[PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
@ 2014-05-22 23:05 Kees Cook
  2014-05-22 23:05 ` [PATCH v5 1/6] seccomp: create internal mode-setting function Kees Cook
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Kees Cook @ 2014-05-22 23:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	Peter Zijlstra, James Morris, Eric Paris, Juri Lelli,
	John Stultz, David S. Miller, Daniel Borkmann, Alex Thorlton,
	Rik van Riel, Daeseok Youn, David Rientjes, Eric W. Biederman,
	Dario Faggioli, Rashika Kheria, liguang, Geert Uytterhoeven,
	linux-doc, linux-security-module

This adds the ability for threads to request seccomp filter
synchronization across their thread group (either at filter attach time
or later). (For example, for Chrome to make sure graphic driver threads
are fully confined after seccomp filters have been attached.)

To support this, seccomp locking on writes is introduced, along with
refactoring of no_new_privs. Races with thread creation are handled via
the tasklist_list.

I think all the concerns raised during the discussion[1] of the first
version of this patch have been addressed. However, the races involved
have tricked me before. :)

Thanks!

-Kees

[1] https://lkml.org/lkml/2014/1/13/795

v5:
 - move includes around (drysdale)
 - drop set_nnp return value (luto)
 - use smp_load_acquire/store_release (luto)
 - merge nnp changes to seccomp always, fewer ifdef (luto)
v4:
 - cleaned up locking further, as noticed by David Drysdale
v3:
 - added SECCOMP_EXT_ACT_FILTER for new filter install options
v2:
 - reworked to avoid clone races

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v5 1/6] seccomp: create internal mode-setting function
  2014-05-22 23:05 [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
@ 2014-05-22 23:05 ` Kees Cook
  2014-05-22 23:05 ` [PATCH v5 2/6] seccomp: split filter prep from check and apply Kees Cook
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Kees Cook @ 2014-05-22 23:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	Peter Zijlstra, James Morris, Eric Paris, Juri Lelli,
	John Stultz, David S. Miller, Daniel Borkmann, Alex Thorlton,
	Rik van Riel, Daeseok Youn, David Rientjes, Eric W. Biederman,
	Dario Faggioli, Rashika Kheria, liguang, Geert Uytterhoeven,
	linux-doc, linux-security-module

In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 kernel/seccomp.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b35c21503a36..8bbe20111222 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -465,7 +465,7 @@ long prctl_get_seccomp(void)
 }
 
 /**
- * prctl_set_seccomp: configures current->seccomp.mode
+ * seccomp_set_mode: internal function for setting seccomp mode
  * @seccomp_mode: requested mode to use
  * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
  *
@@ -478,7 +478,7 @@ long prctl_get_seccomp(void)
  *
  * Returns 0 on success or -EINVAL on failure.
  */
-long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
+static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
 {
 	long ret = -EINVAL;
 
@@ -509,3 +509,18 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 out:
 	return ret;
 }
+
+/**
+ * prctl_set_seccomp: configures current->seccomp.mode
+ * @seccomp_mode: requested mode to use
+ * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
+{
+	long ret;
+
+	ret = seccomp_set_mode(seccomp_mode, filter);
+	return ret;
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 2/6] seccomp: split filter prep from check and apply
  2014-05-22 23:05 [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
  2014-05-22 23:05 ` [PATCH v5 1/6] seccomp: create internal mode-setting function Kees Cook
@ 2014-05-22 23:05 ` Kees Cook
  2014-05-22 23:05 ` [PATCH v5 3/6] seccomp: introduce writer locking Kees Cook
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Kees Cook @ 2014-05-22 23:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	Peter Zijlstra, James Morris, Eric Paris, Juri Lelli,
	John Stultz, David S. Miller, Daniel Borkmann, Alex Thorlton,
	Rik van Riel, Daeseok Youn, David Rientjes, Eric W. Biederman,
	Dario Faggioli, Rashika Kheria, liguang, Geert Uytterhoeven,
	linux-doc, linux-security-module

In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 kernel/seccomp.c |   86 ++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 58 insertions(+), 28 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 8bbe20111222..481504100b1d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -18,6 +18,7 @@
 #include <linux/compat.h>
 #include <linux/sched.h>
 #include <linux/seccomp.h>
+#include <linux/slab.h>
 
 /* #define SECCOMP_DEBUG 1 */
 
@@ -26,7 +27,6 @@
 #include <linux/filter.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
-#include <linux/slab.h>
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
@@ -197,27 +197,21 @@ static u32 seccomp_run_filters(int syscall)
 }
 
 /**
- * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
  *
- * Returns 0 on success or an errno on failure.
+ * Returns filter on success or an ERR_PTR on failure.
  */
-static long seccomp_attach_filter(struct sock_fprog *fprog)
+static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *filter;
 	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
-	unsigned long total_insns = fprog->len;
 	struct sock_filter *fp;
 	int new_len;
 	long ret;
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
-		return -EINVAL;
-
-	for (filter = current->seccomp.filter; filter; filter = filter->prev)
-		total_insns += filter->len + 4;  /* include a 4 instr penalty */
-	if (total_insns > MAX_INSNS_PER_PATH)
-		return -ENOMEM;
+		return ERR_PTR(-EINVAL);
 
 	/*
 	 * Installing a seccomp filter requires that the task have
@@ -228,11 +222,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
 	if (!current->no_new_privs &&
 	    security_capable_noaudit(current_cred(), current_user_ns(),
 				     CAP_SYS_ADMIN) != 0)
-		return -EACCES;
+		return ERR_PTR(-EACCES);
 
 	fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
 	if (!fp)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	/* Copy the instructions from fprog. */
 	ret = -EFAULT;
@@ -270,31 +264,26 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
 	atomic_set(&filter->usage, 1);
 	filter->len = new_len;
 
-	/*
-	 * If there is an existing filter, make it the prev and don't drop its
-	 * task reference.
-	 */
-	filter->prev = current->seccomp.filter;
-	current->seccomp.filter = filter;
-	return 0;
+	return filter;
 
 free_filter:
 	kfree(filter);
 free_prog:
 	kfree(fp);
-	return ret;
+	return ERR_PTR(ret);
 }
 
 /**
- * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
+ * seccomp_prepare_user_filter - prepares a user-supplied sock_fprog
  * @user_filter: pointer to the user data containing a sock_fprog.
  *
- * Returns 0 on success and non-zero otherwise.
+ * Returns filter on success and ERR_PTR otherwise.
  */
-static long seccomp_attach_user_filter(char __user *user_filter)
+static
+struct seccomp_filter *seccomp_prepare_user_filter(char __user *user_filter)
 {
 	struct sock_fprog fprog;
-	long ret = -EFAULT;
+	struct seccomp_filter *filter = ERR_PTR(-EFAULT);
 
 #ifdef CONFIG_COMPAT
 	if (is_compat_task()) {
@@ -307,9 +296,37 @@ static long seccomp_attach_user_filter(char __user *user_filter)
 #endif
 	if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
 		goto out;
-	ret = seccomp_attach_filter(&fprog);
+	filter = seccomp_prepare_filter(&fprog);
 out:
-	return ret;
+	return filter;
+}
+
+/**
+ * _seccomp_attach_filter: validated and attach filter
+ * @filter: seccomp filter to add to the current process
+ *
+ * Returns 0 on success, -ve on error.
+ */
+static long _seccomp_attach_filter(struct seccomp_filter *filter)
+{
+	unsigned long total_insns;
+	struct seccomp_filter *walker;
+
+	/* Validate resulting filter length. */
+	total_insns = filter->len;
+	for (walker = current->seccomp.filter; walker; walker = filter->prev)
+		total_insns += walker->len + 4;  /* include a 4 instr penalty */
+	if (total_insns > MAX_INSNS_PER_PATH)
+		return -ENOMEM;
+
+	/*
+	 * If there is an existing filter, make it the prev and don't drop its
+	 * task reference.
+	 */
+	filter->prev = current->seccomp.filter;
+	current->seccomp.filter = filter;
+
+	return 0;
 }
 
 /* get_seccomp_filter - increments the reference count of the filter on @tsk */
@@ -480,8 +497,18 @@ long prctl_get_seccomp(void)
  */
 static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
 {
+	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
 
+#ifdef CONFIG_SECCOMP_FILTER
+	/* Prepare the new filter outside of the seccomp lock. */
+	if (seccomp_mode == SECCOMP_MODE_FILTER) {
+		prepared = seccomp_prepare_user_filter(filter);
+		if (IS_ERR(prepared))
+			return PTR_ERR(prepared);
+	}
+#endif
+
 	if (current->seccomp.mode &&
 	    current->seccomp.mode != seccomp_mode)
 		goto out;
@@ -495,9 +522,11 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
 		break;
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER:
-		ret = seccomp_attach_user_filter(filter);
+		ret = _seccomp_attach_filter(prepared);
 		if (ret)
 			goto out;
+		/* Do not free the successfully attached filter. */
+		prepared = NULL;
 		break;
 #endif
 	default:
@@ -507,6 +536,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
 	current->seccomp.mode = seccomp_mode;
 	set_thread_flag(TIF_SECCOMP);
 out:
+	kfree(prepared);
 	return ret;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 3/6] seccomp: introduce writer locking
  2014-05-22 23:05 [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
  2014-05-22 23:05 ` [PATCH v5 1/6] seccomp: create internal mode-setting function Kees Cook
  2014-05-22 23:05 ` [PATCH v5 2/6] seccomp: split filter prep from check and apply Kees Cook
@ 2014-05-22 23:05 ` Kees Cook
  2014-05-23  0:28   ` Alexei Starovoitov
  2014-05-23  8:49   ` Peter Zijlstra
  2014-05-22 23:05 ` [PATCH v5 4/6] seccomp: move no_new_privs into seccomp Kees Cook
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 35+ messages in thread
From: Kees Cook @ 2014-05-22 23:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	Peter Zijlstra, James Morris, Eric Paris, Juri Lelli,
	John Stultz, David S. Miller, Daniel Borkmann, Alex Thorlton,
	Rik van Riel, Daeseok Youn, David Rientjes, Eric W. Biederman,
	Dario Faggioli, Rashika Kheria, liguang, Geert Uytterhoeven,
	linux-doc, linux-security-module

Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-task filter pointer updates, writes to the
seccomp fields are now protected by a spinlock.  Read access remains
lockless because pointer updates themselves are atomic.  However, writes
(or cloning) often entail additional checking (like maximum instruction
counts) which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned
from a thread and left in a prior state, seccomp duplication is moved
under the tasklist_lock. Then parent and child are certain have the same
seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |   20 ++++++++++++++++++++
 include/linux/seccomp.h   |    6 ++++--
 kernel/fork.c             |   34 +++++++++++++++++++++++++++++++++-
 kernel/seccomp.c          |   20 ++++++++++++++------
 5 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6df7f9fe0d01..e2370ec3102b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -154,6 +154,13 @@ extern struct task_group root_task_group;
 # define INIT_VTIME(tsk)
 #endif
 
+#ifdef CONFIG_SECCOMP
+# define INIT_SECCOMP(tsk)						\
+	.seccomp.lock	= __SPIN_LOCK_UNLOCKED(tsk.seccomp.lock),
+#else
+# define INIT_SECCOMP(tsk)
+#endif
+
 #define INIT_TASK_COMM "swapper"
 
 #ifdef CONFIG_RT_MUTEXES
@@ -234,6 +241,7 @@ extern struct task_group root_task_group;
 	INIT_CPUSET_SEQ(tsk)						\
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
+	INIT_SECCOMP(tsk)						\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c79f757..71a6cb66a3f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2480,6 +2480,26 @@ static inline void task_unlock(struct task_struct *p)
 	spin_unlock(&p->alloc_lock);
 }
 
+#ifdef CONFIG_SECCOMP
+/*
+ * Protects changes to ->seccomp
+ */
+static inline void seccomp_lock(struct task_struct *p, unsigned long *flags)
+{
+	spin_lock_irqsave(&p->seccomp.lock, (*flags));
+}
+
+static inline void seccomp_unlock(struct task_struct *p, unsigned long flags)
+{
+	spin_unlock_irqrestore(&p->seccomp.lock, flags);
+}
+#else
+static inline void seccomp_lock(struct task_struct *p, unsigned long *flags)
+{ }
+static inline void seccomp_unlock(struct task_struct *p, unsigned long flags)
+{ }
+#endif
+
 extern struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
 							unsigned long *flags);
 
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4054b0994071..c47be00e8ffb 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -14,14 +14,16 @@ struct seccomp_filter;
  *
  * @mode:  indicates one of the valid values above for controlled
  *         system calls available to a process.
- * @filter: The metadata and ruleset for determining what system calls
- *          are allowed for a task.
+ * @lock:  held when making changes to avoid thread races.
+ * @filter: must always point to a valid seccomp-filter or NULL as it is
+ *          accessed without locking during system call entry.
  *
  *          @filter must only be accessed from the context of current as there
  *          is no locking.
  */
 struct seccomp {
 	int mode;
+	spinlock_t lock;
 	struct seccomp_filter *filter;
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 54a8d26f612f..e08a1907cd78 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -315,6 +315,15 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 		goto free_ti;
 
 	tsk->stack = ti;
+#ifdef CONFIG_SECCOMP
+	/*
+	 * We must handle setting up seccomp filters once we're under
+	 * the tasklist_lock in case orig has changed between now and
+	 * then. Until then, filter must be NULL to avoid messing up
+	 * the usage counts on the error path calling free_task.
+	 */
+	tsk->seccomp.filter = NULL;
+#endif
 
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
@@ -1081,6 +1090,24 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
+static void copy_seccomp(struct task_struct *p)
+{
+#ifdef CONFIG_SECCOMP
+	/* Child lock not needed since it is not yet in tasklist. */
+	unsigned long irqflags;
+	seccomp_lock(current, &irqflags);
+
+	get_seccomp_filter(current);
+	p->seccomp = current->seccomp;
+	spin_lock_init(&p->seccomp.lock);
+
+	if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
+		set_tsk_thread_flag(p, TIF_SECCOMP);
+
+	seccomp_unlock(current, irqflags);
+#endif
+}
+
 SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
 {
 	current->clear_child_tid = tidptr;
@@ -1196,7 +1223,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto fork_out;
 
 	ftrace_graph_init_task(p);
-	get_seccomp_filter(p);
 
 	rt_mutex_init_task(p);
 
@@ -1425,6 +1451,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	write_lock_irq(&tasklist_lock);
 
+	/*
+	 * Copy seccomp details explicitly here, in case they were changed
+	 * before holding tasklist_lock.
+	 */
+	copy_seccomp(p);
+
 	/* CLONE_PARENT re-uses the old parent */
 	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
 		p->real_parent = current->real_parent;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 481504100b1d..d200029728ca 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -174,12 +174,12 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  */
 static u32 seccomp_run_filters(int syscall)
 {
-	struct seccomp_filter *f;
+	struct seccomp_filter *f = smp_load_acquire(&current->seccomp.filter);
 	struct seccomp_data sd;
 	u32 ret = SECCOMP_RET_ALLOW;
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
-	if (WARN_ON(current->seccomp.filter == NULL))
+	if (WARN_ON(f == NULL))
 		return SECCOMP_RET_KILL;
 
 	populate_seccomp_data(&sd);
@@ -188,7 +188,7 @@ static u32 seccomp_run_filters(int syscall)
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
 	 */
-	for (f = current->seccomp.filter; f; f = f->prev) {
+	for (; f; f = smp_load_acquire(&f->prev)) {
 		u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi);
 		if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
 			ret = cur_ret;
@@ -305,6 +305,8 @@ out:
  * _seccomp_attach_filter: validated and attach filter
  * @filter: seccomp filter to add to the current process
  *
+ * Caller must be holding the seccomp lock.
+ *
  * Returns 0 on success, -ve on error.
  */
 static long _seccomp_attach_filter(struct seccomp_filter *filter)
@@ -312,6 +314,8 @@ static long _seccomp_attach_filter(struct seccomp_filter *filter)
 	unsigned long total_insns;
 	struct seccomp_filter *walker;
 
+	BUG_ON(!spin_is_locked(&current->seccomp.lock));
+
 	/* Validate resulting filter length. */
 	total_insns = filter->len;
 	for (walker = current->seccomp.filter; walker; walker = filter->prev)
@@ -324,7 +328,7 @@ static long _seccomp_attach_filter(struct seccomp_filter *filter)
 	 * task reference.
 	 */
 	filter->prev = current->seccomp.filter;
-	current->seccomp.filter = filter;
+	smp_store_release(&current->seccomp.filter, filter);
 
 	return 0;
 }
@@ -332,7 +336,7 @@ static long _seccomp_attach_filter(struct seccomp_filter *filter)
 /* get_seccomp_filter - increments the reference count of the filter on @tsk */
 void get_seccomp_filter(struct task_struct *tsk)
 {
-	struct seccomp_filter *orig = tsk->seccomp.filter;
+	struct seccomp_filter *orig = smp_load_acquire(&tsk->seccomp.filter);
 	if (!orig)
 		return;
 	/* Reference count is bounded by the number of total processes. */
@@ -346,7 +350,7 @@ void put_seccomp_filter(struct task_struct *tsk)
 	/* Clean up single-reference branches iteratively. */
 	while (orig && atomic_dec_and_test(&orig->usage)) {
 		struct seccomp_filter *freeme = orig;
-		orig = orig->prev;
+		orig = smp_load_acquire(&orig->prev);
 		kfree(freeme);
 	}
 }
@@ -498,6 +502,7 @@ long prctl_get_seccomp(void)
 static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
 {
 	struct seccomp_filter *prepared = NULL;
+	unsigned long irqflags;
 	long ret = -EINVAL;
 
 #ifdef CONFIG_SECCOMP_FILTER
@@ -509,6 +514,8 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
 	}
 #endif
 
+	seccomp_lock(current, &irqflags);
+
 	if (current->seccomp.mode &&
 	    current->seccomp.mode != seccomp_mode)
 		goto out;
@@ -536,6 +543,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
 	current->seccomp.mode = seccomp_mode;
 	set_thread_flag(TIF_SECCOMP);
 out:
+	seccomp_unlock(current, irqflags);
 	kfree(prepared);
 	return ret;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 4/6] seccomp: move no_new_privs into seccomp
  2014-05-22 23:05 [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
                   ` (2 preceding siblings ...)
  2014-05-22 23:05 ` [PATCH v5 3/6] seccomp: introduce writer locking Kees Cook
@ 2014-05-22 23:05 ` Kees Cook
  2014-05-22 23:08   ` Andy Lutomirski
  2014-05-22 23:05 ` [PATCH v5 5/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_FILTER Kees Cook
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-05-22 23:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	Peter Zijlstra, James Morris, Eric Paris, Juri Lelli,
	John Stultz, David S. Miller, Daniel Borkmann, Alex Thorlton,
	Rik van Riel, Daeseok Youn, David Rientjes, Eric W. Biederman,
	Dario Faggioli, Rashika Kheria, liguang, Geert Uytterhoeven,
	linux-doc, linux-security-module

Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, changes must be atomic. This moves the nnp
flag into the seccomp field as a separate unsigned long for atomic access.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/exec.c                  |    4 ++--
 include/linux/sched.h      |   13 ++++++++++---
 include/linux/seccomp.h    |    8 +++++++-
 kernel/seccomp.c           |    2 +-
 kernel/sys.c               |    4 ++--
 security/apparmor/domain.c |    4 ++--
 6 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 238b7aa26f68..614fcb993739 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1233,7 +1233,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 	 * This isn't strictly necessary, but it makes it harder for LSMs to
 	 * mess up.
 	 */
-	if (current->no_new_privs)
+	if (task_no_new_privs(current))
 		bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
 
 	t = p;
@@ -1271,7 +1271,7 @@ int prepare_binprm(struct linux_binprm *bprm)
 	bprm->cred->egid = current_egid();
 
 	if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
-	    !current->no_new_privs &&
+	    !task_no_new_privs(current) &&
 	    kuid_has_mapping(bprm->cred->user_ns, inode->i_uid) &&
 	    kgid_has_mapping(bprm->cred->user_ns, inode->i_gid)) {
 		/* Set-uid? */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 71a6cb66a3f3..d2a72deba43b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1259,9 +1259,6 @@ struct task_struct {
 				 * execve */
 	unsigned in_iowait:1;
 
-	/* task may not gain privileges */
-	unsigned no_new_privs:1;
-
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
@@ -2480,6 +2477,16 @@ static inline void task_unlock(struct task_struct *p)
 	spin_unlock(&p->alloc_lock);
 }
 
+static inline bool task_no_new_privs(struct task_struct *p)
+{
+	return test_bit(SECCOMP_FLAG_NO_NEW_PRIVS, &p->seccomp.flags);
+}
+
+static inline void task_set_no_new_privs(struct task_struct *p)
+{
+	set_bit(SECCOMP_FLAG_NO_NEW_PRIVS, &p->seccomp.flags);
+}
+
 #ifdef CONFIG_SECCOMP
 /*
  * Protects changes to ->seccomp
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index c47be00e8ffb..ed86b298f3b2 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,6 +3,8 @@
 
 #include <uapi/linux/seccomp.h>
 
+#define SECCOMP_FLAG_NO_NEW_PRIVS	0	/* task may not gain privs */
+
 #ifdef CONFIG_SECCOMP
 
 #include <linux/thread_info.h>
@@ -17,6 +19,7 @@ struct seccomp_filter;
  * @lock:  held when making changes to avoid thread races.
  * @filter: must always point to a valid seccomp-filter or NULL as it is
  *          accessed without locking during system call entry.
+ * @flags: flags under write lock
  *
  *          @filter must only be accessed from the context of current as there
  *          is no locking.
@@ -25,6 +28,7 @@ struct seccomp {
 	int mode;
 	spinlock_t lock;
 	struct seccomp_filter *filter;
+	unsigned long flags;
 };
 
 extern int __secure_computing(int);
@@ -53,7 +57,9 @@ static inline int seccomp_mode(struct seccomp *s)
 
 #include <linux/errno.h>
 
-struct seccomp { };
+struct seccomp {
+	unsigned long flags;
+};
 struct seccomp_filter { };
 
 static inline int secure_computing(int this_syscall) { return 0; }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d200029728ca..e7238f5708d4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -219,7 +219,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	 * This avoids scenarios where unprivileged tasks can affect the
 	 * behavior of privileged children.
 	 */
-	if (!current->no_new_privs &&
+	if (!task_no_new_privs(current) &&
 	    security_capable_noaudit(current_cred(), current_user_ns(),
 				     CAP_SYS_ADMIN) != 0)
 		return ERR_PTR(-EACCES);
diff --git a/kernel/sys.c b/kernel/sys.c
index fba0f29401ea..d3b4af60a411 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1990,12 +1990,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		if (arg2 != 1 || arg3 || arg4 || arg5)
 			return -EINVAL;
 
-		current->no_new_privs = 1;
+		task_set_no_new_privs(current);
 		break;
 	case PR_GET_NO_NEW_PRIVS:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
-		return current->no_new_privs ? 1 : 0;
+		return task_no_new_privs(current) ? 1 : 0;
 	case PR_GET_THP_DISABLE:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 452567d3a08e..d97cba3e3849 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -621,7 +621,7 @@ int aa_change_hat(const char *hats[], int count, u64 token, bool permtest)
 	 * There is no exception for unconfined as change_hat is not
 	 * available.
 	 */
-	if (current->no_new_privs)
+	if (task_no_new_privs(current))
 		return -EPERM;
 
 	/* released below */
@@ -776,7 +776,7 @@ int aa_change_profile(const char *ns_name, const char *hname, bool onexec,
 	 * no_new_privs is set because this aways results in a reduction
 	 * of permissions.
 	 */
-	if (current->no_new_privs && !unconfined(profile)) {
+	if (task_no_new_privs(current) && !unconfined(profile)) {
 		put_cred(cred);
 		return -EPERM;
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 5/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_FILTER
  2014-05-22 23:05 [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
                   ` (3 preceding siblings ...)
  2014-05-22 23:05 ` [PATCH v5 4/6] seccomp: move no_new_privs into seccomp Kees Cook
@ 2014-05-22 23:05 ` Kees Cook
  2014-05-22 23:05 ` [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC Kees Cook
  2014-06-02 19:47 ` [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
  6 siblings, 0 replies; 35+ messages in thread
From: Kees Cook @ 2014-05-22 23:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	Peter Zijlstra, James Morris, Eric Paris, Juri Lelli,
	John Stultz, David S. Miller, Daniel Borkmann, Alex Thorlton,
	Rik van Riel, Daeseok Youn, David Rientjes, Eric W. Biederman,
	Dario Faggioli, Rashika Kheria, liguang, Geert Uytterhoeven,
	linux-doc, linux-security-module

This adds a new seccomp "extension" framework for more complex filter
actions and option setting. The need for the added prctl() is due to
the lack of reserved arguments in PR_SET_SECCOMP (much existing code
already calls prctl without initializing trailing arguments).

When prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER, flags,
filter) is called, it will be the same as prctl(PR_SET_SECCOMP,
SECCOMP_MODE_FILTER, filter), when flags is 0. This will allow for
additional flags being set on the filter in the future.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |   14 ++++++++
 include/linux/seccomp.h                |   10 ++++--
 include/uapi/linux/prctl.h             |    6 ++++
 include/uapi/linux/seccomp.h           |    6 ++++
 kernel/seccomp.c                       |   58 ++++++++++++++++++++++++++++++++
 kernel/sys.c                           |    3 ++
 6 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
index 1e469ef75778..a5e47753b32d 100644
--- a/Documentation/prctl/seccomp_filter.txt
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -223,3 +223,17 @@ Note that modern systems are unlikely to use vsyscalls at all -- they
 are a legacy feature and they are considerably slower than standard
 syscalls.  New code will use the vDSO, and vDSO-issued system calls
 are indistinguishable from normal system calls.
+
+
+
+Extensions
+----------
+Additional seccomp extensions are available through prctl using
+PR_SECCOMP_EXT, with the extension as the following argument.
+
+prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER, flags, prog):
+	Attach filter, with flags.
+
+	This is the same as prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog)
+	except with the addition of optional "flags" argument. No flags
+	are currently recognized.
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index ed86b298f3b2..e315fea562e1 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -7,6 +7,7 @@
 
 #ifdef CONFIG_SECCOMP
 
+#include <linux/errno.h>
 #include <linux/thread_info.h>
 #include <asm/seccomp.h>
 
@@ -55,8 +56,6 @@ static inline int seccomp_mode(struct seccomp *s)
 
 #else /* CONFIG_SECCOMP */
 
-#include <linux/errno.h>
-
 struct seccomp {
 	unsigned long flags;
 };
@@ -84,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
 #ifdef CONFIG_SECCOMP_FILTER
 extern void put_seccomp_filter(struct task_struct *tsk);
 extern void get_seccomp_filter(struct task_struct *tsk);
+extern long prctl_seccomp_ext(unsigned long, unsigned long,
+			      unsigned long, unsigned long);
 #else  /* CONFIG_SECCOMP_FILTER */
 static inline void put_seccomp_filter(struct task_struct *tsk)
 {
@@ -93,5 +94,10 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
 {
 	return;
 }
+static inline long prctl_seccomp_ext(unsigned long arg2, unsigned long arg3,
+				     unsigned long arg4, unsigned long arg5)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 58afc04c107e..ac758ed72495 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -152,4 +152,10 @@
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+/*
+ * Access seccomp extensions
+ * See Documentation/prctl/seccomp_filter.txt for more details.
+ */
+#define PR_SECCOMP_EXT		43
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index ac2dc9f72973..d7ad626c684d 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -10,6 +10,12 @@
 #define SECCOMP_MODE_STRICT	1 /* uses hard-coded filter. */
 #define SECCOMP_MODE_FILTER	2 /* uses user-supplied filter. */
 
+/* Valid extension types as arg2 for prctl(PR_SECCOMP_EXT) */
+#define SECCOMP_EXT_ACT		1
+
+/* Valid extension actions as arg3 to prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT) */
+#define SECCOMP_EXT_ACT_FILTER	1 /* apply seccomp-bpf filter with flags */
+
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e7238f5708d4..088244b2c765 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -30,6 +30,8 @@
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
+static long seccomp_set_mode(unsigned long mode, char * __user filter);
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -333,6 +335,45 @@ static long _seccomp_attach_filter(struct seccomp_filter *filter)
 	return 0;
 }
 
+/**
+ * seccomp_act_filter: attach filter with additional flags
+ * @flags:  flags from SECCOMP_FILTER_* to change behavior
+ * @filter: struct sock_fprog for use with SECCOMP_MODE_FILTER
+ *
+ * Return 0 on success, -ve on error.
+ */
+static long seccomp_act_filter(unsigned long flags, char * __user filter)
+{
+	long ret;
+
+	/* No flags currently recognized. */
+	if (flags != 0)
+		return -EINVAL;
+
+	ret = seccomp_set_mode(SECCOMP_MODE_FILTER, filter);
+
+	return ret;
+}
+
+/**
+ * seccomp_extended_action: performs the specific action
+ * @action: the enum of the action to perform.
+ *
+ * Returns 0 on success. On failure, it returns != 0, or EINVAL on an
+ * invalid action.
+ */
+static long seccomp_extended_action(int action, unsigned long arg1,
+				    unsigned long arg2)
+{
+	switch (action) {
+	case SECCOMP_EXT_ACT_FILTER:
+		return seccomp_act_filter(arg1, (char * __user)arg2);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
 /* get_seccomp_filter - increments the reference count of the filter on @tsk */
 void get_seccomp_filter(struct task_struct *tsk)
 {
@@ -374,6 +415,23 @@ static void seccomp_send_sigsys(int syscall, int reason)
 	info.si_syscall = syscall;
 	force_sig_info(SIGSYS, &info, current);
 }
+
+/**
+ * prctl_seccomp_ext: exposed extension behaviors for seccomp
+ * @cmd: the type of extension being called
+ * @arg[123]: the arguments for the extension
+ *
+ * Returns == 0 on success and != 0 on failure.
+ * Invalid arguments return -EINVAL.
+ */
+long prctl_seccomp_ext(unsigned long type, unsigned long arg1,
+		       unsigned long arg2, unsigned long arg3)
+{
+	if (type != SECCOMP_EXT_ACT)
+		return -EINVAL;
+	/* For action extensions, arg1 is the identifier. */
+	return seccomp_extended_action(arg1, arg2, arg3);
+}
 #endif	/* CONFIG_SECCOMP_FILTER */
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index d3b4af60a411..ce00ad88e48f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1917,6 +1917,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_SET_SECCOMP:
 		error = prctl_set_seccomp(arg2, (char __user *)arg3);
 		break;
+	case PR_SECCOMP_EXT:
+		error = prctl_seccomp_ext(arg2, arg3, arg4, arg5);
+		break;
 	case PR_GET_TSC:
 		error = GET_TSC_CTL(arg2);
 		break;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-22 23:05 [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
                   ` (4 preceding siblings ...)
  2014-05-22 23:05 ` [PATCH v5 5/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_FILTER Kees Cook
@ 2014-05-22 23:05 ` Kees Cook
  2014-05-22 23:11   ` Andy Lutomirski
  2014-06-02 19:47 ` [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
  6 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-05-22 23:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	Peter Zijlstra, James Morris, Eric Paris, Juri Lelli,
	John Stultz, David S. Miller, Daniel Borkmann, Alex Thorlton,
	Rik van Riel, Daeseok Youn, David Rientjes, Eric W. Biederman,
	Dario Faggioli, Rashika Kheria, liguang, Geert Uytterhoeven,
	linux-doc, linux-security-module

Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp extension action for synchronizing thread
group seccomp filters and a prctl() for accessing that functionality,
as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
installation time.

When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
will attempt to synchronize all threads in current's threadgroup to its
seccomp filter program. This is possible iff all threads are using a filter
that is an ancestor to the filter current is attempting to synchronize to.
NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
treated as ancestors allowing threads to be transitioned into
SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
calling thread, no_new_privs will be set for all synchronized threads too.
On success, 0 is returned. On failure, the pid of one of the failing threads
will be returned, with as many filters installed as possible.

The race conditions are against another thread calling TSYNC, another
thread performing a clone, and another thread changing its filter. The
seccomp write lock is sufficient for these cases, though the clone
case is assisted by the tasklist_lock so that new threads must have a
duplicate of its parent seccomp state when it appears on the tasklist.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |   22 ++++++-
 include/uapi/linux/seccomp.h           |    4 ++
 kernel/seccomp.c                       |  106 ++++++++++++++++++++++++++++++--
 3 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
index a5e47753b32d..2924a3450870 100644
--- a/Documentation/prctl/seccomp_filter.txt
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -235,5 +235,23 @@ prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER, flags, prog):
 	Attach filter, with flags.
 
 	This is the same as prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog)
-	except with the addition of optional "flags" argument. No flags
-	are currently recognized.
+	except with the addition of optional "flags" argument:
+
+	SECCOMP_FILTER_TSYNC:
+		After installing filter, perform threadgroup sync, as
+		described below for SECCOMP_EXT_ACT_TSYNC.
+
+prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0):
+	Thread synchronization.
+
+	The current thread requests to synchronize all threads in current's
+	threadgroup to its seccomp filter program. This is possible iff all
+	threads are using a filter that is an ancestor to the filter current
+	is attempting to synchronize to, or the thread has not yet entered
+	seccomp. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
+	calling thread, no_new_privs will be set for all synchronized threads
+	too.
+
+	On success, 0 is returned. On failure, all synchronizable threads
+	will have been synchronized, and the pid of one of the failing
+	threads will be returned.
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index d7ad626c684d..7f4431b90fd4 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -15,6 +15,10 @@
 
 /* Valid extension actions as arg3 to prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT) */
 #define SECCOMP_EXT_ACT_FILTER	1 /* apply seccomp-bpf filter with flags */
+#define SECCOMP_EXT_ACT_TSYNC	2 /* synchronize threadgroup filters */
+
+/* Flags for prctl arg4 when calling SECCOMP_EXT_ACT_FILTER */
+#define SECCOMP_FILTER_TSYNC	1 /* synchronize threadgroup to filter */
 
 /*
  * All BPF programs must return a 32-bit value.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 088244b2c765..d39c0dad9655 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -25,6 +25,7 @@
 #ifdef CONFIG_SECCOMP_FILTER
 #include <asm/syscall.h>
 #include <linux/filter.h>
+#include <linux/pid.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
 #include <linux/tracehook.h>
@@ -198,6 +199,93 @@ static u32 seccomp_run_filters(int syscall)
 	return ret;
 }
 
+/* Returns 1 if the candidate is an ancestor. */
+static int is_ancestor(struct seccomp_filter *candidate,
+		       struct seccomp_filter *child)
+{
+	/* NULL is the root ancestor. */
+	if (candidate == NULL)
+		return 1;
+	for (; child; child = child->prev)
+		if (child == candidate)
+			return 1;
+	return 0;
+}
+
+/* Expects locking and sync suitability to have been done already. */
+static void seccomp_sync_thread(struct task_struct *caller,
+				struct task_struct *thread)
+{
+	/* Get a task reference for the new leaf node. */
+	get_seccomp_filter(caller);
+	/*
+	 * Drop the task reference to the shared ancestor since
+	 * current's path will hold a reference.  (This also
+	 * allows a put before the assignment.)
+	 */
+	put_seccomp_filter(thread);
+	thread->seccomp.filter = caller->seccomp.filter;
+	/* Opt the other thread into seccomp if needed.
+	 * As threads are considered to be trust-realm
+	 * equivalent (see ptrace_may_access), it is safe to
+	 * allow one thread to transition the other.
+	 */
+	if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
+		thread->seccomp.mode = SECCOMP_MODE_FILTER;
+		/*
+		 * Don't let an unprivileged task work around
+		 * the no_new_privs restriction by creating
+		 * a thread that sets it up, enters seccomp,
+		 * then dies.
+		 */
+		if (task_no_new_privs(caller))
+			task_set_no_new_privs(thread);
+		set_tsk_thread_flag(thread, TIF_SECCOMP);
+	}
+}
+
+/**
+ * seccomp_act_sync_threads: sets all threads to use current's filter
+ *
+ * Returns 0 on success, -ve on error, or the pid of a thread which was
+ * either not in the correct seccomp mode or it did not have an ancestral
+ * seccomp filter.
+ */
+static pid_t seccomp_act_sync_threads(void)
+{
+	struct task_struct *thread, *caller;
+	unsigned long tflags;
+	pid_t failed = 0;
+
+	if (current->seccomp.mode != SECCOMP_MODE_FILTER)
+		return -EACCES;
+
+	write_lock_irqsave(&tasklist_lock, tflags);
+	thread = caller = current;
+	while_each_thread(caller, thread) {
+		unsigned long irqflags;
+		seccomp_lock(thread, &irqflags);
+		/*
+		 * Validate thread being eligible for synchronization.
+		 */
+		if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
+		    (thread->seccomp.mode == SECCOMP_MODE_FILTER &&
+		     is_ancestor(thread->seccomp.filter,
+				 caller->seccomp.filter))) {
+			seccomp_sync_thread(caller, thread);
+		} else {
+			/* Keep the last sibling that failed to return. */
+			failed = task_pid_vnr(thread);
+			/* If the pid cannot be resolved, then return -ESRCH */
+			if (failed == 0)
+				failed = -ESRCH;
+		}
+		seccomp_unlock(thread, irqflags);
+	}
+	write_unlock_irqrestore(&tasklist_lock, tflags);
+	return failed;
+}
+
 /**
  * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
@@ -340,19 +428,24 @@ static long _seccomp_attach_filter(struct seccomp_filter *filter)
  * @flags:  flags from SECCOMP_FILTER_* to change behavior
  * @filter: struct sock_fprog for use with SECCOMP_MODE_FILTER
  *
- * Return 0 on success, -ve on error.
+ * Return 0 on success, -ve on error, or thread pid that caused failures.
  */
 static long seccomp_act_filter(unsigned long flags, char * __user filter)
 {
 	long ret;
 
-	/* No flags currently recognized. */
-	if (flags != 0)
+	/* Only SECCOMP_FILTER_TSYNC is recognized. */
+	if ((flags & ~(SECCOMP_FILTER_TSYNC)) != 0)
 		return -EINVAL;
 
 	ret = seccomp_set_mode(SECCOMP_MODE_FILTER, filter);
+	if (ret)
+		return ret;
 
-	return ret;
+	if (flags & SECCOMP_FILTER_TSYNC)
+		return seccomp_act_sync_threads();
+
+	return 0;
 }
 
 /**
@@ -368,6 +461,11 @@ static long seccomp_extended_action(int action, unsigned long arg1,
 	switch (action) {
 	case SECCOMP_EXT_ACT_FILTER:
 		return seccomp_act_filter(arg1, (char * __user)arg2);
+	case SECCOMP_EXT_ACT_TSYNC:
+		/* arg1 and arg2 are currently unused. */
+		if (arg1 || arg2)
+			return -EINVAL;
+		return seccomp_act_sync_threads();
 	default:
 		break;
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 4/6] seccomp: move no_new_privs into seccomp
  2014-05-22 23:05 ` [PATCH v5 4/6] seccomp: move no_new_privs into seccomp Kees Cook
@ 2014-05-22 23:08   ` Andy Lutomirski
  0 siblings, 0 replies; 35+ messages in thread
From: Andy Lutomirski @ 2014-05-22 23:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List

On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
> Since seccomp transitions between threads requires updates to the
> no_new_privs flag to be atomic, changes must be atomic. This moves the nnp
> flag into the seccomp field as a separate unsigned long for atomic access.
>
> Signed-off-by: Kees Cook <keescook@chromium.org>

Acked-by: Andy Lutomirski <luto@amacapital.net>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-22 23:05 ` [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC Kees Cook
@ 2014-05-22 23:11   ` Andy Lutomirski
  2014-05-23 17:05     ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-05-22 23:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List

On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
> Applying restrictive seccomp filter programs to large or diverse
> codebases often requires handling threads which may be started early in
> the process lifetime (e.g., by code that is linked in). While it is
> possible to apply permissive programs prior to process start up, it is
> difficult to further restrict the kernel ABI to those threads after that
> point.
>
> This change adds a new seccomp extension action for synchronizing thread
> group seccomp filters and a prctl() for accessing that functionality,
> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
> installation time.
>
> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
> will attempt to synchronize all threads in current's threadgroup to its
> seccomp filter program. This is possible iff all threads are using a filter
> that is an ancestor to the filter current is attempting to synchronize to.
> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
> treated as ancestors allowing threads to be transitioned into
> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
> calling thread, no_new_privs will be set for all synchronized threads too.
> On success, 0 is returned. On failure, the pid of one of the failing threads
> will be returned, with as many filters installed as possible.

Is there a use case for adding a filter and synchronizing filters
being separate operations?  If not, I think this would be easier to
understand and to use if there was just a single operation.

If you did that, you'd have to decide whether to continue requiring
that all the other threads have a filter that's an ancestor of the
current thread's filter.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 3/6] seccomp: introduce writer locking
  2014-05-22 23:05 ` [PATCH v5 3/6] seccomp: introduce writer locking Kees Cook
@ 2014-05-23  0:28   ` Alexei Starovoitov
  2014-05-23  8:49   ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2014-05-23  0:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	Peter Zijlstra, James Morris, Eric Paris, Juri Lelli,
	John Stultz, David S. Miller, Daniel Borkmann, Alex Thorlton,
	Rik van Riel, Daeseok Youn, David Rientjes, Eric W. Biederman,
	Dario Faggioli, Rashika Kheria, liguang, Geert Uytterhoeven,
	linux-doc, linux-security-module

On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
> Normally, task_struct.seccomp.filter is only ever read or modified by
> the task that owns it (current). This property aids in fast access
> during system call filtering as read access is lockless.
>
> Updating the pointer from another task, however, opens up race
> conditions. To allow cross-task filter pointer updates, writes to the
> seccomp fields are now protected by a spinlock.  Read access remains
> lockless because pointer updates themselves are atomic.  However, writes
> (or cloning) often entail additional checking (like maximum instruction
> counts) which require locking to perform safely.
>
> In the case of cloning threads, the child is invisible to the system
> until it enters the task list. To make sure a child can't be cloned
> from a thread and left in a prior state, seccomp duplication is moved
> under the tasklist_lock. Then parent and child are certain have the same
> seccomp state when they exit the lock.
>
> Based on patches by Will Drewry and David Drysdale.
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  include/linux/init_task.h |    8 ++++++++
>  include/linux/sched.h     |   20 ++++++++++++++++++++
>  include/linux/seccomp.h   |    6 ++++--
>  kernel/fork.c             |   34 +++++++++++++++++++++++++++++++++-
>  kernel/seccomp.c          |   20 ++++++++++++++------
>  5 files changed, 79 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6df7f9fe0d01..e2370ec3102b 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -154,6 +154,13 @@ extern struct task_group root_task_group;
>  # define INIT_VTIME(tsk)
>  #endif
>
> +#ifdef CONFIG_SECCOMP
> +# define INIT_SECCOMP(tsk)                                             \
> +       .seccomp.lock   = __SPIN_LOCK_UNLOCKED(tsk.seccomp.lock),
> +#else
> +# define INIT_SECCOMP(tsk)
> +#endif
> +
>  #define INIT_TASK_COMM "swapper"
>
>  #ifdef CONFIG_RT_MUTEXES
> @@ -234,6 +241,7 @@ extern struct task_group root_task_group;
>         INIT_CPUSET_SEQ(tsk)                                            \
>         INIT_RT_MUTEXES(tsk)                                            \
>         INIT_VTIME(tsk)                                                 \
> +       INIT_SECCOMP(tsk)                                               \
>  }
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 25f54c79f757..71a6cb66a3f3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2480,6 +2480,26 @@ static inline void task_unlock(struct task_struct *p)
>         spin_unlock(&p->alloc_lock);
>  }
>
> +#ifdef CONFIG_SECCOMP
> +/*
> + * Protects changes to ->seccomp
> + */
> +static inline void seccomp_lock(struct task_struct *p, unsigned long *flags)
> +{
> +       spin_lock_irqsave(&p->seccomp.lock, (*flags));
> +}
> +
> +static inline void seccomp_unlock(struct task_struct *p, unsigned long flags)
> +{
> +       spin_unlock_irqrestore(&p->seccomp.lock, flags);
> +}
> +#else
> +static inline void seccomp_lock(struct task_struct *p, unsigned long *flags)
> +{ }
> +static inline void seccomp_unlock(struct task_struct *p, unsigned long flags)
> +{ }
> +#endif
> +
>  extern struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
>                                                         unsigned long *flags);
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 4054b0994071..c47be00e8ffb 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -14,14 +14,16 @@ struct seccomp_filter;
>   *
>   * @mode:  indicates one of the valid values above for controlled
>   *         system calls available to a process.
> - * @filter: The metadata and ruleset for determining what system calls
> - *          are allowed for a task.
> + * @lock:  held when making changes to avoid thread races.
> + * @filter: must always point to a valid seccomp-filter or NULL as it is
> + *          accessed without locking during system call entry.
>   *
>   *          @filter must only be accessed from the context of current as there
>   *          is no locking.
>   */
>  struct seccomp {
>         int mode;
> +       spinlock_t lock;
>         struct seccomp_filter *filter;
>  };
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 54a8d26f612f..e08a1907cd78 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -315,6 +315,15 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>                 goto free_ti;
>
>         tsk->stack = ti;
> +#ifdef CONFIG_SECCOMP
> +       /*
> +        * We must handle setting up seccomp filters once we're under
> +        * the tasklist_lock in case orig has changed between now and
> +        * then. Until then, filter must be NULL to avoid messing up
> +        * the usage counts on the error path calling free_task.
> +        */
> +       tsk->seccomp.filter = NULL;
> +#endif
>
>         setup_thread_stack(tsk, orig);
>         clear_user_return_notifier(tsk);
> @@ -1081,6 +1090,24 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>         return 0;
>  }
>
> +static void copy_seccomp(struct task_struct *p)
> +{
> +#ifdef CONFIG_SECCOMP
> +       /* Child lock not needed since it is not yet in tasklist. */
> +       unsigned long irqflags;
> +       seccomp_lock(current, &irqflags);
> +
> +       get_seccomp_filter(current);
> +       p->seccomp = current->seccomp;
> +       spin_lock_init(&p->seccomp.lock);
> +
> +       if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
> +               set_tsk_thread_flag(p, TIF_SECCOMP);
> +
> +       seccomp_unlock(current, irqflags);
> +#endif
> +}
> +
>  SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
>  {
>         current->clear_child_tid = tidptr;
> @@ -1196,7 +1223,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>                 goto fork_out;
>
>         ftrace_graph_init_task(p);
> -       get_seccomp_filter(p);
>
>         rt_mutex_init_task(p);
>
> @@ -1425,6 +1451,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>          */
>         write_lock_irq(&tasklist_lock);
>
> +       /*
> +        * Copy seccomp details explicitly here, in case they were changed
> +        * before holding tasklist_lock.
> +        */
> +       copy_seccomp(p);
> +
>         /* CLONE_PARENT re-uses the old parent */
>         if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
>                 p->real_parent = current->real_parent;
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 481504100b1d..d200029728ca 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -174,12 +174,12 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>   */
>  static u32 seccomp_run_filters(int syscall)
>  {
> -       struct seccomp_filter *f;
> +       struct seccomp_filter *f = smp_load_acquire(&current->seccomp.filter);
>         struct seccomp_data sd;
>         u32 ret = SECCOMP_RET_ALLOW;
>
>         /* Ensure unexpected behavior doesn't result in failing open. */
> -       if (WARN_ON(current->seccomp.filter == NULL))
> +       if (WARN_ON(f == NULL))
>                 return SECCOMP_RET_KILL;
>
>         populate_seccomp_data(&sd);
> @@ -188,7 +188,7 @@ static u32 seccomp_run_filters(int syscall)
>          * All filters in the list are evaluated and the lowest BPF return
>          * value always takes priority (ignoring the DATA).
>          */
> -       for (f = current->seccomp.filter; f; f = f->prev) {
> +       for (; f; f = smp_load_acquire(&f->prev)) {
>                 u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi);

indeed. the conflicts here will be trivial to resolve. Not sure what
tree you're using
for these patches. Can we avoid the conflicts all together?
Could you rebase against linux-next at least?

>                 if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
>                         ret = cur_ret;
> @@ -305,6 +305,8 @@ out:
>   * _seccomp_attach_filter: validated and attach filter
>   * @filter: seccomp filter to add to the current process
>   *
> + * Caller must be holding the seccomp lock.
> + *
>   * Returns 0 on success, -ve on error.
>   */
>  static long _seccomp_attach_filter(struct seccomp_filter *filter)
> @@ -312,6 +314,8 @@ static long _seccomp_attach_filter(struct seccomp_filter *filter)
>         unsigned long total_insns;
>         struct seccomp_filter *walker;
>
> +       BUG_ON(!spin_is_locked(&current->seccomp.lock));
> +
>         /* Validate resulting filter length. */
>         total_insns = filter->len;
>         for (walker = current->seccomp.filter; walker; walker = filter->prev)
> @@ -324,7 +328,7 @@ static long _seccomp_attach_filter(struct seccomp_filter *filter)
>          * task reference.
>          */
>         filter->prev = current->seccomp.filter;
> -       current->seccomp.filter = filter;
> +       smp_store_release(&current->seccomp.filter, filter);
>
>         return 0;
>  }
> @@ -332,7 +336,7 @@ static long _seccomp_attach_filter(struct seccomp_filter *filter)
>  /* get_seccomp_filter - increments the reference count of the filter on @tsk */
>  void get_seccomp_filter(struct task_struct *tsk)
>  {
> -       struct seccomp_filter *orig = tsk->seccomp.filter;
> +       struct seccomp_filter *orig = smp_load_acquire(&tsk->seccomp.filter);
>         if (!orig)
>                 return;
>         /* Reference count is bounded by the number of total processes. */
> @@ -346,7 +350,7 @@ void put_seccomp_filter(struct task_struct *tsk)
>         /* Clean up single-reference branches iteratively. */
>         while (orig && atomic_dec_and_test(&orig->usage)) {
>                 struct seccomp_filter *freeme = orig;
> -               orig = orig->prev;
> +               orig = smp_load_acquire(&orig->prev);
>                 kfree(freeme);
>         }
>  }
> @@ -498,6 +502,7 @@ long prctl_get_seccomp(void)
>  static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
>  {
>         struct seccomp_filter *prepared = NULL;
> +       unsigned long irqflags;
>         long ret = -EINVAL;
>
>  #ifdef CONFIG_SECCOMP_FILTER
> @@ -509,6 +514,8 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
>         }
>  #endif
>
> +       seccomp_lock(current, &irqflags);
> +
>         if (current->seccomp.mode &&
>             current->seccomp.mode != seccomp_mode)
>                 goto out;
> @@ -536,6 +543,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
>         current->seccomp.mode = seccomp_mode;
>         set_thread_flag(TIF_SECCOMP);
>  out:
> +       seccomp_unlock(current, irqflags);
>         kfree(prepared);
>         return ret;
>  }
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 3/6] seccomp: introduce writer locking
  2014-05-22 23:05 ` [PATCH v5 3/6] seccomp: introduce writer locking Kees Cook
  2014-05-23  0:28   ` Alexei Starovoitov
@ 2014-05-23  8:49   ` Peter Zijlstra
  2014-05-23 21:05     ` Kees Cook
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-05-23  8:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc,
	linux-security-module

[-- Attachment #1: Type: text/plain, Size: 1405 bytes --]

On Thu, May 22, 2014 at 04:05:33PM -0700, Kees Cook wrote:
> Normally, task_struct.seccomp.filter is only ever read or modified by
> the task that owns it (current). This property aids in fast access
> during system call filtering as read access is lockless.
> 
> Updating the pointer from another task, however, opens up race
> conditions. To allow cross-task filter pointer updates, writes to the
> seccomp fields are now protected by a spinlock.  Read access remains
> lockless because pointer updates themselves are atomic.  However, writes
> (or cloning) often entail additional checking (like maximum instruction
> counts) which require locking to perform safely.
> 
> In the case of cloning threads, the child is invisible to the system
> until it enters the task list. To make sure a child can't be cloned
> from a thread and left in a prior state, seccomp duplication is moved
> under the tasklist_lock. Then parent and child are certain have the same
> seccomp state when they exit the lock.
> 

So I'm a complete noob on the whole seccomp thing, so maybe this is a
silly question, but.. what about object lifetimes?

Looking at put_seccomp_filter() it explicitly takes a tsk pointer,
suggesting one can call it on !current. And while it does a dec_and_test
on the object itself, run_filter() does nothing with refcounts, and
therefore can be touching dead memory.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-22 23:11   ` Andy Lutomirski
@ 2014-05-23 17:05     ` Kees Cook
  2014-05-26 19:27       ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-05-23 17:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>> Applying restrictive seccomp filter programs to large or diverse
>> codebases often requires handling threads which may be started early in
>> the process lifetime (e.g., by code that is linked in). While it is
>> possible to apply permissive programs prior to process start up, it is
>> difficult to further restrict the kernel ABI to those threads after that
>> point.
>>
>> This change adds a new seccomp extension action for synchronizing thread
>> group seccomp filters and a prctl() for accessing that functionality,
>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>> installation time.
>>
>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>> will attempt to synchronize all threads in current's threadgroup to its
>> seccomp filter program. This is possible iff all threads are using a filter
>> that is an ancestor to the filter current is attempting to synchronize to.
>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>> treated as ancestors allowing threads to be transitioned into
>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>> calling thread, no_new_privs will be set for all synchronized threads too.
>> On success, 0 is returned. On failure, the pid of one of the failing threads
>> will be returned, with as many filters installed as possible.
>
> Is there a use case for adding a filter and synchronizing filters
> being separate operations?  If not, I think this would be easier to
> understand and to use if there was just a single operation.

Yes: if the other thread's lifetime is not well controlled, it's good
to be able to have a distinct interface to retry the thread sync that
doesn't require adding "no-op" filters.

> If you did that, you'd have to decide whether to continue requiring
> that all the other threads have a filter that's an ancestor of the
> current thread's filter.

This is required no matter what to make sure there is no way to
replace a filter tree with a different one (allowing accidental
bypasses, misbehavior, etc).

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 3/6] seccomp: introduce writer locking
  2014-05-23  8:49   ` Peter Zijlstra
@ 2014-05-23 21:05     ` Kees Cook
  0 siblings, 0 replies; 35+ messages in thread
From: Kees Cook @ 2014-05-23 21:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andy Lutomirski, Oleg Nesterov, Alexander Viro,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc,
	linux-security-module

On Fri, May 23, 2014 at 1:49 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 22, 2014 at 04:05:33PM -0700, Kees Cook wrote:
>> Normally, task_struct.seccomp.filter is only ever read or modified by
>> the task that owns it (current). This property aids in fast access
>> during system call filtering as read access is lockless.
>>
>> Updating the pointer from another task, however, opens up race
>> conditions. To allow cross-task filter pointer updates, writes to the
>> seccomp fields are now protected by a spinlock.  Read access remains
>> lockless because pointer updates themselves are atomic.  However, writes
>> (or cloning) often entail additional checking (like maximum instruction
>> counts) which require locking to perform safely.
>>
>> In the case of cloning threads, the child is invisible to the system
>> until it enters the task list. To make sure a child can't be cloned
>> from a thread and left in a prior state, seccomp duplication is moved
>> under the tasklist_lock. Then parent and child are certain have the same
>> seccomp state when they exit the lock.
>>
>
> So I'm a complete noob on the whole seccomp thing, so maybe this is a
> silly question, but.. what about object lifetimes?

The get/put logic on seccomp filters eluded me when I first looked at
it too. :) Basically, each branch point holds counts, which means a
given filter will only get freed when all tasks using it have died.

> Looking at put_seccomp_filter() it explicitly takes a tsk pointer,
> suggesting one can call it on !current. And while it does a dec_and_test
> on the object itself, run_filter() does nothing with refcounts, and
> therefore can be touching dead memory.

That's technically true, but the only caller of put_seccomp_filter()
is free_task(), for which "current" doesn't make sense. But when
called, the task is no longer part of the task_list, so there's no
dead memory touching. (Unless you see something I don't.)

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-23 17:05     ` Kees Cook
@ 2014-05-26 19:27       ` Andy Lutomirski
  2014-05-27 18:24         ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-05-26 19:27 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>> Applying restrictive seccomp filter programs to large or diverse
>>> codebases often requires handling threads which may be started early in
>>> the process lifetime (e.g., by code that is linked in). While it is
>>> possible to apply permissive programs prior to process start up, it is
>>> difficult to further restrict the kernel ABI to those threads after that
>>> point.
>>>
>>> This change adds a new seccomp extension action for synchronizing thread
>>> group seccomp filters and a prctl() for accessing that functionality,
>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>> installation time.
>>>
>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>> will attempt to synchronize all threads in current's threadgroup to its
>>> seccomp filter program. This is possible iff all threads are using a filter
>>> that is an ancestor to the filter current is attempting to synchronize to.
>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>> treated as ancestors allowing threads to be transitioned into
>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>> will be returned, with as many filters installed as possible.
>>
>> Is there a use case for adding a filter and synchronizing filters
>> being separate operations?  If not, I think this would be easier to
>> understand and to use if there was just a single operation.
>
> Yes: if the other thread's lifetime is not well controlled, it's good
> to be able to have a distinct interface to retry the thread sync that
> doesn't require adding "no-op" filters.

Wouldn't this still be solved by:

seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);

the idea would be that, if seccomp_add_filter fails, then you give up
and, if it succeeds, then you're done.  It shouldn't fail unless out
of memory or you've nested too deeply.

>
>> If you did that, you'd have to decide whether to continue requiring
>> that all the other threads have a filter that's an ancestor of the
>> current thread's filter.
>
> This is required no matter what to make sure there is no way to
> replace a filter tree with a different one (allowing accidental
> bypasses, misbehavior, etc).

What I mean is:  should the add-new-filter-to-all-threads operation
add the new filter to all threads, regardless of what their current
state is, or should it fail if any thread has a filter that isn't an
ancestor of the current thread's filter?  Either version should be
safe.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-26 19:27       ` Andy Lutomirski
@ 2014-05-27 18:24         ` Kees Cook
  2014-05-27 18:40           ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-05-27 18:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> Applying restrictive seccomp filter programs to large or diverse
>>>> codebases often requires handling threads which may be started early in
>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>> possible to apply permissive programs prior to process start up, it is
>>>> difficult to further restrict the kernel ABI to those threads after that
>>>> point.
>>>>
>>>> This change adds a new seccomp extension action for synchronizing thread
>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>> installation time.
>>>>
>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>> treated as ancestors allowing threads to be transitioned into
>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>> will be returned, with as many filters installed as possible.
>>>
>>> Is there a use case for adding a filter and synchronizing filters
>>> being separate operations?  If not, I think this would be easier to
>>> understand and to use if there was just a single operation.
>>
>> Yes: if the other thread's lifetime is not well controlled, it's good
>> to be able to have a distinct interface to retry the thread sync that
>> doesn't require adding "no-op" filters.
>
> Wouldn't this still be solved by:
>
> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>
> the idea would be that, if seccomp_add_filter fails, then you give up
> and, if it succeeds, then you're done.  It shouldn't fail unless out
> of memory or you've nested too deeply.

I wanted to keep the case of being able to to wait for non-ancestor
threads to finish. For example, 2 threads start and set separate
filters. 1 does work and exits, 2 starts another thread (3) which adds
filters, does work, and then waits for 1 to finish by calling TSYNC.
Once 1 dies, TSYNC succeeds. In the case of not having direct control
over thread lifetime (say, when using third-party libraries), I'd like
to retain the flexibility of being able to do TSYNC without needing a
filter being attached to it.

>>> If you did that, you'd have to decide whether to continue requiring
>>> that all the other threads have a filter that's an ancestor of the
>>> current thread's filter.
>>
>> This is required no matter what to make sure there is no way to
>> replace a filter tree with a different one (allowing accidental
>> bypasses, misbehavior, etc).
>
> What I mean is:  should the add-new-filter-to-all-threads operation
> add the new filter to all threads, regardless of what their current
> state is, or should it fail if any thread has a filter that isn't an
> ancestor of the current thread's filter?  Either version should be
> safe.

It should fail -- we don't want to run the risk of effectively
replacing a filter out from under a thread. Adding additional
restrictions is safe as long as we retain the nnp from the caller .

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-27 18:24         ` Kees Cook
@ 2014-05-27 18:40           ` Andy Lutomirski
  2014-05-27 18:45             ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-05-27 18:40 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>> codebases often requires handling threads which may be started early in
>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>> possible to apply permissive programs prior to process start up, it is
>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>> point.
>>>>>
>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>> installation time.
>>>>>
>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>> treated as ancestors allowing threads to be transitioned into
>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>> will be returned, with as many filters installed as possible.
>>>>
>>>> Is there a use case for adding a filter and synchronizing filters
>>>> being separate operations?  If not, I think this would be easier to
>>>> understand and to use if there was just a single operation.
>>>
>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>> to be able to have a distinct interface to retry the thread sync that
>>> doesn't require adding "no-op" filters.
>>
>> Wouldn't this still be solved by:
>>
>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>
>> the idea would be that, if seccomp_add_filter fails, then you give up
>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>> of memory or you've nested too deeply.
>
> I wanted to keep the case of being able to to wait for non-ancestor
> threads to finish. For example, 2 threads start and set separate
> filters. 1 does work and exits, 2 starts another thread (3) which adds
> filters, does work, and then waits for 1 to finish by calling TSYNC.
> Once 1 dies, TSYNC succeeds. In the case of not having direct control
> over thread lifetime (say, when using third-party libraries), I'd like
> to retain the flexibility of being able to do TSYNC without needing a
> filter being attached to it.

I must admit this strikes me as odd.  What's the point of having a
thread set a filter if it intends to be a short-lived thread?

In any cast, I must have missed the ability for TSYNC to block.  Hmm.
That seems complicated, albeit potentially useful.

>
>>>> If you did that, you'd have to decide whether to continue requiring
>>>> that all the other threads have a filter that's an ancestor of the
>>>> current thread's filter.
>>>
>>> This is required no matter what to make sure there is no way to
>>> replace a filter tree with a different one (allowing accidental
>>> bypasses, misbehavior, etc).
>>
>> What I mean is:  should the add-new-filter-to-all-threads operation
>> add the new filter to all threads, regardless of what their current
>> state is, or should it fail if any thread has a filter that isn't an
>> ancestor of the current thread's filter?  Either version should be
>> safe.
>
> It should fail -- we don't want to run the risk of effectively
> replacing a filter out from under a thread. Adding additional
> restrictions is safe as long as we retain the nnp from the caller .

The other option would be to add the new filter.  The threads wouldn't
end up being ancestors of each other, but that could still be okay.

In any case, I don't have a strong feeling about this particular issue.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-27 18:40           ` Andy Lutomirski
@ 2014-05-27 18:45             ` Kees Cook
  2014-05-27 19:10               ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-05-27 18:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>> codebases often requires handling threads which may be started early in
>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>> point.
>>>>>>
>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>> installation time.
>>>>>>
>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>> will be returned, with as many filters installed as possible.
>>>>>
>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>> being separate operations?  If not, I think this would be easier to
>>>>> understand and to use if there was just a single operation.
>>>>
>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>> to be able to have a distinct interface to retry the thread sync that
>>>> doesn't require adding "no-op" filters.
>>>
>>> Wouldn't this still be solved by:
>>>
>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>
>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>> of memory or you've nested too deeply.
>>
>> I wanted to keep the case of being able to to wait for non-ancestor
>> threads to finish. For example, 2 threads start and set separate
>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>> over thread lifetime (say, when using third-party libraries), I'd like
>> to retain the flexibility of being able to do TSYNC without needing a
>> filter being attached to it.
>
> I must admit this strikes me as odd.  What's the point of having a
> thread set a filter if it intends to be a short-lived thread?

I was illustrating the potential insanity of third-party libraries.
There isn't much sense in that behavior, but if it exists, working
around it is harder without the separate TSYNC-only call.

> In any case, I must have missed the ability for TSYNC to block.  Hmm.
> That seems complicated, albeit potentially useful.

Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
3 could do:

while (TSYNC-fails)
   wait-on-or-kill-unexpected-thread

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-27 18:45             ` Kees Cook
@ 2014-05-27 19:10               ` Andy Lutomirski
  2014-05-27 19:23                 ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-05-27 19:10 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>> point.
>>>>>>>
>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>> installation time.
>>>>>>>
>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>
>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>> understand and to use if there was just a single operation.
>>>>>
>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>> doesn't require adding "no-op" filters.
>>>>
>>>> Wouldn't this still be solved by:
>>>>
>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>
>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>> of memory or you've nested too deeply.
>>>
>>> I wanted to keep the case of being able to to wait for non-ancestor
>>> threads to finish. For example, 2 threads start and set separate
>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>> over thread lifetime (say, when using third-party libraries), I'd like
>>> to retain the flexibility of being able to do TSYNC without needing a
>>> filter being attached to it.
>>
>> I must admit this strikes me as odd.  What's the point of having a
>> thread set a filter if it intends to be a short-lived thread?
>
> I was illustrating the potential insanity of third-party libraries.
> There isn't much sense in that behavior, but if it exists, working
> around it is harder without the separate TSYNC-only call.
>
>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>> That seems complicated, albeit potentially useful.
>
> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
> 3 could do:
>
> while (TSYNC-fails)
>    wait-on-or-kill-unexpected-thread
>

Ok.

I'm still not seeing the need for a separate TSYNC option, though --
just add-a-filter-across-all-threads would work if it failed
harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
always-accept filter across all threads, although no one should really
do the latter for efficiency reasons.

--Andy

> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-27 19:10               ` Andy Lutomirski
@ 2014-05-27 19:23                 ` Kees Cook
  2014-05-27 19:27                   ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-05-27 19:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>> point.
>>>>>>>>
>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>> installation time.
>>>>>>>>
>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>
>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>> understand and to use if there was just a single operation.
>>>>>>
>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>> doesn't require adding "no-op" filters.
>>>>>
>>>>> Wouldn't this still be solved by:
>>>>>
>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>
>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>> of memory or you've nested too deeply.
>>>>
>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>> threads to finish. For example, 2 threads start and set separate
>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>> filter being attached to it.
>>>
>>> I must admit this strikes me as odd.  What's the point of having a
>>> thread set a filter if it intends to be a short-lived thread?
>>
>> I was illustrating the potential insanity of third-party libraries.
>> There isn't much sense in that behavior, but if it exists, working
>> around it is harder without the separate TSYNC-only call.
>>
>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>> That seems complicated, albeit potentially useful.
>>
>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>> 3 could do:
>>
>> while (TSYNC-fails)
>>    wait-on-or-kill-unexpected-thread
>>
>
> Ok.
>
> I'm still not seeing the need for a separate TSYNC option, though --
> just add-a-filter-across-all-threads would work if it failed
> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
> always-accept filter across all threads, although no one should really
> do the latter for efficiency reasons.

Given the complexity of the locking, "fail" means "I applied the
change to all threads except for at least this one: *error code*",
which means looping with the "add-a-filter" method means all the other
threads keep getting filters added until there is full success. I
don't want that overhead, so this keeps TSYNC distinctly separate.

Because of the filter addition, when using add_filter-TSYNC, it's not
sensible to continue after a failure. However, using just-TSYNC allows
sensible re-trying. Since the environments where TSYNC intend to be
used in can be very weird, I really want to retain the retry ability.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-27 19:23                 ` Kees Cook
@ 2014-05-27 19:27                   ` Andy Lutomirski
  2014-05-27 19:55                     ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-05-27 19:27 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>> point.
>>>>>>>>>
>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>> installation time.
>>>>>>>>>
>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>
>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>
>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>> doesn't require adding "no-op" filters.
>>>>>>
>>>>>> Wouldn't this still be solved by:
>>>>>>
>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>
>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>>> of memory or you've nested too deeply.
>>>>>
>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>> threads to finish. For example, 2 threads start and set separate
>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>> filter being attached to it.
>>>>
>>>> I must admit this strikes me as odd.  What's the point of having a
>>>> thread set a filter if it intends to be a short-lived thread?
>>>
>>> I was illustrating the potential insanity of third-party libraries.
>>> There isn't much sense in that behavior, but if it exists, working
>>> around it is harder without the separate TSYNC-only call.
>>>
>>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>>> That seems complicated, albeit potentially useful.
>>>
>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>> 3 could do:
>>>
>>> while (TSYNC-fails)
>>>    wait-on-or-kill-unexpected-thread
>>>
>>
>> Ok.
>>
>> I'm still not seeing the need for a separate TSYNC option, though --
>> just add-a-filter-across-all-threads would work if it failed
>> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
>> always-accept filter across all threads, although no one should really
>> do the latter for efficiency reasons.
>
> Given the complexity of the locking, "fail" means "I applied the
> change to all threads except for at least this one: *error code*",
> which means looping with the "add-a-filter" method means all the other
> threads keep getting filters added until there is full success. I
> don't want that overhead, so this keeps TSYNC distinctly separate.

Ugh, right.

>
> Because of the filter addition, when using add_filter-TSYNC, it's not
> sensible to continue after a failure. However, using just-TSYNC allows
> sensible re-trying. Since the environments where TSYNC intend to be
> used in can be very weird, I really want to retain the retry ability.

OK.  So what's wrong with the other approach in which the
add-to-all-threads option always succeeds?  IOW, rather than requiring
that all threads have the caller's filter as an ancestor, just add the
requested filter to all threads.  As an optimization, if the targetted
thread has the current thread's filter as its filter, too, then the
targetted thread can stay synchronized.

That way the add filter call really is atomic.

I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
the userspace interface could be kept as simple as possible.  The fact
that there's a filter hierarchy is sort of an implementation detail, I
think.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-27 19:27                   ` Andy Lutomirski
@ 2014-05-27 19:55                     ` Kees Cook
  2014-06-02 20:53                       ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-05-27 19:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>> point.
>>>>>>>>>>
>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>> installation time.
>>>>>>>>>>
>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>
>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>
>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>
>>>>>>> Wouldn't this still be solved by:
>>>>>>>
>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>
>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>>>> of memory or you've nested too deeply.
>>>>>>
>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>> filter being attached to it.
>>>>>
>>>>> I must admit this strikes me as odd.  What's the point of having a
>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>
>>>> I was illustrating the potential insanity of third-party libraries.
>>>> There isn't much sense in that behavior, but if it exists, working
>>>> around it is harder without the separate TSYNC-only call.
>>>>
>>>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>>>> That seems complicated, albeit potentially useful.
>>>>
>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>> 3 could do:
>>>>
>>>> while (TSYNC-fails)
>>>>    wait-on-or-kill-unexpected-thread
>>>>
>>>
>>> Ok.
>>>
>>> I'm still not seeing the need for a separate TSYNC option, though --
>>> just add-a-filter-across-all-threads would work if it failed
>>> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
>>> always-accept filter across all threads, although no one should really
>>> do the latter for efficiency reasons.
>>
>> Given the complexity of the locking, "fail" means "I applied the
>> change to all threads except for at least this one: *error code*",
>> which means looping with the "add-a-filter" method means all the other
>> threads keep getting filters added until there is full success. I
>> don't want that overhead, so this keeps TSYNC distinctly separate.
>
> Ugh, right.
>
>>
>> Because of the filter addition, when using add_filter-TSYNC, it's not
>> sensible to continue after a failure. However, using just-TSYNC allows
>> sensible re-trying. Since the environments where TSYNC intend to be
>> used in can be very weird, I really want to retain the retry ability.
>
> OK.  So what's wrong with the other approach in which the
> add-to-all-threads option always succeeds?  IOW, rather than requiring
> that all threads have the caller's filter as an ancestor, just add the
> requested filter to all threads.  As an optimization, if the targetted
> thread has the current thread's filter as its filter, too, then the
> targetted thread can stay synchronized.
>
> That way the add filter call really is atomic.
>
> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
> the userspace interface could be kept as simple as possible.  The fact
> that there's a filter hierarchy is sort of an implementation detail, I
> think.

I'm totally on board with making this as simple as possible. :) The
corner cases are kind of horrible, though, but I think this is already
as simple as it can get.

Externally, without the ancestry check, we can run the risk of have
unstable behavior out of a filter change. Imagine the case of a race
where a thread is adding a filter (via prctl), and the other thread
attempts to TSYNC a filter that blocks prctl.

In the "always take the new filter" case, sometimes we get two filters
(original and TSYNCed) on the first thread, and sometimes it blows up
when it calls prctl (TSYNCed filter blocks the prctl). There's no way
for the TSYNC caller to detect who won the race.

With the patch series as-is, losing the race results in a TSYNC
failure (ancestry doesn't match). This is immediately detectable and
the caller can the decide how to handle the situation directly.

Regardless, I don't think the filter hierarchy is an implementation
detail -- complex filters take advantage of the hierarchies. And
keeping this hierarchy stable means the filters are simpler to
validate for correctness, etc.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-05-22 23:05 [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
                   ` (5 preceding siblings ...)
  2014-05-22 23:05 ` [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC Kees Cook
@ 2014-06-02 19:47 ` Kees Cook
  2014-06-02 19:59   ` Andy Lutomirski
  6 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-06-02 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Oleg Nesterov, James Morris, Stephen Rothwell,
	David S. Miller, LKML, Will Drewry, Julien Tinnes,
	Alexei Starovoitov

Hi Andrew,

Would you be willing to carry this series? Andy Lutomirski appears
happy with it now. (Thanks again for all the feedback Andy!) If so, it
has a relatively small merge conflict with the bpf changes living in
net-next. Would you prefer I rebase against net-next, let sfr handle
it, get carried in net-next, or some other option?

Thanks!

-Kees


On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
> This adds the ability for threads to request seccomp filter
> synchronization across their thread group (either at filter attach time
> or later). (For example, for Chrome to make sure graphic driver threads
> are fully confined after seccomp filters have been attached.)
>
> To support this, seccomp locking on writes is introduced, along with
> refactoring of no_new_privs. Races with thread creation are handled via
> the tasklist_list.
>
> I think all the concerns raised during the discussion[1] of the first
> version of this patch have been addressed. However, the races involved
> have tricked me before. :)
>
> Thanks!
>
> -Kees
>
> [1] https://lkml.org/lkml/2014/1/13/795
>
> v5:
>  - move includes around (drysdale)
>  - drop set_nnp return value (luto)
>  - use smp_load_acquire/store_release (luto)
>  - merge nnp changes to seccomp always, fewer ifdef (luto)
> v4:
>  - cleaned up locking further, as noticed by David Drysdale
> v3:
>  - added SECCOMP_EXT_ACT_FILTER for new filter install options
> v2:
>  - reworked to avoid clone races
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-06-02 19:47 ` [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
@ 2014-06-02 19:59   ` Andy Lutomirski
  2014-06-02 20:06     ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-06-02 19:59 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Oleg Nesterov, James Morris, Stephen Rothwell,
	David S. Miller, LKML, Will Drewry, Julien Tinnes,
	Alexei Starovoitov

On Mon, Jun 2, 2014 at 12:47 PM, Kees Cook <keescook@chromium.org> wrote:
> Hi Andrew,
>
> Would you be willing to carry this series? Andy Lutomirski appears
> happy with it now. (Thanks again for all the feedback Andy!) If so, it
> has a relatively small merge conflict with the bpf changes living in
> net-next. Would you prefer I rebase against net-next, let sfr handle
> it, get carried in net-next, or some other option?

Well, I'm still not entirely convinced that we want to have this much
multiplexing in a prctl, and I'm still a bit unconvinced that the code
wouldn't be better off it it were completely atomic in the sense that
it would either work or fail without doing anything.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-06-02 19:59   ` Andy Lutomirski
@ 2014-06-02 20:06     ` Kees Cook
  2014-06-02 21:17       ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-06-02 20:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Oleg Nesterov, James Morris, Stephen Rothwell,
	David S. Miller, LKML, Will Drewry, Julien Tinnes,
	Alexei Starovoitov

On Mon, Jun 2, 2014 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jun 2, 2014 at 12:47 PM, Kees Cook <keescook@chromium.org> wrote:
>> Hi Andrew,
>>
>> Would you be willing to carry this series? Andy Lutomirski appears
>> happy with it now. (Thanks again for all the feedback Andy!) If so, it
>> has a relatively small merge conflict with the bpf changes living in
>> net-next. Would you prefer I rebase against net-next, let sfr handle
>> it, get carried in net-next, or some other option?
>
> Well, I'm still not entirely convinced that we want to have this much
> multiplexing in a prctl, and I'm still a bit unconvinced that the code

I don't want to get caught without interface argument flexibility
again, so that's why the prctl interface is being set up that way.

> wouldn't be better off it it were completely atomic in the sense that
> it would either work or fail without doing anything.

Getting perfect atomic operation looks extremely hard given task
locking. If this could get fixed in the future, it would have no
impact on the interface. At present, the corner case of the racing
thread is small enough that just catching the race failure is
sufficient. If task locking is improved in the future, it could just
simply never lose a race. Userspace still needs to handle errors no
matter what is the non-race failure condition (mode 1 or forked
filter) still exists.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-05-27 19:55                     ` Kees Cook
@ 2014-06-02 20:53                       ` Andy Lutomirski
  2014-06-03  0:14                         ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-06-02 20:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>> point.
>>>>>>>>>>>
>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>> installation time.
>>>>>>>>>>>
>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>
>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>
>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>
>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>
>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>
>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>>>>> of memory or you've nested too deeply.
>>>>>>>
>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>> filter being attached to it.
>>>>>>
>>>>>> I must admit this strikes me as odd.  What's the point of having a
>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>
>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>> around it is harder without the separate TSYNC-only call.
>>>>>
>>>>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>>>>> That seems complicated, albeit potentially useful.
>>>>>
>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>> 3 could do:
>>>>>
>>>>> while (TSYNC-fails)
>>>>>    wait-on-or-kill-unexpected-thread
>>>>>
>>>>
>>>> Ok.
>>>>
>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>> just add-a-filter-across-all-threads would work if it failed
>>>> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
>>>> always-accept filter across all threads, although no one should really
>>>> do the latter for efficiency reasons.
>>>
>>> Given the complexity of the locking, "fail" means "I applied the
>>> change to all threads except for at least this one: *error code*",
>>> which means looping with the "add-a-filter" method means all the other
>>> threads keep getting filters added until there is full success. I
>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>
>> Ugh, right.
>>
>>>
>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>> sensible to continue after a failure. However, using just-TSYNC allows
>>> sensible re-trying. Since the environments where TSYNC intend to be
>>> used in can be very weird, I really want to retain the retry ability.
>>
>> OK.  So what's wrong with the other approach in which the
>> add-to-all-threads option always succeeds?  IOW, rather than requiring
>> that all threads have the caller's filter as an ancestor, just add the
>> requested filter to all threads.  As an optimization, if the targetted
>> thread has the current thread's filter as its filter, too, then the
>> targetted thread can stay synchronized.
>>
>> That way the add filter call really is atomic.
>>
>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>> the userspace interface could be kept as simple as possible.  The fact
>> that there's a filter hierarchy is sort of an implementation detail, I
>> think.
>
> I'm totally on board with making this as simple as possible. :) The
> corner cases are kind of horrible, though, but I think this is already
> as simple as it can get.
>
> Externally, without the ancestry check, we can run the risk of have
> unstable behavior out of a filter change. Imagine the case of a race
> where a thread is adding a filter (via prctl), and the other thread
> attempts to TSYNC a filter that blocks prctl.
>
> In the "always take the new filter" case, sometimes we get two filters
> (original and TSYNCed) on the first thread, and sometimes it blows up
> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
> for the TSYNC caller to detect who won the race.
>
> With the patch series as-is, losing the race results in a TSYNC
> failure (ancestry doesn't match). This is immediately detectable and
> the caller can the decide how to handle the situation directly.
>
> Regardless, I don't think the filter hierarchy is an implementation
> detail -- complex filters take advantage of the hierarchies. And
> keeping this hierarchy stable means the filters are simpler to
> validate for correctness, etc.

Hmm.

Another issue: unless I'm not looking at the right version of this, I
don't think that while_each_thread does what you think it does.

I'm generally in favor of making the kernel interfaces easy to use,
even at the cost of a bit of implementation complexity.  Would this be
simpler if you hijacked siglock to protect seccomp state?  Then I
think you could just make everything atomic.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-06-02 20:06     ` Kees Cook
@ 2014-06-02 21:17       ` Andy Lutomirski
  2014-06-02 23:05         ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-06-02 21:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Oleg Nesterov, James Morris, Stephen Rothwell,
	David S. Miller, LKML, Will Drewry, Julien Tinnes,
	Alexei Starovoitov

On Mon, Jun 2, 2014 at 1:06 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jun 2, 2014 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jun 2, 2014 at 12:47 PM, Kees Cook <keescook@chromium.org> wrote:
>>> Hi Andrew,
>>>
>>> Would you be willing to carry this series? Andy Lutomirski appears
>>> happy with it now. (Thanks again for all the feedback Andy!) If so, it
>>> has a relatively small merge conflict with the bpf changes living in
>>> net-next. Would you prefer I rebase against net-next, let sfr handle
>>> it, get carried in net-next, or some other option?
>>
>> Well, I'm still not entirely convinced that we want to have this much
>> multiplexing in a prctl, and I'm still a bit unconvinced that the code
>
> I don't want to get caught without interface argument flexibility
> again, so that's why the prctl interface is being set up that way.

I was thinking that a syscall might be a lot prettier.  It may pay to
cc linux-api, too.

I'll offer you a deal: if you try to come up with a nice, clean
syscall, I'll try to write a fast(er) path for x86_64 to reduce
overhead.  I bet I can save 90-100ns per syscall. :)

>
>> wouldn't be better off it it were completely atomic in the sense that
>> it would either work or fail without doing anything.
>
> Getting perfect atomic operation looks extremely hard given task
> locking. If this could get fixed in the future, it would have no
> impact on the interface. At present, the corner case of the racing
> thread is small enough that just catching the race failure is
> sufficient. If task locking is improved in the future, it could just
> simply never lose a race. Userspace still needs to handle errors no
> matter what is the non-race failure condition (mode 1 or forked
> filter) still exists.
>

I think it's doable -- I just replied to the other thread.

> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-06-02 21:17       ` Andy Lutomirski
@ 2014-06-02 23:05         ` Kees Cook
  2014-06-02 23:08           ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-06-02 23:05 UTC (permalink / raw)
  To: Andy Lutomirski, linux-api, Andrew Morton
  Cc: Oleg Nesterov, James Morris, Stephen Rothwell, David S. Miller,
	LKML, Will Drewry, Julien Tinnes, Alexei Starovoitov

On Mon, Jun 2, 2014 at 2:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jun 2, 2014 at 1:06 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Mon, Jun 2, 2014 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Jun 2, 2014 at 12:47 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> Hi Andrew,
>>>>
>>>> Would you be willing to carry this series? Andy Lutomirski appears
>>>> happy with it now. (Thanks again for all the feedback Andy!) If so, it
>>>> has a relatively small merge conflict with the bpf changes living in
>>>> net-next. Would you prefer I rebase against net-next, let sfr handle
>>>> it, get carried in net-next, or some other option?
>>>
>>> Well, I'm still not entirely convinced that we want to have this much
>>> multiplexing in a prctl, and I'm still a bit unconvinced that the code
>>
>> I don't want to get caught without interface argument flexibility
>> again, so that's why the prctl interface is being set up that way.
>
> I was thinking that a syscall might be a lot prettier.  It may pay to
> cc linux-api, too.
>
> I'll offer you a deal: if you try to come up with a nice, clean
> syscall, I'll try to write a fast(er) path for x86_64 to reduce
> overhead.  I bet I can save 90-100ns per syscall. :)

Now added to the Cc.

Which path do you mean to improve? Neither the prctl nor a syscall for
this would need to be fast at all.

I don't want to go in circles on this. I've been there before on my
VFS link hardening series, and my module restriction series. I would
like consensus from more than just one person. :)

I'd like to hear from other folks on this (akpm?). My instinct is to
continue using prctl since that is already where mediation for seccomp
happens. I don't see why prctl vs a new syscall makes a difference
here, frankly.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-06-02 23:05         ` Kees Cook
@ 2014-06-02 23:08           ` Andy Lutomirski
  2014-06-03 10:12             ` Michael Kerrisk
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-06-02 23:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux API, Andrew Morton, Oleg Nesterov, James Morris,
	Stephen Rothwell, David S. Miller, LKML, Will Drewry,
	Julien Tinnes, Alexei Starovoitov

On Mon, Jun 2, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jun 2, 2014 at 2:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jun 2, 2014 at 1:06 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Mon, Jun 2, 2014 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Mon, Jun 2, 2014 at 12:47 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>> Hi Andrew,
>>>>>
>>>>> Would you be willing to carry this series? Andy Lutomirski appears
>>>>> happy with it now. (Thanks again for all the feedback Andy!) If so, it
>>>>> has a relatively small merge conflict with the bpf changes living in
>>>>> net-next. Would you prefer I rebase against net-next, let sfr handle
>>>>> it, get carried in net-next, or some other option?
>>>>
>>>> Well, I'm still not entirely convinced that we want to have this much
>>>> multiplexing in a prctl, and I'm still a bit unconvinced that the code
>>>
>>> I don't want to get caught without interface argument flexibility
>>> again, so that's why the prctl interface is being set up that way.
>>
>> I was thinking that a syscall might be a lot prettier.  It may pay to
>> cc linux-api, too.
>>
>> I'll offer you a deal: if you try to come up with a nice, clean
>> syscall, I'll try to write a fast(er) path for x86_64 to reduce
>> overhead.  I bet I can save 90-100ns per syscall. :)
>
> Now added to the Cc.
>
> Which path do you mean to improve? Neither the prctl nor a syscall for
> this would need to be fast at all.

Non-seccomp-related syscalls when seccomp is enabled.

>
> I don't want to go in circles on this. I've been there before on my
> VFS link hardening series, and my module restriction series. I would
> like consensus from more than just one person. :)

I can't offer you anyone else's review, unfortunately :-/

>
> I'd like to hear from other folks on this (akpm?). My instinct is to
> continue using prctl since that is already where mediation for seccomp
> happens. I don't see why prctl vs a new syscall makes a difference
> here, frankly.

Aesthetics?  There's a tendency for people to get annoyed at big
multiplexed APIs, and your patches will be doubly multiplexed.

TBH, I care more about the atomicity thing than about the actual form
of the API.

--Andy

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-06-02 20:53                       ` Andy Lutomirski
@ 2014-06-03  0:14                         ` Kees Cook
  2014-06-03  0:29                           ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-06-03  0:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>>> point.
>>>>>>>>>>>>
>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>>> installation time.
>>>>>>>>>>>>
>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>>
>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>>
>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>>
>>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>>
>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>>
>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>>>>>> of memory or you've nested too deeply.
>>>>>>>>
>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>>> filter being attached to it.
>>>>>>>
>>>>>>> I must admit this strikes me as odd.  What's the point of having a
>>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>>
>>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>>> around it is harder without the separate TSYNC-only call.
>>>>>>
>>>>>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>>>>>> That seems complicated, albeit potentially useful.
>>>>>>
>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>>> 3 could do:
>>>>>>
>>>>>> while (TSYNC-fails)
>>>>>>    wait-on-or-kill-unexpected-thread
>>>>>>
>>>>>
>>>>> Ok.
>>>>>
>>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>>> just add-a-filter-across-all-threads would work if it failed
>>>>> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
>>>>> always-accept filter across all threads, although no one should really
>>>>> do the latter for efficiency reasons.
>>>>
>>>> Given the complexity of the locking, "fail" means "I applied the
>>>> change to all threads except for at least this one: *error code*",
>>>> which means looping with the "add-a-filter" method means all the other
>>>> threads keep getting filters added until there is full success. I
>>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>>
>>> Ugh, right.
>>>
>>>>
>>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>>> sensible to continue after a failure. However, using just-TSYNC allows
>>>> sensible re-trying. Since the environments where TSYNC intend to be
>>>> used in can be very weird, I really want to retain the retry ability.
>>>
>>> OK.  So what's wrong with the other approach in which the
>>> add-to-all-threads option always succeeds?  IOW, rather than requiring
>>> that all threads have the caller's filter as an ancestor, just add the
>>> requested filter to all threads.  As an optimization, if the targetted
>>> thread has the current thread's filter as its filter, too, then the
>>> targetted thread can stay synchronized.
>>>
>>> That way the add filter call really is atomic.
>>>
>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>>> the userspace interface could be kept as simple as possible.  The fact
>>> that there's a filter hierarchy is sort of an implementation detail, I
>>> think.
>>
>> I'm totally on board with making this as simple as possible. :) The
>> corner cases are kind of horrible, though, but I think this is already
>> as simple as it can get.
>>
>> Externally, without the ancestry check, we can run the risk of have
>> unstable behavior out of a filter change. Imagine the case of a race
>> where a thread is adding a filter (via prctl), and the other thread
>> attempts to TSYNC a filter that blocks prctl.
>>
>> In the "always take the new filter" case, sometimes we get two filters
>> (original and TSYNCed) on the first thread, and sometimes it blows up
>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
>> for the TSYNC caller to detect who won the race.
>>
>> With the patch series as-is, losing the race results in a TSYNC
>> failure (ancestry doesn't match). This is immediately detectable and
>> the caller can the decide how to handle the situation directly.
>>
>> Regardless, I don't think the filter hierarchy is an implementation
>> detail -- complex filters take advantage of the hierarchies. And
>> keeping this hierarchy stable means the filters are simpler to
>> validate for correctness, etc.
>
> Hmm.
>
> Another issue: unless I'm not looking at the right version of this, I
> don't think that while_each_thread does what you think it does.

I believe it to walk current's thread_group list. It starts at current
and walks until it sees current again. While holding task_list for
writing, no new threads can appear on the list. Any cloned threads not
yet in the thread_group list will be added once task_lock is held by
the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID,
pid);").

> I'm generally in favor of making the kernel interfaces easy to use,
> even at the cost of a bit of implementation complexity.  Would this be
> simpler if you hijacked siglock to protect seccomp state?  Then I
> think you could just make everything atomic.

The sighand structure isn't unique to the task nor the thread group
(CLONE_SIGHAND will leave the same sighand attached to a new thread),
so it can't be used sanely here AIUI. The threadgroup_lock stuff could
be used here, but I felt like it was too heavy a hammer for
per-add-filter actions.

The atomicity you're after is just avoiding the "I failed but some
threads may have changed" side-effect, IIUC. As in, you want "I failed
and no threads have changed". For that, I think we need to use
threadgroup_lock. However, I remain unconvinced that this is the right
way to go since the failure corner case needs to be handled by the
tsync caller in one of two ways regardless of either perfect "failed,
no threads changed" or existing series "failed, some threads changed":
retry until success or kill entire process.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-06-03  0:14                         ` Kees Cook
@ 2014-06-03  0:29                           ` Andy Lutomirski
  2014-06-03  1:09                             ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-06-03  0:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Mon, Jun 2, 2014 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>>>> point.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>>>> installation time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>>>
>>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>>>
>>>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>>>
>>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>>>
>>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>>>>>>> of memory or you've nested too deeply.
>>>>>>>>>
>>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>>>> filter being attached to it.
>>>>>>>>
>>>>>>>> I must admit this strikes me as odd.  What's the point of having a
>>>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>>>
>>>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>>>> around it is harder without the separate TSYNC-only call.
>>>>>>>
>>>>>>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>>>>>>> That seems complicated, albeit potentially useful.
>>>>>>>
>>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>>>> 3 could do:
>>>>>>>
>>>>>>> while (TSYNC-fails)
>>>>>>>    wait-on-or-kill-unexpected-thread
>>>>>>>
>>>>>>
>>>>>> Ok.
>>>>>>
>>>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>>>> just add-a-filter-across-all-threads would work if it failed
>>>>>> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
>>>>>> always-accept filter across all threads, although no one should really
>>>>>> do the latter for efficiency reasons.
>>>>>
>>>>> Given the complexity of the locking, "fail" means "I applied the
>>>>> change to all threads except for at least this one: *error code*",
>>>>> which means looping with the "add-a-filter" method means all the other
>>>>> threads keep getting filters added until there is full success. I
>>>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>>>
>>>> Ugh, right.
>>>>
>>>>>
>>>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>>>> sensible to continue after a failure. However, using just-TSYNC allows
>>>>> sensible re-trying. Since the environments where TSYNC intend to be
>>>>> used in can be very weird, I really want to retain the retry ability.
>>>>
>>>> OK.  So what's wrong with the other approach in which the
>>>> add-to-all-threads option always succeeds?  IOW, rather than requiring
>>>> that all threads have the caller's filter as an ancestor, just add the
>>>> requested filter to all threads.  As an optimization, if the targetted
>>>> thread has the current thread's filter as its filter, too, then the
>>>> targetted thread can stay synchronized.
>>>>
>>>> That way the add filter call really is atomic.
>>>>
>>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>>>> the userspace interface could be kept as simple as possible.  The fact
>>>> that there's a filter hierarchy is sort of an implementation detail, I
>>>> think.
>>>
>>> I'm totally on board with making this as simple as possible. :) The
>>> corner cases are kind of horrible, though, but I think this is already
>>> as simple as it can get.
>>>
>>> Externally, without the ancestry check, we can run the risk of have
>>> unstable behavior out of a filter change. Imagine the case of a race
>>> where a thread is adding a filter (via prctl), and the other thread
>>> attempts to TSYNC a filter that blocks prctl.
>>>
>>> In the "always take the new filter" case, sometimes we get two filters
>>> (original and TSYNCed) on the first thread, and sometimes it blows up
>>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
>>> for the TSYNC caller to detect who won the race.
>>>
>>> With the patch series as-is, losing the race results in a TSYNC
>>> failure (ancestry doesn't match). This is immediately detectable and
>>> the caller can the decide how to handle the situation directly.
>>>
>>> Regardless, I don't think the filter hierarchy is an implementation
>>> detail -- complex filters take advantage of the hierarchies. And
>>> keeping this hierarchy stable means the filters are simpler to
>>> validate for correctness, etc.
>>
>> Hmm.
>>
>> Another issue: unless I'm not looking at the right version of this, I
>> don't think that while_each_thread does what you think it does.
>
> I believe it to walk current's thread_group list. It starts at current
> and walks until it sees current again. While holding task_list for
> writing, no new threads can appear on the list. Any cloned threads not
> yet in the thread_group list will be added once task_lock is held by
> the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID,
> pid);").

Oh.  I see.  I think the intended use is to pair do_each_thread with
while_each_thread, which has a rather different effect.
for_each_thread might be clearer.

>
>> I'm generally in favor of making the kernel interfaces easy to use,
>> even at the cost of a bit of implementation complexity.  Would this be
>> simpler if you hijacked siglock to protect seccomp state?  Then I
>> think you could just make everything atomic.
>
> The sighand structure isn't unique to the task nor the thread group
> (CLONE_SIGHAND will leave the same sighand attached to a new thread),
> so it can't be used sanely here AIUI. The threadgroup_lock stuff could
> be used here, but I felt like it was too heavy a hammer for
> per-add-filter actions.

I'm going by this comment:

/*
 * NOTE! "signal_struct" does not have its own
 * locking, because a shared signal_struct always
 * implies a shared sighand_struct, so locking
 * sighand_struct is always a proper superset of
 * the locking of signal_struct.
 */
struct signal_struct {

copy_process protects addition to signal->thread_head using siglock,
which seems like a suitable lock for you to use.  And it's there
regardless of configuration.

>
> The atomicity you're after is just avoiding the "I failed but some
> threads may have changed" side-effect, IIUC. As in, you want "I failed
> and no threads have changed". For that, I think we need to use
> threadgroup_lock. However, I remain unconvinced that this is the right
> way to go since the failure corner case needs to be handled by the
> tsync caller in one of two ways regardless of either perfect "failed,
> no threads changed" or existing series "failed, some threads changed":
> retry until success or kill entire process.


With the current patches, I think you need:

add filter to this thread;
while (trying to sync fails) {
  grumble;
  sleep a bit;
}

If the "failed, some threads changed" case goes away, then I think it can be:

while (trying to add a new filter to all threads fails) {
  grumble;
  sleep a bit;
}

which is nicer.  It could be fancier:

force add new filter to all threads;  (i.e. no loop)

where force-add does, roughly:

acquire siglock
old = current's filter
add the filter to this thread
new = current's new filter
for each other thread {
  if (thread's filter == old)
    thread's filter = new (w/ appropriate refcounting)
  else
    add filter to thread
}

IOW, if you are already adding filters to library threads, I think
it'll be more convenient and robust to to add it everywhere than to
exempt threads that have added their own filter.  If they've added
their own filter, we won't override it; we'll just compose.  It'll be
the same, modulo filter order, as if the library thread adds its
filter *after* we add ours.

Yes, I realize there are oddities here if one of the filters blocks
seccomp prctls, but I think there are oddities here regardless.  I
certainly see the value in the non-force variant.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-06-03  0:29                           ` Andy Lutomirski
@ 2014-06-03  1:09                             ` Kees Cook
  2014-06-03  1:15                               ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Kees Cook @ 2014-06-03  1:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Mon, Jun 2, 2014 at 5:29 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jun 2, 2014 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>>>>> installation time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>>>>
>>>>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>>>>
>>>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>>>>
>>>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>>>>>>>> of memory or you've nested too deeply.
>>>>>>>>>>
>>>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>>>>> filter being attached to it.
>>>>>>>>>
>>>>>>>>> I must admit this strikes me as odd.  What's the point of having a
>>>>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>>>>
>>>>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>>>>> around it is harder without the separate TSYNC-only call.
>>>>>>>>
>>>>>>>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>>>>>>>> That seems complicated, albeit potentially useful.
>>>>>>>>
>>>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>>>>> 3 could do:
>>>>>>>>
>>>>>>>> while (TSYNC-fails)
>>>>>>>>    wait-on-or-kill-unexpected-thread
>>>>>>>>
>>>>>>>
>>>>>>> Ok.
>>>>>>>
>>>>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>>>>> just add-a-filter-across-all-threads would work if it failed
>>>>>>> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
>>>>>>> always-accept filter across all threads, although no one should really
>>>>>>> do the latter for efficiency reasons.
>>>>>>
>>>>>> Given the complexity of the locking, "fail" means "I applied the
>>>>>> change to all threads except for at least this one: *error code*",
>>>>>> which means looping with the "add-a-filter" method means all the other
>>>>>> threads keep getting filters added until there is full success. I
>>>>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>>>>
>>>>> Ugh, right.
>>>>>
>>>>>>
>>>>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>>>>> sensible to continue after a failure. However, using just-TSYNC allows
>>>>>> sensible re-trying. Since the environments where TSYNC intend to be
>>>>>> used in can be very weird, I really want to retain the retry ability.
>>>>>
>>>>> OK.  So what's wrong with the other approach in which the
>>>>> add-to-all-threads option always succeeds?  IOW, rather than requiring
>>>>> that all threads have the caller's filter as an ancestor, just add the
>>>>> requested filter to all threads.  As an optimization, if the targetted
>>>>> thread has the current thread's filter as its filter, too, then the
>>>>> targetted thread can stay synchronized.
>>>>>
>>>>> That way the add filter call really is atomic.
>>>>>
>>>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>>>>> the userspace interface could be kept as simple as possible.  The fact
>>>>> that there's a filter hierarchy is sort of an implementation detail, I
>>>>> think.
>>>>
>>>> I'm totally on board with making this as simple as possible. :) The
>>>> corner cases are kind of horrible, though, but I think this is already
>>>> as simple as it can get.
>>>>
>>>> Externally, without the ancestry check, we can run the risk of have
>>>> unstable behavior out of a filter change. Imagine the case of a race
>>>> where a thread is adding a filter (via prctl), and the other thread
>>>> attempts to TSYNC a filter that blocks prctl.
>>>>
>>>> In the "always take the new filter" case, sometimes we get two filters
>>>> (original and TSYNCed) on the first thread, and sometimes it blows up
>>>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
>>>> for the TSYNC caller to detect who won the race.
>>>>
>>>> With the patch series as-is, losing the race results in a TSYNC
>>>> failure (ancestry doesn't match). This is immediately detectable and
>>>> the caller can the decide how to handle the situation directly.
>>>>
>>>> Regardless, I don't think the filter hierarchy is an implementation
>>>> detail -- complex filters take advantage of the hierarchies. And
>>>> keeping this hierarchy stable means the filters are simpler to
>>>> validate for correctness, etc.
>>>
>>> Hmm.
>>>
>>> Another issue: unless I'm not looking at the right version of this, I
>>> don't think that while_each_thread does what you think it does.
>>
>> I believe it to walk current's thread_group list. It starts at current
>> and walks until it sees current again. While holding task_list for
>> writing, no new threads can appear on the list. Any cloned threads not
>> yet in the thread_group list will be added once task_lock is held by
>> the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID,
>> pid);").
>
> Oh.  I see.  I think the intended use is to pair do_each_thread with
> while_each_thread, which has a rather different effect.
> for_each_thread might be clearer.

I can swap it out. It doesn't seem to be any different.

>>> I'm generally in favor of making the kernel interfaces easy to use,
>>> even at the cost of a bit of implementation complexity.  Would this be
>>> simpler if you hijacked siglock to protect seccomp state?  Then I
>>> think you could just make everything atomic.
>>
>> The sighand structure isn't unique to the task nor the thread group
>> (CLONE_SIGHAND will leave the same sighand attached to a new thread),
>> so it can't be used sanely here AIUI. The threadgroup_lock stuff could
>> be used here, but I felt like it was too heavy a hammer for
>> per-add-filter actions.
>
> I'm going by this comment:
>
> /*
>  * NOTE! "signal_struct" does not have its own
>  * locking, because a shared signal_struct always
>  * implies a shared sighand_struct, so locking
>  * sighand_struct is always a proper superset of
>  * the locking of signal_struct.
>  */
> struct signal_struct {
>
> copy_process protects addition to signal->thread_head using siglock,
> which seems like a suitable lock for you to use.  And it's there
> regardless of configuration.

Ah, I see now that CLONE_THREAD must include CLONE_SIGHAND, and
CLONE_SIGHAND must include CLONE_VM. So, we can depend on this being a
single thread-group-wide lock! Excellent.

>> The atomicity you're after is just avoiding the "I failed but some
>> threads may have changed" side-effect, IIUC. As in, you want "I failed
>> and no threads have changed". For that, I think we need to use
>> threadgroup_lock. However, I remain unconvinced that this is the right
>> way to go since the failure corner case needs to be handled by the
>> tsync caller in one of two ways regardless of either perfect "failed,
>> no threads changed" or existing series "failed, some threads changed":
>> retry until success or kill entire process.
>
>
> With the current patches, I think you need:
>
> add filter to this thread;
> while (trying to sync fails) {
>   grumble;
>   sleep a bit;
> }
>
> If the "failed, some threads changed" case goes away, then I think it can be:
>
> while (trying to add a new filter to all threads fails) {
>   grumble;
>   sleep a bit;
> }

You're still looking to get rid of the stand alone tsync action. It
does seem unneeded here if the filter-add can fail without
side-effects. A new prctl or syscall is still needed to handle this,
regardless.

> which is nicer.  It could be fancier:
>
> force add new filter to all threads;  (i.e. no loop)
>
> where force-add does, roughly:
>
> acquire siglock
> old = current's filter
> add the filter to this thread
> new = current's new filter
> for each other thread {
>   if (thread's filter == old)
>     thread's filter = new (w/ appropriate refcounting)
>   else
>     add filter to thread
> }

The trick is the "w/ appropriate refcounting" part. :) Also, "add
filter to thread" means performing memory allocation, which we can't
do while holding the siglock spinlock IIUC. I remain uncomfortable
with the composition idea, because it's not really a "sync", it's a
"make sure *this* filter is attached" instead of "make sure everyone
is using my entire current filter chain".

> IOW, if you are already adding filters to library threads, I think
> it'll be more convenient and robust to to add it everywhere than to
> exempt threads that have added their own filter.  If they've added
> their own filter, we won't override it; we'll just compose.  It'll be
> the same, modulo filter order, as if the library thread adds its
> filter *after* we add ours.

This is why I don't like it: it's not sync. It's force-apply. The goal
was to be able to build up a filter tree with potentially multiple
calls, and in the future say "okay, now, make sure any threads that
got started in the middle of that are all moved forward to the same
filter chain position I'm on."

The force-apply route means all the filters must be force-loaded, and
doesn't handle, say, a library that doesn't start a thread but does
apply a filter running, and then we apply another with force-apply,
and suddenly all the other threads are missing the filter added by the
library.

I think sync should be sync -- it's the desired functionality that
Chrome needs. Force-apply is different, and could be implemented at a
later time if someone wants it, but for now, I'd like a literal
thread-sync.

> Yes, I realize there are oddities here if one of the filters blocks
> seccomp prctls, but I think there are oddities here regardless.  I
> certainly see the value in the non-force variant.

So, it sounds like if I can build the side-effect-less failure case,
you'll be happy? I really do not want to do the force-apply. Can you
live without that?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-06-03  1:09                             ` Kees Cook
@ 2014-06-03  1:15                               ` Andy Lutomirski
  2014-06-03 19:53                                 ` Kees Cook
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2014-06-03  1:15 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Mon, Jun 2, 2014 at 6:09 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jun 2, 2014 at 5:29 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jun 2, 2014 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>>>>>> installation time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>>>>>
>>>>>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>>>>>
>>>>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>>>>>
>>>>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>>>>>>>>> of memory or you've nested too deeply.
>>>>>>>>>>>
>>>>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>>>>>> filter being attached to it.
>>>>>>>>>>
>>>>>>>>>> I must admit this strikes me as odd.  What's the point of having a
>>>>>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>>>>>
>>>>>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>>>>>> around it is harder without the separate TSYNC-only call.
>>>>>>>>>
>>>>>>>>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>>>>>>>>> That seems complicated, albeit potentially useful.
>>>>>>>>>
>>>>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>>>>>> 3 could do:
>>>>>>>>>
>>>>>>>>> while (TSYNC-fails)
>>>>>>>>>    wait-on-or-kill-unexpected-thread
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ok.
>>>>>>>>
>>>>>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>>>>>> just add-a-filter-across-all-threads would work if it failed
>>>>>>>> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
>>>>>>>> always-accept filter across all threads, although no one should really
>>>>>>>> do the latter for efficiency reasons.
>>>>>>>
>>>>>>> Given the complexity of the locking, "fail" means "I applied the
>>>>>>> change to all threads except for at least this one: *error code*",
>>>>>>> which means looping with the "add-a-filter" method means all the other
>>>>>>> threads keep getting filters added until there is full success. I
>>>>>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>>>>>
>>>>>> Ugh, right.
>>>>>>
>>>>>>>
>>>>>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>>>>>> sensible to continue after a failure. However, using just-TSYNC allows
>>>>>>> sensible re-trying. Since the environments where TSYNC intend to be
>>>>>>> used in can be very weird, I really want to retain the retry ability.
>>>>>>
>>>>>> OK.  So what's wrong with the other approach in which the
>>>>>> add-to-all-threads option always succeeds?  IOW, rather than requiring
>>>>>> that all threads have the caller's filter as an ancestor, just add the
>>>>>> requested filter to all threads.  As an optimization, if the targetted
>>>>>> thread has the current thread's filter as its filter, too, then the
>>>>>> targetted thread can stay synchronized.
>>>>>>
>>>>>> That way the add filter call really is atomic.
>>>>>>
>>>>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>>>>>> the userspace interface could be kept as simple as possible.  The fact
>>>>>> that there's a filter hierarchy is sort of an implementation detail, I
>>>>>> think.
>>>>>
>>>>> I'm totally on board with making this as simple as possible. :) The
>>>>> corner cases are kind of horrible, though, but I think this is already
>>>>> as simple as it can get.
>>>>>
>>>>> Externally, without the ancestry check, we can run the risk of have
>>>>> unstable behavior out of a filter change. Imagine the case of a race
>>>>> where a thread is adding a filter (via prctl), and the other thread
>>>>> attempts to TSYNC a filter that blocks prctl.
>>>>>
>>>>> In the "always take the new filter" case, sometimes we get two filters
>>>>> (original and TSYNCed) on the first thread, and sometimes it blows up
>>>>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
>>>>> for the TSYNC caller to detect who won the race.
>>>>>
>>>>> With the patch series as-is, losing the race results in a TSYNC
>>>>> failure (ancestry doesn't match). This is immediately detectable and
>>>>> the caller can the decide how to handle the situation directly.
>>>>>
>>>>> Regardless, I don't think the filter hierarchy is an implementation
>>>>> detail -- complex filters take advantage of the hierarchies. And
>>>>> keeping this hierarchy stable means the filters are simpler to
>>>>> validate for correctness, etc.
>>>>
>>>> Hmm.
>>>>
>>>> Another issue: unless I'm not looking at the right version of this, I
>>>> don't think that while_each_thread does what you think it does.
>>>
>>> I believe it to walk current's thread_group list. It starts at current
>>> and walks until it sees current again. While holding task_list for
>>> writing, no new threads can appear on the list. Any cloned threads not
>>> yet in the thread_group list will be added once task_lock is held by
>>> the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID,
>>> pid);").
>>
>> Oh.  I see.  I think the intended use is to pair do_each_thread with
>> while_each_thread, which has a rather different effect.
>> for_each_thread might be clearer.
>
> I can swap it out. It doesn't seem to be any different.
>
>>>> I'm generally in favor of making the kernel interfaces easy to use,
>>>> even at the cost of a bit of implementation complexity.  Would this be
>>>> simpler if you hijacked siglock to protect seccomp state?  Then I
>>>> think you could just make everything atomic.
>>>
>>> The sighand structure isn't unique to the task nor the thread group
>>> (CLONE_SIGHAND will leave the same sighand attached to a new thread),
>>> so it can't be used sanely here AIUI. The threadgroup_lock stuff could
>>> be used here, but I felt like it was too heavy a hammer for
>>> per-add-filter actions.
>>
>> I'm going by this comment:
>>
>> /*
>>  * NOTE! "signal_struct" does not have its own
>>  * locking, because a shared signal_struct always
>>  * implies a shared sighand_struct, so locking
>>  * sighand_struct is always a proper superset of
>>  * the locking of signal_struct.
>>  */
>> struct signal_struct {
>>
>> copy_process protects addition to signal->thread_head using siglock,
>> which seems like a suitable lock for you to use.  And it's there
>> regardless of configuration.
>
> Ah, I see now that CLONE_THREAD must include CLONE_SIGHAND, and
> CLONE_SIGHAND must include CLONE_VM. So, we can depend on this being a
> single thread-group-wide lock! Excellent.
>
>>> The atomicity you're after is just avoiding the "I failed but some
>>> threads may have changed" side-effect, IIUC. As in, you want "I failed
>>> and no threads have changed". For that, I think we need to use
>>> threadgroup_lock. However, I remain unconvinced that this is the right
>>> way to go since the failure corner case needs to be handled by the
>>> tsync caller in one of two ways regardless of either perfect "failed,
>>> no threads changed" or existing series "failed, some threads changed":
>>> retry until success or kill entire process.
>>
>>
>> With the current patches, I think you need:
>>
>> add filter to this thread;
>> while (trying to sync fails) {
>>   grumble;
>>   sleep a bit;
>> }
>>
>> If the "failed, some threads changed" case goes away, then I think it can be:
>>
>> while (trying to add a new filter to all threads fails) {
>>   grumble;
>>   sleep a bit;
>> }
>
> You're still looking to get rid of the stand alone tsync action. It
> does seem unneeded here if the filter-add can fail without
> side-effects. A new prctl or syscall is still needed to handle this,
> regardless.
>
>> which is nicer.  It could be fancier:
>>
>> force add new filter to all threads;  (i.e. no loop)
>>
>> where force-add does, roughly:
>>
>> acquire siglock
>> old = current's filter
>> add the filter to this thread
>> new = current's new filter
>> for each other thread {
>>   if (thread's filter == old)
>>     thread's filter = new (w/ appropriate refcounting)
>>   else
>>     add filter to thread
>> }
>
> The trick is the "w/ appropriate refcounting" part. :) Also, "add
> filter to thread" means performing memory allocation, which we can't
> do while holding the siglock spinlock IIUC. I remain uncomfortable
> with the composition idea, because it's not really a "sync", it's a
> "make sure *this* filter is attached" instead of "make sure everyone
> is using my entire current filter chain".
>
>> IOW, if you are already adding filters to library threads, I think
>> it'll be more convenient and robust to to add it everywhere than to
>> exempt threads that have added their own filter.  If they've added
>> their own filter, we won't override it; we'll just compose.  It'll be
>> the same, modulo filter order, as if the library thread adds its
>> filter *after* we add ours.
>
> This is why I don't like it: it's not sync. It's force-apply. The goal
> was to be able to build up a filter tree with potentially multiple
> calls, and in the future say "okay, now, make sure any threads that
> got started in the middle of that are all moved forward to the same
> filter chain position I'm on."
>
> The force-apply route means all the filters must be force-loaded, and
> doesn't handle, say, a library that doesn't start a thread but does
> apply a filter running, and then we apply another with force-apply,
> and suddenly all the other threads are missing the filter added by the
> library.

This is racy, though: what if you sync before the library adds its filter?

>
> I think sync should be sync -- it's the desired functionality that
> Chrome needs. Force-apply is different, and could be implemented at a
> later time if someone wants it, but for now, I'd like a literal
> thread-sync.

>
>> Yes, I realize there are oddities here if one of the filters blocks
>> seccomp prctls, but I think there are oddities here regardless.  I
>> certainly see the value in the non-force variant.
>
> So, it sounds like if I can build the side-effect-less failure case,
> you'll be happy? I really do not want to do the force-apply. Can you
> live without that?

Sure.  I can always add it if I find myself wanting it.  Are you
planning on changing the locking and making sync (or
add-to-all-threads-not-forced) atomic?

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-06-02 23:08           ` Andy Lutomirski
@ 2014-06-03 10:12             ` Michael Kerrisk
  2014-06-03 23:47               ` Julien Tinnes
  0 siblings, 1 reply; 35+ messages in thread
From: Michael Kerrisk @ 2014-06-03 10:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux API, Andrew Morton, Oleg Nesterov, James Morris,
	Stephen Rothwell, David S. Miller, LKML, Will Drewry,
	Julien Tinnes, Alexei Starovoitov, Michael Kerrisk-manpages

[Kees, thank you for CCing linux-api]

On Tue, Jun 3, 2014 at 1:08 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jun 2, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:

>> I'd like to hear from other folks on this (akpm?). My instinct is to
>> continue using prctl since that is already where mediation for seccomp
>> happens. I don't see why prctl vs a new syscall makes a difference
>> here, frankly.
>
> Aesthetics?  There's a tendency for people to get annoyed at big
> multiplexed APIs, and your patches will be doubly multiplexed.

prctl() is already a Franken-interface that provides a mass of
different, mostly completely unrelated, functionality. So, I wonder if
it would be better not to make the situation worse. Furthermore, the
very fact that the existing prctl seccomp API is being extended and
multiplexed suggests that other extensions might be desirable further
down the line, which also hints that a separate syscall would be a
good idea. (Or do we have to wait until the prctl seccomp API is
extended one more time, before we realize that a new system call would
have been a good idea...)

> TBH, I care more about the atomicity thing than about the actual form
> of the API.

User-space does not necessarily thank you for that perspective, Andy
;-). The atomicity thing is presumably fixable, regardless of the API.
On the other hand, APIs are things that kernel developers design once
and forget about, and user-space has to live with forever.

Cheers,

Michael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
  2014-06-03  1:15                               ` Andy Lutomirski
@ 2014-06-03 19:53                                 ` Kees Cook
  0 siblings, 0 replies; 35+ messages in thread
From: Kees Cook @ 2014-06-03 19:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Oleg Nesterov, Alexander Viro, Peter Zijlstra,
	James Morris, Eric Paris, Juri Lelli, John Stultz,
	David S. Miller, Daniel Borkmann, Alex Thorlton, Rik van Riel,
	Daeseok Youn, David Rientjes, Eric W. Biederman, Dario Faggioli,
	Rashika Kheria, liguang, Geert Uytterhoeven, linux-doc, LSM List,
	Will Drewry, Julien Tinnes

On Mon, Jun 2, 2014 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jun 2, 2014 at 6:09 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Mon, Jun 2, 2014 at 5:29 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Jun 2, 2014 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>>>>>>> installation time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>>>>>>> being separate operations?  If not, I think this would be easier to
>>>>>>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>>>>>>
>>>>>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>>>>>>
>>>>>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>>>>>>> and, if it succeeds, then you're done.  It shouldn't fail unless out
>>>>>>>>>>>>> of memory or you've nested too deeply.
>>>>>>>>>>>>
>>>>>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>>>>>>> filter being attached to it.
>>>>>>>>>>>
>>>>>>>>>>> I must admit this strikes me as odd.  What's the point of having a
>>>>>>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>>>>>>
>>>>>>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>>>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>>>>>>> around it is harder without the separate TSYNC-only call.
>>>>>>>>>>
>>>>>>>>>>> In any case, I must have missed the ability for TSYNC to block.  Hmm.
>>>>>>>>>>> That seems complicated, albeit potentially useful.
>>>>>>>>>>
>>>>>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>>>>>>> 3 could do:
>>>>>>>>>>
>>>>>>>>>> while (TSYNC-fails)
>>>>>>>>>>    wait-on-or-kill-unexpected-thread
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ok.
>>>>>>>>>
>>>>>>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>>>>>>> just add-a-filter-across-all-threads would work if it failed
>>>>>>>>> harmlessly on error.  FWIW, TSYNC is probably equivalent to adding an
>>>>>>>>> always-accept filter across all threads, although no one should really
>>>>>>>>> do the latter for efficiency reasons.
>>>>>>>>
>>>>>>>> Given the complexity of the locking, "fail" means "I applied the
>>>>>>>> change to all threads except for at least this one: *error code*",
>>>>>>>> which means looping with the "add-a-filter" method means all the other
>>>>>>>> threads keep getting filters added until there is full success. I
>>>>>>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>>>>>>
>>>>>>> Ugh, right.
>>>>>>>
>>>>>>>>
>>>>>>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>>>>>>> sensible to continue after a failure. However, using just-TSYNC allows
>>>>>>>> sensible re-trying. Since the environments where TSYNC intend to be
>>>>>>>> used in can be very weird, I really want to retain the retry ability.
>>>>>>>
>>>>>>> OK.  So what's wrong with the other approach in which the
>>>>>>> add-to-all-threads option always succeeds?  IOW, rather than requiring
>>>>>>> that all threads have the caller's filter as an ancestor, just add the
>>>>>>> requested filter to all threads.  As an optimization, if the targetted
>>>>>>> thread has the current thread's filter as its filter, too, then the
>>>>>>> targetted thread can stay synchronized.
>>>>>>>
>>>>>>> That way the add filter call really is atomic.
>>>>>>>
>>>>>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>>>>>>> the userspace interface could be kept as simple as possible.  The fact
>>>>>>> that there's a filter hierarchy is sort of an implementation detail, I
>>>>>>> think.
>>>>>>
>>>>>> I'm totally on board with making this as simple as possible. :) The
>>>>>> corner cases are kind of horrible, though, but I think this is already
>>>>>> as simple as it can get.
>>>>>>
>>>>>> Externally, without the ancestry check, we can run the risk of have
>>>>>> unstable behavior out of a filter change. Imagine the case of a race
>>>>>> where a thread is adding a filter (via prctl), and the other thread
>>>>>> attempts to TSYNC a filter that blocks prctl.
>>>>>>
>>>>>> In the "always take the new filter" case, sometimes we get two filters
>>>>>> (original and TSYNCed) on the first thread, and sometimes it blows up
>>>>>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
>>>>>> for the TSYNC caller to detect who won the race.
>>>>>>
>>>>>> With the patch series as-is, losing the race results in a TSYNC
>>>>>> failure (ancestry doesn't match). This is immediately detectable and
>>>>>> the caller can the decide how to handle the situation directly.
>>>>>>
>>>>>> Regardless, I don't think the filter hierarchy is an implementation
>>>>>> detail -- complex filters take advantage of the hierarchies. And
>>>>>> keeping this hierarchy stable means the filters are simpler to
>>>>>> validate for correctness, etc.
>>>>>
>>>>> Hmm.
>>>>>
>>>>> Another issue: unless I'm not looking at the right version of this, I
>>>>> don't think that while_each_thread does what you think it does.
>>>>
>>>> I believe it to walk current's thread_group list. It starts at current
>>>> and walks until it sees current again. While holding task_list for
>>>> writing, no new threads can appear on the list. Any cloned threads not
>>>> yet in the thread_group list will be added once task_lock is held by
>>>> the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID,
>>>> pid);").
>>>
>>> Oh.  I see.  I think the intended use is to pair do_each_thread with
>>> while_each_thread, which has a rather different effect.
>>> for_each_thread might be clearer.
>>
>> I can swap it out. It doesn't seem to be any different.
>>
>>>>> I'm generally in favor of making the kernel interfaces easy to use,
>>>>> even at the cost of a bit of implementation complexity.  Would this be
>>>>> simpler if you hijacked siglock to protect seccomp state?  Then I
>>>>> think you could just make everything atomic.
>>>>
>>>> The sighand structure isn't unique to the task nor the thread group
>>>> (CLONE_SIGHAND will leave the same sighand attached to a new thread),
>>>> so it can't be used sanely here AIUI. The threadgroup_lock stuff could
>>>> be used here, but I felt like it was too heavy a hammer for
>>>> per-add-filter actions.
>>>
>>> I'm going by this comment:
>>>
>>> /*
>>>  * NOTE! "signal_struct" does not have its own
>>>  * locking, because a shared signal_struct always
>>>  * implies a shared sighand_struct, so locking
>>>  * sighand_struct is always a proper superset of
>>>  * the locking of signal_struct.
>>>  */
>>> struct signal_struct {
>>>
>>> copy_process protects addition to signal->thread_head using siglock,
>>> which seems like a suitable lock for you to use.  And it's there
>>> regardless of configuration.
>>
>> Ah, I see now that CLONE_THREAD must include CLONE_SIGHAND, and
>> CLONE_SIGHAND must include CLONE_VM. So, we can depend on this being a
>> single thread-group-wide lock! Excellent.
>>
>>>> The atomicity you're after is just avoiding the "I failed but some
>>>> threads may have changed" side-effect, IIUC. As in, you want "I failed
>>>> and no threads have changed". For that, I think we need to use
>>>> threadgroup_lock. However, I remain unconvinced that this is the right
>>>> way to go since the failure corner case needs to be handled by the
>>>> tsync caller in one of two ways regardless of either perfect "failed,
>>>> no threads changed" or existing series "failed, some threads changed":
>>>> retry until success or kill entire process.
>>>
>>>
>>> With the current patches, I think you need:
>>>
>>> add filter to this thread;
>>> while (trying to sync fails) {
>>>   grumble;
>>>   sleep a bit;
>>> }
>>>
>>> If the "failed, some threads changed" case goes away, then I think it can be:
>>>
>>> while (trying to add a new filter to all threads fails) {
>>>   grumble;
>>>   sleep a bit;
>>> }
>>
>> You're still looking to get rid of the stand alone tsync action. It
>> does seem unneeded here if the filter-add can fail without
>> side-effects. A new prctl or syscall is still needed to handle this,
>> regardless.
>>
>>> which is nicer.  It could be fancier:
>>>
>>> force add new filter to all threads;  (i.e. no loop)
>>>
>>> where force-add does, roughly:
>>>
>>> acquire siglock
>>> old = current's filter
>>> add the filter to this thread
>>> new = current's new filter
>>> for each other thread {
>>>   if (thread's filter == old)
>>>     thread's filter = new (w/ appropriate refcounting)
>>>   else
>>>     add filter to thread
>>> }
>>
>> The trick is the "w/ appropriate refcounting" part. :) Also, "add
>> filter to thread" means performing memory allocation, which we can't
>> do while holding the siglock spinlock IIUC. I remain uncomfortable
>> with the composition idea, because it's not really a "sync", it's a
>> "make sure *this* filter is attached" instead of "make sure everyone
>> is using my entire current filter chain".
>>
>>> IOW, if you are already adding filters to library threads, I think
>>> it'll be more convenient and robust to to add it everywhere than to
>>> exempt threads that have added their own filter.  If they've added
>>> their own filter, we won't override it; we'll just compose.  It'll be
>>> the same, modulo filter order, as if the library thread adds its
>>> filter *after* we add ours.
>>
>> This is why I don't like it: it's not sync. It's force-apply. The goal
>> was to be able to build up a filter tree with potentially multiple
>> calls, and in the future say "okay, now, make sure any threads that
>> got started in the middle of that are all moved forward to the same
>> filter chain position I'm on."
>>
>> The force-apply route means all the filters must be force-loaded, and
>> doesn't handle, say, a library that doesn't start a thread but does
>> apply a filter running, and then we apply another with force-apply,
>> and suddenly all the other threads are missing the filter added by the
>> library.
>
> This is racy, though: what if you sync before the library adds its filter?

Even in the force-apply case, it's still potentially racy if prctl is
blocked. I don't feel the internal refactoring to the ref counting
(and associated bug risks) justifies the benefit of a very small race
that doesn't really go away and userspace has to handle gracefully
either way.

>> I think sync should be sync -- it's the desired functionality that
>> Chrome needs. Force-apply is different, and could be implemented at a
>> later time if someone wants it, but for now, I'd like a literal
>> thread-sync.
>
>>> Yes, I realize there are oddities here if one of the filters blocks
>>> seccomp prctls, but I think there are oddities here regardless.  I
>>> certainly see the value in the non-force variant.
>>
>> So, it sounds like if I can build the side-effect-less failure case,
>> you'll be happy? I really do not want to do the force-apply. Can you
>> live without that?
>
> Sure.  I can always add it if I find myself wanting it.  Are you
> planning on changing the locking and making sync (or
> add-to-all-threads-not-forced) atomic?

Yes, you've convinced me. :) I still need to hold task_list lock to
avoid ref count races during thread death, but holding sighand means
the evaluation can be done for all threads before actually making
changes.

And based on the feedback from linux-api, I'll rework the interface to
use a new syscall.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-06-03 10:12             ` Michael Kerrisk
@ 2014-06-03 23:47               ` Julien Tinnes
  0 siblings, 0 replies; 35+ messages in thread
From: Julien Tinnes @ 2014-06-03 23:47 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Andy Lutomirski, Kees Cook, Linux API, Andrew Morton,
	Oleg Nesterov, James Morris, Stephen Rothwell, David S. Miller,
	LKML, Will Drewry, Alexei Starovoitov

On Tue, Jun 3, 2014 at 3:12 AM, Michael Kerrisk <mtk.manpages@gmail.com> wrote:
> [Kees, thank you for CCing linux-api]
>
> On Tue, Jun 3, 2014 at 1:08 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jun 2, 2014 at 4:05 PM, Kees Cook <keescook@chromium.org> wrote:
>
>>> I'd like to hear from other folks on this (akpm?). My instinct is to
>>> continue using prctl since that is already where mediation for seccomp
>>> happens. I don't see why prctl vs a new syscall makes a difference
>>> here, frankly.
>>
>> Aesthetics?  There's a tendency for people to get annoyed at big
>> multiplexed APIs, and your patches will be doubly multiplexed.
>
> prctl() is already a Franken-interface that provides a mass of
> different, mostly completely unrelated, functionality. So, I wonder if
> it would be better not to make the situation worse. Furthermore, the
> very fact that the existing prctl seccomp API is being extended and
> multiplexed suggests that other extensions might be desirable further
> down the line, which also hints that a separate syscall would be a
> good idea. (Or do we have to wait until the prctl seccomp API is
> extended one more time, before we realize that a new system call would
> have been a good idea...)
>
>> TBH, I care more about the atomicity thing than about the actual form
>> of the API.
>
> User-space does not necessarily thank you for that perspective, Andy
> ;-). The atomicity thing is presumably fixable, regardless of the API.
> On the other hand, APIs are things that kernel developers design once
> and forget about, and user-space has to live with forever.

Well, maybe the history of it being a prctl() should count for something.
Most likely, userland will need to test for whether or not these
features are present in the kernel for years to come. With a syscall,
it would now require a syscall (unlikely to be in older headers for a
while, so will require using syscall(3) for a bit) as well as a call
to prctl() to test for seccomp mode 2 (without thread sync) in the
fallback path. It'll be a little odd. As the person who will make this
work in Chromium, I do not feel strongly either way, it's a detail, so
feel free to disregard this point.

But I'm eagerly waiting for:
- Not having to test for the presence of threads at run-time (which
requires a very ugly busy loop with exponential back-off watching for
/proc/self/tasks/ to drop to 1 directory entry).
- Being able to engage the sandbox after third-party libraries have
started threads.

Julien

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-06-03 23:47 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-22 23:05 [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
2014-05-22 23:05 ` [PATCH v5 1/6] seccomp: create internal mode-setting function Kees Cook
2014-05-22 23:05 ` [PATCH v5 2/6] seccomp: split filter prep from check and apply Kees Cook
2014-05-22 23:05 ` [PATCH v5 3/6] seccomp: introduce writer locking Kees Cook
2014-05-23  0:28   ` Alexei Starovoitov
2014-05-23  8:49   ` Peter Zijlstra
2014-05-23 21:05     ` Kees Cook
2014-05-22 23:05 ` [PATCH v5 4/6] seccomp: move no_new_privs into seccomp Kees Cook
2014-05-22 23:08   ` Andy Lutomirski
2014-05-22 23:05 ` [PATCH v5 5/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_FILTER Kees Cook
2014-05-22 23:05 ` [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC Kees Cook
2014-05-22 23:11   ` Andy Lutomirski
2014-05-23 17:05     ` Kees Cook
2014-05-26 19:27       ` Andy Lutomirski
2014-05-27 18:24         ` Kees Cook
2014-05-27 18:40           ` Andy Lutomirski
2014-05-27 18:45             ` Kees Cook
2014-05-27 19:10               ` Andy Lutomirski
2014-05-27 19:23                 ` Kees Cook
2014-05-27 19:27                   ` Andy Lutomirski
2014-05-27 19:55                     ` Kees Cook
2014-06-02 20:53                       ` Andy Lutomirski
2014-06-03  0:14                         ` Kees Cook
2014-06-03  0:29                           ` Andy Lutomirski
2014-06-03  1:09                             ` Kees Cook
2014-06-03  1:15                               ` Andy Lutomirski
2014-06-03 19:53                                 ` Kees Cook
2014-06-02 19:47 ` [PATCH v5 0/6] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
2014-06-02 19:59   ` Andy Lutomirski
2014-06-02 20:06     ` Kees Cook
2014-06-02 21:17       ` Andy Lutomirski
2014-06-02 23:05         ` Kees Cook
2014-06-02 23:08           ` Andy Lutomirski
2014-06-03 10:12             ` Michael Kerrisk
2014-06-03 23:47               ` Julien Tinnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).