[PATCH v3 1/4] seccomp: rename "usage" to "refs" and document

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document
@ 2020-05-31 11:50 Christian Brauner
  2020-05-31 11:50 ` [PATCH v3 2/4] seccomp: release filter after task is fully dead Christian Brauner
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Christian Brauner @ 2020-05-31 11:50 UTC (permalink / raw)
  To: linux-kernel, Kees Cook
  Cc: Andy Lutomirski, Tycho Andersen, Matt Denton, Sargun Dhillon,
	Jann Horn, Chris Palmer, Aleksa Sarai, Robert Sesek,
	Jeffrey Vander Stoep, Linux Containers, Christian Brauner

Naming the lifetime counter of a seccomp filter "usage" suggests a
little too strongly that its about tasks that are using this filter
while it also tracks other references such as the user notifier or
ptrace. This also updates the documentation to note this fact.

We'll be introducing an actual usage counter in a follow-up patch.

Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Denton <mpdenton@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch not present

/* v3 */
patch introduced
---
 kernel/seccomp.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..0ba2d6d0800f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -106,10 +106,11 @@ struct notification {
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
- * @usage: reference count to manage the object lifetime.
- *         get/put helpers should be used when accessing an instance
- *         outside of a lifetime-guarded section.  In general, this
- *         is only needed for handling filters shared across tasks.
+ * @refs: Reference count to manage the object lifetime.
+ *	  A filter's reference count is incremented for each directly
+ *	  attached task, once for the dependent filter, and if
+ *	  requested for the user notifier. When @refs reaches zero,
+ *	  the filter can be freed.
  * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
  * @prev: points to a previously installed, or inherited, filter
  * @prog: the BPF program to evaluate
@@ -124,10 +125,10 @@ struct notification {
  * how namespaces work.
  *
  * seccomp_filter objects should never be modified after being attached
- * to a task_struct (other than @usage).
+ * to a task_struct (other than @refs).
  */
 struct seccomp_filter {
-	refcount_t usage;
+	refcount_t refs;
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
@@ -461,7 +462,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 		return ERR_PTR(ret);
 	}
 
-	refcount_set(&sfilter->usage, 1);
+	refcount_set(&sfilter->refs, 1);
 
 	return sfilter;
 }
@@ -554,7 +555,7 @@ static long seccomp_attach_filter(unsigned int flags,
 
 static void __get_seccomp_filter(struct seccomp_filter *filter)
 {
-	refcount_inc(&filter->usage);
+	refcount_inc(&filter->refs);
 }
 
 /* get_seccomp_filter - increments the reference count of the filter on @tsk */
@@ -577,7 +578,7 @@ static inline void seccomp_filter_free(struct seccomp_filter *filter)
 static void __put_seccomp_filter(struct seccomp_filter *orig)
 {
 	/* Clean up single-reference branches iteratively. */
-	while (orig && refcount_dec_and_test(&orig->usage)) {
+	while (orig && refcount_dec_and_test(&orig->refs)) {
 		struct seccomp_filter *freeme = orig;
 		orig = orig->prev;
 		seccomp_filter_free(freeme);

base-commit: b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/4] seccomp: release filter after task is fully dead
  2020-05-31 11:50 [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document Christian Brauner
@ 2020-05-31 11:50 ` Christian Brauner
  2020-06-01 18:40   ` Kees Cook
  2020-05-31 11:50 ` [PATCH v3 3/4] seccomp: notify about unused filter Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2020-05-31 11:50 UTC (permalink / raw)
  To: linux-kernel, Kees Cook
  Cc: Andy Lutomirski, Tycho Andersen, Matt Denton, Sargun Dhillon,
	Jann Horn, Chris Palmer, Aleksa Sarai, Robert Sesek,
	Jeffrey Vander Stoep, Linux Containers, Christian Brauner

The seccomp filter used to be released in free_task() which is called
asynchronously via call_rcu() and assorted mechanisms. Since we need
to inform tasks waiting on the seccomp notifier when a filter goes empty
we will notify them as soon as a task has been marked fully dead in
release_task(). To not split seccomp cleanup into two parts, move
filter release out of free_task() and into release_task() after we've
unhashed struct task from struct pid, exited signals, and unlinked it
from the threadgroups' thread list. We'll put the empty filter
notification infrastructure into it in a follow up patch.

This also renames put_seccomp_filter() to seccomp_filter_release() which
is a more descriptive name of what we're doing here especially once
we've added the empty filter notification mechanism in there.

We're also NULL-ing the task's filter tree entrypoint which seems
cleaner than leaving a dangling pointer in there. Note that this shouldn't
need any memory barriers since we're calling this when the task is in
release_task() which means it's EXIT_DEAD. So it can't modify it's seccomp
filters anymore. You can also see this from the point where we're calling
seccomp_filter_release(). It's after __exit_signal() and at this point,
tsk->sighand will already have been NULLed which is required for
thread-sync and filter installation alike.

Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Denton <mpdenton@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch not present

/* v3 */
patch introduced
---
 include/linux/seccomp.h |  4 +--
 kernel/exit.c           |  1 +
 kernel/fork.c           |  1 -
 kernel/seccomp.c        | 60 ++++++++++++++++++++++++-----------------
 4 files changed, 38 insertions(+), 28 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4192369b8418..1644922af19a 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -82,10 +82,10 @@ static inline int seccomp_mode(struct seccomp *s)
 #endif /* CONFIG_SECCOMP */
 
 #ifdef CONFIG_SECCOMP_FILTER
-extern void put_seccomp_filter(struct task_struct *tsk);
+extern void seccomp_filter_release(struct task_struct *tsk);
 extern void get_seccomp_filter(struct task_struct *tsk);
 #else  /* CONFIG_SECCOMP_FILTER */
-static inline void put_seccomp_filter(struct task_struct *tsk)
+static inline void seccomp_filter_release(struct task_struct *tsk)
 {
 	return;
 }
diff --git a/kernel/exit.c b/kernel/exit.c
index ce2a75bc0ade..5490cc04f436 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -218,6 +218,7 @@ void release_task(struct task_struct *p)
 	}
 
 	write_unlock_irq(&tasklist_lock);
+	seccomp_filter_release(p);
 	proc_flush_pid(thread_pid);
 	put_pid(thread_pid);
 	release_thread(p);
diff --git a/kernel/fork.c b/kernel/fork.c
index 48ed22774efa..b948a14da94f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -471,7 +471,6 @@ void free_task(struct task_struct *tsk)
 #endif
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
-	put_seccomp_filter(tsk);
 	arch_release_task_struct(tsk);
 	if (tsk->flags & PF_KTHREAD)
 		free_kthread_struct(tsk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 0ba2d6d0800f..55251b1fe03f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -367,6 +367,40 @@ static inline pid_t seccomp_can_sync_threads(void)
 	return 0;
 }
 
+static inline void seccomp_filter_free(struct seccomp_filter *filter)
+{
+	if (filter) {
+		bpf_prog_destroy(filter->prog);
+		kfree(filter);
+	}
+}
+
+static void __put_seccomp_filter(struct seccomp_filter *orig)
+{
+	/* Clean up single-reference branches iteratively. */
+	while (orig && refcount_dec_and_test(&orig->refs)) {
+		struct seccomp_filter *freeme = orig;
+		orig = orig->prev;
+		seccomp_filter_free(freeme);
+	}
+}
+
+/**
+ * seccomp_filter_release - Detach the task from its filter tree
+ *			    and drop its reference count during
+ *			    exit.
+ *
+ * This function should only be called when the task is exiting as
+ * it detaches it from its filter tree.
+ */
+void seccomp_filter_release(struct task_struct *tsk)
+{
+	struct seccomp_filter *cur = tsk->seccomp.filter;
+
+	tsk->seccomp.filter = NULL;
+	__put_seccomp_filter(cur);
+}
+
 /**
  * seccomp_sync_threads: sets all threads to use current's filter
  *
@@ -396,7 +430,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
 		 * current's path will hold a reference.  (This also
 		 * allows a put before the assignment.)
 		 */
-		put_seccomp_filter(thread);
+		__put_seccomp_filter(thread->seccomp.filter);
 		smp_store_release(&thread->seccomp.filter,
 				  caller->seccomp.filter);
 
@@ -567,30 +601,6 @@ void get_seccomp_filter(struct task_struct *tsk)
 	__get_seccomp_filter(orig);
 }
 
-static inline void seccomp_filter_free(struct seccomp_filter *filter)
-{
-	if (filter) {
-		bpf_prog_destroy(filter->prog);
-		kfree(filter);
-	}
-}
-
-static void __put_seccomp_filter(struct seccomp_filter *orig)
-{
-	/* Clean up single-reference branches iteratively. */
-	while (orig && refcount_dec_and_test(&orig->refs)) {
-		struct seccomp_filter *freeme = orig;
-		orig = orig->prev;
-		seccomp_filter_free(freeme);
-	}
-}
-
-/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
-void put_seccomp_filter(struct task_struct *tsk)
-{
-	__put_seccomp_filter(tsk->seccomp.filter);
-}
-
 static void seccomp_init_siginfo(kernel_siginfo_t *info, int syscall, int reason)
 {
 	clear_siginfo(info);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/4] seccomp: notify about unused filter
  2020-05-31 11:50 [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document Christian Brauner
  2020-05-31 11:50 ` [PATCH v3 2/4] seccomp: release filter after task is fully dead Christian Brauner
@ 2020-05-31 11:50 ` Christian Brauner
  2020-06-01 19:29   ` Kees Cook
  2020-05-31 11:50 ` [PATCH v3 4/4] tests: test seccomp filter notifications Christian Brauner
  2020-06-01 18:33 ` [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document Kees Cook
  3 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2020-05-31 11:50 UTC (permalink / raw)
  To: linux-kernel, Kees Cook
  Cc: Andy Lutomirski, Tycho Andersen, Matt Denton, Sargun Dhillon,
	Jann Horn, Chris Palmer, Aleksa Sarai, Robert Sesek,
	Jeffrey Vander Stoep, Linux Containers, Christian Brauner

We've been making heavy use of the seccomp notifier to intercept and
handle certain syscalls for containers. This patch allows a syscall
supervisor listening on a given notifier to be notified when a seccomp
filter has become unused.

A container is often managed by a singleton supervisor process the
so-called "monitor". This monitor process has an event loop which has
various event handlers registered. If the user specified a seccomp
profile that included a notifier for various syscalls then we also
register a seccomp notify even handler. For any container using a
separate pid namespace the lifecycle of the seccomp notifier is bound to
the init process of the pid namespace, i.e. when the init process exits
the filter must be unused.
If a new process attaches to a container we force it to assume a seccomp
profile. This can either be the same seccomp profile as the container
was started with or a modified one. If the attaching process makes use
of the seccomp notifier we will register a new seccomp notifier handler
in the monitor's event loop. However, when the attaching process exits
we can't simply delete the handler since other child processes could've
been created (daemons spawned etc.) that have inherited the seccomp
filter and so we need to keep the seccomp notifier fd alive in the event
loop. But this is problematic since we don't get a notification when the
seccomp filter has become unused and so we currently never remove the
seccomp notifier fd from the event loop and just keep accumulating fds
in the event loop. We've had this issue for a while but it has recently
become more pressing as more and larger users make use of this.

To fix this, we introduce a new "users" reference counter that tracks
any tasks and dependent filters making use of a filter. When a notifier is
registered waiting tasks will be notified that the filter is now empty by
receiving a (E)POLLHUP event.
The concept in this patch introduces is the same as for signal_struct,
i.e. reference counting for life-cycle management is decoupled from
reference counting taks using the object.

There's probably some trickery possible but the second counter is just
the correct way of doing this imho and has precedence. The patch also
lifts the waitqeue from struct notification into sruct seccomp_filter.
This is cleaner overall and let's us avoid having to take the notifier
mutex since we neither need to read nor modify the notifier specific
aspects of the seccomp filter. In the exit path I'd very much like to
avoid having to take the notifier mutex for each filter in the task's
filter hierarchy.

Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Denton <mpdenton@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Jann Horn <jannh@google.com>:
  - Use more descriptive instead of seccomp_filter_notify().
    (I went with seccomp_filter_release().)

/* v3 */
- Kees Cook <keescook@chromium.org>:
  - Rename counter from "live" to "users".
---
 kernel/seccomp.c | 68 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 53 insertions(+), 15 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55251b1fe03f..45244f1ba148 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -94,13 +94,11 @@ struct seccomp_knotif {
  *           filter->notify_lock.
  * @next_id: The id of the next request.
  * @notifications: A list of struct seccomp_knotif elements.
- * @wqh: A wait queue for poll.
  */
 struct notification {
 	struct semaphore request;
 	u64 next_id;
 	struct list_head notifications;
-	wait_queue_head_t wqh;
 };
 
 /**
@@ -116,6 +114,15 @@ struct notification {
  * @prog: the BPF program to evaluate
  * @notif: the struct that holds all notification related information
  * @notify_lock: A lock for all notification-related accesses.
+ * @wqh: A wait queue for poll if a notifier is in use.
+ * @users: A filter's @users count is incremented for each directly
+ *         attached task (filter installation, fork(), thread_sync),
+ *	   and once for the dependent filter (tracked in filter->prev).
+ *	   When it reaches zero it indicates that no direct or indirect
+ *	   users of that filter exist. No new tasks can get associated with
+ *	   this filter after reaching 0. The @users count is always smaller
+ *	   or equal to @refs. Hence, reaching 0 for @users does not mean
+ *	   the filter can be freed.
  *
  * seccomp_filter objects are organized in a tree linked via the @prev
  * pointer.  For any task, it appears to be a singly-linked list starting
@@ -134,6 +141,8 @@ struct seccomp_filter {
 	struct bpf_prog *prog;
 	struct notification *notif;
 	struct mutex notify_lock;
+	refcount_t users;
+	wait_queue_head_t wqh;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -375,6 +384,15 @@ static inline void seccomp_filter_free(struct seccomp_filter *filter)
 	}
 }
 
+static void __seccomp_filter_orphan(struct seccomp_filter *orig)
+{
+	while (orig && refcount_dec_and_test(&orig->users)) {
+		if (waitqueue_active(&orig->wqh))
+			wake_up_poll(&orig->wqh, EPOLLHUP);
+		orig = orig->prev;
+	}
+}
+
 static void __put_seccomp_filter(struct seccomp_filter *orig)
 {
 	/* Clean up single-reference branches iteratively. */
@@ -386,19 +404,23 @@ static void __put_seccomp_filter(struct seccomp_filter *orig)
 }
 
 /**
- * seccomp_filter_release - Detach the task from its filter tree
- *			    and drop its reference count during
- *			    exit.
+ * seccomp_filter_release - Detach the task from its filter tree,
+ *			    drop its reference count, and notify
+ *			    about unused filters
  *
  * This function should only be called when the task is exiting as
  * it detaches it from its filter tree.
  */
 void seccomp_filter_release(struct task_struct *tsk)
 {
-	struct seccomp_filter *cur = tsk->seccomp.filter;
+	struct seccomp_filter *orig = tsk->seccomp.filter;
 
+	/* Detach task from its filter tree. */
 	tsk->seccomp.filter = NULL;
-	__put_seccomp_filter(cur);
+	/* Notify about any unused filters in the task's former filter tree. */
+	__seccomp_filter_orphan(orig);
+	/* Finally drop all references to the task's former tree. */
+	__put_seccomp_filter(orig);
 }
 
 /**
@@ -419,18 +441,29 @@ static inline void seccomp_sync_threads(unsigned long flags)
 	/* Synchronize all threads. */
 	caller = current;
 	for_each_thread(caller, thread) {
+		struct seccomp_filter *cur = thread->seccomp.filter;
+
 		/* Skip current, since it needs no changes. */
 		if (thread == caller)
 			continue;
 
 		/* Get a task reference for the new leaf node. */
 		get_seccomp_filter(caller);
+
+		/*
+		 * Notify everyone as we're forcing the thread
+		 * to orphan its current filter tree.
+		 */
+		__seccomp_filter_orphan(cur);
+
 		/*
-		 * Drop the task reference to the shared ancestor since
-		 * current's path will hold a reference.  (This also
-		 * allows a put before the assignment.)
+		 * Drop the task's reference to the shared ancestor
+		 * since current's path will hold a reference.
+		 * (This also allows a put before the assignment.)
 		 */
-		__put_seccomp_filter(thread->seccomp.filter);
+		__put_seccomp_filter(cur);
+
+		/* Make our new filter tree visible. */
 		smp_store_release(&thread->seccomp.filter,
 				  caller->seccomp.filter);
 
@@ -497,6 +530,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	}
 
 	refcount_set(&sfilter->refs, 1);
+	refcount_set(&sfilter->users, 1);
+	init_waitqueue_head(&sfilter->wqh);
 
 	return sfilter;
 }
@@ -599,6 +634,7 @@ void get_seccomp_filter(struct task_struct *tsk)
 	if (!orig)
 		return;
 	__get_seccomp_filter(orig);
+	refcount_inc(&orig->users);
 }
 
 static void seccomp_init_siginfo(kernel_siginfo_t *info, int syscall, int reason)
@@ -768,7 +804,7 @@ static int seccomp_do_user_notification(int this_syscall,
 	list_add(&n.list, &match->notif->notifications);
 
 	up(&match->notif->request);
-	wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
+	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
 	mutex_unlock(&match->notify_lock);
 
 	/*
@@ -1075,7 +1111,7 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
 	unotif.data = *(knotif->data);
 
 	knotif->state = SECCOMP_NOTIFY_SENT;
-	wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
+	wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
 	ret = 0;
 out:
 	mutex_unlock(&filter->notify_lock);
@@ -1211,7 +1247,7 @@ static __poll_t seccomp_notify_poll(struct file *file,
 	__poll_t ret = 0;
 	struct seccomp_knotif *cur;
 
-	poll_wait(file, &filter->notif->wqh, poll_tab);
+	poll_wait(file, &filter->wqh, poll_tab);
 
 	if (mutex_lock_interruptible(&filter->notify_lock) < 0)
 		return EPOLLERR;
@@ -1227,6 +1263,9 @@ static __poll_t seccomp_notify_poll(struct file *file,
 
 	mutex_unlock(&filter->notify_lock);
 
+	if (refcount_read(&filter->users) == 0)
+		ret |= EPOLLHUP;
+
 	return ret;
 }
 
@@ -1255,7 +1294,6 @@ static struct file *init_listener(struct seccomp_filter *filter)
 	sema_init(&filter->notif->request, 0);
 	filter->notif->next_id = get_random_u64();
 	INIT_LIST_HEAD(&filter->notif->notifications);
-	init_waitqueue_head(&filter->notif->wqh);
 
 	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
 				 filter, O_RDWR);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 4/4] tests: test seccomp filter notifications
  2020-05-31 11:50 [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document Christian Brauner
  2020-05-31 11:50 ` [PATCH v3 2/4] seccomp: release filter after task is fully dead Christian Brauner
  2020-05-31 11:50 ` [PATCH v3 3/4] seccomp: notify about unused filter Christian Brauner
@ 2020-05-31 11:50 ` Christian Brauner
  2020-06-01 19:31   ` Kees Cook
  2020-06-01 18:33 ` [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document Kees Cook
  3 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2020-05-31 11:50 UTC (permalink / raw)
  To: linux-kernel, Kees Cook
  Cc: Andy Lutomirski, Tycho Andersen, Matt Denton, Sargun Dhillon,
	Jann Horn, Chris Palmer, Aleksa Sarai, Robert Sesek,
	Jeffrey Vander Stoep, Linux Containers, Christian Brauner

This verifies we're correctly notified when a seccomp filter becomes
unused when a notifier is in use.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
At first it seemed sensible to add POLLHUP to all poll invocations but
all checks test for revents to be equal to POLLIN. Hence, when POLLHUP
is reported we'd fail the test so we don't gain anyhing by testing for
POLLHUP additionally.
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 136 ++++++++++++++++++
 1 file changed, 136 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index c0aa46ce14f6..4dae278cf77e 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -51,6 +51,7 @@
 #include <poll.h>
 
 #include "../kselftest_harness.h"
+#include "../clone3/clone3_selftests.h"
 
 #ifndef PR_SET_PTRACER
 # define PR_SET_PTRACER 0x59616d61
@@ -3686,6 +3687,141 @@ TEST(user_notification_continue)
 	}
 }
 
+TEST(user_notification_filter_empty)
+{
+	pid_t pid;
+	long ret;
+	int status;
+	struct pollfd pollfd;
+	struct clone_args args = {
+		.flags = CLONE_FILES,
+		.exit_signal = SIGCHLD,
+	};
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int listener;
+
+		listener = user_trap_syscall(__NR_mknod, SECCOMP_FILTER_FLAG_NEW_LISTENER);
+		if (listener < 0)
+			_exit(EXIT_FAILURE);
+
+		if (dup2(listener, 200) != 200)
+			_exit(EXIT_FAILURE);
+
+		close(listener);
+
+		_exit(EXIT_SUCCESS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * The seccomp filter has become unused so we should be notified once
+	 * the kernel gets around to cleaning up task struct.
+	 */
+	pollfd.fd = 200;
+	pollfd.events = POLLHUP;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_GT((pollfd.revents & POLLHUP) ?: 0, 0);
+}
+
+static void *do_thread(void *data)
+{
+	return NULL;
+}
+
+TEST(user_notification_filter_empty_threaded)
+{
+	pid_t pid;
+	long ret;
+	int status;
+	struct pollfd pollfd;
+	struct clone_args args = {
+		.flags = CLONE_FILES,
+		.exit_signal = SIGCHLD,
+	};
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pid_t pid1, pid2;
+		int listener, status;
+		pthread_t thread;
+
+		listener = user_trap_syscall(__NR_dup, SECCOMP_FILTER_FLAG_NEW_LISTENER);
+		if (listener < 0)
+			_exit(EXIT_FAILURE);
+
+		if (dup2(listener, 200) != 200)
+			_exit(EXIT_FAILURE);
+
+		close(listener);
+
+		pid1 = fork();
+		if (pid1 < 0)
+			_exit(EXIT_FAILURE);
+
+		if (pid1 == 0)
+			_exit(EXIT_SUCCESS);
+
+		pid2 = fork();
+		if (pid2 < 0)
+			_exit(EXIT_FAILURE);
+
+		if (pid2 == 0)
+			_exit(EXIT_SUCCESS);
+
+		if (pthread_create(&thread, NULL, do_thread, NULL) ||
+		    pthread_join(thread, NULL))
+			_exit(EXIT_FAILURE);
+
+		if (pthread_create(&thread, NULL, do_thread, NULL) ||
+		    pthread_join(thread, NULL))
+			_exit(EXIT_FAILURE);
+
+		if (waitpid(pid1, &status, 0) != pid1 || !WIFEXITED(status) ||
+		    WEXITSTATUS(status))
+			_exit(EXIT_FAILURE);
+
+		if (waitpid(pid2, &status, 0) != pid2 || !WIFEXITED(status) ||
+		    WEXITSTATUS(status))
+			_exit(EXIT_FAILURE);
+
+		exit(EXIT_SUCCESS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * The seccomp filter has become unused so we should be notified once
+	 * the kernel gets around to cleaning up task struct.
+	 */
+	pollfd.fd = 200;
+	pollfd.events = POLLHUP;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_GT((pollfd.revents & POLLHUP) ?: 0, 0);
+}
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document
  2020-05-31 11:50 [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document Christian Brauner
                   ` (2 preceding siblings ...)
  2020-05-31 11:50 ` [PATCH v3 4/4] tests: test seccomp filter notifications Christian Brauner
@ 2020-06-01 18:33 ` Kees Cook
  3 siblings, 0 replies; 9+ messages in thread
From: Kees Cook @ 2020-06-01 18:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Andy Lutomirski, Tycho Andersen, Matt Denton,
	Sargun Dhillon, Jann Horn, Chris Palmer, Aleksa Sarai,
	Robert Sesek, Jeffrey Vander Stoep, Linux Containers

On Sun, May 31, 2020 at 01:50:28PM +0200, Christian Brauner wrote:
> Naming the lifetime counter of a seccomp filter "usage" suggests a
> little too strongly that its about tasks that are using this filter
> while it also tracks other references such as the user notifier or
> ptrace. This also updates the documentation to note this fact.
> 
> We'll be introducing an actual usage counter in a follow-up patch.
> 
> Cc: Tycho Andersen <tycho@tycho.ws>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Matt Denton <mpdenton@google.com>
> Cc: Sargun Dhillon <sargun@sargun.me>
> Cc: Jann Horn <jannh@google.com>
> Cc: Chris Palmer <palmer@google.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Robert Sesek <rsesek@google.com>
> Cc: Jeffrey Vander Stoep <jeffv@google.com>
> Cc: Linux Containers <containers@lists.linux-foundation.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Thanks! Applied to for-next/seccomp.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/4] seccomp: release filter after task is fully dead
  2020-05-31 11:50 ` [PATCH v3 2/4] seccomp: release filter after task is fully dead Christian Brauner
@ 2020-06-01 18:40   ` Kees Cook
  0 siblings, 0 replies; 9+ messages in thread
From: Kees Cook @ 2020-06-01 18:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Andy Lutomirski, Tycho Andersen, Matt Denton,
	Sargun Dhillon, Jann Horn, Chris Palmer, Aleksa Sarai,
	Robert Sesek, Jeffrey Vander Stoep, Linux Containers

On Sun, May 31, 2020 at 01:50:29PM +0200, Christian Brauner wrote:
> The seccomp filter used to be released in free_task() which is called
> asynchronously via call_rcu() and assorted mechanisms. Since we need
> to inform tasks waiting on the seccomp notifier when a filter goes empty
> we will notify them as soon as a task has been marked fully dead in
> release_task(). To not split seccomp cleanup into two parts, move
> filter release out of free_task() and into release_task() after we've
> unhashed struct task from struct pid, exited signals, and unlinked it
> from the threadgroups' thread list. We'll put the empty filter
> notification infrastructure into it in a follow up patch.
> 
> This also renames put_seccomp_filter() to seccomp_filter_release() which
> is a more descriptive name of what we're doing here especially once
> we've added the empty filter notification mechanism in there.
> 
> We're also NULL-ing the task's filter tree entrypoint which seems
> cleaner than leaving a dangling pointer in there. Note that this shouldn't
> need any memory barriers since we're calling this when the task is in
> release_task() which means it's EXIT_DEAD. So it can't modify it's seccomp
> filters anymore. You can also see this from the point where we're calling
> seccomp_filter_release(). It's after __exit_signal() and at this point,
> tsk->sighand will already have been NULLed which is required for
> thread-sync and filter installation alike.
> 
> Cc: Tycho Andersen <tycho@tycho.ws>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Matt Denton <mpdenton@google.com>
> Cc: Sargun Dhillon <sargun@sargun.me>
> Cc: Jann Horn <jannh@google.com>
> Cc: Chris Palmer <palmer@google.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Robert Sesek <rsesek@google.com>
> Cc: Jeffrey Vander Stoep <jeffv@google.com>
> Cc: Linux Containers <containers@lists.linux-foundation.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Thanks! Applied with typo fixes to the commit log, a slightly expanded
comment on seccomp_filter_release() to just drive home the reason we
don't need barriers, and a variable renaming to avoid some needless
churn in the coming patches...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/4] seccomp: notify about unused filter
  2020-05-31 11:50 ` [PATCH v3 3/4] seccomp: notify about unused filter Christian Brauner
@ 2020-06-01 19:29   ` Kees Cook
  2020-06-02 11:35     ` Christian Brauner
  0 siblings, 1 reply; 9+ messages in thread
From: Kees Cook @ 2020-06-01 19:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Andy Lutomirski, Tycho Andersen, Matt Denton,
	Sargun Dhillon, Jann Horn, Chris Palmer, Aleksa Sarai,
	Robert Sesek, Jeffrey Vander Stoep, Linux Containers

On Sun, May 31, 2020 at 01:50:30PM +0200, Christian Brauner wrote:
> We've been making heavy use of the seccomp notifier to intercept and
> handle certain syscalls for containers. This patch allows a syscall
> supervisor listening on a given notifier to be notified when a seccomp
> filter has become unused.
> 
> A container is often managed by a singleton supervisor process the
> so-called "monitor". This monitor process has an event loop which has
> various event handlers registered. If the user specified a seccomp
> profile that included a notifier for various syscalls then we also
> register a seccomp notify even handler. For any container using a
> separate pid namespace the lifecycle of the seccomp notifier is bound to
> the init process of the pid namespace, i.e. when the init process exits
> the filter must be unused.
> If a new process attaches to a container we force it to assume a seccomp
> profile. This can either be the same seccomp profile as the container
> was started with or a modified one. If the attaching process makes use
> of the seccomp notifier we will register a new seccomp notifier handler
> in the monitor's event loop. However, when the attaching process exits
> we can't simply delete the handler since other child processes could've
> been created (daemons spawned etc.) that have inherited the seccomp
> filter and so we need to keep the seccomp notifier fd alive in the event
> loop. But this is problematic since we don't get a notification when the
> seccomp filter has become unused and so we currently never remove the
> seccomp notifier fd from the event loop and just keep accumulating fds
> in the event loop. We've had this issue for a while but it has recently
> become more pressing as more and larger users make use of this.
> 
> To fix this, we introduce a new "users" reference counter that tracks
> any tasks and dependent filters making use of a filter. When a notifier is
> registered waiting tasks will be notified that the filter is now empty by
> receiving a (E)POLLHUP event.
> The concept in this patch introduces is the same as for signal_struct,
> i.e. reference counting for life-cycle management is decoupled from
> reference counting taks using the object.
> 
> There's probably some trickery possible but the second counter is just
> the correct way of doing this imho and has precedence. The patch also
> lifts the waitqeue from struct notification into sruct seccomp_filter.
> This is cleaner overall and let's us avoid having to take the notifier
> mutex since we neither need to read nor modify the notifier specific
> aspects of the seccomp filter. In the exit path I'd very much like to
> avoid having to take the notifier mutex for each filter in the task's
> filter hierarchy.
> 
> Cc: Tycho Andersen <tycho@tycho.ws>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Matt Denton <mpdenton@google.com>
> Cc: Sargun Dhillon <sargun@sargun.me>
> Cc: Jann Horn <jannh@google.com>
> Cc: Chris Palmer <palmer@google.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Robert Sesek <rsesek@google.com>
> Cc: Jeffrey Vander Stoep <jeffv@google.com>
> Cc: Linux Containers <containers@lists.linux-foundation.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> /* v2 */
> - Jann Horn <jannh@google.com>:
>   - Use more descriptive instead of seccomp_filter_notify().
>     (I went with seccomp_filter_release().)
> 
> /* v3 */
> - Kees Cook <keescook@chromium.org>:
>   - Rename counter from "live" to "users".
> ---
>  kernel/seccomp.c | 68 +++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 53 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 55251b1fe03f..45244f1ba148 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -94,13 +94,11 @@ struct seccomp_knotif {
>   *           filter->notify_lock.
>   * @next_id: The id of the next request.
>   * @notifications: A list of struct seccomp_knotif elements.
> - * @wqh: A wait queue for poll.
>   */

I split the wait queue changes into a separate patch...

>  /**
> - * seccomp_filter_release - Detach the task from its filter tree
> - *			    and drop its reference count during
> - *			    exit.
> + * seccomp_filter_release - Detach the task from its filter tree,
> + *			    drop its reference count, and notify
> + *			    about unused filters
>   *
>   * This function should only be called when the task is exiting as
>   * it detaches it from its filter tree.
>   */
>  void seccomp_filter_release(struct task_struct *tsk)
>  {
> -	struct seccomp_filter *cur = tsk->seccomp.filter;
> +	struct seccomp_filter *orig = tsk->seccomp.filter;
>  
> +	/* Detach task from its filter tree. */
>  	tsk->seccomp.filter = NULL;
> -	__put_seccomp_filter(cur);
> +	/* Notify about any unused filters in the task's former filter tree. */
> +	__seccomp_filter_orphan(orig);
> +	/* Finally drop all references to the task's former tree. */
> +	__put_seccomp_filter(orig);
>  }

I added __seccomp_filter_release() to do the filter-specific parts (the
two functions passing "orig" above, so that it can be reused later...

>  
>  /**
> @@ -419,18 +441,29 @@ static inline void seccomp_sync_threads(unsigned long flags)
>  	/* Synchronize all threads. */
>  	caller = current;
>  	for_each_thread(caller, thread) {
> +		struct seccomp_filter *cur = thread->seccomp.filter;
> +
>  		/* Skip current, since it needs no changes. */
>  		if (thread == caller)
>  			continue;
>  
>  		/* Get a task reference for the new leaf node. */
>  		get_seccomp_filter(caller);
> +
> +		/*
> +		 * Notify everyone as we're forcing the thread
> +		 * to orphan its current filter tree.
> +		 */
> +		__seccomp_filter_orphan(cur);
> +
>  		/*
> -		 * Drop the task reference to the shared ancestor since
> -		 * current's path will hold a reference.  (This also
> -		 * allows a put before the assignment.)
> +		 * Drop the task's reference to the shared ancestor
> +		 * since current's path will hold a reference.
> +		 * (This also allows a put before the assignment.)
>  		 */
> -		__put_seccomp_filter(thread->seccomp.filter);
> +		__put_seccomp_filter(cur);

I switched this around to just call the new __seccomp_release_filter()
(there's no need to open-code this and add "cur"). I also removed the
comment about the notification, because that's not possible: "thread"
shares the same filter hierarchy as "caller", so the counts on "cur"
cannot reach 0 (no notifications can ever happen due to TSYNC).

Everything else looks great! I've applied it to for-next/seccomp.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 4/4] tests: test seccomp filter notifications
  2020-05-31 11:50 ` [PATCH v3 4/4] tests: test seccomp filter notifications Christian Brauner
@ 2020-06-01 19:31   ` Kees Cook
  0 siblings, 0 replies; 9+ messages in thread
From: Kees Cook @ 2020-06-01 19:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Andy Lutomirski, Tycho Andersen, Matt Denton,
	Sargun Dhillon, Jann Horn, Chris Palmer, Aleksa Sarai,
	Robert Sesek, Jeffrey Vander Stoep, Linux Containers

On Sun, May 31, 2020 at 01:50:31PM +0200, Christian Brauner wrote:
> This verifies we're correctly notified when a seccomp filter becomes
> unused when a notifier is in use.
> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> /* v2 */
> unchanged
> 
> /* v3 */
> At first it seemed sensible to add POLLHUP to all poll invocations but
> all checks test for revents to be equal to POLLIN. Hence, when POLLHUP
> is reported we'd fail the test so we don't gain anyhing by testing for
> POLLHUP additionally.

Ah! Well good; the tests were already sensitive enough. ;)

Applied to for-next/seccomp.

Since 5.7 just released, I'll be waiting for rc1 before actually
pushing for-next/seccomp to linux-next.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/4] seccomp: notify about unused filter
  2020-06-01 19:29   ` Kees Cook
@ 2020-06-02 11:35     ` Christian Brauner
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2020-06-02 11:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: Robert Sesek, Chris Palmer, Jann Horn, Jeffrey Vander Stoep,
	Linux Containers, linux-kernel, Matt Denton, Andy Lutomirski

On Mon, Jun 01, 2020 at 12:29:27PM -0700, Kees Cook wrote:
> On Sun, May 31, 2020 at 01:50:30PM +0200, Christian Brauner wrote:
> > We've been making heavy use of the seccomp notifier to intercept and
> > handle certain syscalls for containers. This patch allows a syscall
> > supervisor listening on a given notifier to be notified when a seccomp
> > filter has become unused.
> > 
> > A container is often managed by a singleton supervisor process the
> > so-called "monitor". This monitor process has an event loop which has
> > various event handlers registered. If the user specified a seccomp
> > profile that included a notifier for various syscalls then we also
> > register a seccomp notify even handler. For any container using a
> > separate pid namespace the lifecycle of the seccomp notifier is bound to
> > the init process of the pid namespace, i.e. when the init process exits
> > the filter must be unused.
> > If a new process attaches to a container we force it to assume a seccomp
> > profile. This can either be the same seccomp profile as the container
> > was started with or a modified one. If the attaching process makes use
> > of the seccomp notifier we will register a new seccomp notifier handler
> > in the monitor's event loop. However, when the attaching process exits
> > we can't simply delete the handler since other child processes could've
> > been created (daemons spawned etc.) that have inherited the seccomp
> > filter and so we need to keep the seccomp notifier fd alive in the event
> > loop. But this is problematic since we don't get a notification when the
> > seccomp filter has become unused and so we currently never remove the
> > seccomp notifier fd from the event loop and just keep accumulating fds
> > in the event loop. We've had this issue for a while but it has recently
> > become more pressing as more and larger users make use of this.
> > 
> > To fix this, we introduce a new "users" reference counter that tracks
> > any tasks and dependent filters making use of a filter. When a notifier is
> > registered waiting tasks will be notified that the filter is now empty by
> > receiving a (E)POLLHUP event.
> > The concept in this patch introduces is the same as for signal_struct,
> > i.e. reference counting for life-cycle management is decoupled from
> > reference counting taks using the object.
> > 
> > There's probably some trickery possible but the second counter is just
> > the correct way of doing this imho and has precedence. The patch also
> > lifts the waitqeue from struct notification into sruct seccomp_filter.
> > This is cleaner overall and let's us avoid having to take the notifier
> > mutex since we neither need to read nor modify the notifier specific
> > aspects of the seccomp filter. In the exit path I'd very much like to
> > avoid having to take the notifier mutex for each filter in the task's
> > filter hierarchy.
> > 
> > Cc: Tycho Andersen <tycho@tycho.ws>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Matt Denton <mpdenton@google.com>
> > Cc: Sargun Dhillon <sargun@sargun.me>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Chris Palmer <palmer@google.com>
> > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > Cc: Robert Sesek <rsesek@google.com>
> > Cc: Jeffrey Vander Stoep <jeffv@google.com>
> > Cc: Linux Containers <containers@lists.linux-foundation.org>
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> > /* v2 */
> > - Jann Horn <jannh@google.com>:
> >   - Use more descriptive instead of seccomp_filter_notify().
> >     (I went with seccomp_filter_release().)
> > 
> > /* v3 */
> > - Kees Cook <keescook@chromium.org>:
> >   - Rename counter from "live" to "users".
> > ---
> >  kernel/seccomp.c | 68 +++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 53 insertions(+), 15 deletions(-)
> > 
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 55251b1fe03f..45244f1ba148 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -94,13 +94,11 @@ struct seccomp_knotif {
> >   *           filter->notify_lock.
> >   * @next_id: The id of the next request.
> >   * @notifications: A list of struct seccomp_knotif elements.
> > - * @wqh: A wait queue for poll.
> >   */
> 
> I split the wait queue changes into a separate patch...
> 
> >  /**
> > - * seccomp_filter_release - Detach the task from its filter tree
> > - *			    and drop its reference count during
> > - *			    exit.
> > + * seccomp_filter_release - Detach the task from its filter tree,
> > + *			    drop its reference count, and notify
> > + *			    about unused filters
> >   *
> >   * This function should only be called when the task is exiting as
> >   * it detaches it from its filter tree.
> >   */
> >  void seccomp_filter_release(struct task_struct *tsk)
> >  {
> > -	struct seccomp_filter *cur = tsk->seccomp.filter;
> > +	struct seccomp_filter *orig = tsk->seccomp.filter;
> >  
> > +	/* Detach task from its filter tree. */
> >  	tsk->seccomp.filter = NULL;
> > -	__put_seccomp_filter(cur);
> > +	/* Notify about any unused filters in the task's former filter tree. */
> > +	__seccomp_filter_orphan(orig);
> > +	/* Finally drop all references to the task's former tree. */
> > +	__put_seccomp_filter(orig);
> >  }
> 
> I added __seccomp_filter_release() to do the filter-specific parts (the
> two functions passing "orig" above, so that it can be reused later...
> 
> >  
> >  /**
> > @@ -419,18 +441,29 @@ static inline void seccomp_sync_threads(unsigned long flags)
> >  	/* Synchronize all threads. */
> >  	caller = current;
> >  	for_each_thread(caller, thread) {
> > +		struct seccomp_filter *cur = thread->seccomp.filter;
> > +
> >  		/* Skip current, since it needs no changes. */
> >  		if (thread == caller)
> >  			continue;
> >  
> >  		/* Get a task reference for the new leaf node. */
> >  		get_seccomp_filter(caller);
> > +
> > +		/*
> > +		 * Notify everyone as we're forcing the thread
> > +		 * to orphan its current filter tree.
> > +		 */
> > +		__seccomp_filter_orphan(cur);
> > +
> >  		/*
> > -		 * Drop the task reference to the shared ancestor since
> > -		 * current's path will hold a reference.  (This also
> > -		 * allows a put before the assignment.)
> > +		 * Drop the task's reference to the shared ancestor
> > +		 * since current's path will hold a reference.
> > +		 * (This also allows a put before the assignment.)
> >  		 */
> > -		__put_seccomp_filter(thread->seccomp.filter);
> > +		__put_seccomp_filter(cur);
> 
> I switched this around to just call the new __seccomp_release_filter()
> (there's no need to open-code this and add "cur"). I also removed the
> comment about the notification, because that's not possible: "thread"
> shares the same filter hierarchy as "caller", so the counts on "cur"
> cannot reach 0 (no notifications can ever happen due to TSYNC).
> 
> Everything else looks great! I've applied it to for-next/seccomp.

Excellent, thanks!

Just in case this isn't obvious to everyone, I want to point out, that
this patchset means the seccomp notifier can be (ab)used to receive exit
notifications for a task tree (given some caveats) sharing the same
filter. In any case, I'd rather have a proper thread management
implementation of this through pidfds in the future if there's a need
for this. But one can abuse seccomp to achieve something similar with
my change here.
(Technically, we could also expand the notifier to have a "listen" mode
 whereby it only notifies about e.g. filter installation and filter)
 orphanage.)

Christian

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-06-02 11:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-31 11:50 [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document Christian Brauner
2020-05-31 11:50 ` [PATCH v3 2/4] seccomp: release filter after task is fully dead Christian Brauner
2020-06-01 18:40   ` Kees Cook
2020-05-31 11:50 ` [PATCH v3 3/4] seccomp: notify about unused filter Christian Brauner
2020-06-01 19:29   ` Kees Cook
2020-06-02 11:35     ` Christian Brauner
2020-05-31 11:50 ` [PATCH v3 4/4] tests: test seccomp filter notifications Christian Brauner
2020-06-01 19:31   ` Kees Cook
2020-06-01 18:33 ` [PATCH v3 1/4] seccomp: rename "usage" to "refs" and document Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).