linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Handle seccomp notification preemption
@ 2022-04-29  2:31 Sargun Dhillon
  2022-04-29  2:31 ` [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Sargun Dhillon @ 2022-04-29  2:31 UTC (permalink / raw)
  To: Kees Cook, LKML, Linux Containers
  Cc: Sargun Dhillon, Rodrigo Campos, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

This patchset addresses a race condition we've dealt with recently with
seccomp. Specifically programs interrupting syscalls while they're in
progress. This was exacerbated by Golang's[1] recent adoption of
"Non-cooperative goroutine preemption", in which they try to interrupt any
syscall that's been running for more than 10ms. During certain syscalls,
it's non-trivial to write them in a reetrant manner in userspace (mount).

It allows a per-filter flag to be set that makes it so that the notifying
process will switch to "TASK_KILLABLE" as opposed to returning to userspace
on non-fatal signals.

Changes since v2[3]:
 * Split out addfd patches
 * Move the flag to be per-filter (as opposed to per notification)

Changes since v1[2]:
 * Fix some documentation
 * Add Rata's patches to allow for direct return from addfd

[1]: https://github.com/golang/proposal/blob/master/design/24543-non-cooperative-preemption.md
[2]: https://lore.kernel.org/lkml/20210220090502.7202-1-sargun@sargun.me/
[3]: https://lore.kernel.org/all/20210426180610.2363-1-sargun@sargun.me/

Sargun Dhillon (2):
  seccomp: Add wait_killable semantic to seccomp user notifier
  selftests/seccomp: Add test for wait killable notifier

 .../userspace-api/seccomp_filter.rst          |   8 +
 include/linux/seccomp.h                       |   3 +-
 include/uapi/linux/seccomp.h                  |   2 +
 kernel/seccomp.c                              |  42 ++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 240 ++++++++++++++++++
 5 files changed, 292 insertions(+), 3 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier
  2022-04-29  2:31 [PATCH v3 0/2] Handle seccomp notification preemption Sargun Dhillon
@ 2022-04-29  2:31 ` Sargun Dhillon
  2022-04-29  9:42   ` Rodrigo Campos
                     ` (2 more replies)
  2022-04-29  2:31 ` [PATCH v3 2/2] selftests/seccomp: Add test for wait killable notifier Sargun Dhillon
  2022-04-29  9:24 ` [PATCH v3 0/2] Handle seccomp notification preemption Rodrigo Campos
  2 siblings, 3 replies; 15+ messages in thread
From: Sargun Dhillon @ 2022-04-29  2:31 UTC (permalink / raw)
  To: Kees Cook, LKML, Linux Containers
  Cc: Sargun Dhillon, Rodrigo Campos, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
that makes it so that when notifications are received by the supervisor
the notifying process will transition to wait killable semantics. Although
wait killable isn't a set of semantics formally exposed to userspace,
the concept is searchable. If the notifying process is signaled prior
to the notification being received by the userspace agent, it will
be handled as normal.

One quirk about how this is handled is that the notifying process
only switches to TASK_KILLABLE if it receives a wakeup from either
an addfd or a signal. This is to avoid an unnecessary wakeup of
the notifying task.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
---
 .../userspace-api/seccomp_filter.rst          |  8 ++++
 include/linux/seccomp.h                       |  3 +-
 include/uapi/linux/seccomp.h                  |  2 +
 kernel/seccomp.c                              | 42 ++++++++++++++++++-
 4 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index 539e9d4a4860..204cf5ba511a 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -271,6 +271,14 @@ notifying process it will be replaced. The supervisor can also add an FD, and
 respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return
 value will be the injected file descriptor number.
 
+The notifying process can be preempted, resulting in the notification being
+aborted. This can be problematic when trying to take actions on behalf of the
+notifying process that are long-running and typically retryable (mounting a
+filesytem). Alternatively, the at filter installation time, the
+``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it
+such that when a user notification is received by the supervisor, the notifying
+process will ignore non-fatal signals until the response is sent.
+
 It is worth noting that ``struct seccomp_data`` contains the values of register
 arguments to the syscall, but does not contain pointers to memory. The task's
 memory is accessible to suitably privileged traces via ``ptrace()`` or
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 0c564e5d40ff..d31d76be4982 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -8,7 +8,8 @@
 					 SECCOMP_FILTER_FLAG_LOG | \
 					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
 					 SECCOMP_FILTER_FLAG_NEW_LISTENER | \
-					 SECCOMP_FILTER_FLAG_TSYNC_ESRCH)
+					 SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \
+					 SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
 
 /* sizeof() the first published struct seccomp_notif_addfd */
 #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 78074254ab98..0fdc6ef02b94 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -23,6 +23,8 @@
 #define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
 #define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
 #define SECCOMP_FILTER_FLAG_TSYNC_ESRCH		(1UL << 4)
+/* Received notifications wait in killable state (only respond to fatal signals) */
+#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV	(1UL << 5)
 
 /*
  * All BPF programs must return a 32-bit value.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index db10e73d06e0..9291b0843cb2 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -201,6 +201,8 @@ static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
  *	   the filter can be freed.
  * @cache: cache of arch/syscall mappings to actions
  * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
+ * @wait_killable_recv: Put notifying process in killable state once the
+ *			notification is received by the userspace listener.
  * @prev: points to a previously installed, or inherited, filter
  * @prog: the BPF program to evaluate
  * @notif: the struct that holds all notification related information
@@ -221,6 +223,7 @@ struct seccomp_filter {
 	refcount_t refs;
 	refcount_t users;
 	bool log;
+	bool wait_killable_recv;
 	struct action_cache cache;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
@@ -894,6 +897,10 @@ static long seccomp_attach_filter(unsigned int flags,
 	if (flags & SECCOMP_FILTER_FLAG_LOG)
 		filter->log = true;
 
+	/* Set wait killable flag, if present. */
+	if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
+		filter->wait_killable_recv = true;
+
 	/*
 	 * If there is an existing filter, make it the prev and don't drop its
 	 * task reference.
@@ -1081,6 +1088,12 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn
 	complete(&addfd->completion);
 }
 
+static bool should_sleep_killable(struct seccomp_filter *match,
+				  struct seccomp_knotif *n)
+{
+	return match->wait_killable_recv && n->state == SECCOMP_NOTIFY_SENT;
+}
+
 static int seccomp_do_user_notification(int this_syscall,
 					struct seccomp_filter *match,
 					const struct seccomp_data *sd)
@@ -1111,11 +1124,25 @@ static int seccomp_do_user_notification(int this_syscall,
 	 * This is where we wait for a reply from userspace.
 	 */
 	do {
+		bool wait_killable = should_sleep_killable(match, &n);
+
 		mutex_unlock(&match->notify_lock);
-		err = wait_for_completion_interruptible(&n.ready);
+		if (wait_killable)
+			err = wait_for_completion_killable(&n.ready);
+		else
+			err = wait_for_completion_interruptible(&n.ready);
 		mutex_lock(&match->notify_lock);
-		if (err != 0)
+
+		if (err != 0) {
+			/*
+			 * Check to see if the notifcation got picked up and
+			 * whether we should switch to wait killable.
+			 */
+			if (!wait_killable && should_sleep_killable(match, &n))
+				continue;
+
 			goto interrupted;
+		}
 
 		addfd = list_first_entry_or_null(&n.addfd,
 						 struct seccomp_kaddfd, list);
@@ -1485,6 +1512,9 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
 		mutex_lock(&filter->notify_lock);
 		knotif = find_notification(filter, unotif.id);
 		if (knotif) {
+			/* Reset the process to make sure it's not stuck */
+			if (should_sleep_killable(filter, knotif))
+				complete(&knotif->ready);
 			knotif->state = SECCOMP_NOTIFY_INIT;
 			up(&filter->notif->request);
 		}
@@ -1830,6 +1860,14 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	    ((flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH) == 0))
 		return -EINVAL;
 
+	/*
+	 * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT flag doesn't make sense
+	 * without the SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
+	 */
+	if ((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) &&
+	    ((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0))
+		return -EINVAL;
+
 	/* Prepare the new filter before holding any locks. */
 	prepared = seccomp_prepare_user_filter(filter);
 	if (IS_ERR(prepared))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/2] selftests/seccomp: Add test for wait killable notifier
  2022-04-29  2:31 [PATCH v3 0/2] Handle seccomp notification preemption Sargun Dhillon
  2022-04-29  2:31 ` [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
@ 2022-04-29  2:31 ` Sargun Dhillon
  2022-04-29 18:19   ` Kees Cook
  2022-04-29  9:24 ` [PATCH v3 0/2] Handle seccomp notification preemption Rodrigo Campos
  2 siblings, 1 reply; 15+ messages in thread
From: Sargun Dhillon @ 2022-04-29  2:31 UTC (permalink / raw)
  To: Kees Cook, LKML, Linux Containers
  Cc: Sargun Dhillon, Rodrigo Campos, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

This verifies that if a filter is set up with the wait killable feature
that it obeys the semantics that non-fatal signals are ignored during
a notification after the notification is received.

Cases tested:
 * Non-fatal signal prior to receive
 * Non-fatal signal during receive
 * Fatal signal after receive

The pre-receive signal handling is tested elsewhere, and that codepath
remains unchanged.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 240 ++++++++++++++++++
 1 file changed, 240 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 9d126d7fabdb..825b179b6751 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -59,6 +59,8 @@
 #define SKIP(s, ...)	XFAIL(s, ##__VA_ARGS__)
 #endif
 
+#define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
+
 #ifndef PR_SET_PTRACER
 # define PR_SET_PTRACER 0x59616d61
 #endif
@@ -268,6 +270,10 @@ struct seccomp_notif_addfd_big {
 #define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4)
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV
+#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5)
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -4231,6 +4237,240 @@ TEST(user_notification_addfd_rlimit)
 	close(memfd);
 }
 
+static char get_proc_stat(int pid)
+{
+	char proc_path[100] = { 0 };
+	char *line = NULL;
+	size_t len = 0;
+	ssize_t nread;
+	char status;
+	FILE *f;
+	int i;
+
+	snprintf(proc_path, sizeof(proc_path), "/proc/%d/stat", pid);
+	f = fopen(proc_path, "r");
+	if (f == NULL)
+		ksft_exit_fail_msg("%s - Could not open %s\n", strerror(errno),
+				   proc_path);
+
+	for (i = 0; i < 3; i++) {
+		nread = getdelim(&line, &len, ' ', f);
+		if (nread <= 0)
+			ksft_exit_fail_msg("Failed to read status: %s\n",
+					   strerror(errno));
+	}
+
+	status = *line;
+	free(line);
+	fclose(f);
+
+	return status;
+}
+
+/* Returns -1 if not in syscall (running or blocked) */
+static long get_proc_syscall(int pid)
+{
+	char proc_path[100] = { 0 };
+	char *line = NULL;
+	size_t len = 0;
+	ssize_t nread;
+	long ret = -1;
+	FILE *f;
+
+	snprintf(proc_path, sizeof(proc_path), "/proc/%d/syscall", pid);
+	f = fopen(proc_path, "r");
+	if (f == NULL)
+		ksft_exit_fail_msg("%s - Could not open %s\n", strerror(errno),
+				   proc_path);
+
+	nread = getdelim(&line, &len, ' ', f);
+	if (nread <= 0)
+		ksft_exit_fail_msg("Failed to read status: %s\n",
+				   strerror(errno));
+
+	if (!strncmp("running", line, MIN(7, nread)))
+		ret = strtol(line, NULL, 16);
+
+	free(line);
+	fclose(f);
+
+	return ret;
+}
+
+TEST(user_notification_wait_killable_pre_notification)
+{
+	struct sigaction new_action = {
+		.sa_handler = signal_handler,
+	};
+	int listener, status, sk_pair[2];
+	pid_t pid;
+	long ret;
+	char c;
+	/* 100 ms */
+	struct timespec delay = { .tv_nsec = 100000000 };
+
+	ASSERT_EQ(sigemptyset(&new_action.sa_mask), 0);
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+
+	listener = user_notif_syscall(__NR_getppid,
+				      SECCOMP_FILTER_FLAG_NEW_LISTENER |
+				      SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV);
+	ASSERT_GE(listener, 0);
+
+	/*
+	 * Check that we can kill the process with SIGUSR1 prior to receiving
+	 * the notification. SIGUSR1 is wired up to a custom signal handler,
+	 * and make sure it gets called.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(sk_pair[0]);
+		handled = sk_pair[1];
+
+		/* Setup the sigaction without SA_RESTART */
+		if (sigaction(SIGUSR1, &new_action, NULL)) {
+			perror("sigaction");
+			exit(1);
+		}
+
+		ret = syscall(__NR_getppid);
+		exit(ret != -1 || errno != EINTR);
+	}
+
+	while (get_proc_syscall(pid) != __NR_getppid &&
+	       get_proc_stat(pid) != 'S')
+		nanosleep(&delay, NULL);
+
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+}
+
+TEST(user_notification_wait_killable)
+{
+	struct sigaction new_action = {
+		.sa_handler = signal_handler,
+	};
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_notif req = {};
+	int listener, status, sk_pair[2];
+	pid_t pid;
+	long ret;
+	char c;
+	/* 100 ms */
+	struct timespec delay = { .tv_nsec = 100000000 };
+
+	ASSERT_EQ(sigemptyset(&new_action.sa_mask), 0);
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	listener = user_notif_syscall(__NR_getppid,
+				      SECCOMP_FILTER_FLAG_NEW_LISTENER |
+				      SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(sk_pair[0]);
+		handled = sk_pair[1];
+
+		/* Setup the sigaction without SA_RESTART */
+		if (sigaction(SIGUSR1, &new_action, NULL)) {
+			perror("sigaction");
+			exit(1);
+		}
+
+		/* Make sure that the syscall is completed (no EINTR) */
+		ret = syscall(__NR_getppid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	while (get_proc_syscall(pid) != __NR_getppid &&
+	       get_proc_stat(pid) != 'S')
+		nanosleep(&delay, NULL);
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+	/* Kill the process to make sure it enters the wait_killable state */
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+	/* TASK_KILLABLE is considered D (Disk Sleep) state */
+	while (get_proc_stat(pid) != 'D')
+		nanosleep(&delay, NULL);
+
+	resp.id = req.id;
+	resp.val = USER_NOTIF_MAGIC;
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+	/*
+	 * Make sure that the signal handler does get called once we're back in
+	 * userspace.
+	 */
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+TEST(user_notification_wait_killable_fatal)
+{
+	struct seccomp_notif req = {};
+	int listener, status;
+	pid_t pid;
+	long ret;
+	/* 100 ms */
+	struct timespec delay = { .tv_nsec = 100000000 };
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	listener = user_notif_syscall(__NR_getppid,
+				      SECCOMP_FILTER_FLAG_NEW_LISTENER |
+				      SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* This should never complete */
+		syscall(__NR_getppid);
+		exit(1);
+	}
+
+	while (get_proc_stat(pid) != 'S')
+		nanosleep(&delay, NULL);
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+	/* Kill the process with a fatal signal */
+	EXPECT_EQ(kill(pid, SIGTERM), 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFSIGNALED(status));
+	EXPECT_EQ(SIGTERM, WTERMSIG(status));
+}
+
 /*
  * TODO:
  * - expand NNP testing
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/2] Handle seccomp notification preemption
  2022-04-29  2:31 [PATCH v3 0/2] Handle seccomp notification preemption Sargun Dhillon
  2022-04-29  2:31 ` [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
  2022-04-29  2:31 ` [PATCH v3 2/2] selftests/seccomp: Add test for wait killable notifier Sargun Dhillon
@ 2022-04-29  9:24 ` Rodrigo Campos
  2 siblings, 0 replies; 15+ messages in thread
From: Rodrigo Campos @ 2022-04-29  9:24 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Kees Cook, LKML, Linux Containers, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Fri, Apr 29, 2022 at 4:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> This patchset addresses a race condition we've dealt with recently with
> seccomp. Specifically programs interrupting syscalls while they're in

I think you sent it to the wrong containers ml. It should be
containers@lists.linux.dev, right?

If the next time you can cc me to rodrigo@sdfg.com.ar instead, it will be great.


Best,
Rodrigo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier
  2022-04-29  2:31 ` [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
@ 2022-04-29  9:42   ` Rodrigo Campos
  2022-04-29 17:14     ` Sargun Dhillon
  2022-04-29 18:22   ` Kees Cook
  2022-05-02 14:15   ` Rodrigo Campos
  2 siblings, 1 reply; 15+ messages in thread
From: Rodrigo Campos @ 2022-04-29  9:42 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Kees Cook, LKML, Linux Containers, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Fri, Apr 29, 2022 at 4:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
> the concept is searchable. If the notifying process is signaled prior
> to the notification being received by the userspace agent, it will
> be handled as normal.

Why is that? Why not always handle in the same way (if wait killable
is set, wait like that)

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index db10e73d06e0..9291b0843cb2 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1081,6 +1088,12 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn
>         complete(&addfd->completion);
>  }
>
> +static bool should_sleep_killable(struct seccomp_filter *match,
> +                                 struct seccomp_knotif *n)
> +{
> +       return match->wait_killable_recv && n->state == SECCOMP_NOTIFY_SENT;

Here for some reason we check the notification state to be SENT.

> +}
> +
>  static int seccomp_do_user_notification(int this_syscall,
>                                         struct seccomp_filter *match,
>                                         const struct seccomp_data *sd)
> @@ -1111,11 +1124,25 @@ static int seccomp_do_user_notification(int this_syscall,
>          * This is where we wait for a reply from userspace.
>          */
>         do {
> +               bool wait_killable = should_sleep_killable(match, &n);
> +

So here, the first time this runs this will be false even if the
wait_killable flag was used in the filter (because that function
checks the notification state to be sent, that is not true the first
time)

Why not just do wait_for_completion_killable if match->wait_killable
and wait_for_completion_interruptible otherwise? Am I missing
something?



Best,
Rodrigo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier
  2022-04-29  9:42   ` Rodrigo Campos
@ 2022-04-29 17:14     ` Sargun Dhillon
  2022-04-29 18:20       ` Kees Cook
  0 siblings, 1 reply; 15+ messages in thread
From: Sargun Dhillon @ 2022-04-29 17:14 UTC (permalink / raw)
  To: Rodrigo Campos
  Cc: Kees Cook, LKML, Linux Containers, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Fri, Apr 29, 2022 at 11:42:15AM +0200, Rodrigo Campos wrote:
> On Fri, Apr 29, 2022 at 4:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > the concept is searchable. If the notifying process is signaled prior
> > to the notification being received by the userspace agent, it will
> > be handled as normal.
> 
> Why is that? Why not always handle in the same way (if wait killable
> is set, wait like that)
> 

The goal is to avoid two things:
1. Unncessary work - Often times, we see workloads that implement techniques
   like hedging (Also known as request racing[1]). In fact, RFC3484
   (destination address selection) gets implemented where the DNS library
   will connect to many backend addresses and whichever one comes back first
   "wins".
2. Side effects - We don't want a situation where a syscall is in progress
   that is non-trivial to rollback (mount), and from user space's perspective
   this syscall never completed.

Blocking before the syscall even starts is excessive. When we looked at this
we found that with runtimes like Golang, they can get into a bad situation
if they have many (1000s) of threads that are in the middle of a syscall
because all of them need to elide prior to GC. In this case the runtime
prioritizes the liveness of GC vs. the syscalls.

That being said, there may be some syscalls in a filter that need the suggested 
behaviour. I can imagine introducing a new flag
(say SECCOMP_FILTER_FLAG_WAIT_KILLABLE) that applies to all states.
Alternatively, in one implementation, I put the behaviour in the data
field of the return from the BPF filter.


> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index db10e73d06e0..9291b0843cb2 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -1081,6 +1088,12 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn
> >         complete(&addfd->completion);
> >  }
> >
> > +static bool should_sleep_killable(struct seccomp_filter *match,
> > +                                 struct seccomp_knotif *n)
> > +{
> > +       return match->wait_killable_recv && n->state == SECCOMP_NOTIFY_SENT;
> 
> Here for some reason we check the notification state to be SENT.
> 
Because we don't want to block unless the notification has been received
by userspace.

> > +}
> > +
> >  static int seccomp_do_user_notification(int this_syscall,
> >                                         struct seccomp_filter *match,
> >                                         const struct seccomp_data *sd)
> > @@ -1111,11 +1124,25 @@ static int seccomp_do_user_notification(int this_syscall,
> >          * This is where we wait for a reply from userspace.
> >          */
> >         do {
> > +               bool wait_killable = should_sleep_killable(match, &n);
> > +
> 
> So here, the first time this runs this will be false even if the
> wait_killable flag was used in the filter (because that function
> checks the notification state to be sent, that is not true the first
> time)
> 
> Why not just do wait_for_completion_killable if match->wait_killable
> and wait_for_completion_interruptible otherwise? Am I missing
> something?
Again, this is to allow for the notification to be able to be preempted
prior to being received by the supervisor.

> 
> 
> 
> Best,
> Rodrigo
[1]: https://research.google/pubs/pub40801/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] selftests/seccomp: Add test for wait killable notifier
  2022-04-29  2:31 ` [PATCH v3 2/2] selftests/seccomp: Add test for wait killable notifier Sargun Dhillon
@ 2022-04-29 18:19   ` Kees Cook
  2022-04-29 22:35     ` Sargun Dhillon
  0 siblings, 1 reply; 15+ messages in thread
From: Kees Cook @ 2022-04-29 18:19 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: LKML, Linux Containers, Rodrigo Campos, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Thu, Apr 28, 2022 at 07:31:13PM -0700, Sargun Dhillon wrote:
> This verifies that if a filter is set up with the wait killable feature
> that it obeys the semantics that non-fatal signals are ignored during
> a notification after the notification is received.
> 
> Cases tested:
>  * Non-fatal signal prior to receive
>  * Non-fatal signal during receive
>  * Fatal signal after receive
> 
> The pre-receive signal handling is tested elsewhere, and that codepath
> remains unchanged.
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> ---
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 240 ++++++++++++++++++
>  1 file changed, 240 insertions(+)
> 
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 9d126d7fabdb..825b179b6751 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -59,6 +59,8 @@
>  #define SKIP(s, ...)	XFAIL(s, ##__VA_ARGS__)
>  #endif
>  
> +#define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
> +
>  #ifndef PR_SET_PTRACER
>  # define PR_SET_PTRACER 0x59616d61
>  #endif
> @@ -268,6 +270,10 @@ struct seccomp_notif_addfd_big {
>  #define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4)
>  #endif
>  
> +#ifndef SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV
> +#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5)
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -4231,6 +4237,240 @@ TEST(user_notification_addfd_rlimit)
>  	close(memfd);
>  }
>  
> +static char get_proc_stat(int pid)
> +{
> +	char proc_path[100] = { 0 };
> +	char *line = NULL;
> +	size_t len = 0;
> +	ssize_t nread;
> +	char status;
> +	FILE *f;
> +	int i;
> +
> +	snprintf(proc_path, sizeof(proc_path), "/proc/%d/stat", pid);
> +	f = fopen(proc_path, "r");
> +	if (f == NULL)
> +		ksft_exit_fail_msg("%s - Could not open %s\n", strerror(errno),
> +				   proc_path);
> +
> +	for (i = 0; i < 3; i++) {
> +		nread = getdelim(&line, &len, ' ', f);
> +		if (nread <= 0)
> +			ksft_exit_fail_msg("Failed to read status: %s\n",
> +					   strerror(errno));
> +	}
> +
> +	status = *line;
> +	free(line);
> +	fclose(f);
> +
> +	return status;
> +}
> +
> +/* Returns -1 if not in syscall (running or blocked) */
> +static long get_proc_syscall(int pid)
> +{
> +	char proc_path[100] = { 0 };
> +	char *line = NULL;
> +	size_t len = 0;
> +	ssize_t nread;
> +	long ret = -1;
> +	FILE *f;
> +
> +	snprintf(proc_path, sizeof(proc_path), "/proc/%d/syscall", pid);
> +	f = fopen(proc_path, "r");
> +	if (f == NULL)
> +		ksft_exit_fail_msg("%s - Could not open %s\n", strerror(errno),
> +				   proc_path);
> +
> +	nread = getdelim(&line, &len, ' ', f);
> +	if (nread <= 0)
> +		ksft_exit_fail_msg("Failed to read status: %s\n",
> +				   strerror(errno));
> +
> +	if (!strncmp("running", line, MIN(7, nread)))
> +		ret = strtol(line, NULL, 16);
> +
> +	free(line);
> +	fclose(f);
> +
> +	return ret;
> +}

I think these should be merged as they're almost entirely the same. The
differences are position wanted and file read. Also, passing metadata in
will let it break out on errors.

static char *get_proc_delim(struct __test_metadata *_metadata,
			    unsigned int pid, const char *name,
			    const unsigned int position)
{
	char proc_path[100] = { 0 };
	char *line = NULL;
	size_t len = 0;
	ssize_t nread;
	unsigned int i;
	FILE *f;

	snprintf(proc_path, sizeof(proc_path), "/proc/%d/%s", pid, name);
	f = fopen(proc_path, "r");
	ASSERT_NE(f, NULL) {
		TH_LOG("%s: %s\n", proc_path, strerror(errno));
	}

	for (i = 0; i < position; i++) {
		nread = getdelim(&line, &len, ' ', f);
		ASSERT_GE(nread, 0) {
			TH_LOG("%s: %s\n", proc_path, strerror(errno));
		}
	}
	fclose(f);

	return line;
}

static char get_proc_status(struct __test_metadata *_metadata, int pid)
{
	char status;
	char *item;

	item = get_proc_delim(_metadata,, pid, "stat", 3);
	status = *line;
	free(line);

	return status;
}

static long get_proc_syscall(struct __test_metadata *_metadata, int pid)
{
	char *item;
	long syscall = -1;

	item = get_proc_delim(_metadata, pid, "syscall", 1);
	syscall = strtol(line, item, 16);

	if (!strcmp("running", item)) {
		syscall = strtol(line, NULL, 16);
		ASSERT_NE(syscall, LONG_MIN);
		ASSERT_NE(syscall, LONG_MAX);
	}
	free(item);

	return syscall;
}

> +
> +TEST(user_notification_wait_killable_pre_notification)
> +{
> +	struct sigaction new_action = {
> +		.sa_handler = signal_handler,
> +	};
> +	int listener, status, sk_pair[2];
> +	pid_t pid;
> +	long ret;
> +	char c;
> +	/* 100 ms */
> +	struct timespec delay = { .tv_nsec = 100000000 };
> +
> +	ASSERT_EQ(sigemptyset(&new_action.sa_mask), 0);
> +
> +	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> +	ASSERT_EQ(0, ret) {
> +		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
> +	}
> +
> +	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +
> +	listener = user_notif_syscall(__NR_getppid,
> +				      SECCOMP_FILTER_FLAG_NEW_LISTENER |
> +				      SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV);
> +	ASSERT_GE(listener, 0);
> +
> +	/*
> +	 * Check that we can kill the process with SIGUSR1 prior to receiving
> +	 * the notification. SIGUSR1 is wired up to a custom signal handler,
> +	 * and make sure it gets called.
> +	 */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		close(sk_pair[0]);
> +		handled = sk_pair[1];
> +
> +		/* Setup the sigaction without SA_RESTART */
> +		if (sigaction(SIGUSR1, &new_action, NULL)) {
> +			perror("sigaction");
> +			exit(1);
> +		}
> +
> +		ret = syscall(__NR_getppid);
> +		exit(ret != -1 || errno != EINTR);
> +	}

I could use more comments in here. IIUC, they would be:

> +

	/* Wait for process to block for user_notif in getppid */
> +	while (get_proc_syscall(pid) != __NR_getppid &&
> +	       get_proc_stat(pid) != 'S')
> +		nanosleep(&delay, NULL);
> +

	/* Kill process */
> +	EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +

	/*
	 * Make sure the process ends happily (i.e. its getppid()
	 * failed with EINTR.
	 */
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +

	/* Also make sure the signal handler fired. */
> +	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +}
> +
> +TEST(user_notification_wait_killable)
> +{
> +	struct sigaction new_action = {
> +		.sa_handler = signal_handler,
> +	};
> +	struct seccomp_notif_resp resp = {};
> +	struct seccomp_notif req = {};
> +	int listener, status, sk_pair[2];
> +	pid_t pid;
> +	long ret;
> +	char c;
> +	/* 100 ms */
> +	struct timespec delay = { .tv_nsec = 100000000 };
> +
> +	ASSERT_EQ(sigemptyset(&new_action.sa_mask), 0);
> +
> +	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> +	ASSERT_EQ(0, ret) {
> +		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
> +	}
> +
> +	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +	listener = user_notif_syscall(__NR_getppid,
> +				      SECCOMP_FILTER_FLAG_NEW_LISTENER |
> +				      SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV);
> +	ASSERT_GE(listener, 0);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		close(sk_pair[0]);
> +		handled = sk_pair[1];
> +
> +		/* Setup the sigaction without SA_RESTART */
> +		if (sigaction(SIGUSR1, &new_action, NULL)) {
> +			perror("sigaction");
> +			exit(1);
> +		}
> +
> +		/* Make sure that the syscall is completed (no EINTR) */
> +		ret = syscall(__NR_getppid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	while (get_proc_syscall(pid) != __NR_getppid &&
> +	       get_proc_stat(pid) != 'S')
> +		nanosleep(&delay, NULL);
> +
> +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
> +	/* Kill the process to make sure it enters the wait_killable state */
> +	EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +
> +	/* TASK_KILLABLE is considered D (Disk Sleep) state */
> +	while (get_proc_stat(pid) != 'D')
> +		nanosleep(&delay, NULL);

Should a NOWAIT waitpid() happen in this loop to make sure this doesn't
spin forever?

i.e. running these tests on a kernel that doesn't have the support
shouldn't hang -- yes it'll time out eventually but that's annoying. ;)

> +
> +	resp.id = req.id;
> +	resp.val = USER_NOTIF_MAGIC;
> +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
> +
> +	/*
> +	 * Make sure that the signal handler does get called once we're back in
> +	 * userspace.
> +	 */
> +	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
> +TEST(user_notification_wait_killable_fatal)
> +{
> +	struct seccomp_notif req = {};
> +	int listener, status;
> +	pid_t pid;
> +	long ret;
> +	/* 100 ms */
> +	struct timespec delay = { .tv_nsec = 100000000 };
> +
> +	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> +	ASSERT_EQ(0, ret) {
> +		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
> +	}
> +
> +	listener = user_notif_syscall(__NR_getppid,
> +				      SECCOMP_FILTER_FLAG_NEW_LISTENER |
> +				      SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV);
> +	ASSERT_GE(listener, 0);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		/* This should never complete */
> +		syscall(__NR_getppid);
> +		exit(1);
> +	}
> +
> +	while (get_proc_stat(pid) != 'S')
> +		nanosleep(&delay, NULL);
> +
> +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
> +	/* Kill the process with a fatal signal */
> +	EXPECT_EQ(kill(pid, SIGTERM), 0);
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFSIGNALED(status));
> +	EXPECT_EQ(SIGTERM, WTERMSIG(status));
> +}

Should there be a test validating the inverse of this, as in _without_
SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV, how should the above tests
behave?

Otherwise, looks good! Yay tests!

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier
  2022-04-29 17:14     ` Sargun Dhillon
@ 2022-04-29 18:20       ` Kees Cook
  2022-05-02 12:48         ` Rodrigo Campos
  0 siblings, 1 reply; 15+ messages in thread
From: Kees Cook @ 2022-04-29 18:20 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Rodrigo Campos, LKML, Linux Containers, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Fri, Apr 29, 2022 at 05:14:37PM +0000, Sargun Dhillon wrote:
> On Fri, Apr 29, 2022 at 11:42:15AM +0200, Rodrigo Campos wrote:
> > On Fri, Apr 29, 2022 at 4:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > the concept is searchable. If the notifying process is signaled prior
> > > to the notification being received by the userspace agent, it will
> > > be handled as normal.
> > 
> > Why is that? Why not always handle in the same way (if wait killable
> > is set, wait like that)
> > 
> 
> The goal is to avoid two things:
> 1. Unncessary work - Often times, we see workloads that implement techniques
>    like hedging (Also known as request racing[1]). In fact, RFC3484
>    (destination address selection) gets implemented where the DNS library
>    will connect to many backend addresses and whichever one comes back first
>    "wins".
> 2. Side effects - We don't want a situation where a syscall is in progress
>    that is non-trivial to rollback (mount), and from user space's perspective
>    this syscall never completed.
> 
> Blocking before the syscall even starts is excessive. When we looked at this
> we found that with runtimes like Golang, they can get into a bad situation
> if they have many (1000s) of threads that are in the middle of a syscall
> because all of them need to elide prior to GC. In this case the runtime
> prioritizes the liveness of GC vs. the syscalls.
> 
> That being said, there may be some syscalls in a filter that need the suggested 
> behaviour. I can imagine introducing a new flag
> (say SECCOMP_FILTER_FLAG_WAIT_KILLABLE) that applies to all states.
> Alternatively, in one implementation, I put the behaviour in the data
> field of the return from the BPF filter.

I'd add something like the above to the commit log, just to have it
around.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier
  2022-04-29  2:31 ` [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
  2022-04-29  9:42   ` Rodrigo Campos
@ 2022-04-29 18:22   ` Kees Cook
  2022-05-02 14:15   ` Rodrigo Campos
  2 siblings, 0 replies; 15+ messages in thread
From: Kees Cook @ 2022-04-29 18:22 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: LKML, Linux Containers, Rodrigo Campos, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Thu, Apr 28, 2022 at 07:31:12PM -0700, Sargun Dhillon wrote:
> This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
> that makes it so that when notifications are received by the supervisor
> the notifying process will transition to wait killable semantics. Although
> wait killable isn't a set of semantics formally exposed to userspace,
> the concept is searchable. If the notifying process is signaled prior
> to the notification being received by the userspace agent, it will
> be handled as normal.
> 
> One quirk about how this is handled is that the notifying process
> only switches to TASK_KILLABLE if it receives a wakeup from either
> an addfd or a signal. This is to avoid an unnecessary wakeup of
> the notifying task.
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> ---
>  .../userspace-api/seccomp_filter.rst          |  8 ++++
>  include/linux/seccomp.h                       |  3 +-
>  include/uapi/linux/seccomp.h                  |  2 +
>  kernel/seccomp.c                              | 42 ++++++++++++++++++-
>  4 files changed, 52 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> index 539e9d4a4860..204cf5ba511a 100644
> --- a/Documentation/userspace-api/seccomp_filter.rst
> +++ b/Documentation/userspace-api/seccomp_filter.rst
> @@ -271,6 +271,14 @@ notifying process it will be replaced. The supervisor can also add an FD, and
>  respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return
>  value will be the injected file descriptor number.
>  
> +The notifying process can be preempted, resulting in the notification being
> +aborted. This can be problematic when trying to take actions on behalf of the
> +notifying process that are long-running and typically retryable (mounting a
> +filesytem). Alternatively, the at filter installation time, the

typo: "the at" -> "at"

> +``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it
> +such that when a user notification is received by the supervisor, the notifying
> +process will ignore non-fatal signals until the response is sent.
> +
>  It is worth noting that ``struct seccomp_data`` contains the values of register
>  arguments to the syscall, but does not contain pointers to memory. The task's
>  memory is accessible to suitably privileged traces via ``ptrace()`` or
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 0c564e5d40ff..d31d76be4982 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -8,7 +8,8 @@
>  					 SECCOMP_FILTER_FLAG_LOG | \
>  					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
>  					 SECCOMP_FILTER_FLAG_NEW_LISTENER | \
> -					 SECCOMP_FILTER_FLAG_TSYNC_ESRCH)
> +					 SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \
> +					 SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
>  
>  /* sizeof() the first published struct seccomp_notif_addfd */
>  #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 78074254ab98..0fdc6ef02b94 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -23,6 +23,8 @@
>  #define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
>  #define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
>  #define SECCOMP_FILTER_FLAG_TSYNC_ESRCH		(1UL << 4)
> +/* Received notifications wait in killable state (only respond to fatal signals) */
> +#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV	(1UL << 5)
>  
>  /*
>   * All BPF programs must return a 32-bit value.
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index db10e73d06e0..9291b0843cb2 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -201,6 +201,8 @@ static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
>   *	   the filter can be freed.
>   * @cache: cache of arch/syscall mappings to actions
>   * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
> + * @wait_killable_recv: Put notifying process in killable state once the
> + *			notification is received by the userspace listener.
>   * @prev: points to a previously installed, or inherited, filter
>   * @prog: the BPF program to evaluate
>   * @notif: the struct that holds all notification related information
> @@ -221,6 +223,7 @@ struct seccomp_filter {
>  	refcount_t refs;
>  	refcount_t users;
>  	bool log;
> +	bool wait_killable_recv;
>  	struct action_cache cache;
>  	struct seccomp_filter *prev;
>  	struct bpf_prog *prog;
> @@ -894,6 +897,10 @@ static long seccomp_attach_filter(unsigned int flags,
>  	if (flags & SECCOMP_FILTER_FLAG_LOG)
>  		filter->log = true;
>  
> +	/* Set wait killable flag, if present. */
> +	if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
> +		filter->wait_killable_recv = true;
> +
>  	/*
>  	 * If there is an existing filter, make it the prev and don't drop its
>  	 * task reference.
> @@ -1081,6 +1088,12 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn
>  	complete(&addfd->completion);
>  }
>  
> +static bool should_sleep_killable(struct seccomp_filter *match,
> +				  struct seccomp_knotif *n)
> +{
> +	return match->wait_killable_recv && n->state == SECCOMP_NOTIFY_SENT;
> +}
> +
>  static int seccomp_do_user_notification(int this_syscall,
>  					struct seccomp_filter *match,
>  					const struct seccomp_data *sd)
> @@ -1111,11 +1124,25 @@ static int seccomp_do_user_notification(int this_syscall,
>  	 * This is where we wait for a reply from userspace.
>  	 */
>  	do {
> +		bool wait_killable = should_sleep_killable(match, &n);
> +
>  		mutex_unlock(&match->notify_lock);
> -		err = wait_for_completion_interruptible(&n.ready);
> +		if (wait_killable)
> +			err = wait_for_completion_killable(&n.ready);
> +		else
> +			err = wait_for_completion_interruptible(&n.ready);
>  		mutex_lock(&match->notify_lock);
> -		if (err != 0)
> +
> +		if (err != 0) {
> +			/*
> +			 * Check to see if the notifcation got picked up and
> +			 * whether we should switch to wait killable.
> +			 */
> +			if (!wait_killable && should_sleep_killable(match, &n))
> +				continue;
> +
>  			goto interrupted;
> +		}
>  
>  		addfd = list_first_entry_or_null(&n.addfd,
>  						 struct seccomp_kaddfd, list);
> @@ -1485,6 +1512,9 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
>  		mutex_lock(&filter->notify_lock);
>  		knotif = find_notification(filter, unotif.id);
>  		if (knotif) {
> +			/* Reset the process to make sure it's not stuck */
> +			if (should_sleep_killable(filter, knotif))
> +				complete(&knotif->ready);
>  			knotif->state = SECCOMP_NOTIFY_INIT;
>  			up(&filter->notif->request);
>  		}
> @@ -1830,6 +1860,14 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	    ((flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH) == 0))
>  		return -EINVAL;
>  
> +	/*
> +	 * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT flag doesn't make sense
> +	 * without the SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
> +	 */
> +	if ((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) &&
> +	    ((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0))
> +		return -EINVAL;
> +
>  	/* Prepare the new filter before holding any locks. */
>  	prepared = seccomp_prepare_user_filter(filter);
>  	if (IS_ERR(prepared))

Otherwise, looks good. Thanks for bringing this back up!

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] selftests/seccomp: Add test for wait killable notifier
  2022-04-29 18:19   ` Kees Cook
@ 2022-04-29 22:35     ` Sargun Dhillon
  2022-04-29 22:43       ` Kees Cook
  0 siblings, 1 reply; 15+ messages in thread
From: Sargun Dhillon @ 2022-04-29 22:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Rodrigo Campos, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy,
	Tycho Andersen

On Fri, Apr 29, 2022 at 11:19:33AM -0700, Kees Cook wrote:
> On Thu, Apr 28, 2022 at 07:31:13PM -0700, Sargun Dhillon wrote:
> > +
> > +	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> > +
> > +	listener = user_notif_syscall(__NR_getppid,
> > +				      SECCOMP_FILTER_FLAG_NEW_LISTENER |
> > +				      SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV);
> > +	ASSERT_GE(listener, 0);
> > +
> > +	pid = fork();
> > +	ASSERT_GE(pid, 0);
> > +
> > +	if (pid == 0) {
> > +		close(sk_pair[0]);
> > +		handled = sk_pair[1];
> > +
> > +		/* Setup the sigaction without SA_RESTART */
> > +		if (sigaction(SIGUSR1, &new_action, NULL)) {
> > +			perror("sigaction");
> > +			exit(1);
> > +		}
> > +
> > +		/* Make sure that the syscall is completed (no EINTR) */
> > +		ret = syscall(__NR_getppid);
> > +		exit(ret != USER_NOTIF_MAGIC);
> > +	}
> > +
> > +	while (get_proc_syscall(pid) != __NR_getppid &&
> > +	       get_proc_stat(pid) != 'S')
> > +		nanosleep(&delay, NULL);
> > +
> > +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
> > +	/* Kill the process to make sure it enters the wait_killable state */
> > +	EXPECT_EQ(kill(pid, SIGUSR1), 0);
> > +
> > +	/* TASK_KILLABLE is considered D (Disk Sleep) state */
> > +	while (get_proc_stat(pid) != 'D')
> > +		nanosleep(&delay, NULL);
> 
> Should a NOWAIT waitpid() happen in this loop to make sure this doesn't
> spin forever?
> 
> i.e. running these tests on a kernel that doesn't have the support
> shouldn't hang -- yes it'll time out eventually but that's annoying. ;)
> 
Wouldn't this bail already because user_notif_syscall would assert out
since the kernel would reject the unknown flag?

I might make this a little helper function, something like:
static void wait_for_state(struct __test_metadata *_metadata, pid_t pid, char wait_for) {
	/* 100 ms */
	struct timespec delay = { .tv_nsec = 100000000 };
	int status;

	while (get_proc_stat(pid) != wait_for) {
		ASSERT_EQ(waitpid(pid, &status, WNOHANG), 0) {
			if (WIFEXITED(status))
				TH_LOG("Process %d exited with error code %d", pid, WEXITSTATUS(status));
			else if (WIFSIGNALED(status))
				TH_LOG("Process %d exited due to signal %d", pid, WTERMSIG(status));
			else
				TH_LOG("Process %d exited due to unknown reason", pid);
		}
		nanosleep(&delay, NULL);
	}
}
	
}

> > +
> > +	resp.id = req.id;
> > +	resp.val = USER_NOTIF_MAGIC;
> > +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
> > +
> > +	/*
> > +	 * Make sure that the signal handler does get called once we're back in
> > +	 * userspace.
> > +	 */
> > +	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> > +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> > +	EXPECT_EQ(true, WIFEXITED(status));
> > +	EXPECT_EQ(0, WEXITSTATUS(status));
> > +}
> > +
> > +TEST(user_notification_wait_killable_fatal)
> > +{
> > +	struct seccomp_notif req = {};
> > +	int listener, status;
> > +	pid_t pid;
> > +	long ret;
> > +	/* 100 ms */
> > +	struct timespec delay = { .tv_nsec = 100000000 };
> > +
> > +	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> > +	ASSERT_EQ(0, ret) {
> > +		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
> > +	}
> > +
> > +	listener = user_notif_syscall(__NR_getppid,
> > +				      SECCOMP_FILTER_FLAG_NEW_LISTENER |
> > +				      SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV);
> > +	ASSERT_GE(listener, 0);
> > +
> > +	pid = fork();
> > +	ASSERT_GE(pid, 0);
> > +
> > +	if (pid == 0) {
> > +		/* This should never complete */
> > +		syscall(__NR_getppid);
> > +		exit(1);
> > +	}
> > +
> > +	while (get_proc_stat(pid) != 'S')
> > +		nanosleep(&delay, NULL);
> > +
> > +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
> > +	/* Kill the process with a fatal signal */
> > +	EXPECT_EQ(kill(pid, SIGTERM), 0);
> > +
> > +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> > +	EXPECT_EQ(true, WIFSIGNALED(status));
> > +	EXPECT_EQ(SIGTERM, WTERMSIG(status));
> > +}
> 
> Should there be a test validating the inverse of this, as in _without_
> SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV, how should the above tests
> behave?
Don't we roughly get that from the user_notification_kill_in_middle
and user_notification_signal?

Although, I might cleanup the user_notification_signal test to disable
SA_RESTART like these tests.

> 
> Otherwise, looks good! Yay tests!
> 
> -- 
> Kees Cook

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] selftests/seccomp: Add test for wait killable notifier
  2022-04-29 22:35     ` Sargun Dhillon
@ 2022-04-29 22:43       ` Kees Cook
  0 siblings, 0 replies; 15+ messages in thread
From: Kees Cook @ 2022-04-29 22:43 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: LKML, Linux Containers, Rodrigo Campos, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy,
	Tycho Andersen

On Fri, Apr 29, 2022 at 10:35:57PM +0000, Sargun Dhillon wrote:
> On Fri, Apr 29, 2022 at 11:19:33AM -0700, Kees Cook wrote:
> > On Thu, Apr 28, 2022 at 07:31:13PM -0700, Sargun Dhillon wrote:
> > > +
> > > +	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> > > +
> > > +	listener = user_notif_syscall(__NR_getppid,
> > > +				      SECCOMP_FILTER_FLAG_NEW_LISTENER |
> > > +				      SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV);
> > > +	ASSERT_GE(listener, 0);
> > > +
> > > +	pid = fork();
> > > +	ASSERT_GE(pid, 0);
> > > +
> > > +	if (pid == 0) {
> > > +		close(sk_pair[0]);
> > > +		handled = sk_pair[1];
> > > +
> > > +		/* Setup the sigaction without SA_RESTART */
> > > +		if (sigaction(SIGUSR1, &new_action, NULL)) {
> > > +			perror("sigaction");
> > > +			exit(1);
> > > +		}
> > > +
> > > +		/* Make sure that the syscall is completed (no EINTR) */
> > > +		ret = syscall(__NR_getppid);
> > > +		exit(ret != USER_NOTIF_MAGIC);
> > > +	}
> > > +
> > > +	while (get_proc_syscall(pid) != __NR_getppid &&
> > > +	       get_proc_stat(pid) != 'S')
> > > +		nanosleep(&delay, NULL);
> > > +
> > > +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
> > > +	/* Kill the process to make sure it enters the wait_killable state */
> > > +	EXPECT_EQ(kill(pid, SIGUSR1), 0);
> > > +
> > > +	/* TASK_KILLABLE is considered D (Disk Sleep) state */
> > > +	while (get_proc_stat(pid) != 'D')
> > > +		nanosleep(&delay, NULL);
> > 
> > Should a NOWAIT waitpid() happen in this loop to make sure this doesn't
> > spin forever?
> > 
> > i.e. running these tests on a kernel that doesn't have the support
> > shouldn't hang -- yes it'll time out eventually but that's annoying. ;)
> > 
> Wouldn't this bail already because user_notif_syscall would assert out
> since the kernel would reject the unknown flag?

Oh yeah, duh. :P

> I might make this a little helper function, something like:
> static void wait_for_state(struct __test_metadata *_metadata, pid_t pid, char wait_for) {
> 	/* 100 ms */
> 	struct timespec delay = { .tv_nsec = 100000000 };
> 	int status;
> 
> 	while (get_proc_stat(pid) != wait_for) {
> 		ASSERT_EQ(waitpid(pid, &status, WNOHANG), 0) {
> 			if (WIFEXITED(status))
> 				TH_LOG("Process %d exited with error code %d", pid, WEXITSTATUS(status));
> 			else if (WIFSIGNALED(status))
> 				TH_LOG("Process %d exited due to signal %d", pid, WTERMSIG(status));
> 			else
> 				TH_LOG("Process %d exited due to unknown reason", pid);
> 		}
> 		nanosleep(&delay, NULL);
> 	}
> }

Yeah, though as you point out, that is likely overkill. :)

> > > +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
> > > +	/* Kill the process with a fatal signal */
> > > +	EXPECT_EQ(kill(pid, SIGTERM), 0);
> > > +
> > > +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> > > +	EXPECT_EQ(true, WIFSIGNALED(status));
> > > +	EXPECT_EQ(SIGTERM, WTERMSIG(status));
> > > +}
> > 
> > Should there be a test validating the inverse of this, as in _without_
> > SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV, how should the above tests
> > behave?
> Don't we roughly get that from the user_notification_kill_in_middle
> and user_notification_signal?

Yeah, I guess that's true. Cool, cool.

> Although, I might cleanup the user_notification_signal test to disable
> SA_RESTART like these tests.

Sounds good, though maybe that can be a separate patch?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier
  2022-04-29 18:20       ` Kees Cook
@ 2022-05-02 12:48         ` Rodrigo Campos
  0 siblings, 0 replies; 15+ messages in thread
From: Rodrigo Campos @ 2022-05-02 12:48 UTC (permalink / raw)
  To: Kees Cook
  Cc: Sargun Dhillon, LKML, Linux Containers, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Fri, Apr 29, 2022 at 8:20 PM Kees Cook <keescook@chromium.org> wrote:
> On Fri, Apr 29, 2022 at 05:14:37PM +0000, Sargun Dhillon wrote:
> > On Fri, Apr 29, 2022 at 11:42:15AM +0200, Rodrigo Campos wrote:
> > > On Fri, Apr 29, 2022 at 4:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > the concept is searchable. If the notifying process is signaled prior
> > > > to the notification being received by the userspace agent, it will
> > > > be handled as normal.
> > >
> > > Why is that? Why not always handle in the same way (if wait killable
> > > is set, wait like that)
> > >
> >
> > The goal is to avoid two things:
> > 1. Unncessary work - Often times, we see workloads that implement techniques
> >    like hedging (Also known as request racing[1]). In fact, RFC3484
> >    (destination address selection) gets implemented where the DNS library
> >    will connect to many backend addresses and whichever one comes back first
> >    "wins".
> > 2. Side effects - We don't want a situation where a syscall is in progress
> >    that is non-trivial to rollback (mount), and from user space's perspective
> >    this syscall never completed.
> >
> > Blocking before the syscall even starts is excessive. When we looked at this
> > we found that with runtimes like Golang, they can get into a bad situation
> > if they have many (1000s) of threads that are in the middle of a syscall
> > because all of them need to elide prior to GC. In this case the runtime
> > prioritizes the liveness of GC vs. the syscalls.
> >
> > That being said, there may be some syscalls in a filter that need the suggested
> > behaviour. I can imagine introducing a new flag
> > (say SECCOMP_FILTER_FLAG_WAIT_KILLABLE) that applies to all states.
> > Alternatively, in one implementation, I put the behaviour in the data
> > field of the return from the BPF filter.

Makes sense, if we need to, we can implement that in the future too.

> I'd add something like the above to the commit log, just to have it
> around.

Yes, please. It was not obvious to me.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier
  2022-04-29  2:31 ` [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
  2022-04-29  9:42   ` Rodrigo Campos
  2022-04-29 18:22   ` Kees Cook
@ 2022-05-02 14:15   ` Rodrigo Campos
  2022-05-02 16:04     ` Sargun Dhillon
  2 siblings, 1 reply; 15+ messages in thread
From: Rodrigo Campos @ 2022-05-02 14:15 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Kees Cook, LKML, Linux Containers, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Fri, Apr 29, 2022 at 4:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> index 539e9d4a4860..204cf5ba511a 100644
> --- a/Documentation/userspace-api/seccomp_filter.rst
> +++ b/Documentation/userspace-api/seccomp_filter.rst
> @@ -271,6 +271,14 @@ notifying process it will be replaced. The supervisor can also add an FD, and
>  respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return
>  value will be the injected file descriptor number.
>
> +The notifying process can be preempted, resulting in the notification being
> +aborted. This can be problematic when trying to take actions on behalf of the
> +notifying process that are long-running and typically retryable (mounting a
> +filesytem). Alternatively, the at filter installation time, the
> +``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it
> +such that when a user notification is received by the supervisor, the notifying
> +process will ignore non-fatal signals until the response is sent.

Maybe:

This flags ignores non-fatal signals that arrive after the supervisor
received the notification

I mean, I want to make it clear that if a signal arrives before the
notification was received by the supervisor, then it will be
interrupted anyways.


> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index db10e73d06e0..9291b0843cb2 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1485,6 +1512,9 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
>                 mutex_lock(&filter->notify_lock);
>                 knotif = find_notification(filter, unotif.id);
>                 if (knotif) {
> +                       /* Reset the process to make sure it's not stuck */
> +                       if (should_sleep_killable(filter, knotif))
> +                               complete(&knotif->ready);
>                         knotif->state = SECCOMP_NOTIFY_INIT;
>                         up(&filter->notif->request);

(I couldn't git-am this locally, so maybe I'm injecting this at the
wrong parts mentally when looking at the other code for more context.
Sorry if that is the case :))

Why do we need to complete() only in this error path? As far as I can
see this is on the error path where the copy to userspace failed and
we want to reset this notification.

I think that is wrong, we want to wake up the other side not just on
the error path, but on the non-error path (in fact, do we want to do
this on the error path? It seems like a no-op, but don't see any
reason to do it).

We _need_ to call complete() in the non error path here so the other
side wakes up and switches to a killable wait. As we are not doing
this (for the non error path), this will basically not achieve a
wait_killable() at all.

I think this was probably an oversight adapting the patch from last
year. Is it possble? Because it seems that in the previous version we
sent last year[1] (if you can link them next time it will be way
simpler :)) you had a new ioctl() and the call to complete() was
handled there, in seccomp_notify_set_wait_killable(). Now, as this is
part of the filter (and as I said last year, I think this way looks
better) that call to complete() was completely forgotten.

Is it possible that this is not really working as intended, then? Am I
missing something?


Best,
Rodrigo


[1]: https://lore.kernel.org/all/20210430204939.5152-3-sargun@sargun.me/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier
  2022-05-02 14:15   ` Rodrigo Campos
@ 2022-05-02 16:04     ` Sargun Dhillon
  2022-05-03 14:27       ` Rodrigo Campos
  0 siblings, 1 reply; 15+ messages in thread
From: Sargun Dhillon @ 2022-05-02 16:04 UTC (permalink / raw)
  To: Rodrigo Campos
  Cc: Kees Cook, LKML, Linux Containers, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Mon, May 02, 2022 at 04:15:07PM +0200, Rodrigo Campos wrote:
> On Fri, Apr 29, 2022 at 4:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > index 539e9d4a4860..204cf5ba511a 100644
> > --- a/Documentation/userspace-api/seccomp_filter.rst
> > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > @@ -271,6 +271,14 @@ notifying process it will be replaced. The supervisor can also add an FD, and
> >  respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return
> >  value will be the injected file descriptor number.
> >
> > +The notifying process can be preempted, resulting in the notification being
> > +aborted. This can be problematic when trying to take actions on behalf of the
> > +notifying process that are long-running and typically retryable (mounting a
> > +filesytem). Alternatively, the at filter installation time, the
> > +``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it
> > +such that when a user notification is received by the supervisor, the notifying
> > +process will ignore non-fatal signals until the response is sent.
> 
> Maybe:
> 
> This flags ignores non-fatal signals that arrive after the supervisor
> received the notification
> 
> I mean, I want to make it clear that if a signal arrives before the
> notification was received by the supervisor, then it will be
> interrupted anyways.
> 
> 
Added: Signals that are sent prior to the notification being received by 
userspace are handled normally.

> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index db10e73d06e0..9291b0843cb2 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -1485,6 +1512,9 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
> >                 mutex_lock(&filter->notify_lock);
> >                 knotif = find_notification(filter, unotif.id);
> >                 if (knotif) {
> > +                       /* Reset the process to make sure it's not stuck */
> > +                       if (should_sleep_killable(filter, knotif))
> > +                               complete(&knotif->ready);
> >                         knotif->state = SECCOMP_NOTIFY_INIT;
> >                         up(&filter->notif->request);
> 
> (I couldn't git-am this locally, so maybe I'm injecting this at the
> wrong parts mentally when looking at the other code for more context.
> Sorry if that is the case :))
> 
> Why do we need to complete() only in this error path? As far as I can
> see this is on the error path where the copy to userspace failed and
> we want to reset this notification.
This error path acts as follows
(Say, S: Supervisor, P: Notifying Process, U: User)

P: 2 <-- Pid
P: getppid() // Generated notification
P: Waiting in wait_interruptible state
S: Calls receive notification, and the codepath gets up to the poin
   where it's copying the notification to userspace
U: kill -SIGURG 2 // Send a kill signal to the notifying process
P: Waiting in the wait_killable state
S: Kernel fails to copy notification into user memory, and resets
   the notification and returns an error

If we do not have the reset, the P will never return to wait interruptible.
This is the only code path that a notification can go init -> sent -> init.

> 
> I think that is wrong, we want to wake up the other side not just on
> the error path, but on the non-error path (in fact, do we want to do
> this on the error path? It seems like a no-op, but don't see any
> reason to do it).
>

It's unneccessary. Why do it? It just means we wake up a process without reason.
Wakeups are expensive.
 
> We _need_ to call complete() in the non error path here so the other
> side wakes up and switches to a killable wait. As we are not doing
> this (for the non error path), this will basically not achieve a
> wait_killable() at all.
> 
No, because here, we check if we were waiting interruptible, and
then we switch to wait_killable:
/*
 * Check to see if the notifcation got picked up and
 * whether we should switch to wait killable.
 */
if (!wait_killable && should_sleep_killable(match, &n))
	continue;

This could probably be:
if (fatal_signal_pending(current))
	break;
if (!wait_killable && should_sleep_killable(match, &n))
	continue;

But, that check for fatal_signal_pending seems to be unneccessary (or something we'll get
for free in the next iteration).

> I think this was probably an oversight adapting the patch from last
> year. Is it possble? Because it seems that in the previous version we
> sent last year[1] (if you can link them next time it will be way
> simpler :)) you had a new ioctl() and the call to complete() was
> handled there, in seccomp_notify_set_wait_killable(). Now, as this is
> part of the filter (and as I said last year, I think this way looks
> better) that call to complete() was completely forgotten.
> 
> Is it possible that this is not really working as intended, then? Am I
> missing something?
> 
> 
> Best,
> Rodrigo
> 
> 
> [1]: https://lore.kernel.org/all/20210430204939.5152-3-sargun@sargun.me/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier
  2022-05-02 16:04     ` Sargun Dhillon
@ 2022-05-03 14:27       ` Rodrigo Campos
  0 siblings, 0 replies; 15+ messages in thread
From: Rodrigo Campos @ 2022-05-03 14:27 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Kees Cook, LKML, Linux Containers, Christian Brauner,
	Giuseppe Scrivano, Will Drewry, Andy Lutomirski, Alban Crequy

On Mon, May 2, 2022 at 6:04 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Mon, May 02, 2022 at 04:15:07PM +0200, Rodrigo Campos wrote:
> > (I couldn't git-am this locally, so maybe I'm injecting this at the
> > wrong parts mentally when looking at the other code for more context.
> > Sorry if that is the case :))
> >
> > Why do we need to complete() only in this error path? As far as I can
> > see this is on the error path where the copy to userspace failed and
> > we want to reset this notification.
> This error path acts as follows
> (Say, S: Supervisor, P: Notifying Process, U: User)
>
> P: 2 <-- Pid
> P: getppid() // Generated notification
> P: Waiting in wait_interruptible state
> S: Calls receive notification, and the codepath gets up to the poin
>    where it's copying the notification to userspace
> U: kill -SIGURG 2 // Send a kill signal to the notifying process
> P: Waiting in the wait_killable state
> S: Kernel fails to copy notification into user memory, and resets
>    the notification and returns an error
>
> If we do not have the reset, the P will never return to wait interruptible.

Ohhh, because we want the wait to be interruptible again! Right, I
forgot we should reset to that state again, until the notification is
indeed handled.

What if we say something along those lines in the comment, then? Like:

// Make the other side go back to wait interruptible, the notification
is not SENT.

Something like that would at least help me in the future :)

> > We _need_ to call complete() in the non error path here so the other
> > side wakes up and switches to a killable wait. As we are not doing
> > this (for the non error path), this will basically not achieve a
> > wait_killable() at all.
> >
> No, because here, we check if we were waiting interruptible, and
> then we switch to wait_killable:

Ohhh, right right right. This is lazily changing to wait killable only
when something already wakes up the process. Sorry, I overlooked that.



Best,
Rodrigo

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-05-03 14:28 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-29  2:31 [PATCH v3 0/2] Handle seccomp notification preemption Sargun Dhillon
2022-04-29  2:31 ` [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
2022-04-29  9:42   ` Rodrigo Campos
2022-04-29 17:14     ` Sargun Dhillon
2022-04-29 18:20       ` Kees Cook
2022-05-02 12:48         ` Rodrigo Campos
2022-04-29 18:22   ` Kees Cook
2022-05-02 14:15   ` Rodrigo Campos
2022-05-02 16:04     ` Sargun Dhillon
2022-05-03 14:27       ` Rodrigo Campos
2022-04-29  2:31 ` [PATCH v3 2/2] selftests/seccomp: Add test for wait killable notifier Sargun Dhillon
2022-04-29 18:19   ` Kees Cook
2022-04-29 22:35     ` Sargun Dhillon
2022-04-29 22:43       ` Kees Cook
2022-04-29  9:24 ` [PATCH v3 0/2] Handle seccomp notification preemption Rodrigo Campos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).