linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Seccomp non-preemptible notifier
@ 2021-02-20  9:04 Sargun Dhillon
  2021-02-20  9:05 ` [RFC PATCH 1/3] seccomp: Refactor notification handler to prepare for new semantics Sargun Dhillon
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Sargun Dhillon @ 2021-02-20  9:04 UTC (permalink / raw)
  To: Kees Cook, LKML, Linux Containers
  Cc: Sargun Dhillon, Mauricio Vásquez Bernal, Rodrigo Campos,
	Tycho Andersen, Giuseppe Scrivano

This patchset addresses a race condition we've dealt with recently with
seccomp. Specifically programs interrupting syscalls while they're in
progress. This was exacerbated by Golang's recent adoption of "async
preemption", in which they try to interrupt any syscall that's been
running for more than 10ms during GC. During certain syscalls, it's
non-trivial to write them in a reetrant manner in userspace (mount).

This has a couple semantic changes, and relaxes a check on seccomp_data.
I can deal with these, but this was a first cut. I also expect that the
patch would be squashed down, but it's split out for easier review.

Sargun Dhillon (3):
  seccomp: Refactor notification handler to prepare for new semantics
  seccomp: Add wait_killable semantic to seccomp user notifier
  selftests/seccomp: Add test for wait killable notifier

 include/uapi/linux/seccomp.h                  | 10 +++
 kernel/seccomp.c                              | 63 +++++++++++++------
 tools/testing/selftests/seccomp/seccomp_bpf.c | 60 ++++++++++++++++++
 3 files changed, 114 insertions(+), 19 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 1/3] seccomp: Refactor notification handler to prepare for new semantics
  2021-02-20  9:04 [RFC PATCH 0/3] Seccomp non-preemptible notifier Sargun Dhillon
@ 2021-02-20  9:05 ` Sargun Dhillon
  2021-02-20  9:05 ` [RFC PATCH 2/3] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
  2021-02-20  9:05 ` [RFC PATCH 3/3] selftests/seccomp: Add test for wait killable notifier Sargun Dhillon
  2 siblings, 0 replies; 5+ messages in thread
From: Sargun Dhillon @ 2021-02-20  9:05 UTC (permalink / raw)
  To: Kees Cook, LKML, Linux Containers
  Cc: Sargun Dhillon, Mauricio Vásquez Bernal, Rodrigo Campos,
	Tycho Andersen, Giuseppe Scrivano

This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in semantic, in that
previously we ignored addfd calls upon wakeup if the notification had been
responded to, but instead with the new change we check for an outstanding
addfd calls prior to returning to userspace.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
---
 kernel/seccomp.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 952dc1c90229..b48fb0a29455 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1098,28 +1098,30 @@ static int seccomp_do_user_notification(int this_syscall,
 
 	up(&match->notif->request);
 	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
-	mutex_unlock(&match->notify_lock);
 
 	/*
 	 * This is where we wait for a reply from userspace.
 	 */
-wait:
-	err = wait_for_completion_interruptible(&n.ready);
-	mutex_lock(&match->notify_lock);
-	if (err == 0) {
-		/* Check if we were woken up by a addfd message */
+	do {
+		mutex_unlock(&match->notify_lock);
+		err = wait_for_completion_interruptible(&n.ready);
+		mutex_lock(&match->notify_lock);
+		if (err != 0)
+			goto interrupted;
+
 		addfd = list_first_entry_or_null(&n.addfd,
 						 struct seccomp_kaddfd, list);
-		if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
+		/* Check if we were woken up by a addfd message */
+		if (addfd)
 			seccomp_handle_addfd(addfd);
-			mutex_unlock(&match->notify_lock);
-			goto wait;
-		}
-		ret = n.val;
-		err = n.error;
-		flags = n.flags;
-	}
 
+	}  while (n.state != SECCOMP_NOTIFY_REPLIED);
+
+	ret = n.val;
+	err = n.error;
+	flags = n.flags;
+
+interrupted:
 	/* If there were any pending addfd calls, clear them out */
 	list_for_each_entry_safe(addfd, tmp, &n.addfd, list) {
 		/* The process went away before we got a chance to handle it */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 2/3] seccomp: Add wait_killable semantic to seccomp user notifier
  2021-02-20  9:04 [RFC PATCH 0/3] Seccomp non-preemptible notifier Sargun Dhillon
  2021-02-20  9:05 ` [RFC PATCH 1/3] seccomp: Refactor notification handler to prepare for new semantics Sargun Dhillon
@ 2021-02-20  9:05 ` Sargun Dhillon
  2021-02-26 16:57   ` Rodrigo Campos
  2021-02-20  9:05 ` [RFC PATCH 3/3] selftests/seccomp: Add test for wait killable notifier Sargun Dhillon
  2 siblings, 1 reply; 5+ messages in thread
From: Sargun Dhillon @ 2021-02-20  9:05 UTC (permalink / raw)
  To: Kees Cook, LKML, Linux Containers
  Cc: Sargun Dhillon, Mauricio Vásquez Bernal, Rodrigo Campos,
	Tycho Andersen, Giuseppe Scrivano

The user notifier feature allows for filtering of seccomp notifications in
userspace. While the user notifier is handling the syscall, the notifying
process can be preempted, thus ending the notification. This has become a
growing problem, as Golang has adopted signal based async preemption[1]. In
this, it will preempt every 10ms, thus leaving the supervisor less than
10ms to respond to a given notification. If the syscall require I/O (mount,
connect) on behalf of the process, it can easily take 10ms.

This allows the supervisor to set a flag that moves the process into a
state where it is only killable by terminating signals as opposed to all
signals.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>

[1]: https://github.com/golang/go/issues/24543
---
 include/uapi/linux/seccomp.h | 10 ++++++++++
 kernel/seccomp.c             | 35 +++++++++++++++++++++++++++++------
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 6ba18b82a02e..f9acdb58138b 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -70,6 +70,16 @@ struct seccomp_notif_sizes {
 	__u16 seccomp_data;
 };
 
+/*
+ * Valid flags for struct seccomp_notif
+ *
+ * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE
+ *
+ * Prevent the notifying process from being interrupted by non-fatal, unmasked
+ * signals.
+ */
+#define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE (1UL << 0)
+
 struct seccomp_notif {
 	__u64 id;
 	__u32 pid;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b48fb0a29455..f8c6c47df5d8 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -97,6 +97,8 @@ struct seccomp_knotif {
 
 	/* outstanding addfd requests */
 	struct list_head addfd;
+
+	bool wait_killable;
 };
 
 /**
@@ -1082,6 +1084,7 @@ static int seccomp_do_user_notification(int this_syscall,
 	long ret = 0;
 	struct seccomp_knotif n = {};
 	struct seccomp_kaddfd *addfd, *tmp;
+	bool wait_killable = false;
 
 	mutex_lock(&match->notify_lock);
 	err = -ENOSYS;
@@ -1103,8 +1106,14 @@ static int seccomp_do_user_notification(int this_syscall,
 	 * This is where we wait for a reply from userspace.
 	 */
 	do {
+		wait_killable = n.state == SECCOMP_NOTIFY_SENT &&
+				n.wait_killable;
+
 		mutex_unlock(&match->notify_lock);
-		err = wait_for_completion_interruptible(&n.ready);
+		if (wait_killable)
+			err = wait_for_completion_killable(&n.ready);
+		else
+			err = wait_for_completion_interruptible(&n.ready);
 		mutex_lock(&match->notify_lock);
 		if (err != 0)
 			goto interrupted;
@@ -1420,14 +1429,16 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
 	struct seccomp_notif unotif;
 	ssize_t ret;
 
+	ret = copy_from_user(&unotif, buf, sizeof(unotif));
+	if (ret)
+		return -EFAULT;
+
 	/* Verify that we're not given garbage to keep struct extensible. */
-	ret = check_zeroed_user(buf, sizeof(unotif));
-	if (ret < 0)
-		return ret;
-	if (!ret)
+	if (unotif.flags & ~(SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE))
 		return -EINVAL;
 
-	memset(&unotif, 0, sizeof(unotif));
+	if (unotif.id || unotif.pid)
+		return -EINVAL;
 
 	ret = down_interruptible(&filter->notif->request);
 	if (ret < 0)
@@ -1455,6 +1466,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
 	unotif.pid = task_pid_vnr(knotif->task);
 	unotif.data = *(knotif->data);
 
+	if (unotif.flags & SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) {
+		knotif->wait_killable = true;
+		complete(&knotif->ready);
+	}
+
+
 	knotif->state = SECCOMP_NOTIFY_SENT;
 	wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
 	ret = 0;
@@ -1473,6 +1490,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
 		mutex_lock(&filter->notify_lock);
 		knotif = find_notification(filter, unotif.id);
 		if (knotif) {
+			/* Reset the waiting state */
+			if (knotif->wait_killable) {
+				knotif->wait_killable = false;
+				complete(&knotif->ready);
+			}
+
 			knotif->state = SECCOMP_NOTIFY_INIT;
 			up(&filter->notif->request);
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 3/3] selftests/seccomp: Add test for wait killable notifier
  2021-02-20  9:04 [RFC PATCH 0/3] Seccomp non-preemptible notifier Sargun Dhillon
  2021-02-20  9:05 ` [RFC PATCH 1/3] seccomp: Refactor notification handler to prepare for new semantics Sargun Dhillon
  2021-02-20  9:05 ` [RFC PATCH 2/3] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
@ 2021-02-20  9:05 ` Sargun Dhillon
  2 siblings, 0 replies; 5+ messages in thread
From: Sargun Dhillon @ 2021-02-20  9:05 UTC (permalink / raw)
  To: Kees Cook, LKML, Linux Containers
  Cc: Sargun Dhillon, Mauricio Vásquez Bernal, Rodrigo Campos,
	Tycho Andersen, Giuseppe Scrivano

This adds a test for the positive case of the wait killable notifier,
in testing that when the feature is activated the process acts as
expected -- in not terminating on a non-fatal signal, and instead
queueing it up. There is already a test case for normal handlers
and preemption.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 60 +++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 26c72f2b61b1..a8ef4558d673 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -4139,6 +4139,66 @@ TEST(user_notification_addfd_rlimit)
 	close(memfd);
 }
 
+TEST(user_notification_signal_wait_killable)
+{
+	pid_t pid;
+	long ret;
+	int status, listener, sk_pair[2];
+	struct seccomp_notif req = {
+		.flags = SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE,
+	};
+	struct seccomp_notif_resp resp = {};
+	char c;
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+	ASSERT_EQ(fcntl(sk_pair[0], F_SETFL, O_NONBLOCK), 0);
+
+	listener = user_notif_syscall(__NR_gettid,
+				      SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(sk_pair[0]);
+		handled = sk_pair[1];
+		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+			perror("signal");
+			exit(1);
+		}
+
+		ret = syscall(__NR_gettid);
+		exit(!(ret == 42));
+	}
+	close(sk_pair[1]);
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+	/* Make sure we didn't get a signal */
+	EXPECT_EQ(read(sk_pair[0], &c, 1), -1);
+	/* Make sure the notification is still alive */
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), 0);
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = 42;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	/* Check we eventually received the signal */
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+}
+
+
 /*
  * TODO:
  * - expand NNP testing
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 2/3] seccomp: Add wait_killable semantic to seccomp user notifier
  2021-02-20  9:05 ` [RFC PATCH 2/3] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
@ 2021-02-26 16:57   ` Rodrigo Campos
  0 siblings, 0 replies; 5+ messages in thread
From: Rodrigo Campos @ 2021-02-26 16:57 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Kees Cook, LKML, Linux Containers, Mauricio Vásquez Bernal,
	Tycho Andersen, Giuseppe Scrivano

On Sat, Feb 20, 2021 at 10:05 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> The user notifier feature allows for filtering of seccomp notifications in
> userspace. While the user notifier is handling the syscall, the notifying
> process can be preempted, thus ending the notification. This has become a
> growing problem, as Golang has adopted signal based async preemption[1]. In
> this, it will preempt every 10ms, thus leaving the supervisor less than
> 10ms to respond to a given notification. If the syscall require I/O (mount,
> connect) on behalf of the process, it can easily take 10ms.
>
> This allows the supervisor to set a flag that moves the process into a
> state where it is only killable by terminating signals as opposed to all
> signals.
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
>
> [1]: https://github.com/golang/go/issues/24543
> ---
>  include/uapi/linux/seccomp.h | 10 ++++++++++
>  kernel/seccomp.c             | 35 +++++++++++++++++++++++++++++------
>  2 files changed, 39 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 6ba18b82a02e..f9acdb58138b 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -70,6 +70,16 @@ struct seccomp_notif_sizes {
>         __u16 seccomp_data;
>  };
>
> +/*
> + * Valid flags for struct seccomp_notif
> + *
> + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE
> + *
> + * Prevent the notifying process from being interrupted by non-fatal, unmasked

Maybe s/process/task/ to keep the vocabulary from "wait for
completion" barriers?

> + * signals.
> + */
> +#define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE (1UL << 0)
> +
>  struct seccomp_notif {
>         __u64 id;
>         __u32 pid;
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b48fb0a29455..f8c6c47df5d8 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -97,6 +97,8 @@ struct seccomp_knotif {
>
>         /* outstanding addfd requests */
>         struct list_head addfd;
> +
> +       bool wait_killable;
>  };
>
>  /**
> @@ -1082,6 +1084,7 @@ static int seccomp_do_user_notification(int this_syscall,
>         long ret = 0;
>         struct seccomp_knotif n = {};
>         struct seccomp_kaddfd *addfd, *tmp;
> +       bool wait_killable = false;
>
>         mutex_lock(&match->notify_lock);
>         err = -ENOSYS;
> @@ -1103,8 +1106,14 @@ static int seccomp_do_user_notification(int this_syscall,
>          * This is where we wait for a reply from userspace.
>          */
>         do {
> +               wait_killable = n.state == SECCOMP_NOTIFY_SENT &&
> +                               n.wait_killable;
> +

The state is set to SENT in seccomp_notify_recv() which can specify
the new flag. So, if that ioctl hasn't been issued yet, the state
won't be SENT and we do an interruptible wait. Do I follow correctly?

Then, if userspace does a SECCOMP_IOCTL_NOTIF_RECV ioctl(), there is a
way below to "switch the waiting" so we switch to a killable wait.
Therefore, it is still possible to receive a signal before the "wait
switch" in which case we will be interrupted anyways. If the go
runtime is sending SIGURG every 10ms, isn't this a problem? Like, in
your use cases almost never happens that the seccomp agent does the
ioctl() to receive the notification after the go runtime sends a
signal and is interrupted?

I see the window is smaller with this change and you have a chance to
handle it (as if you win the first time, it is then not interrupted),
but I wonder if it makes sense to know if we want to
wait_killable/wait_interruptable before the SECCOMP_IOCTL_NOTIF_RECV
ioctl() is done, so here we can just do the proper wait. As long as we
have to guess here, there will be races and "let's switch the wait"
kind of games, IIUC.

Can I ask why a flag for the SECCOMP_IOCTL_NOTIF_RECV ioctl() instead
of before in the flow? Like having to use
SECCOMP_FILTER_FLAG_NEW_LISTENER and a new flag that will make this
wait killable instead of interruptible. I guess code is kind of messy
to achieve that?

Am I missing something? Sorry if I am :)

>                 mutex_unlock(&match->notify_lock);
> -               err = wait_for_completion_interruptible(&n.ready);
> +               if (wait_killable)
> +                       err = wait_for_completion_killable(&n.ready);
> +               else
> +                       err = wait_for_completion_interruptible(&n.ready);
>                 mutex_lock(&match->notify_lock);
>                 if (err != 0)
>                         goto interrupted;
> @@ -1455,6 +1466,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
>         unotif.pid = task_pid_vnr(knotif->task);
>         unotif.data = *(knotif->data);
>
> +       if (unotif.flags & SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) {
> +               knotif->wait_killable = true;
> +               complete(&knotif->ready);
> +       }
> +
> +

IIUC This triggers the wait switch from interruptible to killable when
the ioctl SECCOMP_IOCTL_NOTIF_RECV is done.

> @@ -1473,6 +1490,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
>                 mutex_lock(&filter->notify_lock);
>                 knotif = find_notification(filter, unotif.id);
>                 if (knotif) {
> +                       /* Reset the waiting state */
> +                       if (knotif->wait_killable) {
> +                               knotif->wait_killable = false;
> +                               complete(&knotif->ready);
> +                       }
> +
>                         knotif->state = SECCOMP_NOTIFY_INIT;

Haha, so we reset the wait to be interruptible if userspace screws it
up. Semantics are getting tricky, maybe a function can help here?


--
Rodrigo Campos
---
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-02-26 17:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-20  9:04 [RFC PATCH 0/3] Seccomp non-preemptible notifier Sargun Dhillon
2021-02-20  9:05 ` [RFC PATCH 1/3] seccomp: Refactor notification handler to prepare for new semantics Sargun Dhillon
2021-02-20  9:05 ` [RFC PATCH 2/3] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
2021-02-26 16:57   ` Rodrigo Campos
2021-02-20  9:05 ` [RFC PATCH 3/3] selftests/seccomp: Add test for wait killable notifier Sargun Dhillon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).