Re: [RFC PATCH 2/3] seccomp: Add wait_killable semantic to seccomp user notifier

From: Rodrigo Campos <rodrigo@kinvolk.io>
To: Sargun Dhillon <sargun@sargun.me>
Cc: "Kees Cook" <keescook@chromium.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"Linux Containers" <containers@lists.linux-foundation.org>,
	"Mauricio Vásquez Bernal" <mauricio@kinvolk.io>,
	"Tycho Andersen" <tycho@tycho.pizza>,
	"Giuseppe Scrivano" <gscrivan@redhat.com>
Subject: Re: [RFC PATCH 2/3] seccomp: Add wait_killable semantic to seccomp user notifier
Date: Fri, 26 Feb 2021 17:57:51 +0100	[thread overview]
Message-ID: <CACaBj2b_U_oV=cXyZyt8-P3MCRkuD_NrgczUQ+fNn_q0e+wDSQ@mail.gmail.com> (raw)
In-Reply-To: <20210220090502.7202-3-sargun@sargun.me>

On Sat, Feb 20, 2021 at 10:05 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> The user notifier feature allows for filtering of seccomp notifications in
> userspace. While the user notifier is handling the syscall, the notifying
> process can be preempted, thus ending the notification. This has become a
> growing problem, as Golang has adopted signal based async preemption[1]. In
> this, it will preempt every 10ms, thus leaving the supervisor less than
> 10ms to respond to a given notification. If the syscall require I/O (mount,
> connect) on behalf of the process, it can easily take 10ms.
>
> This allows the supervisor to set a flag that moves the process into a
> state where it is only killable by terminating signals as opposed to all
> signals.
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
>
> [1]: https://github.com/golang/go/issues/24543
> ---
>  include/uapi/linux/seccomp.h | 10 ++++++++++
>  kernel/seccomp.c             | 35 +++++++++++++++++++++++++++++------
>  2 files changed, 39 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 6ba18b82a02e..f9acdb58138b 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -70,6 +70,16 @@ struct seccomp_notif_sizes {
>         __u16 seccomp_data;
>  };
>
> +/*
> + * Valid flags for struct seccomp_notif
> + *
> + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE
> + *
> + * Prevent the notifying process from being interrupted by non-fatal, unmasked

Maybe s/process/task/ to keep the vocabulary from "wait for
completion" barriers?

> + * signals.
> + */
> +#define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE (1UL << 0)
> +
>  struct seccomp_notif {
>         __u64 id;
>         __u32 pid;
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b48fb0a29455..f8c6c47df5d8 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -97,6 +97,8 @@ struct seccomp_knotif {
>
>         /* outstanding addfd requests */
>         struct list_head addfd;
> +
> +       bool wait_killable;
>  };
>
>  /**
> @@ -1082,6 +1084,7 @@ static int seccomp_do_user_notification(int this_syscall,
>         long ret = 0;
>         struct seccomp_knotif n = {};
>         struct seccomp_kaddfd *addfd, *tmp;
> +       bool wait_killable = false;
>
>         mutex_lock(&match->notify_lock);
>         err = -ENOSYS;
> @@ -1103,8 +1106,14 @@ static int seccomp_do_user_notification(int this_syscall,
>          * This is where we wait for a reply from userspace.
>          */
>         do {
> +               wait_killable = n.state == SECCOMP_NOTIFY_SENT &&
> +                               n.wait_killable;
> +

The state is set to SENT in seccomp_notify_recv() which can specify
the new flag. So, if that ioctl hasn't been issued yet, the state
won't be SENT and we do an interruptible wait. Do I follow correctly?

Then, if userspace does a SECCOMP_IOCTL_NOTIF_RECV ioctl(), there is a
way below to "switch the waiting" so we switch to a killable wait.
Therefore, it is still possible to receive a signal before the "wait
switch" in which case we will be interrupted anyways. If the go
runtime is sending SIGURG every 10ms, isn't this a problem? Like, in
your use cases almost never happens that the seccomp agent does the
ioctl() to receive the notification after the go runtime sends a
signal and is interrupted?

I see the window is smaller with this change and you have a chance to
handle it (as if you win the first time, it is then not interrupted),
but I wonder if it makes sense to know if we want to
wait_killable/wait_interruptable before the SECCOMP_IOCTL_NOTIF_RECV
ioctl() is done, so here we can just do the proper wait. As long as we
have to guess here, there will be races and "let's switch the wait"
kind of games, IIUC.

Can I ask why a flag for the SECCOMP_IOCTL_NOTIF_RECV ioctl() instead
of before in the flow? Like having to use
SECCOMP_FILTER_FLAG_NEW_LISTENER and a new flag that will make this
wait killable instead of interruptible. I guess code is kind of messy
to achieve that?

Am I missing something? Sorry if I am :)

>                 mutex_unlock(&match->notify_lock);
> -               err = wait_for_completion_interruptible(&n.ready);
> +               if (wait_killable)
> +                       err = wait_for_completion_killable(&n.ready);
> +               else
> +                       err = wait_for_completion_interruptible(&n.ready);
>                 mutex_lock(&match->notify_lock);
>                 if (err != 0)
>                         goto interrupted;
> @@ -1455,6 +1466,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
>         unotif.pid = task_pid_vnr(knotif->task);
>         unotif.data = *(knotif->data);
>
> +       if (unotif.flags & SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) {
> +               knotif->wait_killable = true;
> +               complete(&knotif->ready);
> +       }
> +
> +

IIUC This triggers the wait switch from interruptible to killable when
the ioctl SECCOMP_IOCTL_NOTIF_RECV is done.

> @@ -1473,6 +1490,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
>                 mutex_lock(&filter->notify_lock);
>                 knotif = find_notification(filter, unotif.id);
>                 if (knotif) {
> +                       /* Reset the waiting state */
> +                       if (knotif->wait_killable) {
> +                               knotif->wait_killable = false;
> +                               complete(&knotif->ready);
> +                       }
> +
>                         knotif->state = SECCOMP_NOTIFY_INIT;

Haha, so we reset the wait to be interruptible if userspace screws it
up. Semantics are getting tricky, maybe a function can help here?

--
Rodrigo Campos
---
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000