All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@kernel.org>
To: Sargun Dhillon <sargun@sargun.me>
Cc: Giuseppe Scrivano <gscrivan@redhat.com>,
	Will Drewry <wad@chromium.org>, Kees Cook <keescook@chromium.org>,
	Linux Containers <containers@lists.linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Alban Crequy <alban@kinvolk.io>,
	Andy Lutomirski <luto@kernel.org>,
	Christian Brauner <christian.brauner@ubuntu.com>
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier
Date: Tue, 27 Apr 2021 16:19:54 -0700	[thread overview]
Message-ID: <CALCETrX9JnHE9BOhRxCc1bCvEBfbOY8bb2rxeKTsDNxfMruntQ@mail.gmail.com> (raw)
In-Reply-To: <20210427221028.GA16602@ircssh-2.c.rugged-nimbus-611.internal>

On Tue, Apr 27, 2021 at 3:10 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Tue, Apr 27, 2021 at 11:07:53AM -0600, Tycho Andersen wrote:
> > On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote:
> > > On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > >
> > > > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > >
> > > ISTM the current behavior is severely broken, and the new behavior
> > > isn't *that* much better since it simply ignores signals and can't
> > > emulate -EINTR (or all the various restart modes, sigh).  Surely the
> > > right behavior is to have the seccomped process notice that it got a
> > > signal and inform the monitor of that fact so that the monitor can
> > > take appropriate action.
> >
> > This doesn't help your case (2) though, since the IO could be done
> > before the supervisor gets the notification.

Tycho, I disagree.  Here's how native syscalls work:

1. Entry work is done and the syscall hander does whatever it does at
the beginning of the function.  This is entirely non-interruptible.

2. The syscall handler decides it wants to wait, interruptibly,
killably or otherwise.

3. It gets signaled.  It takes appropriate action.  Appropriate action
does *not* mean -EINTR.  It means that something that is correct *for
that syscall* happens.  For nanosleep(), this involves the restart
block (and I don't think we need to support the restart block).  For
accept(), it mostly seems to mean that the syscall completes as usual.
For write(2), it means that, depending on file type and whether any IO
has occured, either -EINTR is returned and no IO happens, or fewer
bytes than requested are transferred, or the syscall completes.  (Or,
if it's a KILL, the process dies early and partial IO is ignored.)
For some syscalls (some AF_UNIX writes, for example, or ptrace()), the
syscall indeed gets interrupted, but it uses one of the -ERESTART
mecahnisms.

User notifiers should allow correct emulation.  Right now, it doesn't,
but there is no reason it can't.

> >
> I think for something like mount, if it fails (gets interrupted) via a
> fatal signal, that's grounds for terminating the container.

That would be quite obnoxious if it happens after the container
starts.  Ctrl-C-ing a fusermount call should not be grounds for
outright destruction.
>
> I see a handful of paths forward:
>
> * We add a new action USER_NOTIF_KILLABLE which requires a fatal signal
>   in order to be interrupted
> * We add a chunk of data to the USER_NOTIF return code (say, WAIT_KILLABLE)
>   from the BPF filter that indicates what kind of wait should happen
> * (what is happening now) An ioctl flag to say pickup the notification
>   and put it into a wait_killable state
> * An ioctl "command" that puts an existing notifcation in progress into
>   the wait killable state.

In the simplest correct API, I think that kills should always kill the
task.  Non-kill signals should not kill the syscall but the user
notifier should be informed that a signal happened.  (Whether this
requires POLLPRI or some other handshaking mechanism is an open
question.)

The only real downside I see is that genuinely interruptible syscalls
will end up with higher than necessary signal latency if they occur
during the interruptible period.  An extended API could allow the
filter to return an indication that an interrupt should result in a
specified error code (-EINTR, ERESTARTxyz, etc), and the monitor could
do a new ioctl() to tell the kernel that the syscall should stop being
interruptible.  That ioctl() itself would return a status saying
whether the syscall was already interrupted.  One nice feature of this
approach is that the existing model is equivalent to the filter saying
"interruptible with EINTR" and the monitor simply forgetting to do the
new ioctl.

--Andy
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

WARNING: multiple messages have this Message-ID (diff)
From: Andy Lutomirski <luto@kernel.org>
To: Sargun Dhillon <sargun@sargun.me>
Cc: "Tycho Andersen" <tycho@tycho.pizza>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Kees Cook" <keescook@chromium.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"Linux Containers" <containers@lists.linux-foundation.org>,
	"Rodrigo Campos" <rodrigo@kinvolk.io>,
	"Christian Brauner" <christian.brauner@ubuntu.com>,
	"Mauricio Vásquez Bernal" <mauricio@kinvolk.io>,
	"Giuseppe Scrivano" <gscrivan@redhat.com>,
	"Will Drewry" <wad@chromium.org>,
	"Alban Crequy" <alban@kinvolk.io>
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier
Date: Tue, 27 Apr 2021 16:19:54 -0700	[thread overview]
Message-ID: <CALCETrX9JnHE9BOhRxCc1bCvEBfbOY8bb2rxeKTsDNxfMruntQ@mail.gmail.com> (raw)
In-Reply-To: <20210427221028.GA16602@ircssh-2.c.rugged-nimbus-611.internal>

On Tue, Apr 27, 2021 at 3:10 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Tue, Apr 27, 2021 at 11:07:53AM -0600, Tycho Andersen wrote:
> > On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote:
> > > On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > >
> > > > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > >
> > > ISTM the current behavior is severely broken, and the new behavior
> > > isn't *that* much better since it simply ignores signals and can't
> > > emulate -EINTR (or all the various restart modes, sigh).  Surely the
> > > right behavior is to have the seccomped process notice that it got a
> > > signal and inform the monitor of that fact so that the monitor can
> > > take appropriate action.
> >
> > This doesn't help your case (2) though, since the IO could be done
> > before the supervisor gets the notification.

Tycho, I disagree.  Here's how native syscalls work:

1. Entry work is done and the syscall hander does whatever it does at
the beginning of the function.  This is entirely non-interruptible.

2. The syscall handler decides it wants to wait, interruptibly,
killably or otherwise.

3. It gets signaled.  It takes appropriate action.  Appropriate action
does *not* mean -EINTR.  It means that something that is correct *for
that syscall* happens.  For nanosleep(), this involves the restart
block (and I don't think we need to support the restart block).  For
accept(), it mostly seems to mean that the syscall completes as usual.
For write(2), it means that, depending on file type and whether any IO
has occured, either -EINTR is returned and no IO happens, or fewer
bytes than requested are transferred, or the syscall completes.  (Or,
if it's a KILL, the process dies early and partial IO is ignored.)
For some syscalls (some AF_UNIX writes, for example, or ptrace()), the
syscall indeed gets interrupted, but it uses one of the -ERESTART
mecahnisms.

User notifiers should allow correct emulation.  Right now, it doesn't,
but there is no reason it can't.

> >
> I think for something like mount, if it fails (gets interrupted) via a
> fatal signal, that's grounds for terminating the container.

That would be quite obnoxious if it happens after the container
starts.  Ctrl-C-ing a fusermount call should not be grounds for
outright destruction.
>
> I see a handful of paths forward:
>
> * We add a new action USER_NOTIF_KILLABLE which requires a fatal signal
>   in order to be interrupted
> * We add a chunk of data to the USER_NOTIF return code (say, WAIT_KILLABLE)
>   from the BPF filter that indicates what kind of wait should happen
> * (what is happening now) An ioctl flag to say pickup the notification
>   and put it into a wait_killable state
> * An ioctl "command" that puts an existing notifcation in progress into
>   the wait killable state.

In the simplest correct API, I think that kills should always kill the
task.  Non-kill signals should not kill the syscall but the user
notifier should be informed that a signal happened.  (Whether this
requires POLLPRI or some other handshaking mechanism is an open
question.)

The only real downside I see is that genuinely interruptible syscalls
will end up with higher than necessary signal latency if they occur
during the interruptible period.  An extended API could allow the
filter to return an indication that an interrupt should result in a
specified error code (-EINTR, ERESTARTxyz, etc), and the monitor could
do a new ioctl() to tell the kernel that the syscall should stop being
interruptible.  That ioctl() itself would return a status saying
whether the syscall was already interrupted.  One nice feature of this
approach is that the existing model is equivalent to the filter saying
"interruptible with EINTR" and the monitor simply forgetting to do the
new ioctl.

--Andy

  reply	other threads:[~2021-04-27 23:20 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-26 18:06 [PATCH RESEND 0/5] Handle seccomp notification preemption Sargun Dhillon
2021-04-26 18:06 ` Sargun Dhillon
2021-04-26 18:06 ` [PATCH RESEND 1/5] seccomp: Refactor notification handler to prepare for new semantics Sargun Dhillon
2021-04-26 18:06   ` Sargun Dhillon
2021-04-26 18:06 ` [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier Sargun Dhillon
2021-04-26 18:06   ` Sargun Dhillon
2021-04-26 19:02   ` Tycho Andersen
2021-04-26 19:02     ` Tycho Andersen
2021-04-26 22:15     ` Sargun Dhillon
2021-04-26 22:15       ` Sargun Dhillon
2021-04-27 13:48       ` Tycho Andersen
2021-04-27 13:48         ` Tycho Andersen
2021-04-27 16:23         ` Andy Lutomirski
2021-04-27 16:23           ` Andy Lutomirski
2021-04-27 17:07           ` Tycho Andersen
2021-04-27 17:07             ` Tycho Andersen
2021-04-27 22:10             ` Sargun Dhillon
2021-04-27 22:10               ` Sargun Dhillon
2021-04-27 23:19               ` Andy Lutomirski [this message]
2021-04-27 23:19                 ` Andy Lutomirski
2021-04-28  0:22                 ` Tycho Andersen
2021-04-28  0:22                   ` Tycho Andersen
2021-04-28 11:10                   ` Rodrigo Campos
2021-04-28 11:10                     ` Rodrigo Campos
2021-04-28 13:20                     ` Rodrigo Campos
2021-04-28 13:20                       ` Rodrigo Campos
2021-04-28 14:08                       ` Tycho Andersen
2021-04-28 14:08                         ` Tycho Andersen
2021-04-28 17:13                         ` Sargun Dhillon
2021-04-28 17:13                           ` Sargun Dhillon
2021-04-28  3:20                 ` Sargun Dhillon
2021-04-28  3:20                   ` Sargun Dhillon
2021-04-27 16:34         ` Sargun Dhillon
2021-04-27 16:34           ` Sargun Dhillon
2021-04-26 18:06 ` [PATCH RESEND 3/5] selftests/seccomp: Add test for wait killable notifier Sargun Dhillon
2021-04-26 18:06   ` Sargun Dhillon
2021-04-26 18:51   ` Tycho Andersen
2021-04-26 18:51     ` Tycho Andersen
2021-04-26 18:06 ` [PATCH RESEND 4/5] seccomp: Support atomic "addfd + send reply" Sargun Dhillon
2021-04-26 18:06   ` Sargun Dhillon
2021-04-26 18:06 ` [PATCH RESEND 5/5] selftests/seccomp: Add test for atomic addfd+send Sargun Dhillon
2021-04-26 18:06   ` Sargun Dhillon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrX9JnHE9BOhRxCc1bCvEBfbOY8bb2rxeKTsDNxfMruntQ@mail.gmail.com \
    --to=luto@kernel.org \
    --cc=alban@kinvolk.io \
    --cc=christian.brauner@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=gscrivan@redhat.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sargun@sargun.me \
    --cc=wad@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.