Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

From: Sargun Dhillon <sargun@sargun.me>
To: Tycho Andersen <tycho@tycho.pizza>
Cc: "Andy Lutomirski" <luto@kernel.org>,
	"Kees Cook" <keescook@chromium.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"Linux Containers" <containers@lists.linux-foundation.org>,
	"Rodrigo Campos" <rodrigo@kinvolk.io>,
	"Christian Brauner" <christian.brauner@ubuntu.com>,
	"Mauricio Vásquez Bernal" <mauricio@kinvolk.io>,
	"Giuseppe Scrivano" <gscrivan@redhat.com>,
	"Will Drewry" <wad@chromium.org>,
	"Alban Crequy" <alban@kinvolk.io>
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier
Date: Tue, 27 Apr 2021 22:10:29 +0000	[thread overview]
Message-ID: <20210427221028.GA16602@ircssh-2.c.rugged-nimbus-611.internal> (raw)
In-Reply-To: <20210427170753.GA1786245@cisco>

On Tue, Apr 27, 2021 at 11:07:53AM -0600, Tycho Andersen wrote:
> On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote:
> > On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > >
> > > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > > > On Mon, Apr 26, 2021 at 01:02:29PM -0600, Tycho Andersen wrote:
> > > > > On Mon, Apr 26, 2021 at 11:06:07AM -0700, Sargun Dhillon wrote:
> > > > > > @@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall,
> > > > > >    * This is where we wait for a reply from userspace.
> > > > > >    */
> > > > > >   do {
> > > > > > +         interruptible = notification_interruptible(&n);
> > > > > > +
> > > > > >           mutex_unlock(&match->notify_lock);
> > > > > > -         err = wait_for_completion_interruptible(&n.ready);
> > > > > > +         if (interruptible)
> > > > > > +                 err = wait_for_completion_interruptible(&n.ready);
> > > > > > +         else
> > > > > > +                 err = wait_for_completion_killable(&n.ready);
> > > > > >           mutex_lock(&match->notify_lock);
> > > > > > -         if (err != 0)
> > > > > > +
> > > > > > +         if (err != 0) {
> > > > > > +                 /*
> > > > > > +                  * There is a race condition here where if the
> > > > > > +                  * notification was received with the
> > > > > > +                  * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a
> > > > > > +                  * non-fatal signal was received before we could
> > > > > > +                  * transition we could erroneously end our wait early.
> > > > > > +                  *
> > > > > > +                  * The next wait for completion will ensure the signal
> > > > > > +                  * was not fatal.
> > > > > > +                  */
> > > > > > +                 if (interruptible && !notification_interruptible(&n))
> > > > > > +                         continue;
> > > > >
> > > > > I'm trying to understand how one would hit this race,
> > > > >
> > > >
> > > > I'm thinking:
> > > > P: Process that "generates" notification
> > > > S: Supervisor
> > > > U: User
> > > >
> > > > P: Generated notification
> > > > S: ioctl(RECV...) // With wait_killable flag.
> > > > ...complete is called in the supervisor, but the P may not be woken up...
> > > > U: kill -SIGTERM $P
> > > > ...signal gets delivered to p and causes wakeup and
> > > > wait_for_completion_interruptible returns 1...
> > > >
> > > > Then you need to check the race
> > >
> > > I see, thanks. This seems like a consequence of having the flag be
> > > per-RECV-call vs. per-filter. Seems like it might be simpler to have
> > > it be per-filter?
> > >
I agree that it is hard / impossible to guarantee correctness *after* the fact.
> > 
> > Backing up a minute, how is the current behavior not a serious
> > correctness issue?  I can think of two scenarios that seem entirely
> > broken right now:
> > 
> > 1. Process makes a syscall that is not permitted to return -EINTR.  It
> > gets a signal and returns -EINTR when user notifiers are in use.
> >
Yes, there's a whole host of problems here. Things like fsmount should not
be interruptible.

> > 2. Process makes a syscall that is permitted to return -EINTR.  But
> > -EINTR for IO means "I got interrupted and *did not do the IO*".
> > Nevertheless, the syscall returns -EINTR and the IO is done.
In general, I think that the idea is to do as little side-effect I/O
as possible. The use cases we've looked at all have nice ways to unwind
them (perf_event_open, BPF, accept), but others are less good for unwinding
(mount). There are some middle ground calls like connect, but they're
less bad.

> > 
> > ISTM the current behavior is severely broken, and the new behavior
> > isn't *that* much better since it simply ignores signals and can't
> > emulate -EINTR (or all the various restart modes, sigh).  Surely the
> > right behavior is to have the seccomped process notice that it got a
> > signal and inform the monitor of that fact so that the monitor can
> > take appropriate action.
> 
> This doesn't help your case (2) though, since the IO could be done
> before the supervisor gets the notification.
> 
I think for something like mount, if it fails (gets interrupted) via a
fatal signal, that's grounds for terminating the container.

> > IOW, I don't think that the current behavior *or* the patched opt-in
> > behavior is great.  I think we would do better to have the filter
> > indicate that it is signal-aware and to document that non-signal-aware
> > filters cannot behave correctly with respect to signals.
> 
> I think it would be hard to make a signal-aware filter, it really does
> feel like the only thing to do is a killable wait.
> 
> Tycho

There are plenty of scenarios where the syscall can be handled in an interruptible
fashion. I like to use accept as an example. I think Jann Horn had put together
a patchset on how the supervisor could be notified (as opposed to background
polling). If the call is interrupted, you can just "finish" the accept on restart
of the sycall by handing the FD over.

I see a handful of paths forward:

* We add a new action USER_NOTIF_KILLABLE which requires a fatal signal
  in order to be interrupted
* We add a chunk of data to the USER_NOTIF return code (say, WAIT_KILLABLE)
  from the BPF filter that indicates what kind of wait should happen
* (what is happening now) An ioctl flag to say pickup the notification
  and put it into a wait_killable state
* An ioctl "command" that puts an existing notifcation in progress into
  the wait killable state.