All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jann Horn via Containers <containers@lists.linux-foundation.org>
To: Kees Cook <keescook@chromium.org>,
	 "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: linux-man <linux-man@vger.kernel.org>,
	Song Liu <songliubraving@fb.com>, Will Drewry <wad@chromium.org>,
	Robert Sesek <rsesek@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Giuseppe Scrivano <gscrivan@redhat.com>,
	Linux Containers <containers@lists.linux-foundation.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>, bpf <bpf@vger.kernel.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Christian Brauner <christian@brauner.io>
Subject: Re: For review: seccomp_user_notif(2) manual page
Date: Thu, 29 Oct 2020 02:25:34 +0100	[thread overview]
Message-ID: <CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@mail.gmail.com> (raw)
In-Reply-To: <202010281548.CCA92731F@keescook>

On Wed, Oct 28, 2020 at 11:53 PM Kees Cook <keescook@chromium.org> wrote:
> On Mon, Oct 26, 2020 at 10:51:02AM +0100, Jann Horn wrote:
> > The problem is the scenario where a process is interrupted while it's
> > waiting for the supervisor to reply.
> >
> > Consider the following scenario (with supervisor "S" and target "T"; S
> > wants to wait for events on two file descriptors seccomp_fd and
> > other_fd):
> >
> > S: starts poll() to wait for events on seccomp_fd and other_fd
> > T: performs a syscall that's filtered with RET_USER_NOTIF
> > S: poll() returns and signals readiness of seccomp_fd
> > T: receives signal SIGUSR1
> > T: syscall aborts, enters signal handler
> > T: signal handler blocks on unfiltered syscall (e.g. write())
> > S: starts SECCOMP_IOCTL_NOTIF_RECV
> > S: blocks because no syscalls are pending
>
> Oooh, yes, ew. Thanks for the illustration.
>
> Thinking about this from userspace's least-surprise view, I would expect
> the "recv" to stay "queued", in the sense we'd see this:
>
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: gets (stale) seccomp_notif from seccomp_fd
> S: sends seccomp_notif_resp, receives ENOENT (or some better errno?)
>
> This is not at all how things are designed internally right now, but
> that behavior would work, yes?

It would be really ugly, but it could theoretically be made to work,
to some degree.


The first bit of trouble is that currently the notification lives on
the stack of the target process. If we want to be able to show
userspace the stale notification, we'd have to store it elsewhere. And
since we really don't want to start randomly throwing -ENOMEM in any
of this stuff, we'd basically have to store it in pre-allocated memory
inside the filter.


The second bit of trouble is that if the supervisor is so oblivious
that it doesn't realize that syscalls can be interrupted, it'll run
into other problems. Let's say the target process does something like
this:

int func(void) {
  char pathbuf[4096];
  sprintf(pathbuf, "/tmp/blah.%d", some_number);
  mount("foo", pathbuf, ...);
}

and mount() is handled with a notification. If the supervisor just
reads the path string and immediately passes it into the real mount()
syscall, something like this can happen:

target: starts mount()
target: receives signal, aborts mount()
target: runs signal handler, returns from signal handler
target: returns out of func()
supervisor: receives notification
supervisor: reads path from remote buffer
supervisor: calls mount()

but because the stack allocation has already been freed by the time
the supervisor reads it, the supervisor just reads random garbage, and
beautiful fireworks ensue.

So the supervisor *fundamentally* has to be written to expect that at
*any* time, the target can abandon a syscall. And every read of remote
memory has to be separated from uses of that remote memory by a
notification ID recheck.

And at that point, I think it's reasonable to expect the supervisor to
also be able to handle that a syscall can be aborted before the
notification is delivered.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

WARNING: multiple messages have this Message-ID (diff)
From: Jann Horn <jannh@google.com>
To: Kees Cook <keescook@chromium.org>,
	"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Tycho Andersen <tycho@tycho.pizza>,
	Sargun Dhillon <sargun@sargun.me>,
	Christian Brauner <christian@brauner.io>,
	linux-man <linux-man@vger.kernel.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Aleksa Sarai <cyphar@cyphar.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Will Drewry <wad@chromium.org>, bpf <bpf@vger.kernel.org>,
	Song Liu <songliubraving@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andy Lutomirski <luto@amacapital.net>,
	Linux Containers <containers@lists.linux-foundation.org>,
	Giuseppe Scrivano <gscrivan@redhat.com>,
	Robert Sesek <rsesek@google.com>
Subject: Re: For review: seccomp_user_notif(2) manual page
Date: Thu, 29 Oct 2020 02:25:34 +0100	[thread overview]
Message-ID: <CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@mail.gmail.com> (raw)
In-Reply-To: <202010281548.CCA92731F@keescook>

On Wed, Oct 28, 2020 at 11:53 PM Kees Cook <keescook@chromium.org> wrote:
> On Mon, Oct 26, 2020 at 10:51:02AM +0100, Jann Horn wrote:
> > The problem is the scenario where a process is interrupted while it's
> > waiting for the supervisor to reply.
> >
> > Consider the following scenario (with supervisor "S" and target "T"; S
> > wants to wait for events on two file descriptors seccomp_fd and
> > other_fd):
> >
> > S: starts poll() to wait for events on seccomp_fd and other_fd
> > T: performs a syscall that's filtered with RET_USER_NOTIF
> > S: poll() returns and signals readiness of seccomp_fd
> > T: receives signal SIGUSR1
> > T: syscall aborts, enters signal handler
> > T: signal handler blocks on unfiltered syscall (e.g. write())
> > S: starts SECCOMP_IOCTL_NOTIF_RECV
> > S: blocks because no syscalls are pending
>
> Oooh, yes, ew. Thanks for the illustration.
>
> Thinking about this from userspace's least-surprise view, I would expect
> the "recv" to stay "queued", in the sense we'd see this:
>
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: gets (stale) seccomp_notif from seccomp_fd
> S: sends seccomp_notif_resp, receives ENOENT (or some better errno?)
>
> This is not at all how things are designed internally right now, but
> that behavior would work, yes?

It would be really ugly, but it could theoretically be made to work,
to some degree.


The first bit of trouble is that currently the notification lives on
the stack of the target process. If we want to be able to show
userspace the stale notification, we'd have to store it elsewhere. And
since we really don't want to start randomly throwing -ENOMEM in any
of this stuff, we'd basically have to store it in pre-allocated memory
inside the filter.


The second bit of trouble is that if the supervisor is so oblivious
that it doesn't realize that syscalls can be interrupted, it'll run
into other problems. Let's say the target process does something like
this:

int func(void) {
  char pathbuf[4096];
  sprintf(pathbuf, "/tmp/blah.%d", some_number);
  mount("foo", pathbuf, ...);
}

and mount() is handled with a notification. If the supervisor just
reads the path string and immediately passes it into the real mount()
syscall, something like this can happen:

target: starts mount()
target: receives signal, aborts mount()
target: runs signal handler, returns from signal handler
target: returns out of func()
supervisor: receives notification
supervisor: reads path from remote buffer
supervisor: calls mount()

but because the stack allocation has already been freed by the time
the supervisor reads it, the supervisor just reads random garbage, and
beautiful fireworks ensue.

So the supervisor *fundamentally* has to be written to expect that at
*any* time, the target can abandon a syscall. And every read of remote
memory has to be separated from uses of that remote memory by a
notification ID recheck.

And at that point, I think it's reasonable to expect the supervisor to
also be able to handle that a syscall can be aborted before the
notification is delivered.

  reply	other threads:[~2020-10-29  1:26 UTC|newest]

Thread overview: 105+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-30 11:07 For review: seccomp_user_notif(2) manual page Michael Kerrisk (man-pages)
2020-09-30 11:07 ` Michael Kerrisk (man-pages)
2020-09-30 15:03 ` Tycho Andersen
2020-09-30 15:03   ` Tycho Andersen
2020-09-30 15:11   ` Tycho Andersen
2020-09-30 15:11     ` Tycho Andersen
2020-09-30 20:34   ` Michael Kerrisk (man-pages)
2020-09-30 20:34     ` Michael Kerrisk (man-pages)
2020-09-30 23:03     ` Tycho Andersen
2020-09-30 23:03       ` Tycho Andersen
2020-09-30 23:11       ` Jann Horn via Containers
2020-09-30 23:11         ` Jann Horn
2020-09-30 23:24         ` Tycho Andersen
2020-09-30 23:24           ` Tycho Andersen
2020-10-01  1:52           ` Jann Horn via Containers
2020-10-01  1:52             ` Jann Horn
2020-10-01  2:14             ` Jann Horn via Containers
2020-10-01  2:14               ` Jann Horn
2020-10-25 16:31               ` Michael Kerrisk (man-pages)
2020-10-25 16:31                 ` Michael Kerrisk (man-pages)
2020-10-26 15:54                 ` Jann Horn via Containers
2020-10-26 15:54                   ` Jann Horn
2020-10-27  6:14                   ` Michael Kerrisk (man-pages)
2020-10-27  6:14                     ` Michael Kerrisk (man-pages)
2020-10-27 10:28                     ` Jann Horn via Containers
2020-10-27 10:28                       ` Jann Horn
2020-10-28  6:31                       ` Sargun Dhillon
2020-10-28  6:31                         ` Sargun Dhillon
2020-10-28  9:43                         ` Jann Horn via Containers
2020-10-28  9:43                           ` Jann Horn
2020-10-28 17:43                           ` Sargun Dhillon
2020-10-28 17:43                             ` Sargun Dhillon
2020-10-28 18:20                             ` Jann Horn via Containers
2020-10-28 18:20                               ` Jann Horn
2020-10-01  7:49             ` Michael Kerrisk (man-pages)
2020-10-01  7:49               ` Michael Kerrisk (man-pages)
2020-10-26  0:32             ` Kees Cook
2020-10-26  0:32               ` Kees Cook
2020-10-26  9:51               ` Jann Horn via Containers
2020-10-26  9:51                 ` Jann Horn
2020-10-26 10:31                 ` Jann Horn via Containers
2020-10-26 10:31                   ` Jann Horn
2020-10-28 22:56                   ` Kees Cook
2020-10-28 22:56                     ` Kees Cook
2020-10-29  1:11                     ` Jann Horn via Containers
2020-10-29  1:11                       ` Jann Horn
2020-10-29  2:13                   ` Tycho Andersen
2020-10-29  4:26                     ` Jann Horn via Containers
2020-10-29  4:26                       ` Jann Horn
2020-10-28 22:53                 ` Kees Cook
2020-10-28 22:53                   ` Kees Cook
2020-10-29  1:25                   ` Jann Horn via Containers [this message]
2020-10-29  1:25                     ` Jann Horn
2020-10-01  7:45       ` Michael Kerrisk (man-pages)
2020-10-01  7:45         ` Michael Kerrisk (man-pages)
2020-10-14  4:40         ` Michael Kerrisk (man-pages)
2020-10-14  4:40           ` Michael Kerrisk (man-pages)
2020-09-30 15:53 ` Jann Horn via Containers
2020-09-30 15:53   ` Jann Horn
2020-10-01 12:54   ` Christian Brauner
2020-10-01 12:54     ` Christian Brauner
2020-10-01 15:47     ` Jann Horn via Containers
2020-10-01 15:47       ` Jann Horn
2020-10-01 16:58       ` Tycho Andersen
2020-10-01 16:58         ` Tycho Andersen
2020-10-01 17:12         ` Christian Brauner
2020-10-01 17:12           ` Christian Brauner
2020-10-14  5:41           ` Michael Kerrisk (man-pages)
2020-10-14  5:41             ` Michael Kerrisk (man-pages)
2020-10-01 18:18         ` Jann Horn via Containers
2020-10-01 18:18           ` Jann Horn
2020-10-01 18:56           ` Tycho Andersen
2020-10-01 18:56             ` Tycho Andersen
2020-10-01 17:05       ` Christian Brauner
2020-10-01 17:05         ` Christian Brauner
2020-10-15 11:24   ` Michael Kerrisk (man-pages)
2020-10-15 11:24     ` Michael Kerrisk (man-pages)
2020-10-15 20:32     ` Jann Horn via Containers
2020-10-15 20:32       ` Jann Horn
2020-10-16 18:29       ` Michael Kerrisk (man-pages)
2020-10-16 18:29         ` Michael Kerrisk (man-pages)
2020-10-17  0:25         ` Jann Horn via Containers
2020-10-17  0:25           ` Jann Horn
2020-10-24 12:52           ` Michael Kerrisk (man-pages)
2020-10-24 12:52             ` Michael Kerrisk (man-pages)
2020-10-26  9:32             ` Jann Horn via Containers
2020-10-26  9:32               ` Jann Horn
2020-10-26  9:47               ` Michael Kerrisk (man-pages)
2020-10-26  9:47                 ` Michael Kerrisk (man-pages)
2020-09-30 23:39 ` Kees Cook
2020-09-30 23:39   ` Kees Cook
2020-10-15 11:24   ` Michael Kerrisk (man-pages)
2020-10-15 11:24     ` Michael Kerrisk (man-pages)
2020-10-26  0:19     ` Kees Cook
2020-10-26  0:19       ` Kees Cook
2020-10-26  9:39       ` Michael Kerrisk (man-pages)
2020-10-26  9:39         ` Michael Kerrisk (man-pages)
2020-10-01 12:36 ` Christian Brauner
2020-10-01 12:36   ` Christian Brauner
2020-10-15 11:23   ` Michael Kerrisk (man-pages)
2020-10-15 11:23     ` Michael Kerrisk (man-pages)
2020-10-01 21:06 ` Sargun Dhillon
2020-10-01 21:06   ` Sargun Dhillon
2020-10-01 23:19   ` Tycho Andersen
2020-10-01 23:19     ` Tycho Andersen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@mail.gmail.com' \
    --to=containers@lists.linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=christian@brauner.io \
    --cc=daniel@iogearbox.net \
    --cc=gscrivan@redhat.com \
    --cc=jannh@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mtk.manpages@gmail.com \
    --cc=rsesek@google.com \
    --cc=songliubraving@fb.com \
    --cc=wad@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.