All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jann Horn via Containers <containers@lists.linux-foundation.org>
To: Kees Cook <keescook@chromium.org>
Cc: linux-man <linux-man@vger.kernel.org>,
	Song Liu <songliubraving@fb.com>, Will Drewry <wad@chromium.org>,
	Robert Sesek <rsesek@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Giuseppe Scrivano <gscrivan@redhat.com>,
	Linux Containers <containers@lists.linux-foundation.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	"Michael Kerrisk \(man-pages\)" <mtk.manpages@gmail.com>,
	bpf <bpf@vger.kernel.org>, Andy Lutomirski <luto@amacapital.net>,
	Christian Brauner <christian@brauner.io>
Subject: Re: For review: seccomp_user_notif(2) manual page
Date: Mon, 26 Oct 2020 10:51:02 +0100	[thread overview]
Message-ID: <CAG48ez2b-fnsp8YAR=H5uRMT4bBTid_hyU4m6KavHxDko1Efog@mail.gmail.com> (raw)
In-Reply-To: <202010251725.2BD96926E3@keescook>

On Mon, Oct 26, 2020 at 1:32 AM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > >>        ┌─────────────────────────────────────────────────────┐
> > > > > > >>        │FIXME                                                │
> > > > > > >>        ├─────────────────────────────────────────────────────┤
> > > > > > >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > > >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > > >>        │process terminates, then the ioctl()  simply  blocks │
> > > > > > >>        │(rather than returning an error to indicate that the │
> > > > > > >>        │target process no longer exists).                    │
> > > > > > >
> > > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > > >
> > > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > > quick search.
> > > > > >
> > > > > > > but it's a
> > > > > > > bit sticky to do.
> > > > > >
> > > > > > Can you say a few words about the nature of the problem?
> > > > >
> > > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > > notify about unused filter"). So maybe there's a bug here?
> > > >
> > > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > > commit doesn't have any effect on this kind of usage.
> > >
> > > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > > we don't have a count of all of them, unfortunately.
> > >
> > > We could maybe look inside the wait_list, but that will probably make
> > > people angry :)
> >
> > The easiest way would probably be to open-code the semaphore-ish part,
> > and let the semaphore and poll share the waitqueue. The current code
> > kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> > entire semaphore would IMO be cleaner than that. And it's not like
> > semaphore semantics are even a good fit for this code anyway.
> >
> > Let's see... if we didn't have the existing UAPI to worry about, I'd
> > do it as follows (*completely* untested). That way, the ioctl would
> > block exactly until either there actually is a request to deliver or
> > there are no more users of the filter. The problem is that if we just
> > apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> > an event loop and don't set O_NONBLOCK will be screwed. So we'd
>
> Wait, why? Do you mean a ioctl calling loop (rather than a poll event
> loop)?

No, I'm talking about poll event loops.

> I think poll would be fine, but a "try calling RECV and expect to
> return ENOENT" loop would change. But I don't think anyone would do this
> exactly because it _currently_ acts like O_NONBLOCK, yes?
>
> > probably also have to add some stupid counter in place of the
> > semaphore's counter that we can use to preserve the old behavior of
> > returning -ENOENT once for each cancelled request. :(
>
> I only see this in Debian Code Search:
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
> which is using epoll_wait():
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326
>
> I expect LXC is using it. :)

The problem is the scenario where a process is interrupted while it's
waiting for the supervisor to reply.

Consider the following scenario (with supervisor "S" and target "T"; S
wants to wait for events on two file descriptors seccomp_fd and
other_fd):

S: starts poll() to wait for events on seccomp_fd and other_fd
T: performs a syscall that's filtered with RET_USER_NOTIF
S: poll() returns and signals readiness of seccomp_fd
T: receives signal SIGUSR1
T: syscall aborts, enters signal handler
T: signal handler blocks on unfiltered syscall (e.g. write())
S: starts SECCOMP_IOCTL_NOTIF_RECV
S: blocks because no syscalls are pending

Depending on what other_fd is, this could in a worst case even lead to
a deadlock (if e.g. the signal handler wants to write to stdout, but
the stdout fd is hooked up to other_fd in the supervisor, but the
supervisor can't consume the data written because it's stuck in
seccomp handling).

So we have to ensure that when existing code (like that crun code you
linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
immediately instead of blocking.

(Oh, but by the way, that crun code looks broken anyway, because
AFAICS it treats all error returns from SECCOMP_IOCTL_NOTIF_RECV
equally by bailing out; and it kinda looks like that bailout path then
nukes the container, or something? So that needs to be fixed either
way.)
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

WARNING: multiple messages have this Message-ID (diff)
From: Jann Horn <jannh@google.com>
To: Kees Cook <keescook@chromium.org>
Cc: Tycho Andersen <tycho@tycho.pizza>,
	"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>,
	Sargun Dhillon <sargun@sargun.me>,
	Christian Brauner <christian@brauner.io>,
	linux-man <linux-man@vger.kernel.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Aleksa Sarai <cyphar@cyphar.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Will Drewry <wad@chromium.org>, bpf <bpf@vger.kernel.org>,
	Song Liu <songliubraving@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andy Lutomirski <luto@amacapital.net>,
	Linux Containers <containers@lists.linux-foundation.org>,
	Giuseppe Scrivano <gscrivan@redhat.com>,
	Robert Sesek <rsesek@google.com>
Subject: Re: For review: seccomp_user_notif(2) manual page
Date: Mon, 26 Oct 2020 10:51:02 +0100	[thread overview]
Message-ID: <CAG48ez2b-fnsp8YAR=H5uRMT4bBTid_hyU4m6KavHxDko1Efog@mail.gmail.com> (raw)
In-Reply-To: <202010251725.2BD96926E3@keescook>

On Mon, Oct 26, 2020 at 1:32 AM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > >>        ┌─────────────────────────────────────────────────────┐
> > > > > > >>        │FIXME                                                │
> > > > > > >>        ├─────────────────────────────────────────────────────┤
> > > > > > >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > > >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > > >>        │process terminates, then the ioctl()  simply  blocks │
> > > > > > >>        │(rather than returning an error to indicate that the │
> > > > > > >>        │target process no longer exists).                    │
> > > > > > >
> > > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > > >
> > > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > > quick search.
> > > > > >
> > > > > > > but it's a
> > > > > > > bit sticky to do.
> > > > > >
> > > > > > Can you say a few words about the nature of the problem?
> > > > >
> > > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > > notify about unused filter"). So maybe there's a bug here?
> > > >
> > > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > > commit doesn't have any effect on this kind of usage.
> > >
> > > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > > we don't have a count of all of them, unfortunately.
> > >
> > > We could maybe look inside the wait_list, but that will probably make
> > > people angry :)
> >
> > The easiest way would probably be to open-code the semaphore-ish part,
> > and let the semaphore and poll share the waitqueue. The current code
> > kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> > entire semaphore would IMO be cleaner than that. And it's not like
> > semaphore semantics are even a good fit for this code anyway.
> >
> > Let's see... if we didn't have the existing UAPI to worry about, I'd
> > do it as follows (*completely* untested). That way, the ioctl would
> > block exactly until either there actually is a request to deliver or
> > there are no more users of the filter. The problem is that if we just
> > apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> > an event loop and don't set O_NONBLOCK will be screwed. So we'd
>
> Wait, why? Do you mean a ioctl calling loop (rather than a poll event
> loop)?

No, I'm talking about poll event loops.

> I think poll would be fine, but a "try calling RECV and expect to
> return ENOENT" loop would change. But I don't think anyone would do this
> exactly because it _currently_ acts like O_NONBLOCK, yes?
>
> > probably also have to add some stupid counter in place of the
> > semaphore's counter that we can use to preserve the old behavior of
> > returning -ENOENT once for each cancelled request. :(
>
> I only see this in Debian Code Search:
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
> which is using epoll_wait():
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326
>
> I expect LXC is using it. :)

The problem is the scenario where a process is interrupted while it's
waiting for the supervisor to reply.

Consider the following scenario (with supervisor "S" and target "T"; S
wants to wait for events on two file descriptors seccomp_fd and
other_fd):

S: starts poll() to wait for events on seccomp_fd and other_fd
T: performs a syscall that's filtered with RET_USER_NOTIF
S: poll() returns and signals readiness of seccomp_fd
T: receives signal SIGUSR1
T: syscall aborts, enters signal handler
T: signal handler blocks on unfiltered syscall (e.g. write())
S: starts SECCOMP_IOCTL_NOTIF_RECV
S: blocks because no syscalls are pending

Depending on what other_fd is, this could in a worst case even lead to
a deadlock (if e.g. the signal handler wants to write to stdout, but
the stdout fd is hooked up to other_fd in the supervisor, but the
supervisor can't consume the data written because it's stuck in
seccomp handling).

So we have to ensure that when existing code (like that crun code you
linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
immediately instead of blocking.

(Oh, but by the way, that crun code looks broken anyway, because
AFAICS it treats all error returns from SECCOMP_IOCTL_NOTIF_RECV
equally by bailing out; and it kinda looks like that bailout path then
nukes the container, or something? So that needs to be fixed either
way.)

  reply	other threads:[~2020-10-26  9:51 UTC|newest]

Thread overview: 105+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-30 11:07 For review: seccomp_user_notif(2) manual page Michael Kerrisk (man-pages)
2020-09-30 11:07 ` Michael Kerrisk (man-pages)
2020-09-30 15:03 ` Tycho Andersen
2020-09-30 15:03   ` Tycho Andersen
2020-09-30 15:11   ` Tycho Andersen
2020-09-30 15:11     ` Tycho Andersen
2020-09-30 20:34   ` Michael Kerrisk (man-pages)
2020-09-30 20:34     ` Michael Kerrisk (man-pages)
2020-09-30 23:03     ` Tycho Andersen
2020-09-30 23:03       ` Tycho Andersen
2020-09-30 23:11       ` Jann Horn via Containers
2020-09-30 23:11         ` Jann Horn
2020-09-30 23:24         ` Tycho Andersen
2020-09-30 23:24           ` Tycho Andersen
2020-10-01  1:52           ` Jann Horn via Containers
2020-10-01  1:52             ` Jann Horn
2020-10-01  2:14             ` Jann Horn via Containers
2020-10-01  2:14               ` Jann Horn
2020-10-25 16:31               ` Michael Kerrisk (man-pages)
2020-10-25 16:31                 ` Michael Kerrisk (man-pages)
2020-10-26 15:54                 ` Jann Horn via Containers
2020-10-26 15:54                   ` Jann Horn
2020-10-27  6:14                   ` Michael Kerrisk (man-pages)
2020-10-27  6:14                     ` Michael Kerrisk (man-pages)
2020-10-27 10:28                     ` Jann Horn via Containers
2020-10-27 10:28                       ` Jann Horn
2020-10-28  6:31                       ` Sargun Dhillon
2020-10-28  6:31                         ` Sargun Dhillon
2020-10-28  9:43                         ` Jann Horn via Containers
2020-10-28  9:43                           ` Jann Horn
2020-10-28 17:43                           ` Sargun Dhillon
2020-10-28 17:43                             ` Sargun Dhillon
2020-10-28 18:20                             ` Jann Horn via Containers
2020-10-28 18:20                               ` Jann Horn
2020-10-01  7:49             ` Michael Kerrisk (man-pages)
2020-10-01  7:49               ` Michael Kerrisk (man-pages)
2020-10-26  0:32             ` Kees Cook
2020-10-26  0:32               ` Kees Cook
2020-10-26  9:51               ` Jann Horn via Containers [this message]
2020-10-26  9:51                 ` Jann Horn
2020-10-26 10:31                 ` Jann Horn via Containers
2020-10-26 10:31                   ` Jann Horn
2020-10-28 22:56                   ` Kees Cook
2020-10-28 22:56                     ` Kees Cook
2020-10-29  1:11                     ` Jann Horn via Containers
2020-10-29  1:11                       ` Jann Horn
2020-10-29  2:13                   ` Tycho Andersen
2020-10-29  4:26                     ` Jann Horn via Containers
2020-10-29  4:26                       ` Jann Horn
2020-10-28 22:53                 ` Kees Cook
2020-10-28 22:53                   ` Kees Cook
2020-10-29  1:25                   ` Jann Horn via Containers
2020-10-29  1:25                     ` Jann Horn
2020-10-01  7:45       ` Michael Kerrisk (man-pages)
2020-10-01  7:45         ` Michael Kerrisk (man-pages)
2020-10-14  4:40         ` Michael Kerrisk (man-pages)
2020-10-14  4:40           ` Michael Kerrisk (man-pages)
2020-09-30 15:53 ` Jann Horn via Containers
2020-09-30 15:53   ` Jann Horn
2020-10-01 12:54   ` Christian Brauner
2020-10-01 12:54     ` Christian Brauner
2020-10-01 15:47     ` Jann Horn via Containers
2020-10-01 15:47       ` Jann Horn
2020-10-01 16:58       ` Tycho Andersen
2020-10-01 16:58         ` Tycho Andersen
2020-10-01 17:12         ` Christian Brauner
2020-10-01 17:12           ` Christian Brauner
2020-10-14  5:41           ` Michael Kerrisk (man-pages)
2020-10-14  5:41             ` Michael Kerrisk (man-pages)
2020-10-01 18:18         ` Jann Horn via Containers
2020-10-01 18:18           ` Jann Horn
2020-10-01 18:56           ` Tycho Andersen
2020-10-01 18:56             ` Tycho Andersen
2020-10-01 17:05       ` Christian Brauner
2020-10-01 17:05         ` Christian Brauner
2020-10-15 11:24   ` Michael Kerrisk (man-pages)
2020-10-15 11:24     ` Michael Kerrisk (man-pages)
2020-10-15 20:32     ` Jann Horn via Containers
2020-10-15 20:32       ` Jann Horn
2020-10-16 18:29       ` Michael Kerrisk (man-pages)
2020-10-16 18:29         ` Michael Kerrisk (man-pages)
2020-10-17  0:25         ` Jann Horn via Containers
2020-10-17  0:25           ` Jann Horn
2020-10-24 12:52           ` Michael Kerrisk (man-pages)
2020-10-24 12:52             ` Michael Kerrisk (man-pages)
2020-10-26  9:32             ` Jann Horn via Containers
2020-10-26  9:32               ` Jann Horn
2020-10-26  9:47               ` Michael Kerrisk (man-pages)
2020-10-26  9:47                 ` Michael Kerrisk (man-pages)
2020-09-30 23:39 ` Kees Cook
2020-09-30 23:39   ` Kees Cook
2020-10-15 11:24   ` Michael Kerrisk (man-pages)
2020-10-15 11:24     ` Michael Kerrisk (man-pages)
2020-10-26  0:19     ` Kees Cook
2020-10-26  0:19       ` Kees Cook
2020-10-26  9:39       ` Michael Kerrisk (man-pages)
2020-10-26  9:39         ` Michael Kerrisk (man-pages)
2020-10-01 12:36 ` Christian Brauner
2020-10-01 12:36   ` Christian Brauner
2020-10-15 11:23   ` Michael Kerrisk (man-pages)
2020-10-15 11:23     ` Michael Kerrisk (man-pages)
2020-10-01 21:06 ` Sargun Dhillon
2020-10-01 21:06   ` Sargun Dhillon
2020-10-01 23:19   ` Tycho Andersen
2020-10-01 23:19     ` Tycho Andersen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAG48ez2b-fnsp8YAR=H5uRMT4bBTid_hyU4m6KavHxDko1Efog@mail.gmail.com' \
    --to=containers@lists.linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=christian@brauner.io \
    --cc=daniel@iogearbox.net \
    --cc=gscrivan@redhat.com \
    --cc=jannh@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mtk.manpages@gmail.com \
    --cc=rsesek@google.com \
    --cc=songliubraving@fb.com \
    --cc=wad@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.