All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Colascione <dancol@google.com>
To: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Aleksa Sarai <cyphar@cyphar.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Christian Brauner <christian@brauner.io>,
	Jann Horn <jannh@google.com>, Andrew Lutomirski <luto@kernel.org>,
	David Howells <dhowells@redhat.com>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	Linux API <linux-api@vger.kernel.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	Arnd Bergmann <arnd@arndb.de>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
	Kees Cook <keescook@chromium.org>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Michael Kerrisk-manpages <mtk.manpages@gmail.com>,
	"Dmitry V. Levin" <ldv@altlinux.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Joel Fernandes <joel@joelfernandes.org>
Subject: Re: [PATCH v2 0/5] pid: add pidfd_open()
Date: Mon, 1 Apr 2019 09:21:49 -0700	[thread overview]
Message-ID: <CAKOZuetovU9NawSjUfGTAwXVfnmKC9GFtmS8FxAE5Hjmzq0HHA@mail.gmail.com> (raw)
In-Reply-To: <CAGLj2rF85hZLAR79oEA1dQARadw1OHr8bxqCLbvO6g9B6ou2Qw@mail.gmail.com>

On Mon, Apr 1, 2019 at 9:07 AM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
>
> On Mon, Apr 1, 2019 at 4:55 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > On Mon, Apr 1, 2019 at 8:36 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > >
> > > > Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
> > > > an attempt to make it possible one day to mount /proc inside a container
> > > > without adding a bunch of masked paths), though it was just an idea and
> > > > I don't know if he ever had a patch for it.
> >
> > Couldn't this mode just be a relatively simple procfs mount option
> > instead of a whole new filesystem? It'd be a bit like hidepid, right?
> > The internal bind mount option and the no-dotdot-traversal options
> > also look good to me.
> >
> > > I wonder if we really want a fill procfs2, or maybe we could just make
> > > the pidfd readable (yes, it's a directory file descriptor, but we
> > > could allow reading).
> >
> > What would read(2) read?
> >
> > > What are the *actual* use cases for opening /proc files through it? If
> > > it's really just for a small subset that android wants to do this
> > > (getting basic process state like "running" etc), rather than anything
> > > else, then we could skip the whole /proc linking entirely and go the
> > > other way instead (ie open_pidfd() would get that limited IO model,
> > > and we could make the /proc directory node get the same limited IO
> > > model).
> >
> > We do a lot of process state inspection and manipulation, including
> > reading and writing the oom killer adjustment score, reading smaps,
> > and the occasional cgroup manipulation. More generally, I'd also like
> > to be able to write a race-free pkill(1). Doing this work via pidfd
> > would be convenient. More generally, we can't enumerate the specific
> > use cases, because what we want to do with processes isn't bounded in
> > advance, and we regularly find new things in /proc/pid that we want to
> > read and write. I'd rather not prematurely limit the applicability of
> > the pidfd interface, especially when there's a simple option (the
> > procfs directory file descriptor approach) that doesn't require
> > in-advance enumeration of supported process inspection and
> > manipulation actions or a separate per-option pidfd equivalent. I very
> > much want a general-purpose API that reuses the metadata interfaces
> > the kernel already exposes. It's not clear to me how this rich
> > interface could be matched by read(2) on a pidfd.
>
> With the POLLHUP model on a simple pidfd, you'd know when the process
> you were referring to is dead (and one can map POLLPRI to dead and
> POLLHUP to zombie, etc).

I agree that there needs to be *some* kind of pollable file descriptor
that fires on process death. Should it be the pifd itself? I've been
thinking that a pollable directory would be too weird, but if that's
not actually a problem, I'm fine with it.

Mapping different state transitions to different poll bits is an
interesting idea, but I'm also worried about making poll the "source
of truth" with respect to process state as reported by a pidfd.
Consider a socket: for a socket, read(2)/recv(2) is the "source of
truth" and poll is just a hint that says "now is a good time to
attempt a read". Some other systems even allow for spurious wakeups
from poll. It's also worth keeping in mind that some pre-existing
event loops let you provide "is readable" and "is writable" callbacks,
but don't support the more exotic poll notification bits. That's why
I've tried to keep my proposals limited to poll signaling readability.

But I'm not really that picky. I just want something that works.

> This is just an extension of the child process model, since you'd know
> when it's dead, there's no race involved with opening the wrong
> /proc/<PID> entry. The entire thing is already non-racy for direct
> children, the same model can be extended to non-direct ones.
>
> This simplifies a lot of things, now I am essentially just passing a
> file descriptor pinning the struct pid associated with the original
> task, and not process state around to others (I may even want the
> other process to not read that stuff out even if it was allowed to, as
> it wouldn't have been able to otherwise, due to being a in a different
> mount namespace). KISS.
>
> The upshot is this same descriptor can be returned from clone, which
> would allow you to directly register it in your event loop (like
> signalfd, timerfd, file fd, sockets, etc) and POLLIN generated for you
> to read back its exit status (it is arguable if non-parents should be
> returned a readable instance from pidfd_open, but parents sure
> should). You can disable SIGCHLD for the child, and read back exit
> status much later. The entire point of waiting and reaping was that
> it'd be lost, but now you have a descriptor where it is kept for you
> to consume.

There's a subtlety: suppose I'm a library and I want to create a
private subprocess. I use the new clone facility, whatever it is, and
get a pidfd back. I need to be able to read the child's exit status
even if the child exits before clone returns in the parent. Doesn't
this requirement imply that the pidfd, kernel-side, contain something
a bit more than a struct pid?

WARNING: multiple messages have this Message-ID (diff)
From: Daniel Colascione <dancol@google.com>
To: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Aleksa Sarai <cyphar@cyphar.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Christian Brauner <christian@brauner.io>,
	Jann Horn <jannh@google.com>, Andrew Lutomirski <luto@kernel.org>,
	David Howells <dhowells@redhat.com>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	Linux API <linux-api@vger.kernel.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	Arnd Bergmann <arnd@arndb.de>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
	Kees Cook <keescook@chromium.org>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Michael Kerrisk-manpages <mtk.manpages@gmail.com>,
	"Dmitry V. Levin" <ldv@altlinux.org>, Andrew Morton <akpm@li>
Subject: Re: [PATCH v2 0/5] pid: add pidfd_open()
Date: Mon, 1 Apr 2019 09:21:49 -0700	[thread overview]
Message-ID: <CAKOZuetovU9NawSjUfGTAwXVfnmKC9GFtmS8FxAE5Hjmzq0HHA@mail.gmail.com> (raw)
In-Reply-To: <CAGLj2rF85hZLAR79oEA1dQARadw1OHr8bxqCLbvO6g9B6ou2Qw@mail.gmail.com>

On Mon, Apr 1, 2019 at 9:07 AM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
>
> On Mon, Apr 1, 2019 at 4:55 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > On Mon, Apr 1, 2019 at 8:36 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > >
> > > > Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
> > > > an attempt to make it possible one day to mount /proc inside a container
> > > > without adding a bunch of masked paths), though it was just an idea and
> > > > I don't know if he ever had a patch for it.
> >
> > Couldn't this mode just be a relatively simple procfs mount option
> > instead of a whole new filesystem? It'd be a bit like hidepid, right?
> > The internal bind mount option and the no-dotdot-traversal options
> > also look good to me.
> >
> > > I wonder if we really want a fill procfs2, or maybe we could just make
> > > the pidfd readable (yes, it's a directory file descriptor, but we
> > > could allow reading).
> >
> > What would read(2) read?
> >
> > > What are the *actual* use cases for opening /proc files through it? If
> > > it's really just for a small subset that android wants to do this
> > > (getting basic process state like "running" etc), rather than anything
> > > else, then we could skip the whole /proc linking entirely and go the
> > > other way instead (ie open_pidfd() would get that limited IO model,
> > > and we could make the /proc directory node get the same limited IO
> > > model).
> >
> > We do a lot of process state inspection and manipulation, including
> > reading and writing the oom killer adjustment score, reading smaps,
> > and the occasional cgroup manipulation. More generally, I'd also like
> > to be able to write a race-free pkill(1). Doing this work via pidfd
> > would be convenient. More generally, we can't enumerate the specific
> > use cases, because what we want to do with processes isn't bounded in
> > advance, and we regularly find new things in /proc/pid that we want to
> > read and write. I'd rather not prematurely limit the applicability of
> > the pidfd interface, especially when there's a simple option (the
> > procfs directory file descriptor approach) that doesn't require
> > in-advance enumeration of supported process inspection and
> > manipulation actions or a separate per-option pidfd equivalent. I very
> > much want a general-purpose API that reuses the metadata interfaces
> > the kernel already exposes. It's not clear to me how this rich
> > interface could be matched by read(2) on a pidfd.
>
> With the POLLHUP model on a simple pidfd, you'd know when the process
> you were referring to is dead (and one can map POLLPRI to dead and
> POLLHUP to zombie, etc).

I agree that there needs to be *some* kind of pollable file descriptor
that fires on process death. Should it be the pifd itself? I've been
thinking that a pollable directory would be too weird, but if that's
not actually a problem, I'm fine with it.

Mapping different state transitions to different poll bits is an
interesting idea, but I'm also worried about making poll the "source
of truth" with respect to process state as reported by a pidfd.
Consider a socket: for a socket, read(2)/recv(2) is the "source of
truth" and poll is just a hint that says "now is a good time to
attempt a read". Some other systems even allow for spurious wakeups
from poll. It's also worth keeping in mind that some pre-existing
event loops let you provide "is readable" and "is writable" callbacks,
but don't support the more exotic poll notification bits. That's why
I've tried to keep my proposals limited to poll signaling readability.

But I'm not really that picky. I just want something that works.

> This is just an extension of the child process model, since you'd know
> when it's dead, there's no race involved with opening the wrong
> /proc/<PID> entry. The entire thing is already non-racy for direct
> children, the same model can be extended to non-direct ones.
>
> This simplifies a lot of things, now I am essentially just passing a
> file descriptor pinning the struct pid associated with the original
> task, and not process state around to others (I may even want the
> other process to not read that stuff out even if it was allowed to, as
> it wouldn't have been able to otherwise, due to being a in a different
> mount namespace). KISS.
>
> The upshot is this same descriptor can be returned from clone, which
> would allow you to directly register it in your event loop (like
> signalfd, timerfd, file fd, sockets, etc) and POLLIN generated for you
> to read back its exit status (it is arguable if non-parents should be
> returned a readable instance from pidfd_open, but parents sure
> should). You can disable SIGCHLD for the child, and read back exit
> status much later. The entire point of waiting and reaping was that
> it'd be lost, but now you have a descriptor where it is kept for you
> to consume.

There's a subtlety: suppose I'm a library and I want to create a
private subprocess. I use the new clone facility, whatever it is, and
get a pidfd back. I need to be able to read the child's exit status
even if the child exits before clone returns in the parent. Doesn't
this requirement imply that the pidfd, kernel-side, contain something
a bit more than a struct pid?

  parent reply	other threads:[~2019-04-01 16:22 UTC|newest]

Thread overview: 158+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-29 15:54 [PATCH v2 0/5] pid: add pidfd_open() Christian Brauner
2019-03-29 15:54 ` [PATCH v2 1/5] Make anon_inodes unconditional Christian Brauner
2019-03-29 15:54 ` [PATCH v2 2/5] pid: add pidfd_open() Christian Brauner
2019-03-29 23:45   ` Jann Horn
2019-03-29 23:45     ` Jann Horn
2019-03-29 23:55     ` Christian Brauner
2019-03-29 23:55       ` Christian Brauner
2019-03-30 11:53   ` Jürg Billeter
2019-03-30 14:37     ` Christian Brauner
2019-03-30 14:51       ` Jonathan Kowalski
2019-03-30 14:51         ` Jonathan Kowalski
2019-03-29 15:54 ` [PATCH v2 3/5] signal: support pidfd_open() with pidfd_send_signal() Christian Brauner
2019-03-29 15:54 ` [PATCH v2 4/5] signal: PIDFD_SIGNAL_TID threads via pidfds Christian Brauner
2019-03-30  1:06   ` Jann Horn
2019-03-30  1:06     ` Jann Horn
2019-03-30  1:22     ` Christian Brauner
2019-03-30  1:22       ` Christian Brauner
2019-03-30  1:34       ` Christian Brauner
2019-03-30  1:34         ` Christian Brauner
2019-03-30  1:42         ` Christian Brauner
2019-03-30  1:42           ` Christian Brauner
2019-03-29 15:54 ` [PATCH v2 5/5] tests: add pidfd_open() tests Christian Brauner
2019-03-30 16:09 ` [PATCH v2 0/5] pid: add pidfd_open() Linus Torvalds
2019-03-30 16:09   ` Linus Torvalds
2019-03-30 16:11   ` Daniel Colascione
2019-03-30 16:11     ` Daniel Colascione
2019-03-30 16:16     ` Linus Torvalds
2019-03-30 16:16       ` Linus Torvalds
2019-03-30 16:18       ` Linus Torvalds
2019-03-30 16:18         ` Linus Torvalds
2019-03-31  1:07         ` Joel Fernandes
2019-03-31  1:07           ` Joel Fernandes
2019-03-31  2:34           ` Jann Horn
2019-03-31  2:34             ` Jann Horn
2019-03-31  4:08             ` Joel Fernandes
2019-03-31  4:08               ` Joel Fernandes
2019-03-31  4:46               ` Jann Horn
2019-03-31  4:46                 ` Jann Horn
2019-03-31 14:52                 ` Linus Torvalds
2019-03-31 14:52                   ` Linus Torvalds
2019-03-31 15:05                   ` Christian Brauner
2019-03-31 15:05                     ` Christian Brauner
2019-03-31 15:21                     ` Daniel Colascione
2019-03-31 15:21                       ` Daniel Colascione
2019-03-31 15:33                   ` Jonathan Kowalski
2019-03-31 15:33                     ` Jonathan Kowalski
2019-03-30 16:19   ` Christian Brauner
2019-03-30 16:19     ` Christian Brauner
2019-03-30 16:24     ` Linus Torvalds
2019-03-30 16:24       ` Linus Torvalds
2019-03-30 16:34       ` Daniel Colascione
2019-03-30 16:34         ` Daniel Colascione
2019-03-30 16:38         ` Christian Brauner
2019-03-30 16:38           ` Christian Brauner
2019-03-30 17:04         ` Linus Torvalds
2019-03-30 17:04           ` Linus Torvalds
2019-03-30 17:12           ` Christian Brauner
2019-03-30 17:12             ` Christian Brauner
2019-03-30 17:24             ` Linus Torvalds
2019-03-30 17:24               ` Linus Torvalds
2019-03-30 17:37               ` Christian Brauner
2019-03-30 17:37                 ` Christian Brauner
2019-03-30 17:50               ` Jonathan Kowalski
2019-03-30 17:50                 ` Jonathan Kowalski
2019-03-30 17:52                 ` Christian Brauner
2019-03-30 17:52                   ` Christian Brauner
2019-03-30 17:59                   ` Jonathan Kowalski
2019-03-30 17:59                     ` Jonathan Kowalski
2019-03-30 18:02                     ` Christian Brauner
2019-03-30 18:02                       ` Christian Brauner
2019-03-30 18:00               ` Jann Horn
2019-03-30 18:00                 ` Jann Horn
2019-03-31 20:09               ` Andy Lutomirski
2019-03-31 20:09                 ` Andy Lutomirski
2019-03-31 21:03                 ` Linus Torvalds
2019-03-31 21:03                   ` Linus Torvalds
2019-03-31 21:10                   ` Christian Brauner
2019-03-31 21:10                     ` Christian Brauner
2019-03-31 21:17                     ` Linus Torvalds
2019-03-31 21:17                       ` Linus Torvalds
2019-03-31 22:03                       ` Christian Brauner
2019-03-31 22:03                         ` Christian Brauner
2019-03-31 22:16                         ` Linus Torvalds
2019-03-31 22:16                           ` Linus Torvalds
2019-03-31 22:33                           ` Christian Brauner
2019-03-31 22:33                             ` Christian Brauner
2019-04-01  0:52                             ` Jann Horn
2019-04-01  0:52                               ` Jann Horn
2019-04-01  8:47                               ` Yann Droneaud
2019-04-01  8:47                                 ` Yann Droneaud
2019-04-01 10:03                               ` Jonathan Kowalski
2019-04-01 10:03                                 ` Jonathan Kowalski
2019-03-31 23:40                           ` Linus Torvalds
2019-03-31 23:40                             ` Linus Torvalds
2019-04-01  0:09                             ` Al Viro
2019-04-01  0:09                               ` Al Viro
2019-04-01  0:18                               ` Linus Torvalds
2019-04-01  0:18                                 ` Linus Torvalds
2019-04-01  0:21                                 ` Christian Brauner
2019-04-01  0:21                                   ` Christian Brauner
2019-04-01  6:37                                 ` Al Viro
2019-04-01  6:37                                   ` Al Viro
2019-04-01  6:41                                   ` Al Viro
2019-04-01  6:41                                     ` Al Viro
2019-03-31 22:03                       ` Jonathan Kowalski
2019-03-31 22:03                         ` Jonathan Kowalski
2019-04-01  2:13                       ` Andy Lutomirski
2019-04-01  2:13                         ` Andy Lutomirski
2019-04-01 11:40                         ` Aleksa Sarai
2019-04-01 11:40                           ` Aleksa Sarai
2019-04-01 15:36                           ` Linus Torvalds
2019-04-01 15:36                             ` Linus Torvalds
2019-04-01 15:47                             ` Christian Brauner
2019-04-01 15:47                               ` Christian Brauner
2019-04-01 15:55                             ` Daniel Colascione
2019-04-01 15:55                               ` Daniel Colascione
2019-04-01 16:01                               ` Linus Torvalds
2019-04-01 16:01                                 ` Linus Torvalds
2019-04-01 16:13                                 ` Daniel Colascione
2019-04-01 16:13                                   ` Daniel Colascione
2019-04-01 19:42                                 ` Christian Brauner
2019-04-01 19:42                                   ` Christian Brauner
2019-04-01 21:30                                   ` Linus Torvalds
2019-04-01 21:30                                     ` Linus Torvalds
2019-04-01 21:58                                     ` Jonathan Kowalski
2019-04-01 21:58                                       ` Jonathan Kowalski
2019-04-01 22:13                                       ` Linus Torvalds
2019-04-01 22:13                                         ` Linus Torvalds
2019-04-01 22:34                                         ` Daniel Colascione
2019-04-01 22:34                                           ` Daniel Colascione
2019-04-01 16:07                               ` Jonathan Kowalski
2019-04-01 16:07                                 ` Jonathan Kowalski
2019-04-01 16:15                                 ` Linus Torvalds
2019-04-01 16:15                                   ` Linus Torvalds
2019-04-01 16:27                                   ` Jonathan Kowalski
2019-04-01 16:27                                     ` Jonathan Kowalski
2019-04-01 16:21                                 ` Daniel Colascione [this message]
2019-04-01 16:21                                   ` Daniel Colascione
2019-04-01 16:29                                   ` Linus Torvalds
2019-04-01 16:29                                     ` Linus Torvalds
2019-04-01 16:45                                     ` Daniel Colascione
2019-04-01 16:45                                       ` Daniel Colascione
2019-04-01 17:00                                       ` David Laight
2019-04-01 17:00                                         ` David Laight
2019-04-01 17:32                                       ` Linus Torvalds
2019-04-01 17:32                                         ` Linus Torvalds
2019-04-02 11:03                                       ` Florian Weimer
2019-04-02 11:03                                         ` Florian Weimer
2019-04-01 16:10                             ` Andy Lutomirski
2019-04-01 16:10                               ` Andy Lutomirski
2019-04-01 12:04                         ` Christian Brauner
2019-04-01 12:04                           ` Christian Brauner
2019-04-01 13:43                           ` Jann Horn
2019-04-01 13:43                             ` Jann Horn
2019-03-31 21:19                 ` Christian Brauner
2019-03-31 21:19                   ` Christian Brauner
2019-03-30 16:37       ` Christian Brauner
2019-03-30 16:37         ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKOZuetovU9NawSjUfGTAwXVfnmKC9GFtmS8FxAE5Hjmzq0HHA@mail.gmail.com \
    --to=dancol@google.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bl0pbl33p@gmail.com \
    --cc=christian@brauner.io \
    --cc=cyphar@cyphar.com \
    --cc=dhowells@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=jannh@google.com \
    --cc=joel@joelfernandes.org \
    --cc=keescook@chromium.org \
    --cc=khlebnikov@yandex-team.ru \
    --cc=ldv@altlinux.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=luto@kernel.org \
    --cc=mtk.manpages@gmail.com \
    --cc=nagarathnam.muthusamy@oracle.com \
    --cc=oleg@redhat.com \
    --cc=serge@hallyn.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.