linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: pidfd design
@ 2019-03-20 20:07 Alexey Dobriyan
  2019-03-20 20:14 ` Daniel Colascione
  2019-03-22 14:04 ` Michael Tirado
  0 siblings, 2 replies; 35+ messages in thread
From: Alexey Dobriyan @ 2019-03-20 20:07 UTC (permalink / raw)
  To: christian; +Cc: linux-kernel, dancol, joel, luto

> What would be your opinion to having a
> /proc/<pid>/handle
> file instead of having a dirfd.

This is even worse than depending on PROC_FS. Just for the dependency
pidfd code should be backed out immediately. Forget about /proc.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 20:07 pidfd design Alexey Dobriyan
@ 2019-03-20 20:14 ` Daniel Colascione
  2019-03-20 20:39   ` Alexey Dobriyan
  2019-03-22 14:04 ` Michael Tirado
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel Colascione @ 2019-03-20 20:14 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Christian Brauner, linux-kernel, Joel Fernandes, Andy Lutomirski

On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > What would be your opinion to having a
> > /proc/<pid>/handle
> > file instead of having a dirfd.
>
> This is even worse than depending on PROC_FS. Just for the dependency
> pidfd code should be backed out immediately. Forget about /proc.

We already have pidfds, and we've had them since /proc was added ages
ago. Backing out procfs is a bold proposal.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 20:14 ` Daniel Colascione
@ 2019-03-20 20:39   ` Alexey Dobriyan
  2019-03-20 20:47     ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Alexey Dobriyan @ 2019-03-20 20:39 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Christian Brauner, linux-kernel, Joel Fernandes, Andy Lutomirski

On Wed, Mar 20, 2019 at 01:14:01PM -0700, Daniel Colascione wrote:
> On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > > What would be your opinion to having a
> > > /proc/<pid>/handle
> > > file instead of having a dirfd.
> >
> > This is even worse than depending on PROC_FS. Just for the dependency
> > pidfd code should be backed out immediately. Forget about /proc.
> 
> We already have pidfds, and we've had them since /proc was added ages
> ago.

New pidfd code (or whatever the name) should NOT depend on /proc and
should not interact with VFS at all at any point (other than probably
being a descriptor on a fake filesystem). The reason is that /proc is
full of crap and you don't want to spill that into new and hopefully
properly designed part of new code.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 20:39   ` Alexey Dobriyan
@ 2019-03-20 20:47     ` Christian Brauner
  2019-03-20 20:50       ` Daniel Colascione
  0 siblings, 1 reply; 35+ messages in thread
From: Christian Brauner @ 2019-03-20 20:47 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Daniel Colascione, linux-kernel, Joel Fernandes, Andy Lutomirski

On Wed, Mar 20, 2019 at 11:39:10PM +0300, Alexey Dobriyan wrote:
> On Wed, Mar 20, 2019 at 01:14:01PM -0700, Daniel Colascione wrote:
> > On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > > > What would be your opinion to having a
> > > > /proc/<pid>/handle
> > > > file instead of having a dirfd.
> > >
> > > This is even worse than depending on PROC_FS. Just for the dependency
> > > pidfd code should be backed out immediately. Forget about /proc.
> > 
> > We already have pidfds, and we've had them since /proc was added ages
> > ago.
> 
> New pidfd code (or whatever the name) should NOT depend on /proc and
> should not interact with VFS at all at any point (other than probably
> being a descriptor on a fake filesystem). The reason is that /proc is
> full of crap and you don't want to spill that into new and hopefully
> properly designed part of new code.

Yes, I agree. That's why I was thinking that translate_pid() is a good
candidate to provide that decoupling.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 20:47     ` Christian Brauner
@ 2019-03-20 20:50       ` Daniel Colascione
  2019-03-20 21:00         ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Colascione @ 2019-03-20 20:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexey Dobriyan, linux-kernel, Joel Fernandes, Andy Lutomirski

On Wed, Mar 20, 2019 at 1:47 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, Mar 20, 2019 at 11:39:10PM +0300, Alexey Dobriyan wrote:
> > On Wed, Mar 20, 2019 at 01:14:01PM -0700, Daniel Colascione wrote:
> > > On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > > > > What would be your opinion to having a
> > > > > /proc/<pid>/handle
> > > > > file instead of having a dirfd.
> > > >
> > > > This is even worse than depending on PROC_FS. Just for the dependency
> > > > pidfd code should be backed out immediately. Forget about /proc.
> > >
> > > We already have pidfds, and we've had them since /proc was added ages
> > > ago.
> >
> > New pidfd code (or whatever the name) should NOT depend on /proc and
> > should not interact with VFS at all at any point (other than probably
> > being a descriptor on a fake filesystem). The reason is that /proc is
> > full of crap and you don't want to spill that into new and hopefully
> > properly designed part of new code.
>
> Yes, I agree. That's why I was thinking that translate_pid() is a good
> candidate to provide that decoupling.

Then again: how do you propose fetching process metadata? If you adopt
a stance that nothing can use procfs and simultaneously adopt a stance
that we don't want to duplicate all the decades of metadata interfaces
in /proc/pid (which are useful, not "crap"), then the overall result
is that we just won't make any progress at all. There's nothing wrong
with taking a dependency on procfs: procfs is how we talk about
processes. It's completely unreasonable to say "no, you can't use the
old thing" and also "no, we can't add a new thing that would duplicate
the old thing".

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 20:50       ` Daniel Colascione
@ 2019-03-20 21:00         ` Christian Brauner
  0 siblings, 0 replies; 35+ messages in thread
From: Christian Brauner @ 2019-03-20 21:00 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Alexey Dobriyan, linux-kernel, Joel Fernandes, Andy Lutomirski

On Wed, Mar 20, 2019 at 01:50:43PM -0700, Daniel Colascione wrote:
> On Wed, Mar 20, 2019 at 1:47 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, Mar 20, 2019 at 11:39:10PM +0300, Alexey Dobriyan wrote:
> > > On Wed, Mar 20, 2019 at 01:14:01PM -0700, Daniel Colascione wrote:
> > > > On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > > > > > What would be your opinion to having a
> > > > > > /proc/<pid>/handle
> > > > > > file instead of having a dirfd.
> > > > >
> > > > > This is even worse than depending on PROC_FS. Just for the dependency
> > > > > pidfd code should be backed out immediately. Forget about /proc.
> > > >
> > > > We already have pidfds, and we've had them since /proc was added ages
> > > > ago.
> > >
> > > New pidfd code (or whatever the name) should NOT depend on /proc and
> > > should not interact with VFS at all at any point (other than probably
> > > being a descriptor on a fake filesystem). The reason is that /proc is
> > > full of crap and you don't want to spill that into new and hopefully
> > > properly designed part of new code.
> >
> > Yes, I agree. That's why I was thinking that translate_pid() is a good
> > candidate to provide that decoupling.
> 
> Then again: how do you propose fetching process metadata? If you adopt
> a stance that nothing can use procfs and simultaneously adopt a stance
> that we don't want to duplicate all the decades of metadata interfaces
> in /proc/pid (which are useful, not "crap"), then the overall result
> is that we just won't make any progress at all. There's nothing wrong
> with taking a dependency on procfs: procfs is how we talk about
> processes. It's completely unreasonable to say "no, you can't use the
> old thing" and also "no, we can't add a new thing that would duplicate
> the old thing".

This style or arguing won't get us any further. Please, send in the code
that you think is right and you want to get reviewed. If you can get the
Acks from the people you need, good.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 20:07 pidfd design Alexey Dobriyan
  2019-03-20 20:14 ` Daniel Colascione
@ 2019-03-22 14:04 ` Michael Tirado
  2019-03-25 17:45   ` Linus Torvalds
  2019-03-25 18:50   ` Christian Brauner
  1 sibling, 2 replies; 35+ messages in thread
From: Michael Tirado @ 2019-03-22 14:04 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Linus Torvalds, LKML

On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> pidfd code should be backed out immediately. Forget about /proc.

Seems like Torvalds just merges this sort of "stuff" without reading
it now, or there's something that auto accepted pull request to RC tree?

> The pull request you sent on Tue,  5 Mar 2019 18:13:01 +0100:
>
>>  git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git tags/pidfd-v5.1-rc1
>
> has been merged into torvalds/linux.git:
> https://git.kernel.org/torvalds/c/a9dce6679d736cb3d612af39bab9f31f8db66f9b
>
>Thank you!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-25 17:45   ` Linus Torvalds
@ 2019-03-25 16:14     ` Michael Tirado
  2019-03-25 20:45     ` Christian Brauner
  1 sibling, 0 replies; 35+ messages in thread
From: Michael Tirado @ 2019-03-25 16:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: LKML, Alexey Dobriyan, christian

On Mon, Mar 25, 2019 at 5:45 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Mar 22, 2019 at 11:34 AM Michael Tirado <mtirado418@gmail.com> wrote:
> >
> > On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > >
> > > pidfd code should be backed out immediately. Forget about /proc.
> >
> > Seems like Torvalds just merges this sort of "stuff" without reading
> > it now, or there's something that auto accepted pull request to RC tree?
>
> There is no auto-accept.
>
> But there also didn't seem to be any valid arguments against it, and
> the android people had arguments for it.
>

Isn't Google working on their own C++ kernel now, I bet they would want
to make a smooth transition to that at some point? Hopefully they don't
screw up Linux in the process.


> Arguing against it based on "I don't like /proc" is pointless. The
> fact is, /proc is our system interface for a lot of things.
>

The argument was valid to me, at least the design is not set in
stone just yet and there is still hope. I have an option in my
namespace sandbox called "noproc", it works for many things, but
if devs start relying on /proc ALWAYS being available I begin to
have issues. You are all aware of the horrors of /proc, I hope.
I don't want /proc so deeply entrenched in the ecosystem that I
can no longer use "noproc". These sort of bold new core features
need to be designed with extreme caution and awareness of the full
spectra of affects. Just because something like procfs exists and
can be used doesn't mean it is wise to go all-in.


> Arguing against it based on "I worry about the _other_
> non-signal-sending things that could be done with this" is also
> pointless. What other things? The only thing that got merged was the
> signalling.
>

There have been "future changes" hinted through the patches lifecycle,
it leads me to believe it's a gateway patch, and the pid wrapping is a
minor bugfix bridge to some other undisclosed features. How could anyone
know the design is right without knowing what these changes might be?

pidctl/translate_pid?
I am against any new systemcall that crosses namespaces by design to
accomplish something that is already plummable. Seems like they want
to use pidfd's as some sort of token to perform these cross namespace
operations, can't wait to see how devs end up abusing this one.


> So the model of using a file descriptor instead of a 'pid' for signal
> handling is actually very unix-like. Maybe that's how pid's should
> have worked to begin with. Remember that whole "everything is a file"
> thing?
>

Perhaps it could be called an improvement if yall get it right because
AFAIK the only way to handle wrapping today is to directly clone the PID
you're worried about and deal with it immediately when the process exits
before wrap can happen. But I really wonder why PID wrapping matters SO
much, I bet some people are doing dangerously stupid things like using
PID as a credential even though everyone knows it wraps.

Maybe this can make signalling less racey somehow?
At the very least you could learn the process has exited instead of
blindly acting on a potentially recycled number. I recognize the value
in that specifically.
However, using pidfd as a token to do cross-namespace activities that
are already plummable is just plain weird to me, but maybe I'm too used
to doing things "the hard way".

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-22 14:04 ` Michael Tirado
@ 2019-03-25 17:45   ` Linus Torvalds
  2019-03-25 16:14     ` Michael Tirado
  2019-03-25 20:45     ` Christian Brauner
  2019-03-25 18:50   ` Christian Brauner
  1 sibling, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2019-03-25 17:45 UTC (permalink / raw)
  To: Michael Tirado; +Cc: Alexey Dobriyan, LKML

On Fri, Mar 22, 2019 at 11:34 AM Michael Tirado <mtirado418@gmail.com> wrote:
>
> On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
> >
> > pidfd code should be backed out immediately. Forget about /proc.
>
> Seems like Torvalds just merges this sort of "stuff" without reading
> it now, or there's something that auto accepted pull request to RC tree?

There is no auto-accept.

But there also didn't seem to be any valid arguments against it, and
the android people had arguments for it.

Arguing against it based on "I don't like /proc" is pointless. The
fact is, /proc is our system interface for a lot of things.

Arguing against it based on "I worry about the _other_
non-signal-sending things that could be done with this" is also
pointless. What other things? The only thing that got merged was the
signalling.

Now, arguing that signalling should use the open-time credentials
might make sense, but this isn't read/write. You can't fool some suid
program to do magic randon system calls for you, and if you can, then
arguing about pidfd is kind of pointless.

So the model of using a file descriptor instead of a 'pid' for signal
handling is actually very unix-like. Maybe that's how pid's should
have worked to begin with. Remember that whole "everything is a file"
thing?

Now, the fact that fork() and clone() return a pid obviously means
that pidfd isn't the primary model (not to decades of just history),
but that doesn't make pidfd wrong.

And namespace issues etc are all also kind of irrelevant. If you open
random files in /proc and randomly do pidfd_send_signal() on those,
you get random results. If that worries you, then DON'T DO THAT THEN,
for chrissake! That's not a sane model to begin with, but it's not the
usage model for this, so it's another completely specious argument.

So yes, I thought about the pidfd pull (which was why it happened at
the very end of the merge window), and I found the arguments against
it bad.

                Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-22 14:04 ` Michael Tirado
  2019-03-25 17:45   ` Linus Torvalds
@ 2019-03-25 18:50   ` Christian Brauner
  1 sibling, 0 replies; 35+ messages in thread
From: Christian Brauner @ 2019-03-25 18:50 UTC (permalink / raw)
  To: Michael Tirado; +Cc: Alexey Dobriyan, Linus Torvalds, LKML

On Fri, Mar 22, 2019 at 02:04:35PM +0000, Michael Tirado wrote:
> On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
> >
> > pidfd code should be backed out immediately. Forget about /proc.
> 
> Seems like Torvalds just merges this sort of "stuff" without reading
> it now, or there's something that auto accepted pull request to RC tree?

I don't mind controversy around patchsets but I'd appreciate it that
when they are started the Cc is not modified and main people who
maintain the code that is critized are removed. I'd like to be able to
respond without other people having to notify me that there's a
discussion going on I should be involved in.

Christian

> 
> > The pull request you sent on Tue,  5 Mar 2019 18:13:01 +0100:
> >
> >>  git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git tags/pidfd-v5.1-rc1
> >
> > has been merged into torvalds/linux.git:
> > https://git.kernel.org/torvalds/c/a9dce6679d736cb3d612af39bab9f31f8db66f9b
> >
> >Thank you!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-25 17:45   ` Linus Torvalds
  2019-03-25 16:14     ` Michael Tirado
@ 2019-03-25 20:45     ` Christian Brauner
  1 sibling, 0 replies; 35+ messages in thread
From: Christian Brauner @ 2019-03-25 20:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Michael Tirado, Alexey Dobriyan, LKML

On Mon, Mar 25, 2019 at 10:45:29AM -0700, Linus Torvalds wrote:
> On Fri, Mar 22, 2019 at 11:34 AM Michael Tirado <mtirado418@gmail.com> wrote:
> >
> > On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > >
> > > pidfd code should be backed out immediately. Forget about /proc.
> >
> > Seems like Torvalds just merges this sort of "stuff" without reading
> > it now, or there's something that auto accepted pull request to RC tree?
> 
> There is no auto-accept.
> 
> But there also didn't seem to be any valid arguments against it, and
> the android people had arguments for it.
> 
> Arguing against it based on "I don't like /proc" is pointless. The
> fact is, /proc is our system interface for a lot of things.
> 
> Arguing against it based on "I worry about the _other_
> non-signal-sending things that could be done with this" is also
> pointless. What other things? The only thing that got merged was the
> signalling.

To back Linus defense up with a glimpse into the future.

We will not be to rely on dirfds from proc to do general process
management. That is even in the commit message for the pidfd_send_signal
syscall, that we intend to decouple this from procfs, i.e. decouple
process management from process metadata reading.
We have an ongoing discussion and what a lot of people agree upon is
that pidfds will be anon inode file descriptors that stash a reference
to struct pid in their private_data member. They can be pollable if ever
need be and they are just conceptually cleaner and way simpler and
mirror what will happen in the new mount api as well.
The idea is to translate these pidfds e.g. via a simple ioctl()
interface that takes a pidfd and gives back - with standard permissions
applied as are today - a corresponding /proc/<pid> fd that can be used
to read metadata of a process (see the suggestion by Andy and Jann [1]).
The advantage is that this means that pidfd_clone() or something similar
can simply return a pidfd and does not need to care about what procfs
the process is supposed to be located in/reference and is in general way
safer.

But there is absolutely nothing wrong with allowing users to use
/proc/<pid> to signal processes. One of the reasons why I did this is
that it is so intuitive to users that non-kernel people have requested
this be possible over and over.
As mentioned in the orignal patchset the future was always to decouple
this from procfs (see the references in there) and this is what the new
pidctl() syscall is for that transparently translates between the
pid-based api and the pidfd-based api.

[1]: https://lore.kernel.org/lkml/CAG48ez3VMjLJBC_F3BxC2sc2s-28NdsrUduR=jX66XH0w2O-Qg@mail.gmail.com/ 

> 
> Now, arguing that signalling should use the open-time credentials
> might make sense, but this isn't read/write. You can't fool some suid
> program to do magic randon system calls for you, and if you can, then
> arguing about pidfd is kind of pointless.
> 
> So the model of using a file descriptor instead of a 'pid' for signal
> handling is actually very unix-like. Maybe that's how pid's should
> have worked to begin with. Remember that whole "everything is a file"
> thing?
> 
> Now, the fact that fork() and clone() return a pid obviously means
> that pidfd isn't the primary model (not to decades of just history),
> but that doesn't make pidfd wrong.
> 
> And namespace issues etc are all also kind of irrelevant. If you open
> random files in /proc and randomly do pidfd_send_signal() on those,
> you get random results. If that worries you, then DON'T DO THAT THEN,
> for chrissake! That's not a sane model to begin with, but it's not the
> usage model for this, so it's another completely specious argument.
> 
> So yes, I thought about the pidfd pull (which was why it happened at
> the very end of the merge window), and I found the arguments against
> it bad.
> 
>                 Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-26  0:24                                                       ` Andy Lutomirski
@ 2019-03-28  9:21                                                         ` Christian Brauner
  0 siblings, 0 replies; 35+ messages in thread
From: Christian Brauner @ 2019-03-28  9:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan,
	Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API

On Mon, Mar 25, 2019 at 05:24:49PM -0700, Andy Lutomirski wrote:
> On Mon, Mar 25, 2019 at 5:12 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Mon, Mar 25, 2019 at 05:00:17PM -0700, Andy Lutomirski wrote:
> > > On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote:
> > > > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > > > > > > One ioctl on procfs roots to translate pidfds into that procfs,
> > > > > > > subject to both the normal lookup permission checks and only working
> > > > > > > if the pidfd has a translation into the procfs:
> > > > > > >
> > > > > > > int proc_root_fd = open("/proc", O_RDONLY);
> > > > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd);
> > > > > > >
> > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds:
> > > > > > >
> > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY);
> > > > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > > > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY);
> > > > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > > > > > >
> > > > >
> > > > > This sounds okay to me.  Or we could make it so that a procfs
> > > > > directory fd also works as a pidfd, but that seems more likely to be
> > > > > problematic than just allowing two-way translation like this
> > > > >
> > > > > > >
> > > > > > > And then, as you proposed, the new sys_clone() can just return a
> > > > > > > pidfd, and you can convert it into a procfs fd yourself if you want.
> > > > > >
> > > > > > I think that's the consensus we reached on the other thread. The
> > > > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well
> > > > > > enough.
> > > > >
> > > > > I must have missed this particular email.
> > > > >
> > > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it
> > > > > really ought to do function just like /proc/self/fd/mypidfd/. and
> > > > > /proc/self/fd/mypidfd/status should work.  And these latter two
> > > > > options seem nutty.
> > > > >
> > > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl
> > > > > interface -- it doesn't require procfs access.
> > > >
> > > > The other option was to encode the pid in the callers pid namespace into
> > > > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>.
> > > > You'd just need an event on the pidfd to tell you when the process has
> > > > died. Jonathan and I just discussed this.
> > >
> > > From an application developer's POV, the ioctl interface sounds much,
> > > much nicer.
> >
> > Some people are strongly against ioctl()s some don't. I'm not against
> > them so both options are fine with me if people can agree.
> >
> 
> There are certainly non-ioctl equivalents that are functionally
> equivalent.  For example, there could be a syscall
> procfs_open_pidfd(procfs_fd, pid_fd).  I personally don't really mind
> ioctl() when it's really an operation on an fd.

I totally missed that mail somehow.
Yes, I agree that an ioctl() makes sense for that.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-26  0:12                                                     ` Christian Brauner
@ 2019-03-26  0:24                                                       ` Andy Lutomirski
  2019-03-28  9:21                                                         ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-26  0:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Daniel Colascione, Jann Horn, Joel Fernandes,
	Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray,
	Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API

On Mon, Mar 25, 2019 at 5:12 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Mon, Mar 25, 2019 at 05:00:17PM -0700, Andy Lutomirski wrote:
> > On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote:
> > > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote:
> > > > >
> > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> > > > > > One ioctl on procfs roots to translate pidfds into that procfs,
> > > > > > subject to both the normal lookup permission checks and only working
> > > > > > if the pidfd has a translation into the procfs:
> > > > > >
> > > > > > int proc_root_fd = open("/proc", O_RDONLY);
> > > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd);
> > > > > >
> > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds:
> > > > > >
> > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY);
> > > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY);
> > > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > > > > >
> > > >
> > > > This sounds okay to me.  Or we could make it so that a procfs
> > > > directory fd also works as a pidfd, but that seems more likely to be
> > > > problematic than just allowing two-way translation like this
> > > >
> > > > > >
> > > > > > And then, as you proposed, the new sys_clone() can just return a
> > > > > > pidfd, and you can convert it into a procfs fd yourself if you want.
> > > > >
> > > > > I think that's the consensus we reached on the other thread. The
> > > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well
> > > > > enough.
> > > >
> > > > I must have missed this particular email.
> > > >
> > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it
> > > > really ought to do function just like /proc/self/fd/mypidfd/. and
> > > > /proc/self/fd/mypidfd/status should work.  And these latter two
> > > > options seem nutty.
> > > >
> > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl
> > > > interface -- it doesn't require procfs access.
> > >
> > > The other option was to encode the pid in the callers pid namespace into
> > > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>.
> > > You'd just need an event on the pidfd to tell you when the process has
> > > died. Jonathan and I just discussed this.
> >
> > From an application developer's POV, the ioctl interface sounds much,
> > much nicer.
>
> Some people are strongly against ioctl()s some don't. I'm not against
> them so both options are fine with me if people can agree.
>

There are certainly non-ioctl equivalents that are functionally
equivalent.  For example, there could be a syscall
procfs_open_pidfd(procfs_fd, pid_fd).  I personally don't really mind
ioctl() when it's really an operation on an fd.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-26  0:00                                                   ` Andy Lutomirski
@ 2019-03-26  0:12                                                     ` Christian Brauner
  2019-03-26  0:24                                                       ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Christian Brauner @ 2019-03-26  0:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan,
	Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API

On Mon, Mar 25, 2019 at 05:00:17PM -0700, Andy Lutomirski wrote:
> On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote:
> > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote:
> > > >
> > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote:
> > > > >
> > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > > > One ioctl on procfs roots to translate pidfds into that procfs,
> > > > > subject to both the normal lookup permission checks and only working
> > > > > if the pidfd has a translation into the procfs:
> > > > >
> > > > > int proc_root_fd = open("/proc", O_RDONLY);
> > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd);
> > > > >
> > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds:
> > > > >
> > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY);
> > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY);
> > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > > > >
> > >
> > > This sounds okay to me.  Or we could make it so that a procfs
> > > directory fd also works as a pidfd, but that seems more likely to be
> > > problematic than just allowing two-way translation like this
> > >
> > > > >
> > > > > And then, as you proposed, the new sys_clone() can just return a
> > > > > pidfd, and you can convert it into a procfs fd yourself if you want.
> > > >
> > > > I think that's the consensus we reached on the other thread. The
> > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well
> > > > enough.
> > >
> > > I must have missed this particular email.
> > >
> > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it
> > > really ought to do function just like /proc/self/fd/mypidfd/. and
> > > /proc/self/fd/mypidfd/status should work.  And these latter two
> > > options seem nutty.
> > >
> > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl
> > > interface -- it doesn't require procfs access.
> >
> > The other option was to encode the pid in the callers pid namespace into
> > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>.
> > You'd just need an event on the pidfd to tell you when the process has
> > died. Jonathan and I just discussed this.
> 
> From an application developer's POV, the ioctl interface sounds much,
> much nicer.

Some people are strongly against ioctl()s some don't. I'm not against
them so both options are fine with me if people can agree.

Christian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-25 23:45                                                 ` Christian Brauner
@ 2019-03-26  0:00                                                   ` Andy Lutomirski
  2019-03-26  0:12                                                     ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-26  0:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Daniel Colascione, Jann Horn, Joel Fernandes,
	Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray,
	Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API

On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote:
> > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > > > One ioctl on procfs roots to translate pidfds into that procfs,
> > > > subject to both the normal lookup permission checks and only working
> > > > if the pidfd has a translation into the procfs:
> > > >
> > > > int proc_root_fd = open("/proc", O_RDONLY);
> > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd);
> > > >
> > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds:
> > > >
> > > > int proc_pgid_fd = open("/proc/self", O_RDONLY);
> > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY);
> > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > > >
> >
> > This sounds okay to me.  Or we could make it so that a procfs
> > directory fd also works as a pidfd, but that seems more likely to be
> > problematic than just allowing two-way translation like this
> >
> > > >
> > > > And then, as you proposed, the new sys_clone() can just return a
> > > > pidfd, and you can convert it into a procfs fd yourself if you want.
> > >
> > > I think that's the consensus we reached on the other thread. The
> > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well
> > > enough.
> >
> > I must have missed this particular email.
> >
> > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it
> > really ought to do function just like /proc/self/fd/mypidfd/. and
> > /proc/self/fd/mypidfd/status should work.  And these latter two
> > options seem nutty.
> >
> > Also, this O_DIRECTORY thing is missing the entire point of the ioctl
> > interface -- it doesn't require procfs access.
>
> The other option was to encode the pid in the callers pid namespace into
> the pidfd's fdinfo so that you can parse it out and open /proc/<pid>.
> You'd just need an event on the pidfd to tell you when the process has
> died. Jonathan and I just discussed this.

From an application developer's POV, the ioctl interface sounds much,
much nicer.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-25 23:42                                               ` Andy Lutomirski
@ 2019-03-25 23:45                                                 ` Christian Brauner
  2019-03-26  0:00                                                   ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Christian Brauner @ 2019-03-25 23:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan,
	Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API

On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote:
> On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> 
> > > One ioctl on procfs roots to translate pidfds into that procfs,
> > > subject to both the normal lookup permission checks and only working
> > > if the pidfd has a translation into the procfs:
> > >
> > > int proc_root_fd = open("/proc", O_RDONLY);
> > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd);
> > >
> > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds:
> > >
> > > int proc_pgid_fd = open("/proc/self", O_RDONLY);
> > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY);
> > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > >
> 
> This sounds okay to me.  Or we could make it so that a procfs
> directory fd also works as a pidfd, but that seems more likely to be
> problematic than just allowing two-way translation like this
> 
> > >
> > > And then, as you proposed, the new sys_clone() can just return a
> > > pidfd, and you can convert it into a procfs fd yourself if you want.
> >
> > I think that's the consensus we reached on the other thread. The
> > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well
> > enough.
> 
> I must have missed this particular email.
> 
> IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it
> really ought to do function just like /proc/self/fd/mypidfd/. and
> /proc/self/fd/mypidfd/status should work.  And these latter two
> options seem nutty.
> 
> Also, this O_DIRECTORY thing is missing the entire point of the ioctl
> interface -- it doesn't require procfs access.

The other option was to encode the pid in the callers pid namespace into
the pidfd's fdinfo so that you can parse it out and open /proc/<pid>.
You'd just need an event on the pidfd to tell you when the process has
died. Jonathan and I just discussed this.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-25 20:23                                             ` Daniel Colascione
@ 2019-03-25 23:42                                               ` Andy Lutomirski
  2019-03-25 23:45                                                 ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-25 23:42 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jann Horn, Andy Lutomirski, Christian Brauner, Joel Fernandes,
	Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray,
	Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API

On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote:
>
> On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote:

> > One ioctl on procfs roots to translate pidfds into that procfs,
> > subject to both the normal lookup permission checks and only working
> > if the pidfd has a translation into the procfs:
> >
> > int proc_root_fd = open("/proc", O_RDONLY);
> > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd);
> >
> > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds:
> >
> > int proc_pgid_fd = open("/proc/self", O_RDONLY);
> > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> > int proc_pid_fd = open("/proc/thread-self", O_RDONLY);
> > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> >

This sounds okay to me.  Or we could make it so that a procfs
directory fd also works as a pidfd, but that seems more likely to be
problematic than just allowing two-way translation like this

> >
> > And then, as you proposed, the new sys_clone() can just return a
> > pidfd, and you can convert it into a procfs fd yourself if you want.
>
> I think that's the consensus we reached on the other thread. The
> O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well
> enough.

I must have missed this particular email.

IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it
really ought to do function just like /proc/self/fd/mypidfd/. and
/proc/self/fd/mypidfd/status should work.  And these latter two
options seem nutty.

Also, this O_DIRECTORY thing is missing the entire point of the ioctl
interface -- it doesn't require procfs access.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-25 20:13                                           ` Jann Horn
@ 2019-03-25 20:23                                             ` Daniel Colascione
  2019-03-25 23:42                                               ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Colascione @ 2019-03-25 20:23 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Christian Brauner, Joel Fernandes,
	Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray,
	Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API

On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@google.com> wrote:
> > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote:
> > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote:
> > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > >
> > > > > > You're misunderstanding. Again, I said in my previous mails it should
> > > > > > accept pidfds optionally as arguments, yes. But I don't want it to
> > > > > > return the status fds that you previously wanted pidfd_wait() to return.
> > > > > > I really want to see Joel's pidfd_wait() patchset and have more people
> > > > > > review the actual code.
> > > > >
> > > > > Just to make sure that no one is forgetting a material security consideration:
> > > >
> > > > Andy, thanks for commenting!
> > > >
> > > > >
> > > > > $ ls /proc/self
> > > > > attr             exe        mountinfo      projid_map    status
> > > > > autogroup        fd         mounts         root          syscall
> > > > > auxv             fdinfo     mountstats     sched         task
> > > > > cgroup           gid_map    net            schedstat     timers
> > > > > clear_refs       io         ns             sessionid     timerslack_ns
> > > > > cmdline          latency    numa_maps      setgroups     uid_map
> > > > > comm             limits     oom_adj        smaps         wchan
> > > > > coredump_filter  loginuid   oom_score      smaps_rollup
> > > > > cpuset           map_files  oom_score_adj  stack
> > > > > cwd              maps       pagemap        stat
> > > > > environ          mem        personality    statm
> > > > >
> > > > > A bunch of this stuff makes sense to make accessible through a syscall
> > > > > interface that we expect to be used even in sandboxes.  But a bunch of
> > > > > it does not.  For example, *_map, mounts, mountstats, and net are all
> > > > > namespace-wide things that certain policies expect to be unavailable.
> > > > > stack, for example, is a potential attack surface.  Etc.
> > >
> > > If you can access these files sources via open(2) on /proc/<pid>, you
> > > should be able to access them via a pidfd. If you can't, you
> > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't
> > > see how pidfd makes any material changes to anyone's security. As far
> > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a
> > > broken and unsupported configuration.
> >
> > It's not "broken and unsupported".  I know of an actual working,
> > deployed container-ish sandbox that does exactly this.  I would also
> > guess that quite a few not-at-all-container-like sandboxes work like
> > this.  (The obvious seccomp + unshare + pivot_root
> > deny-myself-access-to-lots-of-things trick results in no /proc, which
> > is by dsign.)
> >
> > >
> > > An actual threat model and real thought paid to access capabilities
> > > would help. Almost everything around the interaction of Linux kernel
> > > namespaces and security feels like a jumble of ad-hoc patches added as
> > > afterthoughts in response to random objections.
> >
> > I fully agree.  But if you start thinking for real about access
> > capabilities, there's no way that you're going to conclude that a
> > capability to access some process implies a capability to access the
> > settings of its network namespace.
> >
> > >
> > > >> All these new APIs either need to
> > > > > return something more restrictive than a proc dirfd or they need to
> > > > > follow the same rules.
> > >
> >
> > ...
> >
> > > What's special about libraries? How is a library any worse-off using
> > > openat(2) on a pidfd than it would be just opening the file called
> > > "/proc/$apid"?
> >
> > Because most libraries actually work, right now, without /proc.  Even
> > libraries that spawn subprocesses.  If we make the new API have the
> > property that it doesn't work if you're in a non-root user namespace
> > and /proc isn't mounted, the result will be an utter mess.
> >
> > >
> > > > > Yes, this is unfortunate, but it is indeed the current situation.  I
> > > > > suppose that we could return magic restricted dirfds, or we could
> > > > > return things that aren't dirfds and all and have some API that gives
> > > > > you the dirfd associated with a procfd but only if you can see
> > > > > /proc/PID.
> > > >
> > > > What would be your opinion to having a
> > > > /proc/<pid>/handle
> > > > file instead of having a dirfd. Essentially, what I initially proposed
> > > > at LPC. The change on what we currently have in master would be:
> > > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df
> > >
> > > And how do you propose, given one of these handle objects, getting a
> > > process's current priority, or its current oom score, or its list of
> > > memory maps? As I mentioned in my original email, and which nobody has
> > > addressed, if you don't use a dirfd as your process handle or you
> > > don't provide an easy way to get one of these proc directory FDs, you
> > > need to duplicate a lot of metadata access interfaces.
> >
> > An API that takes a process handle object and an fd pointing at /proc
> > (the root of the proc fs) and gives you back a proc dirfd would do the
> > trick.  You could do this with no new kernel features at all if you're
> > willing to read the pid, call openat(2), and handle the races in user
> > code.
>
> This seems like something that might be a good fit for two ioctls?

As an aside, we had a long discussion about why fundamental facilities
like this should be system calls, not ioctls. I think the arguments
still apply.

> One ioctl on procfs roots to translate pidfds into that procfs,
> subject to both the normal lookup permission checks and only working
> if the pidfd has a translation into the procfs:
>
> int proc_root_fd = open("/proc", O_RDONLY);
> int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd);
>
> And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds:
>
> int proc_pgid_fd = open("/proc/self", O_RDONLY);
> int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> int proc_pid_fd = open("/proc/thread-self", O_RDONLY);
> int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
>
>
> And then, as you proposed, the new sys_clone() can just return a
> pidfd, and you can convert it into a procfs fd yourself if you want.

I think that's the consensus we reached on the other thread. The
O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well
enough.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-21 17:02                                         ` Andy Lutomirski
@ 2019-03-25 20:13                                           ` Jann Horn
  2019-03-25 20:23                                             ` Daniel Colascione
  0 siblings, 1 reply; 35+ messages in thread
From: Jann Horn @ 2019-03-25 20:13 UTC (permalink / raw)
  To: Andy Lutomirski, Christian Brauner
  Cc: Daniel Colascione, Joel Fernandes, Suren Baghdasaryan,
	Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API

On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@google.com> wrote:
> > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote:
> > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote:
> > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > >
> > > > > You're misunderstanding. Again, I said in my previous mails it should
> > > > > accept pidfds optionally as arguments, yes. But I don't want it to
> > > > > return the status fds that you previously wanted pidfd_wait() to return.
> > > > > I really want to see Joel's pidfd_wait() patchset and have more people
> > > > > review the actual code.
> > > >
> > > > Just to make sure that no one is forgetting a material security consideration:
> > >
> > > Andy, thanks for commenting!
> > >
> > > >
> > > > $ ls /proc/self
> > > > attr             exe        mountinfo      projid_map    status
> > > > autogroup        fd         mounts         root          syscall
> > > > auxv             fdinfo     mountstats     sched         task
> > > > cgroup           gid_map    net            schedstat     timers
> > > > clear_refs       io         ns             sessionid     timerslack_ns
> > > > cmdline          latency    numa_maps      setgroups     uid_map
> > > > comm             limits     oom_adj        smaps         wchan
> > > > coredump_filter  loginuid   oom_score      smaps_rollup
> > > > cpuset           map_files  oom_score_adj  stack
> > > > cwd              maps       pagemap        stat
> > > > environ          mem        personality    statm
> > > >
> > > > A bunch of this stuff makes sense to make accessible through a syscall
> > > > interface that we expect to be used even in sandboxes.  But a bunch of
> > > > it does not.  For example, *_map, mounts, mountstats, and net are all
> > > > namespace-wide things that certain policies expect to be unavailable.
> > > > stack, for example, is a potential attack surface.  Etc.
> >
> > If you can access these files sources via open(2) on /proc/<pid>, you
> > should be able to access them via a pidfd. If you can't, you
> > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't
> > see how pidfd makes any material changes to anyone's security. As far
> > as I'm concerned, if a sandbox can't mount /proc at all, it's just a
> > broken and unsupported configuration.
>
> It's not "broken and unsupported".  I know of an actual working,
> deployed container-ish sandbox that does exactly this.  I would also
> guess that quite a few not-at-all-container-like sandboxes work like
> this.  (The obvious seccomp + unshare + pivot_root
> deny-myself-access-to-lots-of-things trick results in no /proc, which
> is by dsign.)
>
> >
> > An actual threat model and real thought paid to access capabilities
> > would help. Almost everything around the interaction of Linux kernel
> > namespaces and security feels like a jumble of ad-hoc patches added as
> > afterthoughts in response to random objections.
>
> I fully agree.  But if you start thinking for real about access
> capabilities, there's no way that you're going to conclude that a
> capability to access some process implies a capability to access the
> settings of its network namespace.
>
> >
> > >> All these new APIs either need to
> > > > return something more restrictive than a proc dirfd or they need to
> > > > follow the same rules.
> >
>
> ...
>
> > What's special about libraries? How is a library any worse-off using
> > openat(2) on a pidfd than it would be just opening the file called
> > "/proc/$apid"?
>
> Because most libraries actually work, right now, without /proc.  Even
> libraries that spawn subprocesses.  If we make the new API have the
> property that it doesn't work if you're in a non-root user namespace
> and /proc isn't mounted, the result will be an utter mess.
>
> >
> > > > Yes, this is unfortunate, but it is indeed the current situation.  I
> > > > suppose that we could return magic restricted dirfds, or we could
> > > > return things that aren't dirfds and all and have some API that gives
> > > > you the dirfd associated with a procfd but only if you can see
> > > > /proc/PID.
> > >
> > > What would be your opinion to having a
> > > /proc/<pid>/handle
> > > file instead of having a dirfd. Essentially, what I initially proposed
> > > at LPC. The change on what we currently have in master would be:
> > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df
> >
> > And how do you propose, given one of these handle objects, getting a
> > process's current priority, or its current oom score, or its list of
> > memory maps? As I mentioned in my original email, and which nobody has
> > addressed, if you don't use a dirfd as your process handle or you
> > don't provide an easy way to get one of these proc directory FDs, you
> > need to duplicate a lot of metadata access interfaces.
>
> An API that takes a process handle object and an fd pointing at /proc
> (the root of the proc fs) and gives you back a proc dirfd would do the
> trick.  You could do this with no new kernel features at all if you're
> willing to read the pid, call openat(2), and handle the races in user
> code.

This seems like something that might be a good fit for two ioctls?

One ioctl on procfs roots to translate pidfds into that procfs,
subject to both the normal lookup permission checks and only working
if the pidfd has a translation into the procfs:

int proc_root_fd = open("/proc", O_RDONLY);
int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd);

And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds:

int proc_pgid_fd = open("/proc/self", O_RDONLY);
int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
int proc_pid_fd = open("/proc/thread-self", O_RDONLY);
int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0);


And then, as you proposed, the new sys_clone() can just return a
pidfd, and you can convert it into a procfs fd yourself if you want.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-24 14:44                                     ` Serge E. Hallyn
@ 2019-03-24 18:48                                       ` Joel Fernandes
  0 siblings, 0 replies; 35+ messages in thread
From: Joel Fernandes @ 2019-03-24 18:48 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Daniel Colascione, Christian Brauner, Joel Fernandes,
	Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray,
	Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov,
	Andy Lutomirski, Kees Cook

On Sun, Mar 24, 2019 at 10:44 AM Serge E. Hallyn <serge@hallyn.com> wrote:
>
> On Wed, Mar 20, 2019 at 12:29:31PM -0700, Daniel Colascione wrote:
> > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > I really want to see Joel's pidfd_wait() patchset and have more people
> > > review the actual code.
> >
> > Sure. But it's also unpleasant to have people write code and then have
> > to throw it away due to guessing incorrectly about unclear
> > requirements.
>
> No, it is not.  It is not unpleasant.  And it is useful.  It is the best way to
> identify and resolve those incorrect guesses and unclear requirements.

No problem, a bit of discussion helped set the direction. Personally
it did help clarify lot of things for me.  We are hard at work with
come up with an implementation and are looking at posting something
soon. I agree that the best is to discuss on actual code where
possible.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 19:29                                   ` Daniel Colascione
@ 2019-03-24 14:44                                     ` Serge E. Hallyn
  2019-03-24 18:48                                       ` Joel Fernandes
  0 siblings, 1 reply; 35+ messages in thread
From: Serge E. Hallyn @ 2019-03-24 14:44 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Christian Brauner, Joel Fernandes, Suren Baghdasaryan,
	Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov,
	Andy Lutomirski, Serge E. Hallyn, Kees Cook

On Wed, Mar 20, 2019 at 12:29:31PM -0700, Daniel Colascione wrote:
> On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote:
> > I really want to see Joel's pidfd_wait() patchset and have more people
> > review the actual code.
> 
> Sure. But it's also unpleasant to have people write code and then have
> to throw it away due to guessing incorrectly about unclear
> requirements.

No, it is not.  It is not unpleasant.  And it is useful.  It is the best way to
identify and resolve those incorrect guesses and unclear requirements.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 19:40                                       ` Daniel Colascione
@ 2019-03-21 17:02                                         ` Andy Lutomirski
  2019-03-25 20:13                                           ` Jann Horn
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-21 17:02 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Christian Brauner, Andy Lutomirski, Joel Fernandes,
	Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray,
	Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook

On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@google.com> wrote:
>
> On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote:
> > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > You're misunderstanding. Again, I said in my previous mails it should
> > > > accept pidfds optionally as arguments, yes. But I don't want it to
> > > > return the status fds that you previously wanted pidfd_wait() to return.
> > > > I really want to see Joel's pidfd_wait() patchset and have more people
> > > > review the actual code.
> > >
> > > Just to make sure that no one is forgetting a material security consideration:
> >
> > Andy, thanks for commenting!
> >
> > >
> > > $ ls /proc/self
> > > attr             exe        mountinfo      projid_map    status
> > > autogroup        fd         mounts         root          syscall
> > > auxv             fdinfo     mountstats     sched         task
> > > cgroup           gid_map    net            schedstat     timers
> > > clear_refs       io         ns             sessionid     timerslack_ns
> > > cmdline          latency    numa_maps      setgroups     uid_map
> > > comm             limits     oom_adj        smaps         wchan
> > > coredump_filter  loginuid   oom_score      smaps_rollup
> > > cpuset           map_files  oom_score_adj  stack
> > > cwd              maps       pagemap        stat
> > > environ          mem        personality    statm
> > >
> > > A bunch of this stuff makes sense to make accessible through a syscall
> > > interface that we expect to be used even in sandboxes.  But a bunch of
> > > it does not.  For example, *_map, mounts, mountstats, and net are all
> > > namespace-wide things that certain policies expect to be unavailable.
> > > stack, for example, is a potential attack surface.  Etc.
>
> If you can access these files sources via open(2) on /proc/<pid>, you
> should be able to access them via a pidfd. If you can't, you
> shouldn't. Which /proc? The one you'd get by mounting procfs. I don't
> see how pidfd makes any material changes to anyone's security. As far
> as I'm concerned, if a sandbox can't mount /proc at all, it's just a
> broken and unsupported configuration.

It's not "broken and unsupported".  I know of an actual working,
deployed container-ish sandbox that does exactly this.  I would also
guess that quite a few not-at-all-container-like sandboxes work like
this.  (The obvious seccomp + unshare + pivot_root
deny-myself-access-to-lots-of-things trick results in no /proc, which
is by dsign.)

>
> An actual threat model and real thought paid to access capabilities
> would help. Almost everything around the interaction of Linux kernel
> namespaces and security feels like a jumble of ad-hoc patches added as
> afterthoughts in response to random objections.

I fully agree.  But if you start thinking for real about access
capabilities, there's no way that you're going to conclude that a
capability to access some process implies a capability to access the
settings of its network namespace.

>
> >> All these new APIs either need to
> > > return something more restrictive than a proc dirfd or they need to
> > > follow the same rules.
>

...

> What's special about libraries? How is a library any worse-off using
> openat(2) on a pidfd than it would be just opening the file called
> "/proc/$apid"?

Because most libraries actually work, right now, without /proc.  Even
libraries that spawn subprocesses.  If we make the new API have the
property that it doesn't work if you're in a non-root user namespace
and /proc isn't mounted, the result will be an utter mess.

>
> > > Yes, this is unfortunate, but it is indeed the current situation.  I
> > > suppose that we could return magic restricted dirfds, or we could
> > > return things that aren't dirfds and all and have some API that gives
> > > you the dirfd associated with a procfd but only if you can see
> > > /proc/PID.
> >
> > What would be your opinion to having a
> > /proc/<pid>/handle
> > file instead of having a dirfd. Essentially, what I initially proposed
> > at LPC. The change on what we currently have in master would be:
> > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df
>
> And how do you propose, given one of these handle objects, getting a
> process's current priority, or its current oom score, or its list of
> memory maps? As I mentioned in my original email, and which nobody has
> addressed, if you don't use a dirfd as your process handle or you
> don't provide an easy way to get one of these proc directory FDs, you
> need to duplicate a lot of metadata access interfaces.

An API that takes a process handle object and an fd pointing at /proc
(the root of the proc fs) and gives you back a proc dirfd would do the
trick.  You could do this with no new kernel features at all if you're
willing to read the pid, call openat(2), and handle the races in user
code.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 19:14                                     ` Christian Brauner
@ 2019-03-20 19:40                                       ` Daniel Colascione
  2019-03-21 17:02                                         ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Colascione @ 2019-03-20 19:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Joel Fernandes, Suren Baghdasaryan,
	Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook

On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote:
> > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > You're misunderstanding. Again, I said in my previous mails it should
> > > accept pidfds optionally as arguments, yes. But I don't want it to
> > > return the status fds that you previously wanted pidfd_wait() to return.
> > > I really want to see Joel's pidfd_wait() patchset and have more people
> > > review the actual code.
> >
> > Just to make sure that no one is forgetting a material security consideration:
>
> Andy, thanks for commenting!
>
> >
> > $ ls /proc/self
> > attr             exe        mountinfo      projid_map    status
> > autogroup        fd         mounts         root          syscall
> > auxv             fdinfo     mountstats     sched         task
> > cgroup           gid_map    net            schedstat     timers
> > clear_refs       io         ns             sessionid     timerslack_ns
> > cmdline          latency    numa_maps      setgroups     uid_map
> > comm             limits     oom_adj        smaps         wchan
> > coredump_filter  loginuid   oom_score      smaps_rollup
> > cpuset           map_files  oom_score_adj  stack
> > cwd              maps       pagemap        stat
> > environ          mem        personality    statm
> >
> > A bunch of this stuff makes sense to make accessible through a syscall
> > interface that we expect to be used even in sandboxes.  But a bunch of
> > it does not.  For example, *_map, mounts, mountstats, and net are all
> > namespace-wide things that certain policies expect to be unavailable.
> > stack, for example, is a potential attack surface.  Etc.

If you can access these files sources via open(2) on /proc/<pid>, you
should be able to access them via a pidfd. If you can't, you
shouldn't. Which /proc? The one you'd get by mounting procfs. I don't
see how pidfd makes any material changes to anyone's security. As far
as I'm concerned, if a sandbox can't mount /proc at all, it's just a
broken and unsupported configuration.

An actual threat model and real thought paid to access capabilities
would help. Almost everything around the interaction of Linux kernel
namespaces and security feels like a jumble of ad-hoc patches added as
afterthoughts in response to random objections.

>> All these new APIs either need to
> > return something more restrictive than a proc dirfd or they need to
> > follow the same rules.

What's wrong with the latter?

> > And I'm afraid that the latter may be a
> > nonstarter if you expect these APIs to be used in libraries.

What's special about libraries? How is a library any worse-off using
openat(2) on a pidfd than it would be just opening the file called
"/proc/$apid"?

> > Yes, this is unfortunate, but it is indeed the current situation.  I
> > suppose that we could return magic restricted dirfds, or we could
> > return things that aren't dirfds and all and have some API that gives
> > you the dirfd associated with a procfd but only if you can see
> > /proc/PID.
>
> What would be your opinion to having a
> /proc/<pid>/handle
> file instead of having a dirfd. Essentially, what I initially proposed
> at LPC. The change on what we currently have in master would be:
> https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df

And how do you propose, given one of these handle objects, getting a
process's current priority, or its current oom score, or its list of
memory maps? As I mentioned in my original email, and which nobody has
addressed, if you don't use a dirfd as your process handle or you
don't provide an easy way to get one of these proc directory FDs, you
need to duplicate a lot of metadata access interfaces.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 18:51                                 ` Christian Brauner
  2019-03-20 18:58                                   ` Andy Lutomirski
  2019-03-20 19:19                                   ` Joel Fernandes
@ 2019-03-20 19:29                                   ` Daniel Colascione
  2019-03-24 14:44                                     ` Serge E. Hallyn
  2 siblings, 1 reply; 35+ messages in thread
From: Daniel Colascione @ 2019-03-20 19:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt,
	Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
	Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
	kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn,
	Kees Cook

On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, Mar 20, 2019 at 11:38:35AM -0700, Daniel Colascione wrote:
> > On Wed, Mar 20, 2019 at 11:26 AM Christian Brauner <christian@brauner.io> wrote:
> > > On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote:
> > > >
> > > >
> > > > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote:
> > > > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner
> > > > ><christian@brauner.io> wrote:
> > > > >>
> > > > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
> > > > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes
> > > > ><joel@joelfernandes.org> wrote:
> > > > >> > >
> > > > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner
> > > > >wrote:
> > > > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione
> > > > >wrote:
> > > > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner
> > > > ><christian@brauner.io> wrote:
> > > > >> > > > > > So I dislike the idea of allocating new inodes from the
> > > > >procfs super
> > > > >> > > > > > block. I would like to avoid pinning the whole pidfd
> > > > >concept exclusively
> > > > >> > > > > > to proc. The idea is that the pidfd API will be useable
> > > > >through procfs
> > > > >> > > > > > via open("/proc/<pid>") because that is what users expect
> > > > >and really
> > > > >> > > > > > wanted to have for a long time. So it makes sense to have
> > > > >this working.
> > > > >> > > > > > But it should really be useable without it. That's why
> > > > >translate_pid()
> > > > >> > > > > > and pidfd_clone() are on the table.  What I'm saying is,
> > > > >once the pidfd
> > > > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N
> > > > >- even
> > > > >> > > > > > though that's crazy - and still be able to use pidfds. This
> > > > >is also a
> > > > >> > > > > > point akpm asked about when I did the pidfd_send_signal
> > > > >work.
> > > > >> > > > >
> > > > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use
> > > > >pidfds. One
> > > > >> > > > > crazy idea that I was discussing with Joel the other day is
> > > > >to just
> > > > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new
> > > > >get_procfs_root()
> > > > >> > > > > system call that returned, out of thin air and independent of
> > > > >the
> > > > >> > > > > mount table, a procfs root directory file descriptor for the
> > > > >caller's
> > > > >> > > > > PID namspace and suitable for use with openat(2).
> > > > >> > > >
> > > > >> > > > Even if this works I'm pretty sure that Al and a lot of others
> > > > >will not
> > > > >> > > > be happy about this. A syscall to get an fd to /proc?
> > > > >> >
> > > > >> > Why not? procfs provides access to a lot of core kernel
> > > > >functionality.
> > > > >> > Why should you need a mountpoint to get to it?
> > > > >> >
> > > > >> > > That's not going
> > > > >> > > > to happen and I don't see the need for a separate syscall just
> > > > >for that.
> > > > >> >
> > > > >> > We need a system call for the same reason we need a getrandom(2):
> > > > >you
> > > > >> > have to bootstrap somehow when you're in a minimal environment.
> > > > >> >
> > > > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
> > > > >> >
> > > > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
> > > > >> > proposing that we *hardwire* it as the default and just declare
> > > > >that
> > > > >> > it's not possible to build a Linux kernel that doesn't include
> > > > >procfs.
> > > > >> > Why do we even have that button?
> > > > >> >
> > > > >> > > I think his point here was that he wanted a handle to procfs no
> > > > >matter where
> > > > >> > > it was mounted and then can later use openat on that. Agreed that
> > > > >it may be
> > > > >> > > unnecessary unless there is a usecase for it, and especially if
> > > > >the /proc
> > > > >> > > directory being the defacto mountpoint for procfs is a universal
> > > > >convention.
> > > > >> >
> > > > >> > If it's a universal convention and, in practice, everyone needs
> > > > >proc
> > > > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y?
> > > > >If
> > > > >> > we advertise /proc as not merely some kind of optional debug
> > > > >interface
> > > > >> > but *the* way certain kernel features are exposed --- and there's
> > > > >> > nothing wrong with that --- then we should give programs access to
> > > > >> > these core kernel features in a way that doesn't depend on
> > > > >userspace
> > > > >> > kernel configuration, and you do that by either providing a
> > > > >> > procfs-root-getting system call or just hardwiring the "/proc/"
> > > > >prefix
> > > > >> > into VFS.
> > > > >> >
> > > > >> > > > Inode allocation from the procfs mount for the file descriptors
> > > > >Joel
> > > > >> > > > wants is not correct. Their not really procfs file descriptors
> > > > >so this
> > > > >> > > > is a nack. We can't just hook into proc that way.
> > > > >> > >
> > > > >> > > I was not particular about using procfs mount for the FDs but
> > > > >that's the only
> > > > >> > > way I knew how to do it until you pointed out anon_inode (my grep
> > > > >skills
> > > > >> > > missed that), so thank you!
> > > > >> > >
> > > > >> > > > > C'mon: /proc is used by everyone today and almost every
> > > > >program breaks
> > > > >> > > > > if it's not around. The string "/proc" is already de facto
> > > > >kernel ABI.
> > > > >> > > > > Let's just drop the pretense of /proc being optional and bake
> > > > >it into
> > > > >> > > > > the kernel proper, then give programs a way to get to /proc
> > > > >that isn't
> > > > >> > > > > tied to any particular mount configuration. This way, we
> > > > >don't need a
> > > > >> > > > > translate_pid(), since callers can just use procfs to do the
> > > > >same
> > > > >> > > > > thing. (That is, if I understand correctly what translate_pid
> > > > >does.)
> > > > >> > > >
> > > > >> > > > I'm not sure what you think translate_pid() is doing since
> > > > >you're not
> > > > >> > > > saying what you think it does.
> > > > >> > > > Examples from the old patchset:
> > > > >> > > > translate_pid(pid, ns, -1)      - get pid in our pid namespace
> > > > >> >
> > > > >> > Ah, it's a bit different from what I had in mind. It's fair to want
> > > > >to
> > > > >> > translate PIDs between namespaces, but the only way to make the
> > > > >> > translate_pid under discussion robust is to have it accept and
> > > > >produce
> > > > >> > pidfds. (At that point, you might as well call it translate_pidfd.)
> > > > >We
> > > > >> > should not be adding new APIs to the kernel that accept numeric
> > > > >PIDs:
> > > > >>
> > > > >> The traditional pid-based api is not going away. There are users that
> > > > >> have the requirement to translate pids between namespaces and also
> > > > >doing
> > > > >> introspection on these namespaces independent of pidfds. We will not
> > > > >> restrict the usefulness of this syscall by making it only work with
> > > > >> pidfds.
> > > > >>
> > > > >> > it's not possible to use these APIs correctly except under very
> > > > >> > limited circumstances --- mostly, talking about init or a parent
> > > > >>
> > > > >> The pid-based api is one of the most widely used apis of the kernel
> > > > >and
> > > > >> people have been using it quite successfully for a long time. Yes,
> > > > >it's
> > > > >> rac, but it's here to stay.
> > > > >>
> > > > >> > talking about its child.
> > > > >> >
> > > > >> > Really, we need a few related operations, and we shouldn't
> > > > >necessarily
> > > > >> > mingle them.
> > > > >>
> > > > >> Yes, we've established that previously.
> > > > >>
> > > > >> >
> > > > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just
> > > > >> > open /proc/<pid>
> > > > >>
> > > > >> Agreed.
> > > > >>
> > > > >> >
> > > > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just
> > > > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is
> > > > >> > always the numeric PID).
> > > > >>
> > > > >> Agreed.
> > > > >>
> > > > >> >
> > > > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal
> > > > >does,
> > > > >> > and it's a good start on the rest of these operations.
> > > > >>
> > > > >> Agreed.
> > > > >>
> > > > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what
> > > > >translate_pid
> > > > >> > is for. My preferred signature for this routine is
> > > > >translate_pid(int
> > > > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments.
> > > > >Why
> > > > >> > not? Because the pidfd *already* names a single process, uniquely!
> > > > >>
> > > > >> Given that people are interested in pids we can't just always return
> > > > >a
> > > > >> pidfd. That would mean a user would need to do get the pidfd read
> > > > >from
> > > > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids
> > > > >or
> > > > >> more you end up allocating and closing file descriptors constantly
> > > > >for
> > > > >> no reason. We can't just debate pids away. So it will also need to be
> > > > >> able to yield pids e.g. through a flag argument.
> > > > >
> > > > >Sure, but that's still not a reason that we should care about pidfds
> > > > >working separately from procfs..
> > >
> > > That's unrelated to the point made in the above paragraph.
> > > Please note, I said that the pidfd api should work when proc is not
> > > available not that they can't be dirfds.
> >
> > What do you mean by "not available"? CONFIG_PROCFS=n? If pidfds
>
> I'm talking about the ability to clone processes and get fd handles on
> them via pidfd_clone() or CLONE_NEWFD.

I wouldn't call that situation "proc [not being] available". We need
pidfd_clone to return a pidfd for atomicity reasons, not /proc
availability reasons. Again, it doesn't make any sense to support this
stuff when CONFIG_PROCFS=n, and CONFIG_PROCFS=n shouldn't even be a
supported configuration.

> > > translate_pid() should just return you a pidfd. Having it return a pidfd
> > > and a status fd feels like stuffing too much functionality in there. If
> > > you're fine with it I'll finish prototyping what I had in mind. As I
> > > said in previous mails I'm already working on this.
> >
> > translate_pid also needs to *accept* pidfds, at least optionally.
> > Unless you have a function from pidfd to pidfd, you race.
>
> You're misunderstanding. Again, I said in my previous mails it should
> accept pidfds optionally as arguments, yes. But I don't want it to
> return the status fds that you previously wanted pidfd_wait() to return.

Agreed. There should be a different way to get these wait handle FDs.

> I really want to see Joel's pidfd_wait() patchset and have more people
> review the actual code.

Sure. But it's also unpleasant to have people write code and then have
to throw it away due to guessing incorrectly about unclear
requirements.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 18:51                                 ` Christian Brauner
  2019-03-20 18:58                                   ` Andy Lutomirski
@ 2019-03-20 19:19                                   ` Joel Fernandes
  2019-03-20 19:29                                   ` Daniel Colascione
  2 siblings, 0 replies; 35+ messages in thread
From: Joel Fernandes @ 2019-03-20 19:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Daniel Colascione, Suren Baghdasaryan, Steven Rostedt,
	Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
	Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
	kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn,
	Kees Cook

On Wed, Mar 20, 2019 at 07:51:57PM +0100, Christian Brauner wrote:
[snip]
> > > translate_pid() should just return you a pidfd. Having it return a pidfd
> > > and a status fd feels like stuffing too much functionality in there. If
> > > you're fine with it I'll finish prototyping what I had in mind. As I
> > > said in previous mails I'm already working on this.
> > 
> > translate_pid also needs to *accept* pidfds, at least optionally.
> > Unless you have a function from pidfd to pidfd, you race.
> 
> You're misunderstanding. Again, I said in my previous mails it should
> accept pidfds optionally as arguments, yes. But I don't want it to
> return the status fds that you previously wanted pidfd_wait() to return.
> I really want to see Joel's pidfd_wait() patchset and have more people
> review the actual code.

No problem, pidfd_wait is also fine with me and we can change it later to
translate_pid or something else if needed.

Agreed that lets get to some code writing now that (I hope) we are all on the
same page and discuss on actual code.

 - Joel


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 18:58                                   ` Andy Lutomirski
@ 2019-03-20 19:14                                     ` Christian Brauner
  2019-03-20 19:40                                       ` Daniel Colascione
  0 siblings, 1 reply; 35+ messages in thread
From: Christian Brauner @ 2019-03-20 19:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Colascione, Joel Fernandes, Suren Baghdasaryan,
	Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook

On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote:
> On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > You're misunderstanding. Again, I said in my previous mails it should
> > accept pidfds optionally as arguments, yes. But I don't want it to
> > return the status fds that you previously wanted pidfd_wait() to return.
> > I really want to see Joel's pidfd_wait() patchset and have more people
> > review the actual code.
> 
> Just to make sure that no one is forgetting a material security consideration:

Andy, thanks for commenting!

> 
> $ ls /proc/self
> attr             exe        mountinfo      projid_map    status
> autogroup        fd         mounts         root          syscall
> auxv             fdinfo     mountstats     sched         task
> cgroup           gid_map    net            schedstat     timers
> clear_refs       io         ns             sessionid     timerslack_ns
> cmdline          latency    numa_maps      setgroups     uid_map
> comm             limits     oom_adj        smaps         wchan
> coredump_filter  loginuid   oom_score      smaps_rollup
> cpuset           map_files  oom_score_adj  stack
> cwd              maps       pagemap        stat
> environ          mem        personality    statm
> 
> A bunch of this stuff makes sense to make accessible through a syscall
> interface that we expect to be used even in sandboxes.  But a bunch of
> it does not.  For example, *_map, mounts, mountstats, and net are all
> namespace-wide things that certain policies expect to be unavailable.
> stack, for example, is a potential attack surface.  Etc.
> 
> As it stands, if you create a fresh userns and mountns and try to
> mount /proc, there are some really awful and hideous rules that are
> checked for security reasons.  All these new APIs either need to
> return something more restrictive than a proc dirfd or they need to
> follow the same rules.  And I'm afraid that the latter may be a
> nonstarter if you expect these APIs to be used in libraries.
> 
> Yes, this is unfortunate, but it is indeed the current situation.  I
> suppose that we could return magic restricted dirfds, or we could
> return things that aren't dirfds and all and have some API that gives
> you the dirfd associated with a procfd but only if you can see
> /proc/PID.

What would be your opinion to having a
/proc/<pid>/handle
file instead of having a dirfd. Essentially, what I initially proposed
at LPC. The change on what we currently have in master would be:
https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 18:26                             ` Christian Brauner
  2019-03-20 18:38                               ` Daniel Colascione
@ 2019-03-20 19:11                               ` Joel Fernandes
  1 sibling, 0 replies; 35+ messages in thread
From: Joel Fernandes @ 2019-03-20 19:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Daniel Colascione, Suren Baghdasaryan, Steven Rostedt,
	Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
	Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
	kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn,
	Kees Cook

On Wed, Mar 20, 2019 at 07:26:50PM +0100, Christian Brauner wrote:
> On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote:
> > 
> > 
> > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote:
> > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner
> > ><christian@brauner.io> wrote:
> > >>
> > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
> > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes
> > ><joel@joelfernandes.org> wrote:
> > >> > >
> > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner
> > >wrote:
> > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione
> > >wrote:
> > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner
> > ><christian@brauner.io> wrote:
> > >> > > > > > So I dislike the idea of allocating new inodes from the
> > >procfs super
> > >> > > > > > block. I would like to avoid pinning the whole pidfd
> > >concept exclusively
> > >> > > > > > to proc. The idea is that the pidfd API will be useable
> > >through procfs
> > >> > > > > > via open("/proc/<pid>") because that is what users expect
> > >and really
> > >> > > > > > wanted to have for a long time. So it makes sense to have
> > >this working.
> > >> > > > > > But it should really be useable without it. That's why
> > >translate_pid()
> > >> > > > > > and pidfd_clone() are on the table.  What I'm saying is,
> > >once the pidfd
> > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N
> > >- even
> > >> > > > > > though that's crazy - and still be able to use pidfds. This
> > >is also a
> > >> > > > > > point akpm asked about when I did the pidfd_send_signal
> > >work.
> > >> > > > >
> > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use
> > >pidfds. One
> > >> > > > > crazy idea that I was discussing with Joel the other day is
> > >to just
> > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new
> > >get_procfs_root()
> > >> > > > > system call that returned, out of thin air and independent of
> > >the
> > >> > > > > mount table, a procfs root directory file descriptor for the
> > >caller's
> > >> > > > > PID namspace and suitable for use with openat(2).
> > >> > > >
> > >> > > > Even if this works I'm pretty sure that Al and a lot of others
> > >will not
> > >> > > > be happy about this. A syscall to get an fd to /proc?
> > >> >
> > >> > Why not? procfs provides access to a lot of core kernel
> > >functionality.
> > >> > Why should you need a mountpoint to get to it?
> > >> >
> > >> > > That's not going
> > >> > > > to happen and I don't see the need for a separate syscall just
> > >for that.
> > >> >
> > >> > We need a system call for the same reason we need a getrandom(2):
> > >you
> > >> > have to bootstrap somehow when you're in a minimal environment.
> > >> >
> > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
> > >> >
> > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
> > >> > proposing that we *hardwire* it as the default and just declare
> > >that
> > >> > it's not possible to build a Linux kernel that doesn't include
> > >procfs.
> > >> > Why do we even have that button?
> > >> >
> > >> > > I think his point here was that he wanted a handle to procfs no
> > >matter where
> > >> > > it was mounted and then can later use openat on that. Agreed that
> > >it may be
> > >> > > unnecessary unless there is a usecase for it, and especially if
> > >the /proc
> > >> > > directory being the defacto mountpoint for procfs is a universal
> > >convention.
> > >> >
> > >> > If it's a universal convention and, in practice, everyone needs
> > >proc
> > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y?
> > >If
> > >> > we advertise /proc as not merely some kind of optional debug
> > >interface
> > >> > but *the* way certain kernel features are exposed --- and there's
> > >> > nothing wrong with that --- then we should give programs access to
> > >> > these core kernel features in a way that doesn't depend on
> > >userspace
> > >> > kernel configuration, and you do that by either providing a
> > >> > procfs-root-getting system call or just hardwiring the "/proc/"
> > >prefix
> > >> > into VFS.
> > >> >
> > >> > > > Inode allocation from the procfs mount for the file descriptors
> > >Joel
> > >> > > > wants is not correct. Their not really procfs file descriptors
> > >so this
> > >> > > > is a nack. We can't just hook into proc that way.
> > >> > >
> > >> > > I was not particular about using procfs mount for the FDs but
> > >that's the only
> > >> > > way I knew how to do it until you pointed out anon_inode (my grep
> > >skills
> > >> > > missed that), so thank you!
> > >> > >
> > >> > > > > C'mon: /proc is used by everyone today and almost every
> > >program breaks
> > >> > > > > if it's not around. The string "/proc" is already de facto
> > >kernel ABI.
> > >> > > > > Let's just drop the pretense of /proc being optional and bake
> > >it into
> > >> > > > > the kernel proper, then give programs a way to get to /proc
> > >that isn't
> > >> > > > > tied to any particular mount configuration. This way, we
> > >don't need a
> > >> > > > > translate_pid(), since callers can just use procfs to do the
> > >same
> > >> > > > > thing. (That is, if I understand correctly what translate_pid
> > >does.)
> > >> > > >
> > >> > > > I'm not sure what you think translate_pid() is doing since
> > >you're not
> > >> > > > saying what you think it does.
> > >> > > > Examples from the old patchset:
> > >> > > > translate_pid(pid, ns, -1)      - get pid in our pid namespace
> > >> >
> > >> > Ah, it's a bit different from what I had in mind. It's fair to want
> > >to
> > >> > translate PIDs between namespaces, but the only way to make the
> > >> > translate_pid under discussion robust is to have it accept and
> > >produce
> > >> > pidfds. (At that point, you might as well call it translate_pidfd.)
> > >We
> > >> > should not be adding new APIs to the kernel that accept numeric
> > >PIDs:
> > >>
> > >> The traditional pid-based api is not going away. There are users that
> > >> have the requirement to translate pids between namespaces and also
> > >doing
> > >> introspection on these namespaces independent of pidfds. We will not
> > >> restrict the usefulness of this syscall by making it only work with
> > >> pidfds.
> > >>
> > >> > it's not possible to use these APIs correctly except under very
> > >> > limited circumstances --- mostly, talking about init or a parent
> > >>
> > >> The pid-based api is one of the most widely used apis of the kernel
> > >and
> > >> people have been using it quite successfully for a long time. Yes,
> > >it's
> > >> rac, but it's here to stay.
> > >>
> > >> > talking about its child.
> > >> >
> > >> > Really, we need a few related operations, and we shouldn't
> > >necessarily
> > >> > mingle them.
> > >>
> > >> Yes, we've established that previously.
> > >>
> > >> >
> > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just
> > >> > open /proc/<pid>
> > >>
> > >> Agreed.
> > >>
> > >> >
> > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just
> > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is
> > >> > always the numeric PID).
> > >>
> > >> Agreed.
> > >>
> > >> >
> > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal
> > >does,
> > >> > and it's a good start on the rest of these operations.
> > >>
> > >> Agreed.
> > >>
> > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what
> > >translate_pid
> > >> > is for. My preferred signature for this routine is
> > >translate_pid(int
> > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments.
> > >Why
> > >> > not? Because the pidfd *already* names a single process, uniquely!
> > >>
> > >> Given that people are interested in pids we can't just always return
> > >a
> > >> pidfd. That would mean a user would need to do get the pidfd read
> > >from
> > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids
> > >or
> > >> more you end up allocating and closing file descriptors constantly
> > >for
> > >> no reason. We can't just debate pids away. So it will also need to be
> > >> able to yield pids e.g. through a flag argument.
> > >
> > >Sure, but that's still not a reason that we should care about pidfds
> > >working separately from procfs..
> 
> That's unrelated to the point made in the above paragraph.
> Please note, I said that the pidfd api should work when proc is not
> available not that they can't be dirfds.
> 
> > 
> > Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid  always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods.
> 
> (K9 Mail still hasn't learned to wrap lines at 80 it seems. :))

Indeed, or I misconfigured it :) Just set it up recently so I'm still messing
with it.

The other issue is it does wrapping on quoted lines too, and there's a bug
filed somewhere for that.

> Again, I never said that pidfds should be a directory handle.
> (Though I would like to point out that one of the original ideas I
> discussed at LPC was to have something like this to get regular file
> descriptors instead of dirfds:
> https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df)

Ok. I was just going by this code in your send_signal patch where you error
out if the pidfd is not a directory.
 
+struct pid *tgid_pidfd_to_pid(const struct file *file)
+{
+	if (!d_is_dir(file->f_path.dentry) ||
+	    (file->f_op != &proc_tgid_base_operations))
+		return ERR_PTR(-EBADF);

> > For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well.
> 
> translate_pid() should just return you a pidfd. Having it return a pidfd
> and a status fd feels like stuffing too much functionality in there. If
> you're fine with it I'll finish prototyping what I had in mind. As I
> said in previous mails I'm already working on this.

Yes, please continue to work on it. No problem.

> Would you be ok with prototyping the pidfd_wait() syscall you had in
> mind?

Yes, Of course, I am working on it. No problem. It is still good to discuss
these ideas and to know what my direction should be, so I appreciate the
conversation here.

> Especially the wait_fd part that you want to have I would like to
> see how that is supposed to work, e.g. who is allowed to wait on the
> process and how notifications will work for non-parent processes and so
> on. I feel we won't get anywhere by talking in the abstrace and other
> people are far more likely to review/comment once there's actual code.

Got it. Lets chat more once I post something.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 18:51                                 ` Christian Brauner
@ 2019-03-20 18:58                                   ` Andy Lutomirski
  2019-03-20 19:14                                     ` Christian Brauner
  2019-03-20 19:19                                   ` Joel Fernandes
  2019-03-20 19:29                                   ` Daniel Colascione
  2 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-20 18:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Daniel Colascione, Joel Fernandes, Suren Baghdasaryan,
	Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov,
	Serge E. Hallyn, Kees Cook

On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote:
>
> You're misunderstanding. Again, I said in my previous mails it should
> accept pidfds optionally as arguments, yes. But I don't want it to
> return the status fds that you previously wanted pidfd_wait() to return.
> I really want to see Joel's pidfd_wait() patchset and have more people
> review the actual code.

Just to make sure that no one is forgetting a material security consideration:

$ ls /proc/self
attr             exe        mountinfo      projid_map    status
autogroup        fd         mounts         root          syscall
auxv             fdinfo     mountstats     sched         task
cgroup           gid_map    net            schedstat     timers
clear_refs       io         ns             sessionid     timerslack_ns
cmdline          latency    numa_maps      setgroups     uid_map
comm             limits     oom_adj        smaps         wchan
coredump_filter  loginuid   oom_score      smaps_rollup
cpuset           map_files  oom_score_adj  stack
cwd              maps       pagemap        stat
environ          mem        personality    statm

A bunch of this stuff makes sense to make accessible through a syscall
interface that we expect to be used even in sandboxes.  But a bunch of
it does not.  For example, *_map, mounts, mountstats, and net are all
namespace-wide things that certain policies expect to be unavailable.
stack, for example, is a potential attack surface.  Etc.

As it stands, if you create a fresh userns and mountns and try to
mount /proc, there are some really awful and hideous rules that are
checked for security reasons.  All these new APIs either need to
return something more restrictive than a proc dirfd or they need to
follow the same rules.  And I'm afraid that the latter may be a
nonstarter if you expect these APIs to be used in libraries.

Yes, this is unfortunate, but it is indeed the current situation.  I
suppose that we could return magic restricted dirfds, or we could
return things that aren't dirfds and all and have some API that gives
you the dirfd associated with a procfd but only if you can see
/proc/PID.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 18:38                               ` Daniel Colascione
@ 2019-03-20 18:51                                 ` Christian Brauner
  2019-03-20 18:58                                   ` Andy Lutomirski
                                                     ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Christian Brauner @ 2019-03-20 18:51 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt,
	Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
	Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
	kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn,
	Kees Cook

On Wed, Mar 20, 2019 at 11:38:35AM -0700, Daniel Colascione wrote:
> On Wed, Mar 20, 2019 at 11:26 AM Christian Brauner <christian@brauner.io> wrote:
> > On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote:
> > >
> > >
> > > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote:
> > > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner
> > > ><christian@brauner.io> wrote:
> > > >>
> > > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
> > > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes
> > > ><joel@joelfernandes.org> wrote:
> > > >> > >
> > > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner
> > > >wrote:
> > > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione
> > > >wrote:
> > > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner
> > > ><christian@brauner.io> wrote:
> > > >> > > > > > So I dislike the idea of allocating new inodes from the
> > > >procfs super
> > > >> > > > > > block. I would like to avoid pinning the whole pidfd
> > > >concept exclusively
> > > >> > > > > > to proc. The idea is that the pidfd API will be useable
> > > >through procfs
> > > >> > > > > > via open("/proc/<pid>") because that is what users expect
> > > >and really
> > > >> > > > > > wanted to have for a long time. So it makes sense to have
> > > >this working.
> > > >> > > > > > But it should really be useable without it. That's why
> > > >translate_pid()
> > > >> > > > > > and pidfd_clone() are on the table.  What I'm saying is,
> > > >once the pidfd
> > > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N
> > > >- even
> > > >> > > > > > though that's crazy - and still be able to use pidfds. This
> > > >is also a
> > > >> > > > > > point akpm asked about when I did the pidfd_send_signal
> > > >work.
> > > >> > > > >
> > > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use
> > > >pidfds. One
> > > >> > > > > crazy idea that I was discussing with Joel the other day is
> > > >to just
> > > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new
> > > >get_procfs_root()
> > > >> > > > > system call that returned, out of thin air and independent of
> > > >the
> > > >> > > > > mount table, a procfs root directory file descriptor for the
> > > >caller's
> > > >> > > > > PID namspace and suitable for use with openat(2).
> > > >> > > >
> > > >> > > > Even if this works I'm pretty sure that Al and a lot of others
> > > >will not
> > > >> > > > be happy about this. A syscall to get an fd to /proc?
> > > >> >
> > > >> > Why not? procfs provides access to a lot of core kernel
> > > >functionality.
> > > >> > Why should you need a mountpoint to get to it?
> > > >> >
> > > >> > > That's not going
> > > >> > > > to happen and I don't see the need for a separate syscall just
> > > >for that.
> > > >> >
> > > >> > We need a system call for the same reason we need a getrandom(2):
> > > >you
> > > >> > have to bootstrap somehow when you're in a minimal environment.
> > > >> >
> > > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
> > > >> >
> > > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
> > > >> > proposing that we *hardwire* it as the default and just declare
> > > >that
> > > >> > it's not possible to build a Linux kernel that doesn't include
> > > >procfs.
> > > >> > Why do we even have that button?
> > > >> >
> > > >> > > I think his point here was that he wanted a handle to procfs no
> > > >matter where
> > > >> > > it was mounted and then can later use openat on that. Agreed that
> > > >it may be
> > > >> > > unnecessary unless there is a usecase for it, and especially if
> > > >the /proc
> > > >> > > directory being the defacto mountpoint for procfs is a universal
> > > >convention.
> > > >> >
> > > >> > If it's a universal convention and, in practice, everyone needs
> > > >proc
> > > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y?
> > > >If
> > > >> > we advertise /proc as not merely some kind of optional debug
> > > >interface
> > > >> > but *the* way certain kernel features are exposed --- and there's
> > > >> > nothing wrong with that --- then we should give programs access to
> > > >> > these core kernel features in a way that doesn't depend on
> > > >userspace
> > > >> > kernel configuration, and you do that by either providing a
> > > >> > procfs-root-getting system call or just hardwiring the "/proc/"
> > > >prefix
> > > >> > into VFS.
> > > >> >
> > > >> > > > Inode allocation from the procfs mount for the file descriptors
> > > >Joel
> > > >> > > > wants is not correct. Their not really procfs file descriptors
> > > >so this
> > > >> > > > is a nack. We can't just hook into proc that way.
> > > >> > >
> > > >> > > I was not particular about using procfs mount for the FDs but
> > > >that's the only
> > > >> > > way I knew how to do it until you pointed out anon_inode (my grep
> > > >skills
> > > >> > > missed that), so thank you!
> > > >> > >
> > > >> > > > > C'mon: /proc is used by everyone today and almost every
> > > >program breaks
> > > >> > > > > if it's not around. The string "/proc" is already de facto
> > > >kernel ABI.
> > > >> > > > > Let's just drop the pretense of /proc being optional and bake
> > > >it into
> > > >> > > > > the kernel proper, then give programs a way to get to /proc
> > > >that isn't
> > > >> > > > > tied to any particular mount configuration. This way, we
> > > >don't need a
> > > >> > > > > translate_pid(), since callers can just use procfs to do the
> > > >same
> > > >> > > > > thing. (That is, if I understand correctly what translate_pid
> > > >does.)
> > > >> > > >
> > > >> > > > I'm not sure what you think translate_pid() is doing since
> > > >you're not
> > > >> > > > saying what you think it does.
> > > >> > > > Examples from the old patchset:
> > > >> > > > translate_pid(pid, ns, -1)      - get pid in our pid namespace
> > > >> >
> > > >> > Ah, it's a bit different from what I had in mind. It's fair to want
> > > >to
> > > >> > translate PIDs between namespaces, but the only way to make the
> > > >> > translate_pid under discussion robust is to have it accept and
> > > >produce
> > > >> > pidfds. (At that point, you might as well call it translate_pidfd.)
> > > >We
> > > >> > should not be adding new APIs to the kernel that accept numeric
> > > >PIDs:
> > > >>
> > > >> The traditional pid-based api is not going away. There are users that
> > > >> have the requirement to translate pids between namespaces and also
> > > >doing
> > > >> introspection on these namespaces independent of pidfds. We will not
> > > >> restrict the usefulness of this syscall by making it only work with
> > > >> pidfds.
> > > >>
> > > >> > it's not possible to use these APIs correctly except under very
> > > >> > limited circumstances --- mostly, talking about init or a parent
> > > >>
> > > >> The pid-based api is one of the most widely used apis of the kernel
> > > >and
> > > >> people have been using it quite successfully for a long time. Yes,
> > > >it's
> > > >> rac, but it's here to stay.
> > > >>
> > > >> > talking about its child.
> > > >> >
> > > >> > Really, we need a few related operations, and we shouldn't
> > > >necessarily
> > > >> > mingle them.
> > > >>
> > > >> Yes, we've established that previously.
> > > >>
> > > >> >
> > > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just
> > > >> > open /proc/<pid>
> > > >>
> > > >> Agreed.
> > > >>
> > > >> >
> > > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just
> > > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is
> > > >> > always the numeric PID).
> > > >>
> > > >> Agreed.
> > > >>
> > > >> >
> > > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal
> > > >does,
> > > >> > and it's a good start on the rest of these operations.
> > > >>
> > > >> Agreed.
> > > >>
> > > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what
> > > >translate_pid
> > > >> > is for. My preferred signature for this routine is
> > > >translate_pid(int
> > > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments.
> > > >Why
> > > >> > not? Because the pidfd *already* names a single process, uniquely!
> > > >>
> > > >> Given that people are interested in pids we can't just always return
> > > >a
> > > >> pidfd. That would mean a user would need to do get the pidfd read
> > > >from
> > > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids
> > > >or
> > > >> more you end up allocating and closing file descriptors constantly
> > > >for
> > > >> no reason. We can't just debate pids away. So it will also need to be
> > > >> able to yield pids e.g. through a flag argument.
> > > >
> > > >Sure, but that's still not a reason that we should care about pidfds
> > > >working separately from procfs..
> >
> > That's unrelated to the point made in the above paragraph.
> > Please note, I said that the pidfd api should work when proc is not
> > available not that they can't be dirfds.
> 
> What do you mean by "not available"? CONFIG_PROCFS=n? If pidfds

I'm talking about the ability to clone processes and get fd handles on
them via pidfd_clone() or CLONE_NEWFD.

> 
> > translate_pid() should just return you a pidfd. Having it return a pidfd
> > and a status fd feels like stuffing too much functionality in there. If
> > you're fine with it I'll finish prototyping what I had in mind. As I
> > said in previous mails I'm already working on this.
> 
> translate_pid also needs to *accept* pidfds, at least optionally.
> Unless you have a function from pidfd to pidfd, you race.

You're misunderstanding. Again, I said in my previous mails it should
accept pidfds optionally as arguments, yes. But I don't want it to
return the status fds that you previously wanted pidfd_wait() to return.
I really want to see Joel's pidfd_wait() patchset and have more people
review the actual code.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 18:26                             ` Christian Brauner
@ 2019-03-20 18:38                               ` Daniel Colascione
  2019-03-20 18:51                                 ` Christian Brauner
  2019-03-20 19:11                               ` Joel Fernandes
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel Colascione @ 2019-03-20 18:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt,
	Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
	Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
	kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn,
	Kees Cook

On Wed, Mar 20, 2019 at 11:26 AM Christian Brauner <christian@brauner.io> wrote:
> On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote:
> >
> >
> > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote:
> > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner
> > ><christian@brauner.io> wrote:
> > >>
> > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
> > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes
> > ><joel@joelfernandes.org> wrote:
> > >> > >
> > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner
> > >wrote:
> > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione
> > >wrote:
> > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner
> > ><christian@brauner.io> wrote:
> > >> > > > > > So I dislike the idea of allocating new inodes from the
> > >procfs super
> > >> > > > > > block. I would like to avoid pinning the whole pidfd
> > >concept exclusively
> > >> > > > > > to proc. The idea is that the pidfd API will be useable
> > >through procfs
> > >> > > > > > via open("/proc/<pid>") because that is what users expect
> > >and really
> > >> > > > > > wanted to have for a long time. So it makes sense to have
> > >this working.
> > >> > > > > > But it should really be useable without it. That's why
> > >translate_pid()
> > >> > > > > > and pidfd_clone() are on the table.  What I'm saying is,
> > >once the pidfd
> > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N
> > >- even
> > >> > > > > > though that's crazy - and still be able to use pidfds. This
> > >is also a
> > >> > > > > > point akpm asked about when I did the pidfd_send_signal
> > >work.
> > >> > > > >
> > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use
> > >pidfds. One
> > >> > > > > crazy idea that I was discussing with Joel the other day is
> > >to just
> > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new
> > >get_procfs_root()
> > >> > > > > system call that returned, out of thin air and independent of
> > >the
> > >> > > > > mount table, a procfs root directory file descriptor for the
> > >caller's
> > >> > > > > PID namspace and suitable for use with openat(2).
> > >> > > >
> > >> > > > Even if this works I'm pretty sure that Al and a lot of others
> > >will not
> > >> > > > be happy about this. A syscall to get an fd to /proc?
> > >> >
> > >> > Why not? procfs provides access to a lot of core kernel
> > >functionality.
> > >> > Why should you need a mountpoint to get to it?
> > >> >
> > >> > > That's not going
> > >> > > > to happen and I don't see the need for a separate syscall just
> > >for that.
> > >> >
> > >> > We need a system call for the same reason we need a getrandom(2):
> > >you
> > >> > have to bootstrap somehow when you're in a minimal environment.
> > >> >
> > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
> > >> >
> > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
> > >> > proposing that we *hardwire* it as the default and just declare
> > >that
> > >> > it's not possible to build a Linux kernel that doesn't include
> > >procfs.
> > >> > Why do we even have that button?
> > >> >
> > >> > > I think his point here was that he wanted a handle to procfs no
> > >matter where
> > >> > > it was mounted and then can later use openat on that. Agreed that
> > >it may be
> > >> > > unnecessary unless there is a usecase for it, and especially if
> > >the /proc
> > >> > > directory being the defacto mountpoint for procfs is a universal
> > >convention.
> > >> >
> > >> > If it's a universal convention and, in practice, everyone needs
> > >proc
> > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y?
> > >If
> > >> > we advertise /proc as not merely some kind of optional debug
> > >interface
> > >> > but *the* way certain kernel features are exposed --- and there's
> > >> > nothing wrong with that --- then we should give programs access to
> > >> > these core kernel features in a way that doesn't depend on
> > >userspace
> > >> > kernel configuration, and you do that by either providing a
> > >> > procfs-root-getting system call or just hardwiring the "/proc/"
> > >prefix
> > >> > into VFS.
> > >> >
> > >> > > > Inode allocation from the procfs mount for the file descriptors
> > >Joel
> > >> > > > wants is not correct. Their not really procfs file descriptors
> > >so this
> > >> > > > is a nack. We can't just hook into proc that way.
> > >> > >
> > >> > > I was not particular about using procfs mount for the FDs but
> > >that's the only
> > >> > > way I knew how to do it until you pointed out anon_inode (my grep
> > >skills
> > >> > > missed that), so thank you!
> > >> > >
> > >> > > > > C'mon: /proc is used by everyone today and almost every
> > >program breaks
> > >> > > > > if it's not around. The string "/proc" is already de facto
> > >kernel ABI.
> > >> > > > > Let's just drop the pretense of /proc being optional and bake
> > >it into
> > >> > > > > the kernel proper, then give programs a way to get to /proc
> > >that isn't
> > >> > > > > tied to any particular mount configuration. This way, we
> > >don't need a
> > >> > > > > translate_pid(), since callers can just use procfs to do the
> > >same
> > >> > > > > thing. (That is, if I understand correctly what translate_pid
> > >does.)
> > >> > > >
> > >> > > > I'm not sure what you think translate_pid() is doing since
> > >you're not
> > >> > > > saying what you think it does.
> > >> > > > Examples from the old patchset:
> > >> > > > translate_pid(pid, ns, -1)      - get pid in our pid namespace
> > >> >
> > >> > Ah, it's a bit different from what I had in mind. It's fair to want
> > >to
> > >> > translate PIDs between namespaces, but the only way to make the
> > >> > translate_pid under discussion robust is to have it accept and
> > >produce
> > >> > pidfds. (At that point, you might as well call it translate_pidfd.)
> > >We
> > >> > should not be adding new APIs to the kernel that accept numeric
> > >PIDs:
> > >>
> > >> The traditional pid-based api is not going away. There are users that
> > >> have the requirement to translate pids between namespaces and also
> > >doing
> > >> introspection on these namespaces independent of pidfds. We will not
> > >> restrict the usefulness of this syscall by making it only work with
> > >> pidfds.
> > >>
> > >> > it's not possible to use these APIs correctly except under very
> > >> > limited circumstances --- mostly, talking about init or a parent
> > >>
> > >> The pid-based api is one of the most widely used apis of the kernel
> > >and
> > >> people have been using it quite successfully for a long time. Yes,
> > >it's
> > >> rac, but it's here to stay.
> > >>
> > >> > talking about its child.
> > >> >
> > >> > Really, we need a few related operations, and we shouldn't
> > >necessarily
> > >> > mingle them.
> > >>
> > >> Yes, we've established that previously.
> > >>
> > >> >
> > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just
> > >> > open /proc/<pid>
> > >>
> > >> Agreed.
> > >>
> > >> >
> > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just
> > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is
> > >> > always the numeric PID).
> > >>
> > >> Agreed.
> > >>
> > >> >
> > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal
> > >does,
> > >> > and it's a good start on the rest of these operations.
> > >>
> > >> Agreed.
> > >>
> > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what
> > >translate_pid
> > >> > is for. My preferred signature for this routine is
> > >translate_pid(int
> > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments.
> > >Why
> > >> > not? Because the pidfd *already* names a single process, uniquely!
> > >>
> > >> Given that people are interested in pids we can't just always return
> > >a
> > >> pidfd. That would mean a user would need to do get the pidfd read
> > >from
> > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids
> > >or
> > >> more you end up allocating and closing file descriptors constantly
> > >for
> > >> no reason. We can't just debate pids away. So it will also need to be
> > >> able to yield pids e.g. through a flag argument.
> > >
> > >Sure, but that's still not a reason that we should care about pidfds
> > >working separately from procfs..
>
> That's unrelated to the point made in the above paragraph.
> Please note, I said that the pidfd api should work when proc is not
> available not that they can't be dirfds.

What do you mean by "not available"? CONFIG_PROCFS=n? If pidfds
supposed to work when proc is unavailable yet also be directory FDs,
to what directory should the FD refer? As I mentioned in my previous
message, trying to make pidfd work without CONFIG_PROCFS is a very bad
idea.

>
> >
> > Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid  always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods.
>
> (K9 Mail still hasn't learned to wrap lines at 80 it seems. :))
>
> Again, I never said that pidfds should be a directory handle.
> (Though I would like to point out that one of the original ideas I
> discussed at LPC was to have something like this to get regular file
> descriptors instead of dirfds:
> https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df)

As I mentioned in my original email on this thread, if you have
regular file descriptors instead of directory FDs, you have to use
some special new API instead of openat to get metadata about a
process. That's pointless duplication of functionality considering
that a directory FD gives you that information automatically.

> > For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well.

> translate_pid() should just return you a pidfd. Having it return a pidfd
> and a status fd feels like stuffing too much functionality in there. If
> you're fine with it I'll finish prototyping what I had in mind. As I
> said in previous mails I'm already working on this.

translate_pid also needs to *accept* pidfds, at least optionally.
Unless you have a function from pidfd to pidfd, you race.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20 11:33                           ` Joel Fernandes
@ 2019-03-20 18:26                             ` Christian Brauner
  2019-03-20 18:38                               ` Daniel Colascione
  2019-03-20 19:11                               ` Joel Fernandes
  0 siblings, 2 replies; 35+ messages in thread
From: Christian Brauner @ 2019-03-20 18:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Colascione, Suren Baghdasaryan, Steven Rostedt,
	Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
	Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
	kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn,
	Kees Cook

On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote:
> 
> 
> On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote:
> >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner
> ><christian@brauner.io> wrote:
> >>
> >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
> >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes
> ><joel@joelfernandes.org> wrote:
> >> > >
> >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner
> >wrote:
> >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione
> >wrote:
> >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner
> ><christian@brauner.io> wrote:
> >> > > > > > So I dislike the idea of allocating new inodes from the
> >procfs super
> >> > > > > > block. I would like to avoid pinning the whole pidfd
> >concept exclusively
> >> > > > > > to proc. The idea is that the pidfd API will be useable
> >through procfs
> >> > > > > > via open("/proc/<pid>") because that is what users expect
> >and really
> >> > > > > > wanted to have for a long time. So it makes sense to have
> >this working.
> >> > > > > > But it should really be useable without it. That's why
> >translate_pid()
> >> > > > > > and pidfd_clone() are on the table.  What I'm saying is,
> >once the pidfd
> >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N
> >- even
> >> > > > > > though that's crazy - and still be able to use pidfds. This
> >is also a
> >> > > > > > point akpm asked about when I did the pidfd_send_signal
> >work.
> >> > > > >
> >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use
> >pidfds. One
> >> > > > > crazy idea that I was discussing with Joel the other day is
> >to just
> >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new
> >get_procfs_root()
> >> > > > > system call that returned, out of thin air and independent of
> >the
> >> > > > > mount table, a procfs root directory file descriptor for the
> >caller's
> >> > > > > PID namspace and suitable for use with openat(2).
> >> > > >
> >> > > > Even if this works I'm pretty sure that Al and a lot of others
> >will not
> >> > > > be happy about this. A syscall to get an fd to /proc?
> >> >
> >> > Why not? procfs provides access to a lot of core kernel
> >functionality.
> >> > Why should you need a mountpoint to get to it?
> >> >
> >> > > That's not going
> >> > > > to happen and I don't see the need for a separate syscall just
> >for that.
> >> >
> >> > We need a system call for the same reason we need a getrandom(2):
> >you
> >> > have to bootstrap somehow when you're in a minimal environment.
> >> >
> >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
> >> >
> >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
> >> > proposing that we *hardwire* it as the default and just declare
> >that
> >> > it's not possible to build a Linux kernel that doesn't include
> >procfs.
> >> > Why do we even have that button?
> >> >
> >> > > I think his point here was that he wanted a handle to procfs no
> >matter where
> >> > > it was mounted and then can later use openat on that. Agreed that
> >it may be
> >> > > unnecessary unless there is a usecase for it, and especially if
> >the /proc
> >> > > directory being the defacto mountpoint for procfs is a universal
> >convention.
> >> >
> >> > If it's a universal convention and, in practice, everyone needs
> >proc
> >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y?
> >If
> >> > we advertise /proc as not merely some kind of optional debug
> >interface
> >> > but *the* way certain kernel features are exposed --- and there's
> >> > nothing wrong with that --- then we should give programs access to
> >> > these core kernel features in a way that doesn't depend on
> >userspace
> >> > kernel configuration, and you do that by either providing a
> >> > procfs-root-getting system call or just hardwiring the "/proc/"
> >prefix
> >> > into VFS.
> >> >
> >> > > > Inode allocation from the procfs mount for the file descriptors
> >Joel
> >> > > > wants is not correct. Their not really procfs file descriptors
> >so this
> >> > > > is a nack. We can't just hook into proc that way.
> >> > >
> >> > > I was not particular about using procfs mount for the FDs but
> >that's the only
> >> > > way I knew how to do it until you pointed out anon_inode (my grep
> >skills
> >> > > missed that), so thank you!
> >> > >
> >> > > > > C'mon: /proc is used by everyone today and almost every
> >program breaks
> >> > > > > if it's not around. The string "/proc" is already de facto
> >kernel ABI.
> >> > > > > Let's just drop the pretense of /proc being optional and bake
> >it into
> >> > > > > the kernel proper, then give programs a way to get to /proc
> >that isn't
> >> > > > > tied to any particular mount configuration. This way, we
> >don't need a
> >> > > > > translate_pid(), since callers can just use procfs to do the
> >same
> >> > > > > thing. (That is, if I understand correctly what translate_pid
> >does.)
> >> > > >
> >> > > > I'm not sure what you think translate_pid() is doing since
> >you're not
> >> > > > saying what you think it does.
> >> > > > Examples from the old patchset:
> >> > > > translate_pid(pid, ns, -1)      - get pid in our pid namespace
> >> >
> >> > Ah, it's a bit different from what I had in mind. It's fair to want
> >to
> >> > translate PIDs between namespaces, but the only way to make the
> >> > translate_pid under discussion robust is to have it accept and
> >produce
> >> > pidfds. (At that point, you might as well call it translate_pidfd.)
> >We
> >> > should not be adding new APIs to the kernel that accept numeric
> >PIDs:
> >>
> >> The traditional pid-based api is not going away. There are users that
> >> have the requirement to translate pids between namespaces and also
> >doing
> >> introspection on these namespaces independent of pidfds. We will not
> >> restrict the usefulness of this syscall by making it only work with
> >> pidfds.
> >>
> >> > it's not possible to use these APIs correctly except under very
> >> > limited circumstances --- mostly, talking about init or a parent
> >>
> >> The pid-based api is one of the most widely used apis of the kernel
> >and
> >> people have been using it quite successfully for a long time. Yes,
> >it's
> >> rac, but it's here to stay.
> >>
> >> > talking about its child.
> >> >
> >> > Really, we need a few related operations, and we shouldn't
> >necessarily
> >> > mingle them.
> >>
> >> Yes, we've established that previously.
> >>
> >> >
> >> > 1) Given a numeric PID, give me a pidfd: that works today: you just
> >> > open /proc/<pid>
> >>
> >> Agreed.
> >>
> >> >
> >> > 2) Given a pidfd, give me a numeric PID: that works today: you just
> >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is
> >> > always the numeric PID).
> >>
> >> Agreed.
> >>
> >> >
> >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal
> >does,
> >> > and it's a good start on the rest of these operations.
> >>
> >> Agreed.
> >>
> >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what
> >translate_pid
> >> > is for. My preferred signature for this routine is
> >translate_pid(int
> >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments.
> >Why
> >> > not? Because the pidfd *already* names a single process, uniquely!
> >>
> >> Given that people are interested in pids we can't just always return
> >a
> >> pidfd. That would mean a user would need to do get the pidfd read
> >from
> >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids
> >or
> >> more you end up allocating and closing file descriptors constantly
> >for
> >> no reason. We can't just debate pids away. So it will also need to be
> >> able to yield pids e.g. through a flag argument.
> >
> >Sure, but that's still not a reason that we should care about pidfds
> >working separately from procfs..

That's unrelated to the point made in the above paragraph.
Please note, I said that the pidfd api should work when proc is not
available not that they can't be dirfds.

> 
> Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid  always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods.

(K9 Mail still hasn't learned to wrap lines at 80 it seems. :))

Again, I never said that pidfds should be a directory handle.
(Though I would like to point out that one of the original ideas I
discussed at LPC was to have something like this to get regular file
descriptors instead of dirfds:
https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df)

> 
> For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well.

translate_pid() should just return you a pidfd. Having it return a pidfd
and a status fd feels like stuffing too much functionality in there. If
you're fine with it I'll finish prototyping what I had in mind. As I
said in previous mails I'm already working on this.

Would you be ok with prototyping the pidfd_wait() syscall you had in
mind? Especially the wait_fd part that you want to have I would like to
see how that is supposed to work, e.g. who is allowed to wait on the
process and how notifications will work for non-parent processes and so
on. I feel we won't get anywhere by talking in the abstrace and other
people are far more likely to review/comment once there's actual code.

Christian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20  7:02                         ` Daniel Colascione
@ 2019-03-20 11:33                           ` Joel Fernandes
  2019-03-20 18:26                             ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Joel Fernandes @ 2019-03-20 11:33 UTC (permalink / raw)
  To: Daniel Colascione, Christian Brauner
  Cc: Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray,
	Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML,
	open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov,
	Andy Lutomirski, Serge E. Hallyn, Kees Cook



On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote:
>On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner
><christian@brauner.io> wrote:
>>
>> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
>> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes
><joel@joelfernandes.org> wrote:
>> > >
>> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner
>wrote:
>> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione
>wrote:
>> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner
><christian@brauner.io> wrote:
>> > > > > > So I dislike the idea of allocating new inodes from the
>procfs super
>> > > > > > block. I would like to avoid pinning the whole pidfd
>concept exclusively
>> > > > > > to proc. The idea is that the pidfd API will be useable
>through procfs
>> > > > > > via open("/proc/<pid>") because that is what users expect
>and really
>> > > > > > wanted to have for a long time. So it makes sense to have
>this working.
>> > > > > > But it should really be useable without it. That's why
>translate_pid()
>> > > > > > and pidfd_clone() are on the table.  What I'm saying is,
>once the pidfd
>> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N
>- even
>> > > > > > though that's crazy - and still be able to use pidfds. This
>is also a
>> > > > > > point akpm asked about when I did the pidfd_send_signal
>work.
>> > > > >
>> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use
>pidfds. One
>> > > > > crazy idea that I was discussing with Joel the other day is
>to just
>> > > > > make CONFIG_PROCFS=Y mandatory and provide a new
>get_procfs_root()
>> > > > > system call that returned, out of thin air and independent of
>the
>> > > > > mount table, a procfs root directory file descriptor for the
>caller's
>> > > > > PID namspace and suitable for use with openat(2).
>> > > >
>> > > > Even if this works I'm pretty sure that Al and a lot of others
>will not
>> > > > be happy about this. A syscall to get an fd to /proc?
>> >
>> > Why not? procfs provides access to a lot of core kernel
>functionality.
>> > Why should you need a mountpoint to get to it?
>> >
>> > > That's not going
>> > > > to happen and I don't see the need for a separate syscall just
>for that.
>> >
>> > We need a system call for the same reason we need a getrandom(2):
>you
>> > have to bootstrap somehow when you're in a minimal environment.
>> >
>> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
>> >
>> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
>> > proposing that we *hardwire* it as the default and just declare
>that
>> > it's not possible to build a Linux kernel that doesn't include
>procfs.
>> > Why do we even have that button?
>> >
>> > > I think his point here was that he wanted a handle to procfs no
>matter where
>> > > it was mounted and then can later use openat on that. Agreed that
>it may be
>> > > unnecessary unless there is a usecase for it, and especially if
>the /proc
>> > > directory being the defacto mountpoint for procfs is a universal
>convention.
>> >
>> > If it's a universal convention and, in practice, everyone needs
>proc
>> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y?
>If
>> > we advertise /proc as not merely some kind of optional debug
>interface
>> > but *the* way certain kernel features are exposed --- and there's
>> > nothing wrong with that --- then we should give programs access to
>> > these core kernel features in a way that doesn't depend on
>userspace
>> > kernel configuration, and you do that by either providing a
>> > procfs-root-getting system call or just hardwiring the "/proc/"
>prefix
>> > into VFS.
>> >
>> > > > Inode allocation from the procfs mount for the file descriptors
>Joel
>> > > > wants is not correct. Their not really procfs file descriptors
>so this
>> > > > is a nack. We can't just hook into proc that way.
>> > >
>> > > I was not particular about using procfs mount for the FDs but
>that's the only
>> > > way I knew how to do it until you pointed out anon_inode (my grep
>skills
>> > > missed that), so thank you!
>> > >
>> > > > > C'mon: /proc is used by everyone today and almost every
>program breaks
>> > > > > if it's not around. The string "/proc" is already de facto
>kernel ABI.
>> > > > > Let's just drop the pretense of /proc being optional and bake
>it into
>> > > > > the kernel proper, then give programs a way to get to /proc
>that isn't
>> > > > > tied to any particular mount configuration. This way, we
>don't need a
>> > > > > translate_pid(), since callers can just use procfs to do the
>same
>> > > > > thing. (That is, if I understand correctly what translate_pid
>does.)
>> > > >
>> > > > I'm not sure what you think translate_pid() is doing since
>you're not
>> > > > saying what you think it does.
>> > > > Examples from the old patchset:
>> > > > translate_pid(pid, ns, -1)      - get pid in our pid namespace
>> >
>> > Ah, it's a bit different from what I had in mind. It's fair to want
>to
>> > translate PIDs between namespaces, but the only way to make the
>> > translate_pid under discussion robust is to have it accept and
>produce
>> > pidfds. (At that point, you might as well call it translate_pidfd.)
>We
>> > should not be adding new APIs to the kernel that accept numeric
>PIDs:
>>
>> The traditional pid-based api is not going away. There are users that
>> have the requirement to translate pids between namespaces and also
>doing
>> introspection on these namespaces independent of pidfds. We will not
>> restrict the usefulness of this syscall by making it only work with
>> pidfds.
>>
>> > it's not possible to use these APIs correctly except under very
>> > limited circumstances --- mostly, talking about init or a parent
>>
>> The pid-based api is one of the most widely used apis of the kernel
>and
>> people have been using it quite successfully for a long time. Yes,
>it's
>> rac, but it's here to stay.
>>
>> > talking about its child.
>> >
>> > Really, we need a few related operations, and we shouldn't
>necessarily
>> > mingle them.
>>
>> Yes, we've established that previously.
>>
>> >
>> > 1) Given a numeric PID, give me a pidfd: that works today: you just
>> > open /proc/<pid>
>>
>> Agreed.
>>
>> >
>> > 2) Given a pidfd, give me a numeric PID: that works today: you just
>> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is
>> > always the numeric PID).
>>
>> Agreed.
>>
>> >
>> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal
>does,
>> > and it's a good start on the rest of these operations.
>>
>> Agreed.
>>
>> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what
>translate_pid
>> > is for. My preferred signature for this routine is
>translate_pid(int
>> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments.
>Why
>> > not? Because the pidfd *already* names a single process, uniquely!
>>
>> Given that people are interested in pids we can't just always return
>a
>> pidfd. That would mean a user would need to do get the pidfd read
>from
>> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids
>or
>> more you end up allocating and closing file descriptors constantly
>for
>> no reason. We can't just debate pids away. So it will also need to be
>> able to yield pids e.g. through a flag argument.
>
>Sure, but that's still not a reason that we should care about pidfds
>working separately from procfs..

Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid  always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods.

For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well.

Joel Fernandes, Android kernel team
Sent from k9-mail on Android

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20  3:59                       ` Christian Brauner
@ 2019-03-20  7:02                         ` Daniel Colascione
  2019-03-20 11:33                           ` Joel Fernandes
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Colascione @ 2019-03-20  7:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt,
	Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
	Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
	kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn,
	Kees Cook

On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote:
> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > So I dislike the idea of allocating new inodes from the procfs super
> > > > > > block. I would like to avoid pinning the whole pidfd concept exclusively
> > > > > > to proc. The idea is that the pidfd API will be useable through procfs
> > > > > > via open("/proc/<pid>") because that is what users expect and really
> > > > > > wanted to have for a long time. So it makes sense to have this working.
> > > > > > But it should really be useable without it. That's why translate_pid()
> > > > > > and pidfd_clone() are on the table.  What I'm saying is, once the pidfd
> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N - even
> > > > > > though that's crazy - and still be able to use pidfds. This is also a
> > > > > > point akpm asked about when I did the pidfd_send_signal work.
> > > > >
> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One
> > > > > crazy idea that I was discussing with Joel the other day is to just
> > > > > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root()
> > > > > system call that returned, out of thin air and independent of the
> > > > > mount table, a procfs root directory file descriptor for the caller's
> > > > > PID namspace and suitable for use with openat(2).
> > > >
> > > > Even if this works I'm pretty sure that Al and a lot of others will not
> > > > be happy about this. A syscall to get an fd to /proc?
> >
> > Why not? procfs provides access to a lot of core kernel functionality.
> > Why should you need a mountpoint to get to it?
> >
> > > That's not going
> > > > to happen and I don't see the need for a separate syscall just for that.
> >
> > We need a system call for the same reason we need a getrandom(2): you
> > have to bootstrap somehow when you're in a minimal environment.
> >
> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
> >
> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
> > proposing that we *hardwire* it as the default and just declare that
> > it's not possible to build a Linux kernel that doesn't include procfs.
> > Why do we even have that button?
> >
> > > I think his point here was that he wanted a handle to procfs no matter where
> > > it was mounted and then can later use openat on that. Agreed that it may be
> > > unnecessary unless there is a usecase for it, and especially if the /proc
> > > directory being the defacto mountpoint for procfs is a universal convention.
> >
> > If it's a universal convention and, in practice, everyone needs proc
> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? If
> > we advertise /proc as not merely some kind of optional debug interface
> > but *the* way certain kernel features are exposed --- and there's
> > nothing wrong with that --- then we should give programs access to
> > these core kernel features in a way that doesn't depend on userspace
> > kernel configuration, and you do that by either providing a
> > procfs-root-getting system call or just hardwiring the "/proc/" prefix
> > into VFS.
> >
> > > > Inode allocation from the procfs mount for the file descriptors Joel
> > > > wants is not correct. Their not really procfs file descriptors so this
> > > > is a nack. We can't just hook into proc that way.
> > >
> > > I was not particular about using procfs mount for the FDs but that's the only
> > > way I knew how to do it until you pointed out anon_inode (my grep skills
> > > missed that), so thank you!
> > >
> > > > > C'mon: /proc is used by everyone today and almost every program breaks
> > > > > if it's not around. The string "/proc" is already de facto kernel ABI.
> > > > > Let's just drop the pretense of /proc being optional and bake it into
> > > > > the kernel proper, then give programs a way to get to /proc that isn't
> > > > > tied to any particular mount configuration. This way, we don't need a
> > > > > translate_pid(), since callers can just use procfs to do the same
> > > > > thing. (That is, if I understand correctly what translate_pid does.)
> > > >
> > > > I'm not sure what you think translate_pid() is doing since you're not
> > > > saying what you think it does.
> > > > Examples from the old patchset:
> > > > translate_pid(pid, ns, -1)      - get pid in our pid namespace
> >
> > Ah, it's a bit different from what I had in mind. It's fair to want to
> > translate PIDs between namespaces, but the only way to make the
> > translate_pid under discussion robust is to have it accept and produce
> > pidfds. (At that point, you might as well call it translate_pidfd.) We
> > should not be adding new APIs to the kernel that accept numeric PIDs:
>
> The traditional pid-based api is not going away. There are users that
> have the requirement to translate pids between namespaces and also doing
> introspection on these namespaces independent of pidfds. We will not
> restrict the usefulness of this syscall by making it only work with
> pidfds.
>
> > it's not possible to use these APIs correctly except under very
> > limited circumstances --- mostly, talking about init or a parent
>
> The pid-based api is one of the most widely used apis of the kernel and
> people have been using it quite successfully for a long time. Yes, it's
> rac, but it's here to stay.
>
> > talking about its child.
> >
> > Really, we need a few related operations, and we shouldn't necessarily
> > mingle them.
>
> Yes, we've established that previously.
>
> >
> > 1) Given a numeric PID, give me a pidfd: that works today: you just
> > open /proc/<pid>
>
> Agreed.
>
> >
> > 2) Given a pidfd, give me a numeric PID: that works today: you just
> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is
> > always the numeric PID).
>
> Agreed.
>
> >
> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal does,
> > and it's a good start on the rest of these operations.
>
> Agreed.
>
> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what translate_pid
> > is for. My preferred signature for this routine is translate_pid(int
> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. Why
> > not? Because the pidfd *already* names a single process, uniquely!
>
> Given that people are interested in pids we can't just always return a
> pidfd. That would mean a user would need to do get the pidfd read from
> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids or
> more you end up allocating and closing file descriptors constantly for
> no reason. We can't just debate pids away. So it will also need to be
> able to yield pids e.g. through a flag argument.

Sure, but that's still not a reason that we should care about pidfds
working separately from procfs.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: pidfd design
  2019-03-20  2:42                     ` pidfd design Daniel Colascione
@ 2019-03-20  3:59                       ` Christian Brauner
  2019-03-20  7:02                         ` Daniel Colascione
  0 siblings, 1 reply; 35+ messages in thread
From: Christian Brauner @ 2019-03-20  3:59 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt,
	Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
	Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
	kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn,
	Kees Cook

On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
> On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote:
> > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote:
> > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > So I dislike the idea of allocating new inodes from the procfs super
> > > > > block. I would like to avoid pinning the whole pidfd concept exclusively
> > > > > to proc. The idea is that the pidfd API will be useable through procfs
> > > > > via open("/proc/<pid>") because that is what users expect and really
> > > > > wanted to have for a long time. So it makes sense to have this working.
> > > > > But it should really be useable without it. That's why translate_pid()
> > > > > and pidfd_clone() are on the table.  What I'm saying is, once the pidfd
> > > > > api is "complete" you should be able to set CONFIG_PROCFS=N - even
> > > > > though that's crazy - and still be able to use pidfds. This is also a
> > > > > point akpm asked about when I did the pidfd_send_signal work.
> > > >
> > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One
> > > > crazy idea that I was discussing with Joel the other day is to just
> > > > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root()
> > > > system call that returned, out of thin air and independent of the
> > > > mount table, a procfs root directory file descriptor for the caller's
> > > > PID namspace and suitable for use with openat(2).
> > >
> > > Even if this works I'm pretty sure that Al and a lot of others will not
> > > be happy about this. A syscall to get an fd to /proc?
> 
> Why not? procfs provides access to a lot of core kernel functionality.
> Why should you need a mountpoint to get to it?
> 
> > That's not going
> > > to happen and I don't see the need for a separate syscall just for that.
> 
> We need a system call for the same reason we need a getrandom(2): you
> have to bootstrap somehow when you're in a minimal environment.
> 
> > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
> 
> I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
> proposing that we *hardwire* it as the default and just declare that
> it's not possible to build a Linux kernel that doesn't include procfs.
> Why do we even have that button?
> 
> > I think his point here was that he wanted a handle to procfs no matter where
> > it was mounted and then can later use openat on that. Agreed that it may be
> > unnecessary unless there is a usecase for it, and especially if the /proc
> > directory being the defacto mountpoint for procfs is a universal convention.
> 
> If it's a universal convention and, in practice, everyone needs proc
> mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? If
> we advertise /proc as not merely some kind of optional debug interface
> but *the* way certain kernel features are exposed --- and there's
> nothing wrong with that --- then we should give programs access to
> these core kernel features in a way that doesn't depend on userspace
> kernel configuration, and you do that by either providing a
> procfs-root-getting system call or just hardwiring the "/proc/" prefix
> into VFS.
> 
> > > Inode allocation from the procfs mount for the file descriptors Joel
> > > wants is not correct. Their not really procfs file descriptors so this
> > > is a nack. We can't just hook into proc that way.
> >
> > I was not particular about using procfs mount for the FDs but that's the only
> > way I knew how to do it until you pointed out anon_inode (my grep skills
> > missed that), so thank you!
> >
> > > > C'mon: /proc is used by everyone today and almost every program breaks
> > > > if it's not around. The string "/proc" is already de facto kernel ABI.
> > > > Let's just drop the pretense of /proc being optional and bake it into
> > > > the kernel proper, then give programs a way to get to /proc that isn't
> > > > tied to any particular mount configuration. This way, we don't need a
> > > > translate_pid(), since callers can just use procfs to do the same
> > > > thing. (That is, if I understand correctly what translate_pid does.)
> > >
> > > I'm not sure what you think translate_pid() is doing since you're not
> > > saying what you think it does.
> > > Examples from the old patchset:
> > > translate_pid(pid, ns, -1)      - get pid in our pid namespace
> 
> Ah, it's a bit different from what I had in mind. It's fair to want to
> translate PIDs between namespaces, but the only way to make the
> translate_pid under discussion robust is to have it accept and produce
> pidfds. (At that point, you might as well call it translate_pidfd.) We
> should not be adding new APIs to the kernel that accept numeric PIDs:

The traditional pid-based api is not going away. There are users that
have the requirement to translate pids between namespaces and also doing
introspection on these namespaces independent of pidfds. We will not
restrict the usefulness of this syscall by making it only work with
pidfds.

> it's not possible to use these APIs correctly except under very
> limited circumstances --- mostly, talking about init or a parent

The pid-based api is one of the most widely used apis of the kernel and
people have been using it quite successfully for a long time. Yes, it's
rac, but it's here to stay.

> talking about its child.
> 
> Really, we need a few related operations, and we shouldn't necessarily
> mingle them.

Yes, we've established that previously.

> 
> 1) Given a numeric PID, give me a pidfd: that works today: you just
> open /proc/<pid>

Agreed.

> 
> 2) Given a pidfd, give me a numeric PID: that works today: you just
> openat(pidfd, "stat", O_RDONLY) and read the first token (which is
> always the numeric PID).

Agreed.

> 
> 3) Given a pidfd, send a signal: that's what pidfd_send_signal does,
> and it's a good start on the rest of these operations.

Agreed.

> 5) Given a pidfd in NS1, get a pidfd in NS2. That's what translate_pid
> is for. My preferred signature for this routine is translate_pid(int
> pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. Why
> not? Because the pidfd *already* names a single process, uniquely!

Given that people are interested in pids we can't just always return a
pidfd. That would mean a user would need to do get the pidfd read from
<pidfd>/stat and then close the pidfd. If you do that for a 100 pids or
more you end up allocating and closing file descriptors constantly for
no reason. We can't just debate pids away. So it will also need to be
able to yield pids e.g. through a flag argument.

> 
> 6) Make a new process and atomically give me a pidfd for it. We need a
> new kind of clone(2) for that. People have been proposing some kind of
> FD-based fork/spawn/etc. thing for ages, and we can finally provide
> it. Yay.

Agreed.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* pidfd design
  2019-03-20  1:52                   ` Joel Fernandes
@ 2019-03-20  2:42                     ` Daniel Colascione
  2019-03-20  3:59                       ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Colascione @ 2019-03-20  2:42 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Christian Brauner, Suren Baghdasaryan, Steven Rostedt,
	Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
	Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
	kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn,
	Kees Cook

On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote:
> > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote:
> > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@brauner.io> wrote:
> > > > So I dislike the idea of allocating new inodes from the procfs super
> > > > block. I would like to avoid pinning the whole pidfd concept exclusively
> > > > to proc. The idea is that the pidfd API will be useable through procfs
> > > > via open("/proc/<pid>") because that is what users expect and really
> > > > wanted to have for a long time. So it makes sense to have this working.
> > > > But it should really be useable without it. That's why translate_pid()
> > > > and pidfd_clone() are on the table.  What I'm saying is, once the pidfd
> > > > api is "complete" you should be able to set CONFIG_PROCFS=N - even
> > > > though that's crazy - and still be able to use pidfds. This is also a
> > > > point akpm asked about when I did the pidfd_send_signal work.
> > >
> > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One
> > > crazy idea that I was discussing with Joel the other day is to just
> > > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root()
> > > system call that returned, out of thin air and independent of the
> > > mount table, a procfs root directory file descriptor for the caller's
> > > PID namspace and suitable for use with openat(2).
> >
> > Even if this works I'm pretty sure that Al and a lot of others will not
> > be happy about this. A syscall to get an fd to /proc?

Why not? procfs provides access to a lot of core kernel functionality.
Why should you need a mountpoint to get to it?

> That's not going
> > to happen and I don't see the need for a separate syscall just for that.

We need a system call for the same reason we need a getrandom(2): you
have to bootstrap somehow when you're in a minimal environment.

> > (I do see the point of making CONFIG_PROCFS=y the default btw.)

I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
proposing that we *hardwire* it as the default and just declare that
it's not possible to build a Linux kernel that doesn't include procfs.
Why do we even have that button?

> I think his point here was that he wanted a handle to procfs no matter where
> it was mounted and then can later use openat on that. Agreed that it may be
> unnecessary unless there is a usecase for it, and especially if the /proc
> directory being the defacto mountpoint for procfs is a universal convention.

If it's a universal convention and, in practice, everyone needs proc
mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? If
we advertise /proc as not merely some kind of optional debug interface
but *the* way certain kernel features are exposed --- and there's
nothing wrong with that --- then we should give programs access to
these core kernel features in a way that doesn't depend on userspace
kernel configuration, and you do that by either providing a
procfs-root-getting system call or just hardwiring the "/proc/" prefix
into VFS.

> > Inode allocation from the procfs mount for the file descriptors Joel
> > wants is not correct. Their not really procfs file descriptors so this
> > is a nack. We can't just hook into proc that way.
>
> I was not particular about using procfs mount for the FDs but that's the only
> way I knew how to do it until you pointed out anon_inode (my grep skills
> missed that), so thank you!
>
> > > C'mon: /proc is used by everyone today and almost every program breaks
> > > if it's not around. The string "/proc" is already de facto kernel ABI.
> > > Let's just drop the pretense of /proc being optional and bake it into
> > > the kernel proper, then give programs a way to get to /proc that isn't
> > > tied to any particular mount configuration. This way, we don't need a
> > > translate_pid(), since callers can just use procfs to do the same
> > > thing. (That is, if I understand correctly what translate_pid does.)
> >
> > I'm not sure what you think translate_pid() is doing since you're not
> > saying what you think it does.
> > Examples from the old patchset:
> > translate_pid(pid, ns, -1)      - get pid in our pid namespace

Ah, it's a bit different from what I had in mind. It's fair to want to
translate PIDs between namespaces, but the only way to make the
translate_pid under discussion robust is to have it accept and produce
pidfds. (At that point, you might as well call it translate_pidfd.) We
should not be adding new APIs to the kernel that accept numeric PIDs:
it's not possible to use these APIs correctly except under very
limited circumstances --- mostly, talking about init or a parent
talking about its child.

Really, we need a few related operations, and we shouldn't necessarily
mingle them.

1) Given a numeric PID, give me a pidfd: that works today: you just
open /proc/<pid>

2) Given a pidfd, give me a numeric PID: that works today: you just
openat(pidfd, "stat", O_RDONLY) and read the first token (which is
always the numeric PID).

3) Given a pidfd, send a signal: that's what pidfd_send_signal does,
and it's a good start on the rest of these operations.

4) Given a pidfd, wait for the named process to exit: that's what my
original exithand thing did, and that's what Joel's helpfully agreed
to start hacking on.

5) Given a pidfd in NS1, get a pidfd in NS2. That's what translate_pid
is for. My preferred signature for this routine is translate_pid(int
pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. Why
not? Because the pidfd *already* names a single process, uniquely!

6) Make a new process and atomically give me a pidfd for it. We need a
new kind of clone(2) for that. People have been proposing some kind of
FD-based fork/spawn/etc. thing for ages, and we can finally provide
it. Yay.

7) Retrieve miscellaneous information about a process identified by a
pidfd: openat(2) handles this case today.

This is a decent framework for a good general-purpose process API that
builds on the one the kernel already provides. With this API, people
should never have to touch the old unix process API except to talk to
humans and other legacy systems. It's a big project, but worthwhile,
and we can do it piecemeal.

Christian, what worries me is that you want to make this project 10x
harder, both in technical and lkml-political terms, by making it work
without CONFIG_PROCFS=y. Without procfs, all the operations above that
involve the word "openat" or "/proc" break, which means that our
general-purpose process API needs to provide its own equivalents to
these operations, and on top of these, its own non-procfs pidfd FD
type --- let's call it pidfd_2. (Let's call a directory FD on
/proc/<pid> a pidfd_1.) Under this scheme, we have to have all
operations that accept a pidfd_1 (like pidfd_send_signal) and have
them accept pidfd_2 file descriptors as well in general fashion. (The
difference between pidfd_1 and pidfd_2 is visible to users who can
call fstat and look at st_dev.) We'd also need an API to translate a
pidfd_2 to a pidfd_1 so you could call openat on it to look at
/proc/<pid> data files, to support operation #7 above.  The
alternative to provide #7 is some kind of new general-purpose
process-information-retrieval interface that mirrors the functionality
/proc/<pid> already provides --- e.g., getting the smaps list for a
process.

To sum it up, we can

A) declare that pidfds don't work without CONFIG_PROCFS=y,
B) hardwire CONFIG_PROCFS=y in all configurations, or
C) support both procfs-based pidfd_1 FDs and non-procfs pidfd_2 FDs.

Option C seems like pointless complexity to me, as I described above.
Option C means that we have to duplicate a lot of existing and
perfectly good functionality.

Option A is fine by me, since I think CONFIG_PROCFS=n is just a
bonkers and broken configuration that's barely even Linux.

From a design perspective, I prefer option B: it turns a de-facto
guaranteed /proc ABI into a de-jure guaranteed ABI, and that's just
more straightforward for everyone --- especially since it reduces the
complexity of the Linux core by deleting all the !CONFIG_PROCFS code
paths. My point about the procfs system call is that *if* we go with
option B and make procfs mandatory, we're essentially stating that
certain kernel functionality is always available, and because (as a
general rule) kernel functionality that's always available should be
available to every process, we should provide a way to *use* this
always-present kernel functionality that doesn't depend on the mount
table --- thus my proposed get_procfs_root(2).

We don't have to decide between A and B right now. We can continue
develop pidfd development under the assumption we're going with option
A, and when option B seems like a good idea, we can just switch with
minimal hassle. On the other hand, if we did implement option C and,
later, it became apparently that option B was right after all, all the
work needed for option C would have been a waste.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2019-03-28  9:21 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-20 20:07 pidfd design Alexey Dobriyan
2019-03-20 20:14 ` Daniel Colascione
2019-03-20 20:39   ` Alexey Dobriyan
2019-03-20 20:47     ` Christian Brauner
2019-03-20 20:50       ` Daniel Colascione
2019-03-20 21:00         ` Christian Brauner
2019-03-22 14:04 ` Michael Tirado
2019-03-25 17:45   ` Linus Torvalds
2019-03-25 16:14     ` Michael Tirado
2019-03-25 20:45     ` Christian Brauner
2019-03-25 18:50   ` Christian Brauner
  -- strict thread matches above, loose matches on Subject: below --
2019-03-16 18:57 [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android Christian Brauner
2019-03-16 19:37 ` Suren Baghdasaryan
2019-03-17  1:53   ` Joel Fernandes
2019-03-17 11:42     ` Christian Brauner
2019-03-17 15:40       ` Daniel Colascione
2019-03-18  0:29         ` Christian Brauner
2019-03-18 23:50           ` Joel Fernandes
2019-03-19 22:14             ` Christian Brauner
2019-03-19 22:48               ` Daniel Colascione
2019-03-19 23:10                 ` Christian Brauner
2019-03-20  1:52                   ` Joel Fernandes
2019-03-20  2:42                     ` pidfd design Daniel Colascione
2019-03-20  3:59                       ` Christian Brauner
2019-03-20  7:02                         ` Daniel Colascione
2019-03-20 11:33                           ` Joel Fernandes
2019-03-20 18:26                             ` Christian Brauner
2019-03-20 18:38                               ` Daniel Colascione
2019-03-20 18:51                                 ` Christian Brauner
2019-03-20 18:58                                   ` Andy Lutomirski
2019-03-20 19:14                                     ` Christian Brauner
2019-03-20 19:40                                       ` Daniel Colascione
2019-03-21 17:02                                         ` Andy Lutomirski
2019-03-25 20:13                                           ` Jann Horn
2019-03-25 20:23                                             ` Daniel Colascione
2019-03-25 23:42                                               ` Andy Lutomirski
2019-03-25 23:45                                                 ` Christian Brauner
2019-03-26  0:00                                                   ` Andy Lutomirski
2019-03-26  0:12                                                     ` Christian Brauner
2019-03-26  0:24                                                       ` Andy Lutomirski
2019-03-28  9:21                                                         ` Christian Brauner
2019-03-20 19:19                                   ` Joel Fernandes
2019-03-20 19:29                                   ` Daniel Colascione
2019-03-24 14:44                                     ` Serge E. Hallyn
2019-03-24 18:48                                       ` Joel Fernandes
2019-03-20 19:11                               ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).