* Re: pidfd design
@ 2019-03-20 20:07 Alexey Dobriyan
2019-03-20 20:14 ` Daniel Colascione
2019-03-22 14:04 ` Michael Tirado
0 siblings, 2 replies; 44+ messages in thread
From: Alexey Dobriyan @ 2019-03-20 20:07 UTC (permalink / raw)
To: christian; +Cc: linux-kernel, dancol, joel, luto
> What would be your opinion to having a
> /proc/<pid>/handle
> file instead of having a dirfd.
This is even worse than depending on PROC_FS. Just for the dependency
pidfd code should be backed out immediately. Forget about /proc.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 20:07 pidfd design Alexey Dobriyan @ 2019-03-20 20:14 ` Daniel Colascione 2019-03-20 20:39 ` Alexey Dobriyan 2019-03-22 14:04 ` Michael Tirado 1 sibling, 1 reply; 44+ messages in thread From: Daniel Colascione @ 2019-03-20 20:14 UTC (permalink / raw) To: Alexey Dobriyan Cc: Christian Brauner, linux-kernel, Joel Fernandes, Andy Lutomirski On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > What would be your opinion to having a > > /proc/<pid>/handle > > file instead of having a dirfd. > > This is even worse than depending on PROC_FS. Just for the dependency > pidfd code should be backed out immediately. Forget about /proc. We already have pidfds, and we've had them since /proc was added ages ago. Backing out procfs is a bold proposal. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 20:14 ` Daniel Colascione @ 2019-03-20 20:39 ` Alexey Dobriyan 2019-03-20 20:47 ` Christian Brauner 0 siblings, 1 reply; 44+ messages in thread From: Alexey Dobriyan @ 2019-03-20 20:39 UTC (permalink / raw) To: Daniel Colascione Cc: Christian Brauner, linux-kernel, Joel Fernandes, Andy Lutomirski On Wed, Mar 20, 2019 at 01:14:01PM -0700, Daniel Colascione wrote: > On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > What would be your opinion to having a > > > /proc/<pid>/handle > > > file instead of having a dirfd. > > > > This is even worse than depending on PROC_FS. Just for the dependency > > pidfd code should be backed out immediately. Forget about /proc. > > We already have pidfds, and we've had them since /proc was added ages > ago. New pidfd code (or whatever the name) should NOT depend on /proc and should not interact with VFS at all at any point (other than probably being a descriptor on a fake filesystem). The reason is that /proc is full of crap and you don't want to spill that into new and hopefully properly designed part of new code. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 20:39 ` Alexey Dobriyan @ 2019-03-20 20:47 ` Christian Brauner 2019-03-20 20:50 ` Daniel Colascione 0 siblings, 1 reply; 44+ messages in thread From: Christian Brauner @ 2019-03-20 20:47 UTC (permalink / raw) To: Alexey Dobriyan Cc: Daniel Colascione, linux-kernel, Joel Fernandes, Andy Lutomirski On Wed, Mar 20, 2019 at 11:39:10PM +0300, Alexey Dobriyan wrote: > On Wed, Mar 20, 2019 at 01:14:01PM -0700, Daniel Colascione wrote: > > On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > > What would be your opinion to having a > > > > /proc/<pid>/handle > > > > file instead of having a dirfd. > > > > > > This is even worse than depending on PROC_FS. Just for the dependency > > > pidfd code should be backed out immediately. Forget about /proc. > > > > We already have pidfds, and we've had them since /proc was added ages > > ago. > > New pidfd code (or whatever the name) should NOT depend on /proc and > should not interact with VFS at all at any point (other than probably > being a descriptor on a fake filesystem). The reason is that /proc is > full of crap and you don't want to spill that into new and hopefully > properly designed part of new code. Yes, I agree. That's why I was thinking that translate_pid() is a good candidate to provide that decoupling. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 20:47 ` Christian Brauner @ 2019-03-20 20:50 ` Daniel Colascione 2019-03-20 21:00 ` Christian Brauner 0 siblings, 1 reply; 44+ messages in thread From: Daniel Colascione @ 2019-03-20 20:50 UTC (permalink / raw) To: Christian Brauner Cc: Alexey Dobriyan, linux-kernel, Joel Fernandes, Andy Lutomirski On Wed, Mar 20, 2019 at 1:47 PM Christian Brauner <christian@brauner.io> wrote: > > On Wed, Mar 20, 2019 at 11:39:10PM +0300, Alexey Dobriyan wrote: > > On Wed, Mar 20, 2019 at 01:14:01PM -0700, Daniel Colascione wrote: > > > On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > > > What would be your opinion to having a > > > > > /proc/<pid>/handle > > > > > file instead of having a dirfd. > > > > > > > > This is even worse than depending on PROC_FS. Just for the dependency > > > > pidfd code should be backed out immediately. Forget about /proc. > > > > > > We already have pidfds, and we've had them since /proc was added ages > > > ago. > > > > New pidfd code (or whatever the name) should NOT depend on /proc and > > should not interact with VFS at all at any point (other than probably > > being a descriptor on a fake filesystem). The reason is that /proc is > > full of crap and you don't want to spill that into new and hopefully > > properly designed part of new code. > > Yes, I agree. That's why I was thinking that translate_pid() is a good > candidate to provide that decoupling. Then again: how do you propose fetching process metadata? If you adopt a stance that nothing can use procfs and simultaneously adopt a stance that we don't want to duplicate all the decades of metadata interfaces in /proc/pid (which are useful, not "crap"), then the overall result is that we just won't make any progress at all. There's nothing wrong with taking a dependency on procfs: procfs is how we talk about processes. It's completely unreasonable to say "no, you can't use the old thing" and also "no, we can't add a new thing that would duplicate the old thing". ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 20:50 ` Daniel Colascione @ 2019-03-20 21:00 ` Christian Brauner 0 siblings, 0 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-20 21:00 UTC (permalink / raw) To: Daniel Colascione Cc: Alexey Dobriyan, linux-kernel, Joel Fernandes, Andy Lutomirski On Wed, Mar 20, 2019 at 01:50:43PM -0700, Daniel Colascione wrote: > On Wed, Mar 20, 2019 at 1:47 PM Christian Brauner <christian@brauner.io> wrote: > > > > On Wed, Mar 20, 2019 at 11:39:10PM +0300, Alexey Dobriyan wrote: > > > On Wed, Mar 20, 2019 at 01:14:01PM -0700, Daniel Colascione wrote: > > > > On Wed, Mar 20, 2019 at 1:07 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > > > > What would be your opinion to having a > > > > > > /proc/<pid>/handle > > > > > > file instead of having a dirfd. > > > > > > > > > > This is even worse than depending on PROC_FS. Just for the dependency > > > > > pidfd code should be backed out immediately. Forget about /proc. > > > > > > > > We already have pidfds, and we've had them since /proc was added ages > > > > ago. > > > > > > New pidfd code (or whatever the name) should NOT depend on /proc and > > > should not interact with VFS at all at any point (other than probably > > > being a descriptor on a fake filesystem). The reason is that /proc is > > > full of crap and you don't want to spill that into new and hopefully > > > properly designed part of new code. > > > > Yes, I agree. That's why I was thinking that translate_pid() is a good > > candidate to provide that decoupling. > > Then again: how do you propose fetching process metadata? If you adopt > a stance that nothing can use procfs and simultaneously adopt a stance > that we don't want to duplicate all the decades of metadata interfaces > in /proc/pid (which are useful, not "crap"), then the overall result > is that we just won't make any progress at all. There's nothing wrong > with taking a dependency on procfs: procfs is how we talk about > processes. It's completely unreasonable to say "no, you can't use the > old thing" and also "no, we can't add a new thing that would duplicate > the old thing". This style or arguing won't get us any further. Please, send in the code that you think is right and you want to get reviewed. If you can get the Acks from the people you need, good. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 20:07 pidfd design Alexey Dobriyan 2019-03-20 20:14 ` Daniel Colascione @ 2019-03-22 14:04 ` Michael Tirado 2019-03-25 17:45 ` Linus Torvalds 2019-03-25 18:50 ` Christian Brauner 1 sibling, 2 replies; 44+ messages in thread From: Michael Tirado @ 2019-03-22 14:04 UTC (permalink / raw) To: Alexey Dobriyan; +Cc: Linus Torvalds, LKML On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > pidfd code should be backed out immediately. Forget about /proc. Seems like Torvalds just merges this sort of "stuff" without reading it now, or there's something that auto accepted pull request to RC tree? > The pull request you sent on Tue, 5 Mar 2019 18:13:01 +0100: > >> git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git tags/pidfd-v5.1-rc1 > > has been merged into torvalds/linux.git: > https://git.kernel.org/torvalds/c/a9dce6679d736cb3d612af39bab9f31f8db66f9b > >Thank you! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-22 14:04 ` Michael Tirado @ 2019-03-25 17:45 ` Linus Torvalds 2019-03-25 16:14 ` Michael Tirado 2019-03-25 20:45 ` Christian Brauner 2019-03-25 18:50 ` Christian Brauner 1 sibling, 2 replies; 44+ messages in thread From: Linus Torvalds @ 2019-03-25 17:45 UTC (permalink / raw) To: Michael Tirado; +Cc: Alexey Dobriyan, LKML On Fri, Mar 22, 2019 at 11:34 AM Michael Tirado <mtirado418@gmail.com> wrote: > > On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > > pidfd code should be backed out immediately. Forget about /proc. > > Seems like Torvalds just merges this sort of "stuff" without reading > it now, or there's something that auto accepted pull request to RC tree? There is no auto-accept. But there also didn't seem to be any valid arguments against it, and the android people had arguments for it. Arguing against it based on "I don't like /proc" is pointless. The fact is, /proc is our system interface for a lot of things. Arguing against it based on "I worry about the _other_ non-signal-sending things that could be done with this" is also pointless. What other things? The only thing that got merged was the signalling. Now, arguing that signalling should use the open-time credentials might make sense, but this isn't read/write. You can't fool some suid program to do magic randon system calls for you, and if you can, then arguing about pidfd is kind of pointless. So the model of using a file descriptor instead of a 'pid' for signal handling is actually very unix-like. Maybe that's how pid's should have worked to begin with. Remember that whole "everything is a file" thing? Now, the fact that fork() and clone() return a pid obviously means that pidfd isn't the primary model (not to decades of just history), but that doesn't make pidfd wrong. And namespace issues etc are all also kind of irrelevant. If you open random files in /proc and randomly do pidfd_send_signal() on those, you get random results. If that worries you, then DON'T DO THAT THEN, for chrissake! That's not a sane model to begin with, but it's not the usage model for this, so it's another completely specious argument. So yes, I thought about the pidfd pull (which was why it happened at the very end of the merge window), and I found the arguments against it bad. Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-25 17:45 ` Linus Torvalds @ 2019-03-25 16:14 ` Michael Tirado 2019-03-25 20:45 ` Christian Brauner 1 sibling, 0 replies; 44+ messages in thread From: Michael Tirado @ 2019-03-25 16:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: LKML, Alexey Dobriyan, christian On Mon, Mar 25, 2019 at 5:45 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Mar 22, 2019 at 11:34 AM Michael Tirado <mtirado418@gmail.com> wrote: > > > > On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > > > > pidfd code should be backed out immediately. Forget about /proc. > > > > Seems like Torvalds just merges this sort of "stuff" without reading > > it now, or there's something that auto accepted pull request to RC tree? > > There is no auto-accept. > > But there also didn't seem to be any valid arguments against it, and > the android people had arguments for it. > Isn't Google working on their own C++ kernel now, I bet they would want to make a smooth transition to that at some point? Hopefully they don't screw up Linux in the process. > Arguing against it based on "I don't like /proc" is pointless. The > fact is, /proc is our system interface for a lot of things. > The argument was valid to me, at least the design is not set in stone just yet and there is still hope. I have an option in my namespace sandbox called "noproc", it works for many things, but if devs start relying on /proc ALWAYS being available I begin to have issues. You are all aware of the horrors of /proc, I hope. I don't want /proc so deeply entrenched in the ecosystem that I can no longer use "noproc". These sort of bold new core features need to be designed with extreme caution and awareness of the full spectra of affects. Just because something like procfs exists and can be used doesn't mean it is wise to go all-in. > Arguing against it based on "I worry about the _other_ > non-signal-sending things that could be done with this" is also > pointless. What other things? The only thing that got merged was the > signalling. > There have been "future changes" hinted through the patches lifecycle, it leads me to believe it's a gateway patch, and the pid wrapping is a minor bugfix bridge to some other undisclosed features. How could anyone know the design is right without knowing what these changes might be? pidctl/translate_pid? I am against any new systemcall that crosses namespaces by design to accomplish something that is already plummable. Seems like they want to use pidfd's as some sort of token to perform these cross namespace operations, can't wait to see how devs end up abusing this one. > So the model of using a file descriptor instead of a 'pid' for signal > handling is actually very unix-like. Maybe that's how pid's should > have worked to begin with. Remember that whole "everything is a file" > thing? > Perhaps it could be called an improvement if yall get it right because AFAIK the only way to handle wrapping today is to directly clone the PID you're worried about and deal with it immediately when the process exits before wrap can happen. But I really wonder why PID wrapping matters SO much, I bet some people are doing dangerously stupid things like using PID as a credential even though everyone knows it wraps. Maybe this can make signalling less racey somehow? At the very least you could learn the process has exited instead of blindly acting on a potentially recycled number. I recognize the value in that specifically. However, using pidfd as a token to do cross-namespace activities that are already plummable is just plain weird to me, but maybe I'm too used to doing things "the hard way". ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-25 17:45 ` Linus Torvalds 2019-03-25 16:14 ` Michael Tirado @ 2019-03-25 20:45 ` Christian Brauner 1 sibling, 0 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-25 20:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: Michael Tirado, Alexey Dobriyan, LKML On Mon, Mar 25, 2019 at 10:45:29AM -0700, Linus Torvalds wrote: > On Fri, Mar 22, 2019 at 11:34 AM Michael Tirado <mtirado418@gmail.com> wrote: > > > > On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > > > > pidfd code should be backed out immediately. Forget about /proc. > > > > Seems like Torvalds just merges this sort of "stuff" without reading > > it now, or there's something that auto accepted pull request to RC tree? > > There is no auto-accept. > > But there also didn't seem to be any valid arguments against it, and > the android people had arguments for it. > > Arguing against it based on "I don't like /proc" is pointless. The > fact is, /proc is our system interface for a lot of things. > > Arguing against it based on "I worry about the _other_ > non-signal-sending things that could be done with this" is also > pointless. What other things? The only thing that got merged was the > signalling. To back Linus defense up with a glimpse into the future. We will not be to rely on dirfds from proc to do general process management. That is even in the commit message for the pidfd_send_signal syscall, that we intend to decouple this from procfs, i.e. decouple process management from process metadata reading. We have an ongoing discussion and what a lot of people agree upon is that pidfds will be anon inode file descriptors that stash a reference to struct pid in their private_data member. They can be pollable if ever need be and they are just conceptually cleaner and way simpler and mirror what will happen in the new mount api as well. The idea is to translate these pidfds e.g. via a simple ioctl() interface that takes a pidfd and gives back - with standard permissions applied as are today - a corresponding /proc/<pid> fd that can be used to read metadata of a process (see the suggestion by Andy and Jann [1]). The advantage is that this means that pidfd_clone() or something similar can simply return a pidfd and does not need to care about what procfs the process is supposed to be located in/reference and is in general way safer. But there is absolutely nothing wrong with allowing users to use /proc/<pid> to signal processes. One of the reasons why I did this is that it is so intuitive to users that non-kernel people have requested this be possible over and over. As mentioned in the orignal patchset the future was always to decouple this from procfs (see the references in there) and this is what the new pidctl() syscall is for that transparently translates between the pid-based api and the pidfd-based api. [1]: https://lore.kernel.org/lkml/CAG48ez3VMjLJBC_F3BxC2sc2s-28NdsrUduR=jX66XH0w2O-Qg@mail.gmail.com/ > > Now, arguing that signalling should use the open-time credentials > might make sense, but this isn't read/write. You can't fool some suid > program to do magic randon system calls for you, and if you can, then > arguing about pidfd is kind of pointless. > > So the model of using a file descriptor instead of a 'pid' for signal > handling is actually very unix-like. Maybe that's how pid's should > have worked to begin with. Remember that whole "everything is a file" > thing? > > Now, the fact that fork() and clone() return a pid obviously means > that pidfd isn't the primary model (not to decades of just history), > but that doesn't make pidfd wrong. > > And namespace issues etc are all also kind of irrelevant. If you open > random files in /proc and randomly do pidfd_send_signal() on those, > you get random results. If that worries you, then DON'T DO THAT THEN, > for chrissake! That's not a sane model to begin with, but it's not the > usage model for this, so it's another completely specious argument. > > So yes, I thought about the pidfd pull (which was why it happened at > the very end of the merge window), and I found the arguments against > it bad. > > Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-22 14:04 ` Michael Tirado 2019-03-25 17:45 ` Linus Torvalds @ 2019-03-25 18:50 ` Christian Brauner 1 sibling, 0 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-25 18:50 UTC (permalink / raw) To: Michael Tirado; +Cc: Alexey Dobriyan, Linus Torvalds, LKML On Fri, Mar 22, 2019 at 02:04:35PM +0000, Michael Tirado wrote: > On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > > pidfd code should be backed out immediately. Forget about /proc. > > Seems like Torvalds just merges this sort of "stuff" without reading > it now, or there's something that auto accepted pull request to RC tree? I don't mind controversy around patchsets but I'd appreciate it that when they are started the Cc is not modified and main people who maintain the code that is critized are removed. I'd like to be able to respond without other people having to notify me that there's a discussion going on I should be involved in. Christian > > > The pull request you sent on Tue, 5 Mar 2019 18:13:01 +0100: > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git tags/pidfd-v5.1-rc1 > > > > has been merged into torvalds/linux.git: > > https://git.kernel.org/torvalds/c/a9dce6679d736cb3d612af39bab9f31f8db66f9b > > > >Thank you! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
@ 2019-03-16 18:57 Christian Brauner
2019-03-16 19:37 ` Suren Baghdasaryan
0 siblings, 1 reply; 44+ messages in thread
From: Christian Brauner @ 2019-03-16 18:57 UTC (permalink / raw)
To: Daniel Colascione
Cc: Suren Baghdasaryan, Joel Fernandes, Steven Rostedt,
Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman,
Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar,
Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm,
kernel-team
On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote:
> > > [..]
> > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just use
> > > > > standard poll/epoll interface on the proc fd like Daniel was suggesting.
> > > > > AFAIK, once the proc file is opened, the struct pid is essentially pinned
> > > > > even though the proc number may be reused. Then the caller can just poll.
> > > > > We can add a waitqueue to struct pid, and wake up any waiters on process
> > > > > death (A quick look shows task_struct can be mapped to its struct pid) and
> > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is
> > > > > needed then, let me know if I missed something?
> > > >
> > > > Huh, I thought that Daniel was against the poll/epoll solution?
> > >
> > > Hmm, going through earlier threads, I believe so now. Here was Daniel's
> > > reasoning about avoiding a notification about process death through proc
> > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > >
> > > May be a dedicated syscall for this would be cleaner after all.
> >
> > Ah, I wish I've seen that discussion before...
> > syscall makes sense and it can be non-blocking and we can use
> > select/poll/epoll if we use eventfd.
>
> Thanks for taking a look.
>
> > I would strongly advocate for
> > non-blocking version or at least to have a non-blocking option.
>
> Waiting for FD readiness is *already* blocking or non-blocking
> according to the caller's desire --- users can pass options they want
> to poll(2) or whatever. There's no need for any kind of special
> configuration knob or non-blocking option. We already *have* a
> non-blocking option that works universally for everything.
>
> As I mentioned in the linked thread, waiting for process exit should
> work just like waiting for bytes to appear on a pipe. Process exit
> status is just another blob of bytes that a process might receive. A
> process exit handle ought to be just another information source. The
> reason the unix process API is so awful is that for whatever reason
> the original designers treated processes as some kind of special kind
> of resource instead of fitting them into the otherwise general-purpose
> unix data-handling API. Let's not repeat that mistake.
>
> > Something like this:
> >
> > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > // register eventfd to receive death notification
> > pidfd_wait(pid_to_kill, evfd);
> > // kill the process
> > pidfd_send_signal(pid_to_kill, ...)
> > // tend to other things
>
> Now you've lost me. pidfd_wait should return a *new* FD, not wire up
> an eventfd.
>
> Why? Because the new type FD can report process exit *status*
> information (via read(2) after readability signal) as well as this
> binary yes-or-no signal *that* a process exited, and this capability
> is useful if you want to the pidfd interface to be a good
> general-purpose process management facility to replace the awful
> wait() family of functions. You can't get an exit status from an
> eventfd. Wiring up an eventfd the way you've proposed also complicates
> wait-causality information, complicating both tracing and any priority
> inheritance we might want in the future (because all the wakeups gets
> mixed into the eventfd and you can't unscramble an egg). And for what?
> What do we gain by using an eventfd? Is the reason that exit.c would
> be able to use eventfd_signal instead of poking a waitqueue directly?
> How is that better? With an eventfd, you've increased path length on
> process exit *and* complicated the API for no reason.
>
> > ...
> > // wait for the process to die
> > poll_wait(evfd, ...);
> >
> > This simplifies userspace
>
> Not relative to an exit handle it doesn't.
>
> >, allows it to wait for multiple events using
> > epoll
>
> So does a process exit status handle.
>
> > and I think kernel implementation will be also quite simple
> > because it already implements eventfd_signal() that takes care of
> > waitqueue handling.
>
> What if there are multiple eventfds registered for the death of a
> process? In any case, you need some mechanism to find, upon process
> death, a list of waiters, then wake each of them up. That's either a
> global search or a search in some list rooted in a task-related
> structure (either struct task or one of its friends). Using an eventfd
> here adds nothing, since upon death, you need this list search
> regardless, and as I mentioned above, eventfd-wiring just makes the
> API worse.
>
> > If pidfd_send_signal could be extended to have an optional eventfd
> > parameter then we would not even have to add a new syscall.
>
> There is nothing wrong with adding a new system call. I don't know why
> there's this idea circulating that adding system calls is something we
> should bend over backwards to avoid. It's cheap, and support-wise,
> kernel interface is kernel interface. Sending a signal has *nothing*
> to do with wiring up some kind of notification and there's no reason
> to mingle it with some kind of event registration.
I agree with Daniel.
One design goal is to not stuff clearly delinated tasks related to
process management into the same syscall. That will just leave us with a
confusing api. Sending signals is part of managing a process while it is
running. Waiting on a process to end is clearly separate from that.
It's important to keep in mind that the goal of the pidfd work is to end
up with an api that is of use to all of user space concerned with
process management not just a specific project.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-16 18:57 [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android Christian Brauner @ 2019-03-16 19:37 ` Suren Baghdasaryan 2019-03-17 1:53 ` Joel Fernandes 0 siblings, 1 reply; 44+ messages in thread From: Suren Baghdasaryan @ 2019-03-16 19:37 UTC (permalink / raw) To: Christian Brauner Cc: Daniel Colascione, Joel Fernandes, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner <christian@brauner.io> wrote: > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > [..] > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just use > > > > > > standard poll/epoll interface on the proc fd like Daniel was suggesting. > > > > > > AFAIK, once the proc file is opened, the struct pid is essentially pinned > > > > > > even though the proc number may be reused. Then the caller can just poll. > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on process > > > > > > death (A quick look shows task_struct can be mapped to its struct pid) and > > > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is > > > > > > needed then, let me know if I missed something? > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was Daniel's > > > > reasoning about avoiding a notification about process death through proc > > > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > Ah, I wish I've seen that discussion before... > > > syscall makes sense and it can be non-blocking and we can use > > > select/poll/epoll if we use eventfd. > > > > Thanks for taking a look. > > > > > I would strongly advocate for > > > non-blocking version or at least to have a non-blocking option. > > > > Waiting for FD readiness is *already* blocking or non-blocking > > according to the caller's desire --- users can pass options they want > > to poll(2) or whatever. There's no need for any kind of special > > configuration knob or non-blocking option. We already *have* a > > non-blocking option that works universally for everything. > > > > As I mentioned in the linked thread, waiting for process exit should > > work just like waiting for bytes to appear on a pipe. Process exit > > status is just another blob of bytes that a process might receive. A > > process exit handle ought to be just another information source. The > > reason the unix process API is so awful is that for whatever reason > > the original designers treated processes as some kind of special kind > > of resource instead of fitting them into the otherwise general-purpose > > unix data-handling API. Let's not repeat that mistake. > > > > > Something like this: > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > // register eventfd to receive death notification > > > pidfd_wait(pid_to_kill, evfd); > > > // kill the process > > > pidfd_send_signal(pid_to_kill, ...) > > > // tend to other things > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > an eventfd. > > Ok, I probably misunderstood your post linked by Joel. I though your original proposal was based on being able to poll a file under /proc/pid and then you changed your mind to have a separate syscall which I assumed would be a blocking one to wait for process exit. Maybe you can describe the new interface you are thinking about in terms of userspace usage like I did above? Several lines of code would explain more than paragraphs of text. > > Why? Because the new type FD can report process exit *status* > > information (via read(2) after readability signal) as well as this > > binary yes-or-no signal *that* a process exited, and this capability > > is useful if you want to the pidfd interface to be a good > > general-purpose process management facility to replace the awful > > wait() family of functions. You can't get an exit status from an > > eventfd. Wiring up an eventfd the way you've proposed also complicates > > wait-causality information, complicating both tracing and any priority > > inheritance we might want in the future (because all the wakeups gets > > mixed into the eventfd and you can't unscramble an egg). And for what? > > What do we gain by using an eventfd? Is the reason that exit.c would > > be able to use eventfd_signal instead of poking a waitqueue directly? > > How is that better? With an eventfd, you've increased path length on > > process exit *and* complicated the API for no reason. > > > > > ... > > > // wait for the process to die > > > poll_wait(evfd, ...); > > > > > > This simplifies userspace > > > > Not relative to an exit handle it doesn't. > > > > >, allows it to wait for multiple events using > > > epoll > > > > So does a process exit status handle. > > > > > and I think kernel implementation will be also quite simple > > > because it already implements eventfd_signal() that takes care of > > > waitqueue handling. > > > > What if there are multiple eventfds registered for the death of a > > process? In any case, you need some mechanism to find, upon process > > death, a list of waiters, then wake each of them up. That's either a > > global search or a search in some list rooted in a task-related > > structure (either struct task or one of its friends). Using an eventfd > > here adds nothing, since upon death, you need this list search > > regardless, and as I mentioned above, eventfd-wiring just makes the > > API worse. > > > > > If pidfd_send_signal could be extended to have an optional eventfd > > > parameter then we would not even have to add a new syscall. > > > > There is nothing wrong with adding a new system call. I don't know why > > there's this idea circulating that adding system calls is something we > > should bend over backwards to avoid. It's cheap, and support-wise, > > kernel interface is kernel interface. Sending a signal has *nothing* > > to do with wiring up some kind of notification and there's no reason > > to mingle it with some kind of event registration. > > > I agree with Daniel. > One design goal is to not stuff clearly delinated tasks related to > process management into the same syscall. That will just leave us with a > confusing api. Sending signals is part of managing a process while it is > running. Waiting on a process to end is clearly separate from that. > It's important to keep in mind that the goal of the pidfd work is to end > up with an api that is of use to all of user space concerned with > process management not just a specific project. I'm not bent on adding or not adding a new syscall as long as functionality is there. Thanks! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-16 19:37 ` Suren Baghdasaryan @ 2019-03-17 1:53 ` Joel Fernandes 2019-03-17 11:42 ` Christian Brauner 0 siblings, 1 reply; 44+ messages in thread From: Joel Fernandes @ 2019-03-17 1:53 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Christian Brauner, Daniel Colascione, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner <christian@brauner.io> wrote: > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > > [..] > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just use > > > > > > > standard poll/epoll interface on the proc fd like Daniel was suggesting. > > > > > > > AFAIK, once the proc file is opened, the struct pid is essentially pinned > > > > > > > even though the proc number may be reused. Then the caller can just poll. > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on process > > > > > > > death (A quick look shows task_struct can be mapped to its struct pid) and > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was Daniel's > > > > > reasoning about avoiding a notification about process death through proc > > > > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > Ah, I wish I've seen that discussion before... > > > > syscall makes sense and it can be non-blocking and we can use > > > > select/poll/epoll if we use eventfd. > > > > > > Thanks for taking a look. > > > > > > > I would strongly advocate for > > > > non-blocking version or at least to have a non-blocking option. > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > according to the caller's desire --- users can pass options they want > > > to poll(2) or whatever. There's no need for any kind of special > > > configuration knob or non-blocking option. We already *have* a > > > non-blocking option that works universally for everything. > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > work just like waiting for bytes to appear on a pipe. Process exit > > > status is just another blob of bytes that a process might receive. A > > > process exit handle ought to be just another information source. The > > > reason the unix process API is so awful is that for whatever reason > > > the original designers treated processes as some kind of special kind > > > of resource instead of fitting them into the otherwise general-purpose > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > Something like this: > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > // register eventfd to receive death notification > > > > pidfd_wait(pid_to_kill, evfd); > > > > // kill the process > > > > pidfd_send_signal(pid_to_kill, ...) > > > > // tend to other things > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > an eventfd. > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > original proposal was based on being able to poll a file under > /proc/pid and then you changed your mind to have a separate syscall > which I assumed would be a blocking one to wait for process exit. > Maybe you can describe the new interface you are thinking about in > terms of userspace usage like I did above? Several lines of code would > explain more than paragraphs of text. Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea from Daniel here is to wait for process death and exit events by just referring to a stable fd, independent of whatever is going on in /proc. What is needed is something like this (in highly pseudo-code form): pidfd = opendir("/proc/<pid>",..); wait_fd = pidfd_wait(pidfd); read or poll wait_fd (non-blocking or blocking whichever) wait_fd will block until the task has either died or reaped. In both these cases, it can return a suitable string such as "dead" or "reaped" although an integer with some predefined meaning is also Ok. What that guarantees is, even if the task's PID has been reused, or the task has already died or already died + reaped, all of these events cannot race with the code above and the information passed to the user is race-free and stable / guaranteed. An eventfd seems to not fit well, because AFAICS passing the raw PID to eventfd as in your example would still race since the PID could have been reused by another process by the time the eventfd is created. Also Andy's idea in [1] seems to use poll flags to communicate various tihngs which is still not as explicit about the PID's status so that's a poor API choice compared to the explicit syscall. I am planning to work on a prototype patch based on Daniel's idea and post something soon (chatted with Daniel about it and will reference him in the posting as well), during this posting I will also summarize all the previous discussions and come up with some tests as well. I hope to have something soon. Let me know if I hit all the points correctly and I hope we are all on the same page. Thanks! - Joel [1] http://lkml.iu.edu/hypermail//linux/kernel/1212.0/00808.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-17 1:53 ` Joel Fernandes @ 2019-03-17 11:42 ` Christian Brauner 2019-03-17 15:40 ` Daniel Colascione 0 siblings, 1 reply; 44+ messages in thread From: Christian Brauner @ 2019-03-17 11:42 UTC (permalink / raw) To: Joel Fernandes Cc: Suren Baghdasaryan, Daniel Colascione, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, oleg, luto, serge On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > > > [..] > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just use > > > > > > > > standard poll/epoll interface on the proc fd like Daniel was suggesting. > > > > > > > > AFAIK, once the proc file is opened, the struct pid is essentially pinned > > > > > > > > even though the proc number may be reused. Then the caller can just poll. > > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on process > > > > > > > > death (A quick look shows task_struct can be mapped to its struct pid) and > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was Daniel's > > > > > > reasoning about avoiding a notification about process death through proc > > > > > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > select/poll/epoll if we use eventfd. > > > > > > > > Thanks for taking a look. > > > > > > > > > I would strongly advocate for > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > according to the caller's desire --- users can pass options they want > > > > to poll(2) or whatever. There's no need for any kind of special > > > > configuration knob or non-blocking option. We already *have* a > > > > non-blocking option that works universally for everything. > > > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > status is just another blob of bytes that a process might receive. A > > > > process exit handle ought to be just another information source. The > > > > reason the unix process API is so awful is that for whatever reason > > > > the original designers treated processes as some kind of special kind > > > > of resource instead of fitting them into the otherwise general-purpose > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > Something like this: > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > // register eventfd to receive death notification > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > // kill the process > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > // tend to other things > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > > an eventfd. > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > original proposal was based on being able to poll a file under > > /proc/pid and then you changed your mind to have a separate syscall > > which I assumed would be a blocking one to wait for process exit. > > Maybe you can describe the new interface you are thinking about in > > terms of userspace usage like I did above? Several lines of code would > > explain more than paragraphs of text. > > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea > from Daniel here is to wait for process death and exit events by just > referring to a stable fd, independent of whatever is going on in /proc. > > What is needed is something like this (in highly pseudo-code form): > > pidfd = opendir("/proc/<pid>",..); > wait_fd = pidfd_wait(pidfd); > read or poll wait_fd (non-blocking or blocking whichever) > > wait_fd will block until the task has either died or reaped. In both these > cases, it can return a suitable string such as "dead" or "reaped" although an > integer with some predefined meaning is also Ok. > > What that guarantees is, even if the task's PID has been reused, or the task > has already died or already died + reaped, all of these events cannot race > with the code above and the information passed to the user is race-free and > stable / guaranteed. > > An eventfd seems to not fit well, because AFAICS passing the raw PID to > eventfd as in your example would still race since the PID could have been > reused by another process by the time the eventfd is created. > > Also Andy's idea in [1] seems to use poll flags to communicate various tihngs > which is still not as explicit about the PID's status so that's a poor API > choice compared to the explicit syscall. > > I am planning to work on a prototype patch based on Daniel's idea and post something > soon (chatted with Daniel about it and will reference him in the posting as > well), during this posting I will also summarize all the previous discussions > and come up with some tests as well. I hope to have something soon. Having pidfd_wait() return another fd will make the syscall harder to swallow for a lot of people I reckon. What exactly prevents us from making the pidfd itself readable/pollable for the exit staus? They are "special" fds anyway. I would really like to avoid polluting the api with multiple different types of fds if possible. ret = pidfd_wait(pidfd); read or poll pidfd (Note that I'm traveling so my responses might be delayed quite a bit.) (Ccing a few people that might have an opinion here.) Christian ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-17 11:42 ` Christian Brauner @ 2019-03-17 15:40 ` Daniel Colascione 2019-03-18 0:29 ` Christian Brauner 0 siblings, 1 reply; 44+ messages in thread From: Daniel Colascione @ 2019-03-17 15:40 UTC (permalink / raw) To: Christian Brauner Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner <christian@brauner.io> wrote: > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > > > > [..] > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just use > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel was suggesting. > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is essentially pinned > > > > > > > > > even though the proc number may be reused. Then the caller can just poll. > > > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on process > > > > > > > > > death (A quick look shows task_struct can be mapped to its struct pid) and > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is > > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was Daniel's > > > > > > > reasoning about avoiding a notification about process death through proc > > > > > > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > > select/poll/epoll if we use eventfd. > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > I would strongly advocate for > > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > > according to the caller's desire --- users can pass options they want > > > > > to poll(2) or whatever. There's no need for any kind of special > > > > > configuration knob or non-blocking option. We already *have* a > > > > > non-blocking option that works universally for everything. > > > > > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > > status is just another blob of bytes that a process might receive. A > > > > > process exit handle ought to be just another information source. The > > > > > reason the unix process API is so awful is that for whatever reason > > > > > the original designers treated processes as some kind of special kind > > > > > of resource instead of fitting them into the otherwise general-purpose > > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > > > Something like this: > > > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > > // register eventfd to receive death notification > > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > > // kill the process > > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > > // tend to other things > > > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > > > an eventfd. > > > > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > > original proposal was based on being able to poll a file under > > > /proc/pid and then you changed your mind to have a separate syscall > > > which I assumed would be a blocking one to wait for process exit. > > > Maybe you can describe the new interface you are thinking about in > > > terms of userspace usage like I did above? Several lines of code would > > > explain more than paragraphs of text. > > > > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea > > from Daniel here is to wait for process death and exit events by just > > referring to a stable fd, independent of whatever is going on in /proc. > > > > What is needed is something like this (in highly pseudo-code form): > > > > pidfd = opendir("/proc/<pid>",..); > > wait_fd = pidfd_wait(pidfd); > > read or poll wait_fd (non-blocking or blocking whichever) > > > > wait_fd will block until the task has either died or reaped. In both these > > cases, it can return a suitable string such as "dead" or "reaped" although an > > integer with some predefined meaning is also Ok. I want to return a siginfo_t: we already use this structure in other contexts to report exit status. > > What that guarantees is, even if the task's PID has been reused, or the task > > has already died or already died + reaped, all of these events cannot race > > with the code above and the information passed to the user is race-free and > > stable / guaranteed. > > > > An eventfd seems to not fit well, because AFAICS passing the raw PID to > > eventfd as in your example would still race since the PID could have been > > reused by another process by the time the eventfd is created. > > > > Also Andy's idea in [1] seems to use poll flags to communicate various tihngs > > which is still not as explicit about the PID's status so that's a poor API > > choice compared to the explicit syscall. > > > > I am planning to work on a prototype patch based on Daniel's idea and post something > > soon (chatted with Daniel about it and will reference him in the posting as > > well), during this posting I will also summarize all the previous discussions > > and come up with some tests as well. I hope to have something soon. Thanks. > Having pidfd_wait() return another fd will make the syscall harder to > swallow for a lot of people I reckon. > What exactly prevents us from making the pidfd itself readable/pollable > for the exit staus? They are "special" fds anyway. I would really like > to avoid polluting the api with multiple different types of fds if possible. If pidfds had been their own file type, I'd agree with you. But pidfds are directories, which means that we're beholden to make them behave like directories normally do. I'd rather introduce another FD than heavily overload the semantics of a directory FD in one particular context. In no other circumstances are directory FDs also weird IO-data sources. Our providing a facility to get a new FD to which we *can* give pipe-like behavior does no harm and *usage* cleaner and easier to reason about. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-17 15:40 ` Daniel Colascione @ 2019-03-18 0:29 ` Christian Brauner 2019-03-18 23:50 ` Joel Fernandes 0 siblings, 1 reply; 44+ messages in thread From: Christian Brauner @ 2019-03-18 0:29 UTC (permalink / raw) To: Daniel Colascione Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote: > On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner <christian@brauner.io> wrote: > > > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > > > > > [..] > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just use > > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel was suggesting. > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is essentially pinned > > > > > > > > > > even though the proc number may be reused. Then the caller can just poll. > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on process > > > > > > > > > > death (A quick look shows task_struct can be mapped to its struct pid) and > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is > > > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was Daniel's > > > > > > > > reasoning about avoiding a notification about process death through proc > > > > > > > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > > > select/poll/epoll if we use eventfd. > > > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > > > I would strongly advocate for > > > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > > > according to the caller's desire --- users can pass options they want > > > > > > to poll(2) or whatever. There's no need for any kind of special > > > > > > configuration knob or non-blocking option. We already *have* a > > > > > > non-blocking option that works universally for everything. > > > > > > > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > > > status is just another blob of bytes that a process might receive. A > > > > > > process exit handle ought to be just another information source. The > > > > > > reason the unix process API is so awful is that for whatever reason > > > > > > the original designers treated processes as some kind of special kind > > > > > > of resource instead of fitting them into the otherwise general-purpose > > > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > > > > > Something like this: > > > > > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > > > // register eventfd to receive death notification > > > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > > > // kill the process > > > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > > > // tend to other things > > > > > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > > > > an eventfd. > > > > > > > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > > > original proposal was based on being able to poll a file under > > > > /proc/pid and then you changed your mind to have a separate syscall > > > > which I assumed would be a blocking one to wait for process exit. > > > > Maybe you can describe the new interface you are thinking about in > > > > terms of userspace usage like I did above? Several lines of code would > > > > explain more than paragraphs of text. > > > > > > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea > > > from Daniel here is to wait for process death and exit events by just > > > referring to a stable fd, independent of whatever is going on in /proc. > > > > > > What is needed is something like this (in highly pseudo-code form): > > > > > > pidfd = opendir("/proc/<pid>",..); > > > wait_fd = pidfd_wait(pidfd); > > > read or poll wait_fd (non-blocking or blocking whichever) > > > > > > wait_fd will block until the task has either died or reaped. In both these > > > cases, it can return a suitable string such as "dead" or "reaped" although an > > > integer with some predefined meaning is also Ok. > > I want to return a siginfo_t: we already use this structure in other > contexts to report exit status. > > > > What that guarantees is, even if the task's PID has been reused, or the task > > > has already died or already died + reaped, all of these events cannot race > > > with the code above and the information passed to the user is race-free and > > > stable / guaranteed. > > > > > > An eventfd seems to not fit well, because AFAICS passing the raw PID to > > > eventfd as in your example would still race since the PID could have been > > > reused by another process by the time the eventfd is created. > > > > > > Also Andy's idea in [1] seems to use poll flags to communicate various tihngs > > > which is still not as explicit about the PID's status so that's a poor API > > > choice compared to the explicit syscall. > > > > > > I am planning to work on a prototype patch based on Daniel's idea and post something > > > soon (chatted with Daniel about it and will reference him in the posting as > > > well), during this posting I will also summarize all the previous discussions > > > and come up with some tests as well. I hope to have something soon. > > Thanks. > > > Having pidfd_wait() return another fd will make the syscall harder to > > swallow for a lot of people I reckon. > > What exactly prevents us from making the pidfd itself readable/pollable > > for the exit staus? They are "special" fds anyway. I would really like > > to avoid polluting the api with multiple different types of fds if possible. > > If pidfds had been their own file type, I'd agree with you. But pidfds > are directories, which means that we're beholden to make them behave > like directories normally do. I'd rather introduce another FD than > heavily overload the semantics of a directory FD in one particular > context. In no other circumstances are directory FDs also weird > IO-data sources. Our providing a facility to get a new FD to which we > *can* give pipe-like behavior does no harm and *usage* cleaner and > easier to reason about. I have two things I'm currently working on: - hijacking translate_pid() - pidfd_clone() essentially My first goal is to talk to Eric about taking the translate_pid() syscall that has been sitting in his tree and expanding it. translate_pid() currently allows you to either get an fd for the pid namespace a pid resides in or the pid number of a given process in another pid namespace relative to a passed in pid namespace fd. I would like to make it possible for this syscall to also give us back pidfds. One question I'm currently struggling with is exactly what you said above: what type of file descriptor these are going to give back to us. It seems that a regular file instead of directory would make the most sense and would lead to a nicer API and I'm very much leaning towards that. Christian ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-18 0:29 ` Christian Brauner @ 2019-03-18 23:50 ` Joel Fernandes 2019-03-19 22:14 ` Christian Brauner 0 siblings, 1 reply; 44+ messages in thread From: Joel Fernandes @ 2019-03-18 23:50 UTC (permalink / raw) To: Christian Brauner Cc: Daniel Colascione, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn On Mon, Mar 18, 2019 at 01:29:51AM +0100, Christian Brauner wrote: > On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote: > > On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > > > > > > [..] > > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just use > > > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel was suggesting. > > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is essentially pinned > > > > > > > > > > > even though the proc number may be reused. Then the caller can just poll. > > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on process > > > > > > > > > > > death (A quick look shows task_struct can be mapped to its struct pid) and > > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is > > > > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was Daniel's > > > > > > > > > reasoning about avoiding a notification about process death through proc > > > > > > > > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > > > > select/poll/epoll if we use eventfd. > > > > > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > > > > > I would strongly advocate for > > > > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > > > > according to the caller's desire --- users can pass options they want > > > > > > > to poll(2) or whatever. There's no need for any kind of special > > > > > > > configuration knob or non-blocking option. We already *have* a > > > > > > > non-blocking option that works universally for everything. > > > > > > > > > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > > > > status is just another blob of bytes that a process might receive. A > > > > > > > process exit handle ought to be just another information source. The > > > > > > > reason the unix process API is so awful is that for whatever reason > > > > > > > the original designers treated processes as some kind of special kind > > > > > > > of resource instead of fitting them into the otherwise general-purpose > > > > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > > > > > > > Something like this: > > > > > > > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > > > > // register eventfd to receive death notification > > > > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > > > > // kill the process > > > > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > > > > // tend to other things > > > > > > > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > > > > > an eventfd. > > > > > > > > > > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > > > > original proposal was based on being able to poll a file under > > > > > /proc/pid and then you changed your mind to have a separate syscall > > > > > which I assumed would be a blocking one to wait for process exit. > > > > > Maybe you can describe the new interface you are thinking about in > > > > > terms of userspace usage like I did above? Several lines of code would > > > > > explain more than paragraphs of text. > > > > > > > > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea > > > > from Daniel here is to wait for process death and exit events by just > > > > referring to a stable fd, independent of whatever is going on in /proc. > > > > > > > > What is needed is something like this (in highly pseudo-code form): > > > > > > > > pidfd = opendir("/proc/<pid>",..); > > > > wait_fd = pidfd_wait(pidfd); > > > > read or poll wait_fd (non-blocking or blocking whichever) > > > > > > > > wait_fd will block until the task has either died or reaped. In both these > > > > cases, it can return a suitable string such as "dead" or "reaped" although an > > > > integer with some predefined meaning is also Ok. > > > > I want to return a siginfo_t: we already use this structure in other > > contexts to report exit status. > > Fine with me. I did a prototype (code is below) as a string but I can change that to siginfo_t in the future. > > > Having pidfd_wait() return another fd will make the syscall harder to > > > swallow for a lot of people I reckon. > > > What exactly prevents us from making the pidfd itself readable/pollable > > > for the exit staus? They are "special" fds anyway. I would really like > > > to avoid polluting the api with multiple different types of fds if possible. > > > > If pidfds had been their own file type, I'd agree with you. But pidfds > > are directories, which means that we're beholden to make them behave > > like directories normally do. I'd rather introduce another FD than > > heavily overload the semantics of a directory FD in one particular > > context. In no other circumstances are directory FDs also weird > > IO-data sources. Our providing a facility to get a new FD to which we > > *can* give pipe-like behavior does no harm and *usage* cleaner and > > easier to reason about. > > I have two things I'm currently working on: > - hijacking translate_pid() > - pidfd_clone() essentially > > My first goal is to talk to Eric about taking the translate_pid() > syscall that has been sitting in his tree and expanding it. > translate_pid() currently allows you to either get an fd for the pid > namespace a pid resides in or the pid number of a given process in > another pid namespace relative to a passed in pid namespace fd. That's good to know. More comments below: > I would > like to make it possible for this syscall to also give us back pidfds. > One question I'm currently struggling with is exactly what you said > above: what type of file descriptor these are going to give back to us. > It seems that a regular file instead of directory would make the most > sense and would lead to a nicer API and I'm very much leaning towards > that. How about something like the following? We can plumb the new file as a pseudo file that is invisible and linked to the fd. This is extremely rough (does not do error handling, synchronizatoin etc) but just wanted to share the idea of what the "frontend" could look like. It is also missing all the actual pid status messages. It just takes care of the creating new fd from the pidfd part and providing file read ops returning the "status" string. It is also written in signal.c and should likely go into proc fs files under fs. Appreciate any suggestions (a test program did prove it works). Also, I was able to translate a pidfd to a pid_namespace by referring to some existing code but perhaps you may be able to suggest something better for such translation.. ---8<----------------------- From: Joel Fernandes <joelaf@google.com> Subject: [PATCH] Partial skeleton prototype of pidfd_wait frontend Signed-off-by: Joel Fernandes <joelaf@google.com> --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 4 +- kernel/signal.c | 62 ++++++++++++++++++++++++++ kernel/sys_ni.c | 3 ++ 6 files changed, 71 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 1f9607ed087c..2a63f1896b63 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -433,3 +433,4 @@ 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter 427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register +428 i386 pidfd_wait sys_pidfd_wait __ia32_sys_pidfd_wait diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 92ee0b4378d4..cf2e08a8053b 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -349,6 +349,7 @@ 425 common io_uring_setup __x64_sys_io_uring_setup 426 common io_uring_enter __x64_sys_io_uring_enter 427 common io_uring_register __x64_sys_io_uring_register +428 common pidfd_wait __x64_sys_pidfd_wait # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e446806a561f..62160970ed3f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -988,6 +988,7 @@ asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, asmlinkage long sys_pidfd_send_signal(int pidfd, int sig, siginfo_t __user *info, unsigned int flags); +asmlinkage long sys_pidfd_wait(int pidfd); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index dee7292e1df6..137aa8662230 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) #define __NR_io_uring_register 427 __SYSCALL(__NR_io_uring_register, sys_io_uring_register) +#define __NR_pidfd_wait 428 +__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait) #undef __NR_syscalls -#define __NR_syscalls 428 +#define __NR_syscalls 429 /* * 32 bit systems traditionally used different diff --git a/kernel/signal.c b/kernel/signal.c index b7953934aa99..ebb550b87044 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3550,6 +3550,68 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) return copy_siginfo_from_user(kinfo, info); } +static ssize_t pidfd_wait_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + /* + * This is just a test string, it will contain the actual + * status of the pidfd in the future. + */ + char buf[] = "status"; + + return copy_to_iter(buf, strlen(buf)+1, to); +} + +static const struct file_operations pidfd_wait_file_ops = { + .read_iter = pidfd_wait_read_iter, +}; + +static struct inode *pidfd_wait_get_inode(struct super_block *sb) +{ + struct inode *inode = new_inode(sb); + + inode->i_ino = get_next_ino(); + inode_init_owner(inode, NULL, S_IFREG); + + inode->i_op = &simple_dir_inode_operations; + inode->i_fop = &pidfd_wait_file_ops; + + return inode; +} + +SYSCALL_DEFINE1(pidfd_wait, int, pidfd) +{ + struct fd f; + struct inode *inode; + struct file *file; + int new_fd; + struct pid_namespace *pid_ns; + struct super_block *sb; + struct vfsmount *mnt; + + f = fdget_raw(pidfd); + if (!f.file) + return -EBADF; + + sb = file_inode(f.file)->i_sb; + pid_ns = sb->s_fs_info; + + inode = pidfd_wait_get_inode(sb); + + mnt = pid_ns->proc_mnt; + + file = alloc_file_pseudo(inode, mnt, "pidfd_wait", O_RDONLY, + &pidfd_wait_file_ops); + + file->f_mode |= FMODE_PREAD; + + new_fd = get_unused_fd_flags(0); + fd_install(new_fd, file); + + fdput(f); + + return new_fd; +} + /** * sys_pidfd_send_signal - send a signal to a process through a task file * descriptor diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index d21f4befaea4..f52c4d864038 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -450,3 +450,6 @@ COND_SYSCALL(setuid16); /* restartable sequence */ COND_SYSCALL(rseq); + +/* pidfd */ +COND_SYSCALL(pidfd_wait); -- 2.21.0.225.g810b269d1ac-goog ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-18 23:50 ` Joel Fernandes @ 2019-03-19 22:14 ` Christian Brauner 2019-03-19 22:48 ` Daniel Colascione 0 siblings, 1 reply; 44+ messages in thread From: Christian Brauner @ 2019-03-19 22:14 UTC (permalink / raw) To: Joel Fernandes Cc: Daniel Colascione, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, keescook On Mon, Mar 18, 2019 at 07:50:52PM -0400, Joel Fernandes wrote: > On Mon, Mar 18, 2019 at 01:29:51AM +0100, Christian Brauner wrote: > > On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote: > > > On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > > > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > > > > > > > [..] > > > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just use > > > > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel was suggesting. > > > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is essentially pinned > > > > > > > > > > > > even though the proc number may be reused. Then the caller can just poll. > > > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on process > > > > > > > > > > > > death (A quick look shows task_struct can be mapped to its struct pid) and > > > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is > > > > > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was Daniel's > > > > > > > > > > reasoning about avoiding a notification about process death through proc > > > > > > > > > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > > > > > select/poll/epoll if we use eventfd. > > > > > > > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > > > > > > > I would strongly advocate for > > > > > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > > > > > according to the caller's desire --- users can pass options they want > > > > > > > > to poll(2) or whatever. There's no need for any kind of special > > > > > > > > configuration knob or non-blocking option. We already *have* a > > > > > > > > non-blocking option that works universally for everything. > > > > > > > > > > > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > > > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > > > > > status is just another blob of bytes that a process might receive. A > > > > > > > > process exit handle ought to be just another information source. The > > > > > > > > reason the unix process API is so awful is that for whatever reason > > > > > > > > the original designers treated processes as some kind of special kind > > > > > > > > of resource instead of fitting them into the otherwise general-purpose > > > > > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > > > > > > > > > Something like this: > > > > > > > > > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > > > > > // register eventfd to receive death notification > > > > > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > > > > > // kill the process > > > > > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > > > > > // tend to other things > > > > > > > > > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > > > > > > an eventfd. > > > > > > > > > > > > > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > > > > > original proposal was based on being able to poll a file under > > > > > > /proc/pid and then you changed your mind to have a separate syscall > > > > > > which I assumed would be a blocking one to wait for process exit. > > > > > > Maybe you can describe the new interface you are thinking about in > > > > > > terms of userspace usage like I did above? Several lines of code would > > > > > > explain more than paragraphs of text. > > > > > > > > > > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea > > > > > from Daniel here is to wait for process death and exit events by just > > > > > referring to a stable fd, independent of whatever is going on in /proc. > > > > > > > > > > What is needed is something like this (in highly pseudo-code form): > > > > > > > > > > pidfd = opendir("/proc/<pid>",..); > > > > > wait_fd = pidfd_wait(pidfd); > > > > > read or poll wait_fd (non-blocking or blocking whichever) > > > > > > > > > > wait_fd will block until the task has either died or reaped. In both these > > > > > cases, it can return a suitable string such as "dead" or "reaped" although an > > > > > integer with some predefined meaning is also Ok. > > > > > > I want to return a siginfo_t: we already use this structure in other > > > contexts to report exit status. > > > > > Fine with me. I did a prototype (code is below) as a string but I can change > that to siginfo_t in the future. > > > > > Having pidfd_wait() return another fd will make the syscall harder to > > > > swallow for a lot of people I reckon. > > > > What exactly prevents us from making the pidfd itself readable/pollable > > > > for the exit staus? They are "special" fds anyway. I would really like > > > > to avoid polluting the api with multiple different types of fds if possible. > > > > > > If pidfds had been their own file type, I'd agree with you. But pidfds > > > are directories, which means that we're beholden to make them behave > > > like directories normally do. I'd rather introduce another FD than > > > heavily overload the semantics of a directory FD in one particular > > > context. In no other circumstances are directory FDs also weird > > > IO-data sources. Our providing a facility to get a new FD to which we > > > *can* give pipe-like behavior does no harm and *usage* cleaner and > > > easier to reason about. > > > > I have two things I'm currently working on: > > - hijacking translate_pid() > > - pidfd_clone() essentially > > > > My first goal is to talk to Eric about taking the translate_pid() > > syscall that has been sitting in his tree and expanding it. > > translate_pid() currently allows you to either get an fd for the pid > > namespace a pid resides in or the pid number of a given process in > > another pid namespace relative to a passed in pid namespace fd. > > That's good to know. More comments below: Sorry for the delay I'm still traveling. I'll be back on a fully functional schedule starting Monday. > > > I would > > like to make it possible for this syscall to also give us back pidfds. > > One question I'm currently struggling with is exactly what you said > > above: what type of file descriptor these are going to give back to us. > > It seems that a regular file instead of directory would make the most > > sense and would lead to a nicer API and I'm very much leaning towards > > that. > > How about something like the following? We can plumb the new file as a pseudo > file that is invisible and linked to the fd. This is extremely rough (does > not do error handling, synchronizatoin etc) but just wanted to share the idea > of what the "frontend" could look like. It is also missing all the actual pid > status messages. It just takes care of the creating new fd from the pidfd > part and providing file read ops returning the "status" string. It is also > written in signal.c and should likely go into proc fs files under fs. > Appreciate any suggestions (a test program did prove it works). > > Also, I was able to translate a pidfd to a pid_namespace by referring to some > existing code but perhaps you may be able to suggest something better for > such translation.. Yeah, there's better ways but I think there's another issue. See below. > > ---8<----------------------- > > From: Joel Fernandes <joelaf@google.com> > Subject: [PATCH] Partial skeleton prototype of pidfd_wait frontend > > Signed-off-by: Joel Fernandes <joelaf@google.com> > --- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/syscalls.h | 1 + > include/uapi/asm-generic/unistd.h | 4 +- > kernel/signal.c | 62 ++++++++++++++++++++++++++ > kernel/sys_ni.c | 3 ++ > 6 files changed, 71 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 1f9607ed087c..2a63f1896b63 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -433,3 +433,4 @@ > 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup > 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter > 427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register > +428 i386 pidfd_wait sys_pidfd_wait __ia32_sys_pidfd_wait > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index 92ee0b4378d4..cf2e08a8053b 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -349,6 +349,7 @@ > 425 common io_uring_setup __x64_sys_io_uring_setup > 426 common io_uring_enter __x64_sys_io_uring_enter > 427 common io_uring_register __x64_sys_io_uring_register > +428 common pidfd_wait __x64_sys_pidfd_wait > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index e446806a561f..62160970ed3f 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -988,6 +988,7 @@ asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, > asmlinkage long sys_pidfd_send_signal(int pidfd, int sig, > siginfo_t __user *info, > unsigned int flags); > +asmlinkage long sys_pidfd_wait(int pidfd); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > index dee7292e1df6..137aa8662230 100644 > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) > __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) > #define __NR_io_uring_register 427 > __SYSCALL(__NR_io_uring_register, sys_io_uring_register) > +#define __NR_pidfd_wait 428 > +__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait) > > #undef __NR_syscalls > -#define __NR_syscalls 428 > +#define __NR_syscalls 429 > > /* > * 32 bit systems traditionally used different > diff --git a/kernel/signal.c b/kernel/signal.c > index b7953934aa99..ebb550b87044 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -3550,6 +3550,68 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) > return copy_siginfo_from_user(kinfo, info); > } > > +static ssize_t pidfd_wait_read_iter(struct kiocb *iocb, struct iov_iter *to) > +{ > + /* > + * This is just a test string, it will contain the actual > + * status of the pidfd in the future. > + */ > + char buf[] = "status"; > + > + return copy_to_iter(buf, strlen(buf)+1, to); > +} > + > +static const struct file_operations pidfd_wait_file_ops = { > + .read_iter = pidfd_wait_read_iter, > +}; > + > +static struct inode *pidfd_wait_get_inode(struct super_block *sb) > +{ > + struct inode *inode = new_inode(sb); > + > + inode->i_ino = get_next_ino(); > + inode_init_owner(inode, NULL, S_IFREG); > + > + inode->i_op = &simple_dir_inode_operations; > + inode->i_fop = &pidfd_wait_file_ops; > + > + return inode; > +} > + > +SYSCALL_DEFINE1(pidfd_wait, int, pidfd) > +{ > + struct fd f; > + struct inode *inode; > + struct file *file; > + int new_fd; > + struct pid_namespace *pid_ns; > + struct super_block *sb; > + struct vfsmount *mnt; > + > + f = fdget_raw(pidfd); > + if (!f.file) > + return -EBADF; > + > + sb = file_inode(f.file)->i_sb; > + pid_ns = sb->s_fs_info; > + > + inode = pidfd_wait_get_inode(sb); > + > + mnt = pid_ns->proc_mnt; > + > + file = alloc_file_pseudo(inode, mnt, "pidfd_wait", O_RDONLY, > + &pidfd_wait_file_ops); So I dislike the idea of allocating new inodes from the procfs super block. I would like to avoid pinning the whole pidfd concept exclusively to proc. The idea is that the pidfd API will be useable through procfs via open("/proc/<pid>") because that is what users expect and really wanted to have for a long time. So it makes sense to have this working. But it should really be useable without it. That's why translate_pid() and pidfd_clone() are on the table. What I'm saying is, once the pidfd api is "complete" you should be able to set CONFIG_PROCFS=N - even though that's crazy - and still be able to use pidfds. This is also a point akpm asked about when I did the pidfd_send_signal work. So instead of going throught proc we should probably do what David has been doing in the mount API and come to rely on anone_inode. So something like: fd = anon_inode_getfd("pidfd", &pidfd_fops, file_priv_data, flags); and stash information such as pid namespace etc. in a pidfd struct or something that we then can stash file->private_data of the new file. This also lets us avoid all this open coding done here. Another advantage is that anon_inodes is its own kernel-internal filesystem. Christian > + > + file->f_mode |= FMODE_PREAD; > + > + new_fd = get_unused_fd_flags(0); > + fd_install(new_fd, file); > + > + fdput(f); > + > + return new_fd; > +} > + > /** > * sys_pidfd_send_signal - send a signal to a process through a task file > * descriptor > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index d21f4befaea4..f52c4d864038 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -450,3 +450,6 @@ COND_SYSCALL(setuid16); > > /* restartable sequence */ > COND_SYSCALL(rseq); > + > +/* pidfd */ > +COND_SYSCALL(pidfd_wait); > -- > 2.21.0.225.g810b269d1ac-goog > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-19 22:14 ` Christian Brauner @ 2019-03-19 22:48 ` Daniel Colascione 2019-03-19 23:10 ` Christian Brauner 0 siblings, 1 reply; 44+ messages in thread From: Daniel Colascione @ 2019-03-19 22:48 UTC (permalink / raw) To: Christian Brauner Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@brauner.io> wrote: > So I dislike the idea of allocating new inodes from the procfs super > block. I would like to avoid pinning the whole pidfd concept exclusively > to proc. The idea is that the pidfd API will be useable through procfs > via open("/proc/<pid>") because that is what users expect and really > wanted to have for a long time. So it makes sense to have this working. > But it should really be useable without it. That's why translate_pid() > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > api is "complete" you should be able to set CONFIG_PROCFS=N - even > though that's crazy - and still be able to use pidfds. This is also a > point akpm asked about when I did the pidfd_send_signal work. I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One crazy idea that I was discussing with Joel the other day is to just make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() system call that returned, out of thin air and independent of the mount table, a procfs root directory file descriptor for the caller's PID namspace and suitable for use with openat(2). C'mon: /proc is used by everyone today and almost every program breaks if it's not around. The string "/proc" is already de facto kernel ABI. Let's just drop the pretense of /proc being optional and bake it into the kernel proper, then give programs a way to get to /proc that isn't tied to any particular mount configuration. This way, we don't need a translate_pid(), since callers can just use procfs to do the same thing. (That is, if I understand correctly what translate_pid does.) We still need a pidfd_clone() for atomicity reasons, but that's a separate story. My goal is to be able to write a library that transparently creates and manages a helper child process even in a "hostile" process environment in which some other uncoordinated thread is constantly doing a waitpid(-1) (e.g., the JVM). > So instead of going throught proc we should probably do what David has > been doing in the mount API and come to rely on anone_inode. So > something like: > > fd = anon_inode_getfd("pidfd", &pidfd_fops, file_priv_data, flags); > > and stash information such as pid namespace etc. in a pidfd struct or > something that we then can stash file->private_data of the new file. > This also lets us avoid all this open coding done here. > Another advantage is that anon_inodes is its own kernel-internal > filesystem. Sure. That works too. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-19 22:48 ` Daniel Colascione @ 2019-03-19 23:10 ` Christian Brauner 2019-03-20 1:52 ` Joel Fernandes 0 siblings, 1 reply; 44+ messages in thread From: Christian Brauner @ 2019-03-19 23:10 UTC (permalink / raw) To: Daniel Colascione Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote: > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@brauner.io> wrote: > > So I dislike the idea of allocating new inodes from the procfs super > > block. I would like to avoid pinning the whole pidfd concept exclusively > > to proc. The idea is that the pidfd API will be useable through procfs > > via open("/proc/<pid>") because that is what users expect and really > > wanted to have for a long time. So it makes sense to have this working. > > But it should really be useable without it. That's why translate_pid() > > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > > api is "complete" you should be able to set CONFIG_PROCFS=N - even > > though that's crazy - and still be able to use pidfds. This is also a > > point akpm asked about when I did the pidfd_send_signal work. > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One > crazy idea that I was discussing with Joel the other day is to just > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() > system call that returned, out of thin air and independent of the > mount table, a procfs root directory file descriptor for the caller's > PID namspace and suitable for use with openat(2). Even if this works I'm pretty sure that Al and a lot of others will not be happy about this. A syscall to get an fd to /proc? That's not going to happen and I don't see the need for a separate syscall just for that. (I do see the point of making CONFIG_PROCFS=y the default btw.) Inode allocation from the procfs mount for the file descriptors Joel wants is not correct. Their not really procfs file descriptors so this is a nack. We can't just hook into proc that way. > > C'mon: /proc is used by everyone today and almost every program breaks > if it's not around. The string "/proc" is already de facto kernel ABI. > Let's just drop the pretense of /proc being optional and bake it into > the kernel proper, then give programs a way to get to /proc that isn't > tied to any particular mount configuration. This way, we don't need a > translate_pid(), since callers can just use procfs to do the same > thing. (That is, if I understand correctly what translate_pid does.) I'm not sure what you think translate_pid() is doing since you're not saying what you think it does. Examples from the old patchset: translate_pid(pid, ns, -1) - get pid in our pid namespace translate_pid(pid, -1, ns) - get pid in other pid namespace translate_pid(1, ns, -1) - get pid of init task for namespace translate_pid(pid, -1, ns) > 0 - is pid is reachable from ns? translate_pid(1, ns1, ns2) > 0 - is ns1 inside ns2? translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2? translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2? Allowing this syscall to yield pidfds as proper regular file fds and also taking pidfds as argument is an excellent way to kill a few problems at once: - cheap pid namespace introspection - creates a bridge between the "old" pid-based api and the "new" pidfd api - allows us to get proper non-directory file descriptors for any pids we like The additional advantage is that people are already happy to add this syscall so simply extending it and routing it through the pidfd tree or Eric's tree is reasonable. (It should probably grow a flag argument. I need to start prototyping this.) > > We still need a pidfd_clone() for atomicity reasons, but that's a > separate story. My goal is to be able to write a library that Yes, on my todo list and I have a ported patch based on prior working rotting somehwere on my git server. > transparently creates and manages a helper child process even in a > "hostile" process environment in which some other uncoordinated thread > is constantly doing a waitpid(-1) (e.g., the JVM). > > > So instead of going throught proc we should probably do what David has > > been doing in the mount API and come to rely on anone_inode. So > > something like: > > > > fd = anon_inode_getfd("pidfd", &pidfd_fops, file_priv_data, flags); > > > > and stash information such as pid namespace etc. in a pidfd struct or > > something that we then can stash file->private_data of the new file. > > This also lets us avoid all this open coding done here. > > Another advantage is that anon_inodes is its own kernel-internal > > filesystem. > > Sure. That works too. Great. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android 2019-03-19 23:10 ` Christian Brauner @ 2019-03-20 1:52 ` Joel Fernandes 2019-03-20 2:42 ` pidfd design Daniel Colascione 0 siblings, 1 reply; 44+ messages in thread From: Joel Fernandes @ 2019-03-20 1:52 UTC (permalink / raw) To: Christian Brauner Cc: Daniel Colascione, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote: > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote: > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@brauner.io> wrote: > > > So I dislike the idea of allocating new inodes from the procfs super > > > block. I would like to avoid pinning the whole pidfd concept exclusively > > > to proc. The idea is that the pidfd API will be useable through procfs > > > via open("/proc/<pid>") because that is what users expect and really > > > wanted to have for a long time. So it makes sense to have this working. > > > But it should really be useable without it. That's why translate_pid() > > > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > > > api is "complete" you should be able to set CONFIG_PROCFS=N - even > > > though that's crazy - and still be able to use pidfds. This is also a > > > point akpm asked about when I did the pidfd_send_signal work. > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One > > crazy idea that I was discussing with Joel the other day is to just > > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() > > system call that returned, out of thin air and independent of the > > mount table, a procfs root directory file descriptor for the caller's > > PID namspace and suitable for use with openat(2). > > Even if this works I'm pretty sure that Al and a lot of others will not > be happy about this. A syscall to get an fd to /proc? That's not going > to happen and I don't see the need for a separate syscall just for that. > (I do see the point of making CONFIG_PROCFS=y the default btw.) I think his point here was that he wanted a handle to procfs no matter where it was mounted and then can later use openat on that. Agreed that it may be unnecessary unless there is a usecase for it, and especially if the /proc directory being the defacto mountpoint for procfs is a universal convention. > Inode allocation from the procfs mount for the file descriptors Joel > wants is not correct. Their not really procfs file descriptors so this > is a nack. We can't just hook into proc that way. I was not particular about using procfs mount for the FDs but that's the only way I knew how to do it until you pointed out anon_inode (my grep skills missed that), so thank you! > > C'mon: /proc is used by everyone today and almost every program breaks > > if it's not around. The string "/proc" is already de facto kernel ABI. > > Let's just drop the pretense of /proc being optional and bake it into > > the kernel proper, then give programs a way to get to /proc that isn't > > tied to any particular mount configuration. This way, we don't need a > > translate_pid(), since callers can just use procfs to do the same > > thing. (That is, if I understand correctly what translate_pid does.) > > I'm not sure what you think translate_pid() is doing since you're not > saying what you think it does. > Examples from the old patchset: > translate_pid(pid, ns, -1) - get pid in our pid namespace > translate_pid(pid, -1, ns) - get pid in other pid namespace > translate_pid(1, ns, -1) - get pid of init task for namespace > translate_pid(pid, -1, ns) > 0 - is pid is reachable from ns? > translate_pid(1, ns1, ns2) > 0 - is ns1 inside ns2? > translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2? > translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2? > > Allowing this syscall to yield pidfds as proper regular file fds and > also taking pidfds as argument is an excellent way to kill a few > problems at once: > - cheap pid namespace introspection > - creates a bridge between the "old" pid-based api and the "new" pidfd api This second point would solve the problem of getting a new pidfd given a pid indeed, without depending on /proc/<pid> at all. So kudos for that and I am glad you are making it return pidfds (but correct me if I misunderstood what you're planning to do with translate_fd). It also obviates any need for dealing with procfs mount points. > - allows us to get proper non-directory file descriptors for any pids we > like Here I got a bit lost. AIUI pidfd is a directory fd. Why would we want it to not be a directory fd? That would be ambigiuous with what pidfd_send_signal expects. Also would it be a bad idea to extend translate_pid to also do what we want for the pidfd_wait syscall? So translate_fd in this case would return an fd that is just used for the pid's death status. All of these extensions seem to mean translate_pid should probably take a fourth parameter that tells it the target translation type? They way I am hypothesizing, translate_pid, it should probably be - translation to a pid in some ns - translation of a pid to a pidfd - translation of a pid to a "wait" fd which returns the death/reap process status. If that makes sense, that would also avoid the need for a new syscall we are adding. > The additional advantage is that people are already happy to add this > syscall so simply extending it and routing it through the pidfd tree or > Eric's tree is reasonable. (It should probably grow a flag argument. I > need to start prototyping this.) Great! > > > > We still need a pidfd_clone() for atomicity reasons, but that's a > > separate story. My goal is to be able to write a library that > > Yes, on my todo list and I have a ported patch based on prior working > rotting somehwere on my git server. Is that different from using dup2 on a pidfd? Sorry I don't follow what is pidfd_clone / why it is needed. thanks, - Joel ^ permalink raw reply [flat|nested] 44+ messages in thread
* pidfd design 2019-03-20 1:52 ` Joel Fernandes @ 2019-03-20 2:42 ` Daniel Colascione 2019-03-20 3:59 ` Christian Brauner 0 siblings, 1 reply; 44+ messages in thread From: Daniel Colascione @ 2019-03-20 2:42 UTC (permalink / raw) To: Joel Fernandes Cc: Christian Brauner, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote: > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote: > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@brauner.io> wrote: > > > > So I dislike the idea of allocating new inodes from the procfs super > > > > block. I would like to avoid pinning the whole pidfd concept exclusively > > > > to proc. The idea is that the pidfd API will be useable through procfs > > > > via open("/proc/<pid>") because that is what users expect and really > > > > wanted to have for a long time. So it makes sense to have this working. > > > > But it should really be useable without it. That's why translate_pid() > > > > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > > > > api is "complete" you should be able to set CONFIG_PROCFS=N - even > > > > though that's crazy - and still be able to use pidfds. This is also a > > > > point akpm asked about when I did the pidfd_send_signal work. > > > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One > > > crazy idea that I was discussing with Joel the other day is to just > > > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() > > > system call that returned, out of thin air and independent of the > > > mount table, a procfs root directory file descriptor for the caller's > > > PID namspace and suitable for use with openat(2). > > > > Even if this works I'm pretty sure that Al and a lot of others will not > > be happy about this. A syscall to get an fd to /proc? Why not? procfs provides access to a lot of core kernel functionality. Why should you need a mountpoint to get to it? > That's not going > > to happen and I don't see the need for a separate syscall just for that. We need a system call for the same reason we need a getrandom(2): you have to bootstrap somehow when you're in a minimal environment. > > (I do see the point of making CONFIG_PROCFS=y the default btw.) I'm not proposing that we make CONFIG_PROCFS=y the default. I'm proposing that we *hardwire* it as the default and just declare that it's not possible to build a Linux kernel that doesn't include procfs. Why do we even have that button? > I think his point here was that he wanted a handle to procfs no matter where > it was mounted and then can later use openat on that. Agreed that it may be > unnecessary unless there is a usecase for it, and especially if the /proc > directory being the defacto mountpoint for procfs is a universal convention. If it's a universal convention and, in practice, everyone needs proc mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? If we advertise /proc as not merely some kind of optional debug interface but *the* way certain kernel features are exposed --- and there's nothing wrong with that --- then we should give programs access to these core kernel features in a way that doesn't depend on userspace kernel configuration, and you do that by either providing a procfs-root-getting system call or just hardwiring the "/proc/" prefix into VFS. > > Inode allocation from the procfs mount for the file descriptors Joel > > wants is not correct. Their not really procfs file descriptors so this > > is a nack. We can't just hook into proc that way. > > I was not particular about using procfs mount for the FDs but that's the only > way I knew how to do it until you pointed out anon_inode (my grep skills > missed that), so thank you! > > > > C'mon: /proc is used by everyone today and almost every program breaks > > > if it's not around. The string "/proc" is already de facto kernel ABI. > > > Let's just drop the pretense of /proc being optional and bake it into > > > the kernel proper, then give programs a way to get to /proc that isn't > > > tied to any particular mount configuration. This way, we don't need a > > > translate_pid(), since callers can just use procfs to do the same > > > thing. (That is, if I understand correctly what translate_pid does.) > > > > I'm not sure what you think translate_pid() is doing since you're not > > saying what you think it does. > > Examples from the old patchset: > > translate_pid(pid, ns, -1) - get pid in our pid namespace Ah, it's a bit different from what I had in mind. It's fair to want to translate PIDs between namespaces, but the only way to make the translate_pid under discussion robust is to have it accept and produce pidfds. (At that point, you might as well call it translate_pidfd.) We should not be adding new APIs to the kernel that accept numeric PIDs: it's not possible to use these APIs correctly except under very limited circumstances --- mostly, talking about init or a parent talking about its child. Really, we need a few related operations, and we shouldn't necessarily mingle them. 1) Given a numeric PID, give me a pidfd: that works today: you just open /proc/<pid> 2) Given a pidfd, give me a numeric PID: that works today: you just openat(pidfd, "stat", O_RDONLY) and read the first token (which is always the numeric PID). 3) Given a pidfd, send a signal: that's what pidfd_send_signal does, and it's a good start on the rest of these operations. 4) Given a pidfd, wait for the named process to exit: that's what my original exithand thing did, and that's what Joel's helpfully agreed to start hacking on. 5) Given a pidfd in NS1, get a pidfd in NS2. That's what translate_pid is for. My preferred signature for this routine is translate_pid(int pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. Why not? Because the pidfd *already* names a single process, uniquely! 6) Make a new process and atomically give me a pidfd for it. We need a new kind of clone(2) for that. People have been proposing some kind of FD-based fork/spawn/etc. thing for ages, and we can finally provide it. Yay. 7) Retrieve miscellaneous information about a process identified by a pidfd: openat(2) handles this case today. This is a decent framework for a good general-purpose process API that builds on the one the kernel already provides. With this API, people should never have to touch the old unix process API except to talk to humans and other legacy systems. It's a big project, but worthwhile, and we can do it piecemeal. Christian, what worries me is that you want to make this project 10x harder, both in technical and lkml-political terms, by making it work without CONFIG_PROCFS=y. Without procfs, all the operations above that involve the word "openat" or "/proc" break, which means that our general-purpose process API needs to provide its own equivalents to these operations, and on top of these, its own non-procfs pidfd FD type --- let's call it pidfd_2. (Let's call a directory FD on /proc/<pid> a pidfd_1.) Under this scheme, we have to have all operations that accept a pidfd_1 (like pidfd_send_signal) and have them accept pidfd_2 file descriptors as well in general fashion. (The difference between pidfd_1 and pidfd_2 is visible to users who can call fstat and look at st_dev.) We'd also need an API to translate a pidfd_2 to a pidfd_1 so you could call openat on it to look at /proc/<pid> data files, to support operation #7 above. The alternative to provide #7 is some kind of new general-purpose process-information-retrieval interface that mirrors the functionality /proc/<pid> already provides --- e.g., getting the smaps list for a process. To sum it up, we can A) declare that pidfds don't work without CONFIG_PROCFS=y, B) hardwire CONFIG_PROCFS=y in all configurations, or C) support both procfs-based pidfd_1 FDs and non-procfs pidfd_2 FDs. Option C seems like pointless complexity to me, as I described above. Option C means that we have to duplicate a lot of existing and perfectly good functionality. Option A is fine by me, since I think CONFIG_PROCFS=n is just a bonkers and broken configuration that's barely even Linux. From a design perspective, I prefer option B: it turns a de-facto guaranteed /proc ABI into a de-jure guaranteed ABI, and that's just more straightforward for everyone --- especially since it reduces the complexity of the Linux core by deleting all the !CONFIG_PROCFS code paths. My point about the procfs system call is that *if* we go with option B and make procfs mandatory, we're essentially stating that certain kernel functionality is always available, and because (as a general rule) kernel functionality that's always available should be available to every process, we should provide a way to *use* this always-present kernel functionality that doesn't depend on the mount table --- thus my proposed get_procfs_root(2). We don't have to decide between A and B right now. We can continue develop pidfd development under the assumption we're going with option A, and when option B seems like a good idea, we can just switch with minimal hassle. On the other hand, if we did implement option C and, later, it became apparently that option B was right after all, all the work needed for option C would have been a waste. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 2:42 ` pidfd design Daniel Colascione @ 2019-03-20 3:59 ` Christian Brauner 2019-03-20 7:02 ` Daniel Colascione 0 siblings, 1 reply; 44+ messages in thread From: Christian Brauner @ 2019-03-20 3:59 UTC (permalink / raw) To: Daniel Colascione Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote: > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote: > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote: > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@brauner.io> wrote: > > > > > So I dislike the idea of allocating new inodes from the procfs super > > > > > block. I would like to avoid pinning the whole pidfd concept exclusively > > > > > to proc. The idea is that the pidfd API will be useable through procfs > > > > > via open("/proc/<pid>") because that is what users expect and really > > > > > wanted to have for a long time. So it makes sense to have this working. > > > > > But it should really be useable without it. That's why translate_pid() > > > > > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N - even > > > > > though that's crazy - and still be able to use pidfds. This is also a > > > > > point akpm asked about when I did the pidfd_send_signal work. > > > > > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One > > > > crazy idea that I was discussing with Joel the other day is to just > > > > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() > > > > system call that returned, out of thin air and independent of the > > > > mount table, a procfs root directory file descriptor for the caller's > > > > PID namspace and suitable for use with openat(2). > > > > > > Even if this works I'm pretty sure that Al and a lot of others will not > > > be happy about this. A syscall to get an fd to /proc? > > Why not? procfs provides access to a lot of core kernel functionality. > Why should you need a mountpoint to get to it? > > > That's not going > > > to happen and I don't see the need for a separate syscall just for that. > > We need a system call for the same reason we need a getrandom(2): you > have to bootstrap somehow when you're in a minimal environment. > > > > (I do see the point of making CONFIG_PROCFS=y the default btw.) > > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm > proposing that we *hardwire* it as the default and just declare that > it's not possible to build a Linux kernel that doesn't include procfs. > Why do we even have that button? > > > I think his point here was that he wanted a handle to procfs no matter where > > it was mounted and then can later use openat on that. Agreed that it may be > > unnecessary unless there is a usecase for it, and especially if the /proc > > directory being the defacto mountpoint for procfs is a universal convention. > > If it's a universal convention and, in practice, everyone needs proc > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? If > we advertise /proc as not merely some kind of optional debug interface > but *the* way certain kernel features are exposed --- and there's > nothing wrong with that --- then we should give programs access to > these core kernel features in a way that doesn't depend on userspace > kernel configuration, and you do that by either providing a > procfs-root-getting system call or just hardwiring the "/proc/" prefix > into VFS. > > > > Inode allocation from the procfs mount for the file descriptors Joel > > > wants is not correct. Their not really procfs file descriptors so this > > > is a nack. We can't just hook into proc that way. > > > > I was not particular about using procfs mount for the FDs but that's the only > > way I knew how to do it until you pointed out anon_inode (my grep skills > > missed that), so thank you! > > > > > > C'mon: /proc is used by everyone today and almost every program breaks > > > > if it's not around. The string "/proc" is already de facto kernel ABI. > > > > Let's just drop the pretense of /proc being optional and bake it into > > > > the kernel proper, then give programs a way to get to /proc that isn't > > > > tied to any particular mount configuration. This way, we don't need a > > > > translate_pid(), since callers can just use procfs to do the same > > > > thing. (That is, if I understand correctly what translate_pid does.) > > > > > > I'm not sure what you think translate_pid() is doing since you're not > > > saying what you think it does. > > > Examples from the old patchset: > > > translate_pid(pid, ns, -1) - get pid in our pid namespace > > Ah, it's a bit different from what I had in mind. It's fair to want to > translate PIDs between namespaces, but the only way to make the > translate_pid under discussion robust is to have it accept and produce > pidfds. (At that point, you might as well call it translate_pidfd.) We > should not be adding new APIs to the kernel that accept numeric PIDs: The traditional pid-based api is not going away. There are users that have the requirement to translate pids between namespaces and also doing introspection on these namespaces independent of pidfds. We will not restrict the usefulness of this syscall by making it only work with pidfds. > it's not possible to use these APIs correctly except under very > limited circumstances --- mostly, talking about init or a parent The pid-based api is one of the most widely used apis of the kernel and people have been using it quite successfully for a long time. Yes, it's rac, but it's here to stay. > talking about its child. > > Really, we need a few related operations, and we shouldn't necessarily > mingle them. Yes, we've established that previously. > > 1) Given a numeric PID, give me a pidfd: that works today: you just > open /proc/<pid> Agreed. > > 2) Given a pidfd, give me a numeric PID: that works today: you just > openat(pidfd, "stat", O_RDONLY) and read the first token (which is > always the numeric PID). Agreed. > > 3) Given a pidfd, send a signal: that's what pidfd_send_signal does, > and it's a good start on the rest of these operations. Agreed. > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what translate_pid > is for. My preferred signature for this routine is translate_pid(int > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. Why > not? Because the pidfd *already* names a single process, uniquely! Given that people are interested in pids we can't just always return a pidfd. That would mean a user would need to do get the pidfd read from <pidfd>/stat and then close the pidfd. If you do that for a 100 pids or more you end up allocating and closing file descriptors constantly for no reason. We can't just debate pids away. So it will also need to be able to yield pids e.g. through a flag argument. > > 6) Make a new process and atomically give me a pidfd for it. We need a > new kind of clone(2) for that. People have been proposing some kind of > FD-based fork/spawn/etc. thing for ages, and we can finally provide > it. Yay. Agreed. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 3:59 ` Christian Brauner @ 2019-03-20 7:02 ` Daniel Colascione 2019-03-20 11:33 ` Joel Fernandes 0 siblings, 1 reply; 44+ messages in thread From: Daniel Colascione @ 2019-03-20 7:02 UTC (permalink / raw) To: Christian Brauner Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner <christian@brauner.io> wrote: > > On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote: > > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote: > > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote: > > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@brauner.io> wrote: > > > > > > So I dislike the idea of allocating new inodes from the procfs super > > > > > > block. I would like to avoid pinning the whole pidfd concept exclusively > > > > > > to proc. The idea is that the pidfd API will be useable through procfs > > > > > > via open("/proc/<pid>") because that is what users expect and really > > > > > > wanted to have for a long time. So it makes sense to have this working. > > > > > > But it should really be useable without it. That's why translate_pid() > > > > > > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N - even > > > > > > though that's crazy - and still be able to use pidfds. This is also a > > > > > > point akpm asked about when I did the pidfd_send_signal work. > > > > > > > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One > > > > > crazy idea that I was discussing with Joel the other day is to just > > > > > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() > > > > > system call that returned, out of thin air and independent of the > > > > > mount table, a procfs root directory file descriptor for the caller's > > > > > PID namspace and suitable for use with openat(2). > > > > > > > > Even if this works I'm pretty sure that Al and a lot of others will not > > > > be happy about this. A syscall to get an fd to /proc? > > > > Why not? procfs provides access to a lot of core kernel functionality. > > Why should you need a mountpoint to get to it? > > > > > That's not going > > > > to happen and I don't see the need for a separate syscall just for that. > > > > We need a system call for the same reason we need a getrandom(2): you > > have to bootstrap somehow when you're in a minimal environment. > > > > > > (I do see the point of making CONFIG_PROCFS=y the default btw.) > > > > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm > > proposing that we *hardwire* it as the default and just declare that > > it's not possible to build a Linux kernel that doesn't include procfs. > > Why do we even have that button? > > > > > I think his point here was that he wanted a handle to procfs no matter where > > > it was mounted and then can later use openat on that. Agreed that it may be > > > unnecessary unless there is a usecase for it, and especially if the /proc > > > directory being the defacto mountpoint for procfs is a universal convention. > > > > If it's a universal convention and, in practice, everyone needs proc > > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? If > > we advertise /proc as not merely some kind of optional debug interface > > but *the* way certain kernel features are exposed --- and there's > > nothing wrong with that --- then we should give programs access to > > these core kernel features in a way that doesn't depend on userspace > > kernel configuration, and you do that by either providing a > > procfs-root-getting system call or just hardwiring the "/proc/" prefix > > into VFS. > > > > > > Inode allocation from the procfs mount for the file descriptors Joel > > > > wants is not correct. Their not really procfs file descriptors so this > > > > is a nack. We can't just hook into proc that way. > > > > > > I was not particular about using procfs mount for the FDs but that's the only > > > way I knew how to do it until you pointed out anon_inode (my grep skills > > > missed that), so thank you! > > > > > > > > C'mon: /proc is used by everyone today and almost every program breaks > > > > > if it's not around. The string "/proc" is already de facto kernel ABI. > > > > > Let's just drop the pretense of /proc being optional and bake it into > > > > > the kernel proper, then give programs a way to get to /proc that isn't > > > > > tied to any particular mount configuration. This way, we don't need a > > > > > translate_pid(), since callers can just use procfs to do the same > > > > > thing. (That is, if I understand correctly what translate_pid does.) > > > > > > > > I'm not sure what you think translate_pid() is doing since you're not > > > > saying what you think it does. > > > > Examples from the old patchset: > > > > translate_pid(pid, ns, -1) - get pid in our pid namespace > > > > Ah, it's a bit different from what I had in mind. It's fair to want to > > translate PIDs between namespaces, but the only way to make the > > translate_pid under discussion robust is to have it accept and produce > > pidfds. (At that point, you might as well call it translate_pidfd.) We > > should not be adding new APIs to the kernel that accept numeric PIDs: > > The traditional pid-based api is not going away. There are users that > have the requirement to translate pids between namespaces and also doing > introspection on these namespaces independent of pidfds. We will not > restrict the usefulness of this syscall by making it only work with > pidfds. > > > it's not possible to use these APIs correctly except under very > > limited circumstances --- mostly, talking about init or a parent > > The pid-based api is one of the most widely used apis of the kernel and > people have been using it quite successfully for a long time. Yes, it's > rac, but it's here to stay. > > > talking about its child. > > > > Really, we need a few related operations, and we shouldn't necessarily > > mingle them. > > Yes, we've established that previously. > > > > > 1) Given a numeric PID, give me a pidfd: that works today: you just > > open /proc/<pid> > > Agreed. > > > > > 2) Given a pidfd, give me a numeric PID: that works today: you just > > openat(pidfd, "stat", O_RDONLY) and read the first token (which is > > always the numeric PID). > > Agreed. > > > > > 3) Given a pidfd, send a signal: that's what pidfd_send_signal does, > > and it's a good start on the rest of these operations. > > Agreed. > > > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what translate_pid > > is for. My preferred signature for this routine is translate_pid(int > > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. Why > > not? Because the pidfd *already* names a single process, uniquely! > > Given that people are interested in pids we can't just always return a > pidfd. That would mean a user would need to do get the pidfd read from > <pidfd>/stat and then close the pidfd. If you do that for a 100 pids or > more you end up allocating and closing file descriptors constantly for > no reason. We can't just debate pids away. So it will also need to be > able to yield pids e.g. through a flag argument. Sure, but that's still not a reason that we should care about pidfds working separately from procfs. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 7:02 ` Daniel Colascione @ 2019-03-20 11:33 ` Joel Fernandes 0 siblings, 0 replies; 44+ messages in thread From: Joel Fernandes @ 2019-03-20 11:33 UTC (permalink / raw) To: Daniel Colascione, Christian Brauner Cc: Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote: >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner ><christian@brauner.io> wrote: >> >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote: >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes ><joel@joelfernandes.org> wrote: >> > > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner >wrote: >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione >wrote: >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner ><christian@brauner.io> wrote: >> > > > > > So I dislike the idea of allocating new inodes from the >procfs super >> > > > > > block. I would like to avoid pinning the whole pidfd >concept exclusively >> > > > > > to proc. The idea is that the pidfd API will be useable >through procfs >> > > > > > via open("/proc/<pid>") because that is what users expect >and really >> > > > > > wanted to have for a long time. So it makes sense to have >this working. >> > > > > > But it should really be useable without it. That's why >translate_pid() >> > > > > > and pidfd_clone() are on the table. What I'm saying is, >once the pidfd >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N >- even >> > > > > > though that's crazy - and still be able to use pidfds. This >is also a >> > > > > > point akpm asked about when I did the pidfd_send_signal >work. >> > > > > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use >pidfds. One >> > > > > crazy idea that I was discussing with Joel the other day is >to just >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new >get_procfs_root() >> > > > > system call that returned, out of thin air and independent of >the >> > > > > mount table, a procfs root directory file descriptor for the >caller's >> > > > > PID namspace and suitable for use with openat(2). >> > > > >> > > > Even if this works I'm pretty sure that Al and a lot of others >will not >> > > > be happy about this. A syscall to get an fd to /proc? >> > >> > Why not? procfs provides access to a lot of core kernel >functionality. >> > Why should you need a mountpoint to get to it? >> > >> > > That's not going >> > > > to happen and I don't see the need for a separate syscall just >for that. >> > >> > We need a system call for the same reason we need a getrandom(2): >you >> > have to bootstrap somehow when you're in a minimal environment. >> > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.) >> > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm >> > proposing that we *hardwire* it as the default and just declare >that >> > it's not possible to build a Linux kernel that doesn't include >procfs. >> > Why do we even have that button? >> > >> > > I think his point here was that he wanted a handle to procfs no >matter where >> > > it was mounted and then can later use openat on that. Agreed that >it may be >> > > unnecessary unless there is a usecase for it, and especially if >the /proc >> > > directory being the defacto mountpoint for procfs is a universal >convention. >> > >> > If it's a universal convention and, in practice, everyone needs >proc >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? >If >> > we advertise /proc as not merely some kind of optional debug >interface >> > but *the* way certain kernel features are exposed --- and there's >> > nothing wrong with that --- then we should give programs access to >> > these core kernel features in a way that doesn't depend on >userspace >> > kernel configuration, and you do that by either providing a >> > procfs-root-getting system call or just hardwiring the "/proc/" >prefix >> > into VFS. >> > >> > > > Inode allocation from the procfs mount for the file descriptors >Joel >> > > > wants is not correct. Their not really procfs file descriptors >so this >> > > > is a nack. We can't just hook into proc that way. >> > > >> > > I was not particular about using procfs mount for the FDs but >that's the only >> > > way I knew how to do it until you pointed out anon_inode (my grep >skills >> > > missed that), so thank you! >> > > >> > > > > C'mon: /proc is used by everyone today and almost every >program breaks >> > > > > if it's not around. The string "/proc" is already de facto >kernel ABI. >> > > > > Let's just drop the pretense of /proc being optional and bake >it into >> > > > > the kernel proper, then give programs a way to get to /proc >that isn't >> > > > > tied to any particular mount configuration. This way, we >don't need a >> > > > > translate_pid(), since callers can just use procfs to do the >same >> > > > > thing. (That is, if I understand correctly what translate_pid >does.) >> > > > >> > > > I'm not sure what you think translate_pid() is doing since >you're not >> > > > saying what you think it does. >> > > > Examples from the old patchset: >> > > > translate_pid(pid, ns, -1) - get pid in our pid namespace >> > >> > Ah, it's a bit different from what I had in mind. It's fair to want >to >> > translate PIDs between namespaces, but the only way to make the >> > translate_pid under discussion robust is to have it accept and >produce >> > pidfds. (At that point, you might as well call it translate_pidfd.) >We >> > should not be adding new APIs to the kernel that accept numeric >PIDs: >> >> The traditional pid-based api is not going away. There are users that >> have the requirement to translate pids between namespaces and also >doing >> introspection on these namespaces independent of pidfds. We will not >> restrict the usefulness of this syscall by making it only work with >> pidfds. >> >> > it's not possible to use these APIs correctly except under very >> > limited circumstances --- mostly, talking about init or a parent >> >> The pid-based api is one of the most widely used apis of the kernel >and >> people have been using it quite successfully for a long time. Yes, >it's >> rac, but it's here to stay. >> >> > talking about its child. >> > >> > Really, we need a few related operations, and we shouldn't >necessarily >> > mingle them. >> >> Yes, we've established that previously. >> >> > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just >> > open /proc/<pid> >> >> Agreed. >> >> > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is >> > always the numeric PID). >> >> Agreed. >> >> > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal >does, >> > and it's a good start on the rest of these operations. >> >> Agreed. >> >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what >translate_pid >> > is for. My preferred signature for this routine is >translate_pid(int >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. >Why >> > not? Because the pidfd *already* names a single process, uniquely! >> >> Given that people are interested in pids we can't just always return >a >> pidfd. That would mean a user would need to do get the pidfd read >from >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids >or >> more you end up allocating and closing file descriptors constantly >for >> no reason. We can't just debate pids away. So it will also need to be >> able to yield pids e.g. through a flag argument. > >Sure, but that's still not a reason that we should care about pidfds >working separately from procfs.. Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods. For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well. Joel Fernandes, Android kernel team Sent from k9-mail on Android ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design @ 2019-03-20 11:33 ` Joel Fernandes 0 siblings, 0 replies; 44+ messages in thread From: Joel Fernandes @ 2019-03-20 11:33 UTC (permalink / raw) To: Daniel Colascione, Christian Brauner Cc: Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote: >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner ><christian@brauner.io> wrote: >> >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote: >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes ><joel@joelfernandes.org> wrote: >> > > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner >wrote: >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione >wrote: >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner ><christian@brauner.io> wrote: >> > > > > > So I dislike the idea of allocating new inodes from the >procfs super >> > > > > > block. I would like to avoid pinning the whole pidfd >concept exclusively >> > > > > > to proc. The idea is that the pidfd API will be useable >through procfs >> > > > > > via open("/proc/<pid>") because that is what users expect >and really >> > > > > > wanted to have for a long time. So it makes sense to have >this working. >> > > > > > But it should really be useable without it. That's why >translate_pid() >> > > > > > and pidfd_clone() are on the table. What I'm saying is, >once the pidfd >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N >- even >> > > > > > though that's crazy - and still be able to use pidfds. This >is also a >> > > > > > point akpm asked about when I did the pidfd_send_signal >work. >> > > > > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use >pidfds. One >> > > > > crazy idea that I was discussing with Joel the other day is >to just >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new >get_procfs_root() >> > > > > system call that returned, out of thin air and independent of >the >> > > > > mount table, a procfs root directory file descriptor for the >caller's >> > > > > PID namspace and suitable for use with openat(2). >> > > > >> > > > Even if this works I'm pretty sure that Al and a lot of others >will not >> > > > be happy about this. A syscall to get an fd to /proc? >> > >> > Why not? procfs provides access to a lot of core kernel >functionality. >> > Why should you need a mountpoint to get to it? >> > >> > > That's not going >> > > > to happen and I don't see the need for a separate syscall just >for that. >> > >> > We need a system call for the same reason we need a getrandom(2): >you >> > have to bootstrap somehow when you're in a minimal environment. >> > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.) >> > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm >> > proposing that we *hardwire* it as the default and just declare >that >> > it's not possible to build a Linux kernel that doesn't include >procfs. >> > Why do we even have that button? >> > >> > > I think his point here was that he wanted a handle to procfs no >matter where >> > > it was mounted and then can later use openat on that. Agreed that >it may be >> > > unnecessary unless there is a usecase for it, and especially if >the /proc >> > > directory being the defacto mountpoint for procfs is a universal >convention. >> > >> > If it's a universal convention and, in practice, everyone needs >proc >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? >If >> > we advertise /proc as not merely some kind of optional debug >interface >> > but *the* way certain kernel features are exposed --- and there's >> > nothing wrong with that --- then we should give programs access to >> > these core kernel features in a way that doesn't depend on >userspace >> > kernel configuration, and you do that by either providing a >> > procfs-root-getting system call or just hardwiring the "/proc/" >prefix >> > into VFS. >> > >> > > > Inode allocation from the procfs mount for the file descriptors >Joel >> > > > wants is not correct. Their not really procfs file descriptors >so this >> > > > is a nack. We can't just hook into proc that way. >> > > >> > > I was not particular about using procfs mount for the FDs but >that's the only >> > > way I knew how to do it until you pointed out anon_inode (my grep >skills >> > > missed that), so thank you! >> > > >> > > > > C'mon: /proc is used by everyone today and almost every >program breaks >> > > > > if it's not around. The string "/proc" is already de facto >kernel ABI. >> > > > > Let's just drop the pretense of /proc being optional and bake >it into >> > > > > the kernel proper, then give programs a way to get to /proc >that isn't >> > > > > tied to any particular mount configuration. This way, we >don't need a >> > > > > translate_pid(), since callers can just use procfs to do the >same >> > > > > thing. (That is, if I understand correctly what translate_pid >does.) >> > > > >> > > > I'm not sure what you think translate_pid() is doing since >you're not >> > > > saying what you think it does. >> > > > Examples from the old patchset: >> > > > translate_pid(pid, ns, -1) - get pid in our pid namespace >> > >> > Ah, it's a bit different from what I had in mind. It's fair to want >to >> > translate PIDs between namespaces, but the only way to make the >> > translate_pid under discussion robust is to have it accept and >produce >> > pidfds. (At that point, you might as well call it translate_pidfd.) >We >> > should not be adding new APIs to the kernel that accept numeric >PIDs: >> >> The traditional pid-based api is not going away. There are users that >> have the requirement to translate pids between namespaces and also >doing >> introspection on these namespaces independent of pidfds. We will not >> restrict the usefulness of this syscall by making it only work with >> pidfds. >> >> > it's not possible to use these APIs correctly except under very >> > limited circumstances --- mostly, talking about init or a parent >> >> The pid-based api is one of the most widely used apis of the kernel >and >> people have been using it quite successfully for a long time. Yes, >it's >> rac, but it's here to stay. >> >> > talking about its child. >> > >> > Really, we need a few related operations, and we shouldn't >necessarily >> > mingle them. >> >> Yes, we've established that previously. >> >> > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just >> > open /proc/<pid> >> >> Agreed. >> >> > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is >> > always the numeric PID). >> >> Agreed. >> >> > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal >does, >> > and it's a good start on the rest of these operations. >> >> Agreed. >> >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what >translate_pid >> > is for. My preferred signature for this routine is >translate_pid(int >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. >Why >> > not? Because the pidfd *already* names a single process, uniquely! >> >> Given that people are interested in pids we can't just always return >a >> pidfd. That would mean a user would need to do get the pidfd read >from >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids >or >> more you end up allocating and closing file descriptors constantly >for >> no reason. We can't just debate pids away. So it will also need to be >> able to yield pids e.g. through a flag argument. > >Sure, but that's still not a reason that we should care about pidfds >working separately from procfs.. Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods. For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well. Joel Fernandes, Android kernel team Sent from k9-mail on Android ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 11:33 ` Joel Fernandes (?) @ 2019-03-20 18:26 ` Christian Brauner 2019-03-20 18:38 ` Daniel Colascione 2019-03-20 19:11 ` Joel Fernandes -1 siblings, 2 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-20 18:26 UTC (permalink / raw) To: Joel Fernandes Cc: Daniel Colascione, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote: > > > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote: > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner > ><christian@brauner.io> wrote: > >> > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote: > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes > ><joel@joelfernandes.org> wrote: > >> > > > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner > >wrote: > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione > >wrote: > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner > ><christian@brauner.io> wrote: > >> > > > > > So I dislike the idea of allocating new inodes from the > >procfs super > >> > > > > > block. I would like to avoid pinning the whole pidfd > >concept exclusively > >> > > > > > to proc. The idea is that the pidfd API will be useable > >through procfs > >> > > > > > via open("/proc/<pid>") because that is what users expect > >and really > >> > > > > > wanted to have for a long time. So it makes sense to have > >this working. > >> > > > > > But it should really be useable without it. That's why > >translate_pid() > >> > > > > > and pidfd_clone() are on the table. What I'm saying is, > >once the pidfd > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N > >- even > >> > > > > > though that's crazy - and still be able to use pidfds. This > >is also a > >> > > > > > point akpm asked about when I did the pidfd_send_signal > >work. > >> > > > > > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use > >pidfds. One > >> > > > > crazy idea that I was discussing with Joel the other day is > >to just > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new > >get_procfs_root() > >> > > > > system call that returned, out of thin air and independent of > >the > >> > > > > mount table, a procfs root directory file descriptor for the > >caller's > >> > > > > PID namspace and suitable for use with openat(2). > >> > > > > >> > > > Even if this works I'm pretty sure that Al and a lot of others > >will not > >> > > > be happy about this. A syscall to get an fd to /proc? > >> > > >> > Why not? procfs provides access to a lot of core kernel > >functionality. > >> > Why should you need a mountpoint to get to it? > >> > > >> > > That's not going > >> > > > to happen and I don't see the need for a separate syscall just > >for that. > >> > > >> > We need a system call for the same reason we need a getrandom(2): > >you > >> > have to bootstrap somehow when you're in a minimal environment. > >> > > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.) > >> > > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm > >> > proposing that we *hardwire* it as the default and just declare > >that > >> > it's not possible to build a Linux kernel that doesn't include > >procfs. > >> > Why do we even have that button? > >> > > >> > > I think his point here was that he wanted a handle to procfs no > >matter where > >> > > it was mounted and then can later use openat on that. Agreed that > >it may be > >> > > unnecessary unless there is a usecase for it, and especially if > >the /proc > >> > > directory being the defacto mountpoint for procfs is a universal > >convention. > >> > > >> > If it's a universal convention and, in practice, everyone needs > >proc > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? > >If > >> > we advertise /proc as not merely some kind of optional debug > >interface > >> > but *the* way certain kernel features are exposed --- and there's > >> > nothing wrong with that --- then we should give programs access to > >> > these core kernel features in a way that doesn't depend on > >userspace > >> > kernel configuration, and you do that by either providing a > >> > procfs-root-getting system call or just hardwiring the "/proc/" > >prefix > >> > into VFS. > >> > > >> > > > Inode allocation from the procfs mount for the file descriptors > >Joel > >> > > > wants is not correct. Their not really procfs file descriptors > >so this > >> > > > is a nack. We can't just hook into proc that way. > >> > > > >> > > I was not particular about using procfs mount for the FDs but > >that's the only > >> > > way I knew how to do it until you pointed out anon_inode (my grep > >skills > >> > > missed that), so thank you! > >> > > > >> > > > > C'mon: /proc is used by everyone today and almost every > >program breaks > >> > > > > if it's not around. The string "/proc" is already de facto > >kernel ABI. > >> > > > > Let's just drop the pretense of /proc being optional and bake > >it into > >> > > > > the kernel proper, then give programs a way to get to /proc > >that isn't > >> > > > > tied to any particular mount configuration. This way, we > >don't need a > >> > > > > translate_pid(), since callers can just use procfs to do the > >same > >> > > > > thing. (That is, if I understand correctly what translate_pid > >does.) > >> > > > > >> > > > I'm not sure what you think translate_pid() is doing since > >you're not > >> > > > saying what you think it does. > >> > > > Examples from the old patchset: > >> > > > translate_pid(pid, ns, -1) - get pid in our pid namespace > >> > > >> > Ah, it's a bit different from what I had in mind. It's fair to want > >to > >> > translate PIDs between namespaces, but the only way to make the > >> > translate_pid under discussion robust is to have it accept and > >produce > >> > pidfds. (At that point, you might as well call it translate_pidfd.) > >We > >> > should not be adding new APIs to the kernel that accept numeric > >PIDs: > >> > >> The traditional pid-based api is not going away. There are users that > >> have the requirement to translate pids between namespaces and also > >doing > >> introspection on these namespaces independent of pidfds. We will not > >> restrict the usefulness of this syscall by making it only work with > >> pidfds. > >> > >> > it's not possible to use these APIs correctly except under very > >> > limited circumstances --- mostly, talking about init or a parent > >> > >> The pid-based api is one of the most widely used apis of the kernel > >and > >> people have been using it quite successfully for a long time. Yes, > >it's > >> rac, but it's here to stay. > >> > >> > talking about its child. > >> > > >> > Really, we need a few related operations, and we shouldn't > >necessarily > >> > mingle them. > >> > >> Yes, we've established that previously. > >> > >> > > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just > >> > open /proc/<pid> > >> > >> Agreed. > >> > >> > > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is > >> > always the numeric PID). > >> > >> Agreed. > >> > >> > > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal > >does, > >> > and it's a good start on the rest of these operations. > >> > >> Agreed. > >> > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what > >translate_pid > >> > is for. My preferred signature for this routine is > >translate_pid(int > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. > >Why > >> > not? Because the pidfd *already* names a single process, uniquely! > >> > >> Given that people are interested in pids we can't just always return > >a > >> pidfd. That would mean a user would need to do get the pidfd read > >from > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids > >or > >> more you end up allocating and closing file descriptors constantly > >for > >> no reason. We can't just debate pids away. So it will also need to be > >> able to yield pids e.g. through a flag argument. > > > >Sure, but that's still not a reason that we should care about pidfds > >working separately from procfs.. That's unrelated to the point made in the above paragraph. Please note, I said that the pidfd api should work when proc is not available not that they can't be dirfds. > > Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods. (K9 Mail still hasn't learned to wrap lines at 80 it seems. :)) Again, I never said that pidfds should be a directory handle. (Though I would like to point out that one of the original ideas I discussed at LPC was to have something like this to get regular file descriptors instead of dirfds: https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df) > > For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well. translate_pid() should just return you a pidfd. Having it return a pidfd and a status fd feels like stuffing too much functionality in there. If you're fine with it I'll finish prototyping what I had in mind. As I said in previous mails I'm already working on this. Would you be ok with prototyping the pidfd_wait() syscall you had in mind? Especially the wait_fd part that you want to have I would like to see how that is supposed to work, e.g. who is allowed to wait on the process and how notifications will work for non-parent processes and so on. I feel we won't get anywhere by talking in the abstrace and other people are far more likely to review/comment once there's actual code. Christian ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 18:26 ` Christian Brauner @ 2019-03-20 18:38 ` Daniel Colascione 2019-03-20 18:51 ` Christian Brauner 2019-03-20 19:11 ` Joel Fernandes 1 sibling, 1 reply; 44+ messages in thread From: Daniel Colascione @ 2019-03-20 18:38 UTC (permalink / raw) To: Christian Brauner Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 11:26 AM Christian Brauner <christian@brauner.io> wrote: > On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote: > > > > > > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote: > > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner > > ><christian@brauner.io> wrote: > > >> > > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote: > > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes > > ><joel@joelfernandes.org> wrote: > > >> > > > > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner > > >wrote: > > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione > > >wrote: > > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner > > ><christian@brauner.io> wrote: > > >> > > > > > So I dislike the idea of allocating new inodes from the > > >procfs super > > >> > > > > > block. I would like to avoid pinning the whole pidfd > > >concept exclusively > > >> > > > > > to proc. The idea is that the pidfd API will be useable > > >through procfs > > >> > > > > > via open("/proc/<pid>") because that is what users expect > > >and really > > >> > > > > > wanted to have for a long time. So it makes sense to have > > >this working. > > >> > > > > > But it should really be useable without it. That's why > > >translate_pid() > > >> > > > > > and pidfd_clone() are on the table. What I'm saying is, > > >once the pidfd > > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N > > >- even > > >> > > > > > though that's crazy - and still be able to use pidfds. This > > >is also a > > >> > > > > > point akpm asked about when I did the pidfd_send_signal > > >work. > > >> > > > > > > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use > > >pidfds. One > > >> > > > > crazy idea that I was discussing with Joel the other day is > > >to just > > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new > > >get_procfs_root() > > >> > > > > system call that returned, out of thin air and independent of > > >the > > >> > > > > mount table, a procfs root directory file descriptor for the > > >caller's > > >> > > > > PID namspace and suitable for use with openat(2). > > >> > > > > > >> > > > Even if this works I'm pretty sure that Al and a lot of others > > >will not > > >> > > > be happy about this. A syscall to get an fd to /proc? > > >> > > > >> > Why not? procfs provides access to a lot of core kernel > > >functionality. > > >> > Why should you need a mountpoint to get to it? > > >> > > > >> > > That's not going > > >> > > > to happen and I don't see the need for a separate syscall just > > >for that. > > >> > > > >> > We need a system call for the same reason we need a getrandom(2): > > >you > > >> > have to bootstrap somehow when you're in a minimal environment. > > >> > > > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.) > > >> > > > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm > > >> > proposing that we *hardwire* it as the default and just declare > > >that > > >> > it's not possible to build a Linux kernel that doesn't include > > >procfs. > > >> > Why do we even have that button? > > >> > > > >> > > I think his point here was that he wanted a handle to procfs no > > >matter where > > >> > > it was mounted and then can later use openat on that. Agreed that > > >it may be > > >> > > unnecessary unless there is a usecase for it, and especially if > > >the /proc > > >> > > directory being the defacto mountpoint for procfs is a universal > > >convention. > > >> > > > >> > If it's a universal convention and, in practice, everyone needs > > >proc > > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? > > >If > > >> > we advertise /proc as not merely some kind of optional debug > > >interface > > >> > but *the* way certain kernel features are exposed --- and there's > > >> > nothing wrong with that --- then we should give programs access to > > >> > these core kernel features in a way that doesn't depend on > > >userspace > > >> > kernel configuration, and you do that by either providing a > > >> > procfs-root-getting system call or just hardwiring the "/proc/" > > >prefix > > >> > into VFS. > > >> > > > >> > > > Inode allocation from the procfs mount for the file descriptors > > >Joel > > >> > > > wants is not correct. Their not really procfs file descriptors > > >so this > > >> > > > is a nack. We can't just hook into proc that way. > > >> > > > > >> > > I was not particular about using procfs mount for the FDs but > > >that's the only > > >> > > way I knew how to do it until you pointed out anon_inode (my grep > > >skills > > >> > > missed that), so thank you! > > >> > > > > >> > > > > C'mon: /proc is used by everyone today and almost every > > >program breaks > > >> > > > > if it's not around. The string "/proc" is already de facto > > >kernel ABI. > > >> > > > > Let's just drop the pretense of /proc being optional and bake > > >it into > > >> > > > > the kernel proper, then give programs a way to get to /proc > > >that isn't > > >> > > > > tied to any particular mount configuration. This way, we > > >don't need a > > >> > > > > translate_pid(), since callers can just use procfs to do the > > >same > > >> > > > > thing. (That is, if I understand correctly what translate_pid > > >does.) > > >> > > > > > >> > > > I'm not sure what you think translate_pid() is doing since > > >you're not > > >> > > > saying what you think it does. > > >> > > > Examples from the old patchset: > > >> > > > translate_pid(pid, ns, -1) - get pid in our pid namespace > > >> > > > >> > Ah, it's a bit different from what I had in mind. It's fair to want > > >to > > >> > translate PIDs between namespaces, but the only way to make the > > >> > translate_pid under discussion robust is to have it accept and > > >produce > > >> > pidfds. (At that point, you might as well call it translate_pidfd.) > > >We > > >> > should not be adding new APIs to the kernel that accept numeric > > >PIDs: > > >> > > >> The traditional pid-based api is not going away. There are users that > > >> have the requirement to translate pids between namespaces and also > > >doing > > >> introspection on these namespaces independent of pidfds. We will not > > >> restrict the usefulness of this syscall by making it only work with > > >> pidfds. > > >> > > >> > it's not possible to use these APIs correctly except under very > > >> > limited circumstances --- mostly, talking about init or a parent > > >> > > >> The pid-based api is one of the most widely used apis of the kernel > > >and > > >> people have been using it quite successfully for a long time. Yes, > > >it's > > >> rac, but it's here to stay. > > >> > > >> > talking about its child. > > >> > > > >> > Really, we need a few related operations, and we shouldn't > > >necessarily > > >> > mingle them. > > >> > > >> Yes, we've established that previously. > > >> > > >> > > > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just > > >> > open /proc/<pid> > > >> > > >> Agreed. > > >> > > >> > > > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just > > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is > > >> > always the numeric PID). > > >> > > >> Agreed. > > >> > > >> > > > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal > > >does, > > >> > and it's a good start on the rest of these operations. > > >> > > >> Agreed. > > >> > > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what > > >translate_pid > > >> > is for. My preferred signature for this routine is > > >translate_pid(int > > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. > > >Why > > >> > not? Because the pidfd *already* names a single process, uniquely! > > >> > > >> Given that people are interested in pids we can't just always return > > >a > > >> pidfd. That would mean a user would need to do get the pidfd read > > >from > > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids > > >or > > >> more you end up allocating and closing file descriptors constantly > > >for > > >> no reason. We can't just debate pids away. So it will also need to be > > >> able to yield pids e.g. through a flag argument. > > > > > >Sure, but that's still not a reason that we should care about pidfds > > >working separately from procfs.. > > That's unrelated to the point made in the above paragraph. > Please note, I said that the pidfd api should work when proc is not > available not that they can't be dirfds. What do you mean by "not available"? CONFIG_PROCFS=n? If pidfds supposed to work when proc is unavailable yet also be directory FDs, to what directory should the FD refer? As I mentioned in my previous message, trying to make pidfd work without CONFIG_PROCFS is a very bad idea. > > > > > Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods. > > (K9 Mail still hasn't learned to wrap lines at 80 it seems. :)) > > Again, I never said that pidfds should be a directory handle. > (Though I would like to point out that one of the original ideas I > discussed at LPC was to have something like this to get regular file > descriptors instead of dirfds: > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df) As I mentioned in my original email on this thread, if you have regular file descriptors instead of directory FDs, you have to use some special new API instead of openat to get metadata about a process. That's pointless duplication of functionality considering that a directory FD gives you that information automatically. > > For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well. > translate_pid() should just return you a pidfd. Having it return a pidfd > and a status fd feels like stuffing too much functionality in there. If > you're fine with it I'll finish prototyping what I had in mind. As I > said in previous mails I'm already working on this. translate_pid also needs to *accept* pidfds, at least optionally. Unless you have a function from pidfd to pidfd, you race. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 18:38 ` Daniel Colascione @ 2019-03-20 18:51 ` Christian Brauner 2019-03-20 18:58 ` Andy Lutomirski ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-20 18:51 UTC (permalink / raw) To: Daniel Colascione Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 11:38:35AM -0700, Daniel Colascione wrote: > On Wed, Mar 20, 2019 at 11:26 AM Christian Brauner <christian@brauner.io> wrote: > > On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote: > > > > > > > > > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote: > > > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner > > > ><christian@brauner.io> wrote: > > > >> > > > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote: > > > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes > > > ><joel@joelfernandes.org> wrote: > > > >> > > > > > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner > > > >wrote: > > > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione > > > >wrote: > > > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner > > > ><christian@brauner.io> wrote: > > > >> > > > > > So I dislike the idea of allocating new inodes from the > > > >procfs super > > > >> > > > > > block. I would like to avoid pinning the whole pidfd > > > >concept exclusively > > > >> > > > > > to proc. The idea is that the pidfd API will be useable > > > >through procfs > > > >> > > > > > via open("/proc/<pid>") because that is what users expect > > > >and really > > > >> > > > > > wanted to have for a long time. So it makes sense to have > > > >this working. > > > >> > > > > > But it should really be useable without it. That's why > > > >translate_pid() > > > >> > > > > > and pidfd_clone() are on the table. What I'm saying is, > > > >once the pidfd > > > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N > > > >- even > > > >> > > > > > though that's crazy - and still be able to use pidfds. This > > > >is also a > > > >> > > > > > point akpm asked about when I did the pidfd_send_signal > > > >work. > > > >> > > > > > > > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use > > > >pidfds. One > > > >> > > > > crazy idea that I was discussing with Joel the other day is > > > >to just > > > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new > > > >get_procfs_root() > > > >> > > > > system call that returned, out of thin air and independent of > > > >the > > > >> > > > > mount table, a procfs root directory file descriptor for the > > > >caller's > > > >> > > > > PID namspace and suitable for use with openat(2). > > > >> > > > > > > >> > > > Even if this works I'm pretty sure that Al and a lot of others > > > >will not > > > >> > > > be happy about this. A syscall to get an fd to /proc? > > > >> > > > > >> > Why not? procfs provides access to a lot of core kernel > > > >functionality. > > > >> > Why should you need a mountpoint to get to it? > > > >> > > > > >> > > That's not going > > > >> > > > to happen and I don't see the need for a separate syscall just > > > >for that. > > > >> > > > > >> > We need a system call for the same reason we need a getrandom(2): > > > >you > > > >> > have to bootstrap somehow when you're in a minimal environment. > > > >> > > > > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.) > > > >> > > > > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm > > > >> > proposing that we *hardwire* it as the default and just declare > > > >that > > > >> > it's not possible to build a Linux kernel that doesn't include > > > >procfs. > > > >> > Why do we even have that button? > > > >> > > > > >> > > I think his point here was that he wanted a handle to procfs no > > > >matter where > > > >> > > it was mounted and then can later use openat on that. Agreed that > > > >it may be > > > >> > > unnecessary unless there is a usecase for it, and especially if > > > >the /proc > > > >> > > directory being the defacto mountpoint for procfs is a universal > > > >convention. > > > >> > > > > >> > If it's a universal convention and, in practice, everyone needs > > > >proc > > > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? > > > >If > > > >> > we advertise /proc as not merely some kind of optional debug > > > >interface > > > >> > but *the* way certain kernel features are exposed --- and there's > > > >> > nothing wrong with that --- then we should give programs access to > > > >> > these core kernel features in a way that doesn't depend on > > > >userspace > > > >> > kernel configuration, and you do that by either providing a > > > >> > procfs-root-getting system call or just hardwiring the "/proc/" > > > >prefix > > > >> > into VFS. > > > >> > > > > >> > > > Inode allocation from the procfs mount for the file descriptors > > > >Joel > > > >> > > > wants is not correct. Their not really procfs file descriptors > > > >so this > > > >> > > > is a nack. We can't just hook into proc that way. > > > >> > > > > > >> > > I was not particular about using procfs mount for the FDs but > > > >that's the only > > > >> > > way I knew how to do it until you pointed out anon_inode (my grep > > > >skills > > > >> > > missed that), so thank you! > > > >> > > > > > >> > > > > C'mon: /proc is used by everyone today and almost every > > > >program breaks > > > >> > > > > if it's not around. The string "/proc" is already de facto > > > >kernel ABI. > > > >> > > > > Let's just drop the pretense of /proc being optional and bake > > > >it into > > > >> > > > > the kernel proper, then give programs a way to get to /proc > > > >that isn't > > > >> > > > > tied to any particular mount configuration. This way, we > > > >don't need a > > > >> > > > > translate_pid(), since callers can just use procfs to do the > > > >same > > > >> > > > > thing. (That is, if I understand correctly what translate_pid > > > >does.) > > > >> > > > > > > >> > > > I'm not sure what you think translate_pid() is doing since > > > >you're not > > > >> > > > saying what you think it does. > > > >> > > > Examples from the old patchset: > > > >> > > > translate_pid(pid, ns, -1) - get pid in our pid namespace > > > >> > > > > >> > Ah, it's a bit different from what I had in mind. It's fair to want > > > >to > > > >> > translate PIDs between namespaces, but the only way to make the > > > >> > translate_pid under discussion robust is to have it accept and > > > >produce > > > >> > pidfds. (At that point, you might as well call it translate_pidfd.) > > > >We > > > >> > should not be adding new APIs to the kernel that accept numeric > > > >PIDs: > > > >> > > > >> The traditional pid-based api is not going away. There are users that > > > >> have the requirement to translate pids between namespaces and also > > > >doing > > > >> introspection on these namespaces independent of pidfds. We will not > > > >> restrict the usefulness of this syscall by making it only work with > > > >> pidfds. > > > >> > > > >> > it's not possible to use these APIs correctly except under very > > > >> > limited circumstances --- mostly, talking about init or a parent > > > >> > > > >> The pid-based api is one of the most widely used apis of the kernel > > > >and > > > >> people have been using it quite successfully for a long time. Yes, > > > >it's > > > >> rac, but it's here to stay. > > > >> > > > >> > talking about its child. > > > >> > > > > >> > Really, we need a few related operations, and we shouldn't > > > >necessarily > > > >> > mingle them. > > > >> > > > >> Yes, we've established that previously. > > > >> > > > >> > > > > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just > > > >> > open /proc/<pid> > > > >> > > > >> Agreed. > > > >> > > > >> > > > > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just > > > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is > > > >> > always the numeric PID). > > > >> > > > >> Agreed. > > > >> > > > >> > > > > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal > > > >does, > > > >> > and it's a good start on the rest of these operations. > > > >> > > > >> Agreed. > > > >> > > > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what > > > >translate_pid > > > >> > is for. My preferred signature for this routine is > > > >translate_pid(int > > > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. > > > >Why > > > >> > not? Because the pidfd *already* names a single process, uniquely! > > > >> > > > >> Given that people are interested in pids we can't just always return > > > >a > > > >> pidfd. That would mean a user would need to do get the pidfd read > > > >from > > > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids > > > >or > > > >> more you end up allocating and closing file descriptors constantly > > > >for > > > >> no reason. We can't just debate pids away. So it will also need to be > > > >> able to yield pids e.g. through a flag argument. > > > > > > > >Sure, but that's still not a reason that we should care about pidfds > > > >working separately from procfs.. > > > > That's unrelated to the point made in the above paragraph. > > Please note, I said that the pidfd api should work when proc is not > > available not that they can't be dirfds. > > What do you mean by "not available"? CONFIG_PROCFS=n? If pidfds I'm talking about the ability to clone processes and get fd handles on them via pidfd_clone() or CLONE_NEWFD. > > > translate_pid() should just return you a pidfd. Having it return a pidfd > > and a status fd feels like stuffing too much functionality in there. If > > you're fine with it I'll finish prototyping what I had in mind. As I > > said in previous mails I'm already working on this. > > translate_pid also needs to *accept* pidfds, at least optionally. > Unless you have a function from pidfd to pidfd, you race. You're misunderstanding. Again, I said in my previous mails it should accept pidfds optionally as arguments, yes. But I don't want it to return the status fds that you previously wanted pidfd_wait() to return. I really want to see Joel's pidfd_wait() patchset and have more people review the actual code. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 18:51 ` Christian Brauner @ 2019-03-20 18:58 ` Andy Lutomirski 2019-03-20 19:14 ` Christian Brauner 2019-03-20 19:19 ` Joel Fernandes 2019-03-20 19:29 ` Daniel Colascione 2 siblings, 1 reply; 44+ messages in thread From: Andy Lutomirski @ 2019-03-20 18:58 UTC (permalink / raw) To: Christian Brauner Cc: Daniel Colascione, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > You're misunderstanding. Again, I said in my previous mails it should > accept pidfds optionally as arguments, yes. But I don't want it to > return the status fds that you previously wanted pidfd_wait() to return. > I really want to see Joel's pidfd_wait() patchset and have more people > review the actual code. Just to make sure that no one is forgetting a material security consideration: $ ls /proc/self attr exe mountinfo projid_map status autogroup fd mounts root syscall auxv fdinfo mountstats sched task cgroup gid_map net schedstat timers clear_refs io ns sessionid timerslack_ns cmdline latency numa_maps setgroups uid_map comm limits oom_adj smaps wchan coredump_filter loginuid oom_score smaps_rollup cpuset map_files oom_score_adj stack cwd maps pagemap stat environ mem personality statm A bunch of this stuff makes sense to make accessible through a syscall interface that we expect to be used even in sandboxes. But a bunch of it does not. For example, *_map, mounts, mountstats, and net are all namespace-wide things that certain policies expect to be unavailable. stack, for example, is a potential attack surface. Etc. As it stands, if you create a fresh userns and mountns and try to mount /proc, there are some really awful and hideous rules that are checked for security reasons. All these new APIs either need to return something more restrictive than a proc dirfd or they need to follow the same rules. And I'm afraid that the latter may be a nonstarter if you expect these APIs to be used in libraries. Yes, this is unfortunate, but it is indeed the current situation. I suppose that we could return magic restricted dirfds, or we could return things that aren't dirfds and all and have some API that gives you the dirfd associated with a procfd but only if you can see /proc/PID. --Andy ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 18:58 ` Andy Lutomirski @ 2019-03-20 19:14 ` Christian Brauner 2019-03-20 19:40 ` Daniel Colascione 0 siblings, 1 reply; 44+ messages in thread From: Christian Brauner @ 2019-03-20 19:14 UTC (permalink / raw) To: Andy Lutomirski Cc: Daniel Colascione, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > You're misunderstanding. Again, I said in my previous mails it should > > accept pidfds optionally as arguments, yes. But I don't want it to > > return the status fds that you previously wanted pidfd_wait() to return. > > I really want to see Joel's pidfd_wait() patchset and have more people > > review the actual code. > > Just to make sure that no one is forgetting a material security consideration: Andy, thanks for commenting! > > $ ls /proc/self > attr exe mountinfo projid_map status > autogroup fd mounts root syscall > auxv fdinfo mountstats sched task > cgroup gid_map net schedstat timers > clear_refs io ns sessionid timerslack_ns > cmdline latency numa_maps setgroups uid_map > comm limits oom_adj smaps wchan > coredump_filter loginuid oom_score smaps_rollup > cpuset map_files oom_score_adj stack > cwd maps pagemap stat > environ mem personality statm > > A bunch of this stuff makes sense to make accessible through a syscall > interface that we expect to be used even in sandboxes. But a bunch of > it does not. For example, *_map, mounts, mountstats, and net are all > namespace-wide things that certain policies expect to be unavailable. > stack, for example, is a potential attack surface. Etc. > > As it stands, if you create a fresh userns and mountns and try to > mount /proc, there are some really awful and hideous rules that are > checked for security reasons. All these new APIs either need to > return something more restrictive than a proc dirfd or they need to > follow the same rules. And I'm afraid that the latter may be a > nonstarter if you expect these APIs to be used in libraries. > > Yes, this is unfortunate, but it is indeed the current situation. I > suppose that we could return magic restricted dirfds, or we could > return things that aren't dirfds and all and have some API that gives > you the dirfd associated with a procfd but only if you can see > /proc/PID. What would be your opinion to having a /proc/<pid>/handle file instead of having a dirfd. Essentially, what I initially proposed at LPC. The change on what we currently have in master would be: https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 19:14 ` Christian Brauner @ 2019-03-20 19:40 ` Daniel Colascione 2019-03-21 17:02 ` Andy Lutomirski 0 siblings, 1 reply; 44+ messages in thread From: Daniel Colascione @ 2019-03-20 19:40 UTC (permalink / raw) To: Christian Brauner Cc: Andy Lutomirski, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote: > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > You're misunderstanding. Again, I said in my previous mails it should > > > accept pidfds optionally as arguments, yes. But I don't want it to > > > return the status fds that you previously wanted pidfd_wait() to return. > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > review the actual code. > > > > Just to make sure that no one is forgetting a material security consideration: > > Andy, thanks for commenting! > > > > > $ ls /proc/self > > attr exe mountinfo projid_map status > > autogroup fd mounts root syscall > > auxv fdinfo mountstats sched task > > cgroup gid_map net schedstat timers > > clear_refs io ns sessionid timerslack_ns > > cmdline latency numa_maps setgroups uid_map > > comm limits oom_adj smaps wchan > > coredump_filter loginuid oom_score smaps_rollup > > cpuset map_files oom_score_adj stack > > cwd maps pagemap stat > > environ mem personality statm > > > > A bunch of this stuff makes sense to make accessible through a syscall > > interface that we expect to be used even in sandboxes. But a bunch of > > it does not. For example, *_map, mounts, mountstats, and net are all > > namespace-wide things that certain policies expect to be unavailable. > > stack, for example, is a potential attack surface. Etc. If you can access these files sources via open(2) on /proc/<pid>, you should be able to access them via a pidfd. If you can't, you shouldn't. Which /proc? The one you'd get by mounting procfs. I don't see how pidfd makes any material changes to anyone's security. As far as I'm concerned, if a sandbox can't mount /proc at all, it's just a broken and unsupported configuration. An actual threat model and real thought paid to access capabilities would help. Almost everything around the interaction of Linux kernel namespaces and security feels like a jumble of ad-hoc patches added as afterthoughts in response to random objections. >> All these new APIs either need to > > return something more restrictive than a proc dirfd or they need to > > follow the same rules. What's wrong with the latter? > > And I'm afraid that the latter may be a > > nonstarter if you expect these APIs to be used in libraries. What's special about libraries? How is a library any worse-off using openat(2) on a pidfd than it would be just opening the file called "/proc/$apid"? > > Yes, this is unfortunate, but it is indeed the current situation. I > > suppose that we could return magic restricted dirfds, or we could > > return things that aren't dirfds and all and have some API that gives > > you the dirfd associated with a procfd but only if you can see > > /proc/PID. > > What would be your opinion to having a > /proc/<pid>/handle > file instead of having a dirfd. Essentially, what I initially proposed > at LPC. The change on what we currently have in master would be: > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df And how do you propose, given one of these handle objects, getting a process's current priority, or its current oom score, or its list of memory maps? As I mentioned in my original email, and which nobody has addressed, if you don't use a dirfd as your process handle or you don't provide an easy way to get one of these proc directory FDs, you need to duplicate a lot of metadata access interfaces. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 19:40 ` Daniel Colascione @ 2019-03-21 17:02 ` Andy Lutomirski 2019-03-25 20:13 ` Jann Horn 0 siblings, 1 reply; 44+ messages in thread From: Andy Lutomirski @ 2019-03-21 17:02 UTC (permalink / raw) To: Daniel Colascione Cc: Christian Brauner, Andy Lutomirski, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@google.com> wrote: > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote: > > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > You're misunderstanding. Again, I said in my previous mails it should > > > > accept pidfds optionally as arguments, yes. But I don't want it to > > > > return the status fds that you previously wanted pidfd_wait() to return. > > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > > review the actual code. > > > > > > Just to make sure that no one is forgetting a material security consideration: > > > > Andy, thanks for commenting! > > > > > > > > $ ls /proc/self > > > attr exe mountinfo projid_map status > > > autogroup fd mounts root syscall > > > auxv fdinfo mountstats sched task > > > cgroup gid_map net schedstat timers > > > clear_refs io ns sessionid timerslack_ns > > > cmdline latency numa_maps setgroups uid_map > > > comm limits oom_adj smaps wchan > > > coredump_filter loginuid oom_score smaps_rollup > > > cpuset map_files oom_score_adj stack > > > cwd maps pagemap stat > > > environ mem personality statm > > > > > > A bunch of this stuff makes sense to make accessible through a syscall > > > interface that we expect to be used even in sandboxes. But a bunch of > > > it does not. For example, *_map, mounts, mountstats, and net are all > > > namespace-wide things that certain policies expect to be unavailable. > > > stack, for example, is a potential attack surface. Etc. > > If you can access these files sources via open(2) on /proc/<pid>, you > should be able to access them via a pidfd. If you can't, you > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't > see how pidfd makes any material changes to anyone's security. As far > as I'm concerned, if a sandbox can't mount /proc at all, it's just a > broken and unsupported configuration. It's not "broken and unsupported". I know of an actual working, deployed container-ish sandbox that does exactly this. I would also guess that quite a few not-at-all-container-like sandboxes work like this. (The obvious seccomp + unshare + pivot_root deny-myself-access-to-lots-of-things trick results in no /proc, which is by dsign.) > > An actual threat model and real thought paid to access capabilities > would help. Almost everything around the interaction of Linux kernel > namespaces and security feels like a jumble of ad-hoc patches added as > afterthoughts in response to random objections. I fully agree. But if you start thinking for real about access capabilities, there's no way that you're going to conclude that a capability to access some process implies a capability to access the settings of its network namespace. > > >> All these new APIs either need to > > > return something more restrictive than a proc dirfd or they need to > > > follow the same rules. > ... > What's special about libraries? How is a library any worse-off using > openat(2) on a pidfd than it would be just opening the file called > "/proc/$apid"? Because most libraries actually work, right now, without /proc. Even libraries that spawn subprocesses. If we make the new API have the property that it doesn't work if you're in a non-root user namespace and /proc isn't mounted, the result will be an utter mess. > > > > Yes, this is unfortunate, but it is indeed the current situation. I > > > suppose that we could return magic restricted dirfds, or we could > > > return things that aren't dirfds and all and have some API that gives > > > you the dirfd associated with a procfd but only if you can see > > > /proc/PID. > > > > What would be your opinion to having a > > /proc/<pid>/handle > > file instead of having a dirfd. Essentially, what I initially proposed > > at LPC. The change on what we currently have in master would be: > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df > > And how do you propose, given one of these handle objects, getting a > process's current priority, or its current oom score, or its list of > memory maps? As I mentioned in my original email, and which nobody has > addressed, if you don't use a dirfd as your process handle or you > don't provide an easy way to get one of these proc directory FDs, you > need to duplicate a lot of metadata access interfaces. An API that takes a process handle object and an fd pointing at /proc (the root of the proc fs) and gives you back a proc dirfd would do the trick. You could do this with no new kernel features at all if you're willing to read the pid, call openat(2), and handle the races in user code. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-21 17:02 ` Andy Lutomirski @ 2019-03-25 20:13 ` Jann Horn 0 siblings, 0 replies; 44+ messages in thread From: Jann Horn @ 2019-03-25 20:13 UTC (permalink / raw) To: Andy Lutomirski, Christian Brauner Cc: Daniel Colascione, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@google.com> wrote: > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote: > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > > > You're misunderstanding. Again, I said in my previous mails it should > > > > > accept pidfds optionally as arguments, yes. But I don't want it to > > > > > return the status fds that you previously wanted pidfd_wait() to return. > > > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > > > review the actual code. > > > > > > > > Just to make sure that no one is forgetting a material security consideration: > > > > > > Andy, thanks for commenting! > > > > > > > > > > > $ ls /proc/self > > > > attr exe mountinfo projid_map status > > > > autogroup fd mounts root syscall > > > > auxv fdinfo mountstats sched task > > > > cgroup gid_map net schedstat timers > > > > clear_refs io ns sessionid timerslack_ns > > > > cmdline latency numa_maps setgroups uid_map > > > > comm limits oom_adj smaps wchan > > > > coredump_filter loginuid oom_score smaps_rollup > > > > cpuset map_files oom_score_adj stack > > > > cwd maps pagemap stat > > > > environ mem personality statm > > > > > > > > A bunch of this stuff makes sense to make accessible through a syscall > > > > interface that we expect to be used even in sandboxes. But a bunch of > > > > it does not. For example, *_map, mounts, mountstats, and net are all > > > > namespace-wide things that certain policies expect to be unavailable. > > > > stack, for example, is a potential attack surface. Etc. > > > > If you can access these files sources via open(2) on /proc/<pid>, you > > should be able to access them via a pidfd. If you can't, you > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't > > see how pidfd makes any material changes to anyone's security. As far > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a > > broken and unsupported configuration. > > It's not "broken and unsupported". I know of an actual working, > deployed container-ish sandbox that does exactly this. I would also > guess that quite a few not-at-all-container-like sandboxes work like > this. (The obvious seccomp + unshare + pivot_root > deny-myself-access-to-lots-of-things trick results in no /proc, which > is by dsign.) > > > > > An actual threat model and real thought paid to access capabilities > > would help. Almost everything around the interaction of Linux kernel > > namespaces and security feels like a jumble of ad-hoc patches added as > > afterthoughts in response to random objections. > > I fully agree. But if you start thinking for real about access > capabilities, there's no way that you're going to conclude that a > capability to access some process implies a capability to access the > settings of its network namespace. > > > > > >> All these new APIs either need to > > > > return something more restrictive than a proc dirfd or they need to > > > > follow the same rules. > > > > ... > > > What's special about libraries? How is a library any worse-off using > > openat(2) on a pidfd than it would be just opening the file called > > "/proc/$apid"? > > Because most libraries actually work, right now, without /proc. Even > libraries that spawn subprocesses. If we make the new API have the > property that it doesn't work if you're in a non-root user namespace > and /proc isn't mounted, the result will be an utter mess. > > > > > > > Yes, this is unfortunate, but it is indeed the current situation. I > > > > suppose that we could return magic restricted dirfds, or we could > > > > return things that aren't dirfds and all and have some API that gives > > > > you the dirfd associated with a procfd but only if you can see > > > > /proc/PID. > > > > > > What would be your opinion to having a > > > /proc/<pid>/handle > > > file instead of having a dirfd. Essentially, what I initially proposed > > > at LPC. The change on what we currently have in master would be: > > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df > > > > And how do you propose, given one of these handle objects, getting a > > process's current priority, or its current oom score, or its list of > > memory maps? As I mentioned in my original email, and which nobody has > > addressed, if you don't use a dirfd as your process handle or you > > don't provide an easy way to get one of these proc directory FDs, you > > need to duplicate a lot of metadata access interfaces. > > An API that takes a process handle object and an fd pointing at /proc > (the root of the proc fs) and gives you back a proc dirfd would do the > trick. You could do this with no new kernel features at all if you're > willing to read the pid, call openat(2), and handle the races in user > code. This seems like something that might be a good fit for two ioctls? One ioctl on procfs roots to translate pidfds into that procfs, subject to both the normal lookup permission checks and only working if the pidfd has a translation into the procfs: int proc_root_fd = open("/proc", O_RDONLY); int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: int proc_pgid_fd = open("/proc/self", O_RDONLY); int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); int proc_pid_fd = open("/proc/thread-self", O_RDONLY); int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); And then, as you proposed, the new sys_clone() can just return a pidfd, and you can convert it into a procfs fd yourself if you want. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design @ 2019-03-25 20:13 ` Jann Horn 0 siblings, 0 replies; 44+ messages in thread From: Jann Horn @ 2019-03-25 20:13 UTC (permalink / raw) To: Andy Lutomirski, Christian Brauner Cc: Daniel Colascione, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@google.com> wrote: > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote: > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > > > You're misunderstanding. Again, I said in my previous mails it should > > > > > accept pidfds optionally as arguments, yes. But I don't want it to > > > > > return the status fds that you previously wanted pidfd_wait() to return. > > > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > > > review the actual code. > > > > > > > > Just to make sure that no one is forgetting a material security consideration: > > > > > > Andy, thanks for commenting! > > > > > > > > > > > $ ls /proc/self > > > > attr exe mountinfo projid_map status > > > > autogroup fd mounts root syscall > > > > auxv fdinfo mountstats sched task > > > > cgroup gid_map net schedstat timers > > > > clear_refs io ns sessionid timerslack_ns > > > > cmdline latency numa_maps setgroups uid_map > > > > comm limits oom_adj smaps wchan > > > > coredump_filter loginuid oom_score smaps_rollup > > > > cpuset map_files oom_score_adj stack > > > > cwd maps pagemap stat > > > > environ mem personality statm > > > > > > > > A bunch of this stuff makes sense to make accessible through a syscall > > > > interface that we expect to be used even in sandboxes. But a bunch of > > > > it does not. For example, *_map, mounts, mountstats, and net are all > > > > namespace-wide things that certain policies expect to be unavailable. > > > > stack, for example, is a potential attack surface. Etc. > > > > If you can access these files sources via open(2) on /proc/<pid>, you > > should be able to access them via a pidfd. If you can't, you > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't > > see how pidfd makes any material changes to anyone's security. As far > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a > > broken and unsupported configuration. > > It's not "broken and unsupported". I know of an actual working, > deployed container-ish sandbox that does exactly this. I would also > guess that quite a few not-at-all-container-like sandboxes work like > this. (The obvious seccomp + unshare + pivot_root > deny-myself-access-to-lots-of-things trick results in no /proc, which > is by dsign.) > > > > > An actual threat model and real thought paid to access capabilities > > would help. Almost everything around the interaction of Linux kernel > > namespaces and security feels like a jumble of ad-hoc patches added as > > afterthoughts in response to random objections. > > I fully agree. But if you start thinking for real about access > capabilities, there's no way that you're going to conclude that a > capability to access some process implies a capability to access the > settings of its network namespace. > > > > > >> All these new APIs either need to > > > > return something more restrictive than a proc dirfd or they need to > > > > follow the same rules. > > > > ... > > > What's special about libraries? How is a library any worse-off using > > openat(2) on a pidfd than it would be just opening the file called > > "/proc/$apid"? > > Because most libraries actually work, right now, without /proc. Even > libraries that spawn subprocesses. If we make the new API have the > property that it doesn't work if you're in a non-root user namespace > and /proc isn't mounted, the result will be an utter mess. > > > > > > > Yes, this is unfortunate, but it is indeed the current situation. I > > > > suppose that we could return magic restricted dirfds, or we could > > > > return things that aren't dirfds and all and have some API that gives > > > > you the dirfd associated with a procfd but only if you can see > > > > /proc/PID. > > > > > > What would be your opinion to having a > > > /proc/<pid>/handle > > > file instead of having a dirfd. Essentially, what I initially proposed > > > at LPC. The change on what we currently have in master would be: > > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df > > > > And how do you propose, given one of these handle objects, getting a > > process's current priority, or its current oom score, or its list of > > memory maps? As I mentioned in my original email, and which nobody has > > addressed, if you don't use a dirfd as your process handle or you > > don't provide an easy way to get one of these proc directory FDs, you > > need to duplicate a lot of metadata access interfaces. > > An API that takes a process handle object and an fd pointing at /proc > (the root of the proc fs) and gives you back a proc dirfd would do the > trick. You could do this with no new kernel features at all if you're > willing to read the pid, call openat(2), and handle the races in user > code. This seems like something that might be a good fit for two ioctls? One ioctl on procfs roots to translate pidfds into that procfs, subject to both the normal lookup permission checks and only working if the pidfd has a translation into the procfs: int proc_root_fd = open("/proc", O_RDONLY); int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: int proc_pgid_fd = open("/proc/self", O_RDONLY); int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); int proc_pid_fd = open("/proc/thread-self", O_RDONLY); int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); And then, as you proposed, the new sys_clone() can just return a pidfd, and you can convert it into a procfs fd yourself if you want. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-25 20:13 ` Jann Horn @ 2019-03-25 20:23 ` Daniel Colascione -1 siblings, 0 replies; 44+ messages in thread From: Daniel Colascione @ 2019-03-25 20:23 UTC (permalink / raw) To: Jann Horn Cc: Andy Lutomirski, Christian Brauner, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@google.com> wrote: > > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote: > > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > > > > > You're misunderstanding. Again, I said in my previous mails it should > > > > > > accept pidfds optionally as arguments, yes. But I don't want it to > > > > > > return the status fds that you previously wanted pidfd_wait() to return. > > > > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > > > > review the actual code. > > > > > > > > > > Just to make sure that no one is forgetting a material security consideration: > > > > > > > > Andy, thanks for commenting! > > > > > > > > > > > > > > $ ls /proc/self > > > > > attr exe mountinfo projid_map status > > > > > autogroup fd mounts root syscall > > > > > auxv fdinfo mountstats sched task > > > > > cgroup gid_map net schedstat timers > > > > > clear_refs io ns sessionid timerslack_ns > > > > > cmdline latency numa_maps setgroups uid_map > > > > > comm limits oom_adj smaps wchan > > > > > coredump_filter loginuid oom_score smaps_rollup > > > > > cpuset map_files oom_score_adj stack > > > > > cwd maps pagemap stat > > > > > environ mem personality statm > > > > > > > > > > A bunch of this stuff makes sense to make accessible through a syscall > > > > > interface that we expect to be used even in sandboxes. But a bunch of > > > > > it does not. For example, *_map, mounts, mountstats, and net are all > > > > > namespace-wide things that certain policies expect to be unavailable. > > > > > stack, for example, is a potential attack surface. Etc. > > > > > > If you can access these files sources via open(2) on /proc/<pid>, you > > > should be able to access them via a pidfd. If you can't, you > > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't > > > see how pidfd makes any material changes to anyone's security. As far > > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a > > > broken and unsupported configuration. > > > > It's not "broken and unsupported". I know of an actual working, > > deployed container-ish sandbox that does exactly this. I would also > > guess that quite a few not-at-all-container-like sandboxes work like > > this. (The obvious seccomp + unshare + pivot_root > > deny-myself-access-to-lots-of-things trick results in no /proc, which > > is by dsign.) > > > > > > > > An actual threat model and real thought paid to access capabilities > > > would help. Almost everything around the interaction of Linux kernel > > > namespaces and security feels like a jumble of ad-hoc patches added as > > > afterthoughts in response to random objections. > > > > I fully agree. But if you start thinking for real about access > > capabilities, there's no way that you're going to conclude that a > > capability to access some process implies a capability to access the > > settings of its network namespace. > > > > > > > > >> All these new APIs either need to > > > > > return something more restrictive than a proc dirfd or they need to > > > > > follow the same rules. > > > > > > > ... > > > > > What's special about libraries? How is a library any worse-off using > > > openat(2) on a pidfd than it would be just opening the file called > > > "/proc/$apid"? > > > > Because most libraries actually work, right now, without /proc. Even > > libraries that spawn subprocesses. If we make the new API have the > > property that it doesn't work if you're in a non-root user namespace > > and /proc isn't mounted, the result will be an utter mess. > > > > > > > > > > Yes, this is unfortunate, but it is indeed the current situation. I > > > > > suppose that we could return magic restricted dirfds, or we could > > > > > return things that aren't dirfds and all and have some API that gives > > > > > you the dirfd associated with a procfd but only if you can see > > > > > /proc/PID. > > > > > > > > What would be your opinion to having a > > > > /proc/<pid>/handle > > > > file instead of having a dirfd. Essentially, what I initially proposed > > > > at LPC. The change on what we currently have in master would be: > > > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df > > > > > > And how do you propose, given one of these handle objects, getting a > > > process's current priority, or its current oom score, or its list of > > > memory maps? As I mentioned in my original email, and which nobody has > > > addressed, if you don't use a dirfd as your process handle or you > > > don't provide an easy way to get one of these proc directory FDs, you > > > need to duplicate a lot of metadata access interfaces. > > > > An API that takes a process handle object and an fd pointing at /proc > > (the root of the proc fs) and gives you back a proc dirfd would do the > > trick. You could do this with no new kernel features at all if you're > > willing to read the pid, call openat(2), and handle the races in user > > code. > > This seems like something that might be a good fit for two ioctls? As an aside, we had a long discussion about why fundamental facilities like this should be system calls, not ioctls. I think the arguments still apply. > One ioctl on procfs roots to translate pidfds into that procfs, > subject to both the normal lookup permission checks and only working > if the pidfd has a translation into the procfs: > > int proc_root_fd = open("/proc", O_RDONLY); > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > And then, as you proposed, the new sys_clone() can just return a > pidfd, and you can convert it into a procfs fd yourself if you want. I think that's the consensus we reached on the other thread. The O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well enough. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design @ 2019-03-25 20:23 ` Daniel Colascione 0 siblings, 0 replies; 44+ messages in thread From: Daniel Colascione @ 2019-03-25 20:23 UTC (permalink / raw) To: Jann Horn Cc: Andy Lutomirski, Christian Brauner, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@google.com> wrote: > > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@brauner.io> wrote: > > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > > > > > You're misunderstanding. Again, I said in my previous mails it should > > > > > > accept pidfds optionally as arguments, yes. But I don't want it to > > > > > > return the status fds that you previously wanted pidfd_wait() to return. > > > > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > > > > review the actual code. > > > > > > > > > > Just to make sure that no one is forgetting a material security consideration: > > > > > > > > Andy, thanks for commenting! > > > > > > > > > > > > > > $ ls /proc/self > > > > > attr exe mountinfo projid_map status > > > > > autogroup fd mounts root syscall > > > > > auxv fdinfo mountstats sched task > > > > > cgroup gid_map net schedstat timers > > > > > clear_refs io ns sessionid timerslack_ns > > > > > cmdline latency numa_maps setgroups uid_map > > > > > comm limits oom_adj smaps wchan > > > > > coredump_filter loginuid oom_score smaps_rollup > > > > > cpuset map_files oom_score_adj stack > > > > > cwd maps pagemap stat > > > > > environ mem personality statm > > > > > > > > > > A bunch of this stuff makes sense to make accessible through a syscall > > > > > interface that we expect to be used even in sandboxes. But a bunch of > > > > > it does not. For example, *_map, mounts, mountstats, and net are all > > > > > namespace-wide things that certain policies expect to be unavailable. > > > > > stack, for example, is a potential attack surface. Etc. > > > > > > If you can access these files sources via open(2) on /proc/<pid>, you > > > should be able to access them via a pidfd. If you can't, you > > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't > > > see how pidfd makes any material changes to anyone's security. As far > > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a > > > broken and unsupported configuration. > > > > It's not "broken and unsupported". I know of an actual working, > > deployed container-ish sandbox that does exactly this. I would also > > guess that quite a few not-at-all-container-like sandboxes work like > > this. (The obvious seccomp + unshare + pivot_root > > deny-myself-access-to-lots-of-things trick results in no /proc, which > > is by dsign.) > > > > > > > > An actual threat model and real thought paid to access capabilities > > > would help. Almost everything around the interaction of Linux kernel > > > namespaces and security feels like a jumble of ad-hoc patches added as > > > afterthoughts in response to random objections. > > > > I fully agree. But if you start thinking for real about access > > capabilities, there's no way that you're going to conclude that a > > capability to access some process implies a capability to access the > > settings of its network namespace. > > > > > > > > >> All these new APIs either need to > > > > > return something more restrictive than a proc dirfd or they need to > > > > > follow the same rules. > > > > > > > ... > > > > > What's special about libraries? How is a library any worse-off using > > > openat(2) on a pidfd than it would be just opening the file called > > > "/proc/$apid"? > > > > Because most libraries actually work, right now, without /proc. Even > > libraries that spawn subprocesses. If we make the new API have the > > property that it doesn't work if you're in a non-root user namespace > > and /proc isn't mounted, the result will be an utter mess. > > > > > > > > > > Yes, this is unfortunate, but it is indeed the current situation. I > > > > > suppose that we could return magic restricted dirfds, or we could > > > > > return things that aren't dirfds and all and have some API that gives > > > > > you the dirfd associated with a procfd but only if you can see > > > > > /proc/PID. > > > > > > > > What would be your opinion to having a > > > > /proc/<pid>/handle > > > > file instead of having a dirfd. Essentially, what I initially proposed > > > > at LPC. The change on what we currently have in master would be: > > > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df > > > > > > And how do you propose, given one of these handle objects, getting a > > > process's current priority, or its current oom score, or its list of > > > memory maps? As I mentioned in my original email, and which nobody has > > > addressed, if you don't use a dirfd as your process handle or you > > > don't provide an easy way to get one of these proc directory FDs, you > > > need to duplicate a lot of metadata access interfaces. > > > > An API that takes a process handle object and an fd pointing at /proc > > (the root of the proc fs) and gives you back a proc dirfd would do the > > trick. You could do this with no new kernel features at all if you're > > willing to read the pid, call openat(2), and handle the races in user > > code. > > This seems like something that might be a good fit for two ioctls? As an aside, we had a long discussion about why fundamental facilities like this should be system calls, not ioctls. I think the arguments still apply. > One ioctl on procfs roots to translate pidfds into that procfs, > subject to both the normal lookup permission checks and only working > if the pidfd has a translation into the procfs: > > int proc_root_fd = open("/proc", O_RDONLY); > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > And then, as you proposed, the new sys_clone() can just return a > pidfd, and you can convert it into a procfs fd yourself if you want. I think that's the consensus we reached on the other thread. The O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well enough. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-25 20:23 ` Daniel Colascione @ 2019-03-25 23:42 ` Andy Lutomirski -1 siblings, 0 replies; 44+ messages in thread From: Andy Lutomirski @ 2019-03-25 23:42 UTC (permalink / raw) To: Daniel Colascione Cc: Jann Horn, Andy Lutomirski, Christian Brauner, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > One ioctl on procfs roots to translate pidfds into that procfs, > > subject to both the normal lookup permission checks and only working > > if the pidfd has a translation into the procfs: > > > > int proc_root_fd = open("/proc", O_RDONLY); > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > This sounds okay to me. Or we could make it so that a procfs directory fd also works as a pidfd, but that seems more likely to be problematic than just allowing two-way translation like this > > > > And then, as you proposed, the new sys_clone() can just return a > > pidfd, and you can convert it into a procfs fd yourself if you want. > > I think that's the consensus we reached on the other thread. The > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > enough. I must have missed this particular email. IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it really ought to do function just like /proc/self/fd/mypidfd/. and /proc/self/fd/mypidfd/status should work. And these latter two options seem nutty. Also, this O_DIRECTORY thing is missing the entire point of the ioctl interface -- it doesn't require procfs access. --Andy ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design @ 2019-03-25 23:42 ` Andy Lutomirski 0 siblings, 0 replies; 44+ messages in thread From: Andy Lutomirski @ 2019-03-25 23:42 UTC (permalink / raw) To: Daniel Colascione Cc: Jann Horn, Andy Lutomirski, Christian Brauner, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > One ioctl on procfs roots to translate pidfds into that procfs, > > subject to both the normal lookup permission checks and only working > > if the pidfd has a translation into the procfs: > > > > int proc_root_fd = open("/proc", O_RDONLY); > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > This sounds okay to me. Or we could make it so that a procfs directory fd also works as a pidfd, but that seems more likely to be problematic than just allowing two-way translation like this > > > > And then, as you proposed, the new sys_clone() can just return a > > pidfd, and you can convert it into a procfs fd yourself if you want. > > I think that's the consensus we reached on the other thread. The > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > enough. I must have missed this particular email. IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it really ought to do function just like /proc/self/fd/mypidfd/. and /proc/self/fd/mypidfd/status should work. And these latter two options seem nutty. Also, this O_DIRECTORY thing is missing the entire point of the ioctl interface -- it doesn't require procfs access. --Andy ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-25 23:42 ` Andy Lutomirski @ 2019-03-25 23:45 ` Christian Brauner -1 siblings, 0 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-25 23:45 UTC (permalink / raw) To: Andy Lutomirski Cc: Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > subject to both the normal lookup permission checks and only working > > > if the pidfd has a translation into the procfs: > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > This sounds okay to me. Or we could make it so that a procfs > directory fd also works as a pidfd, but that seems more likely to be > problematic than just allowing two-way translation like this > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > I think that's the consensus we reached on the other thread. The > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > enough. > > I must have missed this particular email. > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > really ought to do function just like /proc/self/fd/mypidfd/. and > /proc/self/fd/mypidfd/status should work. And these latter two > options seem nutty. > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > interface -- it doesn't require procfs access. The other option was to encode the pid in the callers pid namespace into the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. You'd just need an event on the pidfd to tell you when the process has died. Jonathan and I just discussed this. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design @ 2019-03-25 23:45 ` Christian Brauner 0 siblings, 0 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-25 23:45 UTC (permalink / raw) To: Andy Lutomirski Cc: Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > subject to both the normal lookup permission checks and only working > > > if the pidfd has a translation into the procfs: > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > This sounds okay to me. Or we could make it so that a procfs > directory fd also works as a pidfd, but that seems more likely to be > problematic than just allowing two-way translation like this > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > I think that's the consensus we reached on the other thread. The > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > enough. > > I must have missed this particular email. > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > really ought to do function just like /proc/self/fd/mypidfd/. and > /proc/self/fd/mypidfd/status should work. And these latter two > options seem nutty. > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > interface -- it doesn't require procfs access. The other option was to encode the pid in the callers pid namespace into the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. You'd just need an event on the pidfd to tell you when the process has died. Jonathan and I just discussed this. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-25 23:45 ` Christian Brauner @ 2019-03-26 0:00 ` Andy Lutomirski -1 siblings, 0 replies; 44+ messages in thread From: Andy Lutomirski @ 2019-03-26 0:00 UTC (permalink / raw) To: Christian Brauner Cc: Andy Lutomirski, Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote: > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > > subject to both the normal lookup permission checks and only working > > > > if the pidfd has a translation into the procfs: > > > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > > This sounds okay to me. Or we could make it so that a procfs > > directory fd also works as a pidfd, but that seems more likely to be > > problematic than just allowing two-way translation like this > > > > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > > > I think that's the consensus we reached on the other thread. The > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > > enough. > > > > I must have missed this particular email. > > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > > really ought to do function just like /proc/self/fd/mypidfd/. and > > /proc/self/fd/mypidfd/status should work. And these latter two > > options seem nutty. > > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > > interface -- it doesn't require procfs access. > > The other option was to encode the pid in the callers pid namespace into > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. > You'd just need an event on the pidfd to tell you when the process has > died. Jonathan and I just discussed this. From an application developer's POV, the ioctl interface sounds much, much nicer. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design @ 2019-03-26 0:00 ` Andy Lutomirski 0 siblings, 0 replies; 44+ messages in thread From: Andy Lutomirski @ 2019-03-26 0:00 UTC (permalink / raw) To: Christian Brauner Cc: Andy Lutomirski, Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote: > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > > subject to both the normal lookup permission checks and only working > > > > if the pidfd has a translation into the procfs: > > > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > > This sounds okay to me. Or we could make it so that a procfs > > directory fd also works as a pidfd, but that seems more likely to be > > problematic than just allowing two-way translation like this > > > > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > > > I think that's the consensus we reached on the other thread. The > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > > enough. > > > > I must have missed this particular email. > > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > > really ought to do function just like /proc/self/fd/mypidfd/. and > > /proc/self/fd/mypidfd/status should work. And these latter two > > options seem nutty. > > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > > interface -- it doesn't require procfs access. > > The other option was to encode the pid in the callers pid namespace into > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. > You'd just need an event on the pidfd to tell you when the process has > died. Jonathan and I just discussed this. >From an application developer's POV, the ioctl interface sounds much, much nicer. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-26 0:00 ` Andy Lutomirski @ 2019-03-26 0:12 ` Christian Brauner -1 siblings, 0 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-26 0:12 UTC (permalink / raw) To: Andy Lutomirski Cc: Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API On Mon, Mar 25, 2019 at 05:00:17PM -0700, Andy Lutomirski wrote: > On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote: > > > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > > > subject to both the normal lookup permission checks and only working > > > > > if the pidfd has a translation into the procfs: > > > > > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > > > > > This sounds okay to me. Or we could make it so that a procfs > > > directory fd also works as a pidfd, but that seems more likely to be > > > problematic than just allowing two-way translation like this > > > > > > > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > > > > > I think that's the consensus we reached on the other thread. The > > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > > > enough. > > > > > > I must have missed this particular email. > > > > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > > > really ought to do function just like /proc/self/fd/mypidfd/. and > > > /proc/self/fd/mypidfd/status should work. And these latter two > > > options seem nutty. > > > > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > > > interface -- it doesn't require procfs access. > > > > The other option was to encode the pid in the callers pid namespace into > > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. > > You'd just need an event on the pidfd to tell you when the process has > > died. Jonathan and I just discussed this. > > From an application developer's POV, the ioctl interface sounds much, > much nicer. Some people are strongly against ioctl()s some don't. I'm not against them so both options are fine with me if people can agree. Christian ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design @ 2019-03-26 0:12 ` Christian Brauner 0 siblings, 0 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-26 0:12 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, Tim Murray, Michal Hocko, Joel Fernandes, Sultan Alsawaf, Jonathan Kowalski, open list:ANDROID DRIVERS, Daniel Colascione, Suren Baghdasaryan, Ingo Molnar, kernel-team, Todd Kjos, Kees Cook, Jann Horn, Steven Rostedt, Oleg Nesterov, Martijn Coenen, Greg Kroah-Hartman, LKML, Arve Hjønnevåg, Linux API On Mon, Mar 25, 2019 at 05:00:17PM -0700, Andy Lutomirski wrote: > On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote: > > > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > > > subject to both the normal lookup permission checks and only working > > > > > if the pidfd has a translation into the procfs: > > > > > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > > > > > This sounds okay to me. Or we could make it so that a procfs > > > directory fd also works as a pidfd, but that seems more likely to be > > > problematic than just allowing two-way translation like this > > > > > > > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > > > > > I think that's the consensus we reached on the other thread. The > > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > > > enough. > > > > > > I must have missed this particular email. > > > > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > > > really ought to do function just like /proc/self/fd/mypidfd/. and > > > /proc/self/fd/mypidfd/status should work. And these latter two > > > options seem nutty. > > > > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > > > interface -- it doesn't require procfs access. > > > > The other option was to encode the pid in the callers pid namespace into > > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. > > You'd just need an event on the pidfd to tell you when the process has > > died. Jonathan and I just discussed this. > > From an application developer's POV, the ioctl interface sounds much, > much nicer. Some people are strongly against ioctl()s some don't. I'm not against them so both options are fine with me if people can agree. Christian ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-26 0:12 ` Christian Brauner @ 2019-03-26 0:24 ` Andy Lutomirski -1 siblings, 0 replies; 44+ messages in thread From: Andy Lutomirski @ 2019-03-26 0:24 UTC (permalink / raw) To: Christian Brauner Cc: Andy Lutomirski, Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API On Mon, Mar 25, 2019 at 5:12 PM Christian Brauner <christian@brauner.io> wrote: > > On Mon, Mar 25, 2019 at 05:00:17PM -0700, Andy Lutomirski wrote: > > On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote: > > > > > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > > > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > > > > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > > > > subject to both the normal lookup permission checks and only working > > > > > > if the pidfd has a translation into the procfs: > > > > > > > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > > > > > > > > This sounds okay to me. Or we could make it so that a procfs > > > > directory fd also works as a pidfd, but that seems more likely to be > > > > problematic than just allowing two-way translation like this > > > > > > > > > > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > > > > > > > I think that's the consensus we reached on the other thread. The > > > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > > > > enough. > > > > > > > > I must have missed this particular email. > > > > > > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > > > > really ought to do function just like /proc/self/fd/mypidfd/. and > > > > /proc/self/fd/mypidfd/status should work. And these latter two > > > > options seem nutty. > > > > > > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > > > > interface -- it doesn't require procfs access. > > > > > > The other option was to encode the pid in the callers pid namespace into > > > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. > > > You'd just need an event on the pidfd to tell you when the process has > > > died. Jonathan and I just discussed this. > > > > From an application developer's POV, the ioctl interface sounds much, > > much nicer. > > Some people are strongly against ioctl()s some don't. I'm not against > them so both options are fine with me if people can agree. > There are certainly non-ioctl equivalents that are functionally equivalent. For example, there could be a syscall procfs_open_pidfd(procfs_fd, pid_fd). I personally don't really mind ioctl() when it's really an operation on an fd. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design @ 2019-03-26 0:24 ` Andy Lutomirski 0 siblings, 0 replies; 44+ messages in thread From: Andy Lutomirski @ 2019-03-26 0:24 UTC (permalink / raw) To: Christian Brauner Cc: Andy Lutomirski, Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov On Mon, Mar 25, 2019 at 5:12 PM Christian Brauner <christian@brauner.io> wrote: > > On Mon, Mar 25, 2019 at 05:00:17PM -0700, Andy Lutomirski wrote: > > On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote: > > > > > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > > > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > > > > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > > > > subject to both the normal lookup permission checks and only working > > > > > > if the pidfd has a translation into the procfs: > > > > > > > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > > > > > > > > This sounds okay to me. Or we could make it so that a procfs > > > > directory fd also works as a pidfd, but that seems more likely to be > > > > problematic than just allowing two-way translation like this > > > > > > > > > > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > > > > > > > I think that's the consensus we reached on the other thread. The > > > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > > > > enough. > > > > > > > > I must have missed this particular email. > > > > > > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > > > > really ought to do function just like /proc/self/fd/mypidfd/. and > > > > /proc/self/fd/mypidfd/status should work. And these latter two > > > > options seem nutty. > > > > > > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > > > > interface -- it doesn't require procfs access. > > > > > > The other option was to encode the pid in the callers pid namespace into > > > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. > > > You'd just need an event on the pidfd to tell you when the process has > > > died. Jonathan and I just discussed this. > > > > From an application developer's POV, the ioctl interface sounds much, > > much nicer. > > Some people are strongly against ioctl()s some don't. I'm not against > them so both options are fine with me if people can agree. > There are certainly non-ioctl equivalents that are functionally equivalent. For example, there could be a syscall procfs_open_pidfd(procfs_fd, pid_fd). I personally don't really mind ioctl() when it's really an operation on an fd. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-26 0:24 ` Andy Lutomirski @ 2019-03-28 9:21 ` Christian Brauner -1 siblings, 0 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-28 9:21 UTC (permalink / raw) To: Andy Lutomirski Cc: Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn, Kees Cook, Jonathan Kowalski, Linux API On Mon, Mar 25, 2019 at 05:24:49PM -0700, Andy Lutomirski wrote: > On Mon, Mar 25, 2019 at 5:12 PM Christian Brauner <christian@brauner.io> wrote: > > > > On Mon, Mar 25, 2019 at 05:00:17PM -0700, Andy Lutomirski wrote: > > > On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > > > > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > > > > > > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > > > > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > > > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > > > > > subject to both the normal lookup permission checks and only working > > > > > > > if the pidfd has a translation into the procfs: > > > > > > > > > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > > > > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > > > > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > > > > > > > > > > > This sounds okay to me. Or we could make it so that a procfs > > > > > directory fd also works as a pidfd, but that seems more likely to be > > > > > problematic than just allowing two-way translation like this > > > > > > > > > > > > > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > > > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > > > > > > > > > I think that's the consensus we reached on the other thread. The > > > > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > > > > > enough. > > > > > > > > > > I must have missed this particular email. > > > > > > > > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > > > > > really ought to do function just like /proc/self/fd/mypidfd/. and > > > > > /proc/self/fd/mypidfd/status should work. And these latter two > > > > > options seem nutty. > > > > > > > > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > > > > > interface -- it doesn't require procfs access. > > > > > > > > The other option was to encode the pid in the callers pid namespace into > > > > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. > > > > You'd just need an event on the pidfd to tell you when the process has > > > > died. Jonathan and I just discussed this. > > > > > > From an application developer's POV, the ioctl interface sounds much, > > > much nicer. > > > > Some people are strongly against ioctl()s some don't. I'm not against > > them so both options are fine with me if people can agree. > > > > There are certainly non-ioctl equivalents that are functionally > equivalent. For example, there could be a syscall > procfs_open_pidfd(procfs_fd, pid_fd). I personally don't really mind > ioctl() when it's really an operation on an fd. I totally missed that mail somehow. Yes, I agree that an ioctl() makes sense for that. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design @ 2019-03-28 9:21 ` Christian Brauner 0 siblings, 0 replies; 44+ messages in thread From: Christian Brauner @ 2019-03-28 9:21 UTC (permalink / raw) To: Andy Lutomirski Cc: Daniel Colascione, Jann Horn, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, kernel-team, Oleg Nesterov, Serge E. Hallyn On Mon, Mar 25, 2019 at 05:24:49PM -0700, Andy Lutomirski wrote: > On Mon, Mar 25, 2019 at 5:12 PM Christian Brauner <christian@brauner.io> wrote: > > > > On Mon, Mar 25, 2019 at 05:00:17PM -0700, Andy Lutomirski wrote: > > > On Mon, Mar 25, 2019 at 4:45 PM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > On Mon, Mar 25, 2019 at 04:42:14PM -0700, Andy Lutomirski wrote: > > > > > On Mon, Mar 25, 2019 at 1:23 PM Daniel Colascione <dancol@google.com> wrote: > > > > > > > > > > > > On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@google.com> wrote: > > > > > > > > > > > > > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > > > > > > One ioctl on procfs roots to translate pidfds into that procfs, > > > > > > > subject to both the normal lookup permission checks and only working > > > > > > > if the pidfd has a translation into the procfs: > > > > > > > > > > > > > > int proc_root_fd = open("/proc", O_RDONLY); > > > > > > > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > > > > > > > > > > > > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > > > > > > > > > > > > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > > > > > > > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > > > > > > > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > > > > > > > > > > > > > > > This sounds okay to me. Or we could make it so that a procfs > > > > > directory fd also works as a pidfd, but that seems more likely to be > > > > > problematic than just allowing two-way translation like this > > > > > > > > > > > > > > > > > > > And then, as you proposed, the new sys_clone() can just return a > > > > > > > pidfd, and you can convert it into a procfs fd yourself if you want. > > > > > > > > > > > > I think that's the consensus we reached on the other thread. The > > > > > > O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well > > > > > > enough. > > > > > > > > > > I must have missed this particular email. > > > > > > > > > > IMO, if /proc/self/fd/mypidfd allows O_DIRECTORY open to work, then it > > > > > really ought to do function just like /proc/self/fd/mypidfd/. and > > > > > /proc/self/fd/mypidfd/status should work. And these latter two > > > > > options seem nutty. > > > > > > > > > > Also, this O_DIRECTORY thing is missing the entire point of the ioctl > > > > > interface -- it doesn't require procfs access. > > > > > > > > The other option was to encode the pid in the callers pid namespace into > > > > the pidfd's fdinfo so that you can parse it out and open /proc/<pid>. > > > > You'd just need an event on the pidfd to tell you when the process has > > > > died. Jonathan and I just discussed this. > > > > > > From an application developer's POV, the ioctl interface sounds much, > > > much nicer. > > > > Some people are strongly against ioctl()s some don't. I'm not against > > them so both options are fine with me if people can agree. > > > > There are certainly non-ioctl equivalents that are functionally > equivalent. For example, there could be a syscall > procfs_open_pidfd(procfs_fd, pid_fd). I personally don't really mind > ioctl() when it's really an operation on an fd. I totally missed that mail somehow. Yes, I agree that an ioctl() makes sense for that. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 18:51 ` Christian Brauner 2019-03-20 18:58 ` Andy Lutomirski @ 2019-03-20 19:19 ` Joel Fernandes 2019-03-20 19:29 ` Daniel Colascione 2 siblings, 0 replies; 44+ messages in thread From: Joel Fernandes @ 2019-03-20 19:19 UTC (permalink / raw) To: Christian Brauner Cc: Daniel Colascione, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 07:51:57PM +0100, Christian Brauner wrote: [snip] > > > translate_pid() should just return you a pidfd. Having it return a pidfd > > > and a status fd feels like stuffing too much functionality in there. If > > > you're fine with it I'll finish prototyping what I had in mind. As I > > > said in previous mails I'm already working on this. > > > > translate_pid also needs to *accept* pidfds, at least optionally. > > Unless you have a function from pidfd to pidfd, you race. > > You're misunderstanding. Again, I said in my previous mails it should > accept pidfds optionally as arguments, yes. But I don't want it to > return the status fds that you previously wanted pidfd_wait() to return. > I really want to see Joel's pidfd_wait() patchset and have more people > review the actual code. No problem, pidfd_wait is also fine with me and we can change it later to translate_pid or something else if needed. Agreed that lets get to some code writing now that (I hope) we are all on the same page and discuss on actual code. - Joel ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 18:51 ` Christian Brauner 2019-03-20 18:58 ` Andy Lutomirski 2019-03-20 19:19 ` Joel Fernandes @ 2019-03-20 19:29 ` Daniel Colascione 2019-03-24 14:44 ` Serge E. Hallyn 2 siblings, 1 reply; 44+ messages in thread From: Daniel Colascione @ 2019-03-20 19:29 UTC (permalink / raw) To: Christian Brauner Cc: Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > On Wed, Mar 20, 2019 at 11:38:35AM -0700, Daniel Colascione wrote: > > On Wed, Mar 20, 2019 at 11:26 AM Christian Brauner <christian@brauner.io> wrote: > > > On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote: > > > > > > > > > > > > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote: > > > > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner > > > > ><christian@brauner.io> wrote: > > > > >> > > > > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote: > > > > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes > > > > ><joel@joelfernandes.org> wrote: > > > > >> > > > > > > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner > > > > >wrote: > > > > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione > > > > >wrote: > > > > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner > > > > ><christian@brauner.io> wrote: > > > > >> > > > > > So I dislike the idea of allocating new inodes from the > > > > >procfs super > > > > >> > > > > > block. I would like to avoid pinning the whole pidfd > > > > >concept exclusively > > > > >> > > > > > to proc. The idea is that the pidfd API will be useable > > > > >through procfs > > > > >> > > > > > via open("/proc/<pid>") because that is what users expect > > > > >and really > > > > >> > > > > > wanted to have for a long time. So it makes sense to have > > > > >this working. > > > > >> > > > > > But it should really be useable without it. That's why > > > > >translate_pid() > > > > >> > > > > > and pidfd_clone() are on the table. What I'm saying is, > > > > >once the pidfd > > > > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N > > > > >- even > > > > >> > > > > > though that's crazy - and still be able to use pidfds. This > > > > >is also a > > > > >> > > > > > point akpm asked about when I did the pidfd_send_signal > > > > >work. > > > > >> > > > > > > > > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use > > > > >pidfds. One > > > > >> > > > > crazy idea that I was discussing with Joel the other day is > > > > >to just > > > > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new > > > > >get_procfs_root() > > > > >> > > > > system call that returned, out of thin air and independent of > > > > >the > > > > >> > > > > mount table, a procfs root directory file descriptor for the > > > > >caller's > > > > >> > > > > PID namspace and suitable for use with openat(2). > > > > >> > > > > > > > >> > > > Even if this works I'm pretty sure that Al and a lot of others > > > > >will not > > > > >> > > > be happy about this. A syscall to get an fd to /proc? > > > > >> > > > > > >> > Why not? procfs provides access to a lot of core kernel > > > > >functionality. > > > > >> > Why should you need a mountpoint to get to it? > > > > >> > > > > > >> > > That's not going > > > > >> > > > to happen and I don't see the need for a separate syscall just > > > > >for that. > > > > >> > > > > > >> > We need a system call for the same reason we need a getrandom(2): > > > > >you > > > > >> > have to bootstrap somehow when you're in a minimal environment. > > > > >> > > > > > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.) > > > > >> > > > > > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm > > > > >> > proposing that we *hardwire* it as the default and just declare > > > > >that > > > > >> > it's not possible to build a Linux kernel that doesn't include > > > > >procfs. > > > > >> > Why do we even have that button? > > > > >> > > > > > >> > > I think his point here was that he wanted a handle to procfs no > > > > >matter where > > > > >> > > it was mounted and then can later use openat on that. Agreed that > > > > >it may be > > > > >> > > unnecessary unless there is a usecase for it, and especially if > > > > >the /proc > > > > >> > > directory being the defacto mountpoint for procfs is a universal > > > > >convention. > > > > >> > > > > > >> > If it's a universal convention and, in practice, everyone needs > > > > >proc > > > > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? > > > > >If > > > > >> > we advertise /proc as not merely some kind of optional debug > > > > >interface > > > > >> > but *the* way certain kernel features are exposed --- and there's > > > > >> > nothing wrong with that --- then we should give programs access to > > > > >> > these core kernel features in a way that doesn't depend on > > > > >userspace > > > > >> > kernel configuration, and you do that by either providing a > > > > >> > procfs-root-getting system call or just hardwiring the "/proc/" > > > > >prefix > > > > >> > into VFS. > > > > >> > > > > > >> > > > Inode allocation from the procfs mount for the file descriptors > > > > >Joel > > > > >> > > > wants is not correct. Their not really procfs file descriptors > > > > >so this > > > > >> > > > is a nack. We can't just hook into proc that way. > > > > >> > > > > > > >> > > I was not particular about using procfs mount for the FDs but > > > > >that's the only > > > > >> > > way I knew how to do it until you pointed out anon_inode (my grep > > > > >skills > > > > >> > > missed that), so thank you! > > > > >> > > > > > > >> > > > > C'mon: /proc is used by everyone today and almost every > > > > >program breaks > > > > >> > > > > if it's not around. The string "/proc" is already de facto > > > > >kernel ABI. > > > > >> > > > > Let's just drop the pretense of /proc being optional and bake > > > > >it into > > > > >> > > > > the kernel proper, then give programs a way to get to /proc > > > > >that isn't > > > > >> > > > > tied to any particular mount configuration. This way, we > > > > >don't need a > > > > >> > > > > translate_pid(), since callers can just use procfs to do the > > > > >same > > > > >> > > > > thing. (That is, if I understand correctly what translate_pid > > > > >does.) > > > > >> > > > > > > > >> > > > I'm not sure what you think translate_pid() is doing since > > > > >you're not > > > > >> > > > saying what you think it does. > > > > >> > > > Examples from the old patchset: > > > > >> > > > translate_pid(pid, ns, -1) - get pid in our pid namespace > > > > >> > > > > > >> > Ah, it's a bit different from what I had in mind. It's fair to want > > > > >to > > > > >> > translate PIDs between namespaces, but the only way to make the > > > > >> > translate_pid under discussion robust is to have it accept and > > > > >produce > > > > >> > pidfds. (At that point, you might as well call it translate_pidfd.) > > > > >We > > > > >> > should not be adding new APIs to the kernel that accept numeric > > > > >PIDs: > > > > >> > > > > >> The traditional pid-based api is not going away. There are users that > > > > >> have the requirement to translate pids between namespaces and also > > > > >doing > > > > >> introspection on these namespaces independent of pidfds. We will not > > > > >> restrict the usefulness of this syscall by making it only work with > > > > >> pidfds. > > > > >> > > > > >> > it's not possible to use these APIs correctly except under very > > > > >> > limited circumstances --- mostly, talking about init or a parent > > > > >> > > > > >> The pid-based api is one of the most widely used apis of the kernel > > > > >and > > > > >> people have been using it quite successfully for a long time. Yes, > > > > >it's > > > > >> rac, but it's here to stay. > > > > >> > > > > >> > talking about its child. > > > > >> > > > > > >> > Really, we need a few related operations, and we shouldn't > > > > >necessarily > > > > >> > mingle them. > > > > >> > > > > >> Yes, we've established that previously. > > > > >> > > > > >> > > > > > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just > > > > >> > open /proc/<pid> > > > > >> > > > > >> Agreed. > > > > >> > > > > >> > > > > > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just > > > > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is > > > > >> > always the numeric PID). > > > > >> > > > > >> Agreed. > > > > >> > > > > >> > > > > > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal > > > > >does, > > > > >> > and it's a good start on the rest of these operations. > > > > >> > > > > >> Agreed. > > > > >> > > > > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what > > > > >translate_pid > > > > >> > is for. My preferred signature for this routine is > > > > >translate_pid(int > > > > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. > > > > >Why > > > > >> > not? Because the pidfd *already* names a single process, uniquely! > > > > >> > > > > >> Given that people are interested in pids we can't just always return > > > > >a > > > > >> pidfd. That would mean a user would need to do get the pidfd read > > > > >from > > > > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids > > > > >or > > > > >> more you end up allocating and closing file descriptors constantly > > > > >for > > > > >> no reason. We can't just debate pids away. So it will also need to be > > > > >> able to yield pids e.g. through a flag argument. > > > > > > > > > >Sure, but that's still not a reason that we should care about pidfds > > > > >working separately from procfs.. > > > > > > That's unrelated to the point made in the above paragraph. > > > Please note, I said that the pidfd api should work when proc is not > > > available not that they can't be dirfds. > > > > What do you mean by "not available"? CONFIG_PROCFS=n? If pidfds > > I'm talking about the ability to clone processes and get fd handles on > them via pidfd_clone() or CLONE_NEWFD. I wouldn't call that situation "proc [not being] available". We need pidfd_clone to return a pidfd for atomicity reasons, not /proc availability reasons. Again, it doesn't make any sense to support this stuff when CONFIG_PROCFS=n, and CONFIG_PROCFS=n shouldn't even be a supported configuration. > > > translate_pid() should just return you a pidfd. Having it return a pidfd > > > and a status fd feels like stuffing too much functionality in there. If > > > you're fine with it I'll finish prototyping what I had in mind. As I > > > said in previous mails I'm already working on this. > > > > translate_pid also needs to *accept* pidfds, at least optionally. > > Unless you have a function from pidfd to pidfd, you race. > > You're misunderstanding. Again, I said in my previous mails it should > accept pidfds optionally as arguments, yes. But I don't want it to > return the status fds that you previously wanted pidfd_wait() to return. Agreed. There should be a different way to get these wait handle FDs. > I really want to see Joel's pidfd_wait() patchset and have more people > review the actual code. Sure. But it's also unpleasant to have people write code and then have to throw it away due to guessing incorrectly about unclear requirements. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 19:29 ` Daniel Colascione @ 2019-03-24 14:44 ` Serge E. Hallyn 2019-03-24 18:48 ` Joel Fernandes 0 siblings, 1 reply; 44+ messages in thread From: Serge E. Hallyn @ 2019-03-24 14:44 UTC (permalink / raw) To: Daniel Colascione Cc: Christian Brauner, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 12:29:31PM -0700, Daniel Colascione wrote: > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > I really want to see Joel's pidfd_wait() patchset and have more people > > review the actual code. > > Sure. But it's also unpleasant to have people write code and then have > to throw it away due to guessing incorrectly about unclear > requirements. No, it is not. It is not unpleasant. And it is useful. It is the best way to identify and resolve those incorrect guesses and unclear requirements. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-24 14:44 ` Serge E. Hallyn @ 2019-03-24 18:48 ` Joel Fernandes 0 siblings, 0 replies; 44+ messages in thread From: Joel Fernandes @ 2019-03-24 18:48 UTC (permalink / raw) To: Serge E. Hallyn Cc: Daniel Colascione, Christian Brauner, Joel Fernandes, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Kees Cook On Sun, Mar 24, 2019 at 10:44 AM Serge E. Hallyn <serge@hallyn.com> wrote: > > On Wed, Mar 20, 2019 at 12:29:31PM -0700, Daniel Colascione wrote: > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@brauner.io> wrote: > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > review the actual code. > > > > Sure. But it's also unpleasant to have people write code and then have > > to throw it away due to guessing incorrectly about unclear > > requirements. > > No, it is not. It is not unpleasant. And it is useful. It is the best way to > identify and resolve those incorrect guesses and unclear requirements. No problem, a bit of discussion helped set the direction. Personally it did help clarify lot of things for me. We are hard at work with come up with an implementation and are looking at posting something soon. I agree that the best is to discuss on actual code where possible. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: pidfd design 2019-03-20 18:26 ` Christian Brauner 2019-03-20 18:38 ` Daniel Colascione @ 2019-03-20 19:11 ` Joel Fernandes 1 sibling, 0 replies; 44+ messages in thread From: Joel Fernandes @ 2019-03-20 19:11 UTC (permalink / raw) To: Christian Brauner Cc: Daniel Colascione, Suren Baghdasaryan, Steven Rostedt, Sultan Alsawaf, Tim Murray, Michal Hocko, Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos, Martijn Coenen, Ingo Molnar, Peter Zijlstra, LKML, open list:ANDROID DRIVERS, linux-mm, kernel-team, Oleg Nesterov, Andy Lutomirski, Serge E. Hallyn, Kees Cook On Wed, Mar 20, 2019 at 07:26:50PM +0100, Christian Brauner wrote: > On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote: > > > > > > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@google.com> wrote: > > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner > > ><christian@brauner.io> wrote: > > >> > > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote: > > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes > > ><joel@joelfernandes.org> wrote: > > >> > > > > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner > > >wrote: > > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione > > >wrote: > > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner > > ><christian@brauner.io> wrote: > > >> > > > > > So I dislike the idea of allocating new inodes from the > > >procfs super > > >> > > > > > block. I would like to avoid pinning the whole pidfd > > >concept exclusively > > >> > > > > > to proc. The idea is that the pidfd API will be useable > > >through procfs > > >> > > > > > via open("/proc/<pid>") because that is what users expect > > >and really > > >> > > > > > wanted to have for a long time. So it makes sense to have > > >this working. > > >> > > > > > But it should really be useable without it. That's why > > >translate_pid() > > >> > > > > > and pidfd_clone() are on the table. What I'm saying is, > > >once the pidfd > > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N > > >- even > > >> > > > > > though that's crazy - and still be able to use pidfds. This > > >is also a > > >> > > > > > point akpm asked about when I did the pidfd_send_signal > > >work. > > >> > > > > > > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use > > >pidfds. One > > >> > > > > crazy idea that I was discussing with Joel the other day is > > >to just > > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new > > >get_procfs_root() > > >> > > > > system call that returned, out of thin air and independent of > > >the > > >> > > > > mount table, a procfs root directory file descriptor for the > > >caller's > > >> > > > > PID namspace and suitable for use with openat(2). > > >> > > > > > >> > > > Even if this works I'm pretty sure that Al and a lot of others > > >will not > > >> > > > be happy about this. A syscall to get an fd to /proc? > > >> > > > >> > Why not? procfs provides access to a lot of core kernel > > >functionality. > > >> > Why should you need a mountpoint to get to it? > > >> > > > >> > > That's not going > > >> > > > to happen and I don't see the need for a separate syscall just > > >for that. > > >> > > > >> > We need a system call for the same reason we need a getrandom(2): > > >you > > >> > have to bootstrap somehow when you're in a minimal environment. > > >> > > > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.) > > >> > > > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm > > >> > proposing that we *hardwire* it as the default and just declare > > >that > > >> > it's not possible to build a Linux kernel that doesn't include > > >procfs. > > >> > Why do we even have that button? > > >> > > > >> > > I think his point here was that he wanted a handle to procfs no > > >matter where > > >> > > it was mounted and then can later use openat on that. Agreed that > > >it may be > > >> > > unnecessary unless there is a usecase for it, and especially if > > >the /proc > > >> > > directory being the defacto mountpoint for procfs is a universal > > >convention. > > >> > > > >> > If it's a universal convention and, in practice, everyone needs > > >proc > > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y? > > >If > > >> > we advertise /proc as not merely some kind of optional debug > > >interface > > >> > but *the* way certain kernel features are exposed --- and there's > > >> > nothing wrong with that --- then we should give programs access to > > >> > these core kernel features in a way that doesn't depend on > > >userspace > > >> > kernel configuration, and you do that by either providing a > > >> > procfs-root-getting system call or just hardwiring the "/proc/" > > >prefix > > >> > into VFS. > > >> > > > >> > > > Inode allocation from the procfs mount for the file descriptors > > >Joel > > >> > > > wants is not correct. Their not really procfs file descriptors > > >so this > > >> > > > is a nack. We can't just hook into proc that way. > > >> > > > > >> > > I was not particular about using procfs mount for the FDs but > > >that's the only > > >> > > way I knew how to do it until you pointed out anon_inode (my grep > > >skills > > >> > > missed that), so thank you! > > >> > > > > >> > > > > C'mon: /proc is used by everyone today and almost every > > >program breaks > > >> > > > > if it's not around. The string "/proc" is already de facto > > >kernel ABI. > > >> > > > > Let's just drop the pretense of /proc being optional and bake > > >it into > > >> > > > > the kernel proper, then give programs a way to get to /proc > > >that isn't > > >> > > > > tied to any particular mount configuration. This way, we > > >don't need a > > >> > > > > translate_pid(), since callers can just use procfs to do the > > >same > > >> > > > > thing. (That is, if I understand correctly what translate_pid > > >does.) > > >> > > > > > >> > > > I'm not sure what you think translate_pid() is doing since > > >you're not > > >> > > > saying what you think it does. > > >> > > > Examples from the old patchset: > > >> > > > translate_pid(pid, ns, -1) - get pid in our pid namespace > > >> > > > >> > Ah, it's a bit different from what I had in mind. It's fair to want > > >to > > >> > translate PIDs between namespaces, but the only way to make the > > >> > translate_pid under discussion robust is to have it accept and > > >produce > > >> > pidfds. (At that point, you might as well call it translate_pidfd.) > > >We > > >> > should not be adding new APIs to the kernel that accept numeric > > >PIDs: > > >> > > >> The traditional pid-based api is not going away. There are users that > > >> have the requirement to translate pids between namespaces and also > > >doing > > >> introspection on these namespaces independent of pidfds. We will not > > >> restrict the usefulness of this syscall by making it only work with > > >> pidfds. > > >> > > >> > it's not possible to use these APIs correctly except under very > > >> > limited circumstances --- mostly, talking about init or a parent > > >> > > >> The pid-based api is one of the most widely used apis of the kernel > > >and > > >> people have been using it quite successfully for a long time. Yes, > > >it's > > >> rac, but it's here to stay. > > >> > > >> > talking about its child. > > >> > > > >> > Really, we need a few related operations, and we shouldn't > > >necessarily > > >> > mingle them. > > >> > > >> Yes, we've established that previously. > > >> > > >> > > > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just > > >> > open /proc/<pid> > > >> > > >> Agreed. > > >> > > >> > > > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just > > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is > > >> > always the numeric PID). > > >> > > >> Agreed. > > >> > > >> > > > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal > > >does, > > >> > and it's a good start on the rest of these operations. > > >> > > >> Agreed. > > >> > > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what > > >translate_pid > > >> > is for. My preferred signature for this routine is > > >translate_pid(int > > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments. > > >Why > > >> > not? Because the pidfd *already* names a single process, uniquely! > > >> > > >> Given that people are interested in pids we can't just always return > > >a > > >> pidfd. That would mean a user would need to do get the pidfd read > > >from > > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids > > >or > > >> more you end up allocating and closing file descriptors constantly > > >for > > >> no reason. We can't just debate pids away. So it will also need to be > > >> able to yield pids e.g. through a flag argument. > > > > > >Sure, but that's still not a reason that we should care about pidfds > > >working separately from procfs.. > > That's unrelated to the point made in the above paragraph. > Please note, I said that the pidfd api should work when proc is not > available not that they can't be dirfds. > > > > > Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods. > > (K9 Mail still hasn't learned to wrap lines at 80 it seems. :)) Indeed, or I misconfigured it :) Just set it up recently so I'm still messing with it. The other issue is it does wrapping on quoted lines too, and there's a bug filed somewhere for that. > Again, I never said that pidfds should be a directory handle. > (Though I would like to point out that one of the original ideas I > discussed at LPC was to have something like this to get regular file > descriptors instead of dirfds: > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df) Ok. I was just going by this code in your send_signal patch where you error out if the pidfd is not a directory. +struct pid *tgid_pidfd_to_pid(const struct file *file) +{ + if (!d_is_dir(file->f_path.dentry) || + (file->f_op != &proc_tgid_base_operations)) + return ERR_PTR(-EBADF); > > For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well. > > translate_pid() should just return you a pidfd. Having it return a pidfd > and a status fd feels like stuffing too much functionality in there. If > you're fine with it I'll finish prototyping what I had in mind. As I > said in previous mails I'm already working on this. Yes, please continue to work on it. No problem. > Would you be ok with prototyping the pidfd_wait() syscall you had in > mind? Yes, Of course, I am working on it. No problem. It is still good to discuss these ideas and to know what my direction should be, so I appreciate the conversation here. > Especially the wait_fd part that you want to have I would like to > see how that is supposed to work, e.g. who is allowed to wait on the > process and how notifications will work for non-parent processes and so > on. I feel we won't get anywhere by talking in the abstrace and other > people are far more likely to review/comment once there's actual code. Got it. Lets chat more once I post something. thanks, - Joel ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2019-03-28 9:21 UTC | newest] Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-03-20 20:07 pidfd design Alexey Dobriyan 2019-03-20 20:14 ` Daniel Colascione 2019-03-20 20:39 ` Alexey Dobriyan 2019-03-20 20:47 ` Christian Brauner 2019-03-20 20:50 ` Daniel Colascione 2019-03-20 21:00 ` Christian Brauner 2019-03-22 14:04 ` Michael Tirado 2019-03-25 17:45 ` Linus Torvalds 2019-03-25 16:14 ` Michael Tirado 2019-03-25 20:45 ` Christian Brauner 2019-03-25 18:50 ` Christian Brauner -- strict thread matches above, loose matches on Subject: below -- 2019-03-16 18:57 [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android Christian Brauner 2019-03-16 19:37 ` Suren Baghdasaryan 2019-03-17 1:53 ` Joel Fernandes 2019-03-17 11:42 ` Christian Brauner 2019-03-17 15:40 ` Daniel Colascione 2019-03-18 0:29 ` Christian Brauner 2019-03-18 23:50 ` Joel Fernandes 2019-03-19 22:14 ` Christian Brauner 2019-03-19 22:48 ` Daniel Colascione 2019-03-19 23:10 ` Christian Brauner 2019-03-20 1:52 ` Joel Fernandes 2019-03-20 2:42 ` pidfd design Daniel Colascione 2019-03-20 3:59 ` Christian Brauner 2019-03-20 7:02 ` Daniel Colascione 2019-03-20 11:33 ` Joel Fernandes 2019-03-20 11:33 ` Joel Fernandes 2019-03-20 18:26 ` Christian Brauner 2019-03-20 18:38 ` Daniel Colascione 2019-03-20 18:51 ` Christian Brauner 2019-03-20 18:58 ` Andy Lutomirski 2019-03-20 19:14 ` Christian Brauner 2019-03-20 19:40 ` Daniel Colascione 2019-03-21 17:02 ` Andy Lutomirski 2019-03-25 20:13 ` Jann Horn 2019-03-25 20:13 ` Jann Horn 2019-03-25 20:23 ` Daniel Colascione 2019-03-25 20:23 ` Daniel Colascione 2019-03-25 23:42 ` Andy Lutomirski 2019-03-25 23:42 ` Andy Lutomirski 2019-03-25 23:45 ` Christian Brauner 2019-03-25 23:45 ` Christian Brauner 2019-03-26 0:00 ` Andy Lutomirski 2019-03-26 0:00 ` Andy Lutomirski 2019-03-26 0:12 ` Christian Brauner 2019-03-26 0:12 ` Christian Brauner 2019-03-26 0:24 ` Andy Lutomirski 2019-03-26 0:24 ` Andy Lutomirski 2019-03-28 9:21 ` Christian Brauner 2019-03-28 9:21 ` Christian Brauner 2019-03-20 19:19 ` Joel Fernandes 2019-03-20 19:29 ` Daniel Colascione 2019-03-24 14:44 ` Serge E. Hallyn 2019-03-24 18:48 ` Joel Fernandes 2019-03-20 19:11 ` Joel Fernandes
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.