linux-man.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] clone.2: add CLONE_PIDFD entry
       [not found] <20190511064908.21956-1-christian.brauner@ubuntu.com>
@ 2019-09-11  8:58 ` Michael Kerrisk (man-pages)
  2019-09-16  7:40   ` Christian Brauner
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-11  8:58 UTC (permalink / raw)
  To: Christian Brauner, linux-man; +Cc: mtk.manpages, jannh, oleg

Hello Christian,

On 5/11/19 8:49 AM, Christian Brauner wrote:
> From: Christian Brauner <christian@brauner.io>
> 
> Add an entry for CLONE_PIDFD. This flag is available starting with
> kernel 5.2. If specified, a process file descriptor ("pidfd") referring
> to the child process will be returned in the ptid argument.

I've applied this patch in a local branch, and made some minor edits
and added a piece. And I have some questions. See below.

> Signed-off-by: Christian Brauner <christian@brauner.io>
> ---
>  man2/clone.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/man2/clone.2 b/man2/clone.2
> index 7e880beb8..ee08aeb42 100644
> --- a/man2/clone.2
> +++ b/man2/clone.2
> @@ -539,6 +539,42 @@ The flag disappeared completely from the kernel sources in Linux 2.5.16.
>  Since then, the kernel silently ignores this bit if it is specified in
>  .IR flags .
>  .TP
> +.BR CLONE_PIDFD " (since Linux 5.2)"
> +If
> +.B CLONE_PIDFD
> +is set,
> +.BR clone ()
> +stores a process file descriptor ("pidfd") referring to the child process at
> +the location
> +.I ptid
> +in the parent's memory. 

I added a note that the close-on-exec flag is set on the new FD.

> Note, that the kernel verifies that the value for
> +.I ptid
> +is zero. If it is not an error will be returned. This ensures that
> +.I ptid
> +can potentially be used to specify additional options for
> +.B CLONE_PIDFD
> +in the future.

This piece is no longer true, right? At least I can't see such 
a check in the kernel code, and my testing doesn't yield an error
when ptid != 0 before the call.(No need to send me a patch; if I'm
correct just let me know and I'll edit out this piece.)

> +.IP
> +Since the
> +.I ptid
> +argument is used to return the pidfd,
> +.B CLONE_PIDFD
> +cannot be used with
> +.B CLONE_PARENT_SETTID.
> +.IP
> +It is currently not possible to use this flag together with
> +.B CLONE_THREAD.
> +This means that the process identified by the pidfd will always be a
> +thread-group leader.
> +.IP
> +For a while there was a
> +.B CLONE_DETACHED
> +flag. This flag is usually ignored when passed along with other flags.
> +However, when passed alongside
> +.B CLONE_PIDFD
> +an error will be returned. This ensures that this flag can be reused
> +for further pidfd features in the future.
> +.TP
>  .BR CLONE_PTRACE " (since Linux 2.2)"
>  If
>  .B CLONE_PTRACE
> @@ -1122,6 +1158,21 @@ For example, on aarch64,
>  .I child_stack
>  must be a multiple of 16.
>  .TP
> +.B EINVAL
> +.B CLONE_PIDFD
> +was specified together with
> +.B CLONE_DETACHED.
> +.TP
> +.B EINVAL
> +.B CLONE_PIDFD
> +was specified together with
> +.B CLONE_PARENT_SETTID.
> +.TP
> +.B EINVAL
> +.B CLONE_PIDFD
> +was specified together with
> +.B CLONE_THREAD.
> +.TP
>  .B ENOMEM
>  Cannot allocate sufficient memory to allocate a task structure for the
>  child, or to copy those parts of the caller's context that need to be

One other piece seems to be missing: the returned file descriptor can
be fed to poll()/select()/epoll and the FD will test as readable when
the child terminates. Right? Did that functionality also land in
kernel 5.2? And did it get implemented as a separate commit, or did
the behavior just fall naturally out of the implementation of pidfd's?
Let me know the details, and I will craft a patch.

Also, as far as I can see (from testing) the FD only gives pollable
events on process termination, not on other process transitions such
as stop and continue. Right? (Are there any plans to implement such
functionality for stop/contine transitions?

By the way, when do you expect the pidfd-wait functionality to land 
in the kernel?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] clone.2: add CLONE_PIDFD entry
  2019-09-11  8:58 ` [PATCH] clone.2: add CLONE_PIDFD entry Michael Kerrisk (man-pages)
@ 2019-09-16  7:40   ` Christian Brauner
  2019-09-18  6:49     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2019-09-16  7:40 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: linux-man, jannh, oleg

On Wed, Sep 11, 2019 at 10:58:57AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> On 5/11/19 8:49 AM, Christian Brauner wrote:
> > From: Christian Brauner <christian@brauner.io>
> > 
> > Add an entry for CLONE_PIDFD. This flag is available starting with
> > kernel 5.2. If specified, a process file descriptor ("pidfd") referring
> > to the child process will be returned in the ptid argument.
> 
> I've applied this patch in a local branch, and made some minor edits

Thank you! :)

> and added a piece. And I have some questions. See below.
> 
> > Signed-off-by: Christian Brauner <christian@brauner.io>
> > ---
> >  man2/clone.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> > 
> > diff --git a/man2/clone.2 b/man2/clone.2
> > index 7e880beb8..ee08aeb42 100644
> > --- a/man2/clone.2
> > +++ b/man2/clone.2
> > @@ -539,6 +539,42 @@ The flag disappeared completely from the kernel sources in Linux 2.5.16.
> >  Since then, the kernel silently ignores this bit if it is specified in
> >  .IR flags .
> >  .TP
> > +.BR CLONE_PIDFD " (since Linux 5.2)"
> > +If
> > +.B CLONE_PIDFD
> > +is set,
> > +.BR clone ()
> > +stores a process file descriptor ("pidfd") referring to the child process at
> > +the location
> > +.I ptid
> > +in the parent's memory. 
> 
> I added a note that the close-on-exec flag is set on the new FD.

Ack.

> 
> > Note, that the kernel verifies that the value for
> > +.I ptid
> > +is zero. If it is not an error will be returned. This ensures that
> > +.I ptid
> > +can potentially be used to specify additional options for
> > +.B CLONE_PIDFD
> > +in the future.
> 
> This piece is no longer true, right? At least I can't see such 

Correct.

> a check in the kernel code, and my testing doesn't yield an error
> when ptid != 0 before the call.(No need to send me a patch; if I'm
> correct just let me know and I'll edit out this piece.)
> 
> > +.IP
> > +Since the
> > +.I ptid
> > +argument is used to return the pidfd,
> > +.B CLONE_PIDFD
> > +cannot be used with
> > +.B CLONE_PARENT_SETTID.
> > +.IP
> > +It is currently not possible to use this flag together with
> > +.B CLONE_THREAD.
> > +This means that the process identified by the pidfd will always be a
> > +thread-group leader.
> > +.IP
> > +For a while there was a
> > +.B CLONE_DETACHED
> > +flag. This flag is usually ignored when passed along with other flags.
> > +However, when passed alongside
> > +.B CLONE_PIDFD
> > +an error will be returned. This ensures that this flag can be reused
> > +for further pidfd features in the future.
> > +.TP
> >  .BR CLONE_PTRACE " (since Linux 2.2)"
> >  If
> >  .B CLONE_PTRACE
> > @@ -1122,6 +1158,21 @@ For example, on aarch64,
> >  .I child_stack
> >  must be a multiple of 16.
> >  .TP
> > +.B EINVAL
> > +.B CLONE_PIDFD
> > +was specified together with
> > +.B CLONE_DETACHED.
> > +.TP
> > +.B EINVAL
> > +.B CLONE_PIDFD
> > +was specified together with
> > +.B CLONE_PARENT_SETTID.
> > +.TP
> > +.B EINVAL
> > +.B CLONE_PIDFD
> > +was specified together with
> > +.B CLONE_THREAD.
> > +.TP
> >  .B ENOMEM
> >  Cannot allocate sufficient memory to allocate a task structure for the
> >  child, or to copy those parts of the caller's context that need to be
> 
> One other piece seems to be missing: the returned file descriptor can
> be fed to poll()/select()/epoll and the FD will test as readable when
> the child terminates. Right? Did that functionality also land in
> kernel 5.2? And did it get implemented as a separate commit, or did
> the behavior just fall naturally out of the implementation of pidfd's?
> Let me know the details, and I will craft a patch.

It landed in 5.3. The relevant commit is:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b53b0b9d9a613c418057f6cb921c2f40a6f78c24
and belongs to the following merge:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5450e8a316a64cddcbc15f90733ebc78aa736545

> 
> Also, as far as I can see (from testing) the FD only gives pollable
> events on process termination, not on other process transitions such
> as stop and continue. Right? (Are there any plans to implement such

Correct.

> functionality for stop/contine transitions?

Yes, at some point we will likely want this.

> 
> By the way, when do you expect the pidfd-wait functionality to land 
> in the kernel?

I've sent a PR for 5.4:
https://lkml.org/lkml/2019/9/10/682
which contains the P_PIDFD extension to waitid().

Thanks for the work, Michael!
Christian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] clone.2: add CLONE_PIDFD entry
  2019-09-16  7:40   ` Christian Brauner
@ 2019-09-18  6:49     ` Michael Kerrisk (man-pages)
  2019-09-18  7:14       ` Christian Brauner
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-18  6:49 UTC (permalink / raw)
  To: Christian Brauner; +Cc: mtk.manpages, linux-man, jannh, oleg

Hello Christian,

On 9/16/19 9:40 AM, Christian Brauner wrote:
> On Wed, Sep 11, 2019 at 10:58:57AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Christian,
>>
>> On 5/11/19 8:49 AM, Christian Brauner wrote:
>>> From: Christian Brauner <christian@brauner.io>
>>>
>>> Add an entry for CLONE_PIDFD. This flag is available starting with
>>> kernel 5.2. If specified, a process file descriptor ("pidfd") referring
>>> to the child process will be returned in the ptid argument.
>>
>> I've applied this patch in a local branch, and made some minor edits
> 
> Thank you! :)
> 
>> and added a piece. And I have some questions. See below.
>>
>>> Signed-off-by: Christian Brauner <christian@brauner.io>
>>> ---

[...]

>>> Note, that the kernel verifies that the value for
>>> +.I ptid
>>> +is zero. If it is not an error will be returned. This ensures that
>>> +.I ptid
>>> +can potentially be used to specify additional options for
>>> +.B CLONE_PIDFD
>>> +in the future.
>>
>> This piece is no longer true, right? At least I can't see such 
> 
> Correct.

Thanks. Page amended.

>> a check in the kernel code, and my testing doesn't yield an error
>> when ptid != 0 before the call.(No need to send me a patch; if I'm
>> correct just let me know and I'll edit out this piece.)
>>
>>> +.IP
>>> +Since the
>>> +.I ptid
>>> +argument is used to return the pidfd,
>>> +.B CLONE_PIDFD
>>> +cannot be used with
>>> +.B CLONE_PARENT_SETTID.
>>> +.IP
>>> +It is currently not possible to use this flag together with
>>> +.B CLONE_THREAD.
>>> +This means that the process identified by the pidfd will always be a
>>> +thread-group leader.
>>> +.IP
>>> +For a while there was a
>>> +.B CLONE_DETACHED
>>> +flag. This flag is usually ignored when passed along with other flags.
>>> +However, when passed alongside
>>> +.B CLONE_PIDFD
>>> +an error will be returned. This ensures that this flag can be reused
>>> +for further pidfd features in the future.
>>> +.TP
>>>  .BR CLONE_PTRACE " (since Linux 2.2)"
>>>  If
>>>  .B CLONE_PTRACE
>>> @@ -1122,6 +1158,21 @@ For example, on aarch64,
>>>  .I child_stack
>>>  must be a multiple of 16.
>>>  .TP
>>> +.B EINVAL
>>> +.B CLONE_PIDFD
>>> +was specified together with
>>> +.B CLONE_DETACHED.
>>> +.TP
>>> +.B EINVAL
>>> +.B CLONE_PIDFD
>>> +was specified together with
>>> +.B CLONE_PARENT_SETTID.
>>> +.TP
>>> +.B EINVAL
>>> +.B CLONE_PIDFD
>>> +was specified together with
>>> +.B CLONE_THREAD.
>>> +.TP
>>>  .B ENOMEM
>>>  Cannot allocate sufficient memory to allocate a task structure for the
>>>  child, or to copy those parts of the caller's context that need to be
>>
>> One other piece seems to be missing: the returned file descriptor can
>> be fed to poll()/select()/epoll and the FD will test as readable when
>> the child terminates. Right? Did that functionality also land in
>> kernel 5.2? And did it get implemented as a separate commit, or did
>> the behavior just fall naturally out of the implementation of pidfd's?
>> Let me know the details, and I will craft a patch.
> 
> It landed in 5.3. The relevant commit is:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b53b0b9d9a613c418057f6cb921c2f40a6f78c24
> and belongs to the following merge:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5450e8a316a64cddcbc15f90733ebc78aa736545

Thanks for that info. One other questions springs to mind.
I haven't looked at the source or tried testing this,
but can anything actually be read() from a PIDFD? Presumably,
it might be useful to have data generated on the FD, since
different values could (ultimately) be used to distinguish
between terminate/stopp/continue transitions.

>> Also, as far as I can see (from testing) the FD only gives pollable
>> events on process termination, not on other process transitions such
>> as stop and continue. Right? (Are there any plans to implement such
> 
> Correct.
> 
>> functionality for stop/contine transitions?
> 
> Yes, at some point we will likely want this.
> 
>>
>> By the way, when do you expect the pidfd-wait functionality to land 
>> in the kernel?
> 
> I've sent a PR for 5.4:
> https://lkml.org/lkml/2019/9/10/682
> which contains the P_PIDFD extension to waitid().

Thanks for that pointer. I see that the code is
now merged.

> Thanks for the work, Michael!

You're welcome!

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] clone.2: add CLONE_PIDFD entry
  2019-09-18  6:49     ` Michael Kerrisk (man-pages)
@ 2019-09-18  7:14       ` Christian Brauner
  2019-09-19  4:04         ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2019-09-18  7:14 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: linux-man, jannh, oleg

On Wed, Sep 18, 2019 at 08:49:59AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> On 9/16/19 9:40 AM, Christian Brauner wrote:
> > On Wed, Sep 11, 2019 at 10:58:57AM +0200, Michael Kerrisk (man-pages) wrote:
> >> Hello Christian,
> >>
> >> On 5/11/19 8:49 AM, Christian Brauner wrote:
> >>> From: Christian Brauner <christian@brauner.io>
> >>>
> >>> Add an entry for CLONE_PIDFD. This flag is available starting with
> >>> kernel 5.2. If specified, a process file descriptor ("pidfd") referring
> >>> to the child process will be returned in the ptid argument.
> >>
> >> I've applied this patch in a local branch, and made some minor edits
> > 
> > Thank you! :)
> > 
> >> and added a piece. And I have some questions. See below.
> >>
> >>> Signed-off-by: Christian Brauner <christian@brauner.io>
> >>> ---
> 
> [...]
> 
> >>> Note, that the kernel verifies that the value for
> >>> +.I ptid
> >>> +is zero. If it is not an error will be returned. This ensures that
> >>> +.I ptid
> >>> +can potentially be used to specify additional options for
> >>> +.B CLONE_PIDFD
> >>> +in the future.
> >>
> >> This piece is no longer true, right? At least I can't see such 
> > 
> > Correct.
> 
> Thanks. Page amended.
> 
> >> a check in the kernel code, and my testing doesn't yield an error
> >> when ptid != 0 before the call.(No need to send me a patch; if I'm
> >> correct just let me know and I'll edit out this piece.)
> >>
> >>> +.IP
> >>> +Since the
> >>> +.I ptid
> >>> +argument is used to return the pidfd,
> >>> +.B CLONE_PIDFD
> >>> +cannot be used with
> >>> +.B CLONE_PARENT_SETTID.
> >>> +.IP
> >>> +It is currently not possible to use this flag together with
> >>> +.B CLONE_THREAD.
> >>> +This means that the process identified by the pidfd will always be a
> >>> +thread-group leader.
> >>> +.IP
> >>> +For a while there was a
> >>> +.B CLONE_DETACHED
> >>> +flag. This flag is usually ignored when passed along with other flags.
> >>> +However, when passed alongside
> >>> +.B CLONE_PIDFD
> >>> +an error will be returned. This ensures that this flag can be reused
> >>> +for further pidfd features in the future.
> >>> +.TP
> >>>  .BR CLONE_PTRACE " (since Linux 2.2)"
> >>>  If
> >>>  .B CLONE_PTRACE
> >>> @@ -1122,6 +1158,21 @@ For example, on aarch64,
> >>>  .I child_stack
> >>>  must be a multiple of 16.
> >>>  .TP
> >>> +.B EINVAL
> >>> +.B CLONE_PIDFD
> >>> +was specified together with
> >>> +.B CLONE_DETACHED.
> >>> +.TP
> >>> +.B EINVAL
> >>> +.B CLONE_PIDFD
> >>> +was specified together with
> >>> +.B CLONE_PARENT_SETTID.
> >>> +.TP
> >>> +.B EINVAL
> >>> +.B CLONE_PIDFD
> >>> +was specified together with
> >>> +.B CLONE_THREAD.
> >>> +.TP
> >>>  .B ENOMEM
> >>>  Cannot allocate sufficient memory to allocate a task structure for the
> >>>  child, or to copy those parts of the caller's context that need to be
> >>
> >> One other piece seems to be missing: the returned file descriptor can
> >> be fed to poll()/select()/epoll and the FD will test as readable when
> >> the child terminates. Right? Did that functionality also land in
> >> kernel 5.2? And did it get implemented as a separate commit, or did
> >> the behavior just fall naturally out of the implementation of pidfd's?
> >> Let me know the details, and I will craft a patch.
> > 
> > It landed in 5.3. The relevant commit is:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b53b0b9d9a613c418057f6cb921c2f40a6f78c24
> > and belongs to the following merge:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5450e8a316a64cddcbc15f90733ebc78aa736545
> 
> Thanks for that info. One other questions springs to mind.
> I haven't looked at the source or tried testing this,
> but can anything actually be read() from a PIDFD? Presumably,

We had discussed this but decided to not implement this right away.
Mainly, because we did not have a clear picture what the semantics
should be. But it is something that we will probably want in the
future...

> it might be useful to have data generated on the FD, since
> different values could (ultimately) be used to distinguish
> between terminate/stopp/continue transitions.

Yes.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] clone.2: add CLONE_PIDFD entry
  2019-09-18  7:14       ` Christian Brauner
@ 2019-09-19  4:04         ` Michael Kerrisk (man-pages)
  2019-09-19  4:43           ` Michael Kerrisk (man-pages)
  2019-09-19  6:47           ` Christian Brauner
  0 siblings, 2 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-19  4:04 UTC (permalink / raw)
  To: Christian Brauner; +Cc: mtk.manpages, linux-man, jannh, oleg

Hello Christian,

On 9/18/19 9:14 AM, Christian Brauner wrote:
> On Wed, Sep 18, 2019 at 08:49:59AM +0200, Michael Kerrisk (man-pages) wrote:

>>>> One other piece seems to be missing: the returned file descriptor can
>>>> be fed to poll()/select()/epoll and the FD will test as readable when
>>>> the child terminates. Right? Did that functionality also land in
>>>> kernel 5.2? And did it get implemented as a separate commit, or did
>>>> the behavior just fall naturally out of the implementation of pidfd's?
>>>> Let me know the details, and I will craft a patch.
>>>
>>> It landed in 5.3. The relevant commit is:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b53b0b9d9a613c418057f6cb921c2f40a6f78c24
>>> and belongs to the following merge:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5450e8a316a64cddcbc15f90733ebc78aa736545
>>
>> Thanks for that info. One other questions springs to mind.
>> I haven't looked at the source or tried testing this,
>> but can anything actually be read() from a PIDFD? Presumably,
> 
> We had discussed this but decided to not implement this right away.
> Mainly, because we did not have a clear picture what the semantics
> should be. But it is something that we will probably want in the
> future...

That makes sense.

A further question... We now have three ways of getting a
process file descriptor [*]:

open() of /proc/PID
pidfd_open()
clone()/clone3() with CLONE_PIDFD

I thought the FD was supposed to be equivalent in all three cases.
However, if I try (on kernel 5.3) poll() an FD returned by opening
/proc/PID, poll() tells me POLLNVAL for the FD. Is that difference
intentional? (I am guessing it is not.)

Thanks,

Michael

[*} By the way, going forward, can we call these things
"process FDs", rather than "PID FDs"? The API names are what
they are, an that's okay, but these just as we have socket
FDs that refer to sockets, directory FDs that refer to 
directories, and timer FDs that refer to timers, and so on,
these are FDs that refer to *processes*, not "process IDs".
It's a little thing, but I think the naming better, and
it's what I propose to use in the manual pages.

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] clone.2: add CLONE_PIDFD entry
  2019-09-19  4:04         ` Michael Kerrisk (man-pages)
@ 2019-09-19  4:43           ` Michael Kerrisk (man-pages)
  2019-09-19  6:47           ` Christian Brauner
  1 sibling, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-19  4:43 UTC (permalink / raw)
  To: Christian Brauner; +Cc: mtk.manpages, linux-man, jannh, oleg

Hello Christian,

A tweak to one point from my mail of a few minutes ago...

On 9/19/19 6:04 AM, Michael Kerrisk (man-pages) wrote:

> A further question... We now have three ways of getting a
> process file descriptor [*]:
> 
> open() of /proc/PID
> pidfd_open()
> clone()/clone3() with CLONE_PIDFD
> 
> I thought the FD was supposed to be equivalent in all three cases.
> However, if I try (on kernel 5.3) poll() an FD returned by opening
> /proc/PID, poll() tells me POLLNVAL for the FD. Is that difference
> intentional? (I am guessing it is not.)

Okay -- I made a misstep there, although my question remains.
I got the POLLNVAL return because I had opened /proc/PID with
the O_PATH flag. When I open /proc/PID with O_RDONLY, then poll()
on the FD always returns immediately (i.e., before the target
process has terminated), and the returned events are
POLLIN+POLLOUT.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] clone.2: add CLONE_PIDFD entry
  2019-09-19  4:04         ` Michael Kerrisk (man-pages)
  2019-09-19  4:43           ` Michael Kerrisk (man-pages)
@ 2019-09-19  6:47           ` Christian Brauner
  2019-09-23  8:11             ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2019-09-19  6:47 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: linux-man, jannh, oleg

On Thu, Sep 19, 2019 at 06:04:55AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> On 9/18/19 9:14 AM, Christian Brauner wrote:
> > On Wed, Sep 18, 2019 at 08:49:59AM +0200, Michael Kerrisk (man-pages) wrote:
> 
> >>>> One other piece seems to be missing: the returned file descriptor can
> >>>> be fed to poll()/select()/epoll and the FD will test as readable when
> >>>> the child terminates. Right? Did that functionality also land in
> >>>> kernel 5.2? And did it get implemented as a separate commit, or did
> >>>> the behavior just fall naturally out of the implementation of pidfd's?
> >>>> Let me know the details, and I will craft a patch.
> >>>
> >>> It landed in 5.3. The relevant commit is:
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b53b0b9d9a613c418057f6cb921c2f40a6f78c24
> >>> and belongs to the following merge:
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5450e8a316a64cddcbc15f90733ebc78aa736545
> >>
> >> Thanks for that info. One other questions springs to mind.
> >> I haven't looked at the source or tried testing this,
> >> but can anything actually be read() from a PIDFD? Presumably,
> > 
> > We had discussed this but decided to not implement this right away.
> > Mainly, because we did not have a clear picture what the semantics
> > should be. But it is something that we will probably want in the
> > future...
> 
> That makes sense.
> 
> A further question... We now have three ways of getting a
> process file descriptor [*]:
> 
> open() of /proc/PID
> pidfd_open()
> clone()/clone3() with CLONE_PIDFD
> 
> I thought the FD was supposed to be equivalent in all three cases.
> However, if I try (on kernel 5.3) poll() an FD returned by opening
> /proc/PID, poll() tells me POLLNVAL for the FD. Is that difference
> intentional? (I am guessing it is not.)

It's intentional.
The short answer is that /proc/<pid> is a convenience for sending
signals.
The longer answer is that this stems from a heavy debate about what a
process file descriptor was supposed to be and some people pushing for
at least being able to use /proc/<pid> dirfds while ignoring security
problems as soon as you're talking about returning those fds from
clone(); not to mention the additional problems discovered when trying
to implementing this.
A "real" pidfd is one from CLONE_PIDFD or pidfd_open() and all features
such as exit notification, read, and other future extensions will only
be implemented on top of them.
As much as we'd have liked to get rid of two different file descriptor
types it doesn't hurt us much and is not that much different from what
we will e.g. see with fsinfo() in the new mount api which needs to work
on regular fds gotten via open()/openat() and mountfds gotten from
fsopen() and fspick(). The mountfds will also allow for advanced
operations that the other ones will not. There's even an argument to be
made that fds you will get from open()/openat() and openat2() are
different types since they have very different behavior; openat2()
returning fds that are non arbitrarily upgradable etc.

> 
> Thanks,
> 
> Michael
> 
> [*} By the way, going forward, can we call these things
> "process FDs", rather than "PID FDs"? The API names are what
> they are, an that's okay, but these just as we have socket
> FDs that refer to sockets, directory FDs that refer to 
> directories, and timer FDs that refer to timers, and so on,
> these are FDs that refer to *processes*, not "process IDs".
> It's a little thing, but I think the naming better, and
> it's what I propose to use in the manual pages.

The naming was another debate and we ended with this compromise.
I would just clarify that a pidfd is a process file descriptor. I
wouldn't make too much of a deal of hiding the shortcut "pidfd". People
are already using it out there in the wild and it's never proven a good
idea to go against accepted practice.

Christian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] clone.2: add CLONE_PIDFD entry
  2019-09-19  6:47           ` Christian Brauner
@ 2019-09-23  8:11             ` Michael Kerrisk (man-pages)
  2019-09-23 14:13               ` Christian Brauner
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-23  8:11 UTC (permalink / raw)
  To: Christian Brauner; +Cc: mtk.manpages, linux-man, jannh, oleg

Hello Christian,

On 9/19/19 8:47 AM, Christian Brauner wrote:
> On Thu, Sep 19, 2019 at 06:04:55AM +0200, Michael Kerrisk (man-pages) wrote:

[...]

>>>> Thanks for that info. One other questions springs to mind.
>>>> I haven't looked at the source or tried testing this,
>>>> but can anything actually be read() from a PIDFD? Presumably,
>>>
>>> We had discussed this but decided to not implement this right away.
>>> Mainly, because we did not have a clear picture what the semantics
>>> should be. But it is something that we will probably want in the
>>> future...
>>
>> That makes sense.
>>
>> A further question... We now have three ways of getting a
>> process file descriptor [*]:
>>
>> open() of /proc/PID
>> pidfd_open()
>> clone()/clone3() with CLONE_PIDFD
>>
>> I thought the FD was supposed to be equivalent in all three cases.
>> However, if I try (on kernel 5.3) poll() an FD returned by opening
>> /proc/PID, poll() tells me POLLNVAL for the FD. Is that difference
>> intentional? (I am guessing it is not.)
> 
> It's intentional.
> The short answer is that /proc/<pid> is a convenience for sending
> signals.
> The longer answer is that this stems from a heavy debate about what a
> process file descriptor was supposed to be and some people pushing for
> at least being able to use /proc/<pid> dirfds while ignoring security
> problems as soon as you're talking about returning those fds from
> clone(); not to mention the additional problems discovered when trying
> to implementing this.
> A "real" pidfd is one from CLONE_PIDFD or pidfd_open() and all features
> such as exit notification, read, and other future extensions will only
> be implemented on top of them.
> As much as we'd have liked to get rid of two different file descriptor
> types it doesn't hurt us much and is not that much different from what
> we will e.g. see with fsinfo() in the new mount api which needs to work
> on regular fds gotten via open()/openat() and mountfds gotten from
> fsopen() and fspick(). The mountfds will also allow for advanced
> operations that the other ones will not. There's even an argument to be
> made that fds you will get from open()/openat() and openat2() are
> different types since they have very different behavior; openat2()
> returning fds that are non arbitrarily upgradable etc.

Okay. So, it would be fair to say (in the man pages) that
pidfd_open() is the preferred way of obtaining a PID file
descriptor for an already existing process?

>> [*} By the way, going forward, can we call these things
>> "process FDs", rather than "PID FDs"? The API names are what
>> they are, an that's okay, but these just as we have socket
>> FDs that refer to sockets, directory FDs that refer to 
>> directories, and timer FDs that refer to timers, and so on,
>> these are FDs that refer to *processes*, not "process IDs".
>> It's a little thing, but I think the naming better, and
>> it's what I propose to use in the manual pages.
> 
> The naming was another debate and we ended with this compromise.
> I would just clarify that a pidfd is a process file descriptor. I
> wouldn't make too much of a deal of hiding the shortcut "pidfd". People
> are already using it out there in the wild and it's never proven a good
> idea to go against accepted practice.

Okay.

I have a draft pidfd_open(2) page that I will send out soon.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] clone.2: add CLONE_PIDFD entry
  2019-09-23  8:11             ` Michael Kerrisk (man-pages)
@ 2019-09-23 14:13               ` Christian Brauner
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2019-09-23 14:13 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: linux-man, jannh, oleg

On Mon, Sep 23, 2019 at 10:11:45AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> On 9/19/19 8:47 AM, Christian Brauner wrote:
> > On Thu, Sep 19, 2019 at 06:04:55AM +0200, Michael Kerrisk (man-pages) wrote:
> 
> [...]
> 
> >>>> Thanks for that info. One other questions springs to mind.
> >>>> I haven't looked at the source or tried testing this,
> >>>> but can anything actually be read() from a PIDFD? Presumably,
> >>>
> >>> We had discussed this but decided to not implement this right away.
> >>> Mainly, because we did not have a clear picture what the semantics
> >>> should be. But it is something that we will probably want in the
> >>> future...
> >>
> >> That makes sense.
> >>
> >> A further question... We now have three ways of getting a
> >> process file descriptor [*]:
> >>
> >> open() of /proc/PID
> >> pidfd_open()
> >> clone()/clone3() with CLONE_PIDFD
> >>
> >> I thought the FD was supposed to be equivalent in all three cases.
> >> However, if I try (on kernel 5.3) poll() an FD returned by opening
> >> /proc/PID, poll() tells me POLLNVAL for the FD. Is that difference
> >> intentional? (I am guessing it is not.)
> > 
> > It's intentional.
> > The short answer is that /proc/<pid> is a convenience for sending
> > signals.
> > The longer answer is that this stems from a heavy debate about what a
> > process file descriptor was supposed to be and some people pushing for
> > at least being able to use /proc/<pid> dirfds while ignoring security
> > problems as soon as you're talking about returning those fds from
> > clone(); not to mention the additional problems discovered when trying
> > to implementing this.
> > A "real" pidfd is one from CLONE_PIDFD or pidfd_open() and all features
> > such as exit notification, read, and other future extensions will only
> > be implemented on top of them.
> > As much as we'd have liked to get rid of two different file descriptor
> > types it doesn't hurt us much and is not that much different from what
> > we will e.g. see with fsinfo() in the new mount api which needs to work
> > on regular fds gotten via open()/openat() and mountfds gotten from
> > fsopen() and fspick(). The mountfds will also allow for advanced
> > operations that the other ones will not. There's even an argument to be
> > made that fds you will get from open()/openat() and openat2() are
> > different types since they have very different behavior; openat2()
> > returning fds that are non arbitrarily upgradable etc.
> 
> Okay. So, it would be fair to say (in the man pages) that
> pidfd_open() is the preferred way of obtaining a PID file
> descriptor for an already existing process?

Sure, or just not make a big thing about /proc/<pid> being useable.

> 
> >> [*} By the way, going forward, can we call these things
> >> "process FDs", rather than "PID FDs"? The API names are what
> >> they are, an that's okay, but these just as we have socket
> >> FDs that refer to sockets, directory FDs that refer to 
> >> directories, and timer FDs that refer to timers, and so on,
> >> these are FDs that refer to *processes*, not "process IDs".
> >> It's a little thing, but I think the naming better, and
> >> it's what I propose to use in the manual pages.
> > 
> > The naming was another debate and we ended with this compromise.
> > I would just clarify that a pidfd is a process file descriptor. I
> > wouldn't make too much of a deal of hiding the shortcut "pidfd". People
> > are already using it out there in the wild and it's never proven a good
> > idea to go against accepted practice.
> 
> Okay.
> 
> I have a draft pidfd_open(2) page that I will send out soon.

Thanks!
I will review it soon!

Christian

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-09-23 14:13 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190511064908.21956-1-christian.brauner@ubuntu.com>
2019-09-11  8:58 ` [PATCH] clone.2: add CLONE_PIDFD entry Michael Kerrisk (man-pages)
2019-09-16  7:40   ` Christian Brauner
2019-09-18  6:49     ` Michael Kerrisk (man-pages)
2019-09-18  7:14       ` Christian Brauner
2019-09-19  4:04         ` Michael Kerrisk (man-pages)
2019-09-19  4:43           ` Michael Kerrisk (man-pages)
2019-09-19  6:47           ` Christian Brauner
2019-09-23  8:11             ` Michael Kerrisk (man-pages)
2019-09-23 14:13               ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).