linux-man.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] clone.2: tfix
@ 2019-11-28 12:46 Adrian Reber
  2019-11-28 12:46 ` [PATCH 2/2] clone.2: added clone3() set_tid information Adrian Reber
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Adrian Reber @ 2019-11-28 12:46 UTC (permalink / raw)
  To: mtk.manpages, Christian Brauner; +Cc: linux-man

added two missing parentheses

Signed-off-by: Adrian Reber <areber@redhat.com>
---
 man2/clone.2 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/man2/clone.2 b/man2/clone.2
index e5ab2a096..076b9258e 100644
--- a/man2/clone.2
+++ b/man2/clone.2
@@ -1349,13 +1349,13 @@ For example, on aarch64,
 .I stack
 must be a multiple of 16.
 .TP
-.BR EINVAL " (" clone3 "() only"
+.BR EINVAL " (" clone3 "() only)"
 .B  CLONE_DETACHED
 was specified in the
 .I flags
 mask.
 .TP
-.BR EINVAL " (" clone "() only"
+.BR EINVAL " (" clone "() only)"
 .B CLONE_PIDFD
 was specified together with
 .B CLONE_DETACHED

base-commit: 445fc03eebeee9b0e9ccaa4e2361f91f30749e8e
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] clone.2: added clone3() set_tid information
  2019-11-28 12:46 [PATCH 1/2] clone.2: tfix Adrian Reber
@ 2019-11-28 12:46 ` Adrian Reber
  2019-11-28 17:24   ` Christian Brauner
  2019-12-01 12:18   ` Michael Kerrisk (man-pages)
  2019-11-28 13:30 ` [PATCH 1/2] clone.2: tfix Christian Brauner
  2019-12-01 12:16 ` Michael Kerrisk (man-pages)
  2 siblings, 2 replies; 10+ messages in thread
From: Adrian Reber @ 2019-11-28 12:46 UTC (permalink / raw)
  To: mtk.manpages, Christian Brauner; +Cc: linux-man

Signed-off-by: Adrian Reber <areber@redhat.com>
---
 man2/clone.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/man2/clone.2 b/man2/clone.2
index 076b9258e..59c13ec35 100644
--- a/man2/clone.2
+++ b/man2/clone.2
@@ -195,6 +195,8 @@ struct clone_args {
     u64 stack;        /* Pointer to lowest byte of stack */
     u64 stack_size;   /* Size of stack */
     u64 tls;          /* Location of new TLS */
+    u64 set_tid;      /* Pointer to a \fIpid_t\fP array */
+    u64 set_tid_size; /* Number of elements in \fIset_tid\fP */
 };
 .EE
 .in
@@ -262,6 +264,8 @@ flags & 0xff	exit_signal
 stack	stack
 \fP---\fP	stack_size
 tls	tls	See CLONE_SETTLS
+\fP---\fP	set_tid	See below for details
+\fP---\fP	set_tid_size
 .TE
 .RE
 .\"
@@ -285,6 +289,74 @@ options when waiting for the child with
 If no signal (i.e., zero) is specified, then the parent process is not signaled
 when the child terminates.
 .\"
+.SS The set_tid array
+.PP
+The
+.I set_tid
+array is used to select a certain PID for the process to be created by
+.BR clone3 ().
+If the PID of the newly created process should only be set for the current
+PID namespace or in the newly created PID namespace (if
+.I flags
+contains
+.BR CLONE_NEWPID )
+then the first element in the
+.I set_tid
+array has to be the desired PID and
+.I set_tid_size
+needs to be 1.
+.PP
+If the PID of the newly created process should have a certain value in
+multiple PID namespaces the
+.I set_tid
+array can have multiple entries. The first entry defines the PID in the most
+nested PID namespace and all following entries contain the PID of the
+corresponding parent PID namespace. The number of PID namespaces in which a PID
+should be set is defined by
+.I set_tid_size
+which cannot be larger than the number of currently nested PID namespaces.
+.PP
+To create a process with the following PIDs:
+.RS
+.TS
+lb lb
+l l .
+PID NS level	Requested PID
+0 (host)	31496
+1	42
+2	7
+.TE
+.RE
+.PP
+The
+.I set_tid
+array would need to be filled with:
+.PP
+.EX
+	set_tid[0] = 7;
+	set_tid[1] = 42;
+	set_tid[2] = 31496;
+	set_tid_size = 3;
+.EE
+.PP
+If only the PID of the two innermost PID namespaces
+should be defined it needs to be set like this:
+.PP
+.EX
+	set_tid[0] = 7;
+	set_tid[1] = 42;
+	set_tid_size = 2;
+.EE
+.PP
+The PID in the PID namespaces outside the two innermost PID namespaces
+is then selected the same way as any other PID is selected.
+.PP
+Only a privileged process
+.RB ( CAP_SYS_ADMIN )
+can set
+.I set_tid
+to select a PID for the process to be created.
+.\"
 .SS The flags mask
 .PP
 Both
@@ -1379,6 +1451,16 @@ in the
 .I flags
 mask.
 .TP
+.BR EINVAL " (" clone3 "() only)"
+.I set_tid_size
+larger than current number of nested PID namespaces or maximum number of
+nested PID namespaces was specified.
+.TP
+.BR EINVAL " (" clone3 "() only)"
+If one of the PIDs specified in
+.I set_tid
+was an invalid PID.
+.TP
 .B ENOMEM
 Cannot allocate sufficient memory to allocate a task structure for the
 child, or to copy those parts of the caller's context that need to be
@@ -1450,6 +1532,14 @@ mask and the caller is in a chroot environment
 (i.e., the caller's root directory does not match the root directory
 of the mount namespace in which it resides).
 .TP
+.BR EPERM " (" clone3 "() only)"
+If
+.I set_tid
+with
+.I set_tid_size
+larger than 0 was specified by an unprivileged process (process without
+\fBCAP_SYS_ADMIN\fP).
+.TP
 .BR ERESTARTNOINTR " (since Linux 2.6.17)"
 .\" commit 4a2c7a7837da1b91468e50426066d988050e4d56
 System call was interrupted by a signal and will be restarted.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] clone.2: tfix
  2019-11-28 12:46 [PATCH 1/2] clone.2: tfix Adrian Reber
  2019-11-28 12:46 ` [PATCH 2/2] clone.2: added clone3() set_tid information Adrian Reber
@ 2019-11-28 13:30 ` Christian Brauner
  2019-12-01 12:16 ` Michael Kerrisk (man-pages)
  2 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2019-11-28 13:30 UTC (permalink / raw)
  To: Adrian Reber; +Cc: mtk.manpages, linux-man

On Thu, Nov 28, 2019 at 01:46:49PM +0100, Adrian Reber wrote:
> added two missing parentheses
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>

Seems reasonable to me.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] clone.2: added clone3() set_tid information
  2019-11-28 12:46 ` [PATCH 2/2] clone.2: added clone3() set_tid information Adrian Reber
@ 2019-11-28 17:24   ` Christian Brauner
  2019-11-29 13:56     ` Adrian Reber
  2019-12-01 12:18     ` Michael Kerrisk (man-pages)
  2019-12-01 12:18   ` Michael Kerrisk (man-pages)
  1 sibling, 2 replies; 10+ messages in thread
From: Christian Brauner @ 2019-11-28 17:24 UTC (permalink / raw)
  To: Adrian Reber; +Cc: mtk.manpages, linux-man

On Thu, Nov 28, 2019 at 01:46:50PM +0100, Adrian Reber wrote:
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
>  man2/clone.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 90 insertions(+)
> 
> diff --git a/man2/clone.2 b/man2/clone.2
> index 076b9258e..59c13ec35 100644
> --- a/man2/clone.2
> +++ b/man2/clone.2
> @@ -195,6 +195,8 @@ struct clone_args {
>      u64 stack;        /* Pointer to lowest byte of stack */
>      u64 stack_size;   /* Size of stack */
>      u64 tls;          /* Location of new TLS */
> +    u64 set_tid;      /* Pointer to a \fIpid_t\fP array */
> +    u64 set_tid_size; /* Number of elements in \fIset_tid\fP */
>  };
>  .EE
>  .in
> @@ -262,6 +264,8 @@ flags & 0xff	exit_signal
>  stack	stack
>  \fP---\fP	stack_size
>  tls	tls	See CLONE_SETTLS
> +\fP---\fP	set_tid	See below for details
> +\fP---\fP	set_tid_size
>  .TE
>  .RE
>  .\"
> @@ -285,6 +289,74 @@ options when waiting for the child with
>  If no signal (i.e., zero) is specified, then the parent process is not signaled
>  when the child terminates.
>  .\"
> +.SS The set_tid array
> +.PP
> +The
> +.I set_tid
> +array is used to select a certain PID for the process to be created by
> +.BR clone3 ().
> +If the PID of the newly created process should only be set for the current
> +PID namespace or in the newly created PID namespace (if
> +.I flags
> +contains
> +.BR CLONE_NEWPID )
> +then the first element in the
> +.I set_tid
> +array has to be the desired PID and
> +.I set_tid_size
> +needs to be 1.
> +.PP
> +If the PID of the newly created process should have a certain value in
> +multiple PID namespaces the
> +.I set_tid
> +array can have multiple entries. The first entry defines the PID in the most
> +nested PID namespace and all following entries contain the PID of the
> +corresponding parent PID namespace. The number of PID namespaces in which a PID
> +should be set is defined by
> +.I set_tid_size
> +which cannot be larger than the number of currently nested PID namespaces.

"It's upper cap is the kernel-enforced general nesting limit."
or sm like that

> +.PP
> +To create a process with the following PIDs:

in a pid namespace hierarchy

> +.RS
> +.TS
> +lb lb
> +l l .
> +PID NS level	Requested PID
> +0 (host)	31496
> +1	42
> +2	7
> +.TE
> +.RE
> +.PP

Set the array to:

> +.PP
> +.EX
> +	set_tid[0] = 7;
> +	set_tid[1] = 42;
> +	set_tid[2] = 31496;
> +	set_tid_size = 3;
> +.EE
> +.PP
> +If only the PID of the two innermost PID namespaces

need to be specified set the array to:

> +.PP
> +.EX
> +	set_tid[0] = 7;
> +	set_tid[1] = 42;
> +	set_tid_size = 2;
> +.EE
> +.PP
> +The PID in the PID namespaces outside the two innermost PID namespaces
> +is then selected the same way as any other PID is selected.

s/is then/will be/

> +.PP
> +Only a privileged process
> +.RB ( CAP_SYS_ADMIN )

Maybe more specific:

The set_tid feature requires CAP_SYS_ADMIN in all owning user namespaces
of the target pid namespaces.

> +can set
> +.I set_tid
> +to select a PID for the process to be created.
> +.\"
>  .SS The flags mask
>  .PP
>  Both
> @@ -1379,6 +1451,16 @@ in the
>  .I flags
>  mask.
>  .TP
> +.BR EINVAL " (" clone3 "() only)"
> +.I set_tid_size
> +larger than current number of nested PID namespaces or maximum number of
> +nested PID namespaces was specified.
> +.TP
> +.BR EINVAL " (" clone3 "() only)"
> +If one of the PIDs specified in
> +.I set_tid
> +was an invalid PID.
> +.TP
>  .B ENOMEM
>  Cannot allocate sufficient memory to allocate a task structure for the
>  child, or to copy those parts of the caller's context that need to be
> @@ -1450,6 +1532,14 @@ mask and the caller is in a chroot environment
>  (i.e., the caller's root directory does not match the root directory
>  of the mount namespace in which it resides).
>  .TP
> +.BR EPERM " (" clone3 "() only)"
> +If
> +.I set_tid
> +with
> +.I set_tid_size
> +larger than 0 was specified by an unprivileged process (process without
> +\fBCAP_SYS_ADMIN\fP).

without CAP_SYS_ADMIN in _one_ of the owning user namespaces of the
target pid namespace

maybe?

Where's EEXIST?

	Christian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] clone.2: added clone3() set_tid information
  2019-11-28 17:24   ` Christian Brauner
@ 2019-11-29 13:56     ` Adrian Reber
  2019-12-01 12:17       ` Michael Kerrisk (man-pages)
  2019-12-01 12:18     ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 10+ messages in thread
From: Adrian Reber @ 2019-11-29 13:56 UTC (permalink / raw)
  To: Christian Brauner; +Cc: mtk.manpages, linux-man

On Thu, Nov 28, 2019 at 06:24:05PM +0100, Christian Brauner wrote:
> On Thu, Nov 28, 2019 at 01:46:50PM +0100, Adrian Reber wrote:
> > Signed-off-by: Adrian Reber <areber@redhat.com>
> > ---
> >  man2/clone.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 90 insertions(+)
> > 
> > diff --git a/man2/clone.2 b/man2/clone.2
> > index 076b9258e..59c13ec35 100644
> > --- a/man2/clone.2
> > +++ b/man2/clone.2
> > @@ -195,6 +195,8 @@ struct clone_args {
> >      u64 stack;        /* Pointer to lowest byte of stack */
> >      u64 stack_size;   /* Size of stack */
> >      u64 tls;          /* Location of new TLS */
> > +    u64 set_tid;      /* Pointer to a \fIpid_t\fP array */
> > +    u64 set_tid_size; /* Number of elements in \fIset_tid\fP */
> >  };
> >  .EE
> >  .in
> > @@ -262,6 +264,8 @@ flags & 0xff	exit_signal
> >  stack	stack
> >  \fP---\fP	stack_size
> >  tls	tls	See CLONE_SETTLS
> > +\fP---\fP	set_tid	See below for details
> > +\fP---\fP	set_tid_size
> >  .TE
> >  .RE
> >  .\"
> > @@ -285,6 +289,74 @@ options when waiting for the child with
> >  If no signal (i.e., zero) is specified, then the parent process is not signaled
> >  when the child terminates.
> >  .\"
> > +.SS The set_tid array
> > +.PP
> > +The
> > +.I set_tid
> > +array is used to select a certain PID for the process to be created by
> > +.BR clone3 ().
> > +If the PID of the newly created process should only be set for the current
> > +PID namespace or in the newly created PID namespace (if
> > +.I flags
> > +contains
> > +.BR CLONE_NEWPID )
> > +then the first element in the
> > +.I set_tid
> > +array has to be the desired PID and
> > +.I set_tid_size
> > +needs to be 1.
> > +.PP
> > +If the PID of the newly created process should have a certain value in
> > +multiple PID namespaces the
> > +.I set_tid
> > +array can have multiple entries. The first entry defines the PID in the most
> > +nested PID namespace and all following entries contain the PID of the
> > +corresponding parent PID namespace. The number of PID namespaces in which a PID
> > +should be set is defined by
> > +.I set_tid_size
> > +which cannot be larger than the number of currently nested PID namespaces.
> 
> "It's upper cap is the kernel-enforced general nesting limit."
> or sm like that

Is that an addition to my sentence or a replacement. I think at this
point it is more important to point out that it cannot be larger than
the number of currently nested PID namespaces. Later (at EPERM) I am
also mentioning that it cannot be larger than the maximum number of
nested PID namespaces. The code does indeed check if set_tid_size is
larger than the maximum number of possible nested PID namespaces for
the user, I think, when calling clone3(), it is more relevant that
set_tid_size is not larger than the number of currently nested PID
namespaces. The maximum number of possible nested PID namespaces is more
likely enforced during unshare() or CLONE_NEWPID (which could be
happening at the same point in time as set_tid_size larger than maximum
number of nested PID namespace).

This definitely feels like too much discussion for a single sentence ;)

I can add a sentence about the maximum number of nested PID namespaces
here in addition to the one at EPERM.
I do not think it is relevant for the user at this point in time.

		Adrian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] clone.2: tfix
  2019-11-28 12:46 [PATCH 1/2] clone.2: tfix Adrian Reber
  2019-11-28 12:46 ` [PATCH 2/2] clone.2: added clone3() set_tid information Adrian Reber
  2019-11-28 13:30 ` [PATCH 1/2] clone.2: tfix Christian Brauner
@ 2019-12-01 12:16 ` Michael Kerrisk (man-pages)
  2 siblings, 0 replies; 10+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-12-01 12:16 UTC (permalink / raw)
  To: Adrian Reber, Christian Brauner; +Cc: linux-man

Hi Adrian,

On 11/28/19 1:46 PM, Adrian Reber wrote:
> added two missing parentheses
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
>   man2/clone.2 | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/man2/clone.2 b/man2/clone.2
> index e5ab2a096..076b9258e 100644
> --- a/man2/clone.2
> +++ b/man2/clone.2
> @@ -1349,13 +1349,13 @@ For example, on aarch64,
>   .I stack
>   must be a multiple of 16.
>   .TP
> -.BR EINVAL " (" clone3 "() only"
> +.BR EINVAL " (" clone3 "() only)"
>   .B  CLONE_DETACHED
>   was specified in the
>   .I flags
>   mask.
>   .TP
> -.BR EINVAL " (" clone "() only"
> +.BR EINVAL " (" clone "() only)"
>   .B CLONE_PIDFD
>   was specified together with
>   .B CLONE_DETACHED
> 
> base-commit: 445fc03eebeee9b0e9ccaa4e2361f91f30749e8e

Thanks. Applied.

Cheers,

Michael


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] clone.2: added clone3() set_tid information
  2019-11-29 13:56     ` Adrian Reber
@ 2019-12-01 12:17       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-12-01 12:17 UTC (permalink / raw)
  To: Adrian Reber, Christian Brauner; +Cc: linux-man

Hi Adrian,

Thank you for the patch. I have some review comments below,
and also further comments in my reply to Christian's notes.

On 11/29/19 2:56 PM, Adrian Reber wrote:
> On Thu, Nov 28, 2019 at 06:24:05PM +0100, Christian Brauner wrote:
>> On Thu, Nov 28, 2019 at 01:46:50PM +0100, Adrian Reber wrote:
>>> Signed-off-by: Adrian Reber <areber@redhat.com>
>>> ---
>>>   man2/clone.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 90 insertions(+)
>>>
>>> diff --git a/man2/clone.2 b/man2/clone.2
>>> index 076b9258e..59c13ec35 100644
>>> --- a/man2/clone.2
>>> +++ b/man2/clone.2
>>> @@ -195,6 +195,8 @@ struct clone_args {
>>>       u64 stack;        /* Pointer to lowest byte of stack */
>>>       u64 stack_size;   /* Size of stack */
>>>       u64 tls;          /* Location of new TLS */
>>> +    u64 set_tid;      /* Pointer to a \fIpid_t\fP array */
>>> +    u64 set_tid_size; /* Number of elements in \fIset_tid\fP */
>>>   };
>>>   .EE
>>>   .in
>>> @@ -262,6 +264,8 @@ flags & 0xff	exit_signal
>>>   stack	stack
>>>   \fP---\fP	stack_size
>>>   tls	tls	See CLONE_SETTLS
>>> +\fP---\fP	set_tid	See below for details
>>> +\fP---\fP	set_tid_size
>>>   .TE
>>>   .RE
>>>   .\"
>>> @@ -285,6 +289,74 @@ options when waiting for the child with
>>>   If no signal (i.e., zero) is specified, then the parent process is not signaled
>>>   when the child terminates.
>>>   .\"
>>> +.SS The set_tid array
>>> +.PP
>>> +The
>>> +.I set_tid
>>> +array is used to select a certain PID for the process to be created by
>>> +.BR clone3 ().
>>> +If the PID of the newly created process should only be set for the current
>>> +PID namespace or in the newly created PID namespace (if
>>> +.I flags
>>> +contains
>>> +.BR CLONE_NEWPID )
>>> +then the first element in the
>>> +.I set_tid
>>> +array has to be the desired PID and
>>> +.I set_tid_size
>>> +needs to be 1.
>>> +.PP
>>> +If the PID of the newly created process should have a certain value in
>>> +multiple PID namespaces the
>>> +.I set_tid
>>> +array can have multiple entries. The first entry defines the PID in the most
>>> +nested PID namespace and all following entries contain the PID of the
>>> +corresponding parent PID namespace. The number of PID namespaces in which a PID
>>> +should be set is defined by
>>> +.I set_tid_size
>>> +which cannot be larger than the number of currently nested PID namespaces.
>>
>> "It's upper cap is the kernel-enforced general nesting limit."
>> or sm like that
> 
> Is that an addition to my sentence or a replacement. I think at this
> point it is more important to point out that it cannot be larger than
> the number of currently nested PID namespaces. Later (at EPERM) I am
> also mentioning that it cannot be larger than the maximum number of
> nested PID namespaces. The code does indeed check if set_tid_size is
> larger than the maximum number of possible nested PID namespaces for
> the user, I think, when calling clone3(), it is more relevant that
> set_tid_size is not larger than the number of currently nested PID
> namespaces. The maximum number of possible nested PID namespaces is more
> likely enforced during unshare() or CLONE_NEWPID (which could be
> happening at the same point in time as set_tid_size larger than maximum
> number of nested PID namespace).
> 
> This definitely feels like too much discussion for a single sentence ;)
> 
> I can add a sentence about the maximum number of nested PID namespaces
> here in addition to the one at EPERM.
> I do not think it is relevant for the user at this point in time.

I'm inclined to agree with you. The maximum nesting depth is a general
detail of PID NSs and is already covered in pid_namespaces((7). I think
there's no need to repeat or allude to that detial here the detail here.

Thanks,

Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] clone.2: added clone3() set_tid information
  2019-11-28 12:46 ` [PATCH 2/2] clone.2: added clone3() set_tid information Adrian Reber
  2019-11-28 17:24   ` Christian Brauner
@ 2019-12-01 12:18   ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 10+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-12-01 12:18 UTC (permalink / raw)
  To: Adrian Reber, Christian Brauner; +Cc: linux-man

Hello Adrian,

On 11/28/19 1:46 PM, Adrian Reber wrote:
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
>   man2/clone.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 90 insertions(+)
> 
> diff --git a/man2/clone.2 b/man2/clone.2
> index 076b9258e..59c13ec35 100644
> --- a/man2/clone.2
> +++ b/man2/clone.2
> @@ -195,6 +195,8 @@ struct clone_args {
>       u64 stack;        /* Pointer to lowest byte of stack */
>       u64 stack_size;   /* Size of stack */
>       u64 tls;          /* Location of new TLS */
> +    u64 set_tid;      /* Pointer to a \fIpid_t\fP array */
> +    u64 set_tid_size; /* Number of elements in \fIset_tid\fP */
>   };
>   .EE
>   .in
> @@ -262,6 +264,8 @@ flags & 0xff	exit_signal
>   stack	stack
>   \fP---\fP	stack_size
>   tls	tls	See CLONE_SETTLS
> +\fP---\fP	set_tid	See below for details
> +\fP---\fP	set_tid_size
>   .TE
>   .RE
>   .\"
> @@ -285,6 +289,74 @@ options when waiting for the child with
>   If no signal (i.e., zero) is specified, then the parent process is not signaled
>   when the child terminates.
>   .\"
> +.SS The set_tid array
> +.PP
> +The
> +.I set_tid
> +array is used to select a certain PID for the process to be created by

s/is used/may be used/

Because it's not required to use this array, right? I mean, the
default is that the kernel chooses the PIDs. Perhaps this needs to be
more clearly stated at the start of this subsection.

How about:

[[
By default, the kernel chooses the next sequential PID for the new
process in each of the PID namespaces where it is present.
When creating a process with
.BR clone3 (),
the
.I set_tid
array can be used to select specific PIDs for the process in some
or all of the PID namespaces where it is present.
]]

?

> +.BR clone3 ().
> +If the PID of the newly created process should only be set for the current
> +PID namespace or in the newly created PID namespace (if
> +.I flags
> +contains
> +.BR CLONE_NEWPID )
> +then the first element in the
> +.I set_tid
> +array has to be the desired PID and
> +.I set_tid_size
> +needs to be 1.
> +.PP
> +If the PID of the newly created process should have a certain value in
> +multiple PID namespaces the
> +.I set_tid
> +array can have multiple entries. The first entry defines the PID in the most

most *deeply* nested

> +nested PID namespace and all following entries contain the PID of the
> +corresponding parent PID namespace. The number of PID namespaces in which a PID
> +should be set is defined by
> +.I set_tid_size
> +which cannot be larger than the number of currently nested PID namespaces.
> +.PP
> +To create a process with the following PIDs:
> +.RS
> +.TS
> +lb lb
> +l l .
> +PID NS level	Requested PID
> +0 (host)	31496
> +1	42
> +2	7
> +.TE
> +.RE
> +.PP
> +The
> +.I set_tid
> +array would need to be filled with:
> +.PP
> +.EX
> +	set_tid[0] = 7;
> +	set_tid[1] = 42;
> +	set_tid[2] = 31496;
> +	set_tid_size = 3;
> +.EE
> +.PP
> +If only the PID of the two innermost PID namespaces
> +should be defined it needs to be set like this:
> +.PP
> +.EX
> +	set_tid[0] = 7;
> +	set_tid[1] = 42;
> +	set_tid_size = 2;
> +.EE
> +.PP
> +The PID in the PID namespaces outside the two innermost PID namespaces
> +is then selected the same way as any other PID is selected.
> +.PP
> +Only a privileged process
> +.RB ( CAP_SYS_ADMIN )
> +can set
> +.I set_tid
> +to select a PID for the process to be created.
> +.\"
>   .SS The flags mask
>   .PP
>   Both
> @@ -1379,6 +1451,16 @@ in the
>   .I flags
>   mask.
>   .TP
> +.BR EINVAL " (" clone3 "() only)"
> +.I set_tid_size
> +larger than current number of nested PID namespaces or maximum number of
> +nested PID namespaces was specified.
> +.TP
> +.BR EINVAL " (" clone3 "() only)"
> +If one of the PIDs specified in
> +.I set_tid
> +was an invalid PID.
> +.TP
>   .B ENOMEM
>   Cannot allocate sufficient memory to allocate a task structure for the
>   child, or to copy those parts of the caller's context that need to be
> @@ -1450,6 +1532,14 @@ mask and the caller is in a chroot environment
>   (i.e., the caller's root directory does not match the root directory
>   of the mount namespace in which it resides).
>   .TP
> +.BR EPERM " (" clone3 "() only)"
> +If
> +.I set_tid
> +with
> +.I set_tid_size
> +larger than 0 was specified by an unprivileged process (process without
> +\fBCAP_SYS_ADMIN\fP).
> +.TP
>   .BR ERESTARTNOINTR " (since Linux 2.6.17)"
>   .\" commit 4a2c7a7837da1b91468e50426066d988050e4d56
>   System call was interrupted by a signal and will be restarted.

Thanks,

Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] clone.2: added clone3() set_tid information
  2019-11-28 17:24   ` Christian Brauner
  2019-11-29 13:56     ` Adrian Reber
@ 2019-12-01 12:18     ` Michael Kerrisk (man-pages)
  2019-12-02 14:25       ` Adrian Reber
  1 sibling, 1 reply; 10+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-12-01 12:18 UTC (permalink / raw)
  To: Christian Brauner, Adrian Reber; +Cc: linux-man

Hello Christian, and Adrian,

Christian, thanks for the review!

On 11/28/19 6:24 PM, Christian Brauner wrote:
> On Thu, Nov 28, 2019 at 01:46:50PM +0100, Adrian Reber wrote:
>> Signed-off-by: Adrian Reber <areber@redhat.com>
>> ---
>>   man2/clone.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 90 insertions(+)
>>
>> diff --git a/man2/clone.2 b/man2/clone.2
>> index 076b9258e..59c13ec35 100644
>> --- a/man2/clone.2
>> +++ b/man2/clone.2
>> @@ -195,6 +195,8 @@ struct clone_args {
>>       u64 stack;        /* Pointer to lowest byte of stack */
>>       u64 stack_size;   /* Size of stack */
>>       u64 tls;          /* Location of new TLS */
>> +    u64 set_tid;      /* Pointer to a \fIpid_t\fP array */
>> +    u64 set_tid_size; /* Number of elements in \fIset_tid\fP */
>>   };
>>   .EE
>>   .in
>> @@ -262,6 +264,8 @@ flags & 0xff	exit_signal
>>   stack	stack
>>   \fP---\fP	stack_size
>>   tls	tls	See CLONE_SETTLS
>> +\fP---\fP	set_tid	See below for details
>> +\fP---\fP	set_tid_size
>>   .TE
>>   .RE
>>   .\"
>> @@ -285,6 +289,74 @@ options when waiting for the child with
>>   If no signal (i.e., zero) is specified, then the parent process is not signaled
>>   when the child terminates.
>>   .\"
>> +.SS The set_tid array
>> +.PP
>> +The
>> +.I set_tid
>> +array is used to select a certain PID for the process to be created by
>> +.BR clone3 ().
>> +If the PID of the newly created process should only be set for the current
>> +PID namespace or in the newly created PID namespace (if
>> +.I flags
>> +contains
>> +.BR CLONE_NEWPID )
>> +then the first element in the
>> +.I set_tid
>> +array has to be the desired PID and
>> +.I set_tid_size
>> +needs to be 1.
>> +.PP
>> +If the PID of the newly created process should have a certain value in
>> +multiple PID namespaces the
>> +.I set_tid
>> +array can have multiple entries. The first entry defines the PID in the most
>> +nested PID namespace and all following entries contain the PID of the
>> +corresponding parent PID namespace. The number of PID namespaces in which a PID
>> +should be set is defined by
>> +.I set_tid_size
>> +which cannot be larger than the number of currently nested PID namespaces.
> 
> "It's upper cap is the kernel-enforced general nesting limit."
> or sm like that

See my comments in a separate mail where Adrian replied to the above.

>> +.PP
>> +To create a process with the following PIDs:
> 
> in a pid namespace hierarchy

(okay)

>> +.RS
>> +.TS
>> +lb lb
>> +l l .
>> +PID NS level	Requested PID
>> +0 (host)	31496
>> +1	42
>> +2	7
>> +.TE
>> +.RE
>> +.PP
> 

Christian, you trim away the piece:

[[
     +The
     +.I set_tid
     +array would need to be filled with:
]]

that you suggest replacing with:

 > Set the array to:

I think the suggested change is fine, but when you trim the text
in this way (as you do a few times below, it makes it overly
difficult to review your suggestions, since I must separately
refer back to Adrian's original mail to see what is being
replaced (which is not comfortable on my 13" laptop
11km above Belgium :-)).

>> +.PP
>> +.EX
>> +	set_tid[0] = 7;
>> +	set_tid[1] = 42;
>> +	set_tid[2] = 31496;
>> +	set_tid_size = 3;
>> +.EE
>> +.PP
>> +If only the PID of the two innermost PID namespaces

s/PID of/PIDs in/

> 
> need to be specified set the array to:

(okay; but see my comment about trimming above.)

>> +.PP
>> +.EX
>> +	set_tid[0] = 7;
>> +	set_tid[1] = 42;
>> +	set_tid_size = 2;
>> +.EE
>> +.PP
>> +The PID in the PID namespaces outside the two innermost PID namespaces
>> +is then selected the same way as any other PID is selected.
> 
> s/is then/will be/

(okay)

>> +.PP
>> +Only a privileged process
>> +.RB ( CAP_SYS_ADMIN )
> 
> Maybe more specific:
> 
> The set_tid feature requires CAP_SYS_ADMIN in all owning user namespaces
> of the target pid namespaces.

Excellent addition.

>> +can set
>> +.I set_tid
>> +to select a PID for the process to be created.
>> +.\"
>>   .SS The flags mask
>>   .PP
>>   Both
>> @@ -1379,6 +1451,16 @@ in the
>>   .I flags
>>   mask.
>>   .TP
>> +.BR EINVAL " (" clone3 "() only)"
>> +.I set_tid_size
>> +larger than current number of nested PID namespaces or maximum number of
>> +nested PID namespaces was specified.

The last piece of that sentence feels redundant to me, and it
could/should  be removed, I think. The current number of nested PID
namespaces can't exceed the maximum number of nested namespaces, so
mentioning the latter as a limit seems unnecessary (and even a little
confusing, since it leaves the reader wondering whether "current"
could exceed "maximum".)

Or have I misunderstood what you intend to convey, Adrian?

>> +.TP
>> +.BR EINVAL " (" clone3 "() only)"
>> +If one of the PIDs specified in
>> +.I set_tid
>> +was an invalid PID.
>> +.TP
>>   .B ENOMEM
>>   Cannot allocate sufficient memory to allocate a task structure for the
>>   child, or to copy those parts of the caller's context that need to be
>> @@ -1450,6 +1532,14 @@ mask and the caller is in a chroot environment
>>   (i.e., the caller's root directory does not match the root directory
>>   of the mount namespace in which it resides).
>>   .TP
>> +.BR EPERM " (" clone3 "() only)"
>> +If
>> +.I set_tid
>> +with
>> +.I set_tid_size
>> +larger than 0 was specified by an unprivileged process (process without
>> +\fBCAP_SYS_ADMIN\fP).
> 
> without CAP_SYS_ADMIN in _one_ of the owning user namespaces of the
> target pid namespace
> 
> maybe?

Adrian, I think there's no need to mention 'set_tid'. That's implied
by 'set_tid_size' being greater than 0.

How about:

[[
.BR EPERM " (" clone3 "() only)"
.I set_tid_size
was greater than zero, and the caller lacks the
.B CAP_SYS_ADMIN
capability in one or more of the user namespaces that own the
corresponding PID namespaces.
]]

?

> Where's EEXIST?

Indeed!

I speculate (without having looked at the kernel code):

[[
.BR EPERM " (" clone3 "() only)"
One or more of the PIDs specified in
.I set_tid
already exists in the corresponding namespace.
]]

Thanks,

Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] clone.2: added clone3() set_tid information
  2019-12-01 12:18     ` Michael Kerrisk (man-pages)
@ 2019-12-02 14:25       ` Adrian Reber
  0 siblings, 0 replies; 10+ messages in thread
From: Adrian Reber @ 2019-12-02 14:25 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: Christian Brauner, linux-man

Christian and Michael,

thanks for the reviews. I tried to apply all suggested changes in v2
which I will send out in a few minutes.

		Adrian


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-12-02 14:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-28 12:46 [PATCH 1/2] clone.2: tfix Adrian Reber
2019-11-28 12:46 ` [PATCH 2/2] clone.2: added clone3() set_tid information Adrian Reber
2019-11-28 17:24   ` Christian Brauner
2019-11-29 13:56     ` Adrian Reber
2019-12-01 12:17       ` Michael Kerrisk (man-pages)
2019-12-01 12:18     ` Michael Kerrisk (man-pages)
2019-12-02 14:25       ` Adrian Reber
2019-12-01 12:18   ` Michael Kerrisk (man-pages)
2019-11-28 13:30 ` [PATCH 1/2] clone.2: tfix Christian Brauner
2019-12-01 12:16 ` Michael Kerrisk (man-pages)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).