unhare() interface design questions and man page

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* unhare() interface design questions and man page
       [not found] <200602072059.k17KxJUI016348@shell0.pdx.osdl.net>
@ 2006-02-13 22:10 ` Michael Kerrisk
  2006-02-14 13:44   ` JANAK DESAI
  0 siblings, 1 reply; 10+ messages in thread
From: Michael Kerrisk @ 2006-02-13 22:10 UTC (permalink / raw)
  To: janak
  Cc: torvalds, akpm, ak, hch, paulus, viro, akpm, linux-kernel,
	michael.kerrisk, mtk-manpages

Hello Janak,

I've been working on a man page for the upcoming 2.6.16 unshare()
syscall, using the documentation that you provided (thank you!)
in your Documentation/unshare.txt patch as a base.

Perhaps you would care to review that page (below), and make
corrections, if needed.

While writing this page, I came up with a number of 
questions about the design of this interface:

1. Your original documentation said:

        The flags argument specifies one or bitwise-or'ed of
        several of the following constants.

   However, my reading of the code (I have not yet tested the 
   syscall) is that 'flags' can be zero.  I don't see any problem
   with that, but it is in conflict with the statement above,
   so it may be worth confirming: what is intended behaviour?
   is 'flags' allowed to be zero?

2. Reading the code and your documentation, I see that CLONE_VM
   implies CLONE_SIGHAND.  Since CLONE_SIGHAND is not implemented
   (i.e., results in an EINVAL error), I take it that this means
   that at the moment CLONE_VM will not work (i.e., will result 
   in EINVAL).  Is this correct?  If so, I will note this in 
   the man page.

3. The naming of the 'flags' bits is inconsistent.  In your
   documentation you note:

        unshare reverses sharing that was done using clone(2) system 
        call, so unshare should have a similar interface as clone(2). 
        That is, since flags in clone(int flags, void *stack) 
        specifies what should be shared, similar flags in 
        unshare(int flags) should specify what should be unshared. 
        Unfortunately, this may appear to invert the meaning of the 
        flags from the way they are used in clone(2).  However, 
        there was no easy solution that was less confusing and that 
        allowed incremental context unsharing in future without an 
        ABI change.

   The problem is that the flags don't simply reverse the meanings
   of the clone() flags of the same name: they do it inconsistently.

   That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the 
   effects of the clone() flags of the same name, but CLONE_NEWNS 
   *has the same meaning* as the clone() flag of the same name.
   If *all* of the flags were simply reversed, that would be 
   a little strange, but comprehensible; but the fact that one of 
   them is not reversed is very confusing for users of the 
   interface.

   An idea: why not define a new set of flags for unshare()
   which are synonyms of the clone() flags, but make their
   purpose more obvious to the user, i.e., something like
   the following:

         #define UNSHARE_VM     CLONE_VM
         #define UNSHARE_FS     CLONE_FS
         #define UNSHARE_FILES  CLONE_FILES
         #define UNSHARE_NS     CLONE_NEWNS
         etc.

   This would avoid confusion for the interface user.  
   (Yes, I realize that this could be done in glibc, but why 
   make the kernel and glibc definitions differ?)

4. Would it not be wise to include a check of the following form
   at the entry to sys_unshare():

        if (flags & ~(CLONE_FS | CLONE_FILES | CLONE_VM | 
                CLONE_NEWNS | CLONE_SYSVSEM | CLONE_THREAD))
            return -EINVAL;

   This future-proofs the interface against applications
   that try to specify extraneous bits in 'flags': if those
   bits happen to become meaningful in the future, then the
   application behavior would silently change.  Adding this 
   check now prevents applications trying to use those bits 
   until such a time as they have meaning.

Cheers,

Michael

unshare.2 draft man page:

.\" (C) 2006, Janak Desai <janak@us.ibm.com>
.\" (C) 2006, Michael Kerrisk <mtk-manpages@gmx.ne>
.\" Licensed under the GPL
.\"
.TH UNSHARE 2 2005-03-10 "Linux 2.6.16" "Linux Programmer's Manual"
.SH NAME
unshare \- disassociate parts of the process execution context
.SH SYNOPSIS
.nf
.B #include <sched.h>
.sp
.BI "int unshare(int " flags );
.fi
.SH DESCRIPTION
.BR unshare () 
allows a process to disassociate parts of its execution
context that are currently being shared with other processes. 
Part of the execution context, such as the namespace, is shared 
implicitly when a new process is created using 
.BR fork (2)
or
.BR vfork (2), 
while other parts, such as virtual memory, may be
shared by explicit request when creating a process using 
.BR clone (2).

The main use of 
.BR unshare (2)
is to allow a process to control its
shared execution context without creating a new process.

The 
.I flags 
argument is a bit mask that specifies which parts of 
the execution context should be unshared.  
This argument is specified by ORing together one or more
.\" FIXME according to the code, it looks as though 
.\" flags can actually be zero.
of the following constants:
.TP
.B CLONE_FILES
Reverse the effect of the
.BR clone (2)
.B CLONE_FILES
flag.
Unshare the file descriptor table, so that the calling process 
no longer shares its file descriptors with any other process.
.TP
.B CLONE_FS
Reverse the effect of the
.BR clone (2)
.B CLONE_FS 
flag.
Unshare file system attributes, so that the calling process 
no longer shares its root directory, current directory, 
or umask attributes with any other process.
.BR chroot (2),
.BR chdir (2),
or
.BR umask (2)
.TP
.B CLONE_NEWNS
This flag has the same effect as the
.BR clone (2)
.B CLONE_NEWNS
flag.
Unshare the namespace, so that the calling process has a private 
copy of its namespace which is not shared with any other process.
Specifying this flag automatically implies
.B CLONE_FS
as well.
.TP
.B CLONE_VM
Reverse the effect of the
.BR clone (2)
.B CLONE_VM
flag.
.RB ( CLONE_VM
is also implicitly set by
.BR vfork (2),
and can be reversed using this
.BR unshare ()
flag.)
Unshare virtual memory, so that the calling process no 
longer shares its virtual address space with any other process.
.SH RETURN VALUE
On success, zero returned. On failure, \-1 is returned and 
.I errno 
is set to indicate the error.
.SH ERRORS
.TP
.B EPERM
.I flags
specified
.B CLONE_NEWNS 
but the calling process was not privileged (did not have the
.B CAP_SYS_ADMIN
capability).
.TP
.B ENOMEM
Cannot allocate sufficient memory to copy parts of caller's
context that need to be unshared.
.TP
.B EINVAL
An invalid but was specified in
.IR flags .
.SH CONFORMING TO
The
.BR unshare (2)
system call is Linux-specific.
.SH NOTES
Not all of the process attributes that can be shared when 
a new process is created using
.BR clone (2)
can be unshared using
.BR unshare ().
In particular, as at kernel 2.6.16,
.BR unshare () 
does not implement
.BR CLONE_SIGHAND ,
.BR CLONE_SYSVSEM ,
or 
.BR CLONE_THREAD .
.SH SEE ALSO
.BR clone (2), 
.BR fork (2), 
.BR vfork (2), 
Documentation/unshare.txt

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/, 
read the HOWTOHELP file and grep the source 
files for 'FIXME'.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unhare() interface design questions and man page
  2006-02-13 22:10 ` unhare() interface design questions and man page Michael Kerrisk
@ 2006-02-14 13:44   ` JANAK DESAI
  2006-02-14 18:49     ` Michael Kerrisk
  0 siblings, 1 reply; 10+ messages in thread
From: JANAK DESAI @ 2006-02-14 13:44 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: torvalds, akpm, ak, hch, paulus, viro, linux-kernel, michael.kerrisk

Thanks Michael. Answers are inline ... please feel free to change the
man page text for grammer and readability.

Michael Kerrisk wrote:

>Hello Janak,
>
>I've been working on a man page for the upcoming 2.6.16 unshare()
>syscall, using the documentation that you provided (thank you!)
>in your Documentation/unshare.txt patch as a base.
>
>Perhaps you would care to review that page (below), and make
>corrections, if needed.
>
>While writing this page, I came up with a number of 
>questions about the design of this interface:
>
>1. Your original documentation said:
>
>        The flags argument specifies one or bitwise-or'ed of
>        several of the following constants.
>
>   However, my reading of the code (I have not yet tested the 
>   syscall) is that 'flags' can be zero.  I don't see any problem
>   with that, but it is in conflict with the statement above,
>   so it may be worth confirming: what is intended behaviour?
>   is 'flags' allowed to be zero?
>
>  
>
Yes, I agree that the intended behavior of flags == 0 should be clarified.
If flags is zero, then the system call is basically a no-op. You are saying
that you don't want to unshare anything. I have attempted to state this
in the man page source below.

>2. Reading the code and your documentation, I see that CLONE_VM
>   implies CLONE_SIGHAND.  Since CLONE_SIGHAND is not implemented
>   (i.e., results in an EINVAL error), I take it that this means
>   that at the moment CLONE_VM will not work (i.e., will result 
>   in EINVAL).  Is this correct?  If so, I will note this in 
>   the man page.
>
>  
>
Actually, CLONE_SIGHAND implies CLONE_VM and not the
otherway around. Currently CLONE_VM is supported, as long as
singnal handlers are not being shared. That is, if you created the
process using clone(CLONE_VM), which kept signal handlers
different, then you can unshare VM using unshare(CLONE_VM).

>3. The naming of the 'flags' bits is inconsistent.  In your
>   documentation you note:
>
>        unshare reverses sharing that was done using clone(2) system 
>        call, so unshare should have a similar interface as clone(2). 
>        That is, since flags in clone(int flags, void *stack) 
>        specifies what should be shared, similar flags in 
>        unshare(int flags) should specify what should be unshared. 
>        Unfortunately, this may appear to invert the meaning of the 
>        flags from the way they are used in clone(2).  However, 
>        there was no easy solution that was less confusing and that 
>        allowed incremental context unsharing in future without an 
>        ABI change.
>   
>   The problem is that the flags don't simply reverse the meanings
>   of the clone() flags of the same name: they do it inconsistently.
>
>   That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the 
>   effects of the clone() flags of the same name, but CLONE_NEWNS 
>   *has the same meaning* as the clone() flag of the same name.
>   If *all* of the flags were simply reversed, that would be 
>   a little strange, but comprehensible; but the fact that one of 
>   them is not reversed is very confusing for users of the 
>   interface.
>
>   An idea: why not define a new set of flags for unshare()
>   which are synonyms of the clone() flags, but make their
>   purpose more obvious to the user, i.e., something like
>   the following:
>    
>         #define UNSHARE_VM     CLONE_VM
>         #define UNSHARE_FS     CLONE_FS
>         #define UNSHARE_FILES  CLONE_FILES
>         #define UNSHARE_NS     CLONE_NEWNS
>         etc.
>         
>   This would avoid confusion for the interface user.  
>   (Yes, I realize that this could be done in glibc, but why 
>   make the kernel and glibc definitions differ?)
>
>  
>
I agree that use of clone flags can be confusing. At least a couple of
folks pointed that out when I posted the patch. The issues was even
raised when unshare was proposed few years back on lkml. Some
source of confusion is the special status of CLONE_NEWNS. Because
namespaces are shared by default with fork/clone, it is different than
other CLONE_* flags. That's probably why it was called CLONE_NEWNS
and not CLONE_NS. In the original discussion in Aug'00, Linus
said that "it makes sense that a unshare(CLONE_FILES) basically
_undoes_ the sharing of clone(CLONE_FILES)"

http://www.ussg.iu.edu/hypermail/linux/kernel/0008.3/0662.html

So I decided to follow that as a guidance for unshare interface.

>4. Would it not be wise to include a check of the following form
>   at the entry to sys_unshare():
>
>        if (flags & ~(CLONE_FS | CLONE_FILES | CLONE_VM | 
>                CLONE_NEWNS | CLONE_SYSVSEM | CLONE_THREAD))
>            return -EINVAL;
>
>   This future-proofs the interface against applications
>   that try to specify extraneous bits in 'flags': if those
>   bits happen to become meaningful in the future, then the
>   application behavior would silently change.  Adding this 
>   check now prevents applications trying to use those bits 
>   until such a time as they have meaning.
>
>  
>
I did have a similar check in the first incarnation of the patch. It was
pointed out, correctly, that it is better to allow all flags so we can
incrementally add new unshare functionality while not making
any ABI changes. unshare follows clone here, which also does not
check for extraneous bits in flags.

>Cheers,
>
>Michael
>
>
>unshare.2 draft man page:
>
>.\" (C) 2006, Janak Desai <janak@us.ibm.com>
>.\" (C) 2006, Michael Kerrisk <mtk-manpages@gmx.ne>
>.\" Licensed under the GPL
>.\"
>.TH UNSHARE 2 2005-03-10 "Linux 2.6.16" "Linux Programmer's Manual"
>.SH NAME
>unshare \- disassociate parts of the process execution context
>.SH SYNOPSIS
>.nf
>.B #include <sched.h>
>.sp
>.BI "int unshare(int " flags );
>.fi
>.SH DESCRIPTION
>.BR unshare () 
>allows a process to disassociate parts of its execution
>context that are currently being shared with other processes. 
>Part of the execution context, such as the namespace, is shared 
>implicitly when a new process is created using 
>.BR fork (2)
>or
>.BR vfork (2), 
>while other parts, such as virtual memory, may be
>shared by explicit request when creating a process using 
>.BR clone (2).
>
>The main use of 
>.BR unshare (2)
>is to allow a process to control its
>shared execution context without creating a new process.
>
>The 
>.I flags 
>argument is a bit mask that specifies which parts of 
>the execution context should be unshared. 
>
If no bits are specified, unshare system call does not change the
execution context of the calling process. The flags argument
may be specified

> 
>  by ORing together one or more of the following constants:
>.TP
>.B CLONE_FILES
>Reverse the effect of the
>.BR clone (2)
>.B CLONE_FILES
>flag.
>Unshare the file descriptor table, so that the calling process 
>no longer shares its file descriptors with any other process.
>.TP
>.B CLONE_FS
>Reverse the effect of the
>.BR clone (2)
>.B CLONE_FS 
>flag.
>Unshare file system attributes, so that the calling process 
>no longer shares its root directory, current directory, 
>or umask attributes with any other process.
>.BR chroot (2),
>.BR chdir (2),
>or
>.BR umask (2)
>.TP
>.B CLONE_NEWNS
>This flag has the same effect as the
>.BR clone (2)
>.B CLONE_NEWNS
>flag.
>Unshare the namespace, so that the calling process has a private 
>copy of its namespace which is not shared with any other process.
>Specifying this flag automatically implies
>.B CLONE_FS
>as well.
>.TP
>.B CLONE_VM
>Reverse the effect of the
>.BR clone (2)
>.B CLONE_VM
>flag.
>.RB ( CLONE_VM
>is also implicitly set by
>.BR vfork (2),
>and can be reversed using this
>.BR unshare ()
>flag.)
>Unshare virtual memory, so that the calling process no 
>longer shares its virtual address space with any other process.
>.SH RETURN VALUE
>On success, zero returned. On failure, \-1 is returned and 
>.I errno 
>is set to indicate the error.
>.SH ERRORS
>.TP
>.B EPERM
>.I flags
>specified
>.B CLONE_NEWNS 
>but the calling process was not privileged (did not have the
>.B CAP_SYS_ADMIN
>capability).
>.TP
>.B ENOMEM
>Cannot allocate sufficient memory to copy parts of caller's
>context that need to be unshared.
>.TP
>.B EINVAL
>An invalid but was specified in
>.IR flags .
>.SH CONFORMING TO
>The
>.BR unshare (2)
>system call is Linux-specific.
>.SH NOTES
>Not all of the process attributes that can be shared when 
>a new process is created using
>.BR clone (2)
>can be unshared using
>.BR unshare ().
>In particular, as at kernel 2.6.16,
>.BR unshare () 
>does not implement
>.BR CLONE_SIGHAND ,
>.BR CLONE_SYSVSEM ,
>or 
>.BR CLONE_THREAD .
>.SH SEE ALSO
>.BR clone (2), 
>.BR fork (2), 
>.BR vfork (2), 
>Documentation/unshare.txt
>
>  
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unhare() interface design questions and man page
  2006-02-14 13:44   ` JANAK DESAI
@ 2006-02-14 18:49     ` Michael Kerrisk
  2006-02-14 20:47       ` JANAK DESAI
  0 siblings, 1 reply; 10+ messages in thread
From: Michael Kerrisk @ 2006-02-14 18:49 UTC (permalink / raw)
  To: JANAK DESAI
  Cc: torvalds, akpm, ak, hch, paulus, viro, linux-kernel, michael.kerrisk

Helllo Janak,

[...]

> >While writing this page, I came up with a number of 
> >questions about the design of this interface:
> >
> >1. Your original documentation said:
> >
> >        The flags argument specifies one or bitwise-or'ed of
> >        several of the following constants.
> >
> >   However, my reading of the code (I have not yet tested the 
> >   syscall) is that 'flags' can be zero.  I don't see any problem
> >   with that, but it is in conflict with the statement above,
> >   so it may be worth confirming: what is intended behaviour?
> >   is 'flags' allowed to be zero?
> >
> >  
> >
> Yes, I agree that the intended behavior of flags == 0 should be 
> clarified. If flags is zero, then the system call is basically 
> a no-op. 
[...]

Good.  I assumed that was the case.

> >2. Reading the code and your documentation, I see that CLONE_VM
> >   implies CLONE_SIGHAND.  Since CLONE_SIGHAND is not implemented
> >   (i.e., results in an EINVAL error), I take it that this means
> >   that at the moment CLONE_VM will not work (i.e., will result 
> >   in EINVAL).  Is this correct?  If so, I will note this in 
> >   the man page.
> >
> Actually, CLONE_SIGHAND implies CLONE_VM and not the
> otherway around. Currently CLONE_VM is supported, as long as
> singnal handlers are not being shared. That is, if you created the
> process using clone(CLONE_VM), which kept signal handlers
> different, then you can unshare VM using unshare(CLONE_VM).

Maybe I'm being dense, and as I said I haven't actually
tested the interface, but your Documentation/unshare.txt 
(7.2) says the opposite:

    If CLONE_VM is set, force CLONE_SIGHAND.

and the code in check_unshare_flags() seems to agree:

        /*
         * If unsharing vm, we must also unshare signal handlers 
         */
        if (*flags_ptr & CLONE_VM)
                 *flags_ptr |= CLONE)SIGHAND;

What am I missing?

> >3. The naming of the 'flags' bits is inconsistent.  In your
> >   documentation you note:
> >
> >        unshare reverses sharing that was done using clone(2) system 
> >        call, so unshare should have a similar interface as clone(2). 
> >        That is, since flags in clone(int flags, void *stack) 
> >        specifies what should be shared, similar flags in 
> >        unshare(int flags) should specify what should be unshared. 
> >        Unfortunately, this may appear to invert the meaning of the 
> >        flags from the way they are used in clone(2).  However, 
> >        there was no easy solution that was less confusing and that 
> >        allowed incremental context unsharing in future without an 
> >        ABI change.
> >   
> >   The problem is that the flags don't simply reverse the meanings
> >   of the clone() flags of the same name: they do it inconsistently.
> >
> >   That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the 
> >   effects of the clone() flags of the same name, but CLONE_NEWNS 
> >   *has the same meaning* as the clone() flag of the same name.
> >   If *all* of the flags were simply reversed, that would be 
> >   a little strange, but comprehensible; but the fact that one of 
> >   them is not reversed is very confusing for users of the 
> >   interface.
> >
> >   An idea: why not define a new set of flags for unshare()
> >   which are synonyms of the clone() flags, but make their
> >   purpose more obvious to the user, i.e., something like
> >   the following:
> >    
> >         #define UNSHARE_VM     CLONE_VM
> >         #define UNSHARE_FS     CLONE_FS
> >         #define UNSHARE_FILES  CLONE_FILES
> >         #define UNSHARE_NS     CLONE_NEWNS
> >         etc.
> >         
> >   This would avoid confusion for the interface user.  
> >   (Yes, I realize that this could be done in glibc, but why 
> >   make the kernel and glibc definitions differ?)
> >
> >  
> >
> I agree that use of clone flags can be confusing. At least a couple of
> folks pointed that out when I posted the patch. The issues was even
> raised when unshare was proposed few years back on lkml. Some
> source of confusion is the special status of CLONE_NEWNS. Because
> namespaces are shared by default with fork/clone, it is different than
> other CLONE_* flags. That's probably why it was called CLONE_NEWNS
> and not CLONE_NS. 

Yes, most likely.

> In the original discussion in Aug'00, Linus
> said that "it makes sense that a unshare(CLONE_FILES) basically
> _undoes_ the sharing of clone(CLONE_FILES)"
> 
> http://www.ussg.iu.edu/hypermail/linux/kernel/0008.3/0662.html
>
> So I decided to follow that as a guidance for unshare interface.

Yes, but when Linus made that statement (Aug 2000), there 
was no CLONE_NEWNS (it arrived in 2.4.19, Aug 2002).  So
the inconsistency that I'm highlighting didn't exist back 
then.  As I said above: if *all* of the flags were simply 
reversed, that would be comprehensible; but the fact that 
one of them is not reversed is inconsistent.  This &will*
confuse people in userland, and it seems quite 
unnecessary to do that.  Please consider this point further.

> >4. Would it not be wise to include a check of the following form
> >   at the entry to sys_unshare():
> >
> >        if (flags & ~(CLONE_FS | CLONE_FILES | CLONE_VM | 
> >                CLONE_NEWNS | CLONE_SYSVSEM | CLONE_THREAD))
> >            return -EINVAL;
> >
> >   This future-proofs the interface against applications
> >   that try to specify extraneous bits in 'flags': if those
> >   bits happen to become meaningful in the future, then the
> >   application behavior would silently change.  Adding this 
> >   check now prevents applications trying to use those bits 
> >   until such a time as they have meaning.
> >
> I did have a similar check in the first incarnation of the patch. It was
> pointed out, correctly, that it is better to allow all flags so we can
> incrementally add new unshare functionality while not making
> any ABI changes. unshare follows clone here, which also does not
> check for extraneous bits in flags.

I guess I need educating here.  Several other system calls 
do include such checks:

mm/mlock.c: mlockall(2):
if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
mm/mprotect.c: mprotect(2):
if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
mm/msync.c: msync(2):
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
mm/mremap.c: mremap(2):
if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
mm/mempolicy.c:	mbind(2):
if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
mm/mempolicy.c:	get_mempolicy(2):
if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))

What distinguishes unshare() (and clone()) from these?

I guess I'm unclear too about this (requoted) text

> It was
> pointed out, correctly, that it is better to allow all flags so we can
> incrementally add new unshare functionality while not making
> any ABI changes. 

If one restricts the range of flags that are available now
(prevents userland from trying to use them), and then
permits additional flags later, then from the point of
view of the old userland apps, there has been no ABI change.
What am I missing here?

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/, 
read the HOWTOHELP file and grep the source 
files for 'FIXME'.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unhare() interface design questions and man page
  2006-02-14 18:49     ` Michael Kerrisk
@ 2006-02-14 20:47       ` JANAK DESAI
  2006-02-16  5:50         ` Michael Kerrisk
  0 siblings, 1 reply; 10+ messages in thread
From: JANAK DESAI @ 2006-02-14 20:47 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: torvalds, akpm, ak, hch, paulus, viro, linux-kernel, michael.kerrisk

Michael Kerrisk wrote:

>Helllo Janak,
>
>[...]
>
>  
>
>>>While writing this page, I came up with a number of 
>>>questions about the design of this interface:
>>>
>>>1. Your original documentation said:
>>>
>>>       The flags argument specifies one or bitwise-or'ed of
>>>       several of the following constants.
>>>
>>>  However, my reading of the code (I have not yet tested the 
>>>  syscall) is that 'flags' can be zero.  I don't see any problem
>>>  with that, but it is in conflict with the statement above,
>>>  so it may be worth confirming: what is intended behaviour?
>>>  is 'flags' allowed to be zero?
>>>
>>> 
>>>
>>>      
>>>
>>Yes, I agree that the intended behavior of flags == 0 should be 
>>clarified. If flags is zero, then the system call is basically 
>>a no-op. 
>>    
>>
>[...]
>
>Good.  I assumed that was the case.
>
>  
>
>>>2. Reading the code and your documentation, I see that CLONE_VM
>>>  implies CLONE_SIGHAND.  Since CLONE_SIGHAND is not implemented
>>>  (i.e., results in an EINVAL error), I take it that this means
>>>  that at the moment CLONE_VM will not work (i.e., will result 
>>>  in EINVAL).  Is this correct?  If so, I will note this in 
>>>  the man page.
>>>
>>>      
>>>
>>Actually, CLONE_SIGHAND implies CLONE_VM and not the
>>otherway around. Currently CLONE_VM is supported, as long as
>>singnal handlers are not being shared. That is, if you created the
>>process using clone(CLONE_VM), which kept signal handlers
>>different, then you can unshare VM using unshare(CLONE_VM).
>>    
>>
>
>Maybe I'm being dense, and as I said I haven't actually
>tested the interface, but your Documentation/unshare.txt 
>(7.2) says the opposite:
>
>    If CLONE_VM is set, force CLONE_SIGHAND.
>
>and the code in check_unshare_flags() seems to agree:
>
>        /*
>         * If unsharing vm, we must also unshare signal handlers 
>         */
>        if (*flags_ptr & CLONE_VM)
>                 *flags_ptr |= CLONE)SIGHAND;
>
>What am I missing?
>
>  
>
Sorry, I think I have caused confusion by my use of "implies" and
"forcing of flags". I am easily confused by this as well, so let me see
if I can clarify with a picture.  I will double check the documentation
file to make sure I am using correct and unambiguous words.

Consider the following 4 cases.

  (1)        (2)              (3)              (4)
                                                                                
SH1  SH2   SH1   SH2           SH               SH
 |    |     \     /            ||              /  \
 |    |      \   /             ||             /    \
VM1  VM2       VM              VM            VM1   VM2
                                                                                
   ok          ok              ok              NOT ok
                                                                                
You can achieve (1), (2) or (3) using clone(), but clone()
will not let you do (4). What we want to make sure is
that we don't endup like (4) using unshare. unshare
makes sure that if you are trying to unshare vm, AND
you were sharing signal handlers (case 3) before,
then it won't allow that operation. However, if you
were like (2), then you can unshare vm and end up
like (1), which is allowed. So the forcing of sighand
flag makes sure that you check for case (2) or (3).



>>>3. The naming of the 'flags' bits is inconsistent.  In your
>>>  documentation you note:
>>>
>>>       unshare reverses sharing that was done using clone(2) system 
>>>       call, so unshare should have a similar interface as clone(2). 
>>>       That is, since flags in clone(int flags, void *stack) 
>>>       specifies what should be shared, similar flags in 
>>>       unshare(int flags) should specify what should be unshared. 
>>>       Unfortunately, this may appear to invert the meaning of the 
>>>       flags from the way they are used in clone(2).  However, 
>>>       there was no easy solution that was less confusing and that 
>>>       allowed incremental context unsharing in future without an 
>>>       ABI change.
>>>  
>>>  The problem is that the flags don't simply reverse the meanings
>>>  of the clone() flags of the same name: they do it inconsistently.
>>>
>>>  That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the 
>>>  effects of the clone() flags of the same name, but CLONE_NEWNS 
>>>  *has the same meaning* as the clone() flag of the same name.
>>>  If *all* of the flags were simply reversed, that would be 
>>>  a little strange, but comprehensible; but the fact that one of 
>>>  them is not reversed is very confusing for users of the 
>>>  interface.
>>>
>>>  An idea: why not define a new set of flags for unshare()
>>>  which are synonyms of the clone() flags, but make their
>>>  purpose more obvious to the user, i.e., something like
>>>  the following:
>>>   
>>>        #define UNSHARE_VM     CLONE_VM
>>>        #define UNSHARE_FS     CLONE_FS
>>>        #define UNSHARE_FILES  CLONE_FILES
>>>        #define UNSHARE_NS     CLONE_NEWNS
>>>        etc.
>>>        
>>>  This would avoid confusion for the interface user.  
>>>  (Yes, I realize that this could be done in glibc, but why 
>>>  make the kernel and glibc definitions differ?)
>>>
>>> 
>>>
>>>      
>>>
>>I agree that use of clone flags can be confusing. At least a couple of
>>folks pointed that out when I posted the patch. The issues was even
>>raised when unshare was proposed few years back on lkml. Some
>>source of confusion is the special status of CLONE_NEWNS. Because
>>namespaces are shared by default with fork/clone, it is different than
>>other CLONE_* flags. That's probably why it was called CLONE_NEWNS
>>and not CLONE_NS. 
>>    
>>
>
>Yes, most likely.
>
>  
>
>>In the original discussion in Aug'00, Linus
>>said that "it makes sense that a unshare(CLONE_FILES) basically
>>_undoes_ the sharing of clone(CLONE_FILES)"
>>
>>http://www.ussg.iu.edu/hypermail/linux/kernel/0008.3/0662.html
>>
>>So I decided to follow that as a guidance for unshare interface.
>>    
>>
>
>Yes, but when Linus made that statement (Aug 2000), there 
>was no CLONE_NEWNS (it arrived in 2.4.19, Aug 2002).  So
>the inconsistency that I'm highlighting didn't exist back 
>then.  As I said above: if *all* of the flags were simply 
>reversed, that would be comprehensible; but the fact that 
>one of them is not reversed is inconsistent.  This &will*
>confuse people in userland, and it seems quite 
>unnecessary to do that.  Please consider this point further.
>  
>
Thanks for clarification. I didn't check that namespaces cames after
that original discussion. I still think that the confusion is not acute
enough to warrent addition of more flags, but will run it by some
folks to see what they think.

>  
>
>>>4. Would it not be wise to include a check of the following form
>>>  at the entry to sys_unshare():
>>>
>>>       if (flags & ~(CLONE_FS | CLONE_FILES | CLONE_VM | 
>>>               CLONE_NEWNS | CLONE_SYSVSEM | CLONE_THREAD))
>>>           return -EINVAL;
>>>
>>>  This future-proofs the interface against applications
>>>  that try to specify extraneous bits in 'flags': if those
>>>  bits happen to become meaningful in the future, then the
>>>  application behavior would silently change.  Adding this 
>>>  check now prevents applications trying to use those bits 
>>>  until such a time as they have meaning.
>>>
>>>      
>>>
>>I did have a similar check in the first incarnation of the patch. It was
>>pointed out, correctly, that it is better to allow all flags so we can
>>incrementally add new unshare functionality while not making
>>any ABI changes. unshare follows clone here, which also does not
>>check for extraneous bits in flags.
>>    
>>
>
>I guess I need educating here.  Several other system calls 
>do include such checks:
>
>mm/mlock.c: mlockall(2):
>if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
>mm/mprotect.c: mprotect(2):
>if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
>mm/msync.c: msync(2):
>if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
>mm/mremap.c: mremap(2):
>if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
>mm/mempolicy.c:	mbind(2):
>if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
>mm/mempolicy.c:	get_mempolicy(2):
>if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>
>What distinguishes unshare() (and clone()) from these?
>  
>
I haven't looked at your examples in detail, but basically clone and
unshare work on pieces of process context. It is quite possible that
in future there may be new pieces added to the process context
resulting in new flags. You want to make sure that you can
incrementally add functionality for sharing and unsharing of
new flags.

>I guess I'm unclear too about this (requoted) text
>
>  
>
>>It was
>>pointed out, correctly, that it is better to allow all flags so we can
>>incrementally add new unshare functionality while not making
>>any ABI changes. 
>>    
>>
>
>If one restricts the range of flags that are available now
>(prevents userland from trying to use them), and then
>permits additional flags later, then from the point of
>view of the old userland apps, there has been no ABI change.
>What am I missing here?
>
>  
>
I think the ABI change may occur if the new flag that gets added,
somehow interacts with an existing flag (just like signal handlers and
vm) or has a different default behavior (like namespace). I think
that's why clone and unshare (which mimics the clone interface)
do not check for unimplmented flags.

-Janak

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unhare() interface design questions and man page
  2006-02-14 20:47       ` JANAK DESAI
@ 2006-02-16  5:50         ` Michael Kerrisk
  2006-02-19 17:52           ` Janak Desai
  0 siblings, 1 reply; 10+ messages in thread
From: Michael Kerrisk @ 2006-02-16  5:50 UTC (permalink / raw)
  To: JANAK DESAI
  Cc: torvalds, akpm, ak, hch, paulus, viro, linux-kernel, michael.kerrisk

Hi Janak,

> >>>2. Reading the code and your documentation, I see that CLONE_VM
> >>>  implies CLONE_SIGHAND.  Since CLONE_SIGHAND is not implemented
> >>>  (i.e., results in an EINVAL error), I take it that this means
> >>>  that at the moment CLONE_VM will not work (i.e., will result 
> >>>  in EINVAL).  Is this correct?  If so, I will note this in 
> >>>  the man page.
> >>>
> >>Actually, CLONE_SIGHAND implies CLONE_VM and not the
> >>otherway around. Currently CLONE_VM is supported, as long as
> >>singnal handlers are not being shared. That is, if you created the
> >>process using clone(CLONE_VM), which kept signal handlers
> >>different, then you can unshare VM using unshare(CLONE_VM).
> >
> >Maybe I'm being dense, and as I said I haven't actually
> >tested the interface, but your Documentation/unshare.txt 
> >(7.2) says the opposite:
> >
> >    If CLONE_VM is set, force CLONE_SIGHAND.
> >
> >and the code in check_unshare_flags() seems to agree:
> >
> >        /*
> >         * If unsharing vm, we must also unshare signal handlers 
> >         */
> >        if (*flags_ptr & CLONE_VM)
> >                 *flags_ptr |= CLONE)SIGHAND;
> >
> >What am I missing?
> >
> Sorry, I think I have caused confusion by my use of "implies" and
> "forcing of flags". I am easily confused by this as well, so let me see
> if I can clarify with a picture.  I will double check the documentation
> file to make sure I am using correct and unambiguous words.
> 
> Consider the following 4 cases.
> 
>   (1)        (2)              (3)              (4)
>
>      
> SH1  SH2   SH1   SH2           SH               SH
>  |    |     \     /            ||              /  \
>  |    |      \   /             ||             /    \
> VM1  VM2       VM              VM            VM1   VM2
>  
>      
>    ok          ok              ok              NOT ok
>                          
>      
> You can achieve (1), (2) or (3) using clone(), but clone()
> will not let you do (4). What we want to make sure is
> that we don't endup like (4) using unshare. unshare
> makes sure that if you are trying to unshare vm, AND
> you were sharing signal handlers (case 3) before,
> then it won't allow that operation. However, if you
> were like (2), then you can unshare vm and end up
> like (1), which is allowed. So the forcing of sighand
> flag makes sure that you check for case (2) or (3).

That all makes sense.  Thanks.  Just returning to my
original statement though: CLONE_VM *does* imply 
CLONE_SIGHAND.  But that's okay.  What I had 
overlooked was that unshare(CLONE_SIGHAND) is 
permitted, and implemented as a no-op, if the count
of threads sharing the signal structure is 1.
In other words, we can do an unshare(CLONE_SIGHAND)
as long as we did not specifiy CLONE_SIGHAND in the
previous clone() call.

By the way, I have by now done quite a bit of testing
of unshare(), and it works as documented for all the 
cases I've tested.  I have included a fairly
general (command-line argument driven) program at
the foot of this message.  It allows you to test
various flag combinations.  Maybe it is useful
to you also?

> >>>3. The naming of the 'flags' bits is inconsistent.  In your
> >>>  documentation you note:
> >>>
> >>>       unshare reverses sharing that was done using clone(2) system 
> >>>       call, so unshare should have a similar interface as clone(2). 
> >>>       That is, since flags in clone(int flags, void *stack) 
> >>>       specifies what should be shared, similar flags in 
> >>>       unshare(int flags) should specify what should be unshared. 
> >>>       Unfortunately, this may appear to invert the meaning of the 
> >>>       flags from the way they are used in clone(2).  However, 
> >>>       there was no easy solution that was less confusing and that 
> >>>       allowed incremental context unsharing in future without an 
> >>>       ABI change.
> >>>  
> >>>  The problem is that the flags don't simply reverse the meanings
> >>>  of the clone() flags of the same name: they do it inconsistently.
> >>>
> >>>  That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the 
> >>>  effects of the clone() flags of the same name, but CLONE_NEWNS 
> >>>  *has the same meaning* as the clone() flag of the same name.
> >>>  If *all* of the flags were simply reversed, that would be 
> >>>  a little strange, but comprehensible; but the fact that one of 
> >>>  them is not reversed is very confusing for users of the 
> >>>  interface.
> >>>
> >>>  An idea: why not define a new set of flags for unshare()
> >>>  which are synonyms of the clone() flags, but make their
> >>>  purpose more obvious to the user, i.e., something like
> >>>  the following:
> >>>   
> >>>        #define UNSHARE_VM     CLONE_VM
> >>>        #define UNSHARE_FS     CLONE_FS
> >>>        #define UNSHARE_FILES  CLONE_FILES
> >>>        #define UNSHARE_NS     CLONE_NEWNS
> >>>        etc.
> >>>        
> >>>  This would avoid confusion for the interface user.  
> >>>  (Yes, I realize that this could be done in glibc, but why 
> >>>  make the kernel and glibc definitions differ?)
> >>>
> >>I agree that use of clone flags can be confusing. At least a couple of
> >>folks pointed that out when I posted the patch. The issues was even
> >>raised when unshare was proposed few years back on lkml. Some
> >>source of confusion is the special status of CLONE_NEWNS. Because
> >>namespaces are shared by default with fork/clone, it is different than
> >>other CLONE_* flags. That's probably why it was called CLONE_NEWNS
> >>and not CLONE_NS. 
> >
> >Yes, most likely.
> >
> >>In the original discussion in Aug'00, Linus
> >>said that "it makes sense that a unshare(CLONE_FILES) basically
> >>_undoes_ the sharing of clone(CLONE_FILES)"
> >>
> >>http://www.ussg.iu.edu/hypermail/linux/kernel/0008.3/0662.html
> >>
> >>So I decided to follow that as a guidance for unshare interface.
> >
> >Yes, but when Linus made that statement (Aug 2000), there 
> >was no CLONE_NEWNS (it arrived in 2.4.19, Aug 2002).  So
> >the inconsistency that I'm highlighting didn't exist back 
> >then.  As I said above: if *all* of the flags were simply 
> >reversed, that would be comprehensible; but the fact that 
> >one of them is not reversed is inconsistent.  This &will*
> >confuse people in userland, and it seems quite 
> >unnecessary to do that.  Please consider this point further.
> >
> Thanks for clarification. I didn't check that namespaces cames after
> that original discussion. I still think that the confusion is not acute
> enough to warrent addition of more flags, but will run it by some
> folks to see what they think.

Let me put it this way: if you change things in the manner
I suggest, then it will cause a few kernel developers
to have to stop and think for a moment.  If you leave things 
as they are, then a multitude of userland programmers will be
condemned to stumble over this confusion for many years to
come.  

(And yes, I appreciate that the original problem arose 
with clone(), really there should have been a CLONE_NS 
flag which was used as the default for fork() and exec(), 
and omitted to get the CLONE_NEWNS behaviour we now have.
But extended this problem into unshare() is even
more confusing, IMHO.)

Just to emphasize this point: while testing the
various unshare() flags, I found I was myself runnng 
into confusion when I tested unshare(CLONE_NEWNS).
That confusion arose precisely because CLONE_NEWNS
has the same effect in clone() and unshare().  And
I was still getting confused even though I 
understood that!

Please consider changing these names.  (I'm a little
surprised that no-one else has offered an opinion for
or against this point so far...)

> >>>4. Would it not be wise to include a check of the following form
> >>>  at the entry to sys_unshare():
> >>>
> >>>       if (flags & ~(CLONE_FS | CLONE_FILES | CLONE_VM | 
> >>>               CLONE_NEWNS | CLONE_SYSVSEM | CLONE_THREAD))
> >>>           return -EINVAL;
> >>>
> >>>  This future-proofs the interface against applications
> >>>  that try to specify extraneous bits in 'flags': if those
> >>>  bits happen to become meaningful in the future, then the
> >>>  application behavior would silently change.  Adding this 
> >>>  check now prevents applications trying to use those bits 
> >>>  until such a time as they have meaning.
> >>>
> >>I did have a similar check in the first incarnation of the patch. It 
> >>was
> >>pointed out, correctly, that it is better to allow all flags so we can
> >>incrementally add new unshare functionality while not making
> >>any ABI changes. unshare follows clone here, which also does not
> >>check for extraneous bits in flags.
> >
> >I guess I need educating here.  Several other system calls 
> >do include such checks:
> >
> >mm/mlock.c: mlockall(2):
> >if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
> >mm/mprotect.c: mprotect(2):
> >if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
> >mm/msync.c: msync(2):
> >if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> >mm/mremap.c: mremap(2):
> >if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
> >mm/mempolicy.c:	mbind(2):
> >if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
> >mm/mempolicy.c:	get_mempolicy(2):
> >if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
> >
> >What distinguishes unshare() (and clone()) from these?
> >
> I haven't looked at your examples in detail, but basically clone and
> unshare work on pieces of process context. It is quite possible that
> in future there may be new pieces added to the process context
> resulting in new flags. You want to make sure that you can
> incrementally add functionality for sharing and unsharing of
> new flags.

Sure -- and I do not see how my suggestion preclused 
that possibility.

> >I guess I'm unclear too about this (requoted) text
> >
> >>It was
> >>pointed out, correctly, that it is better to allow all flags so we can
> >>incrementally add new unshare functionality while not making
> >>any ABI changes. 
> >
> >If one restricts the range of flags that are available now
> >(prevents userland from trying to use them), and then
> >permits additional flags later, then from the point of
> >view of the old userland apps, there has been no ABI change.
> >What am I missing here?
> >
> I think the ABI change may occur if the new flag that gets added,
> somehow interacts with an existing flag (just like signal handlers and
> vm) or has a different default behavior (like namespace). I think
> that's why clone and unshare (which mimics the clone interface)
> do not check for unimplmented flags.

What you are saying here doesn't make sense to me.  Here is how 
I see that an ABI change can occur, and it seems to me
that it is highly undesirable:

1. Under the the current implementation, useland calls 
   unshare() *accidentally* specifying some additional 
   bits that currently have no meaning, and do not 
   cause an EINVAL error.

2. Later, those bits acquire meaning in unshare().

3. As a consequence, the behaviour of the old
   binary application changes (perhaps crashes,
   perhaps just does something new and strange)

Does this scenario not seem to be a problem to you?
If not, why not?

Cheers,

Michael




/* demo_unshare.c */

#define _GNU_SOURCE
#include <sys/types.h>
#include <string.h>
#include <signal.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

#define errMsg(msg)     do { perror(msg); } while (0)

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

#define fatalExit(msg)  do { fprintf(stderr, "%s\n", msg); \
                             exit(EXIT_FAILURE); } while (0)

static void
usageError(char *progName)
{
    fprintf(stderr, "Usage: %s [clone-arg [unshare-arg]]\n", progName);
#define fpe(str) fprintf(stderr, "        " str)
    fpe("args can contain the following letters:\n");
    fpe("    d - file descriptors (CLONE_FILES)\n");
    fpe("    f - file system information (CLONE_FS)\n");
    fpe("    n - give child separate namespace (CLONE_NEWNS)\n");
    fpe("    s - signal dispositions (CLONE_SIGHAND)\n");
    fpe("    e - SV semaphore undo structures (CLONE_SYSVSEM)\n");
    fpe("    t - place in same thread group (CLONE_THREAD)\n");
    fpe("    v - signal dispositions (CLONE_VM)\n");
    exit(EXIT_FAILURE);
} /* usageError */

volatile int glob = 0;

typedef struct {        /* For passing info to child startup function */
    int    fd;
    mode_t umask;
    int    signal;
} ChildParams;

static char *unshareFlags = NULL;

#define SYS_unshare 310

static int
charsToBits(char *flags)
{
    char *p;
    int cloneFlags;

    cloneFlags = 0;

    if (flags != NULL) {
        for (p = flags; *p != '\0'; p++) {
            if      (*p == 'd') cloneFlags |= CLONE_FILES;
            else if (*p == 'f') cloneFlags |= CLONE_FS;
            else if (*p == 'n') cloneFlags |= CLONE_NEWNS;
            else if (*p == 's') cloneFlags |= CLONE_SIGHAND;
            else if (*p == 'e') cloneFlags |= CLONE_SYSVSEM;
            else if (*p == 't') cloneFlags |= CLONE_THREAD;
            else if (*p == 'v') cloneFlags |= CLONE_VM;
            else usageError("Bad flag");
        } 
    } 

    return cloneFlags;
} /* charsToBits */

static int              /* Startup function for cloned child */
childFunc(void *arg)
{
    ChildParams *cp;
    int cloneFlags;

    cloneFlags = 0;

    printf("Child:  PID=%ld PPID=%ld\n", (long) getpid(),
            (long) getppid());

#define SYS_unshare 310
    cloneFlags = charsToBits(unshareFlags);

    printf("Child unsharing 0x%x\n\n", cloneFlags);
    if (syscall(SYS_unshare, cloneFlags) == -1) errExit("unshare");

    cp = (ChildParams *) arg;   /* Cast arg to true form */

    /* The following changes may affect parent */

    glob++;             /* May change memory shared with parent */

    umask(cp->umask);

    if (close(cp->fd) == -1) errExit("child:close");
    if (signal(cp->signal, SIG_DFL) == SIG_ERR)
        errExit("child:signal");

    return 0;
} /* childFunc */

int
main(int argc, char *argv[])
{
    const int STACK_SIZE = 65536;    /* Stack size for cloned child */
    char *stack;                     /* Start of stack buffer area */
    char *stackTop;                  /* End of stack buffer area */
    int cloneFlags;                  /* Flags for cloning child */
    ChildParams cp;                  /* Passed to child function */
    const mode_t START_UMASK = S_IWOTH;  /* Initial umask setting */
    struct sigaction sa;
    int status, s;
    pid_t pid;

    printf("Parent: PID=%ld PPID=%ld\n", (long) getpid(),
            (long) getppid());

    if (argc > 2)
        unshareFlags = argv[2];

    /* Set up an argument structure to be passed to cloned child, and
       set some process attributes that will be modified by child */

    umask(START_UMASK);         /* Initialize umask to some value */
    cp.umask = S_IWGRP;         /* Child sets umask to this value */

    cp.fd = open("/dev/null", O_RDWR);  /* Child will close this fd */
    if (cp.fd == -1) errExit("open");

    cp.signal = SIGTERM;        /* Child will change disposition */
    if (signal(cp.signal, SIG_IGN) == SIG_ERR) errExit("signal");

    /* Initialize clone flags using command-line argument
       (if supplied) */

    cloneFlags = charsToBits(argv[1]);
    printf("Parent clone flags 0x%x\n\n", cloneFlags);

    /* Allocate stack for child */

    stack = malloc(STACK_SIZE);
    if (stack == NULL) errExit("malloc");
    stackTop = stack + STACK_SIZE;  /* Assume stack grows downwards */

    if (clone(childFunc, stackTop, cloneFlags | SIGCHLD, &cp) == -1)
        errExit("clone");

    /* Wait for child.  */

    pid = waitpid(-1, &status, 0);
    if (pid == -1) {    /* Probably because CLONE_THREAD was used */
        if (! (cloneFlags & CLONE_THREAD))
            errExit("waitpid");
        else
            sleep(1);
    } 

    if (WIFEXITED(status) && WEXITSTATUS(status) != 0)
        fatalExit("Child failed");

    printf("\nAfter wait in parent:\n");
    printf("    Child PID=%ld; status=0x%x\n", (long) pid, status);

    /* Check whether changes made by cloned child have
       affected parent */

    printf("\nParent - checking process attributes:\n");

    printf("    [VM]      ");
    printf("glob=%d (%s)\n", glob, glob ? "changed" : "unchanged");

    printf("    [FS]      ");
    if (umask(0) != START_UMASK)
        printf("umask has changed\n");
    else
        printf("umask is unchanged\n");

    printf("    [FILES]   ");
    s = write(cp.fd, "Hello world\n", 12);
    if (s == -1 && errno == EBADF)
        printf("file descriptor %d has been closed "
                "(i.e., changed)\n", cp.fd);
    else if (s == -1)
        printf("write() on file descriptor %d failed (%s) "
                "(unexpected!)\n", cp.fd, strerror(errno));
    else
        printf("write() on file descriptor %d succeeded"
                "(i.e., unchanged)\n", cp.fd);

    printf("    [SIGHAND] ");
    if (sigaction(cp.signal, NULL, &sa) == -1) errExit("sigaction");
    if (sa.sa_handler != SIG_IGN)
        printf("signal disposition has changed\n");
    else
        printf("signal disposition is unchanged\n");

    exit(EXIT_SUCCESS);
} /* main */

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/, 
read the HOWTOHELP file and grep the source 
files for 'FIXME'.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unhare() interface design questions and man page
  2006-02-16  5:50         ` Michael Kerrisk
@ 2006-02-19 17:52           ` Janak Desai
  2006-03-02  3:31             ` Michael Kerrisk
  0 siblings, 1 reply; 10+ messages in thread
From: Janak Desai @ 2006-02-19 17:52 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: torvalds, akpm, ak, hch, paulus, viro, linux-kernel, michael.kerrisk

Michael Kerrisk wrote:

>Hi Janak,
>
>  
>
>>>>>2. Reading the code and your documentation, I see that CLONE_VM
>>>>> implies CLONE_SIGHAND.  Since CLONE_SIGHAND is not implemented
>>>>> (i.e., results in an EINVAL error), I take it that this means
>>>>> that at the moment CLONE_VM will not work (i.e., will result 
>>>>> in EINVAL).  Is this correct?  If so, I will note this in 
>>>>> the man page.
>>>>>
>>>>>          
>>>>>
>>>>Actually, CLONE_SIGHAND implies CLONE_VM and not the
>>>>otherway around. Currently CLONE_VM is supported, as long as
>>>>singnal handlers are not being shared. That is, if you created the
>>>>process using clone(CLONE_VM), which kept signal handlers
>>>>different, then you can unshare VM using unshare(CLONE_VM).
>>>>        
>>>>
>>>Maybe I'm being dense, and as I said I haven't actually
>>>tested the interface, but your Documentation/unshare.txt 
>>>(7.2) says the opposite:
>>>
>>>   If CLONE_VM is set, force CLONE_SIGHAND.
>>>
>>>and the code in check_unshare_flags() seems to agree:
>>>
>>>       /*
>>>        * If unsharing vm, we must also unshare signal handlers 
>>>        */
>>>       if (*flags_ptr & CLONE_VM)
>>>                *flags_ptr |= CLONE)SIGHAND;
>>>
>>>What am I missing?
>>>
>>>      
>>>
>>Sorry, I think I have caused confusion by my use of "implies" and
>>"forcing of flags". I am easily confused by this as well, so let me see
>>if I can clarify with a picture.  I will double check the documentation
>>file to make sure I am using correct and unambiguous words.
>>
>>Consider the following 4 cases.
>>
>>  (1)        (2)              (3)              (4)
>>
>>     
>>SH1  SH2   SH1   SH2           SH               SH
>> |    |     \     /            ||              /  \
>> |    |      \   /             ||             /    \
>>VM1  VM2       VM              VM            VM1   VM2
>> 
>>     
>>   ok          ok              ok              NOT ok
>>                         
>>     
>>You can achieve (1), (2) or (3) using clone(), but clone()
>>will not let you do (4). What we want to make sure is
>>that we don't endup like (4) using unshare. unshare
>>makes sure that if you are trying to unshare vm, AND
>>you were sharing signal handlers (case 3) before,
>>then it won't allow that operation. However, if you
>>were like (2), then you can unshare vm and end up
>>like (1), which is allowed. So the forcing of sighand
>>flag makes sure that you check for case (2) or (3).
>>    
>>
>
>That all makes sense.  Thanks.  Just returning to my
>original statement though: CLONE_VM *does* imply 
>CLONE_SIGHAND.  But that's okay.  What I had 
>overlooked was that unshare(CLONE_SIGHAND) is 
>permitted, and implemented as a no-op, if the count
>of threads sharing the signal structure is 1.
>In other words, we can do an unshare(CLONE_SIGHAND)
>as long as we did not specifiy CLONE_SIGHAND in the
>previous clone() call.
>
>By the way, I have by now done quite a bit of testing
>of unshare(), and it works as documented for all the 
>cases I've tested.  I have included a fairly
>general (command-line argument driven) program at
>the foot of this message.  It allows you to test
>various flag combinations.  Maybe it is useful
>to you also?
>
>  
>
Thank you for your extensive testing. I have been using the following
program at

http://prdownloads.sourceforge.net/audit/unshare_test.c?download

as new kernels are released. This program was created for arch
maintainers and for eventual inclusion in the LTP.


>>>>>3. The naming of the 'flags' bits is inconsistent.  In your
>>>>> documentation you note:
>>>>>
>>>>>      unshare reverses sharing that was done using clone(2) system 
>>>>>      call, so unshare should have a similar interface as clone(2). 
>>>>>      That is, since flags in clone(int flags, void *stack) 
>>>>>      specifies what should be shared, similar flags in 
>>>>>      unshare(int flags) should specify what should be unshared. 
>>>>>      Unfortunately, this may appear to invert the meaning of the 
>>>>>      flags from the way they are used in clone(2).  However, 
>>>>>      there was no easy solution that was less confusing and that 
>>>>>      allowed incremental context unsharing in future without an 
>>>>>      ABI change.
>>>>> 
>>>>> The problem is that the flags don't simply reverse the meanings
>>>>> of the clone() flags of the same name: they do it inconsistently.
>>>>>
>>>>> That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the 
>>>>> effects of the clone() flags of the same name, but CLONE_NEWNS 
>>>>> *has the same meaning* as the clone() flag of the same name.
>>>>> If *all* of the flags were simply reversed, that would be 
>>>>> a little strange, but comprehensible; but the fact that one of 
>>>>> them is not reversed is very confusing for users of the 
>>>>> interface.
>>>>>
>>>>> An idea: why not define a new set of flags for unshare()
>>>>> which are synonyms of the clone() flags, but make their
>>>>> purpose more obvious to the user, i.e., something like
>>>>> the following:
>>>>>  
>>>>>       #define UNSHARE_VM     CLONE_VM
>>>>>       #define UNSHARE_FS     CLONE_FS
>>>>>       #define UNSHARE_FILES  CLONE_FILES
>>>>>       #define UNSHARE_NS     CLONE_NEWNS
>>>>>       etc.
>>>>>       
>>>>> This would avoid confusion for the interface user.  
>>>>> (Yes, I realize that this could be done in glibc, but why 
>>>>> make the kernel and glibc definitions differ?)
>>>>>
>>>>>          
>>>>>
>>>>I agree that use of clone flags can be confusing. At least a couple of
>>>>folks pointed that out when I posted the patch. The issues was even
>>>>raised when unshare was proposed few years back on lkml. Some
>>>>source of confusion is the special status of CLONE_NEWNS. Because
>>>>namespaces are shared by default with fork/clone, it is different than
>>>>other CLONE_* flags. That's probably why it was called CLONE_NEWNS
>>>>and not CLONE_NS. 
>>>>        
>>>>
>>>Yes, most likely.
>>>
>>>      
>>>
>>>>In the original discussion in Aug'00, Linus
>>>>said that "it makes sense that a unshare(CLONE_FILES) basically
>>>>_undoes_ the sharing of clone(CLONE_FILES)"
>>>>
>>>>http://www.ussg.iu.edu/hypermail/linux/kernel/0008.3/0662.html
>>>>
>>>>So I decided to follow that as a guidance for unshare interface.
>>>>        
>>>>
>>>Yes, but when Linus made that statement (Aug 2000), there 
>>>was no CLONE_NEWNS (it arrived in 2.4.19, Aug 2002).  So
>>>the inconsistency that I'm highlighting didn't exist back 
>>>then.  As I said above: if *all* of the flags were simply 
>>>reversed, that would be comprehensible; but the fact that 
>>>one of them is not reversed is inconsistent.  This &will*
>>>confuse people in userland, and it seems quite 
>>>unnecessary to do that.  Please consider this point further.
>>>
>>>      
>>>
>>Thanks for clarification. I didn't check that namespaces cames after
>>that original discussion. I still think that the confusion is not acute
>>enough to warrent addition of more flags, but will run it by some
>>folks to see what they think.
>>    
>>
>
>Let me put it this way: if you change things in the manner
>I suggest, then it will cause a few kernel developers
>to have to stop and think for a moment.  If you leave things 
>as they are, then a multitude of userland programmers will be
>condemned to stumble over this confusion for many years to
>come.  
>
>(And yes, I appreciate that the original problem arose 
>with clone(), really there should have been a CLONE_NS 
>flag which was used as the default for fork() and exec(), 
>and omitted to get the CLONE_NEWNS behaviour we now have.
>But extended this problem into unshare() is even
>more confusing, IMHO.)
>
>Just to emphasize this point: while testing the
>various unshare() flags, I found I was myself runnng 
>into confusion when I tested unshare(CLONE_NEWNS).
>That confusion arose precisely because CLONE_NEWNS
>has the same effect in clone() and unshare().  And
>I was still getting confused even though I 
>understood that!
>
>Please consider changing these names.  (I'm a little
>surprised that no-one else has offered an opinion for
>or against this point so far...)
>
>  
>
>>>>>4. Would it not be wise to include a check of the following form
>>>>> at the entry to sys_unshare():
>>>>>
>>>>>      if (flags & ~(CLONE_FS | CLONE_FILES | CLONE_VM | 
>>>>>              CLONE_NEWNS | CLONE_SYSVSEM | CLONE_THREAD))
>>>>>          return -EINVAL;
>>>>>
>>>>> This future-proofs the interface against applications
>>>>> that try to specify extraneous bits in 'flags': if those
>>>>> bits happen to become meaningful in the future, then the
>>>>> application behavior would silently change.  Adding this 
>>>>> check now prevents applications trying to use those bits 
>>>>> until such a time as they have meaning.
>>>>>
>>>>>          
>>>>>
>>>>I did have a similar check in the first incarnation of the patch. It 
>>>>was
>>>>pointed out, correctly, that it is better to allow all flags so we can
>>>>incrementally add new unshare functionality while not making
>>>>any ABI changes. unshare follows clone here, which also does not
>>>>check for extraneous bits in flags.
>>>>        
>>>>
>>>I guess I need educating here.  Several other system calls 
>>>do include such checks:
>>>
>>>mm/mlock.c: mlockall(2):
>>>if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
>>>mm/mprotect.c: mprotect(2):
>>>if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
>>>mm/msync.c: msync(2):
>>>if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
>>>mm/mremap.c: mremap(2):
>>>if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
>>>mm/mempolicy.c:	mbind(2):
>>>if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
>>>mm/mempolicy.c:	get_mempolicy(2):
>>>if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>>>
>>>What distinguishes unshare() (and clone()) from these?
>>>
>>>      
>>>
>>I haven't looked at your examples in detail, but basically clone and
>>unshare work on pieces of process context. It is quite possible that
>>in future there may be new pieces added to the process context
>>resulting in new flags. You want to make sure that you can
>>incrementally add functionality for sharing and unsharing of
>>new flags.
>>    
>>
>
>Sure -- and I do not see how my suggestion preclused 
>that possibility.
>
>  
>
>>>I guess I'm unclear too about this (requoted) text
>>>
>>>      
>>>
>>>>It was
>>>>pointed out, correctly, that it is better to allow all flags so we can
>>>>incrementally add new unshare functionality while not making
>>>>any ABI changes. 
>>>>        
>>>>
>>>If one restricts the range of flags that are available now
>>>(prevents userland from trying to use them), and then
>>>permits additional flags later, then from the point of
>>>view of the old userland apps, there has been no ABI change.
>>>What am I missing here?
>>>
>>>      
>>>
>>I think the ABI change may occur if the new flag that gets added,
>>somehow interacts with an existing flag (just like signal handlers and
>>vm) or has a different default behavior (like namespace). I think
>>that's why clone and unshare (which mimics the clone interface)
>>do not check for unimplmented flags.
>>    
>>
>
>What you are saying here doesn't make sense to me.  Here is how 
>I see that an ABI change can occur, and it seems to me
>that it is highly undesirable:
>
>1. Under the the current implementation, useland calls 
>   unshare() *accidentally* specifying some additional 
>   bits that currently have no meaning, and do not 
>   cause an EINVAL error.
>
>2. Later, those bits acquire meaning in unshare().
>
>3. As a consequence, the behaviour of the old
>   binary application changes (perhaps crashes,
>   perhaps just does something new and strange)
>
>Does this scenario not seem to be a problem to you?
>If not, why not?
>
>  
>
To me, instead of an application accidently passing extra bits/flags, a more
likely scenario is the incremental addition of new and valid flags. What
I was trying to cover is the possibility that new context flags may get
added to the kernel, but their unsharing may not get added at the same
time. An application developer can appropriately add new flags to their
unshare call and would not have to port their application everytime an
unsharing of a new context flag was supported. A context flag, for
which unsharing is not yet implemented, will result in a no-op. I understand
your concerns and I will run them by a couple of senior kernel developers
to see what they think.

>Cheers,
>
>Michael
>
>
>
>
>/* demo_unshare.c */
>
>#define _GNU_SOURCE
>#include <sys/types.h>
>#include <string.h>
>#include <signal.h>
>#include <sys/stat.h>
>#include <sys/wait.h>
>#include <fcntl.h>
>#include <sched.h>
>#include <stdio.h>
>#include <stdlib.h>
>#include <unistd.h>
>#include <errno.h>
>
>#define errMsg(msg)     do { perror(msg); } while (0)
>
>#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>                        } while (0)
>
>#define fatalExit(msg)  do { fprintf(stderr, "%s\n", msg); \
>                             exit(EXIT_FAILURE); } while (0)
>
>static void
>usageError(char *progName)
>{
>    fprintf(stderr, "Usage: %s [clone-arg [unshare-arg]]\n", progName);
>#define fpe(str) fprintf(stderr, "        " str)
>    fpe("args can contain the following letters:\n");
>    fpe("    d - file descriptors (CLONE_FILES)\n");
>    fpe("    f - file system information (CLONE_FS)\n");
>    fpe("    n - give child separate namespace (CLONE_NEWNS)\n");
>    fpe("    s - signal dispositions (CLONE_SIGHAND)\n");
>    fpe("    e - SV semaphore undo structures (CLONE_SYSVSEM)\n");
>    fpe("    t - place in same thread group (CLONE_THREAD)\n");
>    fpe("    v - signal dispositions (CLONE_VM)\n");
>    exit(EXIT_FAILURE);
>} /* usageError */
>
>volatile int glob = 0;
>
>typedef struct {        /* For passing info to child startup function */
>    int    fd;
>    mode_t umask;
>    int    signal;
>} ChildParams;
>
>static char *unshareFlags = NULL;
>
>#define SYS_unshare 310
>
>static int
>charsToBits(char *flags)
>{
>    char *p;
>    int cloneFlags;
>
>    cloneFlags = 0;
>
>    if (flags != NULL) {
>        for (p = flags; *p != '\0'; p++) {
>            if      (*p == 'd') cloneFlags |= CLONE_FILES;
>            else if (*p == 'f') cloneFlags |= CLONE_FS;
>            else if (*p == 'n') cloneFlags |= CLONE_NEWNS;
>            else if (*p == 's') cloneFlags |= CLONE_SIGHAND;
>            else if (*p == 'e') cloneFlags |= CLONE_SYSVSEM;
>            else if (*p == 't') cloneFlags |= CLONE_THREAD;
>            else if (*p == 'v') cloneFlags |= CLONE_VM;
>            else usageError("Bad flag");
>        } 
>    } 
>
>    return cloneFlags;
>} /* charsToBits */
>
>static int              /* Startup function for cloned child */
>childFunc(void *arg)
>{
>    ChildParams *cp;
>    int cloneFlags;
>
>    cloneFlags = 0;
>
>    printf("Child:  PID=%ld PPID=%ld\n", (long) getpid(),
>            (long) getppid());
>
>#define SYS_unshare 310
>    cloneFlags = charsToBits(unshareFlags);
>
>    printf("Child unsharing 0x%x\n\n", cloneFlags);
>    if (syscall(SYS_unshare, cloneFlags) == -1) errExit("unshare");
>
>    cp = (ChildParams *) arg;   /* Cast arg to true form */
>
>    /* The following changes may affect parent */
>
>    glob++;             /* May change memory shared with parent */
>
>    umask(cp->umask);
>
>    if (close(cp->fd) == -1) errExit("child:close");
>    if (signal(cp->signal, SIG_DFL) == SIG_ERR)
>        errExit("child:signal");
>
>    return 0;
>} /* childFunc */
>
>int
>main(int argc, char *argv[])
>{
>    const int STACK_SIZE = 65536;    /* Stack size for cloned child */
>    char *stack;                     /* Start of stack buffer area */
>    char *stackTop;                  /* End of stack buffer area */
>    int cloneFlags;                  /* Flags for cloning child */
>    ChildParams cp;                  /* Passed to child function */
>    const mode_t START_UMASK = S_IWOTH;  /* Initial umask setting */
>    struct sigaction sa;
>    int status, s;
>    pid_t pid;
>
>    printf("Parent: PID=%ld PPID=%ld\n", (long) getpid(),
>            (long) getppid());
>
>    if (argc > 2)
>        unshareFlags = argv[2];
>
>    /* Set up an argument structure to be passed to cloned child, and
>       set some process attributes that will be modified by child */
>
>    umask(START_UMASK);         /* Initialize umask to some value */
>    cp.umask = S_IWGRP;         /* Child sets umask to this value */
>
>    cp.fd = open("/dev/null", O_RDWR);  /* Child will close this fd */
>    if (cp.fd == -1) errExit("open");
>
>    cp.signal = SIGTERM;        /* Child will change disposition */
>    if (signal(cp.signal, SIG_IGN) == SIG_ERR) errExit("signal");
>
>    /* Initialize clone flags using command-line argument
>       (if supplied) */
>
>    cloneFlags = charsToBits(argv[1]);
>    printf("Parent clone flags 0x%x\n\n", cloneFlags);
>
>    /* Allocate stack for child */
>
>    stack = malloc(STACK_SIZE);
>    if (stack == NULL) errExit("malloc");
>    stackTop = stack + STACK_SIZE;  /* Assume stack grows downwards */
>
>    if (clone(childFunc, stackTop, cloneFlags | SIGCHLD, &cp) == -1)
>        errExit("clone");
>
>    /* Wait for child.  */
>
>    pid = waitpid(-1, &status, 0);
>    if (pid == -1) {    /* Probably because CLONE_THREAD was used */
>        if (! (cloneFlags & CLONE_THREAD))
>            errExit("waitpid");
>        else
>            sleep(1);
>    } 
>
>    if (WIFEXITED(status) && WEXITSTATUS(status) != 0)
>        fatalExit("Child failed");
>
>    printf("\nAfter wait in parent:\n");
>    printf("    Child PID=%ld; status=0x%x\n", (long) pid, status);
>
>    /* Check whether changes made by cloned child have
>       affected parent */
>
>    printf("\nParent - checking process attributes:\n");
>
>    printf("    [VM]      ");
>    printf("glob=%d (%s)\n", glob, glob ? "changed" : "unchanged");
>
>    printf("    [FS]      ");
>    if (umask(0) != START_UMASK)
>        printf("umask has changed\n");
>    else
>        printf("umask is unchanged\n");
>
>    printf("    [FILES]   ");
>    s = write(cp.fd, "Hello world\n", 12);
>    if (s == -1 && errno == EBADF)
>        printf("file descriptor %d has been closed "
>                "(i.e., changed)\n", cp.fd);
>    else if (s == -1)
>        printf("write() on file descriptor %d failed (%s) "
>                "(unexpected!)\n", cp.fd, strerror(errno));
>    else
>        printf("write() on file descriptor %d succeeded"
>                "(i.e., unchanged)\n", cp.fd);
>
>    printf("    [SIGHAND] ");
>    if (sigaction(cp.signal, NULL, &sa) == -1) errExit("sigaction");
>    if (sa.sa_handler != SIG_IGN)
>        printf("signal disposition has changed\n");
>    else
>        printf("signal disposition is unchanged\n");
>
>    exit(EXIT_SUCCESS);
>} /* main */
>
>  
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unhare() interface design questions and man page
  2006-02-19 17:52           ` Janak Desai
@ 2006-03-02  3:31             ` Michael Kerrisk
  2006-03-02  4:03               ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Michael Kerrisk @ 2006-03-02  3:31 UTC (permalink / raw)
  To: Janak Desai
  Cc: torvalds, akpm, ak, hch, paulus, viro, linux-kernel, michael.kerrisk

Hi Janak,

> >>>>>3. The naming of the 'flags' bits is inconsistent.  In your
> >>>>> documentation you note:
> >>>>>
> >>>>>      unshare reverses sharing that was done using clone(2) system 
> >>>>>      call, so unshare should have a similar interface as clone(2). 
> >>>>>      That is, since flags in clone(int flags, void *stack) 
> >>>>>      specifies what should be shared, similar flags in 
> >>>>>      unshare(int flags) should specify what should be unshared. 
> >>>>>      Unfortunately, this may appear to invert the meaning of the 
> >>>>>      flags from the way they are used in clone(2).  However, 
> >>>>>      there was no easy solution that was less confusing and that 
> >>>>>      allowed incremental context unsharing in future without an 
> >>>>>      ABI change.
> >>>>> 
> >>>>> The problem is that the flags don't simply reverse the meanings
> >>>>> of the clone() flags of the same name: they do it inconsistently.
> >>>>>
> >>>>> That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the 
> >>>>> effects of the clone() flags of the same name, but CLONE_NEWNS 
> >>>>> *has the same meaning* as the clone() flag of the same name.
> >>>>> If *all* of the flags were simply reversed, that would be 
> >>>>> a little strange, but comprehensible; but the fact that one of 
> >>>>> them is not reversed is very confusing for users of the 
> >>>>> interface.
> >>>>>
> >>>>> An idea: why not define a new set of flags for unshare()
> >>>>> which are synonyms of the clone() flags, but make their
> >>>>> purpose more obvious to the user, i.e., something like
> >>>>> the following:
> >>>>>  
> >>>>>       #define UNSHARE_VM     CLONE_VM
> >>>>>       #define UNSHARE_FS     CLONE_FS
> >>>>>       #define UNSHARE_FILES  CLONE_FILES
> >>>>>       #define UNSHARE_NS     CLONE_NEWNS
> >>>>>       etc.
> >>>>>       
> >>>>> This would avoid confusion for the interface user.  
> >>>>> (Yes, I realize that this could be done in glibc, but why 
> >>>>> make the kernel and glibc definitions differ?)
> >>>>>
> >>>>>          
> >>>>>
> >>>>I agree that use of clone flags can be confusing. At least a couple 
> >>>>of
> >>>>folks pointed that out when I posted the patch. The issues was even
> >>>>raised when unshare was proposed few years back on lkml. Some
> >>>>source of confusion is the special status of CLONE_NEWNS. Because
> >>>>namespaces are shared by default with fork/clone, it is different 
> >>>>>than
> >>>>other CLONE_* flags. That's probably why it was called CLONE_NEWNS
> >>>>and not CLONE_NS. 
> >>>>        
> >>>>
> >>>Yes, most likely.
> >>>
> >>>>In the original discussion in Aug'00, Linus
> >>>>said that "it makes sense that a unshare(CLONE_FILES) basically
> >>>>_undoes_ the sharing of clone(CLONE_FILES)"
> >>>>
> >>>>http://www.ussg.iu.edu/hypermail/linux/kernel/0008.3/0662.html
> >>>>
> >>>>So I decided to follow that as a guidance for unshare interface.
> >>>>        
> >>>>
> >>>Yes, but when Linus made that statement (Aug 2000), there 
> >>>was no CLONE_NEWNS (it arrived in 2.4.19, Aug 2002).  So
> >>>the inconsistency that I'm highlighting didn't exist back 
> >>>then.  As I said above: if *all* of the flags were simply 
> >>>reversed, that would be comprehensible; but the fact that 
> >>>one of them is not reversed is inconsistent.  This &will*
> >>>confuse people in userland, and it seems quite 
> >>>unnecessary to do that.  Please consider this point further.
> >>>
> >>>      
> >>>
> >>Thanks for clarification. I didn't check that namespaces cames after
> >>that original discussion. I still think that the confusion is not acute
> >>enough to warrent addition of more flags, but will run it by some
> >>folks to see what they think.
> >>    
> >>
> >
> >Let me put it this way: if you change things in the manner
> >I suggest, then it will cause a few kernel developers
> >to have to stop and think for a moment.  If you leave things 
> >as they are, then a multitude of userland programmers will be
> >condemned to stumble over this confusion for many years to
> >come.  
> >
> >(And yes, I appreciate that the original problem arose 
> >with clone(), really there should have been a CLONE_NS 
> >flag which was used as the default for fork() and exec(), 
> >and omitted to get the CLONE_NEWNS behaviour we now have.
> >But extended this problem into unshare() is even
> >more confusing, IMHO.)
> >
> >Just to emphasize this point: while testing the
> >various unshare() flags, I found I was myself runnng 
> >into confusion when I tested unshare(CLONE_NEWNS).
> >That confusion arose precisely because CLONE_NEWNS
> >has the same effect in clone() and unshare().  And
> >I was still getting confused even though I 
> >understood that!
> >
> >Please consider changing these names.  (I'm a little
> >surprised that no-one else has offered an opinion for
> >or against this point so far...)

Do you have any further response on this point?
(There was none in your last message?)

> >>>>>4. Would it not be wise to include a check of the following form
> >>>>> at the entry to sys_unshare():
> >>>>>
> >>>>>      if (flags & ~(CLONE_FS | CLONE_FILES | CLONE_VM | 
> >>>>>              CLONE_NEWNS | CLONE_SYSVSEM | CLONE_THREAD))
> >>>>>          return -EINVAL;
> >>>>>
> >>>>> This future-proofs the interface against applications
> >>>>> that try to specify extraneous bits in 'flags': if those
> >>>>> bits happen to become meaningful in the future, then the
> >>>>> application behavior would silently change.  Adding this 
> >>>>> check now prevents applications trying to use those bits 
> >>>>> until such a time as they have meaning.
> >>>>>
> >>>>I did have a similar check in the first incarnation of the patch. It 
> >>>>was
> >>>>pointed out, correctly, that it is better to allow all flags so we 
> >>>>can
> >>>>incrementally add new unshare functionality while not making
> >>>>any ABI changes. unshare follows clone here, which also does not
> >>>>check for extraneous bits in flags.
> >>>>        
> >>>I guess I need educating here.  Several other system calls 
> >>>do include such checks:
> >>>
> >>>mm/mlock.c: mlockall(2):
> >>>if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
> >>>mm/mprotect.c: mprotect(2):
> >>>if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
> >>>mm/msync.c: msync(2):
> >>>if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> >>>mm/mremap.c: mremap(2):
> >>>if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
> >>>mm/mempolicy.c:	mbind(2):
> >>>if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
> >>>mm/mempolicy.c:	get_mempolicy(2):
> >>>if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
> >>>
> >>>What distinguishes unshare() (and clone()) from these?
> >>>
> >>I haven't looked at your examples in detail, but basically clone and
> >>unshare work on pieces of process context. It is quite possible that
> >>in future there may be new pieces added to the process context
> >>resulting in new flags. You want to make sure that you can
> >>incrementally add functionality for sharing and unsharing of
> >>new flags.
> >
> >Sure -- and I do not see how my suggestion preclused 
> >that possibility.
> >
> >>>I guess I'm unclear too about this (requoted) text
> >>>
> >>>>It was
> >>>>pointed out, correctly, that it is better to allow all flags so we 
> >>>>can
> >>>>incrementally add new unshare functionality while not making
> >>>>any ABI changes. 
> >>>>        
> >>>>
> >>>If one restricts the range of flags that are available now
> >>>(prevents userland from trying to use them), and then
> >>>permits additional flags later, then from the point of
> >>>view of the old userland apps, there has been no ABI change.
> >>>What am I missing here?
> >>>
> >>I think the ABI change may occur if the new flag that gets added,
> >>somehow interacts with an existing flag (just like signal handlers and
> >>vm) or has a different default behavior (like namespace). I think
> >>that's why clone and unshare (which mimics the clone interface)
> >>do not check for unimplmented flags.
> >
> >What you are saying here doesn't make sense to me.  Here is how 
> >I see that an ABI change can occur, and it seems to me
> >that it is highly undesirable:
> >
> >1. Under the the current implementation, useland calls 
> >   unshare() *accidentally* specifying some additional 
> >   bits that currently have no meaning, and do not 
> >   cause an EINVAL error.
> >
> >2. Later, those bits acquire meaning in unshare().
> >
> >3. As a consequence, the behaviour of the old
> >   binary application changes (perhaps crashes,
> >   perhaps just does something new and strange)
> >
> >Does this scenario not seem to be a problem to you?
> >If not, why not?
> >
> To me, instead of an application accidently passing extra bits/flags, a
> more
> likely scenario is the incremental addition of new and valid flags. What
> I was trying to cover is the possibility that new context flags may get
> added to the kernel, but their unsharing may not get added at the same
> time. An application developer can appropriately add new flags to their
> unshare call and would not have to port their application everytime an
> unsharing of a new context flag was supported. A context flag, for
> which unsharing is not yet implemented, will result in a no-op. I
> understand
> your concerns and I will run them by a couple of senior kernel developers
> to see what they think.

And what did they think?

Janak, I think these points should be conclusively resolved 
(as in: I'm definitely wrong, and I should shut up; or fixes
should be made) before 2.6.16 goes out.  After that things get 
much more difficult.

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/, 
read the HOWTOHELP file and grep the source 
files for 'FIXME'.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unhare() interface design questions and man page
  2006-03-02  3:31             ` Michael Kerrisk
@ 2006-03-02  4:03               ` Linus Torvalds
  2006-03-02  4:48                 ` Michael Kerrisk
  2006-03-02 21:40                 ` Michael Kerrisk
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2006-03-02  4:03 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Janak Desai, akpm, ak, hch, paulus, viro, linux-kernel, michael.kerrisk

On Thu, 2 Mar 2006, Michael Kerrisk wrote:
> > >>>>>
> > >>>>> That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the 
> > >>>>> effects of the clone() flags of the same name, but CLONE_NEWNS 
> > >>>>> *has the same meaning* as the clone() flag of the same name.

Well, if this is the only problem, who cares? CLONE_NEWNS itself is 
actually a reversal of clone flags: unlike the others, that tell to 
_share_ things that normally aren't shared across a fork(), CLONE_NEWNS 
does the opposite: it asks to unshare something that normally is shared.

So the fact that it then acts not as a reversal when doing "unshare()" is 
actually consistent with the fact that it's a already a "unshare" event 
for clone() itself.

> Do you have any further response on this point?
> (There was none in your last message?)

I personally don't think it's worth makign UNSHARE_NEWNS just because it's 
a flag that acts differently from the other CLONE_xxx flags.

As to whether allow or not allow unknown unshare() flags, I don't think 
it's a huge deal either way. Right now we don't check the flags to 
"clone()" either, I think.

		Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unhare() interface design questions and man page
  2006-03-02  4:03               ` Linus Torvalds
@ 2006-03-02  4:48                 ` Michael Kerrisk
  2006-03-02 21:40                 ` Michael Kerrisk
  1 sibling, 0 replies; 10+ messages in thread
From: Michael Kerrisk @ 2006-03-02  4:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: janak, akpm, ak, hch, paulus, viro, linux-kernel, michael.kerrisk

Linus, 

> On Thu, 2 Mar 2006, Michael Kerrisk wrote:
> > > >>>>>
> > > >>>>> That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the 
> > > >>>>> effects of the clone() flags of the same name, but CLONE_NEWNS 
> > > >>>>> *has the same meaning* as the clone() flag of the same name.
> 
> Well, if this is the only problem, who cares? CLONE_NEWNS itself is 
> actually a reversal of clone flags: unlike the others, that tell to 
> _share_ things that normally aren't shared across a fork(), CLONE_NEWNS 
> does the opposite: it asks to unshare something that normally is shared.
> 
> So the fact that it then acts not as a reversal when doing "unshare()" 
> is 
> actually consistent with the fact that it's a already a "unshare" event 
> for clone() itself.

I care about this from an (kernel-userland) *interface* point 
of view.  I see a small but steady stream of unnecessarily 
confusing interfaces going into the kernel-userland interface; 
the problem is that no-one (or not enough people) seem to really
care about this.

The key point is this: if it looks like a duck, I expect it
to behave like a duck.  When interfaces have the same name, 
then users expect the interfaces to provide the same 
behaviour.  Now with unshare() we have the situation that
a set of flags with the same name reverses the 
corresponding flags for clone().  From an interface design 
point of view that might almost be okay.  But one flag
breaks the pattern: CLONE_NEWNS, does the same thing in 
both clone() and unshare().

Inflicting this sort of messiness on userland is a recipe 
for programmer confusion, with bugs and security holes 
in userland programs as the possible result. If you agree
that that is a possible result, then why not avoid the
possibility?  It does not cost much to do so.

> > Do you have any further response on this point?
> > (There was none in your last message?)
> 
> I personally don't think it's worth makign UNSHARE_NEWNS just because 
> it's
> a flag that acts differently from the other CLONE_xxx flags.

See my comments above.  (And in case it wasn't clear, I meant 
make a complete set of UNSHARE_* flags that mirror the 
corresponding CLONE_* flags.)

> As to whether allow or not allow unknown unshare() flags, I don't think 
> it's a huge deal either way. Right now we don't check the flags to 
> "clone()" either, I think.

No checking in clone() was probably a mistake (now difficult 
to rectify because it could break bad userland apps).
Is that past mistake a rationale for making the same
mistake in future interfaces?  Without checks such as I've 
described, the kernel is not providing ABI stability (i.e.,
if some bit that was meaningless in the past acquires a meaning
in later kernels, then (older) application behaviour arbitrarily
and unexpectedly changes.  Why does that possibility not matter?
Making the change I suggest would help userland programmers.
What is the cost of making it?  (i.e., is there some good reason
not to do it?)

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/, 
read the HOWTOHELP file and grep the source 
files for 'FIXME'.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unhare() interface design questions and man page
  2006-03-02  4:03               ` Linus Torvalds
  2006-03-02  4:48                 ` Michael Kerrisk
@ 2006-03-02 21:40                 ` Michael Kerrisk
  1 sibling, 0 replies; 10+ messages in thread
From: Michael Kerrisk @ 2006-03-02 21:40 UTC (permalink / raw)
  To: janak
  Cc: Linus Torvalds, akpm, ak, hch, paulus, viro, linux-kernel,
	michael.kerrisk

Replying to myself...

> > > Do you have any further response on this point?
> > > (There was none in your last message?)
> > 
> > I personally don't think it's worth makign UNSHARE_NEWNS just because
> > it's a flag that acts differently from the other CLONE_xxx flags.
> 
> See my comments above.  (And in case it wasn't clear, I meant 
> make a complete set of UNSHARE_* flags that mirror the 
> corresponding CLONE_* flags.)

(By the way, I meant that the flag should preferably be called 
UNSHARE_NS, not UNSHARE_NEWNS -- as noted in an earlier message
in this thread, CLONE_NEWNS was itself a bad name.)

I had another thought about why using names of the form
UNSHARE_* might be worthwhile.  It just might be that in the 
future, someone might want to add a flag that has nothing
to do with clone().  I mean some flag that somehow performs
some other modification of the behaviour or unshare(), or
perhaps unshares something that isn't shared/unshared by 
clone().  (The first possibility seems more likely than 
the second.)  In that case, it would make litte sense to
name the flag(s) CLONE_xxx.

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/, 
read the HOWTOHELP file and grep the source 
files for 'FIXME'.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-03-02 21:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200602072059.k17KxJUI016348@shell0.pdx.osdl.net>
2006-02-13 22:10 ` unhare() interface design questions and man page Michael Kerrisk
2006-02-14 13:44   ` JANAK DESAI
2006-02-14 18:49     ` Michael Kerrisk
2006-02-14 20:47       ` JANAK DESAI
2006-02-16  5:50         ` Michael Kerrisk
2006-02-19 17:52           ` Janak Desai
2006-03-02  3:31             ` Michael Kerrisk
2006-03-02  4:03               ` Linus Torvalds
2006-03-02  4:48                 ` Michael Kerrisk
2006-03-02 21:40                 ` Michael Kerrisk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).