Linux-man Archive on
 help / color / Atom feed
From: "Michael Kerrisk (man-pages)" <>
To: Christian Brauner <>
Cc:, Florian Weimer <>,
	Christian Brauner <>,
	lkml <>,
	linux-man <>,
	Kees Cook <>,
	Oleg Nesterov <>, Arnd Bergmann <>,
	David Howells <>,
	Pavel Emelyanov <>,
	Andrew Morton <>,
	Adrian Reber <>, Andrei Vagin <>,
	Linux API <>,
	Jann Horn <>, Ingo Molnar <>
Subject: Re: For review: documentation of clone3() system call
Date: Mon, 11 Nov 2019 10:02:30 +0100
Message-ID: <> (raw)
In-Reply-To: <20191109165349.ec5jkuqt7gtm2iy2@wittgenstein>

Hello Christian,


>>>>        These  system  calls  create a new process, in a manner similar to
>>>>        fork(2).
>>>>        Unlike fork(2), these system calls  allow  the  child  process  to
>>>>        share  parts  of  its  execution context with the calling process,
>>> Hm, sharing part of the execution context is not the only thing that
>>> clone{3}() does. 
>> True. That text has been in the page for 21 years. It probably needs
>> a new coat of paint...
>>> Maybe something like:
>>> 	Unlike fork(2), these system calls allow to create a child process with
>>> 	different properties than its parent. For example, these syscalls allow
>>> 	the child to share various parts of the execution context with the
>>> 	calling process such as [...]. They also allow placing the process in a
>>> 	new set of namespaces.
>>> Just a thought.
>> A good thought...
>> I changed the text to read:
>>        Unlike fork(2), these system calls allow the child to  be  created
>>        with various properties that differ from the parent.  For example,
>>        these system calls provide more precise control over  what  pieces
>>        of  execution  context  are shared between the calling process and
>>        the child process.  For example, using  these  system  calls,  the
>>        caller can control whether or not the two processes share the vir‐
>>        tual address space, the table of file descriptors, and  the  table
>>        of  signal handlers.  These system system calls also allow the new
>>        child process to placed in separate namespaces(7).
>> Okay?
> Nit: The second and thirs sentence both start with "For example".
> Otherwise sounds great.

You gotta love it when the nonnative speaker points out the style errors
of the native speaker :-).

I fixed this.


>>>>        The stack argument specifies the location of the stack used by the
>>>>        child process.  Since the child and calling process may share mem‐
>>>>        ory,  it  is  not possible for the child process to execute in the
>>>>        same stack as the  calling  process.   The  calling  process  must
>>>>        therefore  set  up  memory  space  for  the child stack and pass a
>>>>        pointer to this space to clone().  Stacks  grow  downward  on  all
>>> It might be a good idea to advise people to use mmap() to create a
>>> stack. The "canonical" way of doing this would usually be something like
>>> #define DEFAULT_STACK_SIZE (4 * 1024 * 1024) /* 8 MB usually on Linux */
>>> (Yes, the MAP_STACK is usally a noop but people should always include it
>>>  in case some arch will have weird alignment requirement in which case
>>>  this flag can be changed to actually do something...)
>> So, I'm getting a little bit of an education here, and maybe you are
>> going to further educate me. Long ago, I added the documentation of
>> MAP_STACK to mmap(2), but I never quite connected the dots.
>> However, you say MAP_STACK is *usually* a noop. As far as I can see,
>> in current kernels it is *always* a noop. And AFAICS, since it was first
>> added in 2.6.27 (2008), it has always been a noop.
>> I wonder if it will always be a noop.
>> If we go back and look at the commit:
>> [[
>> commit 2fdc86901d2ab30a12402b46238951d2a7891590
>> Author: Ingo Molnar <>
>> Date:   Wed Aug 13 18:02:18 2008 +0200
>>     x86: add MAP_STACK mmap flag
>>     as per this discussion:
>>     Pardo reported that 64-bit threaded apps, if their stacks exceed the
>>     combined size of ~4GB, slow down drastically in pthread_create() - because
>>     glibc uses MAP_32BIT to allocate the stacks. The use of MAP_32BIT is
>>     a legacy hack - to speed up context switching on certain early model
>>     64-bit P4 CPUs.
>>     So introduce a new flag to be used by glibc instead, to not constrain
>>     64-bit apps like this.
>>     glibc can switch to this new flag straight away - it will be ignored
>>     by the kernel. If those old CPUs ever matter to anyone, support for
>>     it can be implemented.
>> ]]
>> And see also
>> So, my understanding from the above is that MAP_STACK was added to 
>> allow a possible fix on some old architectures, should anyone decide it
>> was worth doing the work of implementing it. But so far, after 12 years,
>> no one did. It kind of looks like no one ever will (since those old
>> architectures become less and less relevant).
>> So, AFAICT, while it's not wrong to tell people to use mmap(MAP_STACKED),
>> it doesn't provide any benefit (and perhaps never will), and it is a
>> more clumsy than plain old malloc().
> Apart from MAP_STACK at some point (however unlikely) becoming relevant
> I also like that mmap() makes it explicit that the page needs to be
> readable (PROT_READ) and writeable (PROT_WRITE) and also every
> (relevant) libc I know uses the explicit mmap() in their pthread
> implementation. If you prefer the malloc() as an example that's
> obviously fine.

I think the benefits you point out are not so significant.
The default assumption is that a given piece of memory is
readable and writable.

Does musl use MAP_STACKED? AFAICT, it doesn't.

I suppose it's not that I prefer malloc(), but rather that (first I
was ignorant, and second that) I prefer simplicity. In fact I even
set up a branch to change various pieces of code I had to use
MAP_STACKED, and then had an "if I had thought about this for
30 minutes, I might have saved myself a couple of hours of work" 
moment after I researched the history of MAP_STACKED a little.

I still don't know the answer for sure. I'm inclined to
leave things as they are until I hear a compelling
argument to change things.

>> Yeah, but I didn't get mentioned in the commit ;-)
> You provided your Ack after I had taken the patch into my tree and I've
> already changed it once for Arnd's ack so I didn't want to change it
> again.

Yes. But I *motivated* the commit :-). Just saying. I'm not
too concerned!


>> I changed the text here to read:
>>        CLONE_PID (Linux 2.0 to 2.5.15)
>>               If  CLONE_PID is set, the child process is created with the
>>               same process ID as the calling process.  This is  good  for
>>               hacking  the  system,  but otherwise of not much use.  From
>>               Linux 2.3.21 onward, this flag could be specified  only  by
>>               the system boot process (PID 0).  The flag disappeared com‐
>>               pletely from the kernel sources in  Linux  2.5.16.   Subse‐
>>               quently,  the  kernel  silently  ignored this bit if it was
>>               specified in the flags mask.  Much later, the same bit  was
>>               recycled for use as the CLONE_PIDFD flag.
> Yip, sounds good.

Are you making fun of my accent? ;-)

Thanks again for your feedback, Christian.



Michael Kerrisk
Linux man-pages maintainer;
Linux/UNIX System Programming Training:

  reply index

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-25 16:59 Michael Kerrisk (man-pages)
2019-10-25 17:07 ` Christian Brauner
2019-11-07 12:26   ` Michael Kerrisk (man-pages)
2019-10-26  2:28 ` G. Branden Robinson
2019-10-31  6:06   ` Michael Kerrisk (man-pages)
2019-10-28 15:12 ` Jann Horn
2019-10-28 17:21   ` Christian Brauner
2019-10-28 19:09     ` Jann Horn
2019-10-29 11:27       ` Christian Brauner
2019-10-29 14:26         ` Christian Brauner
2019-10-29 14:36           ` Florian Weimer
2019-10-29 16:04             ` Christian Brauner
2019-10-29 15:20           ` Jann Horn
2019-10-29 16:05             ` Christian Brauner
2019-11-07 15:19 ` Christian Brauner
2019-11-07 16:10   ` Florian Weimer
2019-11-09  8:09   ` Michael Kerrisk (man-pages)
2019-11-09 16:53     ` Christian Brauner
2019-11-11  9:02       ` Michael Kerrisk (man-pages) [this message]
2019-11-11 11:36         ` Christian Brauner
2019-11-11 19:56           ` Michael Kerrisk (man-pages)
2019-11-11 14:55     ` Jann Horn
2019-11-11 16:58       ` Theodore Y. Ts'o
2019-11-11 20:24         ` Jann Horn
2019-11-12 23:03           ` Kees Cook
2019-11-14 12:15       ` Michael Kerrisk (man-pages)
2019-11-14 12:29         ` Christian Brauner
2019-11-11 15:03 ` Florian Weimer
2019-11-11 15:15   ` Jann Horn
2019-11-11 15:20     ` Florian Weimer

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-man Archive on

Archives are clonable:
	git clone --mirror linux-man/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-man linux-man/ \
	public-inbox-index linux-man

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone