From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Christian Brauner <christian.brauner@ubuntu.com>
Cc: mtk.manpages@gmail.com, Florian Weimer <fweimer@redhat.com>,
Christian Brauner <christian@brauner.io>,
lkml <linux-kernel@vger.kernel.org>,
linux-man <linux-man@vger.kernel.org>,
Kees Cook <keescook@chromium.org>,
Oleg Nesterov <oleg@redhat.com>, Arnd Bergmann <arnd@arndb.de>,
David Howells <dhowells@redhat.com>,
Pavel Emelyanov <xemul@virtuozzo.com>,
Andrew Morton <akpm@linux-foundation.org>,
Adrian Reber <adrian@lisas.de>, Andrei Vagin <avagin@gmail.com>,
Linux API <linux-api@vger.kernel.org>,
Jann Horn <jannh@google.com>, Ingo Molnar <mingo@elte.hu>
Subject: Re: For review: documentation of clone3() system call
Date: Mon, 11 Nov 2019 10:02:30 +0100 [thread overview]
Message-ID: <52e0103b-2bc4-b0c6-274f-ce7effb1799f@gmail.com> (raw)
In-Reply-To: <20191109165349.ec5jkuqt7gtm2iy2@wittgenstein>
Hello Christian,
[...]
>>>> These system calls create a new process, in a manner similar to
>>>> fork(2).
>>>>
>>>> Unlike fork(2), these system calls allow the child process to
>>>> share parts of its execution context with the calling process,
>>>
>>> Hm, sharing part of the execution context is not the only thing that
>>> clone{3}() does.
>>
>> True. That text has been in the page for 21 years. It probably needs
>> a new coat of paint...
>>
>>> Maybe something like:
>>>
>>> Unlike fork(2), these system calls allow to create a child process with
>>> different properties than its parent. For example, these syscalls allow
>>> the child to share various parts of the execution context with the
>>> calling process such as [...]. They also allow placing the process in a
>>> new set of namespaces.
>>>
>>> Just a thought.
>>
>> A good thought...
>>
>> I changed the text to read:
>>
>> Unlike fork(2), these system calls allow the child to be created
>> with various properties that differ from the parent. For example,
>> these system calls provide more precise control over what pieces
>> of execution context are shared between the calling process and
>> the child process. For example, using these system calls, the
>> caller can control whether or not the two processes share the vir‐
>> tual address space, the table of file descriptors, and the table
>> of signal handlers. These system system calls also allow the new
>> child process to placed in separate namespaces(7).
>>
>> Okay?
>
> Nit: The second and thirs sentence both start with "For example".
> Otherwise sounds great.
You gotta love it when the nonnative speaker points out the style errors
of the native speaker :-).
I fixed this.
[...]
>>>> The stack argument specifies the location of the stack used by the
>>>> child process. Since the child and calling process may share mem‐
>>>> ory, it is not possible for the child process to execute in the
>>>> same stack as the calling process. The calling process must
>>>> therefore set up memory space for the child stack and pass a
>>>> pointer to this space to clone(). Stacks grow downward on all
>>>
>>> It might be a good idea to advise people to use mmap() to create a
>>> stack. The "canonical" way of doing this would usually be something like
>>>
>>> #define DEFAULT_STACK_SIZE (4 * 1024 * 1024) /* 8 MB usually on Linux */
>>> void *stack = mmap(NULL, DEFAULT_STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
>>>
>>> (Yes, the MAP_STACK is usally a noop but people should always include it
>>> in case some arch will have weird alignment requirement in which case
>>> this flag can be changed to actually do something...)
>>
>> So, I'm getting a little bit of an education here, and maybe you are
>> going to further educate me. Long ago, I added the documentation of
>> MAP_STACK to mmap(2), but I never quite connected the dots.
>>
>> However, you say MAP_STACK is *usually* a noop. As far as I can see,
>> in current kernels it is *always* a noop. And AFAICS, since it was first
>> added in 2.6.27 (2008), it has always been a noop.
>>
>> I wonder if it will always be a noop.
>>
>> If we go back and look at the commit:
>>
>> [[
>> commit 2fdc86901d2ab30a12402b46238951d2a7891590
>> Author: Ingo Molnar <mingo@elte.hu>
>> Date: Wed Aug 13 18:02:18 2008 +0200
>>
>> x86: add MAP_STACK mmap flag
>>
>> as per this discussion:
>>
>> http://lkml.org/lkml/2008/8/12/423
>>
>> Pardo reported that 64-bit threaded apps, if their stacks exceed the
>> combined size of ~4GB, slow down drastically in pthread_create() - because
>> glibc uses MAP_32BIT to allocate the stacks. The use of MAP_32BIT is
>> a legacy hack - to speed up context switching on certain early model
>> 64-bit P4 CPUs.
>>
>> So introduce a new flag to be used by glibc instead, to not constrain
>> 64-bit apps like this.
>>
>> glibc can switch to this new flag straight away - it will be ignored
>> by the kernel. If those old CPUs ever matter to anyone, support for
>> it can be implemented.
>> ]]
>>
>> And see also https://lwn.net/Articles/294642/
>>
>> So, my understanding from the above is that MAP_STACK was added to
>> allow a possible fix on some old architectures, should anyone decide it
>> was worth doing the work of implementing it. But so far, after 12 years,
>> no one did. It kind of looks like no one ever will (since those old
>> architectures become less and less relevant).
>>
>> So, AFAICT, while it's not wrong to tell people to use mmap(MAP_STACKED),
>> it doesn't provide any benefit (and perhaps never will), and it is a
>> more clumsy than plain old malloc().
>
> Apart from MAP_STACK at some point (however unlikely) becoming relevant
> I also like that mmap() makes it explicit that the page needs to be
> readable (PROT_READ) and writeable (PROT_WRITE) and also every
> (relevant) libc I know uses the explicit mmap() in their pthread
> implementation. If you prefer the malloc() as an example that's
> obviously fine.
I think the benefits you point out are not so significant.
The default assumption is that a given piece of memory is
readable and writable.
Does musl use MAP_STACKED? AFAICT, it doesn't.
I suppose it's not that I prefer malloc(), but rather that (first I
was ignorant, and second that) I prefer simplicity. In fact I even
set up a branch to change various pieces of code I had to use
MAP_STACKED, and then had an "if I had thought about this for
30 minutes, I might have saved myself a couple of hours of work"
moment after I researched the history of MAP_STACKED a little.
I still don't know the answer for sure. I'm inclined to
leave things as they are until I hear a compelling
argument to change things.
>> Yeah, but I didn't get mentioned in the commit ;-)
>
> You provided your Ack after I had taken the patch into my tree and I've
> already changed it once for Arnd's ack so I didn't want to change it
> again.
Yes. But I *motivated* the commit :-). Just saying. I'm not
too concerned!
[...]
>> I changed the text here to read:
>>
>> CLONE_PID (Linux 2.0 to 2.5.15)
>> If CLONE_PID is set, the child process is created with the
>> same process ID as the calling process. This is good for
>> hacking the system, but otherwise of not much use. From
>> Linux 2.3.21 onward, this flag could be specified only by
>> the system boot process (PID 0). The flag disappeared com‐
>> pletely from the kernel sources in Linux 2.5.16. Subse‐
>> quently, the kernel silently ignored this bit if it was
>> specified in the flags mask. Much later, the same bit was
>> recycled for use as the CLONE_PIDFD flag.
>
> Yip, sounds good.
Are you making fun of my accent? ;-)
Thanks again for your feedback, Christian.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
next prev parent reply other threads:[~2019-11-11 9:02 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-25 16:59 For review: documentation of clone3() system call Michael Kerrisk (man-pages)
2019-10-25 17:07 ` Christian Brauner
2019-11-07 12:26 ` Michael Kerrisk (man-pages)
2019-10-28 15:12 ` Jann Horn
2019-10-28 17:21 ` Christian Brauner
2019-10-28 19:09 ` Jann Horn
2019-10-29 11:27 ` Christian Brauner
2019-10-29 14:26 ` Christian Brauner
2019-10-29 14:36 ` Florian Weimer
2019-10-29 16:04 ` Christian Brauner
2019-10-29 15:20 ` Jann Horn
2019-10-29 16:05 ` Christian Brauner
2019-11-07 15:19 ` Christian Brauner
2019-11-07 16:10 ` Florian Weimer
2019-11-09 8:09 ` Michael Kerrisk (man-pages)
2019-11-09 16:53 ` Christian Brauner
2019-11-11 9:02 ` Michael Kerrisk (man-pages) [this message]
2019-11-11 11:36 ` Christian Brauner
2019-11-11 19:56 ` Michael Kerrisk (man-pages)
2019-11-11 14:55 ` Jann Horn
2019-11-11 16:58 ` Theodore Y. Ts'o
2019-11-11 20:24 ` Jann Horn
2019-11-12 23:03 ` Kees Cook
2019-11-14 12:15 ` Michael Kerrisk (man-pages)
2019-11-14 12:29 ` Christian Brauner
2019-11-11 15:03 ` Florian Weimer
2019-11-11 15:15 ` Jann Horn
2019-11-11 15:20 ` Florian Weimer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52e0103b-2bc4-b0c6-274f-ce7effb1799f@gmail.com \
--to=mtk.manpages@gmail.com \
--cc=adrian@lisas.de \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=avagin@gmail.com \
--cc=christian.brauner@ubuntu.com \
--cc=christian@brauner.io \
--cc=dhowells@redhat.com \
--cc=fweimer@redhat.com \
--cc=jannh@google.com \
--cc=keescook@chromium.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-man@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=oleg@redhat.com \
--cc=xemul@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).