Linux-man Archive on lore.kernel.org
 help / color / Atom feed
From: Christian Brauner <christian.brauner@ubuntu.com>
To: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk-manpages <mtk.manpages@gmail.com>,
	lkml <linux-kernel@vger.kernel.org>,
	linux-man <linux-man@vger.kernel.org>,
	Kees Cook <keescook@chromium.org>,
	Florian Weimer <fweimer@redhat.com>,
	Oleg Nesterov <oleg@redhat.com>, Arnd Bergmann <arnd@arndb.de>,
	David Howells <dhowells@redhat.com>,
	Pavel Emelyanov <xemul@virtuozzo.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Adrian Reber <adrian@lisas.de>, Andrei Vagin <avagin@gmail.com>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: For review: documentation of clone3() system call
Date: Mon, 28 Oct 2019 18:21:44 +0100
Message-ID: <20191028172143.4vnnjpdljfnexaq5@wittgenstein> (raw)
In-Reply-To: <CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com>

On Mon, Oct 28, 2019 at 04:12:09PM +0100, Jann Horn wrote:
> On Fri, Oct 25, 2019 at 6:59 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
> > I've made a first shot at adding documentation for clone3(). You can
> > see the diff here:
> > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=faa0e55ae9e490d71c826546bbdef954a1800969
> [...]
> >    clone3()
> >        The  clone3() system call provides a superset of the functionality
> >        of the older clone() interface.  It also provides a number of  API
> >        improvements,  including: space for additional flags bits; cleaner
> >        separation in the use of various arguments;  and  the  ability  to
> >        specify the size of the child's stack area.
> 
> You might want to note somewhere that its flags can't be
> seccomp-filtered because they're stored in memory, making it
> inappropriate to use in heavily sandboxed processes.

Hm, I don't think that belongs on the clone manpage. Granted that
process creation is an important syscall but so are a bunch of others
that aren't filterable because of pointer arguments.
We can probably mention on the seccomp manpage that seccomp can't filter
on pointer arguments and then provide a list of examples. If you setup a
seccomp filter and don't know that you can't filter syscalls with
pointer args that seems pretty bad to begin with.

> 
> >            struct clone_args {
> >                u64 flags;        /* Flags bit mask */
> >                u64 pidfd;        /* Where to store PID file descriptor
> >                                     (int *) */
> >                u64 child_tid;    /* Where to store child TID,
> >                                     in child's memory (int *) */
> >                u64 parent_tid;   /* Where to store child TID,
> >                                     in parent's memory (int *) */
> >                u64 exit_signal;  /* Signal to deliver to parent on
> >                                     child termination */
> >                u64 stack;        /* Pointer to lowest byte of stack */
> >                u64 stack_size;   /* Size of stack */
> >                u64 tls;          /* Location of new TLS */
> >            };
> >
> >        The size argument that is supplied to clone3() should be  initial‐
> >        ized  to  the  size of this structure.  (The existence of the size
> >        argument permits future extensions to the clone_args structure.)
> >
> >        The stack for the child process is  specified  via  cl_args.stack,
> >        which   points   to  the  lowest  byte  of  the  stack  area,  and
> 
> Here and in the comment in the struct above, you say that .stack
> "points to the lowest byte of the stack area", but isn't that
> architecture-dependent? For most architectures, I think it should
> instead be "is the initial stack pointer", with the exception of IA64
> (and maybe others, I'm not sure). For example, on X86, when launching
> a thread with an initially empty stack, it points directly *after* the
> end of the stack area.

re arch and stack_size: You mentioned ia64 below (I snipped this part.)
but it's not the only one. With legacy clone it's _passed_ for any
architecture that has CONFIG_CLONE_BACKWARDS3. That includes at least
microblaze and ia64 I think. But only ia64 makes _actual use_ of this in
copy_thread() by doing user_stack_base + user_stack_size - 16. I think ia64
only needs stack_size because of the split page-table layout where two
stacks grow in different directions; so the stack doesn't grow
dynamically. Afair, stack_size is mainly used when PF_KTHREAD is true
but that can't be set from userspace anyway, so _shrug_.

One thing I never liked about clone() was that userspace had to know
about stack direction. And there is a lot of ugly code in userspace that
has nasty clone() wrappers like:

pid_t wrap_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
{
	pid_t ret;
	void *stack;

	stack = malloc(__STACK_SIZE);
	if (!stack) {
		SYSERROR("Failed to allocate clone stack");
		return -ENOMEM;
	}

#ifdef __ia64__
	ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#else
	ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#endif
	return ret;
}

where stack + stack_size is addition on a void pointer which usually
clang and gcc are not very happy about.
I wanted to bring this up on the mailing list soon: If possible, I don't
want userspace to need to know about stack direction and just have stack
point to the beginning and then have the kernel do the + stack_size
after the copy_clone_args_from_user() if the arch needs it. For example,
by having a dumb helder similar to copy_thread_tls()/coyp_thread() that
either does the + stack_size or not. Right now, clone3() is supported on
parisc and afaict, the stack grows upwards for it. I'm not sure if there
are obvious reasons why that won't work or it would be a bad idea...

Christian

  reply index

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-25 16:59 Michael Kerrisk (man-pages)
2019-10-25 17:07 ` Christian Brauner
2019-11-07 12:26   ` Michael Kerrisk (man-pages)
2019-10-26  2:28 ` G. Branden Robinson
2019-10-31  6:06   ` Michael Kerrisk (man-pages)
2019-10-28 15:12 ` Jann Horn
2019-10-28 17:21   ` Christian Brauner [this message]
2019-10-28 19:09     ` Jann Horn
2019-10-29 11:27       ` Christian Brauner
2019-10-29 14:26         ` Christian Brauner
2019-10-29 14:36           ` Florian Weimer
2019-10-29 16:04             ` Christian Brauner
2019-10-29 15:20           ` Jann Horn
2019-10-29 16:05             ` Christian Brauner
2019-11-07 15:19 ` Christian Brauner
2019-11-07 16:10   ` Florian Weimer
2019-11-09  8:09   ` Michael Kerrisk (man-pages)
2019-11-09 16:53     ` Christian Brauner
2019-11-11  9:02       ` Michael Kerrisk (man-pages)
2019-11-11 11:36         ` Christian Brauner
2019-11-11 19:56           ` Michael Kerrisk (man-pages)
2019-11-11 14:55     ` Jann Horn
2019-11-11 16:58       ` Theodore Y. Ts'o
2019-11-11 20:24         ` Jann Horn
2019-11-12 23:03           ` Kees Cook
2019-11-14 12:15       ` Michael Kerrisk (man-pages)
2019-11-14 12:29         ` Christian Brauner
2019-11-11 15:03 ` Florian Weimer
2019-11-11 15:15   ` Jann Horn
2019-11-11 15:20     ` Florian Weimer

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191028172143.4vnnjpdljfnexaq5@wittgenstein \
    --to=christian.brauner@ubuntu.com \
    --cc=adrian@lisas.de \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=avagin@gmail.com \
    --cc=dhowells@redhat.com \
    --cc=fweimer@redhat.com \
    --cc=jannh@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=mtk.manpages@gmail.com \
    --cc=oleg@redhat.com \
    --cc=xemul@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-man Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-man/0 linux-man/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-man linux-man/ https://lore.kernel.org/linux-man \
		linux-man@vger.kernel.org
	public-inbox-index linux-man

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-man


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git