From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8D78C43219 for ; Mon, 29 Apr 2019 20:50:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 75FCA2075E for ; Mon, 29 Apr 2019 20:50:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729348AbfD2UuH convert rfc822-to-8bit (ORCPT ); Mon, 29 Apr 2019 16:50:07 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54664 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728071AbfD2UuG (ORCPT ); Mon, 29 Apr 2019 16:50:06 -0400 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 24B673092661; Mon, 29 Apr 2019 20:50:06 +0000 (UTC) Received: from oldenburg2.str.redhat.com (ovpn-116-123.ams2.redhat.com [10.36.116.123]) by smtp.corp.redhat.com (Postfix) with ESMTPS id AA2D719089; Mon, 29 Apr 2019 20:49:57 +0000 (UTC) From: Florian Weimer To: Jann Horn Cc: Kevin Easton , Andy Lutomirski , Christian Brauner , Aleksa Sarai , "Enrico Weigelt\, metux IT consult" , Linus Torvalds , Al Viro , David Howells , Linux API , LKML , "Serge E. Hallyn" , Arnd Bergmann , "Eric W. Biederman" , Kees Cook , Thomas Gleixner , Michael Kerrisk , Andrew Morton , Oleg Nesterov , Joel Fernandes , Daniel Colascione Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD] References: <20190414201436.19502-1-christian@brauner.io> <20190415195911.z7b7miwsj67ha54y@yavin> <20190420071406.GA22257@ip-172-31-15-78> Date: Mon, 29 Apr 2019 22:49:55 +0200 In-Reply-To: (Jann Horn's message of "Mon, 29 Apr 2019 15:55:11 -0400") Message-ID: <87v9ywbkp8.fsf@oldenburg2.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.43]); Mon, 29 Apr 2019 20:50:06 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Jann Horn: >> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid, >> ) >> >> and then you'd use it like this to fork off a child process: >> >> int spawn_shell_subprocess_(void *arg) { >> char *cmdline = arg; >> execl("/bin/sh", "sh", "-c", cmdline); >> return -1; >> } >> pid_t spawn_shell_subprocess(char *cmdline) { >> pid_t child_pid; >> int res = clone_temporary(spawn_shell_subprocess_, cmdline, >> &child_pid, [...]); >> if (res == 0) return child_pid; >> return res; >> } >> >> clone_temporary() could be implemented roughly as follows by the libc >> (or other userspace code): >> >> sigset_t sigset, sigset_old; >> sigfillset(&sigset); >> sigprocmask(SIG_SETMASK, &sigset, &sigset_old); >> int child_pid; >> int result = 0; >> /* starting here, use inline assembly to ensure that no stack >> allocations occur */ >> long child = syscall(__NR_clone, >> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP - >> ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0); >> if (child == -1) { result = -1; goto reset_sigmask; } >> if (child == 0) { >> result = fn(arg); >> syscall(__NR_exit, 0); >> } >> futex(&child_pid, FUTEX_WAIT, child, NULL); >> /* end of no-stack-allocations zone */ >> reset_sigmask: >> sigprocmask(SIG_SETMASK, &sigset_old, NULL); >> return result; > > ... I guess that already has a name, and it's called vfork(). (Well, > except that the Linux vfork() isn't a real vfork().) > > So I guess my question is: Why not vfork()? Mainly because some users want access to the clone flags, and that's not possible with the current userspace wrappers. The stack setup for the undocumented clone wrapper is also cumbersome, and the ia64 pecularity annoying. For the stack sharing, the callback-based interface looks like the absolutely right thing to do to me. It enforces the notion that you can safely return on the child path from a function calling vfork. > And if vfork() alone isn't flexible enough, alternatively: How about > an API that forks a new child in the same address space, and then > allows the parent to invoke arbitrary syscalls in the context of the > child? As long it's not an eBPF script … > You could also build that in userspace if you wanted, I think - just > let the child run an assembly loop that reads registers from a unix > seqpacket socket, invokes the syscall instruction, and writes the > value of the result register back into the seqpacket socket. As long > as you use CLONE_VM, you don't have to worry about moving the pointer > targets of syscalls. The user-visible API could look like this: People already use a variant of this, execve'ing twice. See jspawnhelper. Thanks, Florian