All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alex Bennée" <alex.bennee@linaro.org>
To: Josh Kunz <jkz@google.com>
Cc: riku.voipio@iki.fi, qemu-devel@nongnu.org, laurent@vivier.eu
Subject: Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
Date: Tue, 16 Jun 2020 17:08:16 +0100	[thread overview]
Message-ID: <87h7vbyowf.fsf@linaro.org> (raw)
In-Reply-To: <20200612014606.147691-5-jkz@google.com>


Josh Kunz <jkz@google.com> writes:

> The `clone` system call can be used to create new processes that share
> attributes with their parents, such as virtual memory, file
> system location, file descriptor tables, etc. These can be useful to a
> variety of guest programs.
>
> Before this patch, QEMU had support for a limited set of these attributes.
> Basically the ones needed for threads, and the options used by fork.
> This change adds support for all flag combinations involving CLONE_VM.
> In theory, almost all clone options could be supported, but invocations
> not using CLONE_VM are likely to run afoul of linux-user's inherently
> multi-threaded design.
>
> To add this support, this patch updates the `qemu_clone` helper. An
> overview of the mechanism used to support general `clone` options with
> CLONE_VM is described below.
>
> This patch also enables by-default the `clone` unit-tests in
> tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate
> exit signals, based on a bug found during development.

Which by the way fail on some targets:

    TEST    linux-test on alpha
  /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death
  make[2]: *** [../Makefile.target:153: run-linux-test] Error 1
  make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2
  make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2

Have you managed a clean check-tcg with docker enabled so all the guest
architectures get tested?

>
> !! Overview
>
> Adding support for CLONE_VM is tricky. The parent and guest process will
> share an address space (similar to threads), so the emulator must
> coordinate between the parent and the child. Currently, QEMU relies
> heavily on Thread Local Storage (TLS) as part of this coordination
> strategy. For threads, this works fine, because libc manages the
> thread-local data region used for TLS, when we create new threads using
> `pthread_create`. Ideally we could use the same mechanism for
> "process-local storage" needed to allow the parent/child processes to
> emulate in tandem. Unfortunately TLS is tightly integrated into libc.
> The only way to create TLS data regions is via the `pthread_create` API
> which also spawns a new thread (rather than a new processes, which is
> what we want). Worse still, TLS itself is a complicated arch-specific
> feature that is tightly integrated into the rest of libc and the dynamic
> linker. Re-implementing TLS support for QEMU would likely require a
> special dynamic linker / libc. Alternatively, the popular libcs could be
> extended, to allow for users to create TLS regions without creating
> threads. Even if major libcs decide to add this support, QEMU will still
> need a temporary work around until those libcs are widely deployed. It's
> also unclear if libcs will be interested in supporting this case, since
> TLS image creation is generally deeply integrated with thread setup.
>
> In this patch, I've employed an alternative approach: spawning a thread
> an "stealing" its TLS image for use in the child process. This approach
> leaves a dangling thread while the TLS image is in use, but by design
> that thread will not become schedulable until after the TLS data is no
> longer in-use by the child (as described in a moment). Therefore, it
> should cause relatively minimal overhead. When considered in the larger
> context, this seems like a reasonable tradeoff.

*sharp intake of breath*

OK so the solution to the complexity of handling threads is to add more
threads? cool cool cool....

>
> A major complication of this approach knowing when it is safe to clean up
> the stack, and TLS image, used by a child process. When a child is
> created with `CLONE_VM` its stack, and TLS data, need to remain valid
> until that child has either exited, or successfully called `execve` (on
> `execve` the child is given a new VMM by the kernel). One approach would
> be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the
> child). The problem is that the `wait` family of calls only waits for
> termination. The pattern of `clone() ... execve()` for long running
> child processes is pretty common. If we waited for child processes to
> exit, it's likely we would end up using substantially more memory, and
> keep the suspended TLS thread around much longer than necessary.
> Instead, in this patch, I've used an "trampoline" process. The real
> parent first clones a trampoline, the trampoline then clones the
> ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends
> the trampoline process until the child has exited, or called `execve`.
> Once the trampoline is re-scheduled, we know it is safe to clean up
> after the child. This creates one more suspended process, but typically,
> the trampoline only exists for a short period of time.
>
> !! CLONE_VM setup, step by step
>
> 1. First, the suspended thread whose TLS we will use is created using
>    `pthread_create`. The thread fetches and returns it's "TLS pointer"
>    (an arch-specific value given to the kernel) to the parent. It then
>    blocks on a lock to prevent its TLS data from being cleaned up.
>    Ultimately the lock will be unlocked by the trampoline once the child
>    exits.
> 2. Once the TLS thread has fetched the TLS pointer, it notifies the real
>    parent thread, which calls `clone()` to create the trampoline
>    process. For ease of implementation, the TLS image is set for the
>    trampoline process during this step. This allows the trampoline to
>    use functions that require TLS if needed (e.g., printf). TLS location
>    is inherited when a new child is spawned, so this TLS data will
>    automatically be inherited by the child.
> 3. Once the trampoline has been spawned, it registers itself as a
>    "hidden" process with the signal subsystem. This prevents the exit
>    signal from the trampoline from ever being forwarded to the guest.
>    This is needed due to the way that Linux sets the exit signal for the
>    ultimate child when `CLONE_PARENT` is set. See the source for
>    details.
> 4. Once setup is complete, the trampoline spawns the final child with
>    the original clone flags, plus `CLONE_PARENT`, so the child is
>    correctly parented to the kernel task on which the guest invoked
>    `clone`. Without this, kernel features like PDEATHSIG, and
>    subreapers, would not work properly. As previously discussed, the
>    trampoline also supplies `CLONE_VFORK` so that it is suspended until
>    the child can be cleaned up.
> 5. Once the child is spawned, it signals the original parent thread that
>    it is running. At this point, the trampoline process is suspended
>    (due to CLONE_VFORK).
> 6. Finally, the call to `qemu_clone` in the parent is finished, the
>    child begins executing the given callback function in the new child
>    process.
>
> !! Cleaning up
>
> Clean up itself is a multi-step process. Once the child exits, or is
> killed by a signal (cleanup is the same in both cases), the trampoline
> process becomes schedulable. When the trampoline is scheduled, it frees
> the child stack, and unblocks the suspended TLS thread. This cleans up
> the child resources, but not the stack used by the trampoline itself. It
> is possible for a process to clean up its own stack, but it is tricky,
> and architecture-specific. Instead we leverage the TLS manager thread to
> clean up the trampoline stack. When the trampoline is cloned (in step 2
> above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID`
> flags. The target location for the SET/CLEAR TID is set to a special field
> known by the TLS manager. Then, when the TLS manager thread is unsuspended,
> it performs an additional `FUTEX_WAIT` on this location. That blocks the
> TLS manager thread until the trampoline has fully exited, then the TLS
> manager thread frees the trampoline process's stack, before exiting
> itself.
>
> !! Shortcomings of this patch
>
> * It's complicated.
> * It doesn't support any clone options when CLONE_VM is omitted.
> * It doesn't properly clean up the CPU queue when the child process
>   terminates, or calls execve().
> * RCU unregistration is done in the trampoline process (in clone.c), but
>   registration happens in syscall.c This should be made more explicit.
> * The TLS image, and trampoline stack are not cleaned up if the parent
>   calls `execve` or `exit_group` before the child does. This is because
>   those cleanup tasks are handled by the TLS manager thread. The TLS
>   manager thread is in the same thread group as the parent, so it will
>   be terminated if the parent exits or calls `execve`.
>
> !! Alternatives considered
>
> * Non-standard libc extension to allow creating TLS images independent
>   of threads. This would allow us to just `clone` the child directly
>   instead of this complicated maneuver. Though we probably would still
>   need the cleanup logic. For libcs, TLS image allocation is tightly
>   connected to thread stack allocation, which is also arch-specific. I
>   do not have enough experience with libc development to know if
>   maintainers of any popular libcs would be open to supporting such an
>   API. Additionally, since it will probably take years before a libc
>   fix would be widely deployed, we need an interim solution anyways.

We could consider a custom lib stub that intercepts calls to the guests
original libc and replaces it with a QEMU aware one?

> * Non-standard, Linux-only, libc extension to allow us to specify the
>   CLONE_* flags used by `pthread_create`. The processes we are creating
>   are basically threads in a different thread group. If we could alter
>   the flags used, this whole processes could become a `pthread_create.`
>   The problem with this approach is that I don't know what requirements
>   pthreads has on threads to ensure they function properly. I suspect
>   that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached
>   thread state. Since we don't control the child exit reason (Linux only
>   handles CHILD_CLEARTID on normal, non-signal process termination), we
>   probably can't use this same tracking mechanism.
> * Other mechanisms for detecting child exit so cleanup can happen
>   besides CLONE_VFORK:
>   * waitid(WNOWAIT): This can only detect exit, not execve.
>   * file descriptors with close on exec set: This cannot detect children
>     cloned with CLONE_FILES.
>   * System V semaphore adjustments: Cannot detect children cloned with
>     CLONE_SYSVSEM.
>   * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally
>     terminated children.
> * Doing the child clone directly in the TLS manager thread: This saves the
>   need for the trampoline process, but it causes the child process to be
>   parented to the wrong kernel task (the TLS thread instead of the Main
>   thread) breaking things like PDEATHSIG.

Have you considered a daemon which could co-ordinate between the
multiple processes that are sharing some state?


> Signed-off-by: Josh Kunz <jkz@google.com>
> ---
>  linux-user/clone.c               | 415 ++++++++++++++++++++++++++++++-
>  linux-user/qemu.h                |  17 ++
>  linux-user/signal.c              |  49 ++++
>  linux-user/syscall.c             |  69 +++--
>  tests/tcg/multiarch/linux-test.c |  67 ++++-
>  5 files changed, 592 insertions(+), 25 deletions(-)
>
> diff --git a/linux-user/clone.c b/linux-user/clone.c
> index f02ae8c464..3f7344cf9e 100644
> --- a/linux-user/clone.c
> +++ b/linux-user/clone.c
> @@ -12,6 +12,12 @@
>  #include <stdbool.h>
>  #include <assert.h>
>  
> +/* arch-specifc includes needed to fetch the TLS base offset. */
> +#if defined(__x86_64__)
> +#include <asm/prctl.h>
> +#include <sys/prctl.h>
> +#endif
> +
>  static const unsigned long NEW_STACK_SIZE = 0x40000UL;
>  
>  /*
> @@ -62,6 +68,397 @@ static void completion_finish(struct completion *c)
>      pthread_mutex_unlock(&c->mu);
>  }
>  
> +struct tls_manager {
> +    void *tls_ptr;
> +    /* fetched is completed once tls_ptr has been set by the thread. */
> +    struct completion fetched;
> +    /*
> +     * spawned is completed by the user once the managed_tid
> +     * has been spawned.
> +     */
> +    struct completion spawned;
> +    /*
> +     * TID of the child whose memory is cleaned up upon death. This memory
> +     * location is used as part of a futex op, and is cleared by the kernel
> +     * since we specify CHILD_CLEARTID.
> +     */
> +    int managed_tid;
> +    /*
> +     * The value to be `free`'d up once the janitor is ready to clean up the
> +     * TLS section, and the managed tid has exited.
> +     */
> +    void *cleanup;
> +};
> +
> +/*
> + * tls_ptr fetches the TLS "pointer" for the current thread. This pointer
> + * should be whatever platform-specific address is used to represent the TLS
> + * base address.
> + */
> +static void *tls_ptr()

This and a number of other prototypes need void args to stop the
compiler complaining about missing prototypes.

> +{
> +    void *ptr;
> +#if defined(__x86_64__)
> +    /*
> +     * On x86_64, the TLS base is stored in the `fs` segment register, we can
> +     * fetch it with `ARCH_GET_FS`:
> +     */
> +    (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr);
> +#else
> +    ptr = NULL;
> +#endif
> +    return ptr;
> +}
> +
> +/*
> + * clone_vm_supported returns true if clone_vm() is supported on this
> + * platform.
> + */
> +static bool clone_vm_supported()
> +{
> +#if defined(__x86_64__)
> +    return true;
> +#else
> +    return false;
> +#endif
> +}
<snip>

-- 
Alex Bennée


  parent reply	other threads:[~2020-06-16 16:09 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
2020-06-12  1:46 ` [PATCH 1/5] linux-user: Refactor do_fork to use new `qemu_clone` Josh Kunz
2020-06-16 15:51   ` Alex Bennée
2020-06-12  1:46 ` [PATCH 2/5] linux-user: Make fd_trans task-specific Josh Kunz
2020-06-12  1:46 ` [PATCH 3/5] linux-user: Make sigact_table part of the task state Josh Kunz
2020-06-12  1:46 ` [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options Josh Kunz
2020-06-13  0:10   ` Josh Kunz
2020-06-16 16:08   ` Alex Bennée [this message]
2020-06-23  3:43     ` Josh Kunz
2020-06-23  8:21       ` Alex Bennée
     [not found]         ` <CADgy-2tB0Z133RB1i8OdnpKMD3xATL059dFoduHOjdim11G4-A@mail.gmail.com>
     [not found]           ` <87k0zw7opa.fsf@linaro.org>
2020-07-09  0:16             ` Josh Kunz
2020-07-16 10:41               ` Alex Bennée
2020-06-12  1:46 ` [PATCH 5/5] linux-user: Add PDEATHSIG test for clone process hierarchy Josh Kunz
2020-06-12  3:48 ` [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) no-reply
2020-06-13 11:16 ` Alex Bennée
2020-06-16 16:02 ` Alex Bennée
2020-06-16 16:32   ` Peter Maydell
2020-06-16 23:38     ` Josh Kunz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h7vbyowf.fsf@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=jkz@google.com \
    --cc=laurent@vivier.eu \
    --cc=qemu-devel@nongnu.org \
    --cc=riku.voipio@iki.fi \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.