All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
@ 2020-06-12  1:46 Josh Kunz
  2020-06-12  1:46 ` [PATCH 1/5] linux-user: Refactor do_fork to use new `qemu_clone` Josh Kunz
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Josh Kunz @ 2020-06-12  1:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: riku.voipio, laurent, alex.bennee, Josh Kunz

This patch series implements extended support for the `clone` system
call. As best I can tell, any option combination including `CLONE_VM`
should be supported with the addition of this patch series. The
implementation is described in greater detail in the patches themselves.

Testing:

  * All targets built on x86_64.
  * `make check` and `make check-tcg` are passing. Additional tests have
    been added to `linux-test.c` to validate clone behavior.

Caveats:

  * This series touches, but does not fix, several bits of code that are
    racey (namely the sigact table and the fd trans table).
  * `exit_group` does not perform the appropriate cleanup for non-thread
    children created with `CLONE_VM`. CPUs for such children are never
    cleaned up. The correct implementation of exit-group is non-trivial
    (since it also needs to track/handle cleanup for threads in the
    clone'd child process). Also, I don't fully understand the
    interaction between QOM<->linux-user. My naive implementation based
    on the current implementation `exit(2)` was regularly crashing. If
    maintainers have suggestions for better ways to handle exit_group,
    they would be greatly appreciated. 
  * execve does not clean up the CPUs of clone'd children, for the same
    reasons as `exit_group`.

Josh Kunz (5):
  linux-user: Refactor do_fork to use new `qemu_clone`
  linux-user: Make fd_trans task-specific.
  linux-user: Make sigact_table part of the task state.
  linux-user: Support CLONE_VM and extended clone options
  linux-user: Add PDEATHSIG test for clone process hierarchy.

 linux-user/Makefile.objs            |   2 +-
 linux-user/clone.c                  | 565 ++++++++++++++++++++++++++++
 linux-user/clone.h                  |  27 ++
 linux-user/fd-trans-tbl.c           |  13 +
 linux-user/fd-trans-type.h          |  17 +
 linux-user/fd-trans.c               |   3 -
 linux-user/fd-trans.h               |  75 ++--
 linux-user/main.c                   |   1 +
 linux-user/qemu.h                   |  49 +++
 linux-user/signal.c                 |  84 ++++-
 linux-user/syscall.c                | 452 ++++++++++++----------
 tests/tcg/multiarch/Makefile.target |   3 +
 tests/tcg/multiarch/linux-test.c    | 227 ++++++++++-
 13 files changed, 1264 insertions(+), 254 deletions(-)
 create mode 100644 linux-user/clone.c
 create mode 100644 linux-user/clone.h
 create mode 100644 linux-user/fd-trans-tbl.c
 create mode 100644 linux-user/fd-trans-type.h

-- 
2.27.0.290.gba653c62da-goog



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] linux-user: Refactor do_fork to use new `qemu_clone`
  2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
@ 2020-06-12  1:46 ` Josh Kunz
  2020-06-16 15:51   ` Alex Bennée
  2020-06-12  1:46 ` [PATCH 2/5] linux-user: Make fd_trans task-specific Josh Kunz
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Josh Kunz @ 2020-06-12  1:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: riku.voipio, laurent, alex.bennee, Josh Kunz

This is pre-work for adding full support for the `CLONE_VM` `clone`
flag. In a follow-up patch, we'll add support to `clone.c` for
`clone_vm`-type clones beyond threads. CLONE_VM support is more
complicated, so first we're splitting existing clone mechanisms
(pthread_create, and fork) into a separate file.

Signed-off-by: Josh Kunz <jkz@google.com>
---
 linux-user/Makefile.objs |   2 +-
 linux-user/clone.c       | 152 ++++++++++++++++
 linux-user/clone.h       |  27 +++
 linux-user/syscall.c     | 376 +++++++++++++++++++--------------------
 4 files changed, 365 insertions(+), 192 deletions(-)
 create mode 100644 linux-user/clone.c
 create mode 100644 linux-user/clone.h

diff --git a/linux-user/Makefile.objs b/linux-user/Makefile.objs
index 1940910a73..d6788f012c 100644
--- a/linux-user/Makefile.objs
+++ b/linux-user/Makefile.objs
@@ -1,7 +1,7 @@
 obj-y = main.o syscall.o strace.o mmap.o signal.o \
 	elfload.o linuxload.o uaccess.o uname.o \
 	safe-syscall.o $(TARGET_ABI_DIR)/signal.o \
-        $(TARGET_ABI_DIR)/cpu_loop.o exit.o fd-trans.o
+        $(TARGET_ABI_DIR)/cpu_loop.o exit.o fd-trans.o clone.o
 
 obj-$(TARGET_HAS_BFLT) += flatload.o
 obj-$(TARGET_I386) += vm86.o
diff --git a/linux-user/clone.c b/linux-user/clone.c
new file mode 100644
index 0000000000..f02ae8c464
--- /dev/null
+++ b/linux-user/clone.c
@@ -0,0 +1,152 @@
+#include "qemu/osdep.h"
+#include "qemu.h"
+#include "clone.h"
+#include <pthread.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <stdbool.h>
+#include <assert.h>
+
+static const unsigned long NEW_STACK_SIZE = 0x40000UL;
+
+/*
+ * A completion tracks an event that can be completed. It's based on the
+ * kernel concept with the same name, but implemented with userspace locks.
+ */
+struct completion {
+    /* done is set once this completion has been completed. */
+    bool done;
+    /* mu syncronizes access to this completion. */
+    pthread_mutex_t mu;
+    /* cond is used to broadcast completion status to awaiting threads. */
+    pthread_cond_t cond;
+};
+
+static void completion_init(struct completion *c)
+{
+    c->done = false;
+    pthread_mutex_init(&c->mu, NULL);
+    pthread_cond_init(&c->cond, NULL);
+}
+
+/*
+ * Block until the given completion finishes. Returns immediately if the
+ * completion has already finished.
+ */
+static void completion_await(struct completion *c)
+{
+    pthread_mutex_lock(&c->mu);
+    if (c->done) {
+        pthread_mutex_unlock(&c->mu);
+        return;
+    }
+    pthread_cond_wait(&c->cond, &c->mu);
+    assert(c->done && "returned from cond wait without being marked as done");
+    pthread_mutex_unlock(&c->mu);
+}
+
+/*
+ * Finish the completion. Unblocks all awaiters.
+ */
+static void completion_finish(struct completion *c)
+{
+    pthread_mutex_lock(&c->mu);
+    assert(!c->done && "trying to finish an already finished completion");
+    c->done = true;
+    pthread_cond_broadcast(&c->cond);
+    pthread_mutex_unlock(&c->mu);
+}
+
+struct clone_thread_info {
+    struct completion running;
+    int tid;
+    int (*callback)(void *);
+    void *payload;
+};
+
+static void *clone_thread_run(void *raw_info)
+{
+    struct clone_thread_info *info = (struct clone_thread_info *) raw_info;
+    info->tid = syscall(SYS_gettid);
+
+    /*
+     * Save out callback/payload since lifetime of info is only guaranteed
+     * until we finish the completion.
+     */
+    int (*callback)(void *) = info->callback;
+    void *payload = info->payload;
+    completion_finish(&info->running);
+
+    _exit(callback(payload));
+}
+
+static int clone_thread(int flags, int (*callback)(void *), void *payload)
+{
+    struct clone_thread_info info;
+    pthread_attr_t attr;
+    int ret;
+    pthread_t thread_unused;
+
+    memset(&info, 0, sizeof(info));
+
+    completion_init(&info.running);
+    info.callback = callback;
+    info.payload = payload;
+
+    (void)pthread_attr_init(&attr);
+    (void)pthread_attr_setstacksize(&attr, NEW_STACK_SIZE);
+    (void)pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
+
+    ret = pthread_create(&thread_unused, &attr, clone_thread_run, (void *) &info);
+    /* pthread_create returns errors directly, instead of via errno. */
+    if (ret != 0) {
+        errno = ret;
+        ret = -1;
+    } else {
+        completion_await(&info.running);
+        ret = info.tid;
+    }
+
+    pthread_attr_destroy(&attr);
+    return ret;
+}
+
+int qemu_clone(int flags, int (*callback)(void *), void *payload)
+{
+    int ret;
+
+    if (clone_flags_are_thread(flags)) {
+        /*
+         * The new process uses the same flags as pthread_create, so we can
+         * use pthread_create directly. This is an optimization.
+         */
+        return clone_thread(flags, callback, payload);
+    }
+
+    if (clone_flags_are_fork(flags)) {
+        /*
+         * Special case a true `fork` clone call. This is so we can take
+         * advantage of special pthread_atfork handlers in libraries we
+         * depend on (e.g., glibc). Without this, existing users of `fork`
+         * in multi-threaded environments will likely get new flaky
+         * deadlocks.
+         */
+        fork_start();
+        ret = fork();
+        if (ret == 0) {
+            fork_end(1);
+            _exit(callback(payload));
+        }
+        fork_end(0);
+        return ret;
+    }
+
+    /* !fork && !thread */
+    errno = EINVAL;
+    return -1;
+}
diff --git a/linux-user/clone.h b/linux-user/clone.h
new file mode 100644
index 0000000000..34ae9b3780
--- /dev/null
+++ b/linux-user/clone.h
@@ -0,0 +1,27 @@
+#ifndef CLONE_H
+#define CLONE_H
+
+/*
+ * qemu_clone executes the given `callback`, with the given payload as the
+ * first argument, in a new process created with the given flags. Some clone
+ * flags, such as *SETTLS, *CLEARTID are not supported. The child thread ID is
+ * returned on success, otherwise negative errno is returned on clone failure.
+ */
+int qemu_clone(int flags, int (*callback)(void *), void *payload);
+
+/* Returns true if the given clone flags can be emulated with libc fork. */
+static bool clone_flags_are_fork(unsigned int flags)
+{
+    return flags == SIGCHLD;
+}
+
+/* Returns true if the given clone flags can be emulated with pthread_create. */
+static bool clone_flags_are_thread(unsigned int flags)
+{
+    return flags == (
+        CLONE_VM | CLONE_FS | CLONE_FILES |
+        CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM
+    );
+}
+
+#endif /* CLONE_H */
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 97de9fb5c9..7ce021cea2 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -122,6 +122,7 @@
 #include "qapi/error.h"
 #include "fd-trans.h"
 #include "tcg/tcg.h"
+#include "clone.h"
 
 #ifndef CLONE_IO
 #define CLONE_IO                0x80000000      /* Clone io context */
@@ -135,12 +136,6 @@
  *  * flags we can implement within QEMU itself
  *  * flags we can't support and will return an error for
  */
-/* For thread creation, all these flags must be present; for
- * fork, none must be present.
- */
-#define CLONE_THREAD_FLAGS                              \
-    (CLONE_VM | CLONE_FS | CLONE_FILES |                \
-     CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM)
 
 /* These flags are ignored:
  * CLONE_DETACHED is now ignored by the kernel;
@@ -150,30 +145,10 @@
     (CLONE_DETACHED | CLONE_IO)
 
 /* Flags for fork which we can implement within QEMU itself */
-#define CLONE_OPTIONAL_FORK_FLAGS               \
+#define CLONE_EMULATED_FLAGS               \
     (CLONE_SETTLS | CLONE_PARENT_SETTID |       \
      CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID)
 
-/* Flags for thread creation which we can implement within QEMU itself */
-#define CLONE_OPTIONAL_THREAD_FLAGS                             \
-    (CLONE_SETTLS | CLONE_PARENT_SETTID |                       \
-     CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID | CLONE_PARENT)
-
-#define CLONE_INVALID_FORK_FLAGS                                        \
-    (~(CSIGNAL | CLONE_OPTIONAL_FORK_FLAGS | CLONE_IGNORED_FLAGS))
-
-#define CLONE_INVALID_THREAD_FLAGS                                      \
-    (~(CSIGNAL | CLONE_THREAD_FLAGS | CLONE_OPTIONAL_THREAD_FLAGS |     \
-       CLONE_IGNORED_FLAGS))
-
-/* CLONE_VFORK is special cased early in do_fork(). The other flag bits
- * have almost all been allocated. We cannot support any of
- * CLONE_NEWNS, CLONE_NEWCGROUP, CLONE_NEWUTS, CLONE_NEWIPC,
- * CLONE_NEWUSER, CLONE_NEWPID, CLONE_NEWNET, CLONE_PTRACE, CLONE_UNTRACED.
- * The checks against the invalid thread masks above will catch these.
- * (The one remaining unallocated bit is 0x1000 which used to be CLONE_PID.)
- */
-
 /* Define DEBUG_ERESTARTSYS to force every syscall to be restarted
  * once. This exercises the codepaths for restart.
  */
@@ -1104,7 +1079,7 @@ static inline rlim_t target_to_host_rlim(abi_ulong target_rlim)
 {
     abi_ulong target_rlim_swap;
     rlim_t result;
-    
+
     target_rlim_swap = tswapal(target_rlim);
     if (target_rlim_swap == TARGET_RLIM_INFINITY)
         return RLIM_INFINITY;
@@ -1112,7 +1087,7 @@ static inline rlim_t target_to_host_rlim(abi_ulong target_rlim)
     result = target_rlim_swap;
     if (target_rlim_swap != (rlim_t)result)
         return RLIM_INFINITY;
-    
+
     return result;
 }
 #endif
@@ -1122,13 +1097,13 @@ static inline abi_ulong host_to_target_rlim(rlim_t rlim)
 {
     abi_ulong target_rlim_swap;
     abi_ulong result;
-    
+
     if (rlim == RLIM_INFINITY || rlim != (abi_long)rlim)
         target_rlim_swap = TARGET_RLIM_INFINITY;
     else
         target_rlim_swap = rlim;
     result = tswapal(target_rlim_swap);
-    
+
     return result;
 }
 #endif
@@ -1615,10 +1590,11 @@ static inline abi_long target_to_host_cmsg(struct msghdr *msgh,
     abi_ulong target_cmsg_addr;
     struct target_cmsghdr *target_cmsg, *target_cmsg_start;
     socklen_t space = 0;
-    
+
     msg_controllen = tswapal(target_msgh->msg_controllen);
-    if (msg_controllen < sizeof (struct target_cmsghdr)) 
+    if (msg_controllen < sizeof(struct target_cmsghdr)) {
         goto the_end;
+    }
     target_cmsg_addr = tswapal(target_msgh->msg_control);
     target_cmsg = lock_user(VERIFY_READ, target_cmsg_addr, msg_controllen, 1);
     target_cmsg_start = target_cmsg;
@@ -1703,8 +1679,9 @@ static inline abi_long host_to_target_cmsg(struct target_msghdr *target_msgh,
     socklen_t space = 0;
 
     msg_controllen = tswapal(target_msgh->msg_controllen);
-    if (msg_controllen < sizeof (struct target_cmsghdr)) 
+    if (msg_controllen < sizeof(struct target_cmsghdr)) {
         goto the_end;
+    }
     target_cmsg_addr = tswapal(target_msgh->msg_control);
     target_cmsg = lock_user(VERIFY_WRITE, target_cmsg_addr, msg_controllen, 0);
     target_cmsg_start = target_cmsg;
@@ -5750,9 +5727,10 @@ abi_long do_set_thread_area(CPUX86State *env, abi_ulong ptr)
     }
     unlock_user_struct(target_ldt_info, ptr, 1);
 
-    if (ldt_info.entry_number < TARGET_GDT_ENTRY_TLS_MIN || 
-        ldt_info.entry_number > TARGET_GDT_ENTRY_TLS_MAX)
-           return -TARGET_EINVAL;
+    if (ldt_info.entry_number < TARGET_GDT_ENTRY_TLS_MIN ||
+        ldt_info.entry_number > TARGET_GDT_ENTRY_TLS_MAX) {
+        return -TARGET_EINVAL;
+    }
     seg_32bit = ldt_info.flags & 1;
     contents = (ldt_info.flags >> 1) & 3;
     read_exec_only = (ldt_info.flags >> 3) & 1;
@@ -5828,7 +5806,7 @@ static abi_long do_get_thread_area(CPUX86State *env, abi_ulong ptr)
     lp = (uint32_t *)(gdt_table + idx);
     entry_1 = tswap32(lp[0]);
     entry_2 = tswap32(lp[1]);
-    
+
     read_exec_only = ((entry_2 >> 9) & 1) ^ 1;
     contents = (entry_2 >> 10) & 3;
     seg_not_present = ((entry_2 >> 15) & 1) ^ 1;
@@ -5844,8 +5822,8 @@ static abi_long do_get_thread_area(CPUX86State *env, abi_ulong ptr)
         (read_exec_only << 3) | (limit_in_pages << 4) |
         (seg_not_present << 5) | (useable << 6) | (lm << 7);
     limit = (entry_1 & 0xffff) | (entry_2  & 0xf0000);
-    base_addr = (entry_1 >> 16) | 
-        (entry_2 & 0xff000000) | 
+    base_addr = (entry_1 >> 16) |
+        (entry_2 & 0xff000000) |
         ((entry_2 & 0xff) << 16);
     target_ldt_info->base_addr = tswapal(base_addr);
     target_ldt_info->limit = tswap32(limit);
@@ -5895,53 +5873,71 @@ abi_long do_arch_prctl(CPUX86State *env, int code, abi_ulong addr)
 
 #endif /* defined(TARGET_I386) */
 
-#define NEW_STACK_SIZE 0x40000
-
-
 static pthread_mutex_t clone_lock = PTHREAD_MUTEX_INITIALIZER;
 typedef struct {
-    CPUArchState *env;
+    /* Used to synchronize thread/process creation between parent and child. */
     pthread_mutex_t mutex;
     pthread_cond_t cond;
-    pthread_t thread;
-    uint32_t tid;
+    /*
+     * Guest pointers for implementing CLONE_PARENT_SETTID
+     * and CLONE_CHILD_SETTID.
+     */
     abi_ulong child_tidptr;
     abi_ulong parent_tidptr;
-    sigset_t sigmask;
-} new_thread_info;
+    struct {
+        sigset_t sigmask;
+        CPUArchState *env;
+        bool register_thread;
+        bool signal_setup;
+    } child;
+} clone_info;
 
-static void *clone_func(void *arg)
+static int clone_run(void *arg)
 {
-    new_thread_info *info = arg;
+    clone_info *info = (clone_info *) arg;
     CPUArchState *env;
     CPUState *cpu;
     TaskState *ts;
+    uint32_t tid;
 
-    rcu_register_thread();
-    tcg_register_thread();
-    env = info->env;
+    if (info->child.register_thread) {
+        rcu_register_thread();
+        tcg_register_thread();
+    }
+
+    env = info->child.env;
     cpu = env_cpu(env);
     thread_cpu = cpu;
     ts = (TaskState *)cpu->opaque;
-    info->tid = sys_gettid();
+    tid = sys_gettid();
     task_settid(ts);
-    if (info->child_tidptr)
-        put_user_u32(info->tid, info->child_tidptr);
-    if (info->parent_tidptr)
-        put_user_u32(info->tid, info->parent_tidptr);
+
     qemu_guest_random_seed_thread_part2(cpu->random_seed);
-    /* Enable signals.  */
-    sigprocmask(SIG_SETMASK, &info->sigmask, NULL);
-    /* Signal to the parent that we're ready.  */
-    pthread_mutex_lock(&info->mutex);
-    pthread_cond_broadcast(&info->cond);
-    pthread_mutex_unlock(&info->mutex);
-    /* Wait until the parent has finished initializing the tls state.  */
-    pthread_mutex_lock(&clone_lock);
-    pthread_mutex_unlock(&clone_lock);
+
+    if (info->parent_tidptr) {
+        /*
+         * Even when memory is not shared, parent_tidptr is set before the
+         * process copy, so we need to set it in the child.
+         */
+        put_user_u32(tid, info->parent_tidptr);
+    }
+
+    if (info->child_tidptr) {
+        put_user_u32(tid, info->child_tidptr);
+    }
+
+    /* Enable signals. */
+    sigprocmask(SIG_SETMASK, &info->child.sigmask, NULL);
+
+    if (info->child.signal_setup) {
+        pthread_mutex_lock(&info->mutex);
+        pthread_cond_broadcast(&info->cond);
+        pthread_mutex_unlock(&info->mutex);
+    }
+
     cpu_loop(env);
     /* never exits */
-    return NULL;
+    _exit(1);  /* avoid compiler warning. */
 }
 
 /* do_fork() Must return host values and target errnos (unlike most
@@ -5951,139 +5947,131 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
                    abi_ulong child_tidptr)
 {
     CPUState *cpu = env_cpu(env);
-    int ret;
+    int proc_flags, host_sig, ret;
     TaskState *ts;
     CPUState *new_cpu;
-    CPUArchState *new_env;
-    sigset_t sigmask;
+    sigset_t block_sigmask;
+    sigset_t orig_sigmask;
+    clone_info info;
+    TaskState *parent_ts = (TaskState *)cpu->opaque;
 
-    flags &= ~CLONE_IGNORED_FLAGS;
+    memset(&info, 0, sizeof(info));
+
+    /*
+     * When cloning the actual subprocess, we don't need to worry about any
+     * flags that can be ignored, or emulated in QEMU. proc_flags holds only
+     * the flags that need to be passed to `clone` itself.
+     */
+    proc_flags = flags & ~(CLONE_EMULATED_FLAGS | CLONE_IGNORED_FLAGS);
+
+    /*
+     * The exit signal is included in the flags. That signal needs to be mapped
+     * to the appropriate host signal, and we need to check if that signal is
+     * supported.
+     */
+    host_sig = target_to_host_signal(proc_flags & CSIGNAL);
+    if (host_sig > SIGRTMAX) {
+        qemu_log_mask(LOG_UNIMP,
+                      "guest signal %d not supported for exit_signal",
+                      proc_flags & CSIGNAL);
+        return -TARGET_EINVAL;
+    }
+    proc_flags = (proc_flags & ~CSIGNAL) | host_sig;
 
     /* Emulate vfork() with fork() */
-    if (flags & CLONE_VFORK)
-        flags &= ~(CLONE_VFORK | CLONE_VM);
+    if (proc_flags & CLONE_VFORK) {
+        proc_flags &= ~(CLONE_VFORK | CLONE_VM);
+    }
 
-    if (flags & CLONE_VM) {
-        TaskState *parent_ts = (TaskState *)cpu->opaque;
-        new_thread_info info;
-        pthread_attr_t attr;
+    if (!clone_flags_are_fork(proc_flags) &&
+        !clone_flags_are_thread(proc_flags)) {
+        qemu_log_mask(LOG_UNIMP, "unsupported clone flags");
+        return -TARGET_EINVAL;
+    }
 
-        if (((flags & CLONE_THREAD_FLAGS) != CLONE_THREAD_FLAGS) ||
-            (flags & CLONE_INVALID_THREAD_FLAGS)) {
-            return -TARGET_EINVAL;
-        }
+    pthread_mutex_init(&info.mutex, NULL);
+    pthread_mutex_lock(&info.mutex);
+    pthread_cond_init(&info.cond, NULL);
 
-        ts = g_new0(TaskState, 1);
-        init_task_state(ts);
+    ts = g_new0(TaskState, 1);
+    init_task_state(ts);
 
-        /* Grab a mutex so that thread setup appears atomic.  */
-        pthread_mutex_lock(&clone_lock);
+    /* Guard CPU copy. It is not thread-safe. */
+    pthread_mutex_lock(&clone_lock);
+    info.child.env = cpu_copy(env);
+    pthread_mutex_unlock(&clone_lock);
+    /* Init regs that differ from the parent.  */
+    cpu_clone_regs_child(info.child.env, newsp, flags);
 
-        /* we create a new CPU instance. */
-        new_env = cpu_copy(env);
-        /* Init regs that differ from the parent.  */
-        cpu_clone_regs_child(new_env, newsp, flags);
+    if (flags & CLONE_SETTLS) {
+        cpu_set_tls(info.child.env, newtls);
+    }
+
+    new_cpu = env_cpu(info.child.env);
+    new_cpu->opaque = ts;
+    ts->bprm = parent_ts->bprm;
+    ts->info = parent_ts->info;
+    ts->signal_mask = parent_ts->signal_mask;
+
+    if (flags & CLONE_CHILD_CLEARTID) {
+        ts->child_tidptr = child_tidptr;
+    }
+
+    if (flags & CLONE_CHILD_SETTID) {
+        info.child_tidptr = child_tidptr;
+    }
+    if (flags & CLONE_PARENT_SETTID) {
+        info.parent_tidptr = parent_tidptr;
+    }
+
+    /*
+     * If the child process is going to share memory, and this is our first
+     * such child process or thread, we need to ensure we generate code for
+     * parallel execution and flush old translations.
+     */
+    if (!parallel_cpus && (proc_flags & CLONE_VM)) {
+        parallel_cpus = true;
+        tb_flush(cpu);
+    }
+
+    if (proc_flags & CLONE_VM) {
+        info.child.register_thread = true;
+        info.child.signal_setup = true;
+    }
+
+    /*
+     * It is not safe to deliver signals until the child has finished
+     * initializing, so temporarily block all signals.
+     */
+    sigfillset(&block_sigmask);
+    sigprocmask(SIG_BLOCK, &block_sigmask, &orig_sigmask);
+    info.child.sigmask = orig_sigmask;
+
+    ret = get_errno(qemu_clone(proc_flags, clone_run, (void *) &info));
+
+    if (ret >= 0 && (proc_flags & CLONE_VM)) {
+        /*
+         * Wait for the child to finish setup if the child is running in the
+         * same VM.
+         */
+        pthread_cond_wait(&info.cond, &info.mutex);
+    }
+
+    sigprocmask(SIG_SETMASK, &orig_sigmask, NULL);
+
+    pthread_mutex_unlock(&info.mutex);
+    pthread_cond_destroy(&info.cond);
+    pthread_mutex_destroy(&info.mutex);
+
+    if (ret >= 0 && !(proc_flags & CLONE_VM)) {
+        /*
+         * If !CLONE_VM, then we need to set parent_tidptr, since the child
+         * won't set it for us. Should always be safe to set it here anyways.
+         */
+        put_user_u32(ret, info.parent_tidptr);
         cpu_clone_regs_parent(env, flags);
-        new_cpu = env_cpu(new_env);
-        new_cpu->opaque = ts;
-        ts->bprm = parent_ts->bprm;
-        ts->info = parent_ts->info;
-        ts->signal_mask = parent_ts->signal_mask;
-
-        if (flags & CLONE_CHILD_CLEARTID) {
-            ts->child_tidptr = child_tidptr;
-        }
-
-        if (flags & CLONE_SETTLS) {
-            cpu_set_tls (new_env, newtls);
-        }
-
-        memset(&info, 0, sizeof(info));
-        pthread_mutex_init(&info.mutex, NULL);
-        pthread_mutex_lock(&info.mutex);
-        pthread_cond_init(&info.cond, NULL);
-        info.env = new_env;
-        if (flags & CLONE_CHILD_SETTID) {
-            info.child_tidptr = child_tidptr;
-        }
-        if (flags & CLONE_PARENT_SETTID) {
-            info.parent_tidptr = parent_tidptr;
-        }
-
-        ret = pthread_attr_init(&attr);
-        ret = pthread_attr_setstacksize(&attr, NEW_STACK_SIZE);
-        ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
-        /* It is not safe to deliver signals until the child has finished
-           initializing, so temporarily block all signals.  */
-        sigfillset(&sigmask);
-        sigprocmask(SIG_BLOCK, &sigmask, &info.sigmask);
-        cpu->random_seed = qemu_guest_random_seed_thread_part1();
-
-        /* If this is our first additional thread, we need to ensure we
-         * generate code for parallel execution and flush old translations.
-         */
-        if (!parallel_cpus) {
-            parallel_cpus = true;
-            tb_flush(cpu);
-        }
-
-        ret = pthread_create(&info.thread, &attr, clone_func, &info);
-        /* TODO: Free new CPU state if thread creation failed.  */
-
-        sigprocmask(SIG_SETMASK, &info.sigmask, NULL);
-        pthread_attr_destroy(&attr);
-        if (ret == 0) {
-            /* Wait for the child to initialize.  */
-            pthread_cond_wait(&info.cond, &info.mutex);
-            ret = info.tid;
-        } else {
-            ret = -1;
-        }
-        pthread_mutex_unlock(&info.mutex);
-        pthread_cond_destroy(&info.cond);
-        pthread_mutex_destroy(&info.mutex);
-        pthread_mutex_unlock(&clone_lock);
-    } else {
-        /* if no CLONE_VM, we consider it is a fork */
-        if (flags & CLONE_INVALID_FORK_FLAGS) {
-            return -TARGET_EINVAL;
-        }
-
-        /* We can't support custom termination signals */
-        if ((flags & CSIGNAL) != TARGET_SIGCHLD) {
-            return -TARGET_EINVAL;
-        }
-
-        if (block_signals()) {
-            return -TARGET_ERESTARTSYS;
-        }
-
-        fork_start();
-        ret = fork();
-        if (ret == 0) {
-            /* Child Process.  */
-            cpu_clone_regs_child(env, newsp, flags);
-            fork_end(1);
-            /* There is a race condition here.  The parent process could
-               theoretically read the TID in the child process before the child
-               tid is set.  This would require using either ptrace
-               (not implemented) or having *_tidptr to point at a shared memory
-               mapping.  We can't repeat the spinlock hack used above because
-               the child process gets its own copy of the lock.  */
-            if (flags & CLONE_CHILD_SETTID)
-                put_user_u32(sys_gettid(), child_tidptr);
-            if (flags & CLONE_PARENT_SETTID)
-                put_user_u32(sys_gettid(), parent_tidptr);
-            ts = (TaskState *)cpu->opaque;
-            if (flags & CLONE_SETTLS)
-                cpu_set_tls (env, newtls);
-            if (flags & CLONE_CHILD_CLEARTID)
-                ts->child_tidptr = child_tidptr;
-        } else {
-            cpu_clone_regs_parent(env, flags);
-            fork_end(0);
-        }
     }
+
     return ret;
 }
 
@@ -7644,6 +7632,7 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
 
     switch(num) {
     case TARGET_NR_exit:
+    {
         /* In old applications this may be used to implement _exit(2).
            However in threaded applictions it is used for thread termination,
            and _exit_group is used for application termination.
@@ -7673,6 +7662,7 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
                 do_sys_futex(g2h(ts->child_tidptr), FUTEX_WAKE, INT_MAX,
                           NULL, NULL, 0);
             }
+
             thread_cpu = NULL;
             g_free(ts);
             rcu_unregister_thread();
@@ -7683,6 +7673,7 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
         preexit_cleanup(cpu_env, arg1);
         _exit(arg1);
         return 0; /* avoid warning */
+    }
     case TARGET_NR_read:
         if (arg2 == 0 && arg3 == 0) {
             return get_errno(safe_read(arg1, 0, 0));
@@ -9679,9 +9670,10 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
         return ret;
 #ifdef __NR_exit_group
         /* new thread calls */
-    case TARGET_NR_exit_group:
+    case TARGET_NR_exit_group: {
         preexit_cleanup(cpu_env, arg1);
         return get_errno(exit_group(arg1));
+    }
 #endif
     case TARGET_NR_setdomainname:
         if (!(p = lock_user_string(arg1)))
@@ -10873,8 +10865,10 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
         return get_errno(fchown(arg1, low2highuid(arg2), low2highgid(arg3)));
 #if defined(TARGET_NR_fchownat)
     case TARGET_NR_fchownat:
-        if (!(p = lock_user_string(arg2))) 
+        p = lock_user_string(arg2)
+        if (!p) {
             return -TARGET_EFAULT;
+        }
         ret = get_errno(fchownat(arg1, p, low2highuid(arg3),
                                  low2highgid(arg4), arg5));
         unlock_user(p, arg2, 0);
-- 
2.27.0.290.gba653c62da-goog



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] linux-user: Make fd_trans task-specific.
  2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
  2020-06-12  1:46 ` [PATCH 1/5] linux-user: Refactor do_fork to use new `qemu_clone` Josh Kunz
@ 2020-06-12  1:46 ` Josh Kunz
  2020-06-12  1:46 ` [PATCH 3/5] linux-user: Make sigact_table part of the task state Josh Kunz
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Josh Kunz @ 2020-06-12  1:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: riku.voipio, laurent, alex.bennee, Josh Kunz

The file-descriptor translation subsystem used by QEMU uses some global
variables to track file descriptors an their associated state. In the
future (when clone is implemented) it may be possible to have two
processes that share memory, but have a unique set of file descriptors.
This change associates the file-descriptor translation table with the
per-task TaskState structure. Since many tasks will share file
descriptors (e.g., threads), a structure similar to the existing structure is
used. Each task has a pointer to a global table. That table can be
shared by multiple tasks, or changed if a task needs to use a different
FD table.

Signed-off-by: Josh Kunz <jkz@google.com>
---
 linux-user/Makefile.objs   |  2 +-
 linux-user/fd-trans-tbl.c  | 13 +++++++
 linux-user/fd-trans-type.h | 17 +++++++++
 linux-user/fd-trans.c      |  3 --
 linux-user/fd-trans.h      | 75 ++++++++++++++++++++++++--------------
 linux-user/main.c          |  1 +
 linux-user/qemu.h          | 24 ++++++++++++
 linux-user/syscall.c       | 12 ++++++
 8 files changed, 115 insertions(+), 32 deletions(-)
 create mode 100644 linux-user/fd-trans-tbl.c
 create mode 100644 linux-user/fd-trans-type.h

diff --git a/linux-user/Makefile.objs b/linux-user/Makefile.objs
index d6788f012c..d19102e244 100644
--- a/linux-user/Makefile.objs
+++ b/linux-user/Makefile.objs
@@ -1,7 +1,7 @@
 obj-y = main.o syscall.o strace.o mmap.o signal.o \
 	elfload.o linuxload.o uaccess.o uname.o \
 	safe-syscall.o $(TARGET_ABI_DIR)/signal.o \
-        $(TARGET_ABI_DIR)/cpu_loop.o exit.o fd-trans.o clone.o
+        $(TARGET_ABI_DIR)/cpu_loop.o exit.o fd-trans.o clone.o fd-trans-tbl.o
 
 obj-$(TARGET_HAS_BFLT) += flatload.o
 obj-$(TARGET_I386) += vm86.o
diff --git a/linux-user/fd-trans-tbl.c b/linux-user/fd-trans-tbl.c
new file mode 100644
index 0000000000..6afe91096e
--- /dev/null
+++ b/linux-user/fd-trans-tbl.c
@@ -0,0 +1,13 @@
+#include "qemu/osdep.h"
+#include "fd-trans.h"
+
+struct fd_trans_table *fd_trans_table_clone(struct fd_trans_table *tbl)
+{
+    struct fd_trans_table *new_tbl = g_new0(struct fd_trans_table, 1);
+    new_tbl->fd_max = tbl->fd_max;
+    new_tbl->entries = g_new0(TargetFdTrans*, tbl->fd_max);
+    memcpy(new_tbl->entries,
+           tbl->entries,
+           sizeof(*new_tbl->entries) * tbl->fd_max);
+    return new_tbl;
+}
diff --git a/linux-user/fd-trans-type.h b/linux-user/fd-trans-type.h
new file mode 100644
index 0000000000..06c4427642
--- /dev/null
+++ b/linux-user/fd-trans-type.h
@@ -0,0 +1,17 @@
+#ifndef FD_TRANS_TYPE_H
+#define FD_TRANS_TYPE_H
+
+/*
+ * Break out the TargetFdTrans typedefs into a separate file, to break
+ * the circular dependency between qemu.h and fd-trans.h.
+ */
+
+typedef abi_long (*TargetFdDataFunc)(void *, size_t);
+typedef abi_long (*TargetFdAddrFunc)(void *, abi_ulong, socklen_t);
+typedef struct TargetFdTrans {
+    TargetFdDataFunc host_to_target_data;
+    TargetFdDataFunc target_to_host_data;
+    TargetFdAddrFunc target_to_host_addr;
+} TargetFdTrans;
+
+#endif /* FD_TRANS_TYPE_H */
diff --git a/linux-user/fd-trans.c b/linux-user/fd-trans.c
index c0687c52e6..c552034a5e 100644
--- a/linux-user/fd-trans.c
+++ b/linux-user/fd-trans.c
@@ -261,9 +261,6 @@ enum {
     QEMU___RTA_MAX
 };
 
-TargetFdTrans **target_fd_trans;
-unsigned int target_fd_max;
-
 static void tswap_nlmsghdr(struct nlmsghdr *nlh)
 {
     nlh->nlmsg_len = tswap32(nlh->nlmsg_len);
diff --git a/linux-user/fd-trans.h b/linux-user/fd-trans.h
index a3fcdaabc7..07ae04dad7 100644
--- a/linux-user/fd-trans.h
+++ b/linux-user/fd-trans.h
@@ -16,38 +16,45 @@
 #ifndef FD_TRANS_H
 #define FD_TRANS_H
 
-typedef abi_long (*TargetFdDataFunc)(void *, size_t);
-typedef abi_long (*TargetFdAddrFunc)(void *, abi_ulong, socklen_t);
-typedef struct TargetFdTrans {
-    TargetFdDataFunc host_to_target_data;
-    TargetFdDataFunc target_to_host_data;
-    TargetFdAddrFunc target_to_host_addr;
-} TargetFdTrans;
+#include "qemu.h"
+#include "fd-trans-type.h"
 
-extern TargetFdTrans **target_fd_trans;
-
-extern unsigned int target_fd_max;
+/*
+ * Return a duplicate of the given fd_trans_table. This function always
+ * succeeds. Ownership of the pointed-to table is yielded to the caller. The
+ * caller is responsible for freeing the table when it is no longer in-use.
+ */
+struct fd_trans_table *fd_trans_table_clone(struct fd_trans_table *tbl);
 
 static inline TargetFdDataFunc fd_trans_target_to_host_data(int fd)
 {
-    if (fd >= 0 && fd < target_fd_max && target_fd_trans[fd]) {
-        return target_fd_trans[fd]->target_to_host_data;
+    TaskState *ts = (TaskState *)thread_cpu->opaque;
+    struct fd_trans_table *tbl = ts->fd_trans_tbl;
+
+    if (fd >= 0 && fd < tbl->fd_max && tbl->entries[fd]) {
+        return tbl->entries[fd]->target_to_host_data;
     }
     return NULL;
 }
 
 static inline TargetFdDataFunc fd_trans_host_to_target_data(int fd)
 {
-    if (fd >= 0 && fd < target_fd_max && target_fd_trans[fd]) {
-        return target_fd_trans[fd]->host_to_target_data;
+    TaskState *ts = (TaskState *)thread_cpu->opaque;
+    struct fd_trans_table *tbl = ts->fd_trans_tbl;
+
+    if (fd >= 0 && fd < tbl->fd_max && tbl->entries[fd]) {
+        return tbl->entries[fd]->host_to_target_data;
     }
     return NULL;
 }
 
 static inline TargetFdAddrFunc fd_trans_target_to_host_addr(int fd)
 {
-    if (fd >= 0 && fd < target_fd_max && target_fd_trans[fd]) {
-        return target_fd_trans[fd]->target_to_host_addr;
+    TaskState *ts = (TaskState *)thread_cpu->opaque;
+    struct fd_trans_table *tbl = ts->fd_trans_tbl;
+
+    if (fd >= 0 && fd < tbl->fd_max && tbl->entries[fd]) {
+        return tbl->entries[fd]->target_to_host_addr;
     }
     return NULL;
 }
@@ -56,29 +63,41 @@ static inline void fd_trans_register(int fd, TargetFdTrans *trans)
 {
     unsigned int oldmax;
 
-    if (fd >= target_fd_max) {
-        oldmax = target_fd_max;
-        target_fd_max = ((fd >> 6) + 1) << 6; /* by slice of 64 entries */
-        target_fd_trans = g_renew(TargetFdTrans *,
-                                  target_fd_trans, target_fd_max);
-        memset((void *)(target_fd_trans + oldmax), 0,
-               (target_fd_max - oldmax) * sizeof(TargetFdTrans *));
+    TaskState *ts = (TaskState *)thread_cpu->opaque;
+    struct fd_trans_table *tbl = ts->fd_trans_tbl;
+
+    /*
+     * TODO: This is racy. Updates to tbl->entries should be guarded by
+     * a lock.
+     */
+    if (fd >= tbl->fd_max) {
+        oldmax = tbl->fd_max;
+        tbl->fd_max = ((fd >> 6) + 1) << 6; /* by slice of 64 entries */
+        tbl->entries = g_renew(TargetFdTrans *, tbl->entries, tbl->fd_max);
+        memset((void *)(tbl->entries + oldmax), 0,
+               (tbl->fd_max - oldmax) * sizeof(TargetFdTrans *));
     }
-    target_fd_trans[fd] = trans;
+    tbl->entries[fd] = trans;
 }
 
 static inline void fd_trans_unregister(int fd)
 {
-    if (fd >= 0 && fd < target_fd_max) {
-        target_fd_trans[fd] = NULL;
+    TaskState *ts = (TaskState *)thread_cpu->opaque;
+    struct fd_trans_table *tbl = ts->fd_trans_tbl;
+
+    if (fd >= 0 && fd < tbl->fd_max) {
+        tbl->entries[fd] = NULL;
     }
 }
 
 static inline void fd_trans_dup(int oldfd, int newfd)
 {
+    TaskState *ts = (TaskState *)thread_cpu->opaque;
+    struct fd_trans_table *tbl = ts->fd_trans_tbl;
+
     fd_trans_unregister(newfd);
-    if (oldfd < target_fd_max && target_fd_trans[oldfd]) {
-        fd_trans_register(newfd, target_fd_trans[oldfd]);
+    if (oldfd >= 0 && oldfd < tbl->fd_max && tbl->entries[oldfd]) {
+        fd_trans_register(newfd, tbl->entries[oldfd]);
     }
 }
 
diff --git a/linux-user/main.c b/linux-user/main.c
index 3597e99bb1..d1ed0f6120 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -796,6 +796,7 @@ int main(int argc, char **argv, char **envp)
     ts->bprm = &bprm;
     cpu->opaque = ts;
     task_settid(ts);
+    ts->fd_trans_tbl = g_new0(struct fd_trans_table, 1);
 
     ret = loader_exec(execfd, exec_path, target_argv, target_environ, regs,
         info, &bprm);
diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index ce902f5132..989e01ad8d 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -5,6 +5,7 @@
 #include "cpu.h"
 #include "exec/exec-all.h"
 #include "exec/cpu_ldst.h"
+#include "fd-trans-type.h"
 
 #undef DEBUG_REMAP
 #ifdef DEBUG_REMAP
@@ -96,6 +97,22 @@ struct emulated_sigtable {
     target_siginfo_t info;
 };
 
+/*
+ * The fd_trans_table is used the FD data translation subsystem to
+ * find FD data translators (i.e. functions). `entries` is an array of pointers
+ * with size `fd_max`, containing pointers to TargetFDTrans structs. A pointer
+ * to a struct of this type is stored TaskState, which allows the struct itself
+ * to be shared by all tasks (e.g., threads) that share a file descriptor
+ * namespace. Storing a pointer to this table in the TaskState struct is needed
+ * to support rare cases where tasks share an address space, but do not share
+ * a set of file descriptors (e.g., after clone(CLONE_VM) when CLONE_FILES is
+ * not set). See `fd-trans.h` for more info on the FD translation subsystem.
+ */
+struct fd_trans_table {
+    uint64_t fd_max;
+    TargetFdTrans **entries;
+};
+
 /* NOTE: we force a big alignment so that the stack stored after is
    aligned too */
 typedef struct TaskState {
@@ -153,6 +170,13 @@ typedef struct TaskState {
 
     /* This thread's sigaltstack, if it has one */
     struct target_sigaltstack sigaltstack_used;
+
+    /*
+     * A pointer to the FD trans table to be used by this task. Note that the
+     * task doesn't have exclusive control of the fd_trans_table so access to
+     * the table itself should be guarded.
+     */
+    struct fd_trans_table *fd_trans_tbl;
 } __attribute__((aligned(16))) TaskState;
 
 extern char *exec_path;
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 7ce021cea2..ff1d07871f 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -6013,6 +6013,18 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
     ts->info = parent_ts->info;
     ts->signal_mask = parent_ts->signal_mask;
 
+    if (flags & CLONE_FILES) {
+        ts->fd_trans_tbl = parent_ts->fd_trans_tbl;
+    } else {
+        /*
+         * When CLONE_FILES is not set, the parent and child will have
+         * different file descriptor tables, so we need a new
+         * fd_trans_tbl. Clone from parent_ts, since child inherits all our
+         * file descriptors.
+         */
+        ts->fd_trans_tbl = fd_trans_table_clone(parent_ts->fd_trans_tbl);
+    }
+
     if (flags & CLONE_CHILD_CLEARTID) {
         ts->child_tidptr = child_tidptr;
     }
-- 
2.27.0.290.gba653c62da-goog



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] linux-user: Make sigact_table part of the task state.
  2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
  2020-06-12  1:46 ` [PATCH 1/5] linux-user: Refactor do_fork to use new `qemu_clone` Josh Kunz
  2020-06-12  1:46 ` [PATCH 2/5] linux-user: Make fd_trans task-specific Josh Kunz
@ 2020-06-12  1:46 ` Josh Kunz
  2020-06-12  1:46 ` [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options Josh Kunz
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Josh Kunz @ 2020-06-12  1:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: riku.voipio, laurent, alex.bennee, Josh Kunz

sigact_table stores the signal handlers for the given process. Once we
support CLONE_VM, two tasks using the same virtual memory may need
different signal handler tables (e.g., if CLONE_SIGHAND is not
provided). Here we make sigact_table part of the TaskState, so it can be
duplicated as needed when cloning children.

Signed-off-by: Josh Kunz <jkz@google.com>
---
 linux-user/qemu.h    |  8 ++++++++
 linux-user/signal.c  | 35 +++++++++++++++++++++++++++--------
 linux-user/syscall.c | 17 +++++++++++++++++
 3 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index 989e01ad8d..54bf4f47be 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -177,6 +177,12 @@ typedef struct TaskState {
      * the table itself should be guarded.
      */
     struct fd_trans_table *fd_trans_tbl;
+
+    /*
+     * A table containing signal actions for the target. It should have at
+     * least TARGET_NSIG entries
+     */
+    struct target_sigaction *sigact_tbl;
 } __attribute__((aligned(16))) TaskState;
 
 extern char *exec_path;
@@ -419,7 +425,9 @@ void print_syscall_ret(int num, abi_long arg1);
  */
 void print_taken_signal(int target_signum, const target_siginfo_t *tinfo);
 
+
 /* signal.c */
+struct target_sigaction *sigact_table_clone(struct target_sigaction *orig);
 void process_pending_signals(CPUArchState *cpu_env);
 void signal_init(void);
 int queue_signal(CPUArchState *env, int sig, int si_type,
diff --git a/linux-user/signal.c b/linux-user/signal.c
index 8cf51ffecd..dc98def6d1 100644
--- a/linux-user/signal.c
+++ b/linux-user/signal.c
@@ -25,7 +25,13 @@
 #include "trace.h"
 #include "signal-common.h"
 
-static struct target_sigaction sigact_table[TARGET_NSIG];
+struct target_sigaltstack target_sigaltstack_used = {
+    .ss_sp = 0,
+    .ss_size = 0,
+    .ss_flags = TARGET_SS_DISABLE,
+};
+
+typedef struct target_sigaction sigact_table[TARGET_NSIG];
 
 static void host_signal_handler(int host_signum, siginfo_t *info,
                                 void *puc);
@@ -542,6 +548,11 @@ static void signal_table_init(void)
     }
 }
 
+struct target_sigaction *sigact_table_clone(struct target_sigaction *orig)
+{
+    return memcpy(g_new(sigact_table, 1), orig, sizeof(sigact_table));
+}
+
 void signal_init(void)
 {
     TaskState *ts = (TaskState *)thread_cpu->opaque;
@@ -556,6 +567,12 @@ void signal_init(void)
     /* Set the signal mask from the host mask. */
     sigprocmask(0, 0, &ts->signal_mask);
 
+    /*
+     * Set all host signal handlers. ALL signals are blocked during
+     * the handlers to serialize them.
+     */
+    ts->sigact_tbl = (struct target_sigaction *) g_new0(sigact_table, 1);
+
     sigfillset(&act.sa_mask);
     act.sa_flags = SA_SIGINFO;
     act.sa_sigaction = host_signal_handler;
@@ -568,9 +585,9 @@ void signal_init(void)
         host_sig = target_to_host_signal(i);
         sigaction(host_sig, NULL, &oact);
         if (oact.sa_sigaction == (void *)SIG_IGN) {
-            sigact_table[i - 1]._sa_handler = TARGET_SIG_IGN;
+            ts->sigact_tbl[i - 1]._sa_handler = TARGET_SIG_IGN;
         } else if (oact.sa_sigaction == (void *)SIG_DFL) {
-            sigact_table[i - 1]._sa_handler = TARGET_SIG_DFL;
+            ts->sigact_tbl[i - 1]._sa_handler = TARGET_SIG_DFL;
         }
         /* If there's already a handler installed then something has
            gone horribly wrong, so don't even try to handle that case.  */
@@ -608,11 +625,12 @@ void force_sig(int sig)
 #if !defined(TARGET_RISCV)
 void force_sigsegv(int oldsig)
 {
+    TaskState *ts = (TaskState *)thread_cpu->opaque;
     if (oldsig == SIGSEGV) {
         /* Make sure we don't try to deliver the signal again; this will
          * end up with handle_pending_signal() calling dump_core_and_abort().
          */
-        sigact_table[oldsig - 1]._sa_handler = TARGET_SIG_DFL;
+        ts->sigact_tbl[oldsig - 1]._sa_handler = TARGET_SIG_DFL;
     }
     force_sig(TARGET_SIGSEGV);
 }
@@ -837,6 +855,7 @@ int do_sigaction(int sig, const struct target_sigaction *act,
     struct sigaction act1;
     int host_sig;
     int ret = 0;
+    TaskState* ts = (TaskState *)thread_cpu->opaque;
 
     trace_signal_do_sigaction_guest(sig, TARGET_NSIG);
 
@@ -848,7 +867,7 @@ int do_sigaction(int sig, const struct target_sigaction *act,
         return -TARGET_ERESTARTSYS;
     }
 
-    k = &sigact_table[sig - 1];
+    k = &ts->sigact_tbl[sig - 1];
     if (oact) {
         __put_user(k->_sa_handler, &oact->_sa_handler);
         __put_user(k->sa_flags, &oact->sa_flags);
@@ -930,7 +949,7 @@ static void handle_pending_signal(CPUArchState *cpu_env, int sig,
         sa = NULL;
         handler = TARGET_SIG_IGN;
     } else {
-        sa = &sigact_table[sig - 1];
+        sa = &ts->sigact_tbl[sig - 1];
         handler = sa->_sa_handler;
     }
 
@@ -1022,9 +1041,9 @@ void process_pending_signals(CPUArchState *cpu_env)
              * looping round and round indefinitely.
              */
             if (sigismember(&ts->signal_mask, target_to_host_signal_table[sig])
-                || sigact_table[sig - 1]._sa_handler == TARGET_SIG_IGN) {
+                || ts->sigact_tbl[sig - 1]._sa_handler == TARGET_SIG_IGN) {
                 sigdelset(&ts->signal_mask, target_to_host_signal_table[sig]);
-                sigact_table[sig - 1]._sa_handler = TARGET_SIG_DFL;
+                ts->sigact_tbl[sig - 1]._sa_handler = TARGET_SIG_DFL;
             }
 
             handle_pending_signal(cpu_env, sig, &ts->sync_signal);
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index ff1d07871f..838caf9c98 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -5989,6 +5989,17 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
         return -TARGET_EINVAL;
     }
 
+    if ((flags & CLONE_SIGHAND) && !(flags & CLONE_VM)) {
+        /*
+         * Like CLONE_FILES, this flag combination is unsupported. If
+         * CLONE_SIGHAND is specified without CLONE_VM, then we need to keep
+         * the sigact table in-sync across virtual memory boundaries, which is
+         * substantially more complicated.
+         */
+        qemu_log_mask(LOG_UNIMP, "CLONE_SIGHAND only supported with CLONE_VM");
+        return -TARGET_EINVAL;
+    }
+
     pthread_mutex_init(&info.mutex, NULL);
     pthread_mutex_lock(&info.mutex);
     pthread_cond_init(&info.cond, NULL);
@@ -6025,6 +6036,12 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
         ts->fd_trans_tbl = fd_trans_table_clone(parent_ts->fd_trans_tbl);
     }
 
+    if (flags & CLONE_SIGHAND) {
+        ts->sigact_tbl = parent_ts->sigact_tbl;
+    } else {
+        ts->sigact_tbl = sigact_table_clone(parent_ts->sigact_tbl);
+    }
+
     if (flags & CLONE_CHILD_CLEARTID) {
         ts->child_tidptr = child_tidptr;
     }
-- 
2.27.0.290.gba653c62da-goog



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
  2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
                   ` (2 preceding siblings ...)
  2020-06-12  1:46 ` [PATCH 3/5] linux-user: Make sigact_table part of the task state Josh Kunz
@ 2020-06-12  1:46 ` Josh Kunz
  2020-06-13  0:10   ` Josh Kunz
  2020-06-16 16:08   ` Alex Bennée
  2020-06-12  1:46 ` [PATCH 5/5] linux-user: Add PDEATHSIG test for clone process hierarchy Josh Kunz
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 18+ messages in thread
From: Josh Kunz @ 2020-06-12  1:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: riku.voipio, laurent, alex.bennee, Josh Kunz

The `clone` system call can be used to create new processes that share
attributes with their parents, such as virtual memory, file
system location, file descriptor tables, etc. These can be useful to a
variety of guest programs.

Before this patch, QEMU had support for a limited set of these attributes.
Basically the ones needed for threads, and the options used by fork.
This change adds support for all flag combinations involving CLONE_VM.
In theory, almost all clone options could be supported, but invocations
not using CLONE_VM are likely to run afoul of linux-user's inherently
multi-threaded design.

To add this support, this patch updates the `qemu_clone` helper. An
overview of the mechanism used to support general `clone` options with
CLONE_VM is described below.

This patch also enables by-default the `clone` unit-tests in
tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate
exit signals, based on a bug found during development.

!! Overview

Adding support for CLONE_VM is tricky. The parent and guest process will
share an address space (similar to threads), so the emulator must
coordinate between the parent and the child. Currently, QEMU relies
heavily on Thread Local Storage (TLS) as part of this coordination
strategy. For threads, this works fine, because libc manages the
thread-local data region used for TLS, when we create new threads using
`pthread_create`. Ideally we could use the same mechanism for
"process-local storage" needed to allow the parent/child processes to
emulate in tandem. Unfortunately TLS is tightly integrated into libc.
The only way to create TLS data regions is via the `pthread_create` API
which also spawns a new thread (rather than a new processes, which is
what we want). Worse still, TLS itself is a complicated arch-specific
feature that is tightly integrated into the rest of libc and the dynamic
linker. Re-implementing TLS support for QEMU would likely require a
special dynamic linker / libc. Alternatively, the popular libcs could be
extended, to allow for users to create TLS regions without creating
threads. Even if major libcs decide to add this support, QEMU will still
need a temporary work around until those libcs are widely deployed. It's
also unclear if libcs will be interested in supporting this case, since
TLS image creation is generally deeply integrated with thread setup.

In this patch, I've employed an alternative approach: spawning a thread
an "stealing" its TLS image for use in the child process. This approach
leaves a dangling thread while the TLS image is in use, but by design
that thread will not become schedulable until after the TLS data is no
longer in-use by the child (as described in a moment). Therefore, it
should cause relatively minimal overhead. When considered in the larger
context, this seems like a reasonable tradeoff.

A major complication of this approach knowing when it is safe to clean up
the stack, and TLS image, used by a child process. When a child is
created with `CLONE_VM` its stack, and TLS data, need to remain valid
until that child has either exited, or successfully called `execve` (on
`execve` the child is given a new VMM by the kernel). One approach would
be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the
child). The problem is that the `wait` family of calls only waits for
termination. The pattern of `clone() ... execve()` for long running
child processes is pretty common. If we waited for child processes to
exit, it's likely we would end up using substantially more memory, and
keep the suspended TLS thread around much longer than necessary.
Instead, in this patch, I've used an "trampoline" process. The real
parent first clones a trampoline, the trampoline then clones the
ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends
the trampoline process until the child has exited, or called `execve`.
Once the trampoline is re-scheduled, we know it is safe to clean up
after the child. This creates one more suspended process, but typically,
the trampoline only exists for a short period of time.

!! CLONE_VM setup, step by step

1. First, the suspended thread whose TLS we will use is created using
   `pthread_create`. The thread fetches and returns it's "TLS pointer"
   (an arch-specific value given to the kernel) to the parent. It then
   blocks on a lock to prevent its TLS data from being cleaned up.
   Ultimately the lock will be unlocked by the trampoline once the child
   exits.
2. Once the TLS thread has fetched the TLS pointer, it notifies the real
   parent thread, which calls `clone()` to create the trampoline
   process. For ease of implementation, the TLS image is set for the
   trampoline process during this step. This allows the trampoline to
   use functions that require TLS if needed (e.g., printf). TLS location
   is inherited when a new child is spawned, so this TLS data will
   automatically be inherited by the child.
3. Once the trampoline has been spawned, it registers itself as a
   "hidden" process with the signal subsystem. This prevents the exit
   signal from the trampoline from ever being forwarded to the guest.
   This is needed due to the way that Linux sets the exit signal for the
   ultimate child when `CLONE_PARENT` is set. See the source for
   details.
4. Once setup is complete, the trampoline spawns the final child with
   the original clone flags, plus `CLONE_PARENT`, so the child is
   correctly parented to the kernel task on which the guest invoked
   `clone`. Without this, kernel features like PDEATHSIG, and
   subreapers, would not work properly. As previously discussed, the
   trampoline also supplies `CLONE_VFORK` so that it is suspended until
   the child can be cleaned up.
5. Once the child is spawned, it signals the original parent thread that
   it is running. At this point, the trampoline process is suspended
   (due to CLONE_VFORK).
6. Finally, the call to `qemu_clone` in the parent is finished, the
   child begins executing the given callback function in the new child
   process.

!! Cleaning up

Clean up itself is a multi-step process. Once the child exits, or is
killed by a signal (cleanup is the same in both cases), the trampoline
process becomes schedulable. When the trampoline is scheduled, it frees
the child stack, and unblocks the suspended TLS thread. This cleans up
the child resources, but not the stack used by the trampoline itself. It
is possible for a process to clean up its own stack, but it is tricky,
and architecture-specific. Instead we leverage the TLS manager thread to
clean up the trampoline stack. When the trampoline is cloned (in step 2
above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID`
flags. The target location for the SET/CLEAR TID is set to a special field
known by the TLS manager. Then, when the TLS manager thread is unsuspended,
it performs an additional `FUTEX_WAIT` on this location. That blocks the
TLS manager thread until the trampoline has fully exited, then the TLS
manager thread frees the trampoline process's stack, before exiting
itself.

!! Shortcomings of this patch

* It's complicated.
* It doesn't support any clone options when CLONE_VM is omitted.
* It doesn't properly clean up the CPU queue when the child process
  terminates, or calls execve().
* RCU unregistration is done in the trampoline process (in clone.c), but
  registration happens in syscall.c This should be made more explicit.
* The TLS image, and trampoline stack are not cleaned up if the parent
  calls `execve` or `exit_group` before the child does. This is because
  those cleanup tasks are handled by the TLS manager thread. The TLS
  manager thread is in the same thread group as the parent, so it will
  be terminated if the parent exits or calls `execve`.

!! Alternatives considered

* Non-standard libc extension to allow creating TLS images independent
  of threads. This would allow us to just `clone` the child directly
  instead of this complicated maneuver. Though we probably would still
  need the cleanup logic. For libcs, TLS image allocation is tightly
  connected to thread stack allocation, which is also arch-specific. I
  do not have enough experience with libc development to know if
  maintainers of any popular libcs would be open to supporting such an
  API. Additionally, since it will probably take years before a libc
  fix would be widely deployed, we need an interim solution anyways.
* Non-standard, Linux-only, libc extension to allow us to specify the
  CLONE_* flags used by `pthread_create`. The processes we are creating
  are basically threads in a different thread group. If we could alter
  the flags used, this whole processes could become a `pthread_create.`
  The problem with this approach is that I don't know what requirements
  pthreads has on threads to ensure they function properly. I suspect
  that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached
  thread state. Since we don't control the child exit reason (Linux only
  handles CHILD_CLEARTID on normal, non-signal process termination), we
  probably can't use this same tracking mechanism.
* Other mechanisms for detecting child exit so cleanup can happen
  besides CLONE_VFORK:
  * waitid(WNOWAIT): This can only detect exit, not execve.
  * file descriptors with close on exec set: This cannot detect children
    cloned with CLONE_FILES.
  * System V semaphore adjustments: Cannot detect children cloned with
    CLONE_SYSVSEM.
  * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally
    terminated children.
* Doing the child clone directly in the TLS manager thread: This saves the
  need for the trampoline process, but it causes the child process to be
  parented to the wrong kernel task (the TLS thread instead of the Main
  thread) breaking things like PDEATHSIG.

Signed-off-by: Josh Kunz <jkz@google.com>
---
 linux-user/clone.c               | 415 ++++++++++++++++++++++++++++++-
 linux-user/qemu.h                |  17 ++
 linux-user/signal.c              |  49 ++++
 linux-user/syscall.c             |  69 +++--
 tests/tcg/multiarch/linux-test.c |  67 ++++-
 5 files changed, 592 insertions(+), 25 deletions(-)

diff --git a/linux-user/clone.c b/linux-user/clone.c
index f02ae8c464..3f7344cf9e 100644
--- a/linux-user/clone.c
+++ b/linux-user/clone.c
@@ -12,6 +12,12 @@
 #include <stdbool.h>
 #include <assert.h>
 
+/* arch-specifc includes needed to fetch the TLS base offset. */
+#if defined(__x86_64__)
+#include <asm/prctl.h>
+#include <sys/prctl.h>
+#endif
+
 static const unsigned long NEW_STACK_SIZE = 0x40000UL;
 
 /*
@@ -62,6 +68,397 @@ static void completion_finish(struct completion *c)
     pthread_mutex_unlock(&c->mu);
 }
 
+struct tls_manager {
+    void *tls_ptr;
+    /* fetched is completed once tls_ptr has been set by the thread. */
+    struct completion fetched;
+    /*
+     * spawned is completed by the user once the managed_tid
+     * has been spawned.
+     */
+    struct completion spawned;
+    /*
+     * TID of the child whose memory is cleaned up upon death. This memory
+     * location is used as part of a futex op, and is cleared by the kernel
+     * since we specify CHILD_CLEARTID.
+     */
+    int managed_tid;
+    /*
+     * The value to be `free`'d up once the janitor is ready to clean up the
+     * TLS section, and the managed tid has exited.
+     */
+    void *cleanup;
+};
+
+/*
+ * tls_ptr fetches the TLS "pointer" for the current thread. This pointer
+ * should be whatever platform-specific address is used to represent the TLS
+ * base address.
+ */
+static void *tls_ptr()
+{
+    void *ptr;
+#if defined(__x86_64__)
+    /*
+     * On x86_64, the TLS base is stored in the `fs` segment register, we can
+     * fetch it with `ARCH_GET_FS`:
+     */
+    (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr);
+#else
+    ptr = NULL;
+#endif
+    return ptr;
+}
+
+/*
+ * clone_vm_supported returns true if clone_vm() is supported on this
+ * platform.
+ */
+static bool clone_vm_supported()
+{
+#if defined(__x86_64__)
+    return true;
+#else
+    return false;
+#endif
+}
+
+static void *tls_manager_thread(void *arg)
+{
+    struct tls_manager *mgr = (struct tls_manager *) arg;
+    int child_tid, ret;
+
+    /*
+     * NOTE: Do not use an TLS in this thread until after the `spawned`
+     * completion is finished. We need to preserve the pristine state of
+     * the TLS image for this thread, so it can be re-used in a separate
+     * process.
+     */
+    mgr->tls_ptr = tls_ptr();
+
+    /* Notify tls_new that we finished fetching the TLS ptr. */
+    completion_finish(&mgr->fetched);
+
+    /*
+     * Wait for the user of our TLS to tell us the child using our TLS has
+     * been spawned.
+     */
+    completion_await(&mgr->spawned);
+
+    child_tid = atomic_fetch_or(&mgr->managed_tid, 0);
+    /*
+     * Check if the child has already terminated by this point. If not, wait
+     * for the child to exit. As long as the trampoline is not killed by
+     * a signal, the kernel guarantees that the memory at &mgr->managed_tid
+     * will be cleared, and a FUTEX_WAKE at that address will triggered.
+     */
+    if (child_tid != 0) {
+        ret = syscall(SYS_futex, &mgr->managed_tid, FUTEX_WAIT,
+                      child_tid, NULL, NULL, 0);
+        assert(ret == 0 && "clone manager futex should always succeed");
+    }
+
+    free(mgr->cleanup);
+    g_free(mgr);
+
+    return NULL;
+}
+
+static struct tls_manager *tls_manager_new()
+{
+    struct tls_manager *mgr = g_new0(struct tls_manager, 1);
+    sigset_t block, oldmask;
+
+    sigfillset(&block);
+    if (sigprocmask(SIG_BLOCK, &block, &oldmask) != 0) {
+        return NULL;
+    }
+
+    completion_init(&mgr->fetched);
+    completion_init(&mgr->spawned);
+
+    pthread_attr_t attr;
+    pthread_attr_init(&attr);
+    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
+
+    pthread_t unused;
+    if (pthread_create(&unused, &attr, tls_manager_thread, (void *) mgr)) {
+        pthread_attr_destroy(&attr);
+        g_free(mgr);
+        return NULL;
+    }
+    pthread_attr_destroy(&attr);
+    completion_await(&mgr->fetched);
+
+    if (sigprocmask(SIG_SETMASK, &oldmask, NULL) != 0) {
+        /* Let the thread exit, and cleanup itself. */
+        completion_finish(&mgr->spawned);
+        return NULL;
+    }
+
+    /* Once we finish awaiting, the tls_ptr will be usable. */
+    return mgr;
+}
+
+struct stack {
+    /* Buffer is the "base" of the stack buffer. */
+    void *buffer;
+    /* Top is the "start" of the stack (since stack addresses "grow down"). */
+    void *top;
+};
+
+struct info {
+    /* Stacks used for the trampoline and child process. */
+    struct {
+        struct stack trampoline;
+        struct stack process;
+    } stack;
+    struct completion child_ready;
+    /* `clone` flags for the process the user asked us to make. */
+    int flags;
+    sigset_t orig_mask;
+    /*
+     * Function to run in the ultimate child process, and payload to pass as
+     * the argument.
+     */
+    int (*clone_f)(void *);
+    void *payload;
+    /*
+     * Result of calling `clone` for the child clone. Will be set to
+     * `-errno` if an error occurs.
+     */
+    int result;
+};
+
+static bool stack_new(struct stack *stack)
+{
+    /*
+     * TODO: put a guard page at the bottom of the stack, so we don't
+     * accidentally roll off the end.
+     */
+    if (posix_memalign(&stack->buffer, 16, NEW_STACK_SIZE)) {
+        return false;
+    }
+    memset(stack->buffer, 0, NEW_STACK_SIZE);
+    stack->top = stack->buffer + NEW_STACK_SIZE;
+    return true;
+}
+
+static int clone_child(void *raw_info)
+{
+    struct info *info = (struct info *) raw_info;
+    int (*clone_f)(void *) = info->clone_f;
+    void *payload = info->payload;
+    if (!(info->flags & CLONE_VFORK)) {
+        /*
+         * If CLONE_VFORK is NOT set, then the trampoline has stalled (it
+         * forces VFORK), but the actual clone should return immediately. In
+         * this case, this thread needs to notify the parent that the new
+         * process is running. If CLONE_VFORK IS set, the trampoline will
+         * notify the parent once the normal kernel vfork completes.
+         */
+        completion_finish(&info->child_ready);
+    }
+    if (sigprocmask(SIG_SETMASK, &info->orig_mask, NULL) != 0) {
+        perror("failed to restore signal mask in cloned child");
+        _exit(1);
+    }
+    return clone_f(payload);
+}
+
+static int clone_trampoline(void *raw_info)
+{
+    struct info *info = (struct info *) raw_info;
+    int flags;
+
+    struct stack process_stack = info->stack.process;
+    int orig_flags = info->flags;
+
+    if (orig_flags & CSIGNAL) {
+        /*
+         * It should be safe to call here, since we know signals are blocked
+         * for this process.
+         */
+        hide_current_process_exit_signal();
+    }
+
+    /*
+     * Force CLONE_PARENT, so that we don't accidentally become a child of the
+     * trampoline thread. This kernel task should either be a child of the
+     * trampoline's parent (if CLONE_PARENT is not in info->flags), or a child
+     * of the calling process's parent (if CLONE_PARENT IS in info->flags).
+     * That is to say, our parent should always be the correct parent for the
+     * child task.
+     *
+     * Force CLONE_VFORK so that we know when the child is no longer holding
+     * a reference to this process's virtual memory. CLONE_VFORK just suspends
+     * this task until the child execs or exits, it should not affect how the
+     * child process is created in any way. This is the only generic way I'm
+     * aware of to observe *any* exit or exec. Including "abnormal" exits like
+     * exits via signals.
+     *
+     * Force CLONE_CHILD_SETTID, since we want to track the CHILD TID in the
+     * `info` structure. Capturing the child via `clone` call directly is
+     * slightly nicer than making a syscall in the child. Since we know we're
+     * doing a CLONE_VM here, we can use CLONE_CHILD_SETTID, to guarantee that
+     * the kernel must set the child TID before the child is run. The child
+     * TID should be visibile to the parent, since both parent and child share
+     * and address space. If the clone fails, we overwrite `info->result`
+     * anyways with the error code.
+     */
+    flags = orig_flags | CLONE_PARENT | CLONE_VFORK | CLONE_CHILD_SETTID;
+    if (clone(clone_child, info->stack.process.top, flags,
+              (void *) info, NULL, NULL, &info->result) < 0) {
+        info->result = -errno;
+        completion_finish(&info->child_ready);
+        return 0;
+    }
+
+    /*
+     * Clean up the child process stack, since we know the child can no longer
+     * reference it.
+     */
+    free(process_stack.buffer);
+
+    /*
+     * We know the process we created was CLONE_VFORK, so it registered with
+     * the RCU. We share a TLS image with the process, so we can unregister
+     * it from the RCU. Since the TLS image will be valid for at least our
+     * lifetime, it should be OK to leave the child processes RCU entry in
+     * the queue between when the child execve or exits, and the OS returns
+     * here from our vfork.
+     */
+    rcu_unregister_thread();
+
+    /*
+     * If we're doing a real vfork here, we need to notify the parent that the
+     * vfork has happened.
+     */
+    if (orig_flags & CLONE_VFORK) {
+        completion_finish(&info->child_ready);
+    }
+
+    return 0;
+}
+
+static int clone_vm(int flags, int (*callback)(void *), void *payload)
+{
+    struct info info;
+    sigset_t sigmask;
+    int ret;
+
+    assert(flags & CLONE_VM && "CLONE_VM flag must be set");
+
+    memset(&info, 0, sizeof(info));
+    info.clone_f = callback;
+    info.payload = payload;
+    info.flags = flags;
+
+    /*
+     * Set up the stacks for the child processes needed to execute the clone.
+     */
+    if (!stack_new(&info.stack.trampoline)) {
+        return -1;
+    }
+    if (!stack_new(&info.stack.process)) {
+        free(info.stack.trampoline.buffer);
+        return -1;
+    }
+
+    /*
+     * tls_manager_new grants us it's ownership of the reference to the
+     * TLS manager, so we "leak" the data pointer, instead of using _get()
+     */
+    struct tls_manager *mgr = tls_manager_new();
+    if (mgr == NULL) {
+        free(info.stack.trampoline.buffer);
+        free(info.stack.process.buffer);
+        return -1;
+    }
+
+    /* Manager cleans up the trampoline stack once the trampoline exits. */
+    mgr->cleanup = info.stack.trampoline.buffer;
+
+    /*
+     * Flags used by the trampoline in the 2-phase clone setup for children
+     * cloned with CLONE_VM. We want the trampoline to be essentially identical
+     * to its parent. This improves the performance of cloning the trampoline,
+     * and guarantees that the real flags are implemented correctly.
+     *
+     * CLONE_CHILD_SETTID: Make the kernel set the managed_tid for the TLS
+     * manager.
+     *
+     * CLONE_CHILD_CLEARTID: Make the kernel clear the managed_tid, and
+     * trigger a FUTEX_WAKE (received by the TLS manager), so the TLS manager
+     * knows when to cleanup the trampoline stack.
+     *
+     * CLONE_SETTLS: To set the trampoline TLS based on the tls manager.
+     */
+    static const int base_trampoline_flags = (
+        CLONE_FILES | CLONE_FS | CLONE_IO | CLONE_PTRACE |
+        CLONE_SIGHAND | CLONE_SYSVSEM | CLONE_VM
+    ) | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | CLONE_SETTLS;
+
+    int trampoline_flags = base_trampoline_flags;
+
+    /*
+     * To get the process hierarchy right, we set the trampoline
+     * CLONE_PARENT/CLONE_THREAD flag to match the child
+     * CLONE_PARENT/CLONE_THREAD. So add those flags if specified by the child.
+     */
+    trampoline_flags |= (flags & CLONE_PARENT) ? CLONE_PARENT : 0;
+    trampoline_flags |= (flags & CLONE_THREAD) ? CLONE_THREAD : 0;
+
+    /*
+     * When using CLONE_PARENT, linux always sets the exit_signal for the task
+     * to the exit_signal of the parent process. For our purposes, the
+     * trampoline process. exit_signal has special significance for calls like
+     * `wait`, so it needs to be set correctly. We add the signal part of the
+     * user flags here so the ultimate child gets the right signal.
+     *
+     * This has the unfortunate side-effect of sending the parent two exit
+     * signals. One when the true child exits, and one when the trampoline
+     * exits. To work-around this we have to capture the exit signal from the
+     * trampoline and supress it.
+     */
+    trampoline_flags |= (flags & CSIGNAL);
+
+    sigfillset(&sigmask);
+    if (sigprocmask(SIG_BLOCK, &sigmask, &info.orig_mask) != 0) {
+        free(info.stack.trampoline.buffer);
+        free(info.stack.process.buffer);
+        completion_finish(&mgr->spawned);
+        return -1;
+    }
+
+    if (clone(clone_trampoline,
+              info.stack.trampoline.top, trampoline_flags, &info,
+              NULL, mgr->tls_ptr, &mgr->managed_tid) < 0) {
+        free(info.stack.trampoline.buffer);
+        free(info.stack.process.buffer);
+        completion_finish(&mgr->spawned);
+        return -1;
+    }
+
+    completion_await(&info.child_ready);
+    completion_finish(&mgr->spawned);
+
+    ret = sigprocmask(SIG_SETMASK, &info.orig_mask, NULL);
+    /*
+     * If our final sigproc mask doesn't work, we're pretty screwed. We may
+     * have started the final child now, and there's no going back. If this
+     * ever happens, just crash.
+     */
+    assert(!ret && "sigprocmask after clone needs to succeed");
+
+    /* If we have an error result, then set errno as needed. */
+    if (info.result < 0) {
+        errno = -info.result;
+        return -1;
+    }
+    return info.result;
+}
+
 struct clone_thread_info {
     struct completion running;
     int tid;
@@ -120,6 +517,17 @@ int qemu_clone(int flags, int (*callback)(void *), void *payload)
 {
     int ret;
 
+    /*
+     * Backwards Compatibility: Remove once all target platforms support
+     * clone_vm. Previously, we implemented vfork() via a fork() call,
+     * preserve that behavior instead of failing.
+     */
+    if (!clone_vm_supported()) {
+        if (flags & CLONE_VFORK) {
+            flags &= ~(CLONE_VFORK | CLONE_VM);
+        }
+    }
+
     if (clone_flags_are_thread(flags)) {
         /*
          * The new process uses the same flags as pthread_create, so we can
@@ -146,7 +554,12 @@ int qemu_clone(int flags, int (*callback)(void *), void *payload)
         return ret;
     }
 
-    /* !fork && !thread */
+    if (clone_vm_supported() && (flags & CLONE_VM)) {
+        return clone_vm(flags, callback, payload);
+    }
+
+    /* !fork && !thread && !CLONE_VM. This form is unsupported. */
+
     errno = EINVAL;
     return -1;
 }
diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index 54bf4f47be..e29912466c 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -94,6 +94,7 @@ struct vm86_saved_state {
 
 struct emulated_sigtable {
     int pending; /* true if signal is pending */
+    pid_t exit_pid; /* non-zero host pid, if a process is exiting. */
     target_siginfo_t info;
 };
 
@@ -183,6 +184,15 @@ typedef struct TaskState {
      * least TARGET_NSIG entries
      */
     struct target_sigaction *sigact_tbl;
+
+    /*
+     * Set to true if the process asssociated with this task state was cloned.
+     * This is needed to disambiguate cloned processes from threads. If
+     * CLONE_VM is used, a pthread_exit(..) will free the stack/TLS of the
+     * trampoline thread, and the trampoline will be unable to conduct its
+     * cleanup.
+     */
+    bool is_cloned;
 } __attribute__((aligned(16))) TaskState;
 
 extern char *exec_path;
@@ -442,6 +452,13 @@ abi_long do_sigaltstack(abi_ulong uss_addr, abi_ulong uoss_addr, abi_ulong sp);
 int do_sigprocmask(int how, const sigset_t *set, sigset_t *oldset);
 abi_long do_swapcontext(CPUArchState *env, abi_ulong uold_ctx,
                         abi_ulong unew_ctx, abi_long ctx_size);
+
+/*
+ * Register the current process as a "hidden" process. Exit signals generated
+ * by this process should not be delivered to the guest.
+ */
+void hide_current_process_exit_signal(void);
+
 /**
  * block_signals: block all signals while handling this guest syscall
  *
diff --git a/linux-user/signal.c b/linux-user/signal.c
index dc98def6d1..a7f0612b64 100644
--- a/linux-user/signal.c
+++ b/linux-user/signal.c
@@ -36,6 +36,21 @@ typedef struct target_sigaction sigact_table[TARGET_NSIG];
 static void host_signal_handler(int host_signum, siginfo_t *info,
                                 void *puc);
 
+/*
+ * This table, initilized in signal_init, is used to track "hidden" processes
+ * for which exit signals should not be delivered. The PIDs of the processes
+ * hidden processes are stored as keys. Values are always set to NULL.
+ *
+ * Note: Process IDs stored in this table may "leak" (i.e., never be removed
+ * from the table) if the guest blocks (SIG_IGN) the exit signal for the child
+ * it spawned. There is a small risk, that this PID could later be reused
+ * by an alternate child process, and the child exit would be hidden. This is
+ * an unusual case that is unlikely to happen, but it is possible.
+ */
+static GHashTable *hidden_processes;
+
+/* this lock guards access to the `hidden_processes` table. */
+static pthread_mutex_t hidden_processes_lock = PTHREAD_MUTEX_INITIALIZER;
 
 /*
  * System includes define _NSIG as SIGRTMAX + 1,
@@ -564,6 +579,9 @@ void signal_init(void)
     /* initialize signal conversion tables */
     signal_table_init();
 
+    /* initialize the hidden process table. */
+    hidden_processes = g_hash_table_new(g_direct_hash, g_direct_equal);
+
     /* Set the signal mask from the host mask. */
     sigprocmask(0, 0, &ts->signal_mask);
 
@@ -749,6 +767,10 @@ static void host_signal_handler(int host_signum, siginfo_t *info,
     k = &ts->sigtab[sig - 1];
     k->info = tinfo;
     k->pending = sig;
+    k->exit_pid = 0;
+    if (info->si_code & (CLD_DUMPED | CLD_KILLED | CLD_EXITED)) {
+        k->exit_pid = info->si_pid;
+    }
     ts->signal_pending = 1;
 
     /* Block host signals until target signal handler entered. We
@@ -930,6 +952,17 @@ int do_sigaction(int sig, const struct target_sigaction *act,
     return ret;
 }
 
+void hide_current_process_exit_signal(void)
+{
+    pid_t pid = getpid();
+
+    pthread_mutex_lock(&hidden_processes_lock);
+
+    (void)g_hash_table_insert(hidden_processes, GINT_TO_POINTER(pid), NULL);
+
+    pthread_mutex_unlock(&hidden_processes_lock);
+}
+
 static void handle_pending_signal(CPUArchState *cpu_env, int sig,
                                   struct emulated_sigtable *k)
 {
@@ -944,6 +977,22 @@ static void handle_pending_signal(CPUArchState *cpu_env, int sig,
     /* dequeue signal */
     k->pending = 0;
 
+    if (k->exit_pid) {
+        pthread_mutex_lock(&hidden_processes_lock);
+        /*
+         * If the exit signal is for a hidden PID, then just drop it, and
+         * remove the hidden process from the list, since we know it has
+         * exited.
+         */
+        if (g_hash_table_contains(hidden_processes,
+                                  GINT_TO_POINTER(k->exit_pid))) {
+            g_hash_table_remove(hidden_processes, GINT_TO_POINTER(k->exit_pid));
+            pthread_mutex_unlock(&hidden_processes_lock);
+            return;
+        }
+        pthread_mutex_unlock(&hidden_processes_lock);
+    }
+
     sig = gdb_handlesig(cpu, sig);
     if (!sig) {
         sa = NULL;
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 838caf9c98..20cf5d5464 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -139,10 +139,9 @@
 
 /* These flags are ignored:
  * CLONE_DETACHED is now ignored by the kernel;
- * CLONE_IO is just an optimisation hint to the I/O scheduler
  */
 #define CLONE_IGNORED_FLAGS                     \
-    (CLONE_DETACHED | CLONE_IO)
+    (CLONE_DETACHED)
 
 /* Flags for fork which we can implement within QEMU itself */
 #define CLONE_EMULATED_FLAGS               \
@@ -5978,14 +5977,31 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
     }
     proc_flags = (proc_flags & ~CSIGNAL) | host_sig;
 
-    /* Emulate vfork() with fork() */
-    if (proc_flags & CLONE_VFORK) {
-        proc_flags &= ~(CLONE_VFORK | CLONE_VM);
+
+    if (!clone_flags_are_fork(proc_flags) && !(flags & CLONE_VM)) {
+        /*
+         * If the user is doing a non-CLONE_VM clone, which cannot be emulated
+         * with fork, we can't guarantee that we can emulate this correctly.
+         * It should work OK as long as there are no threads in parent process,
+         * so we hide it behind a flag if the user knows what they're doing.
+         */
+        qemu_log_mask(LOG_UNIMP,
+                      "Refusing non-fork/thread clone without CLONE_VM.");
+        return -TARGET_EINVAL;
     }
 
-    if (!clone_flags_are_fork(proc_flags) &&
-        !clone_flags_are_thread(proc_flags)) {
-        qemu_log_mask(LOG_UNIMP, "unsupported clone flags");
+    if ((flags & CLONE_FILES) && !(flags & CLONE_VM)) {
+        /*
+         * This flag combination is currently unsupported. QEMU needs to update
+         * the fd_trans_table as new file descriptors are opened. This is easy
+         * when CLONE_VM is set, because the fd_trans_table is shared between
+         * the parent and child. Without CLONE_VM the fd_trans_table will need
+         * to be share specially using shared memory mappings, or a
+         * consistentcy protocol between the child and the parent.
+         *
+         * For now, just return EINVAL in this case.
+         */
+        qemu_log_mask(LOG_UNIMP, "CLONE_FILES only supported with CLONE_VM");
         return -TARGET_EINVAL;
     }
 
@@ -6042,6 +6058,10 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
         ts->sigact_tbl = sigact_table_clone(parent_ts->sigact_tbl);
     }
 
+    if (!clone_flags_are_thread(proc_flags)) {
+        ts->is_cloned = true;
+    }
+
     if (flags & CLONE_CHILD_CLEARTID) {
         ts->child_tidptr = child_tidptr;
     }
@@ -6063,10 +6083,8 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
         tb_flush(cpu);
     }
 
-    if (proc_flags & CLONE_VM) {
-        info.child.register_thread = true;
-        info.child.signal_setup = true;
-    }
+    info.child.signal_setup = (flags & CLONE_VM) && !(flags & CLONE_VFORK);
+    info.child.register_thread = !!(flags & CLONE_VM);
 
     /*
      * It is not safe to deliver signals until the child has finished
@@ -6078,7 +6096,7 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
 
     ret = get_errno(qemu_clone(proc_flags, clone_run, (void *) &info));
 
-    if (ret >= 0 && (proc_flags & CLONE_VM)) {
+    if (ret >= 0 && (flags & CLONE_VM) && !(flags & CLONE_VFORK)) {
         /*
          * Wait for the child to finish setup if the child is running in the
          * same VM.
@@ -6092,7 +6110,7 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
     pthread_cond_destroy(&info.cond);
     pthread_mutex_destroy(&info.mutex);
 
-    if (ret >= 0 && !(proc_flags & CLONE_VM)) {
+    if (ret >= 0 && !(flags & CLONE_VM)) {
         /*
          * If !CLONE_VM, then we need to set parent_tidptr, since the child
          * won't set it for us. Should always be safe to set it here anyways.
@@ -7662,6 +7680,7 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
     switch(num) {
     case TARGET_NR_exit:
     {
+        bool do_pthread_exit = false;
         /* In old applications this may be used to implement _exit(2).
            However in threaded applictions it is used for thread termination,
            and _exit_group is used for application termination.
@@ -7692,10 +7711,20 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
                           NULL, NULL, 0);
             }
 
+            /*
+             * Need this multi-step process so we can free ts before calling
+             * pthread_exit.
+             */
+            if (!ts->is_cloned) {
+                do_pthread_exit = true;
+            }
+
             thread_cpu = NULL;
             g_free(ts);
-            rcu_unregister_thread();
-            pthread_exit(NULL);
+            if (do_pthread_exit) {
+                rcu_unregister_thread();
+                pthread_exit(NULL);
+            }
         }
 
         pthread_mutex_unlock(&clone_lock);
@@ -9700,6 +9729,14 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
 #ifdef __NR_exit_group
         /* new thread calls */
     case TARGET_NR_exit_group: {
+        /*
+         * TODO: We need to clean up CPUs (like is done for exit(2))
+         * for all threads in this process when exit_group is called, at least
+         * for tasks that have been cloned. Could also be done in
+         * clone_trampoline/tls_mgr. Since this cleanup is non-trival (need to
+         * coordinate it across threads. Right now it seems to be fine without
+         * the cleanup, so just leaving a note.
+         */
         preexit_cleanup(cpu_env, arg1);
         return get_errno(exit_group(arg1));
     }
diff --git a/tests/tcg/multiarch/linux-test.c b/tests/tcg/multiarch/linux-test.c
index 8a7c15cd31..a7723556c2 100644
--- a/tests/tcg/multiarch/linux-test.c
+++ b/tests/tcg/multiarch/linux-test.c
@@ -407,14 +407,13 @@ static void test_clone(void)
 
     stack1 = malloc(STACK_SIZE);
     pid1 = chk_error(clone(thread1_func, stack1 + STACK_SIZE,
-                           CLONE_VM | CLONE_FS | CLONE_FILES |
-                           CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM,
+                           CLONE_VM | SIGCHLD,
                             "hello1"));
 
     stack2 = malloc(STACK_SIZE);
     pid2 = chk_error(clone(thread2_func, stack2 + STACK_SIZE,
                            CLONE_VM | CLONE_FS | CLONE_FILES |
-                           CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM,
+                           CLONE_SIGHAND | CLONE_SYSVSEM | SIGCHLD,
                            "hello2"));
 
     wait_for_child(pid1);
@@ -517,6 +516,61 @@ static void test_shm(void)
     chk_error(shmdt(ptr));
 }
 
+static volatile sig_atomic_t test_clone_signal_count_handler_calls;
+
+static void test_clone_signal_count_handler(int sig)
+{
+    test_clone_signal_count_handler_calls++;
+}
+
+/* A clone function that does nothing and exits successfully. */
+static int successful_func(void *arg __attribute__((unused)))
+{
+    return 0;
+}
+
+/*
+ * With our clone implementation it's possible that we could generate too many
+ * child exit signals. Make sure only the single expected child-exit signal is
+ * generated.
+ */
+static void test_clone_signal_count(void)
+{
+    uint8_t *child_stack;
+    struct sigaction prev, test;
+    int status;
+    pid_t pid;
+
+    memset(&test, 0, sizeof(test));
+    test.sa_handler = test_clone_signal_count_handler;
+    test.sa_flags = SA_RESTART;
+
+    /* Use real-time signals, so every signal event gets delivered. */
+    chk_error(sigaction(SIGRTMIN, &test, &prev));
+
+    child_stack = malloc(STACK_SIZE);
+    pid = chk_error(clone(
+        successful_func,
+        child_stack + STACK_SIZE,
+        CLONE_VM | SIGRTMIN,
+        NULL
+    ));
+
+    /*
+     * Need to use __WCLONE here because we are not using SIGCHLD as the
+     * exit_signal. By default linux only waits for children spawned with
+     * SIGCHLD.
+     */
+    chk_error(waitpid(pid, &status, __WCLONE));
+
+    chk_error(sigaction(SIGRTMIN, &prev, NULL));
+
+    if (test_clone_signal_count_handler_calls != 1) {
+        error("expected to receive exactly 1 signal, received %d signals",
+              test_clone_signal_count_handler_calls);
+    }
+}
+
 int main(int argc, char **argv)
 {
     test_file();
@@ -524,11 +578,8 @@ int main(int argc, char **argv)
     test_fork();
     test_time();
     test_socket();
-
-    if (argc > 1) {
-        printf("test_clone still considered buggy\n");
-        test_clone();
-    }
+    test_clone();
+    test_clone_signal_count();
 
     test_signal();
     test_shm();
-- 
2.27.0.290.gba653c62da-goog



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5] linux-user: Add PDEATHSIG test for clone process hierarchy.
  2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
                   ` (3 preceding siblings ...)
  2020-06-12  1:46 ` [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options Josh Kunz
@ 2020-06-12  1:46 ` Josh Kunz
  2020-06-12  3:48 ` [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) no-reply
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Josh Kunz @ 2020-06-12  1:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: riku.voipio, laurent, alex.bennee, Josh Kunz

Certain process-level linux features like subreapers, and PDEATHSIG,
depend on the guest's process hierarchy being emulated correctly on the
host. This change adds a test that makes sure PDEATHSIG works for a
guest process created with `clone`.

Signed-off-by: Josh Kunz <jkz@google.com>
---
 tests/tcg/multiarch/Makefile.target |   3 +
 tests/tcg/multiarch/linux-test.c    | 160 ++++++++++++++++++++++++++--
 2 files changed, 153 insertions(+), 10 deletions(-)

diff --git a/tests/tcg/multiarch/Makefile.target b/tests/tcg/multiarch/Makefile.target
index cb49cc9ccb..d937b4c59b 100644
--- a/tests/tcg/multiarch/Makefile.target
+++ b/tests/tcg/multiarch/Makefile.target
@@ -60,3 +60,6 @@ endif
 
 # Update TESTS
 TESTS += $(MULTIARCH_TESTS)
+
+# linux-test.c depends on -pthread.
+LDFLAGS += -pthread
diff --git a/tests/tcg/multiarch/linux-test.c b/tests/tcg/multiarch/linux-test.c
index a7723556c2..1824a5a0c2 100644
--- a/tests/tcg/multiarch/linux-test.c
+++ b/tests/tcg/multiarch/linux-test.c
@@ -20,16 +20,19 @@
 #include <stdarg.h>
 #include <stdlib.h>
 #include <stdio.h>
+#include <stdbool.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <inttypes.h>
 #include <string.h>
 #include <sys/types.h>
+#include <syscall.h>
 #include <sys/stat.h>
 #include <sys/wait.h>
 #include <errno.h>
 #include <utime.h>
 #include <time.h>
+#include <sys/prctl.h>
 #include <sys/time.h>
 #include <sys/resource.h>
 #include <sys/uio.h>
@@ -41,6 +44,7 @@
 #include <setjmp.h>
 #include <sys/shm.h>
 #include <assert.h>
+#include <pthread.h>
 
 #define STACK_SIZE 16384
 
@@ -368,14 +372,12 @@ static void test_pipe(void)
     chk_error(close(fds[1]));
 }
 
-static int thread1_res;
-static int thread2_res;
-
 static int thread1_func(void *arg)
 {
+    int *res = (int *) arg;
     int i;
     for(i=0;i<5;i++) {
-        thread1_res++;
+        (*res)++;
         usleep(10 * 1000);
     }
     return 0;
@@ -383,9 +385,10 @@ static int thread1_func(void *arg)
 
 static int thread2_func(void *arg)
 {
+    int *res = (int *) arg;
     int i;
     for(i=0;i<6;i++) {
-        thread2_res++;
+        (*res)++;
         usleep(10 * 1000);
     }
     return 0;
@@ -405,25 +408,27 @@ static void test_clone(void)
     uint8_t *stack1, *stack2;
     pid_t pid1, pid2;
 
+    int t1 = 0, t2 = 0;
+
     stack1 = malloc(STACK_SIZE);
     pid1 = chk_error(clone(thread1_func, stack1 + STACK_SIZE,
                            CLONE_VM | SIGCHLD,
-                            "hello1"));
+                            &t1));
 
     stack2 = malloc(STACK_SIZE);
     pid2 = chk_error(clone(thread2_func, stack2 + STACK_SIZE,
                            CLONE_VM | CLONE_FS | CLONE_FILES |
                            CLONE_SIGHAND | CLONE_SYSVSEM | SIGCHLD,
-                           "hello2"));
+                           &t2));
 
     wait_for_child(pid1);
     free(stack1);
     wait_for_child(pid2);
     free(stack2);
 
-    if (thread1_res != 5 ||
-        thread2_res != 6)
+    if (t1 != 5 || t2 != 6) {
         error("clone");
+    }
 }
 
 /***********************************/
@@ -562,6 +567,7 @@ static void test_clone_signal_count(void)
      * SIGCHLD.
      */
     chk_error(waitpid(pid, &status, __WCLONE));
+    free(child_stack);
 
     chk_error(sigaction(SIGRTMIN, &prev, NULL));
 
@@ -571,6 +577,139 @@ static void test_clone_signal_count(void)
     }
 }
 
+struct test_clone_pdeathsig_info {
+    uint8_t *child_stack;
+    pthread_mutex_t notify_test_mutex;
+    pthread_cond_t notify_test_cond;
+    pthread_mutex_t notify_parent_mutex;
+    pthread_cond_t notify_parent_cond;
+    bool signal_received;
+};
+
+static int test_clone_pdeathsig_child(void *arg)
+{
+    struct test_clone_pdeathsig_info *info =
+        (struct test_clone_pdeathsig_info *) arg;
+    sigset_t wait_on, block_all;
+    siginfo_t sinfo;
+    struct timespec timeout;
+    int ret;
+
+    /* Block all signals, so SIGUSR1 will be pending when we wait on it. */
+    sigfillset(&block_all);
+    chk_error(sigprocmask(SIG_BLOCK, &block_all, NULL));
+
+    chk_error(prctl(PR_SET_PDEATHSIG, SIGUSR1));
+
+    pthread_mutex_lock(&info->notify_parent_mutex);
+    pthread_cond_broadcast(&info->notify_parent_cond);
+    pthread_mutex_unlock(&info->notify_parent_mutex);
+
+    sigemptyset(&wait_on);
+    sigaddset(&wait_on, SIGUSR1);
+    timeout.tv_sec = 0;
+    timeout.tv_nsec = 300 * 1000 * 1000;  /* 300ms */
+
+    ret = sigtimedwait(&wait_on, &sinfo, &timeout);
+
+    if (ret < 0 && errno != EAGAIN) {
+        error("%m (ret=%d, errno=%d/%s)", ret, errno, strerror(errno));
+    }
+    if (ret == SIGUSR1) {
+        info->signal_received = true;
+    }
+    pthread_mutex_lock(&info->notify_test_mutex);
+    pthread_cond_broadcast(&info->notify_test_cond);
+    pthread_mutex_unlock(&info->notify_test_mutex);
+    _exit(0);
+}
+
+static int test_clone_pdeathsig_parent(void *arg)
+{
+    struct test_clone_pdeathsig_info *info =
+        (struct test_clone_pdeathsig_info *) arg;
+
+    pthread_mutex_lock(&info->notify_parent_mutex);
+
+    chk_error(clone(
+        test_clone_pdeathsig_child,
+        info->child_stack + STACK_SIZE,
+        CLONE_VM,
+        info
+    ));
+
+    /* No need to reap the child, it will get reaped by init. */
+
+    /* Wait for the child to signal that they have set up PDEATHSIG. */
+    pthread_cond_wait(&info->notify_parent_cond, &info->notify_parent_mutex);
+    pthread_mutex_unlock(&info->notify_parent_mutex);  /* avoid UB on destroy */
+
+    _exit(0);
+}
+
+/*
+ * This checks that cloned children have the correct parent/child
+ * relationship using PDEATHSIG. PDEATHSIG is based on kernel task hierarchy,
+ * rather than "process" hierarchy, so it should be pretty sensitive to
+ * breakages. PDEATHSIG is also a widely used feature, so it's important
+ * it's correct.
+ *
+ * This test works by spawning a child process (parent) which then spawns it's
+ * own child (the child). The child registers a PDEATHSIG handler, and then
+ * notifies the parent which exits. The child then waits for the PDEATHSIG
+ * signal it regsitered. The child reports whether or not the signal is
+ * received within a small time window, and then notifies the test runner
+ * (this function) that the test is finished.
+ */
+static void test_clone_pdeathsig(void)
+{
+    uint8_t *parent_stack;
+    struct test_clone_pdeathsig_info info;
+    pid_t pid;
+    int status;
+
+    memset(&info, 0, sizeof(info));
+
+    /*
+     * Setup condition variables, so we can be notified once the final child
+     * observes the PDEATHSIG signal from it's parent exiting. When the parent
+     * exits, the child will be orphaned, so we can't use `wait*` to wait for
+     * it to finish.
+     */
+    chk_error(pthread_mutex_init(&info.notify_test_mutex, NULL));
+    chk_error(pthread_cond_init(&info.notify_test_cond, NULL));
+    chk_error(pthread_mutex_init(&info.notify_parent_mutex, NULL));
+    chk_error(pthread_cond_init(&info.notify_parent_cond, NULL));
+
+    parent_stack = malloc(STACK_SIZE);
+    info.child_stack = malloc(STACK_SIZE);
+
+    pthread_mutex_lock(&info.notify_test_mutex);
+
+    pid = chk_error(clone(
+        test_clone_pdeathsig_parent,
+        parent_stack + STACK_SIZE,
+        CLONE_VM,
+        &info
+    ));
+
+    pthread_cond_wait(&info.notify_test_cond, &info.notify_test_mutex);
+    pthread_mutex_unlock(&info.notify_test_mutex);
+    chk_error(waitpid(pid, &status, __WCLONE));  /* reap the parent */
+
+    free(parent_stack);
+    free(info.child_stack);
+
+    pthread_cond_destroy(&info.notify_parent_cond);
+    pthread_mutex_destroy(&info.notify_parent_mutex);
+    pthread_cond_destroy(&info.notify_test_cond);
+    pthread_mutex_destroy(&info.notify_test_mutex);
+
+    if (!info.signal_received) {
+        error("child did not receive PDEATHSIG on parent death");
+    }
+}
+
 int main(int argc, char **argv)
 {
     test_file();
@@ -580,8 +719,9 @@ int main(int argc, char **argv)
     test_socket();
     test_clone();
     test_clone_signal_count();
-
+    test_clone_pdeathsig();
     test_signal();
     test_shm();
+
     return 0;
 }
-- 
2.27.0.290.gba653c62da-goog



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
  2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
                   ` (4 preceding siblings ...)
  2020-06-12  1:46 ` [PATCH 5/5] linux-user: Add PDEATHSIG test for clone process hierarchy Josh Kunz
@ 2020-06-12  3:48 ` no-reply
  2020-06-13 11:16 ` Alex Bennée
  2020-06-16 16:02 ` Alex Bennée
  7 siblings, 0 replies; 18+ messages in thread
From: no-reply @ 2020-06-12  3:48 UTC (permalink / raw)
  To: jkz; +Cc: jkz, riku.voipio, alex.bennee, qemu-devel, laurent

Patchew URL: https://patchew.org/QEMU/20200612014606.147691-1-jkz@google.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

+++ /tmp/qemu-test/build/tests/qemu-iotests/041.out.bad 2020-06-12 03:41:14.015076859 +0000
@@ -1,5 +1,30 @@
-........................................................................................................
+WARNING:qemu.machine:qemu received signal 9: /tmp/qemu-test/build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64 -display none -vga none -chardev socket,id=mon,path=/tmp/tmp.YHsXXgjzyL/qemu-22509-monitor.sock -mon chardev=mon,mode=control -qtest unix:path=/tmp/tmp.YHsXXgjzyL/qemu-22509-qtest.sock -accel qtest -nodefaults -display none -accel qtest -drive if=virtio,id=drive0,file=/tmp/qemu-test/test.img,format=qcow2,cache=writeback,aio=threads,node-name=top,backing.node-name=base -drive if=ide,id=drive1,media=cdrom
+......................................E.................................................................
+======================================================================
+ERROR: test_pause (__main__.TestSingleBlockdev)
+----------------------------------------------------------------------
+Traceback (most recent call last):
+  File "041", line 108, in test_pause
---
 Ran 104 tests
 
-OK
+FAILED (errors=1)
  TEST    iotest-qcow2: 042
  TEST    iotest-qcow2: 043
  TEST    iotest-qcow2: 046
---
Not run: 259
Failures: 041
Failed 1 of 119 iotests
make: *** [check-tests/check-block.sh] Error 1
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 665, in <module>
    sys.exit(main())
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=0b953bb4c5534a2f9f54bbb077c4a060', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-fy1xwr_z/src/docker-src.2020-06-11-23.31.14.29203:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=0b953bb4c5534a2f9f54bbb077c4a060
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-fy1xwr_z/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    17m15.344s
user    0m8.554s


The full log is available at
http://patchew.org/logs/20200612014606.147691-1-jkz@google.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
  2020-06-12  1:46 ` [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options Josh Kunz
@ 2020-06-13  0:10   ` Josh Kunz
  2020-06-16 16:08   ` Alex Bennée
  1 sibling, 0 replies; 18+ messages in thread
From: Josh Kunz @ 2020-06-13  0:10 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Riku Voipio, Laurent Vivier, Alex Bennée

> +    child_tid = atomic_fetch_or(&mgr->managed_tid, 0);
> +    /*
> +     * Check if the child has already terminated by this point. If not, wait
> +     * for the child to exit. As long as the trampoline is not killed by
> +     * a signal, the kernel guarantees that the memory at &mgr->managed_tid
> +     * will be cleared, and a FUTEX_WAKE at that address will triggered.
> +     */
> +    if (child_tid != 0) {
> +        ret = syscall(SYS_futex, &mgr->managed_tid, FUTEX_WAIT,
> +                      child_tid, NULL, NULL, 0);
> +        assert(ret == 0 && "clone manager futex should always succeed");
> +    }

A note for any reviewers/maintainers: While doing some additional
testing today, I discovered there is a bug in this section of the
patch. The child process can exit between the `atomic_fetch` and start
of the `futex(FUTEX_WAIT)` call, causing the kernel to respond with an
`EAGAIN` error, which will be caught by the assert and crash the
program. I have a patch for this. I suspect there will be comments on
this change, so I'm holding off on re-sending the series until initial
reviews have been done. I just wanted to make maintainers aware to
avoid the possibility of this bug being merged in the (very) unlikely
case there are no comments.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
  2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
                   ` (5 preceding siblings ...)
  2020-06-12  3:48 ` [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) no-reply
@ 2020-06-13 11:16 ` Alex Bennée
  2020-06-16 16:02 ` Alex Bennée
  7 siblings, 0 replies; 18+ messages in thread
From: Alex Bennée @ 2020-06-13 11:16 UTC (permalink / raw)
  To: Josh Kunz; +Cc: riku.voipio, qemu-devel, laurent


Josh Kunz <jkz@google.com> writes:

> This patch series implements extended support for the `clone` system
> call. As best I can tell, any option combination including `CLONE_VM`
> should be supported with the addition of this patch series. The
> implementation is described in greater detail in the patches themselves.
>
> Testing:
>
>   * All targets built on x86_64.
>   * `make check` and `make check-tcg` are passing. Additional tests have
>     been added to `linux-test.c` to validate clone behavior.
>
> Caveats:
>
>   * This series touches, but does not fix, several bits of code that are
>     racey (namely the sigact table and the fd trans table).
>   * `exit_group` does not perform the appropriate cleanup for non-thread
>     children created with `CLONE_VM`. CPUs for such children are never
>     cleaned up. The correct implementation of exit-group is non-trivial
>     (since it also needs to track/handle cleanup for threads in the
>     clone'd child process). Also, I don't fully understand the
>     interaction between QOM<->linux-user.

When the QOM object gets unrefed for the final time it should cause a
bunch of clean-up in the common vCPU management code where things like
plugin cleanup are done.

This was recently touched in 1f81ce90e31ef338ee53a0cea02344237bc470cc
where I removed linux-user messing around with the active cpu list and
left it to the core code to deal with. Previously it wasn't being
properly unrealized.

>     My naive implementation based
>     on the current implementation `exit(2)` was regularly crashing. If
>     maintainers have suggestions for better ways to handle exit_group,
>     they would be greatly appreciated. 
>   * execve does not clean up the CPUs of clone'd children, for the same
>     reasons as `exit_group`.
>
> Josh Kunz (5):
>   linux-user: Refactor do_fork to use new `qemu_clone`
>   linux-user: Make fd_trans task-specific.
>   linux-user: Make sigact_table part of the task state.
>   linux-user: Support CLONE_VM and extended clone options
>   linux-user: Add PDEATHSIG test for clone process hierarchy.
>
>  linux-user/Makefile.objs            |   2 +-
>  linux-user/clone.c                  | 565 ++++++++++++++++++++++++++++
>  linux-user/clone.h                  |  27 ++
>  linux-user/fd-trans-tbl.c           |  13 +
>  linux-user/fd-trans-type.h          |  17 +
>  linux-user/fd-trans.c               |   3 -
>  linux-user/fd-trans.h               |  75 ++--
>  linux-user/main.c                   |   1 +
>  linux-user/qemu.h                   |  49 +++
>  linux-user/signal.c                 |  84 ++++-
>  linux-user/syscall.c                | 452 ++++++++++++----------
>  tests/tcg/multiarch/Makefile.target |   3 +
>  tests/tcg/multiarch/linux-test.c    | 227 ++++++++++-
>  13 files changed, 1264 insertions(+), 254 deletions(-)
>  create mode 100644 linux-user/clone.c
>  create mode 100644 linux-user/clone.h
>  create mode 100644 linux-user/fd-trans-tbl.c
>  create mode 100644 linux-user/fd-trans-type.h


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] linux-user: Refactor do_fork to use new `qemu_clone`
  2020-06-12  1:46 ` [PATCH 1/5] linux-user: Refactor do_fork to use new `qemu_clone` Josh Kunz
@ 2020-06-16 15:51   ` Alex Bennée
  0 siblings, 0 replies; 18+ messages in thread
From: Alex Bennée @ 2020-06-16 15:51 UTC (permalink / raw)
  To: Josh Kunz; +Cc: riku.voipio, qemu-devel, laurent


Josh Kunz <jkz@google.com> writes:

> This is pre-work for adding full support for the `CLONE_VM` `clone`
> flag. In a follow-up patch, we'll add support to `clone.c` for
> `clone_vm`-type clones beyond threads. CLONE_VM support is more
> complicated, so first we're splitting existing clone mechanisms
> (pthread_create, and fork) into a separate file.
>
> Signed-off-by: Josh Kunz <jkz@google.com>
> ---
>  linux-user/Makefile.objs |   2 +-
>  linux-user/clone.c       | 152 ++++++++++++++++
>  linux-user/clone.h       |  27 +++
>  linux-user/syscall.c     | 376 +++++++++++++++++++--------------------
>  4 files changed, 365 insertions(+), 192 deletions(-)
>  create mode 100644 linux-user/clone.c
>  create mode 100644 linux-user/clone.h
>
> diff --git a/linux-user/Makefile.objs b/linux-user/Makefile.objs
> index 1940910a73..d6788f012c 100644
> --- a/linux-user/Makefile.objs
> +++ b/linux-user/Makefile.objs
> @@ -1,7 +1,7 @@
>  obj-y = main.o syscall.o strace.o mmap.o signal.o \
>  	elfload.o linuxload.o uaccess.o uname.o \
>  	safe-syscall.o $(TARGET_ABI_DIR)/signal.o \
> -        $(TARGET_ABI_DIR)/cpu_loop.o exit.o fd-trans.o
> +        $(TARGET_ABI_DIR)/cpu_loop.o exit.o fd-trans.o clone.o
>  
>  obj-$(TARGET_HAS_BFLT) += flatload.o
>  obj-$(TARGET_I386) += vm86.o
> diff --git a/linux-user/clone.c b/linux-user/clone.c
> new file mode 100644
> index 0000000000..f02ae8c464
> --- /dev/null
> +++ b/linux-user/clone.c
> @@ -0,0 +1,152 @@
> +#include "qemu/osdep.h"
> +#include "qemu.h"
> +#include "clone.h"
> +#include <pthread.h>
> +#include <sched.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <unistd.h>
> +#include <stdbool.h>
> +#include <assert.h>
> +
> +static const unsigned long NEW_STACK_SIZE = 0x40000UL;
> +
> +/*
> + * A completion tracks an event that can be completed. It's based on the
> + * kernel concept with the same name, but implemented with userspace locks.
> + */
> +struct completion {
> +    /* done is set once this completion has been completed. */
> +    bool done;
> +    /* mu syncronizes access to this completion. */
> +    pthread_mutex_t mu;
> +    /* cond is used to broadcast completion status to awaiting threads. */
> +    pthread_cond_t cond;
> +};
> +
> +static void completion_init(struct completion *c)
> +{
> +    c->done = false;
> +    pthread_mutex_init(&c->mu, NULL);
> +    pthread_cond_init(&c->cond, NULL);
> +}
> +
> +/*
> + * Block until the given completion finishes. Returns immediately if the
> + * completion has already finished.
> + */
> +static void completion_await(struct completion *c)
> +{
> +    pthread_mutex_lock(&c->mu);
> +    if (c->done) {
> +        pthread_mutex_unlock(&c->mu);
> +        return;
> +    }
> +    pthread_cond_wait(&c->cond, &c->mu);
> +    assert(c->done && "returned from cond wait without being marked as done");
> +    pthread_mutex_unlock(&c->mu);
> +}

This introduces another sync mechanism specifically for clone behaviour
- is there a reason one of the exiting mechanisms can't be used? If this
brings new useful functionality it might be worth introducing it as a
system wide function.

> +
> +/*
> + * Finish the completion. Unblocks all awaiters.
> + */
> +static void completion_finish(struct completion *c)
> +{
> +    pthread_mutex_lock(&c->mu);
> +    assert(!c->done && "trying to finish an already finished completion");
> +    c->done = true;
> +    pthread_cond_broadcast(&c->cond);
> +    pthread_mutex_unlock(&c->mu);
> +}
> +
> +struct clone_thread_info {
> +    struct completion running;
> +    int tid;
> +    int (*callback)(void *);
> +    void *payload;
> +};
> +
> +static void *clone_thread_run(void *raw_info)
> +{
> +    struct clone_thread_info *info = (struct clone_thread_info *) raw_info;
> +    info->tid = syscall(SYS_gettid);
> +
> +    /*
> +     * Save out callback/payload since lifetime of info is only guaranteed
> +     * until we finish the completion.
> +     */
> +    int (*callback)(void *) = info->callback;
> +    void *payload = info->payload;
> +    completion_finish(&info->running);
> +
> +    _exit(callback(payload));
> +}
> +
> +static int clone_thread(int flags, int (*callback)(void *), void
> *payload)

It's nicer to typedef a function call type rather than manually casting
to it each time.

> +{
> +    struct clone_thread_info info;
> +    pthread_attr_t attr;
> +    int ret;
> +    pthread_t thread_unused;
> +
> +    memset(&info, 0, sizeof(info));
> +
> +    completion_init(&info.running);
> +    info.callback = callback;
> +    info.payload = payload;
> +
> +    (void)pthread_attr_init(&attr);
> +    (void)pthread_attr_setstacksize(&attr, NEW_STACK_SIZE);
> +    (void)pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
> +
> +    ret = pthread_create(&thread_unused, &attr, clone_thread_run, (void *) &info);
> +    /* pthread_create returns errors directly, instead of via errno. */
> +    if (ret != 0) {
> +        errno = ret;
> +        ret = -1;
> +    } else {
> +        completion_await(&info.running);
> +        ret = info.tid;
> +    }
> +
> +    pthread_attr_destroy(&attr);
> +    return ret;
> +}
> +
> +int qemu_clone(int flags, int (*callback)(void *), void *payload)
> +{
> +    int ret;
> +
> +    if (clone_flags_are_thread(flags)) {
> +        /*
> +         * The new process uses the same flags as pthread_create, so we can
> +         * use pthread_create directly. This is an optimization.
> +         */
> +        return clone_thread(flags, callback, payload);
> +    }
> +
> +    if (clone_flags_are_fork(flags)) {
> +        /*
> +         * Special case a true `fork` clone call. This is so we can take
> +         * advantage of special pthread_atfork handlers in libraries we
> +         * depend on (e.g., glibc). Without this, existing users of `fork`
> +         * in multi-threaded environments will likely get new flaky
> +         * deadlocks.
> +         */
> +        fork_start();
> +        ret = fork();
> +        if (ret == 0) {
> +            fork_end(1);
> +            _exit(callback(payload));
> +        }
> +        fork_end(0);
> +        return ret;
> +    }
> +
> +    /* !fork && !thread */
> +    errno = EINVAL;
> +    return -1;
> +}
> diff --git a/linux-user/clone.h b/linux-user/clone.h
> new file mode 100644
> index 0000000000..34ae9b3780
> --- /dev/null
> +++ b/linux-user/clone.h
> @@ -0,0 +1,27 @@
> +#ifndef CLONE_H
> +#define CLONE_H
> +
> +/*
> + * qemu_clone executes the given `callback`, with the given payload as the
> + * first argument, in a new process created with the given flags. Some clone
> + * flags, such as *SETTLS, *CLEARTID are not supported. The child thread ID is
> + * returned on success, otherwise negative errno is returned on clone failure.
> + */
> +int qemu_clone(int flags, int (*callback)(void *), void *payload);
> +
> +/* Returns true if the given clone flags can be emulated with libc fork. */
> +static bool clone_flags_are_fork(unsigned int flags)
> +{
> +    return flags == SIGCHLD;
> +}
> +
> +/* Returns true if the given clone flags can be emulated with pthread_create. */
> +static bool clone_flags_are_thread(unsigned int flags)
> +{
> +    return flags == (
> +        CLONE_VM | CLONE_FS | CLONE_FILES |
> +        CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM
> +    );
> +}

This is fine in of itself but we seemed to have lost some context from
moving the flags from syscall.c.

> +
> +#endif /* CLONE_H */
> diff --git a/linux-user/syscall.c b/linux-user/syscall.c
> index 97de9fb5c9..7ce021cea2 100644
> --- a/linux-user/syscall.c
> +++ b/linux-user/syscall.c
> @@ -122,6 +122,7 @@
>  #include "qapi/error.h"
>  #include "fd-trans.h"
>  #include "tcg/tcg.h"
> +#include "clone.h"
>  
>  #ifndef CLONE_IO
>  #define CLONE_IO                0x80000000      /* Clone io context */
> @@ -135,12 +136,6 @@
>   *  * flags we can implement within QEMU itself
>   *  * flags we can't support and will return an error for
>   */
> -/* For thread creation, all these flags must be present; for
> - * fork, none must be present.
> - */
> -#define CLONE_THREAD_FLAGS                              \
> -    (CLONE_VM | CLONE_FS | CLONE_FILES |                \
> -     CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM)
>  
>  /* These flags are ignored:
>   * CLONE_DETACHED is now ignored by the kernel;
> @@ -150,30 +145,10 @@
>      (CLONE_DETACHED | CLONE_IO)
>  
>  /* Flags for fork which we can implement within QEMU itself */
> -#define CLONE_OPTIONAL_FORK_FLAGS               \
> +#define CLONE_EMULATED_FLAGS               \
>      (CLONE_SETTLS | CLONE_PARENT_SETTID |       \
>       CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID)
>  
> -/* Flags for thread creation which we can implement within QEMU itself */
> -#define CLONE_OPTIONAL_THREAD_FLAGS                             \
> -    (CLONE_SETTLS | CLONE_PARENT_SETTID |                       \
> -     CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID | CLONE_PARENT)
> -
> -#define CLONE_INVALID_FORK_FLAGS                                        \
> -    (~(CSIGNAL | CLONE_OPTIONAL_FORK_FLAGS | CLONE_IGNORED_FLAGS))
> -
> -#define CLONE_INVALID_THREAD_FLAGS                                      \
> -    (~(CSIGNAL | CLONE_THREAD_FLAGS | CLONE_OPTIONAL_THREAD_FLAGS |     \
> -       CLONE_IGNORED_FLAGS))
> -
> -/* CLONE_VFORK is special cased early in do_fork(). The other flag bits
> - * have almost all been allocated. We cannot support any of
> - * CLONE_NEWNS, CLONE_NEWCGROUP, CLONE_NEWUTS, CLONE_NEWIPC,
> - * CLONE_NEWUSER, CLONE_NEWPID, CLONE_NEWNET, CLONE_PTRACE, CLONE_UNTRACED.
> - * The checks against the invalid thread masks above will catch these.
> - * (The one remaining unallocated bit is 0x1000 which used to be CLONE_PID.)
> - */
> -

I think some of the above can be usefully moved to clone.h to discuss
the various clone/fork options QEMU can and can't support.

>  /* Define DEBUG_ERESTARTSYS to force every syscall to be restarted
>   * once. This exercises the codepaths for restart.
>   */
> @@ -1104,7 +1079,7 @@ static inline rlim_t target_to_host_rlim(abi_ulong target_rlim)
>  {
>      abi_ulong target_rlim_swap;
>      rlim_t result;
> -    
> +
>      target_rlim_swap = tswapal(target_rlim);
>      if (target_rlim_swap == TARGET_RLIM_INFINITY)
>          return RLIM_INFINITY;
> @@ -1112,7 +1087,7 @@ static inline rlim_t target_to_host_rlim(abi_ulong target_rlim)
>      result = target_rlim_swap;
>      if (target_rlim_swap != (rlim_t)result)
>          return RLIM_INFINITY;
> -    
> +
>      return result;
>  }
>  #endif
> @@ -1122,13 +1097,13 @@ static inline abi_ulong host_to_target_rlim(rlim_t rlim)
>  {
>      abi_ulong target_rlim_swap;
>      abi_ulong result;
> -    
> +
>      if (rlim == RLIM_INFINITY || rlim != (abi_long)rlim)
>          target_rlim_swap = TARGET_RLIM_INFINITY;
>      else
>          target_rlim_swap = rlim;
>      result = tswapal(target_rlim_swap);
> -    
> +
>      return result;
>  }
>  #endif
> @@ -1615,10 +1590,11 @@ static inline abi_long target_to_host_cmsg(struct msghdr *msgh,
>      abi_ulong target_cmsg_addr;
>      struct target_cmsghdr *target_cmsg, *target_cmsg_start;
>      socklen_t space = 0;
> -    
> +
>      msg_controllen = tswapal(target_msgh->msg_controllen);
> -    if (msg_controllen < sizeof (struct target_cmsghdr)) 
> +    if (msg_controllen < sizeof(struct target_cmsghdr)) {
>          goto the_end;
> +    }
>      target_cmsg_addr = tswapal(target_msgh->msg_control);
>      target_cmsg = lock_user(VERIFY_READ, target_cmsg_addr, msg_controllen, 1);
>      target_cmsg_start = target_cmsg;
> @@ -1703,8 +1679,9 @@ static inline abi_long host_to_target_cmsg(struct target_msghdr *target_msgh,
>      socklen_t space = 0;
>  
>      msg_controllen = tswapal(target_msgh->msg_controllen);
> -    if (msg_controllen < sizeof (struct target_cmsghdr)) 
> +    if (msg_controllen < sizeof(struct target_cmsghdr)) {
>          goto the_end;
> +    }

Try and avoid un-related whitespace fixes in code that is otherwise unchanged.

>      target_cmsg_addr = tswapal(target_msgh->msg_control);
>      target_cmsg = lock_user(VERIFY_WRITE, target_cmsg_addr, msg_controllen, 0);
>      target_cmsg_start = target_cmsg;
> @@ -5750,9 +5727,10 @@ abi_long do_set_thread_area(CPUX86State *env, abi_ulong ptr)
>      }
>      unlock_user_struct(target_ldt_info, ptr, 1);
>  
> -    if (ldt_info.entry_number < TARGET_GDT_ENTRY_TLS_MIN || 
> -        ldt_info.entry_number > TARGET_GDT_ENTRY_TLS_MAX)
> -           return -TARGET_EINVAL;
> +    if (ldt_info.entry_number < TARGET_GDT_ENTRY_TLS_MIN ||
> +        ldt_info.entry_number > TARGET_GDT_ENTRY_TLS_MAX) {
> +        return -TARGET_EINVAL;
> +    }
>      seg_32bit = ldt_info.flags & 1;
>      contents = (ldt_info.flags >> 1) & 3;
>      read_exec_only = (ldt_info.flags >> 3) & 1;
> @@ -5828,7 +5806,7 @@ static abi_long do_get_thread_area(CPUX86State *env, abi_ulong ptr)
>      lp = (uint32_t *)(gdt_table + idx);
>      entry_1 = tswap32(lp[0]);
>      entry_2 = tswap32(lp[1]);
> -    
> +
>      read_exec_only = ((entry_2 >> 9) & 1) ^ 1;
>      contents = (entry_2 >> 10) & 3;
>      seg_not_present = ((entry_2 >> 15) & 1) ^ 1;
> @@ -5844,8 +5822,8 @@ static abi_long do_get_thread_area(CPUX86State *env, abi_ulong ptr)
>          (read_exec_only << 3) | (limit_in_pages << 4) |
>          (seg_not_present << 5) | (useable << 6) | (lm << 7);
>      limit = (entry_1 & 0xffff) | (entry_2  & 0xf0000);
> -    base_addr = (entry_1 >> 16) | 
> -        (entry_2 & 0xff000000) | 
> +    base_addr = (entry_1 >> 16) |
> +        (entry_2 & 0xff000000) |
>          ((entry_2 & 0xff) << 16);
>      target_ldt_info->base_addr = tswapal(base_addr);
>      target_ldt_info->limit = tswap32(limit);
> @@ -5895,53 +5873,71 @@ abi_long do_arch_prctl(CPUX86State *env, int code, abi_ulong addr)
>  
>  #endif /* defined(TARGET_I386) */
>  
> -#define NEW_STACK_SIZE 0x40000
> -
> -
>  static pthread_mutex_t clone_lock = PTHREAD_MUTEX_INITIALIZER;
>  typedef struct {
> -    CPUArchState *env;
> +    /* Used to synchronize thread/process creation between parent and child. */
>      pthread_mutex_t mutex;
>      pthread_cond_t cond;
> -    pthread_t thread;
> -    uint32_t tid;
> +    /*
> +     * Guest pointers for implementing CLONE_PARENT_SETTID
> +     * and CLONE_CHILD_SETTID.
> +     */
>      abi_ulong child_tidptr;
>      abi_ulong parent_tidptr;
> -    sigset_t sigmask;
> -} new_thread_info;
> +    struct {
> +        sigset_t sigmask;
> +        CPUArchState *env;
> +        bool register_thread;
> +        bool signal_setup;
> +    } child;
> +} clone_info;
>  
> -static void *clone_func(void *arg)
> +static int clone_run(void *arg)
>  {
> -    new_thread_info *info = arg;
> +    clone_info *info = (clone_info *) arg;
>      CPUArchState *env;
>      CPUState *cpu;
>      TaskState *ts;
> +    uint32_t tid;
>  
> -    rcu_register_thread();
> -    tcg_register_thread();
> -    env = info->env;
> +    if (info->child.register_thread) {
> +        rcu_register_thread();
> +        tcg_register_thread();
> +    }
> +
> +    env = info->child.env;
>      cpu = env_cpu(env);
>      thread_cpu = cpu;
>      ts = (TaskState *)cpu->opaque;
> -    info->tid = sys_gettid();
> +    tid = sys_gettid();
>      task_settid(ts);
> -    if (info->child_tidptr)
> -        put_user_u32(info->tid, info->child_tidptr);
> -    if (info->parent_tidptr)
> -        put_user_u32(info->tid, info->parent_tidptr);
> +
>      qemu_guest_random_seed_thread_part2(cpu->random_seed);
> -    /* Enable signals.  */
> -    sigprocmask(SIG_SETMASK, &info->sigmask, NULL);
> -    /* Signal to the parent that we're ready.  */
> -    pthread_mutex_lock(&info->mutex);
> -    pthread_cond_broadcast(&info->cond);
> -    pthread_mutex_unlock(&info->mutex);
> -    /* Wait until the parent has finished initializing the tls state.  */
> -    pthread_mutex_lock(&clone_lock);
> -    pthread_mutex_unlock(&clone_lock);
> +
> +    if (info->parent_tidptr) {
> +        /*
> +         * Even when memory is not shared, parent_tidptr is set before the
> +         * process copy, so we need to set it in the child.
> +         */
> +        put_user_u32(tid, info->parent_tidptr);
> +    }
> +
> +    if (info->child_tidptr) {
> +        put_user_u32(tid, info->child_tidptr);
> +    }
> +
> +    /* Enable signals. */
> +    sigprocmask(SIG_SETMASK, &info->child.sigmask, NULL);
> +
> +    if (info->child.signal_setup) {
> +        pthread_mutex_lock(&info->mutex);
> +        pthread_cond_broadcast(&info->cond);
> +        pthread_mutex_unlock(&info->mutex);
> +    }
> +
>      cpu_loop(env);
>      /* never exits */
> -    return NULL;
> +    _exit(1);  /* avoid compiler warning. */
>  }
>  
>  /* do_fork() Must return host values and target errnos (unlike most
> @@ -5951,139 +5947,131 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
>                     abi_ulong child_tidptr)
>  {
>      CPUState *cpu = env_cpu(env);
> -    int ret;
> +    int proc_flags, host_sig, ret;
>      TaskState *ts;
>      CPUState *new_cpu;
> -    CPUArchState *new_env;
> -    sigset_t sigmask;
> +    sigset_t block_sigmask;
> +    sigset_t orig_sigmask;
> +    clone_info info;
> +    TaskState *parent_ts = (TaskState *)cpu->opaque;
>  
> -    flags &= ~CLONE_IGNORED_FLAGS;
> +    memset(&info, 0, sizeof(info));
> +
> +    /*
> +     * When cloning the actual subprocess, we don't need to worry about any
> +     * flags that can be ignored, or emulated in QEMU. proc_flags holds only
> +     * the flags that need to be passed to `clone` itself.
> +     */
> +    proc_flags = flags & ~(CLONE_EMULATED_FLAGS | CLONE_IGNORED_FLAGS);
> +
> +    /*
> +     * The exit signal is included in the flags. That signal needs to be mapped
> +     * to the appropriate host signal, and we need to check if that signal is
> +     * supported.
> +     */
> +    host_sig = target_to_host_signal(proc_flags & CSIGNAL);
> +    if (host_sig > SIGRTMAX) {
> +        qemu_log_mask(LOG_UNIMP,
> +                      "guest signal %d not supported for exit_signal",
> +                      proc_flags & CSIGNAL);
> +        return -TARGET_EINVAL;
> +    }
> +    proc_flags = (proc_flags & ~CSIGNAL) | host_sig;
>  
>      /* Emulate vfork() with fork() */
> -    if (flags & CLONE_VFORK)
> -        flags &= ~(CLONE_VFORK | CLONE_VM);
> +    if (proc_flags & CLONE_VFORK) {
> +        proc_flags &= ~(CLONE_VFORK | CLONE_VM);
> +    }
>  
> -    if (flags & CLONE_VM) {
> -        TaskState *parent_ts = (TaskState *)cpu->opaque;
> -        new_thread_info info;
> -        pthread_attr_t attr;
> +    if (!clone_flags_are_fork(proc_flags) &&
> +        !clone_flags_are_thread(proc_flags)) {
> +        qemu_log_mask(LOG_UNIMP, "unsupported clone flags");
> +        return -TARGET_EINVAL;
> +    }
>  
> -        if (((flags & CLONE_THREAD_FLAGS) != CLONE_THREAD_FLAGS) ||
> -            (flags & CLONE_INVALID_THREAD_FLAGS)) {
> -            return -TARGET_EINVAL;
> -        }
> +    pthread_mutex_init(&info.mutex, NULL);
> +    pthread_mutex_lock(&info.mutex);
> +    pthread_cond_init(&info.cond, NULL);
>  
> -        ts = g_new0(TaskState, 1);
> -        init_task_state(ts);
> +    ts = g_new0(TaskState, 1);
> +    init_task_state(ts);
>  
> -        /* Grab a mutex so that thread setup appears atomic.  */
> -        pthread_mutex_lock(&clone_lock);
> +    /* Guard CPU copy. It is not thread-safe. */

Why not - isn't that the point of the lock?

<snip>
>      case TARGET_NR_setdomainname:
>          if (!(p = lock_user_string(arg1)))
> @@ -10873,8 +10865,10 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
>          return get_errno(fchown(arg1, low2highuid(arg2), low2highgid(arg3)));
>  #if defined(TARGET_NR_fchownat)
>      case TARGET_NR_fchownat:
> -        if (!(p = lock_user_string(arg2))) 
> +        p = lock_user_string(arg2)
> +        if (!p) {
>              return -TARGET_EFAULT;
> +        }

This has dropped a ; killing the build

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
  2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
                   ` (6 preceding siblings ...)
  2020-06-13 11:16 ` Alex Bennée
@ 2020-06-16 16:02 ` Alex Bennée
  2020-06-16 16:32   ` Peter Maydell
  7 siblings, 1 reply; 18+ messages in thread
From: Alex Bennée @ 2020-06-16 16:02 UTC (permalink / raw)
  To: Josh Kunz; +Cc: riku.voipio, qemu-devel, laurent


Josh Kunz <jkz@google.com> writes:

> This patch series implements extended support for the `clone` system
> call. As best I can tell, any option combination including `CLONE_VM`
> should be supported with the addition of this patch series. The
> implementation is described in greater detail in the patches themselves.
>
> Testing:
>
>   * All targets built on x86_64.
>   * `make check` and `make check-tcg` are passing. Additional tests have
>     been added to `linux-test.c` to validate clone behavior.

Not for me - which probably means you don't have docker/podman setup or
cross compilers for all the various guests we have. See
tests/tcg/configure.sh

>
> Caveats:
>
>   * This series touches, but does not fix, several bits of code that are
>     racey (namely the sigact table and the fd trans table).
>   * `exit_group` does not perform the appropriate cleanup for non-thread
>     children created with `CLONE_VM`. CPUs for such children are never
>     cleaned up. The correct implementation of exit-group is non-trivial
>     (since it also needs to track/handle cleanup for threads in the
>     clone'd child process). Also, I don't fully understand the
>     interaction between QOM<->linux-user. My naive implementation based
>     on the current implementation `exit(2)` was regularly crashing. If
>     maintainers have suggestions for better ways to handle exit_group,
>     they would be greatly appreciated. 
>   * execve does not clean up the CPUs of clone'd children, for the same
>     reasons as `exit_group`.

Apart from "a more perfect emulation" is there a particular use case
served by the extra functionality? AIUI up until this point we've
basically supported glibc's use of clone() which has generally been
enough. I'm assuming you've come across stuff that needs this more fine
grained support?

>
> Josh Kunz (5):
>   linux-user: Refactor do_fork to use new `qemu_clone`
>   linux-user: Make fd_trans task-specific.
>   linux-user: Make sigact_table part of the task state.
>   linux-user: Support CLONE_VM and extended clone options
>   linux-user: Add PDEATHSIG test for clone process hierarchy.
>
>  linux-user/Makefile.objs            |   2 +-
>  linux-user/clone.c                  | 565 ++++++++++++++++++++++++++++
>  linux-user/clone.h                  |  27 ++
>  linux-user/fd-trans-tbl.c           |  13 +
>  linux-user/fd-trans-type.h          |  17 +
>  linux-user/fd-trans.c               |   3 -
>  linux-user/fd-trans.h               |  75 ++--
>  linux-user/main.c                   |   1 +
>  linux-user/qemu.h                   |  49 +++
>  linux-user/signal.c                 |  84 ++++-
>  linux-user/syscall.c                | 452 ++++++++++++----------
>  tests/tcg/multiarch/Makefile.target |   3 +
>  tests/tcg/multiarch/linux-test.c    | 227 ++++++++++-
>  13 files changed, 1264 insertions(+), 254 deletions(-)
>  create mode 100644 linux-user/clone.c
>  create mode 100644 linux-user/clone.h
>  create mode 100644 linux-user/fd-trans-tbl.c
>  create mode 100644 linux-user/fd-trans-type.h


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
  2020-06-12  1:46 ` [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options Josh Kunz
  2020-06-13  0:10   ` Josh Kunz
@ 2020-06-16 16:08   ` Alex Bennée
  2020-06-23  3:43     ` Josh Kunz
  1 sibling, 1 reply; 18+ messages in thread
From: Alex Bennée @ 2020-06-16 16:08 UTC (permalink / raw)
  To: Josh Kunz; +Cc: riku.voipio, qemu-devel, laurent


Josh Kunz <jkz@google.com> writes:

> The `clone` system call can be used to create new processes that share
> attributes with their parents, such as virtual memory, file
> system location, file descriptor tables, etc. These can be useful to a
> variety of guest programs.
>
> Before this patch, QEMU had support for a limited set of these attributes.
> Basically the ones needed for threads, and the options used by fork.
> This change adds support for all flag combinations involving CLONE_VM.
> In theory, almost all clone options could be supported, but invocations
> not using CLONE_VM are likely to run afoul of linux-user's inherently
> multi-threaded design.
>
> To add this support, this patch updates the `qemu_clone` helper. An
> overview of the mechanism used to support general `clone` options with
> CLONE_VM is described below.
>
> This patch also enables by-default the `clone` unit-tests in
> tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate
> exit signals, based on a bug found during development.

Which by the way fail on some targets:

    TEST    linux-test on alpha
  /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death
  make[2]: *** [../Makefile.target:153: run-linux-test] Error 1
  make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2
  make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2

Have you managed a clean check-tcg with docker enabled so all the guest
architectures get tested?

>
> !! Overview
>
> Adding support for CLONE_VM is tricky. The parent and guest process will
> share an address space (similar to threads), so the emulator must
> coordinate between the parent and the child. Currently, QEMU relies
> heavily on Thread Local Storage (TLS) as part of this coordination
> strategy. For threads, this works fine, because libc manages the
> thread-local data region used for TLS, when we create new threads using
> `pthread_create`. Ideally we could use the same mechanism for
> "process-local storage" needed to allow the parent/child processes to
> emulate in tandem. Unfortunately TLS is tightly integrated into libc.
> The only way to create TLS data regions is via the `pthread_create` API
> which also spawns a new thread (rather than a new processes, which is
> what we want). Worse still, TLS itself is a complicated arch-specific
> feature that is tightly integrated into the rest of libc and the dynamic
> linker. Re-implementing TLS support for QEMU would likely require a
> special dynamic linker / libc. Alternatively, the popular libcs could be
> extended, to allow for users to create TLS regions without creating
> threads. Even if major libcs decide to add this support, QEMU will still
> need a temporary work around until those libcs are widely deployed. It's
> also unclear if libcs will be interested in supporting this case, since
> TLS image creation is generally deeply integrated with thread setup.
>
> In this patch, I've employed an alternative approach: spawning a thread
> an "stealing" its TLS image for use in the child process. This approach
> leaves a dangling thread while the TLS image is in use, but by design
> that thread will not become schedulable until after the TLS data is no
> longer in-use by the child (as described in a moment). Therefore, it
> should cause relatively minimal overhead. When considered in the larger
> context, this seems like a reasonable tradeoff.

*sharp intake of breath*

OK so the solution to the complexity of handling threads is to add more
threads? cool cool cool....

>
> A major complication of this approach knowing when it is safe to clean up
> the stack, and TLS image, used by a child process. When a child is
> created with `CLONE_VM` its stack, and TLS data, need to remain valid
> until that child has either exited, or successfully called `execve` (on
> `execve` the child is given a new VMM by the kernel). One approach would
> be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the
> child). The problem is that the `wait` family of calls only waits for
> termination. The pattern of `clone() ... execve()` for long running
> child processes is pretty common. If we waited for child processes to
> exit, it's likely we would end up using substantially more memory, and
> keep the suspended TLS thread around much longer than necessary.
> Instead, in this patch, I've used an "trampoline" process. The real
> parent first clones a trampoline, the trampoline then clones the
> ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends
> the trampoline process until the child has exited, or called `execve`.
> Once the trampoline is re-scheduled, we know it is safe to clean up
> after the child. This creates one more suspended process, but typically,
> the trampoline only exists for a short period of time.
>
> !! CLONE_VM setup, step by step
>
> 1. First, the suspended thread whose TLS we will use is created using
>    `pthread_create`. The thread fetches and returns it's "TLS pointer"
>    (an arch-specific value given to the kernel) to the parent. It then
>    blocks on a lock to prevent its TLS data from being cleaned up.
>    Ultimately the lock will be unlocked by the trampoline once the child
>    exits.
> 2. Once the TLS thread has fetched the TLS pointer, it notifies the real
>    parent thread, which calls `clone()` to create the trampoline
>    process. For ease of implementation, the TLS image is set for the
>    trampoline process during this step. This allows the trampoline to
>    use functions that require TLS if needed (e.g., printf). TLS location
>    is inherited when a new child is spawned, so this TLS data will
>    automatically be inherited by the child.
> 3. Once the trampoline has been spawned, it registers itself as a
>    "hidden" process with the signal subsystem. This prevents the exit
>    signal from the trampoline from ever being forwarded to the guest.
>    This is needed due to the way that Linux sets the exit signal for the
>    ultimate child when `CLONE_PARENT` is set. See the source for
>    details.
> 4. Once setup is complete, the trampoline spawns the final child with
>    the original clone flags, plus `CLONE_PARENT`, so the child is
>    correctly parented to the kernel task on which the guest invoked
>    `clone`. Without this, kernel features like PDEATHSIG, and
>    subreapers, would not work properly. As previously discussed, the
>    trampoline also supplies `CLONE_VFORK` so that it is suspended until
>    the child can be cleaned up.
> 5. Once the child is spawned, it signals the original parent thread that
>    it is running. At this point, the trampoline process is suspended
>    (due to CLONE_VFORK).
> 6. Finally, the call to `qemu_clone` in the parent is finished, the
>    child begins executing the given callback function in the new child
>    process.
>
> !! Cleaning up
>
> Clean up itself is a multi-step process. Once the child exits, or is
> killed by a signal (cleanup is the same in both cases), the trampoline
> process becomes schedulable. When the trampoline is scheduled, it frees
> the child stack, and unblocks the suspended TLS thread. This cleans up
> the child resources, but not the stack used by the trampoline itself. It
> is possible for a process to clean up its own stack, but it is tricky,
> and architecture-specific. Instead we leverage the TLS manager thread to
> clean up the trampoline stack. When the trampoline is cloned (in step 2
> above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID`
> flags. The target location for the SET/CLEAR TID is set to a special field
> known by the TLS manager. Then, when the TLS manager thread is unsuspended,
> it performs an additional `FUTEX_WAIT` on this location. That blocks the
> TLS manager thread until the trampoline has fully exited, then the TLS
> manager thread frees the trampoline process's stack, before exiting
> itself.
>
> !! Shortcomings of this patch
>
> * It's complicated.
> * It doesn't support any clone options when CLONE_VM is omitted.
> * It doesn't properly clean up the CPU queue when the child process
>   terminates, or calls execve().
> * RCU unregistration is done in the trampoline process (in clone.c), but
>   registration happens in syscall.c This should be made more explicit.
> * The TLS image, and trampoline stack are not cleaned up if the parent
>   calls `execve` or `exit_group` before the child does. This is because
>   those cleanup tasks are handled by the TLS manager thread. The TLS
>   manager thread is in the same thread group as the parent, so it will
>   be terminated if the parent exits or calls `execve`.
>
> !! Alternatives considered
>
> * Non-standard libc extension to allow creating TLS images independent
>   of threads. This would allow us to just `clone` the child directly
>   instead of this complicated maneuver. Though we probably would still
>   need the cleanup logic. For libcs, TLS image allocation is tightly
>   connected to thread stack allocation, which is also arch-specific. I
>   do not have enough experience with libc development to know if
>   maintainers of any popular libcs would be open to supporting such an
>   API. Additionally, since it will probably take years before a libc
>   fix would be widely deployed, we need an interim solution anyways.

We could consider a custom lib stub that intercepts calls to the guests
original libc and replaces it with a QEMU aware one?

> * Non-standard, Linux-only, libc extension to allow us to specify the
>   CLONE_* flags used by `pthread_create`. The processes we are creating
>   are basically threads in a different thread group. If we could alter
>   the flags used, this whole processes could become a `pthread_create.`
>   The problem with this approach is that I don't know what requirements
>   pthreads has on threads to ensure they function properly. I suspect
>   that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached
>   thread state. Since we don't control the child exit reason (Linux only
>   handles CHILD_CLEARTID on normal, non-signal process termination), we
>   probably can't use this same tracking mechanism.
> * Other mechanisms for detecting child exit so cleanup can happen
>   besides CLONE_VFORK:
>   * waitid(WNOWAIT): This can only detect exit, not execve.
>   * file descriptors with close on exec set: This cannot detect children
>     cloned with CLONE_FILES.
>   * System V semaphore adjustments: Cannot detect children cloned with
>     CLONE_SYSVSEM.
>   * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally
>     terminated children.
> * Doing the child clone directly in the TLS manager thread: This saves the
>   need for the trampoline process, but it causes the child process to be
>   parented to the wrong kernel task (the TLS thread instead of the Main
>   thread) breaking things like PDEATHSIG.

Have you considered a daemon which could co-ordinate between the
multiple processes that are sharing some state?


> Signed-off-by: Josh Kunz <jkz@google.com>
> ---
>  linux-user/clone.c               | 415 ++++++++++++++++++++++++++++++-
>  linux-user/qemu.h                |  17 ++
>  linux-user/signal.c              |  49 ++++
>  linux-user/syscall.c             |  69 +++--
>  tests/tcg/multiarch/linux-test.c |  67 ++++-
>  5 files changed, 592 insertions(+), 25 deletions(-)
>
> diff --git a/linux-user/clone.c b/linux-user/clone.c
> index f02ae8c464..3f7344cf9e 100644
> --- a/linux-user/clone.c
> +++ b/linux-user/clone.c
> @@ -12,6 +12,12 @@
>  #include <stdbool.h>
>  #include <assert.h>
>  
> +/* arch-specifc includes needed to fetch the TLS base offset. */
> +#if defined(__x86_64__)
> +#include <asm/prctl.h>
> +#include <sys/prctl.h>
> +#endif
> +
>  static const unsigned long NEW_STACK_SIZE = 0x40000UL;
>  
>  /*
> @@ -62,6 +68,397 @@ static void completion_finish(struct completion *c)
>      pthread_mutex_unlock(&c->mu);
>  }
>  
> +struct tls_manager {
> +    void *tls_ptr;
> +    /* fetched is completed once tls_ptr has been set by the thread. */
> +    struct completion fetched;
> +    /*
> +     * spawned is completed by the user once the managed_tid
> +     * has been spawned.
> +     */
> +    struct completion spawned;
> +    /*
> +     * TID of the child whose memory is cleaned up upon death. This memory
> +     * location is used as part of a futex op, and is cleared by the kernel
> +     * since we specify CHILD_CLEARTID.
> +     */
> +    int managed_tid;
> +    /*
> +     * The value to be `free`'d up once the janitor is ready to clean up the
> +     * TLS section, and the managed tid has exited.
> +     */
> +    void *cleanup;
> +};
> +
> +/*
> + * tls_ptr fetches the TLS "pointer" for the current thread. This pointer
> + * should be whatever platform-specific address is used to represent the TLS
> + * base address.
> + */
> +static void *tls_ptr()

This and a number of other prototypes need void args to stop the
compiler complaining about missing prototypes.

> +{
> +    void *ptr;
> +#if defined(__x86_64__)
> +    /*
> +     * On x86_64, the TLS base is stored in the `fs` segment register, we can
> +     * fetch it with `ARCH_GET_FS`:
> +     */
> +    (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr);
> +#else
> +    ptr = NULL;
> +#endif
> +    return ptr;
> +}
> +
> +/*
> + * clone_vm_supported returns true if clone_vm() is supported on this
> + * platform.
> + */
> +static bool clone_vm_supported()
> +{
> +#if defined(__x86_64__)
> +    return true;
> +#else
> +    return false;
> +#endif
> +}
<snip>

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
  2020-06-16 16:02 ` Alex Bennée
@ 2020-06-16 16:32   ` Peter Maydell
  2020-06-16 23:38     ` Josh Kunz
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Maydell @ 2020-06-16 16:32 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Josh Kunz, Riku Voipio, QEMU Developers, Laurent Vivier

On Tue, 16 Jun 2020 at 17:08, Alex Bennée <alex.bennee@linaro.org> wrote:
> Apart from "a more perfect emulation" is there a particular use case
> served by the extra functionality? AIUI up until this point we've
> basically supported glibc's use of clone() which has generally been
> enough. I'm assuming you've come across stuff that needs this more fine
> grained support?

There are definitely cases we don't handle that cause problems;
notably https://bugs.launchpad.net/qemu/+bug/1673976 reports
that newer glibc implement posix_spawn() using CLONE_VM|CLONE_VFORK
which we don't handle correctly (though it is now just "we don't
report failures correctly" rather than "guest asserts").
The problem has always been that glibc implicitly assumes it
knows what the process's threads are like, ie that it is the
only thing doing any clone()s. (The comment in syscall.c mentions
it "breaking mutexes" though I forget what I had in mind when
I wrote that comment.) I haven't looked at these patches,
but the risk of being clever is that we end up implicitly
depending on details of glibc's internal implementation in a
potentially fragile way.

I forget whether QEMU can build against musl libc, but if we do
then that might be an interesting test of whether we have
accidental dependencies on the libc internals.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM)
  2020-06-16 16:32   ` Peter Maydell
@ 2020-06-16 23:38     ` Josh Kunz
  0 siblings, 0 replies; 18+ messages in thread
From: Josh Kunz @ 2020-06-16 23:38 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alex Bennée, Riku Voipio, QEMU Developers, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 3500 bytes --]

On Tue, Jun 16, 2020 at 9:32 AM Peter Maydell <peter.maydell@linaro.org>
wrote:
>
> On Tue, 16 Jun 2020 at 17:08, Alex Bennée <alex.bennee@linaro.org> wrote:
> > Apart from "a more perfect emulation" is there a particular use case
> > served by the extra functionality? AIUI up until this point we've
> > basically supported glibc's use of clone() which has generally been
> > enough. I'm assuming you've come across stuff that needs this more fine
> > grained support?
>
> There are definitely cases we don't handle that cause problems;
> notably https://bugs.launchpad.net/qemu/+bug/1673976 reports
> that newer glibc implement posix_spawn() using CLONE_VM|CLONE_VFORK
> which we don't handle correctly (though it is now just "we don't
> report failures correctly" rather than "guest asserts").

This originally came up for us at Google in multi-threaded guest binaries
using TCMalloc (https://github.com/google/tcmalloc). TCMalloc does not have
any special `at_fork` handling, so guest processes using TCMalloc spawn
subprocesses using a custom bit of code based on `clone(CLONE_VM)` (notably
*not* vfork()).

We've also been using this patch to work around similar issues in QEMU
itself when creating subprocesses with fork()/vfork(). Since QEMU (and
GLib) rely on locks to emulate multi-threaded guests that share virtual
memory, QEMU itself is also vulnerable to deadlocks when a guest forks.
Without this patch we've run into deadlocks with TCG region trees, and
GLib's `g_malloc`, there are likely others as well, since we could still
reproduce the deadlocks after fixing these specific problems.

The issues caused by using fork() in multi-threaded guests are pretty
tricky to debug. They are fundamentally data races (was another thread in
the critical section or not?), and they usually manifest as deadlocks,
which show up as timeouts or hangs to users. I suspect this issue happens
frequently in the wild, but at a low enough rate/user that nobody bothered
fixing it/reporting it yet. Use of `vfork()` with `CLONE_VM` is common as
mentioned by Alex. For example it is the only way to spawn subprocesses in
Go on most platforms:
https://github.com/golang/go/blob/master/src/syscall/exec_linux.go#L218

I tried to come up with a good reproducer for these issues, but I haven't
been able to make one. The cases we have at Google that trigger this issue
reliably are big and they contain lots of code I can't share. When
compiling QEMU itself with TCMalloc, I can pretty reliably reproduce a
deadlock with a program that just calls vfork(), but that's somewhat
synthetic.

> The problem has always been that glibc implicitly assumes it
> knows what the process's threads are like, ie that it is the
> only thing doing any clone()s. (The comment in syscall.c mentions
> it "breaking mutexes" though I forget what I had in mind when
> I wrote that comment.) I haven't looked at these patches,
> but the risk of being clever is that we end up implicitly
> depending on details of glibc's internal implementation in a
> potentially fragile way.
>
>
> I forget whether QEMU can build against musl libc, but if we do
> then that might be an interesting test of whether we have
> accidental dependencies on the libc internals.

I agree it would be interesting to test against musl. I'm pretty sure it
would work (this patch only relies on POSIX APIs + Platform ABIs for TLS),
but it would be interesting to confirm.

--
Josh Kunz

[-- Attachment #2: Type: text/html, Size: 4097 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
  2020-06-16 16:08   ` Alex Bennée
@ 2020-06-23  3:43     ` Josh Kunz
  2020-06-23  8:21       ` Alex Bennée
  0 siblings, 1 reply; 18+ messages in thread
From: Josh Kunz @ 2020-06-23  3:43 UTC (permalink / raw)
  To: Alex Bennée; +Cc: QEMU Developers, Riku Voipio, Laurent Vivier

Thanks for the responses Alex. I'm working on your comments, but
wanted to clarify some of the points you brought up before mailing a
second version. Responses inline.

On Tue, Jun 16, 2020 at 9:08 AM Alex Bennée <alex.bennee@linaro.org> wrote:
> Which by the way fail on some targets:
>
>     TEST    linux-test on alpha
>   /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death
>   make[2]: *** [../Makefile.target:153: run-linux-test] Error 1
>   make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2
>   make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2
>
> Have you managed a clean check-tcg with docker enabled so all the guest
> architectures get tested?

I've gotten this Alpha failure to reproduce on my local build and I'm
working on a fix. Thanks for pointing this out. I'll make sure I get a
clean `make check-tcg` for `linux-test` on all guest architectures.

> > In this patch, I've employed an alternative approach: spawning a thread
> > an "stealing" its TLS image for use in the child process. This approach
> > leaves a dangling thread while the TLS image is in use, but by design
> > that thread will not become schedulable until after the TLS data is no
> > longer in-use by the child (as described in a moment). Therefore, it
> > should cause relatively minimal overhead. When considered in the larger
> > context, this seems like a reasonable tradeoff.
>
> *sharp intake of breath*
>
> OK so the solution to the complexity of handling threads is to add more
> threads? cool cool cool....

The solution to the complexity of shared memory, but yeah, not my
favorite either. I was kinda hoping that someone on the list would
explain why this approach is clearly wrong.

> > * Non-standard libc extension to allow creating TLS images independent
> >   of threads. This would allow us to just `clone` the child directly
> >   instead of this complicated maneuver. Though we probably would still
> >   need the cleanup logic. For libcs, TLS image allocation is tightly
> >   connected to thread stack allocation, which is also arch-specific. I
> >   do not have enough experience with libc development to know if
> >   maintainers of any popular libcs would be open to supporting such an
> >   API. Additionally, since it will probably take years before a libc
> >   fix would be widely deployed, we need an interim solution anyways.
>
> We could consider a custom lib stub that intercepts calls to the guests
> original libc and replaces it with a QEMU aware one?

Unfortunately the problem here is host libc, rather than guest libc.
We need to make TLS variables in QEMU itself work, so intercepting
guest libc calls won't help much. Or am I misunderstanding the point?

> Have you considered a daemon which could co-ordinate between the
> multiple processes that are sharing some state?

Not really for the `CLONE_VM` support added in this patch series. I
have considered trying to pull tcg out of the guest process, but not
very seriously, since it seems like a pretty heavyweight approach.
Especially compared to the solution included in this series. Do you
think there's a simpler approach that involves using a daemon to do
coordination?

Thanks again for your reviews.

--
Josh Kunz


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
  2020-06-23  3:43     ` Josh Kunz
@ 2020-06-23  8:21       ` Alex Bennée
       [not found]         ` <CADgy-2tB0Z133RB1i8OdnpKMD3xATL059dFoduHOjdim11G4-A@mail.gmail.com>
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Bennée @ 2020-06-23  8:21 UTC (permalink / raw)
  To: Josh Kunz; +Cc: Riku Voipio, QEMU Developers, Laurent Vivier


Josh Kunz <jkz@google.com> writes:

> Thanks for the responses Alex. I'm working on your comments, but
> wanted to clarify some of the points you brought up before mailing a
> second version. Responses inline.
>
> On Tue, Jun 16, 2020 at 9:08 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>> Which by the way fail on some targets:
>>
>>     TEST    linux-test on alpha
>>   /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death
>>   make[2]: *** [../Makefile.target:153: run-linux-test] Error 1
>>   make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2
>>   make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2
>>
>> Have you managed a clean check-tcg with docker enabled so all the guest
>> architectures get tested?
>
> I've gotten this Alpha failure to reproduce on my local build and I'm
> working on a fix. Thanks for pointing this out. I'll make sure I get a
> clean `make check-tcg` for `linux-test` on all guest architectures.
>
>> > In this patch, I've employed an alternative approach: spawning a thread
>> > an "stealing" its TLS image for use in the child process. This approach
>> > leaves a dangling thread while the TLS image is in use, but by design
>> > that thread will not become schedulable until after the TLS data is no
>> > longer in-use by the child (as described in a moment). Therefore, it
>> > should cause relatively minimal overhead. When considered in the larger
>> > context, this seems like a reasonable tradeoff.
>>
>> *sharp intake of breath*
>>
>> OK so the solution to the complexity of handling threads is to add more
>> threads? cool cool cool....
>
> The solution to the complexity of shared memory, but yeah, not my
> favorite either. I was kinda hoping that someone on the list would
> explain why this approach is clearly wrong.
>
>> > * Non-standard libc extension to allow creating TLS images independent
>> >   of threads. This would allow us to just `clone` the child directly
>> >   instead of this complicated maneuver. Though we probably would still
>> >   need the cleanup logic. For libcs, TLS image allocation is tightly
>> >   connected to thread stack allocation, which is also arch-specific. I
>> >   do not have enough experience with libc development to know if
>> >   maintainers of any popular libcs would be open to supporting such an
>> >   API. Additionally, since it will probably take years before a libc
>> >   fix would be widely deployed, we need an interim solution anyways.
>>
>> We could consider a custom lib stub that intercepts calls to the guests
>> original libc and replaces it with a QEMU aware one?
>
> Unfortunately the problem here is host libc, rather than guest libc.
> We need to make TLS variables in QEMU itself work, so intercepting
> guest libc calls won't help much. Or am I misunderstanding the point?

Hold up - I'm a little confused now. Why does the host TLS affect the
guest TLS? We have complete control over the guests view of the world so
we should be able to control it's TLS storage.

>> Have you considered a daemon which could co-ordinate between the
>> multiple processes that are sharing some state?
>
> Not really for the `CLONE_VM` support added in this patch series. I
> have considered trying to pull tcg out of the guest process, but not
> very seriously, since it seems like a pretty heavyweight approach.
> Especially compared to the solution included in this series. Do you
> think there's a simpler approach that involves using a daemon to do
> coordination?

I'm getting a little lost now. Exactly what state are we trying to share
between two QEMU guests which are now in separate execution contexts?

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
       [not found]           ` <87k0zw7opa.fsf@linaro.org>
@ 2020-07-09  0:16             ` Josh Kunz
  2020-07-16 10:41               ` Alex Bennée
  0 siblings, 1 reply; 18+ messages in thread
From: Josh Kunz @ 2020-07-09  0:16 UTC (permalink / raw)
  To: Alex Bennée; +Cc: QEMU Developers, Laurent Vivier, Riku Voipio

Sorry for the late reply, response inline. Also I noticed a couple
mails ago I seemed to have removed the devel list and maintainers.
I've re-added them to the CC line.

On Wed, Jun 24, 2020 at 3:17 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Josh Kunz <jkz@google.com> writes:
>
> > On Tue, Jun 23, 2020, 1:21 AM Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > (snip)
> >
> >> >> > * Non-standard libc extension to allow creating TLS images independent
> >> >> >   of threads. This would allow us to just `clone` the child directly
> >> >> >   instead of this complicated maneuver. Though we probably would still
> >> >> >   need the cleanup logic. For libcs, TLS image allocation is tightly
> >> >> >   connected to thread stack allocation, which is also arch-specific. I
> >> >> >   do not have enough experience with libc development to know if
> >> >> >   maintainers of any popular libcs would be open to supporting such an
> >> >> >   API. Additionally, since it will probably take years before a libc
> >> >> >   fix would be widely deployed, we need an interim solution anyways.
> >> >>
> >> >> We could consider a custom lib stub that intercepts calls to the guests
> >> >> original libc and replaces it with a QEMU aware one?
> >> >
> >> > Unfortunately the problem here is host libc, rather than guest libc.
> >> > We need to make TLS variables in QEMU itself work, so intercepting
> >> > guest libc calls won't help much. Or am I misunderstanding the point?
> >>
> >> Hold up - I'm a little confused now. Why does the host TLS affect the
> >> guest TLS? We have complete control over the guests view of the world so
> >> we should be able to control it's TLS storage.
> >
> > Guest TLS is unaffected, just like in the existing case for guest
> > threads. Guest TLS is handled by the guest libc and the CPU emulation.
> > Just to be clear: This series changes nothing about guest TLS.
> >
> > The complexity of this series is to deal with *host* usage of TLS.
> > That is to say: use of thread local variables in QEMU itself. Host TLS
> > is needed to allow the subprocess created with `clone(CLONE_VM, ...)`
> > to run at all. TLS variables are used in QEMU for the RCU
> > implementation, parts of the TCG, and all over the place to access the
> > CPU/TaskState for the running thread. Host TLS is managed by the host
> > libc, and TLS is only set up for host threads created via
> > `pthread_create`. Subprocesses created with `clone(CLONE_VM)` share a
> > virtual memory map *and* TLS data with their parent[1], since libcs
> > provide no special handling of TLS when `clone(CLONE_VM)` is used.
> > Without the workaround used in this patch, both the parent and child
> > process's thread local variables reference the same memory locations.
> > This just doesn't work, since thread local data is assumed to actually
> > be thread local.
> >
> > The "alternative" proposed was to make the host libc support TLS for
> > processes created using clone (there are several ways to go about
> > this, each with different tradeoffs). You mentioned that "We could
> > consider a custom lib stub that intercepts calls to the guests
> > original libc..." in your comment. Since *guest* libc is not involved
> > here I was a bit confused about how this could help, and wanted to
> > clarify.
> >
> >> >> Have you considered a daemon which could co-ordinate between the
> >> >> multiple processes that are sharing some state?
> >> >
> >> > Not really for the `CLONE_VM` support added in this patch series. I
> >> > have considered trying to pull tcg out of the guest process, but not
> >> > very seriously, since it seems like a pretty heavyweight approach.
> >> > Especially compared to the solution included in this series. Do you
> >> > think there's a simpler approach that involves using a daemon to do
> >> > coordination?
> >>
> >> I'm getting a little lost now. Exactly what state are we trying to share
> >> between two QEMU guests which are now in separate execution contexts?
> >
> > Since this series only deals with `clone(CLONE_VM)` we always want to
> > share guest virtual memory between the execution contexts. There is
> > also some extra state that needs to be shared depending on which flags
> > are provided to `clone()`. E.g., signal handler tables for
> > CLONE_SIGHAND, file descriptor tables for CLONE_FILES, etc.
> >
> > The problem is that since QEMU and the guest live in the same virtual
> > memory map, keeping the mappings the same between the guest parent and
> > guest child means that the mappings also stay the same between the
> > host (QEMU) parent and host child. Two hosts can live in the same
> > virtual memory map, like we do right now with threads, but *only* with
> > valid TLS for each thread/process. That's why we bend-over backwards
> > to get set-up TLS for emulation in the child process.
>
> OK thanks for that. I'd obviously misunderstood from my first read
> through. So while hiding the underlying bits of QEMU from the guest is
> relatively easy it's quite hard to hide QEMU from itself in this
> CLONE_VM case.

Yes exactly.

> The other approach would be to suppress CLONE_VM for the actual process
> (thereby allowing QEMU to safely have a new instance and no clashing
> shared data) but emulate CLONE_VM for the guest itself (making the guest
> portions of memory shared and visible to each other). The trouble then
> would be co-ordination of mapping operations and other things that
> should be visible in a real CLONE_VM setup. This is the sort of
> situation I envisioned a co-ordination daemon might be useful.

Ah. This is interesting. Effectively the inverse of this patch. I had
not considered this approach. Thinking more about it, a "no shared
memory" approach does seem more straightforward implementation wise.
Unfortunately I think there would be a few substantial drawbacks:

1. Memory overhead. Every guest thread would need a full copy of QEMU
memory, including the translated guest binary.
2. Performance overhead. To keep virtual memory maps consistent across
tasks, a heavyweight 2 phase commit scheme, or similar, would be
needed for every `mmap`. That could have substantial performance
overhead for the guest. This could be a huge problem for processes
that use a large number of threads *and* do a lot of memory mapping or
frequently change page permissions.
3. There would be lots of similarly-fiddly bits that need to be shared
and coordinated in addition to guest memory. At least the signal
handler tables and fd_trans tables, but there are likely others I'm
missing.

The performance drawbacks could be largely mitigated by using the
current thread-only `CLONE_VM` support, but having *any* threads in
the process at all would lead to deadlocks after fork() or similar
non-CLONE_VM clone() calls. This could be worked around with a "stop
the world" button somewhat like `start_exclusive`, but expanded to
include all emulator threads. That will substantially slow down
fork().

Given all this I think the approach used in this series is probably at
least as "good" as a "no shared memory" approach. It has its own
complexities and drawbacks, but doesn't have obvious performance
issues. If you or other maintainers disagree, I'd be happy to write up
an RFC comparing the approaches in more detail (or we can just use
this thread), just let me know. Until then I'll keep pursuing this
patch.

> > [1] At least on x86_64, because TLS references are defined in terms of
> > the %fs segment, which is inherited on linux. Theoretically it's up to
> > the architecture to specify how TLS is inherited across execution
> > contexts. t's possible that the child actually ends up with no valid
> > TLS rather than using the parent TLS data. But that's not really
> > relevant here. The important thing is that the child ends up with
> > *valid* TLS, not invalid or inherited TLS.
>
>
> --
> Alex Bennée

--
Josh Kunz


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
  2020-07-09  0:16             ` Josh Kunz
@ 2020-07-16 10:41               ` Alex Bennée
  0 siblings, 0 replies; 18+ messages in thread
From: Alex Bennée @ 2020-07-16 10:41 UTC (permalink / raw)
  To: Josh Kunz; +Cc: Richard Henderson, Riku Voipio, QEMU Developers, Laurent Vivier


Josh Kunz <jkz@google.com> writes:

> Sorry for the late reply, response inline. Also I noticed a couple
> mails ago I seemed to have removed the devel list and maintainers.
> I've re-added them to the CC line.
>
> On Wed, Jun 24, 2020 at 3:17 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Josh Kunz <jkz@google.com> writes:
>>
>> > On Tue, Jun 23, 2020, 1:21 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>> >
>> > (snip)
>> >
>> >> >> > * Non-standard libc extension to allow creating TLS images independent
>> >> >> >   of threads. This would allow us to just `clone` the child directly
>> >> >> >   instead of this complicated maneuver. Though we probably would still
>> >> >> >   need the cleanup logic. For libcs, TLS image allocation is tightly
>> >> >> >   connected to thread stack allocation, which is also arch-specific. I
>> >> >> >   do not have enough experience with libc development to know if
>> >> >> >   maintainers of any popular libcs would be open to supporting such an
>> >> >> >   API. Additionally, since it will probably take years before a libc
>> >> >> >   fix would be widely deployed, we need an interim solution anyways.
>> >> >>
>> >> >> We could consider a custom lib stub that intercepts calls to the guests
>> >> >> original libc and replaces it with a QEMU aware one?
>> >> >
>> >> > Unfortunately the problem here is host libc, rather than guest libc.
>> >> > We need to make TLS variables in QEMU itself work, so intercepting
>> >> > guest libc calls won't help much. Or am I misunderstanding the point?
>> >>
>> >> Hold up - I'm a little confused now. Why does the host TLS affect the
>> >> guest TLS? We have complete control over the guests view of the world so
>> >> we should be able to control it's TLS storage.
>> >
>> > Guest TLS is unaffected, just like in the existing case for guest
>> > threads. Guest TLS is handled by the guest libc and the CPU emulation.
>> > Just to be clear: This series changes nothing about guest TLS.
>> >
>> > The complexity of this series is to deal with *host* usage of TLS.
>> > That is to say: use of thread local variables in QEMU itself. Host TLS
>> > is needed to allow the subprocess created with `clone(CLONE_VM, ...)`
>> > to run at all. TLS variables are used in QEMU for the RCU
>> > implementation, parts of the TCG, and all over the place to access the
>> > CPU/TaskState for the running thread. Host TLS is managed by the host
>> > libc, and TLS is only set up for host threads created via
>> > `pthread_create`. Subprocesses created with `clone(CLONE_VM)` share a
>> > virtual memory map *and* TLS data with their parent[1], since libcs
>> > provide no special handling of TLS when `clone(CLONE_VM)` is used.
>> > Without the workaround used in this patch, both the parent and child
>> > process's thread local variables reference the same memory locations.
>> > This just doesn't work, since thread local data is assumed to actually
>> > be thread local.
>> >
>> > The "alternative" proposed was to make the host libc support TLS for
>> > processes created using clone (there are several ways to go about
>> > this, each with different tradeoffs). You mentioned that "We could
>> > consider a custom lib stub that intercepts calls to the guests
>> > original libc..." in your comment. Since *guest* libc is not involved
>> > here I was a bit confused about how this could help, and wanted to
>> > clarify.
>> >
>> >> >> Have you considered a daemon which could co-ordinate between the
>> >> >> multiple processes that are sharing some state?
>> >> >
>> >> > Not really for the `CLONE_VM` support added in this patch series. I
>> >> > have considered trying to pull tcg out of the guest process, but not
>> >> > very seriously, since it seems like a pretty heavyweight approach.
>> >> > Especially compared to the solution included in this series. Do you
>> >> > think there's a simpler approach that involves using a daemon to do
>> >> > coordination?
>> >>
>> >> I'm getting a little lost now. Exactly what state are we trying to share
>> >> between two QEMU guests which are now in separate execution contexts?
>> >
>> > Since this series only deals with `clone(CLONE_VM)` we always want to
>> > share guest virtual memory between the execution contexts. There is
>> > also some extra state that needs to be shared depending on which flags
>> > are provided to `clone()`. E.g., signal handler tables for
>> > CLONE_SIGHAND, file descriptor tables for CLONE_FILES, etc.
>> >
>> > The problem is that since QEMU and the guest live in the same virtual
>> > memory map, keeping the mappings the same between the guest parent and
>> > guest child means that the mappings also stay the same between the
>> > host (QEMU) parent and host child. Two hosts can live in the same
>> > virtual memory map, like we do right now with threads, but *only* with
>> > valid TLS for each thread/process. That's why we bend-over backwards
>> > to get set-up TLS for emulation in the child process.
>>
>> OK thanks for that. I'd obviously misunderstood from my first read
>> through. So while hiding the underlying bits of QEMU from the guest is
>> relatively easy it's quite hard to hide QEMU from itself in this
>> CLONE_VM case.
>
> Yes exactly.
>
>> The other approach would be to suppress CLONE_VM for the actual process
>> (thereby allowing QEMU to safely have a new instance and no clashing
>> shared data) but emulate CLONE_VM for the guest itself (making the guest
>> portions of memory shared and visible to each other). The trouble then
>> would be co-ordination of mapping operations and other things that
>> should be visible in a real CLONE_VM setup. This is the sort of
>> situation I envisioned a co-ordination daemon might be useful.
>
> Ah. This is interesting. Effectively the inverse of this patch. I had
> not considered this approach. Thinking more about it, a "no shared
> memory" approach does seem more straightforward implementation wise.
> Unfortunately I think there would be a few substantial drawbacks:
>
> 1. Memory overhead. Every guest thread would need a full copy of QEMU
> memory, including the translated guest binary.

Sure although I suspect the overhead is not that great. For linux-user
on 64 bit systems we only allocate 128Mb of translation buffer per
process. What sort of size systems are you expecting to run on and how
big are the binaries?

> 2. Performance overhead. To keep virtual memory maps consistent across
> tasks, a heavyweight 2 phase commit scheme, or similar, would be
> needed for every `mmap`. That could have substantial performance
> overhead for the guest. This could be a huge problem for processes
> that use a large number of threads *and* do a lot of memory mapping or
> frequently change page permissions.

I suspect that cross-arch highly threaded apps are still in the realm of
"wow, that actually works, neat :-)" for linux-user. We don't have the
luxury of falling back to a single thread like we do for system
emulation so things like strong-on-weak memory order bugs can still trip
us up.

> 3. There would be lots of similarly-fiddly bits that need to be shared
> and coordinated in addition to guest memory. At least the signal
> handler tables and fd_trans tables, but there are likely others I'm
> missing.
>
> The performance drawbacks could be largely mitigated by using the
> current thread-only `CLONE_VM` support, but having *any* threads in
> the process at all would lead to deadlocks after fork() or similar
> non-CLONE_VM clone() calls. This could be worked around with a "stop
> the world" button somewhat like `start_exclusive`, but expanded to
> include all emulator threads. That will substantially slow down
> fork().
>
> Given all this I think the approach used in this series is probably at
> least as "good" as a "no shared memory" approach. It has its own
> complexities and drawbacks, but doesn't have obvious performance
> issues. If you or other maintainers disagree, I'd be happy to write up
> an RFC comparing the approaches in more detail (or we can just use
> this thread), just let me know. Until then I'll keep pursuing this
> patch.

I think that's fair. I'll leave it to the maintainers to chime in if
they have something to add. I'd already given some comments on patch 1 and
given it needs a re-spin I'll have another look on the next iteration.

I will say expect the system to get some testing on multiple backends so
if you can expand your testing beyond an x86_64 host please do.

>
>> > [1] At least on x86_64, because TLS references are defined in terms of
>> > the %fs segment, which is inherited on linux. Theoretically it's up to
>> > the architecture to specify how TLS is inherited across execution
>> > contexts. t's possible that the child actually ends up with no valid
>> > TLS rather than using the parent TLS data. But that's not really
>> > relevant here. The important thing is that the child ends up with
>> > *valid* TLS, not invalid or inherited TLS.
>>
>>
>> --
>> Alex Bennée


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-07-16 10:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-12  1:46 [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) Josh Kunz
2020-06-12  1:46 ` [PATCH 1/5] linux-user: Refactor do_fork to use new `qemu_clone` Josh Kunz
2020-06-16 15:51   ` Alex Bennée
2020-06-12  1:46 ` [PATCH 2/5] linux-user: Make fd_trans task-specific Josh Kunz
2020-06-12  1:46 ` [PATCH 3/5] linux-user: Make sigact_table part of the task state Josh Kunz
2020-06-12  1:46 ` [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options Josh Kunz
2020-06-13  0:10   ` Josh Kunz
2020-06-16 16:08   ` Alex Bennée
2020-06-23  3:43     ` Josh Kunz
2020-06-23  8:21       ` Alex Bennée
     [not found]         ` <CADgy-2tB0Z133RB1i8OdnpKMD3xATL059dFoduHOjdim11G4-A@mail.gmail.com>
     [not found]           ` <87k0zw7opa.fsf@linaro.org>
2020-07-09  0:16             ` Josh Kunz
2020-07-16 10:41               ` Alex Bennée
2020-06-12  1:46 ` [PATCH 5/5] linux-user: Add PDEATHSIG test for clone process hierarchy Josh Kunz
2020-06-12  3:48 ` [PATCH 0/5] linux-user: Support extended clone(CLONE_VM) no-reply
2020-06-13 11:16 ` Alex Bennée
2020-06-16 16:02 ` Alex Bennée
2020-06-16 16:32   ` Peter Maydell
2020-06-16 23:38     ` Josh Kunz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.