All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-13  1:40 ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

This patch series introduces a new clone flag, CLONE_FD, which lets the caller
handle child process exit notification via a file descriptor rather than
SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
child processes on behalf of their caller, *without* taking over process-wide
SIGCHLD handling (either via signal handler or signalfd).

Note that signalfd for SIGCHLD does not suffice here, because that still
receives notification for all child processes, and interferes with process-wide
signal handling.

The CLONE_FD file descriptor uniquely identifies a process on the system in a
race-free way, by holding a reference to the task_struct.  In the future, we
may introduce APIs that support using process file descriptors instead of PIDs.

Introducing CLONE_FD required two additional bits of yak shaving: Since clone
has no more usable flags (with the three currently unused flags unusable
because old kernels ignore them without EINVAL), also introduce a new clone4
system call with more flag bits and an extensible argument structure.  And
since the magic pt_regs-based syscall argument processing for clone's tls
argument would otherwise prevent introducing a sane clone4 system call, fix
that too.

I tested the CLONE_SETTLS changes with a thread-local storage test program (two
threads independently reading and writing a __thread variable), on both 32-bit
and 64-bit, and I observed no issues there.

I tested clone4 and the new CLONE_FD call with several additional test
programs, launching either a process or thread (in the former case using
syscall(), in the latter case by calling clone4 via assembly and returning to
C), sleeping in parent and child to test the case of either exiting first, and
then printing the received clone4_info structure.  Thiago also tested clone4
with CLONE_FD with a modified version of libqt's process handling, which
includes a test suite.

I've also included the manpages patch at the end of this series.  (Note that
the manpage documents the behavior of the future glibc wrapper as well as the
raw syscall.)  Here's a formatted plain-text version of the manpage for
reference:

CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)



NAME
       clone4 - create a child process

SYNOPSIS
       /* Prototype for the glibc wrapper function */

       #define _GNU_SOURCE
       #include <sched.h>

       int clone4(uint64_t flags,
                  size_t args_size,
                  struct clone4_args *args,
                  int (*fn)(void *), void *arg);

       /* Prototype for the raw system call */

       int clone4(unsigned flags_high, unsigned flags_low,
                  unsigned long args_size,
                  struct clone4_args *args);

       struct clone4_args {
           pid_t *ptid;
           pid_t *ctid;
           unsigned long stack_start;
           unsigned long stack_size;
           unsigned long tls;
       };


DESCRIPTION
       clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
       clone4() supports additional flags that clone(2) does not, and  accepts
       arguments via an extensible structure.

       args  points to a clone4_args structure, and args_size must contain the
       size of that structure, as understood by the  caller.   If  the  caller
       passes  a  shorter  structure  than  the  kernel expects, the remaining
       fields will default to 0.  If the caller passes a larger structure than
       the  kernel  expects  (such  as one from a newer kernel), clone4() will
       return EINVAL.  The clone4_args structure may gain additional fields at
       the  end  in  the future, and callers must only pass a size that encom‐
       passes the number of fields they understand.  If the  caller  passes  0
       for args_size, args is ignored and may be NULL.

       In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
       tls have the same semantics as they do with clone(2) and clone2(2).

       In the glibc wrapper, fn and arg have the same  semantics  as  they  do
       with clone(2).  As with clone(2), the underlying system call works more
       like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
       plifies  thread execution by calling fn(arg) and exiting the child when
       that function exits.

       The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
       flags_low arguments in the kernel interface) accepts all the same flags
       as  clone(2),  with  the   exception   of   the   obsolete   CLONE_PID,
       CLONE_DETACHED, and CLONE_STOPPED.  In addition, flags accepts the fol‐
       lowing flags:


       CLONE_FD
              Instead of returning a process ID, clone4()  with  the  CLONE_FD
              flag  returns a file descriptor associated with the new process.
              When the new process exits, the kernel will not send a signal to
              the  parent process, and will not keep the new process around as
              a "zombie" process  until  a  call  to  waitpid(2)  or  similar.
              Instead,  the file descriptor will become available for reading,
              and the new process will be immediately reaped.

              Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
              descriptor  returned  by  clone4()  with the CLONE_FD flag works
              even with SIGCHLD unblocked in one or more threads of the parent
              process,  and  allows the process to have different handlers for
              different child processes, such as those created by  a  library,
              without  introducing  race conditions around process-wide signal
              handling.

              clone4() will never return a file descriptor in the range 0-2 to
              the caller, to avoid ambiguity with the return of 0 in the child
              process.  Only the  calling  process  will  have  the  new  file
              descriptor open; the child process will not.

              Since the kernel does not send a termination signal when a child
              process created with CLONE_FD exits, the low byte of flags  does
              not contain a signal number.  Instead, the low byte of flags can
              contain the following additional flags for use with CLONE_FD:


              CLONEFD_CLOEXEC
                     Set the O_CLOEXEC flag on the new open  file  descriptor.
                     See  the description of the O_CLOEXEC flag in open(2) for
                     reasons why this may be useful.


              CLONEFD_NONBLOCK
                     Set the O_NONBLOCK flag on the new open file  descriptor.
                     Using  this flag saves extra calls to fcntl(2) to achieve
                     the same result.


              clone4() with the CLONE_FD flag returns a file  descriptor  that
              supports the following operations:

              read(2) (and similar)
                     When  the  new  process  exits,  reading  from  the  file
                     descriptor produces a single clonefd_info structure:

                     struct clonefd_info {
                         uint32_t code;   /* Signal code */
                         uint32_t status; /* Exit status or signal */
                         uint64_t utime;  /* User CPU time */
                         uint64_t stime;  /* System CPU time */
                     };


                     If the new process has not  yet  exited,  read(2)  either
                     blocks  until  it does, or fails with the error EAGAIN if
                     the file descriptor has been made nonblocking.

                     Future kernels may extend clonefd_info by appending addi‐
                     tional  fields  to  the end.  Callers should read as many
                     bytes as they understand; unread data will be  discarded,
                     and  subsequent  reads  after  the first will return 0 to
                     indicate end-of-file.  Callers requesting more bytes than
                     the  kernel  provides  (such as callers expecting a newer
                     clonefd_info structure) will receive a shorter  structure
                     from older kernels.

              poll(2), select(2), epoll(7) (and similar)
                     The  file  descriptor  is readable (the select(2) readfds
                     argument; the poll(2) POLLIN flag) if the new process has
                     exited.

              close(2)
                     When  the file descriptor is no longer required it should
                     be closed.  If no process has a file descriptor open  for
                     the new process, no process will receive any notification
                     when the new process exits.  The new process  will  still
                     be immediately reaped.


   C library/kernel ABI differences
       As with clone(2), the raw clone4() system call corresponds more closely
       to fork(2) in that execution in the child continues from the  point  of
       the call.

       Unlike  clone(2),  the  raw  system call interface for clone4() accepts
       arguments in the same order on all architectures.

       The raw system call accepts flags as two 32-bit  arguments,  flags_high
       and  flags_low, to simplify portability across 32-bit and 64-bit archi‐
       tectures and calling conventions.  The glibc wrapper accepts flags as a
       single 64-bit argument for convenience.


RETURN VALUE
       For the glibc wrapper, on success, clone4() returns the file descriptor
       (with CLONE_FD) or new process ID (without  CLONE_FD),  and  the  child
       process begins running at the specified function.

       For  the  raw syscall, on success, clone4() returns the file descriptor
       or new process ID to the calling process, and  returns  0  in  the  new
       child process.

       On failure, clone4() returns -1 and sets errno accordingly.


ERRORS
       clone4()  can  return any error from clone(2), as well as the following
       additional errors:

       EINVAL flags contained an unknown flag.

       EINVAL flags included CLONE_FD, but the kernel configuration  does  not
              have the CONFIG_CLONEFD option enabled.

       EMFILE flags  included  CLONE_FD,  but  the  new  file descriptor would
              exceed the process limit on open file descriptors.

       ENFILE flags included CLONE_FD,  but  the  new  file  descriptor  would
              exceed the system-wide limit on open file descriptors.

       ENODEV flags  included  CLONE_FD,  but  clone4()  could  not  mount the
              (internal) anonymous inode device.


CONFORMING TO
       clone4() is Linux-specific and should not be used in programs  intended
       to be portable.


SEE ALSO
       clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2)



Linux                             2015-03-01                         CLONE4(2)


Josh Triplett and Thiago Macieira (6):
  clone: Support passing tls argument via C rather than pt_regs magic
  x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  Introduce a new clone4 syscall with more flag bits and extensible arguments
  signal: Factor out a helper function to process task_struct exit_code
  fs: Make alloc_fd non-private
  clone4: Introduce new CLONE_FD flag to get task exit notification via fd

 arch/Kconfig                     |   7 ++
 arch/x86/Kconfig                 |   1 +
 arch/x86/ia32/ia32entry.S        |   3 +-
 arch/x86/kernel/entry_64.S       |   1 +
 arch/x86/kernel/process_32.c     |   6 +-
 arch/x86/kernel/process_64.c     |   8 +--
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   2 +
 fs/file.c                        |   2 +-
 include/linux/compat.h           |  12 ++++
 include/linux/file.h             |   1 +
 include/linux/sched.h            |  20 ++++++
 include/linux/syscalls.h         |   6 +-
 include/uapi/linux/sched.h       |  54 ++++++++++++++-
 init/Kconfig                     |  21 ++++++
 kernel/Makefile                  |   1 +
 kernel/clonefd.c                 | 123 +++++++++++++++++++++++++++++++++
 kernel/clonefd.h                 |  27 ++++++++
 kernel/exit.c                    |  10 ++-
 kernel/fork.c                    | 143 ++++++++++++++++++++++++++++++++-------
 kernel/signal.c                  |  24 ++++---
 kernel/sys_ni.c                  |   1 +
 22 files changed, 425 insertions(+), 49 deletions(-)
 create mode 100644 kernel/clonefd.c
 create mode 100644 kernel/clonefd.h

-- 
2.1.4


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-13  1:40 ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

This patch series introduces a new clone flag, CLONE_FD, which lets the caller
handle child process exit notification via a file descriptor rather than
SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
child processes on behalf of their caller, *without* taking over process-wide
SIGCHLD handling (either via signal handler or signalfd).

Note that signalfd for SIGCHLD does not suffice here, because that still
receives notification for all child processes, and interferes with process-wide
signal handling.

The CLONE_FD file descriptor uniquely identifies a process on the system in a
race-free way, by holding a reference to the task_struct.  In the future, we
may introduce APIs that support using process file descriptors instead of PIDs.

Introducing CLONE_FD required two additional bits of yak shaving: Since clone
has no more usable flags (with the three currently unused flags unusable
because old kernels ignore them without EINVAL), also introduce a new clone4
system call with more flag bits and an extensible argument structure.  And
since the magic pt_regs-based syscall argument processing for clone's tls
argument would otherwise prevent introducing a sane clone4 system call, fix
that too.

I tested the CLONE_SETTLS changes with a thread-local storage test program (two
threads independently reading and writing a __thread variable), on both 32-bit
and 64-bit, and I observed no issues there.

I tested clone4 and the new CLONE_FD call with several additional test
programs, launching either a process or thread (in the former case using
syscall(), in the latter case by calling clone4 via assembly and returning to
C), sleeping in parent and child to test the case of either exiting first, and
then printing the received clone4_info structure.  Thiago also tested clone4
with CLONE_FD with a modified version of libqt's process handling, which
includes a test suite.

I've also included the manpages patch at the end of this series.  (Note that
the manpage documents the behavior of the future glibc wrapper as well as the
raw syscall.)  Here's a formatted plain-text version of the manpage for
reference:

CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)



NAME
       clone4 - create a child process

SYNOPSIS
       /* Prototype for the glibc wrapper function */

       #define _GNU_SOURCE
       #include <sched.h>

       int clone4(uint64_t flags,
                  size_t args_size,
                  struct clone4_args *args,
                  int (*fn)(void *), void *arg);

       /* Prototype for the raw system call */

       int clone4(unsigned flags_high, unsigned flags_low,
                  unsigned long args_size,
                  struct clone4_args *args);

       struct clone4_args {
           pid_t *ptid;
           pid_t *ctid;
           unsigned long stack_start;
           unsigned long stack_size;
           unsigned long tls;
       };


DESCRIPTION
       clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
       clone4() supports additional flags that clone(2) does not, and  accepts
       arguments via an extensible structure.

       args  points to a clone4_args structure, and args_size must contain the
       size of that structure, as understood by the  caller.   If  the  caller
       passes  a  shorter  structure  than  the  kernel expects, the remaining
       fields will default to 0.  If the caller passes a larger structure than
       the  kernel  expects  (such  as one from a newer kernel), clone4() will
       return EINVAL.  The clone4_args structure may gain additional fields at
       the  end  in  the future, and callers must only pass a size that encom‐
       passes the number of fields they understand.  If the  caller  passes  0
       for args_size, args is ignored and may be NULL.

       In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
       tls have the same semantics as they do with clone(2) and clone2(2).

       In the glibc wrapper, fn and arg have the same  semantics  as  they  do
       with clone(2).  As with clone(2), the underlying system call works more
       like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
       plifies  thread execution by calling fn(arg) and exiting the child when
       that function exits.

       The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
       flags_low arguments in the kernel interface) accepts all the same flags
       as  clone(2),  with  the   exception   of   the   obsolete   CLONE_PID,
       CLONE_DETACHED, and CLONE_STOPPED.  In addition, flags accepts the fol‐
       lowing flags:


       CLONE_FD
              Instead of returning a process ID, clone4()  with  the  CLONE_FD
              flag  returns a file descriptor associated with the new process.
              When the new process exits, the kernel will not send a signal to
              the  parent process, and will not keep the new process around as
              a "zombie" process  until  a  call  to  waitpid(2)  or  similar.
              Instead,  the file descriptor will become available for reading,
              and the new process will be immediately reaped.

              Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
              descriptor  returned  by  clone4()  with the CLONE_FD flag works
              even with SIGCHLD unblocked in one or more threads of the parent
              process,  and  allows the process to have different handlers for
              different child processes, such as those created by  a  library,
              without  introducing  race conditions around process-wide signal
              handling.

              clone4() will never return a file descriptor in the range 0-2 to
              the caller, to avoid ambiguity with the return of 0 in the child
              process.  Only the  calling  process  will  have  the  new  file
              descriptor open; the child process will not.

              Since the kernel does not send a termination signal when a child
              process created with CLONE_FD exits, the low byte of flags  does
              not contain a signal number.  Instead, the low byte of flags can
              contain the following additional flags for use with CLONE_FD:


              CLONEFD_CLOEXEC
                     Set the O_CLOEXEC flag on the new open  file  descriptor.
                     See  the description of the O_CLOEXEC flag in open(2) for
                     reasons why this may be useful.


              CLONEFD_NONBLOCK
                     Set the O_NONBLOCK flag on the new open file  descriptor.
                     Using  this flag saves extra calls to fcntl(2) to achieve
                     the same result.


              clone4() with the CLONE_FD flag returns a file  descriptor  that
              supports the following operations:

              read(2) (and similar)
                     When  the  new  process  exits,  reading  from  the  file
                     descriptor produces a single clonefd_info structure:

                     struct clonefd_info {
                         uint32_t code;   /* Signal code */
                         uint32_t status; /* Exit status or signal */
                         uint64_t utime;  /* User CPU time */
                         uint64_t stime;  /* System CPU time */
                     };


                     If the new process has not  yet  exited,  read(2)  either
                     blocks  until  it does, or fails with the error EAGAIN if
                     the file descriptor has been made nonblocking.

                     Future kernels may extend clonefd_info by appending addi‐
                     tional  fields  to  the end.  Callers should read as many
                     bytes as they understand; unread data will be  discarded,
                     and  subsequent  reads  after  the first will return 0 to
                     indicate end-of-file.  Callers requesting more bytes than
                     the  kernel  provides  (such as callers expecting a newer
                     clonefd_info structure) will receive a shorter  structure
                     from older kernels.

              poll(2), select(2), epoll(7) (and similar)
                     The  file  descriptor  is readable (the select(2) readfds
                     argument; the poll(2) POLLIN flag) if the new process has
                     exited.

              close(2)
                     When  the file descriptor is no longer required it should
                     be closed.  If no process has a file descriptor open  for
                     the new process, no process will receive any notification
                     when the new process exits.  The new process  will  still
                     be immediately reaped.


   C library/kernel ABI differences
       As with clone(2), the raw clone4() system call corresponds more closely
       to fork(2) in that execution in the child continues from the  point  of
       the call.

       Unlike  clone(2),  the  raw  system call interface for clone4() accepts
       arguments in the same order on all architectures.

       The raw system call accepts flags as two 32-bit  arguments,  flags_high
       and  flags_low, to simplify portability across 32-bit and 64-bit archi‐
       tectures and calling conventions.  The glibc wrapper accepts flags as a
       single 64-bit argument for convenience.


RETURN VALUE
       For the glibc wrapper, on success, clone4() returns the file descriptor
       (with CLONE_FD) or new process ID (without  CLONE_FD),  and  the  child
       process begins running at the specified function.

       For  the  raw syscall, on success, clone4() returns the file descriptor
       or new process ID to the calling process, and  returns  0  in  the  new
       child process.

       On failure, clone4() returns -1 and sets errno accordingly.


ERRORS
       clone4()  can  return any error from clone(2), as well as the following
       additional errors:

       EINVAL flags contained an unknown flag.

       EINVAL flags included CLONE_FD, but the kernel configuration  does  not
              have the CONFIG_CLONEFD option enabled.

       EMFILE flags  included  CLONE_FD,  but  the  new  file descriptor would
              exceed the process limit on open file descriptors.

       ENFILE flags included CLONE_FD,  but  the  new  file  descriptor  would
              exceed the system-wide limit on open file descriptors.

       ENODEV flags  included  CLONE_FD,  but  clone4()  could  not  mount the
              (internal) anonymous inode device.


CONFORMING TO
       clone4() is Linux-specific and should not be used in programs  intended
       to be portable.


SEE ALSO
       clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2)



Linux                             2015-03-01                         CLONE4(2)


Josh Triplett and Thiago Macieira (6):
  clone: Support passing tls argument via C rather than pt_regs magic
  x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  Introduce a new clone4 syscall with more flag bits and extensible arguments
  signal: Factor out a helper function to process task_struct exit_code
  fs: Make alloc_fd non-private
  clone4: Introduce new CLONE_FD flag to get task exit notification via fd

 arch/Kconfig                     |   7 ++
 arch/x86/Kconfig                 |   1 +
 arch/x86/ia32/ia32entry.S        |   3 +-
 arch/x86/kernel/entry_64.S       |   1 +
 arch/x86/kernel/process_32.c     |   6 +-
 arch/x86/kernel/process_64.c     |   8 +--
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   2 +
 fs/file.c                        |   2 +-
 include/linux/compat.h           |  12 ++++
 include/linux/file.h             |   1 +
 include/linux/sched.h            |  20 ++++++
 include/linux/syscalls.h         |   6 +-
 include/uapi/linux/sched.h       |  54 ++++++++++++++-
 init/Kconfig                     |  21 ++++++
 kernel/Makefile                  |   1 +
 kernel/clonefd.c                 | 123 +++++++++++++++++++++++++++++++++
 kernel/clonefd.h                 |  27 ++++++++
 kernel/exit.c                    |  10 ++-
 kernel/fork.c                    | 143 ++++++++++++++++++++++++++++++++-------
 kernel/signal.c                  |  24 ++++---
 kernel/sys_ni.c                  |   1 +
 22 files changed, 425 insertions(+), 49 deletions(-)
 create mode 100644 kernel/clonefd.c
 create mode 100644 kernel/clonefd.h

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/6] clone: Support passing tls argument via C rather than pt_regs magic
  2015-03-13  1:40 ` Josh Triplett
  (?)
@ 2015-03-13  1:40 ` Josh Triplett
  -1 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

clone with CLONE_SETTLS accepts an argument to set the thread-local
storage area for the new thread.  sys_clone declares an int argument
tls_val in the appropriate point in the argument list (based on the
various CLONE_BACKWARDS variants), but doesn't actually use or pass
along that argument.  Instead, sys_clone calls do_fork, which calls
copy_process, which calls the arch-specific copy_thread, and copy_thread
pulls the corresponding syscall argument out of the pt_regs captured at
kernel entry (knowing what argument of clone that architecture passes
tls in).

Apart from being awful and inscrutable, that also only works because
only one code path into copy_thread can pass the CLONE_SETTLS flag, and
that code path comes from sys_clone with its architecture-specific
argument-passing order.  This prevents introducing a new version of the
clone system call without propagating the same architecture-specific
position of the tls argument.

However, there's no reason to pull the argument out of pt_regs when
sys_clone could just pass it down via C function call arguments.

Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt
into, and a new copy_thread_tls that accepts the tls parameter as an
additional unsigned long (syscall-argument-sized) argument.
Change sys_clone's tls argument to an unsigned long (which does
not change the ABI), and pass that down to copy_thread_tls.

Architectures that don't opt into copy_thread_tls will continue to
ignore the C argument to sys_clone in favor of the pt_regs captured at
kernel entry, and thus will be unable to introduce new versions of the
clone syscall.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 arch/Kconfig             |  7 ++++++
 include/linux/sched.h    | 14 ++++++++++++
 include/linux/syscalls.h |  6 +++---
 kernel/fork.c            | 55 +++++++++++++++++++++++++++++++-----------------
 4 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 05d7a8a..4834a58 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -484,6 +484,13 @@ config HAVE_IRQ_EXIT_ON_IRQ_STACK
 	  This spares a stack switch and improves cache usage on softirq
 	  processing.
 
+config HAVE_COPY_THREAD_TLS
+	bool
+	help
+	  Architecture provides copy_thread_tls to accept tls argument via
+	  normal C parameter passing, rather than extracting the syscall
+	  argument from pt_regs.
+
 #
 # ABI hall of shame
 #
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..9ec36fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2479,8 +2479,22 @@ extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode);
 /* Remove the current tasks stale references to the old mm_struct */
 extern void mm_release(struct task_struct *, struct mm_struct *);
 
+#ifdef CONFIG_HAVE_COPY_THREAD_TLS
+extern int copy_thread_tls(unsigned long, unsigned long, unsigned long,
+			struct task_struct *, unsigned long);
+#else
 extern int copy_thread(unsigned long, unsigned long, unsigned long,
 			struct task_struct *);
+
+/* Architectures that haven't opted into copy_thread_tls get the tls argument
+ * via pt_regs, so ignore the tls argument passed via C. */
+static inline int copy_thread_tls(
+		unsigned long clone_flags, unsigned long sp, unsigned long arg,
+		struct task_struct *p, unsigned long tls)
+{
+	return copy_thread(clone_flags, sp, arg, p);
+}
+#endif
 extern void flush_thread(void);
 extern void exit_thread(void);
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..bb51bec 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -827,15 +827,15 @@ asmlinkage long sys_syncfs(int fd);
 asmlinkage long sys_fork(void);
 asmlinkage long sys_vfork(void);
 #ifdef CONFIG_CLONE_BACKWARDS
-asmlinkage long sys_clone(unsigned long, unsigned long, int __user *, int,
+asmlinkage long sys_clone(unsigned long, unsigned long, int __user *, unsigned long,
 	       int __user *);
 #else
 #ifdef CONFIG_CLONE_BACKWARDS3
 asmlinkage long sys_clone(unsigned long, unsigned long, int, int __user *,
-			  int __user *, int);
+			  int __user *, unsigned long);
 #else
 asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
-	       int __user *, int);
+	       int __user *, unsigned long);
 #endif
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..b3dadf4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1192,7 +1192,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
-					int trace)
+					int trace,
+					unsigned long tls)
 {
 	int retval;
 	struct task_struct *p;
@@ -1401,7 +1402,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	retval = copy_io(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
-	retval = copy_thread(clone_flags, stack_start, stack_size, p);
+	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
@@ -1613,7 +1614,7 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0);
+	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1628,11 +1629,13 @@ struct task_struct *fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
-	      unsigned long stack_start,
-	      unsigned long stack_size,
-	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+static long _do_fork(
+		unsigned long clone_flags,
+		unsigned long stack_start,
+		unsigned long stack_size,
+		int __user *parent_tidptr,
+		int __user *child_tidptr,
+		unsigned long tls)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1657,7 +1660,7 @@ long do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, trace, tls);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1698,20 +1701,34 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+#ifndef CONFIG_HAVE_COPY_THREAD_TLS
+/* For compatibility with architectures that call do_fork directly rather than
+ * using the syscall entry points below. */
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return _do_fork(clone_flags, stack_start, stack_size,
+			parent_tidptr, child_tidptr, 0);
+}
+#endif
+
 /*
  * Create a kernel thread.
  */
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
-	return do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL);
+	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
+		(unsigned long)arg, NULL, NULL, 0);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return do_fork(SIGCHLD, 0, 0, NULL, NULL);
+	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -1722,8 +1739,8 @@ SYSCALL_DEFINE0(fork)
 #ifdef __ARCH_WANT_SYS_VFORK
 SYSCALL_DEFINE0(vfork)
 {
-	return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL);
+	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
+			0, NULL, NULL, 0);
 }
 #endif
 
@@ -1731,27 +1748,27 @@ SYSCALL_DEFINE0(vfork)
 #ifdef CONFIG_CLONE_BACKWARDS
 SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 int __user *, parent_tidptr,
-		 int, tls_val,
+		 unsigned long, tls,
 		 int __user *, child_tidptr)
 #elif defined(CONFIG_CLONE_BACKWARDS2)
 SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,
 		 int __user *, parent_tidptr,
 		 int __user *, child_tidptr,
-		 int, tls_val)
+		 unsigned long, tls)
 #elif defined(CONFIG_CLONE_BACKWARDS3)
 SYSCALL_DEFINE6(clone, unsigned long, clone_flags, unsigned long, newsp,
 		int, stack_size,
 		int __user *, parent_tidptr,
 		int __user *, child_tidptr,
-		int, tls_val)
+		unsigned long, tls)
 #else
 SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 int __user *, parent_tidptr,
 		 int __user *, child_tidptr,
-		 int, tls_val)
+		 unsigned long, tls)
 #endif
 {
-	return do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr);
+	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
 }
 #endif
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-13  1:40   ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

For 32-bit userspace on a 64-bit kernel, this requires modifying
stub32_clone to actually swap the appropriate arguments to match
CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
broken.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 arch/x86/Kconfig             | 1 +
 arch/x86/ia32/ia32entry.S    | 2 +-
 arch/x86/kernel/process_32.c | 6 +++---
 arch/x86/kernel/process_64.c | 8 ++++----
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..4960b0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
 	select MODULES_USE_ELF_REL if X86_32
 	select MODULES_USE_ELF_RELA if X86_64
 	select CLONE_BACKWARDS if X86_32
+	select HAVE_COPY_THREAD_TLS
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUE_RWLOCK
 	select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 156ebca..0286735 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -487,7 +487,7 @@ GLOBAL(\label)
 	ALIGN
 GLOBAL(stub32_clone)
 	leaq sys_clone(%rip),%rax
-	mov	%r8, %rcx
+	xchg %r8, %rcx
 	jmp  ia32_ptregs_common	
 
 	ALIGN
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 603c4f9..ead28ff 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -129,8 +129,8 @@ void release_thread(struct task_struct *dead_task)
 	release_vm86_irqs(dead_task);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-	unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+	unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct task_struct *tsk;
@@ -185,7 +185,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	 */
 	if (clone_flags & CLONE_SETTLS)
 		err = do_set_thread_area(p, -1,
-			(struct user_desc __user *)childregs->si, 0);
+			(struct user_desc __user *)tls, 0);
 
 	if (err && p->thread.io_bitmap_ptr) {
 		kfree(p->thread.io_bitmap_ptr);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 67fcc43..c69cabc 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -151,8 +151,8 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls)
 	return get_desc_base(&t->thread.tls_array[tls]);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+		unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	int err;
 	struct pt_regs *childregs;
@@ -209,10 +209,10 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 #ifdef CONFIG_IA32_EMULATION
 		if (test_thread_flag(TIF_IA32))
 			err = do_set_thread_area(p, -1,
-				(struct user_desc __user *)childregs->si, 0);
+				(struct user_desc __user *)tls, 0);
 		else
 #endif
-			err = do_arch_prctl(p, ARCH_SET_FS, childregs->r8);
+			err = do_arch_prctl(p, ARCH_SET_FS, tls);
 		if (err)
 			goto out;
 	}
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-13  1:40   ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

For 32-bit userspace on a 64-bit kernel, this requires modifying
stub32_clone to actually swap the appropriate arguments to match
CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
broken.

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/Kconfig             | 1 +
 arch/x86/ia32/ia32entry.S    | 2 +-
 arch/x86/kernel/process_32.c | 6 +++---
 arch/x86/kernel/process_64.c | 8 ++++----
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..4960b0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
 	select MODULES_USE_ELF_REL if X86_32
 	select MODULES_USE_ELF_RELA if X86_64
 	select CLONE_BACKWARDS if X86_32
+	select HAVE_COPY_THREAD_TLS
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUE_RWLOCK
 	select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 156ebca..0286735 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -487,7 +487,7 @@ GLOBAL(\label)
 	ALIGN
 GLOBAL(stub32_clone)
 	leaq sys_clone(%rip),%rax
-	mov	%r8, %rcx
+	xchg %r8, %rcx
 	jmp  ia32_ptregs_common	
 
 	ALIGN
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 603c4f9..ead28ff 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -129,8 +129,8 @@ void release_thread(struct task_struct *dead_task)
 	release_vm86_irqs(dead_task);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-	unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+	unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct task_struct *tsk;
@@ -185,7 +185,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	 */
 	if (clone_flags & CLONE_SETTLS)
 		err = do_set_thread_area(p, -1,
-			(struct user_desc __user *)childregs->si, 0);
+			(struct user_desc __user *)tls, 0);
 
 	if (err && p->thread.io_bitmap_ptr) {
 		kfree(p->thread.io_bitmap_ptr);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 67fcc43..c69cabc 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -151,8 +151,8 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls)
 	return get_desc_base(&t->thread.tls_array[tls]);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+		unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	int err;
 	struct pt_regs *childregs;
@@ -209,10 +209,10 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 #ifdef CONFIG_IA32_EMULATION
 		if (test_thread_flag(TIF_IA32))
 			err = do_set_thread_area(p, -1,
-				(struct user_desc __user *)childregs->si, 0);
+				(struct user_desc __user *)tls, 0);
 		else
 #endif
-			err = do_arch_prctl(p, ARCH_SET_FS, childregs->r8);
+			err = do_arch_prctl(p, ARCH_SET_FS, tls);
 		if (err)
 			goto out;
 	}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 3/6] Introduce a new clone4 syscall with more flag bits and extensible arguments
  2015-03-13  1:40 ` Josh Triplett
                   ` (2 preceding siblings ...)
  (?)
@ 2015-03-13  1:40 ` Josh Triplett
  -1 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

clone() has no more usable flags available.  It has three now-unused
flags (CLONE_PID, CLONE_DETACHED, and CLONE_STOPPED), but current
kernels just ignore those flags without returning an error like EINVAL,
so reusing those flags would not allow userspace to detect the
availability of the new functionality.

Introduce a new system call, clone4, which accepts a second 32-bit flags
field.  clone4 also returns EINVAL for the currently unused flags in
clone, allowing their reuse.

To process these new flags, change the flags argument of _do_fork to a
u64.  sys_clone and do_fork both still use "unsigned long" for flags as
they did before, truncating it to 32-bit and masking out the obsolete
flags to behave like clone currently does.

clone4 accepts its remaining arguments as a structure, and userspace
passes in the size of that structure.  clone4 has well-defined semantics
that allow extending that structure in the future.  New userspace
passing in a larger structure than the kernel expects will receive
EINVAL, and can use a smaller structure to work with old kernels.  New
kernels accept smaller argument structures passed by userspace, and any
un-passed arguments default to 0.

clone4 handles arguments in the same order on all architectures, with no
backwards variations; to do so, it depends on the new
HAVE_COPY_THREAD_TLS.

The new system call currently accepts exactly the same flags as clone;
future commits will introduce new flags for additional functionality.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 arch/x86/ia32/ia32entry.S        |  1 +
 arch/x86/kernel/entry_64.S       |  1 +
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  2 ++
 include/linux/compat.h           | 12 ++++++++
 include/uapi/linux/sched.h       | 33 ++++++++++++++++++++--
 init/Kconfig                     | 10 +++++++
 kernel/fork.c                    | 60 +++++++++++++++++++++++++++++++++++++---
 kernel/sys_ni.c                  |  1 +
 9 files changed, 114 insertions(+), 7 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 0286735..ba28306 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -483,6 +483,7 @@ GLOBAL(\label)
 	PTREGSCALL stub32_execveat, compat_sys_execveat
 	PTREGSCALL stub32_fork, sys_fork
 	PTREGSCALL stub32_vfork, sys_vfork
+	PTREGSCALL stub32_clone4, compat_sys_clone4
 
 	ALIGN
 GLOBAL(stub32_clone)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1d74d16..ead143f 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -520,6 +520,7 @@ END(\label)
 	FORK_LIKE  clone
 	FORK_LIKE  fork
 	FORK_LIKE  vfork
+	FORK_LIKE  clone4
 	FIXED_FRAME stub_iopl, sys_iopl
 
 ENTRY(stub_execve)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..56fcc90 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	clone4			sys_clone4			stub32_clone4
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..af15b0f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	clone4			stub_clone4
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
@@ -368,3 +369,4 @@
 543	x32	io_setup		compat_sys_io_setup
 544	x32	io_submit		compat_sys_io_submit
 545	x32	execveat		stub_x32_execveat
+546	x32	clone4			stub32_clone4
diff --git a/include/linux/compat.h b/include/linux/compat.h
index ab25814..6c4a68d 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -293,6 +293,14 @@ struct compat_old_sigaction {
 };
 #endif
 
+struct compat_clone4_args {
+	compat_uptr_t ptid;
+	compat_uptr_t ctid;
+	compat_ulong_t stack_start;
+	compat_ulong_t stack_size;
+	compat_ulong_t tls;
+};
+
 struct compat_statfs;
 struct compat_statfs64;
 struct compat_old_linux_dirent;
@@ -713,6 +721,10 @@ asmlinkage long compat_sys_sched_rr_get_interval(compat_pid_t pid,
 
 asmlinkage long compat_sys_fanotify_mark(int, unsigned int, __u32, __u32,
 					    int, const char __user *);
+
+asmlinkage long compat_sys_clone4(unsigned, unsigned, compat_ulong_t,
+				  struct compat_clone4_args __user *);
+
 #else
 
 #define is_compat_task() (0)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..b5b8012 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -1,6 +1,8 @@
 #ifndef _UAPI_LINUX_SCHED_H
 #define _UAPI_LINUX_SCHED_H
 
+#include <linux/types.h>
+
 /*
  * cloning flags:
  */
@@ -18,11 +20,8 @@
 #define CLONE_SETTLS	0x00080000	/* create a new TLS for the child */
 #define CLONE_PARENT_SETTID	0x00100000	/* set the TID in the parent */
 #define CLONE_CHILD_CLEARTID	0x00200000	/* clear the TID in the child */
-#define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
 #define CLONE_NEWUTS		0x04000000	/* New utsname namespace */
 #define CLONE_NEWIPC		0x08000000	/* New ipc namespace */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
@@ -31,6 +30,34 @@
 #define CLONE_IO		0x80000000	/* Clone io context */
 
 /*
+ * Old flags, unused by current clone.  clone does not return EINVAL for these
+ * flags, so they can't easily be reused.  clone4 can use them.
+ */
+#define CLONE_PID	0x00001000
+#define CLONE_DETACHED	0x00400000
+#define CLONE_STOPPED	0x02000000
+
+/*
+ * Valid flags for clone and for clone4
+ */
+#define CLONE_VALID_FLAGS	(0xffffffffULL & ~(CLONE_PID | CLONE_DETACHED | CLONE_STOPPED))
+#define CLONE4_VALID_FLAGS	CLONE_VALID_FLAGS
+
+/*
+ * Structure passed to clone4 for additional arguments.  Initialized to 0,
+ * then overwritten with arguments from userspace, so arguments not supplied by
+ * userspace will remain 0.  New versions of the kernel may safely append new
+ * arguments to the end.
+ */
+struct clone4_args {
+	__kernel_pid_t __user *ptid;
+	__kernel_pid_t __user *ctid;
+	__kernel_ulong_t stack_start;
+	__kernel_ulong_t stack_size;
+	__kernel_ulong_t tls;
+};
+
+/*
  * Scheduling policies
  */
 #define SCHED_NORMAL		0
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..3ab6649 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1511,6 +1511,16 @@ config EVENTFD
 
 	  If unsure, say Y.
 
+config CLONE4
+	bool "Enable clone4() system call" if EXPERT
+	depends on HAVE_COPY_THREAD_TLS
+	default y
+	help
+	  Enable the clone4() system call, which supports passing additional
+	  flags.
+
+	  If unsure, say Y.
+
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call" if EXPERT
diff --git a/kernel/fork.c b/kernel/fork.c
index b3dadf4..e29edea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1187,7 +1187,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
  * parts of the process environment (as per the clone
  * flags). The actual kick-off is left to the caller.
  */
-static struct task_struct *copy_process(unsigned long clone_flags,
+static struct task_struct *copy_process(u64 clone_flags,
 					unsigned long stack_start,
 					unsigned long stack_size,
 					int __user *child_tidptr,
@@ -1198,6 +1198,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 
+	if (clone_flags & ~CLONE4_VALID_FLAGS)
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 
@@ -1630,7 +1633,7 @@ struct task_struct *fork_idle(int cpu)
  * it and waits for it to finish using the VM if required.
  */
 static long _do_fork(
-		unsigned long clone_flags,
+		u64 clone_flags,
 		unsigned long stack_start,
 		unsigned long stack_size,
 		int __user *parent_tidptr,
@@ -1701,6 +1704,15 @@ static long _do_fork(
 	return nr;
 }
 
+/*
+ * Convenience function for callers passing unsigned long flags, to prevent old
+ * syscall entry points from unexpectedly returning EINVAL.
+ */
+static inline u64 squelch_clone_flags(unsigned long clone_flags)
+{
+	return (u32)(clone_flags & ~CLONE_VALID_FLAGS);
+}
+
 #ifndef CONFIG_HAVE_COPY_THREAD_TLS
 /* For compatibility with architectures that call do_fork directly rather than
  * using the syscall entry points below. */
@@ -1710,7 +1722,8 @@ long do_fork(unsigned long clone_flags,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr)
 {
-	return _do_fork(clone_flags, stack_start, stack_size,
+	return _do_fork(squelch_clone_flags(clone_flags),
+			stack_start, stack_size,
 			parent_tidptr, child_tidptr, 0);
 }
 #endif
@@ -1768,10 +1781,49 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	return _do_fork(squelch_clone_flags(clone_flags), newsp, 0,
+			parent_tidptr, child_tidptr, tls);
 }
 #endif
 
+#ifdef CONFIG_CLONE4
+SYSCALL_DEFINE4(clone4, unsigned, flags_high, unsigned, flags_low,
+		unsigned long, args_size, struct clone4_args __user *, args)
+{
+	struct clone4_args kargs = {};
+	if (args_size > sizeof(kargs)) {
+		return -EINVAL;
+	} else if (args_size) {
+		int ret = copy_from_user(&kargs, args, args_size);
+		if (ret < 0)
+			return ret;
+	}
+	return _do_fork((u64)flags_high << 32 | flags_low,
+			kargs.stack_start, kargs.stack_size,
+			kargs.ptid, kargs.ctid, kargs.tls);
+}
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE4(clone4, unsigned, flags_high, unsigned, flags_low,
+			compat_ulong_t, args_size,
+			struct compat_clone4_args __user *, args)
+{
+	struct compat_clone4_args kargs = {};
+	if (args_size > sizeof(kargs)) {
+		return -EINVAL;
+	} else if (args_size) {
+		int ret = copy_from_user(&kargs, args, args_size);
+		if (ret < 0)
+			return ret;
+	}
+	return _do_fork((u64)flags_high << 32 | flags_low,
+			kargs.stack_start, kargs.stack_size,
+			compat_ptr(kargs.ptid), compat_ptr(kargs.ctid),
+			kargs.tls);
+}
+#endif /* CONFIG_COMPAT */
+#endif /* CONFIG_CLONE4 */
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..5b5d2b9 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -159,6 +159,7 @@ cond_syscall(sys_uselib);
 cond_syscall(sys_fadvise64);
 cond_syscall(sys_fadvise64_64);
 cond_syscall(sys_madvise);
+cond_syscall(sys_clone4);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 4/6] signal: Factor out a helper function to process task_struct exit_code
  2015-03-13  1:40 ` Josh Triplett
                   ` (3 preceding siblings ...)
  (?)
@ 2015-03-13  1:40 ` Josh Triplett
  -1 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

do_notify_parent includes the code to convert the exit_code field of
struct task_struct to the code and status fields that accompany SIGCHLD.
Factor that out into a new helper function task_exit_code_status, to
allow other methods of task exit notification to share that code.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 include/linux/sched.h |  1 +
 kernel/signal.c       | 24 +++++++++++++++---------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9ec36fd..668c58f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2386,6 +2386,7 @@ extern int kill_pid_info_as_cred(int, struct siginfo *, struct pid *,
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
 extern int kill_pid(struct pid *pid, int sig, int priv);
 extern int kill_proc_info(int, struct siginfo *, pid_t);
+extern void task_exit_code_status(int exit_code, s32 *code, s32 *status);
 extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
diff --git a/kernel/signal.c b/kernel/signal.c
index a390499..f959d07 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1613,6 +1613,20 @@ ret:
 	return ret;
 }
 
+/* Translate exit_code to code and status. */
+void task_exit_code_status(int exit_code, s32 *code, s32 *status)
+{
+	*status = exit_code & 0x7f;
+	if (exit_code & 0x80)
+		*code = CLD_DUMPED;
+	else if (exit_code & 0x7f)
+		*code = CLD_KILLED;
+	else {
+		*code = CLD_EXITED;
+		*status = exit_code >> 8;
+	}
+}
+
 /*
  * Let a parent know about the death of a child.
  * For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1668,15 +1682,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	info.si_utime = cputime_to_clock_t(utime + tsk->signal->utime);
 	info.si_stime = cputime_to_clock_t(stime + tsk->signal->stime);
 
-	info.si_status = tsk->exit_code & 0x7f;
-	if (tsk->exit_code & 0x80)
-		info.si_code = CLD_DUMPED;
-	else if (tsk->exit_code & 0x7f)
-		info.si_code = CLD_KILLED;
-	else {
-		info.si_code = CLD_EXITED;
-		info.si_status = tsk->exit_code >> 8;
-	}
+	task_exit_code_status(tsk->exit_code, &info.si_code, &info.si_status);
 
 	psig = tsk->parent->sighand;
 	spin_lock_irqsave(&psig->siglock, flags);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 5/6] fs: Make alloc_fd non-private
@ 2015-03-13  1:40   ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

This allows callers to allocate a file descriptor with a defined minimum
value, without directly calling the lower-level __alloc_fd.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 fs/file.c            | 2 +-
 include/linux/file.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index ee738ea..583ba46 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -500,7 +500,7 @@ out:
 	return error;
 }
 
-static int alloc_fd(unsigned start, unsigned flags)
+int alloc_fd(unsigned start, unsigned flags)
 {
 	return __alloc_fd(current->files, start, rlimit(RLIMIT_NOFILE), flags);
 }
diff --git a/include/linux/file.h b/include/linux/file.h
index f87d308..d49f3bd 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -65,6 +65,7 @@ extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
 extern void put_filp(struct file *);
+extern int alloc_fd(unsigned start, unsigned flags);
 extern int get_unused_fd_flags(unsigned flags);
 extern void put_unused_fd(unsigned int fd);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 5/6] fs: Make alloc_fd non-private
@ 2015-03-13  1:40   ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

This allows callers to allocate a file descriptor with a defined minimum
value, without directly calling the lower-level __alloc_fd.

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/file.c            | 2 +-
 include/linux/file.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index ee738ea..583ba46 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -500,7 +500,7 @@ out:
 	return error;
 }
 
-static int alloc_fd(unsigned start, unsigned flags)
+int alloc_fd(unsigned start, unsigned flags)
 {
 	return __alloc_fd(current->files, start, rlimit(RLIMIT_NOFILE), flags);
 }
diff --git a/include/linux/file.h b/include/linux/file.h
index f87d308..d49f3bd 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -65,6 +65,7 @@ extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
 extern void put_filp(struct file *);
+extern int alloc_fd(unsigned start, unsigned flags);
 extern int get_unused_fd_flags(unsigned flags);
 extern void put_unused_fd(unsigned int fd);
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-13  1:40 ` Josh Triplett
                   ` (5 preceding siblings ...)
  (?)
@ 2015-03-13  1:41 ` Josh Triplett
  2015-03-13 16:21   ` Oleg Nesterov
  2015-03-14 14:35     ` Oleg Nesterov
  -1 siblings, 2 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:41 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

When passed CLONE_FD, clone4 will return a file descriptor rather than a
PID.  When the child process exits, it gets automatically reaped, and
the file descriptor becomes readable, producing a structure containing
the exit code and user/system time.  The file descriptor also works in
epoll, poll, or select.

This allows libraries to safely launch and manage child processes on
behalf of a caller, without taking over or interfering with process-wide
signal handling.  Without this, such a library would need to take over
or cooperate with the entire process's SIGCHLD handling, either via a
signal handler or a signalfd.

CLONE_FD will never return a file descriptor in the 0-2 range; thus, a 0
return from clone4 still indicates the child process.

Since a process created with CLONE_FD does not send any exit signal, the
low byte of the clone flags no longer needs to contain a signal number,
freeing it up for use as CLONE_FD-specific flags; use that to provide
the usual CLOEXEC and NONBLOCK flags.

CLONE_FD takes the value of the unused CLONE_PID, so CLONE4_VALID_ARGS
now includes CLONE_FD; CLONE_VALID_ARGS still doesn't, and sys_clone
still ignores that flag, as only clone4 can use it.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 include/linux/sched.h      |   5 ++
 include/uapi/linux/sched.h |  23 ++++++++-
 init/Kconfig               |  11 ++++
 kernel/Makefile            |   1 +
 kernel/clonefd.c           | 123 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/clonefd.h           |  27 ++++++++++
 kernel/exit.c              |  10 +++-
 kernel/fork.c              |  40 ++++++++++++---
 8 files changed, 231 insertions(+), 9 deletions(-)
 create mode 100644 kernel/clonefd.c
 create mode 100644 kernel/clonefd.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 668c58f..55cf10bb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,6 +1351,9 @@ struct task_struct {
 #if defined(SPLIT_RSS_COUNTING)
 	struct task_rss_stat	rss_stat;
 #endif
+#ifdef CONFIG_CLONEFD
+	wait_queue_head_t clonefd_wqh;
+#endif
 /* task state */
 	int exit_state;
 	int exit_code, exit_signal;
@@ -1372,6 +1375,8 @@ struct task_struct {
 	unsigned memcg_kmem_skip_account:1;
 #endif
 
+	unsigned autoreap:1; /* Do not become a zombie on exit */
+
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
 	struct restart_block restart_block;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b5b8012..d2082c61 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -38,10 +38,31 @@
 #define CLONE_STOPPED	0x02000000
 
 /*
+ * Flags that only work with clone4.
+ */
+#define CLONE_FD	0x00001000	/* set if we want a file descriptor rather than a PID */
+
+/*
  * Valid flags for clone and for clone4
  */
 #define CLONE_VALID_FLAGS	(0xffffffffULL & ~(CLONE_PID | CLONE_DETACHED | CLONE_STOPPED))
-#define CLONE4_VALID_FLAGS	CLONE_VALID_FLAGS
+#define CLONE4_VALID_FLAGS	(CLONE_VALID_FLAGS | CLONE_FD)
+
+/*
+ * Flags passed in the low byte when using CLONE_FD, in place of the signal.
+ */
+#define CLONEFD_CLOEXEC		0x00000001	/* Used with CLONE_FD to set O_CLOEXEC on new fd */
+#define CLONEFD_NONBLOCK	0x00000002	/* Used with CLONE_FD to set O_NONBLOCK on new fd */
+
+/*
+ * Structure read from CLONE_FD file descriptor after process exits
+ */
+struct clonefd_info {
+        __s32 code;
+        __s32 status;
+        __u64 utime;
+        __u64 stime;
+};
 
 /*
  * Structure passed to clone4 for additional arguments.  Initialized to 0,
diff --git a/init/Kconfig b/init/Kconfig
index 3ab6649..b444280 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1521,6 +1521,17 @@ config CLONE4
 
 	  If unsure, say Y.
 
+config CLONEFD
+	bool "Enable CLONE_FD flag for clone4()" if EXPERT
+	depends on CLONE4
+	select ANON_INODES
+	default y
+	help
+	  Enable the CLONE_FD flag for clone4(), which creates a file descriptor
+	  to receive child exit events rather than receiving a signal.
+
+	  If unsure, say Y.
+
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call" if EXPERT
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..368986c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -29,6 +29,7 @@ obj-y += rcu/
 obj-y += livepatch/
 
 obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
+obj-$(CONFIG_CLONEFD) += clonefd.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff --git a/kernel/clonefd.c b/kernel/clonefd.c
new file mode 100644
index 0000000..78fb776
--- /dev/null
+++ b/kernel/clonefd.c
@@ -0,0 +1,123 @@
+/*
+ * Support functions for CLONE_FD
+ *
+ * Copyright (c) 2015 Intel Corporation
+ * Original authors: Josh Triplett <josh@joshtriplett.org>
+ *                   Thiago Macieira <thiago@macieira.org>
+ */
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include "clonefd.h"
+
+static int clonefd_release(struct inode *inode, struct file *file)
+{
+	put_task_struct(file->private_data);
+	return 0;
+}
+
+static unsigned int clonefd_poll(struct file *file, poll_table *wait)
+{
+	struct task_struct *p = file->private_data;
+	poll_wait(file, &p->clonefd_wqh, wait);
+	return p->exit_state == EXIT_DEAD ? (POLLIN | POLLRDNORM) : 0;
+}
+
+static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	struct task_struct *p = file->private_data;
+	int ret = 0;
+
+	/* EOF after first read */
+	if (*ppos)
+		return 0;
+
+	if (file->f_flags & O_NONBLOCK)
+		ret = -EAGAIN;
+	else
+		ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state == EXIT_DEAD);
+
+	if (p->exit_state == EXIT_DEAD) {
+		struct clonefd_info info = {};
+		cputime_t utime, stime;
+		task_exit_code_status(p->exit_code, &info.code, &info.status);
+		info.code &= ~__SI_MASK;
+		task_cputime(p, &utime, &stime);
+		info.utime = cputime_to_clock_t(utime + p->signal->utime);
+		info.stime = cputime_to_clock_t(stime + p->signal->stime);
+		ret = simple_read_from_buffer(buf, count, ppos, &info, sizeof(info));
+	}
+	return ret;
+}
+
+static struct file_operations clonefd_fops = {
+	.release = clonefd_release,
+	.poll = clonefd_poll,
+	.read = clonefd_read,
+	.llseek = no_llseek,
+};
+
+/* Do process exit notification for clonefd. */
+void clonefd_do_notify(struct task_struct *p)
+{
+	if (p->autoreap)
+		wake_up_all(&p->clonefd_wqh);
+}
+
+/* Handle the CLONE_FD case for copy_process. */
+int clonefd_do_clone(u64 clone_flags, struct task_struct *p, struct clonefd_setup *setup)
+{
+	int flags;
+	struct file *file;
+	int fd;
+
+	if (!(clone_flags & CLONE_FD))
+		return 0;
+
+	p->autoreap = 1;
+	init_waitqueue_head(&p->clonefd_wqh);
+
+	get_task_struct(p);
+	flags = O_RDONLY | FMODE_ATOMIC_POS
+	      | (clone_flags & CLONEFD_CLOEXEC ? O_CLOEXEC : 0)
+	      | (clone_flags & CLONEFD_NONBLOCK ? O_NONBLOCK : 0);
+	file = anon_inode_getfile("[process]", &clonefd_fops, p, flags);
+	if (IS_ERR(file)) {
+		put_task_struct(p);
+		return PTR_ERR(file);
+	}
+
+	/*
+	 * We avoid allocating a low fd so that clone can still return 0 in the
+	 * child; the child shouldn't have to change just because the parent
+	 * used CLONE_FD.
+	 */
+	fd = alloc_fd(3, flags);
+	if (fd < 0) {
+		fput(file);
+		return fd;
+	}
+
+	setup->fd = fd;
+	setup->file = file;
+
+	return 0;
+}
+
+/* Clean up clonefd information after a partially complete clone */
+void clonefd_cleanup_failed_clone(struct task_struct *p, struct clonefd_setup *setup)
+{
+	if (setup->fd)
+		put_unused_fd(setup->fd);
+	if (setup->file)
+		fput(setup->file);
+}
+
+/* Finish setting up the clonefd */
+int clonefd_install_fd(struct task_struct *p, struct clonefd_setup *setup)
+{
+	fd_install(setup->fd, setup->file);
+	return setup->fd;
+}
diff --git a/kernel/clonefd.h b/kernel/clonefd.h
new file mode 100644
index 0000000..07bd31f
--- /dev/null
+++ b/kernel/clonefd.h
@@ -0,0 +1,27 @@
+/*
+ * Support functions for CLONE_FD
+ *
+ * Copyright (c) 2015 Intel Corporation
+ * Original authors: Josh Triplett <josh@joshtriplett.org>
+ *                   Thiago Macieira <thiago@macieira.org>
+ */
+#pragma once
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_CLONEFD
+struct clonefd_setup {
+	int fd;
+	struct file *file;
+};
+int clonefd_do_clone(u64 clone_flags, struct task_struct *p, struct clonefd_setup *setup);
+void clonefd_cleanup_failed_clone(struct task_struct *p, struct clonefd_setup *setup);
+int clonefd_install_fd(struct task_struct *p, struct clonefd_setup *setup);
+void clonefd_do_notify(struct task_struct *p);
+#else /* CONFIG_CLONEFD */
+struct clonefd_setup {};
+static inline int clonefd_do_clone(u64 clone_flags, struct task_struct *p, struct clonefd_setup *setup) { return 0; }
+static inline void clonefd_cleanup_failed_clone (struct task_struct *p, struct clonefd_setup *setup) {}
+static inline int clonefd_install_fd(struct task_struct *p, struct clonefd_setup *setup) { return -EINVAL; }
+static inline void clonefd_do_notify(struct task_struct *p) {}
+#endif /* CONFIG_CLONEFD */
diff --git a/kernel/exit.c b/kernel/exit.c
index feff10b..a2c8520 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -59,6 +59,8 @@
 #include <asm/pgtable.h>
 #include <asm/mmu_context.h>
 
+#include "clonefd.h"
+
 static void exit_mm(struct task_struct *tsk);
 
 static void __unhash_process(struct task_struct *p, bool group_dead)
@@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (group_dead)
 		kill_orphaned_pgrp(tsk->group_leader, NULL);
 
-	if (unlikely(tsk->ptrace)) {
+	if (tsk->autoreap) {
+		autoreap = true;
+	} else if (unlikely(tsk->ptrace)) {
 		int sig = thread_group_leader(tsk) &&
 				thread_group_empty(tsk) &&
 				!ptrace_reparented(tsk) ?
@@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	}
 
 	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
-	if (tsk->exit_state == EXIT_DEAD)
+	if (tsk->exit_state == EXIT_DEAD) {
 		list_add(&tsk->ptrace_entry, &dead);
+		clonefd_do_notify(tsk);
+	}
 
 	/* mt-exec, de_thread() is waiting for group leader */
 	if (unlikely(tsk->signal->notify_count < 0))
diff --git a/kernel/fork.c b/kernel/fork.c
index e29edea..00cab05 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -87,6 +87,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/task.h>
 
+#include "clonefd.h"
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -321,6 +323,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	if (err)
 		goto free_ti;
 
+	tsk->autoreap = 0;
+
 	tsk->stack = ti;
 #ifdef CONFIG_SECCOMP
 	/*
@@ -1193,7 +1197,8 @@ static struct task_struct *copy_process(u64 clone_flags,
 					int __user *child_tidptr,
 					struct pid *pid,
 					int trace,
-					unsigned long tls)
+					unsigned long tls,
+					struct clonefd_setup *clonefd_setup)
 {
 	int retval;
 	struct task_struct *p;
@@ -1244,6 +1249,16 @@ static struct task_struct *copy_process(u64 clone_flags,
 			return ERR_PTR(-EINVAL);
 	}
 
+	/*
+	 * If using CLONE_FD, the low byte is used for additional flags; check
+	 * for unknown flags.
+	 */
+	if (clone_flags & CLONE_FD) {
+		if (!IS_ENABLED(CONFIG_CLONEFD) ||
+		    (clone_flags & CSIGNAL & ~(CLONEFD_CLOEXEC | CLONEFD_NONBLOCK)))
+			return ERR_PTR(-EINVAL);
+	}
+
 	retval = security_task_create(clone_flags);
 	if (retval)
 		goto fork_out;
@@ -1416,6 +1431,10 @@ static struct task_struct *copy_process(u64 clone_flags,
 			goto bad_fork_cleanup_io;
 	}
 
+	retval = clonefd_do_clone(clone_flags, p, clonefd_setup);
+	if (retval)
+		goto bad_fork_free_pid;
+
 	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
 	/*
 	 * Clear TID on mm_release()?
@@ -1456,7 +1475,9 @@ static struct task_struct *copy_process(u64 clone_flags,
 		p->group_leader = current->group_leader;
 		p->tgid = current->tgid;
 	} else {
-		if (clone_flags & CLONE_PARENT)
+		if (clone_flags & CLONE_FD)
+			p->exit_signal = 0;
+		else if (clone_flags & CLONE_PARENT)
 			p->exit_signal = current->group_leader->exit_signal;
 		else
 			p->exit_signal = (clone_flags & CSIGNAL);
@@ -1508,7 +1529,7 @@ static struct task_struct *copy_process(u64 clone_flags,
 		spin_unlock(&current->sighand->siglock);
 		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
-		goto bad_fork_free_pid;
+		goto bad_fork_cleanup_clonefd;
 	}
 
 	if (likely(p->pid)) {
@@ -1560,6 +1581,8 @@ static struct task_struct *copy_process(u64 clone_flags,
 
 	return p;
 
+bad_fork_cleanup_clonefd:
+	clonefd_cleanup_failed_clone(p, clonefd_setup);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
@@ -1617,7 +1640,7 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0);
+	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0, NULL);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1643,6 +1666,7 @@ static long _do_fork(
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	struct clonefd_setup clonefd_setup = {};
 
 	/*
 	 * Determine whether and which event to report to ptracer.  When
@@ -1653,7 +1677,8 @@ static long _do_fork(
 	if (!(clone_flags & CLONE_UNTRACED)) {
 		if (clone_flags & CLONE_VFORK)
 			trace = PTRACE_EVENT_VFORK;
-		else if ((clone_flags & CSIGNAL) != SIGCHLD)
+		else if ((clone_flags & CLONE_FD) ||
+			 (clone_flags & CSIGNAL) != SIGCHLD)
 			trace = PTRACE_EVENT_CLONE;
 		else
 			trace = PTRACE_EVENT_FORK;
@@ -1663,7 +1688,7 @@ static long _do_fork(
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace, tls);
+			 child_tidptr, NULL, trace, tls, &clonefd_setup);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1686,6 +1711,9 @@ static long _do_fork(
 			get_task_struct(p);
 		}
 
+		if (clone_flags & CLONE_FD)
+			nr = clonefd_install_fd(p, &clonefd_setup);
+
 		wake_up_new_task(p);
 
 		/* forking complete and child started to run, tell ptracer */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH] clone4.2: New manpage documenting clone4(2)
  2015-03-13  1:40 ` Josh Triplett
                   ` (6 preceding siblings ...)
  (?)
@ 2015-03-13  1:41 ` Josh Triplett
  -1 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13  1:41 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-man, linux-fsdevel, x86

Also includes new cross-reference from clone.2.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 man2/clone.2  |   1 +
 man2/clone4.2 | 332 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 333 insertions(+)
 create mode 100644 man2/clone4.2

diff --git a/man2/clone.2 b/man2/clone.2
index 752c01e..7013885 100644
--- a/man2/clone.2
+++ b/man2/clone.2
@@ -1209,6 +1209,7 @@ main(int argc, char *argv[])
 }
 .fi
 .SH SEE ALSO
+.BR clone4 (2),
 .BR fork (2),
 .BR futex (2),
 .BR getpid (2),
diff --git a/man2/clone4.2 b/man2/clone4.2
new file mode 100644
index 0000000..c2ce188
--- /dev/null
+++ b/man2/clone4.2
@@ -0,0 +1,332 @@
+.\" Based on clone.2:
+.\" Copyright (c) 1992 Drew Eckhardt <drew@cs.colorado.edu>, March 28, 1992
+.\" and Copyright (c) Michael Kerrisk, 2001, 2002, 2005, 2013
+.\"
+.\" %%%LICENSE_START(GPL_NOVERSION_ONELINE)
+.\" May be distributed under the GNU General Public License.
+.\" %%%LICENSE_END
+.TH CLONE4 2 2015-03-01 "Linux" "Linux Programmer's Manual"
+.SH NAME
+clone4 \- create a child process
+.SH SYNOPSIS
+.nf
+/* Prototype for the glibc wrapper function */
+
+.B #define _GNU_SOURCE
+.B #include <sched.h>
+
+.BI "int clone4(uint64_t " flags ,
+.BI "           size_t " args_size ,
+.BI "           struct clone4_args *" args ,
+.BI "           int (*" "fn" ")(void *), void *" arg );
+
+/* Prototype for the raw system call */
+
+.BI "int clone4(unsigned " flags_high ", unsigned " flags_low ,
+.BI "           unsigned long " args_size ,
+.BI "           struct clone4_args *" args );
+
+struct clone4_args {
+    pid_t *ptid;
+    pid_t *ctid;
+    unsigned long stack_start;
+    unsigned long stack_size;
+    unsigned long tls;
+};
+
+.SH DESCRIPTION
+.BR clone4 ()
+creates a new process, similar to
+.BR clone (2)
+and
+.BR fork (2).
+.BR clone4 ()
+supports additional flags that
+.BR clone (2)
+does not, and accepts arguments via an extensible structure.
+
+.I args
+points to a
+.I clone4_args
+structure, and
+.I args_size
+must contain the size of that structure, as understood by the caller.  If the
+caller passes a shorter structure than the kernel expects, the remaining fields
+will default to 0.  If the caller passes a larger structure than the kernel
+expects (such as one from a newer kernel),
+.BR clone4 ()
+will return
+.BR EINVAL .
+The
+.I clone4_args
+structure may gain additional fields at the end in the future, and callers must
+only pass a size that encompasses the number of fields they understand.  If the
+caller passes 0 for
+.IR args_size ,
+.I args
+is ignored and may be NULL.
+
+In the
+.I clone4_args
+structure,
+.IR ptid ,
+.IR ctid ,
+.IR stack_start ,
+.IR stack_size ,
+and
+.I tls
+have the same semantics as they do with
+.BR clone (2)
+and
+.BR clone2 (2).
+
+In the glibc wrapper,
+.I fn
+and
+.I arg
+have the same semantics as they do with
+.BR clone (2).
+As with
+.BR clone (2),
+the underlying system call works more like
+.BR fork (2),
+returning 0 in the child process; the glibc wrapper simplifies thread execution
+by calling
+.IR fn ( arg )
+and exiting the child when that function exits.
+
+The 64-bit
+.I flags
+argument (split into the 32-bit
+.I flags_high
+and
+.I flags_low
+arguments in the kernel interface)
+accepts all the same flags as
+.BR clone (2),
+with the exception of the obsolete
+.BR CLONE_PID ,
+.BR CLONE_DETACHED ,
+and
+.BR CLONE_STOPPED .
+In addition,
+.I flags
+accepts the following flags:
+
+.TP
+.B CLONE_FD
+Instead of returning a process ID,
+.BR clone4 ()
+with the
+.B CLONE_FD
+flag returns a file descriptor associated with the new process.
+When the new process exits, the kernel will not send a signal to the parent
+process, and will not keep the new process around as a "zombie" process until a
+call to
+.BR waitpid (2)
+or similar.  Instead, the file descriptor will become available for reading,
+and the new process will be immediately reaped.
+
+Unlike using
+.BR signalfd (2)
+for the
+.B SIGCHLD
+signal,
+the file descriptor returned by
+.BR clone4 ()
+with the
+.B CLONE_FD
+flag works even with
+.B SIGCHLD
+unblocked in one or more threads of the parent process, and allows the process
+to have different handlers for different child processes, such as those created
+by a library, without introducing race conditions around process-wide signal
+handling.
+
+.BR clone4 ()
+will never return a file descriptor in the range 0-2 to the caller, to avoid
+ambiguity with the return of 0 in the child process.  Only the calling process
+will have the new file descriptor open; the child process will not.
+
+Since the kernel does not send a termination signal when a child process
+created with
+.B CLONE_FD
+exits, the low byte of flags does not contain a signal number.  Instead, the
+low byte of flags can contain the following additional flags for use with
+.BR CLONE_FD :
+
+.RS
+.TP
+.B CLONEFD_CLOEXEC
+Set the
+.B O_CLOEXEC
+flag on the new open file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+
+.TP
+.B CLONEFD_NONBLOCK
+Set the
+.B O_NONBLOCK
+flag on the new open file descriptor.
+Using this flag saves extra calls to
+.BR fcntl (2)
+to achieve the same result.
+.RE
+
+.IP
+.BR clone4 ()
+with the
+.B CLONE_FD
+flag returns a file descriptor that supports the following operations:
+.RS
+.TP
+.BR read "(2) (and similar)"
+When the new process exits, reading from the file descriptor produces
+a single
+.I clonefd_info
+structure:
+.nf
+
+struct clonefd_info {
+    uint32_t code;   /* Signal code */
+    uint32_t status; /* Exit status or signal */
+    uint64_t utime;  /* User CPU time */
+    uint64_t stime;  /* System CPU time */
+};
+
+.fi
+.IP
+If the new process has not yet exited,
+.BR read (2)
+either blocks until it does,
+or fails with the error
+.B EAGAIN
+if the file descriptor has been made nonblocking.
+.IP
+Future kernels may extend
+.I clonefd_info
+by appending additional fields to the end.  Callers should read as many bytes
+as they understand; unread data will be discarded, and subsequent reads after
+the first will return 0 to indicate end-of-file.  Callers requesting more bytes
+than the kernel provides (such as callers expecting a newer
+.I clonefd_info
+structure) will receive a shorter structure from older kernels.
+.TP
+.BR poll "(2), " select "(2), " epoll "(7) (and similar)"
+The file descriptor is readable
+(the
+.BR select (2)
+.I readfds
+argument; the
+.BR poll (2)
+.B POLLIN
+flag)
+if the new process has exited.
+.TP
+.BR close (2)
+When the file descriptor is no longer required it should be closed.  If no
+process has a file descriptor open for the new process, no process will receive
+any notification when the new process exits.  The new process will still be
+immediately reaped.
+.RE
+
+.SS C library/kernel ABI differences
+As with
+.BR clone (2),
+the raw
+.BR clone4 ()
+system call corresponds more closely to
+.BR fork (2)
+in that execution in the child continues from the point of the call.
+
+Unlike
+.BR clone (2),
+the raw system call interface for
+.BR clone4 ()
+accepts arguments in the same order on all architectures.
+
+The raw system call accepts
+.I flags
+as two 32-bit arguments,
+.I flags_high
+and
+.IR flags_low ,
+to simplify portability across 32-bit and 64-bit architectures and calling
+conventions.  The glibc wrapper accepts
+.I flags
+as a single 64-bit argument for convenience.
+
+.SH RETURN VALUE
+For the glibc wrapper, on success,
+.BR clone4 ()
+returns the file descriptor (with
+.BR CLONE_FD )
+or new process ID
+(without
+.BR CLONE_FD ),
+and the child process begins running at the specified function.
+
+For the raw syscall, on success,
+.BR clone4 ()
+returns the file descriptor or new process ID to the calling process, and
+returns 0 in the new child process.
+
+On failure,
+.BR clone4 ()
+returns \-1 and sets
+.I errno
+accordingly.
+
+.SH ERRORS
+.BR clone4 ()
+can return any error from
+.BR clone (2),
+as well as the following additional errors:
+.TP
+.B EINVAL
+.I flags
+contained an unknown flag.
+.TP
+.B EINVAL
+.I flags
+included
+.BR CLONE_FD,
+but the kernel configuration does not have the
+.B CONFIG_CLONEFD
+option enabled.
+.TP
+.B EMFILE
+.I flags
+included
+.BR CLONE_FD,
+but the new file descriptor would exceed the process limit on open file descriptors.
+.TP
+.B ENFILE
+.I flags
+included
+.BR CLONE_FD,
+but the new file descriptor would exceed the system-wide limit on open file descriptors.
+.TP
+.B ENODEV
+.I flags
+included
+.BR CLONE_FD,
+but
+.BR clone4 ()
+could not mount the (internal) anonymous inode device.
+
+.SH CONFORMING TO
+.BR clone4 ()
+is Linux-specific and should not be used in programs intended to be portable.
+
+.SH SEE ALSO
+.BR clone (2),
+.BR epoll (7),
+.BR poll (2),
+.BR pthreads (7),
+.BR read (2),
+.BR select (2)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-13  2:07   ` Thiago Macieira
  0 siblings, 0 replies; 83+ messages in thread
From: Thiago Macieira @ 2015-03-13  2:07 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On Thursday 12 March 2015 18:40:03 Josh Triplett wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the
> caller handle child process exit notification via a file descriptor rather
> than SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch
> and manage child processes on behalf of their caller, *without* taking over
> process-wide SIGCHLD handling (either via signal handler or signalfd).

FYI, the matching use of this feature in Qt can be found at:

	https://codereview.qt-project.org/108455
	https://codereview.qt-project.org/108456

The forkfd.c file this modifies aims at implementing the semantics of CLONE_FD 
for the fork case when support for CLONE_FD is missing in the kernel.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-13  2:07   ` Thiago Macieira
  0 siblings, 0 replies; 83+ messages in thread
From: Thiago Macieira @ 2015-03-13  2:07 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Thursday 12 March 2015 18:40:03 Josh Triplett wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the
> caller handle child process exit notification via a file descriptor rather
> than SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch
> and manage child processes on behalf of their caller, *without* taking over
> process-wide SIGCHLD handling (either via signal handler or signalfd).

FYI, the matching use of this feature in Qt can be found at:

	https://codereview.qt-project.org/108455
	https://codereview.qt-project.org/108456

The forkfd.c file this modifies aims at implementing the semantics of CLONE_FD 
for the fork case when support for CLONE_FD is missing in the kernel.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-13 16:05   ` David Drysdale
  0 siblings, 0 replies; 83+ messages in thread
From: David Drysdale @ 2015-03-13 16:05 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	Linux API, linux-fsdevel, X86 ML

On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> handle child process exit notification via a file descriptor rather than
> SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> child processes on behalf of their caller, *without* taking over process-wide
> SIGCHLD handling (either via signal handler or signalfd).

Hi Josh,

>From the overall description (i.e. I haven't looked at the code yet)
this looks very interesting.  However, it seems to cover a lot of the
same ground as the process descriptor feature that was added to FreeBSD
in 9.x/10.x:
  https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2

I think it would ideally be nice for a userspace library developer to be
able to do subprocess management (without SIGCHLD) in a similar way
across both platforms, without lots of complicated autoconf shenanigans.

So could we look at the overlap and seeing if we can come up with
something that covers your requirements and also allows for something
that looks like FreeBSD's process descriptors?

(I've actually got some rough patches to add process descriptor
functionality on Linux, so I can look at how the two approaches compare
and contrast.)

> Note that signalfd for SIGCHLD does not suffice here, because that still
> receives notification for all child processes, and interferes with process-wide
> signal handling.
>
> The CLONE_FD file descriptor uniquely identifies a process on the system in a
> race-free way, by holding a reference to the task_struct.  In the future, we
> may introduce APIs that support using process file descriptors instead of PIDs.

FreeBSD has pdkill(2) and (theoretically) pdwait4(2) along these lines.
I suspect we need either need pdkill(2) or a way to retrieve a PID from
a process file descriptor, so that there's a way to send signals to the
child.

> Introducing CLONE_FD required two additional bits of yak shaving: Since clone
> has no more usable flags (with the three currently unused flags unusable
> because old kernels ignore them without EINVAL), also introduce a new clone4
> system call with more flag bits and an extensible argument structure.  And
> since the magic pt_regs-based syscall argument processing for clone's tls
> argument would otherwise prevent introducing a sane clone4 system call, fix
> that too.
>
> I tested the CLONE_SETTLS changes with a thread-local storage test program (two
> threads independently reading and writing a __thread variable), on both 32-bit
> and 64-bit, and I observed no issues there.

Worth preserving in tools/testing/selftests/ ?

> I tested clone4 and the new CLONE_FD call with several additional test
> programs, launching either a process or thread (in the former case using
> syscall(), in the latter case by calling clone4 via assembly and returning to
> C), sleeping in parent and child to test the case of either exiting first, and
> then printing the received clone4_info structure.  Thiago also tested clone4
> with CLONE_FD with a modified version of libqt's process handling, which
> includes a test suite.
>
> I've also included the manpages patch at the end of this series.  (Note that
> the manpage documents the behavior of the future glibc wrapper as well as the
> raw syscall.)  Here's a formatted plain-text version of the manpage for
> reference:

FYI, I've added some comparisons with the FreeBSD equivalents below.

>
> CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)
>
>
>
> NAME
>        clone4 - create a child process
>
> SYNOPSIS
>        /* Prototype for the glibc wrapper function */
>
>        #define _GNU_SOURCE
>        #include <sched.h>
>
>        int clone4(uint64_t flags,
>                   size_t args_size,
>                   struct clone4_args *args,
>                   int (*fn)(void *), void *arg);
>
>        /* Prototype for the raw system call */
>
>        int clone4(unsigned flags_high, unsigned flags_low,
>                   unsigned long args_size,
>                   struct clone4_args *args);
>
>        struct clone4_args {
>            pid_t *ptid;
>            pid_t *ctid;
>            unsigned long stack_start;
>            unsigned long stack_size;
>            unsigned long tls;
>        };
>
>
> DESCRIPTION
>        clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
>        clone4() supports additional flags that clone(2) does not, and  accepts
>        arguments via an extensible structure.
>
>        args  points to a clone4_args structure, and args_size must contain the
>        size of that structure, as understood by the  caller.   If  the  caller
>        passes  a  shorter  structure  than  the  kernel expects, the remaining
>        fields will default to 0.  If the caller passes a larger structure than
>        the  kernel  expects  (such  as one from a newer kernel), clone4() will
>        return EINVAL.  The clone4_args structure may gain additional fields at
>        the  end  in  the future, and callers must only pass a size that encom‐
>        passes the number of fields they understand.  If the  caller  passes  0
>        for args_size, args is ignored and may be NULL.
>
>        In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
>        tls have the same semantics as they do with clone(2) and clone2(2).
>
>        In the glibc wrapper, fn and arg have the same  semantics  as  they  do
>        with clone(2).  As with clone(2), the underlying system call works more
>        like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
>        plifies  thread execution by calling fn(arg) and exiting the child when
>        that function exits.
>
>        The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
>        flags_low arguments in the kernel interface) accepts all the same flags
>        as  clone(2),  with  the   exception   of   the   obsolete   CLONE_PID,
>        CLONE_DETACHED, and CLONE_STOPPED.  In addition, flags accepts the fol‐
>        lowing flags:
>
>
>        CLONE_FD
>               Instead of returning a process ID, clone4()  with  the  CLONE_FD
>               flag  returns a file descriptor associated with the new process.
>               When the new process exits, the kernel will not send a signal to
>               the  parent process, and will not keep the new process around as
>               a "zombie" process  until  a  call  to  waitpid(2)  or  similar.
>               Instead,  the file descriptor will become available for reading,
>               and the new process will be immediately reaped.

Just to confirm: presumably a waitpid(-1,...) call that's already in
progress won't return when one of these child processes exits?

>
>               Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
>               descriptor  returned  by  clone4()  with the CLONE_FD flag works
>               even with SIGCHLD unblocked in one or more threads of the parent
>               process,  and  allows the process to have different handlers for
>               different child processes, such as those created by  a  library,
>               without  introducing  race conditions around process-wide signal
>               handling.
>
>               clone4() will never return a file descriptor in the range 0-2 to
>               the caller, to avoid ambiguity with the return of 0 in the child
>               process.  Only the  calling  process  will  have  the  new  file
>               descriptor open; the child process will not.

FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
return the file descriptor separately, which avoids the need for special
case processing for low FD values (and means that POSIX's "lowest file
descriptor not currently open" behaviour can be preserved if desired).

>               Since the kernel does not send a termination signal when a child
>               process created with CLONE_FD exits, the low byte of flags  does
>               not contain a signal number.  Instead, the low byte of flags can
>               contain the following additional flags for use with CLONE_FD:
>
>
>               CLONEFD_CLOEXEC
>                      Set the O_CLOEXEC flag on the new open  file  descriptor.
>                      See  the description of the O_CLOEXEC flag in open(2) for
>                      reasons why this may be useful.
>
>
>               CLONEFD_NONBLOCK
>                      Set the O_NONBLOCK flag on the new open file  descriptor.
>                      Using  this flag saves extra calls to fcntl(2) to achieve
>                      the same result.
>
>
>               clone4() with the CLONE_FD flag returns a file  descriptor  that
>               supports the following operations:
>
>               read(2) (and similar)
>                      When  the  new  process  exits,  reading  from  the  file
>                      descriptor produces a single clonefd_info structure:
>
>                      struct clonefd_info {
>                          uint32_t code;   /* Signal code */
>                          uint32_t status; /* Exit status or signal */
>                          uint64_t utime;  /* User CPU time */
>                          uint64_t stime;  /* System CPU time */
>                      };

Presumably there is no way to get full rusage information for the exited
process?

[FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
process descriptor, including rusage retrieval.  However, I don't think
they actually implemented it:
  http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]

>
>                      If the new process has not  yet  exited,  read(2)  either
>                      blocks  until  it does, or fails with the error EAGAIN if
>                      the file descriptor has been made nonblocking.
>
>                      Future kernels may extend clonefd_info by appending addi‐
>                      tional  fields  to  the end.  Callers should read as many
>                      bytes as they understand; unread data will be  discarded,
>                      and  subsequent  reads  after  the first will return 0 to
>                      indicate end-of-file.  Callers requesting more bytes than
>                      the  kernel  provides  (such as callers expecting a newer
>                      clonefd_info structure) will receive a shorter  structure
>                      from older kernels.

FreeBSD also implements fstat(2) for its process descriptors, although
only a few of the fields get filled in.

>               poll(2), select(2), epoll(7) (and similar)
>                      The  file  descriptor  is readable (the select(2) readfds
>                      argument; the poll(2) POLLIN flag) if the new process has
>                      exited.

FreeBSD uses POLLHUP here.

>               close(2)
>                      When  the file descriptor is no longer required it should
>                      be closed.  If no process has a file descriptor open  for
>                      the new process, no process will receive any notification
>                      when the new process exits.  The new process  will  still
>                      be immediately reaped.

FreeBSD has two different behaviours for close(2), depending on a flag
value (PD_DAEMON).  With the flag set it's roughly like this, but
without PD_DAEMON a close(2) operation on the (last open) file
descriptor terminates the child process.

This can be quite useful, particularly for the use case where some
userspace library has an FD-controlled subprocess -- if the application
using the library terminates, the process descriptor is closed and so
the subprocess is automatically terminated.

>
>    C library/kernel ABI differences
>        As with clone(2), the raw clone4() system call corresponds more closely
>        to fork(2) in that execution in the child continues from the  point  of
>        the call.
>
>        Unlike  clone(2),  the  raw  system call interface for clone4() accepts
>        arguments in the same order on all architectures.
>
>        The raw system call accepts flags as two 32-bit  arguments,  flags_high
>        and  flags_low, to simplify portability across 32-bit and 64-bit archi‐
>        tectures and calling conventions.  The glibc wrapper accepts flags as a
>        single 64-bit argument for convenience.
>
>
> RETURN VALUE
>        For the glibc wrapper, on success, clone4() returns the file descriptor
>        (with CLONE_FD) or new process ID (without  CLONE_FD),  and  the  child
>        process begins running at the specified function.
>
>        For  the  raw syscall, on success, clone4() returns the file descriptor
>        or new process ID to the calling process, and  returns  0  in  the  new
>        child process.
>
>        On failure, clone4() returns -1 and sets errno accordingly.
>
>
> ERRORS
>        clone4()  can  return any error from clone(2), as well as the following
>        additional errors:
>
>        EINVAL flags contained an unknown flag.
>
>        EINVAL flags included CLONE_FD, but the kernel configuration  does  not
>               have the CONFIG_CLONEFD option enabled.
>
>        EMFILE flags  included  CLONE_FD,  but  the  new  file descriptor would
>               exceed the process limit on open file descriptors.
>
>        ENFILE flags included CLONE_FD,  but  the  new  file  descriptor  would
>               exceed the system-wide limit on open file descriptors.
>
>        ENODEV flags  included  CLONE_FD,  but  clone4()  could  not  mount the
>               (internal) anonymous inode device.
>
>
> CONFORMING TO
>        clone4() is Linux-specific and should not be used in programs  intended
>        to be portable.
>
>
> SEE ALSO
>        clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2)
>
>
>
> Linux                             2015-03-01                         CLONE4(2)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-13 16:05   ` David Drysdale
  0 siblings, 0 replies; 83+ messages in thread
From: David Drysdale @ 2015-03-13 16:05 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, X86 ML

On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> handle child process exit notification via a file descriptor rather than
> SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> child processes on behalf of their caller, *without* taking over process-wide
> SIGCHLD handling (either via signal handler or signalfd).

Hi Josh,

From the overall description (i.e. I haven't looked at the code yet)
this looks very interesting.  However, it seems to cover a lot of the
same ground as the process descriptor feature that was added to FreeBSD
in 9.x/10.x:
  https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2

I think it would ideally be nice for a userspace library developer to be
able to do subprocess management (without SIGCHLD) in a similar way
across both platforms, without lots of complicated autoconf shenanigans.

So could we look at the overlap and seeing if we can come up with
something that covers your requirements and also allows for something
that looks like FreeBSD's process descriptors?

(I've actually got some rough patches to add process descriptor
functionality on Linux, so I can look at how the two approaches compare
and contrast.)

> Note that signalfd for SIGCHLD does not suffice here, because that still
> receives notification for all child processes, and interferes with process-wide
> signal handling.
>
> The CLONE_FD file descriptor uniquely identifies a process on the system in a
> race-free way, by holding a reference to the task_struct.  In the future, we
> may introduce APIs that support using process file descriptors instead of PIDs.

FreeBSD has pdkill(2) and (theoretically) pdwait4(2) along these lines.
I suspect we need either need pdkill(2) or a way to retrieve a PID from
a process file descriptor, so that there's a way to send signals to the
child.

> Introducing CLONE_FD required two additional bits of yak shaving: Since clone
> has no more usable flags (with the three currently unused flags unusable
> because old kernels ignore them without EINVAL), also introduce a new clone4
> system call with more flag bits and an extensible argument structure.  And
> since the magic pt_regs-based syscall argument processing for clone's tls
> argument would otherwise prevent introducing a sane clone4 system call, fix
> that too.
>
> I tested the CLONE_SETTLS changes with a thread-local storage test program (two
> threads independently reading and writing a __thread variable), on both 32-bit
> and 64-bit, and I observed no issues there.

Worth preserving in tools/testing/selftests/ ?

> I tested clone4 and the new CLONE_FD call with several additional test
> programs, launching either a process or thread (in the former case using
> syscall(), in the latter case by calling clone4 via assembly and returning to
> C), sleeping in parent and child to test the case of either exiting first, and
> then printing the received clone4_info structure.  Thiago also tested clone4
> with CLONE_FD with a modified version of libqt's process handling, which
> includes a test suite.
>
> I've also included the manpages patch at the end of this series.  (Note that
> the manpage documents the behavior of the future glibc wrapper as well as the
> raw syscall.)  Here's a formatted plain-text version of the manpage for
> reference:

FYI, I've added some comparisons with the FreeBSD equivalents below.

>
> CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)
>
>
>
> NAME
>        clone4 - create a child process
>
> SYNOPSIS
>        /* Prototype for the glibc wrapper function */
>
>        #define _GNU_SOURCE
>        #include <sched.h>
>
>        int clone4(uint64_t flags,
>                   size_t args_size,
>                   struct clone4_args *args,
>                   int (*fn)(void *), void *arg);
>
>        /* Prototype for the raw system call */
>
>        int clone4(unsigned flags_high, unsigned flags_low,
>                   unsigned long args_size,
>                   struct clone4_args *args);
>
>        struct clone4_args {
>            pid_t *ptid;
>            pid_t *ctid;
>            unsigned long stack_start;
>            unsigned long stack_size;
>            unsigned long tls;
>        };
>
>
> DESCRIPTION
>        clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
>        clone4() supports additional flags that clone(2) does not, and  accepts
>        arguments via an extensible structure.
>
>        args  points to a clone4_args structure, and args_size must contain the
>        size of that structure, as understood by the  caller.   If  the  caller
>        passes  a  shorter  structure  than  the  kernel expects, the remaining
>        fields will default to 0.  If the caller passes a larger structure than
>        the  kernel  expects  (such  as one from a newer kernel), clone4() will
>        return EINVAL.  The clone4_args structure may gain additional fields at
>        the  end  in  the future, and callers must only pass a size that encom‐
>        passes the number of fields they understand.  If the  caller  passes  0
>        for args_size, args is ignored and may be NULL.
>
>        In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
>        tls have the same semantics as they do with clone(2) and clone2(2).
>
>        In the glibc wrapper, fn and arg have the same  semantics  as  they  do
>        with clone(2).  As with clone(2), the underlying system call works more
>        like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
>        plifies  thread execution by calling fn(arg) and exiting the child when
>        that function exits.
>
>        The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
>        flags_low arguments in the kernel interface) accepts all the same flags
>        as  clone(2),  with  the   exception   of   the   obsolete   CLONE_PID,
>        CLONE_DETACHED, and CLONE_STOPPED.  In addition, flags accepts the fol‐
>        lowing flags:
>
>
>        CLONE_FD
>               Instead of returning a process ID, clone4()  with  the  CLONE_FD
>               flag  returns a file descriptor associated with the new process.
>               When the new process exits, the kernel will not send a signal to
>               the  parent process, and will not keep the new process around as
>               a "zombie" process  until  a  call  to  waitpid(2)  or  similar.
>               Instead,  the file descriptor will become available for reading,
>               and the new process will be immediately reaped.

Just to confirm: presumably a waitpid(-1,...) call that's already in
progress won't return when one of these child processes exits?

>
>               Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
>               descriptor  returned  by  clone4()  with the CLONE_FD flag works
>               even with SIGCHLD unblocked in one or more threads of the parent
>               process,  and  allows the process to have different handlers for
>               different child processes, such as those created by  a  library,
>               without  introducing  race conditions around process-wide signal
>               handling.
>
>               clone4() will never return a file descriptor in the range 0-2 to
>               the caller, to avoid ambiguity with the return of 0 in the child
>               process.  Only the  calling  process  will  have  the  new  file
>               descriptor open; the child process will not.

FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
return the file descriptor separately, which avoids the need for special
case processing for low FD values (and means that POSIX's "lowest file
descriptor not currently open" behaviour can be preserved if desired).

>               Since the kernel does not send a termination signal when a child
>               process created with CLONE_FD exits, the low byte of flags  does
>               not contain a signal number.  Instead, the low byte of flags can
>               contain the following additional flags for use with CLONE_FD:
>
>
>               CLONEFD_CLOEXEC
>                      Set the O_CLOEXEC flag on the new open  file  descriptor.
>                      See  the description of the O_CLOEXEC flag in open(2) for
>                      reasons why this may be useful.
>
>
>               CLONEFD_NONBLOCK
>                      Set the O_NONBLOCK flag on the new open file  descriptor.
>                      Using  this flag saves extra calls to fcntl(2) to achieve
>                      the same result.
>
>
>               clone4() with the CLONE_FD flag returns a file  descriptor  that
>               supports the following operations:
>
>               read(2) (and similar)
>                      When  the  new  process  exits,  reading  from  the  file
>                      descriptor produces a single clonefd_info structure:
>
>                      struct clonefd_info {
>                          uint32_t code;   /* Signal code */
>                          uint32_t status; /* Exit status or signal */
>                          uint64_t utime;  /* User CPU time */
>                          uint64_t stime;  /* System CPU time */
>                      };

Presumably there is no way to get full rusage information for the exited
process?

[FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
process descriptor, including rusage retrieval.  However, I don't think
they actually implemented it:
  http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]

>
>                      If the new process has not  yet  exited,  read(2)  either
>                      blocks  until  it does, or fails with the error EAGAIN if
>                      the file descriptor has been made nonblocking.
>
>                      Future kernels may extend clonefd_info by appending addi‐
>                      tional  fields  to  the end.  Callers should read as many
>                      bytes as they understand; unread data will be  discarded,
>                      and  subsequent  reads  after  the first will return 0 to
>                      indicate end-of-file.  Callers requesting more bytes than
>                      the  kernel  provides  (such as callers expecting a newer
>                      clonefd_info structure) will receive a shorter  structure
>                      from older kernels.

FreeBSD also implements fstat(2) for its process descriptors, although
only a few of the fields get filled in.

>               poll(2), select(2), epoll(7) (and similar)
>                      The  file  descriptor  is readable (the select(2) readfds
>                      argument; the poll(2) POLLIN flag) if the new process has
>                      exited.

FreeBSD uses POLLHUP here.

>               close(2)
>                      When  the file descriptor is no longer required it should
>                      be closed.  If no process has a file descriptor open  for
>                      the new process, no process will receive any notification
>                      when the new process exits.  The new process  will  still
>                      be immediately reaped.

FreeBSD has two different behaviours for close(2), depending on a flag
value (PD_DAEMON).  With the flag set it's roughly like this, but
without PD_DAEMON a close(2) operation on the (last open) file
descriptor terminates the child process.

This can be quite useful, particularly for the use case where some
userspace library has an FD-controlled subprocess -- if the application
using the library terminates, the process descriptor is closed and so
the subprocess is automatically terminated.

>
>    C library/kernel ABI differences
>        As with clone(2), the raw clone4() system call corresponds more closely
>        to fork(2) in that execution in the child continues from the  point  of
>        the call.
>
>        Unlike  clone(2),  the  raw  system call interface for clone4() accepts
>        arguments in the same order on all architectures.
>
>        The raw system call accepts flags as two 32-bit  arguments,  flags_high
>        and  flags_low, to simplify portability across 32-bit and 64-bit archi‐
>        tectures and calling conventions.  The glibc wrapper accepts flags as a
>        single 64-bit argument for convenience.
>
>
> RETURN VALUE
>        For the glibc wrapper, on success, clone4() returns the file descriptor
>        (with CLONE_FD) or new process ID (without  CLONE_FD),  and  the  child
>        process begins running at the specified function.
>
>        For  the  raw syscall, on success, clone4() returns the file descriptor
>        or new process ID to the calling process, and  returns  0  in  the  new
>        child process.
>
>        On failure, clone4() returns -1 and sets errno accordingly.
>
>
> ERRORS
>        clone4()  can  return any error from clone(2), as well as the following
>        additional errors:
>
>        EINVAL flags contained an unknown flag.
>
>        EINVAL flags included CLONE_FD, but the kernel configuration  does  not
>               have the CONFIG_CLONEFD option enabled.
>
>        EMFILE flags  included  CLONE_FD,  but  the  new  file descriptor would
>               exceed the process limit on open file descriptors.
>
>        ENFILE flags included CLONE_FD,  but  the  new  file  descriptor  would
>               exceed the system-wide limit on open file descriptors.
>
>        ENODEV flags  included  CLONE_FD,  but  clone4()  could  not  mount the
>               (internal) anonymous inode device.
>
>
> CONFORMING TO
>        clone4() is Linux-specific and should not be used in programs  intended
>        to be portable.
>
>
> SEE ALSO
>        clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2)
>
>
>
> Linux                             2015-03-01                         CLONE4(2)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-13  1:41 ` [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd Josh Triplett
@ 2015-03-13 16:21   ` Oleg Nesterov
  2015-03-13 19:57     ` josh
  2015-03-14 14:35     ` Oleg Nesterov
  1 sibling, 1 reply; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-13 16:21 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

Josh,

I'll certainly try to read this series, but not before next week.

but a couple of nits right now.

On 03/12, Josh Triplett wrote:
>
> When passed CLONE_FD, clone4 will return a file descriptor rather than a
> PID.  When the child process exits, it gets automatically reaped,

And even I have no idea what you are actually doing, this doesn't look
right, see below.

> +static unsigned int clonefd_poll(struct file *file, poll_table *wait)
> +{
> +	struct task_struct *p = file->private_data;
> +	poll_wait(file, &p->clonefd_wqh, wait);
> +	return p->exit_state == EXIT_DEAD ? (POLLIN | POLLRDNORM) : 0;
> +}
> +
> +static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
> +{
> +	struct task_struct *p = file->private_data;
> +	int ret = 0;
> +
> +	/* EOF after first read */
> +	if (*ppos)
> +		return 0;
> +
> +	if (file->f_flags & O_NONBLOCK)
> +		ret = -EAGAIN;
> +	else
> +		ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state == EXIT_DEAD);
> +
> +	if (p->exit_state == EXIT_DEAD) {

Again, I simply do not know what this code does at all. But I bet the usage
of EXIT_DEAD is wrong ;)

OK, OK, I can be wrong. But I simply do not see what protects this task_struct
if it is EXIT_DEAD (in fact even if it is EXIT_ZOMBIE).

> @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	if (group_dead)
>  		kill_orphaned_pgrp(tsk->group_leader, NULL);
>  
> -	if (unlikely(tsk->ptrace)) {
> +	if (tsk->autoreap) {
> +		autoreap = true;

Debuggers won't be happy. A ptraced task should not autoreap itself.

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-13 16:05   ` David Drysdale
  (?)
@ 2015-03-13 19:42   ` Josh Triplett
  2015-03-13 21:16     ` Thiago Macieira
                       ` (2 more replies)
  -1 siblings, 3 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-13 19:42 UTC (permalink / raw)
  To: David Drysdale
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	Linux API, linux-fsdevel, X86 ML

On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> > handle child process exit notification via a file descriptor rather than
> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> > child processes on behalf of their caller, *without* taking over process-wide
> > SIGCHLD handling (either via signal handler or signalfd).
> 
> Hi Josh,
> 
> From the overall description (i.e. I haven't looked at the code yet)
> this looks very interesting.  However, it seems to cover a lot of the
> same ground as the process descriptor feature that was added to FreeBSD
> in 9.x/10.x:
>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2

Interesting.

> I think it would ideally be nice for a userspace library developer to be
> able to do subprocess management (without SIGCHLD) in a similar way
> across both platforms, without lots of complicated autoconf shenanigans.
>
> So could we look at the overlap and seeing if we can come up with
> something that covers your requirements and also allows for something
> that looks like FreeBSD's process descriptors?

Agreed; however, I think it's reasonable to provide appropriate Linux
system calls, and then let glibc or libbsd or similar provide the
BSD-compatible calls on top of those.  I don't think the kernel
interface needs to exactly match FreeBSD's, as long as it's a superset
of the functionality.

For example, pdfork can just call clone4 with CLONE_FD and return the
resulting file descriptor.

In my further comments below, I'll suggest ways that the FreeBSD library
calls could be implemented on top of Linux system calls.

> (I've actually got some rough patches to add process descriptor
> functionality on Linux, so I can look at how the two approaches compare
> and contrast.)
> 
> > Note that signalfd for SIGCHLD does not suffice here, because that still
> > receives notification for all child processes, and interferes with process-wide
> > signal handling.
> >
> > The CLONE_FD file descriptor uniquely identifies a process on the system in a
> > race-free way, by holding a reference to the task_struct.  In the future, we
> > may introduce APIs that support using process file descriptors instead of PIDs.
> 
> FreeBSD has pdkill(2) and (theoretically) pdwait4(2) along these lines.
> I suspect we need either need pdkill(2) or a way to retrieve a PID from
> a process file descriptor, so that there's a way to send signals to the
> child.

The original caller of clone4 with CLONE_FD can pass CLONE_PARENT_SETTID
to get the PID.

In the future, I plan to add an fd-based equivalent of
rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
whether to kill a process or thread) which is a superset of pdkill.
pdkill could then call that and just not pass the extra info.

A fair bit of pdwait4 could be implemented on top of read(), other than
the full rusage information (see below), and the ability to wait for
STOP/CONT (which the CLONE_FD file descriptor could support if desired,
but it'd have to be set via a flag at clone time).

I think it's a feature to use read() rather than an additional magic
system call.

> > Introducing CLONE_FD required two additional bits of yak shaving: Since clone
> > has no more usable flags (with the three currently unused flags unusable
> > because old kernels ignore them without EINVAL), also introduce a new clone4
> > system call with more flag bits and an extensible argument structure.  And
> > since the magic pt_regs-based syscall argument processing for clone's tls
> > argument would otherwise prevent introducing a sane clone4 system call, fix
> > that too.
> >
> > I tested the CLONE_SETTLS changes with a thread-local storage test program (two
> > threads independently reading and writing a __thread variable), on both 32-bit
> > and 64-bit, and I observed no issues there.
> 
> Worth preserving in tools/testing/selftests/ ?

Not really; it's just the following trivial program, which was faster to
write than to attempt to find somewhere:

#include <pthread.h>
#include <stdio.h>

__thread unsigned x = 0;

void *thread_func(void *unused)
{
    unsigned *tx = &x;
    for (; *tx < 10; (*tx)++)
        printf("child: tx=%p *tx=%u\n", tx, *tx);
    return NULL;
}

int main(void)
{
    unsigned *tx = &x;
    pthread_t thread;
    pthread_create(&thread, NULL, thread_func, NULL);
    for (; *tx < 10; (*tx)++)
        printf("main: tx=%p *tx=%u\n", tx, *tx);
    pthread_join(thread, NULL);
    return 0;
}

(I didn't bother with error handling, because I ran it under strace.)

> > I tested clone4 and the new CLONE_FD call with several additional test
> > programs, launching either a process or thread (in the former case using
> > syscall(), in the latter case by calling clone4 via assembly and returning to
> > C), sleeping in parent and child to test the case of either exiting first, and
> > then printing the received clone4_info structure.  Thiago also tested clone4
> > with CLONE_FD with a modified version of libqt's process handling, which
> > includes a test suite.
> >
> > I've also included the manpages patch at the end of this series.  (Note that
> > the manpage documents the behavior of the future glibc wrapper as well as the
> > raw syscall.)  Here's a formatted plain-text version of the manpage for
> > reference:
> 
> FYI, I've added some comparisons with the FreeBSD equivalents below.

Thanks!

> > CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)
> >
> >
> >
> > NAME
> >        clone4 - create a child process
> >
> > SYNOPSIS
> >        /* Prototype for the glibc wrapper function */
> >
> >        #define _GNU_SOURCE
> >        #include <sched.h>
> >
> >        int clone4(uint64_t flags,
> >                   size_t args_size,
> >                   struct clone4_args *args,
> >                   int (*fn)(void *), void *arg);
> >
> >        /* Prototype for the raw system call */
> >
> >        int clone4(unsigned flags_high, unsigned flags_low,
> >                   unsigned long args_size,
> >                   struct clone4_args *args);
> >
> >        struct clone4_args {
> >            pid_t *ptid;
> >            pid_t *ctid;
> >            unsigned long stack_start;
> >            unsigned long stack_size;
> >            unsigned long tls;
> >        };
> >
> >
> > DESCRIPTION
> >        clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
> >        clone4() supports additional flags that clone(2) does not, and  accepts
> >        arguments via an extensible structure.
> >
> >        args  points to a clone4_args structure, and args_size must contain the
> >        size of that structure, as understood by the  caller.   If  the  caller
> >        passes  a  shorter  structure  than  the  kernel expects, the remaining
> >        fields will default to 0.  If the caller passes a larger structure than
> >        the  kernel  expects  (such  as one from a newer kernel), clone4() will
> >        return EINVAL.  The clone4_args structure may gain additional fields at
> >        the  end  in  the future, and callers must only pass a size that encom‐
> >        passes the number of fields they understand.  If the  caller  passes  0
> >        for args_size, args is ignored and may be NULL.
> >
> >        In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
> >        tls have the same semantics as they do with clone(2) and clone2(2).
> >
> >        In the glibc wrapper, fn and arg have the same  semantics  as  they  do
> >        with clone(2).  As with clone(2), the underlying system call works more
> >        like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
> >        plifies  thread execution by calling fn(arg) and exiting the child when
> >        that function exits.
> >
> >        The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
> >        flags_low arguments in the kernel interface) accepts all the same flags
> >        as  clone(2),  with  the   exception   of   the   obsolete   CLONE_PID,
> >        CLONE_DETACHED, and CLONE_STOPPED.  In addition, flags accepts the fol‐
> >        lowing flags:
> >
> >
> >        CLONE_FD
> >               Instead of returning a process ID, clone4()  with  the  CLONE_FD
> >               flag  returns a file descriptor associated with the new process.
> >               When the new process exits, the kernel will not send a signal to
> >               the  parent process, and will not keep the new process around as
> >               a "zombie" process  until  a  call  to  waitpid(2)  or  similar.
> >               Instead,  the file descriptor will become available for reading,
> >               and the new process will be immediately reaped.
> 
> Just to confirm: presumably a waitpid(-1,...) call that's already in
> progress won't return when one of these child processes exits?

I agree, I don't think it should.  Because otherwise you'd also assume
you can waitpid() on the PID itself, and that'd be a race condition
since the process autoreaps.

> >               Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
> >               descriptor  returned  by  clone4()  with the CLONE_FD flag works
> >               even with SIGCHLD unblocked in one or more threads of the parent
> >               process,  and  allows the process to have different handlers for
> >               different child processes, such as those created by  a  library,
> >               without  introducing  race conditions around process-wide signal
> >               handling.
> >
> >               clone4() will never return a file descriptor in the range 0-2 to
> >               the caller, to avoid ambiguity with the return of 0 in the child
> >               process.  Only the  calling  process  will  have  the  new  file
> >               descriptor open; the child process will not.
> 
> FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> return the file descriptor separately, which avoids the need for special
> case processing for low FD values (and means that POSIX's "lowest file
> descriptor not currently open" behaviour can be preserved if desired).

That'd be easy to implement if desired, by adding an outbound pointer to
clone4_args.

The (very mild) reason I'd dropped the PID: with CLONE_FD and future
syscalls that use the fd as an identifier, PIDs can hopefully become
mostly unnecessary.  However, I'm not that attached to changing the
return value; it'd be trivial to switch to an outbound parameter
instead, and then drop the "not 0-2".

> >               Since the kernel does not send a termination signal when a child
> >               process created with CLONE_FD exits, the low byte of flags  does
> >               not contain a signal number.  Instead, the low byte of flags can
> >               contain the following additional flags for use with CLONE_FD:
> >
> >
> >               CLONEFD_CLOEXEC
> >                      Set the O_CLOEXEC flag on the new open  file  descriptor.
> >                      See  the description of the O_CLOEXEC flag in open(2) for
> >                      reasons why this may be useful.
> >
> >
> >               CLONEFD_NONBLOCK
> >                      Set the O_NONBLOCK flag on the new open file  descriptor.
> >                      Using  this flag saves extra calls to fcntl(2) to achieve
> >                      the same result.
> >
> >
> >               clone4() with the CLONE_FD flag returns a file  descriptor  that
> >               supports the following operations:
> >
> >               read(2) (and similar)
> >                      When  the  new  process  exits,  reading  from  the  file
> >                      descriptor produces a single clonefd_info structure:
> >
> >                      struct clonefd_info {
> >                          uint32_t code;   /* Signal code */
> >                          uint32_t status; /* Exit status or signal */
> >                          uint64_t utime;  /* User CPU time */
> >                          uint64_t stime;  /* System CPU time */
> >                      };
> 
> Presumably there is no way to get full rusage information for the exited
> process?

I focused on the information available via SIGCHLD.  Even utime and
stime are unnecessary for the primary use case of CLONE_FD, but I
included them because SIGCHLD does.  I'd like to avoid sending the much
larger rusage over the file descriptor when the caller may not care.

However, given that the task_struct sticks around as long as the
CLONE_FD file descriptor does, if that information is normally still
available from a dead-but-not-waited-on process, it should be trivial to
add an operation that takes the file descriptor and returns the full
rusage, if someone needs that.  I think that can be done as part of a
later patch series adding other operations for use with the file
descriptor, though.

> [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> process descriptor, including rusage retrieval.  However, I don't think
> they actually implemented it:
>   http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]

That's a pretty good argument that we don't need to either, at least not
yet.

> >                      If the new process has not  yet  exited,  read(2)  either
> >                      blocks  until  it does, or fails with the error EAGAIN if
> >                      the file descriptor has been made nonblocking.
> >
> >                      Future kernels may extend clonefd_info by appending addi‐
> >                      tional  fields  to  the end.  Callers should read as many
> >                      bytes as they understand; unread data will be  discarded,
> >                      and  subsequent  reads  after  the first will return 0 to
> >                      indicate end-of-file.  Callers requesting more bytes than
> >                      the  kernel  provides  (such as callers expecting a newer
> >                      clonefd_info structure) will receive a shorter  structure
> >                      from older kernels.
> 
> FreeBSD also implements fstat(2) for its process descriptors, although
> only a few of the fields get filled in.

I looked at what they provide, and that seems like more of a novelty
than something particularly useful (since most of the stat fields aren't
meaningful), but if that's useful for compatibility then adding it seems
fine.

> >               poll(2), select(2), epoll(7) (and similar)
> >                      The  file  descriptor  is readable (the select(2) readfds
> >                      argument; the poll(2) POLLIN flag) if the new process has
> >                      exited.
> 
> FreeBSD uses POLLHUP here.

That makes sense given that they provide the information via a separate
call rather than read.  Since the CLONE_FD file descriptor uses read, it
needs to provide POLLIN, but I have no objection to using *both* POLLIN
and POLLHUP if that'd be at all useful.

> >               close(2)
> >                      When  the file descriptor is no longer required it should
> >                      be closed.  If no process has a file descriptor open  for
> >                      the new process, no process will receive any notification
> >                      when the new process exits.  The new process  will  still
> >                      be immediately reaped.
> 
> FreeBSD has two different behaviours for close(2), depending on a flag
> value (PD_DAEMON).  With the flag set it's roughly like this, but
> without PD_DAEMON a close(2) operation on the (last open) file
> descriptor terminates the child process.
> 
> This can be quite useful, particularly for the use case where some
> userspace library has an FD-controlled subprocess -- if the application
> using the library terminates, the process descriptor is closed and so
> the subprocess is automatically terminated.

That's an interesting idea.  I don't think it makes sense for that to be
the default behavior, but if someone wanted to add an additional flag
to implement that behavior, that seems fine.  A FreeBSD-compatible
pdfork could then use that flag when not passed PD_DAEMON and not use it
when passed PD_DAEMON.

How does it kill the process when the last open descriptor closes?
SIGKILL?  SIGTERM?  The former seems unfriendly (preventing graceful
termination), and the latter blockable.  There's a reason init systems
send TERM, then wait, then KILL.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-13 16:21   ` Oleg Nesterov
@ 2015-03-13 19:57     ` josh
  2015-03-13 21:34         ` Andy Lutomirski
  2015-03-14 14:14         ` Oleg Nesterov
  0 siblings, 2 replies; 83+ messages in thread
From: josh @ 2015-03-13 19:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On Fri, Mar 13, 2015 at 05:21:13PM +0100, Oleg Nesterov wrote:
> Josh,
> 
> I'll certainly try to read this series, but not before next week.

Thanks for looking at it.

> but a couple of nits right now.
> 
> On 03/12, Josh Triplett wrote:
> >
> > When passed CLONE_FD, clone4 will return a file descriptor rather than a
> > PID.  When the child process exits, it gets automatically reaped,
> 
> And even I have no idea what you are actually doing, this doesn't look
> right, see below.
> 
> > +static unsigned int clonefd_poll(struct file *file, poll_table *wait)
> > +{
> > +	struct task_struct *p = file->private_data;
> > +	poll_wait(file, &p->clonefd_wqh, wait);
> > +	return p->exit_state == EXIT_DEAD ? (POLLIN | POLLRDNORM) : 0;
> > +}
> > +
> > +static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
> > +{
> > +	struct task_struct *p = file->private_data;
> > +	int ret = 0;
> > +
> > +	/* EOF after first read */
> > +	if (*ppos)
> > +		return 0;
> > +
> > +	if (file->f_flags & O_NONBLOCK)
> > +		ret = -EAGAIN;
> > +	else
> > +		ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state == EXIT_DEAD);
> > +
> > +	if (p->exit_state == EXIT_DEAD) {
> 
> Again, I simply do not know what this code does at all. But I bet the usage
> of EXIT_DEAD is wrong ;)
> 
> OK, OK, I can be wrong. But I simply do not see what protects this task_struct
> if it is EXIT_DEAD (in fact even if it is EXIT_ZOMBIE).

If by "what protects" you mean "what keeps it alive", the file
descriptor holds a reference to the task_struct by calling
get_task_struct when created and put_task_struct when released.

This wait_event_interruptible pairs with the wake_up_all called from
clonefd_do_notify, which exit_notify calls *after* setting the task to
TASK_DEAD.

Apart from that, what about what the code is doing isn't clear?

> > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >  	if (group_dead)
> >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> >  
> > -	if (unlikely(tsk->ptrace)) {
> > +	if (tsk->autoreap) {
> > +		autoreap = true;
> 
> Debuggers won't be happy. A ptraced task should not autoreap itself.

A process launching a new process with CLONE_FD is explicitly requesting
that the process be automatically reaped without any other process
having to wait on it.  The task needs to not become a zombie, because
otherwise, it'll show up in waitpid(-1, ...) calls in the parent
process, which would break the ability to use this to completely
encapsulate process management within a library and not interfere with
the parent's process handling via SIGCHLD and wait{pid,3,4}.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-13 19:42   ` Josh Triplett
@ 2015-03-13 21:16     ` Thiago Macieira
  2015-03-13 21:44       ` josh
  2015-03-13 21:33     ` Andy Lutomirski
  2015-03-15  8:55       ` David Drysdale
  2 siblings, 1 reply; 83+ messages in thread
From: Thiago Macieira @ 2015-03-13 21:16 UTC (permalink / raw)
  To: Josh Triplett
  Cc: David Drysdale, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel, Linux API, linux-fsdevel, X86 ML

On Friday 13 March 2015 12:42:52 Josh Triplett wrote:
> > Hi Josh,
> > 
> > From the overall description (i.e. I haven't looked at the code yet)
> > this looks very interesting.  However, it seems to cover a lot of the
> > same ground as the process descriptor feature that was added to FreeBSD
> > 
> > in 9.x/10.x:
> >   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> 
> Interesting.

Hi Josh, David

I wasn't aware of the FreeBSD implementation of pdfork(). It is actually 
exactly what I need in userspace. The only difference between pdfork() and and 
my proposed forkfd() is where the PID and where the file descriptor are 
returned (meaning, which is optional and which isn't).

Josh and I opted to return the file descriptor in the regular return value in 
forkfd and in clone4 because getting the file descriptor the whole objective of 
using the forkfd or clone4-with-CLONE_FD in the first place: the file descriptor 
is not optional, but the PID is.

> Agreed; however, I think it's reasonable to provide appropriate Linux
> system calls, and then let glibc or libbsd or similar provide the
> BSD-compatible calls on top of those.  I don't think the kernel
> interface needs to exactly match FreeBSD's, as long as it's a superset
> of the functionality.
> 
> For example, pdfork can just call clone4 with CLONE_FD and return the
> resulting file descriptor.

Agreed, we should recommend libc implement pdfork(), pdkill() and pdwait4().

I'm not too attached to the forkfd() interface, but I find it slightly superior 
for the reasons above.

If we want the PD_DAEMON flag, it will have to translate to a clone flag, like 
CLONEFD_DAEMON or inverted like CLONEFD_KILL_ON_CLOSE.

> In the future, I plan to add an fd-based equivalent of
> rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
> whether to kill a process or thread) which is a superset of pdkill.
> pdkill could then call that and just not pass the extra info.
> 
> A fair bit of pdwait4 could be implemented on top of read(), other than
> the full rusage information (see below), and the ability to wait for
> STOP/CONT (which the CLONE_FD file descriptor could support if desired,
> but it'd have to be set via a flag at clone time).
> 
> I think it's a feature to use read() rather than an additional magic
> system call.

Indeed, even if the libc provides a wrapper for you, like glibc does for 
eventfd (eventfd_read, eventfd_write).

Josh and I didn't want to submit "killfd" (or pdkill in the FreeBSD name) in 
the initial patch set, but it was part of the plans.

> > >               clone4() will never return a file descriptor in the range
> > >               0-2 to
> > >               the caller, to avoid ambiguity with the return of 0 in the
> > >               child
> > >               process.  Only the  calling  process  will  have  the  new
> > >                file
> > >               descriptor open; the child process will not.
> > 
> > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> > return the file descriptor separately, which avoids the need for special
> > case processing for low FD values (and means that POSIX's "lowest file
> > descriptor not currently open" behaviour can be preserved if desired).
> 
> That'd be easy to implement if desired, by adding an outbound pointer to
> clone4_args.
>
> The (very mild) reason I'd dropped the PID: with CLONE_FD and future
> syscalls that use the fd as an identifier, PIDs can hopefully become
> mostly unnecessary.  However, I'm not that attached to changing the
> return value; it'd be trivial to switch to an outbound parameter
> instead, and then drop the "not 0-2".

See above for more motivation on making the PID optional.

As for the file descriptor range, if we need to be able to return 0, we can 
implement a magic constant to mean the child process, like the userspace 
forkfd() does (FFD_CHILD_PROCESS). We'd probably choose the value -4096 on 
Linux, since that is neither a valid file descriptor nor a valid errno value.

> > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> > process descriptor, including rusage retrieval.  However, I don't think
> > 
> > they actually implemented it:
> >   http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]
> 
> That's a pretty good argument that we don't need to either, at least not
> yet.

pdwait4() can be implemented on top of read(), with the WNOHANG flag being just 
toggling the O_NONBLOCK bit. The problem is with the rest of the flags. We 
could implement it via more ioctls to be done prior to read() if we don't want 
to add a syscall...

Another alternative is to add a P_PD flag that can be passed as the first 
argument to waitid(), making the second argument a file descriptor instead of a 
PID or pgrp.

> > FreeBSD also implements fstat(2) for its process descriptors, although
> > only a few of the fields get filled in.
> 
> I looked at what they provide, and that seems like more of a novelty
> than something particularly useful (since most of the stat fields aren't
> meaningful), but if that's useful for compatibility then adding it seems
> fine.

I don't think we need to do anything: anon_inode will do it for us.

If I stat an eventfd:

	stat("/proc/107751/fd/4", {st_dev=makedev(0, 9), st_ino=3943, 
st_mode=0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, 
st_size=0, st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
st_ctime=2015/03/12-16:12:00}) = 0

And just out of curiosity, in the following order: epoll, signalfd, timerfd 
and inotify:

	stat("/proc/1462/fd/4", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
st_ctime=2015/03/12-16:12:00}) = 0
	stat("/proc/1462/fd/5", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
st_ctime=2015/03/12-16:12:00}) = 0
	stat("/proc/1462/fd/7", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
st_ctime=2015/03/12-16:12:00}) = 0
	stat("/proc/1462/fd/8", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
st_ctime=2015/03/12-16:12:00}) = 0

(that process is systemd --user)

> > >               poll(2), select(2), epoll(7) (and similar)
> > >               
> > >                      The  file  descriptor  is readable (the select(2)
> > >                      readfds
> > >                      argument; the poll(2) POLLIN flag) if the new
> > >                      process has
> > >                      exited.
> > 
> > FreeBSD uses POLLHUP here.
> 
> That makes sense given that they provide the information via a separate
> call rather than read.  Since the CLONE_FD file descriptor uses read, it
> needs to provide POLLIN, but I have no objection to using *both* POLLIN
> and POLLHUP if that'd be at all useful.

I think we should provide both, since we're notifying that there are things to 
be read and that the file descriptor has closed.

> > FreeBSD has two different behaviours for close(2), depending on a flag
> > value (PD_DAEMON).  With the flag set it's roughly like this, but
> > without PD_DAEMON a close(2) operation on the (last open) file
> > descriptor terminates the child process.
> > 
> > This can be quite useful, particularly for the use case where some
> > userspace library has an FD-controlled subprocess -- if the application
> > using the library terminates, the process descriptor is closed and so
> > the subprocess is automatically terminated.
> 
> That's an interesting idea.  I don't think it makes sense for that to be
> the default behavior, but if someone wanted to add an additional flag
> to implement that behavior, that seems fine.  A FreeBSD-compatible
> pdfork could then use that flag when not passed PD_DAEMON and not use it
> when passed PD_DAEMON.
> 
> How does it kill the process when the last open descriptor closes?
> SIGKILL?  SIGTERM?  The former seems unfriendly (preventing graceful
> termination), and the latter blockable.  There's a reason init systems
> send TERM, then wait, then KILL.

I was wondering if it shouldn't be a SIGHUP, since we're talking about a file 
descriptor closing. We could make it configurable too, but I'd rather not use 
the current CSIGNAL field -- better move to the arguments structure, just in 
case someone is passing SIGCHLD there, they should get EINVAL instead of 
silently sending SIGCHLD to the child process to ask it to terminate.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-13 19:42   ` Josh Triplett
  2015-03-13 21:16     ` Thiago Macieira
@ 2015-03-13 21:33     ` Andy Lutomirski
  2015-03-13 21:45         ` josh-iaAMLnmF4UmaiuxdJuQwMA
  2015-03-15  8:55       ` David Drysdale
  2 siblings, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 21:33 UTC (permalink / raw)
  To: Josh Triplett
  Cc: David Drysdale, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	Linux API, Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
>> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@joshtriplett.org> wrote:
>> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
>> > handle child process exit notification via a file descriptor rather than
>> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
>> > child processes on behalf of their caller, *without* taking over process-wide
>> > SIGCHLD handling (either via signal handler or signalfd).
>>
>> Hi Josh,
>>
>> From the overall description (i.e. I haven't looked at the code yet)
>> this looks very interesting.  However, it seems to cover a lot of the
>> same ground as the process descriptor feature that was added to FreeBSD
>> in 9.x/10.x:
>>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>
> Interesting.
>
>> I think it would ideally be nice for a userspace library developer to be
>> able to do subprocess management (without SIGCHLD) in a similar way
>> across both platforms, without lots of complicated autoconf shenanigans.
>>
>> So could we look at the overlap and seeing if we can come up with
>> something that covers your requirements and also allows for something
>> that looks like FreeBSD's process descriptors?
>
> Agreed; however, I think it's reasonable to provide appropriate Linux
> system calls, and then let glibc or libbsd or similar provide the
> BSD-compatible calls on top of those.  I don't think the kernel
> interface needs to exactly match FreeBSD's, as long as it's a superset
> of the functionality.

We need to be careful with things like read(2), though.  It's hard to
write a glibc function that makes read(2) do something other than what
the kernel thinks.  Similarly, poll(2) is defined by the kernel.  It
would be really nice to be consistent here.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-13 19:57     ` josh
@ 2015-03-13 21:34         ` Andy Lutomirski
  2015-03-14 14:14         ` Oleg Nesterov
  1 sibling, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 21:34 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 12:57 PM,  <josh@joshtriplett.org> wrote:
> On Fri, Mar 13, 2015 at 05:21:13PM +0100, Oleg Nesterov wrote:
>> Josh,
>>
>> I'll certainly try to read this series, but not before next week.
>
> Thanks for looking at it.
>
>> but a couple of nits right now.
>>
>> On 03/12, Josh Triplett wrote:
>> >
>> > When passed CLONE_FD, clone4 will return a file descriptor rather than a
>> > PID.  When the child process exits, it gets automatically reaped,
>>
>> And even I have no idea what you are actually doing, this doesn't look
>> right, see below.
>>
>> > +static unsigned int clonefd_poll(struct file *file, poll_table *wait)
>> > +{
>> > +   struct task_struct *p = file->private_data;
>> > +   poll_wait(file, &p->clonefd_wqh, wait);
>> > +   return p->exit_state == EXIT_DEAD ? (POLLIN | POLLRDNORM) : 0;
>> > +}
>> > +
>> > +static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
>> > +{
>> > +   struct task_struct *p = file->private_data;
>> > +   int ret = 0;
>> > +
>> > +   /* EOF after first read */
>> > +   if (*ppos)
>> > +           return 0;
>> > +
>> > +   if (file->f_flags & O_NONBLOCK)
>> > +           ret = -EAGAIN;
>> > +   else
>> > +           ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state == EXIT_DEAD);
>> > +
>> > +   if (p->exit_state == EXIT_DEAD) {
>>
>> Again, I simply do not know what this code does at all. But I bet the usage
>> of EXIT_DEAD is wrong ;)
>>
>> OK, OK, I can be wrong. But I simply do not see what protects this task_struct
>> if it is EXIT_DEAD (in fact even if it is EXIT_ZOMBIE).
>
> If by "what protects" you mean "what keeps it alive", the file
> descriptor holds a reference to the task_struct by calling
> get_task_struct when created and put_task_struct when released.
>
> This wait_event_interruptible pairs with the wake_up_all called from
> clonefd_do_notify, which exit_notify calls *after* setting the task to
> TASK_DEAD.
>
> Apart from that, what about what the code is doing isn't clear?
>
>> > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>> >     if (group_dead)
>> >             kill_orphaned_pgrp(tsk->group_leader, NULL);
>> >
>> > -   if (unlikely(tsk->ptrace)) {
>> > +   if (tsk->autoreap) {
>> > +           autoreap = true;
>>
>> Debuggers won't be happy. A ptraced task should not autoreap itself.
>
> A process launching a new process with CLONE_FD is explicitly requesting
> that the process be automatically reaped without any other process
> having to wait on it.  The task needs to not become a zombie, because
> otherwise, it'll show up in waitpid(-1, ...) calls in the parent
> process, which would break the ability to use this to completely
> encapsulate process management within a library and not interfere with
> the parent's process handling via SIGCHLD and wait{pid,3,4}.

Wouldn't the correct behavior be to keep it alive as a zombie but
*not* show it in waitpid, etc?

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-13 21:34         ` Andy Lutomirski
  0 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 21:34 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Fri, Mar 13, 2015 at 12:57 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Fri, Mar 13, 2015 at 05:21:13PM +0100, Oleg Nesterov wrote:
>> Josh,
>>
>> I'll certainly try to read this series, but not before next week.
>
> Thanks for looking at it.
>
>> but a couple of nits right now.
>>
>> On 03/12, Josh Triplett wrote:
>> >
>> > When passed CLONE_FD, clone4 will return a file descriptor rather than a
>> > PID.  When the child process exits, it gets automatically reaped,
>>
>> And even I have no idea what you are actually doing, this doesn't look
>> right, see below.
>>
>> > +static unsigned int clonefd_poll(struct file *file, poll_table *wait)
>> > +{
>> > +   struct task_struct *p = file->private_data;
>> > +   poll_wait(file, &p->clonefd_wqh, wait);
>> > +   return p->exit_state == EXIT_DEAD ? (POLLIN | POLLRDNORM) : 0;
>> > +}
>> > +
>> > +static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
>> > +{
>> > +   struct task_struct *p = file->private_data;
>> > +   int ret = 0;
>> > +
>> > +   /* EOF after first read */
>> > +   if (*ppos)
>> > +           return 0;
>> > +
>> > +   if (file->f_flags & O_NONBLOCK)
>> > +           ret = -EAGAIN;
>> > +   else
>> > +           ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state == EXIT_DEAD);
>> > +
>> > +   if (p->exit_state == EXIT_DEAD) {
>>
>> Again, I simply do not know what this code does at all. But I bet the usage
>> of EXIT_DEAD is wrong ;)
>>
>> OK, OK, I can be wrong. But I simply do not see what protects this task_struct
>> if it is EXIT_DEAD (in fact even if it is EXIT_ZOMBIE).
>
> If by "what protects" you mean "what keeps it alive", the file
> descriptor holds a reference to the task_struct by calling
> get_task_struct when created and put_task_struct when released.
>
> This wait_event_interruptible pairs with the wake_up_all called from
> clonefd_do_notify, which exit_notify calls *after* setting the task to
> TASK_DEAD.
>
> Apart from that, what about what the code is doing isn't clear?
>
>> > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>> >     if (group_dead)
>> >             kill_orphaned_pgrp(tsk->group_leader, NULL);
>> >
>> > -   if (unlikely(tsk->ptrace)) {
>> > +   if (tsk->autoreap) {
>> > +           autoreap = true;
>>
>> Debuggers won't be happy. A ptraced task should not autoreap itself.
>
> A process launching a new process with CLONE_FD is explicitly requesting
> that the process be automatically reaped without any other process
> having to wait on it.  The task needs to not become a zombie, because
> otherwise, it'll show up in waitpid(-1, ...) calls in the parent
> process, which would break the ability to use this to completely
> encapsulate process management within a library and not interfere with
> the parent's process handling via SIGCHLD and wait{pid,3,4}.

Wouldn't the correct behavior be to keep it alive as a zombie but
*not* show it in waitpid, etc?

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-13 21:16     ` Thiago Macieira
@ 2015-03-13 21:44       ` josh
  0 siblings, 0 replies; 83+ messages in thread
From: josh @ 2015-03-13 21:44 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: David Drysdale, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel, Linux API, linux-fsdevel, X86 ML

On Fri, Mar 13, 2015 at 02:16:07PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 12:42:52 Josh Triplett wrote:
> > > Hi Josh,
> > > 
> > > From the overall description (i.e. I haven't looked at the code yet)
> > > this looks very interesting.  However, it seems to cover a lot of the
> > > same ground as the process descriptor feature that was added to FreeBSD
> > > 
> > > in 9.x/10.x:
> > >   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> > 
> > Interesting.
> 
> I wasn't aware of the FreeBSD implementation of pdfork(). It is actually 
> exactly what I need in userspace.

Right; libqt should be able to use pdfork on FreeBSD and CLONE_FD on
Linux.

> The only difference between pdfork() and and 
> my proposed forkfd() is where the PID and where the file descriptor are 
> returned (meaning, which is optional and which isn't).
> 
> Josh and I opted to return the file descriptor in the regular return value in 
> forkfd and in clone4 because getting the file descriptor the whole objective of 
> using the forkfd or clone4-with-CLONE_FD in the first place: the file descriptor 
> is not optional, but the PID is.

And as long as you can get the fd, where it's returned really doesn't
matter.

> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those.  I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
> > 
> > For example, pdfork can just call clone4 with CLONE_FD and return the
> > resulting file descriptor.
> 
> Agreed, we should recommend libc implement pdfork(), pdkill() and pdwait4().
> 
> I'm not too attached to the forkfd() interface, but I find it slightly superior 
> for the reasons above.

Agreed.

> If we want the PD_DAEMON flag, it will have to translate to a clone flag, like 
> CLONEFD_DAEMON or inverted like CLONEFD_KILL_ON_CLOSE.

I think the inverted version makes more sense, so that the default
behavior just changes exit notification without adding the kill-on-close
behavior.  And that kill-on-close behavior can come in a later patch. :)

> > In the future, I plan to add an fd-based equivalent of
> > rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
> > whether to kill a process or thread) which is a superset of pdkill.
> > pdkill could then call that and just not pass the extra info.
> > 
> > A fair bit of pdwait4 could be implemented on top of read(), other than
> > the full rusage information (see below), and the ability to wait for
> > STOP/CONT (which the CLONE_FD file descriptor could support if desired,
> > but it'd have to be set via a flag at clone time).
> > 
> > I think it's a feature to use read() rather than an additional magic
> > system call.
> 
> Indeed, even if the libc provides a wrapper for you, like glibc does for 
> eventfd (eventfd_read, eventfd_write).
> 
> Josh and I didn't want to submit "killfd" (or pdkill in the FreeBSD name) in 
> the initial patch set, but it was part of the plans.
> 
> > > >               clone4() will never return a file descriptor in the range
> > > >               0-2 to
> > > >               the caller, to avoid ambiguity with the return of 0 in the
> > > >               child
> > > >               process.  Only the  calling  process  will  have  the  new
> > > >                file
> > > >               descriptor open; the child process will not.
> > > 
> > > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> > > return the file descriptor separately, which avoids the need for special
> > > case processing for low FD values (and means that POSIX's "lowest file
> > > descriptor not currently open" behaviour can be preserved if desired).
> > 
> > That'd be easy to implement if desired, by adding an outbound pointer to
> > clone4_args.
> >
> > The (very mild) reason I'd dropped the PID: with CLONE_FD and future
> > syscalls that use the fd as an identifier, PIDs can hopefully become
> > mostly unnecessary.  However, I'm not that attached to changing the
> > return value; it'd be trivial to switch to an outbound parameter
> > instead, and then drop the "not 0-2".
> 
> See above for more motivation on making the PID optional.
> 
> As for the file descriptor range, if we need to be able to return 0, we can 
> implement a magic constant to mean the child process, like the userspace 
> forkfd() does (FFD_CHILD_PROCESS). We'd probably choose the value -4096 on 
> Linux, since that is neither a valid file descriptor nor a valid errno value.

I don't think that logic is worth implementing, though, since it would
require changing all the architecture-specific copy_thread
implementations.  If we really want to go this path, we should just
return the fd via an out parameter in the clone4_args structure.

> > > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> > > process descriptor, including rusage retrieval.  However, I don't think
> > > 
> > > they actually implemented it:
> > >   http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]
> > 
> > That's a pretty good argument that we don't need to either, at least not
> > yet.
> 
> pdwait4() can be implemented on top of read(), with the WNOHANG flag being just 
> toggling the O_NONBLOCK bit. The problem is with the rest of the flags. We 
> could implement it via more ioctls to be done prior to read() if we don't want 
> to add a syscall...
> 
> Another alternative is to add a P_PD flag that can be passed as the first 
> argument to waitid(), making the second argument a file descriptor instead of a 
> PID or pgrp.

Or a flag that can be added to the options argument of wait4 to indicate
that the first argument is really a file descriptor.

> > > FreeBSD also implements fstat(2) for its process descriptors, although
> > > only a few of the fields get filled in.
> > 
> > I looked at what they provide, and that seems like more of a novelty
> > than something particularly useful (since most of the stat fields aren't
> > meaningful), but if that's useful for compatibility then adding it seems
> > fine.
> 
> I don't think we need to do anything: anon_inode will do it for us.
> 
> If I stat an eventfd:
> 
> 	stat("/proc/107751/fd/4", {st_dev=makedev(0, 9), st_ino=3943, 
> st_mode=0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, 
> st_size=0, st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 
> And just out of curiosity, in the following order: epoll, signalfd, timerfd 
> and inotify:
> 
> 	stat("/proc/1462/fd/4", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 	stat("/proc/1462/fd/5", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 	stat("/proc/1462/fd/7", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 	stat("/proc/1462/fd/8", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 
> (that process is systemd --user)

Interesting.  What does stat on a CLONE_FD file descriptor return?

> > > >               poll(2), select(2), epoll(7) (and similar)
> > > >               
> > > >                      The  file  descriptor  is readable (the select(2)
> > > >                      readfds
> > > >                      argument; the poll(2) POLLIN flag) if the new
> > > >                      process has
> > > >                      exited.
> > > 
> > > FreeBSD uses POLLHUP here.
> > 
> > That makes sense given that they provide the information via a separate
> > call rather than read.  Since the CLONE_FD file descriptor uses read, it
> > needs to provide POLLIN, but I have no objection to using *both* POLLIN
> > and POLLHUP if that'd be at all useful.
> 
> I think we should provide both, since we're notifying that there are things to 
> be read and that the file descriptor has closed.

"closed" in the "other end of the not-quite-a-pipe" sense, sure.  I'll
add that in a v2.

> > > FreeBSD has two different behaviours for close(2), depending on a flag
> > > value (PD_DAEMON).  With the flag set it's roughly like this, but
> > > without PD_DAEMON a close(2) operation on the (last open) file
> > > descriptor terminates the child process.
> > > 
> > > This can be quite useful, particularly for the use case where some
> > > userspace library has an FD-controlled subprocess -- if the application
> > > using the library terminates, the process descriptor is closed and so
> > > the subprocess is automatically terminated.
> > 
> > That's an interesting idea.  I don't think it makes sense for that to be
> > the default behavior, but if someone wanted to add an additional flag
> > to implement that behavior, that seems fine.  A FreeBSD-compatible
> > pdfork could then use that flag when not passed PD_DAEMON and not use it
> > when passed PD_DAEMON.
> > 
> > How does it kill the process when the last open descriptor closes?
> > SIGKILL?  SIGTERM?  The former seems unfriendly (preventing graceful
> > termination), and the latter blockable.  There's a reason init systems
> > send TERM, then wait, then KILL.
> 
> I was wondering if it shouldn't be a SIGHUP, since we're talking about a file 
> descriptor closing. We could make it configurable too, but I'd rather not use 
> the current CSIGNAL field -- better move to the arguments structure, just in 
> case someone is passing SIGCHLD there, they should get EINVAL instead of 
> silently sending SIGCHLD to the child process to ask it to terminate.

That sounds like several good reasons right there to defer "kill on
close" to a future patch, the author of which should research how
FreeBSD implements this.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-13 21:45         ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 83+ messages in thread
From: josh @ 2015-03-13 21:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Drysdale, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	Linux API, Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 02:33:44PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> > On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
> >> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> >> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> >> > handle child process exit notification via a file descriptor rather than
> >> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> >> > child processes on behalf of their caller, *without* taking over process-wide
> >> > SIGCHLD handling (either via signal handler or signalfd).
> >>
> >> Hi Josh,
> >>
> >> From the overall description (i.e. I haven't looked at the code yet)
> >> this looks very interesting.  However, it seems to cover a lot of the
> >> same ground as the process descriptor feature that was added to FreeBSD
> >> in 9.x/10.x:
> >>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> >
> > Interesting.
> >
> >> I think it would ideally be nice for a userspace library developer to be
> >> able to do subprocess management (without SIGCHLD) in a similar way
> >> across both platforms, without lots of complicated autoconf shenanigans.
> >>
> >> So could we look at the overlap and seeing if we can come up with
> >> something that covers your requirements and also allows for something
> >> that looks like FreeBSD's process descriptors?
> >
> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those.  I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
> 
> We need to be careful with things like read(2), though.  It's hard to
> write a glibc function that makes read(2) do something other than what
> the kernel thinks.  Similarly, poll(2) is defined by the kernel.  It
> would be really nice to be consistent here.

It doesn't sound like FreeBSD implements read(2) on the pdfork file
descriptor at all.  If it does, yes, we're not going to be able to be
compatible with that.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-13 21:45         ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 83+ messages in thread
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-03-13 21:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Drysdale, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Fri, Mar 13, 2015 at 02:33:44PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> > On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
> >> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> >> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> >> > handle child process exit notification via a file descriptor rather than
> >> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> >> > child processes on behalf of their caller, *without* taking over process-wide
> >> > SIGCHLD handling (either via signal handler or signalfd).
> >>
> >> Hi Josh,
> >>
> >> From the overall description (i.e. I haven't looked at the code yet)
> >> this looks very interesting.  However, it seems to cover a lot of the
> >> same ground as the process descriptor feature that was added to FreeBSD
> >> in 9.x/10.x:
> >>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> >
> > Interesting.
> >
> >> I think it would ideally be nice for a userspace library developer to be
> >> able to do subprocess management (without SIGCHLD) in a similar way
> >> across both platforms, without lots of complicated autoconf shenanigans.
> >>
> >> So could we look at the overlap and seeing if we can come up with
> >> something that covers your requirements and also allows for something
> >> that looks like FreeBSD's process descriptors?
> >
> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those.  I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
> 
> We need to be careful with things like read(2), though.  It's hard to
> write a glibc function that makes read(2) do something other than what
> the kernel thinks.  Similarly, poll(2) is defined by the kernel.  It
> would be really nice to be consistent here.

It doesn't sound like FreeBSD implements read(2) on the pdfork file
descriptor at all.  If it does, yes, we're not going to be able to be
compatible with that.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-13 21:45         ` josh-iaAMLnmF4UmaiuxdJuQwMA
@ 2015-03-13 21:51           ` Andy Lutomirski
  -1 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 21:51 UTC (permalink / raw)
  To: Josh Triplett
  Cc: David Drysdale, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	Linux API, Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 2:45 PM,  <josh@joshtriplett.org> wrote:
> On Fri, Mar 13, 2015 at 02:33:44PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett <josh@joshtriplett.org> wrote:
>> > On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
>> >> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@joshtriplett.org> wrote:
>> >> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
>> >> > handle child process exit notification via a file descriptor rather than
>> >> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
>> >> > child processes on behalf of their caller, *without* taking over process-wide
>> >> > SIGCHLD handling (either via signal handler or signalfd).
>> >>
>> >> Hi Josh,
>> >>
>> >> From the overall description (i.e. I haven't looked at the code yet)
>> >> this looks very interesting.  However, it seems to cover a lot of the
>> >> same ground as the process descriptor feature that was added to FreeBSD
>> >> in 9.x/10.x:
>> >>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>> >
>> > Interesting.
>> >
>> >> I think it would ideally be nice for a userspace library developer to be
>> >> able to do subprocess management (without SIGCHLD) in a similar way
>> >> across both platforms, without lots of complicated autoconf shenanigans.
>> >>
>> >> So could we look at the overlap and seeing if we can come up with
>> >> something that covers your requirements and also allows for something
>> >> that looks like FreeBSD's process descriptors?
>> >
>> > Agreed; however, I think it's reasonable to provide appropriate Linux
>> > system calls, and then let glibc or libbsd or similar provide the
>> > BSD-compatible calls on top of those.  I don't think the kernel
>> > interface needs to exactly match FreeBSD's, as long as it's a superset
>> > of the functionality.
>>
>> We need to be careful with things like read(2), though.  It's hard to
>> write a glibc function that makes read(2) do something other than what
>> the kernel thinks.  Similarly, poll(2) is defined by the kernel.  It
>> would be really nice to be consistent here.
>
> It doesn't sound like FreeBSD implements read(2) on the pdfork file
> descriptor at all.  If it does, yes, we're not going to be able to be
> compatible with that.

There's an argument that using read(2) for stuff like this is a bad
idea.  If anyone tried to do this in C++ (or any other OO language):

class GenericInterface
{
public:
  virtual void DoAction(const char *value, size_t len) = 0;
};

class Process : public GenericInterface
{
public:
  virtual void DoAction(const char *value, size_t len) = 0;
};

void Kill(Process *p)
{
  p->DoAction("kill", 4);
};

They'd be re-educated very quickly.  This is like duck typing, but
taken to a whole new level: *everything* is a duck, and ducks have a
grand total of three operations.

On the other hand, this seems to be UNIX tradition.  It's not as if
using echo on pidfds is going to be a common idiom, though.

In any event, we should find out what FreeBSD does in response to
read(2) on the fd.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-13 21:51           ` Andy Lutomirski
  0 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 21:51 UTC (permalink / raw)
  To: Josh Triplett
  Cc: David Drysdale, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Fri, Mar 13, 2015 at 2:45 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Fri, Mar 13, 2015 at 02:33:44PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>> > On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
>> >> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>> >> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
>> >> > handle child process exit notification via a file descriptor rather than
>> >> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
>> >> > child processes on behalf of their caller, *without* taking over process-wide
>> >> > SIGCHLD handling (either via signal handler or signalfd).
>> >>
>> >> Hi Josh,
>> >>
>> >> From the overall description (i.e. I haven't looked at the code yet)
>> >> this looks very interesting.  However, it seems to cover a lot of the
>> >> same ground as the process descriptor feature that was added to FreeBSD
>> >> in 9.x/10.x:
>> >>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>> >
>> > Interesting.
>> >
>> >> I think it would ideally be nice for a userspace library developer to be
>> >> able to do subprocess management (without SIGCHLD) in a similar way
>> >> across both platforms, without lots of complicated autoconf shenanigans.
>> >>
>> >> So could we look at the overlap and seeing if we can come up with
>> >> something that covers your requirements and also allows for something
>> >> that looks like FreeBSD's process descriptors?
>> >
>> > Agreed; however, I think it's reasonable to provide appropriate Linux
>> > system calls, and then let glibc or libbsd or similar provide the
>> > BSD-compatible calls on top of those.  I don't think the kernel
>> > interface needs to exactly match FreeBSD's, as long as it's a superset
>> > of the functionality.
>>
>> We need to be careful with things like read(2), though.  It's hard to
>> write a glibc function that makes read(2) do something other than what
>> the kernel thinks.  Similarly, poll(2) is defined by the kernel.  It
>> would be really nice to be consistent here.
>
> It doesn't sound like FreeBSD implements read(2) on the pdfork file
> descriptor at all.  If it does, yes, we're not going to be able to be
> compatible with that.

There's an argument that using read(2) for stuff like this is a bad
idea.  If anyone tried to do this in C++ (or any other OO language):

class GenericInterface
{
public:
  virtual void DoAction(const char *value, size_t len) = 0;
};

class Process : public GenericInterface
{
public:
  virtual void DoAction(const char *value, size_t len) = 0;
};

void Kill(Process *p)
{
  p->DoAction("kill", 4);
};

They'd be re-educated very quickly.  This is like duck typing, but
taken to a whole new level: *everything* is a duck, and ducks have a
grand total of three operations.

On the other hand, this seems to be UNIX tradition.  It's not as if
using echo on pidfds is going to be a common idiom, though.

In any event, we should find out what FreeBSD does in response to
read(2) on the fd.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-13 22:01     ` Andy Lutomirski
  0 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 22:01 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> For 32-bit userspace on a 64-bit kernel, this requires modifying
> stub32_clone to actually swap the appropriate arguments to match
> CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
> broken.
>
> Signed-off-by: Josh Triplett <josh@joshtriplett.org>
> Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
> ---
>  arch/x86/Kconfig             | 1 +
>  arch/x86/ia32/ia32entry.S    | 2 +-
>  arch/x86/kernel/process_32.c | 6 +++---
>  arch/x86/kernel/process_64.c | 8 ++++----
>  4 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b7d31ca..4960b0d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -124,6 +124,7 @@ config X86
>         select MODULES_USE_ELF_REL if X86_32
>         select MODULES_USE_ELF_RELA if X86_64
>         select CLONE_BACKWARDS if X86_32
> +       select HAVE_COPY_THREAD_TLS
>         select ARCH_USE_BUILTIN_BSWAP
>         select ARCH_USE_QUEUE_RWLOCK
>         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 156ebca..0286735 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -487,7 +487,7 @@ GLOBAL(\label)
>         ALIGN
>  GLOBAL(stub32_clone)
>         leaq sys_clone(%rip),%rax
> -       mov     %r8, %rcx
> +       xchg %r8, %rcx
>         jmp  ia32_ptregs_common

Do I understand correct that whatever function this is a stub for just
takes its arguments in the wrong order?  If so, can we just fix it
instead of using xchg here?

In general, I much prefer C code to new asm where it makes sense to
make this tradeoff.

Other than that, this is a huge improvement.  You'll have minor
conflicts against -tip, though.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-13 22:01     ` Andy Lutomirski
  0 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 22:01 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> For 32-bit userspace on a 64-bit kernel, this requires modifying
> stub32_clone to actually swap the appropriate arguments to match
> CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
> broken.
>
> Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
> Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  arch/x86/Kconfig             | 1 +
>  arch/x86/ia32/ia32entry.S    | 2 +-
>  arch/x86/kernel/process_32.c | 6 +++---
>  arch/x86/kernel/process_64.c | 8 ++++----
>  4 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b7d31ca..4960b0d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -124,6 +124,7 @@ config X86
>         select MODULES_USE_ELF_REL if X86_32
>         select MODULES_USE_ELF_RELA if X86_64
>         select CLONE_BACKWARDS if X86_32
> +       select HAVE_COPY_THREAD_TLS
>         select ARCH_USE_BUILTIN_BSWAP
>         select ARCH_USE_QUEUE_RWLOCK
>         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 156ebca..0286735 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -487,7 +487,7 @@ GLOBAL(\label)
>         ALIGN
>  GLOBAL(stub32_clone)
>         leaq sys_clone(%rip),%rax
> -       mov     %r8, %rcx
> +       xchg %r8, %rcx
>         jmp  ia32_ptregs_common

Do I understand correct that whatever function this is a stub for just
takes its arguments in the wrong order?  If so, can we just fix it
instead of using xchg here?

In general, I much prefer C code to new asm where it makes sense to
make this tradeoff.

Other than that, this is a huge improvement.  You'll have minor
conflicts against -tip, though.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-13 21:34         ` Andy Lutomirski
  (?)
@ 2015-03-13 22:20         ` josh
  2015-03-13 22:28             ` Andy Lutomirski
  -1 siblings, 1 reply; 83+ messages in thread
From: josh @ 2015-03-13 22:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 02:34:58PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 12:57 PM,  <josh@joshtriplett.org> wrote:
> > A process launching a new process with CLONE_FD is explicitly requesting
> > that the process be automatically reaped without any other process
> > having to wait on it.  The task needs to not become a zombie, because
> > otherwise, it'll show up in waitpid(-1, ...) calls in the parent
> > process, which would break the ability to use this to completely
> > encapsulate process management within a library and not interfere with
> > the parent's process handling via SIGCHLD and wait{pid,3,4}.
> 
> Wouldn't the correct behavior be to keep it alive as a zombie but
> *not* show it in waitpid, etc?

That's a significant change to the semantics of waitpid.  And then
someone would still need to wait on the process, which we'd like to
avoid.  (We don't want to have magic "reap on read(2)" semantics,
because among other things, what if we add a means in the future to get
an additional file descriptor corresponding to an existing process?)

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-13 22:20         ` josh
@ 2015-03-13 22:28             ` Andy Lutomirski
  0 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 22:28 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 3:20 PM,  <josh@joshtriplett.org> wrote:
> On Fri, Mar 13, 2015 at 02:34:58PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 13, 2015 at 12:57 PM,  <josh@joshtriplett.org> wrote:
>> > A process launching a new process with CLONE_FD is explicitly requesting
>> > that the process be automatically reaped without any other process
>> > having to wait on it.  The task needs to not become a zombie, because
>> > otherwise, it'll show up in waitpid(-1, ...) calls in the parent
>> > process, which would break the ability to use this to completely
>> > encapsulate process management within a library and not interfere with
>> > the parent's process handling via SIGCHLD and wait{pid,3,4}.
>>
>> Wouldn't the correct behavior be to keep it alive as a zombie but
>> *not* show it in waitpid, etc?
>
> That's a significant change to the semantics of waitpid.  And then
> someone would still need to wait on the process, which we'd like to
> avoid.  (We don't want to have magic "reap on read(2)" semantics,
> because among other things, what if we add a means in the future to get
> an additional file descriptor corresponding to an existing process?)
>

Do we not already have a state "dead, successfully waited on by
parent, but still around because ptraced"?  If not, shouldn't we?
Isn't that what PTRACE_SEIZE does?  Or am I just confused?

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-13 22:28             ` Andy Lutomirski
  0 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 22:28 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Fri, Mar 13, 2015 at 3:20 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Fri, Mar 13, 2015 at 02:34:58PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 13, 2015 at 12:57 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>> > A process launching a new process with CLONE_FD is explicitly requesting
>> > that the process be automatically reaped without any other process
>> > having to wait on it.  The task needs to not become a zombie, because
>> > otherwise, it'll show up in waitpid(-1, ...) calls in the parent
>> > process, which would break the ability to use this to completely
>> > encapsulate process management within a library and not interfere with
>> > the parent's process handling via SIGCHLD and wait{pid,3,4}.
>>
>> Wouldn't the correct behavior be to keep it alive as a zombie but
>> *not* show it in waitpid, etc?
>
> That's a significant change to the semantics of waitpid.  And then
> someone would still need to wait on the process, which we'd like to
> avoid.  (We don't want to have magic "reap on read(2)" semantics,
> because among other things, what if we add a means in the future to get
> an additional file descriptor corresponding to an existing process?)
>

Do we not already have a state "dead, successfully waited on by
parent, but still around because ptraced"?  If not, shouldn't we?
Isn't that what PTRACE_SEIZE does?  Or am I just confused?

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  2015-03-13 22:01     ` Andy Lutomirski
  (?)
@ 2015-03-13 22:31     ` josh
  2015-03-13 22:38       ` Andy Lutomirski
  -1 siblings, 1 reply; 83+ messages in thread
From: josh @ 2015-03-13 22:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 03:01:16PM -0700, Andy Lutomirski wrote:
> On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> > For 32-bit userspace on a 64-bit kernel, this requires modifying
> > stub32_clone to actually swap the appropriate arguments to match
> > CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
> > broken.
> >
> > Signed-off-by: Josh Triplett <josh@joshtriplett.org>
> > Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
> > ---
> >  arch/x86/Kconfig             | 1 +
> >  arch/x86/ia32/ia32entry.S    | 2 +-
> >  arch/x86/kernel/process_32.c | 6 +++---
> >  arch/x86/kernel/process_64.c | 8 ++++----
> >  4 files changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index b7d31ca..4960b0d 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -124,6 +124,7 @@ config X86
> >         select MODULES_USE_ELF_REL if X86_32
> >         select MODULES_USE_ELF_RELA if X86_64
> >         select CLONE_BACKWARDS if X86_32
> > +       select HAVE_COPY_THREAD_TLS
> >         select ARCH_USE_BUILTIN_BSWAP
> >         select ARCH_USE_QUEUE_RWLOCK
> >         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> > index 156ebca..0286735 100644
> > --- a/arch/x86/ia32/ia32entry.S
> > +++ b/arch/x86/ia32/ia32entry.S
> > @@ -487,7 +487,7 @@ GLOBAL(\label)
> >         ALIGN
> >  GLOBAL(stub32_clone)
> >         leaq sys_clone(%rip),%rax
> > -       mov     %r8, %rcx
> > +       xchg %r8, %rcx
> >         jmp  ia32_ptregs_common
> 
> Do I understand correct that whatever function this is a stub for just
> takes its arguments in the wrong order?  If so, can we just fix it
> instead of using xchg here?

32-bit x86 and 64-bit x86 take the arguments to clone in a different
order, and stub32_clone fixes up the argument order then calls the
64-bit sys_clone.

I'd love to see *all* the 32-on-64 compat stubs for clone rewritten in C
under CONFIG_COMPAT.  However, doing so would require encoding the
knowledge for each 64-bit architecture for how its corresponding 32-bit
architecture accepts arguments to clone, which is information that the
current CONFIG_CLONE_BACKWARDS{1,2,3} don't include; it would then
require cleaning up all the architecture-specific assembly stubs for
32-bit clone entry points.

In the meantime, doing that *just* for 32-bit x86 on 64-bit x86 doesn't
seem worth it, since it would require adding a new C entry point for
compat_sys_clone under arch/x86 somewhere.

One cleanup at a time. :)

> In general, I much prefer C code to new asm where it makes sense to
> make this tradeoff.

Agreed completely.  However, this is at least conservation-of-asm, or
reduction if you consider the pt_regs argument-grabbing hack to be
asm-esque code.

> Other than that, this is a huge improvement.  You'll have minor
> conflicts against -tip, though.

Right, I've seen your current changes there.  Should be a trivial merge
though.

Would you mind providing an ack for the series, or at least for the
first two patches?

(I'm wondering whose tree this series ought to go through, for that
matter.)

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-13 22:34               ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 83+ messages in thread
From: josh @ 2015-03-13 22:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 03:28:26PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 3:20 PM,  <josh@joshtriplett.org> wrote:
> > On Fri, Mar 13, 2015 at 02:34:58PM -0700, Andy Lutomirski wrote:
> >> On Fri, Mar 13, 2015 at 12:57 PM,  <josh@joshtriplett.org> wrote:
> >> > A process launching a new process with CLONE_FD is explicitly requesting
> >> > that the process be automatically reaped without any other process
> >> > having to wait on it.  The task needs to not become a zombie, because
> >> > otherwise, it'll show up in waitpid(-1, ...) calls in the parent
> >> > process, which would break the ability to use this to completely
> >> > encapsulate process management within a library and not interfere with
> >> > the parent's process handling via SIGCHLD and wait{pid,3,4}.
> >>
> >> Wouldn't the correct behavior be to keep it alive as a zombie but
> >> *not* show it in waitpid, etc?
> >
> > That's a significant change to the semantics of waitpid.  And then
> > someone would still need to wait on the process, which we'd like to
> > avoid.  (We don't want to have magic "reap on read(2)" semantics,
> > because among other things, what if we add a means in the future to get
> > an additional file descriptor corresponding to an existing process?)
> 
> Do we not already have a state "dead, successfully waited on by
> parent, but still around because ptraced"?  If not, shouldn't we?
> Isn't that what PTRACE_SEIZE does?  Or am I just confused?

I don't think that affects the task's exit_state though.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-13 22:34               ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 83+ messages in thread
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-03-13 22:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Fri, Mar 13, 2015 at 03:28:26PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 3:20 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> > On Fri, Mar 13, 2015 at 02:34:58PM -0700, Andy Lutomirski wrote:
> >> On Fri, Mar 13, 2015 at 12:57 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> >> > A process launching a new process with CLONE_FD is explicitly requesting
> >> > that the process be automatically reaped without any other process
> >> > having to wait on it.  The task needs to not become a zombie, because
> >> > otherwise, it'll show up in waitpid(-1, ...) calls in the parent
> >> > process, which would break the ability to use this to completely
> >> > encapsulate process management within a library and not interfere with
> >> > the parent's process handling via SIGCHLD and wait{pid,3,4}.
> >>
> >> Wouldn't the correct behavior be to keep it alive as a zombie but
> >> *not* show it in waitpid, etc?
> >
> > That's a significant change to the semantics of waitpid.  And then
> > someone would still need to wait on the process, which we'd like to
> > avoid.  (We don't want to have magic "reap on read(2)" semantics,
> > because among other things, what if we add a means in the future to get
> > an additional file descriptor corresponding to an existing process?)
> 
> Do we not already have a state "dead, successfully waited on by
> parent, but still around because ptraced"?  If not, shouldn't we?
> Isn't that what PTRACE_SEIZE does?  Or am I just confused?

I don't think that affects the task's exit_state though.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  2015-03-13 22:31     ` josh
@ 2015-03-13 22:38       ` Andy Lutomirski
  2015-03-13 22:43           ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 22:38 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 3:31 PM,  <josh@joshtriplett.org> wrote:
> On Fri, Mar 13, 2015 at 03:01:16PM -0700, Andy Lutomirski wrote:
>> On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh@joshtriplett.org> wrote:
>> > For 32-bit userspace on a 64-bit kernel, this requires modifying
>> > stub32_clone to actually swap the appropriate arguments to match
>> > CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
>> > broken.
>> >
>> > Signed-off-by: Josh Triplett <josh@joshtriplett.org>
>> > Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
>> > ---
>> >  arch/x86/Kconfig             | 1 +
>> >  arch/x86/ia32/ia32entry.S    | 2 +-
>> >  arch/x86/kernel/process_32.c | 6 +++---
>> >  arch/x86/kernel/process_64.c | 8 ++++----
>> >  4 files changed, 9 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> > index b7d31ca..4960b0d 100644
>> > --- a/arch/x86/Kconfig
>> > +++ b/arch/x86/Kconfig
>> > @@ -124,6 +124,7 @@ config X86
>> >         select MODULES_USE_ELF_REL if X86_32
>> >         select MODULES_USE_ELF_RELA if X86_64
>> >         select CLONE_BACKWARDS if X86_32
>> > +       select HAVE_COPY_THREAD_TLS
>> >         select ARCH_USE_BUILTIN_BSWAP
>> >         select ARCH_USE_QUEUE_RWLOCK
>> >         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
>> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> > index 156ebca..0286735 100644
>> > --- a/arch/x86/ia32/ia32entry.S
>> > +++ b/arch/x86/ia32/ia32entry.S
>> > @@ -487,7 +487,7 @@ GLOBAL(\label)
>> >         ALIGN
>> >  GLOBAL(stub32_clone)
>> >         leaq sys_clone(%rip),%rax
>> > -       mov     %r8, %rcx
>> > +       xchg %r8, %rcx
>> >         jmp  ia32_ptregs_common
>>
>> Do I understand correct that whatever function this is a stub for just
>> takes its arguments in the wrong order?  If so, can we just fix it
>> instead of using xchg here?
>
> 32-bit x86 and 64-bit x86 take the arguments to clone in a different
> order, and stub32_clone fixes up the argument order then calls the
> 64-bit sys_clone.
>
> I'd love to see *all* the 32-on-64 compat stubs for clone rewritten in C
> under CONFIG_COMPAT.  However, doing so would require encoding the
> knowledge for each 64-bit architecture for how its corresponding 32-bit
> architecture accepts arguments to clone, which is information that the
> current CONFIG_CLONE_BACKWARDS{1,2,3} don't include; it would then
> require cleaning up all the architecture-specific assembly stubs for
> 32-bit clone entry points.
>
> In the meantime, doing that *just* for 32-bit x86 on 64-bit x86 doesn't
> seem worth it, since it would require adding a new C entry point for
> compat_sys_clone under arch/x86 somewhere.
>
> One cleanup at a time. :)

Fine w/ me.

>
>> In general, I much prefer C code to new asm where it makes sense to
>> make this tradeoff.
>
> Agreed completely.  However, this is at least conservation-of-asm, or
> reduction if you consider the pt_regs argument-grabbing hack to be
> asm-esque code.
>
>> Other than that, this is a huge improvement.  You'll have minor
>> conflicts against -tip, though.
>
> Right, I've seen your current changes there.  Should be a trivial merge
> though.
>
> Would you mind providing an ack for the series, or at least for the
> first two patches?

I can give you an ok-in-principle on the first two.  I'd need to stare
at the awful code for a bit to understand the @!*&! clone variants to
really ack them convincingly.

OTOH, it would be rather surprising if you messed it up in a way that
still boots on all three variants (native 32-bit, native 64-bit, and
compat).

So, for the first two patches:

Acked-by: Andy Lutomirski <luto@kernel.org> # assuming all bitnesses boot

--Andy

>
> (I'm wondering whose tree this series ought to go through, for that
> matter.)
>
> - Josh Triplett



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-13 22:34               ` josh-iaAMLnmF4UmaiuxdJuQwMA
  (?)
@ 2015-03-13 22:38               ` Andy Lutomirski
  -1 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 22:38 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 3:34 PM,  <josh@joshtriplett.org> wrote:
> On Fri, Mar 13, 2015 at 03:28:26PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 13, 2015 at 3:20 PM,  <josh@joshtriplett.org> wrote:
>> > On Fri, Mar 13, 2015 at 02:34:58PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Mar 13, 2015 at 12:57 PM,  <josh@joshtriplett.org> wrote:
>> >> > A process launching a new process with CLONE_FD is explicitly requesting
>> >> > that the process be automatically reaped without any other process
>> >> > having to wait on it.  The task needs to not become a zombie, because
>> >> > otherwise, it'll show up in waitpid(-1, ...) calls in the parent
>> >> > process, which would break the ability to use this to completely
>> >> > encapsulate process management within a library and not interfere with
>> >> > the parent's process handling via SIGCHLD and wait{pid,3,4}.
>> >>
>> >> Wouldn't the correct behavior be to keep it alive as a zombie but
>> >> *not* show it in waitpid, etc?
>> >
>> > That's a significant change to the semantics of waitpid.  And then
>> > someone would still need to wait on the process, which we'd like to
>> > avoid.  (We don't want to have magic "reap on read(2)" semantics,
>> > because among other things, what if we add a means in the future to get
>> > an additional file descriptor corresponding to an existing process?)
>>
>> Do we not already have a state "dead, successfully waited on by
>> parent, but still around because ptraced"?  If not, shouldn't we?
>> Isn't that what PTRACE_SEIZE does?  Or am I just confused?
>
> I don't think that affects the task's exit_state though.
>

That's a question for Oleg.  I have no idea how ptrace is actually implemented.

--Andy

> - Josh Triplett



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-13 22:43           ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 83+ messages in thread
From: josh @ 2015-03-13 22:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 03:38:31PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 3:31 PM,  <josh@joshtriplett.org> wrote:
> > On Fri, Mar 13, 2015 at 03:01:16PM -0700, Andy Lutomirski wrote:
> >> On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> >> > For 32-bit userspace on a 64-bit kernel, this requires modifying
> >> > stub32_clone to actually swap the appropriate arguments to match
> >> > CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
> >> > broken.
> >> >
> >> > Signed-off-by: Josh Triplett <josh@joshtriplett.org>
> >> > Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
> >> > ---
> >> >  arch/x86/Kconfig             | 1 +
> >> >  arch/x86/ia32/ia32entry.S    | 2 +-
> >> >  arch/x86/kernel/process_32.c | 6 +++---
> >> >  arch/x86/kernel/process_64.c | 8 ++++----
> >> >  4 files changed, 9 insertions(+), 8 deletions(-)
> >> >
> >> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> > index b7d31ca..4960b0d 100644
> >> > --- a/arch/x86/Kconfig
> >> > +++ b/arch/x86/Kconfig
> >> > @@ -124,6 +124,7 @@ config X86
> >> >         select MODULES_USE_ELF_REL if X86_32
> >> >         select MODULES_USE_ELF_RELA if X86_64
> >> >         select CLONE_BACKWARDS if X86_32
> >> > +       select HAVE_COPY_THREAD_TLS
> >> >         select ARCH_USE_BUILTIN_BSWAP
> >> >         select ARCH_USE_QUEUE_RWLOCK
> >> >         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> >> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> >> > index 156ebca..0286735 100644
> >> > --- a/arch/x86/ia32/ia32entry.S
> >> > +++ b/arch/x86/ia32/ia32entry.S
> >> > @@ -487,7 +487,7 @@ GLOBAL(\label)
> >> >         ALIGN
> >> >  GLOBAL(stub32_clone)
> >> >         leaq sys_clone(%rip),%rax
> >> > -       mov     %r8, %rcx
> >> > +       xchg %r8, %rcx
> >> >         jmp  ia32_ptregs_common
> >>
> >> Do I understand correct that whatever function this is a stub for just
> >> takes its arguments in the wrong order?  If so, can we just fix it
> >> instead of using xchg here?
> >
> > 32-bit x86 and 64-bit x86 take the arguments to clone in a different
> > order, and stub32_clone fixes up the argument order then calls the
> > 64-bit sys_clone.
> >
> > I'd love to see *all* the 32-on-64 compat stubs for clone rewritten in C
> > under CONFIG_COMPAT.  However, doing so would require encoding the
> > knowledge for each 64-bit architecture for how its corresponding 32-bit
> > architecture accepts arguments to clone, which is information that the
> > current CONFIG_CLONE_BACKWARDS{1,2,3} don't include; it would then
> > require cleaning up all the architecture-specific assembly stubs for
> > 32-bit clone entry points.
> >
> > In the meantime, doing that *just* for 32-bit x86 on 64-bit x86 doesn't
> > seem worth it, since it would require adding a new C entry point for
> > compat_sys_clone under arch/x86 somewhere.
> >
> > One cleanup at a time. :)
> 
> Fine w/ me.

Thanks.

> >
> >> In general, I much prefer C code to new asm where it makes sense to
> >> make this tradeoff.
> >
> > Agreed completely.  However, this is at least conservation-of-asm, or
> > reduction if you consider the pt_regs argument-grabbing hack to be
> > asm-esque code.
> >
> >> Other than that, this is a huge improvement.  You'll have minor
> >> conflicts against -tip, though.
> >
> > Right, I've seen your current changes there.  Should be a trivial merge
> > though.
> >
> > Would you mind providing an ack for the series, or at least for the
> > first two patches?
> 
> I can give you an ok-in-principle on the first two.  I'd need to stare
> at the awful code for a bit to understand the @!*&! clone variants to
> really ack them convincingly.

I'd definitely appreciate the staring. :)

> OTOH, it would be rather surprising if you messed it up in a way that
> still boots on all three variants (native 32-bit, native 64-bit, and
> compat).
> 
> So, for the first two patches:
> 
> Acked-by: Andy Lutomirski <luto@kernel.org> # assuming all bitnesses boot

I did test all three, not just with booting but with a thread-local
storage test.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-13 22:43           ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 83+ messages in thread
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-03-13 22:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Fri, Mar 13, 2015 at 03:38:31PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 3:31 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> > On Fri, Mar 13, 2015 at 03:01:16PM -0700, Andy Lutomirski wrote:
> >> On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> >> > For 32-bit userspace on a 64-bit kernel, this requires modifying
> >> > stub32_clone to actually swap the appropriate arguments to match
> >> > CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
> >> > broken.
> >> >
> >> > Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
> >> > Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >> > ---
> >> >  arch/x86/Kconfig             | 1 +
> >> >  arch/x86/ia32/ia32entry.S    | 2 +-
> >> >  arch/x86/kernel/process_32.c | 6 +++---
> >> >  arch/x86/kernel/process_64.c | 8 ++++----
> >> >  4 files changed, 9 insertions(+), 8 deletions(-)
> >> >
> >> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> > index b7d31ca..4960b0d 100644
> >> > --- a/arch/x86/Kconfig
> >> > +++ b/arch/x86/Kconfig
> >> > @@ -124,6 +124,7 @@ config X86
> >> >         select MODULES_USE_ELF_REL if X86_32
> >> >         select MODULES_USE_ELF_RELA if X86_64
> >> >         select CLONE_BACKWARDS if X86_32
> >> > +       select HAVE_COPY_THREAD_TLS
> >> >         select ARCH_USE_BUILTIN_BSWAP
> >> >         select ARCH_USE_QUEUE_RWLOCK
> >> >         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> >> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> >> > index 156ebca..0286735 100644
> >> > --- a/arch/x86/ia32/ia32entry.S
> >> > +++ b/arch/x86/ia32/ia32entry.S
> >> > @@ -487,7 +487,7 @@ GLOBAL(\label)
> >> >         ALIGN
> >> >  GLOBAL(stub32_clone)
> >> >         leaq sys_clone(%rip),%rax
> >> > -       mov     %r8, %rcx
> >> > +       xchg %r8, %rcx
> >> >         jmp  ia32_ptregs_common
> >>
> >> Do I understand correct that whatever function this is a stub for just
> >> takes its arguments in the wrong order?  If so, can we just fix it
> >> instead of using xchg here?
> >
> > 32-bit x86 and 64-bit x86 take the arguments to clone in a different
> > order, and stub32_clone fixes up the argument order then calls the
> > 64-bit sys_clone.
> >
> > I'd love to see *all* the 32-on-64 compat stubs for clone rewritten in C
> > under CONFIG_COMPAT.  However, doing so would require encoding the
> > knowledge for each 64-bit architecture for how its corresponding 32-bit
> > architecture accepts arguments to clone, which is information that the
> > current CONFIG_CLONE_BACKWARDS{1,2,3} don't include; it would then
> > require cleaning up all the architecture-specific assembly stubs for
> > 32-bit clone entry points.
> >
> > In the meantime, doing that *just* for 32-bit x86 on 64-bit x86 doesn't
> > seem worth it, since it would require adding a new C entry point for
> > compat_sys_clone under arch/x86 somewhere.
> >
> > One cleanup at a time. :)
> 
> Fine w/ me.

Thanks.

> >
> >> In general, I much prefer C code to new asm where it makes sense to
> >> make this tradeoff.
> >
> > Agreed completely.  However, this is at least conservation-of-asm, or
> > reduction if you consider the pt_regs argument-grabbing hack to be
> > asm-esque code.
> >
> >> Other than that, this is a huge improvement.  You'll have minor
> >> conflicts against -tip, though.
> >
> > Right, I've seen your current changes there.  Should be a trivial merge
> > though.
> >
> > Would you mind providing an ack for the series, or at least for the
> > first two patches?
> 
> I can give you an ok-in-principle on the first two.  I'd need to stare
> at the awful code for a bit to understand the @!*&! clone variants to
> really ack them convincingly.

I'd definitely appreciate the staring. :)

> OTOH, it would be rather surprising if you messed it up in a way that
> still boots on all three variants (native 32-bit, native 64-bit, and
> compat).
> 
> So, for the first two patches:
> 
> Acked-by: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> # assuming all bitnesses boot

I did test all three, not just with booting but with a thread-local
storage test.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  2015-03-13 22:43           ` josh-iaAMLnmF4UmaiuxdJuQwMA
@ 2015-03-13 22:45             ` Andy Lutomirski
  -1 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 22:45 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 3:43 PM,  <josh@joshtriplett.org> wrote:
> On Fri, Mar 13, 2015 at 03:38:31PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 13, 2015 at 3:31 PM,  <josh@joshtriplett.org> wrote:
>> > On Fri, Mar 13, 2015 at 03:01:16PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh@joshtriplett.org> wrote:
>> >> > For 32-bit userspace on a 64-bit kernel, this requires modifying
>> >> > stub32_clone to actually swap the appropriate arguments to match
>> >> > CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
>> >> > broken.
>> >> >
>> >> > Signed-off-by: Josh Triplett <josh@joshtriplett.org>
>> >> > Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
>> >> > ---
>> >> >  arch/x86/Kconfig             | 1 +
>> >> >  arch/x86/ia32/ia32entry.S    | 2 +-
>> >> >  arch/x86/kernel/process_32.c | 6 +++---
>> >> >  arch/x86/kernel/process_64.c | 8 ++++----
>> >> >  4 files changed, 9 insertions(+), 8 deletions(-)
>> >> >
>> >> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> >> > index b7d31ca..4960b0d 100644
>> >> > --- a/arch/x86/Kconfig
>> >> > +++ b/arch/x86/Kconfig
>> >> > @@ -124,6 +124,7 @@ config X86
>> >> >         select MODULES_USE_ELF_REL if X86_32
>> >> >         select MODULES_USE_ELF_RELA if X86_64
>> >> >         select CLONE_BACKWARDS if X86_32
>> >> > +       select HAVE_COPY_THREAD_TLS
>> >> >         select ARCH_USE_BUILTIN_BSWAP
>> >> >         select ARCH_USE_QUEUE_RWLOCK
>> >> >         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
>> >> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> >> > index 156ebca..0286735 100644
>> >> > --- a/arch/x86/ia32/ia32entry.S
>> >> > +++ b/arch/x86/ia32/ia32entry.S
>> >> > @@ -487,7 +487,7 @@ GLOBAL(\label)
>> >> >         ALIGN
>> >> >  GLOBAL(stub32_clone)
>> >> >         leaq sys_clone(%rip),%rax
>> >> > -       mov     %r8, %rcx
>> >> > +       xchg %r8, %rcx
>> >> >         jmp  ia32_ptregs_common
>> >>
>> >> Do I understand correct that whatever function this is a stub for just
>> >> takes its arguments in the wrong order?  If so, can we just fix it
>> >> instead of using xchg here?
>> >
>> > 32-bit x86 and 64-bit x86 take the arguments to clone in a different
>> > order, and stub32_clone fixes up the argument order then calls the
>> > 64-bit sys_clone.
>> >
>> > I'd love to see *all* the 32-on-64 compat stubs for clone rewritten in C
>> > under CONFIG_COMPAT.  However, doing so would require encoding the
>> > knowledge for each 64-bit architecture for how its corresponding 32-bit
>> > architecture accepts arguments to clone, which is information that the
>> > current CONFIG_CLONE_BACKWARDS{1,2,3} don't include; it would then
>> > require cleaning up all the architecture-specific assembly stubs for
>> > 32-bit clone entry points.
>> >
>> > In the meantime, doing that *just* for 32-bit x86 on 64-bit x86 doesn't
>> > seem worth it, since it would require adding a new C entry point for
>> > compat_sys_clone under arch/x86 somewhere.
>> >
>> > One cleanup at a time. :)
>>
>> Fine w/ me.
>
> Thanks.
>
>> >
>> >> In general, I much prefer C code to new asm where it makes sense to
>> >> make this tradeoff.
>> >
>> > Agreed completely.  However, this is at least conservation-of-asm, or
>> > reduction if you consider the pt_regs argument-grabbing hack to be
>> > asm-esque code.
>> >
>> >> Other than that, this is a huge improvement.  You'll have minor
>> >> conflicts against -tip, though.
>> >
>> > Right, I've seen your current changes there.  Should be a trivial merge
>> > though.
>> >
>> > Would you mind providing an ack for the series, or at least for the
>> > first two patches?
>>
>> I can give you an ok-in-principle on the first two.  I'd need to stare
>> at the awful code for a bit to understand the @!*&! clone variants to
>> really ack them convincingly.
>
> I'd definitely appreciate the staring. :)
>
>> OTOH, it would be rather surprising if you messed it up in a way that
>> still boots on all three variants (native 32-bit, native 64-bit, and
>> compat).
>>
>> So, for the first two patches:
>>
>> Acked-by: Andy Lutomirski <luto@kernel.org> # assuming all bitnesses boot
>
> I did test all three, not just with booting but with a thread-local
> storage test.

And it's fairly clear that no one ever tested clone-based TLS in 32
bits from a 64-bit ELF binary, because it was broken until very
recently :-/

This stuff is too magical and too poorly documented for my tastes.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-13 22:45             ` Andy Lutomirski
  0 siblings, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-03-13 22:45 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Fri, Mar 13, 2015 at 3:43 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Fri, Mar 13, 2015 at 03:38:31PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 13, 2015 at 3:31 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>> > On Fri, Mar 13, 2015 at 03:01:16PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>> >> > For 32-bit userspace on a 64-bit kernel, this requires modifying
>> >> > stub32_clone to actually swap the appropriate arguments to match
>> >> > CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
>> >> > broken.
>> >> >
>> >> > Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
>> >> > Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> >> > ---
>> >> >  arch/x86/Kconfig             | 1 +
>> >> >  arch/x86/ia32/ia32entry.S    | 2 +-
>> >> >  arch/x86/kernel/process_32.c | 6 +++---
>> >> >  arch/x86/kernel/process_64.c | 8 ++++----
>> >> >  4 files changed, 9 insertions(+), 8 deletions(-)
>> >> >
>> >> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> >> > index b7d31ca..4960b0d 100644
>> >> > --- a/arch/x86/Kconfig
>> >> > +++ b/arch/x86/Kconfig
>> >> > @@ -124,6 +124,7 @@ config X86
>> >> >         select MODULES_USE_ELF_REL if X86_32
>> >> >         select MODULES_USE_ELF_RELA if X86_64
>> >> >         select CLONE_BACKWARDS if X86_32
>> >> > +       select HAVE_COPY_THREAD_TLS
>> >> >         select ARCH_USE_BUILTIN_BSWAP
>> >> >         select ARCH_USE_QUEUE_RWLOCK
>> >> >         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
>> >> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> >> > index 156ebca..0286735 100644
>> >> > --- a/arch/x86/ia32/ia32entry.S
>> >> > +++ b/arch/x86/ia32/ia32entry.S
>> >> > @@ -487,7 +487,7 @@ GLOBAL(\label)
>> >> >         ALIGN
>> >> >  GLOBAL(stub32_clone)
>> >> >         leaq sys_clone(%rip),%rax
>> >> > -       mov     %r8, %rcx
>> >> > +       xchg %r8, %rcx
>> >> >         jmp  ia32_ptregs_common
>> >>
>> >> Do I understand correct that whatever function this is a stub for just
>> >> takes its arguments in the wrong order?  If so, can we just fix it
>> >> instead of using xchg here?
>> >
>> > 32-bit x86 and 64-bit x86 take the arguments to clone in a different
>> > order, and stub32_clone fixes up the argument order then calls the
>> > 64-bit sys_clone.
>> >
>> > I'd love to see *all* the 32-on-64 compat stubs for clone rewritten in C
>> > under CONFIG_COMPAT.  However, doing so would require encoding the
>> > knowledge for each 64-bit architecture for how its corresponding 32-bit
>> > architecture accepts arguments to clone, which is information that the
>> > current CONFIG_CLONE_BACKWARDS{1,2,3} don't include; it would then
>> > require cleaning up all the architecture-specific assembly stubs for
>> > 32-bit clone entry points.
>> >
>> > In the meantime, doing that *just* for 32-bit x86 on 64-bit x86 doesn't
>> > seem worth it, since it would require adding a new C entry point for
>> > compat_sys_clone under arch/x86 somewhere.
>> >
>> > One cleanup at a time. :)
>>
>> Fine w/ me.
>
> Thanks.
>
>> >
>> >> In general, I much prefer C code to new asm where it makes sense to
>> >> make this tradeoff.
>> >
>> > Agreed completely.  However, this is at least conservation-of-asm, or
>> > reduction if you consider the pt_regs argument-grabbing hack to be
>> > asm-esque code.
>> >
>> >> Other than that, this is a huge improvement.  You'll have minor
>> >> conflicts against -tip, though.
>> >
>> > Right, I've seen your current changes there.  Should be a trivial merge
>> > though.
>> >
>> > Would you mind providing an ack for the series, or at least for the
>> > first two patches?
>>
>> I can give you an ok-in-principle on the first two.  I'd need to stare
>> at the awful code for a bit to understand the @!*&! clone variants to
>> really ack them convincingly.
>
> I'd definitely appreciate the staring. :)
>
>> OTOH, it would be rather surprising if you messed it up in a way that
>> still boots on all three variants (native 32-bit, native 64-bit, and
>> compat).
>>
>> So, for the first two patches:
>>
>> Acked-by: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> # assuming all bitnesses boot
>
> I did test all three, not just with booting but with a thread-local
> storage test.

And it's fairly clear that no one ever tested clone-based TLS in 32
bits from a 64-bit ELF binary, because it was broken until very
recently :-/

This stuff is too magical and too poorly documented for my tastes.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-13 23:01               ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 83+ messages in thread
From: josh @ 2015-03-13 23:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, Linux API,
	Linux FS Devel, X86 ML

On Fri, Mar 13, 2015 at 03:45:16PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 3:43 PM,  <josh@joshtriplett.org> wrote:
> > On Fri, Mar 13, 2015 at 03:38:31PM -0700, Andy Lutomirski wrote:
> >> On Fri, Mar 13, 2015 at 3:31 PM,  <josh@joshtriplett.org> wrote:
> >> > On Fri, Mar 13, 2015 at 03:01:16PM -0700, Andy Lutomirski wrote:
> >> >> On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> >> >> > For 32-bit userspace on a 64-bit kernel, this requires modifying
> >> >> > stub32_clone to actually swap the appropriate arguments to match
> >> >> > CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
> >> >> > broken.
> >> >> >
> >> >> > Signed-off-by: Josh Triplett <josh@joshtriplett.org>
> >> >> > Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
> >> >> > ---
> >> >> >  arch/x86/Kconfig             | 1 +
> >> >> >  arch/x86/ia32/ia32entry.S    | 2 +-
> >> >> >  arch/x86/kernel/process_32.c | 6 +++---
> >> >> >  arch/x86/kernel/process_64.c | 8 ++++----
> >> >> >  4 files changed, 9 insertions(+), 8 deletions(-)
> >> >> >
> >> >> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> >> > index b7d31ca..4960b0d 100644
> >> >> > --- a/arch/x86/Kconfig
> >> >> > +++ b/arch/x86/Kconfig
> >> >> > @@ -124,6 +124,7 @@ config X86
> >> >> >         select MODULES_USE_ELF_REL if X86_32
> >> >> >         select MODULES_USE_ELF_RELA if X86_64
> >> >> >         select CLONE_BACKWARDS if X86_32
> >> >> > +       select HAVE_COPY_THREAD_TLS
> >> >> >         select ARCH_USE_BUILTIN_BSWAP
> >> >> >         select ARCH_USE_QUEUE_RWLOCK
> >> >> >         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> >> >> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> >> >> > index 156ebca..0286735 100644
> >> >> > --- a/arch/x86/ia32/ia32entry.S
> >> >> > +++ b/arch/x86/ia32/ia32entry.S
> >> >> > @@ -487,7 +487,7 @@ GLOBAL(\label)
> >> >> >         ALIGN
> >> >> >  GLOBAL(stub32_clone)
> >> >> >         leaq sys_clone(%rip),%rax
> >> >> > -       mov     %r8, %rcx
> >> >> > +       xchg %r8, %rcx
> >> >> >         jmp  ia32_ptregs_common
> >> >>
> >> >> Do I understand correct that whatever function this is a stub for just
> >> >> takes its arguments in the wrong order?  If so, can we just fix it
> >> >> instead of using xchg here?
> >> >
> >> > 32-bit x86 and 64-bit x86 take the arguments to clone in a different
> >> > order, and stub32_clone fixes up the argument order then calls the
> >> > 64-bit sys_clone.
> >> >
> >> > I'd love to see *all* the 32-on-64 compat stubs for clone rewritten in C
> >> > under CONFIG_COMPAT.  However, doing so would require encoding the
> >> > knowledge for each 64-bit architecture for how its corresponding 32-bit
> >> > architecture accepts arguments to clone, which is information that the
> >> > current CONFIG_CLONE_BACKWARDS{1,2,3} don't include; it would then
> >> > require cleaning up all the architecture-specific assembly stubs for
> >> > 32-bit clone entry points.
> >> >
> >> > In the meantime, doing that *just* for 32-bit x86 on 64-bit x86 doesn't
> >> > seem worth it, since it would require adding a new C entry point for
> >> > compat_sys_clone under arch/x86 somewhere.
> >> >
> >> > One cleanup at a time. :)
> >>
> >> Fine w/ me.
> >
> > Thanks.
> >
> >> >
> >> >> In general, I much prefer C code to new asm where it makes sense to
> >> >> make this tradeoff.
> >> >
> >> > Agreed completely.  However, this is at least conservation-of-asm, or
> >> > reduction if you consider the pt_regs argument-grabbing hack to be
> >> > asm-esque code.
> >> >
> >> >> Other than that, this is a huge improvement.  You'll have minor
> >> >> conflicts against -tip, though.
> >> >
> >> > Right, I've seen your current changes there.  Should be a trivial merge
> >> > though.
> >> >
> >> > Would you mind providing an ack for the series, or at least for the
> >> > first two patches?
> >>
> >> I can give you an ok-in-principle on the first two.  I'd need to stare
> >> at the awful code for a bit to understand the @!*&! clone variants to
> >> really ack them convincingly.
> >
> > I'd definitely appreciate the staring. :)
> >
> >> OTOH, it would be rather surprising if you messed it up in a way that
> >> still boots on all three variants (native 32-bit, native 64-bit, and
> >> compat).
> >>
> >> So, for the first two patches:
> >>
> >> Acked-by: Andy Lutomirski <luto@kernel.org> # assuming all bitnesses boot
> >
> > I did test all three, not just with booting but with a thread-local
> > storage test.
> 
> And it's fairly clear that no one ever tested clone-based TLS in 32
> bits from a 64-bit ELF binary, because it was broken until very
> recently :-/

I'm not sure *anyone* other than exploit-seekers test 32-bit system
calls from a 64-bit binary. :)

> This stuff is too magical and too poorly documented for my tastes.

Agreed.  That was my reaction when I figured out what was happening with
CLONE_SETTLS and pt_regs, and my goal with the first two patches in this
series was precisely to make it *less* magical. :)

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-13 23:01               ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 83+ messages in thread
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-03-13 23:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Fri, Mar 13, 2015 at 03:45:16PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 3:43 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> > On Fri, Mar 13, 2015 at 03:38:31PM -0700, Andy Lutomirski wrote:
> >> On Fri, Mar 13, 2015 at 3:31 PM,  <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> >> > On Fri, Mar 13, 2015 at 03:01:16PM -0700, Andy Lutomirski wrote:
> >> >> On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> >> >> > For 32-bit userspace on a 64-bit kernel, this requires modifying
> >> >> > stub32_clone to actually swap the appropriate arguments to match
> >> >> > CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
> >> >> > broken.
> >> >> >
> >> >> > Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
> >> >> > Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >> >> > ---
> >> >> >  arch/x86/Kconfig             | 1 +
> >> >> >  arch/x86/ia32/ia32entry.S    | 2 +-
> >> >> >  arch/x86/kernel/process_32.c | 6 +++---
> >> >> >  arch/x86/kernel/process_64.c | 8 ++++----
> >> >> >  4 files changed, 9 insertions(+), 8 deletions(-)
> >> >> >
> >> >> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> >> > index b7d31ca..4960b0d 100644
> >> >> > --- a/arch/x86/Kconfig
> >> >> > +++ b/arch/x86/Kconfig
> >> >> > @@ -124,6 +124,7 @@ config X86
> >> >> >         select MODULES_USE_ELF_REL if X86_32
> >> >> >         select MODULES_USE_ELF_RELA if X86_64
> >> >> >         select CLONE_BACKWARDS if X86_32
> >> >> > +       select HAVE_COPY_THREAD_TLS
> >> >> >         select ARCH_USE_BUILTIN_BSWAP
> >> >> >         select ARCH_USE_QUEUE_RWLOCK
> >> >> >         select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> >> >> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> >> >> > index 156ebca..0286735 100644
> >> >> > --- a/arch/x86/ia32/ia32entry.S
> >> >> > +++ b/arch/x86/ia32/ia32entry.S
> >> >> > @@ -487,7 +487,7 @@ GLOBAL(\label)
> >> >> >         ALIGN
> >> >> >  GLOBAL(stub32_clone)
> >> >> >         leaq sys_clone(%rip),%rax
> >> >> > -       mov     %r8, %rcx
> >> >> > +       xchg %r8, %rcx
> >> >> >         jmp  ia32_ptregs_common
> >> >>
> >> >> Do I understand correct that whatever function this is a stub for just
> >> >> takes its arguments in the wrong order?  If so, can we just fix it
> >> >> instead of using xchg here?
> >> >
> >> > 32-bit x86 and 64-bit x86 take the arguments to clone in a different
> >> > order, and stub32_clone fixes up the argument order then calls the
> >> > 64-bit sys_clone.
> >> >
> >> > I'd love to see *all* the 32-on-64 compat stubs for clone rewritten in C
> >> > under CONFIG_COMPAT.  However, doing so would require encoding the
> >> > knowledge for each 64-bit architecture for how its corresponding 32-bit
> >> > architecture accepts arguments to clone, which is information that the
> >> > current CONFIG_CLONE_BACKWARDS{1,2,3} don't include; it would then
> >> > require cleaning up all the architecture-specific assembly stubs for
> >> > 32-bit clone entry points.
> >> >
> >> > In the meantime, doing that *just* for 32-bit x86 on 64-bit x86 doesn't
> >> > seem worth it, since it would require adding a new C entry point for
> >> > compat_sys_clone under arch/x86 somewhere.
> >> >
> >> > One cleanup at a time. :)
> >>
> >> Fine w/ me.
> >
> > Thanks.
> >
> >> >
> >> >> In general, I much prefer C code to new asm where it makes sense to
> >> >> make this tradeoff.
> >> >
> >> > Agreed completely.  However, this is at least conservation-of-asm, or
> >> > reduction if you consider the pt_regs argument-grabbing hack to be
> >> > asm-esque code.
> >> >
> >> >> Other than that, this is a huge improvement.  You'll have minor
> >> >> conflicts against -tip, though.
> >> >
> >> > Right, I've seen your current changes there.  Should be a trivial merge
> >> > though.
> >> >
> >> > Would you mind providing an ack for the series, or at least for the
> >> > first two patches?
> >>
> >> I can give you an ok-in-principle on the first two.  I'd need to stare
> >> at the awful code for a bit to understand the @!*&! clone variants to
> >> really ack them convincingly.
> >
> > I'd definitely appreciate the staring. :)
> >
> >> OTOH, it would be rather surprising if you messed it up in a way that
> >> still boots on all three variants (native 32-bit, native 64-bit, and
> >> compat).
> >>
> >> So, for the first two patches:
> >>
> >> Acked-by: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> # assuming all bitnesses boot
> >
> > I did test all three, not just with booting but with a thread-local
> > storage test.
> 
> And it's fairly clear that no one ever tested clone-based TLS in 32
> bits from a 64-bit ELF binary, because it was broken until very
> recently :-/

I'm not sure *anyone* other than exploit-seekers test 32-bit system
calls from a 64-bit binary. :)

> This stuff is too magical and too poorly documented for my tastes.

Agreed.  That was my reaction when I figured out what was happening with
CLONE_SETTLS and pt_regs, and my goal with the first two patches in this
series was precisely to make it *less* magical. :)

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-14  1:11             ` Thiago Macieira
  0 siblings, 0 replies; 83+ messages in thread
From: Thiago Macieira @ 2015-03-14  1:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Triplett, David Drysdale, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel, Linux API, Linux FS Devel, X86 ML

On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> In any event, we should find out what FreeBSD does in response to
> read(2) on the fd.

I've just successfully installed FreeBSD and compiled qtbase (main package of 
Qt 5) on it.

I'll test pdfork during the weekend and report its behaviour.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-14  1:11             ` Thiago Macieira
  0 siblings, 0 replies; 83+ messages in thread
From: Thiago Macieira @ 2015-03-14  1:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Triplett, David Drysdale, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> In any event, we should find out what FreeBSD does in response to
> read(2) on the fd.

I've just successfully installed FreeBSD and compiled qtbase (main package of 
Qt 5) on it.

I'll test pdfork during the weekend and report its behaviour.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-13 19:57     ` josh
@ 2015-03-14 14:14         ` Oleg Nesterov
  2015-03-14 14:14         ` Oleg Nesterov
  1 sibling, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 14:14 UTC (permalink / raw)
  To: josh
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On 03/13, josh@joshtriplett.org wrote:
>
> On Fri, Mar 13, 2015 at 05:21:13PM +0100, Oleg Nesterov wrote:
> >
> > Again, I simply do not know what this code does at all. But I bet the usage
> > of EXIT_DEAD is wrong ;)
> >
> > OK, OK, I can be wrong. But I simply do not see what protects this task_struct
> > if it is EXIT_DEAD (in fact even if it is EXIT_ZOMBIE).
>
> If by "what protects" you mean "what keeps it alive", the file
> descriptor holds a reference to the task_struct by calling
> get_task_struct when created and put_task_struct when released.

OK, so I was wrong. Although I still have a gut feeling that the usage
of EXIT_DEAD can't be right. Because it was always wrong outside of core
"exit" code ;) Nevermind, I didn't read this series yet, forget.

> > > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > >  	if (group_dead)
> > >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> > >
> > > -	if (unlikely(tsk->ptrace)) {
> > > +	if (tsk->autoreap) {
> > > +		autoreap = true;
> >
> > Debuggers won't be happy. A ptraced task should not autoreap itself.
>
> A process launching a new process with CLONE_FD is explicitly requesting
> that the process be automatically reaped without any other process
> having to wait on it.  The task needs to not become a zombie, because
> otherwise, it'll show up in waitpid(-1, ...)

This is clear.

But please note that this task can be traced/debugged by unrelated process,
not its real_parent/creator. Say, the system admin does "strace -p". This
simply breaks the current API.

Again, again, I didn't read this series yet. But the proper solution (afaics)
should move this "autoreap" check in release_task/__ptrace_detach(). If the
task is traced. Debugger should check ->autoreap and skip another
do_notify_parent().


Speaking of autoreap... If ->exit_signal is zero, then the exiting child
doesn't send the notification to its parent, still it doesn't autoreap
itself. To me this looks strange, and in fact it seems to me that this
is only by mistake. I am wondering if we can treat ->exit_signal == 0
as "autoreap" too. As usual, most probably the answer is "no, because it
is too late to change the historical behaviour". But this is off-topic.

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 14:14         ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 14:14 UTC (permalink / raw)
  To: josh-iaAMLnmF4UmaiuxdJuQwMA
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On 03/13, josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
>
> On Fri, Mar 13, 2015 at 05:21:13PM +0100, Oleg Nesterov wrote:
> >
> > Again, I simply do not know what this code does at all. But I bet the usage
> > of EXIT_DEAD is wrong ;)
> >
> > OK, OK, I can be wrong. But I simply do not see what protects this task_struct
> > if it is EXIT_DEAD (in fact even if it is EXIT_ZOMBIE).
>
> If by "what protects" you mean "what keeps it alive", the file
> descriptor holds a reference to the task_struct by calling
> get_task_struct when created and put_task_struct when released.

OK, so I was wrong. Although I still have a gut feeling that the usage
of EXIT_DEAD can't be right. Because it was always wrong outside of core
"exit" code ;) Nevermind, I didn't read this series yet, forget.

> > > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > >  	if (group_dead)
> > >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> > >
> > > -	if (unlikely(tsk->ptrace)) {
> > > +	if (tsk->autoreap) {
> > > +		autoreap = true;
> >
> > Debuggers won't be happy. A ptraced task should not autoreap itself.
>
> A process launching a new process with CLONE_FD is explicitly requesting
> that the process be automatically reaped without any other process
> having to wait on it.  The task needs to not become a zombie, because
> otherwise, it'll show up in waitpid(-1, ...)

This is clear.

But please note that this task can be traced/debugged by unrelated process,
not its real_parent/creator. Say, the system admin does "strace -p". This
simply breaks the current API.

Again, again, I didn't read this series yet. But the proper solution (afaics)
should move this "autoreap" check in release_task/__ptrace_detach(). If the
task is traced. Debugger should check ->autoreap and skip another
do_notify_parent().


Speaking of autoreap... If ->exit_signal is zero, then the exiting child
doesn't send the notification to its parent, still it doesn't autoreap
itself. To me this looks strange, and in fact it seems to me that this
is only by mistake. I am wondering if we can treat ->exit_signal == 0
as "autoreap" too. As usual, most probably the answer is "no, because it
is too late to change the historical behaviour". But this is off-topic.

Oleg.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 14:32           ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 14:32 UTC (permalink / raw)
  To: josh
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

And let me add another note before I forget...

On 03/14, Oleg Nesterov wrote:
>
> On 03/13, josh@joshtriplett.org wrote:
> >
> >
> > A process launching a new process with CLONE_FD is explicitly requesting
> > that the process be automatically reaped without any other process
> > having to wait on it.  The task needs to not become a zombie, because
> > otherwise, it'll show up in waitpid(-1, ...)
>
> This is clear.
>
> But please note that this task can be traced/debugged by unrelated process,
> not its real_parent/creator. Say, the system admin does "strace -p". This
> simply breaks the current API.
>
> Again, again, I didn't read this series yet. But the proper solution (afaics)
> should move this "autoreap" check in release_task/__ptrace_detach(). If the
> task is traced. Debugger should check ->autoreap and skip another
> do_notify_parent().
>
> Speaking of autoreap... If ->exit_signal is zero, then the exiting child
> doesn't send the notification to its parent, still it doesn't autoreap
> itself. To me this looks strange, and in fact it seems to me that this
> is only by mistake. I am wondering if we can treat ->exit_signal == 0
> as "autoreap" too. As usual, most probably the answer is "no, because it
> is too late to change the historical behaviour". But this is off-topic.


It is not clear to me what do_wait() should do with ->autoreap child, even
ignoring ptrace.

Just suppose that real_parent has a single "autoreap" child. Should wait(NULL)
hanf then?

If yes, who will wake the parent up?

If no, I do not see the necessary changes in wait_cosnider_task().

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 14:32           ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 14:32 UTC (permalink / raw)
  To: josh-iaAMLnmF4UmaiuxdJuQwMA
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

And let me add another note before I forget...

On 03/14, Oleg Nesterov wrote:
>
> On 03/13, josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
> >
> >
> > A process launching a new process with CLONE_FD is explicitly requesting
> > that the process be automatically reaped without any other process
> > having to wait on it.  The task needs to not become a zombie, because
> > otherwise, it'll show up in waitpid(-1, ...)
>
> This is clear.
>
> But please note that this task can be traced/debugged by unrelated process,
> not its real_parent/creator. Say, the system admin does "strace -p". This
> simply breaks the current API.
>
> Again, again, I didn't read this series yet. But the proper solution (afaics)
> should move this "autoreap" check in release_task/__ptrace_detach(). If the
> task is traced. Debugger should check ->autoreap and skip another
> do_notify_parent().
>
> Speaking of autoreap... If ->exit_signal is zero, then the exiting child
> doesn't send the notification to its parent, still it doesn't autoreap
> itself. To me this looks strange, and in fact it seems to me that this
> is only by mistake. I am wondering if we can treat ->exit_signal == 0
> as "autoreap" too. As usual, most probably the answer is "no, because it
> is too late to change the historical behaviour". But this is off-topic.


It is not clear to me what do_wait() should do with ->autoreap child, even
ignoring ptrace.

Just suppose that real_parent has a single "autoreap" child. Should wait(NULL)
hanf then?

If yes, who will wake the parent up?

If no, I do not see the necessary changes in wait_cosnider_task().

Oleg.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 14:35     ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 14:35 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On 03/12, Josh Triplett wrote:
>
> @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	if (group_dead)
>  		kill_orphaned_pgrp(tsk->group_leader, NULL);
>  
> -	if (unlikely(tsk->ptrace)) {
> +	if (tsk->autoreap) {
> +		autoreap = true;
> +	} else if (unlikely(tsk->ptrace)) {
>  		int sig = thread_group_leader(tsk) &&
>  				thread_group_empty(tsk) &&
>  				!ptrace_reparented(tsk) ?
> @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	}
>  
>  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> -	if (tsk->exit_state == EXIT_DEAD)
> +	if (tsk->exit_state == EXIT_DEAD) {
>  		list_add(&tsk->ptrace_entry, &dead);
> +		clonefd_do_notify(tsk);
> +	}

And even ignoring semantics issues, this change looks simply buggy anyway ;)

How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
->ptrace_entry is used by debugger.

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 14:35     ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 14:35 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On 03/12, Josh Triplett wrote:
>
> @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	if (group_dead)
>  		kill_orphaned_pgrp(tsk->group_leader, NULL);
>  
> -	if (unlikely(tsk->ptrace)) {
> +	if (tsk->autoreap) {
> +		autoreap = true;
> +	} else if (unlikely(tsk->ptrace)) {
>  		int sig = thread_group_leader(tsk) &&
>  				thread_group_empty(tsk) &&
>  				!ptrace_reparented(tsk) ?
> @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	}
>  
>  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> -	if (tsk->exit_state == EXIT_DEAD)
> +	if (tsk->exit_state == EXIT_DEAD) {
>  		list_add(&tsk->ptrace_entry, &dead);
> +		clonefd_do_notify(tsk);
> +	}

And even ignoring semantics issues, this change looks simply buggy anyway ;)

How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
->ptrace_entry is used by debugger.

Oleg.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-14 14:32           ` Oleg Nesterov
  (?)
@ 2015-03-14 18:38           ` Thiago Macieira
  2015-03-14 18:54             ` Oleg Nesterov
  2015-03-14 19:01             ` Josh Triplett
  -1 siblings, 2 replies; 83+ messages in thread
From: Thiago Macieira @ 2015-03-14 18:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: josh, Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Kees Cook, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> It is not clear to me what do_wait() should do with ->autoreap child, even
> ignoring ptrace.
> 
> Just suppose that real_parent has a single "autoreap" child. Should
> wait(NULL) hanf then?

It should ignore the child that is set to autoreap. wait(NULL) should return -
ECHILD, indicating there are no children waiting to be reaped.

But now I realise that this might have implications for session management and 
job control.

> If yes, who will wake the parent up?
> 
> If no, I do not see the necessary changes in wait_cosnider_task().

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-14 18:38           ` Thiago Macieira
@ 2015-03-14 18:54             ` Oleg Nesterov
  2015-03-14 22:03                 ` Josh Triplett
  2015-03-14 19:01             ` Josh Triplett
  1 sibling, 1 reply; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 18:54 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: josh, Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Kees Cook, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On 03/14, Thiago Macieira wrote:
>
> On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > It is not clear to me what do_wait() should do with ->autoreap child, even
> > ignoring ptrace.
> >
> > Just suppose that real_parent has a single "autoreap" child. Should
> > wait(NULL) hanf then?
>
> It should ignore the child that is set to autoreap. wait(NULL) should return -
> ECHILD, indicating there are no children waiting to be reaped.

I disagree. I won't really argue now, because I think that this needs
a separate discussion. And imo "autoreap" should come as a separate feature.

I think that wait(NULL) should hang like it hangs even if the parent ignores
SIGCHLD. But in this case the parent should be woken up when the "autoreap"
child exits.

If nothing else. Suppose that the parent does waitid(WEXITED|WSTOPPED).
Should WSTOPPED work? I think it should.

At the same time, if we add autoreap then probably it also makes sense to add
WEXITIED_UNLESS_AUTOREAP.

In short: this all certainly needs more discussion, but (afaics) this patch
is wrong in any case.



In fact I have some concerns about file descriptor from clone, it doesn't look
like a "right" interface to me. But I will not comment this part until at least
I read 0/4 ;)

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-14 18:38           ` Thiago Macieira
  2015-03-14 18:54             ` Oleg Nesterov
@ 2015-03-14 19:01             ` Josh Triplett
  2015-03-14 19:18                 ` Oleg Nesterov
  1 sibling, 1 reply; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 19:01 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > It is not clear to me what do_wait() should do with ->autoreap child, even
> > ignoring ptrace.
> > 
> > Just suppose that real_parent has a single "autoreap" child. Should
> > wait(NULL) hanf then?
> 
> It should ignore the child that is set to autoreap. wait(NULL) should return -
> ECHILD, indicating there are no children waiting to be reaped.

Right.  And I don't think the current code does this.  I think we need
to change wait_consider_task to early-return for ->autoreap just as it
does for task_state == EXIT_DEAD.  I'll do that in v2.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-14  1:11             ` Thiago Macieira
  (?)
@ 2015-03-14 19:03             ` Thiago Macieira
  2015-03-14 19:29                 ` Josh Triplett
  -1 siblings, 1 reply; 83+ messages in thread
From: Thiago Macieira @ 2015-03-14 19:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Triplett, David Drysdale, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel, Linux API, Linux FS Devel, X86 ML

On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
> On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> > In any event, we should find out what FreeBSD does in response to
> > read(2) on the fd.
> 
> I've just successfully installed FreeBSD and compiled qtbase (main package
> of Qt 5) on it.
> 
> I'll test pdfork during the weekend and report its behaviour.

Here are my findings about pdfork.

Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
Qt adaptations: https://codereview.qt-project.org/108561

Processes created with pdfork() are normal processes that still send SIGCHLD 
to their parents. The only difference is that you get the extra file descriptor 
that can be passed to the pdgetpid() system call and works on select()/poll(). 
Trying to read from that file descriptor will result in EOPNOTSUPP.

Since they've never implemented pdwait4() (it's not even declared in the 
headers), the only way to reap a child if you only have the file descriptor is 
to first pdgetpid() and then call wait4() or wait6().

If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when 
the file closes.

Conclusion: 
Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess. 
As long as all child process activations use this feature, the problem is 
solved.

Cons: it requires cooperation from all child starters. If some other library 
or the application installs a global SIGCHLD handler that waits on all child 
processes, like libvlc used to do and Glib and Ecore still do, you won't be 
able to get the child exit status.

I have not tested what happens if you try to pass the file descriptor to other 
processes (can you even do that on FreeBSD?). But even if you could and got 
notifications, you couldn't wait on the child to get its exit status -- unless 
they implement pdwait4.

 - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
 - pdwait4: can be emulated with read()
 - pdgetpid: needs an ioctl
 - pdkill: needs an ioctl [or just write()]

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 19:15       ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 19:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On Sat, Mar 14, 2015 at 03:35:58PM +0100, Oleg Nesterov wrote:
> On 03/12, Josh Triplett wrote:
> >
> > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >  	if (group_dead)
> >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> >  
> > -	if (unlikely(tsk->ptrace)) {
> > +	if (tsk->autoreap) {
> > +		autoreap = true;
> > +	} else if (unlikely(tsk->ptrace)) {
> >  		int sig = thread_group_leader(tsk) &&
> >  				thread_group_empty(tsk) &&
> >  				!ptrace_reparented(tsk) ?
> > @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >  	}
> >  
> >  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> > -	if (tsk->exit_state == EXIT_DEAD)
> > +	if (tsk->exit_state == EXIT_DEAD) {
> >  		list_add(&tsk->ptrace_entry, &dead);
> > +		clonefd_do_notify(tsk);
> > +	}
> 
> And even ignoring semantics issues, this change looks simply buggy anyway ;)
> 
> How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
> ->ptrace_entry is used by debugger.

That list_add was there before; I didn't change that.  I just added a
second line inside the EXIT_DEAD case, to call clonefd_do_notify (which
wakes up potential callers of poll/read).

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 19:15       ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 19:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Sat, Mar 14, 2015 at 03:35:58PM +0100, Oleg Nesterov wrote:
> On 03/12, Josh Triplett wrote:
> >
> > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >  	if (group_dead)
> >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> >  
> > -	if (unlikely(tsk->ptrace)) {
> > +	if (tsk->autoreap) {
> > +		autoreap = true;
> > +	} else if (unlikely(tsk->ptrace)) {
> >  		int sig = thread_group_leader(tsk) &&
> >  				thread_group_empty(tsk) &&
> >  				!ptrace_reparented(tsk) ?
> > @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >  	}
> >  
> >  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> > -	if (tsk->exit_state == EXIT_DEAD)
> > +	if (tsk->exit_state == EXIT_DEAD) {
> >  		list_add(&tsk->ptrace_entry, &dead);
> > +		clonefd_do_notify(tsk);
> > +	}
> 
> And even ignoring semantics issues, this change looks simply buggy anyway ;)
> 
> How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
> ->ptrace_entry is used by debugger.

That list_add was there before; I didn't change that.  I just added a
second line inside the EXIT_DEAD case, to call clonefd_do_notify (which
wakes up potential callers of poll/read).

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-14 19:01             ` Josh Triplett
@ 2015-03-14 19:18                 ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 19:18 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On 03/14, Josh Triplett wrote:
>
> On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > ignoring ptrace.
> > >
> > > Just suppose that real_parent has a single "autoreap" child. Should
> > > wait(NULL) hanf then?
> >
> > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > ECHILD, indicating there are no children waiting to be reaped.
>
> Right.  And I don't think the current code does this.  I think we need
> to change wait_consider_task to early-return for ->autoreap just as it
> does for task_state == EXIT_DEAD.

No. This EXIT_DEAD is absolutely different. And this is another indication
that you might use it wrongly ;)

What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
complicate the exit/wait/reparent paths.

However, currently this is TODO. The main problem is the locking in
wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
under read_lock().

And please see another email from me. So far  I disagree that wait(NULL)
should return ECHILD unconditionally. At least unless this is discussed
separately.

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 19:18                 ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 19:18 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On 03/14, Josh Triplett wrote:
>
> On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > ignoring ptrace.
> > >
> > > Just suppose that real_parent has a single "autoreap" child. Should
> > > wait(NULL) hanf then?
> >
> > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > ECHILD, indicating there are no children waiting to be reaped.
>
> Right.  And I don't think the current code does this.  I think we need
> to change wait_consider_task to early-return for ->autoreap just as it
> does for task_state == EXIT_DEAD.

No. This EXIT_DEAD is absolutely different. And this is another indication
that you might use it wrongly ;)

What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
complicate the exit/wait/reparent paths.

However, currently this is TODO. The main problem is the locking in
wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
under read_lock().

And please see another email from me. So far  I disagree that wait(NULL)
should return ECHILD unconditionally. At least unless this is discussed
separately.

Oleg.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-14 19:15       ` Josh Triplett
  (?)
@ 2015-03-14 19:24       ` Oleg Nesterov
  2015-03-14 19:48           ` Josh Triplett
  -1 siblings, 1 reply; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 19:24 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On 03/14, Josh Triplett wrote:
>
> On Sat, Mar 14, 2015 at 03:35:58PM +0100, Oleg Nesterov wrote:
> > On 03/12, Josh Triplett wrote:
> > >
> > > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > >  	if (group_dead)
> > >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> > >
> > > -	if (unlikely(tsk->ptrace)) {
> > > +	if (tsk->autoreap) {
> > > +		autoreap = true;
> > > +	} else if (unlikely(tsk->ptrace)) {
> > >  		int sig = thread_group_leader(tsk) &&
> > >  				thread_group_empty(tsk) &&
> > >  				!ptrace_reparented(tsk) ?
> > > @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > >  	}
> > >
> > >  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> > > -	if (tsk->exit_state == EXIT_DEAD)
> > > +	if (tsk->exit_state == EXIT_DEAD) {
> > >  		list_add(&tsk->ptrace_entry, &dead);
> > > +		clonefd_do_notify(tsk);
> > > +	}
> >
> > And even ignoring semantics issues, this change looks simply buggy anyway ;)
> >
> > How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
> > ->ptrace_entry is used by debugger.
>
> That list_add was there before; I didn't change that.

But this doesn't matter,

> I just added a
> second line inside the EXIT_DEAD case, to call clonefd_do_notify (which
> wakes up potential callers of poll/read).

No. Please read this code before and after your patch. You also added

	if (tsk->autoreap)
		autoreap = true;

at the start. At this can trigger the _wrong_ list_add(&tsk->ptrace_entry),
when the task is traced by another thread.

The current code can only use ->ptrace_entry if it was untraced (by us).

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-14 19:03             ` Thiago Macieira
@ 2015-03-14 19:29                 ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 19:29 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Andy Lutomirski, David Drysdale, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel, Linux API, Linux FS Devel, X86 ML

On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> > > In any event, we should find out what FreeBSD does in response to
> > > read(2) on the fd.
> > 
> > I've just successfully installed FreeBSD and compiled qtbase (main package
> > of Qt 5) on it.
> > 
> > I'll test pdfork during the weekend and report its behaviour.
> 
> Here are my findings about pdfork.
> 
> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
> Qt adaptations: https://codereview.qt-project.org/108561
> 
> Processes created with pdfork() are normal processes that still send SIGCHLD 
> to their parents. The only difference is that you get the extra file descriptor 
> that can be passed to the pdgetpid() system call and works on select()/poll(). 
> Trying to read from that file descriptor will result in EOPNOTSUPP.

OK, since read() doesn't work on a pdfork() file descriptor, we don't
have to worry about compatibility with pdfork()'s read result.

However, if the expectation is that pdfork()ed child processes still
send SIGCHLD, then I don't see how we can be compatible there, nor do I
think we want to; as you mention below, that breaks the ability to
encapsulate management of the created process entirely within a library.

> Since they've never implemented pdwait4() (it's not even declared in the 
> headers), the only way to reap a child if you only have the file descriptor is 
> to first pdgetpid() and then call wait4() or wait6().

Which suggests that we shouldn't try to implement pdwait4() in glibc
until FreeBSD implements it in their kernel, since we won't know the
exact semantics they expect.

> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when 
> the file closes.

OK, that makes sense.  We could certainly implement a
CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
future.

> Conclusion: 
> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess. 
> As long as all child process activations use this feature, the problem is 
> solved.
> 
> Cons: it requires cooperation from all child starters. If some other library 
> or the application installs a global SIGCHLD handler that waits on all child 
> processes, like libvlc used to do and Glib and Ecore still do, you won't be 
> able to get the child exit status.
> 
> I have not tested what happens if you try to pass the file descriptor to other 
> processes (can you even do that on FreeBSD?). But even if you could and got 
> notifications, you couldn't wait on the child to get its exit status -- unless 
> they implement pdwait4.

Even if they do implement pdwait4, they might not bypass the "must be
the parent process" restriction.  Let's wait to see what semantics they
go with.

>  - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
>  - pdwait4: can be emulated with read()
>  - pdgetpid: needs an ioctl
>  - pdkill: needs an ioctl [or just write()]

I think that should be a dedicated syscall, not an ioctl.

It's unfortunate that rt_sigqueueinfo doesn't take a flags argument.
However, I just realized that it takes a 32-bit "int" for the signal
number, yet signal numbers fit in 8 bits.  So we could just add flags in
the high 24 bits of that argument, and in particular add a flag
indicating that the first argument is a file descriptor rather than a
PID.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-14 19:29                 ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 19:29 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Andy Lutomirski, David Drysdale, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> > > In any event, we should find out what FreeBSD does in response to
> > > read(2) on the fd.
> > 
> > I've just successfully installed FreeBSD and compiled qtbase (main package
> > of Qt 5) on it.
> > 
> > I'll test pdfork during the weekend and report its behaviour.
> 
> Here are my findings about pdfork.
> 
> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
> Qt adaptations: https://codereview.qt-project.org/108561
> 
> Processes created with pdfork() are normal processes that still send SIGCHLD 
> to their parents. The only difference is that you get the extra file descriptor 
> that can be passed to the pdgetpid() system call and works on select()/poll(). 
> Trying to read from that file descriptor will result in EOPNOTSUPP.

OK, since read() doesn't work on a pdfork() file descriptor, we don't
have to worry about compatibility with pdfork()'s read result.

However, if the expectation is that pdfork()ed child processes still
send SIGCHLD, then I don't see how we can be compatible there, nor do I
think we want to; as you mention below, that breaks the ability to
encapsulate management of the created process entirely within a library.

> Since they've never implemented pdwait4() (it's not even declared in the 
> headers), the only way to reap a child if you only have the file descriptor is 
> to first pdgetpid() and then call wait4() or wait6().

Which suggests that we shouldn't try to implement pdwait4() in glibc
until FreeBSD implements it in their kernel, since we won't know the
exact semantics they expect.

> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when 
> the file closes.

OK, that makes sense.  We could certainly implement a
CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
future.

> Conclusion: 
> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess. 
> As long as all child process activations use this feature, the problem is 
> solved.
> 
> Cons: it requires cooperation from all child starters. If some other library 
> or the application installs a global SIGCHLD handler that waits on all child 
> processes, like libvlc used to do and Glib and Ecore still do, you won't be 
> able to get the child exit status.
> 
> I have not tested what happens if you try to pass the file descriptor to other 
> processes (can you even do that on FreeBSD?). But even if you could and got 
> notifications, you couldn't wait on the child to get its exit status -- unless 
> they implement pdwait4.

Even if they do implement pdwait4, they might not bypass the "must be
the parent process" restriction.  Let's wait to see what semantics they
go with.

>  - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
>  - pdwait4: can be emulated with read()
>  - pdgetpid: needs an ioctl
>  - pdkill: needs an ioctl [or just write()]

I think that should be a dedicated syscall, not an ioctl.

It's unfortunate that rt_sigqueueinfo doesn't take a flags argument.
However, I just realized that it takes a 32-bit "int" for the signal
number, yet signal numbers fit in 8 bits.  So we could just add flags in
the high 24 bits of that argument, and in particular add a flag
indicating that the first argument is a file descriptor rather than a
PID.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 19:47                   ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 19:47 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On 03/14, Oleg Nesterov wrote:
>
> On 03/14, Josh Triplett wrote:
> >
> > On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > > ignoring ptrace.
> > > >
> > > > Just suppose that real_parent has a single "autoreap" child. Should
> > > > wait(NULL) hanf then?
> > >
> > > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > > ECHILD, indicating there are no children waiting to be reaped.
> >
> > Right.  And I don't think the current code does this.  I think we need
> > to change wait_consider_task to early-return for ->autoreap just as it
> > does for task_state == EXIT_DEAD.
>
> No. This EXIT_DEAD is absolutely different. And this is another indication
> that you might use it wrongly ;)
>
> What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
> want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
> complicate the exit/wait/reparent paths.
>
> However, currently this is TODO. The main problem is the locking in
> wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
> under read_lock().

Let me clarify in case I confused you.

The EXIT_DEAD check in do_wait() paths doesn't mean "autoreap". It means
that this thread/process (depending on ptrace) was already reaped. It was
reaped by our sub-thread, or it was reaped because we ignore SIGCHLD, or
other reasons. This doesn't matter.

In short, EXIT_DEAD means: we have to keep this thread on lists until the
task which set this state calls release_task().

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 19:47                   ` Oleg Nesterov
  0 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 19:47 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On 03/14, Oleg Nesterov wrote:
>
> On 03/14, Josh Triplett wrote:
> >
> > On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > > ignoring ptrace.
> > > >
> > > > Just suppose that real_parent has a single "autoreap" child. Should
> > > > wait(NULL) hanf then?
> > >
> > > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > > ECHILD, indicating there are no children waiting to be reaped.
> >
> > Right.  And I don't think the current code does this.  I think we need
> > to change wait_consider_task to early-return for ->autoreap just as it
> > does for task_state == EXIT_DEAD.
>
> No. This EXIT_DEAD is absolutely different. And this is another indication
> that you might use it wrongly ;)
>
> What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
> want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
> complicate the exit/wait/reparent paths.
>
> However, currently this is TODO. The main problem is the locking in
> wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
> under read_lock().

Let me clarify in case I confused you.

The EXIT_DEAD check in do_wait() paths doesn't mean "autoreap". It means
that this thread/process (depending on ptrace) was already reaped. It was
reaped by our sub-thread, or it was reaped because we ignore SIGCHLD, or
other reasons. This doesn't matter.

In short, EXIT_DEAD means: we have to keep this thread on lists until the
task which set this state calls release_task().

Oleg.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 19:48           ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 19:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On Sat, Mar 14, 2015 at 08:24:56PM +0100, Oleg Nesterov wrote:
> On 03/14, Josh Triplett wrote:
> >
> > On Sat, Mar 14, 2015 at 03:35:58PM +0100, Oleg Nesterov wrote:
> > > On 03/12, Josh Triplett wrote:
> > > >
> > > > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > >  	if (group_dead)
> > > >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> > > >
> > > > -	if (unlikely(tsk->ptrace)) {
> > > > +	if (tsk->autoreap) {
> > > > +		autoreap = true;
> > > > +	} else if (unlikely(tsk->ptrace)) {
> > > >  		int sig = thread_group_leader(tsk) &&
> > > >  				thread_group_empty(tsk) &&
> > > >  				!ptrace_reparented(tsk) ?
> > > > @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > >  	}
> > > >
> > > >  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> > > > -	if (tsk->exit_state == EXIT_DEAD)
> > > > +	if (tsk->exit_state == EXIT_DEAD) {
> > > >  		list_add(&tsk->ptrace_entry, &dead);
> > > > +		clonefd_do_notify(tsk);
> > > > +	}
> > >
> > > And even ignoring semantics issues, this change looks simply buggy anyway ;)
> > >
> > > How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
> > > ->ptrace_entry is used by debugger.
> >
> > That list_add was there before; I didn't change that.
> 
> But this doesn't matter,
> 
> > I just added a
> > second line inside the EXIT_DEAD case, to call clonefd_do_notify (which
> > wakes up potential callers of poll/read).
> 
> No. Please read this code before and after your patch. You also added
> 
> 	if (tsk->autoreap)
> 		autoreap = true;
> 
> at the start. At this can trigger the _wrong_ list_add(&tsk->ptrace_entry),
> when the task is traced by another thread.
> 
> The current code can only use ->ptrace_entry if it was untraced (by us).

Ugh.  I finally realized just how magic the logic is there; thanks for
catching this.  The call to forget_original_parent at the top of
exit_notify can potentially add the process to the list "dead" here,
either in exit_ptrace() or in reparent_leader() (the latter of which has
its own duplicate of part of exit_notify's logic, including
do_notify_parent and setting exit_state).  Then exit_notify can add the
task to "dead" itself under some conditions that clearly depend on the
exact nature of the existing three-way conditional above.  And finally
exit_notify loops over "dead" and releases all the tasks there.

I'll investigate this further and make sure the ptrace case gets
handled correctly.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 19:48           ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 19:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Sat, Mar 14, 2015 at 08:24:56PM +0100, Oleg Nesterov wrote:
> On 03/14, Josh Triplett wrote:
> >
> > On Sat, Mar 14, 2015 at 03:35:58PM +0100, Oleg Nesterov wrote:
> > > On 03/12, Josh Triplett wrote:
> > > >
> > > > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > >  	if (group_dead)
> > > >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> > > >
> > > > -	if (unlikely(tsk->ptrace)) {
> > > > +	if (tsk->autoreap) {
> > > > +		autoreap = true;
> > > > +	} else if (unlikely(tsk->ptrace)) {
> > > >  		int sig = thread_group_leader(tsk) &&
> > > >  				thread_group_empty(tsk) &&
> > > >  				!ptrace_reparented(tsk) ?
> > > > @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > >  	}
> > > >
> > > >  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> > > > -	if (tsk->exit_state == EXIT_DEAD)
> > > > +	if (tsk->exit_state == EXIT_DEAD) {
> > > >  		list_add(&tsk->ptrace_entry, &dead);
> > > > +		clonefd_do_notify(tsk);
> > > > +	}
> > >
> > > And even ignoring semantics issues, this change looks simply buggy anyway ;)
> > >
> > > How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
> > > ->ptrace_entry is used by debugger.
> >
> > That list_add was there before; I didn't change that.
> 
> But this doesn't matter,
> 
> > I just added a
> > second line inside the EXIT_DEAD case, to call clonefd_do_notify (which
> > wakes up potential callers of poll/read).
> 
> No. Please read this code before and after your patch. You also added
> 
> 	if (tsk->autoreap)
> 		autoreap = true;
> 
> at the start. At this can trigger the _wrong_ list_add(&tsk->ptrace_entry),
> when the task is traced by another thread.
> 
> The current code can only use ->ptrace_entry if it was untraced (by us).

Ugh.  I finally realized just how magic the logic is there; thanks for
catching this.  The call to forget_original_parent at the top of
exit_notify can potentially add the process to the list "dead" here,
either in exit_ptrace() or in reparent_leader() (the latter of which has
its own duplicate of part of exit_notify's logic, including
do_notify_parent and setting exit_state).  Then exit_notify can add the
task to "dead" itself under some conditions that clearly depend on the
exact nature of the existing three-way conditional above.  And finally
exit_notify loops over "dead" and releases all the tasks there.

I'll investigate this further and make sure the ptrace case gets
handled correctly.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 20:03                   ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 20:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On Sat, Mar 14, 2015 at 08:18:36PM +0100, Oleg Nesterov wrote:
> On 03/14, Josh Triplett wrote:
> >
> > On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > > ignoring ptrace.
> > > >
> > > > Just suppose that real_parent has a single "autoreap" child. Should
> > > > wait(NULL) hanf then?
> > >
> > > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > > ECHILD, indicating there are no children waiting to be reaped.
> >
> > Right.  And I don't think the current code does this.  I think we need
> > to change wait_consider_task to early-return for ->autoreap just as it
> > does for task_state == EXIT_DEAD.
> 
> No. This EXIT_DEAD is absolutely different. And this is another indication
> that you might use it wrongly ;)

Is there any information somewhere on how this state machine of doom is
*supposed* to work? :)  Why would "p->task_state == EXIT_DEAD" mean
something different in wait_consider_task?

> What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
> want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
> complicate the exit/wait/reparent paths.

Pulling the EXIT_DEAD tasks out of those lists completely does sound
like a good simplification.  However, that doesn't seem to be the
current expectation in wait_consider_task, which just returns if
p->task_state == EXIT_DEAD to skip considering that task.

And an autoreaping task isn't necessarily dead yet; it just shouldn't be
waited on.

> However, currently this is TODO. The main problem is the locking in
> wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
> under read_lock().

That appears to be only reachable for zombies, which an autoreaping task
should never become.

> And please see another email from me. So far  I disagree that wait(NULL)
> should return ECHILD unconditionally. At least unless this is discussed
> separately.

I'll respond in that separate thread, but one issue there: waiting for
any child process cannot safely return an autoreaping child process,
because that would introduce a race condition.  The PID the parent gets
back can disappear at any time, so there's nothing useful the parent can
do with it.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 20:03                   ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 20:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Sat, Mar 14, 2015 at 08:18:36PM +0100, Oleg Nesterov wrote:
> On 03/14, Josh Triplett wrote:
> >
> > On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > > ignoring ptrace.
> > > >
> > > > Just suppose that real_parent has a single "autoreap" child. Should
> > > > wait(NULL) hanf then?
> > >
> > > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > > ECHILD, indicating there are no children waiting to be reaped.
> >
> > Right.  And I don't think the current code does this.  I think we need
> > to change wait_consider_task to early-return for ->autoreap just as it
> > does for task_state == EXIT_DEAD.
> 
> No. This EXIT_DEAD is absolutely different. And this is another indication
> that you might use it wrongly ;)

Is there any information somewhere on how this state machine of doom is
*supposed* to work? :)  Why would "p->task_state == EXIT_DEAD" mean
something different in wait_consider_task?

> What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
> want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
> complicate the exit/wait/reparent paths.

Pulling the EXIT_DEAD tasks out of those lists completely does sound
like a good simplification.  However, that doesn't seem to be the
current expectation in wait_consider_task, which just returns if
p->task_state == EXIT_DEAD to skip considering that task.

And an autoreaping task isn't necessarily dead yet; it just shouldn't be
waited on.

> However, currently this is TODO. The main problem is the locking in
> wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
> under read_lock().

That appears to be only reachable for zombies, which an autoreaping task
should never become.

> And please see another email from me. So far  I disagree that wait(NULL)
> should return ECHILD unconditionally. At least unless this is discussed
> separately.

I'll respond in that separate thread, but one issue there: waiting for
any child process cannot safely return an autoreaping child process,
because that would introduce a race condition.  The PID the parent gets
back can disappear at any time, so there's nothing useful the parent can
do with it.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 20:14                     ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 20:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On Sat, Mar 14, 2015 at 08:47:21PM +0100, Oleg Nesterov wrote:
> On 03/14, Oleg Nesterov wrote:
> >
> > On 03/14, Josh Triplett wrote:
> > >
> > > On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > > > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > > > ignoring ptrace.
> > > > >
> > > > > Just suppose that real_parent has a single "autoreap" child. Should
> > > > > wait(NULL) hanf then?
> > > >
> > > > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > > > ECHILD, indicating there are no children waiting to be reaped.
> > >
> > > Right.  And I don't think the current code does this.  I think we need
> > > to change wait_consider_task to early-return for ->autoreap just as it
> > > does for task_state == EXIT_DEAD.
> >
> > No. This EXIT_DEAD is absolutely different. And this is another indication
> > that you might use it wrongly ;)
> >
> > What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
> > want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
> > complicate the exit/wait/reparent paths.
> >
> > However, currently this is TODO. The main problem is the locking in
> > wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
> > under read_lock().
> 
> Let me clarify in case I confused you.
> 
> The EXIT_DEAD check in do_wait() paths doesn't mean "autoreap". It means
> that this thread/process (depending on ptrace) was already reaped. It was
> reaped by our sub-thread, or it was reaped because we ignore SIGCHLD, or
> other reasons. This doesn't matter.
> 
> In short, EXIT_DEAD means: we have to keep this thread on lists until the
> task which set this state calls release_task().

That much I already understood from reading through the code, since
exit_notify doesn't set task_state to EXIT_DEAD until the task is
actually completely dead.  When wait_consider_task sees p->task_state ==
EXIT_DEAD, that task isn't eligible for waiting at all.

What I was proposing was that a task that isn't yet dead, but that is
going to be autoreaped, is not eligible for waiting either.  All the
various wait* familiy of system calls should pretend it doesn't exist at
all, because returning an autoreaped task from a wait* call introduces a
race condition if the parent tries to *do* anything with the returned
PID.  If you launch a process with CLONE_FD, you need to manage it
exclusively with that fd, not with the wait* family of system calls.

That also implies that the child-stop and child-continued mechanisms
(do_notify_parent_cldstop, WSTOPPED, WCONTINUED) should ignore the task
too.  In the future there could be a flag to clone4 that lets you get
stop and continue notifications through the file descriptor.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 20:14                     ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 20:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Sat, Mar 14, 2015 at 08:47:21PM +0100, Oleg Nesterov wrote:
> On 03/14, Oleg Nesterov wrote:
> >
> > On 03/14, Josh Triplett wrote:
> > >
> > > On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > > > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > > > ignoring ptrace.
> > > > >
> > > > > Just suppose that real_parent has a single "autoreap" child. Should
> > > > > wait(NULL) hanf then?
> > > >
> > > > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > > > ECHILD, indicating there are no children waiting to be reaped.
> > >
> > > Right.  And I don't think the current code does this.  I think we need
> > > to change wait_consider_task to early-return for ->autoreap just as it
> > > does for task_state == EXIT_DEAD.
> >
> > No. This EXIT_DEAD is absolutely different. And this is another indication
> > that you might use it wrongly ;)
> >
> > What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
> > want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
> > complicate the exit/wait/reparent paths.
> >
> > However, currently this is TODO. The main problem is the locking in
> > wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
> > under read_lock().
> 
> Let me clarify in case I confused you.
> 
> The EXIT_DEAD check in do_wait() paths doesn't mean "autoreap". It means
> that this thread/process (depending on ptrace) was already reaped. It was
> reaped by our sub-thread, or it was reaped because we ignore SIGCHLD, or
> other reasons. This doesn't matter.
> 
> In short, EXIT_DEAD means: we have to keep this thread on lists until the
> task which set this state calls release_task().

That much I already understood from reading through the code, since
exit_notify doesn't set task_state to EXIT_DEAD until the task is
actually completely dead.  When wait_consider_task sees p->task_state ==
EXIT_DEAD, that task isn't eligible for waiting at all.

What I was proposing was that a task that isn't yet dead, but that is
going to be autoreaped, is not eligible for waiting either.  All the
various wait* familiy of system calls should pretend it doesn't exist at
all, because returning an autoreaped task from a wait* call introduces a
race condition if the parent tries to *do* anything with the returned
PID.  If you launch a process with CLONE_FD, you need to manage it
exclusively with that fd, not with the wait* family of system calls.

That also implies that the child-stop and child-continued mechanisms
(do_notify_parent_cldstop, WSTOPPED, WCONTINUED) should ignore the task
too.  In the future there could be a flag to clone4 that lets you get
stop and continue notifications through the file descriptor.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-14 20:03                   ` Josh Triplett
  (?)
@ 2015-03-14 20:20                   ` Oleg Nesterov
  -1 siblings, 0 replies; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 20:20 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On 03/14, Josh Triplett wrote:
>
> On Sat, Mar 14, 2015 at 08:18:36PM +0100, Oleg Nesterov wrote:
>
> Is there any information somewhere on how this state machine of doom is
> *supposed* to work? :)

This looks as if you think that other parts of this kernel differ ;)

> Why would "p->task_state == EXIT_DEAD" mean
> something different in wait_consider_task?

different? but once again, EXIT_DEAD never meant "autoreap". It always
means "already reaped". OK, more correctly it means "release_task() will
be called soon, please ignore me".

> Pulling the EXIT_DEAD tasks out of those lists completely does sound
> like a good simplification.  However, that doesn't seem to be the
> current expectation in wait_consider_task, which just returns if
> p->task_state == EXIT_DEAD to skip considering that task.

See above. And another email I sent.

> And an autoreaping task isn't necessarily dead yet; it just shouldn't be
> waited on.

Yes, sure, but please do not confuse this with EXIT_DEAD. In fact, please
do not confuse this with ->task_state.

> > However, currently this is TODO. The main problem is the locking in
> > wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
                        ^^^^^^
I mean, "we can't".

> > under read_lock().
>
> That appears to be only reachable for zombies, which an autoreaping task
> should never become.

Sure. I meant that other paths which can set EXIT_DEAD are simpler wrt
"kill the EXIT_DEAD and EXIT_TRACE state" patch we actually want. But lets
not discuss this here, this is offtopic.

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-14 20:14                     ` Josh Triplett
  (?)
@ 2015-03-14 20:30                     ` Oleg Nesterov
  2015-03-14 22:14                         ` Josh Triplett
  -1 siblings, 1 reply; 83+ messages in thread
From: Oleg Nesterov @ 2015-03-14 20:30 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On 03/14, Josh Triplett wrote:
>
> What I was proposing was that a task that isn't yet dead, but that is
> going to be autoreaped, is not eligible for waiting either.  All the
> various wait* familiy of system calls should pretend it doesn't exist at
> all, because returning an autoreaped task from a wait* call introduces a
> race condition if the parent tries to *do* anything with the returned
> PID.  If you launch a process with CLONE_FD, you need to manage it
> exclusively with that fd, not with the wait* family of system calls.
>
> That also implies that the child-stop and child-continued mechanisms
> (do_notify_parent_cldstop, WSTOPPED, WCONTINUED) should ignore the task
> too.  In the future there could be a flag to clone4 that lets you get
> stop and continue notifications through the file descriptor.

So far I strongly disagree, and I think that "autoreap" feature should
not depend on CLONE_FD.

However, as I already said, perhaps something like W_IGNORE_AUTOREAP
for wait* makes sense.

Plus we should also discuss the reparenting. Ok, let me leave this
discussion until I read 0/4 at least.

Oleg.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 22:03                 ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 22:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On Sat, Mar 14, 2015 at 07:54:24PM +0100, Oleg Nesterov wrote:
> On 03/14, Thiago Macieira wrote:
> > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > ignoring ptrace.
> > >
> > > Just suppose that real_parent has a single "autoreap" child. Should
> > > wait(NULL) hanf then?
> >
> > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > ECHILD, indicating there are no children waiting to be reaped.
> 
> I disagree. I won't really argue now, because I think that this needs
> a separate discussion.

We should certainly discuss it further, but why a "separate" discussion
rather than just discussing the semantics of autoreap and wait here?

> And imo "autoreap" should come as a separate feature.

Thinking about this further, I originally thought that CLONE_FD would
*have* to imply autoreap, because otherwise the calling process still
has to call a wait function on the process after getting the exit
notification via the file descriptor.  However, with the current version
(which holds a reference to the task via the task_struct and generates
the data in ->read), it could potentially make sense to have a file
descriptor for a process that still gets zombified until the parent
waits on it.

Autoreap would still be a potentially useful addition to simplify
process management; it would effectively become "always treat this child
as though the parent had the signal ignored or SA_NOCLDWAIT set", which
would just be a simple change to do_notify_parent, rather than a complex
one to exit_notify that potentially interacts with ptrace.  Matching the
semantics of SA_NOCLDWAIT seems reasonable.

Thiago, see below for a question about switching to the semantics of
SA_NOCLDWAIT.

> I think that wait(NULL) should hang like it hangs even if the parent ignores
> SIGCHLD. But in this case the parent should be woken up when the "autoreap"
> child exits.

I had to think about this for a while, but I think it makes sense now.
wait should *not* ever return the PID of an autoreaped process, because
that would introduce a race condition (the caller cannot safely do
*anything* with the PID of an autoreaped process, since by the time it
does, the process may be gone and the PID may be reused).  However, that
doesn't mean wait cannot block on the process, and then subsequently
wake up and return -ECHILD (or keep waiting on some other child process
if there is one).  That's apparently the semantic used with SA_NOCLDWAIT
or if you have SIGCHLD set to SIG_IGN, and matching that seems
appropriate.

Thiago, could your QProcess implementation handle that modified autoreap
semantic?  The downside there is that if your calling process has a
process-wide loop that waits for all processes (and explicitly passes
the Linux-specific __WCLONE or __WALL flag, since your processes
launched with a 0 signal would count as "clone" children), they'd get
back the processes you launch, too.  (That would happen with your
userspace-emulated version too for calls *without* __WCLONE or __WALL.)
You'd still get the exit status you need via the clonefd, without a
race, and you wouldn't need to touch process-wide signal handling, so I
think this should still work and avoid any races.

I'm going to try implementing that semantic, which should significantly
simplify the last patch of this series.

> If nothing else. Suppose that the parent does waitid(WEXITED|WSTOPPED).
> Should WSTOPPED work? I think it should.

Yeah, I guess it should.  Arguably there ought to be a clone flag that
lets you receive stop/continue notifications for that process via the
file descriptor instead (to allow a library to handle job control for a
process without touching process-wide signal handling), but that can
come later.

> At the same time, if we add autoreap then probably it also makes sense to add
> WEXITIED_UNLESS_AUTOREAP.

Potentially, though for many applications you could also just pass a
signal of 0 and avoid passing __WALL or __WCLONE.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 22:03                 ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 22:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Sat, Mar 14, 2015 at 07:54:24PM +0100, Oleg Nesterov wrote:
> On 03/14, Thiago Macieira wrote:
> > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > ignoring ptrace.
> > >
> > > Just suppose that real_parent has a single "autoreap" child. Should
> > > wait(NULL) hanf then?
> >
> > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > ECHILD, indicating there are no children waiting to be reaped.
> 
> I disagree. I won't really argue now, because I think that this needs
> a separate discussion.

We should certainly discuss it further, but why a "separate" discussion
rather than just discussing the semantics of autoreap and wait here?

> And imo "autoreap" should come as a separate feature.

Thinking about this further, I originally thought that CLONE_FD would
*have* to imply autoreap, because otherwise the calling process still
has to call a wait function on the process after getting the exit
notification via the file descriptor.  However, with the current version
(which holds a reference to the task via the task_struct and generates
the data in ->read), it could potentially make sense to have a file
descriptor for a process that still gets zombified until the parent
waits on it.

Autoreap would still be a potentially useful addition to simplify
process management; it would effectively become "always treat this child
as though the parent had the signal ignored or SA_NOCLDWAIT set", which
would just be a simple change to do_notify_parent, rather than a complex
one to exit_notify that potentially interacts with ptrace.  Matching the
semantics of SA_NOCLDWAIT seems reasonable.

Thiago, see below for a question about switching to the semantics of
SA_NOCLDWAIT.

> I think that wait(NULL) should hang like it hangs even if the parent ignores
> SIGCHLD. But in this case the parent should be woken up when the "autoreap"
> child exits.

I had to think about this for a while, but I think it makes sense now.
wait should *not* ever return the PID of an autoreaped process, because
that would introduce a race condition (the caller cannot safely do
*anything* with the PID of an autoreaped process, since by the time it
does, the process may be gone and the PID may be reused).  However, that
doesn't mean wait cannot block on the process, and then subsequently
wake up and return -ECHILD (or keep waiting on some other child process
if there is one).  That's apparently the semantic used with SA_NOCLDWAIT
or if you have SIGCHLD set to SIG_IGN, and matching that seems
appropriate.

Thiago, could your QProcess implementation handle that modified autoreap
semantic?  The downside there is that if your calling process has a
process-wide loop that waits for all processes (and explicitly passes
the Linux-specific __WCLONE or __WALL flag, since your processes
launched with a 0 signal would count as "clone" children), they'd get
back the processes you launch, too.  (That would happen with your
userspace-emulated version too for calls *without* __WCLONE or __WALL.)
You'd still get the exit status you need via the clonefd, without a
race, and you wouldn't need to touch process-wide signal handling, so I
think this should still work and avoid any races.

I'm going to try implementing that semantic, which should significantly
simplify the last patch of this series.

> If nothing else. Suppose that the parent does waitid(WEXITED|WSTOPPED).
> Should WSTOPPED work? I think it should.

Yeah, I guess it should.  Arguably there ought to be a clone flag that
lets you receive stop/continue notifications for that process via the
file descriptor instead (to allow a library to handle job control for a
process without touching process-wide signal handling), but that can
come later.

> At the same time, if we add autoreap then probably it also makes sense to add
> WEXITIED_UNLESS_AUTOREAP.

Potentially, though for many applications you could also just pass a
signal of 0 and avoid passing __WALL or __WCLONE.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-14 14:14         ` Oleg Nesterov
  (?)
  (?)
@ 2015-03-14 22:09         ` Josh Triplett
  -1 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 22:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86

On Sat, Mar 14, 2015 at 03:14:14PM +0100, Oleg Nesterov wrote:
> Again, again, I didn't read this series yet. But the proper solution (afaics)
> should move this "autoreap" check in release_task/__ptrace_detach(). If the
> task is traced. Debugger should check ->autoreap and skip another
> do_notify_parent().

As mentioned in the mail I just sent, I think I can just move the
autoreap handling *into* do_notify_parent, and treat it as though the
parent had SA_NOCLDWAIT set.

> Speaking of autoreap... If ->exit_signal is zero, then the exiting child
> doesn't send the notification to its parent, still it doesn't autoreap
> itself. To me this looks strange, and in fact it seems to me that this
> is only by mistake. I am wondering if we can treat ->exit_signal == 0
> as "autoreap" too. As usual, most probably the answer is "no, because it
> is too late to change the historical behaviour". But this is off-topic.

Historical behavior, and potentially sensible behavior; you might not
want notification, but you might still want to get the child's exit
status by calling wait, which means you need the process to stick around
as a zombie until you wait on it.

That'd be the main advantage of adding a CLONE_AUTOREAP flag: it allows
you to get the same autoreaping behavior you'd get if you had SIGCHLD
ignored, but without actually sending a signal and without caring how
the process-wide signal handling is set up.  So you'd pass a 0 signal,
and CLONE_AUTOREAP.  And then if you *want* the exit notification, you
can get it via the file descriptor.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 22:14                         ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 22:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On Sat, Mar 14, 2015 at 09:30:29PM +0100, Oleg Nesterov wrote:
> On 03/14, Josh Triplett wrote:
> >
> > What I was proposing was that a task that isn't yet dead, but that is
> > going to be autoreaped, is not eligible for waiting either.  All the
> > various wait* familiy of system calls should pretend it doesn't exist at
> > all, because returning an autoreaped task from a wait* call introduces a
> > race condition if the parent tries to *do* anything with the returned
> > PID.  If you launch a process with CLONE_FD, you need to manage it
> > exclusively with that fd, not with the wait* family of system calls.
> >
> > That also implies that the child-stop and child-continued mechanisms
> > (do_notify_parent_cldstop, WSTOPPED, WCONTINUED) should ignore the task
> > too.  In the future there could be a flag to clone4 that lets you get
> > stop and continue notifications through the file descriptor.
> 
> So far I strongly disagree, and I think that "autoreap" feature should
> not depend on CLONE_FD.

After reading your other mail and thinking about this, I agree that the
two can be separated; see my othermail for the details.

> Plus we should also discuss the reparenting. Ok, let me leave this
> discussion until I read 0/4 at least.

I think you can safely wait for v2 of the last patch, though the first
four patches should be almost completely identical in v2.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
@ 2015-03-14 22:14                         ` Josh Triplett
  0 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-14 22:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Sat, Mar 14, 2015 at 09:30:29PM +0100, Oleg Nesterov wrote:
> On 03/14, Josh Triplett wrote:
> >
> > What I was proposing was that a task that isn't yet dead, but that is
> > going to be autoreaped, is not eligible for waiting either.  All the
> > various wait* familiy of system calls should pretend it doesn't exist at
> > all, because returning an autoreaped task from a wait* call introduces a
> > race condition if the parent tries to *do* anything with the returned
> > PID.  If you launch a process with CLONE_FD, you need to manage it
> > exclusively with that fd, not with the wait* family of system calls.
> >
> > That also implies that the child-stop and child-continued mechanisms
> > (do_notify_parent_cldstop, WSTOPPED, WCONTINUED) should ignore the task
> > too.  In the future there could be a flag to clone4 that lets you get
> > stop and continue notifications through the file descriptor.
> 
> So far I strongly disagree, and I think that "autoreap" feature should
> not depend on CLONE_FD.

After reading your other mail and thinking about this, I agree that the
two can be separated; see my othermail for the details.

> Plus we should also discuss the reparenting. Ok, let me leave this
> discussion until I read 0/4 at least.

I think you can safely wait for v2 of the last patch, though the first
four patches should be almost completely identical in v2.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
  2015-03-14 22:03                 ` Josh Triplett
  (?)
@ 2015-03-14 22:26                 ` Thiago Macieira
  -1 siblings, 0 replies; 83+ messages in thread
From: Thiago Macieira @ 2015-03-14 22:26 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On Saturday 14 March 2015 15:03:08 Josh Triplett wrote:
> I had to think about this for a while, but I think it makes sense now.
> wait should *not* ever return the PID of an autoreaped process, because
> that would introduce a race condition (the caller cannot safely do
> *anything* with the PID of an autoreaped process, since by the time it
> does, the process may be gone and the PID may be reused).  However, that
> doesn't mean wait cannot block on the process, and then subsequently
> wake up and return -ECHILD (or keep waiting on some other child process
> if there is one).  That's apparently the semantic used with SA_NOCLDWAIT
> or if you have SIGCHLD set to SIG_IGN, and matching that seems
> appropriate.
> 
> Thiago, could your QProcess implementation handle that modified autoreap
> semantic?  The downside there is that if your calling process has a
> process-wide loop that waits for all processes (and explicitly passes
> the Linux-specific __WCLONE or __WALL flag, since your processes
> launched with a 0 signal would count as "clone" children), they'd get
> back the processes you launch, too.  (That would happen with your
> userspace-emulated version too for calls *without* __WCLONE or __WALL.)
> You'd still get the exit status you need via the clonefd, without a
> race, and you wouldn't need to touch process-wide signal handling, so I
> think this should still work and avoid any races.

I don't see why QProcess would have a problem. We don't have such a process-
wide wait loop with __WCLONE or __WALL and I can't think of any reason why 
someone would do that and still expect NPTL to work. Or, put another way, if 
they are using clone/clone4 directly and bypassing NPTL, they're probably in a 
very specialised process that has no business running QProcess in the first 
place. I wouldn't be too worried.

Inside glibc itself, __WCLONE is used only in unit tests and __WALL is used in 
a loop in elf/pldd.c, which is an independent application. Bionic has __WCLONE 
in tests only too.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-13 19:42   ` Josh Triplett
@ 2015-03-15  8:55       ` David Drysdale
  2015-03-13 21:33     ` Andy Lutomirski
  2015-03-15  8:55       ` David Drysdale
  2 siblings, 0 replies; 83+ messages in thread
From: David Drysdale @ 2015-03-15  8:55 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	Linux API, linux-fsdevel, X86 ML

On Fri, Mar 13, 2015 at 7:42 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
>> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@joshtriplett.org> wrote:
>> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
>> > handle child process exit notification via a file descriptor rather than
>> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
>> > child processes on behalf of their caller, *without* taking over process-wide
>> > SIGCHLD handling (either via signal handler or signalfd).
>>
>> Hi Josh,
>>
>> From the overall description (i.e. I haven't looked at the code yet)
>> this looks very interesting.  However, it seems to cover a lot of the
>> same ground as the process descriptor feature that was added to FreeBSD
>> in 9.x/10.x:
>>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>
> Interesting.
>
>> I think it would ideally be nice for a userspace library developer to be
>> able to do subprocess management (without SIGCHLD) in a similar way
>> across both platforms, without lots of complicated autoconf shenanigans.
>>
>> So could we look at the overlap and seeing if we can come up with
>> something that covers your requirements and also allows for something
>> that looks like FreeBSD's process descriptors?
>
> Agreed; however, I think it's reasonable to provide appropriate Linux
> system calls, and then let glibc or libbsd or similar provide the
> BSD-compatible calls on top of those.  I don't think the kernel
> interface needs to exactly match FreeBSD's, as long as it's a superset
> of the functionality.

Agreed -- if it's possible to implement equivalent process descriptor
functionality with a wrapper library, but the kernel interface is more
comprehensive and consistent with the rest of the Linux kernel, then
that's a big win.  So thanks for your work and for being willing to look
at the overlap!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-15  8:55       ` David Drysdale
  0 siblings, 0 replies; 83+ messages in thread
From: David Drysdale @ 2015-03-15  8:55 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, X86 ML

On Fri, Mar 13, 2015 at 7:42 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
>> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
>> > handle child process exit notification via a file descriptor rather than
>> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
>> > child processes on behalf of their caller, *without* taking over process-wide
>> > SIGCHLD handling (either via signal handler or signalfd).
>>
>> Hi Josh,
>>
>> From the overall description (i.e. I haven't looked at the code yet)
>> this looks very interesting.  However, it seems to cover a lot of the
>> same ground as the process descriptor feature that was added to FreeBSD
>> in 9.x/10.x:
>>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>
> Interesting.
>
>> I think it would ideally be nice for a userspace library developer to be
>> able to do subprocess management (without SIGCHLD) in a similar way
>> across both platforms, without lots of complicated autoconf shenanigans.
>>
>> So could we look at the overlap and seeing if we can come up with
>> something that covers your requirements and also allows for something
>> that looks like FreeBSD's process descriptors?
>
> Agreed; however, I think it's reasonable to provide appropriate Linux
> system calls, and then let glibc or libbsd or similar provide the
> BSD-compatible calls on top of those.  I don't think the kernel
> interface needs to exactly match FreeBSD's, as long as it's a superset
> of the functionality.

Agreed -- if it's possible to implement equivalent process descriptor
functionality with a wrapper library, but the kernel interface is more
comprehensive and consistent with the rest of the Linux kernel, then
that's a big win.  So thanks for your work and for being willing to look
at the overlap!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-14 19:29                 ` Josh Triplett
@ 2015-03-15 10:18                   ` David Drysdale
  -1 siblings, 0 replies; 83+ messages in thread
From: David Drysdale @ 2015-03-15 10:18 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Andy Lutomirski, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel, Linux API, Linux FS Devel, X86 ML

On Sat, Mar 14, 2015 at 7:29 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
>> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
>> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
>> > > In any event, we should find out what FreeBSD does in response to
>> > > read(2) on the fd.
>> >
>> > I've just successfully installed FreeBSD and compiled qtbase (main package
>> > of Qt 5) on it.
>> >
>> > I'll test pdfork during the weekend and report its behaviour.
>>
>> Here are my findings about pdfork.
>>
>> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
>> Qt adaptations: https://codereview.qt-project.org/108561
>>
>> Processes created with pdfork() are normal processes that still send SIGCHLD
>> to their parents. The only difference is that you get the extra file descriptor
>> that can be passed to the pdgetpid() system call and works on select()/poll().
>> Trying to read from that file descriptor will result in EOPNOTSUPP.
>
> OK, since read() doesn't work on a pdfork() file descriptor, we don't
> have to worry about compatibility with pdfork()'s read result.
>
> However, if the expectation is that pdfork()ed child processes still
> send SIGCHLD, then I don't see how we can be compatible there, nor do I
> think we want to; as you mention below, that breaks the ability to
> encapsulate management of the created process entirely within a library.

I didn't think that was the case -- my understanding was that pdfork()ed
children would not generate SIGCHLD (and that does seem to be the
case with a quick test program).

As an aside, I do think there are some aspects of FreeBSD's process
descriptors that aren't quite right yet, particularly their interaction with
waitpid(-1, ...) -- IIRC pdfork()ed children are visible to it, but I'd expect
them not to be (to allow libraries to use sub-processes invisibly to the
programs using them). There's a thread at:
https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2014-March/thread.html
but I'm not sure that anything came of that discussion.

As it happens, I'm meeting Robert Watson (one of the progenitors
of Capsicum/process descriptors) tomorrow, so I'll chase further.

>> Since they've never implemented pdwait4() (it's not even declared in the
>> headers), the only way to reap a child if you only have the file descriptor is
>> to first pdgetpid() and then call wait4() or wait6().
>
> Which suggests that we shouldn't try to implement pdwait4() in glibc
> until FreeBSD implements it in their kernel, since we won't know the
> exact semantics they expect.

By the way, I should point out one part of the FreeBSD design
which might help explain some of the semantics.

Process descriptors are particularly designed to be used with
Capsicum, which is a security framework where file descriptors
get extra rights associated with them, and the kernel polices
the use of those rights (e.g. you need CAP_READ for read(2)
operations; normal file descriptors implicitly have all of the
rights for back-compatibility).
  https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4

Capsicum also includes 'capability mode', where system calls
that access global namespaces are disabled -- including the
pid namespace.

So process descriptors are the only way to manipulate child
processes when a program is in capability mode -- and this
means that pdkill() is then genuinely needed over and above
kill(pdgetpid(),...).

>> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when
>> the file closes.
>
> OK, that makes sense.  We could certainly implement a
> CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
> future.
>
>> Conclusion:
>> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess.
>> As long as all child process activations use this feature, the problem is
>> solved.
>>
>> Cons: it requires cooperation from all child starters. If some other library
>> or the application installs a global SIGCHLD handler that waits on all child
>> processes, like libvlc used to do and Glib and Ecore still do, you won't be
>> able to get the child exit status.
>>
>> I have not tested what happens if you try to pass the file descriptor to other
>> processes (can you even do that on FreeBSD?). But even if you could and got
>> notifications, you couldn't wait on the child to get its exit status -- unless
>> they implement pdwait4.
>
> Even if they do implement pdwait4, they might not bypass the "must be
> the parent process" restriction.  Let's wait to see what semantics they
> go with.

Hmm, interesting point.  FreeBSD certainly allows FD passing, but
I'm not sure what the interactions are when it's a process descriptor
that's passed.

Given the object-capability background to Capsicum, I'd assume that a
holder of the process descriptor should be able to do whatever operations
are allowed by the rights associated with the descriptor (CAP_PDGETPID,
CAP_PDKILL and CAP_PDWAIT exist as specific rights allowing those
operations, and a non-restricted descriptor will have all of them by default).

But I'll add some test cases for this to the Capsicum test suite to check
whether theory matches practice...
  https://github.com/google/capsicum-test/blob/dev/procdesc.cc

>>  - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
>>  - pdwait4: can be emulated with read()
>>  - pdgetpid: needs an ioctl
>>  - pdkill: needs an ioctl [or just write()]
>
> I think that should be a dedicated syscall, not an ioctl.
>
> It's unfortunate that rt_sigqueueinfo doesn't take a flags argument.
> However, I just realized that it takes a 32-bit "int" for the signal
> number, yet signal numbers fit in 8 bits.  So we could just add flags in
> the high 24 bits of that argument, and in particular add a flag
> indicating that the first argument is a file descriptor rather than a
> PID.
>
> - Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-15 10:18                   ` David Drysdale
  0 siblings, 0 replies; 83+ messages in thread
From: David Drysdale @ 2015-03-15 10:18 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Andy Lutomirski, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Sat, Mar 14, 2015 at 7:29 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
>> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
>> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
>> > > In any event, we should find out what FreeBSD does in response to
>> > > read(2) on the fd.
>> >
>> > I've just successfully installed FreeBSD and compiled qtbase (main package
>> > of Qt 5) on it.
>> >
>> > I'll test pdfork during the weekend and report its behaviour.
>>
>> Here are my findings about pdfork.
>>
>> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
>> Qt adaptations: https://codereview.qt-project.org/108561
>>
>> Processes created with pdfork() are normal processes that still send SIGCHLD
>> to their parents. The only difference is that you get the extra file descriptor
>> that can be passed to the pdgetpid() system call and works on select()/poll().
>> Trying to read from that file descriptor will result in EOPNOTSUPP.
>
> OK, since read() doesn't work on a pdfork() file descriptor, we don't
> have to worry about compatibility with pdfork()'s read result.
>
> However, if the expectation is that pdfork()ed child processes still
> send SIGCHLD, then I don't see how we can be compatible there, nor do I
> think we want to; as you mention below, that breaks the ability to
> encapsulate management of the created process entirely within a library.

I didn't think that was the case -- my understanding was that pdfork()ed
children would not generate SIGCHLD (and that does seem to be the
case with a quick test program).

As an aside, I do think there are some aspects of FreeBSD's process
descriptors that aren't quite right yet, particularly their interaction with
waitpid(-1, ...) -- IIRC pdfork()ed children are visible to it, but I'd expect
them not to be (to allow libraries to use sub-processes invisibly to the
programs using them). There's a thread at:
https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2014-March/thread.html
but I'm not sure that anything came of that discussion.

As it happens, I'm meeting Robert Watson (one of the progenitors
of Capsicum/process descriptors) tomorrow, so I'll chase further.

>> Since they've never implemented pdwait4() (it's not even declared in the
>> headers), the only way to reap a child if you only have the file descriptor is
>> to first pdgetpid() and then call wait4() or wait6().
>
> Which suggests that we shouldn't try to implement pdwait4() in glibc
> until FreeBSD implements it in their kernel, since we won't know the
> exact semantics they expect.

By the way, I should point out one part of the FreeBSD design
which might help explain some of the semantics.

Process descriptors are particularly designed to be used with
Capsicum, which is a security framework where file descriptors
get extra rights associated with them, and the kernel polices
the use of those rights (e.g. you need CAP_READ for read(2)
operations; normal file descriptors implicitly have all of the
rights for back-compatibility).
  https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4

Capsicum also includes 'capability mode', where system calls
that access global namespaces are disabled -- including the
pid namespace.

So process descriptors are the only way to manipulate child
processes when a program is in capability mode -- and this
means that pdkill() is then genuinely needed over and above
kill(pdgetpid(),...).

>> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when
>> the file closes.
>
> OK, that makes sense.  We could certainly implement a
> CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
> future.
>
>> Conclusion:
>> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess.
>> As long as all child process activations use this feature, the problem is
>> solved.
>>
>> Cons: it requires cooperation from all child starters. If some other library
>> or the application installs a global SIGCHLD handler that waits on all child
>> processes, like libvlc used to do and Glib and Ecore still do, you won't be
>> able to get the child exit status.
>>
>> I have not tested what happens if you try to pass the file descriptor to other
>> processes (can you even do that on FreeBSD?). But even if you could and got
>> notifications, you couldn't wait on the child to get its exit status -- unless
>> they implement pdwait4.
>
> Even if they do implement pdwait4, they might not bypass the "must be
> the parent process" restriction.  Let's wait to see what semantics they
> go with.

Hmm, interesting point.  FreeBSD certainly allows FD passing, but
I'm not sure what the interactions are when it's a process descriptor
that's passed.

Given the object-capability background to Capsicum, I'd assume that a
holder of the process descriptor should be able to do whatever operations
are allowed by the rights associated with the descriptor (CAP_PDGETPID,
CAP_PDKILL and CAP_PDWAIT exist as specific rights allowing those
operations, and a non-restricted descriptor will have all of them by default).

But I'll add some test cases for this to the Capsicum test suite to check
whether theory matches practice...
  https://github.com/google/capsicum-test/blob/dev/procdesc.cc

>>  - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
>>  - pdwait4: can be emulated with read()
>>  - pdgetpid: needs an ioctl
>>  - pdkill: needs an ioctl [or just write()]
>
> I think that should be a dedicated syscall, not an ioctl.
>
> It's unfortunate that rt_sigqueueinfo doesn't take a flags argument.
> However, I just realized that it takes a 32-bit "int" for the signal
> number, yet signal numbers fit in 8 bits.  So we could just add flags in
> the high 24 bits of that argument, and in particular add a flag
> indicating that the first argument is a file descriptor rather than a
> PID.
>
> - Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
  2015-03-15 10:18                   ` David Drysdale
  (?)
@ 2015-03-15 10:59                   ` Josh Triplett
  -1 siblings, 0 replies; 83+ messages in thread
From: Josh Triplett @ 2015-03-15 10:59 UTC (permalink / raw)
  To: David Drysdale
  Cc: Thiago Macieira, Andy Lutomirski, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel, Linux API, Linux FS Devel, X86 ML

On Sun, Mar 15, 2015 at 10:18:05AM +0000, David Drysdale wrote:
> On Sat, Mar 14, 2015 at 7:29 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> > On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
> >> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
> >> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> >> > > In any event, we should find out what FreeBSD does in response to
> >> > > read(2) on the fd.
> >> >
> >> > I've just successfully installed FreeBSD and compiled qtbase (main package
> >> > of Qt 5) on it.
> >> >
> >> > I'll test pdfork during the weekend and report its behaviour.
> >>
> >> Here are my findings about pdfork.
> >>
> >> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
> >> Qt adaptations: https://codereview.qt-project.org/108561
> >>
> >> Processes created with pdfork() are normal processes that still send SIGCHLD
> >> to their parents. The only difference is that you get the extra file descriptor
> >> that can be passed to the pdgetpid() system call and works on select()/poll().
> >> Trying to read from that file descriptor will result in EOPNOTSUPP.
> >
> > OK, since read() doesn't work on a pdfork() file descriptor, we don't
> > have to worry about compatibility with pdfork()'s read result.
> >
> > However, if the expectation is that pdfork()ed child processes still
> > send SIGCHLD, then I don't see how we can be compatible there, nor do I
> > think we want to; as you mention below, that breaks the ability to
> > encapsulate management of the created process entirely within a library.
> 
> I didn't think that was the case -- my understanding was that pdfork()ed
> children would not generate SIGCHLD (and that does seem to be the
> case with a quick test program).

Well, either way, v2 of this series is capable of producing either
behavior.  You can have a clonefd and still receive SIGCHLD or any other
signal, or none at all, and you can decide independently from that if
you want autoreaping or waiting.

> As an aside, I do think there are some aspects of FreeBSD's process
> descriptors that aren't quite right yet, particularly their interaction with
> waitpid(-1, ...) -- IIRC pdfork()ed children are visible to it, but I'd expect
> them not to be (to allow libraries to use sub-processes invisibly to the
> programs using them). There's a thread at:
> https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2014-March/thread.html
> but I'm not sure that anything came of that discussion.

As long as you don't use the Linux-specific flags __WALL or __WCLONE, a
process created with clone will be invisible to wait if it has an exit
signal other than SIGCHLD.  That's true independent of this patch
series.  So you can decide if you want processes visible to wait or not.

> As it happens, I'm meeting Robert Watson (one of the progenitors
> of Capsicum/process descriptors) tomorrow, so I'll chase further.

Sounds good.

> >> Since they've never implemented pdwait4() (it's not even declared in the
> >> headers), the only way to reap a child if you only have the file descriptor is
> >> to first pdgetpid() and then call wait4() or wait6().
> >
> > Which suggests that we shouldn't try to implement pdwait4() in glibc
> > until FreeBSD implements it in their kernel, since we won't know the
> > exact semantics they expect.
> 
> By the way, I should point out one part of the FreeBSD design
> which might help explain some of the semantics.
> 
> Process descriptors are particularly designed to be used with
> Capsicum, which is a security framework where file descriptors
> get extra rights associated with them, and the kernel polices
> the use of those rights (e.g. you need CAP_READ for read(2)
> operations; normal file descriptors implicitly have all of the
> rights for back-compatibility).
>   https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4
> 
> Capsicum also includes 'capability mode', where system calls
> that access global namespaces are disabled -- including the
> pid namespace.
> 
> So process descriptors are the only way to manipulate child
> processes when a program is in capability mode -- and this
> means that pdkill() is then genuinely needed over and above
> kill(pdgetpid(),...).

Thanks for the explanation.  I've seen some details about Capsicum, and
I found it quite interesting.  I'm particularly interested in the notion
of getting rid of global namespaces in favor of descriptors or similar
mechanisms that you need specific rights to.

Does Capsicum do anything to eliminate the global namespace of UIDs and
GIDs?

> >> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when
> >> the file closes.
> >
> > OK, that makes sense.  We could certainly implement a
> > CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
> > future.
> >
> >> Conclusion:
> >> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess.
> >> As long as all child process activations use this feature, the problem is
> >> solved.
> >>
> >> Cons: it requires cooperation from all child starters. If some other library
> >> or the application installs a global SIGCHLD handler that waits on all child
> >> processes, like libvlc used to do and Glib and Ecore still do, you won't be
> >> able to get the child exit status.
> >>
> >> I have not tested what happens if you try to pass the file descriptor to other
> >> processes (can you even do that on FreeBSD?). But even if you could and got
> >> notifications, you couldn't wait on the child to get its exit status -- unless
> >> they implement pdwait4.
> >
> > Even if they do implement pdwait4, they might not bypass the "must be
> > the parent process" restriction.  Let's wait to see what semantics they
> > go with.
> 
> Hmm, interesting point.  FreeBSD certainly allows FD passing, but
> I'm not sure what the interactions are when it's a process descriptor
> that's passed.
> 
> Given the object-capability background to Capsicum, I'd assume that a
> holder of the process descriptor should be able to do whatever operations
> are allowed by the rights associated with the descriptor (CAP_PDGETPID,
> CAP_PDKILL and CAP_PDWAIT exist as specific rights allowing those
> operations, and a non-restricted descriptor will have all of them by default).

Possibly, but given that pdwait4 isn't actually implemented yet, it
wouldn't surprise me if the future implementation looks up the process
and then calls the same internal function that wait4 does, with the same
"must be the parent process" restriction.

> But I'll add some test cases for this to the Capsicum test suite to check
> whether theory matches practice...
>   https://github.com/google/capsicum-test/blob/dev/procdesc.cc

Excellent; that seems like a good way to make sure the current and
future behavior matches expectations.

- Josh Triplett

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2015-03-15 11:00 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-13  1:40 [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor Josh Triplett
2015-03-13  1:40 ` Josh Triplett
2015-03-13  1:40 ` [PATCH 1/6] clone: Support passing tls argument via C rather than pt_regs magic Josh Triplett
2015-03-13  1:40 ` [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit Josh Triplett
2015-03-13  1:40   ` Josh Triplett
2015-03-13 22:01   ` Andy Lutomirski
2015-03-13 22:01     ` Andy Lutomirski
2015-03-13 22:31     ` josh
2015-03-13 22:38       ` Andy Lutomirski
2015-03-13 22:43         ` josh
2015-03-13 22:43           ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 22:45           ` Andy Lutomirski
2015-03-13 22:45             ` Andy Lutomirski
2015-03-13 23:01             ` josh
2015-03-13 23:01               ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13  1:40 ` [PATCH 3/6] Introduce a new clone4 syscall with more flag bits and extensible arguments Josh Triplett
2015-03-13  1:40 ` [PATCH 4/6] signal: Factor out a helper function to process task_struct exit_code Josh Triplett
2015-03-13  1:40 ` [PATCH 5/6] fs: Make alloc_fd non-private Josh Triplett
2015-03-13  1:40   ` Josh Triplett
2015-03-13  1:41 ` [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd Josh Triplett
2015-03-13 16:21   ` Oleg Nesterov
2015-03-13 19:57     ` josh
2015-03-13 21:34       ` Andy Lutomirski
2015-03-13 21:34         ` Andy Lutomirski
2015-03-13 22:20         ` josh
2015-03-13 22:28           ` Andy Lutomirski
2015-03-13 22:28             ` Andy Lutomirski
2015-03-13 22:34             ` josh
2015-03-13 22:34               ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 22:38               ` Andy Lutomirski
2015-03-14 14:14       ` Oleg Nesterov
2015-03-14 14:14         ` Oleg Nesterov
2015-03-14 14:32         ` Oleg Nesterov
2015-03-14 14:32           ` Oleg Nesterov
2015-03-14 18:38           ` Thiago Macieira
2015-03-14 18:54             ` Oleg Nesterov
2015-03-14 22:03               ` Josh Triplett
2015-03-14 22:03                 ` Josh Triplett
2015-03-14 22:26                 ` Thiago Macieira
2015-03-14 19:01             ` Josh Triplett
2015-03-14 19:18               ` Oleg Nesterov
2015-03-14 19:18                 ` Oleg Nesterov
2015-03-14 19:47                 ` Oleg Nesterov
2015-03-14 19:47                   ` Oleg Nesterov
2015-03-14 20:14                   ` Josh Triplett
2015-03-14 20:14                     ` Josh Triplett
2015-03-14 20:30                     ` Oleg Nesterov
2015-03-14 22:14                       ` Josh Triplett
2015-03-14 22:14                         ` Josh Triplett
2015-03-14 20:03                 ` Josh Triplett
2015-03-14 20:03                   ` Josh Triplett
2015-03-14 20:20                   ` Oleg Nesterov
2015-03-14 22:09         ` Josh Triplett
2015-03-14 14:35   ` Oleg Nesterov
2015-03-14 14:35     ` Oleg Nesterov
2015-03-14 19:15     ` Josh Triplett
2015-03-14 19:15       ` Josh Triplett
2015-03-14 19:24       ` Oleg Nesterov
2015-03-14 19:48         ` Josh Triplett
2015-03-14 19:48           ` Josh Triplett
2015-03-13  1:41 ` [PATCH] clone4.2: New manpage documenting clone4(2) Josh Triplett
2015-03-13  2:07 ` [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor Thiago Macieira
2015-03-13  2:07   ` Thiago Macieira
2015-03-13 16:05 ` David Drysdale
2015-03-13 16:05   ` David Drysdale
2015-03-13 19:42   ` Josh Triplett
2015-03-13 21:16     ` Thiago Macieira
2015-03-13 21:44       ` josh
2015-03-13 21:33     ` Andy Lutomirski
2015-03-13 21:45       ` josh
2015-03-13 21:45         ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 21:51         ` Andy Lutomirski
2015-03-13 21:51           ` Andy Lutomirski
2015-03-14  1:11           ` Thiago Macieira
2015-03-14  1:11             ` Thiago Macieira
2015-03-14 19:03             ` Thiago Macieira
2015-03-14 19:29               ` Josh Triplett
2015-03-14 19:29                 ` Josh Triplett
2015-03-15 10:18                 ` David Drysdale
2015-03-15 10:18                   ` David Drysdale
2015-03-15 10:59                   ` Josh Triplett
2015-03-15  8:55     ` David Drysdale
2015-03-15  8:55       ` David Drysdale

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.