All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-15  7:59 ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  7:59 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

This patch series introduces a new clone flag, CLONE_FD, which lets the caller
receive child process exit notification via a file descriptor rather than
SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
child processes on behalf of their caller, *without* taking over process-wide
SIGCHLD handling (either via signal handler or signalfd).

Note that signalfd for SIGCHLD does not suffice here, because that still
receives notification for all child processes, and interferes with process-wide
signal handling.

The CLONE_FD file descriptor uniquely identifies a process on the system in a
race-free way, by holding a reference to the task_struct.  In the future, we
may introduce APIs that support using process file descriptors instead of PIDs.

This patch series also introduces a clone flag CLONE_AUTOREAP, which causes the
kernel to automatically reap the child process when it exits, just as it does
for processes using SIGCHLD when the parent has SIGCHLD ignored or marked as
SA_NOCLDSTOP.

Taken together, a library can launch a process with CLONE_FD, CLONE_AUTOREAP,
and no exit signal, and completely avoid affecting either process-wide signal
handling or an existing child wait loop.

Introducing CLONE_FD and CLONE_AUTOREAP required two additional bits of yak
shaving: Since clone has no more usable flags (with the three currently unused
flags unusable because old kernels ignore them without EINVAL), also introduce
a new clone4 system call with more flag bits and an extensible argument
structure.  And since the magic pt_regs-based syscall argument processing for
clone's tls argument would otherwise prevent introducing a sane clone4 system
call, fix that too.

I tested the CLONE_SETTLS changes with a thread-local storage test program (two
threads independently reading and writing a __thread variable), on both 32-bit
and 64-bit, and I observed no issues there.

I tested clone4 and the new flags with several additional test programs,
launching either a process or thread (in the former case using syscall(), in
the latter case by calling clone4 via assembly and returning to C), sleeping in
parent and child to test the case of either exiting first, and then printing
the received clone4_info structure.

Changes in v2:
- Split out autoreaping into a separate CLONE_AUTOREAP.  CLONE_FD no longer
  implies autoreaping and no exit signal, and CLONE_AUTOREAP does not affect
  ptracers or signal handling.  Thanks to Oleg Nesterov for careful
  investigation and discussion on v1.
- Accept O_CLOEXEC and O_NONBLOCK via a clonefd_flags parameter in clone4_args.
  Stop overloading the low byte of the main clone flags, since CLONE_FD now
  works with a non-zero signal.
- Return the file descriptor via an out parameter in clone4_args.
- Drop patch to export alloc_fd; CLONE_FD now uses the next available file
  descriptor, even if that's 0-2, since clone4 no longer needs to avoid
  ambiguity with the 0 return indicating the child process.
- Make poll on a CLONE_FD for an exited task also return POLLHUP, for
  compatibility with FreeBSD's pdfork.  Thanks to David Drysdale for calling
  attention to pdfork.
- Fix typo in squelch_clone_flags.
- Pass arguments to _do_fork and copy_process as a structure.
- Construct the 64-bit flags in a separate variable, rather than inline in the
  call to do_fork.
- Fix error return for copy_from_user faults.
- Add the new syscall to asm-generic.
- Add ack from Andy Lutomirski to patches 1 and 2.

I've included the manpages patch at the end of this series.  (Note that the
manpage documents the behavior of the future glibc wrapper as well as the raw
syscall.)  Here's a formatted plain-text version of the manpage for reference:

CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)



NAME
       clone4 - create a child process

SYNOPSIS
       /* Prototype for the glibc wrapper function */

       #define _GNU_SOURCE
       #include <sched.h>

       int clone4(uint64_t flags,
                  size_t args_size,
                  struct clone4_args *args,
                  int (*fn)(void *), void *arg);

       /* Prototype for the raw system call */

       int clone4(unsigned flags_high, unsigned flags_low,
                  unsigned long args_size,
                  struct clone4_args *args);

       struct clone4_args {
           pid_t *ptid;
           pid_t *ctid;
           unsigned long stack_start;
           unsigned long stack_size;
           unsigned long tls;
           int *clonefd;
           unsigned clonefd_flags;
       };


DESCRIPTION
       clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
       clone4() supports additional flags that clone(2) does not, and  accepts
       arguments via an extensible structure.

       args  points to a clone4_args structure, and args_size must contain the
       size of that structure, as understood by the  caller.   If  the  caller
       passes  a  shorter  structure  than  the  kernel expects, the remaining
       fields will default to 0.  If the caller passes a larger structure than
       the  kernel  expects  (such  as one from a newer kernel), clone4() will
       return EINVAL.  The clone4_args structure may gain additional fields at
       the  end  in  the future, and callers must only pass a size that encom‐
       passes the number of fields they understand.  If the  caller  passes  0
       for args_size, args is ignored and may be NULL.

       In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
       tls have the same semantics as they do with clone(2) and clone2(2).

       In the glibc wrapper, fn and arg have the same  semantics  as  they  do
       with clone(2).  As with clone(2), the underlying system call works more
       like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
       plifies  thread execution by calling fn(arg) and exiting the child when
       that function exits.

       The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
       flags_low  arguments  in  the  kernel  interface for portability across
       architectures) accepts all the same flags as clone(2), with the  excep‐
       tion  of the obsolete CLONE_PID, CLONE_DETACHED, and CLONE_STOPPED.  In
       addition, flags accepts the following flags:


       CLONE_AUTOREAP
              When the new process exits, immediately  reap  it,  rather  than
              keeping  it  around  as a "zombie" until a call to waitpid(2) or
              similar.  Without this flag, the kernel will automatically  reap
              a  process if its exit signal is set to SIGCHLD, and if the par‐
              ent process has SIGCHLD set to SIG_IGN or has a SIGCHLD  handler
              installed  with SA_NOCLDWAIT (see sigaction(2)).  CLONE_AUTOREAP
              allows the calling process to enable automatic reaping  with  an
              exit  signal other than SIGCHLD (including 0 to disable the exit
              signal), and does not depend on the  configuration  of  process-
              wide signal handling.


       CLONE_FD
              Return  a file descriptor associated with the new process, stor‐
              ing it in location clonefd in the parent's address space.   When
              the new process exits, the file descriptor will become available
              for reading.

              Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
              descriptor  returned  by  clone4()  with the CLONE_FD flag works
              even with SIGCHLD unblocked in one or more threads of the parent
              process,  allowing  the  process  to have different handlers for
              different child processes, such as those created by  a  library,
              without  introducing  race conditions around process-wide signal
              handling.

              clonefd_flags may contain the following additional flags for use
              with CLONE_FD:


              O_CLOEXEC
                     Set  the  close-on-exec  flag on the new file descriptor.
                     See the description of the O_CLOEXEC flag in open(2)  for
                     reasons why this may be useful.


              O_NONBLOCK
                     Set  the  O_NONBLOCK  flag  on  the  new file descriptor.
                     Using this flag saves extra calls to fcntl(2) to  achieve
                     the same result.


              The returned file descriptor supports the following operations:

              read(2) (and similar)
                     When  the  new  process  exits,  reading  from  the  file
                     descriptor produces a single clonefd_info structure:

                     struct clonefd_info {
                         uint32_t code;   /* Signal code */
                         uint32_t status; /* Exit status or signal */
                         uint64_t utime;  /* User CPU time */
                         uint64_t stime;  /* System CPU time */
                     };


                     If the new process has not  yet  exited,  read(2)  either
                     blocks  until  it does, or fails with the error EAGAIN if
                     the file descriptor has O_NONBLOCK set.

                     Future kernels may extend clonefd_info by appending addi‐
                     tional  fields  to  the end.  Callers should read as many
                     bytes as they understand; unread data will be  discarded,
                     and  subsequent  reads  after  the first will return 0 to
                     indicate end-of-file.  Callers requesting more bytes than
                     the  kernel  provides  (such as callers expecting a newer
                     clonefd_info structure) will receive a shorter  structure
                     from older kernels.

              poll(2), select(2), epoll(7) (and similar)
                     The  file  descriptor  is readable (the select(2) readfds
                     argument; the poll(2) POLLIN flag) if the new process has
                     exited.

              close(2)
                     When  the file descriptor is no longer required it should
                     be closed.


   C library/kernel ABI differences
       As with clone(2), the raw clone4() system call corresponds more closely
       to  fork(2)  in that execution in the child continues from the point of
       the call.

       Unlike clone(2), the raw system call  interface  for  clone4()  accepts
       arguments in the same order on all architectures.

       The  raw  system call accepts flags as two 32-bit arguments, flags_high
       and flags_low, to simplify portability across 32-bit and 64-bit  archi‐
       tectures and calling conventions.  The glibc wrapper accepts flags as a
       single 64-bit argument for convenience.


RETURN VALUE
       For the glibc wrapper, on success, clone4() returns the new process  ID
       to the calling process, and the new process begins running at the spec‐
       ified function.

       For the raw syscall, on success, clone4() returns the new process ID to
       the calling process, and returns 0 in the new process.

       On failure, clone4() returns -1 and sets errno accordingly.


ERRORS
       clone4()  can  return any error from clone(2), as well as the following
       additional errors:

       EFAULT args is outside your accessible address space.

       EINVAL flags contained an unknown flag.

       EINVAL flags included CLONE_FD and clonefd_flags contained  an  unknown
              flag.

       EINVAL flags  included  CLONE_FD, but the kernel configuration does not
              have the CONFIG_CLONEFD option enabled.

       EMFILE flags included CLONE_FD,  but  the  new  file  descriptor  would
              exceed the process limit on open file descriptors.

       ENFILE flags  included  CLONE_FD,  but  the  new  file descriptor would
              exceed the system-wide limit on open file descriptors.

       ENODEV flags included  CLONE_FD,  but  clone4()  could  not  mount  the
              (internal) anonymous inode device.


CONFORMING TO
       clone4()  is Linux-specific and should not be used in programs intended
       to be portable.


SEE ALSO
       clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2)



Linux                             2015-03-14                         CLONE4(2)


Josh Triplett and Thiago Macieira (7):
  clone: Support passing tls argument via C rather than pt_regs magic
  x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  Introduce a new clone4 syscall with more flag bits and extensible arguments
  kernel/fork.c: Pass arguments to _do_fork and copy_process using clone4_args
  clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
  signal: Factor out a helper function to process task_struct exit_code
  clone4: Add a CLONE_FD flag to get task exit notification via fd

 arch/Kconfig                      |   7 ++
 arch/x86/Kconfig                  |   1 +
 arch/x86/ia32/ia32entry.S         |   3 +-
 arch/x86/kernel/entry_64.S        |   1 +
 arch/x86/kernel/process_32.c      |   6 +-
 arch/x86/kernel/process_64.c      |   8 +--
 arch/x86/syscalls/syscall_32.tbl  |   1 +
 arch/x86/syscalls/syscall_64.tbl  |   2 +
 include/linux/compat.h            |  14 ++++
 include/linux/sched.h             |  22 ++++++
 include/linux/syscalls.h          |   6 +-
 include/uapi/asm-generic/unistd.h |   4 +-
 include/uapi/linux/sched.h        |  55 ++++++++++++++-
 init/Kconfig                      |  21 ++++++
 kernel/Makefile                   |   1 +
 kernel/clonefd.c                  | 121 ++++++++++++++++++++++++++++++++
 kernel/clonefd.h                  |  32 +++++++++
 kernel/exit.c                     |   4 ++
 kernel/fork.c                     | 142 ++++++++++++++++++++++++++++++--------
 kernel/signal.c                   |  26 ++++---
 kernel/sys_ni.c                   |   1 +
 21 files changed, 426 insertions(+), 52 deletions(-)
 create mode 100644 kernel/clonefd.c
 create mode 100644 kernel/clonefd.h

-- 
2.1.4


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-15  7:59 ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  7:59 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

This patch series introduces a new clone flag, CLONE_FD, which lets the caller
receive child process exit notification via a file descriptor rather than
SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
child processes on behalf of their caller, *without* taking over process-wide
SIGCHLD handling (either via signal handler or signalfd).

Note that signalfd for SIGCHLD does not suffice here, because that still
receives notification for all child processes, and interferes with process-wide
signal handling.

The CLONE_FD file descriptor uniquely identifies a process on the system in a
race-free way, by holding a reference to the task_struct.  In the future, we
may introduce APIs that support using process file descriptors instead of PIDs.

This patch series also introduces a clone flag CLONE_AUTOREAP, which causes the
kernel to automatically reap the child process when it exits, just as it does
for processes using SIGCHLD when the parent has SIGCHLD ignored or marked as
SA_NOCLDSTOP.

Taken together, a library can launch a process with CLONE_FD, CLONE_AUTOREAP,
and no exit signal, and completely avoid affecting either process-wide signal
handling or an existing child wait loop.

Introducing CLONE_FD and CLONE_AUTOREAP required two additional bits of yak
shaving: Since clone has no more usable flags (with the three currently unused
flags unusable because old kernels ignore them without EINVAL), also introduce
a new clone4 system call with more flag bits and an extensible argument
structure.  And since the magic pt_regs-based syscall argument processing for
clone's tls argument would otherwise prevent introducing a sane clone4 system
call, fix that too.

I tested the CLONE_SETTLS changes with a thread-local storage test program (two
threads independently reading and writing a __thread variable), on both 32-bit
and 64-bit, and I observed no issues there.

I tested clone4 and the new flags with several additional test programs,
launching either a process or thread (in the former case using syscall(), in
the latter case by calling clone4 via assembly and returning to C), sleeping in
parent and child to test the case of either exiting first, and then printing
the received clone4_info structure.

Changes in v2:
- Split out autoreaping into a separate CLONE_AUTOREAP.  CLONE_FD no longer
  implies autoreaping and no exit signal, and CLONE_AUTOREAP does not affect
  ptracers or signal handling.  Thanks to Oleg Nesterov for careful
  investigation and discussion on v1.
- Accept O_CLOEXEC and O_NONBLOCK via a clonefd_flags parameter in clone4_args.
  Stop overloading the low byte of the main clone flags, since CLONE_FD now
  works with a non-zero signal.
- Return the file descriptor via an out parameter in clone4_args.
- Drop patch to export alloc_fd; CLONE_FD now uses the next available file
  descriptor, even if that's 0-2, since clone4 no longer needs to avoid
  ambiguity with the 0 return indicating the child process.
- Make poll on a CLONE_FD for an exited task also return POLLHUP, for
  compatibility with FreeBSD's pdfork.  Thanks to David Drysdale for calling
  attention to pdfork.
- Fix typo in squelch_clone_flags.
- Pass arguments to _do_fork and copy_process as a structure.
- Construct the 64-bit flags in a separate variable, rather than inline in the
  call to do_fork.
- Fix error return for copy_from_user faults.
- Add the new syscall to asm-generic.
- Add ack from Andy Lutomirski to patches 1 and 2.

I've included the manpages patch at the end of this series.  (Note that the
manpage documents the behavior of the future glibc wrapper as well as the raw
syscall.)  Here's a formatted plain-text version of the manpage for reference:

CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)



NAME
       clone4 - create a child process

SYNOPSIS
       /* Prototype for the glibc wrapper function */

       #define _GNU_SOURCE
       #include <sched.h>

       int clone4(uint64_t flags,
                  size_t args_size,
                  struct clone4_args *args,
                  int (*fn)(void *), void *arg);

       /* Prototype for the raw system call */

       int clone4(unsigned flags_high, unsigned flags_low,
                  unsigned long args_size,
                  struct clone4_args *args);

       struct clone4_args {
           pid_t *ptid;
           pid_t *ctid;
           unsigned long stack_start;
           unsigned long stack_size;
           unsigned long tls;
           int *clonefd;
           unsigned clonefd_flags;
       };


DESCRIPTION
       clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
       clone4() supports additional flags that clone(2) does not, and  accepts
       arguments via an extensible structure.

       args  points to a clone4_args structure, and args_size must contain the
       size of that structure, as understood by the  caller.   If  the  caller
       passes  a  shorter  structure  than  the  kernel expects, the remaining
       fields will default to 0.  If the caller passes a larger structure than
       the  kernel  expects  (such  as one from a newer kernel), clone4() will
       return EINVAL.  The clone4_args structure may gain additional fields at
       the  end  in  the future, and callers must only pass a size that encom‐
       passes the number of fields they understand.  If the  caller  passes  0
       for args_size, args is ignored and may be NULL.

       In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
       tls have the same semantics as they do with clone(2) and clone2(2).

       In the glibc wrapper, fn and arg have the same  semantics  as  they  do
       with clone(2).  As with clone(2), the underlying system call works more
       like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
       plifies  thread execution by calling fn(arg) and exiting the child when
       that function exits.

       The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
       flags_low  arguments  in  the  kernel  interface for portability across
       architectures) accepts all the same flags as clone(2), with the  excep‐
       tion  of the obsolete CLONE_PID, CLONE_DETACHED, and CLONE_STOPPED.  In
       addition, flags accepts the following flags:


       CLONE_AUTOREAP
              When the new process exits, immediately  reap  it,  rather  than
              keeping  it  around  as a "zombie" until a call to waitpid(2) or
              similar.  Without this flag, the kernel will automatically  reap
              a  process if its exit signal is set to SIGCHLD, and if the par‐
              ent process has SIGCHLD set to SIG_IGN or has a SIGCHLD  handler
              installed  with SA_NOCLDWAIT (see sigaction(2)).  CLONE_AUTOREAP
              allows the calling process to enable automatic reaping  with  an
              exit  signal other than SIGCHLD (including 0 to disable the exit
              signal), and does not depend on the  configuration  of  process-
              wide signal handling.


       CLONE_FD
              Return  a file descriptor associated with the new process, stor‐
              ing it in location clonefd in the parent's address space.   When
              the new process exits, the file descriptor will become available
              for reading.

              Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
              descriptor  returned  by  clone4()  with the CLONE_FD flag works
              even with SIGCHLD unblocked in one or more threads of the parent
              process,  allowing  the  process  to have different handlers for
              different child processes, such as those created by  a  library,
              without  introducing  race conditions around process-wide signal
              handling.

              clonefd_flags may contain the following additional flags for use
              with CLONE_FD:


              O_CLOEXEC
                     Set  the  close-on-exec  flag on the new file descriptor.
                     See the description of the O_CLOEXEC flag in open(2)  for
                     reasons why this may be useful.


              O_NONBLOCK
                     Set  the  O_NONBLOCK  flag  on  the  new file descriptor.
                     Using this flag saves extra calls to fcntl(2) to  achieve
                     the same result.


              The returned file descriptor supports the following operations:

              read(2) (and similar)
                     When  the  new  process  exits,  reading  from  the  file
                     descriptor produces a single clonefd_info structure:

                     struct clonefd_info {
                         uint32_t code;   /* Signal code */
                         uint32_t status; /* Exit status or signal */
                         uint64_t utime;  /* User CPU time */
                         uint64_t stime;  /* System CPU time */
                     };


                     If the new process has not  yet  exited,  read(2)  either
                     blocks  until  it does, or fails with the error EAGAIN if
                     the file descriptor has O_NONBLOCK set.

                     Future kernels may extend clonefd_info by appending addi‐
                     tional  fields  to  the end.  Callers should read as many
                     bytes as they understand; unread data will be  discarded,
                     and  subsequent  reads  after  the first will return 0 to
                     indicate end-of-file.  Callers requesting more bytes than
                     the  kernel  provides  (such as callers expecting a newer
                     clonefd_info structure) will receive a shorter  structure
                     from older kernels.

              poll(2), select(2), epoll(7) (and similar)
                     The  file  descriptor  is readable (the select(2) readfds
                     argument; the poll(2) POLLIN flag) if the new process has
                     exited.

              close(2)
                     When  the file descriptor is no longer required it should
                     be closed.


   C library/kernel ABI differences
       As with clone(2), the raw clone4() system call corresponds more closely
       to  fork(2)  in that execution in the child continues from the point of
       the call.

       Unlike clone(2), the raw system call  interface  for  clone4()  accepts
       arguments in the same order on all architectures.

       The  raw  system call accepts flags as two 32-bit arguments, flags_high
       and flags_low, to simplify portability across 32-bit and 64-bit  archi‐
       tectures and calling conventions.  The glibc wrapper accepts flags as a
       single 64-bit argument for convenience.


RETURN VALUE
       For the glibc wrapper, on success, clone4() returns the new process  ID
       to the calling process, and the new process begins running at the spec‐
       ified function.

       For the raw syscall, on success, clone4() returns the new process ID to
       the calling process, and returns 0 in the new process.

       On failure, clone4() returns -1 and sets errno accordingly.


ERRORS
       clone4()  can  return any error from clone(2), as well as the following
       additional errors:

       EFAULT args is outside your accessible address space.

       EINVAL flags contained an unknown flag.

       EINVAL flags included CLONE_FD and clonefd_flags contained  an  unknown
              flag.

       EINVAL flags  included  CLONE_FD, but the kernel configuration does not
              have the CONFIG_CLONEFD option enabled.

       EMFILE flags included CLONE_FD,  but  the  new  file  descriptor  would
              exceed the process limit on open file descriptors.

       ENFILE flags  included  CLONE_FD,  but  the  new  file descriptor would
              exceed the system-wide limit on open file descriptors.

       ENODEV flags included  CLONE_FD,  but  clone4()  could  not  mount  the
              (internal) anonymous inode device.


CONFORMING TO
       clone4()  is Linux-specific and should not be used in programs intended
       to be portable.


SEE ALSO
       clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2)



Linux                             2015-03-14                         CLONE4(2)


Josh Triplett and Thiago Macieira (7):
  clone: Support passing tls argument via C rather than pt_regs magic
  x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  Introduce a new clone4 syscall with more flag bits and extensible arguments
  kernel/fork.c: Pass arguments to _do_fork and copy_process using clone4_args
  clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
  signal: Factor out a helper function to process task_struct exit_code
  clone4: Add a CLONE_FD flag to get task exit notification via fd

 arch/Kconfig                      |   7 ++
 arch/x86/Kconfig                  |   1 +
 arch/x86/ia32/ia32entry.S         |   3 +-
 arch/x86/kernel/entry_64.S        |   1 +
 arch/x86/kernel/process_32.c      |   6 +-
 arch/x86/kernel/process_64.c      |   8 +--
 arch/x86/syscalls/syscall_32.tbl  |   1 +
 arch/x86/syscalls/syscall_64.tbl  |   2 +
 include/linux/compat.h            |  14 ++++
 include/linux/sched.h             |  22 ++++++
 include/linux/syscalls.h          |   6 +-
 include/uapi/asm-generic/unistd.h |   4 +-
 include/uapi/linux/sched.h        |  55 ++++++++++++++-
 init/Kconfig                      |  21 ++++++
 kernel/Makefile                   |   1 +
 kernel/clonefd.c                  | 121 ++++++++++++++++++++++++++++++++
 kernel/clonefd.h                  |  32 +++++++++
 kernel/exit.c                     |   4 ++
 kernel/fork.c                     | 142 ++++++++++++++++++++++++++++++--------
 kernel/signal.c                   |  26 ++++---
 kernel/sys_ni.c                   |   1 +
 21 files changed, 426 insertions(+), 52 deletions(-)
 create mode 100644 kernel/clonefd.c
 create mode 100644 kernel/clonefd.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 1/7] clone: Support passing tls argument via C rather than pt_regs magic
  2015-03-15  7:59 ` Josh Triplett
  (?)
@ 2015-03-15  7:59 ` Josh Triplett
  -1 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  7:59 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

clone with CLONE_SETTLS accepts an argument to set the thread-local
storage area for the new thread.  sys_clone declares an int argument
tls_val in the appropriate point in the argument list (based on the
various CLONE_BACKWARDS variants), but doesn't actually use or pass
along that argument.  Instead, sys_clone calls do_fork, which calls
copy_process, which calls the arch-specific copy_thread, and copy_thread
pulls the corresponding syscall argument out of the pt_regs captured at
kernel entry (knowing what argument of clone that architecture passes
tls in).

Apart from being awful and inscrutable, that also only works because
only one code path into copy_thread can pass the CLONE_SETTLS flag, and
that code path comes from sys_clone with its architecture-specific
argument-passing order.  This prevents introducing a new version of the
clone system call without propagating the same architecture-specific
position of the tls argument.

However, there's no reason to pull the argument out of pt_regs when
sys_clone could just pass it down via C function call arguments.

Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt
into, and a new copy_thread_tls that accepts the tls parameter as an
additional unsigned long (syscall-argument-sized) argument.
Change sys_clone's tls argument to an unsigned long (which does
not change the ABI), and pass that down to copy_thread_tls.

Architectures that don't opt into copy_thread_tls will continue to
ignore the C argument to sys_clone in favor of the pt_regs captured at
kernel entry, and thus will be unable to introduce new versions of the
clone syscall.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
---
 arch/Kconfig             |  7 ++++++
 include/linux/sched.h    | 14 ++++++++++++
 include/linux/syscalls.h |  6 +++---
 kernel/fork.c            | 55 +++++++++++++++++++++++++++++++-----------------
 4 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 05d7a8a..4834a58 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -484,6 +484,13 @@ config HAVE_IRQ_EXIT_ON_IRQ_STACK
 	  This spares a stack switch and improves cache usage on softirq
 	  processing.
 
+config HAVE_COPY_THREAD_TLS
+	bool
+	help
+	  Architecture provides copy_thread_tls to accept tls argument via
+	  normal C parameter passing, rather than extracting the syscall
+	  argument from pt_regs.
+
 #
 # ABI hall of shame
 #
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..9ec36fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2479,8 +2479,22 @@ extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode);
 /* Remove the current tasks stale references to the old mm_struct */
 extern void mm_release(struct task_struct *, struct mm_struct *);
 
+#ifdef CONFIG_HAVE_COPY_THREAD_TLS
+extern int copy_thread_tls(unsigned long, unsigned long, unsigned long,
+			struct task_struct *, unsigned long);
+#else
 extern int copy_thread(unsigned long, unsigned long, unsigned long,
 			struct task_struct *);
+
+/* Architectures that haven't opted into copy_thread_tls get the tls argument
+ * via pt_regs, so ignore the tls argument passed via C. */
+static inline int copy_thread_tls(
+		unsigned long clone_flags, unsigned long sp, unsigned long arg,
+		struct task_struct *p, unsigned long tls)
+{
+	return copy_thread(clone_flags, sp, arg, p);
+}
+#endif
 extern void flush_thread(void);
 extern void exit_thread(void);
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..bb51bec 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -827,15 +827,15 @@ asmlinkage long sys_syncfs(int fd);
 asmlinkage long sys_fork(void);
 asmlinkage long sys_vfork(void);
 #ifdef CONFIG_CLONE_BACKWARDS
-asmlinkage long sys_clone(unsigned long, unsigned long, int __user *, int,
+asmlinkage long sys_clone(unsigned long, unsigned long, int __user *, unsigned long,
 	       int __user *);
 #else
 #ifdef CONFIG_CLONE_BACKWARDS3
 asmlinkage long sys_clone(unsigned long, unsigned long, int, int __user *,
-			  int __user *, int);
+			  int __user *, unsigned long);
 #else
 asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
-	       int __user *, int);
+	       int __user *, unsigned long);
 #endif
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..b3dadf4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1192,7 +1192,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
-					int trace)
+					int trace,
+					unsigned long tls)
 {
 	int retval;
 	struct task_struct *p;
@@ -1401,7 +1402,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	retval = copy_io(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
-	retval = copy_thread(clone_flags, stack_start, stack_size, p);
+	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
@@ -1613,7 +1614,7 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0);
+	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1628,11 +1629,13 @@ struct task_struct *fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
-	      unsigned long stack_start,
-	      unsigned long stack_size,
-	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+static long _do_fork(
+		unsigned long clone_flags,
+		unsigned long stack_start,
+		unsigned long stack_size,
+		int __user *parent_tidptr,
+		int __user *child_tidptr,
+		unsigned long tls)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1657,7 +1660,7 @@ long do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, trace, tls);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1698,20 +1701,34 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+#ifndef CONFIG_HAVE_COPY_THREAD_TLS
+/* For compatibility with architectures that call do_fork directly rather than
+ * using the syscall entry points below. */
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return _do_fork(clone_flags, stack_start, stack_size,
+			parent_tidptr, child_tidptr, 0);
+}
+#endif
+
 /*
  * Create a kernel thread.
  */
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
-	return do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL);
+	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
+		(unsigned long)arg, NULL, NULL, 0);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return do_fork(SIGCHLD, 0, 0, NULL, NULL);
+	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -1722,8 +1739,8 @@ SYSCALL_DEFINE0(fork)
 #ifdef __ARCH_WANT_SYS_VFORK
 SYSCALL_DEFINE0(vfork)
 {
-	return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL);
+	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
+			0, NULL, NULL, 0);
 }
 #endif
 
@@ -1731,27 +1748,27 @@ SYSCALL_DEFINE0(vfork)
 #ifdef CONFIG_CLONE_BACKWARDS
 SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 int __user *, parent_tidptr,
-		 int, tls_val,
+		 unsigned long, tls,
 		 int __user *, child_tidptr)
 #elif defined(CONFIG_CLONE_BACKWARDS2)
 SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,
 		 int __user *, parent_tidptr,
 		 int __user *, child_tidptr,
-		 int, tls_val)
+		 unsigned long, tls)
 #elif defined(CONFIG_CLONE_BACKWARDS3)
 SYSCALL_DEFINE6(clone, unsigned long, clone_flags, unsigned long, newsp,
 		int, stack_size,
 		int __user *, parent_tidptr,
 		int __user *, child_tidptr,
-		int, tls_val)
+		unsigned long, tls)
 #else
 SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 int __user *, parent_tidptr,
 		 int __user *, child_tidptr,
-		 int, tls_val)
+		 unsigned long, tls)
 #endif
 {
-	return do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr);
+	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
 }
 #endif
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/7] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-15  7:59   ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  7:59 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

For 32-bit userspace on a 64-bit kernel, this requires modifying
stub32_clone to actually swap the appropriate arguments to match
CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
broken.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig             | 1 +
 arch/x86/ia32/ia32entry.S    | 2 +-
 arch/x86/kernel/process_32.c | 6 +++---
 arch/x86/kernel/process_64.c | 8 ++++----
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..4960b0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
 	select MODULES_USE_ELF_REL if X86_32
 	select MODULES_USE_ELF_RELA if X86_64
 	select CLONE_BACKWARDS if X86_32
+	select HAVE_COPY_THREAD_TLS
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUE_RWLOCK
 	select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 156ebca..0286735 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -487,7 +487,7 @@ GLOBAL(\label)
 	ALIGN
 GLOBAL(stub32_clone)
 	leaq sys_clone(%rip),%rax
-	mov	%r8, %rcx
+	xchg %r8, %rcx
 	jmp  ia32_ptregs_common	
 
 	ALIGN
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 603c4f9..ead28ff 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -129,8 +129,8 @@ void release_thread(struct task_struct *dead_task)
 	release_vm86_irqs(dead_task);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-	unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+	unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct task_struct *tsk;
@@ -185,7 +185,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	 */
 	if (clone_flags & CLONE_SETTLS)
 		err = do_set_thread_area(p, -1,
-			(struct user_desc __user *)childregs->si, 0);
+			(struct user_desc __user *)tls, 0);
 
 	if (err && p->thread.io_bitmap_ptr) {
 		kfree(p->thread.io_bitmap_ptr);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 67fcc43..c69cabc 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -151,8 +151,8 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls)
 	return get_desc_base(&t->thread.tls_array[tls]);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+		unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	int err;
 	struct pt_regs *childregs;
@@ -209,10 +209,10 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 #ifdef CONFIG_IA32_EMULATION
 		if (test_thread_flag(TIF_IA32))
 			err = do_set_thread_area(p, -1,
-				(struct user_desc __user *)childregs->si, 0);
+				(struct user_desc __user *)tls, 0);
 		else
 #endif
-			err = do_arch_prctl(p, ARCH_SET_FS, childregs->r8);
+			err = do_arch_prctl(p, ARCH_SET_FS, tls);
 		if (err)
 			goto out;
 	}
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/7] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
@ 2015-03-15  7:59   ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  7:59 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

For 32-bit userspace on a 64-bit kernel, this requires modifying
stub32_clone to actually swap the appropriate arguments to match
CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
broken.

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Acked-by: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 arch/x86/Kconfig             | 1 +
 arch/x86/ia32/ia32entry.S    | 2 +-
 arch/x86/kernel/process_32.c | 6 +++---
 arch/x86/kernel/process_64.c | 8 ++++----
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..4960b0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
 	select MODULES_USE_ELF_REL if X86_32
 	select MODULES_USE_ELF_RELA if X86_64
 	select CLONE_BACKWARDS if X86_32
+	select HAVE_COPY_THREAD_TLS
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUE_RWLOCK
 	select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 156ebca..0286735 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -487,7 +487,7 @@ GLOBAL(\label)
 	ALIGN
 GLOBAL(stub32_clone)
 	leaq sys_clone(%rip),%rax
-	mov	%r8, %rcx
+	xchg %r8, %rcx
 	jmp  ia32_ptregs_common	
 
 	ALIGN
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 603c4f9..ead28ff 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -129,8 +129,8 @@ void release_thread(struct task_struct *dead_task)
 	release_vm86_irqs(dead_task);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-	unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+	unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct task_struct *tsk;
@@ -185,7 +185,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	 */
 	if (clone_flags & CLONE_SETTLS)
 		err = do_set_thread_area(p, -1,
-			(struct user_desc __user *)childregs->si, 0);
+			(struct user_desc __user *)tls, 0);
 
 	if (err && p->thread.io_bitmap_ptr) {
 		kfree(p->thread.io_bitmap_ptr);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 67fcc43..c69cabc 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -151,8 +151,8 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls)
 	return get_desc_base(&t->thread.tls_array[tls]);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+		unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	int err;
 	struct pt_regs *childregs;
@@ -209,10 +209,10 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 #ifdef CONFIG_IA32_EMULATION
 		if (test_thread_flag(TIF_IA32))
 			err = do_set_thread_area(p, -1,
-				(struct user_desc __user *)childregs->si, 0);
+				(struct user_desc __user *)tls, 0);
 		else
 #endif
-			err = do_arch_prctl(p, ARCH_SET_FS, childregs->r8);
+			err = do_arch_prctl(p, ARCH_SET_FS, tls);
 		if (err)
 			goto out;
 	}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 3/7] Introduce a new clone4 syscall with more flag bits and extensible arguments
  2015-03-15  7:59 ` Josh Triplett
                   ` (2 preceding siblings ...)
  (?)
@ 2015-03-15  7:59 ` Josh Triplett
  2015-03-23 14:11     ` David Drysdale
  -1 siblings, 1 reply; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  7:59 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

clone() has no more usable flags available.  It has three now-unused
flags (CLONE_PID, CLONE_DETACHED, and CLONE_STOPPED), but current
kernels just ignore those flags without returning an error like EINVAL,
so reusing those flags would not allow userspace to detect the
availability of the new functionality.

Introduce a new system call, clone4, which accepts a second 32-bit flags
field.  clone4 also returns EINVAL for the currently unused flags in
clone, allowing their reuse.

To process these new flags, change the flags argument of _do_fork to a
u64.  sys_clone and do_fork both still use "unsigned long" for flags as
they did before, truncating it to 32-bit and masking out the obsolete
flags to behave like clone currently does.

clone4 accepts its remaining arguments as a structure, and userspace
passes in the size of that structure.  clone4 has well-defined semantics
that allow extending that structure in the future.  New userspace
passing in a larger structure than the kernel expects will receive
EINVAL, and can use a smaller structure to work with old kernels.  New
kernels accept smaller argument structures passed by userspace, and any
un-passed arguments default to 0.

clone4 handles arguments in the same order on all architectures, with no
backwards variations; to do so, it depends on the new
HAVE_COPY_THREAD_TLS.

The new system call currently accepts exactly the same flags as clone;
future commits will introduce new flags for additional functionality.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 arch/x86/ia32/ia32entry.S         |  1 +
 arch/x86/kernel/entry_64.S        |  1 +
 arch/x86/syscalls/syscall_32.tbl  |  1 +
 arch/x86/syscalls/syscall_64.tbl  |  2 ++
 include/linux/compat.h            | 12 +++++++++
 include/uapi/asm-generic/unistd.h |  4 ++-
 include/uapi/linux/sched.h        | 36 ++++++++++++++++++++++---
 init/Kconfig                      | 10 +++++++
 kernel/fork.c                     | 56 ++++++++++++++++++++++++++++++++++++---
 kernel/sys_ni.c                   |  1 +
 10 files changed, 116 insertions(+), 8 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 0286735..ba28306 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -483,6 +483,7 @@ GLOBAL(\label)
 	PTREGSCALL stub32_execveat, compat_sys_execveat
 	PTREGSCALL stub32_fork, sys_fork
 	PTREGSCALL stub32_vfork, sys_vfork
+	PTREGSCALL stub32_clone4, compat_sys_clone4
 
 	ALIGN
 GLOBAL(stub32_clone)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1d74d16..ead143f 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -520,6 +520,7 @@ END(\label)
 	FORK_LIKE  clone
 	FORK_LIKE  fork
 	FORK_LIKE  vfork
+	FORK_LIKE  clone4
 	FIXED_FRAME stub_iopl, sys_iopl
 
 ENTRY(stub_execve)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..56fcc90 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	clone4			sys_clone4			stub32_clone4
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..af15b0f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	clone4			stub_clone4
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
@@ -368,3 +369,4 @@
 543	x32	io_setup		compat_sys_io_setup
 544	x32	io_submit		compat_sys_io_submit
 545	x32	execveat		stub_x32_execveat
+546	x32	clone4			stub32_clone4
diff --git a/include/linux/compat.h b/include/linux/compat.h
index ab25814..6c4a68d 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -293,6 +293,14 @@ struct compat_old_sigaction {
 };
 #endif
 
+struct compat_clone4_args {
+	compat_uptr_t ptid;
+	compat_uptr_t ctid;
+	compat_ulong_t stack_start;
+	compat_ulong_t stack_size;
+	compat_ulong_t tls;
+};
+
 struct compat_statfs;
 struct compat_statfs64;
 struct compat_old_linux_dirent;
@@ -713,6 +721,10 @@ asmlinkage long compat_sys_sched_rr_get_interval(compat_pid_t pid,
 
 asmlinkage long compat_sys_fanotify_mark(int, unsigned int, __u32, __u32,
 					    int, const char __user *);
+
+asmlinkage long compat_sys_clone4(unsigned, unsigned, compat_ulong_t,
+				  struct compat_clone4_args __user *);
+
 #else
 
 #define is_compat_task() (0)
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..3740166 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
 __SYSCALL(__NR_bpf, sys_bpf)
 #define __NR_execveat 281
 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_clone4 282
+__SC_COMP(__NR_clone4, sys_clone4, compat_sys_clone4)
 
 #undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..7656152 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -1,6 +1,8 @@
 #ifndef _UAPI_LINUX_SCHED_H
 #define _UAPI_LINUX_SCHED_H
 
+#include <linux/types.h>
+
 /*
  * cloning flags:
  */
@@ -18,11 +20,8 @@
 #define CLONE_SETTLS	0x00080000	/* create a new TLS for the child */
 #define CLONE_PARENT_SETTID	0x00100000	/* set the TID in the parent */
 #define CLONE_CHILD_CLEARTID	0x00200000	/* clear the TID in the child */
-#define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
 #define CLONE_NEWUTS		0x04000000	/* New utsname namespace */
 #define CLONE_NEWIPC		0x08000000	/* New ipc namespace */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
@@ -31,6 +30,37 @@
 #define CLONE_IO		0x80000000	/* Clone io context */
 
 /*
+ * Old flags, unused by current clone.  clone does not return EINVAL for these
+ * flags, so they can't easily be reused.  clone4 can use them.
+ */
+#define CLONE_PID	0x00001000
+#define CLONE_DETACHED	0x00400000
+#define CLONE_STOPPED	0x02000000
+
+#ifdef __KERNEL__
+/*
+ * Valid flags for clone and for clone4. Kept in this file next to the flag
+ * list above, but not exposed to userspace.
+ */
+#define CLONE_VALID_FLAGS	(0xffffffffULL & ~(CLONE_PID | CLONE_DETACHED | CLONE_STOPPED))
+#define CLONE4_VALID_FLAGS	CLONE_VALID_FLAGS
+#endif /* __KERNEL__ */
+
+/*
+ * Structure passed to clone4 for additional arguments.  Initialized to 0,
+ * then overwritten with arguments from userspace, so arguments not supplied by
+ * userspace will remain 0.  New versions of the kernel may safely append new
+ * arguments to the end.
+ */
+struct clone4_args {
+	__kernel_pid_t __user *ptid;
+	__kernel_pid_t __user *ctid;
+	__kernel_ulong_t stack_start;
+	__kernel_ulong_t stack_size;
+	__kernel_ulong_t tls;
+};
+
+/*
  * Scheduling policies
  */
 #define SCHED_NORMAL		0
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..3ab6649 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1511,6 +1511,16 @@ config EVENTFD
 
 	  If unsure, say Y.
 
+config CLONE4
+	bool "Enable clone4() system call" if EXPERT
+	depends on HAVE_COPY_THREAD_TLS
+	default y
+	help
+	  Enable the clone4() system call, which supports passing additional
+	  flags.
+
+	  If unsure, say Y.
+
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call" if EXPERT
diff --git a/kernel/fork.c b/kernel/fork.c
index b3dadf4..8a21f9e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1187,7 +1187,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
  * parts of the process environment (as per the clone
  * flags). The actual kick-off is left to the caller.
  */
-static struct task_struct *copy_process(unsigned long clone_flags,
+static struct task_struct *copy_process(u64 clone_flags,
 					unsigned long stack_start,
 					unsigned long stack_size,
 					int __user *child_tidptr,
@@ -1198,6 +1198,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 
+	if (clone_flags & ~CLONE4_VALID_FLAGS)
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 
@@ -1630,7 +1633,7 @@ struct task_struct *fork_idle(int cpu)
  * it and waits for it to finish using the VM if required.
  */
 static long _do_fork(
-		unsigned long clone_flags,
+		u64 clone_flags,
 		unsigned long stack_start,
 		unsigned long stack_size,
 		int __user *parent_tidptr,
@@ -1701,6 +1704,15 @@ static long _do_fork(
 	return nr;
 }
 
+/*
+ * Convenience function for callers passing unsigned long flags, to prevent old
+ * syscall entry points from unexpectedly returning EINVAL.
+ */
+static inline u64 squelch_clone_flags(unsigned long clone_flags)
+{
+	return clone_flags & CLONE_VALID_FLAGS;
+}
+
 #ifndef CONFIG_HAVE_COPY_THREAD_TLS
 /* For compatibility with architectures that call do_fork directly rather than
  * using the syscall entry points below. */
@@ -1710,7 +1722,8 @@ long do_fork(unsigned long clone_flags,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr)
 {
-	return _do_fork(clone_flags, stack_start, stack_size,
+	return _do_fork(squelch_clone_flags(clone_flags),
+			stack_start, stack_size,
 			parent_tidptr, child_tidptr, 0);
 }
 #endif
@@ -1768,10 +1781,45 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	return _do_fork(squelch_clone_flags(clone_flags), newsp, 0,
+			parent_tidptr, child_tidptr, tls);
 }
 #endif
 
+#ifdef CONFIG_CLONE4
+SYSCALL_DEFINE4(clone4, unsigned, flags_high, unsigned, flags_low,
+		unsigned long, args_size, struct clone4_args __user *, args)
+{
+	u64 flags = (u64)flags_high << 32 | flags_low;
+	struct clone4_args kargs = {};
+	if (args_size > sizeof(kargs))
+		return -EINVAL;
+	if (args_size && copy_from_user(&kargs, args, args_size))
+		return -EFAULT;
+	return _do_fork(flags, kargs.stack_start, kargs.stack_size,
+			kargs.ptid, kargs.ctid, kargs.tls);
+}
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE4(clone4, unsigned, flags_high, unsigned, flags_low,
+			compat_ulong_t, args_size,
+			struct compat_clone4_args __user *, args)
+{
+	u64 flags = (u64)flags_high << 32 | flags_low;
+	struct compat_clone4_args compat_kargs = {};
+	if (args_size > sizeof(compat_kargs))
+		return -EINVAL;
+	if (args_size && copy_from_user(&compat_kargs, args, args_size))
+		return -EFAULT;
+	return _do_fork(flags, compat_kargs.stack_start,
+			compat_kargs.stack_size,
+			compat_ptr(compat_kargs.ptid),
+			compat_ptr(compat_kargs.ctid),
+			compat_kargs.tls);
+}
+#endif /* CONFIG_COMPAT */
+#endif /* CONFIG_CLONE4 */
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..5b5d2b9 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -159,6 +159,7 @@ cond_syscall(sys_uselib);
 cond_syscall(sys_fadvise64);
 cond_syscall(sys_fadvise64_64);
 cond_syscall(sys_madvise);
+cond_syscall(sys_clone4);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 4/7] kernel/fork.c: Pass arguments to _do_fork and copy_process using clone4_args
  2015-03-15  7:59 ` Josh Triplett
                   ` (3 preceding siblings ...)
  (?)
@ 2015-03-15  7:59 ` Josh Triplett
  -1 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  7:59 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

Rather than continuing to add arguments to _do_fork and copy_process for
future clone4 extensions, with corresponding churn in every caller, pass
the arguments using the clone4_args structure instead.  This allows
clone4 to avoid unpacking the arguments, and allows other callers to use
C99 structure initializers to only initialize the arguments they care
about.  Future extensions to clone4_args will thus not need to touch
clone4, fork, vfork, or other callers of _do_fork.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 kernel/fork.c | 77 +++++++++++++++++++++++++++++++----------------------------
 1 file changed, 41 insertions(+), 36 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8a21f9e..db9012a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1188,12 +1188,9 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
  * flags). The actual kick-off is left to the caller.
  */
 static struct task_struct *copy_process(u64 clone_flags,
-					unsigned long stack_start,
-					unsigned long stack_size,
-					int __user *child_tidptr,
+					struct clone4_args *args,
 					struct pid *pid,
-					int trace,
-					unsigned long tls)
+					int trace)
 {
 	int retval;
 	struct task_struct *p;
@@ -1405,7 +1402,7 @@ static struct task_struct *copy_process(u64 clone_flags,
 	retval = copy_io(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
-	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
+	retval = copy_thread_tls(clone_flags, args->stack_start, args->stack_size, p, args->tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
@@ -1416,11 +1413,11 @@ static struct task_struct *copy_process(u64 clone_flags,
 			goto bad_fork_cleanup_io;
 	}
 
-	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
+	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? args->ctid : NULL;
 	/*
 	 * Clear TID on mm_release()?
 	 */
-	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL;
+	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? args->ctid : NULL;
 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
 #endif
@@ -1617,7 +1614,8 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0);
+	struct clone4_args args = {};
+	task = copy_process(CLONE_VM, &args, &init_struct_pid, 0);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1632,13 +1630,7 @@ struct task_struct *fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-static long _do_fork(
-		u64 clone_flags,
-		unsigned long stack_start,
-		unsigned long stack_size,
-		int __user *parent_tidptr,
-		int __user *child_tidptr,
-		unsigned long tls)
+static long _do_fork(u64 clone_flags, struct clone4_args *args)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1662,8 +1654,7 @@ static long _do_fork(
 			trace = 0;
 	}
 
-	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace, tls);
+	p = copy_process(clone_flags, args, NULL, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1678,7 +1669,7 @@ static long _do_fork(
 		nr = pid_vnr(pid);
 
 		if (clone_flags & CLONE_PARENT_SETTID)
-			put_user(nr, parent_tidptr);
+			put_user(nr, args->ptid);
 
 		if (clone_flags & CLONE_VFORK) {
 			p->vfork_done = &vfork;
@@ -1722,9 +1713,13 @@ long do_fork(unsigned long clone_flags,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr)
 {
-	return _do_fork(squelch_clone_flags(clone_flags),
-			stack_start, stack_size,
-			parent_tidptr, child_tidptr, 0);
+	struct clone4_args kargs = {
+		.ptid = parent_tidptr,
+		.ctid = child_tidptr,
+		.stack_start = stack_start,
+		.stack_start = stack_size,
+	};
+	return _do_fork(squelch_clone_flags(clone_flags), &kargs);
 }
 #endif
 
@@ -1733,15 +1728,19 @@ long do_fork(unsigned long clone_flags,
  */
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
-	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL, 0);
+	struct clone4_args kargs = {
+		.stack_start = (unsigned long)fn,
+		.stack_size = (unsigned long)arg,
+	};
+	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, &kargs);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+	struct clone4_args kargs = {};
+	return _do_fork(SIGCHLD, &kargs);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -1752,8 +1751,8 @@ SYSCALL_DEFINE0(fork)
 #ifdef __ARCH_WANT_SYS_VFORK
 SYSCALL_DEFINE0(vfork)
 {
-	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL, 0);
+	struct clone4_args kargs = {};
+	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, &kargs);
 }
 #endif
 
@@ -1781,8 +1780,13 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(squelch_clone_flags(clone_flags), newsp, 0,
-			parent_tidptr, child_tidptr, tls);
+	struct clone4_args kargs = {
+		.ptid = parent_tidptr,
+		.ctid = child_tidptr,
+		.stack_start = newsp,
+		.tls = tls,
+	};
+	return _do_fork(squelch_clone_flags(clone_flags), &kargs);
 }
 #endif
 
@@ -1796,8 +1800,7 @@ SYSCALL_DEFINE4(clone4, unsigned, flags_high, unsigned, flags_low,
 		return -EINVAL;
 	if (args_size && copy_from_user(&kargs, args, args_size))
 		return -EFAULT;
-	return _do_fork(flags, kargs.stack_start, kargs.stack_size,
-			kargs.ptid, kargs.ctid, kargs.tls);
+	return _do_fork(flags, &kargs);
 }
 
 #ifdef CONFIG_COMPAT
@@ -1807,15 +1810,17 @@ COMPAT_SYSCALL_DEFINE4(clone4, unsigned, flags_high, unsigned, flags_low,
 {
 	u64 flags = (u64)flags_high << 32 | flags_low;
 	struct compat_clone4_args compat_kargs = {};
+	struct clone4_args kargs = {};
 	if (args_size > sizeof(compat_kargs))
 		return -EINVAL;
 	if (args_size && copy_from_user(&compat_kargs, args, args_size))
 		return -EFAULT;
-	return _do_fork(flags, compat_kargs.stack_start,
-			compat_kargs.stack_size,
-			compat_ptr(compat_kargs.ptid),
-			compat_ptr(compat_kargs.ctid),
-			compat_kargs.tls);
+	kargs.ptid = compat_ptr(compat_kargs.ptid);
+	kargs.ctid = compat_ptr(compat_kargs.ctid);
+	kargs.stack_start = compat_kargs.stack_start;
+	kargs.stack_size = compat_kargs.stack_size;
+	kargs.tls = compat_kargs.tls;
+	return _do_fork(flags, &kargs);
 }
 #endif /* CONFIG_COMPAT */
 #endif /* CONFIG_CLONE4 */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
  2015-03-15  7:59 ` Josh Triplett
                   ` (4 preceding siblings ...)
  (?)
@ 2015-03-15  8:00 ` Josh Triplett
  2015-03-15 14:52     ` Oleg Nesterov
  -1 siblings, 1 reply; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  8:00 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

If a process launches a child process with the notification signal set
to SIGCHLD (e.g. with fork()), and then the parent process either
ignores SIGCHLD or sets a handler with SA_NOCLDWAIT, the child process
will get automatically reaped without waiting for the parent to wait on
it.

However, there's currently no way to get the same autoreaping behavior
if the signal is not set to SIGCHLD, including in particular if the
signal is set to 0 to disable notification.  Furthermore, the code
launching the child process may not own process-wide signal handling for
the parent process.

Add a CLONE_AUTOREAP flag to request this behavior unconditionally,
regardless of the notification signal or the state of the parent
process's signal handling when the process exits.

This is particularly useful for libraries that want to launch unattended
child processes without interfering with the calling process's signal
handling or wait loop.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 include/linux/sched.h      | 2 ++
 include/uapi/linux/sched.h | 7 ++++++-
 kernel/fork.c              | 2 ++
 kernel/signal.c            | 2 ++
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9ec36fd..66feeb7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1372,6 +1372,8 @@ struct task_struct {
 	unsigned memcg_kmem_skip_account:1;
 #endif
 
+	unsigned autoreap:1; /* Do not become a zombie on exit */
+
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
 	struct restart_block restart_block;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 7656152..f606c0a 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -37,13 +37,18 @@
 #define CLONE_DETACHED	0x00400000
 #define CLONE_STOPPED	0x02000000
 
+/*
+ * Flags that only work with clone4.
+ */
+#define CLONE_AUTOREAP	0x00001000	/* Automatically reap the process */
+
 #ifdef __KERNEL__
 /*
  * Valid flags for clone and for clone4. Kept in this file next to the flag
  * list above, but not exposed to userspace.
  */
 #define CLONE_VALID_FLAGS	(0xffffffffULL & ~(CLONE_PID | CLONE_DETACHED | CLONE_STOPPED))
-#define CLONE4_VALID_FLAGS	CLONE_VALID_FLAGS
+#define CLONE4_VALID_FLAGS	(CLONE_VALID_FLAGS | CLONE_AUTOREAP)
 #endif /* __KERNEL__ */
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index db9012a..c297e5e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1461,6 +1461,8 @@ static struct task_struct *copy_process(u64 clone_flags,
 		p->tgid = p->pid;
 	}
 
+	p->autoreap = !!(clone_flags & CLONE_AUTOREAP);
+
 	p->nr_dirtied = 0;
 	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
 	p->dirty_paused_when = 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index a390499..c0011c0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1702,6 +1702,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 		if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
 			sig = 0;
 	}
+	if (!tsk->ptrace && tsk->autoreap)
+		autoreap = true;
 	if (valid_signal(sig) && sig)
 		__group_send_sig_info(sig, &info, tsk->parent);
 	__wake_up_parent(tsk, tsk->parent);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 6/7] signal: Factor out a helper function to process task_struct exit_code
  2015-03-15  7:59 ` Josh Triplett
                   ` (5 preceding siblings ...)
  (?)
@ 2015-03-15  8:00 ` Josh Triplett
  -1 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  8:00 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

do_notify_parent includes the code to convert the exit_code field of
struct task_struct to the code and status fields that accompany SIGCHLD.
Factor that out into a new helper function task_exit_code_status, to
allow other methods of task exit notification to share that code.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 include/linux/sched.h |  1 +
 kernel/signal.c       | 24 +++++++++++++++---------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 66feeb7..9daa017 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2388,6 +2388,7 @@ extern int kill_pid_info_as_cred(int, struct siginfo *, struct pid *,
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
 extern int kill_pid(struct pid *pid, int sig, int priv);
 extern int kill_proc_info(int, struct siginfo *, pid_t);
+extern void task_exit_code_status(int exit_code, s32 *code, s32 *status);
 extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
diff --git a/kernel/signal.c b/kernel/signal.c
index c0011c0..478cc09 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1613,6 +1613,20 @@ ret:
 	return ret;
 }
 
+/* Translate exit_code to code and status. */
+void task_exit_code_status(int exit_code, s32 *code, s32 *status)
+{
+	*status = exit_code & 0x7f;
+	if (exit_code & 0x80)
+		*code = CLD_DUMPED;
+	else if (exit_code & 0x7f)
+		*code = CLD_KILLED;
+	else {
+		*code = CLD_EXITED;
+		*status = exit_code >> 8;
+	}
+}
+
 /*
  * Let a parent know about the death of a child.
  * For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1668,15 +1682,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	info.si_utime = cputime_to_clock_t(utime + tsk->signal->utime);
 	info.si_stime = cputime_to_clock_t(stime + tsk->signal->stime);
 
-	info.si_status = tsk->exit_code & 0x7f;
-	if (tsk->exit_code & 0x80)
-		info.si_code = CLD_DUMPED;
-	else if (tsk->exit_code & 0x7f)
-		info.si_code = CLD_KILLED;
-	else {
-		info.si_code = CLD_EXITED;
-		info.si_status = tsk->exit_code >> 8;
-	}
+	task_exit_code_status(tsk->exit_code, &info.si_code, &info.si_status);
 
 	psig = tsk->parent->sighand;
 	spin_lock_irqsave(&psig->siglock, flags);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd
  2015-03-15  7:59 ` Josh Triplett
                   ` (6 preceding siblings ...)
  (?)
@ 2015-03-15  8:00 ` Josh Triplett
  2015-03-23 17:38   ` David Drysdale
  2015-04-06  8:30     ` Sergey Senozhatsky
  -1 siblings, 2 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  8:00 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

When passed CLONE_FD, clone4 hands the caller a file descriptor
referring to the new process.  When the new process exits, the file
descriptor becomes readable, producing a structure containing the exit
status, exit code, and user/system times.  The file descriptor also
works in epoll, poll, and select.

This allows libraries to safely launch and manage child processes on
behalf of a caller, without taking over or interfering with process-wide
signal handling.  Without this, such a library would need to take over
or cooperate with the entire process's SIGCHLD handling, either via a
signal handler or a signalfd.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 include/linux/compat.h     |   2 +
 include/linux/sched.h      |   5 ++
 include/uapi/linux/sched.h |  16 +++++-
 init/Kconfig               |  11 +++++
 kernel/Makefile            |   1 +
 kernel/clonefd.c           | 121 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/clonefd.h           |  32 ++++++++++++
 kernel/exit.c              |   4 ++
 kernel/fork.c              |  22 +++++++--
 9 files changed, 209 insertions(+), 5 deletions(-)
 create mode 100644 kernel/clonefd.c
 create mode 100644 kernel/clonefd.h

diff --git a/include/linux/compat.h b/include/linux/compat.h
index 6c4a68d..c90df5a 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -299,6 +299,8 @@ struct compat_clone4_args {
 	compat_ulong_t stack_start;
 	compat_ulong_t stack_size;
 	compat_ulong_t tls;
+	compat_uptr_t clonefd;
+	u32 clonefd_flags;
 };
 
 struct compat_statfs;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9daa017..1dc680b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1374,6 +1374,11 @@ struct task_struct {
 
 	unsigned autoreap:1; /* Do not become a zombie on exit */
 
+#ifdef CONFIG_CLONEFD
+	unsigned clonefd:1; /* Notify clonefd_wqh on exit */
+	wait_queue_head_t clonefd_wqh;
+#endif
+
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
 	struct restart_block restart_block;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index f606c0a..86627f0 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -41,6 +41,7 @@
  * Flags that only work with clone4.
  */
 #define CLONE_AUTOREAP	0x00001000	/* Automatically reap the process */
+#define CLONE_FD	0x00400000	/* Signal exit via file descriptor */
 
 #ifdef __KERNEL__
 /*
@@ -48,10 +49,21 @@
  * list above, but not exposed to userspace.
  */
 #define CLONE_VALID_FLAGS	(0xffffffffULL & ~(CLONE_PID | CLONE_DETACHED | CLONE_STOPPED))
-#define CLONE4_VALID_FLAGS	(CLONE_VALID_FLAGS | CLONE_AUTOREAP)
+#define CLONE4_VALID_FLAGS	(CLONE_VALID_FLAGS | CLONE_AUTOREAP | \
+				 (IS_ENABLED(CONFIG_CLONEFD) ? CLONE_FD : 0))
 #endif /* __KERNEL__ */
 
 /*
+ * Structure read from CLONE_FD file descriptor after process exits
+ */
+struct clonefd_info {
+	__s32 code;
+	__s32 status;
+	__u64 utime;
+	__u64 stime;
+};
+
+/*
  * Structure passed to clone4 for additional arguments.  Initialized to 0,
  * then overwritten with arguments from userspace, so arguments not supplied by
  * userspace will remain 0.  New versions of the kernel may safely append new
@@ -63,6 +75,8 @@ struct clone4_args {
 	__kernel_ulong_t stack_start;
 	__kernel_ulong_t stack_size;
 	__kernel_ulong_t tls;
+	int __user *clonefd;
+	__u32 clonefd_flags;
 };
 
 /*
diff --git a/init/Kconfig b/init/Kconfig
index 3ab6649..b444280 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1521,6 +1521,17 @@ config CLONE4
 
 	  If unsure, say Y.
 
+config CLONEFD
+	bool "Enable CLONE_FD flag for clone4()" if EXPERT
+	depends on CLONE4
+	select ANON_INODES
+	default y
+	help
+	  Enable the CLONE_FD flag for clone4(), which creates a file descriptor
+	  to receive child exit events rather than receiving a signal.
+
+	  If unsure, say Y.
+
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call" if EXPERT
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..368986c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -29,6 +29,7 @@ obj-y += rcu/
 obj-y += livepatch/
 
 obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
+obj-$(CONFIG_CLONEFD) += clonefd.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff --git a/kernel/clonefd.c b/kernel/clonefd.c
new file mode 100644
index 0000000..eac560c
--- /dev/null
+++ b/kernel/clonefd.c
@@ -0,0 +1,121 @@
+/*
+ * Support functions for CLONE_FD
+ *
+ * Copyright (c) 2015 Intel Corporation
+ * Original authors: Josh Triplett <josh@joshtriplett.org>
+ *                   Thiago Macieira <thiago@macieira.org>
+ */
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include "clonefd.h"
+
+static int clonefd_release(struct inode *inode, struct file *file)
+{
+	put_task_struct(file->private_data);
+	return 0;
+}
+
+static unsigned int clonefd_poll(struct file *file, poll_table *wait)
+{
+	struct task_struct *p = file->private_data;
+	poll_wait(file, &p->clonefd_wqh, wait);
+	return p->exit_state ? (POLLIN | POLLRDNORM | POLLHUP) : 0;
+}
+
+static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	struct task_struct *p = file->private_data;
+	int ret = 0;
+
+	/* EOF after first read */
+	if (*ppos)
+		return 0;
+
+	if (file->f_flags & O_NONBLOCK)
+		ret = -EAGAIN;
+	else
+		ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state);
+
+	if (p->exit_state) {
+		struct clonefd_info info = {};
+		cputime_t utime, stime;
+		task_exit_code_status(p->exit_code, &info.code, &info.status);
+		info.code &= ~__SI_MASK;
+		task_cputime(p, &utime, &stime);
+		info.utime = cputime_to_clock_t(utime + p->signal->utime);
+		info.stime = cputime_to_clock_t(stime + p->signal->stime);
+		ret = simple_read_from_buffer(buf, count, ppos, &info, sizeof(info));
+	}
+	return ret;
+}
+
+static struct file_operations clonefd_fops = {
+	.release = clonefd_release,
+	.poll = clonefd_poll,
+	.read = clonefd_read,
+	.llseek = no_llseek,
+};
+
+/* Do process exit notification for clonefd. */
+void clonefd_do_notify(struct task_struct *p)
+{
+	if (p->clonefd)
+		wake_up_all(&p->clonefd_wqh);
+}
+
+/* Handle the CLONE_FD case for copy_process. */
+int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
+		     struct clone4_args *args, struct clonefd_setup *setup)
+{
+	int flags;
+	struct file *file;
+	int fd;
+
+	p->clonefd = !!(clone_flags & CLONE_FD);
+	if (!p->clonefd)
+		return 0;
+
+	if (args->clonefd_flags & ~(O_CLOEXEC | O_NONBLOCK))
+		return -EINVAL;
+
+	init_waitqueue_head(&p->clonefd_wqh);
+
+	get_task_struct(p);
+	flags = O_RDONLY | FMODE_ATOMIC_POS | args->clonefd_flags;
+	file = anon_inode_getfile("[process]", &clonefd_fops, p, flags);
+	if (IS_ERR(file)) {
+		put_task_struct(p);
+		return PTR_ERR(file);
+	}
+
+	fd = get_unused_fd_flags(flags);
+	if (fd < 0) {
+		fput(file);
+		return fd;
+	}
+
+	setup->fd = fd;
+	setup->file = file;
+	return 0;
+}
+
+/* Clean up clonefd information after a partially complete clone */
+void clonefd_cleanup_failed_clone(struct clonefd_setup *setup)
+{
+	if (setup->file) {
+		put_unused_fd(setup->fd);
+		fput(setup->file);
+	}
+}
+
+/* Finish setting up the clonefd */
+void clonefd_install_fd(struct clone4_args *args, struct clonefd_setup *setup)
+{
+	if (setup->file) {
+		fd_install(setup->fd, setup->file);
+		put_user(setup->fd, args->clonefd);
+	}
+}
diff --git a/kernel/clonefd.h b/kernel/clonefd.h
new file mode 100644
index 0000000..2d8a67c
--- /dev/null
+++ b/kernel/clonefd.h
@@ -0,0 +1,32 @@
+/*
+ * Support functions for CLONE_FD
+ *
+ * Copyright (c) 2015 Intel Corporation
+ * Original authors: Josh Triplett <josh@joshtriplett.org>
+ *                   Thiago Macieira <thiago@macieira.org>
+ */
+#pragma once
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_CLONEFD
+struct clonefd_setup {
+	int fd;
+	struct file *file;
+};
+int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
+		     struct clone4_args *args, struct clonefd_setup *setup);
+void clonefd_cleanup_failed_clone(struct clonefd_setup *setup);
+void clonefd_install_fd(struct clone4_args *args, struct clonefd_setup *setup);
+void clonefd_do_notify(struct task_struct *p);
+#else /* CONFIG_CLONEFD */
+struct clonefd_setup {};
+static inline int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
+				   struct clone4_args *args, struct clonefd_setup *setup)
+{
+	return 0;
+}
+static inline void clonefd_cleanup_failed_clone(struct clonefd_setup *setup) {}
+static inline void clonefd_install_fd(struct clone4_args *args, struct clonefd_setup *setup) {}
+static inline void clonefd_do_notify(struct task_struct *p) {}
+#endif /* CONFIG_CLONEFD */
diff --git a/kernel/exit.c b/kernel/exit.c
index feff10b..83278b8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -59,6 +59,8 @@
 #include <asm/pgtable.h>
 #include <asm/mmu_context.h>
 
+#include "clonefd.h"
+
 static void exit_mm(struct task_struct *tsk);
 
 static void __unhash_process(struct task_struct *p, bool group_dead)
@@ -615,6 +617,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
+	clonefd_do_notify(tsk);
+
 	/* mt-exec, de_thread() is waiting for group leader */
 	if (unlikely(tsk->signal->notify_count < 0))
 		wake_up_process(tsk->signal->group_exit_task);
diff --git a/kernel/fork.c b/kernel/fork.c
index c297e5e..8fdf0ac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -87,6 +87,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/task.h>
 
+#include "clonefd.h"
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -1190,7 +1192,8 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
 static struct task_struct *copy_process(u64 clone_flags,
 					struct clone4_args *args,
 					struct pid *pid,
-					int trace)
+					int trace,
+					struct clonefd_setup *clonefd_setup)
 {
 	int retval;
 	struct task_struct *p;
@@ -1413,6 +1416,10 @@ static struct task_struct *copy_process(u64 clone_flags,
 			goto bad_fork_cleanup_io;
 	}
 
+	retval = clonefd_do_clone(clone_flags, p, args, clonefd_setup);
+	if (retval)
+		goto bad_fork_free_pid;
+
 	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? args->ctid : NULL;
 	/*
 	 * Clear TID on mm_release()?
@@ -1507,7 +1514,7 @@ static struct task_struct *copy_process(u64 clone_flags,
 		spin_unlock(&current->sighand->siglock);
 		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
-		goto bad_fork_free_pid;
+		goto bad_fork_cleanup_clonefd;
 	}
 
 	if (likely(p->pid)) {
@@ -1559,6 +1566,8 @@ static struct task_struct *copy_process(u64 clone_flags,
 
 	return p;
 
+bad_fork_cleanup_clonefd:
+	clonefd_cleanup_failed_clone(clonefd_setup);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
@@ -1617,7 +1626,7 @@ struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
 	struct clone4_args args = {};
-	task = copy_process(CLONE_VM, &args, &init_struct_pid, 0);
+	task = copy_process(CLONE_VM, &args, &init_struct_pid, 0, NULL);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1637,6 +1646,7 @@ static long _do_fork(u64 clone_flags, struct clone4_args *args)
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	struct clonefd_setup clonefd_setup = {};
 
 	/*
 	 * Determine whether and which event to report to ptracer.  When
@@ -1656,7 +1666,7 @@ static long _do_fork(u64 clone_flags, struct clone4_args *args)
 			trace = 0;
 	}
 
-	p = copy_process(clone_flags, args, NULL, trace);
+	p = copy_process(clone_flags, args, NULL, trace, &clonefd_setup);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1679,6 +1689,8 @@ static long _do_fork(u64 clone_flags, struct clone4_args *args)
 			get_task_struct(p);
 		}
 
+		clonefd_install_fd(args, &clonefd_setup);
+
 		wake_up_new_task(p);
 
 		/* forking complete and child started to run, tell ptracer */
@@ -1822,6 +1834,8 @@ COMPAT_SYSCALL_DEFINE4(clone4, unsigned, flags_high, unsigned, flags_low,
 	kargs.stack_start = compat_kargs.stack_start;
 	kargs.stack_size = compat_kargs.stack_size;
 	kargs.tls = compat_kargs.tls;
+	kargs.clonefd = compat_ptr(compat_kargs.clonefd);
+	kargs.clonefd_flags = compat_kargs.clonefd_flags;
 	return _do_fork(flags, &kargs);
 }
 #endif /* CONFIG_COMPAT */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 man-pages] clone4.2: New manpage documenting clone4(2)
  2015-03-15  7:59 ` Josh Triplett
                   ` (7 preceding siblings ...)
  (?)
@ 2015-03-15  8:00 ` Josh Triplett
  -1 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  8:00 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

Also includes new cross-reference from clone.2.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
---
 man2/clone.2  |   1 +
 man2/clone4.2 | 345 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 346 insertions(+)
 create mode 100644 man2/clone4.2

diff --git a/man2/clone.2 b/man2/clone.2
index 752c01e..7013885 100644
--- a/man2/clone.2
+++ b/man2/clone.2
@@ -1209,6 +1209,7 @@ main(int argc, char *argv[])
 }
 .fi
 .SH SEE ALSO
+.BR clone4 (2),
 .BR fork (2),
 .BR futex (2),
 .BR getpid (2),
diff --git a/man2/clone4.2 b/man2/clone4.2
new file mode 100644
index 0000000..f237ebc
--- /dev/null
+++ b/man2/clone4.2
@@ -0,0 +1,345 @@
+.\" Based on clone.2:
+.\" Copyright (c) 1992 Drew Eckhardt <drew@cs.colorado.edu>, March 28, 1992
+.\" and Copyright (c) Michael Kerrisk, 2001, 2002, 2005, 2013
+.\"
+.\" %%%LICENSE_START(GPL_NOVERSION_ONELINE)
+.\" May be distributed under the GNU General Public License.
+.\" %%%LICENSE_END
+.TH CLONE4 2 2015-03-14 "Linux" "Linux Programmer's Manual"
+.SH NAME
+clone4 \- create a child process
+.SH SYNOPSIS
+.nf
+/* Prototype for the glibc wrapper function */
+
+.B #define _GNU_SOURCE
+.B #include <sched.h>
+
+.BI "int clone4(uint64_t " flags ,
+.BI "           size_t " args_size ,
+.BI "           struct clone4_args *" args ,
+.BI "           int (*" "fn" ")(void *), void *" arg );
+
+/* Prototype for the raw system call */
+
+.BI "int clone4(unsigned " flags_high ", unsigned " flags_low ,
+.BI "           unsigned long " args_size ,
+.BI "           struct clone4_args *" args );
+
+struct clone4_args {
+    pid_t *ptid;
+    pid_t *ctid;
+    unsigned long stack_start;
+    unsigned long stack_size;
+    unsigned long tls;
+    int *clonefd;
+    unsigned clonefd_flags;
+};
+
+.SH DESCRIPTION
+.BR clone4 ()
+creates a new process, similar to
+.BR clone (2)
+and
+.BR fork (2).
+.BR clone4 ()
+supports additional flags that
+.BR clone (2)
+does not, and accepts arguments via an extensible structure.
+
+.I args
+points to a
+.I clone4_args
+structure, and
+.I args_size
+must contain the size of that structure, as understood by the caller.  If the
+caller passes a shorter structure than the kernel expects, the remaining fields
+will default to 0.  If the caller passes a larger structure than the kernel
+expects (such as one from a newer kernel),
+.BR clone4 ()
+will return
+.BR EINVAL .
+The
+.I clone4_args
+structure may gain additional fields at the end in the future, and callers must
+only pass a size that encompasses the number of fields they understand.  If the
+caller passes 0 for
+.IR args_size ,
+.I args
+is ignored and may be NULL.
+
+In the
+.I clone4_args
+structure,
+.IR ptid ,
+.IR ctid ,
+.IR stack_start ,
+.IR stack_size ,
+and
+.I tls
+have the same semantics as they do with
+.BR clone (2)
+and
+.BR clone2 (2).
+
+In the glibc wrapper,
+.I fn
+and
+.I arg
+have the same semantics as they do with
+.BR clone (2).
+As with
+.BR clone (2),
+the underlying system call works more like
+.BR fork (2),
+returning 0 in the child process; the glibc wrapper simplifies thread execution
+by calling
+.IR fn ( arg )
+and exiting the child when that function exits.
+
+The 64-bit
+.I flags
+argument (split into the 32-bit
+.I flags_high
+and
+.I flags_low
+arguments in the kernel interface for portability across architectures)
+accepts all the same flags as
+.BR clone (2),
+with the exception of the obsolete
+.BR CLONE_PID ,
+.BR CLONE_DETACHED ,
+and
+.BR CLONE_STOPPED .
+In addition,
+.I flags
+accepts the following flags:
+
+.TP
+.B CLONE_AUTOREAP
+When the new process exits, immediately reap it, rather than keeping it around
+as a "zombie" until a call to
+.BR waitpid (2)
+or similar.  Without this flag, the kernel will automatically reap a process if
+its exit signal is set to
+.BR SIGCHLD ,
+and if the parent process has
+.B SIGCHLD
+set to
+.B SIG_IGN
+or has a
+.B SIGCHLD
+handler installed with
+.B SA_NOCLDWAIT
+(see
+.BR sigaction (2)).
+.B CLONE_AUTOREAP
+allows the calling process to enable automatic reaping with an exit signal
+other than
+.B SIGCHLD
+(including 0 to disable the exit signal), and does not depend on the
+configuration of process-wide signal handling.
+
+.TP
+.B CLONE_FD
+Return a file descriptor associated with the new process, storing it in
+location
+.I clonefd
+in the parent's address space.  When the new process exits, the file descriptor
+will become available for reading.
+
+Unlike using
+.BR signalfd (2)
+for the
+.B SIGCHLD
+signal,
+the file descriptor returned by
+.BR clone4 ()
+with the
+.B CLONE_FD
+flag works even with
+.B SIGCHLD
+unblocked in one or more threads of the parent process, allowing the process to
+have different handlers for different child processes, such as those created by
+a library, without introducing race conditions around process-wide signal
+handling.
+
+.I clonefd_flags
+may contain the following additional flags for use with
+.BR CLONE_FD :
+
+.RS
+.TP
+.B O_CLOEXEC
+Set the close-on-exec flag on the new file descriptor.  See the description of
+the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+
+.TP
+.B O_NONBLOCK
+Set the
+.B O_NONBLOCK
+flag on the new file descriptor.  Using this flag saves extra calls to
+.BR fcntl (2)
+to achieve the same result.
+.RE
+
+.IP
+The returned file descriptor supports the following operations:
+.RS
+.TP
+.BR read "(2) (and similar)"
+When the new process exits, reading from the file descriptor produces
+a single
+.I clonefd_info
+structure:
+.nf
+
+struct clonefd_info {
+    uint32_t code;   /* Signal code */
+    uint32_t status; /* Exit status or signal */
+    uint64_t utime;  /* User CPU time */
+    uint64_t stime;  /* System CPU time */
+};
+
+.fi
+.IP
+If the new process has not yet exited,
+.BR read (2)
+either blocks until it does, or fails with the error
+.B EAGAIN
+if the file descriptor has
+.B O_NONBLOCK
+set.
+.IP
+Future kernels may extend
+.I clonefd_info
+by appending additional fields to the end.  Callers should read as many bytes
+as they understand; unread data will be discarded, and subsequent reads after
+the first will return 0 to indicate end-of-file.  Callers requesting more bytes
+than the kernel provides (such as callers expecting a newer
+.I clonefd_info
+structure) will receive a shorter structure from older kernels.
+.TP
+.BR poll "(2), " select "(2), " epoll "(7) (and similar)"
+The file descriptor is readable
+(the
+.BR select (2)
+.I readfds
+argument; the
+.BR poll (2)
+.B POLLIN
+flag)
+if the new process has exited.
+.TP
+.BR close (2)
+When the file descriptor is no longer required it should be closed.
+.RE
+
+.SS C library/kernel ABI differences
+As with
+.BR clone (2),
+the raw
+.BR clone4 ()
+system call corresponds more closely to
+.BR fork (2)
+in that execution in the child continues from the point of the call.
+
+Unlike
+.BR clone (2),
+the raw system call interface for
+.BR clone4 ()
+accepts arguments in the same order on all architectures.
+
+The raw system call accepts
+.I flags
+as two 32-bit arguments,
+.I flags_high
+and
+.IR flags_low ,
+to simplify portability across 32-bit and 64-bit architectures and calling
+conventions.  The glibc wrapper accepts
+.I flags
+as a single 64-bit argument for convenience.
+
+.SH RETURN VALUE
+For the glibc wrapper, on success,
+.BR clone4 ()
+returns the new process ID to the calling process, and the new process begins
+running at the specified function.
+
+For the raw syscall, on success,
+.BR clone4 ()
+returns the new process ID to the calling process, and returns 0 in the new
+process.
+
+On failure,
+.BR clone4 ()
+returns \-1 and sets
+.I errno
+accordingly.
+
+.SH ERRORS
+.BR clone4 ()
+can return any error from
+.BR clone (2),
+as well as the following additional errors:
+.TP
+.B EFAULT
+.I args
+is outside your accessible address space.
+.TP
+.B EINVAL
+.I flags
+contained an unknown flag.
+.TP
+.B EINVAL
+.I flags
+included
+.B CLONE_FD
+and
+.I clonefd_flags
+contained an unknown flag.
+.TP
+.B EINVAL
+.I flags
+included
+.BR CLONE_FD,
+but the kernel configuration does not have the
+.B CONFIG_CLONEFD
+option enabled.
+.TP
+.B EMFILE
+.I flags
+included
+.BR CLONE_FD,
+but the new file descriptor would exceed the process limit on open file descriptors.
+.TP
+.B ENFILE
+.I flags
+included
+.BR CLONE_FD,
+but the new file descriptor would exceed the system-wide limit on open file descriptors.
+.TP
+.B ENODEV
+.I flags
+included
+.BR CLONE_FD,
+but
+.BR clone4 ()
+could not mount the (internal) anonymous inode device.
+
+.SH CONFORMING TO
+.BR clone4 ()
+is Linux-specific and should not be used in programs intended to be portable.
+
+.SH SEE ALSO
+.BR clone (2),
+.BR epoll (7),
+.BR poll (2),
+.BR pthreads (7),
+.BR read (2),
+.BR select (2)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-15  8:04   ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  8:04 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

On Sun, Mar 15, 2015 at 12:59:17AM -0700, Josh Triplett wrote:
> This patch series also introduces a clone flag CLONE_AUTOREAP, which causes the
> kernel to automatically reap the child process when it exits, just as it does
> for processes using SIGCHLD when the parent has SIGCHLD ignored or marked as
> SA_NOCLDSTOP.

Typo: s/SA_NOCLDSTOP/SA_NOCLDWAIT/.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-15  8:04   ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15  8:04 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Sun, Mar 15, 2015 at 12:59:17AM -0700, Josh Triplett wrote:
> This patch series also introduces a clone flag CLONE_AUTOREAP, which causes the
> kernel to automatically reap the child process when it exits, just as it does
> for processes using SIGCHLD when the parent has SIGCHLD ignored or marked as
> SA_NOCLDSTOP.

Typo: s/SA_NOCLDSTOP/SA_NOCLDWAIT/.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
@ 2015-03-15 14:52     ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2015-03-15 14:52 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Michael Kerrisk, Thiago Macieira, linux-kernel, linux-api,
	linux-fsdevel, x86

On 03/15, Josh Triplett wrote:
>
> Add a CLONE_AUTOREAP flag to request this behavior unconditionally,

Yes, CLONE_AUTOREAP is much better. And I agree (mostly) with that
we should rely on do_notify_parent().

Howver the patch still doesn't look right. First of all, ->autoreap
should be per-process, not per-thread. And there are ptrace/mt issues,
it seems. Just for example, we should avoid EXIT_TRACE if autoreap in
wait_task_zombie() even if we are going to re-notify parent.

Yes... and other problems with ptrace. So let me nack this patch for
the moment ;) But let me repeat that personally I agree with this
change "in general".

EXCEPT: do we really want SIGCHLD from the exiting child? I think we
do not. I won't really argue though, but this should be discussed and
documented. IIUC, with your patch it is still sent.

Josh, please give me some time to think and re-check, I'll write another
email next week. I am not sure this is really needed, but it seems to
me that we need the preparation patch to make this change clear/simple.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
@ 2015-03-15 14:52     ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2015-03-15 14:52 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On 03/15, Josh Triplett wrote:
>
> Add a CLONE_AUTOREAP flag to request this behavior unconditionally,

Yes, CLONE_AUTOREAP is much better. And I agree (mostly) with that
we should rely on do_notify_parent().

Howver the patch still doesn't look right. First of all, ->autoreap
should be per-process, not per-thread. And there are ptrace/mt issues,
it seems. Just for example, we should avoid EXIT_TRACE if autoreap in
wait_task_zombie() even if we are going to re-notify parent.

Yes... and other problems with ptrace. So let me nack this patch for
the moment ;) But let me repeat that personally I agree with this
change "in general".

EXCEPT: do we really want SIGCHLD from the exiting child? I think we
do not. I won't really argue though, but this should be discussed and
documented. IIUC, with your patch it is still sent.

Josh, please give me some time to think and re-check, I'll write another
email next week. I am not sure this is really needed, but it seems to
me that we need the preparation patch to make this change clear/simple.

Oleg.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
@ 2015-03-15 17:18       ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15 17:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Michael Kerrisk, Thiago Macieira, linux-kernel, linux-api,
	linux-fsdevel, x86

On Sun, Mar 15, 2015 at 03:52:23PM +0100, Oleg Nesterov wrote:
> On 03/15, Josh Triplett wrote:
> > Add a CLONE_AUTOREAP flag to request this behavior unconditionally,
> 
> Yes, CLONE_AUTOREAP is much better. And I agree (mostly) with that
> we should rely on do_notify_parent().
> 
> Howver the patch still doesn't look right. First of all, ->autoreap
> should be per-process, not per-thread.

Ah, you're thinking of the case where the parent process launches a
child with CLONE_AUTOREAP, that child process launches siblings with
CLONE_THREAD and without CLONE_AUTOREAP, and one of those siblings is
the last to exit?  That seems easy enough to handle: instead of setting
->autoreap unconditionally in copy_process, I can set it only in the
non-CLONE_THREAD case, and otherwise let it inherit.  Then every task in
the group will have the same value for autoreap.

(As an aside, what *is* the use case for CLONE_PARENT without
CLONE_THREAD?)

> And there are ptrace/mt issues,
> it seems. Just for example, we should avoid EXIT_TRACE if autoreap in
> wait_task_zombie() even if we are going to re-notify parent.

I don't see how EXIT_TRACE can happen in wait_task_zombie if autoreap is
set.  wait_task_zombie does a cmpxchg with exit_state and doesn't
proceed unless exit_state was EXIT_ZOMBIE, and I don't see how we can
ever reach the EXIT_ZOMBIE state if autoreap.

> EXCEPT: do we really want SIGCHLD from the exiting child? I think we
> do not. I won't really argue though, but this should be discussed and
> documented. IIUC, with your patch it is still sent.

I think we do, yes.  The caller of clone can already specify what signal
they want, including no signal at all.  If they specify a signal
(SIGCHLD or otherwise) along with CLONE_AUTOREAP, we can send that
signal.  I don't think that causes any particular problem.

That's the same semantic you'd get if you have a SIGCHLD handler with
SA_NOCLDWAIT: you'd still get the signal, even though you don't need to
(and can't) wait on the child process.

> Josh, please give me some time to think and re-check, I'll write another
> email next week. I am not sure this is really needed, but it seems to
> me that we need the preparation patch to make this change clear/simple.

I'd appreciate any feedback you can offer on this series, including any
potential subtle interactions with ptrace.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
@ 2015-03-15 17:18       ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15 17:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Sun, Mar 15, 2015 at 03:52:23PM +0100, Oleg Nesterov wrote:
> On 03/15, Josh Triplett wrote:
> > Add a CLONE_AUTOREAP flag to request this behavior unconditionally,
> 
> Yes, CLONE_AUTOREAP is much better. And I agree (mostly) with that
> we should rely on do_notify_parent().
> 
> Howver the patch still doesn't look right. First of all, ->autoreap
> should be per-process, not per-thread.

Ah, you're thinking of the case where the parent process launches a
child with CLONE_AUTOREAP, that child process launches siblings with
CLONE_THREAD and without CLONE_AUTOREAP, and one of those siblings is
the last to exit?  That seems easy enough to handle: instead of setting
->autoreap unconditionally in copy_process, I can set it only in the
non-CLONE_THREAD case, and otherwise let it inherit.  Then every task in
the group will have the same value for autoreap.

(As an aside, what *is* the use case for CLONE_PARENT without
CLONE_THREAD?)

> And there are ptrace/mt issues,
> it seems. Just for example, we should avoid EXIT_TRACE if autoreap in
> wait_task_zombie() even if we are going to re-notify parent.

I don't see how EXIT_TRACE can happen in wait_task_zombie if autoreap is
set.  wait_task_zombie does a cmpxchg with exit_state and doesn't
proceed unless exit_state was EXIT_ZOMBIE, and I don't see how we can
ever reach the EXIT_ZOMBIE state if autoreap.

> EXCEPT: do we really want SIGCHLD from the exiting child? I think we
> do not. I won't really argue though, but this should be discussed and
> documented. IIUC, with your patch it is still sent.

I think we do, yes.  The caller of clone can already specify what signal
they want, including no signal at all.  If they specify a signal
(SIGCHLD or otherwise) along with CLONE_AUTOREAP, we can send that
signal.  I don't think that causes any particular problem.

That's the same semantic you'd get if you have a SIGCHLD handler with
SA_NOCLDWAIT: you'd still get the signal, even though you don't need to
(and can't) wait on the child process.

> Josh, please give me some time to think and re-check, I'll write another
> email next week. I am not sure this is really needed, but it seems to
> me that we need the preparation patch to make this change clear/simple.

I'd appreciate any feedback you can offer on this series, including any
potential subtle interactions with ptrace.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
  2015-03-15 17:18       ` Josh Triplett
@ 2015-03-15 19:55         ` Oleg Nesterov
  -1 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2015-03-15 19:55 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Michael Kerrisk, Thiago Macieira, linux-kernel, linux-api,
	linux-fsdevel, x86

On 03/15, Josh Triplett wrote:
>
> On Sun, Mar 15, 2015 at 03:52:23PM +0100, Oleg Nesterov wrote:
> > On 03/15, Josh Triplett wrote:
> > > Add a CLONE_AUTOREAP flag to request this behavior unconditionally,
> >
> > Yes, CLONE_AUTOREAP is much better. And I agree (mostly) with that
> > we should rely on do_notify_parent().
> >
> > Howver the patch still doesn't look right. First of all, ->autoreap
> > should be per-process, not per-thread.
>
> Ah, you're thinking of the case where the parent process launches a
> ...

Not really, although we probably need more sanity checks.

It should be per-process simply because this "autoreap" affects the whole
process. And the sub-threads are already "autoreap". And these 2 autoreap's
semantics differ, we should not confuse them.

> (As an aside, what *is* the use case for CLONE_PARENT without
> CLONE_THREAD?)

To me CLONE_PARENT is another historical mistake and the source of misc
problems ;)

> > And there are ptrace/mt issues,
> > it seems. Just for example, we should avoid EXIT_TRACE if autoreap in
> > wait_task_zombie() even if we are going to re-notify parent.
>
> I don't see how EXIT_TRACE can happen in wait_task_zombie if autoreap is
> set.  wait_task_zombie does a cmpxchg with exit_state and doesn't
> proceed unless exit_state was EXIT_ZOMBIE, and I don't see how we can
> ever reach the EXIT_ZOMBIE state if autoreap.

Because you again forgot about ptrace ;)

Josh. Let me try to summarise this later when I have time. Again, I am
not sure, perhaps this is even simpler than I currently think. And let
me apologize in advance, most probably I will be busy tomorrow.

> > EXCEPT: do we really want SIGCHLD from the exiting child? I think we
> > do not. I won't really argue though, but this should be discussed and
> > documented. IIUC, with your patch it is still sent.
>
> I think we do, yes.  The caller of clone can already specify what signal
> they want, including no signal at all.  If they specify a signal
> (SIGCHLD or otherwise) along with CLONE_AUTOREAP, we can send that
> signal.

OK. Agreed.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
@ 2015-03-15 19:55         ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2015-03-15 19:55 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On 03/15, Josh Triplett wrote:
>
> On Sun, Mar 15, 2015 at 03:52:23PM +0100, Oleg Nesterov wrote:
> > On 03/15, Josh Triplett wrote:
> > > Add a CLONE_AUTOREAP flag to request this behavior unconditionally,
> >
> > Yes, CLONE_AUTOREAP is much better. And I agree (mostly) with that
> > we should rely on do_notify_parent().
> >
> > Howver the patch still doesn't look right. First of all, ->autoreap
> > should be per-process, not per-thread.
>
> Ah, you're thinking of the case where the parent process launches a
> ...

Not really, although we probably need more sanity checks.

It should be per-process simply because this "autoreap" affects the whole
process. And the sub-threads are already "autoreap". And these 2 autoreap's
semantics differ, we should not confuse them.

> (As an aside, what *is* the use case for CLONE_PARENT without
> CLONE_THREAD?)

To me CLONE_PARENT is another historical mistake and the source of misc
problems ;)

> > And there are ptrace/mt issues,
> > it seems. Just for example, we should avoid EXIT_TRACE if autoreap in
> > wait_task_zombie() even if we are going to re-notify parent.
>
> I don't see how EXIT_TRACE can happen in wait_task_zombie if autoreap is
> set.  wait_task_zombie does a cmpxchg with exit_state and doesn't
> proceed unless exit_state was EXIT_ZOMBIE, and I don't see how we can
> ever reach the EXIT_ZOMBIE state if autoreap.

Because you again forgot about ptrace ;)

Josh. Let me try to summarise this later when I have time. Again, I am
not sure, perhaps this is even simpler than I currently think. And let
me apologize in advance, most probably I will be busy tomorrow.

> > EXCEPT: do we really want SIGCHLD from the exiting child? I think we
> > do not. I won't really argue though, but this should be discussed and
> > documented. IIUC, with your patch it is still sent.
>
> I think we do, yes.  The caller of clone can already specify what signal
> they want, including no signal at all.  If they specify a signal
> (SIGCHLD or otherwise) along with CLONE_AUTOREAP, we can send that
> signal.

OK. Agreed.

Oleg.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
@ 2015-03-15 23:34           ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15 23:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Michael Kerrisk, Thiago Macieira, linux-kernel, linux-api,
	linux-fsdevel, x86

On Sun, Mar 15, 2015 at 08:55:06PM +0100, Oleg Nesterov wrote:
> On 03/15, Josh Triplett wrote:
> > On Sun, Mar 15, 2015 at 03:52:23PM +0100, Oleg Nesterov wrote:
> > > On 03/15, Josh Triplett wrote:
> > > > Add a CLONE_AUTOREAP flag to request this behavior unconditionally,
> > >
> > > Yes, CLONE_AUTOREAP is much better. And I agree (mostly) with that
> > > we should rely on do_notify_parent().
> > >
> > > Howver the patch still doesn't look right. First of all, ->autoreap
> > > should be per-process, not per-thread.
> >
> > Ah, you're thinking of the case where the parent process launches a
> > ...
> 
> Not really, although we probably need more sanity checks.
> 
> It should be per-process simply because this "autoreap" affects the whole
> process. And the sub-threads are already "autoreap". And these 2 autoreap's
> semantics differ, we should not confuse them.

Will the approach I suggested, of having clones with CLONE_THREAD
inherit the autoreap value rather than setting it from CLONE_AUTOREAP,
implement the semantics you're looking for?

Also, are you suggesting that CLONE_AUTOREAP with CLONE_THREAD should
produce -EINVAL, or just that it should be ignored?

> > (As an aside, what *is* the use case for CLONE_PARENT without
> > CLONE_THREAD?)
> 
> To me CLONE_PARENT is another historical mistake and the source of misc
> problems ;)

I kinda figured. :)

> > > And there are ptrace/mt issues,
> > > it seems. Just for example, we should avoid EXIT_TRACE if autoreap in
> > > wait_task_zombie() even if we are going to re-notify parent.
> >
> > I don't see how EXIT_TRACE can happen in wait_task_zombie if autoreap is
> > set.  wait_task_zombie does a cmpxchg with exit_state and doesn't
> > proceed unless exit_state was EXIT_ZOMBIE, and I don't see how we can
> > ever reach the EXIT_ZOMBIE state if autoreap.
> 
> Because you again forgot about ptrace ;)
> 
> Josh. Let me try to summarise this later when I have time. Again, I am
> not sure, perhaps this is even simpler than I currently think. And let
> me apologize in advance, most probably I will be busy tomorrow.

I look forward to your later review and feedback.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
@ 2015-03-15 23:34           ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-15 23:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Sun, Mar 15, 2015 at 08:55:06PM +0100, Oleg Nesterov wrote:
> On 03/15, Josh Triplett wrote:
> > On Sun, Mar 15, 2015 at 03:52:23PM +0100, Oleg Nesterov wrote:
> > > On 03/15, Josh Triplett wrote:
> > > > Add a CLONE_AUTOREAP flag to request this behavior unconditionally,
> > >
> > > Yes, CLONE_AUTOREAP is much better. And I agree (mostly) with that
> > > we should rely on do_notify_parent().
> > >
> > > Howver the patch still doesn't look right. First of all, ->autoreap
> > > should be per-process, not per-thread.
> >
> > Ah, you're thinking of the case where the parent process launches a
> > ...
> 
> Not really, although we probably need more sanity checks.
> 
> It should be per-process simply because this "autoreap" affects the whole
> process. And the sub-threads are already "autoreap". And these 2 autoreap's
> semantics differ, we should not confuse them.

Will the approach I suggested, of having clones with CLONE_THREAD
inherit the autoreap value rather than setting it from CLONE_AUTOREAP,
implement the semantics you're looking for?

Also, are you suggesting that CLONE_AUTOREAP with CLONE_THREAD should
produce -EINVAL, or just that it should be ignored?

> > (As an aside, what *is* the use case for CLONE_PARENT without
> > CLONE_THREAD?)
> 
> To me CLONE_PARENT is another historical mistake and the source of misc
> problems ;)

I kinda figured. :)

> > > And there are ptrace/mt issues,
> > > it seems. Just for example, we should avoid EXIT_TRACE if autoreap in
> > > wait_task_zombie() even if we are going to re-notify parent.
> >
> > I don't see how EXIT_TRACE can happen in wait_task_zombie if autoreap is
> > set.  wait_task_zombie does a cmpxchg with exit_state and doesn't
> > proceed unless exit_state was EXIT_ZOMBIE, and I don't see how we can
> > ever reach the EXIT_ZOMBIE state if autoreap.
> 
> Because you again forgot about ptrace ;)
> 
> Josh. Let me try to summarise this later when I have time. Again, I am
> not sure, perhaps this is even simpler than I currently think. And let
> me apologize in advance, most probably I will be busy tomorrow.

I look forward to your later review and feedback.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-15  7:59 ` Josh Triplett
@ 2015-03-16 21:44   ` Kees Cook
  -1 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2015-03-16 21:44 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, LKML,
	Linux API, linux-fsdevel, x86

On Sun, Mar 15, 2015 at 12:59 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> receive child process exit notification via a file descriptor rather than
> SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> child processes on behalf of their caller, *without* taking over process-wide
> SIGCHLD handling (either via signal handler or signalfd).
>
> Note that signalfd for SIGCHLD does not suffice here, because that still
> receives notification for all child processes, and interferes with process-wide
> signal handling.
>
> The CLONE_FD file descriptor uniquely identifies a process on the system in a
> race-free way, by holding a reference to the task_struct.  In the future, we
> may introduce APIs that support using process file descriptors instead of PIDs.
>
> This patch series also introduces a clone flag CLONE_AUTOREAP, which causes the
> kernel to automatically reap the child process when it exits, just as it does
> for processes using SIGCHLD when the parent has SIGCHLD ignored or marked as
> SA_NOCLDSTOP.
>
> Taken together, a library can launch a process with CLONE_FD, CLONE_AUTOREAP,
> and no exit signal, and completely avoid affecting either process-wide signal
> handling or an existing child wait loop.
>
> Introducing CLONE_FD and CLONE_AUTOREAP required two additional bits of yak
> shaving: Since clone has no more usable flags (with the three currently unused
> flags unusable because old kernels ignore them without EINVAL), also introduce
> a new clone4 system call with more flag bits and an extensible argument
> structure.  And since the magic pt_regs-based syscall argument processing for
> clone's tls argument would otherwise prevent introducing a sane clone4 system
> call, fix that too.
>
> I tested the CLONE_SETTLS changes with a thread-local storage test program (two
> threads independently reading and writing a __thread variable), on both 32-bit
> and 64-bit, and I observed no issues there.
>
> I tested clone4 and the new flags with several additional test programs,
> launching either a process or thread (in the former case using syscall(), in
> the latter case by calling clone4 via assembly and returning to C), sleeping in
> parent and child to test the case of either exiting first, and then printing
> the received clone4_info structure.
>
> Changes in v2:
> - Split out autoreaping into a separate CLONE_AUTOREAP.  CLONE_FD no longer
>   implies autoreaping and no exit signal, and CLONE_AUTOREAP does not affect
>   ptracers or signal handling.  Thanks to Oleg Nesterov for careful
>   investigation and discussion on v1.
> - Accept O_CLOEXEC and O_NONBLOCK via a clonefd_flags parameter in clone4_args.
>   Stop overloading the low byte of the main clone flags, since CLONE_FD now
>   works with a non-zero signal.
> - Return the file descriptor via an out parameter in clone4_args.
> - Drop patch to export alloc_fd; CLONE_FD now uses the next available file
>   descriptor, even if that's 0-2, since clone4 no longer needs to avoid
>   ambiguity with the 0 return indicating the child process.
> - Make poll on a CLONE_FD for an exited task also return POLLHUP, for
>   compatibility with FreeBSD's pdfork.  Thanks to David Drysdale for calling
>   attention to pdfork.

I think POLLHUP should be mentioned in the manpage (now it only
mentions POLLIN).

> - Fix typo in squelch_clone_flags.
> - Pass arguments to _do_fork and copy_process as a structure.
> - Construct the 64-bit flags in a separate variable, rather than inline in the
>   call to do_fork.
> - Fix error return for copy_from_user faults.
> - Add the new syscall to asm-generic.
> - Add ack from Andy Lutomirski to patches 1 and 2.
>
> I've included the manpages patch at the end of this series.  (Note that the
> manpage documents the behavior of the future glibc wrapper as well as the raw
> syscall.)  Here's a formatted plain-text version of the manpage for reference:
>
> CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)
>
>
>
> NAME
>        clone4 - create a child process
>
> SYNOPSIS
>        /* Prototype for the glibc wrapper function */
>
>        #define _GNU_SOURCE
>        #include <sched.h>
>
>        int clone4(uint64_t flags,
>                   size_t args_size,
>                   struct clone4_args *args,
>                   int (*fn)(void *), void *arg);
>
>        /* Prototype for the raw system call */
>
>        int clone4(unsigned flags_high, unsigned flags_low,
>                   unsigned long args_size,
>                   struct clone4_args *args);
>
>        struct clone4_args {
>            pid_t *ptid;
>            pid_t *ctid;
>            unsigned long stack_start;
>            unsigned long stack_size;
>            unsigned long tls;
>            int *clonefd;
>            unsigned clonefd_flags;
>        };
>
>
> DESCRIPTION
>        clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
>        clone4() supports additional flags that clone(2) does not, and  accepts
>        arguments via an extensible structure.
>
>        args  points to a clone4_args structure, and args_size must contain the
>        size of that structure, as understood by the  caller.   If  the  caller
>        passes  a  shorter  structure  than  the  kernel expects, the remaining
>        fields will default to 0.  If the caller passes a larger structure than
>        the  kernel  expects  (such  as one from a newer kernel), clone4() will
>        return EINVAL.  The clone4_args structure may gain additional fields at
>        the  end  in  the future, and callers must only pass a size that encom‐
>        passes the number of fields they understand.  If the  caller  passes  0
>        for args_size, args is ignored and may be NULL.
>
>        In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
>        tls have the same semantics as they do with clone(2) and clone2(2).
>
>        In the glibc wrapper, fn and arg have the same  semantics  as  they  do
>        with clone(2).  As with clone(2), the underlying system call works more
>        like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
>        plifies  thread execution by calling fn(arg) and exiting the child when
>        that function exits.
>
>        The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
>        flags_low  arguments  in  the  kernel  interface for portability across
>        architectures) accepts all the same flags as clone(2), with the  excep‐
>        tion  of the obsolete CLONE_PID, CLONE_DETACHED, and CLONE_STOPPED.  In
>        addition, flags accepts the following flags:
>
>
>        CLONE_AUTOREAP
>               When the new process exits, immediately  reap  it,  rather  than
>               keeping  it  around  as a "zombie" until a call to waitpid(2) or
>               similar.  Without this flag, the kernel will automatically  reap
>               a  process if its exit signal is set to SIGCHLD, and if the par‐
>               ent process has SIGCHLD set to SIG_IGN or has a SIGCHLD  handler
>               installed  with SA_NOCLDWAIT (see sigaction(2)).  CLONE_AUTOREAP
>               allows the calling process to enable automatic reaping  with  an
>               exit  signal other than SIGCHLD (including 0 to disable the exit
>               signal), and does not depend on the  configuration  of  process-
>               wide signal handling.
>
>
>        CLONE_FD
>               Return  a file descriptor associated with the new process, stor‐
>               ing it in location clonefd in the parent's address space.   When
>               the new process exits, the file descriptor will become available
>               for reading.
>
>               Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
>               descriptor  returned  by  clone4()  with the CLONE_FD flag works
>               even with SIGCHLD unblocked in one or more threads of the parent
>               process,  allowing  the  process  to have different handlers for
>               different child processes, such as those created by  a  library,
>               without  introducing  race conditions around process-wide signal
>               handling.
>
>               clonefd_flags may contain the following additional flags for use
>               with CLONE_FD:
>
>
>               O_CLOEXEC
>                      Set  the  close-on-exec  flag on the new file descriptor.
>                      See the description of the O_CLOEXEC flag in open(2)  for
>                      reasons why this may be useful.

This begs the question: what happens when all CLONE_FD fds for a
process are closed? Will the parent get SIGCHLD instead, will it
auto-reap, or will it be un-wait-able (I assume not this...)

>
>
>               O_NONBLOCK
>                      Set  the  O_NONBLOCK  flag  on  the  new file descriptor.
>                      Using this flag saves extra calls to fcntl(2) to  achieve
>                      the same result.
>
>
>               The returned file descriptor supports the following operations:
>
>               read(2) (and similar)
>                      When  the  new  process  exits,  reading  from  the  file
>                      descriptor produces a single clonefd_info structure:
>
>                      struct clonefd_info {
>                          uint32_t code;   /* Signal code */
>                          uint32_t status; /* Exit status or signal */
>                          uint64_t utime;  /* User CPU time */
>                          uint64_t stime;  /* System CPU time */
>                      };
>
>
>                      If the new process has not  yet  exited,  read(2)  either
>                      blocks  until  it does, or fails with the error EAGAIN if
>                      the file descriptor has O_NONBLOCK set.
>
>                      Future kernels may extend clonefd_info by appending addi‐
>                      tional  fields  to  the end.  Callers should read as many
>                      bytes as they understand; unread data will be  discarded,
>                      and  subsequent  reads  after  the first will return 0 to
>                      indicate end-of-file.  Callers requesting more bytes than
>                      the  kernel  provides  (such as callers expecting a newer
>                      clonefd_info structure) will receive a shorter  structure
>                      from older kernels.
>
>               poll(2), select(2), epoll(7) (and similar)
>                      The  file  descriptor  is readable (the select(2) readfds
>                      argument; the poll(2) POLLIN flag) if the new process has
>                      exited.
>
>               close(2)
>                      When  the file descriptor is no longer required it should
>                      be closed.
>
>
>    C library/kernel ABI differences
>        As with clone(2), the raw clone4() system call corresponds more closely
>        to  fork(2)  in that execution in the child continues from the point of
>        the call.
>
>        Unlike clone(2), the raw system call  interface  for  clone4()  accepts
>        arguments in the same order on all architectures.
>
>        The  raw  system call accepts flags as two 32-bit arguments, flags_high
>        and flags_low, to simplify portability across 32-bit and 64-bit  archi‐
>        tectures and calling conventions.  The glibc wrapper accepts flags as a
>        single 64-bit argument for convenience.
>
>
> RETURN VALUE
>        For the glibc wrapper, on success, clone4() returns the new process  ID
>        to the calling process, and the new process begins running at the spec‐
>        ified function.
>
>        For the raw syscall, on success, clone4() returns the new process ID to
>        the calling process, and returns 0 in the new process.
>
>        On failure, clone4() returns -1 and sets errno accordingly.
>
>
> ERRORS
>        clone4()  can  return any error from clone(2), as well as the following
>        additional errors:
>
>        EFAULT args is outside your accessible address space.
>
>        EINVAL flags contained an unknown flag.
>
>        EINVAL flags included CLONE_FD and clonefd_flags contained  an  unknown
>               flag.
>
>        EINVAL flags  included  CLONE_FD, but the kernel configuration does not
>               have the CONFIG_CLONEFD option enabled.
>
>        EMFILE flags included CLONE_FD,  but  the  new  file  descriptor  would
>               exceed the process limit on open file descriptors.
>
>        ENFILE flags  included  CLONE_FD,  but  the  new  file descriptor would
>               exceed the system-wide limit on open file descriptors.
>
>        ENODEV flags included  CLONE_FD,  but  clone4()  could  not  mount  the
>               (internal) anonymous inode device.
>
>
> CONFORMING TO
>        clone4()  is Linux-specific and should not be used in programs intended
>        to be portable.
>
>
> SEE ALSO
>        clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2)
>
>
>
> Linux                             2015-03-14                         CLONE4(2)
>
>
> Josh Triplett and Thiago Macieira (7):
>   clone: Support passing tls argument via C rather than pt_regs magic
>   x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
>   Introduce a new clone4 syscall with more flag bits and extensible arguments
>   kernel/fork.c: Pass arguments to _do_fork and copy_process using clone4_args
>   clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
>   signal: Factor out a helper function to process task_struct exit_code
>   clone4: Add a CLONE_FD flag to get task exit notification via fd
>
>  arch/Kconfig                      |   7 ++
>  arch/x86/Kconfig                  |   1 +
>  arch/x86/ia32/ia32entry.S         |   3 +-
>  arch/x86/kernel/entry_64.S        |   1 +
>  arch/x86/kernel/process_32.c      |   6 +-
>  arch/x86/kernel/process_64.c      |   8 +--
>  arch/x86/syscalls/syscall_32.tbl  |   1 +
>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>  include/linux/compat.h            |  14 ++++
>  include/linux/sched.h             |  22 ++++++
>  include/linux/syscalls.h          |   6 +-
>  include/uapi/asm-generic/unistd.h |   4 +-
>  include/uapi/linux/sched.h        |  55 ++++++++++++++-
>  init/Kconfig                      |  21 ++++++
>  kernel/Makefile                   |   1 +
>  kernel/clonefd.c                  | 121 ++++++++++++++++++++++++++++++++
>  kernel/clonefd.h                  |  32 +++++++++
>  kernel/exit.c                     |   4 ++
>  kernel/fork.c                     | 142 ++++++++++++++++++++++++++++++--------
>  kernel/signal.c                   |  26 ++++---
>  kernel/sys_ni.c                   |   1 +
>  21 files changed, 426 insertions(+), 52 deletions(-)
>  create mode 100644 kernel/clonefd.c
>  create mode 100644 kernel/clonefd.h
>
> --
> 2.1.4
>

Looks promising!

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-16 21:44   ` Kees Cook
  0 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2015-03-16 21:44 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, LKML,
	Linux API, linux-fsdevel, x86

On Sun, Mar 15, 2015 at 12:59 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> receive child process exit notification via a file descriptor rather than
> SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> child processes on behalf of their caller, *without* taking over process-wide
> SIGCHLD handling (either via signal handler or signalfd).
>
> Note that signalfd for SIGCHLD does not suffice here, because that still
> receives notification for all child processes, and interferes with process-wide
> signal handling.
>
> The CLONE_FD file descriptor uniquely identifies a process on the system in a
> race-free way, by holding a reference to the task_struct.  In the future, we
> may introduce APIs that support using process file descriptors instead of PIDs.
>
> This patch series also introduces a clone flag CLONE_AUTOREAP, which causes the
> kernel to automatically reap the child process when it exits, just as it does
> for processes using SIGCHLD when the parent has SIGCHLD ignored or marked as
> SA_NOCLDSTOP.
>
> Taken together, a library can launch a process with CLONE_FD, CLONE_AUTOREAP,
> and no exit signal, and completely avoid affecting either process-wide signal
> handling or an existing child wait loop.
>
> Introducing CLONE_FD and CLONE_AUTOREAP required two additional bits of yak
> shaving: Since clone has no more usable flags (with the three currently unused
> flags unusable because old kernels ignore them without EINVAL), also introduce
> a new clone4 system call with more flag bits and an extensible argument
> structure.  And since the magic pt_regs-based syscall argument processing for
> clone's tls argument would otherwise prevent introducing a sane clone4 system
> call, fix that too.
>
> I tested the CLONE_SETTLS changes with a thread-local storage test program (two
> threads independently reading and writing a __thread variable), on both 32-bit
> and 64-bit, and I observed no issues there.
>
> I tested clone4 and the new flags with several additional test programs,
> launching either a process or thread (in the former case using syscall(), in
> the latter case by calling clone4 via assembly and returning to C), sleeping in
> parent and child to test the case of either exiting first, and then printing
> the received clone4_info structure.
>
> Changes in v2:
> - Split out autoreaping into a separate CLONE_AUTOREAP.  CLONE_FD no longer
>   implies autoreaping and no exit signal, and CLONE_AUTOREAP does not affect
>   ptracers or signal handling.  Thanks to Oleg Nesterov for careful
>   investigation and discussion on v1.
> - Accept O_CLOEXEC and O_NONBLOCK via a clonefd_flags parameter in clone4_args.
>   Stop overloading the low byte of the main clone flags, since CLONE_FD now
>   works with a non-zero signal.
> - Return the file descriptor via an out parameter in clone4_args.
> - Drop patch to export alloc_fd; CLONE_FD now uses the next available file
>   descriptor, even if that's 0-2, since clone4 no longer needs to avoid
>   ambiguity with the 0 return indicating the child process.
> - Make poll on a CLONE_FD for an exited task also return POLLHUP, for
>   compatibility with FreeBSD's pdfork.  Thanks to David Drysdale for calling
>   attention to pdfork.

I think POLLHUP should be mentioned in the manpage (now it only
mentions POLLIN).

> - Fix typo in squelch_clone_flags.
> - Pass arguments to _do_fork and copy_process as a structure.
> - Construct the 64-bit flags in a separate variable, rather than inline in the
>   call to do_fork.
> - Fix error return for copy_from_user faults.
> - Add the new syscall to asm-generic.
> - Add ack from Andy Lutomirski to patches 1 and 2.
>
> I've included the manpages patch at the end of this series.  (Note that the
> manpage documents the behavior of the future glibc wrapper as well as the raw
> syscall.)  Here's a formatted plain-text version of the manpage for reference:
>
> CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)
>
>
>
> NAME
>        clone4 - create a child process
>
> SYNOPSIS
>        /* Prototype for the glibc wrapper function */
>
>        #define _GNU_SOURCE
>        #include <sched.h>
>
>        int clone4(uint64_t flags,
>                   size_t args_size,
>                   struct clone4_args *args,
>                   int (*fn)(void *), void *arg);
>
>        /* Prototype for the raw system call */
>
>        int clone4(unsigned flags_high, unsigned flags_low,
>                   unsigned long args_size,
>                   struct clone4_args *args);
>
>        struct clone4_args {
>            pid_t *ptid;
>            pid_t *ctid;
>            unsigned long stack_start;
>            unsigned long stack_size;
>            unsigned long tls;
>            int *clonefd;
>            unsigned clonefd_flags;
>        };
>
>
> DESCRIPTION
>        clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
>        clone4() supports additional flags that clone(2) does not, and  accepts
>        arguments via an extensible structure.
>
>        args  points to a clone4_args structure, and args_size must contain the
>        size of that structure, as understood by the  caller.   If  the  caller
>        passes  a  shorter  structure  than  the  kernel expects, the remaining
>        fields will default to 0.  If the caller passes a larger structure than
>        the  kernel  expects  (such  as one from a newer kernel), clone4() will
>        return EINVAL.  The clone4_args structure may gain additional fields at
>        the  end  in  the future, and callers must only pass a size that encom‐
>        passes the number of fields they understand.  If the  caller  passes  0
>        for args_size, args is ignored and may be NULL.
>
>        In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
>        tls have the same semantics as they do with clone(2) and clone2(2).
>
>        In the glibc wrapper, fn and arg have the same  semantics  as  they  do
>        with clone(2).  As with clone(2), the underlying system call works more
>        like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
>        plifies  thread execution by calling fn(arg) and exiting the child when
>        that function exits.
>
>        The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
>        flags_low  arguments  in  the  kernel  interface for portability across
>        architectures) accepts all the same flags as clone(2), with the  excep‐
>        tion  of the obsolete CLONE_PID, CLONE_DETACHED, and CLONE_STOPPED.  In
>        addition, flags accepts the following flags:
>
>
>        CLONE_AUTOREAP
>               When the new process exits, immediately  reap  it,  rather  than
>               keeping  it  around  as a "zombie" until a call to waitpid(2) or
>               similar.  Without this flag, the kernel will automatically  reap
>               a  process if its exit signal is set to SIGCHLD, and if the par‐
>               ent process has SIGCHLD set to SIG_IGN or has a SIGCHLD  handler
>               installed  with SA_NOCLDWAIT (see sigaction(2)).  CLONE_AUTOREAP
>               allows the calling process to enable automatic reaping  with  an
>               exit  signal other than SIGCHLD (including 0 to disable the exit
>               signal), and does not depend on the  configuration  of  process-
>               wide signal handling.
>
>
>        CLONE_FD
>               Return  a file descriptor associated with the new process, stor‐
>               ing it in location clonefd in the parent's address space.   When
>               the new process exits, the file descriptor will become available
>               for reading.
>
>               Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
>               descriptor  returned  by  clone4()  with the CLONE_FD flag works
>               even with SIGCHLD unblocked in one or more threads of the parent
>               process,  allowing  the  process  to have different handlers for
>               different child processes, such as those created by  a  library,
>               without  introducing  race conditions around process-wide signal
>               handling.
>
>               clonefd_flags may contain the following additional flags for use
>               with CLONE_FD:
>
>
>               O_CLOEXEC
>                      Set  the  close-on-exec  flag on the new file descriptor.
>                      See the description of the O_CLOEXEC flag in open(2)  for
>                      reasons why this may be useful.

This begs the question: what happens when all CLONE_FD fds for a
process are closed? Will the parent get SIGCHLD instead, will it
auto-reap, or will it be un-wait-able (I assume not this...)

>
>
>               O_NONBLOCK
>                      Set  the  O_NONBLOCK  flag  on  the  new file descriptor.
>                      Using this flag saves extra calls to fcntl(2) to  achieve
>                      the same result.
>
>
>               The returned file descriptor supports the following operations:
>
>               read(2) (and similar)
>                      When  the  new  process  exits,  reading  from  the  file
>                      descriptor produces a single clonefd_info structure:
>
>                      struct clonefd_info {
>                          uint32_t code;   /* Signal code */
>                          uint32_t status; /* Exit status or signal */
>                          uint64_t utime;  /* User CPU time */
>                          uint64_t stime;  /* System CPU time */
>                      };
>
>
>                      If the new process has not  yet  exited,  read(2)  either
>                      blocks  until  it does, or fails with the error EAGAIN if
>                      the file descriptor has O_NONBLOCK set.
>
>                      Future kernels may extend clonefd_info by appending addi‐
>                      tional  fields  to  the end.  Callers should read as many
>                      bytes as they understand; unread data will be  discarded,
>                      and  subsequent  reads  after  the first will return 0 to
>                      indicate end-of-file.  Callers requesting more bytes than
>                      the  kernel  provides  (such as callers expecting a newer
>                      clonefd_info structure) will receive a shorter  structure
>                      from older kernels.
>
>               poll(2), select(2), epoll(7) (and similar)
>                      The  file  descriptor  is readable (the select(2) readfds
>                      argument; the poll(2) POLLIN flag) if the new process has
>                      exited.
>
>               close(2)
>                      When  the file descriptor is no longer required it should
>                      be closed.
>
>
>    C library/kernel ABI differences
>        As with clone(2), the raw clone4() system call corresponds more closely
>        to  fork(2)  in that execution in the child continues from the point of
>        the call.
>
>        Unlike clone(2), the raw system call  interface  for  clone4()  accepts
>        arguments in the same order on all architectures.
>
>        The  raw  system call accepts flags as two 32-bit arguments, flags_high
>        and flags_low, to simplify portability across 32-bit and 64-bit  archi‐
>        tectures and calling conventions.  The glibc wrapper accepts flags as a
>        single 64-bit argument for convenience.
>
>
> RETURN VALUE
>        For the glibc wrapper, on success, clone4() returns the new process  ID
>        to the calling process, and the new process begins running at the spec‐
>        ified function.
>
>        For the raw syscall, on success, clone4() returns the new process ID to
>        the calling process, and returns 0 in the new process.
>
>        On failure, clone4() returns -1 and sets errno accordingly.
>
>
> ERRORS
>        clone4()  can  return any error from clone(2), as well as the following
>        additional errors:
>
>        EFAULT args is outside your accessible address space.
>
>        EINVAL flags contained an unknown flag.
>
>        EINVAL flags included CLONE_FD and clonefd_flags contained  an  unknown
>               flag.
>
>        EINVAL flags  included  CLONE_FD, but the kernel configuration does not
>               have the CONFIG_CLONEFD option enabled.
>
>        EMFILE flags included CLONE_FD,  but  the  new  file  descriptor  would
>               exceed the process limit on open file descriptors.
>
>        ENFILE flags  included  CLONE_FD,  but  the  new  file descriptor would
>               exceed the system-wide limit on open file descriptors.
>
>        ENODEV flags included  CLONE_FD,  but  clone4()  could  not  mount  the
>               (internal) anonymous inode device.
>
>
> CONFORMING TO
>        clone4()  is Linux-specific and should not be used in programs intended
>        to be portable.
>
>
> SEE ALSO
>        clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2)
>
>
>
> Linux                             2015-03-14                         CLONE4(2)
>
>
> Josh Triplett and Thiago Macieira (7):
>   clone: Support passing tls argument via C rather than pt_regs magic
>   x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
>   Introduce a new clone4 syscall with more flag bits and extensible arguments
>   kernel/fork.c: Pass arguments to _do_fork and copy_process using clone4_args
>   clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
>   signal: Factor out a helper function to process task_struct exit_code
>   clone4: Add a CLONE_FD flag to get task exit notification via fd
>
>  arch/Kconfig                      |   7 ++
>  arch/x86/Kconfig                  |   1 +
>  arch/x86/ia32/ia32entry.S         |   3 +-
>  arch/x86/kernel/entry_64.S        |   1 +
>  arch/x86/kernel/process_32.c      |   6 +-
>  arch/x86/kernel/process_64.c      |   8 +--
>  arch/x86/syscalls/syscall_32.tbl  |   1 +
>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>  include/linux/compat.h            |  14 ++++
>  include/linux/sched.h             |  22 ++++++
>  include/linux/syscalls.h          |   6 +-
>  include/uapi/asm-generic/unistd.h |   4 +-
>  include/uapi/linux/sched.h        |  55 ++++++++++++++-
>  init/Kconfig                      |  21 ++++++
>  kernel/Makefile                   |   1 +
>  kernel/clonefd.c                  | 121 ++++++++++++++++++++++++++++++++
>  kernel/clonefd.h                  |  32 +++++++++
>  kernel/exit.c                     |   4 ++
>  kernel/fork.c                     | 142 ++++++++++++++++++++++++++++++--------
>  kernel/signal.c                   |  26 ++++---
>  kernel/sys_ni.c                   |   1 +
>  21 files changed, 426 insertions(+), 52 deletions(-)
>  create mode 100644 kernel/clonefd.c
>  create mode 100644 kernel/clonefd.h
>
> --
> 2.1.4
>

Looks promising!

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-16 22:14     ` Thiago Macieira
  0 siblings, 0 replies; 63+ messages in thread
From: Thiago Macieira @ 2015-03-16 22:14 UTC (permalink / raw)
  To: Kees Cook
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Oleg Nesterov, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, LKML, Linux API,
	linux-fsdevel, x86

On Monday 16 March 2015 14:44:20 Kees Cook wrote:
> >               O_CLOEXEC
> >                      Set  the  close-on-exec  flag on the new file
> >descriptor. See the description of the O_CLOEXEC flag in open(2)  for
> >reasons why this may be useful.
> 
> This begs the question: what happens when all CLONE_FD fds for a
> process are closed? Will the parent get SIGCHLD instead, will it
> auto-reap, or will it be un-wait-able (I assume not this...)

Depends on CLONE_AUTOREAP. If it's on, then no one gets SIGCHLD, no one can 
wait() on it and the process autoreaps itself.

If it's no active, then the old rules apply: parent gets SIGCHILD and can 
wait(). If the parent exited first, then the child gets reparented to init, 
which can do the wait().

A child without CLONE_AUTOREAP should be wait()able. If it gets wait()ed 
before the clonefd is read, the clonefd() will return a 0 read. If it gets 
read before wait, then wait() reaps another child or returns -ECHILD. That's 
no different than two threads doing simultaneous wait() on the same child.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-16 22:14     ` Thiago Macieira
  0 siblings, 0 replies; 63+ messages in thread
From: Thiago Macieira @ 2015-03-16 22:14 UTC (permalink / raw)
  To: Kees Cook
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Oleg Nesterov, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, LKML, Linux API,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Monday 16 March 2015 14:44:20 Kees Cook wrote:
> >               O_CLOEXEC
> >                      Set  the  close-on-exec  flag on the new file
> >descriptor. See the description of the O_CLOEXEC flag in open(2)  for
> >reasons why this may be useful.
> 
> This begs the question: what happens when all CLONE_FD fds for a
> process are closed? Will the parent get SIGCHLD instead, will it
> auto-reap, or will it be un-wait-able (I assume not this...)

Depends on CLONE_AUTOREAP. If it's on, then no one gets SIGCHLD, no one can 
wait() on it and the process autoreaps itself.

If it's no active, then the old rules apply: parent gets SIGCHILD and can 
wait(). If the parent exited first, then the child gets reparented to init, 
which can do the wait().

A child without CLONE_AUTOREAP should be wait()able. If it gets wait()ed 
before the clonefd is read, the clonefd() will return a 0 read. If it gets 
read before wait, then wait() reaps another child or returns -ECHILD. That's 
no different than two threads doing simultaneous wait() on the same child.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-16 22:14     ` Thiago Macieira
  (?)
@ 2015-03-16 22:36     ` Kees Cook
  2015-03-16 22:50         ` Thiago Macieira
  2015-03-16 23:35       ` josh
  -1 siblings, 2 replies; 63+ messages in thread
From: Kees Cook @ 2015-03-16 22:36 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Oleg Nesterov, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, LKML, Linux API,
	linux-fsdevel, x86

On Mon, Mar 16, 2015 at 3:14 PM, Thiago Macieira
<thiago.macieira@intel.com> wrote:
> On Monday 16 March 2015 14:44:20 Kees Cook wrote:
>> >               O_CLOEXEC
>> >                      Set  the  close-on-exec  flag on the new file
>> >descriptor. See the description of the O_CLOEXEC flag in open(2)  for
>> >reasons why this may be useful.
>>
>> This begs the question: what happens when all CLONE_FD fds for a
>> process are closed? Will the parent get SIGCHLD instead, will it
>> auto-reap, or will it be un-wait-able (I assume not this...)
>
> Depends on CLONE_AUTOREAP. If it's on, then no one gets SIGCHLD, no one can
> wait() on it and the process autoreaps itself.
>
> If it's no active, then the old rules apply: parent gets SIGCHILD and can
> wait(). If the parent exited first, then the child gets reparented to init,
> which can do the wait().
>
> A child without CLONE_AUTOREAP should be wait()able. If it gets wait()ed
> before the clonefd is read, the clonefd() will return a 0 read. If it gets
> read before wait, then wait() reaps another child or returns -ECHILD. That's
> no different than two threads doing simultaneous wait() on the same child.

Cool. I think detailing this in the manpage would be helpful.

And just so I understand the races here, what happens in CLONE_FD
(without CLONE_AUTOREAP) case where the child dies, but the parent
never reads from the CLONE_FD fd, and closes it (or dies)? Will the
modes switch that late in the child's lifetime? (i.e. even though the
details were written to the fd, they were never read, yet it'll still
switch and generate a SIGCHLD, etc?)

Thanks!

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-16 22:50         ` Thiago Macieira
  0 siblings, 0 replies; 63+ messages in thread
From: Thiago Macieira @ 2015-03-16 22:50 UTC (permalink / raw)
  To: Kees Cook
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Oleg Nesterov, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, LKML, Linux API,
	linux-fsdevel, x86

On Monday 16 March 2015 15:36:16 Kees Cook wrote:
> And just so I understand the races here, what happens in CLONE_FD
> (without CLONE_AUTOREAP) case where the child dies, but the parent
> never reads from the CLONE_FD fd, and closes it (or dies)? Will the
> modes switch that late in the child's lifetime? (i.e. even though the
> details were written to the fd, they were never read, yet it'll still
> switch and generate a SIGCHLD, etc?)

What happens to a child that dies during the parent's lifetime but the parent 
exits without reaping the child?

The same should happen, whatever that behaviour is.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-16 22:50         ` Thiago Macieira
  0 siblings, 0 replies; 63+ messages in thread
From: Thiago Macieira @ 2015-03-16 22:50 UTC (permalink / raw)
  To: Kees Cook
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Oleg Nesterov, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, LKML, Linux API,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Monday 16 March 2015 15:36:16 Kees Cook wrote:
> And just so I understand the races here, what happens in CLONE_FD
> (without CLONE_AUTOREAP) case where the child dies, but the parent
> never reads from the CLONE_FD fd, and closes it (or dies)? Will the
> modes switch that late in the child's lifetime? (i.e. even though the
> details were written to the fd, they were never read, yet it'll still
> switch and generate a SIGCHLD, etc?)

What happens to a child that dies during the parent's lifetime but the parent 
exits without reaping the child?

The same should happen, whatever that behaviour is.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-16 23:25     ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 63+ messages in thread
From: josh @ 2015-03-16 23:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, LKML,
	Linux API, linux-fsdevel, x86

On Mon, Mar 16, 2015 at 02:44:20PM -0700, Kees Cook wrote:
> On Sun, Mar 15, 2015 at 12:59 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> > - Make poll on a CLONE_FD for an exited task also return POLLHUP, for
> >   compatibility with FreeBSD's pdfork.  Thanks to David Drysdale for calling
> >   attention to pdfork.
> 
> I think POLLHUP should be mentioned in the manpage (now it only
> mentions POLLIN).

Added for v3.

> >        CLONE_FD
> >               Return  a file descriptor associated with the new process, stor‐
> >               ing it in location clonefd in the parent's address space.   When
> >               the new process exits, the file descriptor will become available
> >               for reading.
> >
> >               Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
> >               descriptor  returned  by  clone4()  with the CLONE_FD flag works
> >               even with SIGCHLD unblocked in one or more threads of the parent
> >               process,  allowing  the  process  to have different handlers for
> >               different child processes, such as those created by  a  library,
> >               without  introducing  race conditions around process-wide signal
> >               handling.
> >
> >               clonefd_flags may contain the following additional flags for use
> >               with CLONE_FD:
> >
> >
> >               O_CLOEXEC
> >                      Set  the  close-on-exec  flag on the new file descriptor.
> >                      See the description of the O_CLOEXEC flag in open(2)  for
> >                      reasons why this may be useful.
> 
> This begs the question: what happens when all CLONE_FD fds for a
> process are closed? Will the parent get SIGCHLD instead, will it
> auto-reap, or will it be un-wait-able (I assume not this...)

Whether the parent gets SIGCHLD is determined only by what signal you
request in clone; if you clone with CLONE_FD | SIGCHLD (or
CLONE_AUTOREAP | CLONE_FD | SIGCHLD), you'll get notification via both
clonefd (if you have one) and signal (if you have a handler).  If you
pass a 0 signal (just CLONE_FD or CLONE_AUTOREAP | CLONE_FD), you'll
receive no signal, only the notification via clonefd.  Independently, if
you have CLONE_AUTOREAP set, the process will autoreap.

Those are all orthogonal now.

If you close the clonefd, nothing special happens other than a
put_task_struct.  While this is conceptually somewhat like a pipe, the
data is actually generated at read time, so the task exit doesn't care
whether there's a live clonefd or not.  (Or, in the future, if there are
multiple live clonefds for the same process.)

> Looks promising!

Thanks!

And thanks for catching the manpage issue.  I'd definitely welcome any
comments you have on the implementation as well.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-16 23:25     ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 63+ messages in thread
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-03-16 23:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, LKML,
	Linux API, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	x86-DgEjT+Ai2ygdnm+yROfE0A

On Mon, Mar 16, 2015 at 02:44:20PM -0700, Kees Cook wrote:
> On Sun, Mar 15, 2015 at 12:59 AM, Josh Triplett <josh-iaAMLnmF4Un9QnwtaJ6UQQ@public.gmane.orgg> wrote:
> > - Make poll on a CLONE_FD for an exited task also return POLLHUP, for
> >   compatibility with FreeBSD's pdfork.  Thanks to David Drysdale for calling
> >   attention to pdfork.
> 
> I think POLLHUP should be mentioned in the manpage (now it only
> mentions POLLIN).

Added for v3.

> >        CLONE_FD
> >               Return  a file descriptor associated with the new process, stor‐
> >               ing it in location clonefd in the parent's address space.   When
> >               the new process exits, the file descriptor will become available
> >               for reading.
> >
> >               Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
> >               descriptor  returned  by  clone4()  with the CLONE_FD flag works
> >               even with SIGCHLD unblocked in one or more threads of the parent
> >               process,  allowing  the  process  to have different handlers for
> >               different child processes, such as those created by  a  library,
> >               without  introducing  race conditions around process-wide signal
> >               handling.
> >
> >               clonefd_flags may contain the following additional flags for use
> >               with CLONE_FD:
> >
> >
> >               O_CLOEXEC
> >                      Set  the  close-on-exec  flag on the new file descriptor.
> >                      See the description of the O_CLOEXEC flag in open(2)  for
> >                      reasons why this may be useful.
> 
> This begs the question: what happens when all CLONE_FD fds for a
> process are closed? Will the parent get SIGCHLD instead, will it
> auto-reap, or will it be un-wait-able (I assume not this...)

Whether the parent gets SIGCHLD is determined only by what signal you
request in clone; if you clone with CLONE_FD | SIGCHLD (or
CLONE_AUTOREAP | CLONE_FD | SIGCHLD), you'll get notification via both
clonefd (if you have one) and signal (if you have a handler).  If you
pass a 0 signal (just CLONE_FD or CLONE_AUTOREAP | CLONE_FD), you'll
receive no signal, only the notification via clonefd.  Independently, if
you have CLONE_AUTOREAP set, the process will autoreap.

Those are all orthogonal now.

If you close the clonefd, nothing special happens other than a
put_task_struct.  While this is conceptually somewhat like a pipe, the
data is actually generated at read time, so the task exit doesn't care
whether there's a live clonefd or not.  (Or, in the future, if there are
multiple live clonefds for the same process.)

> Looks promising!

Thanks!

And thanks for catching the manpage issue.  I'd definitely welcome any
comments you have on the implementation as well.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-16 22:50         ` Thiago Macieira
  (?)
@ 2015-03-16 23:26         ` Kees Cook
  -1 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2015-03-16 23:26 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Oleg Nesterov, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, LKML, Linux API,
	linux-fsdevel, x86

On Mon, Mar 16, 2015 at 3:50 PM, Thiago Macieira
<thiago.macieira@intel.com> wrote:
> On Monday 16 March 2015 15:36:16 Kees Cook wrote:
>> And just so I understand the races here, what happens in CLONE_FD
>> (without CLONE_AUTOREAP) case where the child dies, but the parent
>> never reads from the CLONE_FD fd, and closes it (or dies)? Will the
>> modes switch that late in the child's lifetime? (i.e. even though the
>> details were written to the fd, they were never read, yet it'll still
>> switch and generate a SIGCHLD, etc?)
>
> What happens to a child that dies during the parent's lifetime but the parent
> exits without reaping the child?
>
> The same should happen, whatever that behaviour is.

Okay, sounds like the waitpid internals are tied to the read, not the
child death. Perfect, that means it'll still be waitpid-able by
whatever process inherits the zombie. Sounds good! Thanks for
clarifying. :)

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-16 22:14     ` Thiago Macieira
@ 2015-03-16 23:29       ` josh-iaAMLnmF4UmaiuxdJuQwMA
  -1 siblings, 0 replies; 63+ messages in thread
From: josh @ 2015-03-16 23:29 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Kees Cook, Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, LKML, Linux API, linux-fsdevel,
	x86

On Mon, Mar 16, 2015 at 03:14:14PM -0700, Thiago Macieira wrote:
> On Monday 16 March 2015 14:44:20 Kees Cook wrote:
> > >               O_CLOEXEC
> > >                      Set  the  close-on-exec  flag on the new file
> > >descriptor. See the description of the O_CLOEXEC flag in open(2)  for
> > >reasons why this may be useful.
> > 
> > This begs the question: what happens when all CLONE_FD fds for a
> > process are closed? Will the parent get SIGCHLD instead, will it
> > auto-reap, or will it be un-wait-able (I assume not this...)
> 
> Depends on CLONE_AUTOREAP. If it's on, then no one gets SIGCHLD, no one can 
> wait() on it and the process autoreaps itself.

Minor nit: CLONE_AUTOREAP makes the process autoreap and nobody can wait
on it, but if you pass SIGCHLD or some other exit signal to clone then
you'll still get that signal.

> If it's no active, then the old rules apply: parent gets SIGCHILD and can 
> wait(). If the parent exited first, then the child gets reparented to init, 
> which can do the wait().

Right.

> A child without CLONE_AUTOREAP should be wait()able. If it gets wait()ed 
> before the clonefd is read, the clonefd() will return a 0 read. If it gets 
> read before wait, then wait() reaps another child or returns -ECHILD. That's 
> no different than two threads doing simultaneous wait() on the same child.

Hrm?  That isn't the semantics we implemented; you'll *always* get an
exit notification via the clonefd if you have it open, with or without
autoreap and whether or not a wait has occurred yet.  And reading from
the clonefd does not serve as a wait; if you don't pass CLONE_AUTOREAP,
you'll still need to wait on the process.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-16 23:29       ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 63+ messages in thread
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-03-16 23:29 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Kees Cook, Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, LKML, Linux API,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Mon, Mar 16, 2015 at 03:14:14PM -0700, Thiago Macieira wrote:
> On Monday 16 March 2015 14:44:20 Kees Cook wrote:
> > >               O_CLOEXEC
> > >                      Set  the  close-on-exec  flag on the new file
> > >descriptor. See the description of the O_CLOEXEC flag in open(2)  for
> > >reasons why this may be useful.
> > 
> > This begs the question: what happens when all CLONE_FD fds for a
> > process are closed? Will the parent get SIGCHLD instead, will it
> > auto-reap, or will it be un-wait-able (I assume not this...)
> 
> Depends on CLONE_AUTOREAP. If it's on, then no one gets SIGCHLD, no one can 
> wait() on it and the process autoreaps itself.

Minor nit: CLONE_AUTOREAP makes the process autoreap and nobody can wait
on it, but if you pass SIGCHLD or some other exit signal to clone then
you'll still get that signal.

> If it's no active, then the old rules apply: parent gets SIGCHILD and can 
> wait(). If the parent exited first, then the child gets reparented to init, 
> which can do the wait().

Right.

> A child without CLONE_AUTOREAP should be wait()able. If it gets wait()ed 
> before the clonefd is read, the clonefd() will return a 0 read. If it gets 
> read before wait, then wait() reaps another child or returns -ECHILD. That's 
> no different than two threads doing simultaneous wait() on the same child.

Hrm?  That isn't the semantics we implemented; you'll *always* get an
exit notification via the clonefd if you have it open, with or without
autoreap and whether or not a wait has occurred yet.  And reading from
the clonefd does not serve as a wait; if you don't pass CLONE_AUTOREAP,
you'll still need to wait on the process.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-16 22:36     ` Kees Cook
  2015-03-16 22:50         ` Thiago Macieira
@ 2015-03-16 23:35       ` josh
  1 sibling, 0 replies; 63+ messages in thread
From: josh @ 2015-03-16 23:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Oleg Nesterov, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, LKML, Linux API,
	linux-fsdevel, x86

On Mon, Mar 16, 2015 at 03:36:16PM -0700, Kees Cook wrote:
> On Mon, Mar 16, 2015 at 3:14 PM, Thiago Macieira
> <thiago.macieira@intel.com> wrote:
> > On Monday 16 March 2015 14:44:20 Kees Cook wrote:
> >> >               O_CLOEXEC
> >> >                      Set  the  close-on-exec  flag on the new file
> >> >descriptor. See the description of the O_CLOEXEC flag in open(2)  for
> >> >reasons why this may be useful.
> >>
> >> This begs the question: what happens when all CLONE_FD fds for a
> >> process are closed? Will the parent get SIGCHLD instead, will it
> >> auto-reap, or will it be un-wait-able (I assume not this...)
> >
> > Depends on CLONE_AUTOREAP. If it's on, then no one gets SIGCHLD, no one can
> > wait() on it and the process autoreaps itself.
> >
> > If it's no active, then the old rules apply: parent gets SIGCHILD and can
> > wait(). If the parent exited first, then the child gets reparented to init,
> > which can do the wait().
> >
> > A child without CLONE_AUTOREAP should be wait()able. If it gets wait()ed
> > before the clonefd is read, the clonefd() will return a 0 read. If it gets
> > read before wait, then wait() reaps another child or returns -ECHILD. That's
> > no different than two threads doing simultaneous wait() on the same child.
> 
> Cool. I think detailing this in the manpage would be helpful.
> 
> And just so I understand the races here, what happens in CLONE_FD
> (without CLONE_AUTOREAP) case where the child dies, but the parent
> never reads from the CLONE_FD fd, and closes it (or dies)? Will the
> modes switch that late in the child's lifetime? (i.e. even though the
> details were written to the fd, they were never read, yet it'll still
> switch and generate a SIGCHLD, etc?)

This doesn't actually work like a pipe; the details aren't "written" to
the fd.  The data is generated at read time, and if you never read,
that's fine.  There's no semantic meaning attached to reading from the
clonefd; you still have to wait on the process if you don't pass
CLONE_AUTOREAP.  (Or you can block SIGCHLD or use SA_NOCLDWAIT, if you
control the calling process's signal handling; AUTOREAP just lets you
avoid interacting with the calling process's signal handling.)

See my previous response for the rest.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-16 23:29       ` josh-iaAMLnmF4UmaiuxdJuQwMA
@ 2015-03-17  0:49         ` Thiago Macieira
  -1 siblings, 0 replies; 63+ messages in thread
From: Thiago Macieira @ 2015-03-17  0:49 UTC (permalink / raw)
  To: josh
  Cc: Kees Cook, Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, LKML, Linux API, linux-fsdevel,
	x86

On Monday 16 March 2015 16:29:49 josh@joshtriplett.org wrote:
> > A child without CLONE_AUTOREAP should be wait()able. If it gets wait()ed 
> > before the clonefd is read, the clonefd() will return a 0 read. If it
> > gets 
> > read before wait, then wait() reaps another child or returns -ECHILD.
> > That's  no different than two threads doing simultaneous wait() on the
> > same child.
> Hrm?  That isn't the semantics we implemented; you'll *always* get an
> exit notification via the clonefd if you have it open, with or without
> autoreap and whether or not a wait has occurred yet.  And reading from
> the clonefd does not serve as a wait; if you don't pass CLONE_AUTOREAP,
> you'll still need to wait on the process.

Ah, I see what you're saying. Ok, I stand corrected: a child without 
CLONE_AUTOREAP must be wait()ed on and whoever waits on it will get 
information. In addition to that, the information is available on the clonefd 
and it can happen at any time, before or after the wait().

In the case of an orphaned child, the file descriptor will close, that's all. 
No modification is necessary to init.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-03-17  0:49         ` Thiago Macieira
  0 siblings, 0 replies; 63+ messages in thread
From: Thiago Macieira @ 2015-03-17  0:49 UTC (permalink / raw)
  To: josh-iaAMLnmF4UmaiuxdJuQwMA
  Cc: Kees Cook, Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, LKML, Linux API,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Monday 16 March 2015 16:29:49 josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
> > A child without CLONE_AUTOREAP should be wait()able. If it gets wait()ed 
> > before the clonefd is read, the clonefd() will return a 0 read. If it
> > gets 
> > read before wait, then wait() reaps another child or returns -ECHILD.
> > That's  no different than two threads doing simultaneous wait() on the
> > same child.
> Hrm?  That isn't the semantics we implemented; you'll *always* get an
> exit notification via the clonefd if you have it open, with or without
> autoreap and whether or not a wait has occurred yet.  And reading from
> the clonefd does not serve as a wait; if you don't pass CLONE_AUTOREAP,
> you'll still need to wait on the process.

Ah, I see what you're saying. Ok, I stand corrected: a child without 
CLONE_AUTOREAP must be wait()ed on and whoever waits on it will get 
information. In addition to that, the information is available on the clonefd 
and it can happen at any time, before or after the wait().

In the case of an orphaned child, the file descriptor will close, that's all. 
No modification is necessary to init.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
  2015-03-15 23:34           ` Josh Triplett
  (?)
@ 2015-03-20 18:14           ` Oleg Nesterov
  2015-03-20 18:46             ` Thiago Macieira
  -1 siblings, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2015-03-20 18:14 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Michael Kerrisk, Thiago Macieira, linux-kernel, linux-api,
	linux-fsdevel, x86

Josh,

I am really sorry for delay.

On 03/15, Josh Triplett wrote:
>
> On Sun, Mar 15, 2015 at 08:55:06PM +0100, Oleg Nesterov wrote:
 >
> > It should be per-process simply because this "autoreap" affects the whole
> > process. And the sub-threads are already "autoreap". And these 2 autoreap's
> > semantics differ, we should not confuse them.
>
> Will the approach I suggested, of having clones with CLONE_THREAD
> inherit the autoreap value rather than setting it from CLONE_AUTOREAP,
> implement the semantics you're looking for?

Not sure I understand... CLONE_THREAD should not inherit the autoreap.
A sub-thread is always autoreapable.

> Also, are you suggesting that CLONE_AUTOREAP with CLONE_THREAD should
> produce -EINVAL, or just that it should be ignored?

Yes, I think CLONE_AUTOREAP | CLONE_THREAD should return -EINVAL. But
this all is minor...

The main problem is how/when we should check this "autoreap" without
making this code even more ugly.

I still think we need a preparation patch. I tried to make it today but
failed. Will try again on weekend...


Note that we can't solely rely on do_notify_parent() which (with your patch)
correctly checks !ptrace && autoreap.

Just for example. Please look at __ptrace_detach(). Note that if we add
CLONE_AUTOREAP this needs a fix in any case. The tracee can be "autoreap"
but zombie, because "autoreap" should be ignored until the tracer detaches.
But the "same_thread_group" should not call do_notify_parent() again. So
this needs another check.

And let me quote our discussion from the previous email:

	> > EXCEPT: do we really want SIGCHLD from the exiting child? I think we
	> > do not. I won't really argue though, but this should be discussed and
	> > documented. IIUC, with your patch it is still sent.
	>
	> I think we do, yes.  The caller of clone can already specify what signal
	> they want, including no signal at all.  If they specify a signal
	> (SIGCHLD or otherwise) along with CLONE_AUTOREAP, we can send that
	> signal.

	OK. Agreed.

Yes, I agree...

But the changes in __ptrace_detach() depend on whether we need to send a signal
or not. Either way the changle is simple, but looks ugly. It would be nice to
cleanup this somehow.

Also. I forgot that the kernel always resets ->exit_signal to SIGCHLD on exec
or reparenting. Reparenting is probably fine. But what about exec? Should it
keep ->exit_signal == 0 if "autoreap" ? I think it should not, to avoid the
strange special case.

> > > > And there are ptrace/mt issues,
> > > > it seems. Just for example, we should avoid EXIT_TRACE if autoreap in
> > > > wait_task_zombie() even if we are going to re-notify parent.
> > >
> > > I don't see how EXIT_TRACE can happen in wait_task_zombie if autoreap is
> > > set.  wait_task_zombie does a cmpxchg with exit_state and doesn't
> > > proceed unless exit_state was EXIT_ZOMBIE, and I don't see how we can
> > > ever reach the EXIT_ZOMBIE state if autoreap.
> >
> > Because you again forgot about ptrace ;)

And this too asks for preparation before CLONE_AUTOREAP...

So I'll try to think about this all again on weekend. I'll try very much
to not disappear again ;)

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
  2015-03-20 18:14           ` Oleg Nesterov
@ 2015-03-20 18:46             ` Thiago Macieira
  2015-03-20 19:09                 ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: Thiago Macieira @ 2015-03-20 18:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On Friday 20 March 2015 19:14:04 Oleg Nesterov wrote:
> Also. I forgot that the kernel always resets ->exit_signal to SIGCHLD on
> exec or reparenting. Reparenting is probably fine. But what about exec?
> Should it keep ->exit_signal == 0 if "autoreap" ? I think it should not, to
> avoid the strange special case.

Not delivering any signal was the objective of this patch series, so yes 
exit_signal == 0 should survive an exec and even re-exec.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
  2015-03-20 18:46             ` Thiago Macieira
@ 2015-03-20 19:09                 ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2015-03-20 19:09 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On 03/20, Thiago Macieira wrote:
>
> On Friday 20 March 2015 19:14:04 Oleg Nesterov wrote:
> > Also. I forgot that the kernel always resets ->exit_signal to SIGCHLD on
> > exec or reparenting. Reparenting is probably fine. But what about exec?
> > Should it keep ->exit_signal == 0 if "autoreap" ? I think it should not, to
> > avoid the strange special case.
>
> Not delivering any signal was the objective of this patch series, so yes
> exit_signal == 0 should survive an exec and even re-exec.

OK, but then perhaps we should never send SIGCHLD (on exit) if "autoreap",
to make the logic simple.

And copy_process() should probably do

	if ((clone_flags & CSIGNAL) && (clone_flags && CLONE_AUTOREAP))
		return -EINVAL;

so that we still can change this behaviour later.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
@ 2015-03-20 19:09                 ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2015-03-20 19:09 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On 03/20, Thiago Macieira wrote:
>
> On Friday 20 March 2015 19:14:04 Oleg Nesterov wrote:
> > Also. I forgot that the kernel always resets ->exit_signal to SIGCHLD on
> > exec or reparenting. Reparenting is probably fine. But what about exec?
> > Should it keep ->exit_signal == 0 if "autoreap" ? I think it should not, to
> > avoid the strange special case.
>
> Not delivering any signal was the objective of this patch series, so yes
> exit_signal == 0 should survive an exec and even re-exec.

OK, but then perhaps we should never send SIGCHLD (on exit) if "autoreap",
to make the logic simple.

And copy_process() should probably do

	if ((clone_flags & CSIGNAL) && (clone_flags && CLONE_AUTOREAP))
		return -EINVAL;

so that we still can change this behaviour later.

Oleg.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process
  2015-03-20 19:09                 ` Oleg Nesterov
  (?)
@ 2015-03-20 21:10                 ` josh
  -1 siblings, 0 replies; 63+ messages in thread
From: josh @ 2015-03-20 21:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

On Fri, Mar 20, 2015 at 08:09:14PM +0100, Oleg Nesterov wrote:
> On 03/20, Thiago Macieira wrote:
> >
> > On Friday 20 March 2015 19:14:04 Oleg Nesterov wrote:
> > > Also. I forgot that the kernel always resets ->exit_signal to SIGCHLD on
> > > exec or reparenting. Reparenting is probably fine. But what about exec?
> > > Should it keep ->exit_signal == 0 if "autoreap" ? I think it should not, to
> > > avoid the strange special case.
> >
> > Not delivering any signal was the objective of this patch series, so yes
> > exit_signal == 0 should survive an exec and even re-exec.
> 
> OK, but then perhaps we should never send SIGCHLD (on exit) if "autoreap",
> to make the logic simple.
> 
> And copy_process() should probably do
> 
> 	if ((clone_flags & CSIGNAL) && (clone_flags && CLONE_AUTOREAP))
> 		return -EINVAL;
> 
> so that we still can change this behaviour later.

I'm fine with that, as it would handle the particular use case we care
about.

However, the reset-signal-on-reparent thing might still make sense,
particularly for the ptrace-reparent case (less so for the
reparent-to-child-reaper case).

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 3/7] Introduce a new clone4 syscall with more flag bits and extensible arguments
@ 2015-03-23 14:11     ` David Drysdale
  0 siblings, 0 replies; 63+ messages in thread
From: David Drysdale @ 2015-03-23 14:11 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	Linux API, Linux FS Devel, X86 ML

On Sun, Mar 15, 2015 at 7:59 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 0286735..ba28306 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -483,6 +483,7 @@ GLOBAL(\label)
>         PTREGSCALL stub32_execveat, compat_sys_execveat
>         PTREGSCALL stub32_fork, sys_fork
>         PTREGSCALL stub32_vfork, sys_vfork
> +       PTREGSCALL stub32_clone4, compat_sys_clone4
>
>         ALIGN
>  GLOBAL(stub32_clone)
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 1d74d16..ead143f 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -520,6 +520,7 @@ END(\label)
>         FORK_LIKE  clone
>         FORK_LIKE  fork
>         FORK_LIKE  vfork
> +       FORK_LIKE  clone4
>         FIXED_FRAME stub_iopl, sys_iopl
>
>  ENTRY(stub_execve)
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index b3560ec..56fcc90 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -365,3 +365,4 @@
>  356    i386    memfd_create            sys_memfd_create
>  357    i386    bpf                     sys_bpf
>  358    i386    execveat                sys_execveat                    stub32_execveat
> +359    i386    clone4                  sys_clone4                      stub32_clone4
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 8d656fb..af15b0f 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -329,6 +329,7 @@
>  320    common  kexec_file_load         sys_kexec_file_load
>  321    common  bpf                     sys_bpf
>  322    64      execveat                stub_execveat
> +323    64      clone4                  stub_clone4
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -368,3 +369,4 @@
>  543    x32     io_setup                compat_sys_io_setup
>  544    x32     io_submit               compat_sys_io_submit
>  545    x32     execveat                stub_x32_execveat
> +546    x32     clone4                  stub32_clone4

Doesn't this need an x32 specific wrapper (to ensure the full
set of registers are saved)?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 3/7] Introduce a new clone4 syscall with more flag bits and extensible arguments
@ 2015-03-23 14:11     ` David Drysdale
  0 siblings, 0 replies; 63+ messages in thread
From: David Drysdale @ 2015-03-23 14:11 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux API, Linux FS Devel,
	X86 ML

On Sun, Mar 15, 2015 at 7:59 AM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 0286735..ba28306 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -483,6 +483,7 @@ GLOBAL(\label)
>         PTREGSCALL stub32_execveat, compat_sys_execveat
>         PTREGSCALL stub32_fork, sys_fork
>         PTREGSCALL stub32_vfork, sys_vfork
> +       PTREGSCALL stub32_clone4, compat_sys_clone4
>
>         ALIGN
>  GLOBAL(stub32_clone)
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 1d74d16..ead143f 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -520,6 +520,7 @@ END(\label)
>         FORK_LIKE  clone
>         FORK_LIKE  fork
>         FORK_LIKE  vfork
> +       FORK_LIKE  clone4
>         FIXED_FRAME stub_iopl, sys_iopl
>
>  ENTRY(stub_execve)
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index b3560ec..56fcc90 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -365,3 +365,4 @@
>  356    i386    memfd_create            sys_memfd_create
>  357    i386    bpf                     sys_bpf
>  358    i386    execveat                sys_execveat                    stub32_execveat
> +359    i386    clone4                  sys_clone4                      stub32_clone4
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 8d656fb..af15b0f 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -329,6 +329,7 @@
>  320    common  kexec_file_load         sys_kexec_file_load
>  321    common  bpf                     sys_bpf
>  322    64      execveat                stub_execveat
> +323    64      clone4                  stub_clone4
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -368,3 +369,4 @@
>  543    x32     io_setup                compat_sys_io_setup
>  544    x32     io_submit               compat_sys_io_submit
>  545    x32     execveat                stub_x32_execveat
> +546    x32     clone4                  stub32_clone4

Doesn't this need an x32 specific wrapper (to ensure the full
set of registers are saved)?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-16 23:29       ` josh-iaAMLnmF4UmaiuxdJuQwMA
  (?)
  (?)
@ 2015-03-23 14:12       ` David Drysdale
  2015-03-23 15:03         ` josh
  -1 siblings, 1 reply; 63+ messages in thread
From: David Drysdale @ 2015-03-23 14:12 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Kees Cook, Al Viro, Andrew Morton,
	Andy Lutomirski, Ingo Molnar, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	LKML, Linux API, linux-fsdevel, x86

On Mon, Mar 16, 2015 at 11:29 PM,  <josh@joshtriplett.org> wrote:
> On Mon, Mar 16, 2015 at 03:14:14PM -0700, Thiago Macieira wrote:
>> On Monday 16 March 2015 14:44:20 Kees Cook wrote:
>> > >               O_CLOEXEC
>> > >                      Set  the  close-on-exec  flag on the new file
>> > >descriptor. See the description of the O_CLOEXEC flag in open(2)  for
>> > >reasons why this may be useful.
>> >
>> > This begs the question: what happens when all CLONE_FD fds for a
>> > process are closed? Will the parent get SIGCHLD instead, will it
>> > auto-reap, or will it be un-wait-able (I assume not this...)
>>
>> Depends on CLONE_AUTOREAP. If it's on, then no one gets SIGCHLD, no one can
>> wait() on it and the process autoreaps itself.
>
> Minor nit: CLONE_AUTOREAP makes the process autoreap and nobody can wait
> on it, but if you pass SIGCHLD or some other exit signal to clone then
> you'll still get that signal.

Quick query: does CLONE_AUTOREAP also affect waiting for non-exit
events (i.e. WUNTRACED / WCONTINUED), by original parent and/or ptracer?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-23 14:12       ` David Drysdale
@ 2015-03-23 15:03         ` josh
  0 siblings, 0 replies; 63+ messages in thread
From: josh @ 2015-03-23 15:03 UTC (permalink / raw)
  To: David Drysdale
  Cc: Thiago Macieira, Kees Cook, Al Viro, Andrew Morton,
	Andy Lutomirski, Ingo Molnar, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	LKML, Linux API, linux-fsdevel, x86

On Mon, Mar 23, 2015 at 02:12:34PM +0000, David Drysdale wrote:
> On Mon, Mar 16, 2015 at 11:29 PM,  <josh@joshtriplett.org> wrote:
> > On Mon, Mar 16, 2015 at 03:14:14PM -0700, Thiago Macieira wrote:
> >> On Monday 16 March 2015 14:44:20 Kees Cook wrote:
> >> > >               O_CLOEXEC
> >> > >                      Set  the  close-on-exec  flag on the new file
> >> > >descriptor. See the description of the O_CLOEXEC flag in open(2)  for
> >> > >reasons why this may be useful.
> >> >
> >> > This begs the question: what happens when all CLONE_FD fds for a
> >> > process are closed? Will the parent get SIGCHLD instead, will it
> >> > auto-reap, or will it be un-wait-able (I assume not this...)
> >>
> >> Depends on CLONE_AUTOREAP. If it's on, then no one gets SIGCHLD, no one can
> >> wait() on it and the process autoreaps itself.
> >
> > Minor nit: CLONE_AUTOREAP makes the process autoreap and nobody can wait
> > on it, but if you pass SIGCHLD or some other exit signal to clone then
> > you'll still get that signal.
> 
> Quick query: does CLONE_AUTOREAP also affect waiting for non-exit
> events (i.e. WUNTRACED / WCONTINUED), by original parent and/or ptracer?

It shouldn't, no.  You can't wait on the process to exit (you'll get
-ECHLD after it wakes up), but you can wait on it to continue or
similar; none of the autoreap changes should affect that.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 3/7] Introduce a new clone4 syscall with more flag bits and extensible arguments
  2015-03-23 14:11     ` David Drysdale
  (?)
@ 2015-03-23 15:05     ` josh
  2015-03-31 14:41       ` David Drysdale
  -1 siblings, 1 reply; 63+ messages in thread
From: josh @ 2015-03-23 15:05 UTC (permalink / raw)
  To: David Drysdale
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	Linux API, Linux FS Devel, X86 ML

On Mon, Mar 23, 2015 at 02:11:45PM +0000, David Drysdale wrote:
> On Sun, Mar 15, 2015 at 7:59 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> > index 0286735..ba28306 100644
> > --- a/arch/x86/ia32/ia32entry.S
> > +++ b/arch/x86/ia32/ia32entry.S
> > @@ -483,6 +483,7 @@ GLOBAL(\label)
> >         PTREGSCALL stub32_execveat, compat_sys_execveat
> >         PTREGSCALL stub32_fork, sys_fork
> >         PTREGSCALL stub32_vfork, sys_vfork
> > +       PTREGSCALL stub32_clone4, compat_sys_clone4
> >
> >         ALIGN
> >  GLOBAL(stub32_clone)
> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> > index 1d74d16..ead143f 100644
> > --- a/arch/x86/kernel/entry_64.S
> > +++ b/arch/x86/kernel/entry_64.S
> > @@ -520,6 +520,7 @@ END(\label)
> >         FORK_LIKE  clone
> >         FORK_LIKE  fork
> >         FORK_LIKE  vfork
> > +       FORK_LIKE  clone4
> >         FIXED_FRAME stub_iopl, sys_iopl
> >
> >  ENTRY(stub_execve)
> > diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> > index b3560ec..56fcc90 100644
> > --- a/arch/x86/syscalls/syscall_32.tbl
> > +++ b/arch/x86/syscalls/syscall_32.tbl
> > @@ -365,3 +365,4 @@
> >  356    i386    memfd_create            sys_memfd_create
> >  357    i386    bpf                     sys_bpf
> >  358    i386    execveat                sys_execveat                    stub32_execveat
> > +359    i386    clone4                  sys_clone4                      stub32_clone4
> > diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> > index 8d656fb..af15b0f 100644
> > --- a/arch/x86/syscalls/syscall_64.tbl
> > +++ b/arch/x86/syscalls/syscall_64.tbl
> > @@ -329,6 +329,7 @@
> >  320    common  kexec_file_load         sys_kexec_file_load
> >  321    common  bpf                     sys_bpf
> >  322    64      execveat                stub_execveat
> > +323    64      clone4                  stub_clone4
> >
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > @@ -368,3 +369,4 @@
> >  543    x32     io_setup                compat_sys_io_setup
> >  544    x32     io_submit               compat_sys_io_submit
> >  545    x32     execveat                stub_x32_execveat
> > +546    x32     clone4                  stub32_clone4
> 
> Doesn't this need an x32 specific wrapper (to ensure the full
> set of registers are saved)?

I'm not an x32 expert; I don't know how x32 interacts with pt_regs and
compat syscalls.  Could an x32 expert weigh in, please?

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd
  2015-03-15  8:00 ` [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd Josh Triplett
@ 2015-03-23 17:38   ` David Drysdale
  2015-03-25 14:53     ` Josh Triplett
  2015-04-06  8:30     ` Sergey Senozhatsky
  1 sibling, 1 reply; 63+ messages in thread
From: David Drysdale @ 2015-03-23 17:38 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	Linux API, Linux FS Devel, X86 ML

On Sun, Mar 15, 2015 at 8:00 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index 6c4a68d..c90df5a 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -299,6 +299,8 @@ struct compat_clone4_args {
>         compat_ulong_t stack_start;
>         compat_ulong_t stack_size;
>         compat_ulong_t tls;
> +       compat_uptr_t clonefd;
> +       u32 clonefd_flags;
>  };
>
>  struct compat_statfs;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9daa017..1dc680b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1374,6 +1374,11 @@ struct task_struct {
>
>         unsigned autoreap:1; /* Do not become a zombie on exit */
>
> +#ifdef CONFIG_CLONEFD
> +       unsigned clonefd:1; /* Notify clonefd_wqh on exit */
> +       wait_queue_head_t clonefd_wqh;
> +#endif
> +
>         unsigned long atomic_flags; /* Flags needing atomic access. */
>
>         struct restart_block restart_block;

Idle thought: are there any concerns about the occupancy
impact of adding a wait_queue_head to every task_struct,
whether it has a clonefd or not?

I guess we could reduce the size somewhat by just
storing a struct file *clonefd_file in the task, and then have
a separate structure (with the wqh and a task_struct*) referenced
by file->private_data.  Not sure whether the added complication
would be worthwhile, though.

> diff --git a/kernel/clonefd.c b/kernel/clonefd.c
> new file mode 100644
> index 0000000..eac560c
> --- /dev/null
> +++ b/kernel/clonefd.c
> @@ -0,0 +1,121 @@
> +/*
> + * Support functions for CLONE_FD
> + *
> + * Copyright (c) 2015 Intel Corporation
> + * Original authors: Josh Triplett <josh@joshtriplett.org>
> + *                   Thiago Macieira <thiago@macieira.org>
> + */
> +#include <linux/anon_inodes.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include "clonefd.h"
> +
> +static int clonefd_release(struct inode *inode, struct file *file)
> +{
> +       put_task_struct(file->private_data);
> +       return 0;
> +}
> +
> +static unsigned int clonefd_poll(struct file *file, poll_table *wait)
> +{
> +       struct task_struct *p = file->private_data;
> +       poll_wait(file, &p->clonefd_wqh, wait);
> +       return p->exit_state ? (POLLIN | POLLRDNORM | POLLHUP) : 0;
> +}
> +
> +static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
> +{
> +       struct task_struct *p = file->private_data;
> +       int ret = 0;
> +
> +       /* EOF after first read */
> +       if (*ppos)
> +               return 0;
> +
> +       if (file->f_flags & O_NONBLOCK)
> +               ret = -EAGAIN;
> +       else
> +               ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state);
> +
> +       if (p->exit_state) {
> +               struct clonefd_info info = {};
> +               cputime_t utime, stime;
> +               task_exit_code_status(p->exit_code, &info.code, &info.status);
> +               info.code &= ~__SI_MASK;
> +               task_cputime(p, &utime, &stime);
> +               info.utime = cputime_to_clock_t(utime + p->signal->utime);
> +               info.stime = cputime_to_clock_t(stime + p->signal->stime);
> +               ret = simple_read_from_buffer(buf, count, ppos, &info, sizeof(info));
> +       }
> +       return ret;
> +}
> +
> +static struct file_operations clonefd_fops = {
> +       .release = clonefd_release,
> +       .poll = clonefd_poll,
> +       .read = clonefd_read,
> +       .llseek = no_llseek,
> +};

It might be nice to include a show_fdinfo() implementation that shows
(say) the pid that the clonefd refers to.  E.g. something like:

static void clonefd_show_fdinfo(struct seq_file *m, struct file *file)
{
    struct task_struct *p = file->private_data;

    seq_printf(m, "tid:\t%d\n", task_tgid_vnr(p));
}

> +
> +/* Do process exit notification for clonefd. */
> +void clonefd_do_notify(struct task_struct *p)
> +{
> +       if (p->clonefd)
> +               wake_up_all(&p->clonefd_wqh);
> +}
> +
> +/* Handle the CLONE_FD case for copy_process. */
> +int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
> +                    struct clone4_args *args, struct clonefd_setup *setup)
> +{
> +       int flags;
> +       struct file *file;
> +       int fd;
> +
> +       p->clonefd = !!(clone_flags & CLONE_FD);
> +       if (!p->clonefd)
> +               return 0;
> +
> +       if (args->clonefd_flags & ~(O_CLOEXEC | O_NONBLOCK))
> +               return -EINVAL;
> +

Maybe also check for (args->clonefd == NULL) in advance, and
return -EINVAL or -EFAULT?

> +       init_waitqueue_head(&p->clonefd_wqh);
> +
> +       get_task_struct(p);
> +       flags = O_RDONLY | FMODE_ATOMIC_POS | args->clonefd_flags;
> +       file = anon_inode_getfile("[process]", &clonefd_fops, p, flags);
> +       if (IS_ERR(file)) {
> +               put_task_struct(p);
> +               return PTR_ERR(file);
> +       }
> +
> +       fd = get_unused_fd_flags(flags);
> +       if (fd < 0) {
> +               fput(file);
> +               return fd;
> +       }
> +
> +       setup->fd = fd;
> +       setup->file = file;
> +       return 0;
> +}
> +
> +/* Clean up clonefd information after a partially complete clone */
> +void clonefd_cleanup_failed_clone(struct clonefd_setup *setup)
> +{
> +       if (setup->file) {
> +               put_unused_fd(setup->fd);
> +               fput(setup->file);
> +       }
> +}
> +
> +/* Finish setting up the clonefd */
> +void clonefd_install_fd(struct clone4_args *args, struct clonefd_setup *setup)
> +{
> +       if (setup->file) {
> +               fd_install(setup->fd, setup->file);
> +               put_user(setup->fd, args->clonefd);
> +       }
> +}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd
  2015-03-23 17:38   ` David Drysdale
@ 2015-03-25 14:53     ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-03-25 14:53 UTC (permalink / raw)
  To: David Drysdale
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	Linux API, Linux FS Devel, X86 ML

On Mon, Mar 23, 2015 at 05:38:45PM +0000, David Drysdale wrote:
> On Sun, Mar 15, 2015 at 8:00 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 9daa017..1dc680b 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1374,6 +1374,11 @@ struct task_struct {
> >
> >         unsigned autoreap:1; /* Do not become a zombie on exit */
> >
> > +#ifdef CONFIG_CLONEFD
> > +       unsigned clonefd:1; /* Notify clonefd_wqh on exit */
> > +       wait_queue_head_t clonefd_wqh;
> > +#endif
> > +
> >         unsigned long atomic_flags; /* Flags needing atomic access. */
> >
> >         struct restart_block restart_block;
> 
> Idle thought: are there any concerns about the occupancy
> impact of adding a wait_queue_head to every task_struct,
> whether it has a clonefd or not?
> 
> I guess we could reduce the size somewhat by just
> storing a struct file *clonefd_file in the task, and then have
> a separate structure (with the wqh and a task_struct*) referenced
> by file->private_data.  Not sure whether the added complication
> would be worthwhile, though.

My original patches did exactly that (minus the reference back to the
task_struct).  However, there are a couple of problems with that
approach.  First, it assumes that a task_struct has only a single file
referencing it, but in the future I'd like to support obtaining a
clonefd for an existing task.  Second, the task_struct really shouldn't
have a reference to the actual struct file, when it only needs the
wait_queue_head_t.

Also, AFAICT a wait_queue_head_t is normally (in the absence of kernel
lock debugging options) the size of two pointers.  Adding an indirection
and an extra allocation to change that to the size of one pointer seems
iffy, especially when looking at the rest of what's directly in
task_struct that's far larger.

> > --- /dev/null
> > +++ b/kernel/clonefd.c
> > @@ -0,0 +1,121 @@
> > +/*
> > + * Support functions for CLONE_FD
> > + *
> > + * Copyright (c) 2015 Intel Corporation
> > + * Original authors: Josh Triplett <josh@joshtriplett.org>
> > + *                   Thiago Macieira <thiago@macieira.org>
> > + */
> > +#include <linux/anon_inodes.h>
> > +#include <linux/file.h>
> > +#include <linux/fs.h>
> > +#include <linux/poll.h>
> > +#include <linux/slab.h>
> > +#include "clonefd.h"
> > +
> > +static int clonefd_release(struct inode *inode, struct file *file)
> > +{
> > +       put_task_struct(file->private_data);
> > +       return 0;
> > +}
> > +
> > +static unsigned int clonefd_poll(struct file *file, poll_table *wait)
> > +{
> > +       struct task_struct *p = file->private_data;
> > +       poll_wait(file, &p->clonefd_wqh, wait);
> > +       return p->exit_state ? (POLLIN | POLLRDNORM | POLLHUP) : 0;
> > +}
> > +
> > +static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
> > +{
> > +       struct task_struct *p = file->private_data;
> > +       int ret = 0;
> > +
> > +       /* EOF after first read */
> > +       if (*ppos)
> > +               return 0;
> > +
> > +       if (file->f_flags & O_NONBLOCK)
> > +               ret = -EAGAIN;
> > +       else
> > +               ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state);
> > +
> > +       if (p->exit_state) {
> > +               struct clonefd_info info = {};
> > +               cputime_t utime, stime;
> > +               task_exit_code_status(p->exit_code, &info.code, &info.status);
> > +               info.code &= ~__SI_MASK;
> > +               task_cputime(p, &utime, &stime);
> > +               info.utime = cputime_to_clock_t(utime + p->signal->utime);
> > +               info.stime = cputime_to_clock_t(stime + p->signal->stime);
> > +               ret = simple_read_from_buffer(buf, count, ppos, &info, sizeof(info));
> > +       }
> > +       return ret;
> > +}
> > +
> > +static struct file_operations clonefd_fops = {
> > +       .release = clonefd_release,
> > +       .poll = clonefd_poll,
> > +       .read = clonefd_read,
> > +       .llseek = no_llseek,
> > +};
> 
> It might be nice to include a show_fdinfo() implementation that shows
> (say) the pid that the clonefd refers to.  E.g. something like:
> 
> static void clonefd_show_fdinfo(struct seq_file *m, struct file *file)
> {
>     struct task_struct *p = file->private_data;
> 
>     seq_printf(m, "tid:\t%d\n", task_tgid_vnr(p));
> }

I thought about that, but that would add a couple of additional ifdefs
(CONFIG_PROC_FS), for an informational file of minimal value.  More
importantly, I don't want to add that until after adding an ioctl or
similar to programmatically obtain the pid from a clonefd; otherwise,
someone might try to use fdinfo as the "API" to do so, which would be
all kinds of awful.

So I'd prefer to add fdinfo in a future extension of clonefd, rather
than in the initial patch series.

> > +
> > +/* Do process exit notification for clonefd. */
> > +void clonefd_do_notify(struct task_struct *p)
> > +{
> > +       if (p->clonefd)
> > +               wake_up_all(&p->clonefd_wqh);
> > +}
> > +
> > +/* Handle the CLONE_FD case for copy_process. */
> > +int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
> > +                    struct clone4_args *args, struct clonefd_setup *setup)
> > +{
> > +       int flags;
> > +       struct file *file;
> > +       int fd;
> > +
> > +       p->clonefd = !!(clone_flags & CLONE_FD);
> > +       if (!p->clonefd)
> > +               return 0;
> > +
> > +       if (args->clonefd_flags & ~(O_CLOEXEC | O_NONBLOCK))
> > +               return -EINVAL;
> > +
> 
> Maybe also check for (args->clonefd == NULL) in advance, and
> return -EINVAL or -EFAULT?

That wouldn't be consistent with how clone treats its various other
out argument pointers.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 3/7] Introduce a new clone4 syscall with more flag bits and extensible arguments
  2015-03-23 15:05     ` josh
@ 2015-03-31 14:41       ` David Drysdale
  0 siblings, 0 replies; 63+ messages in thread
From: David Drysdale @ 2015-03-31 14:41 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	Linux API, Linux FS Devel, X86 ML

On Mon, Mar 23, 2015 at 3:05 PM,  <josh@joshtriplett.org> wrote:
> On Mon, Mar 23, 2015 at 02:11:45PM +0000, David Drysdale wrote:
>> On Sun, Mar 15, 2015 at 7:59 AM, Josh Triplett <josh@joshtriplett.org> wrote:
>> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> > index 0286735..ba28306 100644
>> > --- a/arch/x86/ia32/ia32entry.S
>> > +++ b/arch/x86/ia32/ia32entry.S
>> > @@ -483,6 +483,7 @@ GLOBAL(\label)
>> >         PTREGSCALL stub32_execveat, compat_sys_execveat
>> >         PTREGSCALL stub32_fork, sys_fork
>> >         PTREGSCALL stub32_vfork, sys_vfork
>> > +       PTREGSCALL stub32_clone4, compat_sys_clone4
>> >
>> >         ALIGN
>> >  GLOBAL(stub32_clone)
>> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> > index 1d74d16..ead143f 100644
>> > --- a/arch/x86/kernel/entry_64.S
>> > +++ b/arch/x86/kernel/entry_64.S
>> > @@ -520,6 +520,7 @@ END(\label)
>> >         FORK_LIKE  clone
>> >         FORK_LIKE  fork
>> >         FORK_LIKE  vfork
>> > +       FORK_LIKE  clone4
>> >         FIXED_FRAME stub_iopl, sys_iopl
>> >
>> >  ENTRY(stub_execve)
>> > diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
>> > index b3560ec..56fcc90 100644
>> > --- a/arch/x86/syscalls/syscall_32.tbl
>> > +++ b/arch/x86/syscalls/syscall_32.tbl
>> > @@ -365,3 +365,4 @@
>> >  356    i386    memfd_create            sys_memfd_create
>> >  357    i386    bpf                     sys_bpf
>> >  358    i386    execveat                sys_execveat                    stub32_execveat
>> > +359    i386    clone4                  sys_clone4                      stub32_clone4
>> > diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
>> > index 8d656fb..af15b0f 100644
>> > --- a/arch/x86/syscalls/syscall_64.tbl
>> > +++ b/arch/x86/syscalls/syscall_64.tbl
>> > @@ -329,6 +329,7 @@
>> >  320    common  kexec_file_load         sys_kexec_file_load
>> >  321    common  bpf                     sys_bpf
>> >  322    64      execveat                stub_execveat
>> > +323    64      clone4                  stub_clone4
>> >
>> >  #
>> >  # x32-specific system call numbers start at 512 to avoid cache impact
>> > @@ -368,3 +369,4 @@
>> >  543    x32     io_setup                compat_sys_io_setup
>> >  544    x32     io_submit               compat_sys_io_submit
>> >  545    x32     execveat                stub_x32_execveat
>> > +546    x32     clone4                  stub32_clone4
>>
>> Doesn't this need an x32 specific wrapper (to ensure the full
>> set of registers are saved)?
>
> I'm not an x32 expert; I don't know how x32 interacts with pt_regs and
> compat syscalls.  Could an x32 expert weigh in, please?
>
> - Josh Triplett

(In the absence of an x32 expert chiming in...)

As I understand it:
 - stub32_clone4 expects 32-bit calling conventions and calls compat_sys_clone4
 - stub_clone4 expects 64-bit calling conventions and calls sys_clone4
 - stub_x32_clone4 would expect 64-bit calling conventions but call
   compat_sys_clone4.

Also, I have a suspicion that different field types in the [compat_]clone4_args
structure may cause problems -- I *think* its (final) layout will be 4+4+4+4+4+4
on 32-bit, 8+8+8+8+8+4 on 64-bit, but 4+4+8+8+4+4 on x32.

Have you tried running a test with a userspace program compiled with -mx32?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-15  7:59 ` Josh Triplett
                   ` (10 preceding siblings ...)
  (?)
@ 2015-03-31 20:08 ` Jonathan Corbet
  2015-03-31 22:02   ` josh
  -1 siblings, 1 reply; 63+ messages in thread
From: Jonathan Corbet @ 2015-03-31 20:08 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

So I finally got around to having a look at this, and one thing caught my
eye:

>              read(2) (and similar)
>                      When  the  new  process  exits,  reading  from  the  file
>                      descriptor produces a single clonefd_info structure:
> 
>                      struct clonefd_info {
>                          uint32_t code;   /* Signal code */
>                          uint32_t status; /* Exit status or signal */
>                          uint64_t utime;  /* User CPU time */
>                          uint64_t stime;  /* System CPU time */
>                      };

This would appear to assume that a clonefd_info structure is the only
thing that will ever be read from this descriptor.  It seems to me that
there is the potential for, someday, wanting to be able to read and write
other things as well.  Should this structure be marked with type and
length fields so that other structures could be added in the future?

(I suppose we could just use ioctl() for any other functionality in the
future, though...:)

jon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-31 20:08 ` Jonathan Corbet
@ 2015-03-31 22:02   ` josh
  2015-04-01  7:24     ` Jonathan Corbet
  0 siblings, 1 reply; 63+ messages in thread
From: josh @ 2015-03-31 22:02 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

On Tue, Mar 31, 2015 at 10:08:07PM +0200, Jonathan Corbet wrote:
> So I finally got around to having a look at this, and one thing caught my
> eye:
> 
> >              read(2) (and similar)
> >                      When  the  new  process  exits,  reading  from  the  file
> >                      descriptor produces a single clonefd_info structure:
> > 
> >                      struct clonefd_info {
> >                          uint32_t code;   /* Signal code */
> >                          uint32_t status; /* Exit status or signal */
> >                          uint64_t utime;  /* User CPU time */
> >                          uint64_t stime;  /* System CPU time */
> >                      };
> 
> This would appear to assume that a clonefd_info structure is the only
> thing that will ever be read from this descriptor.  It seems to me that
> there is the potential for, someday, wanting to be able to read and write
> other things as well.  Should this structure be marked with type and
> length fields so that other structures could be added in the future?

I don't think it makes sense for a caller to get an arbitrary structure
on read(), and have to figure out what they got and ignore something
they don't understand.  Instead, I think it makes more sense for the
caller to say "Hey, here's a flag saying I understand the new thing, go
ahead and give me the new thing".  So, for instance, if you want to
receive SIGSTOP/SIGCONT messages for child processes through this
descriptor, we could add a flag for that.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-03-31 22:02   ` josh
@ 2015-04-01  7:24     ` Jonathan Corbet
  2015-04-09  2:19         ` Josh Triplett
  0 siblings, 1 reply; 63+ messages in thread
From: Jonathan Corbet @ 2015-04-01  7:24 UTC (permalink / raw)
  To: josh
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

On Tue, 31 Mar 2015 15:02:24 -0700
josh@joshtriplett.org wrote:

> > This would appear to assume that a clonefd_info structure is the only
> > thing that will ever be read from this descriptor.  It seems to me that
> > there is the potential for, someday, wanting to be able to read and write
> > other things as well.  Should this structure be marked with type and
> > length fields so that other structures could be added in the future?  
> 
> I don't think it makes sense for a caller to get an arbitrary structure
> on read(), and have to figure out what they got and ignore something
> they don't understand.  Instead, I think it makes more sense for the
> caller to say "Hey, here's a flag saying I understand the new thing, go
> ahead and give me the new thing".  So, for instance, if you want to
> receive SIGSTOP/SIGCONT messages for child processes through this
> descriptor, we could add a flag for that.

The flag is fine, but, once we have set that flag saying we want those
messages, how do we know which type of structure we've gotten?  That's
the piece of the puzzle I'm missing, sorry if I'm being overly slow.

Thanks,

jon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd
@ 2015-04-06  8:30     ` Sergey Senozhatsky
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Senozhatsky @ 2015-04-06  8:30 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

On (03/15/15 01:00), Josh Triplett wrote:
[..]
> +
> +/* Handle the CLONE_FD case for copy_process. */
> +int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
> +		     struct clone4_args *args, struct clonefd_setup *setup)
> +{
> +	int flags;
> +	struct file *file;
> +	int fd;
> +
> +	p->clonefd = !!(clone_flags & CLONE_FD);
> +	if (!p->clonefd)
> +		return 0;
> +
> +	if (args->clonefd_flags & ~(O_CLOEXEC | O_NONBLOCK))
> +		return -EINVAL;
> +
> +	init_waitqueue_head(&p->clonefd_wqh);
> +
> +	get_task_struct(p);
> +	flags = O_RDONLY | FMODE_ATOMIC_POS | args->clonefd_flags;
> +	file = anon_inode_getfile("[process]", &clonefd_fops, p, flags);
> +	if (IS_ERR(file)) {
> +		put_task_struct(p);
> +		return PTR_ERR(file);
> +	}
> +
> +	fd = get_unused_fd_flags(flags);
> +	if (fd < 0) {

+		put_task_struct(p); ?

> +		fput(file);
> +		return fd;
> +	}
> +
> +	setup->fd = fd;
> +	setup->file = file;
> +	return 0;
> +}
[..]

	-ss

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd
@ 2015-04-06  8:30     ` Sergey Senozhatsky
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Senozhatsky @ 2015-04-06  8:30 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On (03/15/15 01:00), Josh Triplett wrote:
[..]
> +
> +/* Handle the CLONE_FD case for copy_process. */
> +int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
> +		     struct clone4_args *args, struct clonefd_setup *setup)
> +{
> +	int flags;
> +	struct file *file;
> +	int fd;
> +
> +	p->clonefd = !!(clone_flags & CLONE_FD);
> +	if (!p->clonefd)
> +		return 0;
> +
> +	if (args->clonefd_flags & ~(O_CLOEXEC | O_NONBLOCK))
> +		return -EINVAL;
> +
> +	init_waitqueue_head(&p->clonefd_wqh);
> +
> +	get_task_struct(p);
> +	flags = O_RDONLY | FMODE_ATOMIC_POS | args->clonefd_flags;
> +	file = anon_inode_getfile("[process]", &clonefd_fops, p, flags);
> +	if (IS_ERR(file)) {
> +		put_task_struct(p);
> +		return PTR_ERR(file);
> +	}
> +
> +	fd = get_unused_fd_flags(flags);
> +	if (fd < 0) {

+		put_task_struct(p); ?

> +		fput(file);
> +		return fd;
> +	}
> +
> +	setup->fd = fd;
> +	setup->file = file;
> +	return 0;
> +}
[..]

	-ss

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd
  2015-04-06  8:30     ` Sergey Senozhatsky
@ 2015-04-06  9:31       ` Josh Triplett
  -1 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-04-06  9:31 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

On Mon, Apr 06, 2015 at 05:30:35PM +0900, Sergey Senozhatsky wrote:
> On (03/15/15 01:00), Josh Triplett wrote:
> [..]
> > +
> > +/* Handle the CLONE_FD case for copy_process. */
> > +int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
> > +		     struct clone4_args *args, struct clonefd_setup *setup)
> > +{
> > +	int flags;
> > +	struct file *file;
> > +	int fd;
> > +
> > +	p->clonefd = !!(clone_flags & CLONE_FD);
> > +	if (!p->clonefd)
> > +		return 0;
> > +
> > +	if (args->clonefd_flags & ~(O_CLOEXEC | O_NONBLOCK))
> > +		return -EINVAL;
> > +
> > +	init_waitqueue_head(&p->clonefd_wqh);
> > +
> > +	get_task_struct(p);
> > +	flags = O_RDONLY | FMODE_ATOMIC_POS | args->clonefd_flags;
> > +	file = anon_inode_getfile("[process]", &clonefd_fops, p, flags);
> > +	if (IS_ERR(file)) {
> > +		put_task_struct(p);
> > +		return PTR_ERR(file);
> > +	}
> > +
> > +	fd = get_unused_fd_flags(flags);
> > +	if (fd < 0) {
> 
> +		put_task_struct(p); ?

No, once anon_inode_getfile has succeeded, the file owns the reference
to the task_struct, so fput(file) will call the release function which
calls put_task_struct.  Only the failure case for anon_inode_getfile
needs to call put_task_struct directly.

> > +		fput(file);
> > +		return fd;
> > +	}

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd
@ 2015-04-06  9:31       ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-04-06  9:31 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Mon, Apr 06, 2015 at 05:30:35PM +0900, Sergey Senozhatsky wrote:
> On (03/15/15 01:00), Josh Triplett wrote:
> [..]
> > +
> > +/* Handle the CLONE_FD case for copy_process. */
> > +int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
> > +		     struct clone4_args *args, struct clonefd_setup *setup)
> > +{
> > +	int flags;
> > +	struct file *file;
> > +	int fd;
> > +
> > +	p->clonefd = !!(clone_flags & CLONE_FD);
> > +	if (!p->clonefd)
> > +		return 0;
> > +
> > +	if (args->clonefd_flags & ~(O_CLOEXEC | O_NONBLOCK))
> > +		return -EINVAL;
> > +
> > +	init_waitqueue_head(&p->clonefd_wqh);
> > +
> > +	get_task_struct(p);
> > +	flags = O_RDONLY | FMODE_ATOMIC_POS | args->clonefd_flags;
> > +	file = anon_inode_getfile("[process]", &clonefd_fops, p, flags);
> > +	if (IS_ERR(file)) {
> > +		put_task_struct(p);
> > +		return PTR_ERR(file);
> > +	}
> > +
> > +	fd = get_unused_fd_flags(flags);
> > +	if (fd < 0) {
> 
> +		put_task_struct(p); ?

No, once anon_inode_getfile has succeeded, the file owns the reference
to the task_struct, so fput(file) will call the release function which
calls put_task_struct.  Only the failure case for anon_inode_getfile
needs to call put_task_struct directly.

> > +		fput(file);
> > +		return fd;
> > +	}

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-04-09  2:19         ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-04-09  2:19 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira, linux-kernel,
	linux-api, linux-fsdevel, x86

On Wed, Apr 01, 2015 at 09:24:20AM +0200, Jonathan Corbet wrote:
> On Tue, 31 Mar 2015 15:02:24 -0700
> josh@joshtriplett.org wrote:
> 
> > > This would appear to assume that a clonefd_info structure is the only
> > > thing that will ever be read from this descriptor.  It seems to me that
> > > there is the potential for, someday, wanting to be able to read and write
> > > other things as well.  Should this structure be marked with type and
> > > length fields so that other structures could be added in the future?  
> > 
> > I don't think it makes sense for a caller to get an arbitrary structure
> > on read(), and have to figure out what they got and ignore something
> > they don't understand.  Instead, I think it makes more sense for the
> > caller to say "Hey, here's a flag saying I understand the new thing, go
> > ahead and give me the new thing".  So, for instance, if you want to
> > receive SIGSTOP/SIGCONT messages for child processes through this
> > descriptor, we could add a flag for that.
> 
> The flag is fine, but, once we have set that flag saying we want those
> messages, how do we know which type of structure we've gotten?  That's
> the piece of the puzzle I'm missing, sorry if I'm being overly slow.

If you pass a flag saying you can handle a new set of potential
structures, those structures can then include any necessary
disambiguating flags/IDs/etc.  No need for them to match the current
clonefd_info structure if userspace has opted into a new version.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-04-09  2:19         ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2015-04-09  2:19 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, Thiago Macieira,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On Wed, Apr 01, 2015 at 09:24:20AM +0200, Jonathan Corbet wrote:
> On Tue, 31 Mar 2015 15:02:24 -0700
> josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
> 
> > > This would appear to assume that a clonefd_info structure is the only
> > > thing that will ever be read from this descriptor.  It seems to me that
> > > there is the potential for, someday, wanting to be able to read and write
> > > other things as well.  Should this structure be marked with type and
> > > length fields so that other structures could be added in the future?  
> > 
> > I don't think it makes sense for a caller to get an arbitrary structure
> > on read(), and have to figure out what they got and ignore something
> > they don't understand.  Instead, I think it makes more sense for the
> > caller to say "Hey, here's a flag saying I understand the new thing, go
> > ahead and give me the new thing".  So, for instance, if you want to
> > receive SIGSTOP/SIGCONT messages for child processes through this
> > descriptor, we could add a flag for that.
> 
> The flag is fine, but, once we have set that flag saying we want those
> messages, how do we know which type of structure we've gotten?  That's
> the piece of the puzzle I'm missing, sorry if I'm being overly slow.

If you pass a flag saying you can handle a new set of potential
structures, those structures can then include any necessary
disambiguating flags/IDs/etc.  No need for them to match the current
clonefd_info structure if userspace has opted into a new version.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-05-29  7:43   ` Florian Weimer
  0 siblings, 0 replies; 63+ messages in thread
From: Florian Weimer @ 2015-05-29  7:43 UTC (permalink / raw)
  To: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	Thiago Macieira, linux-kernel, linux-api, linux-fsdevel, x86

On 03/15/2015 08:59 AM, Josh Triplett wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> receive child process exit notification via a file descriptor rather than
> SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> child processes on behalf of their caller, *without* taking over process-wide
> SIGCHLD handling (either via signal handler or signalfd).
> 
> Note that signalfd for SIGCHLD does not suffice here, because that still
> receives notification for all child processes, and interferes with process-wide
> signal handling.

It has been suggested (e.g.,
<https://sourceware.org/bugzilla/show_bug.cgi?id=15661#c3>) that you can
use the existing clone(2) without specifying SIGCHLD to create a new
process.  The resulting child process is not supposed to show up in
wait(2), only in a waitpid(2) (or similar) explicitly specifying the
PID.  Is this not the case?

-- 
Florian Weimer / Red Hat Product Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-05-29  7:43   ` Florian Weimer
  0 siblings, 0 replies; 63+ messages in thread
From: Florian Weimer @ 2015-05-29  7:43 UTC (permalink / raw)
  To: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	Thiago Macieira, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On 03/15/2015 08:59 AM, Josh Triplett wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> receive child process exit notification via a file descriptor rather than
> SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> child processes on behalf of their caller, *without* taking over process-wide
> SIGCHLD handling (either via signal handler or signalfd).
> 
> Note that signalfd for SIGCHLD does not suffice here, because that still
> receives notification for all child processes, and interferes with process-wide
> signal handling.

It has been suggested (e.g.,
<https://sourceware.org/bugzilla/show_bug.cgi?id=15661#c3>) that you can
use the existing clone(2) without specifying SIGCHLD to create a new
process.  The resulting child process is not supposed to show up in
wait(2), only in a waitpid(2) (or similar) explicitly specifying the
PID.  Is this not the case?

-- 
Florian Weimer / Red Hat Product Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-05-29  7:43   ` Florian Weimer
  (?)
@ 2015-05-29 20:27   ` Thiago Macieira
  2015-06-15 10:06       ` Florian Weimer
  -1 siblings, 1 reply; 63+ messages in thread
From: Thiago Macieira @ 2015-05-29 20:27 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel, linux-api, linux-fsdevel, x86

On Friday 29 May 2015 09:43:35 Florian Weimer wrote:
> On 03/15/2015 08:59 AM, Josh Triplett wrote:
> > This patch series introduces a new clone flag, CLONE_FD, which lets the
> > caller receive child process exit notification via a file descriptor
> > rather than SIGCHLD.  CLONE_FD makes it possible for libraries to safely
> > launch and manage child processes on behalf of their caller, *without*
> > taking over process-wide SIGCHLD handling (either via signal handler or
> > signalfd).
> > 
> > Note that signalfd for SIGCHLD does not suffice here, because that still
> > receives notification for all child processes, and interferes with
> > process-wide signal handling.
> 
> It has been suggested (e.g.,
> <https://sourceware.org/bugzilla/show_bug.cgi?id=15661#c3>) that you can
> use the existing clone(2) without specifying SIGCHLD to create a new
> process.  The resulting child process is not supposed to show up in
> wait(2), only in a waitpid(2) (or similar) explicitly specifying the
> PID.  Is this not the case?

Hi Florian

That sounds orthogonal to what we're looking for. Our objective is to get 
notification of when the child exited without resorting to SIGCHLD. If we use 
the regular clone(2) without SIGCHLD and without CLONE_FD, we get no 
notification. The only way to know of the child's termination is by a blocking 
waitpid(2), like you indicated, which is counter productive to our needs.

We need something we can select(2)/poll(2) on.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
  2015-05-29 20:27   ` Thiago Macieira
@ 2015-06-15 10:06       ` Florian Weimer
  0 siblings, 0 replies; 63+ messages in thread
From: Florian Weimer @ 2015-06-15 10:06 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel, linux-api, linux-fsdevel, x86

On 05/29/2015 10:27 PM, Thiago Macieira wrote:

>> It has been suggested (e.g.,
>> <https://sourceware.org/bugzilla/show_bug.cgi?id=15661#c3>) that you can
>> use the existing clone(2) without specifying SIGCHLD to create a new
>> process.  The resulting child process is not supposed to show up in
>> wait(2), only in a waitpid(2) (or similar) explicitly specifying the
>> PID.  Is this not the case?
> 
> Hi Florian
> 
> That sounds orthogonal to what we're looking for. Our objective is to get 
> notification of when the child exited without resorting to SIGCHLD. If we use 
> the regular clone(2) without SIGCHLD and without CLONE_FD, we get no 
> notification. The only way to know of the child's termination is by a blocking 
> waitpid(2), like you indicated, which is counter productive to our needs.
> 
> We need something we can select(2)/poll(2) on.

Thanks for the clarification.  I agree that this is a separate and quite
sensible use case.

-- 
Florian Weimer / Red Hat Product Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor
@ 2015-06-15 10:06       ` Florian Weimer
  0 siblings, 0 replies; 63+ messages in thread
From: Florian Weimer @ 2015-06-15 10:06 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Josh Triplett, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A

On 05/29/2015 10:27 PM, Thiago Macieira wrote:

>> It has been suggested (e.g.,
>> <https://sourceware.org/bugzilla/show_bug.cgi?id=15661#c3>) that you can
>> use the existing clone(2) without specifying SIGCHLD to create a new
>> process.  The resulting child process is not supposed to show up in
>> wait(2), only in a waitpid(2) (or similar) explicitly specifying the
>> PID.  Is this not the case?
> 
> Hi Florian
> 
> That sounds orthogonal to what we're looking for. Our objective is to get 
> notification of when the child exited without resorting to SIGCHLD. If we use 
> the regular clone(2) without SIGCHLD and without CLONE_FD, we get no 
> notification. The only way to know of the child's termination is by a blocking 
> waitpid(2), like you indicated, which is counter productive to our needs.
> 
> We need something we can select(2)/poll(2) on.

Thanks for the clarification.  I agree that this is a separate and quite
sensible use case.

-- 
Florian Weimer / Red Hat Product Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2015-06-15 10:06 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-15  7:59 [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor Josh Triplett
2015-03-15  7:59 ` Josh Triplett
2015-03-15  7:59 ` [PATCH v2 1/7] clone: Support passing tls argument via C rather than pt_regs magic Josh Triplett
2015-03-15  7:59 ` [PATCH v2 2/7] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit Josh Triplett
2015-03-15  7:59   ` Josh Triplett
2015-03-15  7:59 ` [PATCH v2 3/7] Introduce a new clone4 syscall with more flag bits and extensible arguments Josh Triplett
2015-03-23 14:11   ` David Drysdale
2015-03-23 14:11     ` David Drysdale
2015-03-23 15:05     ` josh
2015-03-31 14:41       ` David Drysdale
2015-03-15  7:59 ` [PATCH v2 4/7] kernel/fork.c: Pass arguments to _do_fork and copy_process using clone4_args Josh Triplett
2015-03-15  8:00 ` [PATCH v2 5/7] clone4: Add a CLONE_AUTOREAP flag to automatically reap the child process Josh Triplett
2015-03-15 14:52   ` Oleg Nesterov
2015-03-15 14:52     ` Oleg Nesterov
2015-03-15 17:18     ` Josh Triplett
2015-03-15 17:18       ` Josh Triplett
2015-03-15 19:55       ` Oleg Nesterov
2015-03-15 19:55         ` Oleg Nesterov
2015-03-15 23:34         ` Josh Triplett
2015-03-15 23:34           ` Josh Triplett
2015-03-20 18:14           ` Oleg Nesterov
2015-03-20 18:46             ` Thiago Macieira
2015-03-20 19:09               ` Oleg Nesterov
2015-03-20 19:09                 ` Oleg Nesterov
2015-03-20 21:10                 ` josh
2015-03-15  8:00 ` [PATCH v2 6/7] signal: Factor out a helper function to process task_struct exit_code Josh Triplett
2015-03-15  8:00 ` [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd Josh Triplett
2015-03-23 17:38   ` David Drysdale
2015-03-25 14:53     ` Josh Triplett
2015-04-06  8:30   ` Sergey Senozhatsky
2015-04-06  8:30     ` Sergey Senozhatsky
2015-04-06  9:31     ` Josh Triplett
2015-04-06  9:31       ` Josh Triplett
2015-03-15  8:00 ` [PATCH v2 man-pages] clone4.2: New manpage documenting clone4(2) Josh Triplett
2015-03-15  8:04 ` [PATCH v2 0/7] CLONE_FD: Task exit notification via file descriptor Josh Triplett
2015-03-15  8:04   ` Josh Triplett
2015-03-16 21:44 ` Kees Cook
2015-03-16 21:44   ` Kees Cook
2015-03-16 22:14   ` Thiago Macieira
2015-03-16 22:14     ` Thiago Macieira
2015-03-16 22:36     ` Kees Cook
2015-03-16 22:50       ` Thiago Macieira
2015-03-16 22:50         ` Thiago Macieira
2015-03-16 23:26         ` Kees Cook
2015-03-16 23:35       ` josh
2015-03-16 23:29     ` josh
2015-03-16 23:29       ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-17  0:49       ` Thiago Macieira
2015-03-17  0:49         ` Thiago Macieira
2015-03-23 14:12       ` David Drysdale
2015-03-23 15:03         ` josh
2015-03-16 23:25   ` josh
2015-03-16 23:25     ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-31 20:08 ` Jonathan Corbet
2015-03-31 22:02   ` josh
2015-04-01  7:24     ` Jonathan Corbet
2015-04-09  2:19       ` Josh Triplett
2015-04-09  2:19         ` Josh Triplett
2015-05-29  7:43 ` Florian Weimer
2015-05-29  7:43   ` Florian Weimer
2015-05-29 20:27   ` Thiago Macieira
2015-06-15 10:06     ` Florian Weimer
2015-06-15 10:06       ` Florian Weimer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.