bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* For review: seccomp_user_notif(2) manual page
@ 2020-09-30 11:07 Michael Kerrisk (man-pages)
  2020-09-30 15:03 ` Tycho Andersen
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-09-30 11:07 UTC (permalink / raw)
  To: Tycho Andersen, Sargun Dhillon
  Cc: mtk.manpages, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Jann Horn, Alexei Starovoitov, wad, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

Hi Tycho, Sargun (and all),

I knew it would be a big ask, but below is kind of the manual page
I was hoping you might write [1] for the seccomp user-space notification
mechanism. Since you didn't (and because 5.9 adds various new pieces 
such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD 
that also will need documenting [2]), I did :-). But of course I may 
have made mistakes...

I've shown the rendered version of the page below, and would love
to receive review comments from you and others, and acks, etc.

There are a few FIXMEs sprinkled into the page, including one
that relates to what appears to me to be a misdesign (possibly 
fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV 
operation. I would be especially interested in feedback on that
FIXME, and also of course the other FIXMEs.

The page includes an extensive (albeit slightly contrived)
example program, and I would be happy also to receive comments
on that program.

The page source currently sits in a branch (along with the text
that you sent me for the seccomp(2) page) at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif

Thanks,

Michael

[1] https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd23c@gmail.com/#t
[2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
    and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?

=====

NAME
       seccomp_user_notif - Seccomp user-space notification mechanism

SYNOPSIS
       #include <linux/seccomp.h>
       #include <linux/filter.h>
       #include <linux/audit.h>

       int seccomp(unsigned int operation, unsigned int flags, void *args);

DESCRIPTION
       This  page  describes  the user-space notification mechanism pro‐
       vided by the Secure Computing (seccomp) facility.  As well as the
       use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
       COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
       operation  described  in  seccomp(2), this mechanism involves the
       use of a number of related ioctl(2) operations (described below).

   Overview
       In conventional usage of a seccomp filter, the decision about how
       to  treat  a particular system call is made by the filter itself.
       The user-space notification mechanism allows the handling of  the
       system  call  to  instead  be handed off to a user-space process.
       The advantages of doing this are that, by contrast with the  sec‐
       comp  filter,  which  is  running on a virtual machine inside the
       kernel, the user-space process has access to information that  is
       unavailable to the seccomp filter and it can perform actions that
       can't be performed from the seccomp filter.

       In the discussion that follows, the process  that  has  installed
       the  seccomp filter is referred to as the target, and the process
       that is notified by  the  user-space  notification  mechanism  is
       referred  to  as  the  supervisor.  An overview of the steps per‐
       formed by these two processes is as follows:

       1. The target process establishes a seccomp filter in  the  usual
          manner, but with two differences:

          · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
            TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
            the  (successful)  seccomp(2) call is a new "listening" file
            descriptor that can be used to receive notifications.

          · In cases where it is appropriate, the seccomp filter returns
            the  action value SECCOMP_RET_USER_NOTIF.  This return value
            will trigger a notification event.

       2. In order that the supervisor process can obtain  notifications
          using  the  listening  file  descriptor, (a duplicate of) that
          file descriptor must be passed from the target process to  the
          supervisor process.  One way in which this could be done is by
          passing the file descriptor over a UNIX domain socket  connec‐
          tion between the two processes (using the SCM_RIGHTS ancillary
          message type described in unix(7)).   Another  possibility  is
          that  the  supervisor  might  inherit  the file descriptor via
          fork(2).

       3. The supervisor process will receive notification events on the
          listening  file  descriptor.   These  events  are  returned as
          structures of type seccomp_notif.  Because this structure  and
          its  size may evolve over kernel versions, the supervisor must
          first determine the size of  this  structure  using  the  sec‐
          comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
          structure of type seccomp_notif_sizes.  The  supervisor  allo‐
          cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
          to receive notification events.   In  addition,the  supervisor
          allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
          comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
          comp_notif_resp  structure) that it will provide to the kernel
          (and thus the target process).

       4. The target process then performs its workload, which  includes
          system  calls  that  will be controlled by the seccomp filter.
          Whenever one of these system calls causes the filter to return
          the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
          execute the system call;  instead,  execution  of  the  target
          process is temporarily blocked inside the kernel and a notifi‐
          cation event is generated on the listening file descriptor.

       5. The supervisor process can now repeatedly monitor the  listen‐
          ing   file   descriptor  for  SECCOMP_RET_USER_NOTIF-triggered
          events.   To  do  this,   the   supervisor   uses   the   SEC‐
          COMP_IOCTL_NOTIF_RECV  ioctl(2)  operation to read information
          about a notification event; this  operation  blocks  until  an
          event  is  available.   The  operation returns a seccomp_notif
          structure containing information about the system call that is
          being attempted by the target process.

       6. The    seccomp_notif    structure   returned   by   the   SEC‐
          COMP_IOCTL_NOTIF_RECV operation includes the same  information
          (a seccomp_data structure) that was passed to the seccomp fil‐
          ter.  This information allows the supervisor to  discover  the
          system  call number and the arguments for the target process's
          system call.  In addition, the notification event contains the
          PID of the target process.

          The  information  in  the notification can be used to discover
          the values of pointer arguments for the target process's  sys‐
          tem call.  (This is something that can't be done from within a
          seccomp filter.)  To do this (and  assuming  it  has  suitable
          permissions),   the   supervisor   opens   the   corresponding
          /proc/[pid]/mem file, seeks to the memory location that corre‐
          sponds to one of the pointer arguments whose value is supplied
          in the notification event, and reads bytes from that location.
          (The supervisor must be careful to avoid a race condition that
          can occur when doing this; see the  description  of  the  SEC‐
          COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
          tion, the supervisor can access other system information  that
          is  visible  in  user space but which is not accessible from a
          seccomp filter.

          ┌─────────────────────────────────────────────────────┐
          │FIXME                                                │
          ├─────────────────────────────────────────────────────┤
          │Suppose we are reading a pathname from /proc/PID/mem │
          │for  a system call such as mkdir(). The pathname can │
          │be an arbitrary length. How do we know how much (how │
          │many pages) to read from /proc/PID/mem?              │
          └─────────────────────────────────────────────────────┘

       7. Having  obtained  information  as  per  the previous step, the
          supervisor may then choose to perform an action in response to
          the  target  process's  system call (which, as noted above, is
          not  executed  when  the  seccomp  filter  returns  the   SEC‐
          COMP_RET_USER_NOTIF action value).

          One  example  use case here relates to containers.  The target
          process may be located inside a container where  it  does  not
          have sufficient capabilities to mount a filesystem in the con‐
          tainer's mount namespace.  However, the supervisor  may  be  a
          more  privileged  process that that does have sufficient capa‐
          bilities to perform the mount operation.

       8. The supervisor then sends a response to the notification.  The
          information  in  this  response  is used by the kernel to con‐
          struct a return value for the target process's system call and
          provide a value that will be assigned to the errno variable of
          the target process.

          The  response  is  sent  using  the   SECCOMP_IOCTL_NOTIF_RECV
          ioctl(2)   operation,   which  is  used  to  transmit  a  sec‐
          comp_notif_resp  structure  to  the  kernel.   This  structure
          includes  a  cookie  value that the supervisor obtained in the
          seccomp_notif    structure    returned     by     the     SEC‐
          COMP_IOCTL_NOTIF_RECV operation.  This cookie value allows the
          kernel to associate the response with the target process.

       9. Once the notification has been sent, the system  call  in  the
          target  process  unblocks,  returning the information that was
          provided by the supervisor in the notification response.

       As a variation on the last two steps, the supervisor can  send  a
       response  that tells the kernel that it should execute the target
       process's   system   call;   see   the   discussion    of    SEC‐
       COMP_USER_NOTIF_FLAG_CONTINUE, below.

   ioctl(2) operations
       The following ioctl(2) operations are provided to support seccomp
       user-space notification.  For each of these operations, the first
       (file  descriptor)  argument  of  ioctl(2)  is the listening file
       descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
       TER_FLAG_NEW_LISTENER flag.

       SECCOMP_IOCTL_NOTIF_RECV
              This operation is used to obtain a user-space notification
              event.  If no such event is currently pending, the  opera‐
              tion  blocks  until  an  event occurs.  The third ioctl(2)
              argument is a pointer to a structure of the following form
              which  contains  information about the event.  This struc‐
              ture must be zeroed out before the call.

                  struct seccomp_notif {
                      __u64  id;              /* Cookie */
                      __u32  pid;             /* PID of target process */
                      __u32  flags;           /* Currently unused (0) */
                      struct seccomp_data data;   /* See seccomp(2) */
                  };

              The fields in this structure are as follows:

              id     This is a cookie for the notification.   Each  such
                     cookie  is  guaranteed  to be unique for the corre‐
                     sponding seccomp  filter.   In  other  words,  this
                     cookie  is  unique for each notification event from
                     the target process.  The cookie value has the  fol‐
                     lowing uses:

                     · It     can     be     used    with    the    SEC‐
                       COMP_IOCTL_NOTIF_ID_VALID ioctl(2)  operation  to
                       verify that the target process is still alive.

                     · When  returning  a  notification  response to the
                       kernel, the supervisor must  include  the  cookie
                       value in the seccomp_notif_resp structure that is
                       specified   as   the   argument   of   the   SEC‐
                       COMP_IOCTL_NOTIF_SEND operation.

              pid    This  is  the  PID of the target process that trig‐
                     gered the notification event.

                     ┌─────────────────────────────────────────────────────┐
                     │FIXME                                                │
                     ├─────────────────────────────────────────────────────┤
                     │This is a thread ID, rather than a PID, right?       │
                     └─────────────────────────────────────────────────────┘

              flags  This is a  bit  mask  of  flags  providing  further
                     information on the event.  In the current implemen‐
                     tation, this field is always zero.

              data   This is a seccomp_data structure containing  infor‐
                     mation  about  the  system  call that triggered the
                     notification.  This is the same structure  that  is
                     passed  to  the seccomp filter.  See seccomp(2) for
                     details of this structure.

              On success, this operation returns 0; on  failure,  -1  is
              returned,  and  errno  is set to indicate the cause of the
              error.  This operation can fail with the following errors:

              EINVAL (since Linux 5.5)
                     The seccomp_notif structure that was passed to  the
                     call contained nonzero fields.

              ENOENT The  target  process  was killed by a signal as the
                     notification information was being generated.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │From my experiments,  it  appears  that  if  a  SEC‐ │
       │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
       │process terminates, then the ioctl()  simply  blocks │
       │(rather than returning an error to indicate that the │
       │target process no longer exists).                    │
       │                                                     │
       │I found that surprising, and it required  some  con‐ │
       │tortions  in the example program.  It was not possi‐ │
       │ble to code my SIGCHLD handler (which reaps the zom‐ │
       │bie  when  the  worker/target process terminates) to │
       │simply set a flag checked in the main  handleNotifi‐ │
       │cations()  loop,  since  this created an unavoidable │
       │race where the child might terminate  just  after  I │
       │had  checked  the  flag,  but before I blocked (for‐ │
       │ever!) in  the  SECCOMP_IOCTL_NOTIF_RECV  operation. │
       │Instead,  I had to code the signal handler to simply │
       │call _exit(2)  in  order  to  terminate  the  parent │
       │process (the supervisor).                            │
       │                                                     │
       │Is  this  expected  behavior?  It seems to me rather │
       │desirable that SECCOMP_IOCTL_NOTIF_RECV should  give │
       │an error if the target process has terminated.       │
       └─────────────────────────────────────────────────────┘

       SECCOMP_IOCTL_NOTIF_ID_VALID
              This operation can be used to check that a notification ID
              returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
              is  still  valid  (i.e.,  that  the  target  process still
              exists).

              The third ioctl(2) argument is a  pointer  to  the  cookie
              (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.

              This  operation is necessary to avoid race conditions that
              can  occur   when   the   pid   returned   by   the   SEC‐
              COMP_IOCTL_NOTIF_RECV   operation   terminates,  and  that
              process ID is reused by another process.   An  example  of
              this kind of race is the following

              1. A  notification  is  generated  on  the  listening file
                 descriptor.  The returned  seccomp_notif  contains  the
                 PID of the target process.

              2. The target process terminates.

              3. Another process is created on the system that by chance
                 reuses the PID that was freed when the  target  process
                 terminates.

              4. The  supervisor  open(2)s  the /proc/[pid]/mem file for
                 the PID obtained in step 1, with the intention of (say)
                 inspecting the memory locations that contains the argu‐
                 ments of the system call that triggered  the  notifica‐
                 tion in step 1.

              In the above scenario, the risk is that the supervisor may
              try to access the memory of a process other than the  tar‐
              get.   This  race  can be avoided by following the call to
              open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
              ify  that  the  process that generated the notification is
              still alive.  (Note that  if  the  target  process  subse‐
              quently  terminates, its PID won't be reused because there
              remains an open reference to the /proc[pid]/mem  file;  in
              this  case, a subsequent read(2) from the file will return
              0, indicating end of file.)

              On success (i.e., the notification  ID  is  still  valid),
              this  operation  returns 0 On failure (i.e., the notifica‐
              tion ID is no longer valid), -1 is returned, and errno  is
              set to ENOENT.

       SECCOMP_IOCTL_NOTIF_SEND
              This  operation  is  used  to send a notification response
              back to the kernel.  The third ioctl(2) argument  of  this
              structure  is  a  pointer  to a structure of the following
              form:

                  struct seccomp_notif_resp {
                      __u64 id;               /* Cookie value */
                      __s64 val;              /* Success return value */
                      __s32 error;            /* 0 (success) or negative
                                                 error number */
                      __u32 flags;            /* See below */
                  };

              The fields of this structure are as follows:

              id     This is the cookie value that  was  obtained  using
                     the   SECCOMP_IOCTL_NOTIF_RECV   operation.    This
                     cookie value allows the kernel to  correctly  asso‐
                     ciate this response with the system call that trig‐
                     gered the user-space notification.

              val    This is the value that will be used for  a  spoofed
                     success  return  for  the  target  process's system
                     call; see below.

              error  This is the value that will be used  as  the  error
                     number  (errno)  for a spoofed error return for the
                     target process's system call; see below.

              flags  This is a bit mask that includes zero  or  more  of
                     the following flags

                     SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
                            Tell   the  kernel  to  execute  the  target
                            process's system call.

              Two kinds of response are possible:

              · A response to the kernel telling it to execute the  tar‐
                get  process's  system  call.   In  this case, the flags
                field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and  the
                error and val fields must be zero.

                This  kind  of response can be useful in cases where the
                supervisor needs to do deeper analysis of  the  target's
                system  call  than  is  possible  from  a seccomp filter
                (e.g., examining the values of pointer arguments),  and,
                having  verified that the system call is acceptable, the
                supervisor wants to allow it to proceed.

              · A spoofed return value for the target  process's  system
                call.   In  this  case,  the kernel does not execute the
                target process's system call, instead causing the system
                call to return a spoofed value as specified by fields of
                the seccomp_notif_resp structure.  The supervisor should
                set the fields of this structure as follows:

                +  flags  does  not contain SECCOMP_USER_NOTIF_FLAG_CON‐
                   TINUE.

                +  error is set either to  0  for  a  spoofed  "success"
                   return  or  to  a negative error number for a spoofed
                   "failure" return.  In the  former  case,  the  kernel
                   causes the target process's system call to return the
                   value specified in the val field.  In the later case,
                   the kernel causes the target process's system call to
                   return -1, and errno is assigned  the  negated  error
                   value.

                +  val is set to a value that will be used as the return
                   value for a spoofed "success" return for  the  target
                   process's  system  call.   The value in this field is
                   ignored if the error field contains a nonzero value.

              On success, this operation returns 0; on  failure,  -1  is
              returned,  and  errno  is set to indicate the cause of the
              error.  This operation can fail with the following errors:

              EINPROGRESS
                     A response to this notification  has  already  been
                     sent.

              EINVAL An invalid value was specified in the flags field.

              EINVAL The       flags      field      contained      SEC‐
                     COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
                     field was not zero.

              ENOENT The  blocked  system call in the target process has
                     been interrupted by a signal handler.

NOTES
       The file descriptor returned when seccomp(2) is employed with the
       SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
       poll(2), epoll(7), and select(2).  When a notification  is  pend‐
       ing,  these interfaces indicate that the file descriptor is read‐
       able.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Interestingly, after the event  had  been  received, │
       │the  file descriptor indicates as writable (verified │
       │from the source code and by experiment). How is this │
       │useful?                                              │
       └─────────────────────────────────────────────────────┘

EXAMPLES
       The (somewhat contrived) program shown below demonstrates the use
       of the interfaces described in this page.  The program creates  a
       child  process  that  serves  as the "target" process.  The child
       process  installs  a  seccomp  filter  that  returns   the   SEC‐
       COMP_RET_USER_NOTIF  action  value if a call is made to mkdir(2).
       The child process then calls mkdir(2) once for each of  the  sup‐
       plied  command-line arguments, and reports the result returned by
       the call.  After processing all arguments, the child process ter‐
       minates.

       The  parent  process  acts  as  the supervisor, listening for the
       notifications that are generated when the  target  process  calls
       mkdir(2).   When such a notification occurs, the supervisor exam‐
       ines the memory of the target process (using /proc/[pid]/mem)  to
       discover  the pathname argument that was supplied to the mkdir(2)
       call, and performs one of the following actions:

       · If the pathname begins with the prefix "/tmp/", then the super‐
         visor  attempts  to  create  the  specified directory, and then
         spoofs a return for the target  process  based  on  the  return
         value  of  the  supervisor's  mkdir(2) call.  In the event that
         that call succeeds, the spoofed success  return  value  is  the
         length of the pathname.

       · If  the pathname begins with "./" (i.e., it is a relative path‐
         name), the supervisor sends a  SECCOMP_USER_NOTIF_FLAG_CONTINUE
         response  to  the  kernel to say that kernel should execute the
         target process's mkdir(2) call.

       · If the pathname begins with some other prefix,  the  supervisor
         spoofs an error return for the target process, so that the tar‐
         get process's mkdir(2) call appears to fail with the error EOP‐
         NOTSUPP  ("Operation  not  supported").   Additionally,  if the
         specified pathname is exactly "/bye", then the supervisor  ter‐
         minates.

       This  program  can  used  to  demonstrate  various aspects of the
       behavior of the seccomp user-space  notification  mechanism.   To
       help  aid  such demonstrations, the program logs various messages
       to show the operation of the target process (lines prefixed "T:")
       and the supervisor (indented lines prefixed "S:").

       In  the  following  example,  the  target  attempts to create the
       directory /tmp/x.  Upon receiving the notification, the  supervi‐
       sor  creates  the  directory on the target's behalf, and spoofs a
       success return to be received by the  target  process's  mkdir(2)
       call.

           $ ./seccomp_unotify /tmp/x
           T: PID = 23168

           T: about to mkdir("/tmp/x")
                   S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
                   S: executing: mkdir("/tmp/x", 0700)
                   S: success! spoofed return = 6
                   S: sending response (flags = 0; val = 6; error = 0)
           T: SUCCESS: mkdir(2) returned 6

           T: terminating
                   S: target has terminated; bye

       In  the  above output, note that the spoofed return value seen by
       the target process is 6 (the  length  of  the  pathname  /tmp/x),
       whereas a normal mkdir(2) call returns 0 on success.

       In  the  next  example, the target attempts to create a directory
       using the relative pathname ./sub.  Since  this  pathname  starts
       with  "./",  the  supervisor sends a SECCOMP_USER_NOTIF_FLAG_CON‐
       TINUE response to the kernel, and the kernel then  (successfully)
       executes the target process's mkdir(2) call.

           $ ./seccomp_unotify ./sub
           T: PID = 23204

           T: about to mkdir("./sub")
                   S: got notification (ID 0xddb16abe25b4c12) for PID 23204
                   S: target can execute system call
                   S: sending response (flags = 0x1; val = 0; error = 0)
           T: SUCCESS: mkdir(2) returned 0

           T: terminating
                   S: target has terminated; bye

       If the target process attempts to create a directory with a path‐
       name that doesn't start with "." and doesn't begin with the  pre‐
       fix  "/tmp/", then the supervisor spoofs an error return (EOPNOT‐
       SUPP, "Operation not  supported") for the target's mkdir(2)  call
       (which is not executed):

           $ ./seccomp_unotify /xxx
           T: PID = 23178

           T: about to mkdir("/xxx")
                   S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
                   S: spoofing error response (Operation not supported)
                   S: sending response (flags = 0; val = 0; error = -95)
           T: ERROR: mkdir(2): Operation not supported

           T: terminating
                   S: target has terminated; bye

       In  the  next  example,  the  target process attempts to create a
       directory with the pathname /tmp/nosuchdir/b.  Upon receiving the
       notification,  the  supervisor attempts to create that directory,
       but the mkdir(2) call fails because the directory  /tmp/nosuchdir
       does  not  exist.   Consequently,  the supervisor spoofs an error
       return that passes the error that it received back to the  target
       process's mkdir(2) call.

           $ ./seccomp_unotify /tmp/nosuchdir/b
           T: PID = 23199

           T: about to mkdir("/tmp/nosuchdir/b")
                   S: got notification (ID 0x8744454293506046) for PID 23199
                   S: executing: mkdir("/tmp/nosuchdir/b", 0700)
                   S: failure! (errno = 2; No such file or directory)
                   S: sending response (flags = 0; val = 0; error = -2)
           T: ERROR: mkdir(2): No such file or directory

           T: terminating
                   S: target has terminated; bye

       If the supervisor receives a notification and sees that the argu‐
       ment of the target's mkdir(2) is the string "/bye", then (as well
       as  spoofing an EOPNOTSUPP error), the supervisor terminates.  If
       the target process subsequently executes  another  mkdir(2)  that
       triggers  its seccomp filter to return the SECCOMP_RET_USER_NOTIF
       action value, then the kernel causes the target process's  system
       call  to fail with the error ENOSYS ("Function not implemented").
       This is demonstrated by the following example:

           $ ./seccomp_unotify /bye /tmp/y
           T: PID = 23185

           T: about to mkdir("/bye")
                   S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
                   S: spoofing error response (Operation not supported)
                   S: sending response (flags = 0; val = 0; error = -95)
                   S: terminating **********
           T: ERROR: mkdir(2): Operation not supported

           T: about to mkdir("/tmp/y")
           T: ERROR: mkdir(2): Function not implemented

           T: terminating

   Program source
       #define _GNU_SOURCE
       #include <sys/types.h>
       #include <sys/prctl.h>
       #include <fcntl.h>
       #include <limits.h>
       #include <signal.h>
       #include <stddef.h>
       #include <stdint.h>
       #include <stdbool.h>
       #include <linux/audit.h>
       #include <sys/syscall.h>
       #include <sys/stat.h>
       #include <linux/filter.h>
       #include <linux/seccomp.h>
       #include <sys/ioctl.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <unistd.h>
       #include <errno.h>
       #include <sys/socket.h>
       #include <sys/un.h>

       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                               } while (0)

       /* Send the file descriptor 'fd' over the connected UNIX domain socket
          'sockfd'. Returns 0 on success, or -1 on error. */

       static int
       sendfd(int sockfd, int fd)
       {
           struct msghdr msgh;
           struct iovec iov;
           int data;
           struct cmsghdr *cmsgp;

           /* Allocate a char array of suitable size to hold the ancillary data.
              However, since this buffer is in reality a 'struct cmsghdr', use a
              union to ensure that it is suitable aligned. */
           union {
               char   buf[CMSG_SPACE(sizeof(int))];
                               /* Space large enough to hold an 'int' */
               struct cmsghdr align;
           } controlMsg;

           /* The 'msg_name' field can be used to specify the address of the
              destination socket when sending a datagram. However, we do not
              need to use this field because 'sockfd' is a connected socket. */

           msgh.msg_name = NULL;
           msgh.msg_namelen = 0;

           /* On Linux, we must transmit at least one byte of real data in
              order to send ancillary data. We transmit an arbitrary integer
              whose value is ignored by recvfd(). */

           msgh.msg_iov = &iov;
           msgh.msg_iovlen = 1;
           iov.iov_base = &data;
           iov.iov_len = sizeof(int);
           data = 12345;

           /* Set 'msghdr' fields that describe ancillary data */

           msgh.msg_control = controlMsg.buf;
           msgh.msg_controllen = sizeof(controlMsg.buf);

           /* Set up ancillary data describing file descriptor to send */

           cmsgp = CMSG_FIRSTHDR(&msgh);
           cmsgp->cmsg_level = SOL_SOCKET;
           cmsgp->cmsg_type = SCM_RIGHTS;
           cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
           memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));

           /* Send real plus ancillary data */

           if (sendmsg(sockfd, &msgh, 0) == -1)
               return -1;

           return 0;
       }

       /* Receive a file descriptor on a connected UNIX domain socket. Returns
          the received file descriptor on success, or -1 on error. */

       static int
       recvfd(int sockfd)
       {
           struct msghdr msgh;
           struct iovec iov;
           int data, fd;
           ssize_t nr;

           /* Allocate a char buffer for the ancillary data. See the comments
              in sendfd() */
           union {
               char   buf[CMSG_SPACE(sizeof(int))];
               struct cmsghdr align;
           } controlMsg;
           struct cmsghdr *cmsgp;

           /* The 'msg_name' field can be used to obtain the address of the
              sending socket. However, we do not need this information. */

           msgh.msg_name = NULL;
           msgh.msg_namelen = 0;

           /* Specify buffer for receiving real data */

           msgh.msg_iov = &iov;
           msgh.msg_iovlen = 1;
           iov.iov_base = &data;       /* Real data is an 'int' */
           iov.iov_len = sizeof(int);

           /* Set 'msghdr' fields that describe ancillary data */

           msgh.msg_control = controlMsg.buf;
           msgh.msg_controllen = sizeof(controlMsg.buf);

           /* Receive real plus ancillary data; real data is ignored */

           nr = recvmsg(sockfd, &msgh, 0);
           if (nr == -1)
               return -1;

           cmsgp = CMSG_FIRSTHDR(&msgh);

           /* Check the validity of the 'cmsghdr' */

           if (cmsgp == NULL ||
                   cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
                   cmsgp->cmsg_level != SOL_SOCKET ||
                   cmsgp->cmsg_type != SCM_RIGHTS) {
               errno = EINVAL;
               return -1;
           }

           /* Return the received file descriptor to our caller */

           memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
           return fd;
       }

       static void
       sigchldHandler(int sig)
       {
           char *msg  = "\tS: target has terminated; bye\n";

           write(STDOUT_FILENO, msg, strlen(msg));
           _exit(EXIT_SUCCESS);
       }

       static int
       seccomp(unsigned int operation, unsigned int flags, void *args)
       {
           return syscall(__NR_seccomp, operation, flags, args);
       }

       /* The following is the x86-64-specific BPF boilerplate code for checking
          that the BPF program is running on the right architecture + ABI. At
          completion of these instructions, the accumulator contains the system
          call number. */

       /* For the x32 ABI, all system call numbers have bit 30 set */

       #define X32_SYSCALL_BIT         0x40000000

       #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                       (offsetof(struct seccomp_data, arch))), \
               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                        (offsetof(struct seccomp_data, nr))), \
               BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)

       /* installNotifyFilter() installs a seccomp filter that generates
          user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
          calls mkdir(2); the filter allows all other system calls.

          The function return value is a file descriptor from which the
          user-space notifications can be fetched. */

       static int
       installNotifyFilter(void)
       {
           struct sock_filter filter[] = {
               X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,

               /* mkdir() triggers notification to user-space supervisor */

               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
               BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),

               /* Every other system call is allowed */

               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
           };

           struct sock_fprog prog = {
               .len = sizeof(filter) / sizeof(filter[0]),
               .filter = filter,
           };

           /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
              as a result, seccomp() returns a notification file descriptor. */

           int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
           if (notifyFd == -1)
               errExit("seccomp-install-notify-filter");

           return notifyFd;
       }

       /* Close a pair of sockets created by socketpair() */

       static void
       closeSocketPair(int sockPair[2])
       {
           if (close(sockPair[0]) == -1)
               errExit("closeSocketPair-close-0");
           if (close(sockPair[1]) == -1)
               errExit("closeSocketPair-close-1");
       }

       /* Implementation of the target process; create a child process that:

          (1) installs a seccomp filter with the
              SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
          (2) writes the seccomp notification file descriptor returned from
              the previous step onto the UNIX domain socket, 'sockPair[0]';
          (3) calls mkdir(2) for each element of 'argv'.

          The function return value in the parent is the PID of the child
          process; the child does not return from this function. */

       static pid_t
       targetProcess(int sockPair[2], char *argv[])
       {
           pid_t targetPid = fork();
           if (targetPid == -1)
               errExit("fork");

           if (targetPid > 0)          /* In parent, return PID of child */
               return targetPid;

           /* Child falls through to here */

           printf("T: PID = %ld\n", (long) getpid());

           /* Install seccomp filter(s) */

           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
               errExit("prctl");

           int notifyFd = installNotifyFilter();

           /* Pass the notification file descriptor to the tracing process over
              a UNIX domain socket */

           if (sendfd(sockPair[0], notifyFd) == -1)
               errExit("sendfd");

           /* Notification and socket FDs are no longer needed in target */

           if (close(notifyFd) == -1)
               errExit("close-target-notify-fd");

           closeSocketPair(sockPair);

           /* Perform a mkdir() call for each of the command-line arguments */

           for (char **ap = argv; *ap != NULL; ap++) {
               printf("\nT: about to mkdir(\"%s\")\n", *ap);

               int s = mkdir(*ap, 0700);
               if (s == -1)
                   perror("T: ERROR: mkdir(2)");
               else
                   printf("T: SUCCESS: mkdir(2) returned %d\n", s);
           }

           printf("\nT: terminating\n");
           exit(EXIT_SUCCESS);
       }

       /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
          operation is still valid. It will no longer be valid if the process
          has terminated. This operation can be used when accessing /proc/PID
          files in the target process in order to avoid TOCTOU race conditions
          where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV terminates
          and is reused by another process. */

       static void
       checkNotificationIdIsValid(int notifyFd, uint64_t id)
       {
           if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
               fprintf(stderr, "\tS: notification ID check: "
                       "target has terminated!!!\n");

               exit(EXIT_FAILURE);
           }
       }

       /* Access the memory of the target process in order to discover the
          pathname that was given to mkdir() */

       static void
       getTargetPathname(struct seccomp_notif *req, int notifyFd,
                         char *path, size_t len)
       {
           char procMemPath[PATH_MAX];
           snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);

           int procMemFd = open(procMemPath, O_RDONLY);
           if (procMemFd == -1)
               errExit("Supervisor: open");

           /* Check that the process whose info we are accessing is still alive.
              If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
              in checkNotificationIdIsValid()) succeeds, we know that the
              /proc/PID/mem file descriptor that we opened corresponds to the
              process for which we received a notification. If that process
              subsequently terminates, then read() on that file descriptor
              will return 0 (EOF). */

           checkNotificationIdIsValid(notifyFd, req->id);

           /* Seek to the location containing the pathname argument (i.e., the
              first argument) of the mkdir(2) call and read that pathname */

           if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
               errExit("Supervisor: lseek");

           ssize_t s = read(procMemFd, path, PATH_MAX);
           if (s == -1)
               errExit("read");

           if (s == 0) {
               fprintf(stderr, "\tS: read() of /proc/PID/mem "
                       "returned 0 (EOF)\n");
               exit(EXIT_FAILURE);
           }

           if (close(procMemFd) == -1)
               errExit("close-/proc/PID/mem");
       }

       /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
          descriptor, 'notifyFd'. */

       static void
       handleNotifications(int notifyFd)
       {
           struct seccomp_notif_sizes sizes;
           char path[PATH_MAX];
               /* For simplicity, we assume that the pathname given to mkdir()
                  is no more than PATH_MAX bytes; but this might not be true. */

           /* Discover the sizes of the structures that are used to receive
              notifications and send notification responses, and allocate
              buffers of those sizes. */

           if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
               errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");

           struct seccomp_notif *req = malloc(sizes.seccomp_notif);
           if (req == NULL)
               errExit("\tS: malloc");

           struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
           if (resp == NULL)
               errExit("\tS: malloc");

           /* Loop handling notifications */

           for (;;) {
               /* Wait for next notification, returning info in '*req' */

               memset(req, 0, sizes.seccomp_notif);
               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
                   if (errno == EINTR)
                       continue;
                   errExit("Supervisor: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
               }

               printf("\tS: got notification (ID %#llx) for PID %d\n",
                       req->id, req->pid);

               /* The only system call that can generate a notification event
                  is mkdir(2). Nevertheless, we check that the notified system
                  call is indeed mkdir() as kind of future-proofing of this
                  code in case the seccomp filter is later modified to
                  generate notifications for other system calls. */

               if (req->data.nr != __NR_mkdir) {
                   printf("\tS: notification contained unexpected "
                           "system call number; bye!!!\n");
                   exit(EXIT_FAILURE);
               }

               getTargetPathname(req, notifyFd, path, sizeof(path));

               /* Prepopulate some fields of the response */

               resp->id = req->id;     /* Response includes notification ID */
               resp->flags = 0;
               resp->val = 0;

               /* If the directory is in /tmp, then create it on behalf of
                  the supervisor; if the pathname starts with '.', tell the
                  kernel to let the target process execute the mkdir();
                  otherwise, give an error for a directory pathname in
                  any other location. */

               if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
                   printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
                           path, req->data.args[1]);

                   if (mkdir(path, req->data.args[1]) == 0) {
                       resp->error = 0;            /* "Success" */
                       resp->val = strlen(path);   /* Used as return value of
                                                      mkdir() in target */
                       printf("\tS: success! spoofed return = %lld\n",
                               resp->val);
                   } else {

                       /* If mkdir() failed in the supervisor, pass the error
                          back to the target */

                       resp->error = -errno;
                       printf("\tS: failure! (errno = %d; %s)\n", errno,
                               strerror(errno));
                   }
                                                            } else if (strncmp(path, "./", strlen("./")) == 0) {
                   resp->error = resp->val = 0;
                   resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
                   printf("\tS: target can execute system call\n");
               } else {
                   resp->error = -EOPNOTSUPP;
                   printf("\tS: spoofing error response (%s)\n",
                           strerror(-resp->error));
               }

               /* Send a response to the notification */

               printf("\tS: sending response "
                       "(flags = %#x; val = %lld; error = %d)\n",
                       resp->flags, resp->val, resp->error);

               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
                   if (errno == ENOENT)
                       printf("\tS: response failed with ENOENT; "
                               "perhaps target process's syscall was "
                               "interrupted by signal?\n");
                   else
                       perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
               }

               /* If the pathname is just "/bye", then the supervisor
                  terminates. This allows us to see what happens if the
                  target process makes further calls to mkdir(2). */

               if (strcmp(path, "/bye") == 0) {
                   printf("\tS: terminating **********\n");
                   exit(EXIT_FAILURE);
               }
           }
       }

       /* Implementation of the supervisor process:

          (1) obtains the notification file descriptor from 'sockPair[1]'
          (2) handles notifications that arrive on that file descriptor. */

       static void
       supervisor(int sockPair[2])
       {
           int notifyFd = recvfd(sockPair[1]);
           if (notifyFd == -1)
               errExit("recvfd");

           closeSocketPair(sockPair);  /* We no longer need the socket pair */

           handleNotifications(notifyFd);
       }

       int
       main(int argc, char *argv[])
       {
           int sockPair[2];

           setbuf(stdout, NULL);

           if (argc < 2) {
               fprintf(stderr, "At least one pathname argument is required\n");
               exit(EXIT_FAILURE);
           }

           /* Create a UNIX domain socket that is used to pass the seccomp
              notification file descriptor from the target process to the
              supervisor process. */

           if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
               errExit("socketpair");

           /* Create a child process--the "target"--that installs seccomp
              filtering. The target process writes the seccomp notification
              file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
              each directory in the command-line arguments. */

           (void) targetProcess(sockPair, &argv[optind]);

           /* Catch SIGCHLD when the target terminates, so that the
              supervisor can also terminate. */

           struct sigaction sa;
           sa.sa_handler = sigchldHandler;
           sa.sa_flags = 0;
           sigemptyset(&sa.sa_mask);
           if (sigaction(SIGCHLD, &sa, NULL) == -1)
               errExit("sigaction");

           supervisor(sockPair);

           exit(EXIT_SUCCESS);
       }

SEE ALSO
       ioctl(2), seccomp(2)


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 11:07 For review: seccomp_user_notif(2) manual page Michael Kerrisk (man-pages)
@ 2020-09-30 15:03 ` Tycho Andersen
  2020-09-30 15:11   ` Tycho Andersen
  2020-09-30 20:34   ` Michael Kerrisk (man-pages)
  2020-09-30 15:53 ` Jann Horn
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 52+ messages in thread
From: Tycho Andersen @ 2020-09-30 15:03 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Sargun Dhillon, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Jann Horn, Alexei Starovoitov, wad, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>        2. In order that the supervisor process can obtain  notifications
>           using  the  listening  file  descriptor, (a duplicate of) that
>           file descriptor must be passed from the target process to  the
>           supervisor process.  One way in which this could be done is by
>           passing the file descriptor over a UNIX domain socket  connec‐
>           tion between the two processes (using the SCM_RIGHTS ancillary
>           message type described in unix(7)).   Another  possibility  is
>           that  the  supervisor  might  inherit  the file descriptor via
>           fork(2).

It is technically possible to inherit the fd via fork, but is it
really that useful? The child process wouldn't be able to actually do
the syscall in question, since it would have the same filter.

>           The  information  in  the notification can be used to discover
>           the values of pointer arguments for the target process's  sys‐
>           tem call.  (This is something that can't be done from within a
>           seccomp filter.)  To do this (and  assuming  it  has  suitable

s/To do this/One way to accomplish this/ perhaps, since there are
others.

>           permissions),   the   supervisor   opens   the   corresponding
>           /proc/[pid]/mem file, seeks to the memory location that corre‐
>           sponds to one of the pointer arguments whose value is supplied
>           in the notification event, and reads bytes from that location.
>           (The supervisor must be careful to avoid a race condition that
>           can occur when doing this; see the  description  of  the  SEC‐
>           COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>           tion, the supervisor can access other system information  that
>           is  visible  in  user space but which is not accessible from a
>           seccomp filter.
> 
>           ┌─────────────────────────────────────────────────────┐
>           │FIXME                                                │
>           ├─────────────────────────────────────────────────────┤
>           │Suppose we are reading a pathname from /proc/PID/mem │
>           │for  a system call such as mkdir(). The pathname can │
>           │be an arbitrary length. How do we know how much (how │
>           │many pages) to read from /proc/PID/mem?              │
>           └─────────────────────────────────────────────────────┘

PATH_MAX, I suppose.

>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │From my experiments,  it  appears  that  if  a  SEC‐ │
>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>        │process terminates, then the ioctl()  simply  blocks │
>        │(rather than returning an error to indicate that the │
>        │target process no longer exists).                    │

Yeah, I think Christian wanted to fix this at some point, but it's a
bit sticky to do. Note that if you e.g. rely on fork() above, the
filter is shared with your current process, and this notification
would never be possible. Perhaps another reason to omit that from the
man page.

>        SECCOMP_IOCTL_NOTIF_ID_VALID
>               This operation can be used to check that a notification ID
>               returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
>               is  still  valid  (i.e.,  that  the  target  process still
>               exists).
> 
>               The third ioctl(2) argument is a  pointer  to  the  cookie
>               (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
> 
>               This  operation is necessary to avoid race conditions that
>               can  occur   when   the   pid   returned   by   the   SEC‐
>               COMP_IOCTL_NOTIF_RECV   operation   terminates,  and  that
>               process ID is reused by another process.   An  example  of
>               this kind of race is the following
> 
>               1. A  notification  is  generated  on  the  listening file
>                  descriptor.  The returned  seccomp_notif  contains  the
>                  PID of the target process.
> 
>               2. The target process terminates.
> 
>               3. Another process is created on the system that by chance
>                  reuses the PID that was freed when the  target  process
>                  terminates.
> 
>               4. The  supervisor  open(2)s  the /proc/[pid]/mem file for
>                  the PID obtained in step 1, with the intention of (say)
>                  inspecting the memory locations that contains the argu‐
>                  ments of the system call that triggered  the  notifica‐
>                  tion in step 1.
> 
>               In the above scenario, the risk is that the supervisor may
>               try to access the memory of a process other than the  tar‐
>               get.   This  race  can be avoided by following the call to
>               open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>               ify  that  the  process that generated the notification is
>               still alive.  (Note that  if  the  target  process  subse‐
>               quently  terminates, its PID won't be reused because there
>               remains an open reference to the /proc[pid]/mem  file;  in
>               this  case, a subsequent read(2) from the file will return
>               0, indicating end of file.)
> 
>               On success (i.e., the notification  ID  is  still  valid),
>               this  operation  returns 0 On failure (i.e., the notifica‐
                                          ^ need a period?

>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │Interestingly, after the event  had  been  received, │
>        │the  file descriptor indicates as writable (verified │
>        │from the source code and by experiment). How is this │
>        │useful?                                              │

You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
reasonable.

> 
> EXAMPLES
>        The (somewhat contrived) program shown below demonstrates the use

May also be worth mentioning the example in
samples/seccomp/user-trap.c as well.

Tycho

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 15:03 ` Tycho Andersen
@ 2020-09-30 15:11   ` Tycho Andersen
  2020-09-30 20:34   ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 52+ messages in thread
From: Tycho Andersen @ 2020-09-30 15:11 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Sargun Dhillon, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Jann Horn, Alexei Starovoitov, wad, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Wed, Sep 30, 2020 at 09:03:36AM -0600, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> >        ┌─────────────────────────────────────────────────────┐
> >        │FIXME                                                │
> >        ├─────────────────────────────────────────────────────┤
> >        │Interestingly, after the event  had  been  received, │
> >        │the  file descriptor indicates as writable (verified │
> >        │from the source code and by experiment). How is this │
> >        │useful?                                              │
> 
> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> reasonable.

If we make this change, I suppose we should also drop EPOLLRDNORM from
things which have not been received yet, since they're not really
readable.

Tycho

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 11:07 For review: seccomp_user_notif(2) manual page Michael Kerrisk (man-pages)
  2020-09-30 15:03 ` Tycho Andersen
@ 2020-09-30 15:53 ` Jann Horn
  2020-10-01 12:54   ` Christian Brauner
  2020-10-15 11:24   ` Michael Kerrisk (man-pages)
  2020-09-30 23:39 ` Kees Cook
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 52+ messages in thread
From: Jann Horn @ 2020-09-30 15:53 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Alexei Starovoitov, Will Drewry,
	bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> that also will need documenting [2]), I did :-). But of course I may
> have made mistakes...
[...]
> NAME
>        seccomp_user_notif - Seccomp user-space notification mechanism
>
> SYNOPSIS
>        #include <linux/seccomp.h>
>        #include <linux/filter.h>
>        #include <linux/audit.h>
>
>        int seccomp(unsigned int operation, unsigned int flags, void *args);

Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
of the ioctl_* manpages?

> DESCRIPTION
>        This  page  describes  the user-space notification mechanism pro‐
>        vided by the Secure Computing (seccomp) facility.  As well as the
>        use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>        COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>        operation  described  in  seccomp(2), this mechanism involves the
>        use of a number of related ioctl(2) operations (described below).
>
>    Overview
>        In conventional usage of a seccomp filter, the decision about how
>        to  treat  a particular system call is made by the filter itself.
>        The user-space notification mechanism allows the handling of  the
>        system  call  to  instead  be handed off to a user-space process.
>        The advantages of doing this are that, by contrast with the  sec‐
>        comp  filter,  which  is  running on a virtual machine inside the
>        kernel, the user-space process has access to information that  is
>        unavailable to the seccomp filter and it can perform actions that
>        can't be performed from the seccomp filter.
>
>        In the discussion that follows, the process  that  has  installed
>        the  seccomp filter is referred to as the target, and the process

Technically, this definition of "target" is a bit inaccurate because:

 - seccomp filters are inherited
 - seccomp filters apply to threads, not processes
 - seccomp filters can be semi-remotely installed via TSYNC

(I assume that in manpages, we should try to go for the "a task is a
thread and a thread group is a process" definition, right?)

Perhaps "the threads on which the seccomp filter is installed are
referred to as the target", or something like that would be better?

>        that is notified by  the  user-space  notification  mechanism  is
>        referred  to  as  the  supervisor.  An overview of the steps per‐
>        formed by these two processes is as follows:
>
>        1. The target process establishes a seccomp filter in  the  usual
>           manner, but with two differences:
>
>           · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
>             TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
>             the  (successful)  seccomp(2) call is a new "listening" file
>             descriptor that can be used to receive notifications.
>
>           · In cases where it is appropriate, the seccomp filter returns
>             the  action value SECCOMP_RET_USER_NOTIF.  This return value
>             will trigger a notification event.
>
>        2. In order that the supervisor process can obtain  notifications
>           using  the  listening  file  descriptor, (a duplicate of) that
>           file descriptor must be passed from the target process to  the
>           supervisor process.  One way in which this could be done is by
>           passing the file descriptor over a UNIX domain socket  connec‐
>           tion between the two processes (using the SCM_RIGHTS ancillary
>           message type described in unix(7)).   Another  possibility  is
>           that  the  supervisor  might  inherit  the file descriptor via
>           fork(2).

With the caveat that if the supervisor inherits the file descriptor
via fork(), that (more or less) implies that the supervisor is subject
to the same filter (although it could bypass the filter using a helper
thread that responds SECCOMP_USER_NOTIF_FLAG_CONTINUE, but I don't
expect any clean software to do that).

>        3. The supervisor process will receive notification events on the
>           listening  file  descriptor.   These  events  are  returned as
>           structures of type seccomp_notif.  Because this structure  and
>           its  size may evolve over kernel versions, the supervisor must
>           first determine the size of  this  structure  using  the  sec‐
>           comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>           structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>           cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>           to receive notification events.   In  addition,the  supervisor
>           allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>           comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>           comp_notif_resp  structure) that it will provide to the kernel
>           (and thus the target process).
>
>        4. The target process then performs its workload, which  includes
>           system  calls  that  will be controlled by the seccomp filter.
>           Whenever one of these system calls causes the filter to return
>           the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>           execute the system call;  instead,  execution  of  the  target
>           process is temporarily blocked inside the kernel and a notifi‐

where "blocked" refers to the interruptible, restartable kind - if the
child receives a signal with an SA_RESTART signal handler in the
meantime, it'll leave the syscall, go through the signal handler, then
restart the syscall again and send the same request to the supervisor
again. so the supervisor may see duplicate syscalls.

What's really gross here is that signal(7) promises that some syscalls
like epoll_wait(2) never restart, but seccomp doesn't know about that;
if userspace installs a filter that uses SECCOMP_RET_USER_NOTIF for a
non-restartable syscall, the result is that UAPI gets broken a little
bit. Luckily normal users of seccomp probably won't use
SECCOMP_RET_USER_NOTIF for restartable syscalls, but if someone does
want to do that, we might have to add some "suppress syscall
restarting" flag into the seccomp action value, or something like
that... yuck.

>           cation event is generated on the listening file descriptor.
>
>        5. The supervisor process can now repeatedly monitor the  listen‐
>           ing   file   descriptor  for  SECCOMP_RET_USER_NOTIF-triggered
>           events.   To  do  this,   the   supervisor   uses   the   SEC‐
>           COMP_IOCTL_NOTIF_RECV  ioctl(2)  operation to read information
>           about a notification event; this  operation  blocks  until  an

(interruptably - but I guess that maybe doesn't have to be said
explicitly here?)

>           event  is  available.

Maybe we should note here that you can use the multi-fd-polling APIs
(select/poll/epoll) instead, and that if the notification goes away
before you call SECCOMP_IOCTL_NOTIF_RECV, the ioctl will return
-ENOENT instead of blocking, and therefore as long as nobody else
reads from the same fd, you can assume that after the fd reports as
readable, you can call SECCOMP_IOCTL_NOTIF_RECV once without blocking.

Exceeeeept that this part looks broken:

  if (mutex_lock_interruptible(&filter->notify_lock) < 0)
    return EPOLLERR;

which I think means that we can have a race where a signal arrives
while poll() is trying to add itself to the waitqueue of the seccomp
fd, and then we'll get a spurious error condition reported on the fd.
That's a kernel bug, I'd say.

> The  operation returns a seccomp_notif
>           structure containing information about the system call that is
>           being attempted by the target process.
>
>        6. The    seccomp_notif    structure   returned   by   the   SEC‐
>           COMP_IOCTL_NOTIF_RECV operation includes the same  information
>           (a seccomp_data structure) that was passed to the seccomp fil‐
>           ter.  This information allows the supervisor to  discover  the
>           system  call number and the arguments for the target process's
>           system call.  In addition, the notification event contains the
>           PID of the target process.

That's a PIDTYPE_PID, which the manpages call a "thread ID".

>           The  information  in  the notification can be used to discover
>           the values of pointer arguments for the target process's  sys‐
>           tem call.  (This is something that can't be done from within a
>           seccomp filter.)  To do this (and  assuming  it  has  suitable
>           permissions),   the   supervisor   opens   the   corresponding
>           /proc/[pid]/mem file,

... which means that here we might have to get into the weeds of how
actually /proc has invisible directories for every TID, even though
only the ones for PIDs are visible, and therefore you can just open
/proc/[tid]/mem and it'll work fine?

> seeks to the memory location that corre‐
>           sponds to one of the pointer arguments whose value is supplied
>           in the notification event, and reads bytes from that location.
>           (The supervisor must be careful to avoid a race condition that
>           can occur when doing this; see the  description  of  the  SEC‐
>           COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>           tion, the supervisor can access other system information  that
>           is  visible  in  user space but which is not accessible from a
>           seccomp filter.
>
>           ┌─────────────────────────────────────────────────────┐
>           │FIXME                                                │
>           ├─────────────────────────────────────────────────────┤
>           │Suppose we are reading a pathname from /proc/PID/mem │
>           │for  a system call such as mkdir(). The pathname can │
>           │be an arbitrary length. How do we know how much (how │
>           │many pages) to read from /proc/PID/mem?              │
>           └─────────────────────────────────────────────────────┘

It can't be an arbitrary length. While pathnames *returned* from the
kernel in some places can have different limits, strings supplied as
path arguments *to* the kernel AFAIK always have an upper limit of
PATH_MAX, else you get -ENAMETOOLONG. See getname_flags().

>        7. Having  obtained  information  as  per  the previous step, the
>           supervisor may then choose to perform an action in response to
>           the  target  process's  system call (which, as noted above, is
>           not  executed  when  the  seccomp  filter  returns  the   SEC‐
>           COMP_RET_USER_NOTIF action value).

(unless SECCOMP_USER_NOTIF_FLAG_CONTINUE is used)

>           One  example  use case here relates to containers.  The target
>           process may be located inside a container where  it  does  not
>           have sufficient capabilities to mount a filesystem in the con‐
>           tainer's mount namespace.  However, the supervisor  may  be  a
>           more  privileged  process that that does have sufficient capa‐

nit: s/that that/that/

>           bilities to perform the mount operation.
>
>        8. The supervisor then sends a response to the notification.  The
>           information  in  this  response  is used by the kernel to con‐
>           struct a return value for the target process's system call and
>           provide a value that will be assigned to the errno variable of
>           the target process.
>
>           The  response  is  sent  using  the   SECCOMP_IOCTL_NOTIF_RECV
>           ioctl(2)   operation,   which  is  used  to  transmit  a  sec‐
>           comp_notif_resp  structure  to  the  kernel.   This  structure
>           includes  a  cookie  value that the supervisor obtained in the
>           seccomp_notif    structure    returned     by     the     SEC‐
>           COMP_IOCTL_NOTIF_RECV operation.  This cookie value allows the
>           kernel to associate the response with the target process.

(unless if the target thread entered a signal handler or was killed in
the meantime)

>        9. Once the notification has been sent, the system  call  in  the
>           target  process  unblocks,  returning the information that was
>           provided by the supervisor in the notification response.
>
>        As a variation on the last two steps, the supervisor can  send  a
>        response  that tells the kernel that it should execute the target
>        process's   system   call;   see   the   discussion    of    SEC‐
>        COMP_USER_NOTIF_FLAG_CONTINUE, below.
>
>    ioctl(2) operations
>        The following ioctl(2) operations are provided to support seccomp
>        user-space notification.  For each of these operations, the first
>        (file  descriptor)  argument  of  ioctl(2)  is the listening file
>        descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
>        TER_FLAG_NEW_LISTENER flag.
>
>        SECCOMP_IOCTL_NOTIF_RECV
>               This operation is used to obtain a user-space notification
>               event.  If no such event is currently pending, the  opera‐
>               tion  blocks  until  an  event occurs.

Not necessarily; for every time a process entered a signal handler or
was killed while a notification was pending, a call to
SECCOMP_IOCTL_NOTIF_RECV will return -ENOENT.

> The third ioctl(2)
>               argument is a pointer to a structure of the following form
>               which  contains  information about the event.  This struc‐
>               ture must be zeroed out before the call.
>
>                   struct seccomp_notif {
>                       __u64  id;              /* Cookie */
>                       __u32  pid;             /* PID of target process */

(TID, not PID)

>                       __u32  flags;           /* Currently unused (0) */
>                       struct seccomp_data data;   /* See seccomp(2) */
>                   };
>
>               The fields in this structure are as follows:
>
>               id     This is a cookie for the notification.   Each  such
>                      cookie  is  guaranteed  to be unique for the corre‐
>                      sponding seccomp  filter.   In  other  words,  this
>                      cookie  is  unique for each notification event from
>                      the target process.

That sentence about "target process" looks wrong to me. The cookies
are unique across notifications from the filter, but there can be
multiple filters per thread, and multiple threads per filter.

> The cookie value has the  fol‐
>                      lowing uses:
>
>                      · It     can     be     used    with    the    SEC‐
>                        COMP_IOCTL_NOTIF_ID_VALID ioctl(2)  operation  to
>                        verify that the target process is still alive.
>
>                      · When  returning  a  notification  response to the
>                        kernel, the supervisor must  include  the  cookie
>                        value in the seccomp_notif_resp structure that is
>                        specified   as   the   argument   of   the   SEC‐
>                        COMP_IOCTL_NOTIF_SEND operation.
>
>               pid    This  is  the  PID of the target process that trig‐
>                      gered the notification event.
>
>                      ┌─────────────────────────────────────────────────────┐
>                      │FIXME                                                │
>                      ├─────────────────────────────────────────────────────┤
>                      │This is a thread ID, rather than a PID, right?       │
>                      └─────────────────────────────────────────────────────┘

Yeah.

>
>               flags  This is a  bit  mask  of  flags  providing  further
>                      information on the event.  In the current implemen‐
>                      tation, this field is always zero.
>
>               data   This is a seccomp_data structure containing  infor‐
>                      mation  about  the  system  call that triggered the
>                      notification.  This is the same structure  that  is
>                      passed  to  the seccomp filter.  See seccomp(2) for
>                      details of this structure.
>
>               On success, this operation returns 0; on  failure,  -1  is
>               returned,  and  errno  is set to indicate the cause of the
>               error.  This operation can fail with the following errors:
>
>               EINVAL (since Linux 5.5)
>                      The seccomp_notif structure that was passed to  the
>                      call contained nonzero fields.
>
>               ENOENT The  target  process  was killed by a signal as the
>                      notification information was being generated.

Not just killed, interruption with a signal handler has the same effect.

>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │From my experiments,  it  appears  that  if  a  SEC‐ │
>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>        │process terminates, then the ioctl()  simply  blocks │
>        │(rather than returning an error to indicate that the │
>        │target process no longer exists).                    │
>        │                                                     │
>        │I found that surprising, and it required  some  con‐ │
>        │tortions  in the example program.  It was not possi‐ │
>        │ble to code my SIGCHLD handler (which reaps the zom‐ │
>        │bie  when  the  worker/target process terminates) to │
>        │simply set a flag checked in the main  handleNotifi‐ │
>        │cations()  loop,  since  this created an unavoidable │
>        │race where the child might terminate  just  after  I │
>        │had  checked  the  flag,  but before I blocked (for‐ │
>        │ever!) in  the  SECCOMP_IOCTL_NOTIF_RECV  operation. │
>        │Instead,  I had to code the signal handler to simply │
>        │call _exit(2)  in  order  to  terminate  the  parent │
>        │process (the supervisor).                            │
>        │                                                     │
>        │Is  this  expected  behavior?  It seems to me rather │
>        │desirable that SECCOMP_IOCTL_NOTIF_RECV should  give │
>        │an error if the target process has terminated.       │
>        └─────────────────────────────────────────────────────┘

You could poll() the fd first. But yeah, it'd probably be a good idea
to change that.

>        SECCOMP_IOCTL_NOTIF_ID_VALID
[...]
>               In the above scenario, the risk is that the supervisor may
>               try to access the memory of a process other than the  tar‐
>               get.   This  race  can be avoided by following the call to
>               open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>               ify  that  the  process that generated the notification is
>               still alive.  (Note that  if  the  target  process  subse‐
>               quently  terminates, its PID won't be reused because there

That's wrong, the PID can be reused, but the /proc/$pid directory is
internally not associated with the numeric PID, but, conceptually
speaking, with a specific incarnation of the PID, or something like
that. (Actually, it is associated with the "struct pid", which is not
reused, instead of the numeric PID.)

>               remains an open reference to the /proc[pid]/mem  file;  in
>               this  case, a subsequent read(2) from the file will return
>               0, indicating end of file.)
>
>               On success (i.e., the notification  ID  is  still  valid),
>               this  operation  returns 0 On failure (i.e., the notifica‐

nit: s/returns 0/returns 0./

>               tion ID is no longer valid), -1 is returned, and errno  is
>               set to ENOENT.
>
>        SECCOMP_IOCTL_NOTIF_SEND
[...]
>               Two kinds of response are possible:
>
>               · A response to the kernel telling it to execute the  tar‐
>                 get  process's  system  call.   In  this case, the flags
>                 field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and  the
>                 error and val fields must be zero.
>
>                 This  kind  of response can be useful in cases where the
>                 supervisor needs to do deeper analysis of  the  target's
>                 system  call  than  is  possible  from  a seccomp filter
>                 (e.g., examining the values of pointer arguments),  and,
>                 having  verified that the system call is acceptable, the
>                 supervisor wants to allow it to proceed.

"allow" sounds as if this is an access control thing, but this
mechanism should usually not be used for access control (unless the
"seccomp" syscall is blocked). Maybe reword as "having decided that
the system call does not require emulation by the supervisor, the
supervisor wants it to execute normally", or something like that?

[...]
>               On success, this operation returns 0; on  failure,  -1  is
>               returned,  and  errno  is set to indicate the cause of the
>               error.  This operation can fail with the following errors:
>
>               EINPROGRESS
>                      A response to this notification  has  already  been
>                      sent.
>
>               EINVAL An invalid value was specified in the flags field.
>
>               EINVAL The       flags      field      contained      SEC‐
>                      COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
>                      field was not zero.
>
>               ENOENT The  blocked  system call in the target process has
>                      been interrupted by a signal handler.

(you could also get this if a response has already been sent, instead
of EINPROGRESS - the only difference is whether the target thread has
picked up the response yet)

> NOTES
>        The file descriptor returned when seccomp(2) is employed with the
>        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
>        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
>        ing,  these interfaces indicate that the file descriptor is read‐
>        able.

We should probably also point out somewhere that, as
include/uapi/linux/seccomp.h says:

 * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
 * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
 * same syscall, the most recently added filter takes precedence. This means
 * that the new SECCOMP_RET_USER_NOTIF filter can override any
 * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
 * such filtered syscalls to be executed by sending the response
 * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
 * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.

In other words, from a security perspective, you must assume that the
target process can bypass any SECCOMP_RET_USER_NOTIF (or
SECCOMP_RET_TRACE) filters unless it is completely prohibited from
calling seccomp(). This should also be noted over in the main
seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.


> EXAMPLES
[...]
>        This  program  can  used  to  demonstrate  various aspects of the

nit: "can be used to demonstrate", or alternatively just "demonstrates"

>        behavior of the seccomp user-space  notification  mechanism.   To
>        help  aid  such demonstrations, the program logs various messages
>        to show the operation of the target process (lines prefixed "T:")
>        and the supervisor (indented lines prefixed "S:").
[...]
>    Program source
[...]
>        #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>                                } while (0)

Don't we have err() for this?

>        /* Send the file descriptor 'fd' over the connected UNIX domain socket
>           'sockfd'. Returns 0 on success, or -1 on error. */
>
>        static int
>        sendfd(int sockfd, int fd)
>        {
>            struct msghdr msgh;
>            struct iovec iov;
>            int data;
>            struct cmsghdr *cmsgp;
>
>            /* Allocate a char array of suitable size to hold the ancillary data.
>               However, since this buffer is in reality a 'struct cmsghdr', use a
>               union to ensure that it is suitable aligned. */

nit: suitably

>            union {
>                char   buf[CMSG_SPACE(sizeof(int))];
>                                /* Space large enough to hold an 'int' */
>                struct cmsghdr align;
>            } controlMsg;
>
>            /* The 'msg_name' field can be used to specify the address of the
>               destination socket when sending a datagram. However, we do not
>               need to use this field because 'sockfd' is a connected socket. */
>
>            msgh.msg_name = NULL;
>            msgh.msg_namelen = 0;
>
>            /* On Linux, we must transmit at least one byte of real data in
>               order to send ancillary data. We transmit an arbitrary integer
>               whose value is ignored by recvfd(). */
>
>            msgh.msg_iov = &iov;
>            msgh.msg_iovlen = 1;
>            iov.iov_base = &data;
>            iov.iov_len = sizeof(int);
>            data = 12345;
>
>            /* Set 'msghdr' fields that describe ancillary data */
>
>            msgh.msg_control = controlMsg.buf;
>            msgh.msg_controllen = sizeof(controlMsg.buf);
>
>            /* Set up ancillary data describing file descriptor to send */
>
>            cmsgp = CMSG_FIRSTHDR(&msgh);
>            cmsgp->cmsg_level = SOL_SOCKET;
>            cmsgp->cmsg_type = SCM_RIGHTS;
>            cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
>            memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
>
>            /* Send real plus ancillary data */
>
>            if (sendmsg(sockfd, &msgh, 0) == -1)
>                return -1;
>
>            return 0;
>        }

Instead of using unix domain sockets to send the fd to the parent, I
think you could also use clone3() with flags==CLONE_FILES|SIGCHLD,
dup2() the seccomp fd to an fd that was reserved in the parent, call
unshare(CLONE_FILES) in the child after setting up the seccomp fd, and
wake up the parent with something like pthread_cond_signal()? I'm not
sure whether that'd look better or worse in the end though, so maybe
just ignore this comment.

[...]
>        /* Access the memory of the target process in order to discover the
>           pathname that was given to mkdir() */
>
>        static void
>        getTargetPathname(struct seccomp_notif *req, int notifyFd,
>                          char *path, size_t len)
>        {
>            char procMemPath[PATH_MAX];
>            snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
>
>            int procMemFd = open(procMemPath, O_RDONLY);

Should example code like this maybe use O_CLOEXEC unless the fd in
question actually has to be inheritable? I know it doesn't actually
matter here, but if this code was used in a multi-threaded context, it
might.

>            if (procMemFd == -1)
>                errExit("Supervisor: open");
>
>            /* Check that the process whose info we are accessing is still alive.
>               If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>               in checkNotificationIdIsValid()) succeeds, we know that the
>               /proc/PID/mem file descriptor that we opened corresponds to the
>               process for which we received a notification. If that process
>               subsequently terminates, then read() on that file descriptor
>               will return 0 (EOF). */
>
>            checkNotificationIdIsValid(notifyFd, req->id);
>
>            /* Seek to the location containing the pathname argument (i.e., the
>               first argument) of the mkdir(2) call and read that pathname */
>
>            if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
>                errExit("Supervisor: lseek");
>
>            ssize_t s = read(procMemFd, path, PATH_MAX);
>            if (s == -1)
>                errExit("read");

Why not pread() instead of lseek()+read()?

>            if (s == 0) {
>                fprintf(stderr, "\tS: read() of /proc/PID/mem "
>                        "returned 0 (EOF)\n");
>                exit(EXIT_FAILURE);
>            }
>
>            if (close(procMemFd) == -1)
>                errExit("close-/proc/PID/mem");

We should probably make sure here that the value we read is actually
NUL-terminated?

>        }
>
>        /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
>           descriptor, 'notifyFd'. */
>
>        static void
>        handleNotifications(int notifyFd)
>        {
>            struct seccomp_notif_sizes sizes;
>            char path[PATH_MAX];
>                /* For simplicity, we assume that the pathname given to mkdir()
>                   is no more than PATH_MAX bytes; but this might not be true. */

No, it has to be true, otherwise the kernel would fail the syscall if
it was executing normally.

>            /* Discover the sizes of the structures that are used to receive
>               notifications and send notification responses, and allocate
>               buffers of those sizes. */
>
>            if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
>                errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
>
>            struct seccomp_notif *req = malloc(sizes.seccomp_notif);
>            if (req == NULL)
>                errExit("\tS: malloc");
>
>            struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);

This should probably do something like max(sizes.seccomp_notif_resp,
sizeof(struct seccomp_notif_resp)) in case the program was built
against new UAPI headers that make struct seccomp_notif_resp big, but
is running under an old kernel where that struct is still smaller?

>            if (resp == NULL)
>                errExit("\tS: malloc");
[...]
>                    } else {
>
>                        /* If mkdir() failed in the supervisor, pass the error
>                           back to the target */
>
>                        resp->error = -errno;
>                        printf("\tS: failure! (errno = %d; %s)\n", errno,
>                                strerror(errno));
>                    }
>                                                             } else if (strncmp(path, "./", strlen("./")) == 0) {

nit: indent messed up

>                    resp->error = resp->val = 0;
>                    resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
>                    printf("\tS: target can execute system call\n");
[...]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 15:03 ` Tycho Andersen
  2020-09-30 15:11   ` Tycho Andersen
@ 2020-09-30 20:34   ` Michael Kerrisk (man-pages)
  2020-09-30 23:03     ` Tycho Andersen
  1 sibling, 1 reply; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-09-30 20:34 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: mtk.manpages, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Jann Horn, Alexei Starovoitov,
	wad, bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

Hi Tycho,

Thanks for taking time to look at the page!

On 9/30/20 5:03 PM, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>>        2. In order that the supervisor process can obtain  notifications
>>           using  the  listening  file  descriptor, (a duplicate of) that
>>           file descriptor must be passed from the target process to  the
>>           supervisor process.  One way in which this could be done is by
>>           passing the file descriptor over a UNIX domain socket  connec‐
>>           tion between the two processes (using the SCM_RIGHTS ancillary
>>           message type described in unix(7)).   Another  possibility  is
>>           that  the  supervisor  might  inherit  the file descriptor via
>>           fork(2).
> 
> It is technically possible to inherit the fd via fork, but is it
> really that useful? The child process wouldn't be able to actually do
> the syscall in question, since it would have the same filter.

D'oh! Yes, of course.

I think I was reaching because in an earlier conversation
you replied:

[[
> 3. The "target process" passes the "listening file descriptor"
>    to the "monitoring process" via the UNIX domain socket.

or some other means, it doesn't have to be with SCM_RIGHTS.
]]

So, what other means?

Anyway, I removed the sentence mentioning fork().

>>           The  information  in  the notification can be used to discover
>>           the values of pointer arguments for the target process's  sys‐
>>           tem call.  (This is something that can't be done from within a
>>           seccomp filter.)  To do this (and  assuming  it  has  suitable
> 
> s/To do this/One way to accomplish this/ perhaps, since there are
> others.

Yes, thanks, done.

>>           permissions),   the   supervisor   opens   the   corresponding
>>           /proc/[pid]/mem file, seeks to the memory location that corre‐
>>           sponds to one of the pointer arguments whose value is supplied
>>           in the notification event, and reads bytes from that location.
>>           (The supervisor must be careful to avoid a race condition that
>>           can occur when doing this; see the  description  of  the  SEC‐
>>           COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>>           tion, the supervisor can access other system information  that
>>           is  visible  in  user space but which is not accessible from a
>>           seccomp filter.
>>
>>           ┌─────────────────────────────────────────────────────┐
>>           │FIXME                                                │
>>           ├─────────────────────────────────────────────────────┤
>>           │Suppose we are reading a pathname from /proc/PID/mem │
>>           │for  a system call such as mkdir(). The pathname can │
>>           │be an arbitrary length. How do we know how much (how │
>>           │many pages) to read from /proc/PID/mem?              │
>>           └─────────────────────────────────────────────────────┘
> 
> PATH_MAX, I suppose.

Yes, I misunderstood a fundamental detail here, as Jann 
also confirmed.

>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │From my experiments,  it  appears  that  if  a  SEC‐ │
>>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>>        │process terminates, then the ioctl()  simply  blocks │
>>        │(rather than returning an error to indicate that the │
>>        │target process no longer exists).                    │
> 
> Yeah, I think Christian wanted to fix this at some point,

Do you have a pointer that discussion? I could not find it with a 
quick search.

> but it's a
> bit sticky to do.

Can you say a few words about the nature of the problem?

In the meantime. I think this merits a note under BUGS, and
I've added one.

> Note that if you e.g. rely on fork() above, the
> filter is shared with your current process, and this notification
> would never be possible. Perhaps another reason to omit that from the
> man page.

(Yes, as noted above, I removed that sentence.)

>>        SECCOMP_IOCTL_NOTIF_ID_VALID
>>               This operation can be used to check that a notification ID
>>               returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
>>               is  still  valid  (i.e.,  that  the  target  process still
>>               exists).
>>
>>               The third ioctl(2) argument is a  pointer  to  the  cookie
>>               (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>>               This  operation is necessary to avoid race conditions that
>>               can  occur   when   the   pid   returned   by   the   SEC‐
>>               COMP_IOCTL_NOTIF_RECV   operation   terminates,  and  that
>>               process ID is reused by another process.   An  example  of
>>               this kind of race is the following
>>
>>               1. A  notification  is  generated  on  the  listening file
>>                  descriptor.  The returned  seccomp_notif  contains  the
>>                  PID of the target process.
>>
>>               2. The target process terminates.
>>
>>               3. Another process is created on the system that by chance
>>                  reuses the PID that was freed when the  target  process
>>                  terminates.
>>
>>               4. The  supervisor  open(2)s  the /proc/[pid]/mem file for
>>                  the PID obtained in step 1, with the intention of (say)
>>                  inspecting the memory locations that contains the argu‐
>>                  ments of the system call that triggered  the  notifica‐
>>                  tion in step 1.
>>
>>               In the above scenario, the risk is that the supervisor may
>>               try to access the memory of a process other than the  tar‐
>>               get.   This  race  can be avoided by following the call to
>>               open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>>               ify  that  the  process that generated the notification is
>>               still alive.  (Note that  if  the  target  process  subse‐
>>               quently  terminates, its PID won't be reused because there
>>               remains an open reference to the /proc[pid]/mem  file;  in
>>               this  case, a subsequent read(2) from the file will return
>>               0, indicating end of file.)
>>
>>               On success (i.e., the notification  ID  is  still  valid),
>>               this  operation  returns 0 On failure (i.e., the notifica‐
>                                           ^ need a period?
> 
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │Interestingly, after the event  had  been  received, │
>>        │the  file descriptor indicates as writable (verified │
>>        │from the source code and by experiment). How is this │
>>        │useful?                                              │
> 
> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> reasonable.

No, I'm saying something more fundamental: why is the FD indicating as
writable? Can you write something to it? If yes, what? If not, then
why do these APIs want to say that the FD is writable?

>> EXAMPLES
>>        The (somewhat contrived) program shown below demonstrates the use
> 
> May also be worth mentioning the example in
> samples/seccomp/user-trap.c as well.

Oh -- I meant to do that! Thanks for the reminding me.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 20:34   ` Michael Kerrisk (man-pages)
@ 2020-09-30 23:03     ` Tycho Andersen
  2020-09-30 23:11       ` Jann Horn
  2020-10-01  7:45       ` Michael Kerrisk (man-pages)
  0 siblings, 2 replies; 52+ messages in thread
From: Tycho Andersen @ 2020-09-30 23:03 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Sargun Dhillon, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Jann Horn, Alexei Starovoitov, wad, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Tycho,
> 
> Thanks for taking time to look at the page!
> 
> On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> >>        2. In order that the supervisor process can obtain  notifications
> >>           using  the  listening  file  descriptor, (a duplicate of) that
> >>           file descriptor must be passed from the target process to  the
> >>           supervisor process.  One way in which this could be done is by
> >>           passing the file descriptor over a UNIX domain socket  connec‐
> >>           tion between the two processes (using the SCM_RIGHTS ancillary
> >>           message type described in unix(7)).   Another  possibility  is
> >>           that  the  supervisor  might  inherit  the file descriptor via
> >>           fork(2).
> > 
> > It is technically possible to inherit the fd via fork, but is it
> > really that useful? The child process wouldn't be able to actually do
> > the syscall in question, since it would have the same filter.
> 
> D'oh! Yes, of course.
> 
> I think I was reaching because in an earlier conversation
> you replied:
> 
> [[
> > 3. The "target process" passes the "listening file descriptor"
> >    to the "monitoring process" via the UNIX domain socket.
> 
> or some other means, it doesn't have to be with SCM_RIGHTS.
> ]]
> 
> So, what other means?
> 
> Anyway, I removed the sentence mentioning fork().

Whatever means people want :). fork() could work (it's how some of the
tests for this feature work, but it's not particularly useful I don't
think), clone(CLONE_FILES) is similar, seccomp_putfd, or maybe even
cloning it via some pidfd interface that might be invented for
re-opening files.

> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> >>        │process terminates, then the ioctl()  simply  blocks │
> >>        │(rather than returning an error to indicate that the │
> >>        │target process no longer exists).                    │
> > 
> > Yeah, I think Christian wanted to fix this at some point,
> 
> Do you have a pointer that discussion? I could not find it with a 
> quick search.
> 
> > but it's a
> > bit sticky to do.
> 
> Can you say a few words about the nature of the problem?

I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
notify about unused filter"). So maybe there's a bug here?

> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Interestingly, after the event  had  been  received, │
> >>        │the  file descriptor indicates as writable (verified │
> >>        │from the source code and by experiment). How is this │
> >>        │useful?                                              │
> > 
> > You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> > reasonable.
> 
> No, I'm saying something more fundamental: why is the FD indicating as
> writable? Can you write something to it? If yes, what? If not, then
> why do these APIs want to say that the FD is writable?

You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
NOTIFY_SEND are reading and writing events from the fd. I don't know
that much about the poll interface though -- is it possible to
indicate "here's a pseudo-read event"? It didn't look like it, so I
just (ab-)used POLLIN and POLLOUT, but probably that's wrong.

Tycho

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 23:03     ` Tycho Andersen
@ 2020-09-30 23:11       ` Jann Horn
  2020-09-30 23:24         ` Tycho Andersen
  2020-10-01  7:45       ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 52+ messages in thread
From: Jann Horn @ 2020-09-30 23:11 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Michael Kerrisk (man-pages),
	Sargun Dhillon, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Alexei Starovoitov, Will Drewry, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > >>        ┌─────────────────────────────────────────────────────┐
> > >>        │FIXME                                                │
> > >>        ├─────────────────────────────────────────────────────┤
> > >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> > >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > >>        │process terminates, then the ioctl()  simply  blocks │
> > >>        │(rather than returning an error to indicate that the │
> > >>        │target process no longer exists).                    │
> > >
> > > Yeah, I think Christian wanted to fix this at some point,
> >
> > Do you have a pointer that discussion? I could not find it with a
> > quick search.
> >
> > > but it's a
> > > bit sticky to do.
> >
> > Can you say a few words about the nature of the problem?
>
> I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> notify about unused filter"). So maybe there's a bug here?

That thing only notifies on ->poll, it doesn't unblock ioctls; and
Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
commit doesn't have any effect on this kind of usage.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 23:11       ` Jann Horn
@ 2020-09-30 23:24         ` Tycho Andersen
  2020-10-01  1:52           ` Jann Horn
  0 siblings, 1 reply; 52+ messages in thread
From: Tycho Andersen @ 2020-09-30 23:24 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	Sargun Dhillon, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Alexei Starovoitov, Will Drewry, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > >>        ┌─────────────────────────────────────────────────────┐
> > > >>        │FIXME                                                │
> > > >>        ├─────────────────────────────────────────────────────┤
> > > >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> > > >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > >>        │process terminates, then the ioctl()  simply  blocks │
> > > >>        │(rather than returning an error to indicate that the │
> > > >>        │target process no longer exists).                    │
> > > >
> > > > Yeah, I think Christian wanted to fix this at some point,
> > >
> > > Do you have a pointer that discussion? I could not find it with a
> > > quick search.
> > >
> > > > but it's a
> > > > bit sticky to do.
> > >
> > > Can you say a few words about the nature of the problem?
> >
> > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > notify about unused filter"). So maybe there's a bug here?
> 
> That thing only notifies on ->poll, it doesn't unblock ioctls; and
> Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> commit doesn't have any effect on this kind of usage.

Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
we don't have a count of all of them, unfortunately.

We could maybe look inside the wait_list, but that will probably make
people angry :)

Tycho

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 11:07 For review: seccomp_user_notif(2) manual page Michael Kerrisk (man-pages)
  2020-09-30 15:03 ` Tycho Andersen
  2020-09-30 15:53 ` Jann Horn
@ 2020-09-30 23:39 ` Kees Cook
  2020-10-15 11:24   ` Michael Kerrisk (man-pages)
  2020-10-01 12:36 ` Christian Brauner
  2020-10-01 21:06 ` Sargun Dhillon
  4 siblings, 1 reply; 52+ messages in thread
From: Kees Cook @ 2020-09-30 23:39 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, Christian Brauner, linux-man,
	lkml, Aleksa Sarai, Jann Horn, Alexei Starovoitov, wad, bpf,
	Song Liu, Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> [...] I did :-)

Yay! Thank you!

> [...]
>    Overview
>        In conventional usage of a seccomp filter, the decision about how
>        to  treat  a particular system call is made by the filter itself.
>        The user-space notification mechanism allows the handling of  the
>        system  call  to  instead  be handed off to a user-space process.
>        The advantages of doing this are that, by contrast with the  sec‐
>        comp  filter,  which  is  running on a virtual machine inside the
>        kernel, the user-space process has access to information that  is
>        unavailable to the seccomp filter and it can perform actions that
>        can't be performed from the seccomp filter.

I might clarify a bit with something like (though maybe the
target/supervisor paragraph needs to be moved to the start):

	This is used for performing syscalls on behalf of the target,
	rather than having the supervisor make security policy decisions
	about the syscall, which would be inherently race-prone. The
	target's syscall should either be handled by the supervisor or
	allowed to continue normally in the kernel (where standard security
	policies will be applied).

I'll comment more later, but I've run out of time today and I didn't see
anyone mention this detail yet in the existing threads... :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 23:24         ` Tycho Andersen
@ 2020-10-01  1:52           ` Jann Horn
  2020-10-01  2:14             ` Jann Horn
                               ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Jann Horn @ 2020-10-01  1:52 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Michael Kerrisk (man-pages),
	Sargun Dhillon, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Alexei Starovoitov, Will Drewry, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > >>        ┌─────────────────────────────────────────────────────┐
> > > > >>        │FIXME                                                │
> > > > >>        ├─────────────────────────────────────────────────────┤
> > > > >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > >>        │process terminates, then the ioctl()  simply  blocks │
> > > > >>        │(rather than returning an error to indicate that the │
> > > > >>        │target process no longer exists).                    │
> > > > >
> > > > > Yeah, I think Christian wanted to fix this at some point,
> > > >
> > > > Do you have a pointer that discussion? I could not find it with a
> > > > quick search.
> > > >
> > > > > but it's a
> > > > > bit sticky to do.
> > > >
> > > > Can you say a few words about the nature of the problem?
> > >
> > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > notify about unused filter"). So maybe there's a bug here?
> >
> > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > commit doesn't have any effect on this kind of usage.
>
> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> we don't have a count of all of them, unfortunately.
>
> We could maybe look inside the wait_list, but that will probably make
> people angry :)

The easiest way would probably be to open-code the semaphore-ish part,
and let the semaphore and poll share the waitqueue. The current code
kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
entire semaphore would IMO be cleaner than that. And it's not like
semaphore semantics are even a good fit for this code anyway.

Let's see... if we didn't have the existing UAPI to worry about, I'd
do it as follows (*completely* untested). That way, the ioctl would
block exactly until either there actually is a request to deliver or
there are no more users of the filter. The problem is that if we just
apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
an event loop and don't set O_NONBLOCK will be screwed. So we'd
probably also have to add some stupid counter in place of the
semaphore's counter that we can use to preserve the old behavior of
returning -ENOENT once for each cancelled request. :(

I guess this is a nice point in favor of Michael's usual complaint
that if there are no man pages for a feature by the time the feature
lands upstream, there's a higher chance that the UAPI will suck
forever...



diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 676d4af62103..f0f4c68e0bc6 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -138,7 +138,6 @@ struct seccomp_kaddfd {
  * @notifications: A list of struct seccomp_knotif elements.
  */
 struct notification {
-       struct semaphore request;
        u64 next_id;
        struct list_head notifications;
 };
@@ -859,7 +858,6 @@ static int seccomp_do_user_notification(int this_syscall,
        list_add(&n.list, &match->notif->notifications);
        INIT_LIST_HEAD(&n.addfd);

-       up(&match->notif->request);
        wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
        mutex_unlock(&match->notify_lock);

@@ -1175,9 +1173,10 @@ find_notification(struct seccomp_filter *filter, u64 id)


 static long seccomp_notify_recv(struct seccomp_filter *filter,
-                               void __user *buf)
+                               void __user *buf, bool blocking)
 {
        struct seccomp_knotif *knotif = NULL, *cur;
+       DECLARE_WAITQUEUE(wait, current);
        struct seccomp_notif unotif;
        ssize_t ret;

@@ -1190,11 +1189,9 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,

        memset(&unotif, 0, sizeof(unotif));

-       ret = down_interruptible(&filter->notif->request);
-       if (ret < 0)
-               return ret;
-
        mutex_lock(&filter->notify_lock);
+
+retry:
        list_for_each_entry(cur, &filter->notif->notifications, list) {
                if (cur->state == SECCOMP_NOTIFY_INIT) {
                        knotif = cur;
@@ -1202,14 +1199,32 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
                }
        }

-       /*
-        * If we didn't find a notification, it could be that the task was
-        * interrupted by a fatal signal between the time we were woken and
-        * when we were able to acquire the rw lock.
-        */
        if (!knotif) {
-               ret = -ENOENT;
-               goto out;
+               /* This has to happen before checking &filter->users. */
+               prepare_to_wait(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
+
+               /*
+                * If all users of the filter are gone, throw an error instead
+                * of pointlessly continuing to block.
+                */
+               if (refcount_read(&filter->users) == 0) {
+                       ret = -ENOTCON;
+                       goto out;
+               }
+               if (blocking) {
+                       /* No notifications pending - wait for one,
then retry. */
+                       mutex_unlock(&filter->notify_lock);
+                       schedule();
+                       mutex_lock(&filter->notify_lock);
+                       if (signal_pending(current)) {
+                               ret = -EINTR;
+                               goto out;
+                       }
+                       goto retry;
+               } else {
+                       ret = -ENOENT;
+                       goto out;
+               }
        }

        unotif.id = knotif->id;
@@ -1220,6 +1235,7 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
        wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
        ret = 0;
 out:
+       finish_wait(&filter->wqh, &wait);
        mutex_unlock(&filter->notify_lock);

        if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) {
@@ -1233,10 +1249,8 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
                 */
                mutex_lock(&filter->notify_lock);
                knotif = find_notification(filter, unotif.id);
-               if (knotif) {
+               if (knotif)
                        knotif->state = SECCOMP_NOTIFY_INIT;
-                       up(&filter->notif->request);
-               }
                mutex_unlock(&filter->notify_lock);
        }

@@ -1412,11 +1426,12 @@ static long seccomp_notify_ioctl(struct file
*file, unsigned int cmd,
 {
        struct seccomp_filter *filter = file->private_data;
        void __user *buf = (void __user *)arg;
+       bool blocking = !(file->f_flags & O_NONBLOCK);

        /* Fixed-size ioctls */
        switch (cmd) {
        case SECCOMP_IOCTL_NOTIF_RECV:
-               return seccomp_notify_recv(filter, buf);
+               return seccomp_notify_recv(filter, buf, blocking);
        case SECCOMP_IOCTL_NOTIF_SEND:
                return seccomp_notify_send(filter, buf);
        case SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR:
@@ -1485,7 +1500,6 @@ static struct file *init_listener(struct
seccomp_filter *filter)
        if (!filter->notif)
                goto out;

-       sema_init(&filter->notif->request, 0);
        filter->notif->next_id = get_random_u64();
        INIT_LIST_HEAD(&filter->notif->notifications);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01  1:52           ` Jann Horn
@ 2020-10-01  2:14             ` Jann Horn
  2020-10-25 16:31               ` Michael Kerrisk (man-pages)
  2020-10-01  7:49             ` Michael Kerrisk (man-pages)
  2020-10-26  0:32             ` Kees Cook
  2 siblings, 1 reply; 52+ messages in thread
From: Jann Horn @ 2020-10-01  2:14 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Michael Kerrisk (man-pages),
	Sargun Dhillon, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Alexei Starovoitov, Will Drewry, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Thu, Oct 1, 2020 at 3:52 AM Jann Horn <jannh@google.com> wrote:
> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > >>        ┌─────────────────────────────────────────────────────┐
> > > > > >>        │FIXME                                                │
> > > > > >>        ├─────────────────────────────────────────────────────┤
> > > > > >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > >>        │process terminates, then the ioctl()  simply  blocks │
> > > > > >>        │(rather than returning an error to indicate that the │
> > > > > >>        │target process no longer exists).                    │
> > > > > >
> > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > >
> > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > quick search.
> > > > >
> > > > > > but it's a
> > > > > > bit sticky to do.
> > > > >
> > > > > Can you say a few words about the nature of the problem?
> > > >
> > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > notify about unused filter"). So maybe there's a bug here?
> > >
> > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > commit doesn't have any effect on this kind of usage.
> >
> > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > we don't have a count of all of them, unfortunately.
> >
> > We could maybe look inside the wait_list, but that will probably make
> > people angry :)
>
> The easiest way would probably be to open-code the semaphore-ish part,
> and let the semaphore and poll share the waitqueue. The current code
> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> entire semaphore would IMO be cleaner than that. And it's not like
> semaphore semantics are even a good fit for this code anyway.
>
> Let's see... if we didn't have the existing UAPI to worry about, I'd
> do it as follows (*completely* untested). That way, the ioctl would
> block exactly until either there actually is a request to deliver or
> there are no more users of the filter. The problem is that if we just
> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> an event loop and don't set O_NONBLOCK will be screwed. So we'd
> probably also have to add some stupid counter in place of the
> semaphore's counter that we can use to preserve the old behavior of
> returning -ENOENT once for each cancelled request. :(
>
> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...

And I guess this would be the UAPI-compatible version - not actually
as terrible as I thought it might be. Do y'all want this? If so, feel
free to either turn this into a proper patch with Co-developed-by, or
tell me that I should do it and I'll try to get around to turning it
into something proper.

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 676d4af62103..d08c453fcc2c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -138,7 +138,7 @@ struct seccomp_kaddfd {
  * @notifications: A list of struct seccomp_knotif elements.
  */
 struct notification {
-       struct semaphore request;
+       bool canceled_reqs;
        u64 next_id;
        struct list_head notifications;
 };
@@ -859,7 +859,6 @@ static int seccomp_do_user_notification(int this_syscall,
        list_add(&n.list, &match->notif->notifications);
        INIT_LIST_HEAD(&n.addfd);

-       up(&match->notif->request);
        wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
        mutex_unlock(&match->notify_lock);

@@ -901,8 +900,20 @@ static int seccomp_do_user_notification(int this_syscall,
         * *reattach* to a notifier right now. If one is added, we'll need to
         * keep track of the notif itself and make sure they match here.
         */
-       if (match->notif)
+       if (match->notif) {
                list_del(&n.list);
+
+               /*
+                * We are stuck with a UAPI that requires that after a spurious
+                * wakeup, SECCOMP_IOCTL_NOTIF_RECV must return immediately.
+                * This is the tracking for that, keeping track of whether we
+                * canceled a request after waking waiters, but before userspace
+                * picked up the notification.
+                */
+               if (n.state == SECCOMP_NOTIFY_INIT)
+                       match->notif->canceled_reqs = true;
+       }
+
 out:
        mutex_unlock(&match->notify_lock);

@@ -1178,6 +1189,7 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
                                void __user *buf)
 {
        struct seccomp_knotif *knotif = NULL, *cur;
+       DECLARE_WAITQUEUE(wait, current);
        struct seccomp_notif unotif;
        ssize_t ret;

@@ -1190,11 +1202,9 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,

        memset(&unotif, 0, sizeof(unotif));

-       ret = down_interruptible(&filter->notif->request);
-       if (ret < 0)
-               return ret;
-
        mutex_lock(&filter->notify_lock);
+
+retry:
        list_for_each_entry(cur, &filter->notif->notifications, list) {
                if (cur->state == SECCOMP_NOTIFY_INIT) {
                        knotif = cur;
@@ -1202,14 +1212,32 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
                }
        }

-       /*
-        * If we didn't find a notification, it could be that the task was
-        * interrupted by a fatal signal between the time we were woken and
-        * when we were able to acquire the rw lock.
-        */
        if (!knotif) {
-               ret = -ENOENT;
-               goto out;
+               /* This has to happen before checking &filter->users. */
+               prepare_to_wait(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
+
+               /*
+                * If all users of the filter are gone, throw an error instead
+                * of pointlessly continuing to block.
+                */
+               if (refcount_read(&filter->users) == 0) {
+                       ret = -ENOTCON;
+                       goto out;
+               }
+               if (filter->notif->canceled_reqs) {
+                       ret = -ENOENT;
+                       goto out;
+               } else {
+                       /* No notifications pending - wait for one,
then retry. */
+                       mutex_unlock(&filter->notify_lock);
+                       schedule();
+                       mutex_lock(&filter->notify_lock);
+                       if (signal_pending(current)) {
+                               ret = -EINTR;
+                               goto out;
+                       }
+                       goto retry;
+               }
        }

        unotif.id = knotif->id;
@@ -1220,6 +1248,8 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
        wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
        ret = 0;
 out:
+       filter->notif->canceled_reqs = false;
+       finish_wait(&filter->wqh, &wait);
        mutex_unlock(&filter->notify_lock);

        if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) {
@@ -1233,10 +1263,8 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
                 */
                mutex_lock(&filter->notify_lock);
                knotif = find_notification(filter, unotif.id);
-               if (knotif) {
+               if (knotif)
                        knotif->state = SECCOMP_NOTIFY_INIT;
-                       up(&filter->notif->request);
-               }
                mutex_unlock(&filter->notify_lock);
        }

@@ -1485,7 +1513,6 @@ static struct file *init_listener(struct
seccomp_filter *filter)
        if (!filter->notif)
                goto out;

-       sema_init(&filter->notif->request, 0);
        filter->notif->next_id = get_random_u64();
        INIT_LIST_HEAD(&filter->notif->notifications);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 23:03     ` Tycho Andersen
  2020-09-30 23:11       ` Jann Horn
@ 2020-10-01  7:45       ` Michael Kerrisk (man-pages)
  2020-10-14  4:40         ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-01  7:45 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: mtk.manpages, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Jann Horn, Alexei Starovoitov,
	wad, bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

On 10/1/20 1:03 AM, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Tycho,
>>
>> Thanks for taking time to look at the page!
>>
>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:

[...]

>>>>        ┌─────────────────────────────────────────────────────┐
>>>>        │FIXME                                                │
>>>>        ├─────────────────────────────────────────────────────┤
>>>>        │Interestingly, after the event  had  been  received, │
>>>>        │the  file descriptor indicates as writable (verified │
>>>>        │from the source code and by experiment). How is this │
>>>>        │useful?                                              │
>>>
>>> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
>>> reasonable.
>>
>> No, I'm saying something more fundamental: why is the FD indicating as
>> writable? Can you write something to it? If yes, what? If not, then
>> why do these APIs want to say that the FD is writable?
> 
> You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
> NOTIFY_SEND are reading and writing events from the fd. I don't know
> that much about the poll interface though -- is it possible to
> indicate "here's a pseudo-read event"? It didn't look like it, so I
> just (ab-)used POLLIN and POLLOUT, but probably that's wrong.

I think the POLLIN thing is fine.

So, I think maybe I now understand what you intended with setting
POLLOUT: the notification has been received ("read") and now the
FD can be used to NOTIFY_SEND ("write") a response. Right?

If that's correct, I don't have a problem with it. I just wonder:
is it useful? IOW: are there situations where the process doing the
NOTIFY_SEND might want to test for POLLOUT because the it doesn't
know whether a NOTIFY_RECV has occurred? 

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01  1:52           ` Jann Horn
  2020-10-01  2:14             ` Jann Horn
@ 2020-10-01  7:49             ` Michael Kerrisk (man-pages)
  2020-10-26  0:32             ` Kees Cook
  2 siblings, 0 replies; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-01  7:49 UTC (permalink / raw)
  To: Jann Horn, Tycho Andersen
  Cc: mtk.manpages, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Alexei Starovoitov, Will Drewry,
	bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

On 10/1/20 3:52 AM, Jann Horn wrote:

[...]

> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...

Thanks for saving me the trouble of saying that (again).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 11:07 For review: seccomp_user_notif(2) manual page Michael Kerrisk (man-pages)
                   ` (2 preceding siblings ...)
  2020-09-30 23:39 ` Kees Cook
@ 2020-10-01 12:36 ` Christian Brauner
  2020-10-15 11:23   ` Michael Kerrisk (man-pages)
  2020-10-01 21:06 ` Sargun Dhillon
  4 siblings, 1 reply; 52+ messages in thread
From: Christian Brauner @ 2020-10-01 12:36 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, linux-man, Song Liu, wad,
	Kees Cook, Daniel Borkmann, Jann Horn, Robert Sesek,
	Linux Containers, lkml, Alexei Starovoitov, Giuseppe Scrivano,
	bpf, Andy Lutomirski, Christian Brauner

[I'm on vacation so I'll just give this a quick glance for now.]

On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Tycho, Sargun (and all),
> 
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces 
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD 
> that also will need documenting [2]), I did :-). But of course I may 
> have made mistakes...
> 
> I've shown the rendered version of the page below, and would love
> to receive review comments from you and others, and acks, etc.
> 
> There are a few FIXMEs sprinkled into the page, including one
> that relates to what appears to me to be a misdesign (possibly 
> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV 
> operation. I would be especially interested in feedback on that
> FIXME, and also of course the other FIXMEs.
> 
> The page includes an extensive (albeit slightly contrived)
> example program, and I would be happy also to receive comments
> on that program.
> 
> The page source currently sits in a branch (along with the text
> that you sent me for the seccomp(2) page) at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> 
> Thanks,
> 
> Michael
> 
> [1] https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd23c@gmail.com/#t
> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
>     and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> 
> =====
> 
> NAME
>        seccomp_user_notif - Seccomp user-space notification mechanism
> 
> SYNOPSIS
>        #include <linux/seccomp.h>
>        #include <linux/filter.h>
>        #include <linux/audit.h>
> 
>        int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
> DESCRIPTION
>        This  page  describes  the user-space notification mechanism pro‐
>        vided by the Secure Computing (seccomp) facility.  As well as the
>        use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>        COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>        operation  described  in  seccomp(2), this mechanism involves the
>        use of a number of related ioctl(2) operations (described below).
> 
>    Overview
>        In conventional usage of a seccomp filter, the decision about how
>        to  treat  a particular system call is made by the filter itself.
>        The user-space notification mechanism allows the handling of  the
>        system  call  to  instead  be handed off to a user-space process.

"In contrast, the user notification mechanism allows to delegate the
handling of the system call of one process (target) to another
user-space process (supervisor)."?

>        The advantages of doing this are that, by contrast with the  sec‐
>        comp  filter,  which  is  running on a virtual machine inside the
>        kernel, the user-space process has access to information that  is
>        unavailable to the seccomp filter and it can perform actions that
>        can't be performed from the seccomp filter.

This section reads a bit difficult imho:
"A suitably privileged supervisor can use the user notification
mechanism to perform actions in lieu of the target. The supervisor will
usually be able to retrieve information about the target and the
performed system call that the seccomp filter itself cannot."

> 
>        In the discussion that follows, the process  that  has  installed
>        the  seccomp filter is referred to as the target, and the process
>        that is notified by  the  user-space  notification  mechanism  is
>        referred  to  as  the  supervisor.  An overview of the steps per‐
>        formed by these two processes is as follows:
> 
>        1. The target process establishes a seccomp filter in  the  usual
>           manner, but with two differences:
> 
>           · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
>             TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
>             the  (successful)  seccomp(2) call is a new "listening" file
>             descriptor that can be used to receive notifications.

I think it would be good to mention that seccomp notify fds are
O_CLOEXEC by default somewhere.

> 
>           · In cases where it is appropriate, the seccomp filter returns
>             the  action value SECCOMP_RET_USER_NOTIF.  This return value
>             will trigger a notification event.
> 
>        2. In order that the supervisor process can obtain  notifications
>           using  the  listening  file  descriptor, (a duplicate of) that
>           file descriptor must be passed from the target process to  the
>           supervisor process.  One way in which this could be done is by
>           passing the file descriptor over a UNIX domain socket  connec‐
>           tion between the two processes (using the SCM_RIGHTS ancillary
>           message type described in unix(7)).   Another  possibility  is
>           that  the  supervisor  might  inherit  the file descriptor via
>           fork(2).

I think a few people have already pointed out other ways of retrieving
an fd. :)

> 
>        3. The supervisor process will receive notification events on the
>           listening  file  descriptor.   These  events  are  returned as
>           structures of type seccomp_notif.  Because this structure  and
>           its  size may evolve over kernel versions, the supervisor must
>           first determine the size of  this  structure  using  the  sec‐
>           comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>           structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>           cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>           to receive notification events.   In  addition,the  supervisor
>           allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>           comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>           comp_notif_resp  structure) that it will provide to the kernel
>           (and thus the target process).
> 
>        4. The target process then performs its workload, which  includes
>           system  calls  that  will be controlled by the seccomp filter.
>           Whenever one of these system calls causes the filter to return
>           the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>           execute the system call;  instead,  execution  of  the  target
>           process is temporarily blocked inside the kernel and a notifi‐

Maybe mention that the task is killable when so blocked?

>           cation event is generated on the listening file descriptor.
> 
>        5. The supervisor process can now repeatedly monitor the  listen‐
>           ing   file   descriptor  for  SECCOMP_RET_USER_NOTIF-triggered
>           events.   To  do  this,   the   supervisor   uses   the   SEC‐
>           COMP_IOCTL_NOTIF_RECV  ioctl(2)  operation to read information
>           about a notification event; this  operation  blocks  until  an
>           event  is  available.   The  operation returns a seccomp_notif
>           structure containing information about the system call that is
>           being attempted by the target process.
> 
>        6. The    seccomp_notif    structure   returned   by   the   SEC‐
>           COMP_IOCTL_NOTIF_RECV operation includes the same  information
>           (a seccomp_data structure) that was passed to the seccomp fil‐
>           ter.  This information allows the supervisor to  discover  the
>           system  call number and the arguments for the target process's
>           system call.  In addition, the notification event contains the
>           PID of the target process.

(Technically TID.)

> 
>           The  information  in  the notification can be used to discover
>           the values of pointer arguments for the target process's  sys‐
>           tem call.  (This is something that can't be done from within a
>           seccomp filter.)  To do this (and  assuming  it  has  suitable
>           permissions),   the   supervisor   opens   the   corresponding
>           /proc/[pid]/mem file, seeks to the memory location that corre‐
>           sponds to one of the pointer arguments whose value is supplied
>           in the notification event, and reads bytes from that location.
>           (The supervisor must be careful to avoid a race condition that
>           can occur when doing this; see the  description  of  the  SEC‐
>           COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>           tion, the supervisor can access other system information  that
>           is  visible  in  user space but which is not accessible from a
>           seccomp filter.
> 
>           ┌─────────────────────────────────────────────────────┐
>           │FIXME                                                │
>           ├─────────────────────────────────────────────────────┤
>           │Suppose we are reading a pathname from /proc/PID/mem │
>           │for  a system call such as mkdir(). The pathname can │
>           │be an arbitrary length. How do we know how much (how │
>           │many pages) to read from /proc/PID/mem?              │
>           └─────────────────────────────────────────────────────┘

This has already been answered, I believe.

> 
>        7. Having  obtained  information  as  per  the previous step, the
>           supervisor may then choose to perform an action in response to
>           the  target  process's  system call (which, as noted above, is
>           not  executed  when  the  seccomp  filter  returns  the   SEC‐
>           COMP_RET_USER_NOTIF action value).

Nit: It is not _yet_ executed it may very well be if the response is
"continue". This should either mention that when the fd becomes
_RECVable the system call is guaranteed to not have executed yet or
specify that it is not yet executed, I think.

> 
>           One  example  use case here relates to containers.  The target
>           process may be located inside a container where  it  does  not
>           have sufficient capabilities to mount a filesystem in the con‐
>           tainer's mount namespace.  However, the supervisor  may  be  a
>           more  privileged  process that that does have sufficient capa‐
>           bilities to perform the mount operation.
> 
>        8. The supervisor then sends a response to the notification.  The
>           information  in  this  response  is used by the kernel to con‐
>           struct a return value for the target process's system call and
>           provide a value that will be assigned to the errno variable of
>           the target process.
> 
>           The  response  is  sent  using  the   SECCOMP_IOCTL_NOTIF_RECV
>           ioctl(2)   operation,   which  is  used  to  transmit  a  sec‐
>           comp_notif_resp  structure  to  the  kernel.   This  structure
>           includes  a  cookie  value that the supervisor obtained in the
>           seccomp_notif    structure    returned     by     the     SEC‐
>           COMP_IOCTL_NOTIF_RECV operation.  This cookie value allows the
>           kernel to associate the response with the target process.

I think here or above you should mention that the id or "cookie" _must_
be used when a file descriptor to /proc/<pid>/mem or any /proc/<pid>/*
is opened:
fd = open(/proc/pid/*);
verify_via_cookie_that_pid_still_alive(cookie);
operate_on(fd)

Otherwise this is a potential security issue.

> 
>        9. Once the notification has been sent, the system  call  in  the
>           target  process  unblocks,  returning the information that was
>           provided by the supervisor in the notification response.
> 
>        As a variation on the last two steps, the supervisor can  send  a
>        response  that tells the kernel that it should execute the target
>        process's   system   call;   see   the   discussion    of    SEC‐
>        COMP_USER_NOTIF_FLAG_CONTINUE, below.
> 
>    ioctl(2) operations
>        The following ioctl(2) operations are provided to support seccomp
>        user-space notification.  For each of these operations, the first
>        (file  descriptor)  argument  of  ioctl(2)  is the listening file
>        descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
>        TER_FLAG_NEW_LISTENER flag.
> 
>        SECCOMP_IOCTL_NOTIF_RECV
>               This operation is used to obtain a user-space notification
>               event.  If no such event is currently pending, the  opera‐
>               tion  blocks  until  an  event occurs.  The third ioctl(2)
>               argument is a pointer to a structure of the following form
>               which  contains  information about the event.  This struc‐
>               ture must be zeroed out before the call.
> 
>                   struct seccomp_notif {
>                       __u64  id;              /* Cookie */
>                       __u32  pid;             /* PID of target process */
>                       __u32  flags;           /* Currently unused (0) */
>                       struct seccomp_data data;   /* See seccomp(2) */
>                   };
> 
>               The fields in this structure are as follows:
> 
>               id     This is a cookie for the notification.   Each  such
>                      cookie  is  guaranteed  to be unique for the corre‐
>                      sponding seccomp  filter.   In  other  words,  this
>                      cookie  is  unique for each notification event from
>                      the target process.  The cookie value has the  fol‐
>                      lowing uses:
> 
>                      · It     can     be     used    with    the    SEC‐
>                        COMP_IOCTL_NOTIF_ID_VALID ioctl(2)  operation  to
>                        verify that the target process is still alive.
> 
>                      · When  returning  a  notification  response to the
>                        kernel, the supervisor must  include  the  cookie
>                        value in the seccomp_notif_resp structure that is
>                        specified   as   the   argument   of   the   SEC‐
>                        COMP_IOCTL_NOTIF_SEND operation.
> 
>               pid    This  is  the  PID of the target process that trig‐
>                      gered the notification event.
> 
>                      ┌─────────────────────────────────────────────────────┐
>                      │FIXME                                                │
>                      ├─────────────────────────────────────────────────────┤
>                      │This is a thread ID, rather than a PID, right?       │
>                      └─────────────────────────────────────────────────────┘

Yes.

> 
>               flags  This is a  bit  mask  of  flags  providing  further
>                      information on the event.  In the current implemen‐
>                      tation, this field is always zero.
> 
>               data   This is a seccomp_data structure containing  infor‐
>                      mation  about  the  system  call that triggered the
>                      notification.  This is the same structure  that  is
>                      passed  to  the seccomp filter.  See seccomp(2) for
>                      details of this structure.
> 
>               On success, this operation returns 0; on  failure,  -1  is
>               returned,  and  errno  is set to indicate the cause of the
>               error.  This operation can fail with the following errors:
> 
>               EINVAL (since Linux 5.5)
>                      The seccomp_notif structure that was passed to  the
>                      call contained nonzero fields.
> 
>               ENOENT The  target  process  was killed by a signal as the
>                      notification information was being generated.
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │From my experiments,  it  appears  that  if  a  SEC‐ │
>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>        │process terminates, then the ioctl()  simply  blocks │
>        │(rather than returning an error to indicate that the │
>        │target process no longer exists).                    │
>        │                                                     │
>        │I found that surprising, and it required  some  con‐ │
>        │tortions  in the example program.  It was not possi‐ │
>        │ble to code my SIGCHLD handler (which reaps the zom‐ │
>        │bie  when  the  worker/target process terminates) to │
>        │simply set a flag checked in the main  handleNotifi‐ │
>        │cations()  loop,  since  this created an unavoidable │
>        │race where the child might terminate  just  after  I │
>        │had  checked  the  flag,  but before I blocked (for‐ │
>        │ever!) in  the  SECCOMP_IOCTL_NOTIF_RECV  operation. │
>        │Instead,  I had to code the signal handler to simply │
>        │call _exit(2)  in  order  to  terminate  the  parent │
>        │process (the supervisor).                            │
>        │                                                     │
>        │Is  this  expected  behavior?  It seems to me rather │
>        │desirable that SECCOMP_IOCTL_NOTIF_RECV should  give │
>        │an error if the target process has terminated.       │
>        └─────────────────────────────────────────────────────┘

This has been discussed later in the thread too, I believe. My patchset
fixed a different but related bug in ->poll() when a filter becomes
unused. I hadn't noticed this behavior since I'm always polling. (Pure
ioctls() feel a bit fishy to me. :) But obviously a valid use.)

> 
>        SECCOMP_IOCTL_NOTIF_ID_VALID
>               This operation can be used to check that a notification ID
>               returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
>               is  still  valid  (i.e.,  that  the  target  process still
>               exists).
> 
>               The third ioctl(2) argument is a  pointer  to  the  cookie
>               (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
> 
>               This  operation is necessary to avoid race conditions that
>               can  occur   when   the   pid   returned   by   the   SEC‐
>               COMP_IOCTL_NOTIF_RECV   operation   terminates,  and  that
>               process ID is reused by another process.   An  example  of
>               this kind of race is the following
> 
>               1. A  notification  is  generated  on  the  listening file
>                  descriptor.  The returned  seccomp_notif  contains  the
>                  PID of the target process.
> 
>               2. The target process terminates.
> 
>               3. Another process is created on the system that by chance
>                  reuses the PID that was freed when the  target  process
>                  terminates.
> 
>               4. The  supervisor  open(2)s  the /proc/[pid]/mem file for
>                  the PID obtained in step 1, with the intention of (say)
>                  inspecting the memory locations that contains the argu‐
>                  ments of the system call that triggered  the  notifica‐
>                  tion in step 1.
> 
>               In the above scenario, the risk is that the supervisor may
>               try to access the memory of a process other than the  tar‐
>               get.   This  race  can be avoided by following the call to
>               open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>               ify  that  the  process that generated the notification is
>               still alive.  (Note that  if  the  target  process  subse‐
>               quently  terminates, its PID won't be reused because there
>               remains an open reference to the /proc[pid]/mem  file;  in
>               this  case, a subsequent read(2) from the file will return
>               0, indicating end of file.)
> 
>               On success (i.e., the notification  ID  is  still  valid),
>               this  operation  returns 0 On failure (i.e., the notifica‐

Missing a ".", I think.

>               tion ID is no longer valid), -1 is returned, and errno  is
>               set to ENOENT.
> 
>        SECCOMP_IOCTL_NOTIF_SEND
>               This  operation  is  used  to send a notification response
>               back to the kernel.  The third ioctl(2) argument  of  this
>               structure  is  a  pointer  to a structure of the following
>               form:
> 
>                   struct seccomp_notif_resp {
>                       __u64 id;               /* Cookie value */
>                       __s64 val;              /* Success return value */
>                       __s32 error;            /* 0 (success) or negative
>                                                  error number */
>                       __u32 flags;            /* See below */
>                   };
> 
>               The fields of this structure are as follows:
> 
>               id     This is the cookie value that  was  obtained  using
>                      the   SECCOMP_IOCTL_NOTIF_RECV   operation.    This
>                      cookie value allows the kernel to  correctly  asso‐
>                      ciate this response with the system call that trig‐
>                      gered the user-space notification.
> 
>               val    This is the value that will be used for  a  spoofed
>                      success  return  for  the  target  process's system
>                      call; see below.
> 
>               error  This is the value that will be used  as  the  error
>                      number  (errno)  for a spoofed error return for the
>                      target process's system call; see below.

Nit: "val" is only used when "error" is not set.

> 
>               flags  This is a bit mask that includes zero  or  more  of
>                      the following flags
> 
>                      SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
>                             Tell   the  kernel  to  execute  the  target
>                             process's system call.
> 
>               Two kinds of response are possible:
> 
>               · A response to the kernel telling it to execute the  tar‐
>                 get  process's  system  call.   In  this case, the flags
>                 field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and  the
>                 error and val fields must be zero.
> 
>                 This  kind  of response can be useful in cases where the
>                 supervisor needs to do deeper analysis of  the  target's
>                 system  call  than  is  possible  from  a seccomp filter
>                 (e.g., examining the values of pointer arguments),  and,
>                 having  verified that the system call is acceptable, the
>                 supervisor wants to allow it to proceed.

I think Jann has pointed this out. This needs to come with a big warning
and I would explicitly put a:
"The user notification mechanism cannot be used to implement a syscall
security policy in user space!"
You might want to take a look at the seccomp.h header file where I
placed a giant warning about how to use this too.

> 
>               · A spoofed return value for the target  process's  system
>                 call.   In  this  case,  the kernel does not execute the
>                 target process's system call, instead causing the system
>                 call to return a spoofed value as specified by fields of
>                 the seccomp_notif_resp structure.  The supervisor should
>                 set the fields of this structure as follows:
> 
>                 +  flags  does  not contain SECCOMP_USER_NOTIF_FLAG_CON‐
>                    TINUE.
> 
>                 +  error is set either to  0  for  a  spoofed  "success"
>                    return  or  to  a negative error number for a spoofed
>                    "failure" return.  In the  former  case,  the  kernel
>                    causes the target process's system call to return the
>                    value specified in the val field.  In the later case,
>                    the kernel causes the target process's system call to
>                    return -1, and errno is assigned  the  negated  error
>                    value.
> 
>                 +  val is set to a value that will be used as the return
>                    value for a spoofed "success" return for  the  target
>                    process's  system  call.   The value in this field is
>                    ignored if the error field contains a nonzero value.
> 
>               On success, this operation returns 0; on  failure,  -1  is
>               returned,  and  errno  is set to indicate the cause of the
>               error.  This operation can fail with the following errors:
> 
>               EINPROGRESS
>                      A response to this notification  has  already  been
>                      sent.
> 
>               EINVAL An invalid value was specified in the flags field.
> 
>               EINVAL The       flags      field      contained      SEC‐
>                      COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
>                      field was not zero.
> 
>               ENOENT The  blocked  system call in the target process has
>                      been interrupted by a signal handler.
> 
> NOTES
>        The file descriptor returned when seccomp(2) is employed with the
>        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
>        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
>        ing,  these interfaces indicate that the file descriptor is read‐
>        able.

This should also note that when a filter becomes unused, i.e. the last
task using that filter in its filter hierarchy is dead (been
reaped/autoreaped) ->poll() will notify with (E)POLLHUP.

> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │Interestingly, after the event  had  been  received, │
>        │the  file descriptor indicates as writable (verified │
>        │from the source code and by experiment). How is this │
>        │useful?                                              │
>        └─────────────────────────────────────────────────────┘
> 
> EXAMPLES
>        The (somewhat contrived) program shown below demonstrates the use
>        of the interfaces described in this page.  The program creates  a
>        child  process  that  serves  as the "target" process.  The child
>        process  installs  a  seccomp  filter  that  returns   the   SEC‐
>        COMP_RET_USER_NOTIF  action  value if a call is made to mkdir(2).
>        The child process then calls mkdir(2) once for each of  the  sup‐
>        plied  command-line arguments, and reports the result returned by
>        the call.  After processing all arguments, the child process ter‐
>        minates.
> 
>        The  parent  process  acts  as  the supervisor, listening for the
>        notifications that are generated when the  target  process  calls
>        mkdir(2).   When such a notification occurs, the supervisor exam‐
>        ines the memory of the target process (using /proc/[pid]/mem)  to
>        discover  the pathname argument that was supplied to the mkdir(2)
>        call, and performs one of the following actions:
> 
>        · If the pathname begins with the prefix "/tmp/", then the super‐
>          visor  attempts  to  create  the  specified directory, and then
>          spoofs a return for the target  process  based  on  the  return
>          value  of  the  supervisor's  mkdir(2) call.  In the event that
>          that call succeeds, the spoofed success  return  value  is  the
>          length of the pathname.
> 
>        · If  the pathname begins with "./" (i.e., it is a relative path‐
>          name), the supervisor sends a  SECCOMP_USER_NOTIF_FLAG_CONTINUE
>          response  to  the  kernel to say that kernel should execute the
>          target process's mkdir(2) call.

Potentially problematic if the two processes have the same privilege
level and the supervisor intends _CONTINUE to mean "is safe to execute".
An attacker could try to re-write arguments afaict.
A good an easy example is usually mknod() in a user namespace. A
_CONTINUE is always safe since you can't create device nodes anyway.

Sorry, I can't review the rest in sufficient detail since I'm on
vacation still so I'm just going to shut up now. :)

Christian

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 15:53 ` Jann Horn
@ 2020-10-01 12:54   ` Christian Brauner
  2020-10-01 15:47     ` Jann Horn
  2020-10-15 11:24   ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 52+ messages in thread
From: Christian Brauner @ 2020-10-01 12:54 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	linux-man, Song Liu, Will Drewry, Kees Cook, Daniel Borkmann,
	Giuseppe Scrivano, Robert Sesek, Linux Containers, lkml,
	Alexei Starovoitov, bpf, Andy Lutomirski, Christian Brauner

On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
> > I knew it would be a big ask, but below is kind of the manual page
> > I was hoping you might write [1] for the seccomp user-space notification
> > mechanism. Since you didn't (and because 5.9 adds various new pieces
> > such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> > that also will need documenting [2]), I did :-). But of course I may
> > have made mistakes...
> [...]
> > NAME
> >        seccomp_user_notif - Seccomp user-space notification mechanism
> >
> > SYNOPSIS
> >        #include <linux/seccomp.h>
> >        #include <linux/filter.h>
> >        #include <linux/audit.h>
> >
> >        int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
> Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
> of the ioctl_* manpages?
> 
> > DESCRIPTION
> >        This  page  describes  the user-space notification mechanism pro‐
> >        vided by the Secure Computing (seccomp) facility.  As well as the
> >        use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
> >        COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
> >        operation  described  in  seccomp(2), this mechanism involves the
> >        use of a number of related ioctl(2) operations (described below).
> >
> >    Overview
> >        In conventional usage of a seccomp filter, the decision about how
> >        to  treat  a particular system call is made by the filter itself.
> >        The user-space notification mechanism allows the handling of  the
> >        system  call  to  instead  be handed off to a user-space process.
> >        The advantages of doing this are that, by contrast with the  sec‐
> >        comp  filter,  which  is  running on a virtual machine inside the
> >        kernel, the user-space process has access to information that  is
> >        unavailable to the seccomp filter and it can perform actions that
> >        can't be performed from the seccomp filter.
> >
> >        In the discussion that follows, the process  that  has  installed
> >        the  seccomp filter is referred to as the target, and the process
> 
> Technically, this definition of "target" is a bit inaccurate because:
> 
>  - seccomp filters are inherited
>  - seccomp filters apply to threads, not processes
>  - seccomp filters can be semi-remotely installed via TSYNC
> 
> (I assume that in manpages, we should try to go for the "a task is a
> thread and a thread group is a process" definition, right?)
> 
> Perhaps "the threads on which the seccomp filter is installed are
> referred to as the target", or something like that would be better?
> 
> >        that is notified by  the  user-space  notification  mechanism  is
> >        referred  to  as  the  supervisor.  An overview of the steps per‐
> >        formed by these two processes is as follows:
> >
> >        1. The target process establishes a seccomp filter in  the  usual
> >           manner, but with two differences:
> >
> >           · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
> >             TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
> >             the  (successful)  seccomp(2) call is a new "listening" file
> >             descriptor that can be used to receive notifications.
> >
> >           · In cases where it is appropriate, the seccomp filter returns
> >             the  action value SECCOMP_RET_USER_NOTIF.  This return value
> >             will trigger a notification event.
> >
> >        2. In order that the supervisor process can obtain  notifications
> >           using  the  listening  file  descriptor, (a duplicate of) that
> >           file descriptor must be passed from the target process to  the
> >           supervisor process.  One way in which this could be done is by
> >           passing the file descriptor over a UNIX domain socket  connec‐
> >           tion between the two processes (using the SCM_RIGHTS ancillary
> >           message type described in unix(7)).   Another  possibility  is
> >           that  the  supervisor  might  inherit  the file descriptor via
> >           fork(2).
> 
> With the caveat that if the supervisor inherits the file descriptor
> via fork(), that (more or less) implies that the supervisor is subject
> to the same filter (although it could bypass the filter using a helper
> thread that responds SECCOMP_USER_NOTIF_FLAG_CONTINUE, but I don't
> expect any clean software to do that).
> 
> >        3. The supervisor process will receive notification events on the
> >           listening  file  descriptor.   These  events  are  returned as
> >           structures of type seccomp_notif.  Because this structure  and
> >           its  size may evolve over kernel versions, the supervisor must
> >           first determine the size of  this  structure  using  the  sec‐
> >           comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
> >           structure of type seccomp_notif_sizes.  The  supervisor  allo‐
> >           cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
> >           to receive notification events.   In  addition,the  supervisor
> >           allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
> >           comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
> >           comp_notif_resp  structure) that it will provide to the kernel
> >           (and thus the target process).
> >
> >        4. The target process then performs its workload, which  includes
> >           system  calls  that  will be controlled by the seccomp filter.
> >           Whenever one of these system calls causes the filter to return
> >           the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
> >           execute the system call;  instead,  execution  of  the  target
> >           process is temporarily blocked inside the kernel and a notifi‐
> 
> where "blocked" refers to the interruptible, restartable kind - if the
> child receives a signal with an SA_RESTART signal handler in the
> meantime, it'll leave the syscall, go through the signal handler, then
> restart the syscall again and send the same request to the supervisor
> again. so the supervisor may see duplicate syscalls.
> 
> What's really gross here is that signal(7) promises that some syscalls
> like epoll_wait(2) never restart, but seccomp doesn't know about that;
> if userspace installs a filter that uses SECCOMP_RET_USER_NOTIF for a
> non-restartable syscall, the result is that UAPI gets broken a little
> bit. Luckily normal users of seccomp probably won't use
> SECCOMP_RET_USER_NOTIF for restartable syscalls, but if someone does
> want to do that, we might have to add some "suppress syscall
> restarting" flag into the seccomp action value, or something like
> that... yuck.
> 
> >           cation event is generated on the listening file descriptor.
> >
> >        5. The supervisor process can now repeatedly monitor the  listen‐
> >           ing   file   descriptor  for  SECCOMP_RET_USER_NOTIF-triggered
> >           events.   To  do  this,   the   supervisor   uses   the   SEC‐
> >           COMP_IOCTL_NOTIF_RECV  ioctl(2)  operation to read information
> >           about a notification event; this  operation  blocks  until  an
> 
> (interruptably - but I guess that maybe doesn't have to be said
> explicitly here?)
> 
> >           event  is  available.
> 
> Maybe we should note here that you can use the multi-fd-polling APIs
> (select/poll/epoll) instead, and that if the notification goes away
> before you call SECCOMP_IOCTL_NOTIF_RECV, the ioctl will return
> -ENOENT instead of blocking, and therefore as long as nobody else
> reads from the same fd, you can assume that after the fd reports as
> readable, you can call SECCOMP_IOCTL_NOTIF_RECV once without blocking.
> 
> Exceeeeept that this part looks broken:
> 
>   if (mutex_lock_interruptible(&filter->notify_lock) < 0)
>     return EPOLLERR;
> 
> which I think means that we can have a race where a signal arrives
> while poll() is trying to add itself to the waitqueue of the seccomp
> fd, and then we'll get a spurious error condition reported on the fd.
> That's a kernel bug, I'd say.
> 
> > The  operation returns a seccomp_notif
> >           structure containing information about the system call that is
> >           being attempted by the target process.
> >
> >        6. The    seccomp_notif    structure   returned   by   the   SEC‐
> >           COMP_IOCTL_NOTIF_RECV operation includes the same  information
> >           (a seccomp_data structure) that was passed to the seccomp fil‐
> >           ter.  This information allows the supervisor to  discover  the
> >           system  call number and the arguments for the target process's
> >           system call.  In addition, the notification event contains the
> >           PID of the target process.
> 
> That's a PIDTYPE_PID, which the manpages call a "thread ID".
> 
> >           The  information  in  the notification can be used to discover
> >           the values of pointer arguments for the target process's  sys‐
> >           tem call.  (This is something that can't be done from within a
> >           seccomp filter.)  To do this (and  assuming  it  has  suitable
> >           permissions),   the   supervisor   opens   the   corresponding
> >           /proc/[pid]/mem file,
> 
> ... which means that here we might have to get into the weeds of how
> actually /proc has invisible directories for every TID, even though
> only the ones for PIDs are visible, and therefore you can just open
> /proc/[tid]/mem and it'll work fine?
> 
> > seeks to the memory location that corre‐
> >           sponds to one of the pointer arguments whose value is supplied
> >           in the notification event, and reads bytes from that location.
> >           (The supervisor must be careful to avoid a race condition that
> >           can occur when doing this; see the  description  of  the  SEC‐
> >           COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
> >           tion, the supervisor can access other system information  that
> >           is  visible  in  user space but which is not accessible from a
> >           seccomp filter.
> >
> >           ┌─────────────────────────────────────────────────────┐
> >           │FIXME                                                │
> >           ├─────────────────────────────────────────────────────┤
> >           │Suppose we are reading a pathname from /proc/PID/mem │
> >           │for  a system call such as mkdir(). The pathname can │
> >           │be an arbitrary length. How do we know how much (how │
> >           │many pages) to read from /proc/PID/mem?              │
> >           └─────────────────────────────────────────────────────┘
> 
> It can't be an arbitrary length. While pathnames *returned* from the
> kernel in some places can have different limits, strings supplied as
> path arguments *to* the kernel AFAIK always have an upper limit of
> PATH_MAX, else you get -ENAMETOOLONG. See getname_flags().
> 
> >        7. Having  obtained  information  as  per  the previous step, the
> >           supervisor may then choose to perform an action in response to
> >           the  target  process's  system call (which, as noted above, is
> >           not  executed  when  the  seccomp  filter  returns  the   SEC‐
> >           COMP_RET_USER_NOTIF action value).
> 
> (unless SECCOMP_USER_NOTIF_FLAG_CONTINUE is used)
> 
> >           One  example  use case here relates to containers.  The target
> >           process may be located inside a container where  it  does  not
> >           have sufficient capabilities to mount a filesystem in the con‐
> >           tainer's mount namespace.  However, the supervisor  may  be  a
> >           more  privileged  process that that does have sufficient capa‐
> 
> nit: s/that that/that/
> 
> >           bilities to perform the mount operation.
> >
> >        8. The supervisor then sends a response to the notification.  The
> >           information  in  this  response  is used by the kernel to con‐
> >           struct a return value for the target process's system call and
> >           provide a value that will be assigned to the errno variable of
> >           the target process.
> >
> >           The  response  is  sent  using  the   SECCOMP_IOCTL_NOTIF_RECV
> >           ioctl(2)   operation,   which  is  used  to  transmit  a  sec‐
> >           comp_notif_resp  structure  to  the  kernel.   This  structure
> >           includes  a  cookie  value that the supervisor obtained in the
> >           seccomp_notif    structure    returned     by     the     SEC‐
> >           COMP_IOCTL_NOTIF_RECV operation.  This cookie value allows the
> >           kernel to associate the response with the target process.
> 
> (unless if the target thread entered a signal handler or was killed in
> the meantime)
> 
> >        9. Once the notification has been sent, the system  call  in  the
> >           target  process  unblocks,  returning the information that was
> >           provided by the supervisor in the notification response.
> >
> >        As a variation on the last two steps, the supervisor can  send  a
> >        response  that tells the kernel that it should execute the target
> >        process's   system   call;   see   the   discussion    of    SEC‐
> >        COMP_USER_NOTIF_FLAG_CONTINUE, below.
> >
> >    ioctl(2) operations
> >        The following ioctl(2) operations are provided to support seccomp
> >        user-space notification.  For each of these operations, the first
> >        (file  descriptor)  argument  of  ioctl(2)  is the listening file
> >        descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
> >        TER_FLAG_NEW_LISTENER flag.
> >
> >        SECCOMP_IOCTL_NOTIF_RECV
> >               This operation is used to obtain a user-space notification
> >               event.  If no such event is currently pending, the  opera‐
> >               tion  blocks  until  an  event occurs.
> 
> Not necessarily; for every time a process entered a signal handler or
> was killed while a notification was pending, a call to
> SECCOMP_IOCTL_NOTIF_RECV will return -ENOENT.
> 
> > The third ioctl(2)
> >               argument is a pointer to a structure of the following form
> >               which  contains  information about the event.  This struc‐
> >               ture must be zeroed out before the call.
> >
> >                   struct seccomp_notif {
> >                       __u64  id;              /* Cookie */
> >                       __u32  pid;             /* PID of target process */
> 
> (TID, not PID)
> 
> >                       __u32  flags;           /* Currently unused (0) */
> >                       struct seccomp_data data;   /* See seccomp(2) */
> >                   };
> >
> >               The fields in this structure are as follows:
> >
> >               id     This is a cookie for the notification.   Each  such
> >                      cookie  is  guaranteed  to be unique for the corre‐
> >                      sponding seccomp  filter.   In  other  words,  this
> >                      cookie  is  unique for each notification event from
> >                      the target process.
> 
> That sentence about "target process" looks wrong to me. The cookies
> are unique across notifications from the filter, but there can be
> multiple filters per thread, and multiple threads per filter.
> 
> > The cookie value has the  fol‐
> >                      lowing uses:
> >
> >                      · It     can     be     used    with    the    SEC‐
> >                        COMP_IOCTL_NOTIF_ID_VALID ioctl(2)  operation  to
> >                        verify that the target process is still alive.
> >
> >                      · When  returning  a  notification  response to the
> >                        kernel, the supervisor must  include  the  cookie
> >                        value in the seccomp_notif_resp structure that is
> >                        specified   as   the   argument   of   the   SEC‐
> >                        COMP_IOCTL_NOTIF_SEND operation.
> >
> >               pid    This  is  the  PID of the target process that trig‐
> >                      gered the notification event.
> >
> >                      ┌─────────────────────────────────────────────────────┐
> >                      │FIXME                                                │
> >                      ├─────────────────────────────────────────────────────┤
> >                      │This is a thread ID, rather than a PID, right?       │
> >                      └─────────────────────────────────────────────────────┘
> 
> Yeah.
> 
> >
> >               flags  This is a  bit  mask  of  flags  providing  further
> >                      information on the event.  In the current implemen‐
> >                      tation, this field is always zero.
> >
> >               data   This is a seccomp_data structure containing  infor‐
> >                      mation  about  the  system  call that triggered the
> >                      notification.  This is the same structure  that  is
> >                      passed  to  the seccomp filter.  See seccomp(2) for
> >                      details of this structure.
> >
> >               On success, this operation returns 0; on  failure,  -1  is
> >               returned,  and  errno  is set to indicate the cause of the
> >               error.  This operation can fail with the following errors:
> >
> >               EINVAL (since Linux 5.5)
> >                      The seccomp_notif structure that was passed to  the
> >                      call contained nonzero fields.
> >
> >               ENOENT The  target  process  was killed by a signal as the
> >                      notification information was being generated.
> 
> Not just killed, interruption with a signal handler has the same effect.
> 
> >        ┌─────────────────────────────────────────────────────┐
> >        │FIXME                                                │
> >        ├─────────────────────────────────────────────────────┤
> >        │From my experiments,  it  appears  that  if  a  SEC‐ │
> >        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> >        │process terminates, then the ioctl()  simply  blocks │
> >        │(rather than returning an error to indicate that the │
> >        │target process no longer exists).                    │
> >        │                                                     │
> >        │I found that surprising, and it required  some  con‐ │
> >        │tortions  in the example program.  It was not possi‐ │
> >        │ble to code my SIGCHLD handler (which reaps the zom‐ │
> >        │bie  when  the  worker/target process terminates) to │
> >        │simply set a flag checked in the main  handleNotifi‐ │
> >        │cations()  loop,  since  this created an unavoidable │
> >        │race where the child might terminate  just  after  I │
> >        │had  checked  the  flag,  but before I blocked (for‐ │
> >        │ever!) in  the  SECCOMP_IOCTL_NOTIF_RECV  operation. │
> >        │Instead,  I had to code the signal handler to simply │
> >        │call _exit(2)  in  order  to  terminate  the  parent │
> >        │process (the supervisor).                            │
> >        │                                                     │
> >        │Is  this  expected  behavior?  It seems to me rather │
> >        │desirable that SECCOMP_IOCTL_NOTIF_RECV should  give │
> >        │an error if the target process has terminated.       │
> >        └─────────────────────────────────────────────────────┘
> 
> You could poll() the fd first. But yeah, it'd probably be a good idea
> to change that.
> 
> >        SECCOMP_IOCTL_NOTIF_ID_VALID
> [...]
> >               In the above scenario, the risk is that the supervisor may
> >               try to access the memory of a process other than the  tar‐
> >               get.   This  race  can be avoided by following the call to
> >               open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
> >               ify  that  the  process that generated the notification is
> >               still alive.  (Note that  if  the  target  process  subse‐
> >               quently  terminates, its PID won't be reused because there
> 
> That's wrong, the PID can be reused, but the /proc/$pid directory is
> internally not associated with the numeric PID, but, conceptually
> speaking, with a specific incarnation of the PID, or something like
> that. (Actually, it is associated with the "struct pid", which is not
> reused, instead of the numeric PID.)
> 
> >               remains an open reference to the /proc[pid]/mem  file;  in
> >               this  case, a subsequent read(2) from the file will return
> >               0, indicating end of file.)
> >
> >               On success (i.e., the notification  ID  is  still  valid),
> >               this  operation  returns 0 On failure (i.e., the notifica‐
> 
> nit: s/returns 0/returns 0./
> 
> >               tion ID is no longer valid), -1 is returned, and errno  is
> >               set to ENOENT.
> >
> >        SECCOMP_IOCTL_NOTIF_SEND
> [...]
> >               Two kinds of response are possible:
> >
> >               · A response to the kernel telling it to execute the  tar‐
> >                 get  process's  system  call.   In  this case, the flags
> >                 field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and  the
> >                 error and val fields must be zero.
> >
> >                 This  kind  of response can be useful in cases where the
> >                 supervisor needs to do deeper analysis of  the  target's
> >                 system  call  than  is  possible  from  a seccomp filter
> >                 (e.g., examining the values of pointer arguments),  and,
> >                 having  verified that the system call is acceptable, the
> >                 supervisor wants to allow it to proceed.
> 
> "allow" sounds as if this is an access control thing, but this
> mechanism should usually not be used for access control (unless the
> "seccomp" syscall is blocked). Maybe reword as "having decided that
> the system call does not require emulation by the supervisor, the
> supervisor wants it to execute normally", or something like that?
> 
> [...]
> >               On success, this operation returns 0; on  failure,  -1  is
> >               returned,  and  errno  is set to indicate the cause of the
> >               error.  This operation can fail with the following errors:
> >
> >               EINPROGRESS
> >                      A response to this notification  has  already  been
> >                      sent.
> >
> >               EINVAL An invalid value was specified in the flags field.
> >
> >               EINVAL The       flags      field      contained      SEC‐
> >                      COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
> >                      field was not zero.
> >
> >               ENOENT The  blocked  system call in the target process has
> >                      been interrupted by a signal handler.
> 
> (you could also get this if a response has already been sent, instead
> of EINPROGRESS - the only difference is whether the target thread has
> picked up the response yet)
> 
> > NOTES
> >        The file descriptor returned when seccomp(2) is employed with the
> >        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> >        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> >        ing,  these interfaces indicate that the file descriptor is read‐
> >        able.
> 
> We should probably also point out somewhere that, as
> include/uapi/linux/seccomp.h says:
> 
>  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
>  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
>  * same syscall, the most recently added filter takes precedence. This means
>  * that the new SECCOMP_RET_USER_NOTIF filter can override any
>  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
>  * such filtered syscalls to be executed by sending the response
>  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
>  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> 
> In other words, from a security perspective, you must assume that the
> target process can bypass any SECCOMP_RET_USER_NOTIF (or
> SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> calling seccomp(). This should also be noted over in the main
> seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.

So I was actually wondering about this when I skimmed this and a while
ago but forgot about this again... Afaict, you can only ever load a
single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
in the tasks filter hierarchy then the kernel will refuse to load a new
one?

static struct file *init_listener(struct seccomp_filter *filter)
{
	struct file *ret = ERR_PTR(-EBUSY);
	struct seccomp_filter *cur;

	for (cur = current->seccomp.filter; cur; cur = cur->prev) {
		if (cur->notif)
			goto out;
	}

shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
override each other for the same task simply because there can only ever
be a single one?

> 
> 
> > EXAMPLES
> [...]
> >        This  program  can  used  to  demonstrate  various aspects of the
> 
> nit: "can be used to demonstrate", or alternatively just "demonstrates"
> 
> >        behavior of the seccomp user-space  notification  mechanism.   To
> >        help  aid  such demonstrations, the program logs various messages
> >        to show the operation of the target process (lines prefixed "T:")
> >        and the supervisor (indented lines prefixed "S:").
> [...]
> >    Program source
> [...]
> >        #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
> >                                } while (0)
> 
> Don't we have err() for this?
> 
> >        /* Send the file descriptor 'fd' over the connected UNIX domain socket
> >           'sockfd'. Returns 0 on success, or -1 on error. */
> >
> >        static int
> >        sendfd(int sockfd, int fd)
> >        {
> >            struct msghdr msgh;
> >            struct iovec iov;
> >            int data;
> >            struct cmsghdr *cmsgp;
> >
> >            /* Allocate a char array of suitable size to hold the ancillary data.
> >               However, since this buffer is in reality a 'struct cmsghdr', use a
> >               union to ensure that it is suitable aligned. */
> 
> nit: suitably
> 
> >            union {
> >                char   buf[CMSG_SPACE(sizeof(int))];
> >                                /* Space large enough to hold an 'int' */
> >                struct cmsghdr align;
> >            } controlMsg;
> >
> >            /* The 'msg_name' field can be used to specify the address of the
> >               destination socket when sending a datagram. However, we do not
> >               need to use this field because 'sockfd' is a connected socket. */
> >
> >            msgh.msg_name = NULL;
> >            msgh.msg_namelen = 0;
> >
> >            /* On Linux, we must transmit at least one byte of real data in
> >               order to send ancillary data. We transmit an arbitrary integer
> >               whose value is ignored by recvfd(). */
> >
> >            msgh.msg_iov = &iov;
> >            msgh.msg_iovlen = 1;
> >            iov.iov_base = &data;
> >            iov.iov_len = sizeof(int);
> >            data = 12345;
> >
> >            /* Set 'msghdr' fields that describe ancillary data */
> >
> >            msgh.msg_control = controlMsg.buf;
> >            msgh.msg_controllen = sizeof(controlMsg.buf);
> >
> >            /* Set up ancillary data describing file descriptor to send */
> >
> >            cmsgp = CMSG_FIRSTHDR(&msgh);
> >            cmsgp->cmsg_level = SOL_SOCKET;
> >            cmsgp->cmsg_type = SCM_RIGHTS;
> >            cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
> >            memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
> >
> >            /* Send real plus ancillary data */
> >
> >            if (sendmsg(sockfd, &msgh, 0) == -1)
> >                return -1;
> >
> >            return 0;
> >        }
> 
> Instead of using unix domain sockets to send the fd to the parent, I
> think you could also use clone3() with flags==CLONE_FILES|SIGCHLD,
> dup2() the seccomp fd to an fd that was reserved in the parent, call
> unshare(CLONE_FILES) in the child after setting up the seccomp fd, and
> wake up the parent with something like pthread_cond_signal()? I'm not
> sure whether that'd look better or worse in the end though, so maybe
> just ignore this comment.

(If the target process exec's (rather fast) then VFORK can be useful.)

> 
> [...]
> >        /* Access the memory of the target process in order to discover the
> >           pathname that was given to mkdir() */
> >
> >        static void
> >        getTargetPathname(struct seccomp_notif *req, int notifyFd,
> >                          char *path, size_t len)
> >        {
> >            char procMemPath[PATH_MAX];
> >            snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
> >
> >            int procMemFd = open(procMemPath, O_RDONLY);
> 
> Should example code like this maybe use O_CLOEXEC unless the fd in
> question actually has to be inheritable? I know it doesn't actually
> matter here, but if this code was used in a multi-threaded context, it
> might.

Agreed, about the O_CLOEXEC part.

> 
> >            if (procMemFd == -1)
> >                errExit("Supervisor: open");
> >
> >            /* Check that the process whose info we are accessing is still alive.
> >               If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> >               in checkNotificationIdIsValid()) succeeds, we know that the
> >               /proc/PID/mem file descriptor that we opened corresponds to the
> >               process for which we received a notification. If that process
> >               subsequently terminates, then read() on that file descriptor
> >               will return 0 (EOF). */
> >
> >            checkNotificationIdIsValid(notifyFd, req->id);
> >
> >            /* Seek to the location containing the pathname argument (i.e., the
> >               first argument) of the mkdir(2) call and read that pathname */
> >
> >            if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
> >                errExit("Supervisor: lseek");
> >
> >            ssize_t s = read(procMemFd, path, PATH_MAX);
> >            if (s == -1)
> >                errExit("read");
> 
> Why not pread() instead of lseek()+read()?

With multiple arguments to be read process_vm_readv() should also be
considered.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01 12:54   ` Christian Brauner
@ 2020-10-01 15:47     ` Jann Horn
  2020-10-01 16:58       ` Tycho Andersen
  2020-10-01 17:05       ` Christian Brauner
  0 siblings, 2 replies; 52+ messages in thread
From: Jann Horn @ 2020-10-01 15:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Michael Kerrisk (man-pages),
	linux-man, Song Liu, Will Drewry, Kees Cook, Daniel Borkmann,
	Giuseppe Scrivano, Robert Sesek, Linux Containers, lkml,
	Alexei Starovoitov, bpf, Andy Lutomirski, Christian Brauner

On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
<christian.brauner@canonical.com> wrote:
> On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> > > NOTES
> > >        The file descriptor returned when seccomp(2) is employed with the
> > >        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > >        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > >        ing,  these interfaces indicate that the file descriptor is read‐
> > >        able.
> >
> > We should probably also point out somewhere that, as
> > include/uapi/linux/seccomp.h says:
> >
> >  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> >  * same syscall, the most recently added filter takes precedence. This means
> >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> >  * such filtered syscalls to be executed by sending the response
> >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> >
> > In other words, from a security perspective, you must assume that the
> > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > calling seccomp(). This should also be noted over in the main
> > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
>
> So I was actually wondering about this when I skimmed this and a while
> ago but forgot about this again... Afaict, you can only ever load a
> single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> in the tasks filter hierarchy then the kernel will refuse to load a new
> one?
>
> static struct file *init_listener(struct seccomp_filter *filter)
> {
>         struct file *ret = ERR_PTR(-EBUSY);
>         struct seccomp_filter *cur;
>
>         for (cur = current->seccomp.filter; cur; cur = cur->prev) {
>                 if (cur->notif)
>                         goto out;
>         }
>
> shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> override each other for the same task simply because there can only ever
> be a single one?

Good point. Exceeeept that that check seems ineffective because this
happens before we take the locks that guard against TSYNC, and also
before we decide to which existing filter we want to chain the new
filter. So if two threads race with TSYNC, I think they'll be able to
chain two filters with listeners together.

I don't know whether we want to eternalize this "only one listener
across all the filters" restriction in the manpage though, or whether
the man page should just say that the kernel currently doesn't support
it but that security-wise you should assume that it might at some
point.

[...]
> > >            if (procMemFd == -1)
> > >                errExit("Supervisor: open");
> > >
> > >            /* Check that the process whose info we are accessing is still alive.
> > >               If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> > >               in checkNotificationIdIsValid()) succeeds, we know that the
> > >               /proc/PID/mem file descriptor that we opened corresponds to the
> > >               process for which we received a notification. If that process
> > >               subsequently terminates, then read() on that file descriptor
> > >               will return 0 (EOF). */
> > >
> > >            checkNotificationIdIsValid(notifyFd, req->id);
> > >
> > >            /* Seek to the location containing the pathname argument (i.e., the
> > >               first argument) of the mkdir(2) call and read that pathname */
> > >
> > >            if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
> > >                errExit("Supervisor: lseek");
> > >
> > >            ssize_t s = read(procMemFd, path, PATH_MAX);
> > >            if (s == -1)
> > >                errExit("read");
> >
> > Why not pread() instead of lseek()+read()?
>
> With multiple arguments to be read process_vm_readv() should also be
> considered.

process_vm_readv() can end up doing each read against a different
process, which is sort of weird semantically. You would end up taking
page faults at random addresses in unrelated processes, blocking on
their mmap locks, potentially triggering their userfaultfd notifiers,
and so on.

Whereas if you first open /proc/$tid/mem, then re-check
SECCOMP_IOCTL_NOTIF_ID_VALID, and then do the read, you know that
you're only taking page faults on the process where you intended to do
it.

So until there is a variant of process_vm_readv() that operates on
pidfds, I would not recommend using that here.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01 15:47     ` Jann Horn
@ 2020-10-01 16:58       ` Tycho Andersen
  2020-10-01 17:12         ` Christian Brauner
  2020-10-01 18:18         ` Jann Horn
  2020-10-01 17:05       ` Christian Brauner
  1 sibling, 2 replies; 52+ messages in thread
From: Tycho Andersen @ 2020-10-01 16:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, linux-man, Song Liu, Will Drewry, Kees Cook,
	Daniel Borkmann, Giuseppe Scrivano, Robert Sesek,
	Linux Containers, lkml, Alexei Starovoitov,
	Michael Kerrisk (man-pages),
	bpf, Andy Lutomirski, Christian Brauner

On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> <christian.brauner@canonical.com> wrote:
> > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > <mtk.manpages@gmail.com> wrote:
> > > > NOTES
> > > >        The file descriptor returned when seccomp(2) is employed with the
> > > >        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > > >        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > > >        ing,  these interfaces indicate that the file descriptor is read‐
> > > >        able.
> > >
> > > We should probably also point out somewhere that, as
> > > include/uapi/linux/seccomp.h says:
> > >
> > >  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > >  * same syscall, the most recently added filter takes precedence. This means
> > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > >  * such filtered syscalls to be executed by sending the response
> > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > >
> > > In other words, from a security perspective, you must assume that the
> > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > calling seccomp(). This should also be noted over in the main
> > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> >
> > So I was actually wondering about this when I skimmed this and a while
> > ago but forgot about this again... Afaict, you can only ever load a
> > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > in the tasks filter hierarchy then the kernel will refuse to load a new
> > one?
> >
> > static struct file *init_listener(struct seccomp_filter *filter)
> > {
> >         struct file *ret = ERR_PTR(-EBUSY);
> >         struct seccomp_filter *cur;
> >
> >         for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> >                 if (cur->notif)
> >                         goto out;
> >         }
> >
> > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > override each other for the same task simply because there can only ever
> > be a single one?
> 
> Good point. Exceeeept that that check seems ineffective because this
> happens before we take the locks that guard against TSYNC, and also
> before we decide to which existing filter we want to chain the new
> filter. So if two threads race with TSYNC, I think they'll be able to
> chain two filters with listeners together.

Yep, seems the check needs to also be in seccomp_can_sync_threads() to
be totally effective,

> I don't know whether we want to eternalize this "only one listener
> across all the filters" restriction in the manpage though, or whether
> the man page should just say that the kernel currently doesn't support
> it but that security-wise you should assume that it might at some
> point.

This requirement originally came from Andy, arguing that the semantics
of this were/are confusing, which still makes sense to me. Perhaps we
should do something like the below?

Tycho


diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..7b107207c2b0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -376,6 +376,18 @@ static int is_ancestor(struct seccomp_filter *parent,
 	return 0;
 }
 
+static bool has_listener_parent(struct seccomp_filter *child)
+{
+	struct seccomp_filter *cur;
+
+	for (cur = current->seccomp.filter; cur; cur = cur->prev) {
+		if (cur->notif)
+			return true;
+	}
+
+	return false;
+}
+
 /**
  * seccomp_can_sync_threads: checks if all threads can be synchronized
  *
@@ -385,7 +397,7 @@ static int is_ancestor(struct seccomp_filter *parent,
  * either not in the correct seccomp mode or did not have an ancestral
  * seccomp filter.
  */
-static inline pid_t seccomp_can_sync_threads(void)
+static inline pid_t seccomp_can_sync_threads(unsigned int flags)
 {
 	struct task_struct *thread, *caller;
 
@@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
 				 caller->seccomp.filter)))
 			continue;
 
+		/* don't allow TSYNC to install multiple listeners */
+		if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
+		    !has_listener_parent(thread->seccomp.filter))
+			continue;
+
 		/* Return the first thread that cannot be synchronized. */
 		failed = task_pid_vnr(thread);
 		/* If the pid cannot be resolved, then return -ESRCH */
@@ -637,7 +654,7 @@ static long seccomp_attach_filter(unsigned int flags,
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
 		int ret;
 
-		ret = seccomp_can_sync_threads();
+		ret = seccomp_can_sync_threads(flags);
 		if (ret) {
 			if (flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH)
 				return -ESRCH;
@@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops = {
 static struct file *init_listener(struct seccomp_filter *filter)
 {
 	struct file *ret = ERR_PTR(-EBUSY);
-	struct seccomp_filter *cur;
 
-	for (cur = current->seccomp.filter; cur; cur = cur->prev) {
-		if (cur->notif)
-			goto out;
-	}
+	if (has_listener_parent(current->seccomp.filter))
+		goto out;
 
 	ret = ERR_PTR(-ENOMEM);
 	filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01 15:47     ` Jann Horn
  2020-10-01 16:58       ` Tycho Andersen
@ 2020-10-01 17:05       ` Christian Brauner
  1 sibling, 0 replies; 52+ messages in thread
From: Christian Brauner @ 2020-10-01 17:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	linux-man, Song Liu, Will Drewry, Kees Cook, Daniel Borkmann,
	Giuseppe Scrivano, Robert Sesek, Linux Containers, lkml,
	Alexei Starovoitov, bpf, Andy Lutomirski, Christian Brauner

On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> <christian.brauner@canonical.com> wrote:
> > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > <mtk.manpages@gmail.com> wrote:
> > > > NOTES
> > > >        The file descriptor returned when seccomp(2) is employed with the
> > > >        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > > >        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > > >        ing,  these interfaces indicate that the file descriptor is read‐
> > > >        able.
> > >
> > > We should probably also point out somewhere that, as
> > > include/uapi/linux/seccomp.h says:
> > >
> > >  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > >  * same syscall, the most recently added filter takes precedence. This means
> > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > >  * such filtered syscalls to be executed by sending the response
> > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > >
> > > In other words, from a security perspective, you must assume that the
> > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > calling seccomp(). This should also be noted over in the main
> > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> >
> > So I was actually wondering about this when I skimmed this and a while
> > ago but forgot about this again... Afaict, you can only ever load a
> > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > in the tasks filter hierarchy then the kernel will refuse to load a new
> > one?
> >
> > static struct file *init_listener(struct seccomp_filter *filter)
> > {
> >         struct file *ret = ERR_PTR(-EBUSY);
> >         struct seccomp_filter *cur;
> >
> >         for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> >                 if (cur->notif)
> >                         goto out;
> >         }
> >
> > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > override each other for the same task simply because there can only ever
> > be a single one?
> 
> Good point. Exceeeept that that check seems ineffective because this
> happens before we take the locks that guard against TSYNC, and also
> before we decide to which existing filter we want to chain the new
> filter. So if two threads race with TSYNC, I think they'll be able to
> chain two filters with listeners together.

That's a bug, imho. I don't have source code in front of me right now
though.

> 
> I don't know whether we want to eternalize this "only one listener
> across all the filters" restriction in the manpage though, or whether
> the man page should just say that the kernel currently doesn't support
> it but that security-wise you should assume that it might at some
> point.

Maybe. I would argue that it might be worth having at least a new
flag/option to indicate either "This is a non-overridable filter." or at
least for the seccomp notifier have an option to indicate that no other
notifer can be installed.

Christian

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01 16:58       ` Tycho Andersen
@ 2020-10-01 17:12         ` Christian Brauner
  2020-10-14  5:41           ` Michael Kerrisk (man-pages)
  2020-10-01 18:18         ` Jann Horn
  1 sibling, 1 reply; 52+ messages in thread
From: Christian Brauner @ 2020-10-01 17:12 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Jann Horn, linux-man, Song Liu, Will Drewry, Kees Cook,
	Daniel Borkmann, Giuseppe Scrivano, Robert Sesek,
	Linux Containers, lkml, Alexei Starovoitov,
	Michael Kerrisk (man-pages),
	bpf, Andy Lutomirski, Christian Brauner

On Thu, Oct 01, 2020 at 10:58:50AM -0600, Tycho Andersen wrote:
> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> > <christian.brauner@canonical.com> wrote:
> > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > > <mtk.manpages@gmail.com> wrote:
> > > > > NOTES
> > > > >        The file descriptor returned when seccomp(2) is employed with the
> > > > >        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > > > >        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > > > >        ing,  these interfaces indicate that the file descriptor is read‐
> > > > >        able.
> > > >
> > > > We should probably also point out somewhere that, as
> > > > include/uapi/linux/seccomp.h says:
> > > >
> > > >  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > > >  * same syscall, the most recently added filter takes precedence. This means
> > > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > > >  * such filtered syscalls to be executed by sending the response
> > > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > >
> > > > In other words, from a security perspective, you must assume that the
> > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > calling seccomp(). This should also be noted over in the main
> > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > >
> > > So I was actually wondering about this when I skimmed this and a while
> > > ago but forgot about this again... Afaict, you can only ever load a
> > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > one?
> > >
> > > static struct file *init_listener(struct seccomp_filter *filter)
> > > {
> > >         struct file *ret = ERR_PTR(-EBUSY);
> > >         struct seccomp_filter *cur;
> > >
> > >         for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > >                 if (cur->notif)
> > >                         goto out;
> > >         }
> > >
> > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > override each other for the same task simply because there can only ever
> > > be a single one?
> > 
> > Good point. Exceeeept that that check seems ineffective because this
> > happens before we take the locks that guard against TSYNC, and also
> > before we decide to which existing filter we want to chain the new
> > filter. So if two threads race with TSYNC, I think they'll be able to
> > chain two filters with listeners together.
> 
> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> be totally effective,
> 
> > I don't know whether we want to eternalize this "only one listener
> > across all the filters" restriction in the manpage though, or whether
> > the man page should just say that the kernel currently doesn't support
> > it but that security-wise you should assume that it might at some
> > point.
> 
> This requirement originally came from Andy, arguing that the semantics
> of this were/are confusing, which still makes sense to me. Perhaps we
> should do something like the below?

I think we should either keep up this restriction and then cement it in
the manpage or add a flag to indicate that the notifier is
non-overridable.
I don't care about the default too much, i.e. whether it's overridable
by default and exclusive if opting in or the other way around doesn't
matter too much. But from a supervisor's perspective it'd be quite nice
to be able to be sure that a notifier can't be overriden by another
notifier.

I think having a flag would provide the greatest flexibility but I agree
that the semantics of multiple listeners are kinda odd.

Below looks sane to me though again, I'm not sitting in fron of source
code.

Christian

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 3ee59ce0a323..7b107207c2b0 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -376,6 +376,18 @@ static int is_ancestor(struct seccomp_filter *parent,
>  	return 0;
>  }
>  
> +static bool has_listener_parent(struct seccomp_filter *child)
> +{
> +	struct seccomp_filter *cur;
> +
> +	for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> +		if (cur->notif)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  /**
>   * seccomp_can_sync_threads: checks if all threads can be synchronized
>   *
> @@ -385,7 +397,7 @@ static int is_ancestor(struct seccomp_filter *parent,
>   * either not in the correct seccomp mode or did not have an ancestral
>   * seccomp filter.
>   */
> -static inline pid_t seccomp_can_sync_threads(void)
> +static inline pid_t seccomp_can_sync_threads(unsigned int flags)
>  {
>  	struct task_struct *thread, *caller;
>  
> @@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
>  				 caller->seccomp.filter)))
>  			continue;
>  
> +		/* don't allow TSYNC to install multiple listeners */
> +		if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
> +		    !has_listener_parent(thread->seccomp.filter))
> +			continue;
> +
>  		/* Return the first thread that cannot be synchronized. */
>  		failed = task_pid_vnr(thread);
>  		/* If the pid cannot be resolved, then return -ESRCH */
> @@ -637,7 +654,7 @@ static long seccomp_attach_filter(unsigned int flags,
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
>  		int ret;
>  
> -		ret = seccomp_can_sync_threads();
> +		ret = seccomp_can_sync_threads(flags);
>  		if (ret) {
>  			if (flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH)
>  				return -ESRCH;
> @@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops = {
>  static struct file *init_listener(struct seccomp_filter *filter)
>  {
>  	struct file *ret = ERR_PTR(-EBUSY);
> -	struct seccomp_filter *cur;
>  
> -	for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> -		if (cur->notif)
> -			goto out;
> -	}
> +	if (has_listener_parent(current->seccomp.filter))
> +		goto out;
>  
>  	ret = ERR_PTR(-ENOMEM);
>  	filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01 16:58       ` Tycho Andersen
  2020-10-01 17:12         ` Christian Brauner
@ 2020-10-01 18:18         ` Jann Horn
  2020-10-01 18:56           ` Tycho Andersen
  1 sibling, 1 reply; 52+ messages in thread
From: Jann Horn @ 2020-10-01 18:18 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Christian Brauner, linux-man, Song Liu, Will Drewry, Kees Cook,
	Daniel Borkmann, Giuseppe Scrivano, Robert Sesek,
	Linux Containers, lkml, Alexei Starovoitov,
	Michael Kerrisk (man-pages),
	bpf, Andy Lutomirski, Christian Brauner

On Thu, Oct 1, 2020 at 6:58 PM Tycho Andersen <tycho@tycho.pizza> wrote:
> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> > <christian.brauner@canonical.com> wrote:
> > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > > <mtk.manpages@gmail.com> wrote:
> > > > > NOTES
> > > > >        The file descriptor returned when seccomp(2) is employed with the
> > > > >        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > > > >        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > > > >        ing,  these interfaces indicate that the file descriptor is read‐
> > > > >        able.
> > > >
> > > > We should probably also point out somewhere that, as
> > > > include/uapi/linux/seccomp.h says:
> > > >
> > > >  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > > >  * same syscall, the most recently added filter takes precedence. This means
> > > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > > >  * such filtered syscalls to be executed by sending the response
> > > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > >
> > > > In other words, from a security perspective, you must assume that the
> > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > calling seccomp(). This should also be noted over in the main
> > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > >
> > > So I was actually wondering about this when I skimmed this and a while
> > > ago but forgot about this again... Afaict, you can only ever load a
> > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > one?
> > >
> > > static struct file *init_listener(struct seccomp_filter *filter)
> > > {
> > >         struct file *ret = ERR_PTR(-EBUSY);
> > >         struct seccomp_filter *cur;
> > >
> > >         for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > >                 if (cur->notif)
> > >                         goto out;
> > >         }
> > >
> > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > override each other for the same task simply because there can only ever
> > > be a single one?
> >
> > Good point. Exceeeept that that check seems ineffective because this
> > happens before we take the locks that guard against TSYNC, and also
> > before we decide to which existing filter we want to chain the new
> > filter. So if two threads race with TSYNC, I think they'll be able to
> > chain two filters with listeners together.
>
> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> be totally effective,
>
> > I don't know whether we want to eternalize this "only one listener
> > across all the filters" restriction in the manpage though, or whether
> > the man page should just say that the kernel currently doesn't support
> > it but that security-wise you should assume that it might at some
> > point.
>
> This requirement originally came from Andy, arguing that the semantics
> of this were/are confusing, which still makes sense to me. Perhaps we
> should do something like the below?
[...]
> +static bool has_listener_parent(struct seccomp_filter *child)
> +{
> +       struct seccomp_filter *cur;
> +
> +       for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> +               if (cur->notif)
> +                       return true;
> +       }
> +
> +       return false;
> +}
[...]
> @@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
[...]
> +               /* don't allow TSYNC to install multiple listeners */
> +               if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
> +                   !has_listener_parent(thread->seccomp.filter))
> +                       continue;
[...]
> @@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops = {
>  static struct file *init_listener(struct seccomp_filter *filter)
[...]
> -       for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> -               if (cur->notif)
> -                       goto out;
> -       }
> +       if (has_listener_parent(current->seccomp.filter))
> +               goto out;

I dislike this because it combines a non-locked check and a locked
check. And I don't think this will work in the case where TSYNC and
non-TSYNC race - if the non-TSYNC call nests around the TSYNC filter
installation, the thread that called seccomp in non-TSYNC mode will
still end up with two notifying filters. How about the following?


diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 676d4af62103..c49ad8ba0bc1 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1475,11 +1475,6 @@ static struct file *init_listener(struct
seccomp_filter *filter)
        struct file *ret = ERR_PTR(-EBUSY);
        struct seccomp_filter *cur;

-       for (cur = current->seccomp.filter; cur; cur = cur->prev) {
-               if (cur->notif)
-                       goto out;
-       }
-
        ret = ERR_PTR(-ENOMEM);
        filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
        if (!filter->notif)
@@ -1504,6 +1499,31 @@ static struct file *init_listener(struct
seccomp_filter *filter)
        return ret;
 }

+/*
+ * Does @new_child have a listener while an ancestor also has a listener?
+ * If so, we'll want to reject this filter.
+ * This only has to be tested for the current process, even in the TSYNC case,
+ * because TSYNC installs @child with the same parent on all threads.
+ * Note that @new_child is not hooked up to its parent at this point yet, so
+ * we use current->seccomp.filter.
+ */
+static bool has_duplicate_listener(struct seccomp_filter *new_child)
+{
+       struct seccomp_filter *cur;
+
+       /* must be protected against concurrent TSYNC */
+       lockdep_assert_held(&current->sighand->siglock);
+
+       if (!new_child->notif)
+               return false;
+       for (cur = current->seccomp.filter; cur; cur = cur->prev) {
+               if (cur->notif)
+                       return true;
+       }
+
+       return false;
+}
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -1575,6 +1595,9 @@ static long seccomp_set_mode_filter(unsigned int flags,
        if (!seccomp_may_assign_mode(seccomp_mode))
                goto out;

+       if (has_duplicate_listener(prepared))
+               goto out;
+
        ret = seccomp_attach_filter(flags, prepared);
        if (ret)
                goto out;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01 18:18         ` Jann Horn
@ 2020-10-01 18:56           ` Tycho Andersen
  0 siblings, 0 replies; 52+ messages in thread
From: Tycho Andersen @ 2020-10-01 18:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, linux-man, Song Liu, Will Drewry, Kees Cook,
	Daniel Borkmann, Giuseppe Scrivano, Robert Sesek,
	Linux Containers, lkml, Alexei Starovoitov,
	Michael Kerrisk (man-pages),
	bpf, Andy Lutomirski, Christian Brauner

On Thu, Oct 01, 2020 at 08:18:49PM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 6:58 PM Tycho Andersen <tycho@tycho.pizza> wrote:
> > On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> > > <christian.brauner@canonical.com> wrote:
> > > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > > > <mtk.manpages@gmail.com> wrote:
> > > > > > NOTES
> > > > > >        The file descriptor returned when seccomp(2) is employed with the
> > > > > >        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > > > > >        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > > > > >        ing,  these interfaces indicate that the file descriptor is read‐
> > > > > >        able.
> > > > >
> > > > > We should probably also point out somewhere that, as
> > > > > include/uapi/linux/seccomp.h says:
> > > > >
> > > > >  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > > > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > > > >  * same syscall, the most recently added filter takes precedence. This means
> > > > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > > > >  * such filtered syscalls to be executed by sending the response
> > > > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > > > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > > >
> > > > > In other words, from a security perspective, you must assume that the
> > > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > > calling seccomp(). This should also be noted over in the main
> > > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > > >
> > > > So I was actually wondering about this when I skimmed this and a while
> > > > ago but forgot about this again... Afaict, you can only ever load a
> > > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > > one?
> > > >
> > > > static struct file *init_listener(struct seccomp_filter *filter)
> > > > {
> > > >         struct file *ret = ERR_PTR(-EBUSY);
> > > >         struct seccomp_filter *cur;
> > > >
> > > >         for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > > >                 if (cur->notif)
> > > >                         goto out;
> > > >         }
> > > >
> > > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > > override each other for the same task simply because there can only ever
> > > > be a single one?
> > >
> > > Good point. Exceeeept that that check seems ineffective because this
> > > happens before we take the locks that guard against TSYNC, and also
> > > before we decide to which existing filter we want to chain the new
> > > filter. So if two threads race with TSYNC, I think they'll be able to
> > > chain two filters with listeners together.
> >
> > Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> > be totally effective,
> >
> > > I don't know whether we want to eternalize this "only one listener
> > > across all the filters" restriction in the manpage though, or whether
> > > the man page should just say that the kernel currently doesn't support
> > > it but that security-wise you should assume that it might at some
> > > point.
> >
> > This requirement originally came from Andy, arguing that the semantics
> > of this were/are confusing, which still makes sense to me. Perhaps we
> > should do something like the below?
> [...]
> > +static bool has_listener_parent(struct seccomp_filter *child)
> > +{
> > +       struct seccomp_filter *cur;
> > +
> > +       for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > +               if (cur->notif)
> > +                       return true;
> > +       }
> > +
> > +       return false;
> > +}
> [...]
> > @@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
> [...]
> > +               /* don't allow TSYNC to install multiple listeners */
> > +               if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
> > +                   !has_listener_parent(thread->seccomp.filter))
> > +                       continue;
> [...]
> > @@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops = {
> >  static struct file *init_listener(struct seccomp_filter *filter)
> [...]
> > -       for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > -               if (cur->notif)
> > -                       goto out;
> > -       }
> > +       if (has_listener_parent(current->seccomp.filter))
> > +               goto out;
> 
> I dislike this because it combines a non-locked check and a locked
> check. And I don't think this will work in the case where TSYNC and
> non-TSYNC race - if the non-TSYNC call nests around the TSYNC filter
> installation, the thread that called seccomp in non-TSYNC mode will
> still end up with two notifying filters. How about the following?

Sure, you can add,

Reviewed-by: Tycho Andersen <tycho@tycho.pizza>

when you send it.

Tycho

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 11:07 For review: seccomp_user_notif(2) manual page Michael Kerrisk (man-pages)
                   ` (3 preceding siblings ...)
  2020-10-01 12:36 ` Christian Brauner
@ 2020-10-01 21:06 ` Sargun Dhillon
  2020-10-01 23:19   ` Tycho Andersen
  4 siblings, 1 reply; 52+ messages in thread
From: Sargun Dhillon @ 2020-10-01 21:06 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Jann Horn, Alexei Starovoitov, Will Drewry, bpf,
	Song Liu, Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Wed, Sep 30, 2020 at 4:07 AM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hi Tycho, Sargun (and all),
>
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> that also will need documenting [2]), I did :-). But of course I may
> have made mistakes...
>
> I've shown the rendered version of the page below, and would love
> to receive review comments from you and others, and acks, etc.
>
> There are a few FIXMEs sprinkled into the page, including one
> that relates to what appears to me to be a misdesign (possibly
> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
> operation. I would be especially interested in feedback on that
> FIXME, and also of course the other FIXMEs.
>
> The page includes an extensive (albeit slightly contrived)
> example program, and I would be happy also to receive comments
> on that program.
>
> The page source currently sits in a branch (along with the text
> that you sent me for the seccomp(2) page) at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>
> Thanks,
>
> Michael
>
> [1] https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd23c@gmail.com/#t
> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
>     and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>
> ====
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Should we consider the SECCOMP_GET_NOTIF_SIZES dance to be "deprecated" at
this point, given that the extensible ioctl mechanism works? If we add
new fields to the
seccomp datastructures, we would move them from fixed-size ioctls, to
variable sized
ioctls that encode the datastructure size / length?

-- This is mostly a question for Kees and Tycho.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01 21:06 ` Sargun Dhillon
@ 2020-10-01 23:19   ` Tycho Andersen
  0 siblings, 0 replies; 52+ messages in thread
From: Tycho Andersen @ 2020-10-01 23:19 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Michael Kerrisk (man-pages),
	Kees Cook, Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Jann Horn, Alexei Starovoitov, Will Drewry, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Thu, Oct 01, 2020 at 02:06:10PM -0700, Sargun Dhillon wrote:
> On Wed, Sep 30, 2020 at 4:07 AM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
> >
> > Hi Tycho, Sargun (and all),
> >
> > I knew it would be a big ask, but below is kind of the manual page
> > I was hoping you might write [1] for the seccomp user-space notification
> > mechanism. Since you didn't (and because 5.9 adds various new pieces
> > such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> > that also will need documenting [2]), I did :-). But of course I may
> > have made mistakes...
> >
> > I've shown the rendered version of the page below, and would love
> > to receive review comments from you and others, and acks, etc.
> >
> > There are a few FIXMEs sprinkled into the page, including one
> > that relates to what appears to me to be a misdesign (possibly
> > fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
> > operation. I would be especially interested in feedback on that
> > FIXME, and also of course the other FIXMEs.
> >
> > The page includes an extensive (albeit slightly contrived)
> > example program, and I would be happy also to receive comments
> > on that program.
> >
> > The page source currently sits in a branch (along with the text
> > that you sent me for the seccomp(2) page) at
> > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> >
> > Thanks,
> >
> > Michael
> >
> > [1] https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd23c@gmail.com/#t
> > [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
> >     and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> >
> > ====
> >
> > --
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
> 
> Should we consider the SECCOMP_GET_NOTIF_SIZES dance to be "deprecated" at
> this point, given that the extensible ioctl mechanism works? If we add
> new fields to the
> seccomp datastructures, we would move them from fixed-size ioctls, to
> variable sized
> ioctls that encode the datastructure size / length?
> 
> -- This is mostly a question for Kees and Tycho.

It will tell you how big struct seccomp_data in the currently running
kernel is, so it still seems useful/necessary to me, unless there's
another way to figure that out.

But I agree, I don't think the intent is to add anything else to
struct seccomp_notif. (I don't know that it ever was.)

Tycho

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01  7:45       ` Michael Kerrisk (man-pages)
@ 2020-10-14  4:40         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-14  4:40 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: mtk.manpages, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Jann Horn, Alexei Starovoitov,
	wad, bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

Hi Tycho,

Ping on the question below!

Thanks,

Michael

On 10/1/20 9:45 AM, Michael Kerrisk (man-pages) wrote:
> On 10/1/20 1:03 AM, Tycho Andersen wrote:
>> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>>> Hi Tycho,
>>>
>>> Thanks for taking time to look at the page!
>>>
>>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> 
> [...]
> 
>>>>>        ┌─────────────────────────────────────────────────────┐
>>>>>        │FIXME                                                │
>>>>>        ├─────────────────────────────────────────────────────┤
>>>>>        │Interestingly, after the event  had  been  received, │
>>>>>        │the  file descriptor indicates as writable (verified │
>>>>>        │from the source code and by experiment). How is this │
>>>>>        │useful?                                              │
>>>>
>>>> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
>>>> reasonable.
>>>
>>> No, I'm saying something more fundamental: why is the FD indicating as
>>> writable? Can you write something to it? If yes, what? If not, then
>>> why do these APIs want to say that the FD is writable?
>>
>> You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
>> NOTIFY_SEND are reading and writing events from the fd. I don't know
>> that much about the poll interface though -- is it possible to
>> indicate "here's a pseudo-read event"? It didn't look like it, so I
>> just (ab-)used POLLIN and POLLOUT, but probably that's wrong.
> 
> I think the POLLIN thing is fine.
> 
> So, I think maybe I now understand what you intended with setting
> POLLOUT: the notification has been received ("read") and now the
> FD can be used to NOTIFY_SEND ("write") a response. Right?
> 
> If that's correct, I don't have a problem with it. I just wonder:
> is it useful? IOW: are there situations where the process doing the
> NOTIFY_SEND might want to test for POLLOUT because the it doesn't
> know whether a NOTIFY_RECV has occurred? 
> 
> Thanks,
> 
> Michael
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01 17:12         ` Christian Brauner
@ 2020-10-14  5:41           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-14  5:41 UTC (permalink / raw)
  To: Christian Brauner, Tycho Andersen
  Cc: mtk.manpages, Jann Horn, linux-man, Song Liu, Will Drewry,
	Kees Cook, Daniel Borkmann, Giuseppe Scrivano, Robert Sesek,
	Linux Containers, lkml, Alexei Starovoitov, bpf, Andy Lutomirski,
	Christian Brauner

On 10/1/20 7:12 PM, Christian Brauner wrote:
> On Thu, Oct 01, 2020 at 10:58:50AM -0600, Tycho Andersen wrote:
>> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
>>> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
>>> <christian.brauner@canonical.com> wrote:
>>>> On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
>>>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>>>> <mtk.manpages@gmail.com> wrote:
>>>>>> NOTES
>>>>>>        The file descriptor returned when seccomp(2) is employed with the
>>>>>>        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
>>>>>>        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
>>>>>>        ing,  these interfaces indicate that the file descriptor is read‐
>>>>>>        able.
>>>>>
>>>>> We should probably also point out somewhere that, as
>>>>> include/uapi/linux/seccomp.h says:
>>>>>
>>>>>  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
>>>>>  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
>>>>>  * same syscall, the most recently added filter takes precedence. This means
>>>>>  * that the new SECCOMP_RET_USER_NOTIF filter can override any
>>>>>  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
>>>>>  * such filtered syscalls to be executed by sending the response
>>>>>  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
>>>>>  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>>>>>
>>>>> In other words, from a security perspective, you must assume that the
>>>>> target process can bypass any SECCOMP_RET_USER_NOTIF (or
>>>>> SECCOMP_RET_TRACE) filters unless it is completely prohibited from
>>>>> calling seccomp(). This should also be noted over in the main
>>>>> seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
>>>>
>>>> So I was actually wondering about this when I skimmed this and a while
>>>> ago but forgot about this again... Afaict, you can only ever load a
>>>> single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
>>>> already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
>>>> in the tasks filter hierarchy then the kernel will refuse to load a new
>>>> one?
>>>>
>>>> static struct file *init_listener(struct seccomp_filter *filter)
>>>> {
>>>>         struct file *ret = ERR_PTR(-EBUSY);
>>>>         struct seccomp_filter *cur;
>>>>
>>>>         for (cur = current->seccomp.filter; cur; cur = cur->prev) {
>>>>                 if (cur->notif)
>>>>                         goto out;
>>>>         }
>>>>
>>>> shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
>>>> override each other for the same task simply because there can only ever
>>>> be a single one?
>>>
>>> Good point. Exceeeept that that check seems ineffective because this
>>> happens before we take the locks that guard against TSYNC, and also
>>> before we decide to which existing filter we want to chain the new
>>> filter. So if two threads race with TSYNC, I think they'll be able to
>>> chain two filters with listeners together.
>>
>> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
>> be totally effective,
>>
>>> I don't know whether we want to eternalize this "only one listener
>>> across all the filters" restriction in the manpage though, or whether
>>> the man page should just say that the kernel currently doesn't support
>>> it but that security-wise you should assume that it might at some
>>> point.
>>
>> This requirement originally came from Andy, arguing that the semantics
>> of this were/are confusing, which still makes sense to me. Perhaps we
>> should do something like the below?
> 
> I think we should either keep up this restriction and then cement it in
> the manpage or add a flag to indicate that the notifier is
> non-overridable.
> I don't care about the default too much, i.e. whether it's overridable
> by default and exclusive if opting in or the other way around doesn't
> matter too much. But from a supervisor's perspective it'd be quite nice
> to be able to be sure that a notifier can't be overriden by another
> notifier.
> 
> I think having a flag would provide the greatest flexibility but I agree
> that the semantics of multiple listeners are kinda odd.

So, for now, I have applied the patch at the foot of this mail
to the pages. Does this seem correct?

> Below looks sane to me though again, I'm not sitting in fron of source
> code.
[...]

Thanks,

Michael

PS Jann, if you see this, I'm still working through your (extensive
and very helpful) review comments. I will be sending a response.

======

diff --git a/man2/seccomp.2 b/man2/seccomp.2
index 9ab07f4ab..45a6984df 100644
--- a/man2/seccomp.2
+++ b/man2/seccomp.2
@@ -221,6 +221,11 @@ return a new user-space notification file descriptor.
 When the filter returns
 .BR SECCOMP_RET_USER_NOTIF
 a notification will be sent to this file descriptor.
+.IP
+At most one seccomp filter using the
+.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
+flag can be installed for a thread.
+.IP
 See
 .BR seccomp_user_notif (2)
 for further details.
@@ -789,6 +794,12 @@ capability in its user namespace, or had not set
 before using
 .BR SECCOMP_SET_MODE_FILTER .
 .TP
+.BR EBUSY
+While installing a new filter, the
+.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
+flag was specified,
+but a previous filter had already been installed with that flag.
+.TP
 .BR EFAULT
 .IR args
 was not a valid address.
diff --git a/man2/seccomp_user_notif.2 b/man2/seccomp_user_notif.2
index a6025e4d4..d1a406f46 100644
--- a/man2/seccomp_user_notif.2
+++ b/man2/seccomp_user_notif.2
@@ -92,6 +92,7 @@ Consequently, the return value  of the (successful)
 .BR seccomp (2)
 call is a new "listening"
 file descriptor that can be used to receive notifications.
+Only one such "listener" can be established.
 .IP \(bu
 In cases where it is appropriate, the seccomp filter returns the action value
 .BR SECCOMP_RET_USER_NOTIF .

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01 12:36 ` Christian Brauner
@ 2020-10-15 11:23   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-15 11:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Tycho Andersen, Sargun Dhillon, linux-man,
	Song Liu, wad, Kees Cook, Daniel Borkmann, Jann Horn,
	Robert Sesek, Linux Containers, lkml, Alexei Starovoitov,
	Giuseppe Scrivano, bpf, Andy Lutomirski, Christian Brauner

Hello Christian,

On 10/1/20 2:36 PM, Christian Brauner wrote:
> [I'm on vacation so I'll just give this a quick glance for now.]
> 
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Tycho, Sargun (and all),
>>
>> I knew it would be a big ask, but below is kind of the manual page
>> I was hoping you might write [1] for the seccomp user-space notification
>> mechanism. Since you didn't (and because 5.9 adds various new pieces 
>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD 
>> that also will need documenting [2]), I did :-). But of course I may 
>> have made mistakes...
>>
>> I've shown the rendered version of the page below, and would love
>> to receive review comments from you and others, and acks, etc.
>>
>> There are a few FIXMEs sprinkled into the page, including one
>> that relates to what appears to me to be a misdesign (possibly 
>> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV 
>> operation. I would be especially interested in feedback on that
>> FIXME, and also of course the other FIXMEs.
>>
>> The page includes an extensive (albeit slightly contrived)
>> example program, and I would be happy also to receive comments
>> on that program.
>>
>> The page source currently sits in a branch (along with the text
>> that you sent me for the seccomp(2) page) at
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>>
>> Thanks,
>>
>> Michael
>>
>> [1] https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd23c@gmail.com/#t
>> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
>>     and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>>
>> =====
>>
>> NAME
>>        seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>>        #include <linux/seccomp.h>
>>        #include <linux/filter.h>
>>        #include <linux/audit.h>
>>
>>        int seccomp(unsigned int operation, unsigned int flags, void *args);
>>
>> DESCRIPTION
>>        This  page  describes  the user-space notification mechanism pro‐
>>        vided by the Secure Computing (seccomp) facility.  As well as the
>>        use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>>        COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>>        operation  described  in  seccomp(2), this mechanism involves the
>>        use of a number of related ioctl(2) operations (described below).
>>
>>    Overview
>>        In conventional usage of a seccomp filter, the decision about how
>>        to  treat  a particular system call is made by the filter itself.
>>        The user-space notification mechanism allows the handling of  the
>>        system  call  to  instead  be handed off to a user-space process.
> 
> "In contrast, the user notification mechanism allows to delegate the
> handling of the system call of one process (target) to another
> user-space process (supervisor)."?

Thanks. I've reworded similarly to what you suggest.

>>        The advantages of doing this are that, by contrast with the  sec‐
>>        comp  filter,  which  is  running on a virtual machine inside the
>>        kernel, the user-space process has access to information that  is
>>        unavailable to the seccomp filter and it can perform actions that
>>        can't be performed from the seccomp filter.
> 
> This section reads a bit difficult imho:
> "A suitably privileged supervisor can use the user notification
> mechanism to perform actions in lieu of the target. The supervisor will
> usually be able to retrieve information about the target and the
> performed system call that the seccomp filter itself cannot."

Thanks. Again I've done some rewording.

>>        In the discussion that follows, the process  that  has  installed
>>        the  seccomp filter is referred to as the target, and the process
>>        that is notified by  the  user-space  notification  mechanism  is
>>        referred  to  as  the  supervisor.  An overview of the steps per‐
>>        formed by these two processes is as follows:

After the various rewordings, the opening paragraphs now read:

       In conventional usage of a seccomp filter, the decision about  how
       to treat a system call is made by the filter itself.  By contrast,
       the user-space notification mechanism allows the seccomp filter to
       delegate  the  handling  of  the system call to another user-space
       process.

       In the discussion that follows, the thread(s) on which the seccomp
       filter  is  installed  is (are) referred to as the target, and the
       process that is notified by the user-space notification  mechanism
       is referred to as the supervisor.

       A  suitably privileged supervisor can use the user-space notifica‐
       tion mechanism to perform actions on behalf of  the  target.   The
       advantage  of  the  user-space  notification mechanism is that the
       supervisor will usually be able to retrieve information about  the
       target  and  the  performed  system  call  that the seccomp filter
       itself cannot.  (A seccomp filter is limited in the information it
       can  obtain and the actions that it can perform because it is run‐
       ning on a virtual machine inside the kernel.)

       An overview of the steps performed by the target and the  supervi‐
       sor is as follows:

>>        1. The target process establishes a seccomp filter in  the  usual
>>           manner, but with two differences:
>>
>>           · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
>>             TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
>>             the  (successful)  seccomp(2) call is a new "listening" file
>>             descriptor that can be used to receive notifications.
> 
> I think it would be good to mention that seccomp notify fds are
> O_CLOEXEC by default somewhere.

Yep. This is already noted in seccomp(2).

>>           · In cases where it is appropriate, the seccomp filter returns
>>             the  action value SECCOMP_RET_USER_NOTIF.  This return value
>>             will trigger a notification event.
>>
>>        2. In order that the supervisor process can obtain  notifications
>>           using  the  listening  file  descriptor, (a duplicate of) that
>>           file descriptor must be passed from the target process to  the
>>           supervisor process.  One way in which this could be done is by
>>           passing the file descriptor over a UNIX domain socket  connec‐
>>           tion between the two processes (using the SCM_RIGHTS ancillary
>>           message type described in unix(7)).   Another  possibility  is
>>           that  the  supervisor  might  inherit  the file descriptor via
>>           fork(2).
> 
> I think a few people have already pointed out other ways of retrieving
> an fd. :)

Yup.

>>        3. The supervisor process will receive notification events on the
>>           listening  file  descriptor.   These  events  are  returned as
>>           structures of type seccomp_notif.  Because this structure  and
>>           its  size may evolve over kernel versions, the supervisor must
>>           first determine the size of  this  structure  using  the  sec‐
>>           comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>>           structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>>           cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>>           to receive notification events.   In  addition,the  supervisor
>>           allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>>           comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>>           comp_notif_resp  structure) that it will provide to the kernel
>>           (and thus the target process).
>>
>>        4. The target process then performs its workload, which  includes
>>           system  calls  that  will be controlled by the seccomp filter.
>>           Whenever one of these system calls causes the filter to return
>>           the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>>           execute the system call;  instead,  execution  of  the  target
>>           process is temporarily blocked inside the kernel and a notifi‐
> 
> Maybe mention that the task is killable when so blocked?

Jann also noted this, and I thought it could be presumed, and so was
not thinking to add anything to the text. But, since you mention it too,
I've added some words to note that the sleep state is interruptible by
signals.

>>           cation event is generated on the listening file descriptor.
>>
>>        5. The supervisor process can now repeatedly monitor the  listen‐
>>           ing   file   descriptor  for  SECCOMP_RET_USER_NOTIF-triggered
>>           events.   To  do  this,   the   supervisor   uses   the   SEC‐
>>           COMP_IOCTL_NOTIF_RECV  ioctl(2)  operation to read information
>>           about a notification event; this  operation  blocks  until  an
>>           event  is  available.   The  operation returns a seccomp_notif
>>           structure containing information about the system call that is
>>           being attempted by the target process.
>>
>>        6. The    seccomp_notif    structure   returned   by   the   SEC‐
>>           COMP_IOCTL_NOTIF_RECV operation includes the same  information
>>           (a seccomp_data structure) that was passed to the seccomp fil‐
>>           ter.  This information allows the supervisor to  discover  the
>>           system  call number and the arguments for the target process's
>>           system call.  In addition, the notification event contains the
>>           PID of the target process.
> 
> (Technically TID.)

Yep. I've already made various fixes after comments from Jann.

>>           The  information  in  the notification can be used to discover
>>           the values of pointer arguments for the target process's  sys‐
>>           tem call.  (This is something that can't be done from within a
>>           seccomp filter.)  To do this (and  assuming  it  has  suitable
>>           permissions),   the   supervisor   opens   the   corresponding
>>           /proc/[pid]/mem file, seeks to the memory location that corre‐
>>           sponds to one of the pointer arguments whose value is supplied
>>           in the notification event, and reads bytes from that location.
>>           (The supervisor must be careful to avoid a race condition that
>>           can occur when doing this; see the  description  of  the  SEC‐
>>           COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>>           tion, the supervisor can access other system information  that
>>           is  visible  in  user space but which is not accessible from a
>>           seccomp filter.
>>
>>           ┌─────────────────────────────────────────────────────┐
>>           │FIXME                                                │
>>           ├─────────────────────────────────────────────────────┤
>>           │Suppose we are reading a pathname from /proc/PID/mem │
>>           │for  a system call such as mkdir(). The pathname can │
>>           │be an arbitrary length. How do we know how much (how │
>>           │many pages) to read from /proc/PID/mem?              │
>>           └─────────────────────────────────────────────────────┘
> 
> This has already been answered, I believe.

Yep.

>>
>>        7. Having  obtained  information  as  per  the previous step, the
>>           supervisor may then choose to perform an action in response to
>>           the  target  process's  system call (which, as noted above, is
>>           not  executed  when  the  seccomp  filter  returns  the   SEC‐
>>           COMP_RET_USER_NOTIF action value).
> 
> Nit: It is not _yet_ executed it may very well be if the response is
> "continue". 

Okay. I've added the word "yet" in point 4. I already elaborate on
the "continue" details later.

> This should either mention that when the fd becomes
> _RECVable the system call is guaranteed to not have executed yet or
> specify that it is not yet executed, I think.

I'm not sure that I understand your point here. I mean, doesn't the
arrival of the notification already imply that the system call hasn't
yet been executed? You seem to be drawing some distinction between
the notification vs FD being RECVable, but I don't understand what
that distinction is. Can you elaborate please...

>>           One  example  use case here relates to containers.  The target
>>           process may be located inside a container where  it  does  not
>>           have sufficient capabilities to mount a filesystem in the con‐
>>           tainer's mount namespace.  However, the supervisor  may  be  a
>>           more  privileged  process that that does have sufficient capa‐
>>           bilities to perform the mount operation.
>>
>>        8. The supervisor then sends a response to the notification.  The
>>           information  in  this  response  is used by the kernel to con‐
>>           struct a return value for the target process's system call and
>>           provide a value that will be assigned to the errno variable of
>>           the target process.
>>
>>           The  response  is  sent  using  the   SECCOMP_IOCTL_NOTIF_RECV
>>           ioctl(2)   operation,   which  is  used  to  transmit  a  sec‐
>>           comp_notif_resp  structure  to  the  kernel.   This  structure
>>           includes  a  cookie  value that the supervisor obtained in the
>>           seccomp_notif    structure    returned     by     the     SEC‐
>>           COMP_IOCTL_NOTIF_RECV operation.  This cookie value allows the
>>           kernel to associate the response with the target process.
> 
> I think here or above you should mention that the id or "cookie" _must_
> be used when a file descriptor to /proc/<pid>/mem or any /proc/<pid>/*
> is opened:
> fd = open(/proc/pid/*);
> verify_via_cookie_that_pid_still_alive(cookie);
> operate_on(fd)
> 
> Otherwise this is a potential security issue.

Yes, but already in point 6 above I say:

           (The supervisor must be careful to avoid a race condition that
           can occur when doing this; see the  description  of  the  SEC‐
           COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐

And then I say more about the ioctl() later. So, I think that I've
covered this point sufficiently (?). Maybe you missed some of
that text. Or do you think there's still something I should add?

>>        9. Once the notification has been sent, the system  call  in  the
>>           target  process  unblocks,  returning the information that was
>>           provided by the supervisor in the notification response.
>>
>>        As a variation on the last two steps, the supervisor can  send  a
>>        response  that tells the kernel that it should execute the target
>>        process's   system   call;   see   the   discussion    of    SEC‐
>>        COMP_USER_NOTIF_FLAG_CONTINUE, below.
>>
>>    ioctl(2) operations
>>        The following ioctl(2) operations are provided to support seccomp
>>        user-space notification.  For each of these operations, the first
>>        (file  descriptor)  argument  of  ioctl(2)  is the listening file
>>        descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
>>        TER_FLAG_NEW_LISTENER flag.
>>
>>        SECCOMP_IOCTL_NOTIF_RECV
>>               This operation is used to obtain a user-space notification
>>               event.  If no such event is currently pending, the  opera‐
>>               tion  blocks  until  an  event occurs.  The third ioctl(2)
>>               argument is a pointer to a structure of the following form
>>               which  contains  information about the event.  This struc‐
>>               ture must be zeroed out before the call.
>>
>>                   struct seccomp_notif {
>>                       __u64  id;              /* Cookie */
>>                       __u32  pid;             /* PID of target process */
>>                       __u32  flags;           /* Currently unused (0) */
>>                       struct seccomp_data data;   /* See seccomp(2) */
>>                   };
>>
>>               The fields in this structure are as follows:
>>
>>               id     This is a cookie for the notification.   Each  such
>>                      cookie  is  guaranteed  to be unique for the corre‐
>>                      sponding seccomp  filter.   In  other  words,  this
>>                      cookie  is  unique for each notification event from
>>                      the target process.  The cookie value has the  fol‐
>>                      lowing uses:
>>
>>                      · It     can     be     used    with    the    SEC‐
>>                        COMP_IOCTL_NOTIF_ID_VALID ioctl(2)  operation  to
>>                        verify that the target process is still alive.
>>
>>                      · When  returning  a  notification  response to the
>>                        kernel, the supervisor must  include  the  cookie
>>                        value in the seccomp_notif_resp structure that is
>>                        specified   as   the   argument   of   the   SEC‐
>>                        COMP_IOCTL_NOTIF_SEND operation.
>>
>>               pid    This  is  the  PID of the target process that trig‐
>>                      gered the notification event.
>>
>>                      ┌─────────────────────────────────────────────────────┐
>>                      │FIXME                                                │
>>                      ├─────────────────────────────────────────────────────┤
>>                      │This is a thread ID, rather than a PID, right?       │
>>                      └─────────────────────────────────────────────────────┘
> 
> Yes.
> 
>>
>>               flags  This is a  bit  mask  of  flags  providing  further
>>                      information on the event.  In the current implemen‐
>>                      tation, this field is always zero.
>>
>>               data   This is a seccomp_data structure containing  infor‐
>>                      mation  about  the  system  call that triggered the
>>                      notification.  This is the same structure  that  is
>>                      passed  to  the seccomp filter.  See seccomp(2) for
>>                      details of this structure.
>>
>>               On success, this operation returns 0; on  failure,  -1  is
>>               returned,  and  errno  is set to indicate the cause of the
>>               error.  This operation can fail with the following errors:
>>
>>               EINVAL (since Linux 5.5)
>>                      The seccomp_notif structure that was passed to  the
>>                      call contained nonzero fields.
>>
>>               ENOENT The  target  process  was killed by a signal as the
>>                      notification information was being generated.
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │From my experiments,  it  appears  that  if  a  SEC‐ │
>>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>>        │process terminates, then the ioctl()  simply  blocks │
>>        │(rather than returning an error to indicate that the │
>>        │target process no longer exists).                    │
>>        │                                                     │
>>        │I found that surprising, and it required  some  con‐ │
>>        │tortions  in the example program.  It was not possi‐ │
>>        │ble to code my SIGCHLD handler (which reaps the zom‐ │
>>        │bie  when  the  worker/target process terminates) to │
>>        │simply set a flag checked in the main  handleNotifi‐ │
>>        │cations()  loop,  since  this created an unavoidable │
>>        │race where the child might terminate  just  after  I │
>>        │had  checked  the  flag,  but before I blocked (for‐ │
>>        │ever!) in  the  SECCOMP_IOCTL_NOTIF_RECV  operation. │
>>        │Instead,  I had to code the signal handler to simply │
>>        │call _exit(2)  in  order  to  terminate  the  parent │
>>        │process (the supervisor).                            │
>>        │                                                     │
>>        │Is  this  expected  behavior?  It seems to me rather │
>>        │desirable that SECCOMP_IOCTL_NOTIF_RECV should  give │
>>        │an error if the target process has terminated.       │
>>        └─────────────────────────────────────────────────────┘
> 
> This has been discussed later in the thread too, I believe. My patchset
> fixed a different but related bug in ->poll() when a filter becomes
> unused. I hadn't noticed this behavior since I'm always polling. (Pure
> ioctls() feel a bit fishy to me. :) But obviously a valid use.)

Yes, I hope the ioctl() can be fixed.

>>        SECCOMP_IOCTL_NOTIF_ID_VALID
>>               This operation can be used to check that a notification ID
>>               returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
>>               is  still  valid  (i.e.,  that  the  target  process still
>>               exists).
>>
>>               The third ioctl(2) argument is a  pointer  to  the  cookie
>>               (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>>               This  operation is necessary to avoid race conditions that
>>               can  occur   when   the   pid   returned   by   the   SEC‐
>>               COMP_IOCTL_NOTIF_RECV   operation   terminates,  and  that
>>               process ID is reused by another process.   An  example  of
>>               this kind of race is the following
>>
>>               1. A  notification  is  generated  on  the  listening file
>>                  descriptor.  The returned  seccomp_notif  contains  the
>>                  PID of the target process.
>>
>>               2. The target process terminates.
>>
>>               3. Another process is created on the system that by chance
>>                  reuses the PID that was freed when the  target  process
>>                  terminates.
>>
>>               4. The  supervisor  open(2)s  the /proc/[pid]/mem file for
>>                  the PID obtained in step 1, with the intention of (say)
>>                  inspecting the memory locations that contains the argu‐
>>                  ments of the system call that triggered  the  notifica‐
>>                  tion in step 1.
>>
>>               In the above scenario, the risk is that the supervisor may
>>               try to access the memory of a process other than the  tar‐
>>               get.   This  race  can be avoided by following the call to
>>               open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>>               ify  that  the  process that generated the notification is
>>               still alive.  (Note that  if  the  target  process  subse‐
>>               quently  terminates, its PID won't be reused because there
>>               remains an open reference to the /proc[pid]/mem  file;  in
>>               this  case, a subsequent read(2) from the file will return
>>               0, indicating end of file.)
>>
>>               On success (i.e., the notification  ID  is  still  valid),
>>               this  operation  returns 0 On failure (i.e., the notifica‐
> 
> Missing a ".", I think.

(Yup. Already fixed.)

>>               tion ID is no longer valid), -1 is returned, and errno  is
>>               set to ENOENT.
>>
>>        SECCOMP_IOCTL_NOTIF_SEND
>>               This  operation  is  used  to send a notification response
>>               back to the kernel.  The third ioctl(2) argument  of  this
>>               structure  is  a  pointer  to a structure of the following
>>               form:
>>
>>                   struct seccomp_notif_resp {
>>                       __u64 id;               /* Cookie value */
>>                       __s64 val;              /* Success return value */
>>                       __s32 error;            /* 0 (success) or negative
>>                                                  error number */
>>                       __u32 flags;            /* See below */
>>                   };
>>
>>               The fields of this structure are as follows:
>>
>>               id     This is the cookie value that  was  obtained  using
>>                      the   SECCOMP_IOCTL_NOTIF_RECV   operation.    This
>>                      cookie value allows the kernel to  correctly  asso‐
>>                      ciate this response with the system call that trig‐
>>                      gered the user-space notification.
>>
>>               val    This is the value that will be used for  a  spoofed
>>                      success  return  for  the  target  process's system
>>                      call; see below.
>>
>>               error  This is the value that will be used  as  the  error
>>                      number  (errno)  for a spoofed error return for the
>>                      target process's system call; see below.
> 
> Nit: "val" is only used when "error" is not set.

Yes. I note that below. I don't want to clutter this part of the page with
too many details.

>>               flags  This is a bit mask that includes zero  or  more  of
>>                      the following flags
>>
>>                      SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
>>                             Tell   the  kernel  to  execute  the  target
>>                             process's system call.
>>
>>               Two kinds of response are possible:
>>
>>               · A response to the kernel telling it to execute the  tar‐
>>                 get  process's  system  call.   In  this case, the flags
>>                 field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and  the
>>                 error and val fields must be zero.
>>
>>                 This  kind  of response can be useful in cases where the
>>                 supervisor needs to do deeper analysis of  the  target's
>>                 system  call  than  is  possible  from  a seccomp filter
>>                 (e.g., examining the values of pointer arguments),  and,
>>                 having  verified that the system call is acceptable, the
>>                 supervisor wants to allow it to proceed.
> 
> I think Jann has pointed this out. This needs to come with a big warning
> and I would explicitly put a:
> "The user notification mechanism cannot be used to implement a syscall
> security policy in user space!"
> You might want to take a look at the seccomp.h header file where I
> placed a giant warning about how to use this too.

Yes. Kees also raised this. See my reply to Jann (who pasted in a copy 
of part of your comment from seccomp.h). I'm going to freely reuse the
text from your comment. Please take a look at the text in my reply to Jann,
ad let me know wat you think.

>>               · A spoofed return value for the target  process's  system
>>                 call.   In  this  case,  the kernel does not execute the
>>                 target process's system call, instead causing the system
>>                 call to return a spoofed value as specified by fields of
>>                 the seccomp_notif_resp structure.  The supervisor should
>>                 set the fields of this structure as follows:
>>
>>                 +  flags  does  not contain SECCOMP_USER_NOTIF_FLAG_CON‐
>>                    TINUE.
>>
>>                 +  error is set either to  0  for  a  spoofed  "success"
>>                    return  or  to  a negative error number for a spoofed
>>                    "failure" return.  In the  former  case,  the  kernel
>>                    causes the target process's system call to return the
>>                    value specified in the val field.  In the later case,
>>                    the kernel causes the target process's system call to
>>                    return -1, and errno is assigned  the  negated  error
>>                    value.
>>
>>                 +  val is set to a value that will be used as the return
>>                    value for a spoofed "success" return for  the  target
>>                    process's  system  call.   The value in this field is
>>                    ignored if the error field contains a nonzero value.
>>
>>               On success, this operation returns 0; on  failure,  -1  is
>>               returned,  and  errno  is set to indicate the cause of the
>>               error.  This operation can fail with the following errors:
>>
>>               EINPROGRESS
>>                      A response to this notification  has  already  been
>>                      sent.
>>
>>               EINVAL An invalid value was specified in the flags field.
>>
>>               EINVAL The       flags      field      contained      SEC‐
>>                      COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
>>                      field was not zero.
>>
>>               ENOENT The  blocked  system call in the target process has
>>                      been interrupted by a signal handler.
>>
>> NOTES
>>        The file descriptor returned when seccomp(2) is employed with the
>>        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
>>        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
>>        ing,  these interfaces indicate that the file descriptor is read‐
>>        able.
> 
> This should also note that when a filter becomes unused, i.e. the last
> task using that filter in its filter hierarchy is dead (been
> reaped/autoreaped) ->poll() will notify with (E)POLLHUP.

Ahh! Now I understand. I was unaware of this. Jann commented that
poll() could be used as well, but you provided enough detail that
now I understand how this works. I added the following in NOTES
where poll/select/epoll are described:

       · After the last thread using the filter has terminated  and  been
         reaped  using waitpid(2) (or similar), the file descriptor indi‐
         cates an end-of-file condition  (readable  in  select(2);  POLL‐
         HUP/EPOLLHUP in poll(2)/ epoll_wait(2)).

>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │Interestingly, after the event  had  been  received, │
>>        │the  file descriptor indicates as writable (verified │
>>        │from the source code and by experiment). How is this │
>>        │useful?                                              │
>>        └─────────────────────────────────────────────────────┘
>>
>> EXAMPLES
>>        The (somewhat contrived) program shown below demonstrates the use
>>        of the interfaces described in this page.  The program creates  a
>>        child  process  that  serves  as the "target" process.  The child
>>        process  installs  a  seccomp  filter  that  returns   the   SEC‐
>>        COMP_RET_USER_NOTIF  action  value if a call is made to mkdir(2).
>>        The child process then calls mkdir(2) once for each of  the  sup‐
>>        plied  command-line arguments, and reports the result returned by
>>        the call.  After processing all arguments, the child process ter‐
>>        minates.
>>
>>        The  parent  process  acts  as  the supervisor, listening for the
>>        notifications that are generated when the  target  process  calls
>>        mkdir(2).   When such a notification occurs, the supervisor exam‐
>>        ines the memory of the target process (using /proc/[pid]/mem)  to
>>        discover  the pathname argument that was supplied to the mkdir(2)
>>        call, and performs one of the following actions:
>>
>>        · If the pathname begins with the prefix "/tmp/", then the super‐
>>          visor  attempts  to  create  the  specified directory, and then
>>          spoofs a return for the target  process  based  on  the  return
>>          value  of  the  supervisor's  mkdir(2) call.  In the event that
>>          that call succeeds, the spoofed success  return  value  is  the
>>          length of the pathname.
>>
>>        · If  the pathname begins with "./" (i.e., it is a relative path‐
>>          name), the supervisor sends a  SECCOMP_USER_NOTIF_FLAG_CONTINUE
>>          response  to  the  kernel to say that kernel should execute the
>>          target process's mkdir(2) call.
> 
> Potentially problematic if the two processes have the same privilege
> level and the supervisor intends _CONTINUE to mean "is safe to execute".

Understood. But I think that needs to be clarified elsewhere in the
page, since it's essentially the same point as "The user notification
mechanism cannot be used to implement a syscall security policy in 
user space!" See my reply to Jann.

> An attacker could try to re-write arguments afaict.

By an attacker, I presume you mean a malign supervisor, right.
Sure, it looks to me as though rewriting arguments could be 
possible. But, if you had privilege to do that, you'd presumably
have privileges for any number of other nefarious actities, right?
(So, I don't think anything special needs to be said here; let me
know if you feel something does need to be said.

> A good an easy example is usually mknod() in a user namespace. A
> _CONTINUE is always safe since you can't create device nodes anyway.

Okay -- but I wanted to provide an example (admittedly very
contrived) to show how the supervisor could either do the systcall
on behalf of the target, or leave things to the target to execute
the system call. Do you feel that the example is leading people
astray?

> Sorry, I can't review the rest in sufficient detail since I'm on
> vacation still so I'm just going to shut up now. :)

Well, thanks already, because your comments were already very
useful!. I will send out a new draft shortly :-).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 23:39 ` Kees Cook
@ 2020-10-15 11:24   ` Michael Kerrisk (man-pages)
  2020-10-26  0:19     ` Kees Cook
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-15 11:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: mtk.manpages, Tycho Andersen, Sargun Dhillon, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Jann Horn, Alexei Starovoitov,
	wad, bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

Hello Kees,

On 10/1/20 1:39 AM, Kees Cook wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> [...] I did :-)
> 
> Yay! Thank you!

You're welcome :-)

>> [...]
>>    Overview
>>        In conventional usage of a seccomp filter, the decision about how
>>        to  treat  a particular system call is made by the filter itself.
>>        The user-space notification mechanism allows the handling of  the
>>        system  call  to  instead  be handed off to a user-space process.
>>        The advantages of doing this are that, by contrast with the  sec‐
>>        comp  filter,  which  is  running on a virtual machine inside the
>>        kernel, the user-space process has access to information that  is
>>        unavailable to the seccomp filter and it can perform actions that
>>        can't be performed from the seccomp filter.
> 
> I might clarify a bit with something like (though maybe the
> target/supervisor paragraph needs to be moved to the start):
> 
> 	This is used for performing syscalls on behalf of the target,
> 	rather than having the supervisor make security policy decisions
> 	about the syscall, which would be inherently race-prone. The
> 	target's syscall should either be handled by the supervisor or
> 	allowed to continue normally in the kernel (where standard security
> 	policies will be applied).

You, Christian, and Jann all pulled me up on this point. And thanks; 
I'm going to use some of your words above. See my reply to Jann, sent
at about the same time as this reply. Please take a look at the text
in my reply to Jann, and let me know what you think.

> I'll comment more later, but I've run out of time today and I didn't see
> anyone mention this detail yet in the existing threads... :)

Later never came :-). But, I hope you may have comments for the 
next draft, which I will send out soon.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-09-30 15:53 ` Jann Horn
  2020-10-01 12:54   ` Christian Brauner
@ 2020-10-15 11:24   ` Michael Kerrisk (man-pages)
  2020-10-15 20:32     ` Jann Horn
  1 sibling, 1 reply; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-15 11:24 UTC (permalink / raw)
  To: Jann Horn
  Cc: mtk.manpages, Tycho Andersen, Sargun Dhillon, Kees Cook,
	Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

Hi Jann,

So, first off, thank you for the detailed review. I really 
appreciate it! I've changed various pieces, and still have
a few questions below.

On 9/30/20 5:53 PM, Jann Horn wrote:
> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> I knew it would be a big ask, but below is kind of the manual page
>> I was hoping you might write [1] for the seccomp user-space notification
>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>> that also will need documenting [2]), I did :-). But of course I may
>> have made mistakes...
> [...]
>> NAME
>>        seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>>        #include <linux/seccomp.h>
>>        #include <linux/filter.h>
>>        #include <linux/audit.h>
>>
>>        int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
> Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
> of the ioctl_* manpages?

Yes, good idea. I added:

       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
                 struct seccomp_notif *req);
       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
                 struct seccomp_notif_resp *req);
       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
>> DESCRIPTION
>>        This  page  describes  the user-space notification mechanism pro‐
>>        vided by the Secure Computing (seccomp) facility.  As well as the
>>        use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
>>        COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>>        operation  described  in  seccomp(2), this mechanism involves the
>>        use of a number of related ioctl(2) operations (described below).
>>
>>    Overview
>>        In conventional usage of a seccomp filter, the decision about how
>>        to  treat  a particular system call is made by the filter itself.
>>        The user-space notification mechanism allows the handling of  the
>>        system  call  to  instead  be handed off to a user-space process.
>>        The advantages of doing this are that, by contrast with the  sec‐
>>        comp  filter,  which  is  running on a virtual machine inside the
>>        kernel, the user-space process has access to information that  is
>>        unavailable to the seccomp filter and it can perform actions that
>>        can't be performed from the seccomp filter.
>>
>>        In the discussion that follows, the process  that  has  installed
>>        the  seccomp filter is referred to as the target, and the process
> 
> Technically, this definition of "target" is a bit inaccurate because:
> 
>  - seccomp filters are inherited
>  - seccomp filters apply to threads, not processes
>  - seccomp filters can be semi-remotely installed via TSYNC

(Nice summary.)

> (I assume that in manpages, we should try to go for the "a task is a
> thread and a thread group is a process" definition, right?)

Exactly.

> Perhaps "the threads on which the seccomp filter is installed are
> referred to as the target", or something like that would be better?

Thanks. It's always hugely helpful to get a suggested wording, even
if I still feel the need to rework it (which I don't in this case).
The sentence now reads:

       In the discussion that follows, the thread(s) on which the seccomp
       filter is installed are referred to as the target, and the process
       that is notified  by  the  user-space  notification  mechanism  is
       referred to as the supervisor.

>>        that is notified by  the  user-space  notification  mechanism  is
>>        referred  to  as  the  supervisor.  An overview of the steps per‐
>>        formed by these two processes is as follows:
>>
>>        1. The target process establishes a seccomp filter in  the  usual
>>           manner, but with two differences:
>>
>>           · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
>>             TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
>>             the  (successful)  seccomp(2) call is a new "listening" file
>>             descriptor that can be used to receive notifications.
>>
>>           · In cases where it is appropriate, the seccomp filter returns
>>             the  action value SECCOMP_RET_USER_NOTIF.  This return value
>>             will trigger a notification event.
>>
>>        2. In order that the supervisor process can obtain  notifications
>>           using  the  listening  file  descriptor, (a duplicate of) that
>>           file descriptor must be passed from the target process to  the
>>           supervisor process.  One way in which this could be done is by
>>           passing the file descriptor over a UNIX domain socket  connec‐
>>           tion between the two processes (using the SCM_RIGHTS ancillary
>>           message type described in unix(7)).   Another  possibility  is
>>           that  the  supervisor  might  inherit  the file descriptor via
>>           fork(2).
> 
> With the caveat that if the supervisor inherits the file descriptor
> via fork(), that (more or less) implies that the supervisor is subject
> to the same filter (although it could bypass the filter using a helper
> thread that responds SECCOMP_USER_NOTIF_FLAG_CONTINUE, but I don't
> expect any clean software to do that).

It's a good thing no one ever writes unclean software...

Thanks for catching this; Tycho did also. It was a thinko on my part
to forget that if one used fork(), the supervisor would inherit the
filter. I've simply removed the sentence mentioning fork().


>>        3. The supervisor process will receive notification events on the
>>           listening  file  descriptor.   These  events  are  returned as
>>           structures of type seccomp_notif.  Because this structure  and
>>           its  size may evolve over kernel versions, the supervisor must
>>           first determine the size of  this  structure  using  the  sec‐
>>           comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>>           structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>>           cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>>           to receive notification events.   In  addition,the  supervisor
>>           allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>>           comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>>           comp_notif_resp  structure) that it will provide to the kernel
>>           (and thus the target process).
>>
>>        4. The target process then performs its workload, which  includes
>>           system  calls  that  will be controlled by the seccomp filter.
>>           Whenever one of these system calls causes the filter to return
>>           the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>>           execute the system call;  instead,  execution  of  the  target
>>           process is temporarily blocked inside the kernel and a notifi‐
> 
> where "blocked" refers to the interruptible, restartable kind - if the
> child receives a signal with an SA_RESTART signal handler in the
> meantime, it'll leave the syscall, go through the signal handler, then
> restart the syscall again and send the same request to the supervisor
> again. so the supervisor may see duplicate syscalls.

So, I partially demonstrated what you describe here, for two example
system calls (epoll_wait() and pause()). But I could not exactly 
demonstrate things as I understand you to be describing them. (So,
I'm not sure whether I have not understood you correctly, or
if things are not exactly as you describe them.)

Here's a scenario (A) that I tested:

1. Target installs seccomp filters for a blocking syscall
   (epoll_wait() or pause(), both of which should never restart,
   regardless of SA_RESTART)
2. Target installs SIGINT handler with SA_RESTART
3. Supervisor is sleeping (i.e., is not blocked in
   SECCOMP_IOCTL_NOTIF_RECV operation).
4. Target makes a blocking system call (epoll_wait() or pause()).
5. SIGINT gets delivered to target; handler gets called;
   ***and syscall gets restarted by the kernel***

That last should never happen, of course, and is a result of the
combination of both the user-notify filter and the SA_RESTART flag.
If one or other is not present, then the system call is not
restarted.

So, as you note below, the UAPI gets broken a little.

However, from your description above I had understood that 
something like the following scenario (B) could occur:

1. Target installs seccomp filters for a blocking syscall
   (epoll_wait() or pause(), both of which should never restart,
   regardless of SA_RESTART)
2. Target installs SIGINT handler with SA_RESTART
3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
   blocks).
4. Target makes a blocking system call (epoll_wait() or pause()).
5. Supervisor gets seccomp user-space notification (i.e.,
   SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
6. SIGINT gets delivered to target; handler gets called;
   and syscall gets restarted by the kernel
7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
   which gets another notification for the restarted system call.

However, I don't observe such behavior. In step 6, the syscall
does not get restarted by the kernel, but instead returns -1/EINTR.
Perhaps I have misconstructed my experiment in the second case, or
perhaps I've misunderstood what you meant, or is it possibly the
case that things are not quite as you said?

> What's really gross here is that signal(7) promises that some syscalls
> like epoll_wait(2) never restart, but seccomp doesn't know about that;
> if userspace installs a filter that uses SECCOMP_RET_USER_NOTIF for a
> non-restartable syscall, the result is that UAPI gets broken a little
> bit. Luckily normal users of seccomp probably won't use
> SECCOMP_RET_USER_NOTIF for restartable syscalls, but if someone does
> want to do that, we might have to add some "suppress syscall
> restarting" flag into the seccomp action value, or something like
> that... yuck.

Yes, the UAPI breakage is a bit sad (although, likely to be rarely
encountered, as you note). I'm inclined to add a note about this in
in BUGS, but beforehand I'm interested in hearing your thoughts on
scenario B above.

>>           cation event is generated on the listening file descriptor.
>>
>>        5. The supervisor process can now repeatedly monitor the  listen‐
>>           ing   file   descriptor  for  SECCOMP_RET_USER_NOTIF-triggered
>>           events.   To  do  this,   the   supervisor   uses   the   SEC‐
>>           COMP_IOCTL_NOTIF_RECV  ioctl(2)  operation to read information
>>           about a notification event; this  operation  blocks  until  an
> 
> (interruptably - but I guess that maybe doesn't have to be said
> explicitly here?)

Yes, I think so. The general assumption is that syscalls block
interruptibly, unless text in a manual page that says
"uninterruptible". (Postscript: Christian made a similar comment,
so I decided to explicitly note that it's an interruptible sleep.)

>>           event  is  available.
> 
> Maybe we should note here that you can use the multi-fd-polling APIs
> (select/poll/epoll) instead, and that if the notification goes away
> before you call SECCOMP_IOCTL_NOTIF_RECV, the ioctl will return
> -ENOENT instead of blocking, and therefore as long as nobody else
> reads from the same fd, you can assume that after the fd reports as
> readable, you can call SECCOMP_IOCTL_NOTIF_RECV once without blocking.

I'd rather not add this info in the overview section, which is
already longer than I would like. But I did add some details
in NOTES:

[[
       The file descriptor returned when seccomp(2) is employed with  the
       SECCOMP_FILTER_FLAG_NEW_LISTENER   flag  can  be  monitored  using
       poll(2), epoll(7), and select(2).  When a notification is pending,
       these  interfaces  indicate  that the file descriptor is readable.
       Following    such    an    indication,    a    subsequent     SEC‐
       COMP_IOCTL_NOTIF_RECV  ioctl(2)  will  not block, returning either
       information about a notification or else failing  with  the  error
       EINTR  if  the  target  process has been killed by a signal or its
       system call has been interrupted by a signal handler.
]]

Okay?

> Exceeeeept that this part looks broken:
> 
>   if (mutex_lock_interruptible(&filter->notify_lock) < 0)
>     return EPOLLERR;
> 
> which I think means that we can have a race where a signal arrives
> while poll() is trying to add itself to the waitqueue of the seccomp
> fd, and then we'll get a spurious error condition reported on the fd.
> That's a kernel bug, I'd say.

Sigh... Writing documentation helps find bugs. Who knew?

>> The  operation returns a seccomp_notif
>>           structure containing information about the system call that is
>>           being attempted by the target process.
>>
>>        6. The    seccomp_notif    structure   returned   by   the   SEC‐
>>           COMP_IOCTL_NOTIF_RECV operation includes the same  information
>>           (a seccomp_data structure) that was passed to the seccomp fil‐
>>           ter.  This information allows the supervisor to  discover  the
>>           system  call number and the arguments for the target process's
>>           system call.  In addition, the notification event contains the
>>           PID of the target process.
> 
> That's a PIDTYPE_PID, which the manpages call a "thread ID".

Yes. Fixed now. More generally, I've swept through the page replacing
various instances of "target process" with either "target thread", or
often just "target".

>>           The  information  in  the notification can be used to discover
>>           the values of pointer arguments for the target process's  sys‐
>>           tem call.  (This is something that can't be done from within a
>>           seccomp filter.)  To do this (and  assuming  it  has  suitable
>>           permissions),   the   supervisor   opens   the   corresponding
>>           /proc/[pid]/mem file,
> 
> ... which means that here we might have to get into the weeds of how
> actually /proc has invisible directories for every TID, even though
> only the ones for PIDs are visible, and therefore you can just open
> /proc/[tid]/mem and it'll work fine?

I myself was unaware of this for years until I *accidentally* made use
of the feature in one of my test programs and then a while later got to
asking myself "how come that worked?".

About two years ago, I added some text (@) to explain this in proc(5)
near the start of the page:

   Overview
       Underneath /proc, there are the following general groups of  files
       and subdirectories:

       /proc/[pid] subdirectories
              [...]
              Underneath each of the /proc/[pid] directories, a task sub‐
              directory contains subdirectories of the  form  task/[tid],
              [...]

              The  /proc/[pid]  subdirectories are visible when iterating
              through /proc with getdents(2) (and thus are  visible  when
              one uses ls(1) to view the contents of /proc).

       /proc/[tid] subdirectories
@             Each  one of these subdirectories contains files and subdi‐
@             rectories exposing information about the  thread  with  the
@             corresponding thread ID.  The contents of these directories
@             are the same as  the  corresponding  /proc/[pid]/task/[tid]
@             directories.

@             The /proc/[tid] subdirectories are not visible when iterat‐
@             ing through /proc with getdents(2) (and thus are not  visi‐
@             ble when one uses ls(1) to view the contents of /proc).

I think I'll just drop a cross reference to proc(5) into the text in
seccomp_user_notif.

>> seeks to the memory location that corre‐
>>           sponds to one of the pointer arguments whose value is supplied
>>           in the notification event, and reads bytes from that location.
>>           (The supervisor must be careful to avoid a race condition that
>>           can occur when doing this; see the  description  of  the  SEC‐
>>           COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>>           tion, the supervisor can access other system information  that
>>           is  visible  in  user space but which is not accessible from a
>>           seccomp filter.
>>
>>           ┌─────────────────────────────────────────────────────┐
>>           │FIXME                                                │
>>           ├─────────────────────────────────────────────────────┤
>>           │Suppose we are reading a pathname from /proc/PID/mem │
>>           │for  a system call such as mkdir(). The pathname can │
>>           │be an arbitrary length. How do we know how much (how │
>>           │many pages) to read from /proc/PID/mem?              │
>>           └─────────────────────────────────────────────────────┘
> 
> It can't be an arbitrary length. While pathnames *returned* from the
> kernel in some places can have different limits, strings supplied as
> path arguments *to* the kernel AFAIK always have an upper limit of
> PATH_MAX, else you get -ENAMETOOLONG. See getname_flags().

Yes, another thinko on my part. I removed this FIXME.

>>        7. Having  obtained  information  as  per  the previous step, the
>>           supervisor may then choose to perform an action in response to
>>           the  target  process's  system call (which, as noted above, is
>>           not  executed  when  the  seccomp  filter  returns  the   SEC‐
>>           COMP_RET_USER_NOTIF action value).
> 
> (unless SECCOMP_USER_NOTIF_FLAG_CONTINUE is used)

As you probably saw, I give SECCOMP_USER_NOTIF_FLAG_CONTINUE a brief 
mention a couple of paragraphs later, and then go into rather more
detail later in the page. (Or do you still think something needs
fixing?)

>>           One  example  use case here relates to containers.  The target
>>           process may be located inside a container where  it  does  not
>>           have sufficient capabilities to mount a filesystem in the con‐
>>           tainer's mount namespace.  However, the supervisor  may  be  a
>>           more  privileged  process that that does have sufficient capa‐
> 
> nit: s/that that/that/

Thanks. Fixed.

>>           bilities to perform the mount operation.
>>
>>        8. The supervisor then sends a response to the notification.  The
>>           information  in  this  response  is used by the kernel to con‐
>>           struct a return value for the target process's system call and
>>           provide a value that will be assigned to the errno variable of
>>           the target process.
>>
>>           The  response  is  sent  using  the   SECCOMP_IOCTL_NOTIF_RECV
>>           ioctl(2)   operation,   which  is  used  to  transmit  a  sec‐
>>           comp_notif_resp  structure  to  the  kernel.   This  structure
>>           includes  a  cookie  value that the supervisor obtained in the
>>           seccomp_notif    structure    returned     by     the     SEC‐
>>           COMP_IOCTL_NOTIF_RECV operation.  This cookie value allows the
>>           kernel to associate the response with the target process.
> 
> (unless if the target thread entered a signal handler or was killed in
> the meantime)

Yes, but I think I have this adequately covered in the errors described
later in the page for SECCOMP_IOCTL_NOTIF_RECV. (I have now added the
target-process-terminated case to the orror text.)

              ENOENT The blocked system  call  in  the  target  has  been
                     interrupted  by  a  signal  handler  or  the  target
                     process has terminated.

Is that sufficient?

>>        9. Once the notification has been sent, the system  call  in  the
>>           target  process  unblocks,  returning the information that was
>>           provided by the supervisor in the notification response.
>>
>>        As a variation on the last two steps, the supervisor can  send  a
>>        response  that tells the kernel that it should execute the target
>>        process's   system   call;   see   the   discussion    of    SEC‐
>>        COMP_USER_NOTIF_FLAG_CONTINUE, below.
>>
>>    ioctl(2) operations
>>        The following ioctl(2) operations are provided to support seccomp
>>        user-space notification.  For each of these operations, the first
>>        (file  descriptor)  argument  of  ioctl(2)  is the listening file
>>        descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
>>        TER_FLAG_NEW_LISTENER flag.
>>
>>        SECCOMP_IOCTL_NOTIF_RECV
>>               This operation is used to obtain a user-space notification
>>               event.  If no such event is currently pending, the  opera‐
>>               tion  blocks  until  an  event occurs.
> 
> Not necessarily; for every time a process entered a signal handler or
> was killed while a notification was pending, a call to
> SECCOMP_IOCTL_NOTIF_RECV will return -ENOENT.

Yes, but do you not consider this sufficiently covered by the
(updated) error text that appears later? (See below.)

>> The third ioctl(2)
>>               argument is a pointer to a structure of the following form
>>               which  contains  information about the event.  This struc‐
>>               ture must be zeroed out before the call.
>>
>>                   struct seccomp_notif {
>>                       __u64  id;              /* Cookie */
>>                       __u32  pid;             /* PID of target process */
> 
> (TID, not PID)

Thanks. Fixed.

>>                       __u32  flags;           /* Currently unused (0) */
>>                       struct seccomp_data data;   /* See seccomp(2) */
>>                   };
>>
>>               The fields in this structure are as follows:
>>
>>               id     This is a cookie for the notification.   Each  such
>>                      cookie  is  guaranteed  to be unique for the corre‐
>>                      sponding seccomp  filter.   In  other  words,  this
>>                      cookie  is  unique for each notification event from
>>                      the target process.
> 
> That sentence about "target process" looks wrong to me. The cookies
> are unique across notifications from the filter, but there can be
> multiple filters per thread, and multiple threads per filter.

Thanks. I simply removed that last sentence.

>> The cookie value has the  fol‐
>>                      lowing uses:
>>
>>                      · It     can     be     used    with    the    SEC‐
>>                        COMP_IOCTL_NOTIF_ID_VALID ioctl(2)  operation  to
>>                        verify that the target process is still alive.
>>
>>                      · When  returning  a  notification  response to the
>>                        kernel, the supervisor must  include  the  cookie
>>                        value in the seccomp_notif_resp structure that is
>>                        specified   as   the   argument   of   the   SEC‐
>>                        COMP_IOCTL_NOTIF_SEND operation.
>>
>>               pid    This  is  the  PID of the target process that trig‐
>>                      gered the notification event.
>>
>>                      ┌─────────────────────────────────────────────────────┐
>>                      │FIXME                                                │
>>                      ├─────────────────────────────────────────────────────┤
>>                      │This is a thread ID, rather than a PID, right?       │
>>                      └─────────────────────────────────────────────────────┘
> 
> Yeah.

Thanks. I've made various fixes.

>>               flags  This is a  bit  mask  of  flags  providing  further
>>                      information on the event.  In the current implemen‐
>>                      tation, this field is always zero.
>>
>>               data   This is a seccomp_data structure containing  infor‐
>>                      mation  about  the  system  call that triggered the
>>                      notification.  This is the same structure  that  is
>>                      passed  to  the seccomp filter.  See seccomp(2) for
>>                      details of this structure.
>>
>>               On success, this operation returns 0; on  failure,  -1  is
>>               returned,  and  errno  is set to indicate the cause of the
>>               error.  This operation can fail with the following errors:
>>
>>               EINVAL (since Linux 5.5)
>>                      The seccomp_notif structure that was passed to  the
>>                      call contained nonzero fields.
>>
>>               ENOENT The  target  process  was killed by a signal as the
>>                      notification information was being generated.
> 
> Not just killed, interruption with a signal handler has the same effect.

Ah yes! Thanks. I added that as well.

[[
              ENOENT The target thread was killed  by  a  signal  as  the
                     notification information was being generated, or the
                     target's (blocked) system call was interrupted by  a
                     signal handler.
]]

Okay?

>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │From my experiments,  it  appears  that  if  a  SEC‐ │
>>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>>        │process terminates, then the ioctl()  simply  blocks │
>>        │(rather than returning an error to indicate that the │
>>        │target process no longer exists).                    │
>>        │                                                     │
>>        │I found that surprising, and it required  some  con‐ │
>>        │tortions  in the example program.  It was not possi‐ │
>>        │ble to code my SIGCHLD handler (which reaps the zom‐ │
>>        │bie  when  the  worker/target process terminates) to │
>>        │simply set a flag checked in the main  handleNotifi‐ │
>>        │cations()  loop,  since  this created an unavoidable │
>>        │race where the child might terminate  just  after  I │
>>        │had  checked  the  flag,  but before I blocked (for‐ │
>>        │ever!) in  the  SECCOMP_IOCTL_NOTIF_RECV  operation. │
>>        │Instead,  I had to code the signal handler to simply │
>>        │call _exit(2)  in  order  to  terminate  the  parent │
>>        │process (the supervisor).                            │
>>        │                                                     │
>>        │Is  this  expected  behavior?  It seems to me rather │
>>        │desirable that SECCOMP_IOCTL_NOTIF_RECV should  give │
>>        │an error if the target process has terminated.       │
>>        └─────────────────────────────────────────────────────┘
> 
> You could poll() the fd first. But yeah, it'd probably be a good idea
> to change that.

Ah! It was only after reading some comments from Christian that I
realized how poll() works here. I'll make some additions to the
page about the poll() details. (See my reply to Christian that should
land at about the same time as this mail.)
  
>>        SECCOMP_IOCTL_NOTIF_ID_VALID
> [...]
>>               In the above scenario, the risk is that the supervisor may
>>               try to access the memory of a process other than the  tar‐
>>               get.   This  race  can be avoided by following the call to
>>               open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>>               ify  that  the  process that generated the notification is
>>               still alive.  (Note that  if  the  target  process  subse‐
>>               quently  terminates, its PID won't be reused because there
> 
> That's wrong, the PID can be reused, but the /proc/$pid directory is
> internally not associated with the numeric PID, but, conceptually
> speaking, with a specific incarnation of the PID, or something like
> that. (Actually, it is associated with the "struct pid", which is not
> reused, instead of the numeric PID.)

Thanks. I simplified the last sentence of the paragraph:

              In  the above scenario, the risk is that the supervisor may
              try to access the memory of a process other than  the  tar‐
              get.   This  race  can  be avoided by following the call to
              open(2) with a  SECCOMP_IOCTL_NOTIF_ID_VALID  operation  to
              verify  that the process that generated the notification is
              still alive.  (Note that if the target terminates after the
              latter  step, a subsequent read(2) from the file descriptor
              will return 0, indicating end of file.)

I think that's probably enough detail.

>>               remains an open reference to the /proc[pid]/mem  file;  in
>>               this  case, a subsequent read(2) from the file will return
>>               0, indicating end of file.)
>>
>>               On success (i.e., the notification  ID  is  still  valid),
>>               this  operation  returns 0 On failure (i.e., the notifica‐
> 
> nit: s/returns 0/returns 0./

Thanks. Fixed.

>>               tion ID is no longer valid), -1 is returned, and errno  is
>>               set to ENOENT.
>>
>>        SECCOMP_IOCTL_NOTIF_SEND
> [...]
>>               Two kinds of response are possible:
>>
>>               · A response to the kernel telling it to execute the  tar‐
>>                 get  process's  system  call.   In  this case, the flags
>>                 field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and  the
>>                 error and val fields must be zero.
>>
>>                 This  kind  of response can be useful in cases where the
>>                 supervisor needs to do deeper analysis of  the  target's
>>                 system  call  than  is  possible  from  a seccomp filter
>>                 (e.g., examining the values of pointer arguments),  and,
>>                 having  verified that the system call is acceptable, the
>>                 supervisor wants to allow it to proceed.
> 
> "allow" sounds as if this is an access control thing, but this
> mechanism should usually not be used for access control (unless the
> "seccomp" syscall is blocked).

Yes, Kees has also raised this point.

> Maybe reword as "having decided that
> the system call does not require emulation by the supervisor, the
> supervisor wants it to execute normally", or something like that?

Great! More suggested wordings! Thank you :-).

I tweaked slightly:

    ... having decided that the system call does not require emulation
    by the supervisor, the supervisor wants the system call to 
    be executed normally in the target.

> [...]
>>               On success, this operation returns 0; on  failure,  -1  is
>>               returned,  and  errno  is set to indicate the cause of the
>>               error.  This operation can fail with the following errors:
>>
>>               EINPROGRESS
>>                      A response to this notification  has  already  been
>>                      sent.
>>
>>               EINVAL An invalid value was specified in the flags field.
>>
>>               EINVAL The       flags      field      contained      SEC‐
>>                      COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
>>                      field was not zero.
>>
>>               ENOENT The  blocked  system call in the target process has
>>                      been interrupted by a signal handler.
> 
> (you could also get this if a response has already been sent, instead
> of EINPROGRESS - the only difference is whether the target thread has
> picked up the response yet)

Got it. I don't think I'll try to work that detail into the page
(unless you really think I should, but since you made this a
parenthetical comment, perhaps you don't think it's necessary).

>> NOTES
>>        The file descriptor returned when seccomp(2) is employed with the
>>        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
>>        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
>>        ing,  these interfaces indicate that the file descriptor is read‐
>>        able.
> 
> We should probably also point out somewhere that, as
> include/uapi/linux/seccomp.h says:
> 
>  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
>  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
>  * same syscall, the most recently added filter takes precedence. This means
>  * that the new SECCOMP_RET_USER_NOTIF filter can override any
>  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all

My takeaway from Chritian's comments is that this comment in the kernel 
source is partially wrong, since it is not possible to install multiple
filters with SECCOMP_RET_USER_NOTIF, right?

>  * such filtered syscalls to be executed by sending the response
>  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
>  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> 
> In other words, from a security perspective, you must assume that the
> target process can bypass any SECCOMP_RET_USER_NOTIF (or
> SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> calling seccomp(). 

Drawing on text from Chrstian's comment in seccomp.h and Kees's mail,
I added the following in NOTES:

   Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
       The intent of the user-space notification feature is to allow sys‐
       tem calls to be performed on behalf of the target.   The  target's
       system  call should either be handled by the supervisor or allowed
       to continue normally in the kernel (where standard security  poli‐
       cies will be applied).

       Note well: this mechanism must not be used to make security policy
       decisions about the system call, which would be  inherently  race-
       prone for reasons described next.

       The  SECCOMP_USER_NOTIF_FLAG_CONTINUE  flag must be used with cau‐
       tion.  If set by the supervisor, the  target's  system  call  will
       continue.   However,  there  is  a time-of-check, time-of-use race
       here, since an attacker could exploit the interval of  time  where
       the  target  is  blocked  waiting on the "continue" response to do
       things such as rewriting the system call arguments.

       Note furthermore that a user-space notifier can be bypassed if the
       existing  filters  allow  the  use  of  seccomp(2)  or prctl(2) to
       install a filter that returns an action value with a higher prece‐
       dence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).

       It  should  thus  be  absolutely clear that the seccomp user-space
       notification mechanism can not be used  to  implement  a  security
       policy!   It  should  only  ever be used in scenarios where a more
       privileged process supervises the system calls of a lesser  privi‐
       leged  target  to get around kernel-enforced security restrictions
       when the supervisor deems this safe.  In other words, in order  to
       continue a system call, the supervisor should be sure that another
       security mechanism or the kernel itself  will  sufficiently  block
       the  system  call  if  its  arguments  are  rewritten to something
       unsafe.

Seem okay?

> This should also be noted over in the main
> seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.

I added some words in seccomp(2) to emphasize this.

>> EXAMPLES
> [...]
>>        This  program  can  used  to  demonstrate  various aspects of the
> 
> nit: "can be used to demonstrate", or alternatively just "demonstrates"

Thanks. Fixed (added "to")

>>        behavior of the seccomp user-space  notification  mechanism.   To
>>        help  aid  such demonstrations, the program logs various messages
>>        to show the operation of the target process (lines prefixed "T:")
>>        and the supervisor (indented lines prefixed "S:").
> [...]
>>    Program source
> [...]
>>        #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>>                                } while (0)
> 
> Don't we have err() for this?

I tend to avoid the use of err() because it's a nonstandard BSDism.
Perhaps by this point this is as much a habit as anything rational.

>>        /* Send the file descriptor 'fd' over the connected UNIX domain socket
>>           'sockfd'. Returns 0 on success, or -1 on error. */
>>
>>        static int
>>        sendfd(int sockfd, int fd)
>>        {
>>            struct msghdr msgh;
>>            struct iovec iov;
>>            int data;
>>            struct cmsghdr *cmsgp;
>>
>>            /* Allocate a char array of suitable size to hold the ancillary data.
>>               However, since this buffer is in reality a 'struct cmsghdr', use a
>>               union to ensure that it is suitable aligned. */
> 
> nit: suitably

Thanks. Fixed.

>>            union {
>>                char   buf[CMSG_SPACE(sizeof(int))];
>>                                /* Space large enough to hold an 'int' */
>>                struct cmsghdr align;
>>            } controlMsg;
>>
>>            /* The 'msg_name' field can be used to specify the address of the
>>               destination socket when sending a datagram. However, we do not
>>               need to use this field because 'sockfd' is a connected socket. */
>>
>>            msgh.msg_name = NULL;
>>            msgh.msg_namelen = 0;
>>
>>            /* On Linux, we must transmit at least one byte of real data in
>>               order to send ancillary data. We transmit an arbitrary integer
>>               whose value is ignored by recvfd(). */
>>
>>            msgh.msg_iov = &iov;
>>            msgh.msg_iovlen = 1;
>>            iov.iov_base = &data;
>>            iov.iov_len = sizeof(int);
>>            data = 12345;
>>
>>            /* Set 'msghdr' fields that describe ancillary data */
>>
>>            msgh.msg_control = controlMsg.buf;
>>            msgh.msg_controllen = sizeof(controlMsg.buf);
>>
>>            /* Set up ancillary data describing file descriptor to send */
>>
>>            cmsgp = CMSG_FIRSTHDR(&msgh);
>>            cmsgp->cmsg_level = SOL_SOCKET;
>>            cmsgp->cmsg_type = SCM_RIGHTS;
>>            cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
>>            memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
>>
>>            /* Send real plus ancillary data */
>>
>>            if (sendmsg(sockfd, &msgh, 0) == -1)
>>                return -1;
>>
>>            return 0;
>>        }
> 
> Instead of using unix domain sockets to send the fd to the parent, I
> think you could also use clone3() with flags==CLONE_FILES|SIGCHLD,
> dup2() the seccomp fd to an fd that was reserved in the parent, call
> unshare(CLONE_FILES) in the child after setting up the seccomp fd, and
> wake up the parent with something like pthread_cond_signal()? I'm not
> sure whether that'd look better or worse in the end though, so maybe
> just ignore this comment.

Ahh -- nice. That answers in detail a question I also had for Tycho.
I won't make any changes to the page (since I'm not sure it would 
look better), but I will add that detail in a comment in the page
source. Perhaps I'll do something with that in the future.

> [...]
>>        /* Access the memory of the target process in order to discover the
>>           pathname that was given to mkdir() */
>>
>>        static void
>>        getTargetPathname(struct seccomp_notif *req, int notifyFd,
>>                          char *path, size_t len)
>>        {
>>            char procMemPath[PATH_MAX];
>>            snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
>>
>>            int procMemFd = open(procMemPath, O_RDONLY);
> 
> Should example code like this maybe use O_CLOEXEC unless the fd in
> question actually has to be inheritable? I know it doesn't actually
> matter here, but if this code was used in a multi-threaded context, it
> might.

Yes, good point. I changed this.

>>            if (procMemFd == -1)
>>                errExit("Supervisor: open");
>>
>>            /* Check that the process whose info we are accessing is still alive.
>>               If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>>               in checkNotificationIdIsValid()) succeeds, we know that the
>>               /proc/PID/mem file descriptor that we opened corresponds to the
>>               process for which we received a notification. If that process
>>               subsequently terminates, then read() on that file descriptor
>>               will return 0 (EOF). */
>>
>>            checkNotificationIdIsValid(notifyFd, req->id);
>>
>>            /* Seek to the location containing the pathname argument (i.e., the
>>               first argument) of the mkdir(2) call and read that pathname */
>>
>>            if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
>>                errExit("Supervisor: lseek");
>>
>>            ssize_t s = read(procMemFd, path, PATH_MAX);
>>            if (s == -1)
>>                errExit("read");
> 
> Why not pread() instead of lseek()+read()?

No good reason! I changed it to:

           /* Read bytes at the location containing the pathname argument
              (i.e., the first argument) of the mkdir(2) call */

           ssize_t s = pread(procMemFd, path, PATH_MAX, req->data.args[0]);
           if (s == -1)
               errExit("pread");

           if (s == 0) {
               fprintf(stderr, "\tS: pread() of /proc/PID/mem "
                       "returned 0 (EOF)\n");
               exit(EXIT_FAILURE);
           }

Thanks!

>>            if (s == 0) {
>>                fprintf(stderr, "\tS: read() of /proc/PID/mem "
>>                        "returned 0 (EOF)\n");
>>                exit(EXIT_FAILURE);
>>            }
>>
>>            if (close(procMemFd) == -1)
>>                errExit("close-/proc/PID/mem");
> 
> We should probably make sure here that the value we read is actually
> NUL-terminated?

So, I was curious about that point also. But, (why) are we not
guaranteed that it will be NUL-terminated?

>>        }
>>
>>        /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
>>           descriptor, 'notifyFd'. */
>>
>>        static void
>>        handleNotifications(int notifyFd)
>>        {
>>            struct seccomp_notif_sizes sizes;
>>            char path[PATH_MAX];
>>                /* For simplicity, we assume that the pathname given to mkdir()
>>                   is no more than PATH_MAX bytes; but this might not be true. */
> 
> No, it has to be true, otherwise the kernel would fail the syscall if
> it was executing normally.

Yes. I removed that comment.

>>            /* Discover the sizes of the structures that are used to receive
>>               notifications and send notification responses, and allocate
>>               buffers of those sizes. */
>>
>>            if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
>>                errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
>>
>>            struct seccomp_notif *req = malloc(sizes.seccomp_notif);
>>            if (req == NULL)
>>                errExit("\tS: malloc");
>>
>>            struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
> 
> This should probably do something like max(sizes.seccomp_notif_resp,
> sizeof(struct seccomp_notif_resp)) in case the program was built
> against new UAPI headers that make struct seccomp_notif_resp big, but
> is running under an old kernel where that struct is still smaller?

I'm confused. Why? I mean, if the running kernel says that it expects
a buffer of a certain size, and we allocate a buffer of that size,
what's the problem?

>>            if (resp == NULL)
>>                errExit("\tS: malloc");
> [...]
>>                    } else {
>>
>>                        /* If mkdir() failed in the supervisor, pass the error
>>                           back to the target */
>>
>>                        resp->error = -errno;
>>                        printf("\tS: failure! (errno = %d; %s)\n", errno,
>>                                strerror(errno));
>>                    }
>>                                                             } else if (strncmp(path, "./", strlen("./")) == 0) {
> 
> nit: indent messed up

Thanks. Fixed.

And thanks again for the detailed review, Jann.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-15 11:24   ` Michael Kerrisk (man-pages)
@ 2020-10-15 20:32     ` Jann Horn
  2020-10-16 18:29       ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 52+ messages in thread
From: Jann Horn @ 2020-10-15 20:32 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Alexei Starovoitov, Will Drewry,
	bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 9/30/20 5:53 PM, Jann Horn wrote:
> > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> >> I knew it would be a big ask, but below is kind of the manual page
> >> I was hoping you might write [1] for the seccomp user-space notification
> >> mechanism. Since you didn't (and because 5.9 adds various new pieces
> >> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> >> that also will need documenting [2]), I did :-). But of course I may
> >> have made mistakes...
[...]
> >>        3. The supervisor process will receive notification events on the
> >>           listening  file  descriptor.   These  events  are  returned as
> >>           structures of type seccomp_notif.  Because this structure  and
> >>           its  size may evolve over kernel versions, the supervisor must
> >>           first determine the size of  this  structure  using  the  sec‐
> >>           comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
> >>           structure of type seccomp_notif_sizes.  The  supervisor  allo‐
> >>           cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
> >>           to receive notification events.   In  addition,the  supervisor
> >>           allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
> >>           comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
> >>           comp_notif_resp  structure) that it will provide to the kernel
> >>           (and thus the target process).
> >>
> >>        4. The target process then performs its workload, which  includes
> >>           system  calls  that  will be controlled by the seccomp filter.
> >>           Whenever one of these system calls causes the filter to return
> >>           the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
> >>           execute the system call;  instead,  execution  of  the  target
> >>           process is temporarily blocked inside the kernel and a notifi‐
> >
> > where "blocked" refers to the interruptible, restartable kind - if the
> > child receives a signal with an SA_RESTART signal handler in the
> > meantime, it'll leave the syscall, go through the signal handler, then
> > restart the syscall again and send the same request to the supervisor
> > again. so the supervisor may see duplicate syscalls.
>
> So, I partially demonstrated what you describe here, for two example
> system calls (epoll_wait() and pause()). But I could not exactly
> demonstrate things as I understand you to be describing them. (So,
> I'm not sure whether I have not understood you correctly, or
> if things are not exactly as you describe them.)
>
> Here's a scenario (A) that I tested:
>
> 1. Target installs seccomp filters for a blocking syscall
>    (epoll_wait() or pause(), both of which should never restart,
>    regardless of SA_RESTART)
> 2. Target installs SIGINT handler with SA_RESTART
> 3. Supervisor is sleeping (i.e., is not blocked in
>    SECCOMP_IOCTL_NOTIF_RECV operation).
> 4. Target makes a blocking system call (epoll_wait() or pause()).
> 5. SIGINT gets delivered to target; handler gets called;
>    ***and syscall gets restarted by the kernel***
>
> That last should never happen, of course, and is a result of the
> combination of both the user-notify filter and the SA_RESTART flag.
> If one or other is not present, then the system call is not
> restarted.
>
> So, as you note below, the UAPI gets broken a little.
>
> However, from your description above I had understood that
> something like the following scenario (B) could occur:
>
> 1. Target installs seccomp filters for a blocking syscall
>    (epoll_wait() or pause(), both of which should never restart,
>    regardless of SA_RESTART)
> 2. Target installs SIGINT handler with SA_RESTART
> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
>    blocks).
> 4. Target makes a blocking system call (epoll_wait() or pause()).
> 5. Supervisor gets seccomp user-space notification (i.e.,
>    SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> 6. SIGINT gets delivered to target; handler gets called;
>    and syscall gets restarted by the kernel
> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
>    which gets another notification for the restarted system call.
>
> However, I don't observe such behavior. In step 6, the syscall
> does not get restarted by the kernel, but instead returns -1/EINTR.
> Perhaps I have misconstructed my experiment in the second case, or
> perhaps I've misunderstood what you meant, or is it possibly the
> case that things are not quite as you said?

user@vm:~/test/seccomp-notify-interrupt$ cat seccomp-notify-interrupt.c
#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <limits.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
  int seccomp_fd;
} *shared;

static void handle_signal(int sig, siginfo_t *info, void *uctx) {
  printf("signal handler invoked\n");
}

int main(void) {
  setbuf(stdout, NULL);

  shared = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                MAP_ANONYMOUS|MAP_SHARED, -1, 0);
  if (shared == MAP_FAILED)
    err(1, "mmap");
  shared->seccomp_fd = -1;

  /* glibc's clone() wrapper doesn't support fork()-style usage */
  pid_t child = syscall(__NR_clone, CLONE_FILES|SIGCHLD,
                        NULL, NULL, NULL, 0);
  if (child == -1) err(1, "clone");
  if (child == 0) {
    /* don't outlive the parent */
    prctl(PR_SET_PDEATHSIG, SIGKILL);
    if (getppid() == 1) exit(0);

    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
    struct sock_filter insns[] = {
      BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
      BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_pause, 0, 1),
      BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
      BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW)
    };
    struct sock_fprog prog = {
      .len = sizeof(insns)/sizeof(insns[0]),
      .filter = insns
    };
    int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
                              SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
    if (seccomp_ret < 0)
      err(1, "install");
    printf("installed seccomp: fd %d\n", seccomp_ret);

    __atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE);
    int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
                            INT_MAX, NULL, NULL, 0);
    printf("woke %d waiters\n", futex_ret);

    struct sigaction act = {
      .sa_sigaction = handle_signal,
      .sa_flags = SA_RESTART|SA_SIGINFO
    };
    if (sigaction(SIGUSR1, &act, NULL))
      err(1, "sigaction");

    pause();
    perror("pause returned");
    exit(0);
  }

  int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
                          -1, NULL, NULL, 0);
  if (futex_ret == -1 && errno != EAGAIN)
    err(1, "futex wait");
  int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
  printf("child installed seccomp fd %d\n", fd);

  sleep(1);
  printf("going to send SIGUSR1...\n");
  kill(child, SIGUSR1);
  sleep(1);

  exit(0);
}
user@vm:~/test/seccomp-notify-interrupt$ gcc -o
seccomp-notify-interrupt seccomp-notify-interrupt.c -Wall
user@vm:~/test/seccomp-notify-interrupt$ strace -f
./seccomp-notify-interrupt >/dev/null
execve("./seccomp-notify-interrupt", ["./seccomp-notify-interrupt"],
0x7ffcb31a0d08 /* 42 vars */) = 0
brk(NULL)                               = 0x5565864b2000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=89296, ...}) = 0
mmap(NULL, 89296, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7e688e7000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260A\2\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1824496, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f7e688e5000
mmap(NULL, 1837056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f7e68724000
mprotect(0x7f7e68746000, 1658880, PROT_NONE) = 0
mmap(0x7f7e68746000, 1343488, PROT_READ|PROT_EXEC,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f7e68746000
mmap(0x7f7e6888e000, 311296, PROT_READ,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16a000) = 0x7f7e6888e000
mmap(0x7f7e688db000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f7e688db000
mmap(0x7f7e688e1000, 14336, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f7e688e1000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7f7e688e6500) = 0
mprotect(0x7f7e688db000, 16384, PROT_READ) = 0
mprotect(0x556585183000, 4096, PROT_READ) = 0
mprotect(0x7f7e68924000, 4096, PROT_READ) = 0
munmap(0x7f7e688e7000, 89296)           = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1,
0) = 0x7f7e688fc000
clone(child_stack=NULL, flags=CLONE_FILES|SIGCHLD) = 2558
futex(0x7f7e688fc000, FUTEX_WAIT, 4294967295, NULLstrace: Process 2558 attached
 <unfinished ...>
[pid  2558] prctl(PR_SET_PDEATHSIG, SIGKILL) = 0
[pid  2558] getppid()                   = 2557
[pid  2558] prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
[pid  2558] seccomp(SECCOMP_SET_MODE_FILTER, 0x8 /*
SECCOMP_FILTER_FLAG_??? */, {len=4, filter=0x7ffdf7cc9b50}) = 3
[pid  2558] write(1, "installed seccomp: fd 3\n", 24) = 24
[pid  2558] futex(0x7f7e688fc000, FUTEX_WAKE, 2147483647 <unfinished ...>
[pid  2557] <... futex resumed> )       = 0
[pid  2558] <... futex resumed> )       = 1
[pid  2558] write(1, "woke 1 waiters\n", 15) = 15
[pid  2557] write(1, "child installed seccomp fd 3\n", 29) = 29
[pid  2558] rt_sigaction(SIGUSR1, {sa_handler=0x556585181215,
sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO,
sa_restorer=0x7f7e6875b840}, NULL, 8) = 0
[pid  2557] nanosleep({tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid  2558] pause( <unfinished ...>
[pid  2557] <... nanosleep resumed> 0x7ffdf7cc9b10) = 0
[pid  2557] write(1, "going to send SIGUSR1...", 24) = 24
[pid  2557] write(1, "\n", 1)           = 1
[pid  2557] kill(2558, SIGUSR1)         = 0
[pid  2557] nanosleep({tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid  2558] <... pause resumed> )       = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)
[pid  2558] --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER,
si_pid=2557, si_uid=1000} ---
[pid  2558] write(1, "signal handler invoked", 22) = 22
[pid  2558] write(1, "\n", 1)           = 1
[pid  2558] rt_sigreturn({mask=[]})     = 34
[pid  2558] pause( <unfinished ...>
[pid  2557] <... nanosleep resumed> 0x7ffdf7cc9b10) = 0
[pid  2557] exit_group(0)               = ?
[pid  2557] +++ exited with 0 +++
<... pause resumed>)                    = ?
+++ killed by SIGKILL +++
user@vm:~/test/seccomp-notify-interrupt$


[...]
> >>           event  is  available.
> >
> > Maybe we should note here that you can use the multi-fd-polling APIs
> > (select/poll/epoll) instead, and that if the notification goes away
> > before you call SECCOMP_IOCTL_NOTIF_RECV, the ioctl will return
> > -ENOENT instead of blocking, and therefore as long as nobody else
> > reads from the same fd, you can assume that after the fd reports as
> > readable, you can call SECCOMP_IOCTL_NOTIF_RECV once without blocking.
>
> I'd rather not add this info in the overview section, which is
> already longer than I would like. But I did add some details
> in NOTES:
>
> [[
>        The file descriptor returned when seccomp(2) is employed with  the
>        SECCOMP_FILTER_FLAG_NEW_LISTENER   flag  can  be  monitored  using
>        poll(2), epoll(7), and select(2).  When a notification is pending,
>        these  interfaces  indicate  that the file descriptor is readable.
>        Following    such    an    indication,    a    subsequent     SEC‐
>        COMP_IOCTL_NOTIF_RECV  ioctl(2)  will  not block, returning either
>        information about a notification or else failing  with  the  error
>        EINTR  if  the  target  process has been killed by a signal or its
>        system call has been interrupted by a signal handler.
> ]]
>
> Okay?

Sounds good.

[...]
> >>           bilities to perform the mount operation.
> >>
> >>        8. The supervisor then sends a response to the notification.  The
> >>           information  in  this  response  is used by the kernel to con‐
> >>           struct a return value for the target process's system call and
> >>           provide a value that will be assigned to the errno variable of
> >>           the target process.
> >>
> >>           The  response  is  sent  using  the   SECCOMP_IOCTL_NOTIF_RECV
> >>           ioctl(2)   operation,   which  is  used  to  transmit  a  sec‐
> >>           comp_notif_resp  structure  to  the  kernel.   This  structure
> >>           includes  a  cookie  value that the supervisor obtained in the
> >>           seccomp_notif    structure    returned     by     the     SEC‐
> >>           COMP_IOCTL_NOTIF_RECV operation.  This cookie value allows the
> >>           kernel to associate the response with the target process.
> >
> > (unless if the target thread entered a signal handler or was killed in
> > the meantime)
>
> Yes, but I think I have this adequately covered in the errors described
> later in the page for SECCOMP_IOCTL_NOTIF_RECV. (I have now added the
> target-process-terminated case to the orror text.)
>
>               ENOENT The blocked system  call  in  the  target  has  been
>                      interrupted  by  a  signal  handler  or  the  target
>                      process has terminated.
>
> Is that sufficient?

Ah, right.

[...]
> >>               ENOENT The  target  process  was killed by a signal as the
> >>                      notification information was being generated.
> >
> > Not just killed, interruption with a signal handler has the same effect.
>
> Ah yes! Thanks. I added that as well.
>
> [[
>               ENOENT The target thread was killed  by  a  signal  as  the
>                      notification information was being generated, or the
>                      target's (blocked) system call was interrupted by  a
>                      signal handler.
> ]]
>
> Okay?

Yeah, sounds good.

[...]
> >>               In the above scenario, the risk is that the supervisor may
> >>               try to access the memory of a process other than the  tar‐
> >>               get.   This  race  can be avoided by following the call to
> >>               open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
> >>               ify  that  the  process that generated the notification is
> >>               still alive.  (Note that  if  the  target  process  subse‐
> >>               quently  terminates, its PID won't be reused because there
> >
> > That's wrong, the PID can be reused, but the /proc/$pid directory is
> > internally not associated with the numeric PID, but, conceptually
> > speaking, with a specific incarnation of the PID, or something like
> > that. (Actually, it is associated with the "struct pid", which is not
> > reused, instead of the numeric PID.)
>
> Thanks. I simplified the last sentence of the paragraph:
>
>               In  the above scenario, the risk is that the supervisor may
>               try to access the memory of a process other than  the  tar‐
>               get.   This  race  can  be avoided by following the call to
>               open(2) with a  SECCOMP_IOCTL_NOTIF_ID_VALID  operation  to
>               verify  that the process that generated the notification is
>               still alive.  (Note that if the target terminates after the
>               latter  step, a subsequent read(2) from the file descriptor
>               will return 0, indicating end of file.)
>
> I think that's probably enough detail.

Maybe make that "may return 0" instead of "will return 0" - reading
from /proc/$pid/mem can only return 0 in the following cases AFAICS:

1. task->mm was already gone at open() time
2. mm->mm_users has dropped to zero (the mm only has lazytlb users;
   page tables and VMAs are being blown away or have been blown away)
3. the syscall was called with length 0

When a process has gone away, normally mm->mm_users will drop to zero,
but someone else could theoretically still be holding a reference to
the mm (e.g. someone else in the middle of accessing /proc/$pid/mem).
(Such references should normally not be very long-lived though.)

Additionally, in the unlikely case that the OOM killer just chomped
through the page tables of the target process, I think the read will
return -EIO (same error as if the address was simply unmapped) if the
address is within a non-shared mapping. (Maybe that's something procfs
could do better...)

[...]
> >> NOTES
> >>        The file descriptor returned when seccomp(2) is employed with the
> >>        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> >>        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> >>        ing,  these interfaces indicate that the file descriptor is read‐
> >>        able.
> >
> > We should probably also point out somewhere that, as
> > include/uapi/linux/seccomp.h says:
> >
> >  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> >  * same syscall, the most recently added filter takes precedence. This means
> >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
>
> My takeaway from Chritian's comments is that this comment in the kernel
> source is partially wrong, since it is not possible to install multiple
> filters with SECCOMP_RET_USER_NOTIF, right?

Yeah. (Well, AFAICS technically, you can add more filters that return
SECCOMP_RET_USER_NOTIF, but when a filter returns that without having
a notifier fd attached, seccomp blocks the syscall with -ENOSYS; it
won't use the notifier fd attached to a different filter in the
chain.)

> >  * such filtered syscalls to be executed by sending the response
> >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> >
> > In other words, from a security perspective, you must assume that the
> > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > calling seccomp().
>
> Drawing on text from Chrstian's comment in seccomp.h and Kees's mail,
> I added the following in NOTES:
>
>    Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
>        The intent of the user-space notification feature is to allow sys‐
>        tem calls to be performed on behalf of the target.   The  target's
>        system  call should either be handled by the supervisor or allowed
>        to continue normally in the kernel (where standard security  poli‐
>        cies will be applied).
>
>        Note well: this mechanism must not be used to make security policy
>        decisions about the system call, which would be  inherently  race-
>        prone for reasons described next.
>
>        The  SECCOMP_USER_NOTIF_FLAG_CONTINUE  flag must be used with cau‐
>        tion.  If set by the supervisor, the  target's  system  call  will
>        continue.   However,  there  is  a time-of-check, time-of-use race
>        here, since an attacker could exploit the interval of  time  where
>        the  target  is  blocked  waiting on the "continue" response to do
>        things such as rewriting the system call arguments.
>
>        Note furthermore that a user-space notifier can be bypassed if the
>        existing  filters  allow  the  use  of  seccomp(2)  or prctl(2) to
>        install a filter that returns an action value with a higher prece‐
>        dence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).
>
>        It  should  thus  be  absolutely clear that the seccomp user-space
>        notification mechanism can not be used  to  implement  a  security
>        policy!   It  should  only  ever be used in scenarios where a more
>        privileged process supervises the system calls of a lesser  privi‐
>        leged  target  to get around kernel-enforced security restrictions
>        when the supervisor deems this safe.  In other words, in order  to
>        continue a system call, the supervisor should be sure that another
>        security mechanism or the kernel itself  will  sufficiently  block
>        the  system  call  if  its  arguments  are  rewritten to something
>        unsafe.
>
> Seem okay?

Yeah, sounds good.

[...]
> >>            if (s == 0) {
> >>                fprintf(stderr, "\tS: read() of /proc/PID/mem "
> >>                        "returned 0 (EOF)\n");
> >>                exit(EXIT_FAILURE);
> >>            }
> >>
> >>            if (close(procMemFd) == -1)
> >>                errExit("close-/proc/PID/mem");
> >
> > We should probably make sure here that the value we read is actually
> > NUL-terminated?
>
> So, I was curious about that point also. But, (why) are we not
> guaranteed that it will be NUL-terminated?

Because it's random memory filled by another process, which we don't
necessarily trust. While seccomp notifiers aren't usable for applying
*extra* security restrictions, the supervisor will still often be more
privileged than the supervised process.

[...]
> >>            /* Discover the sizes of the structures that are used to receive
> >>               notifications and send notification responses, and allocate
> >>               buffers of those sizes. */
> >>
> >>            if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
> >>                errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
> >>
> >>            struct seccomp_notif *req = malloc(sizes.seccomp_notif);
> >>            if (req == NULL)
> >>                errExit("\tS: malloc");
> >>
> >>            struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
> >
> > This should probably do something like max(sizes.seccomp_notif_resp,
> > sizeof(struct seccomp_notif_resp)) in case the program was built
> > against new UAPI headers that make struct seccomp_notif_resp big, but
> > is running under an old kernel where that struct is still smaller?
>
> I'm confused. Why? I mean, if the running kernel says that it expects
> a buffer of a certain size, and we allocate a buffer of that size,
> what's the problem?

Because in userspace, we cast the result of malloc() to a "struct
seccomp_notif_resp *". If the kernel tells us that it expects a size
smaller than sizeof(struct seccomp_notif_resp), then we end up with a
pointer to a struct that consists partly of allocated memory, partly
of out-of-bounds memory, which is generally a bad idea - I'm not sure
whether the C standard permits that. And if userspace then e.g.
decides to access some member of that struct that is beyond what the
kernel thinks is the struct size, we get actual OOB memory accesses.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-15 20:32     ` Jann Horn
@ 2020-10-16 18:29       ` Michael Kerrisk (man-pages)
  2020-10-17  0:25         ` Jann Horn
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-16 18:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: mtk.manpages, Tycho Andersen, Sargun Dhillon, Kees Cook,
	Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

Hello Jann,

Thanks for your reply!

On 10/15/20 10:32 PM, Jann Horn wrote:
> On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> On 9/30/20 5:53 PM, Jann Horn wrote:
>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>> <mtk.manpages@gmail.com> wrote:
>>>> I knew it would be a big ask, but below is kind of the manual page
>>>> I was hoping you might write [1] for the seccomp user-space notification
>>>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>>>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>>>> that also will need documenting [2]), I did :-). But of course I may
>>>> have made mistakes...
> [...]
>>>>        3. The supervisor process will receive notification events on the
>>>>           listening  file  descriptor.   These  events  are  returned as
>>>>           structures of type seccomp_notif.  Because this structure  and
>>>>           its  size may evolve over kernel versions, the supervisor must
>>>>           first determine the size of  this  structure  using  the  sec‐
>>>>           comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>>>>           structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>>>>           cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>>>>           to receive notification events.   In  addition,the  supervisor
>>>>           allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>>>>           comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>>>>           comp_notif_resp  structure) that it will provide to the kernel
>>>>           (and thus the target process).
>>>>
>>>>        4. The target process then performs its workload, which  includes
>>>>           system  calls  that  will be controlled by the seccomp filter.
>>>>           Whenever one of these system calls causes the filter to return
>>>>           the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>>>>           execute the system call;  instead,  execution  of  the  target
>>>>           process is temporarily blocked inside the kernel and a notifi‐
>>>
>>> where "blocked" refers to the interruptible, restartable kind - if the
>>> child receives a signal with an SA_RESTART signal handler in the
>>> meantime, it'll leave the syscall, go through the signal handler, then
>>> restart the syscall again and send the same request to the supervisor
>>> again. so the supervisor may see duplicate syscalls.
>>
>> So, I partially demonstrated what you describe here, for two example
>> system calls (epoll_wait() and pause()). But I could not exactly
>> demonstrate things as I understand you to be describing them. (So,
>> I'm not sure whether I have not understood you correctly, or
>> if things are not exactly as you describe them.)
>>
>> Here's a scenario (A) that I tested:
>>
>> 1. Target installs seccomp filters for a blocking syscall
>>    (epoll_wait() or pause(), both of which should never restart,
>>    regardless of SA_RESTART)
>> 2. Target installs SIGINT handler with SA_RESTART
>> 3. Supervisor is sleeping (i.e., is not blocked in
>>    SECCOMP_IOCTL_NOTIF_RECV operation).
>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>> 5. SIGINT gets delivered to target; handler gets called;
>>    ***and syscall gets restarted by the kernel***
>>
>> That last should never happen, of course, and is a result of the
>> combination of both the user-notify filter and the SA_RESTART flag.
>> If one or other is not present, then the system call is not
>> restarted.
>>
>> So, as you note below, the UAPI gets broken a little.
>>
>> However, from your description above I had understood that
>> something like the following scenario (B) could occur:
>>
>> 1. Target installs seccomp filters for a blocking syscall
>>    (epoll_wait() or pause(), both of which should never restart,
>>    regardless of SA_RESTART)
>> 2. Target installs SIGINT handler with SA_RESTART
>> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
>>    blocks).
>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>> 5. Supervisor gets seccomp user-space notification (i.e.,
>>    SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
>> 6. SIGINT gets delivered to target; handler gets called;
>>    and syscall gets restarted by the kernel
>> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
>>    which gets another notification for the restarted system call.
>>
>> However, I don't observe such behavior. In step 6, the syscall
>> does not get restarted by the kernel, but instead returns -1/EINTR.
>> Perhaps I have misconstructed my experiment in the second case, or
>> perhaps I've misunderstood what you meant, or is it possibly the
>> case that things are not quite as you said?

Thanks for the code, Jann (including the demo of the CLONE_FILES
technique to pass the notification FD to the supervisor).

But I think your code just demonstrates what I described in
scenario A. So, it seems that I both understood what you
meant (because my code demonstrates the same thing) and
also misunderstood what you said (because I thought you
were meaning something more like scenario B).

I'm not sure if I should write anything about this small UAPI
breakage in BUGS, or not. Your thoughts?

> user@vm:~/test/seccomp-notify-interrupt$ cat seccomp-notify-interrupt.c
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <signal.h>
> #include <err.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sched.h>
> #include <stddef.h>
> #include <limits.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/prctl.h>
> #include <linux/seccomp.h>
> #include <linux/filter.h>
> #include <linux/futex.h>
> 
> struct {
>   int seccomp_fd;
> } *shared;
> 
> static void handle_signal(int sig, siginfo_t *info, void *uctx) {
>   printf("signal handler invoked\n");
> }
> 
> int main(void) {
>   setbuf(stdout, NULL);
> 
>   shared = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
>                 MAP_ANONYMOUS|MAP_SHARED, -1, 0);
>   if (shared == MAP_FAILED)
>     err(1, "mmap");
>   shared->seccomp_fd = -1;
> 
>   /* glibc's clone() wrapper doesn't support fork()-style usage */
>   pid_t child = syscall(__NR_clone, CLONE_FILES|SIGCHLD,
>                         NULL, NULL, NULL, 0);
>   if (child == -1) err(1, "clone");
>   if (child == 0) {
>     /* don't outlive the parent */
>     prctl(PR_SET_PDEATHSIG, SIGKILL);
>     if (getppid() == 1) exit(0);
> 
>     prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>     struct sock_filter insns[] = {
>       BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
>       BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_pause, 0, 1),
>       BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
>       BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW)
>     };
>     struct sock_fprog prog = {
>       .len = sizeof(insns)/sizeof(insns[0]),
>       .filter = insns
>     };
>     int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
>                               SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
>     if (seccomp_ret < 0)
>       err(1, "install");
>     printf("installed seccomp: fd %d\n", seccomp_ret);
> 
>     __atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE);
>     int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
>                             INT_MAX, NULL, NULL, 0);
>     printf("woke %d waiters\n", futex_ret);
> 
>     struct sigaction act = {
>       .sa_sigaction = handle_signal,
>       .sa_flags = SA_RESTART|SA_SIGINFO
>     };
>     if (sigaction(SIGUSR1, &act, NULL))
>       err(1, "sigaction");
> 
>     pause();
>     perror("pause returned");
>     exit(0);
>   }
> 
>   int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
>                           -1, NULL, NULL, 0);
>   if (futex_ret == -1 && errno != EAGAIN)
>     err(1, "futex wait");
>   int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
>   printf("child installed seccomp fd %d\n", fd);
> 
>   sleep(1);
>   printf("going to send SIGUSR1...\n");
>   kill(child, SIGUSR1);
>   sleep(1);
> 
>   exit(0);
> }
> user@vm:~/test/seccomp-notify-interrupt$ gcc -o
> seccomp-notify-interrupt seccomp-notify-interrupt.c -Wall
> user@vm:~/test/seccomp-notify-interrupt$ strace -f
> ./seccomp-notify-interrupt >/dev/null
> execve("./seccomp-notify-interrupt", ["./seccomp-notify-interrupt"],
> 0x7ffcb31a0d08 /* 42 vars */) = 0
> brk(NULL)                               = 0x5565864b2000
> access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=89296, ...}) = 0
> mmap(NULL, 89296, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7e688e7000
> close(3)                                = 0
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260A\2\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=1824496, ...}) = 0
> mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x7f7e688e5000
> mmap(NULL, 1837056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f7e68724000
> mprotect(0x7f7e68746000, 1658880, PROT_NONE) = 0
> mmap(0x7f7e68746000, 1343488, PROT_READ|PROT_EXEC,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f7e68746000
> mmap(0x7f7e6888e000, 311296, PROT_READ,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16a000) = 0x7f7e6888e000
> mmap(0x7f7e688db000, 24576, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f7e688db000
> mmap(0x7f7e688e1000, 14336, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f7e688e1000
> close(3)                                = 0
> arch_prctl(ARCH_SET_FS, 0x7f7e688e6500) = 0
> mprotect(0x7f7e688db000, 16384, PROT_READ) = 0
> mprotect(0x556585183000, 4096, PROT_READ) = 0
> mprotect(0x7f7e68924000, 4096, PROT_READ) = 0
> munmap(0x7f7e688e7000, 89296)           = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1,
> 0) = 0x7f7e688fc000
> clone(child_stack=NULL, flags=CLONE_FILES|SIGCHLD) = 2558
> futex(0x7f7e688fc000, FUTEX_WAIT, 4294967295, NULLstrace: Process 2558 attached
>  <unfinished ...>
> [pid  2558] prctl(PR_SET_PDEATHSIG, SIGKILL) = 0
> [pid  2558] getppid()                   = 2557
> [pid  2558] prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
> [pid  2558] seccomp(SECCOMP_SET_MODE_FILTER, 0x8 /*
> SECCOMP_FILTER_FLAG_??? */, {len=4, filter=0x7ffdf7cc9b50}) = 3
> [pid  2558] write(1, "installed seccomp: fd 3\n", 24) = 24
> [pid  2558] futex(0x7f7e688fc000, FUTEX_WAKE, 2147483647 <unfinished ...>
> [pid  2557] <... futex resumed> )       = 0
> [pid  2558] <... futex resumed> )       = 1
> [pid  2558] write(1, "woke 1 waiters\n", 15) = 15
> [pid  2557] write(1, "child installed seccomp fd 3\n", 29) = 29
> [pid  2558] rt_sigaction(SIGUSR1, {sa_handler=0x556585181215,
> sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO,
> sa_restorer=0x7f7e6875b840}, NULL, 8) = 0
> [pid  2557] nanosleep({tv_sec=1, tv_nsec=0},  <unfinished ...>
> [pid  2558] pause( <unfinished ...>
> [pid  2557] <... nanosleep resumed> 0x7ffdf7cc9b10) = 0
> [pid  2557] write(1, "going to send SIGUSR1...", 24) = 24
> [pid  2557] write(1, "\n", 1)           = 1
> [pid  2557] kill(2558, SIGUSR1)         = 0
> [pid  2557] nanosleep({tv_sec=1, tv_nsec=0},  <unfinished ...>
> [pid  2558] <... pause resumed> )       = ? ERESTARTSYS (To be
> restarted if SA_RESTART is set)
> [pid  2558] --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER,
> si_pid=2557, si_uid=1000} ---
> [pid  2558] write(1, "signal handler invoked", 22) = 22
> [pid  2558] write(1, "\n", 1)           = 1
> [pid  2558] rt_sigreturn({mask=[]})     = 34
> [pid  2558] pause( <unfinished ...>
> [pid  2557] <... nanosleep resumed> 0x7ffdf7cc9b10) = 0
> [pid  2557] exit_group(0)               = ?
> [pid  2557] +++ exited with 0 +++
> <... pause resumed>)                    = ?
> +++ killed by SIGKILL +++
> user@vm:~/test/seccomp-notify-interrupt$

[...]

>>>>               In the above scenario, the risk is that the supervisor may
>>>>               try to access the memory of a process other than the  tar‐
>>>>               get.   This  race  can be avoided by following the call to
>>>>               open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>>>>               ify  that  the  process that generated the notification is
>>>>               still alive.  (Note that  if  the  target  process  subse‐
>>>>               quently  terminates, its PID won't be reused because there
>>>
>>> That's wrong, the PID can be reused, but the /proc/$pid directory is
>>> internally not associated with the numeric PID, but, conceptually
>>> speaking, with a specific incarnation of the PID, or something like
>>> that. (Actually, it is associated with the "struct pid", which is not
>>> reused, instead of the numeric PID.)
>>
>> Thanks. I simplified the last sentence of the paragraph:
>>
>>               In  the above scenario, the risk is that the supervisor may
>>               try to access the memory of a process other than  the  tar‐
>>               get.   This  race  can  be avoided by following the call to
>>               open(2) with a  SECCOMP_IOCTL_NOTIF_ID_VALID  operation  to
>>               verify  that the process that generated the notification is
>>               still alive.  (Note that if the target terminates after the
>>               latter  step, a subsequent read(2) from the file descriptor
>>               will return 0, indicating end of file.)
>>
>> I think that's probably enough detail.
> 
> Maybe make that "may return 0" instead of "will return 0" - reading
> from /proc/$pid/mem can only return 0 in the following cases AFAICS:
> 
> 1. task->mm was already gone at open() time
> 2. mm->mm_users has dropped to zero (the mm only has lazytlb users;
>    page tables and VMAs are being blown away or have been blown away)
> 3. the syscall was called with length 0
> 
> When a process has gone away, normally mm->mm_users will drop to zero,
> but someone else could theoretically still be holding a reference to
> the mm (e.g. someone else in the middle of accessing /proc/$pid/mem).
> (Such references should normally not be very long-lived though.)
> 
> Additionally, in the unlikely case that the OOM killer just chomped
> through the page tables of the target process, I think the read will
> return -EIO (same error as if the address was simply unmapped) if the
> address is within a non-shared mapping. (Maybe that's something procfs
> could do better...)

Thanks for all the detail! I changed the text to say "may" 
instead of "will".

> [...]
>>>> NOTES
>>>>        The file descriptor returned when seccomp(2) is employed with the
>>>>        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
>>>>        poll(2), epoll(7), and select(2).  When a notification  is  pend‐
>>>>        ing,  these interfaces indicate that the file descriptor is read‐
>>>>        able.
>>>
>>> We should probably also point out somewhere that, as
>>> include/uapi/linux/seccomp.h says:
>>>
>>>  * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
>>>  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
>>>  * same syscall, the most recently added filter takes precedence. This means
>>>  * that the new SECCOMP_RET_USER_NOTIF filter can override any
>>>  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
>>
>> My takeaway from Chritian's comments is that this comment in the kernel
>> source is partially wrong, since it is not possible to install multiple
>> filters with SECCOMP_RET_USER_NOTIF, right?
> 
> Yeah. (Well, AFAICS technically, you can add more filters that return
> SECCOMP_RET_USER_NOTIF, but when a filter returns that without having
> a notifier fd attached, seccomp blocks the syscall with -ENOSYS; it
> won't use the notifier fd attached to a different filter in the
> chain.)

Ah yes. I misspoke. I meant to say that only one filter can be installed
with SECCOMP_FILTER_FLAG_NEW_LISTENER (and that's what seccomp(2)
currently says). Also, I just checked, and I have already added the
detail about ENOSYS in seccomp(2).

       SECCOMP_RET_USER_NOTIF (since Linux 5.0)
              ...
              If there is no attached  supervisor  (either  because  the
              filter   was   not   installed   with   the   SECCOMP_FIL‐
              TER_FLAG_NEW_LISTENER flag or because the file  descriptor
              was  closed),  the  filter returns ENOSYS (similar to what
              happens when a filter returns SECCOMP_RET_TRACE and  there
              is  no  tracer).   See  seccomp_user_notif(2)  for further
              details.

[...]

>>>>            if (s == 0) {
>>>>                fprintf(stderr, "\tS: read() of /proc/PID/mem "
>>>>                        "returned 0 (EOF)\n");
>>>>                exit(EXIT_FAILURE);
>>>>            }
>>>>
>>>>            if (close(procMemFd) == -1)
>>>>                errExit("close-/proc/PID/mem");
>>>
>>> We should probably make sure here that the value we read is actually
>>> NUL-terminated?
>>
>> So, I was curious about that point also. But, (why) are we not
>> guaranteed that it will be NUL-terminated?
> 
> Because it's random memory filled by another process, which we don't
> necessarily trust. While seccomp notifiers aren't usable for applying
> *extra* security restrictions, the supervisor will still often be more
> privileged than the supervised process.

D'oh! Yes, I see that I failed my Security Engineering 101 exam.

How about:

    /* We have no guarantees about what was in the memory of the target
       process. Therefore, we ensure that 'path' is null-terminated. Such
       precautions are particularly important in cases where (as is
       common) the surpervisor is running at a higher privilege level
       than the target. */

    // 'len' is size of buffer; 's' is return value from pread()
    int zeroIdx = len - 1;
    if (s < zeroIdx)
        zeroIdx = s;
    path[zeroIdx] = '\0';

Or just simply:

    path[len - 1] = '\0';

?

>>>>            /* Discover the sizes of the structures that are used to receive
>>>>               notifications and send notification responses, and allocate
>>>>               buffers of those sizes. */
>>>>
>>>>            if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
>>>>                errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
>>>>
>>>>            struct seccomp_notif *req = malloc(sizes.seccomp_notif);
>>>>            if (req == NULL)
>>>>                errExit("\tS: malloc");
>>>>
>>>>            struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
>>>
>>> This should probably do something like max(sizes.seccomp_notif_resp,
>>> sizeof(struct seccomp_notif_resp)) in case the program was built
>>> against new UAPI headers that make struct seccomp_notif_resp big, but
>>> is running under an old kernel where that struct is still smaller?
>>
>> I'm confused. Why? I mean, if the running kernel says that it expects
>> a buffer of a certain size, and we allocate a buffer of that size,
>> what's the problem?
> 
> Because in userspace, we cast the result of malloc() to a "struct
> seccomp_notif_resp *". If the kernel tells us that it expects a size
> smaller than sizeof(struct seccomp_notif_resp), then we end up with a
> pointer to a struct that consists partly of allocated memory, partly
> of out-of-bounds memory, which is generally a bad idea - I'm not sure
> whether the C standard permits that. And if userspace then e.g.
> decides to access some member of that struct that is beyond what the
> kernel thinks is the struct size, we get actual OOB memory accesses.

Thanks. Got it. (But gosh, this seems like a fragile API mess.)

I added the following to the code:

    /* When allocating the response buffer, we must allow for the fact
       that the user-space binary may have been built with user-space
       headers where 'struct seccomp_notif_resp' is bigger than the
       response buffer expected by the (older) kernel. Therefore, we
       allocate a buffer that is the maximum of the two sizes. This
       ensures that if the supervisor places bytes into the response
       structure that are past the response size that the kernel expects,
       then the supervisor is not touching an invalid memory location. */

    size_t resp_size = sizes.seccomp_notif_resp;
    if (sizeof(struct seccomp_notif_resp) > resp_size)
        resp_size = sizeof(struct seccomp_notif_resp);

    struct seccomp_notif_resp *resp = malloc(resp_size);

Okay?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-16 18:29       ` Michael Kerrisk (man-pages)
@ 2020-10-17  0:25         ` Jann Horn
  2020-10-24 12:52           ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 52+ messages in thread
From: Jann Horn @ 2020-10-17  0:25 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Alexei Starovoitov, Will Drewry,
	bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 10/15/20 10:32 PM, Jann Horn wrote:
> > On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> >> On 9/30/20 5:53 PM, Jann Horn wrote:
> >>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> >>> <mtk.manpages@gmail.com> wrote:
> >>>> I knew it would be a big ask, but below is kind of the manual page
> >>>> I was hoping you might write [1] for the seccomp user-space notification
> >>>> mechanism. Since you didn't (and because 5.9 adds various new pieces
> >>>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> >>>> that also will need documenting [2]), I did :-). But of course I may
> >>>> have made mistakes...
> > [...]
> >>>>        3. The supervisor process will receive notification events on the
> >>>>           listening  file  descriptor.   These  events  are  returned as
> >>>>           structures of type seccomp_notif.  Because this structure  and
> >>>>           its  size may evolve over kernel versions, the supervisor must
> >>>>           first determine the size of  this  structure  using  the  sec‐
> >>>>           comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
> >>>>           structure of type seccomp_notif_sizes.  The  supervisor  allo‐
> >>>>           cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
> >>>>           to receive notification events.   In  addition,the  supervisor
> >>>>           allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
> >>>>           comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
> >>>>           comp_notif_resp  structure) that it will provide to the kernel
> >>>>           (and thus the target process).
> >>>>
> >>>>        4. The target process then performs its workload, which  includes
> >>>>           system  calls  that  will be controlled by the seccomp filter.
> >>>>           Whenever one of these system calls causes the filter to return
> >>>>           the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
> >>>>           execute the system call;  instead,  execution  of  the  target
> >>>>           process is temporarily blocked inside the kernel and a notifi‐
> >>>
> >>> where "blocked" refers to the interruptible, restartable kind - if the
> >>> child receives a signal with an SA_RESTART signal handler in the
> >>> meantime, it'll leave the syscall, go through the signal handler, then
> >>> restart the syscall again and send the same request to the supervisor
> >>> again. so the supervisor may see duplicate syscalls.
> >>
> >> So, I partially demonstrated what you describe here, for two example
> >> system calls (epoll_wait() and pause()). But I could not exactly
> >> demonstrate things as I understand you to be describing them. (So,
> >> I'm not sure whether I have not understood you correctly, or
> >> if things are not exactly as you describe them.)
> >>
> >> Here's a scenario (A) that I tested:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>    (epoll_wait() or pause(), both of which should never restart,
> >>    regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor is sleeping (i.e., is not blocked in
> >>    SECCOMP_IOCTL_NOTIF_RECV operation).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. SIGINT gets delivered to target; handler gets called;
> >>    ***and syscall gets restarted by the kernel***
> >>
> >> That last should never happen, of course, and is a result of the
> >> combination of both the user-notify filter and the SA_RESTART flag.
> >> If one or other is not present, then the system call is not
> >> restarted.
> >>
> >> So, as you note below, the UAPI gets broken a little.
> >>
> >> However, from your description above I had understood that
> >> something like the following scenario (B) could occur:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>    (epoll_wait() or pause(), both of which should never restart,
> >>    regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
> >>    blocks).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. Supervisor gets seccomp user-space notification (i.e.,
> >>    SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> >> 6. SIGINT gets delivered to target; handler gets called;
> >>    and syscall gets restarted by the kernel
> >> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
> >>    which gets another notification for the restarted system call.
> >>
> >> However, I don't observe such behavior. In step 6, the syscall
> >> does not get restarted by the kernel, but instead returns -1/EINTR.
> >> Perhaps I have misconstructed my experiment in the second case, or
> >> perhaps I've misunderstood what you meant, or is it possibly the
> >> case that things are not quite as you said?
>
> Thanks for the code, Jann (including the demo of the CLONE_FILES
> technique to pass the notification FD to the supervisor).
>
> But I think your code just demonstrates what I described in
> scenario A. So, it seems that I both understood what you
> meant (because my code demonstrates the same thing) and
> also misunderstood what you said (because I thought you
> were meaning something more like scenario B).

Ahh, sorry, I should've read your mail more carefully. Indeed, that
testcase only shows scenario A. But the following shows scenario B...



user@vm:~/test/seccomp-notify-interrupt$ cat seccomp-notify-interrupt-b.c
#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <string.h>
#include <limits.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
  int seccomp_fd;
} *shared;

static void handle_signal(int sig, siginfo_t *info, void *uctx) {
  const char *msg = "signal handler invoked\n";
  write(1, msg, strlen(msg));
}

static size_t max_size(size_t a, size_t b) {
  return (a > b) ? a : b;
}

int main(void) {
  setbuf(stdout, NULL);

  shared = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                MAP_ANONYMOUS|MAP_SHARED, -1, 0);
  if (shared == MAP_FAILED)
    err(1, "mmap");
  shared->seccomp_fd = -1;

  /* glibc's clone() wrapper doesn't support fork()-style usage */
  pid_t child = syscall(__NR_clone, CLONE_FILES|SIGCHLD,
                        NULL, NULL, NULL, 0);
  if (child == -1) err(1, "clone");
  if (child == 0) {
    /* don't outlive the parent */
    prctl(PR_SET_PDEATHSIG, SIGKILL);
    if (getppid() == 1) exit(0);

    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
    struct sock_filter insns[] = {
      BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
      BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_pause, 0, 1),
      BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
      BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW)
    };
    struct sock_fprog prog = {
      .len = sizeof(insns)/sizeof(insns[0]),
      .filter = insns
    };
    int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
                              SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
    if (seccomp_ret < 0)
      err(1, "install");
    printf("installed seccomp: fd %d\n", seccomp_ret);

    __atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE);
    int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
                            INT_MAX, NULL, NULL, 0);
    printf("woke %d waiters\n", futex_ret);

    struct sigaction act = {
      .sa_sigaction = handle_signal,
      .sa_flags = SA_RESTART|SA_SIGINFO
    };
    if (sigaction(SIGUSR1, &act, NULL))
      err(1, "sigaction");

    pause();
    perror("pause returned");
    exit(0);
  }

  int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
                          -1, NULL, NULL, 0);
  if (futex_ret == -1 && errno != EAGAIN)
    err(1, "futex wait");
  int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
  printf("child installed seccomp fd %d\n", fd);

  struct seccomp_notif_sizes sizes;
  if (syscall(__NR_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes))
    err(1, "notif_sizes");
  struct seccomp_notif *notif = malloc(max_size(
    sizeof(struct seccomp_notif),
    sizes.seccomp_notif
  ));
  if (!notif)
    err(1, "malloc");
  for (int i=0; i<4; i++) {
    memset(notif, '\0', sizes.seccomp_notif);
    if (ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, notif))
      err(1, "notif_recv");
    printf("got notif: id=%" PRIu64 " pid=%u nr=%d\n",
           notif->id, notif->pid, notif->data.nr);
    sleep(1);
    printf("going to send SIGUSR1...\n");
    kill(child, SIGUSR1);
  }
  sleep(1);

  exit(0);
}
user@vm:~/test/seccomp-notify-interrupt$ gcc -o
seccomp-notify-interrupt-b seccomp-notify-interrupt-b.c
user@vm:~/test/seccomp-notify-interrupt$ ./seccomp-notify-interrupt-b
installed seccomp: fd 3
woke 1 waiters
child installed seccomp fd 3
got notif: id=4490537653766950251 pid=2641 nr=34
going to send SIGUSR1...
signal handler invoked
got notif: id=4490537653766950252 pid=2641 nr=34
going to send SIGUSR1...
signal handler invoked
got notif: id=4490537653766950253 pid=2641 nr=34
going to send SIGUSR1...
signal handler invoked
got notif: id=4490537653766950254 pid=2641 nr=34
going to send SIGUSR1...
signal handler invoked
user@vm:~/test/seccomp-notify-interrupt$



> I'm not sure if I should write anything about this small UAPI
> breakage in BUGS, or not. Your thoughts?

Thinking about it a bit more: Any code that relies on pause() or
epoll_wait() not restarting is buggy anyway, right? Because a signal
could also arrive directly before entering the syscall, while
userspace code is still executing? So one could argue that we're just
enlarging a preexisting race. (Unless the signal handler checks the
interrupted register state to figure out whether we already entered
syscall handling?)

If userspace relies on non-restarting behavior, it should be using
something like epoll_pwait(). And that stuff only unblocks signals
after we've already past the seccomp checks on entry. (I guess this
also means that anything that uses pause() properly effectively has to
either run pause() in a loop with nothing else [iow, not care whether
pause() restarts] or siglongjmp() out of the signal handler [iow,
unwind through the signal frame]?)

So we should probably document the restarting behavior as something
the supervisor has to deal with in the manpage; but for the
"non-restarting syscalls can restart from the target's perspective"
aspect, it might be enough to document this as quirky behavior that
can't actually break correct code? (Or not document it at all. Dunno.)

[...]
> >>>>            if (s == 0) {
> >>>>                fprintf(stderr, "\tS: read() of /proc/PID/mem "
> >>>>                        "returned 0 (EOF)\n");
> >>>>                exit(EXIT_FAILURE);
> >>>>            }
> >>>>
> >>>>            if (close(procMemFd) == -1)
> >>>>                errExit("close-/proc/PID/mem");
> >>>
> >>> We should probably make sure here that the value we read is actually
> >>> NUL-terminated?
> >>
> >> So, I was curious about that point also. But, (why) are we not
> >> guaranteed that it will be NUL-terminated?
> >
> > Because it's random memory filled by another process, which we don't
> > necessarily trust. While seccomp notifiers aren't usable for applying
> > *extra* security restrictions, the supervisor will still often be more
> > privileged than the supervised process.
>
> D'oh! Yes, I see that I failed my Security Engineering 101 exam.
>
> How about:
>
>     /* We have no guarantees about what was in the memory of the target
>        process. Therefore, we ensure that 'path' is null-terminated. Such
>        precautions are particularly important in cases where (as is
>        common) the surpervisor is running at a higher privilege level
>        than the target. */
>
>     // 'len' is size of buffer; 's' is return value from pread()
>     int zeroIdx = len - 1;
>     if (s < zeroIdx)
>         zeroIdx = s;
>     path[zeroIdx] = '\0';
>
> Or just simply:
>
>     path[len - 1] = '\0';
>
> ?

I'd either do "path[s-1] = '\0'" or bail out if "path[s - 1] != '\0'".
Especially if we haven't NUL-terminated the buffer before reading into
it, we shouldn't write a nullbyte to path[len - 1], since the bytes in
front of that will stay uninitialized.

(Oh, by the way: In general, reading path buffers like this (with the
read potentially going beyond the end of the actual buffer) can
have... interesting interactions with userfaultfd. If the path is
stored in one page, starting at a non-zero offset inside the page, our
read will always overlap into the second page. That second page might
belong to a completely different VMA. If that VMA has a userfaultfd
handler, we'll take a userfaultfd fault and wait for the userfaultfd
handler to service the fault. Normally that's fine-ish; but if the
target thread is supposed to *be* the thread handling userfaultfd
faults in its process (and it never intentionally accesses any
userfaultfd regions, only other threads do that), userspace will
deadlock, because the thread waiting for userfaultfd fault resolution
is the same one that's blocked on the userfaultfd. But this is not
special to seccomp; there are syscalls that do the same thing,
although their over-reads are typically smaller. E.g.
do_strncpy_from_user() over-reads by up to 7 bytes. But when this came
up in a discussion with Linus Torvalds, he said it was a theoretical
concern; so I guess if the kernel seems fine with doing that in
practice, we probably don't care too much here either.)

> >>>>            /* Discover the sizes of the structures that are used to receive
> >>>>               notifications and send notification responses, and allocate
> >>>>               buffers of those sizes. */
> >>>>
> >>>>            if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
> >>>>                errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
> >>>>
> >>>>            struct seccomp_notif *req = malloc(sizes.seccomp_notif);
> >>>>            if (req == NULL)
> >>>>                errExit("\tS: malloc");
> >>>>
> >>>>            struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
> >>>
> >>> This should probably do something like max(sizes.seccomp_notif_resp,
> >>> sizeof(struct seccomp_notif_resp)) in case the program was built
> >>> against new UAPI headers that make struct seccomp_notif_resp big, but
> >>> is running under an old kernel where that struct is still smaller?
> >>
> >> I'm confused. Why? I mean, if the running kernel says that it expects
> >> a buffer of a certain size, and we allocate a buffer of that size,
> >> what's the problem?
> >
> > Because in userspace, we cast the result of malloc() to a "struct
> > seccomp_notif_resp *". If the kernel tells us that it expects a size
> > smaller than sizeof(struct seccomp_notif_resp), then we end up with a
> > pointer to a struct that consists partly of allocated memory, partly
> > of out-of-bounds memory, which is generally a bad idea - I'm not sure
> > whether the C standard permits that. And if userspace then e.g.
> > decides to access some member of that struct that is beyond what the
> > kernel thinks is the struct size, we get actual OOB memory accesses.
>
> Thanks. Got it. (But gosh, this seems like a fragile API mess.)
>
> I added the following to the code:
>
>     /* When allocating the response buffer, we must allow for the fact
>        that the user-space binary may have been built with user-space
>        headers where 'struct seccomp_notif_resp' is bigger than the
>        response buffer expected by the (older) kernel. Therefore, we
>        allocate a buffer that is the maximum of the two sizes. This
>        ensures that if the supervisor places bytes into the response
>        structure that are past the response size that the kernel expects,
>        then the supervisor is not touching an invalid memory location. */
>
>     size_t resp_size = sizes.seccomp_notif_resp;
>     if (sizeof(struct seccomp_notif_resp) > resp_size)
>         resp_size = sizeof(struct seccomp_notif_resp);
>
>     struct seccomp_notif_resp *resp = malloc(resp_size);
>
> Okay?

Looks good.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-17  0:25         ` Jann Horn
@ 2020-10-24 12:52           ` Michael Kerrisk (man-pages)
  2020-10-26  9:32             ` Jann Horn
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-24 12:52 UTC (permalink / raw)
  To: Jann Horn
  Cc: mtk.manpages, Tycho Andersen, Sargun Dhillon, Kees Cook,
	Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

Hello Jann,

On 10/17/20 2:25 AM, Jann Horn wrote:
> On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> On 10/15/20 10:32 PM, Jann Horn wrote:
>>> On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
>>> <mtk.manpages@gmail.com> wrote:
>>>> On 9/30/20 5:53 PM, Jann Horn wrote:
>>>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>>>> <mtk.manpages@gmail.com> wrote:
>>>>>> I knew it would be a big ask, but below is kind of the manual page
>>>>>> I was hoping you might write [1] for the seccomp user-space notification
>>>>>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>>>>>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>>>>>> that also will need documenting [2]), I did :-). But of course I may
>>>>>> have made mistakes...
>>> [...]
>>>>>>        3. The supervisor process will receive notification events on the
>>>>>>           listening  file  descriptor.   These  events  are  returned as
>>>>>>           structures of type seccomp_notif.  Because this structure  and
>>>>>>           its  size may evolve over kernel versions, the supervisor must
>>>>>>           first determine the size of  this  structure  using  the  sec‐
>>>>>>           comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
>>>>>>           structure of type seccomp_notif_sizes.  The  supervisor  allo‐
>>>>>>           cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>>>>>>           to receive notification events.   In  addition,the  supervisor
>>>>>>           allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
>>>>>>           comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
>>>>>>           comp_notif_resp  structure) that it will provide to the kernel
>>>>>>           (and thus the target process).
>>>>>>
>>>>>>        4. The target process then performs its workload, which  includes
>>>>>>           system  calls  that  will be controlled by the seccomp filter.
>>>>>>           Whenever one of these system calls causes the filter to return
>>>>>>           the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
>>>>>>           execute the system call;  instead,  execution  of  the  target
>>>>>>           process is temporarily blocked inside the kernel and a notifi‐
>>>>>
>>>>> where "blocked" refers to the interruptible, restartable kind - if the
>>>>> child receives a signal with an SA_RESTART signal handler in the
>>>>> meantime, it'll leave the syscall, go through the signal handler, then
>>>>> restart the syscall again and send the same request to the supervisor
>>>>> again. so the supervisor may see duplicate syscalls.
>>>>
>>>> So, I partially demonstrated what you describe here, for two example
>>>> system calls (epoll_wait() and pause()). But I could not exactly
>>>> demonstrate things as I understand you to be describing them. (So,
>>>> I'm not sure whether I have not understood you correctly, or
>>>> if things are not exactly as you describe them.)
>>>>
>>>> Here's a scenario (A) that I tested:
>>>>
>>>> 1. Target installs seccomp filters for a blocking syscall
>>>>    (epoll_wait() or pause(), both of which should never restart,
>>>>    regardless of SA_RESTART)
>>>> 2. Target installs SIGINT handler with SA_RESTART
>>>> 3. Supervisor is sleeping (i.e., is not blocked in
>>>>    SECCOMP_IOCTL_NOTIF_RECV operation).
>>>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>>>> 5. SIGINT gets delivered to target; handler gets called;
>>>>    ***and syscall gets restarted by the kernel***
>>>>
>>>> That last should never happen, of course, and is a result of the
>>>> combination of both the user-notify filter and the SA_RESTART flag.
>>>> If one or other is not present, then the system call is not
>>>> restarted.
>>>>
>>>> So, as you note below, the UAPI gets broken a little.
>>>>
>>>> However, from your description above I had understood that
>>>> something like the following scenario (B) could occur:
>>>>
>>>> 1. Target installs seccomp filters for a blocking syscall
>>>>    (epoll_wait() or pause(), both of which should never restart,
>>>>    regardless of SA_RESTART)
>>>> 2. Target installs SIGINT handler with SA_RESTART
>>>> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
>>>>    blocks).
>>>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>>>> 5. Supervisor gets seccomp user-space notification (i.e.,
>>>>    SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
>>>> 6. SIGINT gets delivered to target; handler gets called;
>>>>    and syscall gets restarted by the kernel
>>>> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
>>>>    which gets another notification for the restarted system call.
>>>>
>>>> However, I don't observe such behavior. In step 6, the syscall
>>>> does not get restarted by the kernel, but instead returns -1/EINTR.
>>>> Perhaps I have misconstructed my experiment in the second case, or
>>>> perhaps I've misunderstood what you meant, or is it possibly the
>>>> case that things are not quite as you said?
>>
>> Thanks for the code, Jann (including the demo of the CLONE_FILES
>> technique to pass the notification FD to the supervisor).
>>
>> But I think your code just demonstrates what I described in
>> scenario A. So, it seems that I both understood what you
>> meant (because my code demonstrates the same thing) and
>> also misunderstood what you said (because I thought you
>> were meaning something more like scenario B).
> 
> Ahh, sorry, I should've read your mail more carefully. Indeed, that
> testcase only shows scenario A. But the following shows scenario B...
> 
> user@vm:~/test/seccomp-notify-interrupt$ cat seccomp-notify-interrupt-b.c
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <signal.h>
> #include <err.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sched.h>
> #include <stddef.h>
> #include <string.h>
> #include <limits.h>
> #include <inttypes.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/ioctl.h>
> #include <sys/prctl.h>
> #include <linux/seccomp.h>
> #include <linux/filter.h>
> #include <linux/futex.h>
> 
> struct {
>   int seccomp_fd;
> } *shared;
> 
> static void handle_signal(int sig, siginfo_t *info, void *uctx) {
>   const char *msg = "signal handler invoked\n";
>   write(1, msg, strlen(msg));
> }
> 
> static size_t max_size(size_t a, size_t b) {
>   return (a > b) ? a : b;
> }
> 
> int main(void) {
>   setbuf(stdout, NULL);
> 
>   shared = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
>                 MAP_ANONYMOUS|MAP_SHARED, -1, 0);
>   if (shared == MAP_FAILED)
>     err(1, "mmap");
>   shared->seccomp_fd = -1;
> 
>   /* glibc's clone() wrapper doesn't support fork()-style usage */
>   pid_t child = syscall(__NR_clone, CLONE_FILES|SIGCHLD,
>                         NULL, NULL, NULL, 0);
>   if (child == -1) err(1, "clone");
>   if (child == 0) {
>     /* don't outlive the parent */
>     prctl(PR_SET_PDEATHSIG, SIGKILL);
>     if (getppid() == 1) exit(0);
> 
>     prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>     struct sock_filter insns[] = {
>       BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
>       BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_pause, 0, 1),
>       BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
>       BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW)
>     };
>     struct sock_fprog prog = {
>       .len = sizeof(insns)/sizeof(insns[0]),
>       .filter = insns
>     };
>     int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
>                               SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
>     if (seccomp_ret < 0)
>       err(1, "install");
>     printf("installed seccomp: fd %d\n", seccomp_ret);
> 
>     __atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE);
>     int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
>                             INT_MAX, NULL, NULL, 0);
>     printf("woke %d waiters\n", futex_ret);
> 
>     struct sigaction act = {
>       .sa_sigaction = handle_signal,
>       .sa_flags = SA_RESTART|SA_SIGINFO
>     };
>     if (sigaction(SIGUSR1, &act, NULL))
>       err(1, "sigaction");
> 
>     pause();
>     perror("pause returned");
>     exit(0);
>   }
> 
>   int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
>                           -1, NULL, NULL, 0);
>   if (futex_ret == -1 && errno != EAGAIN)
>     err(1, "futex wait");
>   int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
>   printf("child installed seccomp fd %d\n", fd);
> 
>   struct seccomp_notif_sizes sizes;
>   if (syscall(__NR_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes))
>     err(1, "notif_sizes");
>   struct seccomp_notif *notif = malloc(max_size(
>     sizeof(struct seccomp_notif),
>     sizes.seccomp_notif
>   ));
>   if (!notif)
>     err(1, "malloc");
>   for (int i=0; i<4; i++) {
>     memset(notif, '\0', sizes.seccomp_notif);
>     if (ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, notif))
>       err(1, "notif_recv");
>     printf("got notif: id=%" PRIu64 " pid=%u nr=%d\n",
>            notif->id, notif->pid, notif->data.nr);
>     sleep(1);
>     printf("going to send SIGUSR1...\n");
>     kill(child, SIGUSR1);
>   }
>   sleep(1);
> 
>   exit(0);
> }
> user@vm:~/test/seccomp-notify-interrupt$ gcc -o
> seccomp-notify-interrupt-b seccomp-notify-interrupt-b.c
> user@vm:~/test/seccomp-notify-interrupt$ ./seccomp-notify-interrupt-b
> installed seccomp: fd 3
> woke 1 waiters
> child installed seccomp fd 3
> got notif: id=4490537653766950251 pid=2641 nr=34
> going to send SIGUSR1...
> signal handler invoked
> got notif: id=4490537653766950252 pid=2641 nr=34
> going to send SIGUSR1...
> signal handler invoked
> got notif: id=4490537653766950253 pid=2641 nr=34
> going to send SIGUSR1...
> signal handler invoked
> got notif: id=4490537653766950254 pid=2641 nr=34
> going to send SIGUSR1...
> signal handler invoked
> user@vm:~/test/seccomp-notify-interrupt$

Thanks for that! Clearly I must have messed up something when
I tried to construct the code to test that scenario.

>> I'm not sure if I should write anything about this small UAPI
>> breakage in BUGS, or not. Your thoughts?
> 
> Thinking about it a bit more: Any code that relies on pause() or
> epoll_wait() not restarting is buggy anyway, right? Because a signal
> could also arrive directly before entering the syscall, while
> userspace code is still executing? So one could argue that we're just
> enlarging a preexisting race. (Unless the signal handler checks the
> interrupted register state to figure out whether we already entered
> syscall handling?)

Yes, that all makes sense.

> If userspace relies on non-restarting behavior, it should be using
> something like epoll_pwait(). And that stuff only unblocks signals
> after we've already past the seccomp checks on entry.

Thanks for elaborating that detail, since as soon as you talked 
about "enlarging a preexisting race" above, I immediately wondered
sigsuspend(), pselect(), etc.

(Mind you, I still wonder about the effect on system calls that
are normally nonrestartable because they have timeouts. My
understanding is that the kernel doesn't restart those system
calls because it's impossible for the kernel to restart the call
with the right timeout value. I wonder what happens when those
system calls are restarted in the scenario we're discussing.)

Anyway, returning to your point... So, to be clear (and to
quickly remind myself in case I one day reread this thread),
there is not a problem with sigsuspend(), pselect(), ppoll(),
and epoll_pwait() since:

* Before the syscall, signals are blocked in the target.
* Inside the syscall, signals are still blocked at the time 
  the check is made for seccomp filters.
* If a seccomp user-space notification  event kicks, the target
  is put to sleep with the signals still blocked.
* The signal will only get delivered after the supervisor either
  triggers a spoofed success/failure return in the target or the
  supervisor sends a CONTINUE response to the kernel telling it
  to execute the target's system call. Either way, there won't be
  any restarting of the target's system call (and the supervisor
  thus won't see multiple notifications).

(Right?)

> (I guess this
> also means that anything that uses pause() properly effectively has to
> either run pause() in a loop with nothing else [iow, not care whether
> pause() restarts] or siglongjmp() out of the signal handler [iow,
> unwind through the signal frame]?)

Yes, that's my understanding. Simple pause() (vs sigsuspend())
is always racy.

> So we should probably document the restarting behavior as something
> the supervisor has to deal with in the manpage; but for the
> "non-restarting syscalls can restart from the target's perspective"
> aspect, it might be enough to document this as quirky behavior that
> can't actually break correct code? (Or not document it at all. Dunno.)

So, I've added the following to the page:

   Interaction with SA_RESTART signal handlers
       Consider the following scenario:

       · The target process has used sigaction(2)  to  install  a  signal
         handler with the SA_RESTART flag.

       · The target has made a system call that triggered a seccomp user-
         space notification and the target is currently blocked until the
         supervisor sends a notification response.

       · A  signal  is  delivered to the target and the signal handler is
         executed.

       · When  (if)  the  supervisor  attempts  to  send  a  notification
         response,  the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
         fail with the ENOENT error.

       In this scenario, the kernel  will  restart  the  target's  system
       call.   Consequently,  the  supervisor  will receive another user-
       space notification.  Thus, depending on how many times the blocked
       system call is interrupted by a signal handler, the supervisor may
       receive multiple notifications for the same  system  call  in  the
       target.

       One  oddity  is  that  system call restarting as described in this
       scenario will occur even for the blocking system calls  listed  in
       signal(7) that would never normally be restarted by the SA_RESTART
       flag.

Does that seem okay?

In addition, I've queued a cross-reference in signal(7):

       In certain circumstances, the seccomp(2) user-space notifi‐
       cation  feature can lead to restarting of system calls that
       would otherwise  never  be  restarted  by  SA_RESTART;  for
       details, see seccomp_user_notif(2).

> [...]
>>>>>>            if (s == 0) {
>>>>>>                fprintf(stderr, "\tS: read() of /proc/PID/mem "
>>>>>>                        "returned 0 (EOF)\n");
>>>>>>                exit(EXIT_FAILURE);
>>>>>>            }
>>>>>>
>>>>>>            if (close(procMemFd) == -1)
>>>>>>                errExit("close-/proc/PID/mem");
>>>>>
>>>>> We should probably make sure here that the value we read is actually
>>>>> NUL-terminated?
>>>>
>>>> So, I was curious about that point also. But, (why) are we not
>>>> guaranteed that it will be NUL-terminated?
>>>
>>> Because it's random memory filled by another process, which we don't
>>> necessarily trust. While seccomp notifiers aren't usable for applying
>>> *extra* security restrictions, the supervisor will still often be more
>>> privileged than the supervised process.
>>
>> D'oh! Yes, I see that I failed my Security Engineering 101 exam.
>>
>> How about:
>>
>>     /* We have no guarantees about what was in the memory of the target
>>        process. Therefore, we ensure that 'path' is null-terminated. Such
>>        precautions are particularly important in cases where (as is
>>        common) the surpervisor is running at a higher privilege level
>>        than the target. */
>>
>>     // 'len' is size of buffer; 's' is return value from pread()
>>     int zeroIdx = len - 1;
>>     if (s < zeroIdx)
>>         zeroIdx = s;
>>     path[zeroIdx] = '\0';
>>
>> Or just simply:
>>
>>     path[len - 1] = '\0';
>>
>> ?
> 
> I'd either do "path[s-1] = '\0'" or bail out if "path[s - 1] != '\0'".
> Especially if we haven't NUL-terminated the buffer before reading into
> it, we shouldn't write a nullbyte to path[len - 1], since the bytes in
> front of that will stay uninitialized.

I realized by the way that I made a thinko. In the usual case,
read(fd, buf, PATH_MAX) will return PATHMAX bytes that include
trailing garbage after the pathname. So the right check is I think
to scan from the start of the buffer to see if there's a NUL, and
error if there is not, and that's how I have modified the example
program.

> (Oh, by the way: In general, reading path buffers like this (with the
> read potentially going beyond the end of the actual buffer) can
> have... interesting interactions with userfaultfd. If the path is
> stored in one page, starting at a non-zero offset inside the page, our
> read will always overlap into the second page. That second page might
> belong to a completely different VMA. If that VMA has a userfaultfd
> handler, we'll take a userfaultfd fault and wait for the userfaultfd
> handler to service the fault. Normally that's fine-ish; but if the
> target thread is supposed to *be* the thread handling userfaultfd
> faults in its process (and it never intentionally accesses any
> userfaultfd regions, only other threads do that), userspace will
> deadlock, because the thread waiting for userfaultfd fault resolution
> is the same one that's blocked on the userfaultfd. But this is not
> special to seccomp; there are syscalls that do the same thing,
> although their over-reads are typically smaller. E.g.
> do_strncpy_from_user() over-reads by up to 7 bytes. But when this came
> up in a discussion with Linus Torvalds, he said it was a theoretical
> concern; so I guess if the kernel seems fine with doing that in
> practice, we probably don't care too much here either.)

Thanks for the background info. Indeed there are some 
bizarre corner cases...

[...]

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01  2:14             ` Jann Horn
@ 2020-10-25 16:31               ` Michael Kerrisk (man-pages)
  2020-10-26 15:54                 ` Jann Horn
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-25 16:31 UTC (permalink / raw)
  To: Jann Horn, Tycho Andersen
  Cc: mtk.manpages, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Alexei Starovoitov, Will Drewry,
	bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

Hi Jann,

On 10/1/20 4:14 AM, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 3:52 AM Jann Horn <jannh@google.com> wrote:
>> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>>> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
>>>> On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>>>>> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>>>>>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>>>>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>>>>>>>>        ┌─────────────────────────────────────────────────────┐
>>>>>>>>        │FIXME                                                │
>>>>>>>>        ├─────────────────────────────────────────────────────┤
>>>>>>>>        │From my experiments,  it  appears  that  if  a  SEC‐ │
>>>>>>>>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>>>>>>>>        │process terminates, then the ioctl()  simply  blocks │
>>>>>>>>        │(rather than returning an error to indicate that the │
>>>>>>>>        │target process no longer exists).                    │
>>>>>>>
>>>>>>> Yeah, I think Christian wanted to fix this at some point,
>>>>>>
>>>>>> Do you have a pointer that discussion? I could not find it with a
>>>>>> quick search.
>>>>>>
>>>>>>> but it's a
>>>>>>> bit sticky to do.
>>>>>>
>>>>>> Can you say a few words about the nature of the problem?
>>>>>
>>>>> I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
>>>>> notify about unused filter"). So maybe there's a bug here?
>>>>
>>>> That thing only notifies on ->poll, it doesn't unblock ioctls; and
>>>> Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
>>>> commit doesn't have any effect on this kind of usage.
>>>
>>> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
>>> we don't have a count of all of them, unfortunately.
>>>
>>> We could maybe look inside the wait_list, but that will probably make
>>> people angry :)
>>
>> The easiest way would probably be to open-code the semaphore-ish part,
>> and let the semaphore and poll share the waitqueue. The current code
>> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
>> entire semaphore would IMO be cleaner than that. And it's not like
>> semaphore semantics are even a good fit for this code anyway.
>>
>> Let's see... if we didn't have the existing UAPI to worry about, I'd
>> do it as follows (*completely* untested). That way, the ioctl would
>> block exactly until either there actually is a request to deliver or
>> there are no more users of the filter. The problem is that if we just
>> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
>> an event loop and don't set O_NONBLOCK will be screwed. So we'd
>> probably also have to add some stupid counter in place of the
>> semaphore's counter that we can use to preserve the old behavior of
>> returning -ENOENT once for each cancelled request. :(
>>
>> I guess this is a nice point in favor of Michael's usual complaint
>> that if there are no man pages for a feature by the time the feature
>> lands upstream, there's a higher chance that the UAPI will suck
>> forever...
> 
> And I guess this would be the UAPI-compatible version - not actually
> as terrible as I thought it might be. Do y'all want this? If so, feel
> free to either turn this into a proper patch with Co-developed-by, or
> tell me that I should do it and I'll try to get around to turning it
> into something proper.

Thanks for taking a shot at this.

I tried applying the patch below to vanilla 5.9.0.
(There's one typo: s/ENOTCON/ENOTCONN).

It seems not to work though; when I send a signal to my test
target process that is sleeping waiting for the notification
response, the process enters the uninterruptible D state.
Any thoughts?

Thanks,

Michael

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 676d4af62103..d08c453fcc2c 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -138,7 +138,7 @@ struct seccomp_kaddfd {
>   * @notifications: A list of struct seccomp_knotif elements.
>   */
>  struct notification {
> -       struct semaphore request;
> +       bool canceled_reqs;
>         u64 next_id;
>         struct list_head notifications;
>  };
> @@ -859,7 +859,6 @@ static int seccomp_do_user_notification(int this_syscall,
>         list_add(&n.list, &match->notif->notifications);
>         INIT_LIST_HEAD(&n.addfd);
> 
> -       up(&match->notif->request);
>         wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
>         mutex_unlock(&match->notify_lock);
> 
> @@ -901,8 +900,20 @@ static int seccomp_do_user_notification(int this_syscall,
>          * *reattach* to a notifier right now. If one is added, we'll need to
>          * keep track of the notif itself and make sure they match here.
>          */
> -       if (match->notif)
> +       if (match->notif) {
>                 list_del(&n.list);
> +
> +               /*
> +                * We are stuck with a UAPI that requires that after a spurious
> +                * wakeup, SECCOMP_IOCTL_NOTIF_RECV must return immediately.
> +                * This is the tracking for that, keeping track of whether we
> +                * canceled a request after waking waiters, but before userspace
> +                * picked up the notification.
> +                */
> +               if (n.state == SECCOMP_NOTIFY_INIT)
> +                       match->notif->canceled_reqs = true;
> +       }
> +
>  out:
>         mutex_unlock(&match->notify_lock);
> 
> @@ -1178,6 +1189,7 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
>                                 void __user *buf)
>  {
>         struct seccomp_knotif *knotif = NULL, *cur;
> +       DECLARE_WAITQUEUE(wait, current);
>         struct seccomp_notif unotif;
>         ssize_t ret;
> 
> @@ -1190,11 +1202,9 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
> 
>         memset(&unotif, 0, sizeof(unotif));
> 
> -       ret = down_interruptible(&filter->notif->request);
> -       if (ret < 0)
> -               return ret;
> -
>         mutex_lock(&filter->notify_lock);
> +
> +retry:
>         list_for_each_entry(cur, &filter->notif->notifications, list) {
>                 if (cur->state == SECCOMP_NOTIFY_INIT) {
>                         knotif = cur;
> @@ -1202,14 +1212,32 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
>                 }
>         }
> 
> -       /*
> -        * If we didn't find a notification, it could be that the task was
> -        * interrupted by a fatal signal between the time we were woken and
> -        * when we were able to acquire the rw lock.
> -        */
>         if (!knotif) {
> -               ret = -ENOENT;
> -               goto out;
> +               /* This has to happen before checking &filter->users. */
> +               prepare_to_wait(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
> +
> +               /*
> +                * If all users of the filter are gone, throw an error instead
> +                * of pointlessly continuing to block.
> +                */
> +               if (refcount_read(&filter->users) == 0) {
> +                       ret = -ENOTCON;
> +                       goto out;
> +               }
> +               if (filter->notif->canceled_reqs) {
> +                       ret = -ENOENT;
> +                       goto out;
> +               } else {
> +                       /* No notifications pending - wait for one,
> then retry. */
> +                       mutex_unlock(&filter->notify_lock);
> +                       schedule();
> +                       mutex_lock(&filter->notify_lock);
> +                       if (signal_pending(current)) {
> +                               ret = -EINTR;
> +                               goto out;
> +                       }
> +                       goto retry;
> +               }
>         }
> 
>         unotif.id = knotif->id;
> @@ -1220,6 +1248,8 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
>         wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
>         ret = 0;
>  out:
> +       filter->notif->canceled_reqs = false;
> +       finish_wait(&filter->wqh, &wait);
>         mutex_unlock(&filter->notify_lock);
> 
>         if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) {
> @@ -1233,10 +1263,8 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
>                  */
>                 mutex_lock(&filter->notify_lock);
>                 knotif = find_notification(filter, unotif.id);
> -               if (knotif) {
> +               if (knotif)
>                         knotif->state = SECCOMP_NOTIFY_INIT;
> -                       up(&filter->notif->request);
> -               }
>                 mutex_unlock(&filter->notify_lock);
>         }
> 
> @@ -1485,7 +1513,6 @@ static struct file *init_listener(struct
> seccomp_filter *filter)
>         if (!filter->notif)
>                 goto out;
> 
> -       sema_init(&filter->notif->request, 0);
>         filter->notif->next_id = get_random_u64();
>         INIT_LIST_HEAD(&filter->notif->notifications);
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-15 11:24   ` Michael Kerrisk (man-pages)
@ 2020-10-26  0:19     ` Kees Cook
  2020-10-26  9:39       ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 52+ messages in thread
From: Kees Cook @ 2020-10-26  0:19 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, Christian Brauner, linux-man,
	lkml, Aleksa Sarai, Jann Horn, Alexei Starovoitov, wad, bpf,
	Song Liu, Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Thu, Oct 15, 2020 at 01:24:03PM +0200, Michael Kerrisk (man-pages) wrote:
> On 10/1/20 1:39 AM, Kees Cook wrote:
> > I'll comment more later, but I've run out of time today and I didn't see
> > anyone mention this detail yet in the existing threads... :)
> 
> Later never came :-). But, I hope you may have comments for the 
> next draft, which I will send out soon.

Later is now, and Soon approaches!

I finally caught up and read through this whole thread. Thank you all
for the bug fix[1], and I'm looking forward to more[2]. :)

For my reply I figured I'd base it on the current draft, so here's a
simulated quote based on the seccomp_user_notif branch of
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
through commit 71101158fe330af5a26552447a0bb433b69e15b7
$ COLUMNS=75 man --nh --nj man2/seccomp_user_notif.2 | sed 's/^/> /'

On Sun, Oct 25, 2020 at 01:54:05PM +0100, Michael Kerrisk (man-pages) wrote:
> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual   SECCOMP_USER_NOTIF(2)
> 
> NAME
>        seccomp_user_notif - Seccomp user-space notification mechanism
> 
> SYNOPSIS
>        #include <linux/seccomp.h>
>        #include <linux/filter.h>
>        #include <linux/audit.h>
> 
>        int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
>        #include <sys/ioctl.h>
> 
>        int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>                  struct seccomp_notif *req);
>        int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>                  struct seccomp_notif_resp *resp);
>        int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
> DESCRIPTION
>        This page describes the user-space notification mechanism provided
>        by the Secure Computing (seccomp) facility.  As well as the use of
>        the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>        SECCOMP_RET_USER_NOTIF action value, and the
>        SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>        mechanism involves the use of a number of related ioctl(2)
>        operations (described below).
> 
>    Overview
>        In conventional usage of a seccomp filter, the decision about how
>        to treat a system call is made by the filter itself.  By contrast,
>        the user-space notification mechanism allows the seccomp filter to
>        delegate the handling of the system call to another user-space
>        process.  Note that this mechanism is explicitly not intended as a
>        method implementing security policy; see NOTES.
> 
>        In the discussion that follows, the thread(s) on which the seccomp
>        filter is installed is (are) referred to as the target, and the
>        process that is notified by the user-space notification mechanism
>        is referred to as the supervisor.
> 
>        A suitably privileged supervisor can use the user-space
>        notification mechanism to perform actions on behalf of the target.
>        The advantage of the user-space notification mechanism is that the
>        supervisor will usually be able to retrieve information about the
>        target and the performed system call that the seccomp filter
>        itself cannot.  (A seccomp filter is limited in the information it
>        can obtain and the actions that it can perform because it is
>        running on a virtual machine inside the kernel.)
> 
>        An overview of the steps performed by the target and the
>        supervisor is as follows:
> 
>        1. The target establishes a seccomp filter in the usual manner,
>           but with two differences:
> 
>           • The seccomp(2) flags argument includes the flag
>             SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
>             value  of the (successful) seccomp(2) call is a new

nit: extra space

>             "listening" file descriptor that can be used to receive
>             notifications.  Only one "listening" seccomp filter can be
>             installed for a thread.

I like this limitation, but I expect that it'll need to change in the
future. Even with LSMs, we see the need for arbitrary stacking, and the
idea of there being only 1 supervisor will eventually break down. Right
now there is only 1 because only container managers are using this
feature. But if some daemon starts using it to isolate some thread,
suddenly it might break if a container manager is trying to listen to it
too, etc. I expect it won't be needed soon, but I do think it'll change.

> 
>           • In cases where it is appropriate, the seccomp filter returns
>             the action value SECCOMP_RET_USER_NOTIF.  This return value
>             will trigger a notification event.
> 
>        2. In order that the supervisor can obtain notifications using the
>           listening file descriptor, (a duplicate of) that file
>           descriptor must be passed from the target to the supervisor.

Yet another reason to have an "activate on exec" mode for seccomp. With
no_new_privs _not_ being delayed in such a way, I think it'd be safe to
add. The supervisor would get the fd immediately, and then once it
fork/execed suddenly the whole thing would activate, and no fd passing
needed.

The "on exec" boundary is really only needed for oblivious targets. For
a coordinated target, I've thought it might be nice to have an arbitrary
"go" point, where the target could call seccomp() with something like a
SECCOMP_ACTIVATE_DELAYED_FILTERS operation. This lets any process
initialization happen that might need to do things that would be blocked
by filters, etc.

Before:

	fork
	install some filters that don't block initialization
	exec
	do some initialization
	install more filters, maybe block exec, seccomp
	run

After:

	fork
	install delayed filters
	exec
	do some initialization
	activate delayed filters
	run

In practice, the two-stage filter application has been fine, if
sometimes a bit complex (e.g. for user_notif, "do some initialization"
includes figuring out how to pass the fd back to the supervisor, etc).

>           One way in which this could be done is by passing the file
>           descriptor over a UNIX domain socket connection between the
>           target and the supervisor (using the SCM_RIGHTS ancillary
>           message type described in unix(7)).
> 
>        3. The supervisor will receive notification events on the
>           listening file descriptor.  These events are returned as
>           structures of type seccomp_notif.  Because this structure and
>           its size may evolve over kernel versions, the supervisor must
>           first determine the size of this structure using the seccomp(2)
>           SECCOMP_GET_NOTIF_SIZES operation, which returns a structure of
>           type seccomp_notif_sizes.  The supervisor allocates a buffer of
>           size seccomp_notif_sizes.seccomp_notif bytes to receive
>           notification events.  In addition,the supervisor allocates
>           another buffer of size seccomp_notif_sizes.seccomp_notif_resp
>           bytes for the response (a struct seccomp_notif_resp structure)
>           that it will provide to the kernel (and thus the target).
> 
>        4. The target then performs its workload, which includes system
>           calls that will be controlled by the seccomp filter.  Whenever
>           one of these system calls causes the filter to return the
>           SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet)
>           execute the system call; instead, execution of the target is
>           temporarily blocked inside the kernel (in a sleep state that is
>           interruptible by signals) and a notification event is generated
>           on the listening file descriptor.
> 
>        5. The supervisor can now repeatedly monitor the listening file
>           descriptor for SECCOMP_RET_USER_NOTIF-triggered events.  To do
>           this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2)
>           operation to read information about a notification event; this
>           operation blocks until an event is available.  The operation
>           returns a seccomp_notif structure containing information about
>           the system call that is being attempted by the target.
> 
>        6. The seccomp_notif structure returned by the
>           SECCOMP_IOCTL_NOTIF_RECV operation includes the same
>           information (a seccomp_data structure) that was passed to the
>           seccomp filter.  This information allows the supervisor to
>           discover the system call number and the arguments for the
>           target's system call.  In addition, the notification event
>           contains the ID of the thread that triggered the notification.

Should "cookie" be at least named here, just to provide a bit more
context for when it is mentioned in 8 below? E.g.:

			       ... In addition, the notification event
	    contains the triggering thread's ID and a unique cookie to be
	    used in subsequent SECCOMP_IOCTL_NOTIF_ID_VALID and
	    SECCOMP_IOCTL_NOTIF_SEND operations.

> 
>           The information in the notification can be used to discover the
>           values of pointer arguments for the target's system call.
>           (This is something that can't be done from within a seccomp
>           filter.)  One way in which the supervisor can do this is to
>           open the corresponding /proc/[tid]/mem file (see proc(5)) and
>           read bytes from the location that corresponds to one of the
>           pointer arguments whose value is supplied in the notification
>           event.  (The supervisor must be careful to avoid a race
>           condition that can occur when doing this; see the description
>           of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)
>           In addition, the supervisor can access other system information
>           that is visible in user space but which is not accessible from
>           a seccomp filter.
> 
>        7. Having obtained information as per the previous step, the
>           supervisor may then choose to perform an action in response to
>           the target's system call (which, as noted above, is not
>           executed when the seccomp filter returns the
>           SECCOMP_RET_USER_NOTIF action value).
> 
>           One example use case here relates to containers.  The target
>           may be located inside a container where it does not have
>           sufficient capabilities to mount a filesystem in the
>           container's mount namespace.  However, the supervisor may be a
>           more privileged process that does have sufficient capabilities
>           to perform the mount operation.
> 
>        8. The supervisor then sends a response to the notification.  The
>           information in this response is used by the kernel to construct
>           a return value for the target's system call and provide a value
>           that will be assigned to the errno variable of the target.
> 
>           The response is sent using the SECCOMP_IOCTL_NOTIF_SEND
>           ioctl(2) operation, which is used to transmit a
>           seccomp_notif_resp structure to the kernel.  This structure
>           includes a cookie value that the supervisor obtained in the
>           seccomp_notif structure returned by the
>           SECCOMP_IOCTL_NOTIF_RECV operation.  This cookie value allows
>           the kernel to associate the response with the target.

Describing where the cookie came from seems like it should live in 6
above. A reader would have to take this new info and figure out where
SECCOMP_IOCTL_NOTIF_RECV was described and piece it together. With the
suggestion to 6 above, maybe:

                                                     ... This structure
            must include the cookie value that the supervisor obtained in
            the seccomp_notif structure returned by the
	    SECCOMP_IOCTL_NOTIF_RECV operation, which allows the kernel
            to associate the response with the target.

> 
>        9. Once the notification has been sent, the system call in the
>           target thread unblocks, returning the information that was
>           provided by the supervisor in the notification response.
> 
>        As a variation on the last two steps, the supervisor can send a
>        response that tells the kernel that it should execute the target
>        thread's system call; see the discussion of
>        SECCOMP_USER_NOTIF_FLAG_CONTINUE, below.
> 
>    ioctl(2) operations
>        The following ioctl(2) operations are provided to support seccomp
>        user-space notification.  For each of these operations, the first
>        (file descriptor) argument of ioctl(2) is the listening file
>        descriptor returned by a call to seccomp(2) with the
>        SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
> 
>        SECCOMP_IOCTL_NOTIF_RECV
>               This operation is used to obtain a user-space notification
>               event.  If no such event is currently pending, the
>               operation blocks until an event occurs.  The third ioctl(2)
>               argument is a pointer to a structure of the following form
>               which contains information about the event.  This structure
>               must be zeroed out before the call.
> 
>                   struct seccomp_notif {
>                       __u64  id;              /* Cookie */
>                       __u32  pid;             /* TID of target thread */

Should we rename this variable from pid to tid? Yes it's UAPI, but yay for
anonymous unions:

struct seccomp_notif {
	__u64		id;		/* Cookie */
	union {
		__u32	pid;
		__u32	tid;		/* TID of target thread */
	};
	__u32  flags;			/* Currently unused (0) */
	struct seccomp_data data;	/* See seccomp(2) */
};

>                       __u32  flags;           /* Currently unused (0) */
>                       struct seccomp_data data;   /* See seccomp(2) */
>                   };
> 
>               The fields in this structure are as follows:
> 
>               id     This is a cookie for the notification.  Each such
>                      cookie is guaranteed to be unique for the
>                      corresponding seccomp filter.
> 
>                      • It can be used with the
>                        SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
>                        verify that the target is still alive.
> 
>                      • When returning a notification response to the
>                        kernel, the supervisor must include the cookie
>                        value in the seccomp_notif_resp structure that is
>                        specified as the argument of the
>                        SECCOMP_IOCTL_NOTIF_SEND operation.
> 
>               pid    This is the thread ID of the target thread that
>                      triggered the notification event.
> 
>               flags  This is a bit mask of flags providing further
>                      information on the event.  In the current
>                      implementation, this field is always zero.
> 
>               data   This is a seccomp_data structure containing
>                      information about the system call that triggered the
>                      notification.  This is the same structure that is
>                      passed to the seccomp filter.  See seccomp(2) for
>                      details of this structure.
> 
>               On success, this operation returns 0; on failure, -1 is
>               returned, and errno is set to indicate the cause of the
>               error.  This operation can fail with the following errors:
> 
>               EINVAL (since Linux 5.5)
>                      The seccomp_notif structure that was passed to the
>                      call contained nonzero fields.
> 
>               ENOENT The target thread was killed by a signal as the
>                      notification information was being generated, or the
>                      target's (blocked) system call was interrupted by a
>                      signal handler.
> 
>        SECCOMP_IOCTL_NOTIF_ID_VALID
>               This operation can be used to check that a notification ID
>               returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
>               is still valid (i.e., that the target still exists).

Maybe clarify a bit more, since it's covering more than just "is the
target still alive", but also "is that syscall still waiting for a
response":

                is still valid (i.e., that the target still exists and
		the syscall is still blocked waiting for a response).


> 
>               The third ioctl(2) argument is a pointer to the cookie (id)
>               returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
> 
>               This operation is necessary to avoid race conditions that
>               can occur when the pid returned by the
>               SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that
>               process ID is reused by another process.  An example of
>               this kind of race is the following
> 
>               1. A notification is generated on the listening file
>                  descriptor.  The returned seccomp_notif contains the TID
>                  of the target thread (in the pid field of the
>                  structure).
> 
>               2. The target terminates.
> 
>               3. Another thread or process is created on the system that
>                  by chance reuses the TID that was freed when the target
>                  terminated.
> 
>               4. The supervisor open(2)s the /proc/[tid]/mem file for the
>                  TID obtained in step 1, with the intention of (say)
>                  inspecting the memory location(s) that containing the
>                  argument(s) of the system call that triggered the
>                  notification in step 1.
> 
>               In the above scenario, the risk is that the supervisor may
>               try to access the memory of a process other than the
>               target.  This race can be avoided by following the call to
>               open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to
>               verify that the process that generated the notification is
>               still alive.  (Note that if the target terminates after the
>               latter step, a subsequent read(2) from the file descriptor
>               may return 0, indicating end of file.)
> 
>               On success (i.e., the notification ID is still valid), this
>               operation returns 0.  On failure (i.e., the notification ID
>               is no longer valid), -1 is returned, and errno is set to
>               ENOENT.
> 
>        SECCOMP_IOCTL_NOTIF_SEND
>               This operation is used to send a notification response back
>               to the kernel.  The third ioctl(2) argument of this
>               structure is a pointer to a structure of the following
>               form:
> 
>                   struct seccomp_notif_resp {
>                       __u64 id;               /* Cookie value */
>                       __s64 val;              /* Success return value */
>                       __s32 error;            /* 0 (success) or negative
>                                                  error number */
>                       __u32 flags;            /* See below */
>                   };
> 
>               The fields of this structure are as follows:
> 
>               id     This is the cookie value that was obtained using the
>                      SECCOMP_IOCTL_NOTIF_RECV operation.  This cookie
>                      value allows the kernel to correctly associate this
>                      response with the system call that triggered the
>                      user-space notification.
> 
>               val    This is the value that will be used for a spoofed
>                      success return for the target's system call; see
>                      below.
> 
>               error  This is the value that will be used as the error
>                      number (errno) for a spoofed error return for the
>                      target's system call; see below.
> 
>               flags  This is a bit mask that includes zero or more of the
>                      following flags:
> 
>                      SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
>                             Tell the kernel to execute the target's
>                             system call.
> 
>               Two kinds of response are possible:
> 
>               • A response to the kernel telling it to execute the
>                 target's system call.  In this case, the flags field
>                 includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error
>                 and val fields must be zero.
> 
>                 This kind of response can be useful in cases where the
>                 supervisor needs to do deeper analysis of the target's
>                 system call than is possible from a seccomp filter (e.g.,
>                 examining the values of pointer arguments), and, having
>                 decided that the system call does not require emulation
>                 by the supervisor, the supervisor wants the system call
>                 to be executed normally in the target.
> 
>                 The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used
>                 with caution; see NOTES.
> 
>               • A spoofed return value for the target's system call.  In
>                 this case, the kernel does not execute the target's
>                 system call, instead causing the system call to return a
>                 spoofed value as specified by fields of the
>                 seccomp_notif_resp structure.  The supervisor should set
>                 the fields of this structure as follows:
> 
>                 +  flags does not contain
>                    SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> 
>                 +  error is set either to 0 for a spoofed "success"
>                    return or to a negative error number for a spoofed
>                    "failure" return.  In the former case, the kernel
>                    causes the target's system call to return the value
>                    specified in the val field.  In the later case, the
>                    kernel causes the target's system call to return -1,
>                    and errno is assigned the negated error value.
> 
>                 +  val is set to a value that will be used as the return
>                    value for a spoofed "success" return for the target's
>                    system call.  The value in this field is ignored if
>                    the error field contains a nonzero value.

Strictly speaking, this is architecture specific, but all architectures
do it this way. Should seccomp enforce val == 0 when err != 0 ?

> 
>               On success, this operation returns 0; on failure, -1 is
>               returned, and errno is set to indicate the cause of the
>               error.  This operation can fail with the following errors:
> 
>               EINPROGRESS
>                      A response to this notification has already been
>                      sent.
> 
>               EINVAL An invalid value was specified in the flags field.
> 
>               EINVAL The flags field contained
>                      SECCOMP_USER_NOTIF_FLAG_CONTINUE, and the error or
>                      val field was not zero.
> 
>               ENOENT The blocked system call in the target has been
>                      interrupted by a signal handler or the target has
>                      terminated.
> 
> NOTES
>    select()/poll()/epoll semantics
>        The file descriptor returned when seccomp(2) is employed with the
>        SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
>        poll(2), epoll(7), and select(2).  These interfaces indicate that
>        the file descriptor is ready as follows:
> 
>        • When a notification is pending, these interfaces indicate that
>          the file descriptor is readable.  Following such an indication,
>          a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block,
>          returning either information about a notification or else
>          failing with the error EINTR if the target has been killed by a
>          signal or its system call has been interrupted by a signal
>          handler.
> 
>        • After the notification has been received (i.e., by the
>          SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces
>          indicate that the file descriptor is writable, meaning that a
>          notification response can be sent using the
>          SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation.
> 
>        • After the last thread using the filter has terminated and been
>          reaped using waitpid(2) (or similar), the file descriptor
>          indicates an end-of-file condition (readable in select(2);
>          POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)).

I'll reply separately about the "ioctl() does not terminate when all
filters have terminated" case.

> 
>    Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
>        The intent of the user-space notification feature is to allow
>        system calls to be performed on behalf of the target.  The
>        target's system call should either be handled by the supervisor or
>        allowed to continue normally in the kernel (where standard
>        security policies will be applied).
> 
>        Note well: this mechanism must not be used to make security policy
>        decisions about the system call, which would be inherently race-
>        prone for reasons described next.
> 
>        The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with
>        caution.  If set by the supervisor, the target's system call will
>        continue.  However, there is a time-of-check, time-of-use race
>        here, since an attacker could exploit the interval of time where
>        the target is blocked waiting on the "continue" response to do
>        things such as rewriting the system call arguments.
> 
>        Note furthermore that a user-space notifier can be bypassed if the
>        existing filters allow the use of seccomp(2) or prctl(2) to
>        install a filter that returns an action value with a higher
>        precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).
> 
>        It should thus be absolutely clear that the seccomp user-space
>        notification mechanism can not be used to implement a security
>        policy!  It should only ever be used in scenarios where a more
>        privileged process supervises the system calls of a lesser
>        privileged target to get around kernel-enforced security
>        restrictions when the supervisor deems this safe.  In other words,
>        in order to continue a system call, the supervisor should be sure
>        that another security mechanism or the kernel itself will
>        sufficiently block the system call if its arguments are rewritten
>        to something unsafe.
> 
>    Interaction with SA_RESTART signal handlers
>        Consider the following scenario:
> 
>        • The target process has used sigaction(2) to install a signal
>          handler with the SA_RESTART flag.
> 
>        • The target has made a system call that triggered a seccomp user-
>          space notification and the target is currently blocked until the
>          supervisor sends a notification response.
> 
>        • A signal is delivered to the target and the signal handler is
>          executed.
> 
>        • When (if) the supervisor attempts to send a notification
>          response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
>          fail with the ENOENT error.
> 
>        In this scenario, the kernel will restart the target's system
>        call.  Consequently, the supervisor will receive another user-
>        space notification.  Thus, depending on how many times the blocked
>        system call is interrupted by a signal handler, the supervisor may
>        receive multiple notifications for the same system call in the

maybe "... for the same instance of a system call in the target." for
clarity?

>        target.
> 
>        One oddity is that system call restarting as described in this
>        scenario will occur even for the blocking system calls listed in
>        signal(7) that would never normally be restarted by the SA_RESTART
>        flag.

Does this need fixing? I imagine the correct behavior for this case
would be a response to _SEND of EINPROGRESS and the target would see
EINTR normally?

I mean, it's not like seccomp doesn't already expose weirdness with
syscall restarts. Not even arm64 compat agrees[3] with arm32 in this
regard. :(

> BUGS
>        If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed
>        after the target terminates, then the ioctl(2) call simply blocks
>        (rather than returning an error to indicate that the target no
>        longer exists).

I want this fixed. It caused me no end of pain when building the
selftests, and ended up spawning my implementing a global test timeout
in kselftest. :P Before the usage counter refactor, there was no sane
way to deal with this, but now I think we're close[2]. I'll reply
separately about this.

> 
> EXAMPLES
>        The (somewhat contrived) program shown below demonstrates the use
>        of the interfaces described in this page.  The program creates a
>        child process that serves as the "target" process.  The child
>        process installs a seccomp filter that returns the
>        SECCOMP_RET_USER_NOTIF action value if a call is made to mkdir(2).
>        The child process then calls mkdir(2) once for each of the
>        supplied command-line arguments, and reports the result returned
>        by the call.  After processing all arguments, the child process
>        terminates.
> 
>        The parent process acts as the supervisor, listening for the
>        notifications that are generated when the target process calls
>        mkdir(2).  When such a notification occurs, the supervisor
>        examines the memory of the target process (using /proc/[pid]/mem)
>        to discover the pathname argument that was supplied to the
>        mkdir(2) call, and performs one of the following actions:

I like this example! It's simple enough to be understandable and complex
enough to show the purpose of user_notif. :)

> 
>        • If the pathname begins with the prefix "/tmp/", then the
>          supervisor attempts to create the specified directory, and then
>          spoofs a return for the target process based on the return value
>          of the supervisor's mkdir(2) call.  In the event that that call
>          succeeds, the spoofed success return value is the length of the
>          pathname.
> 
>        • If the pathname begins with "./" (i.e., it is a relative
>          pathname), the supervisor sends a
>          SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel to say
>          that the kernel should execute the target process's mkdir(2)
>          call.
> 
>        • If the pathname begins with some other prefix, the supervisor
>          spoofs an error return for the target process, so that the
>          target process's mkdir(2) call appears to fail with the error
>          EOPNOTSUPP ("Operation not supported").  Additionally, if the
>          specified pathname is exactly "/bye", then the supervisor
>          terminates.
> 
>        This program can be used to demonstrate various aspects of the
>        behavior of the seccomp user-space notification mechanism.  To
>        help aid such demonstrations, the program logs various messages to
>        show the operation of the target process (lines prefixed "T:") and
>        the supervisor (indented lines prefixed "S:").
> 
>        In the following example, the target attempts to create the
>        directory /tmp/x.  Upon receiving the notification, the supervisor
>        creates the directory on the target's behalf, and spoofs a success
>        return to be received by the target process's mkdir(2) call.
> 
>            $ ./seccomp_unotify /tmp/x
>            T: PID = 23168
> 
>            T: about to mkdir("/tmp/x")
>                    S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
>                    S: executing: mkdir("/tmp/x", 0700)
>                    S: success! spoofed return = 6
>                    S: sending response (flags = 0; val = 6; error = 0)
>            T: SUCCESS: mkdir(2) returned 6
> 
>            T: terminating
>                    S: target has terminated; bye
> 
>        In the above output, note that the spoofed return value seen by
>        the target process is 6 (the length of the pathname /tmp/x),
>        whereas a normal mkdir(2) call returns 0 on success.
> 
>        In the next example, the target attempts to create a directory
>        using the relative pathname ./sub.  Since this pathname starts
>        with "./", the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE
>        response to the kernel, and the kernel then (successfully)
>        executes the target process's mkdir(2) call.
> 
>            $ ./seccomp_unotify ./sub
>            T: PID = 23204
> 
>            T: about to mkdir("./sub")
>                    S: got notification (ID 0xddb16abe25b4c12) for PID 23204
>                    S: target can execute system call
>                    S: sending response (flags = 0x1; val = 0; error = 0)
>            T: SUCCESS: mkdir(2) returned 0
> 
>            T: terminating
>                    S: target has terminated; bye
> 
>        If the target process attempts to create a directory with a
>        pathname that doesn't start with "." and doesn't begin with the
>        prefix "/tmp/", then the supervisor spoofs an error return
>        (EOPNOTSUPP, "Operation not  supported") for the target's mkdir(2)
>        call (which is not executed):
> 
>            $ ./seccomp_unotify /xxx
>            T: PID = 23178
> 
>            T: about to mkdir("/xxx")
>                    S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
>                    S: spoofing error response (Operation not supported)
>                    S: sending response (flags = 0; val = 0; error = -95)
>            T: ERROR: mkdir(2): Operation not supported
> 
>            T: terminating
>                    S: target has terminated; bye
> 
>        In the next example, the target process attempts to create a
>        directory with the pathname /tmp/nosuchdir/b.  Upon receiving the
>        notification, the supervisor attempts to create that directory,
>        but the mkdir(2) call fails because the directory /tmp/nosuchdir
>        does not exist.  Consequently, the supervisor spoofs an error
>        return that passes the error that it received back to the target
>        process's mkdir(2) call.
> 
>            $ ./seccomp_unotify /tmp/nosuchdir/b
>            T: PID = 23199
> 
>            T: about to mkdir("/tmp/nosuchdir/b")
>                    S: got notification (ID 0x8744454293506046) for PID 23199
>                    S: executing: mkdir("/tmp/nosuchdir/b", 0700)
>                    S: failure! (errno = 2; No such file or directory)
>                    S: sending response (flags = 0; val = 0; error = -2)
>            T: ERROR: mkdir(2): No such file or directory
> 
>            T: terminating
>                    S: target has terminated; bye
> 
>        If the supervisor receives a notification and sees that the
>        argument of the target's mkdir(2) is the string "/bye", then (as
>        well as spoofing an EOPNOTSUPP error), the supervisor terminates.
>        If the target process subsequently executes another mkdir(2) that
>        triggers its seccomp filter to return the SECCOMP_RET_USER_NOTIF
>        action value, then the kernel causes the target process's system
>        call to fail with the error ENOSYS ("Function not implemented").
>        This is demonstrated by the following example:
> 
>            $ ./seccomp_unotify /bye /tmp/y
>            T: PID = 23185
> 
>            T: about to mkdir("/bye")
>                    S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
>                    S: spoofing error response (Operation not supported)
>                    S: sending response (flags = 0; val = 0; error = -95)
>                    S: terminating **********
>            T: ERROR: mkdir(2): Operation not supported
> 
>            T: about to mkdir("/tmp/y")
>            T: ERROR: mkdir(2): Function not implemented
> 
>            T: terminating
> 
>    Program source
>        #define _GNU_SOURCE
>        #include <sys/types.h>
>        #include <sys/prctl.h>
>        #include <fcntl.h>
>        #include <limits.h>
>        #include <signal.h>
>        #include <stddef.h>
>        #include <stdint.h>
>        #include <stdbool.h>
>        #include <linux/audit.h>
>        #include <sys/syscall.h>
>        #include <sys/stat.h>
>        #include <linux/filter.h>
>        #include <linux/seccomp.h>
>        #include <sys/ioctl.h>
>        #include <stdio.h>
>        #include <stdlib.h>
>        #include <unistd.h>
>        #include <errno.h>
>        #include <sys/socket.h>
>        #include <sys/un.h>
> 
>        #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>                                } while (0)

Because I love macros, you can expand this to make it take a format
string:

#define errExit(fmt, ...)	do {					\
		char __err[64];						\
		strerror_r(errno, __err, sizeof(__err));		\
		fprintf(stderr, fmt ": %s\n", ##__VA_ARG__, __err);	\
		exit(EXIT_FAILURE);					\
	} while (0)

> 
>        /* Send the file descriptor 'fd' over the connected UNIX domain socket
>           'sockfd'. Returns 0 on success, or -1 on error. */
> 
>        static int
>        sendfd(int sockfd, int fd)
>        {
>            struct msghdr msgh;
>            struct iovec iov;
>            int data;
>            struct cmsghdr *cmsgp;
> 
>            /* Allocate a char array of suitable size to hold the ancillary data.
>               However, since this buffer is in reality a 'struct cmsghdr', use a
>               union to ensure that it is suitably aligned. */
>            union {
>                char   buf[CMSG_SPACE(sizeof(int))];
>                                /* Space large enough to hold an 'int' */
>                struct cmsghdr align;
>            } controlMsg;
> 
>            /* The 'msg_name' field can be used to specify the address of the
>               destination socket when sending a datagram. However, we do not
>               need to use this field because 'sockfd' is a connected socket. */
> 
>            msgh.msg_name = NULL;
>            msgh.msg_namelen = 0;
> 
>            /* On Linux, we must transmit at least one byte of real data in
>               order to send ancillary data. We transmit an arbitrary integer
>               whose value is ignored by recvfd(). */
> 
>            msgh.msg_iov = &iov;
>            msgh.msg_iovlen = 1;
>            iov.iov_base = &data;
>            iov.iov_len = sizeof(int);
>            data = 12345;
> 
>            /* Set 'msghdr' fields that describe ancillary data */
> 
>            msgh.msg_control = controlMsg.buf;
>            msgh.msg_controllen = sizeof(controlMsg.buf);
> 
>            /* Set up ancillary data describing file descriptor to send */
> 
>            cmsgp = CMSG_FIRSTHDR(&msgh);
>            cmsgp->cmsg_level = SOL_SOCKET;
>            cmsgp->cmsg_type = SCM_RIGHTS;
>            cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
>            memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
> 
>            /* Send real plus ancillary data */
> 
>            if (sendmsg(sockfd, &msgh, 0) == -1)
>                return -1;
> 
>            return 0;
>        }
> 
>        /* Receive a file descriptor on a connected UNIX domain socket. Returns
>           the received file descriptor on success, or -1 on error. */
> 
>        static int
>        recvfd(int sockfd)
>        {
>            struct msghdr msgh;
>            struct iovec iov;
>            int data, fd;
>            ssize_t nr;
> 
>            /* Allocate a char buffer for the ancillary data. See the comments
>               in sendfd() */
>            union {
>                char   buf[CMSG_SPACE(sizeof(int))];
>                struct cmsghdr align;
>            } controlMsg;
>            struct cmsghdr *cmsgp;
> 
>            /* The 'msg_name' field can be used to obtain the address of the
>               sending socket. However, we do not need this information. */
> 
>            msgh.msg_name = NULL;
>            msgh.msg_namelen = 0;
> 
>            /* Specify buffer for receiving real data */
> 
>            msgh.msg_iov = &iov;
>            msgh.msg_iovlen = 1;
>            iov.iov_base = &data;       /* Real data is an 'int' */
>            iov.iov_len = sizeof(int);
> 
>            /* Set 'msghdr' fields that describe ancillary data */
> 
>            msgh.msg_control = controlMsg.buf;
>            msgh.msg_controllen = sizeof(controlMsg.buf);
> 
>            /* Receive real plus ancillary data; real data is ignored */
> 
>            nr = recvmsg(sockfd, &msgh, 0);
>            if (nr == -1)
>                return -1;
> 
>            cmsgp = CMSG_FIRSTHDR(&msgh);
> 
>            /* Check the validity of the 'cmsghdr' */
> 
>            if (cmsgp == NULL ||
>                    cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
>                    cmsgp->cmsg_level != SOL_SOCKET ||
>                    cmsgp->cmsg_type != SCM_RIGHTS) {
>                errno = EINVAL;
>                return -1;
>            }
> 
>            /* Return the received file descriptor to our caller */
> 
>            memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
>            return fd;
>        }
> 
>        static void
>        sigchldHandler(int sig)
>        {
>            char *msg  = "\tS: target has terminated; bye\n";
> 
>            write(STDOUT_FILENO, msg, strlen(msg));

white space nit: extra space before "="
efficiency nit: strlen isn't needed, since it can be done with
compile-time constant constants:

             char msg[] = "\tS: target has terminated; bye\n";
             write(STDOUT_FILENO, msg, sizeof(msg) - 1);

(some optimization levels may already replace the strlen a sizeof - 1)

>            _exit(EXIT_SUCCESS);
>        }
> 
>        static int
>        seccomp(unsigned int operation, unsigned int flags, void *args)
>        {
>            return syscall(__NR_seccomp, operation, flags, args);
>        }
> 
>        /* The following is the x86-64-specific BPF boilerplate code for checking
>           that the BPF program is running on the right architecture + ABI. At
>           completion of these instructions, the accumulator contains the system
>           call number. */
> 
>        /* For the x32 ABI, all system call numbers have bit 30 set */
> 
>        #define X32_SYSCALL_BIT         0x40000000
> 
>        #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
>                BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
>                        (offsetof(struct seccomp_data, arch))), \
>                BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
>                BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
>                         (offsetof(struct seccomp_data, nr))), \
>                BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
>                BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
> 
>        /* installNotifyFilter() installs a seccomp filter that generates
>           user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
>           calls mkdir(2); the filter allows all other system calls.
> 
>           The function return value is a file descriptor from which the
>           user-space notifications can be fetched. */
> 
>        static int
>        installNotifyFilter(void)
>        {
>            struct sock_filter filter[] = {
>                X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
> 
>                /* mkdir() triggers notification to user-space supervisor */
> 
>                BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
>                BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
> 
>                /* Every other system call is allowed */
> 
>                BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
>            };
> 
>            struct sock_fprog prog = {
>                .len = sizeof(filter) / sizeof(filter[0]),
>                .filter = filter,
>            };
> 
>            /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
>               as a result, seccomp() returns a notification file descriptor. */
> 
>            int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
>                                   SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
>            if (notifyFd == -1)
>                errExit("seccomp-install-notify-filter");
> 
>            return notifyFd;
>        }
> 
>        /* Close a pair of sockets created by socketpair() */
> 
>        static void
>        closeSocketPair(int sockPair[2])
>        {
>            if (close(sockPair[0]) == -1)
>                errExit("closeSocketPair-close-0");
>            if (close(sockPair[1]) == -1)
>                errExit("closeSocketPair-close-1");
>        }
> 
>        /* Implementation of the target process; create a child process that:
> 
>           (1) installs a seccomp filter with the
>               SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
>           (2) writes the seccomp notification file descriptor returned from
>               the previous step onto the UNIX domain socket, 'sockPair[0]';
>           (3) calls mkdir(2) for each element of 'argv'.
> 
>           The function return value in the parent is the PID of the child
>           process; the child does not return from this function. */
> 
>        static pid_t
>        targetProcess(int sockPair[2], char *argv[])
>        {
>            pid_t targetPid = fork();
>            if (targetPid == -1)
>                errExit("fork");
> 
>            if (targetPid > 0)          /* In parent, return PID of child */
>                return targetPid;
> 
>            /* Child falls through to here */
> 
>            printf("T: PID = %ld\n", (long) getpid());
> 
>            /* Install seccomp filter(s) */
> 
>            if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
>                errExit("prctl");
> 
>            int notifyFd = installNotifyFilter();
> 
>            /* Pass the notification file descriptor to the tracing process over
>               a UNIX domain socket */
> 
>            if (sendfd(sockPair[0], notifyFd) == -1)
>                errExit("sendfd");
> 
>            /* Notification and socket FDs are no longer needed in target */
> 
>            if (close(notifyFd) == -1)
>                errExit("close-target-notify-fd");
> 
>            closeSocketPair(sockPair);
> 
>            /* Perform a mkdir() call for each of the command-line arguments */
> 
>            for (char **ap = argv; *ap != NULL; ap++) {
>                printf("\nT: about to mkdir(\"%s\")\n", *ap);
> 
>                int s = mkdir(*ap, 0700);
>                if (s == -1)
>                    perror("T: ERROR: mkdir(2)");
>                else
>                    printf("T: SUCCESS: mkdir(2) returned %d\n", s);
>            }
> 
>            printf("\nT: terminating\n");
>            exit(EXIT_SUCCESS);
>        }
> 
>        /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
>           operation is still valid. It will no longer be valid if the process
>           has terminated. This operation can be used when accessing /proc/PID
>           files in the target process in order to avoid TOCTOU race conditions
>           where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV terminates
>           and is reused by another process. */
> 
>        static void
>        checkNotificationIdIsValid(int notifyFd, uint64_t id)
>        {
>            if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
>                fprintf(stderr, "\tS: notification ID check: "
>                        "target has terminated!!!\n");
> 
>                exit(EXIT_FAILURE);

And now you can do:

		errExit("\tS: notification ID check: "
			"target has terminated! ioctl");

;)

>            }
>        }
> 
>        /* Access the memory of the target process in order to discover the
>           pathname that was given to mkdir() */
> 
>        static bool
>        getTargetPathname(struct seccomp_notif *req, int notifyFd,
>                          char *path, size_t len)
>        {
>            char procMemPath[PATH_MAX];
> 
>            snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
> 
>            int procMemFd = open(procMemPath, O_RDONLY);
>            if (procMemFd == -1)
>                errExit("Supervisor: open");
> 
>            /* Check that the process whose info we are accessing is still alive.
>               If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>               in checkNotificationIdIsValid()) succeeds, we know that the
>               /proc/PID/mem file descriptor that we opened corresponds to the
>               process for which we received a notification. If that process
>               subsequently terminates, then read() on that file descriptor
>               will return 0 (EOF). */
> 
>            checkNotificationIdIsValid(notifyFd, req->id);
> 
>            /* Read bytes at the location containing the pathname argument
>               (i.e., the first argument) of the mkdir(2) call */
> 
>            ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>            if (nread == -1)
>                errExit("pread");
> 
>            if (nread == 0) {
>                fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>                        "returned 0 (EOF)\n");
>                exit(EXIT_FAILURE);
>            }
> 
>            if (close(procMemFd) == -1)
>                errExit("close-/proc/PID/mem");
> 
>            /* We have no guarantees about what was in the memory of the target
>               process. We therefore treat the buffer returned by pread() as
>               untrusted input. The buffer should be terminated by a null byte;
>               if not, then we will trigger an error for the target process. */
> 
>            for (int j = 0; j < nread; j++)
>                if (path[j] == ' ')

This rendering typo (' ' vs '\0') ends up manifesting badly. ;) The man
source shows:

        if (path[j] == \(aq\0\(aq)

I think this needs to be \\0 ?

Or it could also be a tested as:

	if (strnlen(path, nread) < nread)

>                    return true;
> 
>            return false;
>        }
> 
>        /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
>           descriptor, 'notifyFd'. */
> 
>        static void
>        handleNotifications(int notifyFd)
>        {
>            struct seccomp_notif_sizes sizes;
>            char path[PATH_MAX];
> 
>            /* Discover the sizes of the structures that are used to receive
>               notifications and send notification responses, and allocate
>               buffers of those sizes. */
> 
>            if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
>                errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
> 
>            struct seccomp_notif *req = malloc(sizes.seccomp_notif);
>            if (req == NULL)
>                errExit("\tS: malloc");
> 
>            /* When allocating the response buffer, we must allow for the fact
>               that the user-space binary may have been built with user-space
>               headers where 'struct seccomp_notif_resp' is bigger than the
>               response buffer expected by the (older) kernel. Therefore, we
>               allocate a buffer that is the maximum of the two sizes. This
>               ensures that if the supervisor places bytes into the response
>               structure that are past the response size that the kernel expects,
>               then the supervisor is not touching an invalid memory location. */
> 
>            size_t resp_size = sizes.seccomp_notif_resp;
>            if (sizeof(struct seccomp_notif_resp) > resp_size)
>                resp_size = sizeof(struct seccomp_notif_resp);
> 
>            struct seccomp_notif_resp *resp = malloc(resp_size);
>            if (resp == NULL)
>                errExit("\tS: malloc");
> 
>            /* Loop handling notifications */
> 
>            for (;;) {
>                /* Wait for next notification, returning info in '*req' */
> 
>                memset(req, 0, sizes.seccomp_notif);
>                if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
>                    if (errno == EINTR)
>                        continue;
>                    errExit("Supervisor: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
>                }
> 
>                printf("\tS: got notification (ID %#llx) for PID %d\n",
>                        req->id, req->pid);
> 
>                /* The only system call that can generate a notification event
>                   is mkdir(2). Nevertheless, we check that the notified system
>                   call is indeed mkdir() as kind of future-proofing of this
>                   code in case the seccomp filter is later modified to
>                   generate notifications for other system calls. */
> 
>                if (req->data.nr != __NR_mkdir) {
>                    printf("\tS: notification contained unexpected "
>                            "system call number; bye!!!\n");
>                    exit(EXIT_FAILURE);
>                }
> 
>                bool pathOK = getTargetPathname(req, notifyFd, path,
>                                                sizeof(path));
> 
>                /* Prepopulate some fields of the response */
> 
>                resp->id = req->id;     /* Response includes notification ID */
>                resp->flags = 0;
>                resp->val = 0;
> 
>                /* If the target pathname was not valid, trigger an EINVAL error;
>                   if the directory is in /tmp, then create it on behalf of the
>                   supervisor; if the pathname starts with '.', tell the kernel
>                   to let the target process execute the mkdir(); otherwise, give
>                   an error for a directory pathname in any other location. */
> 
>                if (!pathOK) {
>                    resp->error = -EINVAL;
>                    printf("\tS: spoofing error for invalid pathname (%s)\n",
>                            strerror(-resp->error));
>                } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
>                    printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
>                            path, req->data.args[1]);
> 
>                    if (mkdir(path, req->data.args[1]) == 0) {
>                        resp->error = 0;            /* "Success" */
>                        resp->val = strlen(path);   /* Used as return value of
>                                                       mkdir() in target */
>                        printf("\tS: success! spoofed return = %lld\n",
>                                resp->val);
>                    } else {
> 
>                        /* If mkdir() failed in the supervisor, pass the error
>                           back to the target */
> 
>                        resp->error = -errno;
>                        printf("\tS: failure! (errno = %d; %s)\n", errno,
>                                strerror(errno));
>                    }
>                } else if (strncmp(path, "./", strlen("./")) == 0) {
>                    resp->error = resp->val = 0;
>                    resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
>                    printf("\tS: target can execute system call\n");
>                } else {
>                    resp->error = -EOPNOTSUPP;
>                    printf("\tS: spoofing error response (%s)\n",
>                            strerror(-resp->error));
>                }
> 
>                /* Send a response to the notification */
> 
>                printf("\tS: sending response "
>                        "(flags = %#x; val = %lld; error = %d)\n",
>                        resp->flags, resp->val, resp->error);
> 
>                if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
>                    if (errno == ENOENT)
>                        printf("\tS: response failed with ENOENT; "
>                                "perhaps target process's syscall was "
>                                "interrupted by a signal?\n");
>                    else
>                        perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
>                }
> 
>                /* If the pathname is just "/bye", then the supervisor
>                   terminates. This allows us to see what happens if the
>                   target process makes further calls to mkdir(2). */
> 
>                if (strcmp(path, "/bye") == 0) {
>                    printf("\tS: terminating **********\n");
>                    exit(EXIT_FAILURE);
>                }
>            }
>        }
> 
>        /* Implementation of the supervisor process:
> 
>           (1) obtains the notification file descriptor from 'sockPair[1]'
>           (2) handles notifications that arrive on that file descriptor. */
> 
>        static void
>        supervisor(int sockPair[2])
>        {
>            int notifyFd = recvfd(sockPair[1]);
>            if (notifyFd == -1)
>                errExit("recvfd");
> 
>            closeSocketPair(sockPair);  /* We no longer need the socket pair */
> 
>            handleNotifications(notifyFd);
>        }
> 
>        int
>        main(int argc, char *argv[])
>        {
>            int sockPair[2];
> 
>            setbuf(stdout, NULL);
> 
>            if (argc < 2) {
>                fprintf(stderr, "At least one pathname argument is required\n");
>                exit(EXIT_FAILURE);
>            }
> 
>            /* Create a UNIX domain socket that is used to pass the seccomp
>               notification file descriptor from the target process to the
>               supervisor process. */
> 
>            if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
>                errExit("socketpair");
> 
>            /* Create a child process--the "target"--that installs seccomp
>               filtering. The target process writes the seccomp notification
>               file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
>               each directory in the command-line arguments. */
> 
>            (void) targetProcess(sockPair, &argv[optind]);
> 
>            /* Catch SIGCHLD when the target terminates, so that the
>               supervisor can also terminate. */
> 
>            struct sigaction sa;
>            sa.sa_handler = sigchldHandler;
>            sa.sa_flags = 0;
>            sigemptyset(&sa.sa_mask);
>            if (sigaction(SIGCHLD, &sa, NULL) == -1)
>                errExit("sigaction");
> 
>            supervisor(sockPair);
> 
>            exit(EXIT_SUCCESS);
>        }
> 
> SEE ALSO
>        ioctl(2), seccomp(2)
> 
>        A further example program can be found in the kernel source file
>        samples/seccomp/user-trap.c.
> 
> Linux                           2020-10-01          SECCOMP_USER_NOTIF(2)

Thank you so much for this documentation and example! :)

-Kees

[1] https://git.kernel.org/linus/dfe719fef03d752f1682fa8aeddf30ba501c8555
[2] https://lore.kernel.org/lkml/CAG48ez3kpEDO1x_HfvOM2R9M78Ach9O_4+Pjs-vLLfqvZL+13A@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CAGXu5jKzif=vp6gn5ZtrTx-JTN367qFphobnt9s=awbaafwoUw@mail.gmail.com/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-01  1:52           ` Jann Horn
  2020-10-01  2:14             ` Jann Horn
  2020-10-01  7:49             ` Michael Kerrisk (man-pages)
@ 2020-10-26  0:32             ` Kees Cook
  2020-10-26  9:51               ` Jann Horn
  2 siblings, 1 reply; 52+ messages in thread
From: Kees Cook @ 2020-10-26  0:32 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Michael Kerrisk (man-pages),
	Sargun Dhillon, Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > >>        ┌─────────────────────────────────────────────────────┐
> > > > > >>        │FIXME                                                │
> > > > > >>        ├─────────────────────────────────────────────────────┤
> > > > > >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > >>        │process terminates, then the ioctl()  simply  blocks │
> > > > > >>        │(rather than returning an error to indicate that the │
> > > > > >>        │target process no longer exists).                    │
> > > > > >
> > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > >
> > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > quick search.
> > > > >
> > > > > > but it's a
> > > > > > bit sticky to do.
> > > > >
> > > > > Can you say a few words about the nature of the problem?
> > > >
> > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > notify about unused filter"). So maybe there's a bug here?
> > >
> > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > commit doesn't have any effect on this kind of usage.
> >
> > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > we don't have a count of all of them, unfortunately.
> >
> > We could maybe look inside the wait_list, but that will probably make
> > people angry :)
> 
> The easiest way would probably be to open-code the semaphore-ish part,
> and let the semaphore and poll share the waitqueue. The current code
> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> entire semaphore would IMO be cleaner than that. And it's not like
> semaphore semantics are even a good fit for this code anyway.
> 
> Let's see... if we didn't have the existing UAPI to worry about, I'd
> do it as follows (*completely* untested). That way, the ioctl would
> block exactly until either there actually is a request to deliver or
> there are no more users of the filter. The problem is that if we just
> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> an event loop and don't set O_NONBLOCK will be screwed. So we'd

Wait, why? Do you mean a ioctl calling loop (rather than a poll event
loop)? I think poll would be fine, but a "try calling RECV and expect to
return ENOENT" loop would change. But I don't think anyone would do this
exactly because it _currently_ acts like O_NONBLOCK, yes?

> probably also have to add some stupid counter in place of the
> semaphore's counter that we can use to preserve the old behavior of
> returning -ENOENT once for each cancelled request. :(

I only see this in Debian Code Search:
https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
which is using epoll_wait():
https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326

I expect LXC is using it. :)

Let's change it ASAP! ;)

-Kees

> 
> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...
> 
> 
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 676d4af62103..f0f4c68e0bc6 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -138,7 +138,6 @@ struct seccomp_kaddfd {
>   * @notifications: A list of struct seccomp_knotif elements.
>   */
>  struct notification {
> -       struct semaphore request;
>         u64 next_id;
>         struct list_head notifications;
>  };
> @@ -859,7 +858,6 @@ static int seccomp_do_user_notification(int this_syscall,
>         list_add(&n.list, &match->notif->notifications);
>         INIT_LIST_HEAD(&n.addfd);
> 
> -       up(&match->notif->request);
>         wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
>         mutex_unlock(&match->notify_lock);
> 
> @@ -1175,9 +1173,10 @@ find_notification(struct seccomp_filter *filter, u64 id)
> 
> 
>  static long seccomp_notify_recv(struct seccomp_filter *filter,
> -                               void __user *buf)
> +                               void __user *buf, bool blocking)
>  {
>         struct seccomp_knotif *knotif = NULL, *cur;
> +       DECLARE_WAITQUEUE(wait, current);
>         struct seccomp_notif unotif;
>         ssize_t ret;
> 
> @@ -1190,11 +1189,9 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
> 
>         memset(&unotif, 0, sizeof(unotif));
> 
> -       ret = down_interruptible(&filter->notif->request);
> -       if (ret < 0)
> -               return ret;
> -
>         mutex_lock(&filter->notify_lock);
> +
> +retry:
>         list_for_each_entry(cur, &filter->notif->notifications, list) {
>                 if (cur->state == SECCOMP_NOTIFY_INIT) {
>                         knotif = cur;
> @@ -1202,14 +1199,32 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
>                 }
>         }
> 
> -       /*
> -        * If we didn't find a notification, it could be that the task was
> -        * interrupted by a fatal signal between the time we were woken and
> -        * when we were able to acquire the rw lock.
> -        */
>         if (!knotif) {
> -               ret = -ENOENT;
> -               goto out;
> +               /* This has to happen before checking &filter->users. */
> +               prepare_to_wait(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
> +
> +               /*
> +                * If all users of the filter are gone, throw an error instead
> +                * of pointlessly continuing to block.
> +                */
> +               if (refcount_read(&filter->users) == 0) {
> +                       ret = -ENOTCON;
> +                       goto out;
> +               }
> +               if (blocking) {
> +                       /* No notifications pending - wait for one,
> then retry. */
> +                       mutex_unlock(&filter->notify_lock);
> +                       schedule();
> +                       mutex_lock(&filter->notify_lock);
> +                       if (signal_pending(current)) {
> +                               ret = -EINTR;
> +                               goto out;
> +                       }
> +                       goto retry;
> +               } else {
> +                       ret = -ENOENT;
> +                       goto out;
> +               }
>         }
> 
>         unotif.id = knotif->id;
> @@ -1220,6 +1235,7 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
>         wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
>         ret = 0;
>  out:
> +       finish_wait(&filter->wqh, &wait);
>         mutex_unlock(&filter->notify_lock);
> 
>         if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) {
> @@ -1233,10 +1249,8 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
>                  */
>                 mutex_lock(&filter->notify_lock);
>                 knotif = find_notification(filter, unotif.id);
> -               if (knotif) {
> +               if (knotif)
>                         knotif->state = SECCOMP_NOTIFY_INIT;
> -                       up(&filter->notif->request);
> -               }
>                 mutex_unlock(&filter->notify_lock);
>         }
> 
> @@ -1412,11 +1426,12 @@ static long seccomp_notify_ioctl(struct file
> *file, unsigned int cmd,
>  {
>         struct seccomp_filter *filter = file->private_data;
>         void __user *buf = (void __user *)arg;
> +       bool blocking = !(file->f_flags & O_NONBLOCK);
> 
>         /* Fixed-size ioctls */
>         switch (cmd) {
>         case SECCOMP_IOCTL_NOTIF_RECV:
> -               return seccomp_notify_recv(filter, buf);
> +               return seccomp_notify_recv(filter, buf, blocking);
>         case SECCOMP_IOCTL_NOTIF_SEND:
>                 return seccomp_notify_send(filter, buf);
>         case SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR:
> @@ -1485,7 +1500,6 @@ static struct file *init_listener(struct
> seccomp_filter *filter)
>         if (!filter->notif)
>                 goto out;
> 
> -       sema_init(&filter->notif->request, 0);
>         filter->notif->next_id = get_random_u64();
>         INIT_LIST_HEAD(&filter->notif->notifications);

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-24 12:52           ` Michael Kerrisk (man-pages)
@ 2020-10-26  9:32             ` Jann Horn
  2020-10-26  9:47               ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 52+ messages in thread
From: Jann Horn @ 2020-10-26  9:32 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Alexei Starovoitov, Will Drewry,
	bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

On Sat, Oct 24, 2020 at 2:53 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 10/17/20 2:25 AM, Jann Horn wrote:
> > On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
[...]
> >> I'm not sure if I should write anything about this small UAPI
> >> breakage in BUGS, or not. Your thoughts?
> >
> > Thinking about it a bit more: Any code that relies on pause() or
> > epoll_wait() not restarting is buggy anyway, right? Because a signal
> > could also arrive directly before entering the syscall, while
> > userspace code is still executing? So one could argue that we're just
> > enlarging a preexisting race. (Unless the signal handler checks the
> > interrupted register state to figure out whether we already entered
> > syscall handling?)
>
> Yes, that all makes sense.
>
> > If userspace relies on non-restarting behavior, it should be using
> > something like epoll_pwait(). And that stuff only unblocks signals
> > after we've already past the seccomp checks on entry.
>
> Thanks for elaborating that detail, since as soon as you talked
> about "enlarging a preexisting race" above, I immediately wondered
> sigsuspend(), pselect(), etc.
>
> (Mind you, I still wonder about the effect on system calls that
> are normally nonrestartable because they have timeouts. My
> understanding is that the kernel doesn't restart those system
> calls because it's impossible for the kernel to restart the call
> with the right timeout value. I wonder what happens when those
> system calls are restarted in the scenario we're discussing.)

Ah, that's an interesting edge case...

> Anyway, returning to your point... So, to be clear (and to
> quickly remind myself in case I one day reread this thread),
> there is not a problem with sigsuspend(), pselect(), ppoll(),
> and epoll_pwait() since:
>
> * Before the syscall, signals are blocked in the target.
> * Inside the syscall, signals are still blocked at the time
>   the check is made for seccomp filters.
> * If a seccomp user-space notification  event kicks, the target
>   is put to sleep with the signals still blocked.
> * The signal will only get delivered after the supervisor either
>   triggers a spoofed success/failure return in the target or the
>   supervisor sends a CONTINUE response to the kernel telling it
>   to execute the target's system call. Either way, there won't be
>   any restarting of the target's system call (and the supervisor
>   thus won't see multiple notifications).
>
> (Right?)

Yeah.

[...]
> > So we should probably document the restarting behavior as something
> > the supervisor has to deal with in the manpage; but for the
> > "non-restarting syscalls can restart from the target's perspective"
> > aspect, it might be enough to document this as quirky behavior that
> > can't actually break correct code? (Or not document it at all. Dunno.)
>
> So, I've added the following to the page:
>
>    Interaction with SA_RESTART signal handlers
>        Consider the following scenario:
>
>        · The target process has used sigaction(2)  to  install  a  signal
>          handler with the SA_RESTART flag.
>
>        · The target has made a system call that triggered a seccomp user-
>          space notification and the target is currently blocked until the
>          supervisor sends a notification response.
>
>        · A  signal  is  delivered to the target and the signal handler is
>          executed.
>
>        · When  (if)  the  supervisor  attempts  to  send  a  notification
>          response,  the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
>          fail with the ENOENT error.
>
>        In this scenario, the kernel  will  restart  the  target's  system
>        call.   Consequently,  the  supervisor  will receive another user-
>        space notification.  Thus, depending on how many times the blocked
>        system call is interrupted by a signal handler, the supervisor may
>        receive multiple notifications for the same  system  call  in  the
>        target.
>
>        One  oddity  is  that  system call restarting as described in this
>        scenario will occur even for the blocking system calls  listed  in
>        signal(7) that would never normally be restarted by the SA_RESTART
>        flag.
>
> Does that seem okay?

Sounds good to me.

> In addition, I've queued a cross-reference in signal(7):
>
>        In certain circumstances, the seccomp(2) user-space notifi‐
>        cation  feature can lead to restarting of system calls that
>        would otherwise  never  be  restarted  by  SA_RESTART;  for
>        details, see seccomp_user_notif(2).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-26  0:19     ` Kees Cook
@ 2020-10-26  9:39       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-26  9:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: mtk.manpages, Tycho Andersen, Sargun Dhillon, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Jann Horn, Alexei Starovoitov,
	wad, bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

Hello Kees,

On 10/26/20 1:19 AM, Kees Cook wrote:
> On Thu, Oct 15, 2020 at 01:24:03PM +0200, Michael Kerrisk (man-pages) wrote:
>> On 10/1/20 1:39 AM, Kees Cook wrote:
>>> I'll comment more later, but I've run out of time today and I didn't see
>>> anyone mention this detail yet in the existing threads... :)
>>
>> Later never came :-). But, I hope you may have comments for the 
>> next draft, which I will send out soon.
> 
> Later is now, and Soon approaches!
> 
> I finally caught up and read through this whole thread. Thank you all
> for the bug fix[1], and I'm looking forward to more[2]. :)


> For my reply I figured I'd base it on the current draft, so here's a
> simulated quote based on the seccomp_user_notif branch of
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
> through commit 71101158fe330af5a26552447a0bb433b69e15b7
> $ COLUMNS=75 man --nh --nj man2/seccomp_user_notif.2 | sed 's/^/> /'

Thanks for reviewing the latest version!

> On Sun, Oct 25, 2020 at 01:54:05PM +0100, Michael Kerrisk (man-pages) wrote:
>> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual   SECCOMP_USER_NOTIF(2)
>>
>> NAME
>>        seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>>        #include <linux/seccomp.h>
>>        #include <linux/filter.h>
>>        #include <linux/audit.h>
>>
>>        int seccomp(unsigned int operation, unsigned int flags, void *args);
>>
>>        #include <sys/ioctl.h>
>>
>>        int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>>                  struct seccomp_notif *req);
>>        int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>>                  struct seccomp_notif_resp *resp);
>>        int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
>>
>> DESCRIPTION
>>        This page describes the user-space notification mechanism provided
>>        by the Secure Computing (seccomp) facility.  As well as the use of
>>        the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>>        SECCOMP_RET_USER_NOTIF action value, and the
>>        SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>>        mechanism involves the use of a number of related ioctl(2)
>>        operations (described below).
>>
>>    Overview
>>        In conventional usage of a seccomp filter, the decision about how
>>        to treat a system call is made by the filter itself.  By contrast,
>>        the user-space notification mechanism allows the seccomp filter to
>>        delegate the handling of the system call to another user-space
>>        process.  Note that this mechanism is explicitly not intended as a
>>        method implementing security policy; see NOTES.
>>
>>        In the discussion that follows, the thread(s) on which the seccomp
>>        filter is installed is (are) referred to as the target, and the
>>        process that is notified by the user-space notification mechanism
>>        is referred to as the supervisor.
>>
>>        A suitably privileged supervisor can use the user-space
>>        notification mechanism to perform actions on behalf of the target.
>>        The advantage of the user-space notification mechanism is that the
>>        supervisor will usually be able to retrieve information about the
>>        target and the performed system call that the seccomp filter
>>        itself cannot.  (A seccomp filter is limited in the information it
>>        can obtain and the actions that it can perform because it is
>>        running on a virtual machine inside the kernel.)
>>
>>        An overview of the steps performed by the target and the
>>        supervisor is as follows:
>>
>>        1. The target establishes a seccomp filter in the usual manner,
>>           but with two differences:
>>
>>           • The seccomp(2) flags argument includes the flag
>>             SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
>>             value  of the (successful) seccomp(2) call is a new
> 
> nit: extra space

Thanks. Fixed.

>>             "listening" file descriptor that can be used to receive
>>             notifications.  Only one "listening" seccomp filter can be
>>             installed for a thread.
> 
> I like this limitation, but I expect that it'll need to change in the
> future. Even with LSMs, we see the need for arbitrary stacking, and the
> idea of there being only 1 supervisor will eventually break down. Right
> now there is only 1 because only container managers are using this
> feature. But if some daemon starts using it to isolate some thread,
> suddenly it might break if a container manager is trying to listen to it
> too, etc. I expect it won't be needed soon, but I do think it'll change.

Thanks for the background. (I added your text in a comment in the
page, just for my own reference in the future.)

>>           • In cases where it is appropriate, the seccomp filter returns
>>             the action value SECCOMP_RET_USER_NOTIF.  This return value
>>             will trigger a notification event.
>>
>>        2. In order that the supervisor can obtain notifications using the
>>           listening file descriptor, (a duplicate of) that file
>>           descriptor must be passed from the target to the supervisor.
> 
> Yet another reason to have an "activate on exec" mode for seccomp. With

Funnily enough, I was having an in-person conversation just last week
with someone else who was interested in "activate-on-exec".

> no_new_privs _not_ being delayed in such a way, I think it'd be safe to
> add. The supervisor would get the fd immediately, and then once it
> fork/execed suddenly the whole thing would activate, and no fd passing
> needed.
> 
> The "on exec" boundary is really only needed for oblivious targets. For
> a coordinated target, I've thought it might be nice to have an arbitrary
> "go" point, where the target could call seccomp() with something like a
> SECCOMP_ACTIVATE_DELAYED_FILTERS operation. This lets any process
> initialization happen that might need to do things that would be blocked
> by filters, etc.
> 
> Before:
> 
> 	fork
> 	install some filters that don't block initialization
> 	exec
> 	do some initialization
> 	install more filters, maybe block exec, seccomp
> 	run
> 
> After:
> 
> 	fork
> 	install delayed filters
> 	exec
> 	do some initialization
> 	activate delayed filters
> 	run
> 
> In practice, the two-stage filter application has been fine, if
> sometimes a bit complex (e.g. for user_notif, "do some initialization"
> includes figuring out how to pass the fd back to the supervisor, etc).

Yes, something like what you describe above would certainly make some
uses easier. Activate-on-exec seems to me the most compelling need
though..

>>           One way in which this could be done is by passing the file
>>           descriptor over a UNIX domain socket connection between the
>>           target and the supervisor (using the SCM_RIGHTS ancillary
>>           message type described in unix(7)).
>>
>>        3. The supervisor will receive notification events on the
>>           listening file descriptor.  These events are returned as
>>           structures of type seccomp_notif.  Because this structure and
>>           its size may evolve over kernel versions, the supervisor must
>>           first determine the size of this structure using the seccomp(2)
>>           SECCOMP_GET_NOTIF_SIZES operation, which returns a structure of
>>           type seccomp_notif_sizes.  The supervisor allocates a buffer of
>>           size seccomp_notif_sizes.seccomp_notif bytes to receive
>>           notification events.  In addition,the supervisor allocates
>>           another buffer of size seccomp_notif_sizes.seccomp_notif_resp
>>           bytes for the response (a struct seccomp_notif_resp structure)
>>           that it will provide to the kernel (and thus the target).
>>
>>        4. The target then performs its workload, which includes system
>>           calls that will be controlled by the seccomp filter.  Whenever
>>           one of these system calls causes the filter to return the
>>           SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet)
>>           execute the system call; instead, execution of the target is
>>           temporarily blocked inside the kernel (in a sleep state that is
>>           interruptible by signals) and a notification event is generated
>>           on the listening file descriptor.
>>
>>        5. The supervisor can now repeatedly monitor the listening file
>>           descriptor for SECCOMP_RET_USER_NOTIF-triggered events.  To do
>>           this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2)
>>           operation to read information about a notification event; this
>>           operation blocks until an event is available.  The operation
>>           returns a seccomp_notif structure containing information about
>>           the system call that is being attempted by the target.
>>
>>        6. The seccomp_notif structure returned by the
>>           SECCOMP_IOCTL_NOTIF_RECV operation includes the same
>>           information (a seccomp_data structure) that was passed to the
>>           seccomp filter.  This information allows the supervisor to
>>           discover the system call number and the arguments for the
>>           target's system call.  In addition, the notification event
>>           contains the ID of the thread that triggered the notification.
> 
> Should "cookie" be at least named here, just to provide a bit more
> context for when it is mentioned in 8 below? E.g.:
> 
> 			       ... In addition, the notification event
> 	    contains the triggering thread's ID and a unique cookie to be
> 	    used in subsequent SECCOMP_IOCTL_NOTIF_ID_VALID and
> 	    SECCOMP_IOCTL_NOTIF_SEND operations.

Good catch! Changed as you suggest. (And thanks so much for all
your suggested rewordings; that makes things *much* easier for me.)

>>           The information in the notification can be used to discover the
>>           values of pointer arguments for the target's system call.
>>           (This is something that can't be done from within a seccomp
>>           filter.)  One way in which the supervisor can do this is to
>>           open the corresponding /proc/[tid]/mem file (see proc(5)) and
>>           read bytes from the location that corresponds to one of the
>>           pointer arguments whose value is supplied in the notification
>>           event.  (The supervisor must be careful to avoid a race
>>           condition that can occur when doing this; see the description
>>           of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)
>>           In addition, the supervisor can access other system information
>>           that is visible in user space but which is not accessible from
>>           a seccomp filter.
>>
>>        7. Having obtained information as per the previous step, the
>>           supervisor may then choose to perform an action in response to
>>           the target's system call (which, as noted above, is not
>>           executed when the seccomp filter returns the
>>           SECCOMP_RET_USER_NOTIF action value).
>>
>>           One example use case here relates to containers.  The target
>>           may be located inside a container where it does not have
>>           sufficient capabilities to mount a filesystem in the
>>           container's mount namespace.  However, the supervisor may be a
>>           more privileged process that does have sufficient capabilities
>>           to perform the mount operation.
>>
>>        8. The supervisor then sends a response to the notification.  The
>>           information in this response is used by the kernel to construct
>>           a return value for the target's system call and provide a value
>>           that will be assigned to the errno variable of the target.
>>
>>           The response is sent using the SECCOMP_IOCTL_NOTIF_SEND
>>           ioctl(2) operation, which is used to transmit a
>>           seccomp_notif_resp structure to the kernel.  This structure
>>           includes a cookie value that the supervisor obtained in the
>>           seccomp_notif structure returned by the
>>           SECCOMP_IOCTL_NOTIF_RECV operation.  This cookie value allows
>>           the kernel to associate the response with the target.
> 
> Describing where the cookie came from seems like it should live in 6
> above. A reader would have to take this new info and figure out where
> SECCOMP_IOCTL_NOTIF_RECV was described and piece it together.

Yeah. I hate it when the documentation loses the reader like that :-}.

> With the
> suggestion to 6 above, maybe:
> 
>                                                      ... This structure
>             must include the cookie value that the supervisor obtained in
>             the seccomp_notif structure returned by the
> 	    SECCOMP_IOCTL_NOTIF_RECV operation, which allows the kernel
>             to associate the response with the target.

Great! Changed.

>>        9. Once the notification has been sent, the system call in the
>>           target thread unblocks, returning the information that was
>>           provided by the supervisor in the notification response.
>>
>>        As a variation on the last two steps, the supervisor can send a
>>        response that tells the kernel that it should execute the target
>>        thread's system call; see the discussion of
>>        SECCOMP_USER_NOTIF_FLAG_CONTINUE, below.
>>
>>    ioctl(2) operations
>>        The following ioctl(2) operations are provided to support seccomp
>>        user-space notification.  For each of these operations, the first
>>        (file descriptor) argument of ioctl(2) is the listening file
>>        descriptor returned by a call to seccomp(2) with the
>>        SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
>>
>>        SECCOMP_IOCTL_NOTIF_RECV
>>               This operation is used to obtain a user-space notification
>>               event.  If no such event is currently pending, the
>>               operation blocks until an event occurs.  The third ioctl(2)
>>               argument is a pointer to a structure of the following form
>>               which contains information about the event.  This structure
>>               must be zeroed out before the call.
>>
>>                   struct seccomp_notif {
>>                       __u64  id;              /* Cookie */
>>                       __u32  pid;             /* TID of target thread */
> 
> Should we rename this variable from pid to tid? Yes it's UAPI, but yay for
> anonymous unions:
> 
> struct seccomp_notif {
> 	__u64		id;		/* Cookie */
> 	union {
> 		__u32	pid;
> 		__u32	tid;		/* TID of target thread */
> 	};
> 	__u32  flags;			/* Currently unused (0) */
> 	struct seccomp_data data;	/* See seccomp(2) */
> };

Yes, it would be nice to make this change. But, already there
are so many places in the UAPI where the pid/tid is messed upp :-(.

>>                       __u32  flags;           /* Currently unused (0) */
>>                       struct seccomp_data data;   /* See seccomp(2) */
>>                   };
>>
>>               The fields in this structure are as follows:
>>
>>               id     This is a cookie for the notification.  Each such
>>                      cookie is guaranteed to be unique for the
>>                      corresponding seccomp filter.
>>
>>                      • It can be used with the
>>                        SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
>>                        verify that the target is still alive.
>>
>>                      • When returning a notification response to the
>>                        kernel, the supervisor must include the cookie
>>                        value in the seccomp_notif_resp structure that is
>>                        specified as the argument of the
>>                        SECCOMP_IOCTL_NOTIF_SEND operation.
>>
>>               pid    This is the thread ID of the target thread that
>>                      triggered the notification event.
>>
>>               flags  This is a bit mask of flags providing further
>>                      information on the event.  In the current
>>                      implementation, this field is always zero.
>>
>>               data   This is a seccomp_data structure containing
>>                      information about the system call that triggered the
>>                      notification.  This is the same structure that is
>>                      passed to the seccomp filter.  See seccomp(2) for
>>                      details of this structure.
>>
>>               On success, this operation returns 0; on failure, -1 is
>>               returned, and errno is set to indicate the cause of the
>>               error.  This operation can fail with the following errors:
>>
>>               EINVAL (since Linux 5.5)
>>                      The seccomp_notif structure that was passed to the
>>                      call contained nonzero fields.
>>
>>               ENOENT The target thread was killed by a signal as the
>>                      notification information was being generated, or the
>>                      target's (blocked) system call was interrupted by a
>>                      signal handler.
>>
>>        SECCOMP_IOCTL_NOTIF_ID_VALID
>>               This operation can be used to check that a notification ID
>>               returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
>>               is still valid (i.e., that the target still exists).
> 
> Maybe clarify a bit more, since it's covering more than just "is the
> target still alive", but also "is that syscall still waiting for a
> response":
> 
>                 is still valid (i.e., that the target still exists and
> 		the syscall is still blocked waiting for a response).

Thanks. I made it:

(i.e., that the target still exists and its system call
is still blocked waiting for a response).

>>               The third ioctl(2) argument is a pointer to the cookie (id)
>>               returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>>               This operation is necessary to avoid race conditions that
>>               can occur when the pid returned by the
>>               SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that
>>               process ID is reused by another process.  An example of
>>               this kind of race is the following
>>
>>               1. A notification is generated on the listening file
>>                  descriptor.  The returned seccomp_notif contains the TID
>>                  of the target thread (in the pid field of the
>>                  structure).
>>
>>               2. The target terminates.
>>
>>               3. Another thread or process is created on the system that
>>                  by chance reuses the TID that was freed when the target
>>                  terminated.
>>
>>               4. The supervisor open(2)s the /proc/[tid]/mem file for the
>>                  TID obtained in step 1, with the intention of (say)
>>                  inspecting the memory location(s) that containing the
>>                  argument(s) of the system call that triggered the
>>                  notification in step 1.
>>
>>               In the above scenario, the risk is that the supervisor may
>>               try to access the memory of a process other than the
>>               target.  This race can be avoided by following the call to
>>               open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to
>>               verify that the process that generated the notification is
>>               still alive.  (Note that if the target terminates after the
>>               latter step, a subsequent read(2) from the file descriptor
>>               may return 0, indicating end of file.)
>>
>>               On success (i.e., the notification ID is still valid), this
>>               operation returns 0.  On failure (i.e., the notification ID
>>               is no longer valid), -1 is returned, and errno is set to
>>               ENOENT.
>>
>>        SECCOMP_IOCTL_NOTIF_SEND
>>               This operation is used to send a notification response back
>>               to the kernel.  The third ioctl(2) argument of this
>>               structure is a pointer to a structure of the following
>>               form:
>>
>>                   struct seccomp_notif_resp {
>>                       __u64 id;               /* Cookie value */
>>                       __s64 val;              /* Success return value */
>>                       __s32 error;            /* 0 (success) or negative
>>                                                  error number */
>>                       __u32 flags;            /* See below */
>>                   };
>>
>>               The fields of this structure are as follows:
>>
>>               id     This is the cookie value that was obtained using the
>>                      SECCOMP_IOCTL_NOTIF_RECV operation.  This cookie
>>                      value allows the kernel to correctly associate this
>>                      response with the system call that triggered the
>>                      user-space notification.
>>
>>               val    This is the value that will be used for a spoofed
>>                      success return for the target's system call; see
>>                      below.
>>
>>               error  This is the value that will be used as the error
>>                      number (errno) for a spoofed error return for the
>>                      target's system call; see below.
>>
>>               flags  This is a bit mask that includes zero or more of the
>>                      following flags:
>>
>>                      SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
>>                             Tell the kernel to execute the target's
>>                             system call.
>>
>>               Two kinds of response are possible:
>>
>>               • A response to the kernel telling it to execute the
>>                 target's system call.  In this case, the flags field
>>                 includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error
>>                 and val fields must be zero.
>>
>>                 This kind of response can be useful in cases where the
>>                 supervisor needs to do deeper analysis of the target's
>>                 system call than is possible from a seccomp filter (e.g.,
>>                 examining the values of pointer arguments), and, having
>>                 decided that the system call does not require emulation
>>                 by the supervisor, the supervisor wants the system call
>>                 to be executed normally in the target.
>>
>>                 The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used
>>                 with caution; see NOTES.
>>
>>               • A spoofed return value for the target's system call.  In
>>                 this case, the kernel does not execute the target's
>>                 system call, instead causing the system call to return a
>>                 spoofed value as specified by fields of the
>>                 seccomp_notif_resp structure.  The supervisor should set
>>                 the fields of this structure as follows:
>>
>>                 +  flags does not contain
>>                    SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>>
>>                 +  error is set either to 0 for a spoofed "success"
>>                    return or to a negative error number for a spoofed
>>                    "failure" return.  In the former case, the kernel
>>                    causes the target's system call to return the value
>>                    specified in the val field.  In the later case, the
>>                    kernel causes the target's system call to return -1,
>>                    and errno is assigned the negated error value.
>>
>>                 +  val is set to a value that will be used as the return
>>                    value for a spoofed "success" return for the target's
>>                    system call.  The value in this field is ignored if
>>                    the error field contains a nonzero value.
> 
> Strictly speaking, this is architecture specific, but all architectures
> do it this way. Should seccomp enforce val == 0 when err != 0 ?

That seems a resonable check to add. Initially, I found the absence of
such a check confusing, since it left me wondering: have I understood
the kernel code correctly?

>>               On success, this operation returns 0; on failure, -1 is
>>               returned, and errno is set to indicate the cause of the
>>               error.  This operation can fail with the following errors:
>>
>>               EINPROGRESS
>>                      A response to this notification has already been
>>                      sent.
>>
>>               EINVAL An invalid value was specified in the flags field.
>>
>>               EINVAL The flags field contained
>>                      SECCOMP_USER_NOTIF_FLAG_CONTINUE, and the error or
>>                      val field was not zero.
>>
>>               ENOENT The blocked system call in the target has been
>>                      interrupted by a signal handler or the target has
>>                      terminated.
>>
>> NOTES
>>    select()/poll()/epoll semantics
>>        The file descriptor returned when seccomp(2) is employed with the
>>        SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
>>        poll(2), epoll(7), and select(2).  These interfaces indicate that
>>        the file descriptor is ready as follows:
>>
>>        • When a notification is pending, these interfaces indicate that
>>          the file descriptor is readable.  Following such an indication,
>>          a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block,
>>          returning either information about a notification or else
>>          failing with the error EINTR if the target has been killed by a
>>          signal or its system call has been interrupted by a signal
>>          handler.
>>
>>        • After the notification has been received (i.e., by the
>>          SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces
>>          indicate that the file descriptor is writable, meaning that a
>>          notification response can be sent using the
>>          SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation.
>>
>>        • After the last thread using the filter has terminated and been
>>          reaped using waitpid(2) (or similar), the file descriptor
>>          indicates an end-of-file condition (readable in select(2);
>>          POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)).
> 
> I'll reply separately about the "ioctl() does not terminate when all
> filters have terminated" case.

Okay.

>>    Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
>>        The intent of the user-space notification feature is to allow
>>        system calls to be performed on behalf of the target.  The
>>        target's system call should either be handled by the supervisor or
>>        allowed to continue normally in the kernel (where standard
>>        security policies will be applied).
>>
>>        Note well: this mechanism must not be used to make security policy
>>        decisions about the system call, which would be inherently race-
>>        prone for reasons described next.
>>
>>        The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with
>>        caution.  If set by the supervisor, the target's system call will
>>        continue.  However, there is a time-of-check, time-of-use race
>>        here, since an attacker could exploit the interval of time where
>>        the target is blocked waiting on the "continue" response to do
>>        things such as rewriting the system call arguments.
>>
>>        Note furthermore that a user-space notifier can be bypassed if the
>>        existing filters allow the use of seccomp(2) or prctl(2) to
>>        install a filter that returns an action value with a higher
>>        precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).
>>
>>        It should thus be absolutely clear that the seccomp user-space
>>        notification mechanism can not be used to implement a security
>>        policy!  It should only ever be used in scenarios where a more
>>        privileged process supervises the system calls of a lesser
>>        privileged target to get around kernel-enforced security
>>        restrictions when the supervisor deems this safe.  In other words,
>>        in order to continue a system call, the supervisor should be sure
>>        that another security mechanism or the kernel itself will
>>        sufficiently block the system call if its arguments are rewritten
>>        to something unsafe.
>>
>>    Interaction with SA_RESTART signal handlers
>>        Consider the following scenario:
>>
>>        • The target process has used sigaction(2) to install a signal
>>          handler with the SA_RESTART flag.
>>
>>        • The target has made a system call that triggered a seccomp user-
>>          space notification and the target is currently blocked until the
>>          supervisor sends a notification response.
>>
>>        • A signal is delivered to the target and the signal handler is
>>          executed.
>>
>>        • When (if) the supervisor attempts to send a notification
>>          response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
>>          fail with the ENOENT error.
>>
>>        In this scenario, the kernel will restart the target's system
>>        call.  Consequently, the supervisor will receive another user-
>>        space notification.  Thus, depending on how many times the blocked
>>        system call is interrupted by a signal handler, the supervisor may
>>        receive multiple notifications for the same system call in the
> 
> maybe "... for the same instance of a system call in the target." for
> clarity?

Yes, that's a nice clarification.

>>        target.
>>
>>        One oddity is that system call restarting as described in this
>>        scenario will occur even for the blocking system calls listed in
>>        signal(7) that would never normally be restarted by the SA_RESTART
>>        flag.
> 
> Does this need fixing? I imagine the correct behavior for this case
> would be a response to _SEND of EINPROGRESS and the target would see
> EINTR normally?

That sounds reasonable.

> I mean, it's not like seccomp doesn't already expose weirdness with
> syscall restarts. Not even arm64 compat agrees[3] with arm32 in this
> regard. :(

I've added the above comments as a FIXME in the page.

>> BUGS
>>        If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed
>>        after the target terminates, then the ioctl(2) call simply blocks
>>        (rather than returning an error to indicate that the target no
>>        longer exists).
> 
> I want this fixed. It caused me no end of pain when building the
> selftests, and ended up spawning my implementing a global test timeout
> in kselftest. :P Before the usage counter refactor, there was no sane
> way to deal with this, but now I think we're close[2]. I'll reply
> separately about this.

Also added as FIXME comment in the page :-).

The behavior here is surprising, and caused me some
confusion until I worked out what was going on.

>> EXAMPLES
>>        The (somewhat contrived) program shown below demonstrates the use
>>        of the interfaces described in this page.  The program creates a
>>        child process that serves as the "target" process.  The child
>>        process installs a seccomp filter that returns the
>>        SECCOMP_RET_USER_NOTIF action value if a call is made to mkdir(2).
>>        The child process then calls mkdir(2) once for each of the
>>        supplied command-line arguments, and reports the result returned
>>        by the call.  After processing all arguments, the child process
>>        terminates.
>>
>>        The parent process acts as the supervisor, listening for the
>>        notifications that are generated when the target process calls
>>        mkdir(2).  When such a notification occurs, the supervisor
>>        examines the memory of the target process (using /proc/[pid]/mem)
>>        to discover the pathname argument that was supplied to the
>>        mkdir(2) call, and performs one of the following actions:
> 
> I like this example! It's simple enough to be understandable and complex
> enough to show the purpose of user_notif. :)

Precisely my aim. Thank you for noticing and appreciating :-).

>>        • If the pathname begins with the prefix "/tmp/", then the
>>          supervisor attempts to create the specified directory, and then
>>          spoofs a return for the target process based on the return value
>>          of the supervisor's mkdir(2) call.  In the event that that call
>>          succeeds, the spoofed success return value is the length of the
>>          pathname.
>>
>>        • If the pathname begins with "./" (i.e., it is a relative
>>          pathname), the supervisor sends a
>>          SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel to say
>>          that the kernel should execute the target process's mkdir(2)
>>          call.
>>
>>        • If the pathname begins with some other prefix, the supervisor
>>          spoofs an error return for the target process, so that the
>>          target process's mkdir(2) call appears to fail with the error
>>          EOPNOTSUPP ("Operation not supported").  Additionally, if the
>>          specified pathname is exactly "/bye", then the supervisor
>>          terminates.

[...]

>>    Program source
>>        #define _GNU_SOURCE
>>        #include <sys/types.h>
>>        #include <sys/prctl.h>
>>        #include <fcntl.h>
>>        #include <limits.h>
>>        #include <signal.h>
>>        #include <stddef.h>
>>        #include <stdint.h>
>>        #include <stdbool.h>
>>        #include <linux/audit.h>
>>        #include <sys/syscall.h>
>>        #include <sys/stat.h>
>>        #include <linux/filter.h>
>>        #include <linux/seccomp.h>
>>        #include <sys/ioctl.h>
>>        #include <stdio.h>
>>        #include <stdlib.h>
>>        #include <unistd.h>
>>        #include <errno.h>
>>        #include <sys/socket.h>
>>        #include <sys/un.h>
>>
>>        #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>>                                } while (0)
> 
> Because I love macros, you can expand this to make it take a format
> string:
> 
> #define errExit(fmt, ...)	do {					\
> 		char __err[64];						\
> 		strerror_r(errno, __err, sizeof(__err));		\
> 		fprintf(stderr, fmt ": %s\n", ##__VA_ARG__, __err);	\
> 		exit(EXIT_FAILURE);					\
> 	} while (0)

I'm a bit divivided about this. I don't want to distract the reader by
requiring them to understand the macro. I'll leave this for the moment.

[...]

>>        static void
>>        sigchldHandler(int sig)
>>        {
>>            char *msg  = "\tS: target has terminated; bye\n";
>>
>>            write(STDOUT_FILENO, msg, strlen(msg));
> 
> white space nit: extra space before "="

Thanks!

> efficiency nit: strlen isn't needed, since it can be done with
> compile-time constant constants:
> 
>              char msg[] = "\tS: target has terminated; bye\n";
>              write(STDOUT_FILENO, msg, sizeof(msg) - 1);
> 
> (some optimization levels may already replace the strlen a sizeof - 1)

Changed as you suggest. Thanks!

>>            _exit(EXIT_SUCCESS);
>>        }

[...]

>>        static void
>>        checkNotificationIdIsValid(int notifyFd, uint64_t id)
>>        {
>>            if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
>>                fprintf(stderr, "\tS: notification ID check: "
>>                        "target has terminated!!!\n");
>>
>>                exit(EXIT_FAILURE);
> 
> And now you can do:
> 
> 		errExit("\tS: notification ID check: "
> 			"target has terminated! ioctl");
> 
> ;)

Thanks. Changed as you suggest.

>>            }
>>        }
>>
>>        /* Access the memory of the target process in order to discover the
>>           pathname that was given to mkdir() */
>>
>>        static bool
>>        getTargetPathname(struct seccomp_notif *req, int notifyFd,
>>                          char *path, size_t len)
>>        {
>>            char procMemPath[PATH_MAX];
>>
>>            snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
>>
>>            int procMemFd = open(procMemPath, O_RDONLY);
>>            if (procMemFd == -1)
>>                errExit("Supervisor: open");
>>
>>            /* Check that the process whose info we are accessing is still alive.
>>               If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>>               in checkNotificationIdIsValid()) succeeds, we know that the
>>               /proc/PID/mem file descriptor that we opened corresponds to the
>>               process for which we received a notification. If that process
>>               subsequently terminates, then read() on that file descriptor
>>               will return 0 (EOF). */
>>
>>            checkNotificationIdIsValid(notifyFd, req->id);
>>
>>            /* Read bytes at the location containing the pathname argument
>>               (i.e., the first argument) of the mkdir(2) call */
>>
>>            ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>>            if (nread == -1)
>>                errExit("pread");
>>
>>            if (nread == 0) {
>>                fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>>                        "returned 0 (EOF)\n");
>>                exit(EXIT_FAILURE);
>>            }
>>
>>            if (close(procMemFd) == -1)
>>                errExit("close-/proc/PID/mem");
>>
>>            /* We have no guarantees about what was in the memory of the target
>>               process. We therefore treat the buffer returned by pread() as
>>               untrusted input. The buffer should be terminated by a null byte;
>>               if not, then we will trigger an error for the target process. */
>>
>>            for (int j = 0; j < nread; j++)
>>                if (path[j] == ' ')
> 
> This rendering typo (' ' vs '\0') ends up manifesting badly. ;) The man
> source shows:
> 
>         if (path[j] == \(aq\0\(aq)
> 
> I think this needs to be \\0 ?

Yes, that was the intention.

> Or it could also be a tested as:
> 
> 	if (strnlen(path, nread) < nread)
>

Good point. Changed to:

    if (strnlen(path, nread) < nread)
        return true;

[...]

> 
> Thank you so much for this documentation and example! :)

You're welcome. It's been "interesting" uncovering the glitches :-).

Cheers,

Michael


> [1] https://git.kernel.org/linus/dfe719fef03d752f1682fa8aeddf30ba501c8555
> [2] https://lore.kernel.org/lkml/CAG48ez3kpEDO1x_HfvOM2R9M78Ach9O_4+Pjs-vLLfqvZL+13A@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/CAGXu5jKzif=vp6gn5ZtrTx-JTN367qFphobnt9s=awbaafwoUw@mail.gmail.com/
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-26  9:32             ` Jann Horn
@ 2020-10-26  9:47               ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-26  9:47 UTC (permalink / raw)
  To: Jann Horn
  Cc: mtk.manpages, Tycho Andersen, Sargun Dhillon, Kees Cook,
	Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

Hi Jann,

On 10/26/20 10:32 AM, Jann Horn wrote:
> On Sat, Oct 24, 2020 at 2:53 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> On 10/17/20 2:25 AM, Jann Horn wrote:
>>> On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
>>> <mtk.manpages@gmail.com> wrote:
> [...]
>>>> I'm not sure if I should write anything about this small UAPI
>>>> breakage in BUGS, or not. Your thoughts?
>>>
>>> Thinking about it a bit more: Any code that relies on pause() or
>>> epoll_wait() not restarting is buggy anyway, right? Because a signal
>>> could also arrive directly before entering the syscall, while
>>> userspace code is still executing? So one could argue that we're just
>>> enlarging a preexisting race. (Unless the signal handler checks the
>>> interrupted register state to figure out whether we already entered
>>> syscall handling?)
>>
>> Yes, that all makes sense.
>>
>>> If userspace relies on non-restarting behavior, it should be using
>>> something like epoll_pwait(). And that stuff only unblocks signals
>>> after we've already past the seccomp checks on entry.
>>
>> Thanks for elaborating that detail, since as soon as you talked
>> about "enlarging a preexisting race" above, I immediately wondered
>> sigsuspend(), pselect(), etc.
>>
>> (Mind you, I still wonder about the effect on system calls that
>> are normally nonrestartable because they have timeouts. My
>> understanding is that the kernel doesn't restart those system
>> calls because it's impossible for the kernel to restart the call
>> with the right timeout value. I wonder what happens when those
>> system calls are restarted in the scenario we're discussing.)
> 
> Ah, that's an interesting edge case...

I'm going to drop a FIXME into the page source so that
there's a reminder of this issue in the next draft of 
the page, which I'm about to send out.

[...]

Thanks for checking the other pieces, Jann.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-26  0:32             ` Kees Cook
@ 2020-10-26  9:51               ` Jann Horn
  2020-10-26 10:31                 ` Jann Horn
  2020-10-28 22:53                 ` Kees Cook
  0 siblings, 2 replies; 52+ messages in thread
From: Jann Horn @ 2020-10-26  9:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: Tycho Andersen, Michael Kerrisk (man-pages),
	Sargun Dhillon, Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

On Mon, Oct 26, 2020 at 1:32 AM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > >>        ┌─────────────────────────────────────────────────────┐
> > > > > > >>        │FIXME                                                │
> > > > > > >>        ├─────────────────────────────────────────────────────┤
> > > > > > >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > > >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > > >>        │process terminates, then the ioctl()  simply  blocks │
> > > > > > >>        │(rather than returning an error to indicate that the │
> > > > > > >>        │target process no longer exists).                    │
> > > > > > >
> > > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > > >
> > > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > > quick search.
> > > > > >
> > > > > > > but it's a
> > > > > > > bit sticky to do.
> > > > > >
> > > > > > Can you say a few words about the nature of the problem?
> > > > >
> > > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > > notify about unused filter"). So maybe there's a bug here?
> > > >
> > > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > > commit doesn't have any effect on this kind of usage.
> > >
> > > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > > we don't have a count of all of them, unfortunately.
> > >
> > > We could maybe look inside the wait_list, but that will probably make
> > > people angry :)
> >
> > The easiest way would probably be to open-code the semaphore-ish part,
> > and let the semaphore and poll share the waitqueue. The current code
> > kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> > entire semaphore would IMO be cleaner than that. And it's not like
> > semaphore semantics are even a good fit for this code anyway.
> >
> > Let's see... if we didn't have the existing UAPI to worry about, I'd
> > do it as follows (*completely* untested). That way, the ioctl would
> > block exactly until either there actually is a request to deliver or
> > there are no more users of the filter. The problem is that if we just
> > apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> > an event loop and don't set O_NONBLOCK will be screwed. So we'd
>
> Wait, why? Do you mean a ioctl calling loop (rather than a poll event
> loop)?

No, I'm talking about poll event loops.

> I think poll would be fine, but a "try calling RECV and expect to
> return ENOENT" loop would change. But I don't think anyone would do this
> exactly because it _currently_ acts like O_NONBLOCK, yes?
>
> > probably also have to add some stupid counter in place of the
> > semaphore's counter that we can use to preserve the old behavior of
> > returning -ENOENT once for each cancelled request. :(
>
> I only see this in Debian Code Search:
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
> which is using epoll_wait():
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326
>
> I expect LXC is using it. :)

The problem is the scenario where a process is interrupted while it's
waiting for the supervisor to reply.

Consider the following scenario (with supervisor "S" and target "T"; S
wants to wait for events on two file descriptors seccomp_fd and
other_fd):

S: starts poll() to wait for events on seccomp_fd and other_fd
T: performs a syscall that's filtered with RET_USER_NOTIF
S: poll() returns and signals readiness of seccomp_fd
T: receives signal SIGUSR1
T: syscall aborts, enters signal handler
T: signal handler blocks on unfiltered syscall (e.g. write())
S: starts SECCOMP_IOCTL_NOTIF_RECV
S: blocks because no syscalls are pending

Depending on what other_fd is, this could in a worst case even lead to
a deadlock (if e.g. the signal handler wants to write to stdout, but
the stdout fd is hooked up to other_fd in the supervisor, but the
supervisor can't consume the data written because it's stuck in
seccomp handling).

So we have to ensure that when existing code (like that crun code you
linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
immediately instead of blocking.

(Oh, but by the way, that crun code looks broken anyway, because
AFAICS it treats all error returns from SECCOMP_IOCTL_NOTIF_RECV
equally by bailing out; and it kinda looks like that bailout path then
nukes the container, or something? So that needs to be fixed either
way.)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-26  9:51               ` Jann Horn
@ 2020-10-26 10:31                 ` Jann Horn
  2020-10-28 22:56                   ` Kees Cook
       [not found]                   ` <20201029021348.GB25673@cisco>
  2020-10-28 22:53                 ` Kees Cook
  1 sibling, 2 replies; 52+ messages in thread
From: Jann Horn @ 2020-10-26 10:31 UTC (permalink / raw)
  To: Kees Cook
  Cc: Tycho Andersen, Michael Kerrisk (man-pages),
	Sargun Dhillon, Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

On Mon, Oct 26, 2020 at 10:51 AM Jann Horn <jannh@google.com> wrote:
> On Mon, Oct 26, 2020 at 1:32 AM Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> > > On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > > >>        ┌─────────────────────────────────────────────────────┐
> > > > > > > >>        │FIXME                                                │
> > > > > > > >>        ├─────────────────────────────────────────────────────┤
> > > > > > > >>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> > > > > > > >>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > > > > > >>        │process terminates, then the ioctl()  simply  blocks │
> > > > > > > >>        │(rather than returning an error to indicate that the │
> > > > > > > >>        │target process no longer exists).                    │
> > > > > > > >
> > > > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > > > >
> > > > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > > > quick search.
> > > > > > >
> > > > > > > > but it's a
> > > > > > > > bit sticky to do.
> > > > > > >
> > > > > > > Can you say a few words about the nature of the problem?
> > > > > >
> > > > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > > > notify about unused filter"). So maybe there's a bug here?
> > > > >
> > > > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > > > commit doesn't have any effect on this kind of usage.
> > > >
> > > > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > > > we don't have a count of all of them, unfortunately.
> > > >
> > > > We could maybe look inside the wait_list, but that will probably make
> > > > people angry :)
> > >
> > > The easiest way would probably be to open-code the semaphore-ish part,
> > > and let the semaphore and poll share the waitqueue. The current code
> > > kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> > > entire semaphore would IMO be cleaner than that. And it's not like
> > > semaphore semantics are even a good fit for this code anyway.
> > >
> > > Let's see... if we didn't have the existing UAPI to worry about, I'd
> > > do it as follows (*completely* untested). That way, the ioctl would
> > > block exactly until either there actually is a request to deliver or
> > > there are no more users of the filter. The problem is that if we just
> > > apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> > > an event loop and don't set O_NONBLOCK will be screwed. So we'd
> >
> > Wait, why? Do you mean a ioctl calling loop (rather than a poll event
> > loop)?
>
> No, I'm talking about poll event loops.
>
> > I think poll would be fine, but a "try calling RECV and expect to
> > return ENOENT" loop would change. But I don't think anyone would do this
> > exactly because it _currently_ acts like O_NONBLOCK, yes?
> >
> > > probably also have to add some stupid counter in place of the
> > > semaphore's counter that we can use to preserve the old behavior of
> > > returning -ENOENT once for each cancelled request. :(
> >
> > I only see this in Debian Code Search:
> > https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
> > which is using epoll_wait():
> > https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326
> >
> > I expect LXC is using it. :)
>
> The problem is the scenario where a process is interrupted while it's
> waiting for the supervisor to reply.
>
> Consider the following scenario (with supervisor "S" and target "T"; S
> wants to wait for events on two file descriptors seccomp_fd and
> other_fd):
>
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: blocks because no syscalls are pending
>
> Depending on what other_fd is, this could in a worst case even lead to
> a deadlock (if e.g. the signal handler wants to write to stdout, but
> the stdout fd is hooked up to other_fd in the supervisor, but the
> supervisor can't consume the data written because it's stuck in
> seccomp handling).
>
> So we have to ensure that when existing code (like that crun code you
> linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
> immediately instead of blocking.

Or I guess we could also just set O_NONBLOCK on the fd by default?
Since the one existing user is eventloop-based...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-25 16:31               ` Michael Kerrisk (man-pages)
@ 2020-10-26 15:54                 ` Jann Horn
  2020-10-27  6:14                   ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 52+ messages in thread
From: Jann Horn @ 2020-10-26 15:54 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Alexei Starovoitov, Will Drewry,
	bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

On Sun, Oct 25, 2020 at 5:32 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 10/1/20 4:14 AM, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 3:52 AM Jann Horn <jannh@google.com> wrote:
> >> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> >>> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> >>>> On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> >>>>> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> >>>>>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
> >>>>>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> >>>>>>>>        ┌─────────────────────────────────────────────────────┐
> >>>>>>>>        │FIXME                                                │
> >>>>>>>>        ├─────────────────────────────────────────────────────┤
> >>>>>>>>        │From my experiments,  it  appears  that  if  a  SEC‐ │
> >>>>>>>>        │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> >>>>>>>>        │process terminates, then the ioctl()  simply  blocks │
> >>>>>>>>        │(rather than returning an error to indicate that the │
> >>>>>>>>        │target process no longer exists).                    │
> >>>>>>>
> >>>>>>> Yeah, I think Christian wanted to fix this at some point,
> >>>>>>
> >>>>>> Do you have a pointer that discussion? I could not find it with a
> >>>>>> quick search.
> >>>>>>
> >>>>>>> but it's a
> >>>>>>> bit sticky to do.
> >>>>>>
> >>>>>> Can you say a few words about the nature of the problem?
> >>>>>
> >>>>> I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> >>>>> notify about unused filter"). So maybe there's a bug here?
> >>>>
> >>>> That thing only notifies on ->poll, it doesn't unblock ioctls; and
> >>>> Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> >>>> commit doesn't have any effect on this kind of usage.
> >>>
> >>> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> >>> we don't have a count of all of them, unfortunately.
> >>>
> >>> We could maybe look inside the wait_list, but that will probably make
> >>> people angry :)
> >>
> >> The easiest way would probably be to open-code the semaphore-ish part,
> >> and let the semaphore and poll share the waitqueue. The current code
> >> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> >> entire semaphore would IMO be cleaner than that. And it's not like
> >> semaphore semantics are even a good fit for this code anyway.
> >>
> >> Let's see... if we didn't have the existing UAPI to worry about, I'd
> >> do it as follows (*completely* untested). That way, the ioctl would
> >> block exactly until either there actually is a request to deliver or
> >> there are no more users of the filter. The problem is that if we just
> >> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> >> an event loop and don't set O_NONBLOCK will be screwed. So we'd
> >> probably also have to add some stupid counter in place of the
> >> semaphore's counter that we can use to preserve the old behavior of
> >> returning -ENOENT once for each cancelled request. :(
> >>
> >> I guess this is a nice point in favor of Michael's usual complaint
> >> that if there are no man pages for a feature by the time the feature
> >> lands upstream, there's a higher chance that the UAPI will suck
> >> forever...
> >
> > And I guess this would be the UAPI-compatible version - not actually
> > as terrible as I thought it might be. Do y'all want this? If so, feel
> > free to either turn this into a proper patch with Co-developed-by, or
> > tell me that I should do it and I'll try to get around to turning it
> > into something proper.
>
> Thanks for taking a shot at this.
>
> I tried applying the patch below to vanilla 5.9.0.
> (There's one typo: s/ENOTCON/ENOTCONN).
>
> It seems not to work though; when I send a signal to my test
> target process that is sleeping waiting for the notification
> response, the process enters the uninterruptible D state.
> Any thoughts?

Ah, yeah, I think I was completely misusing the wait API. I'll go change that.

(Btw, in general, for reports about hangs like that, it can be helpful
to have the contents of /proc/$pid/stack. And for cases where CPUs are
spinning, the relevant part from the output of the "L" sysrq, or
something like that.)

Also, I guess we can probably break this part of UAPI after all, since
the only user of this interface seems to currently be completely
broken in this case anyway? So I think we want the other
implementation without the ->canceled_reqs logic after all.

I'm a bit on the fence now on whether non-blocking mode should use
ENOTCONN or not... I guess if we returned ENOENT even when there are
no more listeners, you'd have to disambiguate through the poll()
revents, which would be kinda ugly?

I'll try to turn this into a proper patch submission...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-26 15:54                 ` Jann Horn
@ 2020-10-27  6:14                   ` Michael Kerrisk (man-pages)
  2020-10-27 10:28                     ` Jann Horn
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-27  6:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: mtk.manpages, Tycho Andersen, Sargun Dhillon, Kees Cook,
	Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

On 10/26/20 4:54 PM, Jann Horn wrote:
> On Sun, Oct 25, 2020 at 5:32 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
[...]
>> I tried applying the patch below to vanilla 5.9.0.
>> (There's one typo: s/ENOTCON/ENOTCONN).
>>
>> It seems not to work though; when I send a signal to my test
>> target process that is sleeping waiting for the notification
>> response, the process enters the uninterruptible D state.
>> Any thoughts?
> 
> Ah, yeah, I think I was completely misusing the wait API. I'll go change that.
> 
> (Btw, in general, for reports about hangs like that, it can be helpful
> to have the contents of /proc/$pid/stack. And for cases where CPUs are
> spinning, the relevant part from the output of the "L" sysrq, or
> something like that.)

Thanks for the tipcs!

> Also, I guess we can probably break this part of UAPI after all, since
> the only user of this interface seems to currently be completely
> broken in this case anyway? So I think we want the other
> implementation without the ->canceled_reqs logic after all.

Okay.

> I'm a bit on the fence now on whether non-blocking mode should use
> ENOTCONN or not... I guess if we returned ENOENT even when there are
> no more listeners, you'd have to disambiguate through the poll()
> revents, which would be kinda ugly?

I must confess, I'm not quite clear on which two cases you 
are trying to distinguish. Can you elaborate?

> I'll try to turn this into a proper patch submission...

Thank you!!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-27  6:14                   ` Michael Kerrisk (man-pages)
@ 2020-10-27 10:28                     ` Jann Horn
  2020-10-28  6:31                       ` Sargun Dhillon
  0 siblings, 1 reply; 52+ messages in thread
From: Jann Horn @ 2020-10-27 10:28 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, Kees Cook, Christian Brauner,
	linux-man, lkml, Aleksa Sarai, Alexei Starovoitov, Will Drewry,
	bpf, Song Liu, Daniel Borkmann, Andy Lutomirski,
	Linux Containers, Giuseppe Scrivano, Robert Sesek

On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 10/26/20 4:54 PM, Jann Horn wrote:
> > I'm a bit on the fence now on whether non-blocking mode should use
> > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > no more listeners, you'd have to disambiguate through the poll()
> > revents, which would be kinda ugly?
>
> I must confess, I'm not quite clear on which two cases you
> are trying to distinguish. Can you elaborate?

Let's say someone writes a program whose responsibilities are just to
handle seccomp events and to listen on some other fd for commands. And
this is implemented with an event loop. Then once all the target
processes are gone (including zombie reaping), we'll start getting
EPOLLERR.

If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
can just call into the seccomp logic without any arguments; it can
just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
The downside is that there's one more error code userspace has to
special-case.
This would be more consistent with what we'd be doing in the blocking case.

If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
the seccomp logic what the revents are.

I guess it probably doesn't really matter much.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-27 10:28                     ` Jann Horn
@ 2020-10-28  6:31                       ` Sargun Dhillon
  2020-10-28  9:43                         ` Jann Horn
  0 siblings, 1 reply; 52+ messages in thread
From: Sargun Dhillon @ 2020-10-28  6:31 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	Tycho Andersen, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Alexei Starovoitov, Will Drewry, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Tue, Oct 27, 2020 at 3:28 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
> > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > I'm a bit on the fence now on whether non-blocking mode should use
> > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > no more listeners, you'd have to disambiguate through the poll()
> > > revents, which would be kinda ugly?
> >
> > I must confess, I'm not quite clear on which two cases you
> > are trying to distinguish. Can you elaborate?
>
> Let's say someone writes a program whose responsibilities are just to
> handle seccomp events and to listen on some other fd for commands. And
> this is implemented with an event loop. Then once all the target
> processes are gone (including zombie reaping), we'll start getting
> EPOLLERR.
>
> If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> can just call into the seccomp logic without any arguments; it can
> just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> The downside is that there's one more error code userspace has to
> special-case.
> This would be more consistent with what we'd be doing in the blocking case.
>
> If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> the seccomp logic what the revents are.
>
> I guess it probably doesn't really matter much.

So, in practice, if you're emulating a blocking syscall (such as open,
perf_event_open, or any of a number of other syscalls), you probably
have to do it on a separate thread in the supervisor because you want
to continue to be able to receive new notifications if any other process
generates a seccomp notification event that you need to handle.

In addition to that, some of these syscalls are preemptible, so you need
to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
under supervision hasn't left the syscall.

If we're to implement a mechanism that makes the seccomp ioctl receive
non-blocking, it would be valuable to address this problem as well (getting
a notification when the supervisor is processing a syscall and needs to
preempt it). In the best case, this can be a minor inconvenience, and
in the worst case this can result in weird errors where you're keeping
resources open that the container expects to be closed.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-28  6:31                       ` Sargun Dhillon
@ 2020-10-28  9:43                         ` Jann Horn
  2020-10-28 17:43                           ` Sargun Dhillon
  0 siblings, 1 reply; 52+ messages in thread
From: Jann Horn @ 2020-10-28  9:43 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Michael Kerrisk (man-pages),
	Tycho Andersen, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Alexei Starovoitov, Will Drewry, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
> On Tue, Oct 27, 2020 at 3:28 AM Jann Horn <jannh@google.com> wrote:
> > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > no more listeners, you'd have to disambiguate through the poll()
> > > > revents, which would be kinda ugly?
> > >
> > > I must confess, I'm not quite clear on which two cases you
> > > are trying to distinguish. Can you elaborate?
> >
> > Let's say someone writes a program whose responsibilities are just to
> > handle seccomp events and to listen on some other fd for commands. And
> > this is implemented with an event loop. Then once all the target
> > processes are gone (including zombie reaping), we'll start getting
> > EPOLLERR.
> >
> > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > can just call into the seccomp logic without any arguments; it can
> > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > The downside is that there's one more error code userspace has to
> > special-case.
> > This would be more consistent with what we'd be doing in the blocking case.
> >
> > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > the seccomp logic what the revents are.
> >
> > I guess it probably doesn't really matter much.
>
> So, in practice, if you're emulating a blocking syscall (such as open,
> perf_event_open, or any of a number of other syscalls), you probably
> have to do it on a separate thread in the supervisor because you want
> to continue to be able to receive new notifications if any other process
> generates a seccomp notification event that you need to handle.
>
> In addition to that, some of these syscalls are preemptible, so you need
> to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> under supervision hasn't left the syscall.
>
> If we're to implement a mechanism that makes the seccomp ioctl receive
> non-blocking, it would be valuable to address this problem as well (getting
> a notification when the supervisor is processing a syscall and needs to
> preempt it). In the best case, this can be a minor inconvenience, and
> in the worst case this can result in weird errors where you're keeping
> resources open that the container expects to be closed.

Does "a notification" mean signals? Or would you want to have a second
thread in userspace that poll()s for cancellation events on the
seccomp fd and then somehow takes care of interrupting the first
thread, or something like that?

Either way, I think your proposal goes beyond the scope of patching
the existing weirdness, and should be a separate patch.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-28  9:43                         ` Jann Horn
@ 2020-10-28 17:43                           ` Sargun Dhillon
  2020-10-28 18:20                             ` Jann Horn
  0 siblings, 1 reply; 52+ messages in thread
From: Sargun Dhillon @ 2020-10-28 17:43 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	Tycho Andersen, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Alexei Starovoitov, Will Drewry, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Wed, Oct 28, 2020 at 2:43 AM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > On Tue, Oct 27, 2020 at 3:28 AM Jann Horn <jannh@google.com> wrote:
> > > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> > > <mtk.manpages@gmail.com> wrote:
> > > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > > no more listeners, you'd have to disambiguate through the poll()
> > > > > revents, which would be kinda ugly?
> > > >
> > > > I must confess, I'm not quite clear on which two cases you
> > > > are trying to distinguish. Can you elaborate?
> > >
> > > Let's say someone writes a program whose responsibilities are just to
> > > handle seccomp events and to listen on some other fd for commands. And
> > > this is implemented with an event loop. Then once all the target
> > > processes are gone (including zombie reaping), we'll start getting
> > > EPOLLERR.
> > >
> > > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > > can just call into the seccomp logic without any arguments; it can
> > > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > > The downside is that there's one more error code userspace has to
> > > special-case.
> > > This would be more consistent with what we'd be doing in the blocking case.
> > >
> > > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > > the seccomp logic what the revents are.
> > >
> > > I guess it probably doesn't really matter much.
> >
> > So, in practice, if you're emulating a blocking syscall (such as open,
> > perf_event_open, or any of a number of other syscalls), you probably
> > have to do it on a separate thread in the supervisor because you want
> > to continue to be able to receive new notifications if any other process
> > generates a seccomp notification event that you need to handle.
> >
> > In addition to that, some of these syscalls are preemptible, so you need
> > to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> > under supervision hasn't left the syscall.
> >
> > If we're to implement a mechanism that makes the seccomp ioctl receive
> > non-blocking, it would be valuable to address this problem as well (getting
> > a notification when the supervisor is processing a syscall and needs to
> > preempt it). In the best case, this can be a minor inconvenience, and
> > in the worst case this can result in weird errors where you're keeping
> > resources open that the container expects to be closed.
>
> Does "a notification" mean signals? Or would you want to have a second
> thread in userspace that poll()s for cancellation events on the
> seccomp fd and then somehow takes care of interrupting the first
> thread, or something like that?

I would be reluctant to be prescriptive in that it be a signal. Right
now, it's implemented
as a second thread in userspace that does a ioctl(...) and checks if
the notification
is valid / alive, and does what's required if the notification has
died (interrupting
the first thread).

>
> Either way, I think your proposal goes beyond the scope of patching
> the existing weirdness, and should be a separate patch.

I agree it should be a separate patch, but I think that it'd be nice if there
was a way to do something like:
* opt-in to getting another message after receiving the notification
  that indicates the program has left the syscall
* when you do the RECV, you can specify a flag or some such asking
  that you get signaled / notified about the program leaving the syscall
* a multiplexed receive that can say if an existing notification in progress
  has left the valid state.

---
The reason I bring this up as part of this current thread / discussion is that
I think that they may be related in terms of how we want the behaviour to act.

I would love to hear how people think this should work, or better suggestions
than the second thread approach above, or the alternative approach of
polling all the notifications in progress on some interval [and relying on
epoll timeout to trigger that interval].

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-28 17:43                           ` Sargun Dhillon
@ 2020-10-28 18:20                             ` Jann Horn
  0 siblings, 0 replies; 52+ messages in thread
From: Jann Horn @ 2020-10-28 18:20 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Michael Kerrisk (man-pages),
	Tycho Andersen, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Alexei Starovoitov, Will Drewry, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Wed, Oct 28, 2020 at 6:44 PM Sargun Dhillon <sargun@sargun.me> wrote:
> On Wed, Oct 28, 2020 at 2:43 AM Jann Horn <jannh@google.com> wrote:
> > On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > On Tue, Oct 27, 2020 at 3:28 AM Jann Horn <jannh@google.com> wrote:
> > > > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> > > > <mtk.manpages@gmail.com> wrote:
> > > > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > > > no more listeners, you'd have to disambiguate through the poll()
> > > > > > revents, which would be kinda ugly?
> > > > >
> > > > > I must confess, I'm not quite clear on which two cases you
> > > > > are trying to distinguish. Can you elaborate?
> > > >
> > > > Let's say someone writes a program whose responsibilities are just to
> > > > handle seccomp events and to listen on some other fd for commands. And
> > > > this is implemented with an event loop. Then once all the target
> > > > processes are gone (including zombie reaping), we'll start getting
> > > > EPOLLERR.
> > > >
> > > > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > > > can just call into the seccomp logic without any arguments; it can
> > > > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > > > The downside is that there's one more error code userspace has to
> > > > special-case.
> > > > This would be more consistent with what we'd be doing in the blocking case.
> > > >
> > > > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > > > the seccomp logic what the revents are.
> > > >
> > > > I guess it probably doesn't really matter much.
> > >
> > > So, in practice, if you're emulating a blocking syscall (such as open,
> > > perf_event_open, or any of a number of other syscalls), you probably
> > > have to do it on a separate thread in the supervisor because you want
> > > to continue to be able to receive new notifications if any other process
> > > generates a seccomp notification event that you need to handle.
> > >
> > > In addition to that, some of these syscalls are preemptible, so you need
> > > to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> > > under supervision hasn't left the syscall.
> > >
> > > If we're to implement a mechanism that makes the seccomp ioctl receive
> > > non-blocking, it would be valuable to address this problem as well (getting
> > > a notification when the supervisor is processing a syscall and needs to
> > > preempt it). In the best case, this can be a minor inconvenience, and
> > > in the worst case this can result in weird errors where you're keeping
> > > resources open that the container expects to be closed.
> >
> > Does "a notification" mean signals? Or would you want to have a second
> > thread in userspace that poll()s for cancellation events on the
> > seccomp fd and then somehow takes care of interrupting the first
> > thread, or something like that?
>
> I would be reluctant to be prescriptive in that it be a signal. Right
> now, it's implemented
> as a second thread in userspace that does a ioctl(...) and checks if
> the notification
> is valid / alive, and does what's required if the notification has
> died (interrupting
> the first thread).
>
> >
> > Either way, I think your proposal goes beyond the scope of patching
> > the existing weirdness, and should be a separate patch.
>
> I agree it should be a separate patch, but I think that it'd be nice if there
> was a way to do something like:
> * opt-in to getting another message after receiving the notification
>   that indicates the program has left the syscall

I guess to do that cleanly, we'd want something like an array
associated with the seccomp filter that has a size N that's determined
when the filter is set up... and then when a received but unanswered
notification is cancelled, we'd insert its identifier into that array.
And if we enforce that the supervisor can never have more than N
pending messages (by just not delivering new ones if there are N old
ones pending), we'll know that any possible cancellation will always
fit, and we don't need to worry about dynamic memory allocation.

And we could raise EPOLLPRI on the file descriptor when the array is
non-empty, so that if userspace doesn't currently want to handle new
notifications (because it's already dealing with a bunch of them),
userspace can do that, too.

> * when you do the RECV, you can specify a flag or some such asking
>   that you get signaled / notified about the program leaving the syscall

I think filter setup time is easier to deal with than RECV time.

> * a multiplexed receive that can say if an existing notification in progress
>   has left the valid state.

Or alternatively a separate ioctl for receiving cancellation messages,
which you'd only call on EPOLLPRI.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-26  9:51               ` Jann Horn
  2020-10-26 10:31                 ` Jann Horn
@ 2020-10-28 22:53                 ` Kees Cook
  2020-10-29  1:25                   ` Jann Horn
  1 sibling, 1 reply; 52+ messages in thread
From: Kees Cook @ 2020-10-28 22:53 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Michael Kerrisk (man-pages),
	Sargun Dhillon, Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

On Mon, Oct 26, 2020 at 10:51:02AM +0100, Jann Horn wrote:
> The problem is the scenario where a process is interrupted while it's
> waiting for the supervisor to reply.
> 
> Consider the following scenario (with supervisor "S" and target "T"; S
> wants to wait for events on two file descriptors seccomp_fd and
> other_fd):
> 
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: blocks because no syscalls are pending

Oooh, yes, ew. Thanks for the illustration.

Thinking about this from userspace's least-surprise view, I would expect
the "recv" to stay "queued", in the sense we'd see this:

S: starts poll() to wait for events on seccomp_fd and other_fd
T: performs a syscall that's filtered with RET_USER_NOTIF
S: poll() returns and signals readiness of seccomp_fd
T: receives signal SIGUSR1
T: syscall aborts, enters signal handler
T: signal handler blocks on unfiltered syscall (e.g. write())
S: starts SECCOMP_IOCTL_NOTIF_RECV
S: gets (stale) seccomp_notif from seccomp_fd
S: sends seccomp_notif_resp, receives ENOENT (or some better errno?)

This is not at all how things are designed internally right now, but
that behavior would work, yes?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-26 10:31                 ` Jann Horn
@ 2020-10-28 22:56                   ` Kees Cook
  2020-10-29  1:11                     ` Jann Horn
       [not found]                   ` <20201029021348.GB25673@cisco>
  1 sibling, 1 reply; 52+ messages in thread
From: Kees Cook @ 2020-10-28 22:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Michael Kerrisk (man-pages),
	Sargun Dhillon, Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

On Mon, Oct 26, 2020 at 11:31:01AM +0100, Jann Horn wrote:
> Or I guess we could also just set O_NONBLOCK on the fd by default?
> Since the one existing user is eventloop-based...

I thought about that initially, but it rubs me the wrong way: it
violates least-surprise for me. File descriptors are expected to be
default-blocking. It *is* a special fd, though, so maybe it could work.
The only case I can think of it would break would be ioctl-loop case
that is already buggy in that it didn't handle non-zero returns?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-28 22:56                   ` Kees Cook
@ 2020-10-29  1:11                     ` Jann Horn
  0 siblings, 0 replies; 52+ messages in thread
From: Jann Horn @ 2020-10-29  1:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: Tycho Andersen, Michael Kerrisk (man-pages),
	Sargun Dhillon, Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

On Wed, Oct 28, 2020 at 11:56 PM Kees Cook <keescook@chromium.org> wrote:
> On Mon, Oct 26, 2020 at 11:31:01AM +0100, Jann Horn wrote:
> > Or I guess we could also just set O_NONBLOCK on the fd by default?
> > Since the one existing user is eventloop-based...
>
> I thought about that initially, but it rubs me the wrong way: it
> violates least-surprise for me. File descriptors are expected to be
> default-blocking. It *is* a special fd, though, so maybe it could work.
> The only case I can think of it would break would be ioctl-loop case
> that is already buggy in that it didn't handle non-zero returns?

We don't have any actual users that use the API that way outside of
the kernel's selftest/sample code, right?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
  2020-10-28 22:53                 ` Kees Cook
@ 2020-10-29  1:25                   ` Jann Horn
  0 siblings, 0 replies; 52+ messages in thread
From: Jann Horn @ 2020-10-29  1:25 UTC (permalink / raw)
  To: Kees Cook, Michael Kerrisk (man-pages)
  Cc: Tycho Andersen, Sargun Dhillon, Christian Brauner, linux-man,
	lkml, Aleksa Sarai, Alexei Starovoitov, Will Drewry, bpf,
	Song Liu, Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

On Wed, Oct 28, 2020 at 11:53 PM Kees Cook <keescook@chromium.org> wrote:
> On Mon, Oct 26, 2020 at 10:51:02AM +0100, Jann Horn wrote:
> > The problem is the scenario where a process is interrupted while it's
> > waiting for the supervisor to reply.
> >
> > Consider the following scenario (with supervisor "S" and target "T"; S
> > wants to wait for events on two file descriptors seccomp_fd and
> > other_fd):
> >
> > S: starts poll() to wait for events on seccomp_fd and other_fd
> > T: performs a syscall that's filtered with RET_USER_NOTIF
> > S: poll() returns and signals readiness of seccomp_fd
> > T: receives signal SIGUSR1
> > T: syscall aborts, enters signal handler
> > T: signal handler blocks on unfiltered syscall (e.g. write())
> > S: starts SECCOMP_IOCTL_NOTIF_RECV
> > S: blocks because no syscalls are pending
>
> Oooh, yes, ew. Thanks for the illustration.
>
> Thinking about this from userspace's least-surprise view, I would expect
> the "recv" to stay "queued", in the sense we'd see this:
>
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: gets (stale) seccomp_notif from seccomp_fd
> S: sends seccomp_notif_resp, receives ENOENT (or some better errno?)
>
> This is not at all how things are designed internally right now, but
> that behavior would work, yes?

It would be really ugly, but it could theoretically be made to work,
to some degree.


The first bit of trouble is that currently the notification lives on
the stack of the target process. If we want to be able to show
userspace the stale notification, we'd have to store it elsewhere. And
since we really don't want to start randomly throwing -ENOMEM in any
of this stuff, we'd basically have to store it in pre-allocated memory
inside the filter.


The second bit of trouble is that if the supervisor is so oblivious
that it doesn't realize that syscalls can be interrupted, it'll run
into other problems. Let's say the target process does something like
this:

int func(void) {
  char pathbuf[4096];
  sprintf(pathbuf, "/tmp/blah.%d", some_number);
  mount("foo", pathbuf, ...);
}

and mount() is handled with a notification. If the supervisor just
reads the path string and immediately passes it into the real mount()
syscall, something like this can happen:

target: starts mount()
target: receives signal, aborts mount()
target: runs signal handler, returns from signal handler
target: returns out of func()
supervisor: receives notification
supervisor: reads path from remote buffer
supervisor: calls mount()

but because the stack allocation has already been freed by the time
the supervisor reads it, the supervisor just reads random garbage, and
beautiful fireworks ensue.

So the supervisor *fundamentally* has to be written to expect that at
*any* time, the target can abandon a syscall. And every read of remote
memory has to be separated from uses of that remote memory by a
notification ID recheck.

And at that point, I think it's reasonable to expect the supervisor to
also be able to handle that a syscall can be aborted before the
notification is delivered.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: For review: seccomp_user_notif(2) manual page
       [not found]                   ` <20201029021348.GB25673@cisco>
@ 2020-10-29  4:26                     ` Jann Horn
  0 siblings, 0 replies; 52+ messages in thread
From: Jann Horn @ 2020-10-29  4:26 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Michael Kerrisk (man-pages),
	Sargun Dhillon, Christian Brauner, linux-man, lkml, Aleksa Sarai,
	Alexei Starovoitov, Will Drewry, bpf, Song Liu, Daniel Borkmann,
	Andy Lutomirski, Linux Containers, Giuseppe Scrivano,
	Robert Sesek

On Thu, Oct 29, 2020 at 3:13 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > > Consider the following scenario (with supervisor "S" and target "T"; S
> > > wants to wait for events on two file descriptors seccomp_fd and
> > > other_fd):
> > >
> > > S: starts poll() to wait for events on seccomp_fd and other_fd
> > > T: performs a syscall that's filtered with RET_USER_NOTIF
> > > S: poll() returns and signals readiness of seccomp_fd
> > > T: receives signal SIGUSR1
> > > T: syscall aborts, enters signal handler
> > > T: signal handler blocks on unfiltered syscall (e.g. write())
> > > S: starts SECCOMP_IOCTL_NOTIF_RECV
> > > S: blocks because no syscalls are pending
> > >
> > > Depending on what other_fd is, this could in a worst case even lead to
> > > a deadlock (if e.g. the signal handler wants to write to stdout, but
> > > the stdout fd is hooked up to other_fd in the supervisor, but the
> > > supervisor can't consume the data written because it's stuck in
> > > seccomp handling).
> > >
> > > So we have to ensure that when existing code (like that crun code you
> > > linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
> > > immediately instead of blocking.
> >
> > Or I guess we could also just set O_NONBLOCK on the fd by default?
> > Since the one existing user is eventloop-based...
>
> I feel like it's ok to return an error from the RECV ioctl() if
> there's never going to be any more events on the fd; was there
> something fundamentally wrong with your patch here:
> https://lore.kernel.org/bpf/CAG48ez2xn+_KznEztJ-eVTsTzkbf9CVgPqaAk7TpRNAqbdaRoA@mail.gmail.com/
> ?

No, I have a new version of that about 80% done and hope to send it
out soonish. (There's some stuff around tests that I still need to
cobble together).

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2020-10-29  7:44 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-30 11:07 For review: seccomp_user_notif(2) manual page Michael Kerrisk (man-pages)
2020-09-30 15:03 ` Tycho Andersen
2020-09-30 15:11   ` Tycho Andersen
2020-09-30 20:34   ` Michael Kerrisk (man-pages)
2020-09-30 23:03     ` Tycho Andersen
2020-09-30 23:11       ` Jann Horn
2020-09-30 23:24         ` Tycho Andersen
2020-10-01  1:52           ` Jann Horn
2020-10-01  2:14             ` Jann Horn
2020-10-25 16:31               ` Michael Kerrisk (man-pages)
2020-10-26 15:54                 ` Jann Horn
2020-10-27  6:14                   ` Michael Kerrisk (man-pages)
2020-10-27 10:28                     ` Jann Horn
2020-10-28  6:31                       ` Sargun Dhillon
2020-10-28  9:43                         ` Jann Horn
2020-10-28 17:43                           ` Sargun Dhillon
2020-10-28 18:20                             ` Jann Horn
2020-10-01  7:49             ` Michael Kerrisk (man-pages)
2020-10-26  0:32             ` Kees Cook
2020-10-26  9:51               ` Jann Horn
2020-10-26 10:31                 ` Jann Horn
2020-10-28 22:56                   ` Kees Cook
2020-10-29  1:11                     ` Jann Horn
     [not found]                   ` <20201029021348.GB25673@cisco>
2020-10-29  4:26                     ` Jann Horn
2020-10-28 22:53                 ` Kees Cook
2020-10-29  1:25                   ` Jann Horn
2020-10-01  7:45       ` Michael Kerrisk (man-pages)
2020-10-14  4:40         ` Michael Kerrisk (man-pages)
2020-09-30 15:53 ` Jann Horn
2020-10-01 12:54   ` Christian Brauner
2020-10-01 15:47     ` Jann Horn
2020-10-01 16:58       ` Tycho Andersen
2020-10-01 17:12         ` Christian Brauner
2020-10-14  5:41           ` Michael Kerrisk (man-pages)
2020-10-01 18:18         ` Jann Horn
2020-10-01 18:56           ` Tycho Andersen
2020-10-01 17:05       ` Christian Brauner
2020-10-15 11:24   ` Michael Kerrisk (man-pages)
2020-10-15 20:32     ` Jann Horn
2020-10-16 18:29       ` Michael Kerrisk (man-pages)
2020-10-17  0:25         ` Jann Horn
2020-10-24 12:52           ` Michael Kerrisk (man-pages)
2020-10-26  9:32             ` Jann Horn
2020-10-26  9:47               ` Michael Kerrisk (man-pages)
2020-09-30 23:39 ` Kees Cook
2020-10-15 11:24   ` Michael Kerrisk (man-pages)
2020-10-26  0:19     ` Kees Cook
2020-10-26  9:39       ` Michael Kerrisk (man-pages)
2020-10-01 12:36 ` Christian Brauner
2020-10-15 11:23   ` Michael Kerrisk (man-pages)
2020-10-01 21:06 ` Sargun Dhillon
2020-10-01 23:19   ` Tycho Andersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).