For review: seccomp_user_notif(2) manual page

* For review: seccomp_user_notif(2) manual page
@ 2020-09-30 11:07 Michael Kerrisk (man-pages)
  2020-09-30 15:03 ` Tycho Andersen
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-09-30 11:07 UTC (permalink / raw)
  To: Tycho Andersen, Sargun Dhillon
  Cc: mtk.manpages, Kees Cook, Christian Brauner, linux-man, lkml,
	Aleksa Sarai, Jann Horn, Alexei Starovoitov, wad, bpf, Song Liu,
	Daniel Borkmann, Andy Lutomirski, Linux Containers,
	Giuseppe Scrivano, Robert Sesek

Hi Tycho, Sargun (and all),

I knew it would be a big ask, but below is kind of the manual page
I was hoping you might write [1] for the seccomp user-space notification
mechanism. Since you didn't (and because 5.9 adds various new pieces 
such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD 
that also will need documenting [2]), I did :-). But of course I may 
have made mistakes...

I've shown the rendered version of the page below, and would love
to receive review comments from you and others, and acks, etc.

There are a few FIXMEs sprinkled into the page, including one
that relates to what appears to me to be a misdesign (possibly 
fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV 
operation. I would be especially interested in feedback on that
FIXME, and also of course the other FIXMEs.

The page includes an extensive (albeit slightly contrived)
example program, and I would be happy also to receive comments
on that program.

The page source currently sits in a branch (along with the text
that you sent me for the seccomp(2) page) at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif

Thanks,

Michael

[1] https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd23c@gmail.com/#t
[2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
    and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?

=====

NAME
       seccomp_user_notif - Seccomp user-space notification mechanism

SYNOPSIS
       #include <linux/seccomp.h>
       #include <linux/filter.h>
       #include <linux/audit.h>

       int seccomp(unsigned int operation, unsigned int flags, void *args);

DESCRIPTION
       This  page  describes  the user-space notification mechanism pro‐
       vided by the Secure Computing (seccomp) facility.  As well as the
       use   of  the  SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the  SEC‐
       COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
       operation  described  in  seccomp(2), this mechanism involves the
       use of a number of related ioctl(2) operations (described below).

   Overview
       In conventional usage of a seccomp filter, the decision about how
       to  treat  a particular system call is made by the filter itself.
       The user-space notification mechanism allows the handling of  the
       system  call  to  instead  be handed off to a user-space process.
       The advantages of doing this are that, by contrast with the  sec‐
       comp  filter,  which  is  running on a virtual machine inside the
       kernel, the user-space process has access to information that  is
       unavailable to the seccomp filter and it can perform actions that
       can't be performed from the seccomp filter.

       In the discussion that follows, the process  that  has  installed
       the  seccomp filter is referred to as the target, and the process
       that is notified by  the  user-space  notification  mechanism  is
       referred  to  as  the  supervisor.  An overview of the steps per‐
       formed by these two processes is as follows:

       1. The target process establishes a seccomp filter in  the  usual
          manner, but with two differences:

          · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
            TER_FLAG_NEW_LISTENER.  Consequently, the return  value   of
            the  (successful)  seccomp(2) call is a new "listening" file
            descriptor that can be used to receive notifications.

          · In cases where it is appropriate, the seccomp filter returns
            the  action value SECCOMP_RET_USER_NOTIF.  This return value
            will trigger a notification event.

       2. In order that the supervisor process can obtain  notifications
          using  the  listening  file  descriptor, (a duplicate of) that
          file descriptor must be passed from the target process to  the
          supervisor process.  One way in which this could be done is by
          passing the file descriptor over a UNIX domain socket  connec‐
          tion between the two processes (using the SCM_RIGHTS ancillary
          message type described in unix(7)).   Another  possibility  is
          that  the  supervisor  might  inherit  the file descriptor via
          fork(2).

       3. The supervisor process will receive notification events on the
          listening  file  descriptor.   These  events  are  returned as
          structures of type seccomp_notif.  Because this structure  and
          its  size may evolve over kernel versions, the supervisor must
          first determine the size of  this  structure  using  the  sec‐
          comp(2)  SECCOMP_GET_NOTIF_SIZES  operation,  which  returns a
          structure of type seccomp_notif_sizes.  The  supervisor  allo‐
          cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
          to receive notification events.   In  addition,the  supervisor
          allocates  another  buffer  of  size  seccomp_notif_sizes.sec‐
          comp_notif_resp  bytes  for  the  response  (a   struct   sec‐
          comp_notif_resp  structure) that it will provide to the kernel
          (and thus the target process).

       4. The target process then performs its workload, which  includes
          system  calls  that  will be controlled by the seccomp filter.
          Whenever one of these system calls causes the filter to return
          the  SECCOMP_RET_USER_NOTIF  action value, the kernel does not
          execute the system call;  instead,  execution  of  the  target
          process is temporarily blocked inside the kernel and a notifi‐
          cation event is generated on the listening file descriptor.

       5. The supervisor process can now repeatedly monitor the  listen‐
          ing   file   descriptor  for  SECCOMP_RET_USER_NOTIF-triggered
          events.   To  do  this,   the   supervisor   uses   the   SEC‐
          COMP_IOCTL_NOTIF_RECV  ioctl(2)  operation to read information
          about a notification event; this  operation  blocks  until  an
          event  is  available.   The  operation returns a seccomp_notif
          structure containing information about the system call that is
          being attempted by the target process.

       6. The    seccomp_notif    structure   returned   by   the   SEC‐
          COMP_IOCTL_NOTIF_RECV operation includes the same  information
          (a seccomp_data structure) that was passed to the seccomp fil‐
          ter.  This information allows the supervisor to  discover  the
          system  call number and the arguments for the target process's
          system call.  In addition, the notification event contains the
          PID of the target process.

          The  information  in  the notification can be used to discover
          the values of pointer arguments for the target process's  sys‐
          tem call.  (This is something that can't be done from within a
          seccomp filter.)  To do this (and  assuming  it  has  suitable
          permissions),   the   supervisor   opens   the   corresponding
          /proc/[pid]/mem file, seeks to the memory location that corre‐
          sponds to one of the pointer arguments whose value is supplied
          in the notification event, and reads bytes from that location.
          (The supervisor must be careful to avoid a race condition that
          can occur when doing this; see the  description  of  the  SEC‐
          COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
          tion, the supervisor can access other system information  that
          is  visible  in  user space but which is not accessible from a
          seccomp filter.

          ┌─────────────────────────────────────────────────────┐
          │FIXME                                                │
          ├─────────────────────────────────────────────────────┤
          │Suppose we are reading a pathname from /proc/PID/mem │
          │for  a system call such as mkdir(). The pathname can │
          │be an arbitrary length. How do we know how much (how │
          │many pages) to read from /proc/PID/mem?              │
          └─────────────────────────────────────────────────────┘

       7. Having  obtained  information  as  per  the previous step, the
          supervisor may then choose to perform an action in response to
          the  target  process's  system call (which, as noted above, is
          not  executed  when  the  seccomp  filter  returns  the   SEC‐
          COMP_RET_USER_NOTIF action value).

          One  example  use case here relates to containers.  The target
          process may be located inside a container where  it  does  not
          have sufficient capabilities to mount a filesystem in the con‐
          tainer's mount namespace.  However, the supervisor  may  be  a
          more  privileged  process that that does have sufficient capa‐
          bilities to perform the mount operation.

       8. The supervisor then sends a response to the notification.  The
          information  in  this  response  is used by the kernel to con‐
          struct a return value for the target process's system call and
          provide a value that will be assigned to the errno variable of
          the target process.

          The  response  is  sent  using  the   SECCOMP_IOCTL_NOTIF_RECV
          ioctl(2)   operation,   which  is  used  to  transmit  a  sec‐
          comp_notif_resp  structure  to  the  kernel.   This  structure
          includes  a  cookie  value that the supervisor obtained in the
          seccomp_notif    structure    returned     by     the     SEC‐
          COMP_IOCTL_NOTIF_RECV operation.  This cookie value allows the
          kernel to associate the response with the target process.

       9. Once the notification has been sent, the system  call  in  the
          target  process  unblocks,  returning the information that was
          provided by the supervisor in the notification response.

       As a variation on the last two steps, the supervisor can  send  a
       response  that tells the kernel that it should execute the target
       process's   system   call;   see   the   discussion    of    SEC‐
       COMP_USER_NOTIF_FLAG_CONTINUE, below.

   ioctl(2) operations
       The following ioctl(2) operations are provided to support seccomp
       user-space notification.  For each of these operations, the first
       (file  descriptor)  argument  of  ioctl(2)  is the listening file
       descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
       TER_FLAG_NEW_LISTENER flag.

       SECCOMP_IOCTL_NOTIF_RECV
              This operation is used to obtain a user-space notification
              event.  If no such event is currently pending, the  opera‐
              tion  blocks  until  an  event occurs.  The third ioctl(2)
              argument is a pointer to a structure of the following form
              which  contains  information about the event.  This struc‐
              ture must be zeroed out before the call.

                  struct seccomp_notif {
                      __u64  id;              /* Cookie */
                      __u32  pid;             /* PID of target process */
                      __u32  flags;           /* Currently unused (0) */
                      struct seccomp_data data;   /* See seccomp(2) */
                  };

              The fields in this structure are as follows:

              id     This is a cookie for the notification.   Each  such
                     cookie  is  guaranteed  to be unique for the corre‐
                     sponding seccomp  filter.   In  other  words,  this
                     cookie  is  unique for each notification event from
                     the target process.  The cookie value has the  fol‐
                     lowing uses:

                     · It     can     be     used    with    the    SEC‐
                       COMP_IOCTL_NOTIF_ID_VALID ioctl(2)  operation  to
                       verify that the target process is still alive.

                     · When  returning  a  notification  response to the
                       kernel, the supervisor must  include  the  cookie
                       value in the seccomp_notif_resp structure that is
                       specified   as   the   argument   of   the   SEC‐
                       COMP_IOCTL_NOTIF_SEND operation.

              pid    This  is  the  PID of the target process that trig‐
                     gered the notification event.

                     ┌─────────────────────────────────────────────────────┐
                     │FIXME                                                │
                     ├─────────────────────────────────────────────────────┤
                     │This is a thread ID, rather than a PID, right?       │
                     └─────────────────────────────────────────────────────┘

              flags  This is a  bit  mask  of  flags  providing  further
                     information on the event.  In the current implemen‐
                     tation, this field is always zero.

              data   This is a seccomp_data structure containing  infor‐
                     mation  about  the  system  call that triggered the
                     notification.  This is the same structure  that  is
                     passed  to  the seccomp filter.  See seccomp(2) for
                     details of this structure.

              On success, this operation returns 0; on  failure,  -1  is
              returned,  and  errno  is set to indicate the cause of the
              error.  This operation can fail with the following errors:

              EINVAL (since Linux 5.5)
                     The seccomp_notif structure that was passed to  the
                     call contained nonzero fields.

              ENOENT The  target  process  was killed by a signal as the
                     notification information was being generated.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │From my experiments,  it  appears  that  if  a  SEC‐ │
       │COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
       │process terminates, then the ioctl()  simply  blocks │
       │(rather than returning an error to indicate that the │
       │target process no longer exists).                    │
       │                                                     │
       │I found that surprising, and it required  some  con‐ │
       │tortions  in the example program.  It was not possi‐ │
       │ble to code my SIGCHLD handler (which reaps the zom‐ │
       │bie  when  the  worker/target process terminates) to │
       │simply set a flag checked in the main  handleNotifi‐ │
       │cations()  loop,  since  this created an unavoidable │
       │race where the child might terminate  just  after  I │
       │had  checked  the  flag,  but before I blocked (for‐ │
       │ever!) in  the  SECCOMP_IOCTL_NOTIF_RECV  operation. │
       │Instead,  I had to code the signal handler to simply │
       │call _exit(2)  in  order  to  terminate  the  parent │
       │process (the supervisor).                            │
       │                                                     │
       │Is  this  expected  behavior?  It seems to me rather │
       │desirable that SECCOMP_IOCTL_NOTIF_RECV should  give │
       │an error if the target process has terminated.       │
       └─────────────────────────────────────────────────────┘

       SECCOMP_IOCTL_NOTIF_ID_VALID
              This operation can be used to check that a notification ID
              returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
              is  still  valid  (i.e.,  that  the  target  process still
              exists).

              The third ioctl(2) argument is a  pointer  to  the  cookie
              (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.

              This  operation is necessary to avoid race conditions that
              can  occur   when   the   pid   returned   by   the   SEC‐
              COMP_IOCTL_NOTIF_RECV   operation   terminates,  and  that
              process ID is reused by another process.   An  example  of
              this kind of race is the following

              1. A  notification  is  generated  on  the  listening file
                 descriptor.  The returned  seccomp_notif  contains  the
                 PID of the target process.

              2. The target process terminates.

              3. Another process is created on the system that by chance
                 reuses the PID that was freed when the  target  process
                 terminates.

              4. The  supervisor  open(2)s  the /proc/[pid]/mem file for
                 the PID obtained in step 1, with the intention of (say)
                 inspecting the memory locations that contains the argu‐
                 ments of the system call that triggered  the  notifica‐
                 tion in step 1.

              In the above scenario, the risk is that the supervisor may
              try to access the memory of a process other than the  tar‐
              get.   This  race  can be avoided by following the call to
              open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
              ify  that  the  process that generated the notification is
              still alive.  (Note that  if  the  target  process  subse‐
              quently  terminates, its PID won't be reused because there
              remains an open reference to the /proc[pid]/mem  file;  in
              this  case, a subsequent read(2) from the file will return
              0, indicating end of file.)

              On success (i.e., the notification  ID  is  still  valid),
              this  operation  returns 0 On failure (i.e., the notifica‐
              tion ID is no longer valid), -1 is returned, and errno  is
              set to ENOENT.

       SECCOMP_IOCTL_NOTIF_SEND
              This  operation  is  used  to send a notification response
              back to the kernel.  The third ioctl(2) argument  of  this
              structure  is  a  pointer  to a structure of the following
              form:

                  struct seccomp_notif_resp {
                      __u64 id;               /* Cookie value */
                      __s64 val;              /* Success return value */
                      __s32 error;            /* 0 (success) or negative
                                                 error number */
                      __u32 flags;            /* See below */
                  };

              The fields of this structure are as follows:

              id     This is the cookie value that  was  obtained  using
                     the   SECCOMP_IOCTL_NOTIF_RECV   operation.    This
                     cookie value allows the kernel to  correctly  asso‐
                     ciate this response with the system call that trig‐
                     gered the user-space notification.

              val    This is the value that will be used for  a  spoofed
                     success  return  for  the  target  process's system
                     call; see below.

              error  This is the value that will be used  as  the  error
                     number  (errno)  for a spoofed error return for the
                     target process's system call; see below.

              flags  This is a bit mask that includes zero  or  more  of
                     the following flags

                     SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
                            Tell   the  kernel  to  execute  the  target
                            process's system call.

              Two kinds of response are possible:

              · A response to the kernel telling it to execute the  tar‐
                get  process's  system  call.   In  this case, the flags
                field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and  the
                error and val fields must be zero.

                This  kind  of response can be useful in cases where the
                supervisor needs to do deeper analysis of  the  target's
                system  call  than  is  possible  from  a seccomp filter
                (e.g., examining the values of pointer arguments),  and,
                having  verified that the system call is acceptable, the
                supervisor wants to allow it to proceed.

              · A spoofed return value for the target  process's  system
                call.   In  this  case,  the kernel does not execute the
                target process's system call, instead causing the system
                call to return a spoofed value as specified by fields of
                the seccomp_notif_resp structure.  The supervisor should
                set the fields of this structure as follows:

                +  flags  does  not contain SECCOMP_USER_NOTIF_FLAG_CON‐
                   TINUE.

                +  error is set either to  0  for  a  spoofed  "success"
                   return  or  to  a negative error number for a spoofed
                   "failure" return.  In the  former  case,  the  kernel
                   causes the target process's system call to return the
                   value specified in the val field.  In the later case,
                   the kernel causes the target process's system call to
                   return -1, and errno is assigned  the  negated  error
                   value.

                +  val is set to a value that will be used as the return
                   value for a spoofed "success" return for  the  target
                   process's  system  call.   The value in this field is
                   ignored if the error field contains a nonzero value.

              On success, this operation returns 0; on  failure,  -1  is
              returned,  and  errno  is set to indicate the cause of the
              error.  This operation can fail with the following errors:

              EINPROGRESS
                     A response to this notification  has  already  been
                     sent.

              EINVAL An invalid value was specified in the flags field.

              EINVAL The       flags      field      contained      SEC‐
                     COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
                     field was not zero.

              ENOENT The  blocked  system call in the target process has
                     been interrupted by a signal handler.

NOTES
       The file descriptor returned when seccomp(2) is employed with the
       SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
       poll(2), epoll(7), and select(2).  When a notification  is  pend‐
       ing,  these interfaces indicate that the file descriptor is read‐
       able.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Interestingly, after the event  had  been  received, │
       │the  file descriptor indicates as writable (verified │
       │from the source code and by experiment). How is this │
       │useful?                                              │
       └─────────────────────────────────────────────────────┘

EXAMPLES
       The (somewhat contrived) program shown below demonstrates the use
       of the interfaces described in this page.  The program creates  a
       child  process  that  serves  as the "target" process.  The child
       process  installs  a  seccomp  filter  that  returns   the   SEC‐
       COMP_RET_USER_NOTIF  action  value if a call is made to mkdir(2).
       The child process then calls mkdir(2) once for each of  the  sup‐
       plied  command-line arguments, and reports the result returned by
       the call.  After processing all arguments, the child process ter‐
       minates.

       The  parent  process  acts  as  the supervisor, listening for the
       notifications that are generated when the  target  process  calls
       mkdir(2).   When such a notification occurs, the supervisor exam‐
       ines the memory of the target process (using /proc/[pid]/mem)  to
       discover  the pathname argument that was supplied to the mkdir(2)
       call, and performs one of the following actions:

       · If the pathname begins with the prefix "/tmp/", then the super‐
         visor  attempts  to  create  the  specified directory, and then
         spoofs a return for the target  process  based  on  the  return
         value  of  the  supervisor's  mkdir(2) call.  In the event that
         that call succeeds, the spoofed success  return  value  is  the
         length of the pathname.

       · If  the pathname begins with "./" (i.e., it is a relative path‐
         name), the supervisor sends a  SECCOMP_USER_NOTIF_FLAG_CONTINUE
         response  to  the  kernel to say that kernel should execute the
         target process's mkdir(2) call.

       · If the pathname begins with some other prefix,  the  supervisor
         spoofs an error return for the target process, so that the tar‐
         get process's mkdir(2) call appears to fail with the error EOP‐
         NOTSUPP  ("Operation  not  supported").   Additionally,  if the
         specified pathname is exactly "/bye", then the supervisor  ter‐
         minates.

       This  program  can  used  to  demonstrate  various aspects of the
       behavior of the seccomp user-space  notification  mechanism.   To
       help  aid  such demonstrations, the program logs various messages
       to show the operation of the target process (lines prefixed "T:")
       and the supervisor (indented lines prefixed "S:").

       In  the  following  example,  the  target  attempts to create the
       directory /tmp/x.  Upon receiving the notification, the  supervi‐
       sor  creates  the  directory on the target's behalf, and spoofs a
       success return to be received by the  target  process's  mkdir(2)
       call.

           $ ./seccomp_unotify /tmp/x
           T: PID = 23168

           T: about to mkdir("/tmp/x")
                   S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
                   S: executing: mkdir("/tmp/x", 0700)
                   S: success! spoofed return = 6
                   S: sending response (flags = 0; val = 6; error = 0)
           T: SUCCESS: mkdir(2) returned 6

           T: terminating
                   S: target has terminated; bye

       In  the  above output, note that the spoofed return value seen by
       the target process is 6 (the  length  of  the  pathname  /tmp/x),
       whereas a normal mkdir(2) call returns 0 on success.

       In  the  next  example, the target attempts to create a directory
       using the relative pathname ./sub.  Since  this  pathname  starts
       with  "./",  the  supervisor sends a SECCOMP_USER_NOTIF_FLAG_CON‐
       TINUE response to the kernel, and the kernel then  (successfully)
       executes the target process's mkdir(2) call.

           $ ./seccomp_unotify ./sub
           T: PID = 23204

           T: about to mkdir("./sub")
                   S: got notification (ID 0xddb16abe25b4c12) for PID 23204
                   S: target can execute system call
                   S: sending response (flags = 0x1; val = 0; error = 0)
           T: SUCCESS: mkdir(2) returned 0

           T: terminating
                   S: target has terminated; bye

       If the target process attempts to create a directory with a path‐
       name that doesn't start with "." and doesn't begin with the  pre‐
       fix  "/tmp/", then the supervisor spoofs an error return (EOPNOT‐
       SUPP, "Operation not  supported") for the target's mkdir(2)  call
       (which is not executed):

           $ ./seccomp_unotify /xxx
           T: PID = 23178

           T: about to mkdir("/xxx")
                   S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
                   S: spoofing error response (Operation not supported)
                   S: sending response (flags = 0; val = 0; error = -95)
           T: ERROR: mkdir(2): Operation not supported

           T: terminating
                   S: target has terminated; bye

       In  the  next  example,  the  target process attempts to create a
       directory with the pathname /tmp/nosuchdir/b.  Upon receiving the
       notification,  the  supervisor attempts to create that directory,
       but the mkdir(2) call fails because the directory  /tmp/nosuchdir
       does  not  exist.   Consequently,  the supervisor spoofs an error
       return that passes the error that it received back to the  target
       process's mkdir(2) call.

           $ ./seccomp_unotify /tmp/nosuchdir/b
           T: PID = 23199

           T: about to mkdir("/tmp/nosuchdir/b")
                   S: got notification (ID 0x8744454293506046) for PID 23199
                   S: executing: mkdir("/tmp/nosuchdir/b", 0700)
                   S: failure! (errno = 2; No such file or directory)
                   S: sending response (flags = 0; val = 0; error = -2)
           T: ERROR: mkdir(2): No such file or directory

           T: terminating
                   S: target has terminated; bye

       If the supervisor receives a notification and sees that the argu‐
       ment of the target's mkdir(2) is the string "/bye", then (as well
       as  spoofing an EOPNOTSUPP error), the supervisor terminates.  If
       the target process subsequently executes  another  mkdir(2)  that
       triggers  its seccomp filter to return the SECCOMP_RET_USER_NOTIF
       action value, then the kernel causes the target process's  system
       call  to fail with the error ENOSYS ("Function not implemented").
       This is demonstrated by the following example:

           $ ./seccomp_unotify /bye /tmp/y
           T: PID = 23185

           T: about to mkdir("/bye")
                   S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
                   S: spoofing error response (Operation not supported)
                   S: sending response (flags = 0; val = 0; error = -95)
                   S: terminating **********
           T: ERROR: mkdir(2): Operation not supported

           T: about to mkdir("/tmp/y")
           T: ERROR: mkdir(2): Function not implemented

           T: terminating

   Program source
       #define _GNU_SOURCE
       #include <sys/types.h>
       #include <sys/prctl.h>
       #include <fcntl.h>
       #include <limits.h>
       #include <signal.h>
       #include <stddef.h>
       #include <stdint.h>
       #include <stdbool.h>
       #include <linux/audit.h>
       #include <sys/syscall.h>
       #include <sys/stat.h>
       #include <linux/filter.h>
       #include <linux/seccomp.h>
       #include <sys/ioctl.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <unistd.h>
       #include <errno.h>
       #include <sys/socket.h>
       #include <sys/un.h>

       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                               } while (0)

       /* Send the file descriptor 'fd' over the connected UNIX domain socket
          'sockfd'. Returns 0 on success, or -1 on error. */

       static int
       sendfd(int sockfd, int fd)
       {
           struct msghdr msgh;
           struct iovec iov;
           int data;
           struct cmsghdr *cmsgp;

           /* Allocate a char array of suitable size to hold the ancillary data.
              However, since this buffer is in reality a 'struct cmsghdr', use a
              union to ensure that it is suitable aligned. */
           union {
               char   buf[CMSG_SPACE(sizeof(int))];
                               /* Space large enough to hold an 'int' */
               struct cmsghdr align;
           } controlMsg;

           /* The 'msg_name' field can be used to specify the address of the
              destination socket when sending a datagram. However, we do not
              need to use this field because 'sockfd' is a connected socket. */

           msgh.msg_name = NULL;
           msgh.msg_namelen = 0;

           /* On Linux, we must transmit at least one byte of real data in
              order to send ancillary data. We transmit an arbitrary integer
              whose value is ignored by recvfd(). */

           msgh.msg_iov = &iov;
           msgh.msg_iovlen = 1;
           iov.iov_base = &data;
           iov.iov_len = sizeof(int);
           data = 12345;

           /* Set 'msghdr' fields that describe ancillary data */

           msgh.msg_control = controlMsg.buf;
           msgh.msg_controllen = sizeof(controlMsg.buf);

           /* Set up ancillary data describing file descriptor to send */

           cmsgp = CMSG_FIRSTHDR(&msgh);
           cmsgp->cmsg_level = SOL_SOCKET;
           cmsgp->cmsg_type = SCM_RIGHTS;
           cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
           memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));

           /* Send real plus ancillary data */

           if (sendmsg(sockfd, &msgh, 0) == -1)
               return -1;

           return 0;
       }

       /* Receive a file descriptor on a connected UNIX domain socket. Returns
          the received file descriptor on success, or -1 on error. */

       static int
       recvfd(int sockfd)
       {
           struct msghdr msgh;
           struct iovec iov;
           int data, fd;
           ssize_t nr;

           /* Allocate a char buffer for the ancillary data. See the comments
              in sendfd() */
           union {
               char   buf[CMSG_SPACE(sizeof(int))];
               struct cmsghdr align;
           } controlMsg;
           struct cmsghdr *cmsgp;

           /* The 'msg_name' field can be used to obtain the address of the
              sending socket. However, we do not need this information. */

           msgh.msg_name = NULL;
           msgh.msg_namelen = 0;

           /* Specify buffer for receiving real data */

           msgh.msg_iov = &iov;
           msgh.msg_iovlen = 1;
           iov.iov_base = &data;       /* Real data is an 'int' */
           iov.iov_len = sizeof(int);

           /* Set 'msghdr' fields that describe ancillary data */

           msgh.msg_control = controlMsg.buf;
           msgh.msg_controllen = sizeof(controlMsg.buf);

           /* Receive real plus ancillary data; real data is ignored */

           nr = recvmsg(sockfd, &msgh, 0);
           if (nr == -1)
               return -1;

           cmsgp = CMSG_FIRSTHDR(&msgh);

           /* Check the validity of the 'cmsghdr' */

           if (cmsgp == NULL ||
                   cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
                   cmsgp->cmsg_level != SOL_SOCKET ||
                   cmsgp->cmsg_type != SCM_RIGHTS) {
               errno = EINVAL;
               return -1;
           }

           /* Return the received file descriptor to our caller */

           memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
           return fd;
       }

       static void
       sigchldHandler(int sig)
       {
           char *msg  = "\tS: target has terminated; bye\n";

           write(STDOUT_FILENO, msg, strlen(msg));
           _exit(EXIT_SUCCESS);
       }

       static int
       seccomp(unsigned int operation, unsigned int flags, void *args)
       {
           return syscall(__NR_seccomp, operation, flags, args);
       }

       /* The following is the x86-64-specific BPF boilerplate code for checking
          that the BPF program is running on the right architecture + ABI. At
          completion of these instructions, the accumulator contains the system
          call number. */

       /* For the x32 ABI, all system call numbers have bit 30 set */

       #define X32_SYSCALL_BIT         0x40000000

       #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                       (offsetof(struct seccomp_data, arch))), \
               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                        (offsetof(struct seccomp_data, nr))), \
               BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)

       /* installNotifyFilter() installs a seccomp filter that generates
          user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
          calls mkdir(2); the filter allows all other system calls.

          The function return value is a file descriptor from which the
          user-space notifications can be fetched. */

       static int
       installNotifyFilter(void)
       {
           struct sock_filter filter[] = {
               X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,

               /* mkdir() triggers notification to user-space supervisor */

               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
               BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),

               /* Every other system call is allowed */

               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
           };

           struct sock_fprog prog = {
               .len = sizeof(filter) / sizeof(filter[0]),
               .filter = filter,
           };

           /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
              as a result, seccomp() returns a notification file descriptor. */

           int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
           if (notifyFd == -1)
               errExit("seccomp-install-notify-filter");

           return notifyFd;
       }

       /* Close a pair of sockets created by socketpair() */

       static void
       closeSocketPair(int sockPair[2])
       {
           if (close(sockPair[0]) == -1)
               errExit("closeSocketPair-close-0");
           if (close(sockPair[1]) == -1)
               errExit("closeSocketPair-close-1");
       }

       /* Implementation of the target process; create a child process that:

          (1) installs a seccomp filter with the
              SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
          (2) writes the seccomp notification file descriptor returned from
              the previous step onto the UNIX domain socket, 'sockPair[0]';
          (3) calls mkdir(2) for each element of 'argv'.

          The function return value in the parent is the PID of the child
          process; the child does not return from this function. */

       static pid_t
       targetProcess(int sockPair[2], char *argv[])
       {
           pid_t targetPid = fork();
           if (targetPid == -1)
               errExit("fork");

           if (targetPid > 0)          /* In parent, return PID of child */
               return targetPid;

           /* Child falls through to here */

           printf("T: PID = %ld\n", (long) getpid());

           /* Install seccomp filter(s) */

           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
               errExit("prctl");

           int notifyFd = installNotifyFilter();

           /* Pass the notification file descriptor to the tracing process over
              a UNIX domain socket */

           if (sendfd(sockPair[0], notifyFd) == -1)
               errExit("sendfd");

           /* Notification and socket FDs are no longer needed in target */

           if (close(notifyFd) == -1)
               errExit("close-target-notify-fd");

           closeSocketPair(sockPair);

           /* Perform a mkdir() call for each of the command-line arguments */

           for (char **ap = argv; *ap != NULL; ap++) {
               printf("\nT: about to mkdir(\"%s\")\n", *ap);

               int s = mkdir(*ap, 0700);
               if (s == -1)
                   perror("T: ERROR: mkdir(2)");
               else
                   printf("T: SUCCESS: mkdir(2) returned %d\n", s);
           }

           printf("\nT: terminating\n");
           exit(EXIT_SUCCESS);
       }

       /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
          operation is still valid. It will no longer be valid if the process
          has terminated. This operation can be used when accessing /proc/PID
          files in the target process in order to avoid TOCTOU race conditions
          where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV terminates
          and is reused by another process. */

       static void
       checkNotificationIdIsValid(int notifyFd, uint64_t id)
       {
           if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
               fprintf(stderr, "\tS: notification ID check: "
                       "target has terminated!!!\n");

               exit(EXIT_FAILURE);
           }
       }

       /* Access the memory of the target process in order to discover the
          pathname that was given to mkdir() */

       static void
       getTargetPathname(struct seccomp_notif *req, int notifyFd,
                         char *path, size_t len)
       {
           char procMemPath[PATH_MAX];
           snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);

           int procMemFd = open(procMemPath, O_RDONLY);
           if (procMemFd == -1)
               errExit("Supervisor: open");

           /* Check that the process whose info we are accessing is still alive.
              If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
              in checkNotificationIdIsValid()) succeeds, we know that the
              /proc/PID/mem file descriptor that we opened corresponds to the
              process for which we received a notification. If that process
              subsequently terminates, then read() on that file descriptor
              will return 0 (EOF). */

           checkNotificationIdIsValid(notifyFd, req->id);

           /* Seek to the location containing the pathname argument (i.e., the
              first argument) of the mkdir(2) call and read that pathname */

           if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
               errExit("Supervisor: lseek");

           ssize_t s = read(procMemFd, path, PATH_MAX);
           if (s == -1)
               errExit("read");

           if (s == 0) {
               fprintf(stderr, "\tS: read() of /proc/PID/mem "
                       "returned 0 (EOF)\n");
               exit(EXIT_FAILURE);
           }

           if (close(procMemFd) == -1)
               errExit("close-/proc/PID/mem");
       }

       /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
          descriptor, 'notifyFd'. */

       static void
       handleNotifications(int notifyFd)
       {
           struct seccomp_notif_sizes sizes;
           char path[PATH_MAX];
               /* For simplicity, we assume that the pathname given to mkdir()
                  is no more than PATH_MAX bytes; but this might not be true. */

           /* Discover the sizes of the structures that are used to receive
              notifications and send notification responses, and allocate
              buffers of those sizes. */

           if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
               errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");

           struct seccomp_notif *req = malloc(sizes.seccomp_notif);
           if (req == NULL)
               errExit("\tS: malloc");

           struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
           if (resp == NULL)
               errExit("\tS: malloc");

           /* Loop handling notifications */

           for (;;) {
               /* Wait for next notification, returning info in '*req' */

               memset(req, 0, sizes.seccomp_notif);
               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
                   if (errno == EINTR)
                       continue;
                   errExit("Supervisor: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
               }

               printf("\tS: got notification (ID %#llx) for PID %d\n",
                       req->id, req->pid);

               /* The only system call that can generate a notification event
                  is mkdir(2). Nevertheless, we check that the notified system
                  call is indeed mkdir() as kind of future-proofing of this
                  code in case the seccomp filter is later modified to
                  generate notifications for other system calls. */

               if (req->data.nr != __NR_mkdir) {
                   printf("\tS: notification contained unexpected "
                           "system call number; bye!!!\n");
                   exit(EXIT_FAILURE);
               }

               getTargetPathname(req, notifyFd, path, sizeof(path));

               /* Prepopulate some fields of the response */

               resp->id = req->id;     /* Response includes notification ID */
               resp->flags = 0;
               resp->val = 0;

               /* If the directory is in /tmp, then create it on behalf of
                  the supervisor; if the pathname starts with '.', tell the
                  kernel to let the target process execute the mkdir();
                  otherwise, give an error for a directory pathname in
                  any other location. */

               if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
                   printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
                           path, req->data.args[1]);

                   if (mkdir(path, req->data.args[1]) == 0) {
                       resp->error = 0;            /* "Success" */
                       resp->val = strlen(path);   /* Used as return value of
                                                      mkdir() in target */
                       printf("\tS: success! spoofed return = %lld\n",
                               resp->val);
                   } else {

                       /* If mkdir() failed in the supervisor, pass the error
                          back to the target */

                       resp->error = -errno;
                       printf("\tS: failure! (errno = %d; %s)\n", errno,
                               strerror(errno));
                   }
                                                            } else if (strncmp(path, "./", strlen("./")) == 0) {
                   resp->error = resp->val = 0;
                   resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
                   printf("\tS: target can execute system call\n");
               } else {
                   resp->error = -EOPNOTSUPP;
                   printf("\tS: spoofing error response (%s)\n",
                           strerror(-resp->error));
               }

               /* Send a response to the notification */

               printf("\tS: sending response "
                       "(flags = %#x; val = %lld; error = %d)\n",
                       resp->flags, resp->val, resp->error);

               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
                   if (errno == ENOENT)
                       printf("\tS: response failed with ENOENT; "
                               "perhaps target process's syscall was "
                               "interrupted by signal?\n");
                   else
                       perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
               }

               /* If the pathname is just "/bye", then the supervisor
                  terminates. This allows us to see what happens if the
                  target process makes further calls to mkdir(2). */

               if (strcmp(path, "/bye") == 0) {
                   printf("\tS: terminating **********\n");
                   exit(EXIT_FAILURE);
               }
           }
       }

       /* Implementation of the supervisor process:

          (1) obtains the notification file descriptor from 'sockPair[1]'
          (2) handles notifications that arrive on that file descriptor. */

       static void
       supervisor(int sockPair[2])
       {
           int notifyFd = recvfd(sockPair[1]);
           if (notifyFd == -1)
               errExit("recvfd");

           closeSocketPair(sockPair);  /* We no longer need the socket pair */

           handleNotifications(notifyFd);
       }

       int
       main(int argc, char *argv[])
       {
           int sockPair[2];

           setbuf(stdout, NULL);

           if (argc < 2) {
               fprintf(stderr, "At least one pathname argument is required\n");
               exit(EXIT_FAILURE);
           }

           /* Create a UNIX domain socket that is used to pass the seccomp
              notification file descriptor from the target process to the
              supervisor process. */

           if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
               errExit("socketpair");

           /* Create a child process--the "target"--that installs seccomp
              filtering. The target process writes the seccomp notification
              file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
              each directory in the command-line arguments. */

           (void) targetProcess(sockPair, &argv[optind]);

           /* Catch SIGCHLD when the target terminates, so that the
              supervisor can also terminate. */

           struct sigaction sa;
           sa.sa_handler = sigchldHandler;
           sa.sa_flags = 0;
           sigemptyset(&sa.sa_mask);
           if (sigaction(SIGCHLD, &sa, NULL) == -1)
               errExit("sigaction");

           supervisor(sockPair);

           exit(EXIT_SUCCESS);
       }

SEE ALSO
       ioctl(2), seccomp(2)

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 52+ messages in thread