archive mirror
 help / color / mirror / Atom feed
From: Aleksa Sarai <>
To: Al Viro <>,
	Jeff Layton <>,
	"J. Bruce Fields" <>,
	Arnd Bergmann <>,
	David Howells <>,
	Shuah Khan <>,
	Shuah Khan <>,
	Ingo Molnar <>,
	Peter Zijlstra <>
Cc: Aleksa Sarai <>,
	Eric Biederman <>,
	Andy Lutomirski <>,
	Andrew Morton <>,
	Alexei Starovoitov <>,
	Kees Cook <>, Jann Horn <>,
	Tycho Andersen <>,
	David Drysdale <>,
	Chanho Min <>, Oleg Nesterov <>,
	Rasmus Villemoes <>,
	Alexander Shishkin <>,
	Jiri Olsa <>, Namhyung Kim <>,
	Christian Brauner <>,
	Aleksa Sarai <>,
	Linus Torvalds <>,,,,,,,,,,,,,,,,,,
Subject: [PATCH v14 0/6] open: introduce openat2(2) syscall
Date: Thu, 10 Oct 2019 16:41:34 +1100	[thread overview]
Message-ID: <> (raw)

This patchset is being developed here:

Patch changelog:
  * The magic-link changes (and O_EMPTYPATH) have been dropped from this series
    -- they will be developed and sent separately. The main reason is that we
    need to restrict things other than open(2) (examples include truncate(2) as
    well as mount(MS_BIND)). This will require a fair amount of extra work, and
    there's no point stalling openat2(2) for that work to be completed.
  * Minor rework of 'struct open_how':
    * To avoid future headaches, make it a non-const argument.
    * Expand ->flags and ->resolve to 64-bit fields to allow for more flag
      extensions without needing to add separate fields too early. This
      requires adding a bit of explicit padding (32 bits) to avoid userspace
      putting garbage in the alignment padding -- this can be repurposed for
      future extensions.
    * upgrade_mask is dropped (and will be a separate field when we add it
      again in the future) to avoid userspace foot-guns.
    * Expand -EINVAL checks in build_open_flags(). Rather than silently
      ignoring silly flag combinations (such as O_TMPFILE|O_PATH or
      O_PATH|<most flags>), give an -EINVAL. All of the silent ignore semantics
      were added to open(2) because we couldn't return -EINVAL -- but we can
      * open(2) and openat(2) clean up their flags before passing them to
        build_open_flags(), so all mixed flags will continue to work. There is
        one exception which is (O_PATH|O_TMPFILE) -- this is no longer
        permitted (as far as I can tell this appears to be a bug, and there are
        no userspace users that I've hit after running this code for a few
        days). If it turns out that userspace does depend on (O_PATH|O_TMPFILE)
        working, we can only disallow it for openat2(2).
  * Don't zero out nd->root in complete_walk() for RCU-walk if we're doing a
    scoped-lookup (this prevents a needless REF-walk retry).
  * Attempt all tests on kernels that don't have openat2(2), rather than just
    skipping everything.
 v13: <>
 v12: <>
 v11: <>
 v10: <>
 v09: <>
 v08: <>
 v07: <>
 v06: <>
 v05: <>
 v04: <>
 v03: <>
 v02: <>
 v01: <>

For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:

  * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
    or through absolute links). Absolute pathnames alone in openat(2) do not
    trigger this. Magic-link traversal which implies a vfsmount jump is also
    blocked (though magic-link jumps on the same vfsmount are permitted).

  * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
    links. This is done by blocking the usage of nd_jump_link() during
    resolution in a filesystem. The term "magic-links" is used to match
    with the only reference to these links in Documentation/, but I'm
    happy to change the name.

    It should be noted that this is different to the scope of
    ~LOOKUP_FOLLOW in that it applies to all path components. However,
    you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
    will *not* fail (assuming that no parent component was a
    magic-link), and you will have an fd for the magic-link.

    In order to correctly detect magic-links, the introduction of a new
    LOOKUP_MAGICLINK_JUMPED state flag was required.

  * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
    tree, using techniques such as ".." or absolute links. Absolute
    paths in openat(2) are also disallowed. Conceptually this flag is to
    ensure you "stay below" a certain point in the filesystem tree --
    but this requires some additional to protect against various races
    that would allow escape using "..".

    Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
    can trivially beam you around the filesystem (breaking the
    protection). In future, there might be similar safety checks done as
    in LOOKUP_IN_ROOT, but that requires more discussion.

In addition, two new flags are added that expand on the above ideas:

  * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
    resolution is allowed at all, including magic-links. Just as with
    LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
    fd for the symlink as long as no parent path had a symlink

  * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
    blocking attempts to move past the root, forces all such movements
    to be scoped to the starting point. This provides chroot(2)-like
    protection but without the cost of a chroot(2) for each filesystem
    operation, as well as being safe against race attacks that chroot(2)
    is not.

    If a race is detected (as with LOOKUP_BENEATH) then an error is
    generated, and similar to LOOKUP_BENEATH it is not permitted to cross
    magic-links with LOOKUP_IN_ROOT.

    The primary need for this is from container runtimes, which
    currently need to do symlink scoping in userspace[7] when opening
    paths in a potentially malicious container. There is a long list of
    CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
    (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
    CVE-2019-5736, just to name a few).

In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.


The current draft of the openat2(2) man-page is included below.

OPENAT2(2)                          Linux Programmer's Manual                          OPENAT2(2)

       openat2 - open and possibly create a file (extended)

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>

       int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);

       Note: There is no glibc wrapper for this system call; see NOTES.

       The  openat2()  system  call  opens the file specified by pathname.  If the specified file
       does not exist, it may optionally (if O_CREAT is specified in  how.flags)  be  created  by

       As  with  openat(2),  if  pathname  is  relative,  then  it is interpreted relative to the
       directory referred to by the file descriptor dirfd (or the current  working  directory  of
       the  calling  process,  if dirfd is the special value AT_FDCWD.)  If pathname is absolute,
       then dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname
       is resolved relative to dirfd.)

       The  openat2()  system  call  is  an extension of openat(2) and provides a superset of its
       functionality.  Rather than taking a single flag argument, an extensible  structure  (how)
       is  passed  instead  to  allow  for  future extensions.  size must be set to sizeof(struct
       open_how), to facilitate future extensions (see the "Extensibility" section of  the  NOTES
       for more detail on how extensions are handled.)

   The open_how structure
       The following structure indicates how pathname should be opened, and acts as a superset of
       the flag and mode arguments to openat(2).

           struct open_how {
               __aligned_u64 flags;         /* O_* flags. */
               __u16         mode;          /* Mode for O_{CREAT,TMPFILE}. */
               __u16         __padding[3];  /* Must be zeroed. */
               __aligned_u64 resolve;       /* RESOLVE_* flags. */

       Any future extensions to openat2() will be implemented as new fields appended to the above
       structure (or through reuse of pre-existing padding space), with the zero value of the new
       fields acting as though the extension were not present.

       The meaning of each field is as follows:

                     The file creation and status flags to use for this operation.   All  of  the
                     O_* flags defined for openat(2) are valid openat2() flag values.

                     Unlike openat(2), it is an error to provide openat2() unknown or conflicting
                     flags in flags.

                     File mode for the new file, with identical semantics to the mode argument to
                     openat(2).   However,  unlike openat(2), it is an error to provide openat2()
                     with a mode which contains bits other than 0777.

                     It is an error to provide openat2()  a  non-zero  mode  if  flags  does  not
                     contain O_CREAT or O_TMPFILE.

                     Change   how   the   components   of   pathname   will   be   resolved  (see
                     path_resolution(7) for background information.)  The primary  use  case  for
                     these flags is to allow trusted programs to restrict how untrusted paths (or
                     paths inside untrusted directories) are resolved.  The full list of  resolve
                     flags is given below.

                            Disallow  traversal of mount points during path resolution (including
                            all bind mounts).

                            Users of this flag  are  encouraged  to  make  its  use  configurable
                            (unless  it  is used for a specific security purpose), as bind mounts
                            are very widely used by end-users.  Setting this flag indiscrimnately
                            for   all  uses  of  openat2()  may  result  in  spurious  errors  on
                            previously-functional systems.

                            Disallow resolution of symbolic links during path  resolution.   This
                            option implies RESOLVE_NO_MAGICLINKS.

                            If the trailing component is a symbolic link, and flags contains both
                            O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
                            symbolic link will be returned.

                            Users  of  this  flag  are  encouraged  to  make its use configurable
                            (unless it is used for a  specific  security  purpose),  as  symbolic
                            links   are  very  widely  used  by  end-users.   Setting  this  flag
                            indiscrimnately for all uses of  openat2()  may  result  in  spurious
                            errors on previously-functional systems.

                            Disallow all magic link resolution during path resolution.

                            If  the  trailing  component is a magic link, and flags contains both
                            O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
                            magic link will be returned.

                            Magic-links  are  symbolic  link-like  objects  that are most notably
                            found   in   proc(5)   (examples    include    /proc/[pid]/exe    and
                            /proc/[pid]/fd/*.)   Due  to  the  potential  danger  of  unknowingly
                            opening these magic links, it may be preferable for users to  disable
                            their resolution entirely (see symboliclink(7) for more details.)

                            Do  not permit the path resolution to succeed if any component of the
                            resolution is not a descendant of the directory indicated  by  dirfd.
                            This  results  in  absolute  symbolic  links  (and absolute values of
                            pathname) to be rejected.

                            Currently, this flag also disables magic link  resolution.   However,
                            this  may change in the future.  The caller should explicitly specify
                            RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.

                            Treat dirfd as the root directory while resolving pathname (as though
                            the  user  called  chroot(2)  with  dirfd as the argument.)  Absolute
                            symbolic links and ".." path components will be scoped to dirfd.   If
                            pathname is an absolute path, it is also treated relative to dirfd.

                            However,   unlike   chroot(2)  (which  changes  the  filesystem  root
                            permanently for a  process),  RESOLVE_IN_ROOT  allows  a  program  to
                            efficiently restrict path resolution for only certain operations.  It
                            also has several hardening features (such detecting  escape  attempts
                            during ..  resolution) which chroot(2) does not.

                            Currently,  this  flag also disables magic link resolution.  However,
                            this may change in the future.  The caller should explicitly  specify
                            RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.

                     It is an error to provide openat2() unknown flags in resolve.

       On success, a new file descriptor is returned.  On error, -1 is returned, and errno is set

       The set of errors returned by openat2() includes all of the errors returned by  openat(2),
       as well as the following additional errors:

       EINVAL An unknown flag or invalid value was specified in how.

       EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.

       EINVAL size was smaller than any known version of struct open_how.

       E2BIG  An  extension  was specified in how, which the current kernel does not support (see
              the "Extensibility" section of the NOTES for more  detail  on  how  extensions  are

       EAGAIN resolve  contains  either  RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
              not ensure that a ".."  component  didn't  escape  (due  to  a  race  condition  or
              potential attack.)  Callers may choose to retry the openat2() call.

       EXDEV  resolve  contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
              root during path resolution was detected.

       EXDEV  resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross  a  mount

       ELOOP  resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
              link (or magic link).

       ELOOP  resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a  magic

       openat2() was added to Linux in kernel 5.FOO.

       This system call is Linux-specific.

       The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.

       Glibc does not provide a wrapper for this system call; call it using systemcall(2).

       In order to allow for struct open_how to be extended in future kernel revisions, openat2()
       requires userspace to specify the size of struct open_how structure they are passing.   By
       providing  this  information,  it  is possible for openat2() to provide both forwards- and
       backwards-compatibility — with size acting as an  implicit  version  number  (because  new
       extension  fields  will  always  be  appended,  the  size  will  always  increase.)   This
       extensibility design is very similar  to  other  system  calls  such  as  perf_setattr(2),
       perf_event_open(2), and clone(3).

       If  we let usize be the size of the structure according to userspace and ksize be the size
       of the structure which the kernel supports, then there are only three cases to consider:

              *  If ksize equals usize, then there is no version mismatch and  how  can  be  used

              *  If  ksize  is  larger  than  usize,  then  there  are some extensions the kernel
                 supports which the userspace program is unaware of.  Because all extensions must
                 have their zero values be a no-op, the kernel treats all of the extension fields
                 not  set  by  userspace  to  have  zero  values.    This   provides   backwards-

              *  If  ksize  is  smaller  than  usize,  then  there  are some extensions which the
                 userspace program is aware of but the kernel  does  not  support.   Because  all
                 extensions  must have their zero values be a no-op, the kernel can safely ignore
                 the unsupported extension fields if  they  are  all-zero.   If  any  unsupported
                 extension  fields  are  non-zero, then -1 is returned and errno is set to E2BIG.
                 This provides forwards-compatibility.

       Therefore, most userspace  programs  will  not  need  to  have  any  special  handling  of
       extensions.   However,  if  a  userspace  program  wishes to determine what extensions the
       running kernel supports, they may conduct a binary search on size  (to  find  the  largest
       value which doesn't produce an error of E2BIG.)

       openat(2), path_resolution(7), symboliclink(7)

Linux                                       2019-10-10                                 OPENAT2(2)

Aleksa Sarai (6):
  namei: O_BENEATH-style resolution restriction flags
  namei: LOOKUP_IN_ROOT: chroot-like path resolution
  namei: permit ".." resolution with LOOKUP_{IN_ROOT,BENEATH}
  open: introduce openat2(2) syscall
  selftests: add openat2(2) selftests
  Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED

 CREDITS                                       |   4 +-
 Documentation/filesystems/path-lookup.rst     |  18 +-
 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd.h               |   2 +-
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/namei.c                                    | 167 +++++-
 fs/open.c                                     | 154 ++++--
 include/linux/fcntl.h                         |  12 +-
 include/linux/namei.h                         |  12 +
 include/linux/syscalls.h                      |   3 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 include/uapi/linux/fcntl.h                    |  41 ++
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/openat2/.gitignore    |   1 +
 tools/testing/selftests/openat2/Makefile      |   8 +
 tools/testing/selftests/openat2/helpers.c     | 109 ++++
 tools/testing/selftests/openat2/helpers.h     | 107 ++++
 .../testing/selftests/openat2/openat2_test.c  | 297 ++++++++++
 .../selftests/openat2/rename_attack_test.c    | 160 ++++++
 .../testing/selftests/openat2/resolve_test.c  | 523 ++++++++++++++++++
 35 files changed, 1571 insertions(+), 71 deletions(-)
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/openat2_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c


             reply	other threads:[~2019-10-10  5:42 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-10  5:41 Aleksa Sarai [this message]
2019-10-10  5:41 ` [PATCH v14 1/6] namei: O_BENEATH-style resolution restriction flags Aleksa Sarai
2019-10-10  5:41 ` [PATCH v14 2/6] namei: LOOKUP_IN_ROOT: chroot-like path resolution Aleksa Sarai
2019-10-10 17:07   ` Linus Torvalds
2019-10-12  4:08     ` Aleksa Sarai
2019-10-12  4:15       ` Aleksa Sarai
2019-10-24  7:06         ` Aleksa Sarai
2019-10-10  5:41 ` [PATCH v14 3/6] namei: permit ".." resolution with LOOKUP_{IN_ROOT,BENEATH} Aleksa Sarai
2019-10-10  5:41 ` [PATCH v14 4/6] open: introduce openat2(2) syscall Aleksa Sarai
2019-10-10  5:41 ` [PATCH v14 5/6] selftests: add openat2(2) selftests Aleksa Sarai
2019-10-10  5:41 ` [PATCH v14 6/6] Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED Aleksa Sarai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).