Linux-parisc archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v15 0/9] open: introduce openat2(2) syscall
@ 2019-11-05  9:05 Aleksa Sarai
  2019-11-05  9:05 ` [PATCH v15 1/9] namei: LOOKUP_NO_SYMLINKS: block symlink resolution Aleksa Sarai
                   ` (9 more replies)
  0 siblings, 10 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
	libc-alpha, linux-arch, linux-arm-kernel, linux-fsdevel,
	linux-ia64, linux-kernel, linux-kselftest, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	linux-xtensa, sparclinux

This patchset is being developed here:
  <https://github.com/cyphar/linux/tree/openat2/master>

Patch changelog:
 v15:
  * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
  * Split out patches for each individual LOOKUP flag.
  * Reword commit messages to give more background information about the
    series, as well as mention the semantics of each flag in more detail.
 v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
      <https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
 v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
 v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
 v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
      <https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
 v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
 v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
 v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
 v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
 v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
 v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
 v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
 v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
 v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
 v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>

For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:

  * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
    or through absolute links). Absolute pathnames alone in openat(2) do not
    trigger this. Magic-link traversal which implies a vfsmount jump is also
    blocked (though magic-link jumps on the same vfsmount are permitted).

  * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
    links. This is done by blocking the usage of nd_jump_link() during
    resolution in a filesystem. The term "magic-links" is used to match
    with the only reference to these links in Documentation/, but I'm
    happy to change the name.

    It should be noted that this is different to the scope of
    ~LOOKUP_FOLLOW in that it applies to all path components. However,
    you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
    will *not* fail (assuming that no parent component was a
    magic-link), and you will have an fd for the magic-link.

    In order to correctly detect magic-links, the introduction of a new
    LOOKUP_MAGICLINK_JUMPED state flag was required.

  * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
    tree, using techniques such as ".." or absolute links. Absolute
    paths in openat(2) are also disallowed. Conceptually this flag is to
    ensure you "stay below" a certain point in the filesystem tree --
    but this requires some additional to protect against various races
    that would allow escape using "..".

    Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
    can trivially beam you around the filesystem (breaking the
    protection). In future, there might be similar safety checks done as
    in LOOKUP_IN_ROOT, but that requires more discussion.

In addition, two new flags are added that expand on the above ideas:

  * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
    resolution is allowed at all, including magic-links. Just as with
    LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
    fd for the symlink as long as no parent path had a symlink
    component.

  * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
    blocking attempts to move past the root, forces all such movements
    to be scoped to the starting point. This provides chroot(2)-like
    protection but without the cost of a chroot(2) for each filesystem
    operation, as well as being safe against race attacks that chroot(2)
    is not.

    If a race is detected (as with LOOKUP_BENEATH) then an error is
    generated, and similar to LOOKUP_BENEATH it is not permitted to cross
    magic-links with LOOKUP_IN_ROOT.

    The primary need for this is from container runtimes, which
    currently need to do symlink scoping in userspace[7] when opening
    paths in a potentially malicious container. There is a long list of
    CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
    (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
    CVE-2019-5736, just to name a few).

In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.

Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).

[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs

The current draft of the openat2(2) man-page is included below.

--8<---------------------------------------------------------------------------
OPENAT2(2)                          Linux Programmer's Manual                          OPENAT2(2)

NAME
       openat2 - open and possibly create a file (extended)

SYNOPSIS
       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>

       int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);

       Note: There is no glibc wrapper for this system call; see NOTES.

DESCRIPTION
       The  openat2()  system  call  opens the file specified by pathname.  If the specified file
       does not exist, it may optionally (if O_CREAT is specified in  how.flags)  be  created  by
       openat2().

       As  with openat(2), if pathname is relative, then it is interpreted relative to the direc-
       tory referred to by the file descriptor dirfd (or the current  working  directory  of  the
       calling  process,  if dirfd is the special value AT_FDCWD.)  If pathname is absolute, then
       dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case  pathname  is
       resolved relative to dirfd.)

       The  openat2()  system  call  is  an extension of openat(2) and provides a superset of its
       functionality.  Rather than taking a single flag argument, an extensible  structure  (how)
       is  passed  instead  to  allow  for  future extensions.  size must be set to sizeof(struct
       open_how), to facilitate future extensions (see the "Extensibility" section of  the  NOTES
       for more detail on how extensions are handled.)

   The open_how structure
       The following structure indicates how pathname should be opened, and acts as a superset of
       the flag and mode arguments to openat(2).

           struct open_how {
               __aligned_u64 flags;         /* O_* flags. */
               __u16         mode;          /* Mode for O_{CREAT,TMPFILE}. */
               __u16         __padding[3];  /* Must be zeroed. */
               __aligned_u64 resolve;       /* RESOLVE_* flags. */
           };

       Any future extensions to openat2() will be implemented as new fields appended to the above
       structure (or through reuse of pre-existing padding space), with the zero value of the new
       fields acting as though the extension were not present.

       The meaning of each field is as follows:

              flags
                     The file creation and status flags to use for this operation.   All  of  the
                     O_* flags defined for openat(2) are valid openat2() flag values.

                     Unlike openat(2), it is an error to provide openat2() unknown or conflicting
                     flags in flags.

              mode
                     File mode for the new file, with identical semantics to the mode argument to
                     openat(2).   However,  unlike openat(2), it is an error to provide openat2()
                     with a mode which contains bits other than 0777.

                     It is an error to provide openat2() a non-zero mode if flags does  not  con-
                     tain O_CREAT or O_TMPFILE.

              resolve
                     Change  how  the  components  of pathname will be resolved (see path_resolu-
                     tion(7) for background information.)  The primary use case for  these  flags
                     is  to  allow trusted programs to restrict how untrusted paths (or paths in-
                     side untrusted directories) are resolved.  The full list of resolve flags is
                     given below.

                     RESOLVE_NO_XDEV
                            Disallow  traversal of mount points during path resolution (including
                            all bind mounts).

                            Users of this flag are encouraged to make its use  configurable  (un-
                            less  it is used for a specific security purpose), as bind mounts are
                            very widely used by end-users.  Setting this flag indiscrimnately for
                            all  uses  of  openat2() may result in spurious errors on previously-
                            functional systems.

                     RESOLVE_NO_SYMLINKS
                            Disallow resolution of symbolic links during path  resolution.   This
                            option implies RESOLVE_NO_MAGICLINKS.

                            If the trailing component is a symbolic link, and flags contains both
                            O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
                            symbolic link will be returned.

                            Users  of  this flag are encouraged to make its use configurable (un-
                            less it is used for a specific security purpose), as  symbolic  links
                            are very widely used by end-users.  Setting this flag indiscrimnately
                            for all uses of openat2() may result in  spurious  errors  on  previ-
                            ously-functional systems.

                     RESOLVE_NO_MAGICLINKS
                            Disallow all magic link resolution during path resolution.

                            If  the  trailing  component is a magic link, and flags contains both
                            O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
                            magic link will be returned.

                            Magic-links  are  symbolic  link-like  objects  that are most notably
                            found   in   proc(5)   (examples    include    /proc/[pid]/exe    and
                            /proc/[pid]/fd/*.)   Due to the potential danger of unknowingly open-
                            ing these magic links, it may be  preferable  for  users  to  disable
                            their resolution entirely (see symboliclink(7) for more details.)

                     RESOLVE_BENEATH
                            Do  not permit the path resolution to succeed if any component of the
                            resolution is not a descendant of the directory indicated  by  dirfd.
                            This results in absolute symbolic links (and absolute values of path-
                            name) to be rejected.

                            Currently, this flag also disables magic link  resolution.   However,
                            this  may change in the future.  The caller should explicitly specify
                            RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.

                     RESOLVE_IN_ROOT
                            Treat dirfd as the root directory while resolving pathname (as though
                            the user called chroot(2) with dirfd as the argument.)  Absolute sym-
                            bolic links and ".." path components will be  scoped  to  dirfd.   If
                            pathname is an absolute path, it is also treated relative to dirfd.

                            However,  unlike  chroot(2) (which changes the filesystem root perma-
                            nently for a process), RESOLVE_IN_ROOT  allows  a  program  to  effi-
                            ciently  restrict  path  resolution  for only certain operations.  It
                            also has several hardening features (such detecting  escape  attempts
                            during ..  resolution) which chroot(2) does not.

                            Currently,  this  flag also disables magic link resolution.  However,
                            this may change in the future.  The caller should explicitly  specify
                            RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.

                     It is an error to provide openat2() unknown flags in resolve.

RETURN VALUE
       On success, a new file descriptor is returned.  On error, -1 is returned, and errno is set
       appropriately.

ERRORS
       The set of errors returned by openat2() includes all of the errors returned by  openat(2),
       as well as the following additional errors:

       EINVAL An unknown flag or invalid value was specified in how.

       EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.

       EINVAL size was smaller than any known version of struct open_how.

       E2BIG  An  extension  was specified in how, which the current kernel does not support (see
              the "Extensibility" section of the NOTES for more detail on how extensions are han-
              dled.)

       EAGAIN resolve  contains  either  RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
              not ensure that a ".." component didn't escape (due to a race condition  or  poten-
              tial attack.)  Callers may choose to retry the openat2() call.

       EXDEV  resolve  contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
              root during path resolution was detected.

       EXDEV  resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross  a  mount
              point.

       ELOOP  resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
              link (or magic link).

       ELOOP  resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a  magic
              link.

VERSIONS
       openat2() was added to Linux in kernel 5.FOO.

CONFORMING TO
       This system call is Linux-specific.

       The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.

NOTES
       Glibc does not provide a wrapper for this system call; call it using systemcall(2).

   Extensibility
       In order to allow for struct open_how to be extended in future kernel revisions, openat2()
       requires userspace to specify the size of struct open_how structure they are passing.   By
       providing  this  information,  it  is possible for openat2() to provide both forwards- and
       backwards-compatibility — with size acting as an implicit version number (because new  ex-
       tension  fields will always be appended, the size will always increase.)  This extensibil-
       ity  design  is  very  similar  to   other   system   calls   such   as   perf_setattr(2),
       perf_event_open(2), and clone(3).

       If  we let usize be the size of the structure according to userspace and ksize be the size
       of the structure which the kernel supports, then there are only three cases to consider:

              *  If ksize equals usize, then there is no version mismatch and  how  can  be  used
                 verbatim.

              *  If  ksize  is  larger than usize, then there are some extensions the kernel sup-
                 ports which the userspace program is unaware of.  Because  all  extensions  must
                 have their zero values be a no-op, the kernel treats all of the extension fields
                 not set by userspace to have zero values.  This  provides  backwards-compatibil-
                 ity.

              *  If  ksize  is  smaller  than  usize,  then  there  are some extensions which the
                 userspace program is aware of but the kernel does not support.  Because all  ex-
                 tensions  must  have  their zero values be a no-op, the kernel can safely ignore
                 the unsupported extension fields if they are all-zero.  If any  unsupported  ex-
                 tension  fields  are  non-zero,  then  -1 is returned and errno is set to E2BIG.
                 This provides forwards-compatibility.

       Therefore, most userspace programs will not need to have any special  handling  of  exten-
       sions.   However,  if  a userspace program wishes to determine what extensions the running
       kernel supports, they may conduct a binary search on size (to find the largest value which
       doesn't produce an error of E2BIG.)

SEE ALSO
       openat(2), path_resolution(7), symlink(7)

Linux                                       2019-11-05                                 OPENAT2(2)
--8<---------------------------------------------------------------------------

Aleksa Sarai (9):
  namei: LOOKUP_NO_SYMLINKS: block symlink resolution
  namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
  namei: LOOKUP_NO_XDEV: block mountpoint crossing
  namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  open: introduce openat2(2) syscall
  selftests: add openat2(2) selftests
  Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED

 CREDITS                                       |   4 +-
 Documentation/filesystems/path-lookup.rst     |  18 +-
 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd.h               |   2 +-
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/namei.c                                    | 176 +++++-
 fs/open.c                                     | 149 +++--
 include/linux/fcntl.h                         |  12 +-
 include/linux/namei.h                         |  11 +
 include/linux/syscalls.h                      |   3 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 include/uapi/linux/fcntl.h                    |  41 ++
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/openat2/.gitignore    |   1 +
 tools/testing/selftests/openat2/Makefile      |   8 +
 tools/testing/selftests/openat2/helpers.c     | 109 ++++
 tools/testing/selftests/openat2/helpers.h     | 107 ++++
 .../testing/selftests/openat2/openat2_test.c  | 316 +++++++++++
 .../selftests/openat2/rename_attack_test.c    | 160 ++++++
 .../testing/selftests/openat2/resolve_test.c  | 523 ++++++++++++++++++
 35 files changed, 1591 insertions(+), 73 deletions(-)
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/openat2_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c


base-commit: a99d8080aaf358d5d23581244e5da23b35e340b9
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v15 1/9] namei: LOOKUP_NO_SYMLINKS: block symlink resolution
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
@ 2019-11-05  9:05 ` Aleksa Sarai
  2019-11-05  9:05 ` [PATCH v15 2/9] namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution Aleksa Sarai
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Christian Brauner, Linus Torvalds, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Christian Brauner, Aleksa Sarai, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

/* Background. */
Userspace cannot easily resolve a path without resolving symlinks, and
would have to manually resolve each path component with O_PATH and
O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw
up (resulting in possible security bugs). Linus has mentioned that Git
has a particular need for this kind of flag[1]. It also resolves a
fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only
blocks the opening of trailing symlinks.

This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a
variation on David Drysdale's O_BENEATH patchset[3], which in turn was
based on the Capsicum project[4]).

/* Userspace API. */
LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_SYMLINKS applies to all components of the path.

With LOOKUP_NO_SYMLINKS, any symlink path component encountered during
path resolution will yield -ELOOP. If the trailing component is a
symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW
will not error out and will instead provide a handle to the trailing
symlink -- without resolving it.

/* Testing. */
LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.

[1]: https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=Wvhd8AgSP7qjwQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[3]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[4]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c            | 3 +++
 include/linux/namei.h | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..4e85d6fa4048 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1045,6 +1045,9 @@ const char *get_link(struct nameidata *nd)
 	int error;
 	const char *res;
 
+	if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+		return ERR_PTR(-ELOOP);
+
 	if (!(nd->flags & LOOKUP_RCU)) {
 		touch_atime(&last->link);
 		cond_resched();
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 397a08ade6a2..ee2e35af387f 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -39,6 +39,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_ROOT		0x2000
 #define LOOKUP_ROOT_GRABBED	0x0008
 
+/* Scoping flags for lookup. */
+#define LOOKUP_NO_SYMLINKS	0x020000 /* No symlink crossing. */
+
 extern int path_pts(struct path *path);
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v15 2/9] namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
  2019-11-05  9:05 ` [PATCH v15 1/9] namei: LOOKUP_NO_SYMLINKS: block symlink resolution Aleksa Sarai
@ 2019-11-05  9:05 ` Aleksa Sarai
  2019-11-13  1:24   ` Al Viro
  2019-11-05  9:05 ` [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing Aleksa Sarai
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

/* Background. */
There has always been a special class of symlink-like objects in procfs
(and a few other pseudo-filesystems) which allow for non-lexical
resolution of paths using nd_jump_link(). These "magic-links" do not
follow traditional mount namespace boundaries, and have been used
consistently in container escape attacks because they can be used to
trick unsuspecting privileged processes into resolving unexpected paths.

It is also non-trivial for userspace to unambiguously avoid resolving
magic-links, because they do not have a reliable indication that they
are a magic-link (in order to verify them you'd have to manually open
the path given by readlink(2) and then verify that the two file
descriptors reference the same underlying file, which is plagued with
possible race conditions or supplementary attack scenarios).

It would therefore be very helpful for userspace to be able to avoid
these symlinks easily, thus hopefully removing a tool from attackers'
toolboxes.

This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
variation on David Drysdale's O_BENEATH patchset[2], which in turn was
based on the Capsicum project[3]).

/* Userspace API. */
LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_MAGICLINKS applies to all components of the path.

With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered
during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW
for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.

LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.

/* Testing. */
LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.

[1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c            | 7 ++++++-
 include/linux/namei.h | 2 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4e85d6fa4048..1f0d871199e5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -866,7 +866,7 @@ void nd_jump_link(struct path *path)
 
 	nd->path = *path;
 	nd->inode = nd->path.dentry->d_inode;
-	nd->flags |= LOOKUP_JUMPED;
+	nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
 }
 
 static inline void put_link(struct nameidata *nd)
@@ -1063,6 +1063,7 @@ const char *get_link(struct nameidata *nd)
 		return ERR_PTR(error);
 
 	nd->last_type = LAST_BIND;
+	nd->flags &= ~LOOKUP_MAGICLINK_JUMPED;
 	res = READ_ONCE(inode->i_link);
 	if (!res) {
 		const char * (*get)(struct dentry *, struct inode *,
@@ -1078,6 +1079,10 @@ const char *get_link(struct nameidata *nd)
 		} else {
 			res = get(dentry, inode, &last->done);
 		}
+		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
+			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+				return ERR_PTR(-ELOOP);
+		}
 		if (IS_ERR_OR_NULL(res))
 			return res;
 	}
diff --git a/include/linux/namei.h b/include/linux/namei.h
index ee2e35af387f..a8b3f93338da 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -38,9 +38,11 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_JUMPED		0x1000
 #define LOOKUP_ROOT		0x2000
 #define LOOKUP_ROOT_GRABBED	0x0008
+#define LOOKUP_MAGICLINK_JUMPED	0x10000
 
 /* Scoping flags for lookup. */
 #define LOOKUP_NO_SYMLINKS	0x020000 /* No symlink crossing. */
+#define LOOKUP_NO_MAGICLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
 
 extern int path_pts(struct path *path);
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
  2019-11-05  9:05 ` [PATCH v15 1/9] namei: LOOKUP_NO_SYMLINKS: block symlink resolution Aleksa Sarai
  2019-11-05  9:05 ` [PATCH v15 2/9] namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution Aleksa Sarai
@ 2019-11-05  9:05 ` Aleksa Sarai
  2019-11-13  1:36   ` Al Viro
  2019-11-05  9:05 ` [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution Aleksa Sarai
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

/* Background. */
The need to contain path operations within a mountpoint has been a
long-standing usecase that userspace has historically implemented
manually with liberal usage of stat(). find, rsync, tar and
many other programs implement these semantics -- but it'd be much
simpler to have a fool-proof way of refusing to open a path if it
crosses a mountpoint.

This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
variation on David Drysdale's O_BENEATH patchset[2], which in turn was
based on the Capsicum project[3]).

/* Userspace API. */
LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_XDEV applies to all components of the path.

With LOOKUP_NO_XDEV, any path component which crosses a mount-point
during path resolution (including "..") will yield an -EXDEV. Absolute
paths, absolute symlinks, and magic-links will only yield an -EXDEV if
the jump involved changing mount-points.

/* Testing. */
LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.

[1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c            | 34 ++++++++++++++++++++++++++++++----
 include/linux/namei.h |  1 +
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 1f0d871199e5..b73ee1601bd4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -504,6 +504,9 @@ struct nameidata {
 	struct filename	*name;
 	struct nameidata *saved;
 	struct inode	*link_inode;
+	struct {
+		bool same_mnt;
+	} last_magiclink;
 	unsigned	root_seq;
 	int		dfd;
 } __randomize_layout;
@@ -837,6 +840,11 @@ static inline void path_to_nameidata(const struct path *path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+	if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+		/* Absolute path arguments to path_init() are allowed. */
+		if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
+			return -EXDEV;
+	}
 	if (nd->flags & LOOKUP_RCU) {
 		struct dentry *d;
 		nd->path = nd->root;
@@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
 void nd_jump_link(struct path *path)
 {
 	struct nameidata *nd = current->nameidata;
+
+	nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
 	path_put(&nd->path);
 
 	nd->path = *path;
@@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
 		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
 			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
 				return ERR_PTR(-ELOOP);
+			if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+				if (!nd->last_magiclink.same_mnt)
+					return ERR_PTR(-EXDEV);
+			}
 		}
 		if (IS_ERR_OR_NULL(res))
 			return res;
@@ -1271,12 +1285,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
 		break;
 	}
 
-	if (need_mntput && path->mnt == mnt)
-		mntput(path->mnt);
+	if (need_mntput) {
+		if (path->mnt == mnt)
+			mntput(path->mnt);
+		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+			ret = -EXDEV;
+		else
+			nd->flags |= LOOKUP_JUMPED;
+	}
 	if (ret == -EISDIR || !ret)
 		ret = 1;
-	if (need_mntput)
-		nd->flags |= LOOKUP_JUMPED;
 	if (unlikely(ret < 0))
 		path_put_conditional(path, nd);
 	return ret;
@@ -1333,6 +1351,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		mounted = __lookup_mnt(path->mnt, path->dentry);
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+			return false;
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
@@ -1379,6 +1399,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			if (&mparent->mnt == nd->path.mnt)
 				break;
+			if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+				return -EXDEV;
 			/* we know that mountpoint was pinned */
 			nd->path.dentry = mountpoint;
 			nd->path.mnt = &mparent->mnt;
@@ -1393,6 +1415,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			return -ECHILD;
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+			return -EXDEV;
 		nd->path.mnt = &mounted->mnt;
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
@@ -1491,6 +1515,8 @@ static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+			return -EXDEV;
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a8b3f93338da..6105c8a59fc8 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -43,6 +43,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 /* Scoping flags for lookup. */
 #define LOOKUP_NO_SYMLINKS	0x020000 /* No symlink crossing. */
 #define LOOKUP_NO_MAGICLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_XDEV		0x080000 /* No mountpoint crossing. */
 
 extern int path_pts(struct path *path);
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
                   ` (2 preceding siblings ...)
  2019-11-05  9:05 ` [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing Aleksa Sarai
@ 2019-11-05  9:05 ` Aleksa Sarai
  2019-11-13  1:55   ` Al Viro
  2019-11-05  9:05 ` [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like " Aleksa Sarai
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.

This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).

/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.

With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.

Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.

Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.

/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.

[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c            | 91 +++++++++++++++++++++++++++++++++++--------
 include/linux/namei.h |  4 ++
 2 files changed, 79 insertions(+), 16 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index b73ee1601bd4..54fdbdfbeb94 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -644,6 +644,14 @@ static bool legitimize_links(struct nameidata *nd)
 
 static bool legitimize_root(struct nameidata *nd)
 {
+	/*
+	 * For scoped-lookups (where nd->root has been zeroed), we need to
+	 * restart the whole lookup from scratch -- because set_root() is wrong
+	 * for these lookups (nd->dfd is the root, not the filesystem root).
+	 */
+	if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED))
+		return false;
+	/* Nothing to do if nd->root is zero or is managed by the VFS user. */
 	if (!nd->root.mnt || (nd->flags & LOOKUP_ROOT))
 		return true;
 	nd->flags |= LOOKUP_ROOT_GRABBED;
@@ -779,7 +787,11 @@ static int complete_walk(struct nameidata *nd)
 	int status;
 
 	if (nd->flags & LOOKUP_RCU) {
-		if (!(nd->flags & LOOKUP_ROOT))
+		/*
+		 * We don't want to zero nd->root for scoped-lookups or
+		 * externally-managed nd->root.
+		 */
+		if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED)))
 			nd->root.mnt = NULL;
 		if (unlikely(unlazy_walk(nd)))
 			return -ECHILD;
@@ -801,10 +813,18 @@ static int complete_walk(struct nameidata *nd)
 	return status;
 }
 
-static void set_root(struct nameidata *nd)
+static int set_root(struct nameidata *nd)
 {
 	struct fs_struct *fs = current->fs;
 
+	/*
+	 * Jumping to the real root in a scoped-lookup is a BUG in namei, but we
+	 * still have to ensure it doesn't happen because it will cause a breakout
+	 * from the dirfd.
+	 */
+	if (WARN_ON(nd->flags & LOOKUP_IS_SCOPED))
+		return -ENOTRECOVERABLE;
+
 	if (nd->flags & LOOKUP_RCU) {
 		unsigned seq;
 
@@ -817,6 +837,7 @@ static void set_root(struct nameidata *nd)
 		get_fs_root(fs, &nd->root);
 		nd->flags |= LOOKUP_ROOT_GRABBED;
 	}
+	return 0;
 }
 
 static void path_put_conditional(struct path *path, struct nameidata *nd)
@@ -840,11 +861,18 @@ static inline void path_to_nameidata(const struct path *path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+	if (unlikely(nd->flags & LOOKUP_BENEATH))
+		return -EXDEV;
 	if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
 		/* Absolute path arguments to path_init() are allowed. */
 		if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
 			return -EXDEV;
 	}
+	if (!nd->root.mnt) {
+		int error = set_root(nd);
+		if (error)
+			return error;
+	}
 	if (nd->flags & LOOKUP_RCU) {
 		struct dentry *d;
 		nd->path = nd->root;
@@ -1096,15 +1124,17 @@ const char *get_link(struct nameidata *nd)
 				if (!nd->last_magiclink.same_mnt)
 					return ERR_PTR(-EXDEV);
 			}
+			/* Not currently safe for scoped-lookups. */
+			if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
+				return ERR_PTR(-EXDEV);
 		}
 		if (IS_ERR_OR_NULL(res))
 			return res;
 	}
 	if (*res == '/') {
-		if (!nd->root.mnt)
-			set_root(nd);
-		if (unlikely(nd_jump_root(nd)))
-			return ERR_PTR(-ECHILD);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 	struct inode *inode = nd->inode;
 
 	while (1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			struct dentry *old = nd->path.dentry;
 			struct dentry *parent = old->d_parent;
@@ -1505,8 +1538,11 @@ static int path_parent_directory(struct path *path)
 static int follow_dotdot(struct nameidata *nd)
 {
 	while(1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			int ret = path_parent_directory(&nd->path);
 			if (ret)
@@ -1731,8 +1767,20 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
-		if (!nd->root.mnt)
-			set_root(nd);
+		int error = 0;
+
+		/*
+		 * Scoped-lookup flags resolving ".." is not currently safe --
+		 * races can cause our parent to have moved outside of the root
+		 * and us to skip over it.
+		 */
+		if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
+			return -EXDEV;
+		if (!nd->root.mnt) {
+			error = set_root(nd);
+			if (error)
+				return error;
+		}
 		if (nd->flags & LOOKUP_RCU) {
 			return follow_dotdot_rcu(nd);
 		} else
@@ -2195,6 +2243,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 /* must be paired with terminate_walk() */
 static const char *path_init(struct nameidata *nd, unsigned flags)
 {
+	int error;
 	const char *s = nd->name->name;
 
 	if (!*s)
@@ -2227,11 +2276,12 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
+
+	/* Figure out the starting path and root (if needed). */
 	if (*s == '/') {
-		set_root(nd);
-		if (likely(!nd_jump_root(nd)))
-			return s;
-		return ERR_PTR(-ECHILD);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 	} else if (nd->dfd == AT_FDCWD) {
 		if (flags & LOOKUP_RCU) {
 			struct fs_struct *fs = current->fs;
@@ -2247,7 +2297,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 			get_fs_pwd(current->fs, &nd->path);
 			nd->inode = nd->path.dentry->d_inode;
 		}
-		return s;
 	} else {
 		/* Caller must check execute permissions on the starting path component */
 		struct fd f = fdget_raw(nd->dfd);
@@ -2272,8 +2321,18 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 			nd->inode = nd->path.dentry->d_inode;
 		}
 		fdput(f);
-		return s;
 	}
+	/* For scoped-lookups we need to set the root to the dirfd as well. */
+	if (flags & LOOKUP_IS_SCOPED) {
+		nd->root = nd->path;
+		if (flags & LOOKUP_RCU) {
+			nd->root_seq = nd->seq;
+		} else {
+			path_get(&nd->root);
+			nd->flags |= LOOKUP_ROOT_GRABBED;
+		}
+	}
+	return s;
 }
 
 static const char *trailing_symlink(struct nameidata *nd)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 6105c8a59fc8..12f4f36835c2 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_NAMEI_H
 #define _LINUX_NAMEI_H
 
+#include <linux/fs.h>
 #include <linux/kernel.h>
 #include <linux/path.h>
 #include <linux/fcntl.h>
@@ -44,6 +45,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_SYMLINKS	0x020000 /* No symlink crossing. */
 #define LOOKUP_NO_MAGICLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
 #define LOOKUP_NO_XDEV		0x080000 /* No mountpoint crossing. */
+#define LOOKUP_BENEATH		0x100000 /* No escaping from starting point. */
+/* LOOKUP_* flags which do scope-related checks based on the dirfd. */
+#define LOOKUP_IS_SCOPED LOOKUP_BENEATH
 
 extern int path_pts(struct path *path);
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
                   ` (3 preceding siblings ...)
  2019-11-05  9:05 ` [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution Aleksa Sarai
@ 2019-11-05  9:05 ` " Aleksa Sarai
  2019-11-13  2:03   ` Al Viro
  2019-11-05  9:05 ` [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution Aleksa Sarai
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

/* Background. */
Container runtimes or other administrative management processes will
often interact with root filesystems while in the host mount namespace,
because the cost of doing a chroot(2) on every operation is too
prohibitive (especially in Go, which cannot safely use vfork). However,
a malicious program can trick the management process into doing
operations on files outside of the root filesystem through careful
crafting of symlinks.

Most programs that need this feature have attempted to make this process
safe, by doing all of the path resolution in userspace (with symlinks
being scoped to the root of the malicious root filesystem).
Unfortunately, this method is prone to foot-guns and usually such
implementations have subtle security bugs.

Thus, what userspace needs is a way to resolve a path as though it were
in a chroot(2) -- with all absolute symlinks being resolved relative to
the dirfd root (and ".." components being stuck under the dirfd root).
It is much simpler and more straight-forward to provide this
functionality in-kernel (because it can be done far more cheaply and
correctly).

More classical applications that also have this problem (which have
their own potentially buggy userspace path sanitisation code) include
web servers, archive extraction tools, network file servers, and so on.

/* Userspace API. */
LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_IN_ROOT applies to all components of the path.

With LOOKUP_IN_ROOT, any path component which attempts to cross the
starting point of the pathname lookup (the dirfd passed to openat) will
remain at the starting point. Thus, all absolute paths and symlinks will
be scoped within the starting point.

There is a slight change in behaviour regarding pathnames -- if the
pathname is absolute then the dirfd is still used as the root of
resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
foot-guns, at the cost of a minor API inconsistency).

As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
will be lifted in a future patch, but requires more work to ensure that
permitting ".." is done safely.

Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.

/* Testing. */
LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.

[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c            | 15 ++++++++++++---
 include/linux/namei.h |  3 ++-
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 54fdbdfbeb94..a3d199a60708 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 
 	nd->m_seq = read_seqbegin(&mount_lock);
 
-	/* Figure out the starting path and root (if needed). */
-	if (*s == '/') {
+	/* Absolute pathname -- fetch the root. */
+	if (flags & LOOKUP_IN_ROOT) {
+		/* With LOOKUP_IN_ROOT, act as a relative path. */
+		while (*s == '/')
+			s++;
+	} else if (*s == '/') {
 		error = nd_jump_root(nd);
 		if (unlikely(error))
 			return ERR_PTR(error);
-	} else if (nd->dfd == AT_FDCWD) {
+		return s;
+	}
+
+	/* Relative pathname -- get the starting-point it is relative to. */
+	if (nd->dfd == AT_FDCWD) {
 		if (flags & LOOKUP_RCU) {
 			struct fs_struct *fs = current->fs;
 			unsigned seq;
@@ -2322,6 +2330,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 		}
 		fdput(f);
 	}
+
 	/* For scoped-lookups we need to set the root to the dirfd as well. */
 	if (flags & LOOKUP_IS_SCOPED) {
 		nd->root = nd->path;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 12f4f36835c2..96b374e08230 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -46,8 +46,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_MAGICLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
 #define LOOKUP_NO_XDEV		0x080000 /* No mountpoint crossing. */
 #define LOOKUP_BENEATH		0x100000 /* No escaping from starting point. */
+#define LOOKUP_IN_ROOT		0x200000 /* Treat dirfd as %current->fs->root. */
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
-#define LOOKUP_IS_SCOPED LOOKUP_BENEATH
+#define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
 
 extern int path_pts(struct path *path);
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
                   ` (4 preceding siblings ...)
  2019-11-05  9:05 ` [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like " Aleksa Sarai
@ 2019-11-05  9:05 ` Aleksa Sarai
  2019-11-13  2:09   ` Al Viro
  2019-11-05  9:05 ` [PATCH v15 7/9] open: introduce openat2(2) syscall Aleksa Sarai
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Christian Brauner, Jann Horn, Linus Torvalds,
	Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Tycho Andersen, David Drysdale,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
(in the case of LOOKUP_BENEATH the resolution will still fail if ".."
resolution would resolve a path outside of the root -- while
LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
still disallowed entirely[*].

As Jann explains[1,2], the need for this patch (and the original no-".."
restriction) is explained by observing there is a fairly easy-to-exploit
race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
used to "skip over" nd->root and thus escape to the filesystem above
nd->root.

  thread1 [attacker]:
    for (;;)
      renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
    for (;;)
      openat2(dirb, "b/c/../../etc/shadow",
              { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.

This is a first-pass implementation, where -EAGAIN will be returned if
any rename or mount occurs anywhere on the host (in any namespace). This
will result in spurious errors, but there isn't a satisfactory
alternative (other than denying ".." altogether).

One other possible alternative (which previous versions of this patch
used) would be to check with path_is_under() if there was a racing
rename or mount (after re-taking the relevant seqlocks). While this does
work, it results in possible O(n*m) behaviour if there are many renames
or mounts occuring *anywhere on the system*.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check (as
    with the alternative solution for "..") for magic-links after they
    are resolved. However this seems unlikely to be a feature that
    people *really* need -- it can be added later if it turns out a lot
    of people want it.

[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 42 +++++++++++++++++++++++++++++-------------
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a3d199a60708..174d69cf9084 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
 	struct path	root;
 	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
-	unsigned	seq, m_seq;
+	unsigned	seq, m_seq, r_seq;
 	int		last_type;
 	unsigned	depth;
 	int		total_link_count;
@@ -1769,22 +1769,35 @@ static inline int handle_dots(struct nameidata *nd, int type)
 	if (type == LAST_DOTDOT) {
 		int error = 0;
 
-		/*
-		 * Scoped-lookup flags resolving ".." is not currently safe --
-		 * races can cause our parent to have moved outside of the root
-		 * and us to skip over it.
-		 */
-		if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
-			return -EXDEV;
 		if (!nd->root.mnt) {
 			error = set_root(nd);
 			if (error)
 				return error;
 		}
-		if (nd->flags & LOOKUP_RCU) {
-			return follow_dotdot_rcu(nd);
-		} else
-			return follow_dotdot(nd);
+		if (nd->flags & LOOKUP_RCU)
+			error = follow_dotdot_rcu(nd);
+		else
+			error = follow_dotdot(nd);
+		if (error)
+			return error;
+
+		if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
+			bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+			bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+			/*
+			 * If there was a racing rename or mount along our
+			 * path, then we can't be sure that ".." hasn't jumped
+			 * above nd->root (and so userspace should retry or use
+			 * some fallback).
+			 *
+			 * In future we could do a path_is_under() check here
+			 * instead, but there are O(n*m) performance
+			 * considerations with such a setup.
+			 */
+			if (unlikely(m_retry || r_retry))
+				return -EAGAIN;
+		}
 	}
 	return 0;
 }
@@ -2254,6 +2267,10 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
+
+	nd->m_seq = read_seqbegin(&mount_lock);
+	nd->r_seq = read_seqbegin(&rename_lock);
+
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -2275,7 +2292,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 
-	nd->m_seq = read_seqbegin(&mount_lock);
 
 	/* Absolute pathname -- fetch the root. */
 	if (flags & LOOKUP_IN_ROOT) {
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v15 7/9] open: introduce openat2(2) syscall
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
                   ` (5 preceding siblings ...)
  2019-11-05  9:05 ` [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution Aleksa Sarai
@ 2019-11-05  9:05 ` Aleksa Sarai
  2019-11-13  2:29   ` Al Viro
  2019-11-05  9:05 ` [PATCH v15 8/9] selftests: add openat2(2) selftests Aleksa Sarai
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
	libc-alpha, linux-arch, linux-arm-kernel, linux-fsdevel,
	linux-ia64, linux-kernel, linux-kselftest, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	linux-xtensa, sparclinux

/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.

In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.

Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).

/* Syscall Prototype. */
  /*
   * open_how is an extensible structure (similar in interface to
   * clone3(2) or sched_setattr(2)). The size parameter must be set to
   * sizeof(struct open_how), to allow for future extensions. All future
   * extensions will be appended to open_how, with their zero value
   * acting as a no-op default.
   */
  struct open_how { /* ... */ };

  int openat2(int dfd, const char *pathname,
              struct open_how *how, size_t size);

/* Description. */
The initial version of 'struct open_how' contains the following fields:

  flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

  mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

  __padding
    Must be set to all zeroes.

  resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH       => LOOKUP_BENEATH
    RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT

open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it).

Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).

/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.

In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).

/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).

Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.

[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/

Suggested-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 CREDITS                                     |   4 +-
 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/arm64/include/asm/unistd.h             |   2 +-
 arch/arm64/include/asm/unistd32.h           |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 fs/open.c                                   | 149 +++++++++++++++-----
 include/linux/fcntl.h                       |  12 +-
 include/linux/syscalls.h                    |   3 +
 include/uapi/asm-generic/unistd.h           |   5 +-
 include/uapi/linux/fcntl.h                  |  41 ++++++
 24 files changed, 196 insertions(+), 38 deletions(-)

diff --git a/CREDITS b/CREDITS
index 031605d46b4d..a048e001d726 100644
--- a/CREDITS
+++ b/CREDITS
@@ -3301,7 +3301,9 @@ S: France
 N: Aleksa Sarai
 E: cyphar@cyphar.com
 W: https://www.cyphar.com/
-D: `pids` cgroup subsystem
+D: /sys/fs/cgroup/pids
+D: openat2(2)
+S: Sydney, Australia
 
 N: Dipankar Sarma
 E: dipankar@in.ibm.com
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..9f374f7d9514 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 543	common	fspick				sys_fspick
 544	common	pidfd_open			sys_pidfd_open
 # 545 reserved for clone3
+547	common	openat2				sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..4ba54bc7e19a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+437	common	openat2				sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..8aa00ccb0b96 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		436
+#define __NR_compat_syscalls		438
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..57f6f592d460 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
 __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 #define __NR_clone3 435
 __SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..8d36f2e2dc89 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+437	common	openat2				sys_openat2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..2559925f1924 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+437	common	openat2				sys_openat2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..c04385e60833 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+437	common	openat2				sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index e7c5ab38e403..68c9ec06851f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,3 +374,4 @@
 433	n32	fspick				sys_fspick
 434	n32	pidfd_open			sys_pidfd_open
 435	n32	clone3				__sys_clone3
+437	n32	openat2				sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 13cd66581f3b..42a72d010050 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,3 +350,4 @@
 433	n64	fspick				sys_fspick
 434	n64	pidfd_open			sys_pidfd_open
 435	n64	clone3				__sys_clone3
+437	n64	openat2				sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 353539ea4140..f114c4aed0ed 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,3 +423,4 @@
 433	o32	fspick				sys_fspick
 434	o32	pidfd_open			sys_pidfd_open
 435	o32	clone3				__sys_clone3
+437	o32	openat2				sys_openat2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 285ff516150c..b550ae9a7fea 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3_wrapper
+437	common	openat2				sys_openat2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 43f736ed47f2..a8b5ecb5b602 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	nospu	clone3				ppc_clone3
+437	common	openat2				sys_openat2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 3054e9c035a3..16b571c06161 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 433  common	fspick			sys_fspick			sys_fspick
 434  common	pidfd_open		sys_pidfd_open			sys_pidfd_open
 435  common	clone3			sys_clone3			sys_clone3
+437  common	openat2			sys_openat2			sys_openat2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index b5ed26c4c005..a7185cc18626 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+437	common	openat2				sys_openat2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8c8cc7537fb2..b11c19552022 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+437	common	openat2			sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3fe02546aed3..e5c022e9a5c4 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
 434	i386	pidfd_open		sys_pidfd_open			__ia32_sys_pidfd_open
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
+437	i386	openat2			sys_openat2			__ia32_sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..9035647ef236 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
 433	common	fspick			__x64_sys_fspick
 434	common	pidfd_open		__x64_sys_pidfd_open
 435	common	clone3			__x64_sys_clone3/ptregs
+437	common	openat2			__x64_sys_openat2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 25f4de729a6d..f0a68013c038 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,3 +406,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+437	common	openat2				sys_openat2
diff --git a/fs/open.c b/fs/open.c
index b62f5c0923a8..50a46501bcc9 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -955,48 +955,86 @@ struct file *open_with_fake_path(const struct path *path, int flags,
 }
 EXPORT_SYMBOL(open_with_fake_path);
 
-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
+#define WILL_CREATE(flags)	(flags & (O_CREAT | __O_TMPFILE))
+#define O_PATH_FLAGS		(O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)
+
+static inline struct open_how build_open_how(int flags, umode_t mode)
+{
+	struct open_how how = {
+		.flags = flags & VALID_OPEN_FLAGS,
+		.mode = mode & S_IALLUGO,
+	};
+
+	/* O_PATH beats everything else. */
+	if (how.flags & O_PATH)
+		how.flags &= O_PATH_FLAGS;
+	/* Modes should only be set for create-like flags. */
+	if (!WILL_CREATE(how.flags))
+		how.mode = 0;
+	return how;
+}
+
+static inline int build_open_flags(const struct open_how *how,
+				   struct open_flags *op)
 {
+	int flags = how->flags;
 	int lookup_flags = 0;
 	int acc_mode = ACC_MODE(flags);
 
+	/* Must never be set by userspace */
+	flags &= ~(FMODE_NONOTIFY | O_CLOEXEC);
+
 	/*
-	 * Clear out all open flags we don't know about so that we don't report
-	 * them in fcntl(F_GETFD) or similar interfaces.
+	 * Older syscalls implicitly clear all of the invalid flags or argument
+	 * values before calling build_open_flags(), but openat2(2) checks all
+	 * of its arguments.
 	 */
-	flags &= VALID_OPEN_FLAGS;
+	if (flags & ~VALID_OPEN_FLAGS)
+		return -EINVAL;
+	if (how->resolve & ~VALID_RESOLVE_FLAGS)
+		return -EINVAL;
+	if (memchr_inv(how->__padding, 0, sizeof(how->__padding)))
+		return -EINVAL;
 
-	if (flags & (O_CREAT | __O_TMPFILE))
-		op->mode = (mode & S_IALLUGO) | S_IFREG;
-	else
+	/* Deal with the mode. */
+	if (WILL_CREATE(flags)) {
+		if (how->mode & ~S_IALLUGO)
+			return -EINVAL;
+		op->mode = how->mode | S_IFREG;
+	} else {
+		if (how->mode != 0)
+			return -EINVAL;
 		op->mode = 0;
-
-	/* Must never be set by userspace */
-	flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC;
+	}
 
 	/*
-	 * O_SYNC is implemented as __O_SYNC|O_DSYNC.  As many places only
-	 * check for O_DSYNC if the need any syncing at all we enforce it's
-	 * always set instead of having to deal with possibly weird behaviour
-	 * for malicious applications setting only __O_SYNC.
+	 * In order to ensure programs get explicit errors when trying to use
+	 * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it
+	 * looks like (O_DIRECTORY|O_RDWR & ~O_CREAT) to old kernels. But we
+	 * have to require userspace to explicitly set it.
 	 */
-	if (flags & __O_SYNC)
-		flags |= O_DSYNC;
-
 	if (flags & __O_TMPFILE) {
 		if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
 			return -EINVAL;
 		if (!(acc_mode & MAY_WRITE))
 			return -EINVAL;
-	} else if (flags & O_PATH) {
-		/*
-		 * If we have O_PATH in the open flag. Then we
-		 * cannot have anything other than the below set of flags
-		 */
-		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+	}
+	if (flags & O_PATH) {
+		/* O_PATH only permits certain other flags to be set. */
+		if (flags & ~O_PATH_FLAGS)
+			return -EINVAL;
 		acc_mode = 0;
 	}
 
+	/*
+	 * O_SYNC is implemented as __O_SYNC|O_DSYNC.  As many places only
+	 * check for O_DSYNC if the need any syncing at all we enforce it's
+	 * always set instead of having to deal with possibly weird behaviour
+	 * for malicious applications setting only __O_SYNC.
+	 */
+	if (flags & __O_SYNC)
+		flags |= O_DSYNC;
+
 	op->open_flag = flags;
 
 	/* O_TRUNC implies we need access checks for write permissions */
@@ -1022,6 +1060,18 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_DIRECTORY;
 	if (!(flags & O_NOFOLLOW))
 		lookup_flags |= LOOKUP_FOLLOW;
+
+	if (how->resolve & RESOLVE_NO_XDEV)
+		lookup_flags |= LOOKUP_NO_XDEV;
+	if (how->resolve & RESOLVE_NO_MAGICLINKS)
+		lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (how->resolve & RESOLVE_NO_SYMLINKS)
+		lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (how->resolve & RESOLVE_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
+	if (how->resolve & RESOLVE_IN_ROOT)
+		lookup_flags |= LOOKUP_IN_ROOT;
+
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
@@ -1040,8 +1090,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 struct file *file_open_name(struct filename *name, int flags, umode_t mode)
 {
 	struct open_flags op;
-	int err = build_open_flags(flags, mode, &op);
-	return err ? ERR_PTR(err) : do_filp_open(AT_FDCWD, name, &op);
+	struct open_how how = build_open_how(flags, mode);
+	int err = build_open_flags(&how, &op);
+	if (err)
+		return ERR_PTR(err);
+	return do_filp_open(AT_FDCWD, name, &op);
 }
 
 /**
@@ -1072,17 +1125,19 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 			    const char *filename, int flags, umode_t mode)
 {
 	struct open_flags op;
-	int err = build_open_flags(flags, mode, &op);
+	struct open_how how = build_open_how(flags, mode);
+	int err = build_open_flags(&how, &op);
 	if (err)
 		return ERR_PTR(err);
 	return do_file_open_root(dentry, mnt, filename, &op);
 }
 EXPORT_SYMBOL(file_open_root);
 
-long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
+static long do_sys_openat2(int dfd, const char __user *filename,
+			   struct open_how *how)
 {
 	struct open_flags op;
-	int fd = build_open_flags(flags, mode, &op);
+	int fd = build_open_flags(how, &op);
 	struct filename *tmp;
 
 	if (fd)
@@ -1092,7 +1147,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 	if (IS_ERR(tmp))
 		return PTR_ERR(tmp);
 
-	fd = get_unused_fd_flags(flags);
+	fd = get_unused_fd_flags(how->flags);
 	if (fd >= 0) {
 		struct file *f = do_filp_open(dfd, tmp, &op);
 		if (IS_ERR(f)) {
@@ -1107,12 +1162,16 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 	return fd;
 }
 
-SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
+long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 {
-	if (force_o_largefile())
-		flags |= O_LARGEFILE;
+	struct open_how how = build_open_how(flags, mode);
+	return do_sys_openat2(dfd, filename, &how);
+}
 
-	return do_sys_open(AT_FDCWD, filename, flags, mode);
+
+SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
+{
+	return ksys_open(filename, flags, mode);
 }
 
 SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
@@ -1120,10 +1179,32 @@ SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
 {
 	if (force_o_largefile())
 		flags |= O_LARGEFILE;
-
 	return do_sys_open(dfd, filename, flags, mode);
 }
 
+SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
+		struct open_how __user *, how, size_t, usize)
+{
+	int err;
+	struct open_how tmp;
+
+	BUILD_BUG_ON(sizeof(struct open_how) < OPEN_HOW_SIZE_VER0);
+	BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_LATEST);
+
+	if (unlikely(usize < OPEN_HOW_SIZE_VER0))
+		return -EINVAL;
+
+	err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize);
+	if (err)
+		return err;
+
+	/* O_LARGEFILE is only allowed for non-O_PATH. */
+	if (!(tmp.flags & O_PATH) && force_o_largefile())
+		tmp.flags |= O_LARGEFILE;
+
+	return do_sys_openat2(dfd, filename, &tmp);
+}
+
 #ifdef CONFIG_COMPAT
 /*
  * Exactly like sys_open(), except that it doesn't set the
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d019df946cb2..f2eb05bd3af3 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -2,15 +2,25 @@
 #ifndef _LINUX_FCNTL_H
 #define _LINUX_FCNTL_H
 
+#include <linux/stat.h>
 #include <uapi/linux/fcntl.h>
 
-/* list of all valid flags for the open/openat flags argument: */
+/* List of all valid flags for the open/openat flags argument: */
 #define VALID_OPEN_FLAGS \
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
 	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
 
+/* List of all valid flags for the how->upgrade_mask argument: */
+#define VALID_UPGRADE_FLAGS \
+	(UPGRADE_NOWRITE | UPGRADE_NOREAD)
+
+/* List of all valid flags for the how->resolve argument: */
+#define VALID_RESOLVE_FLAGS \
+	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
+	 RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+
 #ifndef force_o_largefile
 #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
 #endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f7c561c4dcdd..808f103b7a62 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct rseq;
 union bpf_attr;
 struct io_uring_params;
 struct clone_args;
+struct open_how;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -439,6 +440,8 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
 asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
 asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
 			   umode_t mode);
+asmlinkage long sys_openat2(int dfd, const char __user *filename,
+			    struct open_how *how, size_t size);
 asmlinkage long sys_close(unsigned int fd);
 asmlinkage long sys_vhangup(void);
 
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1fc8faa6e973..d4122c091472 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -851,8 +851,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 __SYSCALL(__NR_clone3, sys_clone3)
 #endif
 
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
+
 #undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 438
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..5de8b0006a95 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -93,5 +93,46 @@
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
+/*
+ * Arguments for how openat2(2) should open the target path. If @resolve is
+ * zero, then openat2(2) operates very similarly to openat(2).
+ *
+ * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
+ * than being silently ignored. @mode must be zero unless one of {O_CREAT,
+ * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+	__aligned_u64 flags;
+	__u16 mode;
+	__u16 __padding[3]; /* must be zeroed */
+	__aligned_u64 resolve;
+};
+
+#define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
+#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER0
+
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV		0x01 /* Block mount-point crossings
+					(includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS	0x02 /* Block traversal through procfs-style
+					"magic-links". */
+#define RESOLVE_NO_SYMLINKS	0x04 /* Block traversal through all symlinks
+					(implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH		0x08 /* Block "lexical" trickery like
+					"..", symlinks, and absolute
+					paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
+					be scoped inside the dirfd
+					(similar to chroot(2)). */
+
+/* how->upgrade flags for openat2(2). */
+/* First bit is reserved for a future UPGRADE_NOEXEC flag. */
+#define UPGRADE_NOREAD		0x02 /* Block re-opening with MAY_READ. */
+#define UPGRADE_NOWRITE		0x04 /* Block re-opening with MAY_WRITE. */
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v15 8/9] selftests: add openat2(2) selftests
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
                   ` (6 preceding siblings ...)
  2019-11-05  9:05 ` [PATCH v15 7/9] open: introduce openat2(2) syscall Aleksa Sarai
@ 2019-11-05  9:05 ` Aleksa Sarai
  2019-11-05  9:05 ` [PATCH v15 9/9] Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED Aleksa Sarai
  2019-11-11 13:24 ` [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
  9 siblings, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
	libc-alpha, linux-arch, linux-arm-kernel, linux-fsdevel,
	linux-ia64, linux-kernel, linux-kselftest, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	linux-xtensa, sparclinux

Test all of the various openat2(2) flags. A small stress-test of a
symlink-rename attack is included to show that the protections against
".."-based attacks are sufficient.

The main things these self-tests are enforcing are:

  * The struct+usize ABI for openat2(2) and copy_struct_from_user() to
    ensure that upgrades will be handled gracefully (in addition,
    ensuring that misaligned structures are also handled correctly).

  * The -EINVAL checks for openat2(2) are all correctly handled to avoid
    userspace passing unknown or conflicting flag sets (most
    importantly, ensuring that invalid flag combinations are checked).

  * All of the RESOLVE_* semantics (including errno values) are
    correctly handled with various combinations of paths and flags.

  * RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
    attack that has been responsible for several CVEs (and likely will
    be responsible for several more).

Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/openat2/.gitignore    |   1 +
 tools/testing/selftests/openat2/Makefile      |   8 +
 tools/testing/selftests/openat2/helpers.c     | 109 ++++
 tools/testing/selftests/openat2/helpers.h     | 107 ++++
 .../testing/selftests/openat2/openat2_test.c  | 316 +++++++++++
 .../selftests/openat2/rename_attack_test.c    | 160 ++++++
 .../testing/selftests/openat2/resolve_test.c  | 523 ++++++++++++++++++
 8 files changed, 1225 insertions(+)
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/openat2_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4cdbae6f4e61..28996856ed5e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
 TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += openat2
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index 000000000000..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index 000000000000..4b93b1417b86
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
+TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index 000000000000..e9a6557ab16f
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+
+#include "helpers.h"
+
+bool needs_openat2(const struct open_how *how)
+{
+	return how->resolve != 0;
+}
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size)
+{
+	int ret = syscall(__NR_openat2, dfd, path, how, size);
+	return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat2(int dfd, const char *path, struct open_how *how)
+{
+	return raw_openat2(dfd, path, how, sizeof(*how));
+}
+
+int sys_openat(int dfd, const char *path, struct open_how *how)
+{
+	int ret = openat(dfd, path, how->flags, how->mode);
+	return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+		  int newdirfd, const char *newpath, unsigned int flags)
+{
+	int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+					  newdirfd, newpath, flags);
+	return ret >= 0 ? ret : -errno;
+}
+
+int touchat(int dfd, const char *path)
+{
+	int fd = openat(dfd, path, O_CREAT);
+	if (fd >= 0)
+		close(fd);
+	return fd;
+}
+
+char *fdreadlink(int fd)
+{
+	char *target, *tmp;
+
+	E_asprintf(&tmp, "/proc/self/fd/%d", fd);
+
+	target = malloc(PATH_MAX);
+	if (!target)
+		ksft_exit_fail_msg("fdreadlink: malloc failed\n");
+	memset(target, 0, PATH_MAX);
+
+	E_readlink(tmp, target, PATH_MAX);
+	free(tmp);
+	return target;
+}
+
+bool fdequal(int fd, int dfd, const char *path)
+{
+	char *fdpath, *dfdpath, *other;
+	bool cmp;
+
+	fdpath = fdreadlink(fd);
+	dfdpath = fdreadlink(dfd);
+
+	if (!path)
+		E_asprintf(&other, "%s", dfdpath);
+	else if (*path == '/')
+		E_asprintf(&other, "%s", path);
+	else
+		E_asprintf(&other, "%s/%s", dfdpath, path);
+
+	cmp = !strcmp(fdpath, other);
+
+	free(fdpath);
+	free(dfdpath);
+	free(other);
+	return cmp;
+}
+
+bool openat2_supported = false;
+
+void __attribute__((constructor)) init(void)
+{
+	struct open_how how = {};
+	int fd;
+
+	BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0);
+
+	/* Check openat2(2) support. */
+	fd = sys_openat2(AT_FDCWD, ".", &how);
+	openat2_supported = (fd >= 0);
+
+	if (fd >= 0)
+		close(fd);
+}
diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
new file mode 100644
index 000000000000..43ca5ceab6e3
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.h
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#ifndef __RESOLVEAT_H__
+#define __RESOLVEAT_H__
+
+#define _GNU_SOURCE
+#include <stdint.h>
+#include <errno.h>
+#include <linux/types.h>
+#include "../kselftest.h"
+
+#define ARRAY_LEN(X) (sizeof (X) / sizeof (*(X)))
+#define BUILD_BUG_ON(e) ((void)(sizeof(struct { int:(-!!(e)); })))
+
+#ifndef SYS_openat2
+#ifndef __NR_openat2
+#define __NR_openat2 437
+#endif /* __NR_openat2 */
+#define SYS_openat2 __NR_openat2
+#endif /* SYS_openat2 */
+
+/*
+ * Arguments for how openat2(2) should open the target path. If @resolve is
+ * zero, then openat2(2) operates very similarly to openat(2).
+ *
+ * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
+ * than being silently ignored. @mode must be zero unless one of {O_CREAT,
+ * O_TMPFILE} are set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+	__aligned_u64 flags;
+	__u16 mode;
+	__u16 __padding[3]; /* must be zeroed */
+	__aligned_u64 resolve;
+};
+
+#define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
+#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER0
+
+bool needs_openat2(const struct open_how *how);
+
+#ifndef RESOLVE_IN_ROOT
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV		0x01 /* Block mount-point crossings
+					(includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS	0x02 /* Block traversal through procfs-style
+					"magic-links". */
+#define RESOLVE_NO_SYMLINKS	0x04 /* Block traversal through all symlinks
+					(implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH		0x08 /* Block "lexical" trickery like
+					"..", symlinks, and absolute
+					paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
+					be scoped inside the dirfd
+					(similar to chroot(2)). */
+#endif /* RESOLVE_IN_ROOT */
+
+#define E_func(func, ...)						\
+	do {								\
+		if (func(__VA_ARGS__) < 0)				\
+			ksft_exit_fail_msg("%s:%d %s failed\n", \
+					   __FILE__, __LINE__, #func);\
+	} while (0)
+
+#define E_asprintf(...)		E_func(asprintf,	__VA_ARGS__)
+#define E_chmod(...)		E_func(chmod,		__VA_ARGS__)
+#define E_dup2(...)		E_func(dup2,		__VA_ARGS__)
+#define E_fchdir(...)		E_func(fchdir,		__VA_ARGS__)
+#define E_fstatat(...)		E_func(fstatat,		__VA_ARGS__)
+#define E_kill(...)		E_func(kill,		__VA_ARGS__)
+#define E_mkdirat(...)		E_func(mkdirat,		__VA_ARGS__)
+#define E_mount(...)		E_func(mount,		__VA_ARGS__)
+#define E_prctl(...)		E_func(prctl,		__VA_ARGS__)
+#define E_readlink(...)		E_func(readlink,	__VA_ARGS__)
+#define E_setresuid(...)	E_func(setresuid,	__VA_ARGS__)
+#define E_symlinkat(...)	E_func(symlinkat,	__VA_ARGS__)
+#define E_touchat(...)		E_func(touchat,		__VA_ARGS__)
+#define E_unshare(...)		E_func(unshare,		__VA_ARGS__)
+
+#define E_assert(expr, msg, ...)					\
+	do {								\
+		if (!(expr))						\
+			ksft_exit_fail_msg("ASSERT(%s:%d) failed (%s): " msg "\n", \
+					   __FILE__, __LINE__, #expr, ##__VA_ARGS__); \
+	} while (0)
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size);
+int sys_openat2(int dfd, const char *path, struct open_how *how);
+int sys_openat(int dfd, const char *path, struct open_how *how);
+int sys_renameat2(int olddirfd, const char *oldpath,
+		  int newdirfd, const char *newpath, unsigned int flags);
+
+int touchat(int dfd, const char *path);
+char *fdreadlink(int fd);
+bool fdequal(int fd, int dfd, const char *path);
+
+extern bool openat2_supported;
+
+#endif /* __RESOLVEAT_H__ */
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
new file mode 100644
index 000000000000..8a641acb0d6c
--- /dev/null
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * O_LARGEFILE is set to 0 by glibc.
+ * XXX: This is wrong on {mips, parisc, powerpc, sparc}.
+ */
+#undef	O_LARGEFILE
+#define	O_LARGEFILE 0x8000
+
+struct open_how_ext {
+	struct open_how inner;
+	uint32_t extra1;
+	char pad1[128];
+	uint32_t extra2;
+	char pad2[128];
+	uint32_t extra3;
+};
+
+struct struct_test {
+	const char *name;
+	struct open_how_ext arg;
+	size_t size;
+	int err;
+};
+
+#define NUM_OPENAT2_STRUCT_TESTS 9
+#define NUM_OPENAT2_STRUCT_VARIATIONS 13
+
+void test_openat2_struct(void)
+{
+	int misalignments[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 17, 87 };
+
+	struct struct_test tests[] = {
+		/* Normal struct. */
+		{ .name = "normal struct",
+		  .arg.inner.flags = O_RDONLY,
+		  .size = sizeof(struct open_how) },
+		/* Bigger struct, with zeroed out end. */
+		{ .name = "bigger struct (zeroed out)",
+		  .arg.inner.flags = O_RDONLY,
+		  .size = sizeof(struct open_how_ext) },
+
+		/* Normal struct with broken padding. */
+		{ .name = "normal struct (non-zero padding[0])",
+		  .arg.inner.flags = O_RDONLY,
+		  .arg.inner.__padding = {0xa0, 0x00},
+		  .size = sizeof(struct open_how_ext), .err = -EINVAL },
+		{ .name = "normal struct (non-zero padding[1])",
+		  .arg.inner.flags = O_RDONLY,
+		  .arg.inner.__padding = {0x00, 0x1a},
+		  .size = sizeof(struct open_how_ext), .err = -EINVAL },
+
+		/* TODO: Once expanded, check zero-padding. */
+
+		/* Smaller than version-0 struct. */
+		{ .name = "zero-sized 'struct'",
+		  .arg.inner.flags = O_RDONLY, .size = 0, .err = -EINVAL },
+		{ .name = "smaller-than-v0 struct",
+		  .arg.inner.flags = O_RDONLY,
+		  .size = OPEN_HOW_SIZE_VER0 - 1, .err = -EINVAL },
+
+		/* Bigger struct, with non-zero trailing bytes. */
+		{ .name = "bigger struct (non-zero data in first 'future field')",
+		  .arg.inner.flags = O_RDONLY, .arg.extra1 = 0xdeadbeef,
+		  .size = sizeof(struct open_how_ext), .err = -E2BIG },
+		{ .name = "bigger struct (non-zero data in middle of 'future fields')",
+		  .arg.inner.flags = O_RDONLY, .arg.extra2 = 0xfeedcafe,
+		  .size = sizeof(struct open_how_ext), .err = -E2BIG },
+		{ .name = "bigger struct (non-zero data at end of 'future fields')",
+		  .arg.inner.flags = O_RDONLY, .arg.extra3 = 0xabad1dea,
+		  .size = sizeof(struct open_how_ext), .err = -E2BIG },
+	};
+
+	BUILD_BUG_ON(ARRAY_LEN(misalignments) != NUM_OPENAT2_STRUCT_VARIATIONS);
+	BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_STRUCT_TESTS);
+
+	for (int i = 0; i < ARRAY_LEN(tests); i++) {
+		struct struct_test *test = &tests[i];
+		struct open_how_ext how_ext = test->arg;
+
+		for (int j = 0; j < ARRAY_LEN(misalignments); j++) {
+			int fd, misalign = misalignments[j];
+			char *fdpath = NULL;
+			bool failed;
+			void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+			void *copy = NULL, *how_copy = &how_ext;
+
+			if (!openat2_supported) {
+				ksft_print_msg("openat2(2) unsupported\n");
+				resultfn = ksft_test_result_skip;
+				goto skip;
+			}
+
+			if (misalign) {
+				/*
+				 * Explicitly misalign the structure copying it with the given
+				 * (mis)alignment offset. The other data is set to be non-zero to
+				 * make sure that non-zero bytes outside the struct aren't checked
+				 *
+				 * This is effectively to check that is_zeroed_user() works.
+				 */
+				copy = malloc(misalign + sizeof(how_ext));
+				how_copy = copy + misalign;
+				memset(copy, 0xff, misalign);
+				memcpy(how_copy, &how_ext, sizeof(how_ext));
+			}
+
+			fd = raw_openat2(AT_FDCWD, ".", how_copy, test->size);
+			if (test->err >= 0)
+				failed = (fd < 0);
+			else
+				failed = (fd != test->err);
+			if (fd >= 0) {
+				fdpath = fdreadlink(fd);
+				close(fd);
+			}
+
+			if (failed) {
+				resultfn = ksft_test_result_fail;
+
+				ksft_print_msg("openat2 unexpectedly returned ");
+				if (fdpath)
+					ksft_print_msg("%d['%s']\n", fd, fdpath);
+				else
+					ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+			}
+
+skip:
+			if (test->err >= 0)
+				resultfn("openat2 with %s argument [misalign=%d] succeeds\n",
+					 test->name, misalign);
+			else
+				resultfn("openat2 with %s argument [misalign=%d] fails with %d (%s)\n",
+					 test->name, misalign, test->err,
+					 strerror(-test->err));
+
+			free(copy);
+			free(fdpath);
+			fflush(stdout);
+		}
+	}
+}
+
+struct flag_test {
+	const char *name;
+	struct open_how how;
+	int err;
+};
+
+#define NUM_OPENAT2_FLAG_TESTS 21
+
+void test_openat2_flags(void)
+{
+	struct flag_test tests[] = {
+		/* O_TMPFILE is incompatible with O_PATH and O_CREAT. */
+		{ .name = "incompatible flags (O_TMPFILE | O_PATH)",
+		  .how.flags = O_TMPFILE | O_PATH | O_RDWR, .err = -EINVAL },
+		{ .name = "incompatible flags (O_TMPFILE | O_CREAT)",
+		  .how.flags = O_TMPFILE | O_CREAT | O_RDWR, .err = -EINVAL },
+
+		/* O_PATH only permits certain other flags to be set ... */
+		{ .name = "compatible flags (O_PATH | O_CLOEXEC)",
+		  .how.flags = O_PATH | O_CLOEXEC },
+		{ .name = "compatible flags (O_PATH | O_DIRECTORY)",
+		  .how.flags = O_PATH | O_DIRECTORY },
+		{ .name = "compatible flags (O_PATH | O_NOFOLLOW)",
+		  .how.flags = O_PATH | O_NOFOLLOW },
+		/* ... and others are absolutely not permitted. */
+		{ .name = "incompatible flags (O_PATH | O_RDWR)",
+		  .how.flags = O_PATH | O_RDWR, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_CREAT)",
+		  .how.flags = O_PATH | O_CREAT, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_EXCL)",
+		  .how.flags = O_PATH | O_EXCL, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_NOCTTY)",
+		  .how.flags = O_PATH | O_NOCTTY, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_DIRECT)",
+		  .how.flags = O_PATH | O_DIRECT, .err = -EINVAL },
+		{ .name = "incompatible flags (O_PATH | O_LARGEFILE)",
+		  .how.flags = O_PATH | O_LARGEFILE, .err = -EINVAL },
+
+		/* ->mode must only be set with O_{CREAT,TMPFILE}. */
+		{ .name = "non-zero how.mode and O_RDONLY",
+		  .how.flags = O_RDONLY, .how.mode = 0600, .err = -EINVAL },
+		{ .name = "non-zero how.mode and O_PATH",
+		  .how.flags = O_PATH,   .how.mode = 0600, .err = -EINVAL },
+		{ .name = "valid how.mode and O_CREAT",
+		  .how.flags = O_CREAT,  .how.mode = 0600 },
+		{ .name = "valid how.mode and O_TMPFILE",
+		  .how.flags = O_TMPFILE | O_RDWR, .how.mode = 0600 },
+		/* ->mode must only contain 0777 bits. */
+		{ .name = "invalid how.mode and O_CREAT",
+		  .how.flags = O_CREAT,
+		  .how.mode = 0xFFFF, .err = -EINVAL },
+		{ .name = "invalid how.mode and O_TMPFILE",
+		  .how.flags = O_TMPFILE | O_RDWR,
+		  .how.mode = 0x1337, .err = -EINVAL },
+
+		/* ->resolve must only contain RESOLVE_* flags. */
+		{ .name = "invalid how.resolve and O_RDONLY",
+		  .how.flags = O_RDONLY,
+		  .how.resolve = 0x1337, .err = -EINVAL },
+		{ .name = "invalid how.resolve and O_CREAT",
+		  .how.flags = O_CREAT,
+		  .how.resolve = 0x1337, .err = -EINVAL },
+		{ .name = "invalid how.resolve and O_TMPFILE",
+		  .how.flags = O_TMPFILE | O_RDWR,
+		  .how.resolve = 0x1337, .err = -EINVAL },
+		{ .name = "invalid how.resolve and O_PATH",
+		  .how.flags = O_PATH,
+		  .how.resolve = 0x1337, .err = -EINVAL },
+	};
+
+	BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS);
+
+	for (int i = 0; i < ARRAY_LEN(tests); i++) {
+		int fd, fdflags = -1;
+		char *path, *fdpath = NULL;
+		bool failed = false;
+		struct flag_test *test = &tests[i];
+		void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+		if (!openat2_supported) {
+			ksft_print_msg("openat2(2) unsupported\n");
+			resultfn = ksft_test_result_skip;
+			goto skip;
+		}
+
+		path = (test->how.flags & O_CREAT) ? "/tmp/ksft.openat2_tmpfile" : ".";
+		unlink(path);
+
+		fd = sys_openat2(AT_FDCWD, path, &test->how);
+		if (test->err >= 0)
+			failed = (fd < 0);
+		else
+			failed = (fd != test->err);
+		if (fd >= 0) {
+			int otherflags;
+
+			fdpath = fdreadlink(fd);
+			fdflags = fcntl(fd, F_GETFL);
+			otherflags = fcntl(fd, F_GETFD);
+			close(fd);
+
+			E_assert(fdflags >= 0, "fcntl F_GETFL of new fd");
+			E_assert(otherflags >= 0, "fcntl F_GETFD of new fd");
+
+			/* O_CLOEXEC isn't shown in F_GETFL. */
+			if (otherflags & FD_CLOEXEC)
+				fdflags |= O_CLOEXEC;
+			/* O_CREAT is hidden from F_GETFL. */
+			if (test->how.flags & O_CREAT)
+				fdflags |= O_CREAT;
+			if (!(test->how.flags & O_LARGEFILE))
+				fdflags &= ~O_LARGEFILE;
+			failed |= (fdflags != test->how.flags);
+		}
+
+		if (failed) {
+			resultfn = ksft_test_result_fail;
+
+			ksft_print_msg("openat2 unexpectedly returned ");
+			if (fdpath)
+				ksft_print_msg("%d['%s'] with %X (!= %X)\n",
+					       fd, fdpath, fdflags,
+					       test->how.flags);
+			else
+				ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+		}
+
+skip:
+		if (test->err >= 0)
+			resultfn("openat2 with %s succeeds\n", test->name);
+		else
+			resultfn("openat2 with %s fails with %d (%s)\n",
+				 test->name, test->err, strerror(-test->err));
+
+		free(fdpath);
+		fflush(stdout);
+	}
+}
+
+#define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \
+		   NUM_OPENAT2_FLAG_TESTS)
+
+int main(int argc, char **argv)
+{
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	test_openat2_struct();
+	test_openat2_flags();
+
+	if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+		ksft_exit_fail();
+	else
+		ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/rename_attack_test.c b/tools/testing/selftests/openat2/rename_attack_test.c
new file mode 100644
index 000000000000..0a770728b436
--- /dev/null
+++ b/tools/testing/selftests/openat2/rename_attack_test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/* Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- a/
+ * |   `-- c/
+ * `-- b/
+ */
+int setup_testdir(void)
+{
+	int dfd;
+	char dirname[] = "/tmp/ksft-openat2-rename-attack.XXXXXX";
+
+	/* Make the top-level directory. */
+	if (!mkdtemp(dirname))
+		ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+	dfd = open(dirname, O_PATH | O_DIRECTORY);
+	if (dfd < 0)
+		ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+	E_mkdirat(dfd, "a", 0755);
+	E_mkdirat(dfd, "b", 0755);
+	E_mkdirat(dfd, "a/c", 0755);
+
+	return dfd;
+}
+
+/* Swap @dirfd/@a and @dirfd/@b constantly. Parent must kill this process. */
+pid_t spawn_attack(int dirfd, char *a, char *b)
+{
+	pid_t child = fork();
+	if (child != 0)
+		return child;
+
+	/* If the parent (the test process) dies, kill ourselves too. */
+	E_prctl(PR_SET_PDEATHSIG, SIGKILL);
+
+	/* Swap @a and @b. */
+	for (;;)
+		renameat2(dirfd, a, dirfd, b, RENAME_EXCHANGE);
+	exit(1);
+}
+
+#define NUM_RENAME_TESTS 2
+#define ROUNDS 400000
+
+const char *flagname(int resolve)
+{
+	switch (resolve) {
+	case RESOLVE_IN_ROOT:
+		return "RESOLVE_IN_ROOT";
+	case RESOLVE_BENEATH:
+		return "RESOLVE_BENEATH";
+	}
+	return "(unknown)";
+}
+
+void test_rename_attack(int resolve)
+{
+	int dfd, afd;
+	pid_t child;
+	void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+	int escapes = 0, other_errs = 0, exdevs = 0, eagains = 0, successes = 0;
+
+	struct open_how how = {
+		.flags = O_PATH,
+		.resolve = resolve,
+	};
+
+	if (!openat2_supported) {
+		how.resolve = 0;
+		ksft_print_msg("openat2(2) unsupported -- using openat(2) instead\n");
+	}
+
+	dfd = setup_testdir();
+	afd = openat(dfd, "a", O_PATH);
+	if (afd < 0)
+		ksft_exit_fail_msg("test_rename_attack: failed to open 'a'\n");
+
+	child = spawn_attack(dfd, "a/c", "b");
+
+	for (int i = 0; i < ROUNDS; i++) {
+		int fd;
+		char *victim_path = "c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../..";
+
+		if (openat2_supported)
+			fd = sys_openat2(afd, victim_path, &how);
+		else
+			fd = sys_openat(afd, victim_path, &how);
+
+		if (fd < 0) {
+			if (fd == -EAGAIN)
+				eagains++;
+			else if (fd == -EXDEV)
+				exdevs++;
+			else if (fd == -ENOENT)
+				escapes++; /* escaped outside and got ENOENT... */
+			else
+				other_errs++; /* unexpected error */
+		} else {
+			if (fdequal(fd, afd, NULL))
+				successes++;
+			else
+				escapes++; /* we got an unexpected fd */
+		}
+		close(fd);
+	}
+
+	if (escapes > 0)
+		resultfn = ksft_test_result_fail;
+	ksft_print_msg("non-escapes: EAGAIN=%d EXDEV=%d E<other>=%d success=%d\n",
+		       eagains, exdevs, other_errs, successes);
+	resultfn("rename attack with %s (%d runs, got %d escapes)\n",
+		 flagname(resolve), ROUNDS, escapes);
+
+	/* Should be killed anyway, but might as well make sure. */
+	E_kill(child, SIGKILL);
+}
+
+#define NUM_TESTS NUM_RENAME_TESTS
+
+int main(int argc, char **argv)
+{
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	test_rename_attack(RESOLVE_BENEATH);
+	test_rename_attack(RESOLVE_IN_ROOT);
+
+	if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+		ksft_exit_fail();
+	else
+		ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/resolve_test.c b/tools/testing/selftests/openat2/resolve_test.c
new file mode 100644
index 000000000000..7a94b1da8e7b
--- /dev/null
+++ b/tools/testing/selftests/openat2/resolve_test.c
@@ -0,0 +1,523 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- procexe -> /proc/self/exe
+ * |-- procroot -> /proc/self/root
+ * |-- root/
+ * |-- mnt/ [mountpoint]
+ * |   |-- self -> ../mnt/
+ * |   `-- absself -> /mnt/
+ * |-- etc/
+ * |   `-- passwd
+ * |-- creatlink -> /newfile3
+ * |-- reletc -> etc/
+ * |-- relsym -> etc/passwd
+ * |-- absetc -> /etc/
+ * |-- abssym -> /etc/passwd
+ * |-- abscheeky -> /cheeky
+ * `-- cheeky/
+ *     |-- absself -> /
+ *     |-- self -> ../../root/
+ *     |-- garbageself -> /../../root/
+ *     |-- passwd -> ../cheeky/../cheeky/../etc/../etc/passwd
+ *     |-- abspasswd -> /../cheeky/../cheeky/../etc/../etc/passwd
+ *     |-- dotdotlink -> ../../../../../../../../../../../../../../etc/passwd
+ *     `-- garbagelink -> /../../../../../../../../../../../../../../etc/passwd
+ */
+int setup_testdir(void)
+{
+	int dfd, tmpfd;
+	char dirname[] = "/tmp/ksft-openat2-testdir.XXXXXX";
+
+	/* Unshare and make /tmp a new directory. */
+	E_unshare(CLONE_NEWNS);
+	E_mount("", "/tmp", "", MS_PRIVATE, "");
+
+	/* Make the top-level directory. */
+	if (!mkdtemp(dirname))
+		ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+	dfd = open(dirname, O_PATH | O_DIRECTORY);
+	if (dfd < 0)
+		ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+	/* A sub-directory which is actually used for tests. */
+	E_mkdirat(dfd, "root", 0755);
+	tmpfd = openat(dfd, "root", O_PATH | O_DIRECTORY);
+	if (tmpfd < 0)
+		ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+	close(dfd);
+	dfd = tmpfd;
+
+	E_symlinkat("/proc/self/exe", dfd, "procexe");
+	E_symlinkat("/proc/self/root", dfd, "procroot");
+	E_mkdirat(dfd, "root", 0755);
+
+	/* There is no mountat(2), so use chdir. */
+	E_mkdirat(dfd, "mnt", 0755);
+	E_fchdir(dfd);
+	E_mount("tmpfs", "./mnt", "tmpfs", MS_NOSUID | MS_NODEV, "");
+	E_symlinkat("../mnt/", dfd, "mnt/self");
+	E_symlinkat("/mnt/", dfd, "mnt/absself");
+
+	E_mkdirat(dfd, "etc", 0755);
+	E_touchat(dfd, "etc/passwd");
+
+	E_symlinkat("/newfile3", dfd, "creatlink");
+	E_symlinkat("etc/", dfd, "reletc");
+	E_symlinkat("etc/passwd", dfd, "relsym");
+	E_symlinkat("/etc/", dfd, "absetc");
+	E_symlinkat("/etc/passwd", dfd, "abssym");
+	E_symlinkat("/cheeky", dfd, "abscheeky");
+
+	E_mkdirat(dfd, "cheeky", 0755);
+
+	E_symlinkat("/", dfd, "cheeky/absself");
+	E_symlinkat("../../root/", dfd, "cheeky/self");
+	E_symlinkat("/../../root/", dfd, "cheeky/garbageself");
+
+	E_symlinkat("../cheeky/../etc/../etc/passwd", dfd, "cheeky/passwd");
+	E_symlinkat("/../cheeky/../etc/../etc/passwd", dfd, "cheeky/abspasswd");
+
+	E_symlinkat("../../../../../../../../../../../../../../etc/passwd",
+		    dfd, "cheeky/dotdotlink");
+	E_symlinkat("/../../../../../../../../../../../../../../etc/passwd",
+		    dfd, "cheeky/garbagelink");
+
+	return dfd;
+}
+
+struct basic_test {
+	const char *name;
+	const char *dir;
+	const char *path;
+	struct open_how how;
+	bool pass;
+	union {
+		int err;
+		const char *path;
+	} out;
+};
+
+#define NUM_OPENAT2_OPATH_TESTS 88
+
+void test_openat2_opath_tests(void)
+{
+	int rootfd, hardcoded_fd;
+	char *procselfexe, *hardcoded_fdpath;
+
+	E_asprintf(&procselfexe, "/proc/%d/exe", getpid());
+	rootfd = setup_testdir();
+
+	hardcoded_fd = open("/dev/null", O_RDONLY);
+	E_assert(hardcoded_fd >= 0, "open fd to hardcode");
+	E_asprintf(&hardcoded_fdpath, "self/fd/%d", hardcoded_fd);
+
+	struct basic_test tests[] = {
+		/** RESOLVE_BENEATH **/
+		/* Attempts to cross dirfd should be blocked. */
+		{ .name = "[beneath] jump to /",
+		  .path = "/",			.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] absolute link to $root",
+		  .path = "cheeky/absself",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] chained absolute links to $root",
+		  .path = "abscheeky/absself",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] jump outside $root",
+		  .path = "..",			.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] temporary jump outside $root",
+		  .path = "../root/",		.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] symlink temporary jump outside $root",
+		  .path = "cheeky/self",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] chained symlink temporary jump outside $root",
+		  .path = "abscheeky/self",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] garbage links to $root",
+		  .path = "cheeky/garbageself",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] chained garbage links to $root",
+		  .path = "abscheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Only relative paths that stay inside dirfd should work. */
+		{ .name = "[beneath] ordinary path to 'root'",
+		  .path = "root",		.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[beneath] ordinary path to 'etc'",
+		  .path = "etc",		.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "etc",		.pass = true },
+		{ .name = "[beneath] ordinary path to 'etc/passwd'",
+		  .path = "etc/passwd",		.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[beneath] relative symlink inside $root",
+		  .path = "relsym",		.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[beneath] chained-'..' relative symlink inside $root",
+		  .path = "cheeky/passwd",	.how.resolve = RESOLVE_BENEATH,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[beneath] absolute symlink component outside $root",
+		  .path = "abscheeky/passwd",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] absolute symlink target outside $root",
+		  .path = "abssym",		.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] absolute path outside $root",
+		  .path = "/etc/passwd",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] cheeky absolute path outside $root",
+		  .path = "cheeky/abspasswd",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] chained cheeky absolute path outside $root",
+		  .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Tricky paths should fail. */
+		{ .name = "[beneath] tricky '..'-chained symlink outside $root",
+		  .path = "cheeky/dotdotlink",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] tricky absolute + '..'-chained symlink outside $root",
+		  .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] tricky garbage link outside $root",
+		  .path = "cheeky/garbagelink",	.how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[beneath] tricky absolute + garbage link outside $root",
+		  .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+		  .out.err = -EXDEV,		.pass = false },
+
+		/** RESOLVE_IN_ROOT **/
+		/* All attempts to cross the dirfd will be scoped-to-root. */
+		{ .name = "[in_root] jump to /",
+		  .path = "/",			.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[in_root] absolute symlink to /root",
+		  .path = "cheeky/absself",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[in_root] chained absolute symlinks to /root",
+		  .path = "abscheeky/absself",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[in_root] '..' at root",
+		  .path = "..",			.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[in_root] '../root' at root",
+		  .path = "../root/",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] relative symlink containing '..' above root",
+		  .path = "cheeky/self",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] garbage link to /root",
+		  .path = "cheeky/garbageself",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] chainged garbage links to /root",
+		  .path = "abscheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] relative path to 'root'",
+		  .path = "root",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[in_root] relative path to 'etc'",
+		  .path = "etc",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc",		.pass = true },
+		{ .name = "[in_root] relative path to 'etc/passwd'",
+		  .path = "etc/passwd",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] relative symlink to 'etc/passwd'",
+		  .path = "relsym",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] chained-'..' relative symlink to 'etc/passwd'",
+		  .path = "cheeky/passwd",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] chained-'..' absolute + relative symlink to 'etc/passwd'",
+		  .path = "abscheeky/passwd",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] absolute symlink to 'etc/passwd'",
+		  .path = "abssym",		.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] absolute path 'etc/passwd'",
+		  .path = "/etc/passwd",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] cheeky absolute path 'etc/passwd'",
+		  .path = "cheeky/abspasswd",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] chained cheeky absolute path 'etc/passwd'",
+		  .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky '..'-chained symlink outside $root",
+		  .path = "cheeky/dotdotlink",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky absolute + '..'-chained symlink outside $root",
+		  .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky absolute path + absolute + '..'-chained symlink outside $root",
+		  .path = "/../../../../abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky garbage link outside $root",
+		  .path = "cheeky/garbagelink",	.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky absolute + garbage link outside $root",
+		  .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		{ .name = "[in_root] tricky absolute path + absolute + garbage link outside $root",
+		  .path = "/../../../../abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "etc/passwd",	.pass = true },
+		/* O_CREAT should handle trailing symlinks correctly. */
+		{ .name = "[in_root] O_CREAT of relative path inside $root",
+		  .path = "newfile1",		.how.flags = O_CREAT,
+						.how.mode = 0700,
+						.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "newfile1",	.pass = true },
+		{ .name = "[in_root] O_CREAT of absolute path",
+		  .path = "/newfile2",		.how.flags = O_CREAT,
+						.how.mode = 0700,
+						.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "newfile2",	.pass = true },
+		{ .name = "[in_root] O_CREAT of tricky symlink outside root",
+		  .path = "/creatlink",		.how.flags = O_CREAT,
+						.how.mode = 0700,
+						.how.resolve = RESOLVE_IN_ROOT,
+		  .out.path = "newfile3",	.pass = true },
+
+		/** RESOLVE_NO_XDEV **/
+		/* Crossing *down* into a mountpoint is disallowed. */
+		{ .name = "[no_xdev] cross into $mnt",
+		  .path = "mnt",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] cross into $mnt/",
+		  .path = "mnt/",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] cross into $mnt/.",
+		  .path = "mnt/.",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Crossing *up* out of a mountpoint is disallowed. */
+		{ .name = "[no_xdev] goto mountpoint root",
+		  .dir = "mnt", .path = ".",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.path = "mnt",		.pass = true },
+		{ .name = "[no_xdev] cross up through '..'",
+		  .dir = "mnt", .path = "..",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] temporary cross up through '..'",
+		  .dir = "mnt", .path = "../mnt", .how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] temporary relative symlink cross up",
+		  .dir = "mnt", .path = "self",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] temporary absolute symlink cross up",
+		  .dir = "mnt", .path = "absself", .how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Jumping to "/" is ok, but later components cannot cross. */
+		{ .name = "[no_xdev] jump to / directly",
+		  .dir = "mnt", .path = "/",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.path = "/",		.pass = true },
+		{ .name = "[no_xdev] jump to / (from /) directly",
+		  .dir = "/", .path = "/",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.path = "/",		.pass = true },
+		{ .name = "[no_xdev] jump to / then proc",
+		  .path = "/proc/1",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		{ .name = "[no_xdev] jump to / then tmp",
+		  .path = "/tmp",		.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,		.pass = false },
+		/* Magic-links are blocked since they can switch vfsmounts. */
+		{ .name = "[no_xdev] cross through magic-link to self/root",
+		  .dir = "/proc", .path = "self/root", 	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,			.pass = false },
+		{ .name = "[no_xdev] cross through magic-link to self/cwd",
+		  .dir = "/proc", .path = "self/cwd",	.how.resolve = RESOLVE_NO_XDEV,
+		  .out.err = -EXDEV,			.pass = false },
+		/* Except magic-link jumps inside the same vfsmount. */
+		{ .name = "[no_xdev] jump through magic-link to same procfs",
+		  .dir = "/proc", .path = hardcoded_fdpath, .how.resolve = RESOLVE_NO_XDEV,
+		  .out.path = "/proc",			    .pass = true, },
+
+		/** RESOLVE_NO_MAGICLINKS **/
+		/* Regular symlinks should work. */
+		{ .name = "[no_magiclinks] ordinary relative symlink",
+		  .path = "relsym",		.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.path = "etc/passwd",	.pass = true },
+		/* Magic-links should not work. */
+		{ .name = "[no_magiclinks] symlink to magic-link",
+		  .path = "procexe",		.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_magiclinks] normal path to magic-link",
+		  .path = "/proc/self/exe",	.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_magiclinks] normal path to magic-link with O_NOFOLLOW",
+		  .path = "/proc/self/exe",	.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.path = procselfexe,	.pass = true },
+		{ .name = "[no_magiclinks] symlink to magic-link path component",
+		  .path = "procroot/etc",	.how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_magiclinks] magic-link path component",
+		  .path = "/proc/self/root/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_magiclinks] magic-link path component with O_NOFOLLOW",
+		  .path = "/proc/self/root/etc", .how.flags = O_NOFOLLOW,
+						 .how.resolve = RESOLVE_NO_MAGICLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+
+		/** RESOLVE_NO_SYMLINKS **/
+		/* Normal paths should work. */
+		{ .name = "[no_symlinks] ordinary path to '.'",
+		  .path = ".",			.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = NULL,		.pass = true },
+		{ .name = "[no_symlinks] ordinary path to 'root'",
+		  .path = "root",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "root",		.pass = true },
+		{ .name = "[no_symlinks] ordinary path to 'etc'",
+		  .path = "etc",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "etc",		.pass = true },
+		{ .name = "[no_symlinks] ordinary path to 'etc/passwd'",
+		  .path = "etc/passwd",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "etc/passwd",	.pass = true },
+		/* Regular symlinks are blocked. */
+		{ .name = "[no_symlinks] relative symlink target",
+		  .path = "relsym",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] relative symlink component",
+		  .path = "reletc/passwd",	.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] absolute symlink target",
+		  .path = "abssym",		.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] absolute symlink component",
+		  .path = "absetc/passwd",	.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] cheeky garbage link",
+		  .path = "cheeky/garbagelink",	.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] cheeky absolute + garbage link",
+		  .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] cheeky absolute + absolute symlink",
+		  .path = "abscheeky/absself",	.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		/* Trailing symlinks with NO_FOLLOW. */
+		{ .name = "[no_symlinks] relative symlink with O_NOFOLLOW",
+		  .path = "relsym",		.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "relsym",		.pass = true },
+		{ .name = "[no_symlinks] absolute symlink with O_NOFOLLOW",
+		  .path = "abssym",		.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "abssym",		.pass = true },
+		{ .name = "[no_symlinks] trailing symlink with O_NOFOLLOW",
+		  .path = "cheeky/garbagelink",	.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.path = "cheeky/garbagelink", .pass = true },
+		{ .name = "[no_symlinks] multiple symlink components with O_NOFOLLOW",
+		  .path = "abscheeky/absself",	.how.flags = O_NOFOLLOW,
+						.how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+		{ .name = "[no_symlinks] multiple symlink (and garbage link) components with O_NOFOLLOW",
+		  .path = "abscheeky/garbagelink", .how.flags = O_NOFOLLOW,
+						   .how.resolve = RESOLVE_NO_SYMLINKS,
+		  .out.err = -ELOOP,		.pass = false },
+	};
+
+	BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_OPATH_TESTS);
+
+	for (int i = 0; i < ARRAY_LEN(tests); i++) {
+		int dfd, fd;
+		char *fdpath = NULL;
+		bool failed;
+		void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+		struct basic_test *test = &tests[i];
+
+		if (!openat2_supported) {
+			ksft_print_msg("openat2(2) unsupported\n");
+			resultfn = ksft_test_result_skip;
+			goto skip;
+		}
+
+		/* Auto-set O_PATH. */
+		if (!(test->how.flags & O_CREAT))
+			test->how.flags |= O_PATH;
+
+		if (test->dir)
+			dfd = openat(rootfd, test->dir, O_PATH | O_DIRECTORY);
+		else
+			dfd = dup(rootfd);
+		E_assert(dfd, "failed to openat root '%s': %m", test->dir);
+
+		E_dup2(dfd, hardcoded_fd);
+
+		fd = sys_openat2(dfd, test->path, &test->how);
+		if (test->pass)
+			failed = (fd < 0 || !fdequal(fd, rootfd, test->out.path));
+		else
+			failed = (fd != test->out.err);
+		if (fd >= 0) {
+			fdpath = fdreadlink(fd);
+			close(fd);
+		}
+		close(dfd);
+
+		if (failed) {
+			resultfn = ksft_test_result_fail;
+
+			ksft_print_msg("openat2 unexpectedly returned ");
+			if (fdpath)
+				ksft_print_msg("%d['%s']\n", fd, fdpath);
+			else
+				ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+		}
+
+skip:
+		if (test->pass)
+			resultfn("%s gives path '%s'\n", test->name,
+				 test->out.path ?: ".");
+		else
+			resultfn("%s fails with %d (%s)\n", test->name,
+				 test->out.err, strerror(-test->out.err));
+
+		fflush(stdout);
+		free(fdpath);
+	}
+
+	free(procselfexe);
+	close(rootfd);
+
+	free(hardcoded_fdpath);
+	close(hardcoded_fd);
+}
+
+#define NUM_TESTS NUM_OPENAT2_OPATH_TESTS
+
+int main(int argc, char **argv)
+{
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	/* NOTE: We should be checking for CAP_SYS_ADMIN here... */
+	if (geteuid() != 0)
+		ksft_exit_skip("all tests require euid == 0\n");
+
+	test_openat2_opath_tests();
+
+	if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+		ksft_exit_fail();
+	else
+		ksft_exit_pass();
+}
-- 
2.23.0



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v15 9/9] Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
                   ` (7 preceding siblings ...)
  2019-11-05  9:05 ` [PATCH v15 8/9] selftests: add openat2(2) selftests Aleksa Sarai
@ 2019-11-05  9:05 ` Aleksa Sarai
  2019-11-11 13:24 ` [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
  9 siblings, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-05  9:05 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
	libc-alpha, linux-arch, linux-arm-kernel, linux-fsdevel,
	linux-ia64, linux-kernel, linux-kselftest, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	linux-xtensa, sparclinux

Now that we have a special flag to signify magic-link jumps, mention it
within the path-lookup docs. And now that "magic link" is the correct
term for nd_jump_link()-style symlinks, clean up references to this type
of "symlink".

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/filesystems/path-lookup.rst | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..2c32795389bd 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -405,6 +405,10 @@ is requested.  Keeping a reference in the ``nameidata`` ensures that
 only one root is in effect for the entire path walk, even if it races
 with a ``chroot()`` system call.
 
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
+``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
+passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
+
 The root is needed when either of two conditions holds: (1) either the
 pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
 component is being handled, since "``..``" from the root must always stay
@@ -1149,7 +1153,7 @@ so ``NULL`` is returned to indicate that the symlink can be released and
 the stack frame discarded.
 
 The other case involves things in ``/proc`` that look like symlinks but
-aren't really::
+aren't really (and are therefore commonly referred to as "magic-links")::
 
      $ ls -l /proc/self/fd/1
      lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
@@ -1310,12 +1314,14 @@ longer needed.
 ``LOOKUP_JUMPED`` means that the current dentry was chosen not because
 it had the right name but for some other reason.  This happens when
 following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink.  In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``).  In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``).  In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
 ``LOOKUP_JUMPED`` is set when the look completes - which may be at the
-final component or, when creating, unlinking, or renaming, at the penultimate component.
+final component or, when creating, unlinking, or renaming, at the
+penultimate component. ``LOOKUP_MAGICLINK_JUMPED`` is set alongside
+``LOOKUP_JUMPED`` if a magic-link was traversed.
 
 Final-component flags
 ~~~~~~~~~~~~~~~~~~~~~
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 0/9] open: introduce openat2(2) syscall
  2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
                   ` (8 preceding siblings ...)
  2019-11-05  9:05 ` [PATCH v15 9/9] Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED Aleksa Sarai
@ 2019-11-11 13:24 ` Aleksa Sarai
  2019-11-12 23:01   ` Kees Cook
  9 siblings, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-11 13:24 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra
  Cc: Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
	libc-alpha, linux-arch, linux-arm-kernel, linux-fsdevel,
	linux-ia64, linux-kernel, linux-kselftest, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 24141 bytes --]

On 2019-11-05, Aleksa Sarai <cyphar@cyphar.com> wrote:
> This patchset is being developed here:
>   <https://github.com/cyphar/linux/tree/openat2/master>
> 
> Patch changelog:
>  v15:
>   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
>   * Split out patches for each individual LOOKUP flag.
>   * Reword commit messages to give more background information about the
>     series, as well as mention the semantics of each flag in more detail.
>  v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
>       <https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
>  v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
>  v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
>  v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
>       <https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
>  v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
>  v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
>  v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
>  v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
>  v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
>  v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
>  v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
>  v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
>  v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
>  v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
> 
> For a very long time, extending openat(2) with new features has been
> incredibly frustrating. This stems from the fact that openat(2) is
> possibly the most famous counter-example to the mantra "don't silently
> accept garbage from userspace" -- it doesn't check whether unknown flags
> are present[1].
> 
> This means that (generally) the addition of new flags to openat(2) has
> been fraught with backwards-compatibility issues (O_TMPFILE has to be
> defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
> kernels gave errors, since it's insecure to silently ignore the
> flag[2]). All new security-related flags therefore have a tough road to
> being added to openat(2).
> 
> Furthermore, the need for some sort of control over VFS's path resolution (to
> avoid malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a revival
> of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
> Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
> project[5]) with a few additions and changes made based on the previous
> discussion within [6] as well as others I felt were useful.
> 
> In line with the conclusions of the original discussion of AT_NO_JUMPS, the
> flag has been split up into separate flags. However, instead of being an
> openat(2) flag it is provided through a new syscall openat2(2) which provides
> several other improvements to the openat(2) interface (see the patch
> description for more details). The following new LOOKUP_* flags are added:
> 
>   * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
>     or through absolute links). Absolute pathnames alone in openat(2) do not
>     trigger this. Magic-link traversal which implies a vfsmount jump is also
>     blocked (though magic-link jumps on the same vfsmount are permitted).
> 
>   * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
>     links. This is done by blocking the usage of nd_jump_link() during
>     resolution in a filesystem. The term "magic-links" is used to match
>     with the only reference to these links in Documentation/, but I'm
>     happy to change the name.
> 
>     It should be noted that this is different to the scope of
>     ~LOOKUP_FOLLOW in that it applies to all path components. However,
>     you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
>     will *not* fail (assuming that no parent component was a
>     magic-link), and you will have an fd for the magic-link.
> 
>     In order to correctly detect magic-links, the introduction of a new
>     LOOKUP_MAGICLINK_JUMPED state flag was required.
> 
>   * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
>     tree, using techniques such as ".." or absolute links. Absolute
>     paths in openat(2) are also disallowed. Conceptually this flag is to
>     ensure you "stay below" a certain point in the filesystem tree --
>     but this requires some additional to protect against various races
>     that would allow escape using "..".
> 
>     Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
>     can trivially beam you around the filesystem (breaking the
>     protection). In future, there might be similar safety checks done as
>     in LOOKUP_IN_ROOT, but that requires more discussion.
> 
> In addition, two new flags are added that expand on the above ideas:
> 
>   * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
>     resolution is allowed at all, including magic-links. Just as with
>     LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
>     fd for the symlink as long as no parent path had a symlink
>     component.
> 
>   * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
>     blocking attempts to move past the root, forces all such movements
>     to be scoped to the starting point. This provides chroot(2)-like
>     protection but without the cost of a chroot(2) for each filesystem
>     operation, as well as being safe against race attacks that chroot(2)
>     is not.
> 
>     If a race is detected (as with LOOKUP_BENEATH) then an error is
>     generated, and similar to LOOKUP_BENEATH it is not permitted to cross
>     magic-links with LOOKUP_IN_ROOT.
> 
>     The primary need for this is from container runtimes, which
>     currently need to do symlink scoping in userspace[7] when opening
>     paths in a potentially malicious container. There is a long list of
>     CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
>     (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
>     CVE-2019-5736, just to name a few).
> 
> In order to make all of the above more usable, I'm working on
> libpathrs[8] which is a C-friendly library for safe path resolution. It
> features a userspace-emulated backend if the kernel doesn't support
> openat2(2). Hopefully we can get userspace to switch to using it, and
> thus get openat2(2) support for free once it's ready.
> 
> Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
> possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
> though stale NFS handles).
> 
> [1]: https://lwn.net/Articles/588444/
> [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
> [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
> [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com
> [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com
> [6]: https://lwn.net/Articles/723057/
> [7]: https://github.com/cyphar/filepath-securejoin
> [8]: https://github.com/openSUSE/libpathrs
> 
> The current draft of the openat2(2) man-page is included below.
> 
> --8<---------------------------------------------------------------------------
> OPENAT2(2)                          Linux Programmer's Manual                          OPENAT2(2)
> 
> NAME
>        openat2 - open and possibly create a file (extended)
> 
> SYNOPSIS
>        #include <sys/types.h>
>        #include <sys/stat.h>
>        #include <fcntl.h>
> 
>        int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
> 
>        Note: There is no glibc wrapper for this system call; see NOTES.
> 
> DESCRIPTION
>        The  openat2()  system  call  opens the file specified by pathname.  If the specified file
>        does not exist, it may optionally (if O_CREAT is specified in  how.flags)  be  created  by
>        openat2().
> 
>        As  with openat(2), if pathname is relative, then it is interpreted relative to the direc-
>        tory referred to by the file descriptor dirfd (or the current  working  directory  of  the
>        calling  process,  if dirfd is the special value AT_FDCWD.)  If pathname is absolute, then
>        dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case  pathname  is
>        resolved relative to dirfd.)
> 
>        The  openat2()  system  call  is  an extension of openat(2) and provides a superset of its
>        functionality.  Rather than taking a single flag argument, an extensible  structure  (how)
>        is  passed  instead  to  allow  for  future extensions.  size must be set to sizeof(struct
>        open_how), to facilitate future extensions (see the "Extensibility" section of  the  NOTES
>        for more detail on how extensions are handled.)
> 
>    The open_how structure
>        The following structure indicates how pathname should be opened, and acts as a superset of
>        the flag and mode arguments to openat(2).
> 
>            struct open_how {
>                __aligned_u64 flags;         /* O_* flags. */
>                __u16         mode;          /* Mode for O_{CREAT,TMPFILE}. */
>                __u16         __padding[3];  /* Must be zeroed. */
>                __aligned_u64 resolve;       /* RESOLVE_* flags. */
>            };
> 
>        Any future extensions to openat2() will be implemented as new fields appended to the above
>        structure (or through reuse of pre-existing padding space), with the zero value of the new
>        fields acting as though the extension were not present.
> 
>        The meaning of each field is as follows:
> 
>               flags
>                      The file creation and status flags to use for this operation.   All  of  the
>                      O_* flags defined for openat(2) are valid openat2() flag values.
> 
>                      Unlike openat(2), it is an error to provide openat2() unknown or conflicting
>                      flags in flags.
> 
>               mode
>                      File mode for the new file, with identical semantics to the mode argument to
>                      openat(2).   However,  unlike openat(2), it is an error to provide openat2()
>                      with a mode which contains bits other than 0777.
> 
>                      It is an error to provide openat2() a non-zero mode if flags does  not  con-
>                      tain O_CREAT or O_TMPFILE.
> 
>               resolve
>                      Change  how  the  components  of pathname will be resolved (see path_resolu-
>                      tion(7) for background information.)  The primary use case for  these  flags
>                      is  to  allow trusted programs to restrict how untrusted paths (or paths in-
>                      side untrusted directories) are resolved.  The full list of resolve flags is
>                      given below.
> 
>                      RESOLVE_NO_XDEV
>                             Disallow  traversal of mount points during path resolution (including
>                             all bind mounts).
> 
>                             Users of this flag are encouraged to make its use  configurable  (un-
>                             less  it is used for a specific security purpose), as bind mounts are
>                             very widely used by end-users.  Setting this flag indiscrimnately for
>                             all  uses  of  openat2() may result in spurious errors on previously-
>                             functional systems.
> 
>                      RESOLVE_NO_SYMLINKS
>                             Disallow resolution of symbolic links during path  resolution.   This
>                             option implies RESOLVE_NO_MAGICLINKS.
> 
>                             If the trailing component is a symbolic link, and flags contains both
>                             O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
>                             symbolic link will be returned.
> 
>                             Users  of  this flag are encouraged to make its use configurable (un-
>                             less it is used for a specific security purpose), as  symbolic  links
>                             are very widely used by end-users.  Setting this flag indiscrimnately
>                             for all uses of openat2() may result in  spurious  errors  on  previ-
>                             ously-functional systems.
> 
>                      RESOLVE_NO_MAGICLINKS
>                             Disallow all magic link resolution during path resolution.
> 
>                             If  the  trailing  component is a magic link, and flags contains both
>                             O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
>                             magic link will be returned.
> 
>                             Magic-links  are  symbolic  link-like  objects  that are most notably
>                             found   in   proc(5)   (examples    include    /proc/[pid]/exe    and
>                             /proc/[pid]/fd/*.)   Due to the potential danger of unknowingly open-
>                             ing these magic links, it may be  preferable  for  users  to  disable
>                             their resolution entirely (see symboliclink(7) for more details.)
> 
>                      RESOLVE_BENEATH
>                             Do  not permit the path resolution to succeed if any component of the
>                             resolution is not a descendant of the directory indicated  by  dirfd.
>                             This results in absolute symbolic links (and absolute values of path-
>                             name) to be rejected.
> 
>                             Currently, this flag also disables magic link  resolution.   However,
>                             this  may change in the future.  The caller should explicitly specify
>                             RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
> 
>                      RESOLVE_IN_ROOT
>                             Treat dirfd as the root directory while resolving pathname (as though
>                             the user called chroot(2) with dirfd as the argument.)  Absolute sym-
>                             bolic links and ".." path components will be  scoped  to  dirfd.   If
>                             pathname is an absolute path, it is also treated relative to dirfd.
> 
>                             However,  unlike  chroot(2) (which changes the filesystem root perma-
>                             nently for a process), RESOLVE_IN_ROOT  allows  a  program  to  effi-
>                             ciently  restrict  path  resolution  for only certain operations.  It
>                             also has several hardening features (such detecting  escape  attempts
>                             during ..  resolution) which chroot(2) does not.
> 
>                             Currently,  this  flag also disables magic link resolution.  However,
>                             this may change in the future.  The caller should explicitly  specify
>                             RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
> 
>                      It is an error to provide openat2() unknown flags in resolve.
> 
> RETURN VALUE
>        On success, a new file descriptor is returned.  On error, -1 is returned, and errno is set
>        appropriately.
> 
> ERRORS
>        The set of errors returned by openat2() includes all of the errors returned by  openat(2),
>        as well as the following additional errors:
> 
>        EINVAL An unknown flag or invalid value was specified in how.
> 
>        EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
> 
>        EINVAL size was smaller than any known version of struct open_how.
> 
>        E2BIG  An  extension  was specified in how, which the current kernel does not support (see
>               the "Extensibility" section of the NOTES for more detail on how extensions are han-
>               dled.)
> 
>        EAGAIN resolve  contains  either  RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
>               not ensure that a ".." component didn't escape (due to a race condition  or  poten-
>               tial attack.)  Callers may choose to retry the openat2() call.
> 
>        EXDEV  resolve  contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
>               root during path resolution was detected.
> 
>        EXDEV  resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross  a  mount
>               point.
> 
>        ELOOP  resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
>               link (or magic link).
> 
>        ELOOP  resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a  magic
>               link.
> 
> VERSIONS
>        openat2() was added to Linux in kernel 5.FOO.
> 
> CONFORMING TO
>        This system call is Linux-specific.
> 
>        The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
> 
> NOTES
>        Glibc does not provide a wrapper for this system call; call it using systemcall(2).
> 
>    Extensibility
>        In order to allow for struct open_how to be extended in future kernel revisions, openat2()
>        requires userspace to specify the size of struct open_how structure they are passing.   By
>        providing  this  information,  it  is possible for openat2() to provide both forwards- and
>        backwards-compatibility — with size acting as an implicit version number (because new  ex-
>        tension  fields will always be appended, the size will always increase.)  This extensibil-
>        ity  design  is  very  similar  to   other   system   calls   such   as   perf_setattr(2),
>        perf_event_open(2), and clone(3).
> 
>        If  we let usize be the size of the structure according to userspace and ksize be the size
>        of the structure which the kernel supports, then there are only three cases to consider:
> 
>               *  If ksize equals usize, then there is no version mismatch and  how  can  be  used
>                  verbatim.
> 
>               *  If  ksize  is  larger than usize, then there are some extensions the kernel sup-
>                  ports which the userspace program is unaware of.  Because  all  extensions  must
>                  have their zero values be a no-op, the kernel treats all of the extension fields
>                  not set by userspace to have zero values.  This  provides  backwards-compatibil-
>                  ity.
> 
>               *  If  ksize  is  smaller  than  usize,  then  there  are some extensions which the
>                  userspace program is aware of but the kernel does not support.  Because all  ex-
>                  tensions  must  have  their zero values be a no-op, the kernel can safely ignore
>                  the unsupported extension fields if they are all-zero.  If any  unsupported  ex-
>                  tension  fields  are  non-zero,  then  -1 is returned and errno is set to E2BIG.
>                  This provides forwards-compatibility.
> 
>        Therefore, most userspace programs will not need to have any special  handling  of  exten-
>        sions.   However,  if  a userspace program wishes to determine what extensions the running
>        kernel supports, they may conduct a binary search on size (to find the largest value which
>        doesn't produce an error of E2BIG.)
> 
> SEE ALSO
>        openat(2), path_resolution(7), symlink(7)
> 
> Linux                                       2019-11-05                                 OPENAT2(2)
> --8<---------------------------------------------------------------------------
> 
> Aleksa Sarai (9):
>   namei: LOOKUP_NO_SYMLINKS: block symlink resolution
>   namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
>   namei: LOOKUP_NO_XDEV: block mountpoint crossing
>   namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
>   namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
>   namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
>   open: introduce openat2(2) syscall
>   selftests: add openat2(2) selftests
>   Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
> 
>  CREDITS                                       |   4 +-
>  Documentation/filesystems/path-lookup.rst     |  18 +-
>  arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
>  arch/arm/tools/syscall.tbl                    |   1 +
>  arch/arm64/include/asm/unistd.h               |   2 +-
>  arch/arm64/include/asm/unistd32.h             |   2 +
>  arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
>  arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
>  arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
>  arch/s390/kernel/syscalls/syscall.tbl         |   1 +
>  arch/sh/kernel/syscalls/syscall.tbl           |   1 +
>  arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
>  fs/namei.c                                    | 176 +++++-
>  fs/open.c                                     | 149 +++--
>  include/linux/fcntl.h                         |  12 +-
>  include/linux/namei.h                         |  11 +
>  include/linux/syscalls.h                      |   3 +
>  include/uapi/asm-generic/unistd.h             |   5 +-
>  include/uapi/linux/fcntl.h                    |  41 ++
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/openat2/.gitignore    |   1 +
>  tools/testing/selftests/openat2/Makefile      |   8 +
>  tools/testing/selftests/openat2/helpers.c     | 109 ++++
>  tools/testing/selftests/openat2/helpers.h     | 107 ++++
>  .../testing/selftests/openat2/openat2_test.c  | 316 +++++++++++
>  .../selftests/openat2/rename_attack_test.c    | 160 ++++++
>  .../testing/selftests/openat2/resolve_test.c  | 523 ++++++++++++++++++
>  35 files changed, 1591 insertions(+), 73 deletions(-)
>  create mode 100644 tools/testing/selftests/openat2/.gitignore
>  create mode 100644 tools/testing/selftests/openat2/Makefile
>  create mode 100644 tools/testing/selftests/openat2/helpers.c
>  create mode 100644 tools/testing/selftests/openat2/helpers.h
>  create mode 100644 tools/testing/selftests/openat2/openat2_test.c
>  create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
>  create mode 100644 tools/testing/selftests/openat2/resolve_test.c
> 
> 
> base-commit: a99d8080aaf358d5d23581244e5da23b35e340b9

Ping -- this patch hasn't been touched for a week. Thanks.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 0/9] open: introduce openat2(2) syscall
  2019-11-11 13:24 ` [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
@ 2019-11-12 23:01   ` Kees Cook
  2019-11-12 23:01     ` Kees Cook
                       ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Kees Cook @ 2019-11-12 23:01 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Jann Horn, Tycho Andersen, David Drysdale,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	Linus Torvalds, containers, linux-alpha, linux-api, libc-alpha,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
	linux-kernel, linux-kselftest, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, linux-xtensa,
	sparclinux

On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> On 2019-11-05, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > This patchset is being developed here:
> >   <https://github.com/cyphar/linux/tree/openat2/master>
> > 
> > Patch changelog:
> >  v15:
> >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
> >   * Split out patches for each individual LOOKUP flag.
> >   * Reword commit messages to give more background information about the
> >     series, as well as mention the semantics of each flag in more detail.
> > [...]
> 
> Ping -- this patch hasn't been touched for a week. Thanks.

If I've been following correctly, everyone is happy with this series.
(i.e. Linus's comment appear to have been addressed.)

Perhaps the next question is should this go via a pull request by you to
Linus directly during the v5.5 merge window, via akpm, via akpm, via
Christian, or some other path? Besides Linus, it's not been clear who
should "claim" this series. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 0/9] open: introduce openat2(2) syscall
  2019-11-12 23:01   ` Kees Cook
@ 2019-11-12 23:01     ` Kees Cook
  2019-11-12 23:06     ` Christian Brauner
  2019-11-13  0:46     ` Aleksa Sarai
  2 siblings, 0 replies; 49+ messages in thread
From: Kees Cook @ 2019-11-12 23:01 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Jann Horn, Tycho Andersen, David Drysdale,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	Linus Torvalds, containers, linux-alpha, linux-api, libc-alpha,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
	linux-kernel, linux-kselftest, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, linux-xtensa,
	sparclinux

On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> On 2019-11-05, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > This patchset is being developed here:
> >   <https://github.com/cyphar/linux/tree/openat2/master>
> > 
> > Patch changelog:
> >  v15:
> >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
> >   * Split out patches for each individual LOOKUP flag.
> >   * Reword commit messages to give more background information about the
> >     series, as well as mention the semantics of each flag in more detail.
> > [...]
> 
> Ping -- this patch hasn't been touched for a week. Thanks.

If I've been following correctly, everyone is happy with this series.
(i.e. Linus's comment appear to have been addressed.)

Perhaps the next question is should this go via a pull request by you to
Linus directly during the v5.5 merge window, via akpm, via akpm, via
Christian, or some other path? Besides Linus, it's not been clear who
should "claim" this series. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 0/9] open: introduce openat2(2) syscall
  2019-11-12 23:01   ` Kees Cook
  2019-11-12 23:01     ` Kees Cook
@ 2019-11-12 23:06     ` Christian Brauner
  2019-11-12 23:06       ` Christian Brauner
  2019-11-13  0:46     ` Aleksa Sarai
  2 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2019-11-12 23:06 UTC (permalink / raw)
  To: Kees Cook, Al Viro, Linus Torvalds
  Cc: Aleksa Sarai, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Jann Horn, Tycho Andersen, David Drysdale,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Tue, Nov 12, 2019 at 03:01:26PM -0800, Kees Cook wrote:
> On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> > On 2019-11-05, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > This patchset is being developed here:
> > >   <https://github.com/cyphar/linux/tree/openat2/master>
> > > 
> > > Patch changelog:
> > >  v15:
> > >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
> > >   * Split out patches for each individual LOOKUP flag.
> > >   * Reword commit messages to give more background information about the
> > >     series, as well as mention the semantics of each flag in more detail.
> > > [...]
> > 
> > Ping -- this patch hasn't been touched for a week. Thanks.
> 
> If I've been following correctly, everyone is happy with this series.
> (i.e. Linus's comment appear to have been addressed.)
> 
> Perhaps the next question is should this go via a pull request by you to
> Linus directly during the v5.5 merge window, via akpm, via akpm, via
> Christian, or some other path? Besides Linus, it's not been clear who
> should "claim" this series. :)

I like this series and the same with the copy_struct_from_user() part of
it I've taken I'm happy to stuff this into a dedicated branch, merge it
into my for-next and send it for v5.5.
Though I'd _much_ rather see Al pick this up or have him give his
blessing first.

Christian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 0/9] open: introduce openat2(2) syscall
  2019-11-12 23:06     ` Christian Brauner
@ 2019-11-12 23:06       ` Christian Brauner
  0 siblings, 0 replies; 49+ messages in thread
From: Christian Brauner @ 2019-11-12 23:06 UTC (permalink / raw)
  To: Kees Cook, Al Viro, Linus Torvalds
  Cc: Aleksa Sarai, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Jann Horn, Tycho Andersen, David Drysdale,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Tue, Nov 12, 2019 at 03:01:26PM -0800, Kees Cook wrote:
> On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> > On 2019-11-05, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > This patchset is being developed here:
> > >   <https://github.com/cyphar/linux/tree/openat2/master>
> > > 
> > > Patch changelog:
> > >  v15:
> > >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
> > >   * Split out patches for each individual LOOKUP flag.
> > >   * Reword commit messages to give more background information about the
> > >     series, as well as mention the semantics of each flag in more detail.
> > > [...]
> > 
> > Ping -- this patch hasn't been touched for a week. Thanks.
> 
> If I've been following correctly, everyone is happy with this series.
> (i.e. Linus's comment appear to have been addressed.)
> 
> Perhaps the next question is should this go via a pull request by you to
> Linus directly during the v5.5 merge window, via akpm, via akpm, via
> Christian, or some other path? Besides Linus, it's not been clear who
> should "claim" this series. :)

I like this series and the same with the copy_struct_from_user() part of
it I've taken I'm happy to stuff this into a dedicated branch, merge it
into my for-next and send it for v5.5.
Though I'd _much_ rather see Al pick this up or have him give his
blessing first.

Christian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 0/9] open: introduce openat2(2) syscall
  2019-11-12 23:01   ` Kees Cook
  2019-11-12 23:01     ` Kees Cook
  2019-11-12 23:06     ` Christian Brauner
@ 2019-11-13  0:46     ` Aleksa Sarai
  2019-11-13  0:46       ` Aleksa Sarai
  2 siblings, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  0:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Jann Horn, Tycho Andersen, David Drysdale,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	Linus Torvalds, containers, linux-alpha, linux-api, libc-alpha,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
	linux-kernel, linux-kselftest, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, linux-xtensa,
	sparclinux

[-- Attachment #1: Type: text/plain, Size: 1757 bytes --]

On 2019-11-12, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> > On 2019-11-05, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > This patchset is being developed here:
> > >   <https://github.com/cyphar/linux/tree/openat2/master>
> > > 
> > > Patch changelog:
> > >  v15:
> > >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
> > >   * Split out patches for each individual LOOKUP flag.
> > >   * Reword commit messages to give more background information about the
> > >     series, as well as mention the semantics of each flag in more detail.
> > > [...]
> > 
> > Ping -- this patch hasn't been touched for a week. Thanks.
> 
> If I've been following correctly, everyone is happy with this series.
> (i.e. Linus's comment appear to have been addressed.)
> 
> Perhaps the next question is should this go via a pull request by you to
> Linus directly during the v5.5 merge window, via akpm, via akpm, via
> Christian, or some other path? Besides Linus, it's not been clear who
> should "claim" this series. :)

Given the namei changes, I wanted to avoid stepping on Al's toes. Though
he did review the series a few versions ago, the discussion didn't focus
on the openat2(2) semantics (which have also changed since then). I'm
not sure whether to interpret the silence to mean he's satisfied with
things as they are, or if he hasn't had more time to look at the series.

As for which tree it should be routed to, I don't mind -- Christian is
the most straight-forward choice (but if Al wants to route it, that's
fine with me too).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 0/9] open: introduce openat2(2) syscall
  2019-11-13  0:46     ` Aleksa Sarai
@ 2019-11-13  0:46       ` Aleksa Sarai
  0 siblings, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  0:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Jann Horn, Tycho Andersen, David Drysdale,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	Linus Torvalds, containers, linux-alpha, linux-api, libc-alpha,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
	linux-kernel, linux-kselftest, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, linux-xtensa,
	sparclinux

[-- Attachment #1: Type: text/plain, Size: 1757 bytes --]

On 2019-11-12, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> > On 2019-11-05, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > This patchset is being developed here:
> > >   <https://github.com/cyphar/linux/tree/openat2/master>
> > > 
> > > Patch changelog:
> > >  v15:
> > >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
> > >   * Split out patches for each individual LOOKUP flag.
> > >   * Reword commit messages to give more background information about the
> > >     series, as well as mention the semantics of each flag in more detail.
> > > [...]
> > 
> > Ping -- this patch hasn't been touched for a week. Thanks.
> 
> If I've been following correctly, everyone is happy with this series.
> (i.e. Linus's comment appear to have been addressed.)
> 
> Perhaps the next question is should this go via a pull request by you to
> Linus directly during the v5.5 merge window, via akpm, via akpm, via
> Christian, or some other path? Besides Linus, it's not been clear who
> should "claim" this series. :)

Given the namei changes, I wanted to avoid stepping on Al's toes. Though
he did review the series a few versions ago, the discussion didn't focus
on the openat2(2) semantics (which have also changed since then). I'm
not sure whether to interpret the silence to mean he's satisfied with
things as they are, or if he hasn't had more time to look at the series.

As for which tree it should be routed to, I don't mind -- Christian is
the most straight-forward choice (but if Al wants to route it, that's
fine with me too).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 2/9] namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
  2019-11-05  9:05 ` [PATCH v15 2/9] namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution Aleksa Sarai
@ 2019-11-13  1:24   ` Al Viro
  2019-11-13  1:24     ` Al Viro
  0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2019-11-13  1:24 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:46PM +1100, Aleksa Sarai wrote:
> @@ -1078,6 +1079,10 @@ const char *get_link(struct nameidata *nd)
>  		} else {
>  			res = get(dentry, inode, &last->done);
>  		}
> +		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> +			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> +				return ERR_PTR(-ELOOP);
> +		}

Minor nit - the first check probably wants unlikely() more than the
second one; it's probably noise anyway, but most of the symlinks
traversed are not going to be procfs ones, so you get test + branch
taken most of the time.

OTOH, that just might compile into
	fetch nd->flags
	and with LOOKUP_MAGICLINK_JUMPED | LOOKUP_NO_MAGICLINKS
	compare with the same constant
	unlikely branch when equal

Anyway, that's no more than a minor nit and can be dealt with later (if
at all)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 2/9] namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
  2019-11-13  1:24   ` Al Viro
@ 2019-11-13  1:24     ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  1:24 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:46PM +1100, Aleksa Sarai wrote:
> @@ -1078,6 +1079,10 @@ const char *get_link(struct nameidata *nd)
>  		} else {
>  			res = get(dentry, inode, &last->done);
>  		}
> +		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> +			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> +				return ERR_PTR(-ELOOP);
> +		}

Minor nit - the first check probably wants unlikely() more than the
second one; it's probably noise anyway, but most of the symlinks
traversed are not going to be procfs ones, so you get test + branch
taken most of the time.

OTOH, that just might compile into
	fetch nd->flags
	and with LOOKUP_MAGICLINK_JUMPED | LOOKUP_NO_MAGICLINKS
	compare with the same constant
	unlikely branch when equal

Anyway, that's no more than a minor nit and can be dealt with later (if
at all)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing
  2019-11-05  9:05 ` [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing Aleksa Sarai
@ 2019-11-13  1:36   ` Al Viro
  2019-11-13  1:36     ` Al Viro
  2019-11-14  4:49     ` Aleksa Sarai
  0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  1:36 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:

> @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
>  void nd_jump_link(struct path *path)
>  {
>  	struct nameidata *nd = current->nameidata;
> +
> +	nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
>  	path_put(&nd->path);
>  
>  	nd->path = *path;
> @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
>  		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
>  			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
>  				return ERR_PTR(-ELOOP);
> +			if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> +				if (!nd->last_magiclink.same_mnt)
> +					return ERR_PTR(-EXDEV);
> +			}
>  		}

Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
into nd_jump_link()?  Or just have nd_jump_link() return an error...

I mean, look at the callers of nd_jump_link().
static const char *policy_get_link(struct dentry *dentry,
                                   struct inode *inode,
                                   struct delayed_call *done)
{
        struct aa_ns *ns;
        struct path path;  

        if (!dentry)   
                return ERR_PTR(-ECHILD);
        ns = aa_get_current_ns();
        path.mnt = mntget(aafs_mnt);
        path.dentry = dget(ns_dir(ns));
        nd_jump_link(&path); 
        aa_put_ns(ns);

        return NULL;
}
- very close to the end of ->get_link() instance.

static const char *proc_pid_get_link(struct dentry *dentry,
                                     struct inode *inode,
                                     struct delayed_call *done)
{ 
        struct path path;
        int error = -EACCES;

        if (!dentry)
                return ERR_PTR(-ECHILD);

        /* Are we allowed to snoop on the tasks file descriptors? */
        if (!proc_fd_access_allowed(inode))
                goto out;

        error = PROC_I(inode)->op.proc_get_link(dentry, &path);
        if (error)
                goto out;

        nd_jump_link(&path);
        return NULL;
out:   
        return ERR_PTR(error);
}
Ditto.

static const char *proc_ns_get_link(struct dentry *dentry,
                                    struct inode *inode,
                                    struct delayed_call *done)
{
        const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops;
        struct task_struct *task;
        struct path ns_path;
        void *error = ERR_PTR(-EACCES);

        if (!dentry)
                return ERR_PTR(-ECHILD);

        task = get_proc_task(inode);
        if (!task)
                return error;

        if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
                error = ns_get_path(&ns_path, task, ns_ops);
                if (!error)
                        nd_jump_link(&ns_path);
        }
        put_task_struct(task);
        return error;
}

The same.  And that's it - there's no more of them.  So how about
this in the beginning of the series, then having your magiclink
error handling done in nd_jump_link()?

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..8ec924813c30 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -859,7 +859,7 @@ static int nd_jump_root(struct nameidata *nd)
  * Helper to directly jump to a known parsed path from ->get_link,
  * caller must have taken a reference to path beforehand.
  */
-void nd_jump_link(struct path *path)
+const char *nd_jump_link(struct path *path)
 {
 	struct nameidata *nd = current->nameidata;
 	path_put(&nd->path);
@@ -867,6 +867,7 @@ void nd_jump_link(struct path *path)
 	nd->path = *path;
 	nd->inode = nd->path.dentry->d_inode;
 	nd->flags |= LOOKUP_JUMPED;
+	return NULL;
 }
 
 static inline void put_link(struct nameidata *nd)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..ac4e57a3dfa5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1626,8 +1626,7 @@ static const char *proc_pid_get_link(struct dentry *dentry,
 	if (error)
 		goto out;
 
-	nd_jump_link(&path);
-	return NULL;
+	return nd_jump_link(&path);
 out:
 	return ERR_PTR(error);
 }
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..dde0c501b2f3 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -54,7 +54,7 @@ static const char *proc_ns_get_link(struct dentry *dentry,
 	if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
 		error = ns_get_path(&ns_path, task, ns_ops);
 		if (!error)
-			nd_jump_link(&ns_path);
+			error = nd_jump_link(&ns_path);
 	}
 	put_task_struct(task);
 	return error;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 397a08ade6a2..f3e8438e5631 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -68,7 +68,7 @@ extern int follow_up(struct path *);
 extern struct dentry *lock_rename(struct dentry *, struct dentry *);
 extern void unlock_rename(struct dentry *, struct dentry *);
 
-extern void nd_jump_link(struct path *path);
+extern const char *nd_jump_link(struct path *path);
 
 static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
 {
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 45d13b6462aa..98aef94b4777 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -2453,18 +2453,16 @@ static const char *policy_get_link(struct dentry *dentry,
 				   struct inode *inode,
 				   struct delayed_call *done)
 {
-	struct aa_ns *ns;
-	struct path path;
-
-	if (!dentry)
-		return ERR_PTR(-ECHILD);
-	ns = aa_get_current_ns();
-	path.mnt = mntget(aafs_mnt);
-	path.dentry = dget(ns_dir(ns));
-	nd_jump_link(&path);
-	aa_put_ns(ns);
-
-	return NULL;
+	const char *err = ERR_PTR(-ECHILD);
+
+	if (dentry) {
+		struct aa_ns *ns = aa_get_current_ns();
+		struct path path = {.mnt = mntget(aafs_mnt),
+				    .dentry = ns_dir(ns)};
+		err = nd_jump_link(&path);
+		aa_put_ns(ns);
+	}
+	return err;
 }
 
 static int policy_readlink(struct dentry *dentry, char __user *buffer,

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing
  2019-11-13  1:36   ` Al Viro
@ 2019-11-13  1:36     ` Al Viro
  2019-11-14  4:49     ` Aleksa Sarai
  1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  1:36 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:

> @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
>  void nd_jump_link(struct path *path)
>  {
>  	struct nameidata *nd = current->nameidata;
> +
> +	nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
>  	path_put(&nd->path);
>  
>  	nd->path = *path;
> @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
>  		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
>  			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
>  				return ERR_PTR(-ELOOP);
> +			if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> +				if (!nd->last_magiclink.same_mnt)
> +					return ERR_PTR(-EXDEV);
> +			}
>  		}

Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
into nd_jump_link()?  Or just have nd_jump_link() return an error...

I mean, look at the callers of nd_jump_link().
static const char *policy_get_link(struct dentry *dentry,
                                   struct inode *inode,
                                   struct delayed_call *done)
{
        struct aa_ns *ns;
        struct path path;  

        if (!dentry)   
                return ERR_PTR(-ECHILD);
        ns = aa_get_current_ns();
        path.mnt = mntget(aafs_mnt);
        path.dentry = dget(ns_dir(ns));
        nd_jump_link(&path); 
        aa_put_ns(ns);

        return NULL;
}
- very close to the end of ->get_link() instance.

static const char *proc_pid_get_link(struct dentry *dentry,
                                     struct inode *inode,
                                     struct delayed_call *done)
{ 
        struct path path;
        int error = -EACCES;

        if (!dentry)
                return ERR_PTR(-ECHILD);

        /* Are we allowed to snoop on the tasks file descriptors? */
        if (!proc_fd_access_allowed(inode))
                goto out;

        error = PROC_I(inode)->op.proc_get_link(dentry, &path);
        if (error)
                goto out;

        nd_jump_link(&path);
        return NULL;
out:   
        return ERR_PTR(error);
}
Ditto.

static const char *proc_ns_get_link(struct dentry *dentry,
                                    struct inode *inode,
                                    struct delayed_call *done)
{
        const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops;
        struct task_struct *task;
        struct path ns_path;
        void *error = ERR_PTR(-EACCES);

        if (!dentry)
                return ERR_PTR(-ECHILD);

        task = get_proc_task(inode);
        if (!task)
                return error;

        if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
                error = ns_get_path(&ns_path, task, ns_ops);
                if (!error)
                        nd_jump_link(&ns_path);
        }
        put_task_struct(task);
        return error;
}

The same.  And that's it - there's no more of them.  So how about
this in the beginning of the series, then having your magiclink
error handling done in nd_jump_link()?

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..8ec924813c30 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -859,7 +859,7 @@ static int nd_jump_root(struct nameidata *nd)
  * Helper to directly jump to a known parsed path from ->get_link,
  * caller must have taken a reference to path beforehand.
  */
-void nd_jump_link(struct path *path)
+const char *nd_jump_link(struct path *path)
 {
 	struct nameidata *nd = current->nameidata;
 	path_put(&nd->path);
@@ -867,6 +867,7 @@ void nd_jump_link(struct path *path)
 	nd->path = *path;
 	nd->inode = nd->path.dentry->d_inode;
 	nd->flags |= LOOKUP_JUMPED;
+	return NULL;
 }
 
 static inline void put_link(struct nameidata *nd)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..ac4e57a3dfa5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1626,8 +1626,7 @@ static const char *proc_pid_get_link(struct dentry *dentry,
 	if (error)
 		goto out;
 
-	nd_jump_link(&path);
-	return NULL;
+	return nd_jump_link(&path);
 out:
 	return ERR_PTR(error);
 }
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..dde0c501b2f3 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -54,7 +54,7 @@ static const char *proc_ns_get_link(struct dentry *dentry,
 	if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
 		error = ns_get_path(&ns_path, task, ns_ops);
 		if (!error)
-			nd_jump_link(&ns_path);
+			error = nd_jump_link(&ns_path);
 	}
 	put_task_struct(task);
 	return error;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 397a08ade6a2..f3e8438e5631 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -68,7 +68,7 @@ extern int follow_up(struct path *);
 extern struct dentry *lock_rename(struct dentry *, struct dentry *);
 extern void unlock_rename(struct dentry *, struct dentry *);
 
-extern void nd_jump_link(struct path *path);
+extern const char *nd_jump_link(struct path *path);
 
 static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
 {
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 45d13b6462aa..98aef94b4777 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -2453,18 +2453,16 @@ static const char *policy_get_link(struct dentry *dentry,
 				   struct inode *inode,
 				   struct delayed_call *done)
 {
-	struct aa_ns *ns;
-	struct path path;
-
-	if (!dentry)
-		return ERR_PTR(-ECHILD);
-	ns = aa_get_current_ns();
-	path.mnt = mntget(aafs_mnt);
-	path.dentry = dget(ns_dir(ns));
-	nd_jump_link(&path);
-	aa_put_ns(ns);
-
-	return NULL;
+	const char *err = ERR_PTR(-ECHILD);
+
+	if (dentry) {
+		struct aa_ns *ns = aa_get_current_ns();
+		struct path path = {.mnt = mntget(aafs_mnt),
+				    .dentry = ns_dir(ns)};
+		err = nd_jump_link(&path);
+		aa_put_ns(ns);
+	}
+	return err;
 }
 
 static int policy_readlink(struct dentry *dentry, char __user *buffer,

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  2019-11-05  9:05 ` [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution Aleksa Sarai
@ 2019-11-13  1:55   ` Al Viro
  2019-11-13  1:55     ` Al Viro
  2019-11-13  7:47     ` Aleksa Sarai
  0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  1:55 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:48PM +1100, Aleksa Sarai wrote:

Minor nit here - I'd split "move the conditional call of set_root()
into nd_jump_root()" into a separate patch before that one.  Makes
for fewer distractions in this one.  I'd probably fold "and be
ready for errors other than -ECHILD" into the same preliminary
patch.

> +			/* Not currently safe for scoped-lookups. */
> +			if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
> +				return ERR_PTR(-EXDEV);

Also a candidate for doing in nd_jump_link()...

> @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  	struct inode *inode = nd->inode;
>  
>  	while (1) {
> -		if (path_equal(&nd->path, &nd->root))
> +		if (path_equal(&nd->path, &nd->root)) {
> +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> +				return -EXDEV;

Umm...  Are you sure it's not -ECHILD?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  2019-11-13  1:55   ` Al Viro
@ 2019-11-13  1:55     ` Al Viro
  2019-11-13  7:47     ` Aleksa Sarai
  1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  1:55 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:48PM +1100, Aleksa Sarai wrote:

Minor nit here - I'd split "move the conditional call of set_root()
into nd_jump_root()" into a separate patch before that one.  Makes
for fewer distractions in this one.  I'd probably fold "and be
ready for errors other than -ECHILD" into the same preliminary
patch.

> +			/* Not currently safe for scoped-lookups. */
> +			if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
> +				return ERR_PTR(-EXDEV);

Also a candidate for doing in nd_jump_link()...

> @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>  	struct inode *inode = nd->inode;
>  
>  	while (1) {
> -		if (path_equal(&nd->path, &nd->root))
> +		if (path_equal(&nd->path, &nd->root)) {
> +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> +				return -EXDEV;

Umm...  Are you sure it's not -ECHILD?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  2019-11-05  9:05 ` [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like " Aleksa Sarai
@ 2019-11-13  2:03   ` Al Viro
  2019-11-13  2:03     ` Al Viro
  2019-11-13  2:44     ` Aleksa Sarai
  0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  2:03 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:

> @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  
>  	nd->m_seq = read_seqbegin(&mount_lock);
>  
> -	/* Figure out the starting path and root (if needed). */
> -	if (*s == '/') {
> +	/* Absolute pathname -- fetch the root. */
> +	if (flags & LOOKUP_IN_ROOT) {
> +		/* With LOOKUP_IN_ROOT, act as a relative path. */
> +		while (*s == '/')
> +			s++;

Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
will skip them just fine, you are actually risking breakage in this:
                if (*s && unlikely(!d_can_lookup(dentry))) {
                        fdput(f);
                        return ERR_PTR(-ENOTDIR);
                }
which is downstream from there with you patch, AFAICS.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  2019-11-13  2:03   ` Al Viro
@ 2019-11-13  2:03     ` Al Viro
  2019-11-13  2:44     ` Aleksa Sarai
  1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  2:03 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:

> @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  
>  	nd->m_seq = read_seqbegin(&mount_lock);
>  
> -	/* Figure out the starting path and root (if needed). */
> -	if (*s == '/') {
> +	/* Absolute pathname -- fetch the root. */
> +	if (flags & LOOKUP_IN_ROOT) {
> +		/* With LOOKUP_IN_ROOT, act as a relative path. */
> +		while (*s == '/')
> +			s++;

Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
will skip them just fine, you are actually risking breakage in this:
                if (*s && unlikely(!d_can_lookup(dentry))) {
                        fdput(f);
                        return ERR_PTR(-ENOTDIR);
                }
which is downstream from there with you patch, AFAICS.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  2019-11-05  9:05 ` [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution Aleksa Sarai
@ 2019-11-13  2:09   ` Al Viro
  2019-11-13  2:09     ` Al Viro
  2019-11-13  7:52     ` Aleksa Sarai
  0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  2:09 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Jann Horn, Linus Torvalds, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, containers, linux-alpha,
	linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:50PM +1100, Aleksa Sarai wrote:

> One other possible alternative (which previous versions of this patch
> used) would be to check with path_is_under() if there was a racing
> rename or mount (after re-taking the relevant seqlocks). While this does
> work, it results in possible O(n*m) behaviour if there are many renames
> or mounts occuring *anywhere on the system*.

BTW, do you realize that open-by-fhandle (or working nfsd, for that matter)
will trigger arseloads of write_seqlock(&rename_lock) simply on d_splice_alias()
bringing disconnected subtrees in contact with parent?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  2019-11-13  2:09   ` Al Viro
@ 2019-11-13  2:09     ` Al Viro
  2019-11-13  7:52     ` Aleksa Sarai
  1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  2:09 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Jann Horn, Linus Torvalds, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, containers, linux-alpha,
	linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:50PM +1100, Aleksa Sarai wrote:

> One other possible alternative (which previous versions of this patch
> used) would be to check with path_is_under() if there was a racing
> rename or mount (after re-taking the relevant seqlocks). While this does
> work, it results in possible O(n*m) behaviour if there are many renames
> or mounts occuring *anywhere on the system*.

BTW, do you realize that open-by-fhandle (or working nfsd, for that matter)
will trigger arseloads of write_seqlock(&rename_lock) simply on d_splice_alias()
bringing disconnected subtrees in contact with parent?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 7/9] open: introduce openat2(2) syscall
  2019-11-05  9:05 ` [PATCH v15 7/9] open: introduce openat2(2) syscall Aleksa Sarai
@ 2019-11-13  2:29   ` Al Viro
  2019-11-13  2:29     ` Al Viro
  2019-11-13  2:35     ` Aleksa Sarai
  0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  2:29 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
	libc-alpha, linux-arch, linux-arm-kernel, linux-fsdevel,
	linux-ia64, linux-kernel, linux-kselftest, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:51PM +1100, Aleksa Sarai wrote:
  
> +/*
> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates very similarly to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> + * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
> + *
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).

???

> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> +	__aligned_u64 flags;
> +	__u16 mode;
> +	__u16 __padding[3]; /* must be zeroed */
> +	__aligned_u64 resolve;
> +};

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 7/9] open: introduce openat2(2) syscall
  2019-11-13  2:29   ` Al Viro
@ 2019-11-13  2:29     ` Al Viro
  2019-11-13  2:35     ` Aleksa Sarai
  1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  2:29 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
	libc-alpha, linux-arch, linux-arm-kernel, linux-fsdevel,
	linux-ia64, linux-kernel, linux-kselftest, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	linux-xtensa, sparclinux

On Tue, Nov 05, 2019 at 08:05:51PM +1100, Aleksa Sarai wrote:
  
> +/*
> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates very similarly to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> + * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
> + *
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).

???

> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> +	__aligned_u64 flags;
> +	__u16 mode;
> +	__u16 __padding[3]; /* must be zeroed */
> +	__aligned_u64 resolve;
> +};

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 7/9] open: introduce openat2(2) syscall
  2019-11-13  2:29   ` Al Viro
  2019-11-13  2:29     ` Al Viro
@ 2019-11-13  2:35     ` Aleksa Sarai
  2019-11-13  2:35       ` Aleksa Sarai
  1 sibling, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  2:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
	libc-alpha, linux-arch, linux-arm-kernel, linux-fsdevel,
	linux-ia64, linux-kernel, linux-kselftest, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Nov 05, 2019 at 08:05:51PM +1100, Aleksa Sarai wrote:
> > +/*
> > + * Arguments for how openat2(2) should open the target path. If @resolve is
> > + * zero, then openat2(2) operates very similarly to openat(2).
> > + *
> > + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> > + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> > + * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
> > + *
> > + * @flags: O_* flags.
> > + * @mode: O_CREAT/O_TMPFILE file mode.
> > + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
> 
> ???

Sorry, that was left over from a previous revision (where the magic-link
re-opening restrictions were part of this series).

> > + * @resolve: RESOLVE_* flags.
> > + */
> > +struct open_how {
> > +	__aligned_u64 flags;
> > +	__u16 mode;
> > +	__u16 __padding[3]; /* must be zeroed */
> > +	__aligned_u64 resolve;
> > +};


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 7/9] open: introduce openat2(2) syscall
  2019-11-13  2:35     ` Aleksa Sarai
@ 2019-11-13  2:35       ` Aleksa Sarai
  0 siblings, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  2:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
	libc-alpha, linux-arch, linux-arm-kernel, linux-fsdevel,
	linux-ia64, linux-kernel, linux-kselftest, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Nov 05, 2019 at 08:05:51PM +1100, Aleksa Sarai wrote:
> > +/*
> > + * Arguments for how openat2(2) should open the target path. If @resolve is
> > + * zero, then openat2(2) operates very similarly to openat(2).
> > + *
> > + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> > + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> > + * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
> > + *
> > + * @flags: O_* flags.
> > + * @mode: O_CREAT/O_TMPFILE file mode.
> > + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
> 
> ???

Sorry, that was left over from a previous revision (where the magic-link
re-opening restrictions were part of this series).

> > + * @resolve: RESOLVE_* flags.
> > + */
> > +struct open_how {
> > +	__aligned_u64 flags;
> > +	__u16 mode;
> > +	__u16 __padding[3]; /* must be zeroed */
> > +	__aligned_u64 resolve;
> > +};


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  2019-11-13  2:03   ` Al Viro
  2019-11-13  2:03     ` Al Viro
@ 2019-11-13  2:44     ` Aleksa Sarai
  2019-11-13  2:44       ` Aleksa Sarai
  2019-11-13  2:59       ` Al Viro
  1 sibling, 2 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  2:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1371 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> 
> > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> >  
> >  	nd->m_seq = read_seqbegin(&mount_lock);
> >  
> > -	/* Figure out the starting path and root (if needed). */
> > -	if (*s == '/') {
> > +	/* Absolute pathname -- fetch the root. */
> > +	if (flags & LOOKUP_IN_ROOT) {
> > +		/* With LOOKUP_IN_ROOT, act as a relative path. */
> > +		while (*s == '/')
> > +			s++;
> 
> Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> will skip them just fine, you are actually risking breakage in this:
>                 if (*s && unlikely(!d_can_lookup(dentry))) {
>                         fdput(f);
>                         return ERR_PTR(-ENOTDIR);
>                 }
> which is downstream from there with you patch, AFAICS.

I switched to stripping the slashes at your suggestion a few revisions
ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
path_init(). But you're quite right about link_path_walk() -- and I'd be
more than happy to drop it.

[1]: https://lore.kernel.org/lkml/20190712125552.GL17978@ZenIV.linux.org.uk/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  2019-11-13  2:44     ` Aleksa Sarai
@ 2019-11-13  2:44       ` Aleksa Sarai
  2019-11-13  2:59       ` Al Viro
  1 sibling, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  2:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1371 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> 
> > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> >  
> >  	nd->m_seq = read_seqbegin(&mount_lock);
> >  
> > -	/* Figure out the starting path and root (if needed). */
> > -	if (*s == '/') {
> > +	/* Absolute pathname -- fetch the root. */
> > +	if (flags & LOOKUP_IN_ROOT) {
> > +		/* With LOOKUP_IN_ROOT, act as a relative path. */
> > +		while (*s == '/')
> > +			s++;
> 
> Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> will skip them just fine, you are actually risking breakage in this:
>                 if (*s && unlikely(!d_can_lookup(dentry))) {
>                         fdput(f);
>                         return ERR_PTR(-ENOTDIR);
>                 }
> which is downstream from there with you patch, AFAICS.

I switched to stripping the slashes at your suggestion a few revisions
ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
path_init(). But you're quite right about link_path_walk() -- and I'd be
more than happy to drop it.

[1]: https://lore.kernel.org/lkml/20190712125552.GL17978@ZenIV.linux.org.uk/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  2019-11-13  2:44     ` Aleksa Sarai
  2019-11-13  2:44       ` Aleksa Sarai
@ 2019-11-13  2:59       ` Al Viro
  2019-11-13  2:59         ` Al Viro
  2019-11-13  3:55         ` Aleksa Sarai
  1 sibling, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  2:59 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

On Wed, Nov 13, 2019 at 01:44:14PM +1100, Aleksa Sarai wrote:
> On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> > 
> > > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >  
> > >  	nd->m_seq = read_seqbegin(&mount_lock);
> > >  
> > > -	/* Figure out the starting path and root (if needed). */
> > > -	if (*s == '/') {
> > > +	/* Absolute pathname -- fetch the root. */
> > > +	if (flags & LOOKUP_IN_ROOT) {
> > > +		/* With LOOKUP_IN_ROOT, act as a relative path. */
> > > +		while (*s == '/')
> > > +			s++;
> > 
> > Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> > will skip them just fine, you are actually risking breakage in this:
> >                 if (*s && unlikely(!d_can_lookup(dentry))) {
> >                         fdput(f);
> >                         return ERR_PTR(-ENOTDIR);
> >                 }
> > which is downstream from there with you patch, AFAICS.
> 
> I switched to stripping the slashes at your suggestion a few revisions
> ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
> path_init(). But you're quite right about link_path_walk() -- and I'd be
> more than happy to drop it.

That, IIRC, was about untangling the weirdness around multiple calls of
dirfd_path_init() and basically went "we might want just strip the slashes
in case of that flag very early in the entire thing, so that later the
normal logics for absolute/relative would DTRT".  Since your check is
right next to checking for absolute pathnames (and not in the very
beginning of path_init()), we might as well turn the check for
absolute pathname into *s == '/' && !(flags & LOOKUP_IN_ROOT) and be
done with that.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  2019-11-13  2:59       ` Al Viro
@ 2019-11-13  2:59         ` Al Viro
  2019-11-13  3:55         ` Aleksa Sarai
  1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-11-13  2:59 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

On Wed, Nov 13, 2019 at 01:44:14PM +1100, Aleksa Sarai wrote:
> On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> > 
> > > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >  
> > >  	nd->m_seq = read_seqbegin(&mount_lock);
> > >  
> > > -	/* Figure out the starting path and root (if needed). */
> > > -	if (*s == '/') {
> > > +	/* Absolute pathname -- fetch the root. */
> > > +	if (flags & LOOKUP_IN_ROOT) {
> > > +		/* With LOOKUP_IN_ROOT, act as a relative path. */
> > > +		while (*s == '/')
> > > +			s++;
> > 
> > Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> > will skip them just fine, you are actually risking breakage in this:
> >                 if (*s && unlikely(!d_can_lookup(dentry))) {
> >                         fdput(f);
> >                         return ERR_PTR(-ENOTDIR);
> >                 }
> > which is downstream from there with you patch, AFAICS.
> 
> I switched to stripping the slashes at your suggestion a few revisions
> ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
> path_init(). But you're quite right about link_path_walk() -- and I'd be
> more than happy to drop it.

That, IIRC, was about untangling the weirdness around multiple calls of
dirfd_path_init() and basically went "we might want just strip the slashes
in case of that flag very early in the entire thing, so that later the
normal logics for absolute/relative would DTRT".  Since your check is
right next to checking for absolute pathnames (and not in the very
beginning of path_init()), we might as well turn the check for
absolute pathname into *s == '/' && !(flags & LOOKUP_IN_ROOT) and be
done with that.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  2019-11-13  2:59       ` Al Viro
  2019-11-13  2:59         ` Al Viro
@ 2019-11-13  3:55         ` Aleksa Sarai
  2019-11-13  3:55           ` Aleksa Sarai
  1 sibling, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  3:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 2141 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Nov 13, 2019 at 01:44:14PM +1100, Aleksa Sarai wrote:
> > On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> > > 
> > > > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > > >  
> > > >  	nd->m_seq = read_seqbegin(&mount_lock);
> > > >  
> > > > -	/* Figure out the starting path and root (if needed). */
> > > > -	if (*s == '/') {
> > > > +	/* Absolute pathname -- fetch the root. */
> > > > +	if (flags & LOOKUP_IN_ROOT) {
> > > > +		/* With LOOKUP_IN_ROOT, act as a relative path. */
> > > > +		while (*s == '/')
> > > > +			s++;
> > > 
> > > Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> > > will skip them just fine, you are actually risking breakage in this:
> > >                 if (*s && unlikely(!d_can_lookup(dentry))) {
> > >                         fdput(f);
> > >                         return ERR_PTR(-ENOTDIR);
> > >                 }
> > > which is downstream from there with you patch, AFAICS.
> > 
> > I switched to stripping the slashes at your suggestion a few revisions
> > ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
> > path_init(). But you're quite right about link_path_walk() -- and I'd be
> > more than happy to drop it.
> 
> That, IIRC, was about untangling the weirdness around multiple calls of
> dirfd_path_init() and basically went "we might want just strip the slashes
> in case of that flag very early in the entire thing, so that later the
> normal logics for absolute/relative would DTRT".

Ah okay, I'd misunderstood the point you were making in that thread.

> Since your check is right next to checking for absolute pathnames (and
> not in the very beginning of path_init()), we might as well turn the
> check for absolute pathname into *s == '/' && !(flags &
> LOOKUP_IN_ROOT) and be done with that.

Yup, agreed.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  2019-11-13  3:55         ` Aleksa Sarai
@ 2019-11-13  3:55           ` Aleksa Sarai
  0 siblings, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  3:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, Linus Torvalds, containers,
	linux-alpha, linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 2141 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Nov 13, 2019 at 01:44:14PM +1100, Aleksa Sarai wrote:
> > On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> > > 
> > > > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > > >  
> > > >  	nd->m_seq = read_seqbegin(&mount_lock);
> > > >  
> > > > -	/* Figure out the starting path and root (if needed). */
> > > > -	if (*s == '/') {
> > > > +	/* Absolute pathname -- fetch the root. */
> > > > +	if (flags & LOOKUP_IN_ROOT) {
> > > > +		/* With LOOKUP_IN_ROOT, act as a relative path. */
> > > > +		while (*s == '/')
> > > > +			s++;
> > > 
> > > Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> > > will skip them just fine, you are actually risking breakage in this:
> > >                 if (*s && unlikely(!d_can_lookup(dentry))) {
> > >                         fdput(f);
> > >                         return ERR_PTR(-ENOTDIR);
> > >                 }
> > > which is downstream from there with you patch, AFAICS.
> > 
> > I switched to stripping the slashes at your suggestion a few revisions
> > ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
> > path_init(). But you're quite right about link_path_walk() -- and I'd be
> > more than happy to drop it.
> 
> That, IIRC, was about untangling the weirdness around multiple calls of
> dirfd_path_init() and basically went "we might want just strip the slashes
> in case of that flag very early in the entire thing, so that later the
> normal logics for absolute/relative would DTRT".

Ah okay, I'd misunderstood the point you were making in that thread.

> Since your check is right next to checking for absolute pathnames (and
> not in the very beginning of path_init()), we might as well turn the
> check for absolute pathname into *s == '/' && !(flags &
> LOOKUP_IN_ROOT) and be done with that.

Yup, agreed.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  2019-11-13  1:55   ` Al Viro
  2019-11-13  1:55     ` Al Viro
@ 2019-11-13  7:47     ` Aleksa Sarai
  2019-11-13  7:47       ` Aleksa Sarai
  2019-11-14  4:57       ` Aleksa Sarai
  1 sibling, 2 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  7:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1296 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> Minor nit here - I'd split "move the conditional call of set_root()
> into nd_jump_root()" into a separate patch before that one.  Makes
> for fewer distractions in this one.  I'd probably fold "and be
> ready for errors other than -ECHILD" into the same preliminary
> patch.

Will do.

> > +			/* Not currently safe for scoped-lookups. */
> > +			if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
> > +				return ERR_PTR(-EXDEV);
> 
> Also a candidate for doing in nd_jump_link()...
> 
> > @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> >  	struct inode *inode = nd->inode;
> >  
> >  	while (1) {
> > -		if (path_equal(&nd->path, &nd->root))
> > +		if (path_equal(&nd->path, &nd->root)) {
> > +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> > +				return -EXDEV;
> 
> Umm...  Are you sure it's not -ECHILD?

It wouldn't hurt to be -ECHILD -- though it's not clear to me how likely
a success would be in REF-walk if the parent components didn't already
trigger an unlazy_walk() in RCU-walk.

I guess that also means LOOKUP_NO_XDEV should trigger -ECHILD in
follow_dotdot_rcu()?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  2019-11-13  7:47     ` Aleksa Sarai
@ 2019-11-13  7:47       ` Aleksa Sarai
  2019-11-14  4:57       ` Aleksa Sarai
  1 sibling, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  7:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1296 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> Minor nit here - I'd split "move the conditional call of set_root()
> into nd_jump_root()" into a separate patch before that one.  Makes
> for fewer distractions in this one.  I'd probably fold "and be
> ready for errors other than -ECHILD" into the same preliminary
> patch.

Will do.

> > +			/* Not currently safe for scoped-lookups. */
> > +			if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
> > +				return ERR_PTR(-EXDEV);
> 
> Also a candidate for doing in nd_jump_link()...
> 
> > @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> >  	struct inode *inode = nd->inode;
> >  
> >  	while (1) {
> > -		if (path_equal(&nd->path, &nd->root))
> > +		if (path_equal(&nd->path, &nd->root)) {
> > +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> > +				return -EXDEV;
> 
> Umm...  Are you sure it's not -ECHILD?

It wouldn't hurt to be -ECHILD -- though it's not clear to me how likely
a success would be in REF-walk if the parent components didn't already
trigger an unlazy_walk() in RCU-walk.

I guess that also means LOOKUP_NO_XDEV should trigger -ECHILD in
follow_dotdot_rcu()?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  2019-11-13  2:09   ` Al Viro
  2019-11-13  2:09     ` Al Viro
@ 2019-11-13  7:52     ` Aleksa Sarai
  2019-11-13  7:52       ` Aleksa Sarai
  1 sibling, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  7:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Jann Horn, Linus Torvalds, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, containers, linux-alpha,
	linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1005 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Nov 05, 2019 at 08:05:50PM +1100, Aleksa Sarai wrote:
> 
> > One other possible alternative (which previous versions of this patch
> > used) would be to check with path_is_under() if there was a racing
> > rename or mount (after re-taking the relevant seqlocks). While this does
> > work, it results in possible O(n*m) behaviour if there are many renames
> > or mounts occuring *anywhere on the system*.
> 
> BTW, do you realize that open-by-fhandle (or working nfsd, for that matter)
> will trigger arseloads of write_seqlock(&rename_lock) simply on d_splice_alias()
> bringing disconnected subtrees in contact with parent?

I wasn't aware of that -- that makes path_is_under() even less viable.
I'll reword it to be clearer that path_is_under() isn't a good idea and
why we went with -EAGAIN over an in-kernel retry.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  2019-11-13  7:52     ` Aleksa Sarai
@ 2019-11-13  7:52       ` Aleksa Sarai
  0 siblings, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-13  7:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Jann Horn, Linus Torvalds, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Rasmus Villemoes, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Christian Brauner, Aleksa Sarai, containers, linux-alpha,
	linux-api, libc-alpha, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-ia64, linux-kernel, linux-kselftest,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1005 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Nov 05, 2019 at 08:05:50PM +1100, Aleksa Sarai wrote:
> 
> > One other possible alternative (which previous versions of this patch
> > used) would be to check with path_is_under() if there was a racing
> > rename or mount (after re-taking the relevant seqlocks). While this does
> > work, it results in possible O(n*m) behaviour if there are many renames
> > or mounts occuring *anywhere on the system*.
> 
> BTW, do you realize that open-by-fhandle (or working nfsd, for that matter)
> will trigger arseloads of write_seqlock(&rename_lock) simply on d_splice_alias()
> bringing disconnected subtrees in contact with parent?

I wasn't aware of that -- that makes path_is_under() even less viable.
I'll reword it to be clearer that path_is_under() isn't a good idea and
why we went with -EAGAIN over an in-kernel retry.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing
  2019-11-13  1:36   ` Al Viro
  2019-11-13  1:36     ` Al Viro
@ 2019-11-14  4:49     ` Aleksa Sarai
  2019-11-14  4:49       ` Aleksa Sarai
  2019-11-14  5:43       ` Al Viro
  1 sibling, 2 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-14  4:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1920 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
> 
> > @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
> >  void nd_jump_link(struct path *path)
> >  {
> >  	struct nameidata *nd = current->nameidata;
> > +
> > +	nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
> >  	path_put(&nd->path);
> >  
> >  	nd->path = *path;
> > @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
> >  		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> >  			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> >  				return ERR_PTR(-ELOOP);
> > +			if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> > +				if (!nd->last_magiclink.same_mnt)
> > +					return ERR_PTR(-EXDEV);
> > +			}
> >  		}
> 
> Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
> into nd_jump_link()?  Or just have nd_jump_link() return an error...

This could be done, but the reason for stashing it away in
last_magiclink is because of the future magic-link re-opening patches
which can't be implemented like that without putting the open_flags
inside nameidata (which was decided to be too ugly a while ago).

My point being that I could implement it this way for this series, but
I'd have to implement something like last_magiclink when I end up
re-posting the magic-link stuff in a few weeks.

Looking at all the nd_jump_link() users, the other option is to just
disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only
thing allowing them permits is to resolve file descriptors that are
pointing to the same procfs mount -- and it's unclear to me how useful
that really is (apparmorfs and nsfs will always give -EXDEV because
aafs_mnt and nsfs_mnt are internal kernel vfsmounts).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing
  2019-11-14  4:49     ` Aleksa Sarai
@ 2019-11-14  4:49       ` Aleksa Sarai
  2019-11-14  5:43       ` Al Viro
  1 sibling, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-14  4:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1920 bytes --]

On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
> 
> > @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
> >  void nd_jump_link(struct path *path)
> >  {
> >  	struct nameidata *nd = current->nameidata;
> > +
> > +	nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
> >  	path_put(&nd->path);
> >  
> >  	nd->path = *path;
> > @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
> >  		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> >  			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> >  				return ERR_PTR(-ELOOP);
> > +			if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> > +				if (!nd->last_magiclink.same_mnt)
> > +					return ERR_PTR(-EXDEV);
> > +			}
> >  		}
> 
> Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
> into nd_jump_link()?  Or just have nd_jump_link() return an error...

This could be done, but the reason for stashing it away in
last_magiclink is because of the future magic-link re-opening patches
which can't be implemented like that without putting the open_flags
inside nameidata (which was decided to be too ugly a while ago).

My point being that I could implement it this way for this series, but
I'd have to implement something like last_magiclink when I end up
re-posting the magic-link stuff in a few weeks.

Looking at all the nd_jump_link() users, the other option is to just
disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only
thing allowing them permits is to resolve file descriptors that are
pointing to the same procfs mount -- and it's unclear to me how useful
that really is (apparmorfs and nsfs will always give -EXDEV because
aafs_mnt and nsfs_mnt are internal kernel vfsmounts).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  2019-11-13  7:47     ` Aleksa Sarai
  2019-11-13  7:47       ` Aleksa Sarai
@ 2019-11-14  4:57       ` Aleksa Sarai
  2019-11-14  4:57         ` Aleksa Sarai
  1 sibling, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-14  4:57 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-ia64, linux-sh, Rasmus Villemoes, Alexei Starovoitov,
	linux-kernel, David Howells, linux-kselftest, sparclinux,
	Christian Brauner, Shuah Khan, linux-arch, linux-s390, Jiri Olsa,
	Alexander Shishkin, Ingo Molnar, linux-arm-kernel, linux-mips,
	linux-xtensa, Kees Cook, Arnd Bergmann, Jann Horn, linuxppc-dev,
	linux-m68k, Andy Lutomirski, Shuah Khan, Namhyung Kim,
	David Drysdale, J. Bruce Fields, libc-alpha, linux-parisc,
	linux-api, Chanho Min, Jeff Layton, Oleg Nesterov,
	Eric Biederman, linux-alpha, linux-fsdevel, Andrew Morton,
	Linus Torvalds, containers

[-- Attachment #1: Type: text/plain, Size: 1595 bytes --]

On 2019-11-13, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > Minor nit here - I'd split "move the conditional call of set_root()
> > into nd_jump_root()" into a separate patch before that one.  Makes
> > for fewer distractions in this one.  I'd probably fold "and be
> > ready for errors other than -ECHILD" into the same preliminary
> > patch.
> 
> Will do.
> 
> > > +			/* Not currently safe for scoped-lookups. */
> > > +			if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
> > > +				return ERR_PTR(-EXDEV);
> > 
> > Also a candidate for doing in nd_jump_link()...
> > 
> > > @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > >  	struct inode *inode = nd->inode;
> > >  
> > >  	while (1) {
> > > -		if (path_equal(&nd->path, &nd->root))
> > > +		if (path_equal(&nd->path, &nd->root)) {
> > > +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> > > +				return -EXDEV;
> > 
> > Umm...  Are you sure it's not -ECHILD?
> 
> It wouldn't hurt to be -ECHILD -- though it's not clear to me how likely
> a success would be in REF-walk if the parent components didn't already
> trigger an unlazy_walk() in RCU-walk.
> 
> I guess that also means LOOKUP_NO_XDEV should trigger -ECHILD in
> follow_dotdot_rcu()?

Scratch the last question -- AFAICS we don't need to do that for
LOOKUP_NO_XDEV because we check against mount_lock so it's very unlikely
that -ECHILD will have any benefit.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  2019-11-14  4:57       ` Aleksa Sarai
@ 2019-11-14  4:57         ` Aleksa Sarai
  0 siblings, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-14  4:57 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-ia64, linux-sh, Rasmus Villemoes, Alexei Starovoitov,
	linux-kernel, David Howells, linux-kselftest, sparclinux,
	Christian Brauner, Shuah Khan, linux-arch, linux-s390, Jiri Olsa,
	Alexander Shishkin, Ingo Molnar, linux-arm-kernel, linux-mips,
	linux-xtensa, Kees Cook, Arnd Bergmann, Jann Horn, linuxppc-dev,
	linux-m68k, Andy Lutomirski, Shuah Khan, Namhyung Kim,
	David Drysdale, J. Bruce Fields, libc-alpha, linux-parisc,
	linux-api, Chanho Min, Jeff Layton, Oleg Nesterov,
	Eric Biederman, linux-alpha, linux-fsdevel, Andrew Morton,
	Linus Torvalds, containers

[-- Attachment #1: Type: text/plain, Size: 1595 bytes --]

On 2019-11-13, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > Minor nit here - I'd split "move the conditional call of set_root()
> > into nd_jump_root()" into a separate patch before that one.  Makes
> > for fewer distractions in this one.  I'd probably fold "and be
> > ready for errors other than -ECHILD" into the same preliminary
> > patch.
> 
> Will do.
> 
> > > +			/* Not currently safe for scoped-lookups. */
> > > +			if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
> > > +				return ERR_PTR(-EXDEV);
> > 
> > Also a candidate for doing in nd_jump_link()...
> > 
> > > @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > >  	struct inode *inode = nd->inode;
> > >  
> > >  	while (1) {
> > > -		if (path_equal(&nd->path, &nd->root))
> > > +		if (path_equal(&nd->path, &nd->root)) {
> > > +			if (unlikely(nd->flags & LOOKUP_BENEATH))
> > > +				return -EXDEV;
> > 
> > Umm...  Are you sure it's not -ECHILD?
> 
> It wouldn't hurt to be -ECHILD -- though it's not clear to me how likely
> a success would be in REF-walk if the parent components didn't already
> trigger an unlazy_walk() in RCU-walk.
> 
> I guess that also means LOOKUP_NO_XDEV should trigger -ECHILD in
> follow_dotdot_rcu()?

Scratch the last question -- AFAICS we don't need to do that for
LOOKUP_NO_XDEV because we check against mount_lock so it's very unlikely
that -ECHILD will have any benefit.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing
  2019-11-14  4:49     ` Aleksa Sarai
  2019-11-14  4:49       ` Aleksa Sarai
@ 2019-11-14  5:43       ` Al Viro
  2019-11-14  5:43         ` Al Viro
  2019-11-14 13:33         ` Aleksa Sarai
  1 sibling, 2 replies; 49+ messages in thread
From: Al Viro @ 2019-11-14  5:43 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Thu, Nov 14, 2019 at 03:49:45PM +1100, Aleksa Sarai wrote:
> On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
> > 
> > > @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
> > >  void nd_jump_link(struct path *path)
> > >  {
> > >  	struct nameidata *nd = current->nameidata;
> > > +
> > > +	nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
> > >  	path_put(&nd->path);
> > >  
> > >  	nd->path = *path;
> > > @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
> > >  		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> > >  			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> > >  				return ERR_PTR(-ELOOP);
> > > +			if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> > > +				if (!nd->last_magiclink.same_mnt)
> > > +					return ERR_PTR(-EXDEV);
> > > +			}
> > >  		}
> > 
> > Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
> > into nd_jump_link()?  Or just have nd_jump_link() return an error...
> 
> This could be done, but the reason for stashing it away in
> last_magiclink is because of the future magic-link re-opening patches
> which can't be implemented like that without putting the open_flags
> inside nameidata (which was decided to be too ugly a while ago).
> 
> My point being that I could implement it this way for this series, but
> I'd have to implement something like last_magiclink when I end up
> re-posting the magic-link stuff in a few weeks.
> 
> Looking at all the nd_jump_link() users, the other option is to just
> disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only
> thing allowing them permits is to resolve file descriptors that are
> pointing to the same procfs mount -- and it's unclear to me how useful
> that really is (apparmorfs and nsfs will always give -EXDEV because
> aafs_mnt and nsfs_mnt are internal kernel vfsmounts).

I would rather keep the entire if (nd->flags & LOOKUP_MAGICLINK_JUMPED)
out of the get_link().  If you want to generate some error if
nd_jump_link() has been called, just do it right there.  The fewer
pieces of state need to be carried around, the better...

And as for opening them...  Why would you need full open_flags in there?
Details, please...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing
  2019-11-14  5:43       ` Al Viro
@ 2019-11-14  5:43         ` Al Viro
  2019-11-14 13:33         ` Aleksa Sarai
  1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2019-11-14  5:43 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

On Thu, Nov 14, 2019 at 03:49:45PM +1100, Aleksa Sarai wrote:
> On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
> > 
> > > @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
> > >  void nd_jump_link(struct path *path)
> > >  {
> > >  	struct nameidata *nd = current->nameidata;
> > > +
> > > +	nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
> > >  	path_put(&nd->path);
> > >  
> > >  	nd->path = *path;
> > > @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
> > >  		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> > >  			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> > >  				return ERR_PTR(-ELOOP);
> > > +			if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> > > +				if (!nd->last_magiclink.same_mnt)
> > > +					return ERR_PTR(-EXDEV);
> > > +			}
> > >  		}
> > 
> > Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
> > into nd_jump_link()?  Or just have nd_jump_link() return an error...
> 
> This could be done, but the reason for stashing it away in
> last_magiclink is because of the future magic-link re-opening patches
> which can't be implemented like that without putting the open_flags
> inside nameidata (which was decided to be too ugly a while ago).
> 
> My point being that I could implement it this way for this series, but
> I'd have to implement something like last_magiclink when I end up
> re-posting the magic-link stuff in a few weeks.
> 
> Looking at all the nd_jump_link() users, the other option is to just
> disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only
> thing allowing them permits is to resolve file descriptors that are
> pointing to the same procfs mount -- and it's unclear to me how useful
> that really is (apparmorfs and nsfs will always give -EXDEV because
> aafs_mnt and nsfs_mnt are internal kernel vfsmounts).

I would rather keep the entire if (nd->flags & LOOKUP_MAGICLINK_JUMPED)
out of the get_link().  If you want to generate some error if
nd_jump_link() has been called, just do it right there.  The fewer
pieces of state need to be carried around, the better...

And as for opening them...  Why would you need full open_flags in there?
Details, please...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing
  2019-11-14  5:43       ` Al Viro
  2019-11-14  5:43         ` Al Viro
@ 2019-11-14 13:33         ` Aleksa Sarai
  2019-11-14 13:33           ` Aleksa Sarai
  1 sibling, 1 reply; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-14 13:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 3959 bytes --]

On 2019-11-14, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Nov 14, 2019 at 03:49:45PM +1100, Aleksa Sarai wrote:
> > On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
> > > 
> > > > @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
> > > >  void nd_jump_link(struct path *path)
> > > >  {
> > > >  	struct nameidata *nd = current->nameidata;
> > > > +
> > > > +	nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
> > > >  	path_put(&nd->path);
> > > >  
> > > >  	nd->path = *path;
> > > > @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
> > > >  		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> > > >  			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> > > >  				return ERR_PTR(-ELOOP);
> > > > +			if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> > > > +				if (!nd->last_magiclink.same_mnt)
> > > > +					return ERR_PTR(-EXDEV);
> > > > +			}
> > > >  		}
> > > 
> > > Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
> > > into nd_jump_link()?  Or just have nd_jump_link() return an error...
> > 
> > This could be done, but the reason for stashing it away in
> > last_magiclink is because of the future magic-link re-opening patches
> > which can't be implemented like that without putting the open_flags
> > inside nameidata (which was decided to be too ugly a while ago).
> > 
> > My point being that I could implement it this way for this series, but
> > I'd have to implement something like last_magiclink when I end up
> > re-posting the magic-link stuff in a few weeks.
> > 
> > Looking at all the nd_jump_link() users, the other option is to just
> > disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only
> > thing allowing them permits is to resolve file descriptors that are
> > pointing to the same procfs mount -- and it's unclear to me how useful
> > that really is (apparmorfs and nsfs will always give -EXDEV because
> > aafs_mnt and nsfs_mnt are internal kernel vfsmounts).
> 
> I would rather keep the entire if (nd->flags & LOOKUP_MAGICLINK_JUMPED)
> out of the get_link().  If you want to generate some error if
> nd_jump_link() has been called, just do it right there.  The fewer
> pieces of state need to be carried around, the better...

Sure, I can make nd_jump_link() give -ELOOP and drop the current need
for LOOKUP_MAGICLINK_JUMPED -- if necessary we can re-add it for the
magic-link reopening patches.

> And as for opening them...  Why would you need full open_flags in there?
> Details, please...

I was referring to [1] which has been dropped from this series. I
misspoke -- you don't need the full open_flags, you just need acc_mode
in nameidata -- but from memory you (understandably) weren't in favour
of that either because it further muddled the open semantics with namei.

So the solution I went with was to stash away the i_mode of the
magiclink in nd->last_magiclink.mode (though to avoid a race which Jann
found, you actually need to recalculate it when you call nd_jump_link()
but that's a different topic) and then check it in trailing_magiclink().

However, I've since figured out that we need to restrict things like
bind-mounts and truncate() because they can be used to get around the
restrictions. I dropped that patch from this series so that I could work
on implementing the restrictions for the other relevant VFS syscalls
separately from openat2 (upgrade_mask will be re-added to open_how with
those patches).

My point was that AFAICS we will either have to have nd->acc_mode (or
something similar) or have nd->last_magiclink in order to implement the
magic-link reopening hardening.

[1]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing
  2019-11-14 13:33         ` Aleksa Sarai
@ 2019-11-14 13:33           ` Aleksa Sarai
  0 siblings, 0 replies; 49+ messages in thread
From: Aleksa Sarai @ 2019-11-14 13:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, David Drysdale, Andy Lutomirski,
	Linus Torvalds, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	Chanho Min, Oleg Nesterov, Rasmus Villemoes, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Christian Brauner, Aleksa Sarai,
	containers, linux-alpha, linux-api, libc-alpha, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel,
	linux-kselftest, linux-m68k, linux-mips, linux-parisc,
	linuxppc-dev, linux-s390, linux-sh, linux-xtensa, sparclinux

[-- Attachment #1: Type: text/plain, Size: 3959 bytes --]

On 2019-11-14, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Nov 14, 2019 at 03:49:45PM +1100, Aleksa Sarai wrote:
> > On 2019-11-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
> > > 
> > > > @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
> > > >  void nd_jump_link(struct path *path)
> > > >  {
> > > >  	struct nameidata *nd = current->nameidata;
> > > > +
> > > > +	nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
> > > >  	path_put(&nd->path);
> > > >  
> > > >  	nd->path = *path;
> > > > @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
> > > >  		if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> > > >  			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> > > >  				return ERR_PTR(-ELOOP);
> > > > +			if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> > > > +				if (!nd->last_magiclink.same_mnt)
> > > > +					return ERR_PTR(-EXDEV);
> > > > +			}
> > > >  		}
> > > 
> > > Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
> > > into nd_jump_link()?  Or just have nd_jump_link() return an error...
> > 
> > This could be done, but the reason for stashing it away in
> > last_magiclink is because of the future magic-link re-opening patches
> > which can't be implemented like that without putting the open_flags
> > inside nameidata (which was decided to be too ugly a while ago).
> > 
> > My point being that I could implement it this way for this series, but
> > I'd have to implement something like last_magiclink when I end up
> > re-posting the magic-link stuff in a few weeks.
> > 
> > Looking at all the nd_jump_link() users, the other option is to just
> > disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only
> > thing allowing them permits is to resolve file descriptors that are
> > pointing to the same procfs mount -- and it's unclear to me how useful
> > that really is (apparmorfs and nsfs will always give -EXDEV because
> > aafs_mnt and nsfs_mnt are internal kernel vfsmounts).
> 
> I would rather keep the entire if (nd->flags & LOOKUP_MAGICLINK_JUMPED)
> out of the get_link().  If you want to generate some error if
> nd_jump_link() has been called, just do it right there.  The fewer
> pieces of state need to be carried around, the better...

Sure, I can make nd_jump_link() give -ELOOP and drop the current need
for LOOKUP_MAGICLINK_JUMPED -- if necessary we can re-add it for the
magic-link reopening patches.

> And as for opening them...  Why would you need full open_flags in there?
> Details, please...

I was referring to [1] which has been dropped from this series. I
misspoke -- you don't need the full open_flags, you just need acc_mode
in nameidata -- but from memory you (understandably) weren't in favour
of that either because it further muddled the open semantics with namei.

So the solution I went with was to stash away the i_mode of the
magiclink in nd->last_magiclink.mode (though to avoid a race which Jann
found, you actually need to recalculate it when you call nd_jump_link()
but that's a different topic) and then check it in trailing_magiclink().

However, I've since figured out that we need to restrict things like
bind-mounts and truncate() because they can be used to get around the
restrictions. I dropped that patch from this series so that I could work
on implementing the restrictions for the other relevant VFS syscalls
separately from openat2 (upgrade_mask will be re-added to open_how with
those patches).

My point was that AFAICS we will either have to have nd->acc_mode (or
something similar) or have nd->last_magiclink in order to implement the
magic-link reopening hardening.

[1]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, back to index

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-05  9:05 [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
2019-11-05  9:05 ` [PATCH v15 1/9] namei: LOOKUP_NO_SYMLINKS: block symlink resolution Aleksa Sarai
2019-11-05  9:05 ` [PATCH v15 2/9] namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution Aleksa Sarai
2019-11-13  1:24   ` Al Viro
2019-11-13  1:24     ` Al Viro
2019-11-05  9:05 ` [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing Aleksa Sarai
2019-11-13  1:36   ` Al Viro
2019-11-13  1:36     ` Al Viro
2019-11-14  4:49     ` Aleksa Sarai
2019-11-14  4:49       ` Aleksa Sarai
2019-11-14  5:43       ` Al Viro
2019-11-14  5:43         ` Al Viro
2019-11-14 13:33         ` Aleksa Sarai
2019-11-14 13:33           ` Aleksa Sarai
2019-11-05  9:05 ` [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution Aleksa Sarai
2019-11-13  1:55   ` Al Viro
2019-11-13  1:55     ` Al Viro
2019-11-13  7:47     ` Aleksa Sarai
2019-11-13  7:47       ` Aleksa Sarai
2019-11-14  4:57       ` Aleksa Sarai
2019-11-14  4:57         ` Aleksa Sarai
2019-11-05  9:05 ` [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like " Aleksa Sarai
2019-11-13  2:03   ` Al Viro
2019-11-13  2:03     ` Al Viro
2019-11-13  2:44     ` Aleksa Sarai
2019-11-13  2:44       ` Aleksa Sarai
2019-11-13  2:59       ` Al Viro
2019-11-13  2:59         ` Al Viro
2019-11-13  3:55         ` Aleksa Sarai
2019-11-13  3:55           ` Aleksa Sarai
2019-11-05  9:05 ` [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution Aleksa Sarai
2019-11-13  2:09   ` Al Viro
2019-11-13  2:09     ` Al Viro
2019-11-13  7:52     ` Aleksa Sarai
2019-11-13  7:52       ` Aleksa Sarai
2019-11-05  9:05 ` [PATCH v15 7/9] open: introduce openat2(2) syscall Aleksa Sarai
2019-11-13  2:29   ` Al Viro
2019-11-13  2:29     ` Al Viro
2019-11-13  2:35     ` Aleksa Sarai
2019-11-13  2:35       ` Aleksa Sarai
2019-11-05  9:05 ` [PATCH v15 8/9] selftests: add openat2(2) selftests Aleksa Sarai
2019-11-05  9:05 ` [PATCH v15 9/9] Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED Aleksa Sarai
2019-11-11 13:24 ` [PATCH v15 0/9] open: introduce openat2(2) syscall Aleksa Sarai
2019-11-12 23:01   ` Kees Cook
2019-11-12 23:01     ` Kees Cook
2019-11-12 23:06     ` Christian Brauner
2019-11-12 23:06       ` Christian Brauner
2019-11-13  0:46     ` Aleksa Sarai
2019-11-13  0:46       ` Aleksa Sarai

Linux-parisc archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-parisc/0 linux-parisc/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-parisc linux-parisc/ https://lore.kernel.org/linux-parisc \
		linux-parisc@vger.kernel.org
	public-inbox-index linux-parisc

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-parisc


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git