On 2019-11-05, Aleksa Sarai wrote: > This patchset is being developed here: > > > Patch changelog: > v15: > * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds] > * Split out patches for each individual LOOKUP flag. > * Reword commit messages to give more background information about the > series, as well as mention the semantics of each flag in more detail. > v14: > > v13: > v12: > v11: > > v10: > v09: > v08: > v07: > v06: > v05: > v04: > v03: > v02: > v01: > > For a very long time, extending openat(2) with new features has been > incredibly frustrating. This stems from the fact that openat(2) is > possibly the most famous counter-example to the mantra "don't silently > accept garbage from userspace" -- it doesn't check whether unknown flags > are present[1]. > > This means that (generally) the addition of new flags to openat(2) has > been fraught with backwards-compatibility issues (O_TMPFILE has to be > defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old > kernels gave errors, since it's insecure to silently ignore the > flag[2]). All new security-related flags therefore have a tough road to > being added to openat(2). > > Furthermore, the need for some sort of control over VFS's path resolution (to > avoid malicious paths resulting in inadvertent breakouts) has been a very > long-standing desire of many userspace applications. This patchset is a revival > of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David > Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum > project[5]) with a few additions and changes made based on the previous > discussion within [6] as well as others I felt were useful. > > In line with the conclusions of the original discussion of AT_NO_JUMPS, the > flag has been split up into separate flags. However, instead of being an > openat(2) flag it is provided through a new syscall openat2(2) which provides > several other improvements to the openat(2) interface (see the patch > description for more details). The following new LOOKUP_* flags are added: > > * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards, > or through absolute links). Absolute pathnames alone in openat(2) do not > trigger this. Magic-link traversal which implies a vfsmount jump is also > blocked (though magic-link jumps on the same vfsmount are permitted). > > * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style > links. This is done by blocking the usage of nd_jump_link() during > resolution in a filesystem. The term "magic-links" is used to match > with the only reference to these links in Documentation/, but I'm > happy to change the name. > > It should be noted that this is different to the scope of > ~LOOKUP_FOLLOW in that it applies to all path components. However, > you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it > will *not* fail (assuming that no parent component was a > magic-link), and you will have an fd for the magic-link. > > In order to correctly detect magic-links, the introduction of a new > LOOKUP_MAGICLINK_JUMPED state flag was required. > > * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's > tree, using techniques such as ".." or absolute links. Absolute > paths in openat(2) are also disallowed. Conceptually this flag is to > ensure you "stay below" a certain point in the filesystem tree -- > but this requires some additional to protect against various races > that would allow escape using "..". > > Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it > can trivially beam you around the filesystem (breaking the > protection). In future, there might be similar safety checks done as > in LOOKUP_IN_ROOT, but that requires more discussion. > > In addition, two new flags are added that expand on the above ideas: > > * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink > resolution is allowed at all, including magic-links. Just as with > LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an > fd for the symlink as long as no parent path had a symlink > component. > > * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than > blocking attempts to move past the root, forces all such movements > to be scoped to the starting point. This provides chroot(2)-like > protection but without the cost of a chroot(2) for each filesystem > operation, as well as being safe against race attacks that chroot(2) > is not. > > If a race is detected (as with LOOKUP_BENEATH) then an error is > generated, and similar to LOOKUP_BENEATH it is not permitted to cross > magic-links with LOOKUP_IN_ROOT. > > The primary need for this is from container runtimes, which > currently need to do symlink scoping in userspace[7] when opening > paths in a potentially malicious container. There is a long list of > CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT > (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and > CVE-2019-5736, just to name a few). > > In order to make all of the above more usable, I'm working on > libpathrs[8] which is a C-friendly library for safe path resolution. It > features a userspace-emulated backend if the kernel doesn't support > openat2(2). Hopefully we can get userspace to switch to using it, and > thus get openat2(2) support for free once it's ready. > > Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and > possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes > though stale NFS handles). > > [1]: https://lwn.net/Articles/588444/ > [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com > [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk > [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com > [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com > [6]: https://lwn.net/Articles/723057/ > [7]: https://github.com/cyphar/filepath-securejoin > [8]: https://github.com/openSUSE/libpathrs > > The current draft of the openat2(2) man-page is included below. > > --8<--------------------------------------------------------------------------- > OPENAT2(2) Linux Programmer's Manual OPENAT2(2) > > NAME > openat2 - open and possibly create a file (extended) > > SYNOPSIS > #include > #include > #include > > int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size); > > Note: There is no glibc wrapper for this system call; see NOTES. > > DESCRIPTION > The openat2() system call opens the file specified by pathname. If the specified file > does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by > openat2(). > > As with openat(2), if pathname is relative, then it is interpreted relative to the direc- > tory referred to by the file descriptor dirfd (or the current working directory of the > calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then > dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is > resolved relative to dirfd.) > > The openat2() system call is an extension of openat(2) and provides a superset of its > functionality. Rather than taking a single flag argument, an extensible structure (how) > is passed instead to allow for future extensions. size must be set to sizeof(struct > open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES > for more detail on how extensions are handled.) > > The open_how structure > The following structure indicates how pathname should be opened, and acts as a superset of > the flag and mode arguments to openat(2). > > struct open_how { > __aligned_u64 flags; /* O_* flags. */ > __u16 mode; /* Mode for O_{CREAT,TMPFILE}. */ > __u16 __padding[3]; /* Must be zeroed. */ > __aligned_u64 resolve; /* RESOLVE_* flags. */ > }; > > Any future extensions to openat2() will be implemented as new fields appended to the above > structure (or through reuse of pre-existing padding space), with the zero value of the new > fields acting as though the extension were not present. > > The meaning of each field is as follows: > > flags > The file creation and status flags to use for this operation. All of the > O_* flags defined for openat(2) are valid openat2() flag values. > > Unlike openat(2), it is an error to provide openat2() unknown or conflicting > flags in flags. > > mode > File mode for the new file, with identical semantics to the mode argument to > openat(2). However, unlike openat(2), it is an error to provide openat2() > with a mode which contains bits other than 0777. > > It is an error to provide openat2() a non-zero mode if flags does not con- > tain O_CREAT or O_TMPFILE. > > resolve > Change how the components of pathname will be resolved (see path_resolu- > tion(7) for background information.) The primary use case for these flags > is to allow trusted programs to restrict how untrusted paths (or paths in- > side untrusted directories) are resolved. The full list of resolve flags is > given below. > > RESOLVE_NO_XDEV > Disallow traversal of mount points during path resolution (including > all bind mounts). > > Users of this flag are encouraged to make its use configurable (un- > less it is used for a specific security purpose), as bind mounts are > very widely used by end-users. Setting this flag indiscrimnately for > all uses of openat2() may result in spurious errors on previously- > functional systems. > > RESOLVE_NO_SYMLINKS > Disallow resolution of symbolic links during path resolution. This > option implies RESOLVE_NO_MAGICLINKS. > > If the trailing component is a symbolic link, and flags contains both > O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the > symbolic link will be returned. > > Users of this flag are encouraged to make its use configurable (un- > less it is used for a specific security purpose), as symbolic links > are very widely used by end-users. Setting this flag indiscrimnately > for all uses of openat2() may result in spurious errors on previ- > ously-functional systems. > > RESOLVE_NO_MAGICLINKS > Disallow all magic link resolution during path resolution. > > If the trailing component is a magic link, and flags contains both > O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the > magic link will be returned. > > Magic-links are symbolic link-like objects that are most notably > found in proc(5) (examples include /proc/[pid]/exe and > /proc/[pid]/fd/*.) Due to the potential danger of unknowingly open- > ing these magic links, it may be preferable for users to disable > their resolution entirely (see symboliclink(7) for more details.) > > RESOLVE_BENEATH > Do not permit the path resolution to succeed if any component of the > resolution is not a descendant of the directory indicated by dirfd. > This results in absolute symbolic links (and absolute values of path- > name) to be rejected. > > Currently, this flag also disables magic link resolution. However, > this may change in the future. The caller should explicitly specify > RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved. > > RESOLVE_IN_ROOT > Treat dirfd as the root directory while resolving pathname (as though > the user called chroot(2) with dirfd as the argument.) Absolute sym- > bolic links and ".." path components will be scoped to dirfd. If > pathname is an absolute path, it is also treated relative to dirfd. > > However, unlike chroot(2) (which changes the filesystem root perma- > nently for a process), RESOLVE_IN_ROOT allows a program to effi- > ciently restrict path resolution for only certain operations. It > also has several hardening features (such detecting escape attempts > during .. resolution) which chroot(2) does not. > > Currently, this flag also disables magic link resolution. However, > this may change in the future. The caller should explicitly specify > RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved. > > It is an error to provide openat2() unknown flags in resolve. > > RETURN VALUE > On success, a new file descriptor is returned. On error, -1 is returned, and errno is set > appropriately. > > ERRORS > The set of errors returned by openat2() includes all of the errors returned by openat(2), > as well as the following additional errors: > > EINVAL An unknown flag or invalid value was specified in how. > > EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE. > > EINVAL size was smaller than any known version of struct open_how. > > E2BIG An extension was specified in how, which the current kernel does not support (see > the "Extensibility" section of the NOTES for more detail on how extensions are han- > dled.) > > EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could > not ensure that a ".." component didn't escape (due to a race condition or poten- > tial attack.) Callers may choose to retry the openat2() call. > > EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the > root during path resolution was detected. > > EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount > point. > > ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic > link (or magic link). > > ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic > link. > > VERSIONS > openat2() was added to Linux in kernel 5.FOO. > > CONFORMING TO > This system call is Linux-specific. > > The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH. > > NOTES > Glibc does not provide a wrapper for this system call; call it using systemcall(2). > > Extensibility > In order to allow for struct open_how to be extended in future kernel revisions, openat2() > requires userspace to specify the size of struct open_how structure they are passing. By > providing this information, it is possible for openat2() to provide both forwards- and > backwards-compatibility — with size acting as an implicit version number (because new ex- > tension fields will always be appended, the size will always increase.) This extensibil- > ity design is very similar to other system calls such as perf_setattr(2), > perf_event_open(2), and clone(3). > > If we let usize be the size of the structure according to userspace and ksize be the size > of the structure which the kernel supports, then there are only three cases to consider: > > * If ksize equals usize, then there is no version mismatch and how can be used > verbatim. > > * If ksize is larger than usize, then there are some extensions the kernel sup- > ports which the userspace program is unaware of. Because all extensions must > have their zero values be a no-op, the kernel treats all of the extension fields > not set by userspace to have zero values. This provides backwards-compatibil- > ity. > > * If ksize is smaller than usize, then there are some extensions which the > userspace program is aware of but the kernel does not support. Because all ex- > tensions must have their zero values be a no-op, the kernel can safely ignore > the unsupported extension fields if they are all-zero. If any unsupported ex- > tension fields are non-zero, then -1 is returned and errno is set to E2BIG. > This provides forwards-compatibility. > > Therefore, most userspace programs will not need to have any special handling of exten- > sions. However, if a userspace program wishes to determine what extensions the running > kernel supports, they may conduct a binary search on size (to find the largest value which > doesn't produce an error of E2BIG.) > > SEE ALSO > openat(2), path_resolution(7), symlink(7) > > Linux 2019-11-05 OPENAT2(2) > --8<--------------------------------------------------------------------------- > > Aleksa Sarai (9): > namei: LOOKUP_NO_SYMLINKS: block symlink resolution > namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution > namei: LOOKUP_NO_XDEV: block mountpoint crossing > namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution > namei: LOOKUP_IN_ROOT: chroot-like scoped resolution > namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution > open: introduce openat2(2) syscall > selftests: add openat2(2) selftests > Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED > > CREDITS | 4 +- > Documentation/filesystems/path-lookup.rst | 18 +- > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > arch/arm/tools/syscall.tbl | 1 + > arch/arm64/include/asm/unistd.h | 2 +- > arch/arm64/include/asm/unistd32.h | 2 + > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + > arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + > arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > arch/s390/kernel/syscalls/syscall.tbl | 1 + > arch/sh/kernel/syscalls/syscall.tbl | 1 + > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > fs/namei.c | 176 +++++- > fs/open.c | 149 +++-- > include/linux/fcntl.h | 12 +- > include/linux/namei.h | 11 + > include/linux/syscalls.h | 3 + > include/uapi/asm-generic/unistd.h | 5 +- > include/uapi/linux/fcntl.h | 41 ++ > tools/testing/selftests/Makefile | 1 + > tools/testing/selftests/openat2/.gitignore | 1 + > tools/testing/selftests/openat2/Makefile | 8 + > tools/testing/selftests/openat2/helpers.c | 109 ++++ > tools/testing/selftests/openat2/helpers.h | 107 ++++ > .../testing/selftests/openat2/openat2_test.c | 316 +++++++++++ > .../selftests/openat2/rename_attack_test.c | 160 ++++++ > .../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++ > 35 files changed, 1591 insertions(+), 73 deletions(-) > create mode 100644 tools/testing/selftests/openat2/.gitignore > create mode 100644 tools/testing/selftests/openat2/Makefile > create mode 100644 tools/testing/selftests/openat2/helpers.c > create mode 100644 tools/testing/selftests/openat2/helpers.h > create mode 100644 tools/testing/selftests/openat2/openat2_test.c > create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c > create mode 100644 tools/testing/selftests/openat2/resolve_test.c > > > base-commit: a99d8080aaf358d5d23581244e5da23b35e340b9 Ping -- this patch hasn't been touched for a week. Thanks. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH