Patch changelog: v6: * Drop O_* flags API to the new LOOKUP_ path scoping bits and instead introduce resolveat(2) as an alternative method of obtaining an O_PATH. The justification for this is included in patch 6 (though switching back to O_* flags is trivial). v5: * In response to CVE-2019-5736 (one of the vectors showed that open(2)+fexec(3) cannot be used to scope binfmt_script's implicit open_exec()), AT_* flags have been re-added and are now piped through to binfmt_script (and other binfmt_* that use open_exec) but are only supported for execveat(2) for now. v4: * Remove AT_* flag reservations, as they require more discussion. * Switch to path_is_under() over __d_path() for breakout checking. * Make O_XDEV no longer block openat("/tmp", "/", O_XDEV) -- dirfd is now ignored for absolute paths to match other flags. * Improve the dirfd_path_init() refactor and move it to a separate commit. * Remove reference to Linux-capsicum. * Switch "proclink" name to "magic link". v3: [resend] v2: * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with some semi-aggressive __d_path checking (see patch 3). * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the hopes they can be re-enabled once safe. * Removed the selftests as they will be reimplemented as xfstests. * Removed stat(2) support, since you can already get it through O_PATH and fstatat(2). The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the Capsicum project[4]) with a few additions and changes made based on the previous discussion within [5] as well as others I felt were useful. In line with the conclusions of the original discussion of AT_NO_JUMPS, the flag has been split up into separate flags. However, instead of being an openat(2) flag it is provided through a new syscall resolveat(2) which provides an alternative way to get an O_PATH file descriptor (the reasoning for doing this is included in patch 6). The following new LOOKUP_ (and corresponding uapi) flags are added: * LOOKUP_XDEV blocks all mountpoint crossings (upwards, downwards, or through absolute links). Absolute pathnames alone in openat(2) do not trigger this. * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style links. This is done by blocking the usage of nd_jump_link() during resolution in a filesystem. The term "magic links" is used to match with the only reference to these links in Documentation/, but I'm happy to change the name. It should be noted that this is different to the scope of ~LOOKUP_FOLLOW in that it applies to all path components. However, you can do resolveat(NOFOLLOW|NO_MAGICLINKS) on a "magic link" and it will *not* fail (assuming that no parent component was a "magic link"), and you will have an fd for the "magic link". * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's tree, using techniques such as ".." or absolute links. Absolute paths in openat(2) are also disallowed. Conceptually this flag is to ensure you "stay below" a certain point in the filesystem tree -- but this requires some additional to protect against various races that would allow escape using ".." (see patch 4 for more detail). Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it can trivially beam you around the filesystem (breaking the protection). In future, there might be similar safety checks as in patch 4, but that requires more discussion. In addition, two new flags were added that expand on the above ideas: * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink resolution is allowed at all, including "magic links". Just as with LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an fd for the symlink as long as no parent path had a symlink component. * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than blocking attempts to move past the root, forces all such movements to be scoped to the starting point. This provides chroot(2)-like protection but without the cost of a chroot(2) for each filesystem operation, as well as being safe against race attacks that chroot(2) is not. If a race is detected (as with LOOKUP_BENEATH) then an error is generated, and similar to LOOKUP_BENEATH it is not permitted to cross "magic links" with LOOKUP_IN_ROOT. The primary need for this is from container runtimes, which currently need to do symlink scoping in userspace[6] when opening paths in a potentially malicious container. There is a long list of CVEs that could have bene mitigated by having O_THISROOT (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a few). In addition, a mirror set of AT_* flags have been added (though currently these are only supported for execveat(2) -- and not for any other syscall). The need for these is explained in patch 5 (it's motivated by CVE-2019-5736). Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: David Drysdale <drysdale@google.com> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <containers@lists.linux-foundation.org> Cc: <linux-fsdevel@vger.kernel.org> Cc: <linux-api@vger.kernel.org> [1]: https://lwn.net/Articles/721443/ [2]: https://lore.kernel.org/patchwork/patch/784221/ [3]: https://lwn.net/Articles/619151/ [4]: https://lwn.net/Articles/603929/ [5]: https://lwn.net/Articles/723057/ [6]: https://github.com/cyphar/filepath-securejoin Aleksa Sarai (6): namei: split out nd->dfd handling to dirfd_path_init namei: O_BENEATH-style path resolution flags namei: LOOKUP_IN_ROOT: chroot-like path resolution namei: aggressively check for nd->root escape on ".." resolution binfmt_*: scope path resolution of interpreters namei: resolveat(2) syscall arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/binfmt_elf.c | 2 +- fs/binfmt_elf_fdpic.c | 2 +- fs/binfmt_em86.c | 4 +- fs/binfmt_misc.c | 2 +- fs/binfmt_script.c | 2 +- fs/exec.c | 26 +- fs/namei.c | 251 +++++++++++++++----- include/linux/binfmts.h | 1 + include/linux/fs.h | 9 +- include/linux/namei.h | 8 + include/uapi/linux/fcntl.h | 18 ++ 27 files changed, 270 insertions(+), 71 deletions(-) -- 2.21.0
Previously, path_init's handling of *at(dfd, ...) was only done once, but with O_BENEATH (and O_THISROOT) we have to parse the initial nd->path at different times (before or after absolute path handling) depending on whether we have been asked to scope resolution within a root. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/namei.c | 103 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 59 insertions(+), 44 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 5ebd64b21970..2a91b72aa5e9 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2166,9 +2166,59 @@ static int link_path_walk(const char *name, struct nameidata *nd) } } +/* + * Configure nd->path based on the nd->dfd. This is only used as part of + * path_init(). + */ +static inline int dirfd_path_init(struct nameidata *nd) +{ + if (nd->dfd == AT_FDCWD) { + if (nd->flags & LOOKUP_RCU) { + struct fs_struct *fs = current->fs; + unsigned seq; + + do { + seq = read_seqcount_begin(&fs->seq); + nd->path = fs->pwd; + nd->inode = nd->path.dentry->d_inode; + nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); + } while (read_seqcount_retry(&fs->seq, seq)); + } else { + get_fs_pwd(current->fs, &nd->path); + nd->inode = nd->path.dentry->d_inode; + } + } else { + /* Caller must check execute permissions on the starting path component */ + struct fd f = fdget_raw(nd->dfd); + struct dentry *dentry; + + if (!f.file) + return -EBADF; + + dentry = f.file->f_path.dentry; + + if (*nd->name->name && unlikely(!d_can_lookup(dentry))) { + fdput(f); + return -ENOTDIR; + } + + nd->path = f.file->f_path; + if (nd->flags & LOOKUP_RCU) { + nd->inode = nd->path.dentry->d_inode; + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); + } else { + path_get(&nd->path); + nd->inode = nd->path.dentry->d_inode; + } + fdput(f); + } + return 0; +} + /* must be paired with terminate_walk() */ static const char *path_init(struct nameidata *nd, unsigned flags) { + int error; const char *s = nd->name->name; if (!*s) @@ -2202,52 +2252,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->m_seq = read_seqbegin(&mount_lock); if (*s == '/') { - set_root(nd); - if (likely(!nd_jump_root(nd))) - return s; - return ERR_PTR(-ECHILD); - } else if (nd->dfd == AT_FDCWD) { - if (flags & LOOKUP_RCU) { - struct fs_struct *fs = current->fs; - unsigned seq; - - do { - seq = read_seqcount_begin(&fs->seq); - nd->path = fs->pwd; - nd->inode = nd->path.dentry->d_inode; - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); - } while (read_seqcount_retry(&fs->seq, seq)); - } else { - get_fs_pwd(current->fs, &nd->path); - nd->inode = nd->path.dentry->d_inode; - } - return s; - } else { - /* Caller must check execute permissions on the starting path component */ - struct fd f = fdget_raw(nd->dfd); - struct dentry *dentry; - - if (!f.file) - return ERR_PTR(-EBADF); - - dentry = f.file->f_path.dentry; - - if (*s && unlikely(!d_can_lookup(dentry))) { - fdput(f); - return ERR_PTR(-ENOTDIR); - } - - nd->path = f.file->f_path; - if (flags & LOOKUP_RCU) { - nd->inode = nd->path.dentry->d_inode; - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); - } else { - path_get(&nd->path); - nd->inode = nd->path.dentry->d_inode; - } - fdput(f); + if (likely(!nd->root.mnt)) + set_root(nd); + error = nd_jump_root(nd); + if (unlikely(error)) + s = ERR_PTR(error); return s; } + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + return s; } static const char *trailing_symlink(struct nameidata *nd) -- 2.21.0
Add the following flags to allow various restrictions on path resolution (these affect the *entire* resolution, rather than just the final path component -- as is the case with most other AT_* flags). The primary justification for these flags is to allow for programs to be far more strict about how they want path resolution to handle symlinks, mountpoint crossings, and paths that escape the dirfd (through an absolute path or ".." shenanigans). This is of particular concern to container runtimes that want to be very careful about malicious root filesystems that a container's init might have screwed around with (and there is no real way to protect against this in userspace if you consider potential races against a malicious container's init). More classical applications (which have their own potentially buggy userspace path sanitisation code) include web servers, archive extraction tools, network file servers, and so on. These flags are exposed to userspace in a later patchset. * LOOKUP_XDEV: Disallow mount-point crossing (both *down* into one, or *up* from one). The primary "scoping" use is to blocking resolution that crosses a bind-mount, which has a similar property to a symlink (in the way that it allows for escape from the starting-point). Since it is not possible to differentiate bind-mounts However since bind-mounting requires privileges (in ways symlinks don't) this has been split from LOOKUP_BENEATH. The naming is based on "find -xdev" as well as -EXDEV (though find(1) doesn't walk upwards, the semantics seem obvious). * LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" jumping. This is a very specific restriction, and it exists because /proc/$pid/fd/... "symlinks" allow for access outside nd->root and pose risk to container runtimes that don't want to be tricked into accessing a host path (but do want to allow no-funny-business symlink resolution). * LOOKUP_NO_SYMLINKS: Disallows symlink jumping *of any kind*. Implies LOOKUP_NO_MAGICLINKS (obviously). * LOOKUP_BENEATH: Disallow "escapes" from the starting point of the filesystem tree during resolution (you must stay "beneath" the starting point at all times). Currently this is done by disallowing ".." and absolute paths (either in the given path or found during symlink resolution) entirely, as well as all "magic link" jumping. The wholesale banning of ".." is because it is currently not safe to allow ".." resolution (races can cause the path to be moved outside of the root -- this is conceptually similar to historical chroot(2) escape attacks). Future patches in this series will address this, and will re-enable ".." resolution once it is safe. With those patches, ".." resolution will only be allowed if it remains in the root throughout resolution (such as "a/../b" not "a/../../outside/b"). The banning of "magic link" jumping is done because it is not clear whether semantically they should be allowed -- while some "magic links" are safe there are many that can cause escapes (and once a resolution is outside of the root, O_BENEATH will no longer detect it). Future patches may re-enable "magic link" jumping when such jumps would remain inside the root. The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would violates their requirement, while the others all return -EXDEV. This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation on David Drysdale's O_BENEATH patchset[2], which in turn was based on the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS thread[4] determined most of the API changes made in this refresh. [1]: https://lwn.net/Articles/721443/ [2]: https://lwn.net/Articles/619151/ [3]: https://lwn.net/Articles/603929/ [4]: https://lwn.net/Articles/723057/ Cc: Eric Biederman <ebiederm@xmission.com> Cc: Christian Brauner <christian@brauner.io> Cc: Kees Cook <keescook@chromium.org> Suggested-by: David Drysdale <drysdale@google.com> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Andy Lutomirski <luto@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/namei.c | 76 ++++++++++++++++++++++++++++++++++++------- include/linux/namei.h | 7 ++++ 2 files changed, 72 insertions(+), 11 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 2a91b72aa5e9..e13a02720a9d 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -843,6 +843,12 @@ static inline void path_to_nameidata(const struct path *path, static int nd_jump_root(struct nameidata *nd) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; + if (unlikely(nd->flags & LOOKUP_XDEV)) { + if (nd->path.mnt != nd->root.mnt) + return -EXDEV; + } if (nd->flags & LOOKUP_RCU) { struct dentry *d; nd->path = nd->root; @@ -1051,6 +1057,9 @@ const char *get_link(struct nameidata *nd) int error; const char *res; + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) + return ERR_PTR(-ELOOP); + if (!(nd->flags & LOOKUP_RCU)) { touch_atime(&last->link); cond_resched(); @@ -1081,14 +1090,23 @@ const char *get_link(struct nameidata *nd) } else { res = get(dentry, inode, &last->done); } + /* If we just jumped it was because of a procfs-style link. */ + if (unlikely(nd->flags & LOOKUP_JUMPED)) { + if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) + return ERR_PTR(-ELOOP); + /* Not currently safe. */ + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return ERR_PTR(-EXDEV); + } if (IS_ERR_OR_NULL(res)) return res; } if (*res == '/') { if (!nd->root.mnt) set_root(nd); - if (unlikely(nd_jump_root(nd))) - return ERR_PTR(-ECHILD); + error = nd_jump_root(nd); + if (unlikely(error)) + return ERR_PTR(error); while (unlikely(*++res == '/')) ; } @@ -1269,12 +1287,16 @@ static int follow_managed(struct path *path, struct nameidata *nd) break; } - if (need_mntput && path->mnt == mnt) - mntput(path->mnt); + if (need_mntput) { + if (path->mnt == mnt) + mntput(path->mnt); + if (unlikely(nd->flags & LOOKUP_XDEV)) + ret = -EXDEV; + else + nd->flags |= LOOKUP_JUMPED; + } if (ret == -EISDIR || !ret) ret = 1; - if (need_mntput) - nd->flags |= LOOKUP_JUMPED; if (unlikely(ret < 0)) path_put_conditional(path, nd); return ret; @@ -1331,6 +1353,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, mounted = __lookup_mnt(path->mnt, path->dentry); if (!mounted) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return false; path->mnt = &mounted->mnt; path->dentry = mounted->mnt.mnt_root; nd->flags |= LOOKUP_JUMPED; @@ -1351,8 +1375,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode; while (1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent; @@ -1377,6 +1404,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (&mparent->mnt == nd->path.mnt) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; /* we know that mountpoint was pinned */ nd->path.dentry = mountpoint; nd->path.mnt = &mparent->mnt; @@ -1391,6 +1420,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (!mounted) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; nd->path.mnt = &mounted->mnt; nd->path.dentry = mounted->mnt.mnt_root; inode = nd->path.dentry->d_inode; @@ -1479,8 +1510,11 @@ static int path_parent_directory(struct path *path) static int follow_dotdot(struct nameidata *nd) { while(1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret) @@ -1489,6 +1523,8 @@ static int follow_dotdot(struct nameidata *nd) } if (!follow_up(&nd->path)) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; } follow_mount(&nd->path); nd->inode = nd->path.dentry->d_inode; @@ -1703,6 +1739,13 @@ static inline int may_lookup(struct nameidata *nd) static inline int handle_dots(struct nameidata *nd, int type) { if (type == LAST_DOTDOT) { + /* + * LOOKUP_BENEATH resolving ".." is not currently safe -- races can + * cause our parent to have moved outside of the root and us to skip + * over it. + */ + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; if (!nd->root.mnt) set_root(nd); if (nd->flags & LOOKUP_RCU) { @@ -2251,6 +2294,15 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.dentry = NULL; nd->m_seq = read_seqbegin(&mount_lock); + + if (unlikely(nd->flags & LOOKUP_BENEATH)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + nd->root = nd->path; + if (!(nd->flags & LOOKUP_RCU)) + path_get(&nd->root); + } if (*s == '/') { if (likely(!nd->root.mnt)) set_root(nd); @@ -2259,9 +2311,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) s = ERR_PTR(error); return s; } - error = dirfd_path_init(nd); - if (unlikely(error)) - return ERR_PTR(error); + if (likely(!nd->path.mnt)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + } return s; } diff --git a/include/linux/namei.h b/include/linux/namei.h index 9138b4471dbf..7bc819ad0cd3 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -50,6 +50,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_EMPTY 0x4000 #define LOOKUP_DOWN 0x8000 +/* Scoping flags for lookup. */ +#define LOOKUP_BENEATH 0x010000 /* No escaping from starting point. */ +#define LOOKUP_XDEV 0x020000 /* No mountpoint crossing. */ +#define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ +#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. + Implies LOOKUP_NO_MAGICLINKS. */ + extern int path_pts(struct path *path); extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty); -- 2.21.0
The primary motivation for the need for this flag is container runtimes which have to interact with malicious root filesystems in the host namespaces. One of the first requirements for a container runtime to be secure against a malicious rootfs is that they correctly scope symlinks (that is, they should be scoped as though they are chroot(2)ed into the container's rootfs) and ".."-style paths[*]. The already-existing O_XDEV and O_NOMAGICLINKS[**] help defend against other potential attacks in a malicious rootfs scenario. Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec (for some runtimes) which is *very* costly if necessary for every filesystem operation involving a container. [*] At the moment, ".." and "magic link" jumping are disallowed for the same reason it is disabled for LOOKUP_BENEATH -- currently it is not safe to allow it. Future patches may enable it unconditionally once we have resolved the possible races (for "..") and semantics (for "magic link" jumping). The most significant openat(2) semantic change with LOOKUP_THISROOT is that absolute pathnames no longer cause dirfd to be ignored completely. The rationale is that LOOKUP_THISROOT must necessarily chroot-scope symlinks with absolute paths to dirfd, and so doing it for the base path seems to be the most consistent behaviour (and also avoids foot-gunning users who want to scope paths that are absolute). [1]: https://github.com/cyphar/filepath-securejoin Cc: Eric Biederman <ebiederm@xmission.com> Cc: Christian Brauner <christian@brauner.io> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/namei.c | 6 +++--- include/linux/namei.h | 1 + 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index e13a02720a9d..3a3cba593b85 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1095,7 +1095,7 @@ const char *get_link(struct nameidata *nd) if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) return ERR_PTR(-ELOOP); /* Not currently safe. */ - if (unlikely(nd->flags & LOOKUP_BENEATH)) + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) return ERR_PTR(-EXDEV); } if (IS_ERR_OR_NULL(res)) @@ -1744,7 +1744,7 @@ static inline int handle_dots(struct nameidata *nd, int type) * cause our parent to have moved outside of the root and us to skip * over it. */ - if (unlikely(nd->flags & LOOKUP_BENEATH)) + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) return -EXDEV; if (!nd->root.mnt) set_root(nd); @@ -2295,7 +2295,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->m_seq = read_seqbegin(&mount_lock); - if (unlikely(nd->flags & LOOKUP_BENEATH)) { + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) { error = dirfd_path_init(nd); if (unlikely(error)) return ERR_PTR(error); diff --git a/include/linux/namei.h b/include/linux/namei.h index 7bc819ad0cd3..4b1ee717cb14 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -56,6 +56,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. Implies LOOKUP_NO_MAGICLINKS. */ +#define LOOKUP_IN_ROOT 0x100000 /* Treat dirfd as %current->fs->root. */ extern int path_pts(struct path *path); -- 2.21.0
This patch allows for O_BENEATH and O_THISROOT to safely permit ".." resolution (in the case of O_BENEATH the resolution will still fail if ".." resolution would resolve a path outside of the root -- while O_THISROOT will chroot(2)-style scope it). "magic link" jumps are still disallowed entirely because now they could result in inconsistent behaviour if resolution encounters a subsequent "..". The need for this patch is explained by observing there is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension O_THISROOT and O_BENEATH) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root. thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat(dirb, "b/c/../../etc/shadow", O_THISROOT); With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar (though somewhat more privileged) attack using MS_MOVE. With this patch, such cases will be detected *during* ".." resolution (which is the weak point of chroot(2) -- since walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root -- except through a bind-mount or "magic link"). By detecting this at ".." resolution (rather than checking only at the end of the entire resolution) we can both correct escapes by jumping back to the root (in the case of O_THISROOT), as well as avoid revealing to attackers the structure of the filesystem outside of the root (through timing attacks for instance). In order to avoid a quadratic lookup with each ".." entry, we only activate the slow path if a write through &rename_lock or &mount_lock have occurred during path resolution (&rename_lock and &mount_lock are re-taken to further optimise the lookup). Since the primary attack being protected against is MS_MOVE or rename(2), not doing additional checks unless a mount or rename have occurred avoids making the common case slow. The use of path_is_under() here might seem suspect, but on further inspection of the most important race (a path was *inside* the root but is now *outside*), there appears to be no attack potential. If path_is_under() occurs before the rename, then the path will be resolved but since the path was originally inside the root there is no escape. Subsequent ".." jumps are guaranteed to check path_is_under() (by construction, &rename_lock or &mount_lock must have been taken by the attacker after path_is_under() returned in the victim), and thus will not be able to escape from the previously-inside-root path. Walking down is still safe since the entire subtree was moved (either by rename(2) or MS_MOVE) and because (as discussed above) walking down is safe. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/namei.c | 48 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 33 insertions(+), 15 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 3a3cba593b85..2b6a1bf4e745 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -491,7 +491,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags; - unsigned seq, m_seq; + unsigned seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count; @@ -1739,19 +1739,35 @@ static inline int may_lookup(struct nameidata *nd) static inline int handle_dots(struct nameidata *nd, int type) { if (type == LAST_DOTDOT) { - /* - * LOOKUP_BENEATH resolving ".." is not currently safe -- races can - * cause our parent to have moved outside of the root and us to skip - * over it. - */ - if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) - return -EXDEV; + int error = 0; + if (!nd->root.mnt) set_root(nd); - if (nd->flags & LOOKUP_RCU) { - return follow_dotdot_rcu(nd); - } else - return follow_dotdot(nd); + if (nd->flags & LOOKUP_RCU) + error = follow_dotdot_rcu(nd); + else + error = follow_dotdot(nd); + if (error) + return error; + + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) { + bool m_retry = read_seqretry(&mount_lock, nd->m_seq); + bool r_retry = read_seqretry(&rename_lock, nd->r_seq); + + /* + * Don't bother checking unless there's a racing + * rename(2) or MS_MOVE. + */ + if (likely(!m_retry && !r_retry)) + return 0; + + if (m_retry && !(nd->flags & LOOKUP_RCU)) + nd->m_seq = read_seqbegin(&mount_lock); + if (r_retry) + nd->r_seq = read_seqbegin(&rename_lock); + if (!path_is_under(&nd->path, &nd->root)) + return -EXDEV; + } } return 0; } @@ -2272,6 +2288,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->last_type = LAST_ROOT; /* if there are only slashes... */ nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; nd->depth = 0; + + nd->m_seq = read_seqbegin(&mount_lock); + if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) + nd->r_seq = read_seqbegin(&rename_lock); + if (flags & LOOKUP_ROOT) { struct dentry *root = nd->root.dentry; struct inode *inode = root->d_inode; @@ -2282,7 +2303,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) if (flags & LOOKUP_RCU) { nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); nd->root_seq = nd->seq; - nd->m_seq = read_seqbegin(&mount_lock); } else { path_get(&nd->path); } @@ -2293,8 +2313,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.mnt = NULL; nd->path.dentry = NULL; - nd->m_seq = read_seqbegin(&mount_lock); - if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) { error = dirfd_path_init(nd); if (unlikely(error)) -- 2.21.0
The need to be able to scope path resolution of interpreters became clear with one of the possible vectors used in CVE-2019-5736 (which most major container runtimes were vulnerable to). Naively, it might seem that openat(2) -- which supports path scoping -- can be combined with execveat(AT_EMPTY_PATH) to trivially scope the binary being executed. Unfortunately, a "bad binary" (usually a symlink) could be written as a #!-style script with the symlink target as the interpreter -- which would be completely missed by just scoping the openat(2). An example of this being exploitable is CVE-2019-5736. In order to get around this, we need to pass down to each binfmt_* implementation the scoping flags requested in execveat(2). In order to maintain backwards-compatibility we only pass the scoping AT_* flags. To avoid breaking userspace (in the exceptionally rare cases where you have #!-scripts with a relative path being execveat(2)-ed with dfd != AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are set in execveat(2). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/binfmt_elf.c | 2 +- fs/binfmt_elf_fdpic.c | 2 +- fs/binfmt_em86.c | 4 ++-- fs/binfmt_misc.c | 2 +- fs/binfmt_script.c | 2 +- fs/exec.c | 26 ++++++++++++++++++++++---- include/linux/binfmts.h | 1 + include/linux/fs.h | 9 +++++++-- include/uapi/linux/fcntl.h | 6 ++++++ 9 files changed, 42 insertions(+), 12 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 7d09d125f148..473420eed477 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -772,7 +772,7 @@ static int load_elf_binary(struct linux_binprm *bprm) if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0') goto out_free_interp; - interpreter = open_exec(elf_interpreter); + interpreter = openat_exec(bprm->dfd, elf_interpreter, bprm->flags); retval = PTR_ERR(interpreter); if (IS_ERR(interpreter)) goto out_free_interp; diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c index b53bb3729ac1..c463c6428f77 100644 --- a/fs/binfmt_elf_fdpic.c +++ b/fs/binfmt_elf_fdpic.c @@ -263,7 +263,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm) kdebug("Using ELF interpreter %s", interpreter_name); /* replace the program with the interpreter */ - interpreter = open_exec(interpreter_name); + interpreter = openat_exec(bprm->dfd, interpreter_name, bprm->flags); retval = PTR_ERR(interpreter); if (IS_ERR(interpreter)) { interpreter = NULL; diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c index dd2d3f0cd55d..3ee46b0dc0d4 100644 --- a/fs/binfmt_em86.c +++ b/fs/binfmt_em86.c @@ -81,10 +81,10 @@ static int load_em86(struct linux_binprm *bprm) /* * OK, now restart the process with the interpreter's inode. - * Note that we use open_exec() as the name is now in kernel + * Note that we use openat_exec() as the name is now in kernel * space, and we don't need to copy it. */ - file = open_exec(interp); + file = openat_exec(binprm->dfd, interp, binprm->flags); if (IS_ERR(file)) return PTR_ERR(file); diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index aa4a7a23ff99..573ef06ff5a1 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -209,7 +209,7 @@ static int load_misc_binary(struct linux_binprm *bprm) if (!IS_ERR(interp_file)) deny_write_access(interp_file); } else { - interp_file = open_exec(fmt->interpreter); + interp_file = openat_exec(bprm->dfd, fmt->interpreter, bprm->flags); } retval = PTR_ERR(interp_file); if (IS_ERR(interp_file)) diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c index e996174cbfc0..aaff12d12bfd 100644 --- a/fs/binfmt_script.c +++ b/fs/binfmt_script.c @@ -137,7 +137,7 @@ static int load_script(struct linux_binprm *bprm) /* * OK, now restart the process with the interpreter's dentry. */ - file = open_exec(i_name); + file = openat_exec(bprm->dfd, i_name, bprm->flags); if (IS_ERR(file)) return PTR_ERR(file); diff --git a/fs/exec.c b/fs/exec.c index 2e0033348d8e..d0f244d9a541 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -846,12 +846,24 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) .lookup_flags = LOOKUP_FOLLOW, }; - if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0) + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH | AT_BENEATH | + AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | + AT_THIS_ROOT)) != 0) return ERR_PTR(-EINVAL); if (flags & AT_SYMLINK_NOFOLLOW) open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW; if (flags & AT_EMPTY_PATH) open_exec_flags.lookup_flags |= LOOKUP_EMPTY; + if (flags & AT_BENEATH) + open_exec_flags.lookup_flags |= LOOKUP_BENEATH; + if (flags & AT_XDEV) + open_exec_flags.lookup_flags |= LOOKUP_XDEV; + if (flags & AT_NO_MAGICLINKS) + open_exec_flags.lookup_flags |= LOOKUP_NO_MAGICLINKS; + if (flags & AT_NO_SYMLINKS) + open_exec_flags.lookup_flags |= LOOKUP_NO_SYMLINKS; + if (flags & AT_THIS_ROOT) + open_exec_flags.lookup_flags |= LOOKUP_IN_ROOT; file = do_filp_open(fd, name, &open_exec_flags); if (IS_ERR(file)) @@ -879,18 +891,18 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) return ERR_PTR(err); } -struct file *open_exec(const char *name) +struct file *openat_exec(int dfd, const char *name, int flags) { struct filename *filename = getname_kernel(name); struct file *f = ERR_CAST(filename); if (!IS_ERR(filename)) { - f = do_open_execat(AT_FDCWD, filename, 0); + f = do_open_execat(dfd, filename, flags); putname(filename); } return f; } -EXPORT_SYMBOL(open_exec); +EXPORT_SYMBOL(openat_exec); int kernel_read_file(struct file *file, void **buf, loff_t *size, loff_t max_size, enum kernel_read_file_id id) @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, sched_exec(); + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | + AT_THIS_ROOT); + bprm->dfd = AT_FDCWD; + if (bprm->flags) + bprm->dfd = fd; + bprm->file = file; if (!filename) { bprm->filename = "none"; diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 688ab0de7810..e4da2d36e97f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -50,6 +50,7 @@ struct linux_binprm { unsigned int taso:1; #endif unsigned int recursion_depth; /* only for search_binary_handler() */ + int dfd, flags; /* passed down to execat_open() */ struct file * file; struct cred *cred; /* new credentials */ int unsafe; /* how unsafe this exec is (mask of LSM_UNSAFE_*) */ diff --git a/include/linux/fs.h b/include/linux/fs.h index fe94c48481a6..cdcd2e021101 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2945,8 +2945,13 @@ extern int kernel_read_file_from_fd(int, void **, loff_t *, loff_t, extern ssize_t kernel_read(struct file *, void *, size_t, loff_t *); extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *); extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *); -extern struct file * open_exec(const char *); - + +extern struct file *openat_exec(int, const char *, int); +static inline struct file *open_exec(const char *name) +{ + return openat_exec(AT_FDCWD, name, 0); +} + /* fs/dcache.c -- generic fs support functions */ extern bool is_subdir(struct dentry *, struct dentry *); extern bool path_is_under(const struct path *, const struct path *); diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 1d338357df8a..bfca7f87cd2a 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -93,5 +93,11 @@ #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ +#define AT_RESOLUTION_TYPE 0x1F0000 /* Type of path-resolution scoping we are applying. */ +#define AT_BENEATH 0x010000 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */ +#define AT_XDEV 0x020000 /* - Block mount-point crossings (includes bind-mounts). */ +#define AT_NO_MAGICLINKS 0x040000 /* - Block procfs-style "magic" symlinks. */ +#define AT_NO_SYMLINKS 0x080000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */ +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ #endif /* _UAPI_LINUX_FCNTL_H */ -- 2.21.0
The most obvious syscall to add support for the new LOOKUP_* scoping flags would be openat(2) (along with the required execveat(2) change included in this series). However, there are a few reasons to not do this: * The new LOOKUP_* flags are intended to be security features, and openat(2) will silently ignore all unknown flags. This means that users would need to avoid foot-gunning themselves constantly when using this interface if it were part of openat(2). * Resolution scoping feels like a different operation to the existing O_* flags. And since openat(2) has limited flag space, it seems to be quite wasteful to clutter it with 5 flags that are all resolution-related. Arguably O_NOFOLLOw is also a resolution flag but its entire purpose is to error out if you encounter a trailing symlink not to scope resolution. * Other systems would be able to reimplement this syscall allowing for cross-OS standardisation rather than being hidden amongst O_* flags which may result in it not being used by all the parties that might want to use it (file servers, web servers, container runtimes, etc). * It gives us the opportunity to iterate on the O_PATH interface in the future. There are some potential security improvements that can be made to O_PATH (handling /proc/self/fd re-opening of file descriptors much more sanely) which could be made even better with some other bits (such as ACC_MODE bits which work for O_PATH). To this end, we introduce the resolveat(2) syscall. At the moment it's effectively another way of getting a bog-standard O_PATH descriptor but with the ability to use the new LOOKUP_* flags. Because resolveat(2) only provides the ability to get O_PATH descriptors, users will need to get creative with /proc/self/fd in order to get a usable file descriptor for other uses. However, in future we can add O_EMPTYPATH support to openat(2) which would allow for re-opening without procfs (though as mentioned above there are some security improvements that should be made to the interfaces). NOTE: This patch adds the syscall to all architectures using the new unified syscall numbering, but several architectures are missing newer (nr > 423) syscalls -- hence the uneven gaps in the syscall tables. Cc: Christian Brauner <christian@brauner.io> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/namei.c | 46 +++++++++++++++++++++ include/uapi/linux/fcntl.h | 12 ++++++ 18 files changed, 74 insertions(+) diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 63ed39cbd3bd..72f431b1dc9c 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -461,5 +461,6 @@ 530 common getegid sys_getegid 531 common geteuid sys_geteuid 532 common getppid sys_getppid +533 common resolveat sys_resolveat # all other architectures have common numbers for new syscall, alpha # is the exception. diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 9016f4081bb9..1bc0282a67f7 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -437,3 +437,4 @@ 421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait 422 common futex_time64 sys_futex 423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval +428 common resolveat sys_resolveat diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index ab9cda5f6136..d3ae73ffaf48 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -344,3 +344,4 @@ 332 common pkey_free sys_pkey_free 333 common rseq sys_rseq # 334 through 423 are reserved to sync up with other architectures +428 common resolveat sys_resolveat diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 125c14178979..81b7389e9e58 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -423,3 +423,4 @@ 421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait 422 common futex_time64 sys_futex 423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval +428 common resolveat sys_resolveat diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 8ee3a8c18498..626aed10e2b5 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -429,3 +429,4 @@ 421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait 422 common futex_time64 sys_futex 423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval +428 common resolveat sys_resolveat diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 15f4117900ee..8cbd6032d6bf 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -362,3 +362,4 @@ 421 n32 rt_sigtimedwait_time64 compat_sys_rt_sigtimedwait_time64 422 n32 futex_time64 sys_futex 423 n32 sched_rr_get_interval_time64 sys_sched_rr_get_interval +428 n32 resolveat sys_resolveat diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index c85502e67b44..234923a1fc88 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -338,3 +338,4 @@ 327 n64 rseq sys_rseq 328 n64 io_pgetevents sys_io_pgetevents # 329 through 423 are reserved to sync up with other architectures +428 n64 resolveat sys_resolveat diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 2e063d0f837e..7b4586acf35d 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -411,3 +411,4 @@ 421 o32 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64 422 o32 futex_time64 sys_futex sys_futex 423 o32 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval +428 o32 resolveat sys_resolveat sys_resolveat diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index b26766c6647d..19a9a92dc5f8 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -420,3 +420,4 @@ 421 32 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64 422 32 futex_time64 sys_futex sys_futex 423 32 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval +428 common resolveat sys_resolveat sys_resolveat diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index b18abb0c3dae..bfcd75b928de 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -505,3 +505,4 @@ 421 32 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64 422 32 futex_time64 sys_futex sys_futex 423 32 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval +428 common resolveat sys_resolveat sys_resolveat diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 02579f95f391..084e51f02e65 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -426,3 +426,4 @@ 421 32 rt_sigtimedwait_time64 - compat_sys_rt_sigtimedwait_time64 422 32 futex_time64 - sys_futex 423 32 sched_rr_get_interval_time64 - sys_sched_rr_get_interval +428 common resolveat sys_resolveat - diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index bfda678576e4..e9115c5cec72 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -426,3 +426,4 @@ 421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait 422 common futex_time64 sys_futex 423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval +428 common resolveat sys_resolveat diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index b9a5a04b2d2c..2d3fdd913d89 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -469,3 +469,4 @@ 421 32 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64 422 32 futex_time64 sys_futex sys_futex 423 32 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval +428 common resolveat sys_resolveat sys_resolveat diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 4cd5f982b1e5..101feef22473 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -438,3 +438,4 @@ 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter 427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register +428 i386 resolveat sys_resolveat __ia32_sys_resolveat diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 64ca0d06259a..c7c197bf07bc 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -355,6 +355,7 @@ 425 common io_uring_setup __x64_sys_io_uring_setup 426 common io_uring_enter __x64_sys_io_uring_enter 427 common io_uring_register __x64_sys_io_uring_register +428 common resolveat __x64_sys_resolveat # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 6af49929de85..86c44f15dfa4 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -394,3 +394,4 @@ 421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait 422 common futex_time64 sys_futex 423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval +428 common resolveat sys_resolveat diff --git a/fs/namei.c b/fs/namei.c index 2b6a1bf4e745..30db5833aac3 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3656,6 +3656,52 @@ struct file *do_filp_open(int dfd, struct filename *pathname, return filp; } +SYSCALL_DEFINE3(resolveat, int, dfd, const char __user *, path, + unsigned long, flags) +{ + int fd; + struct filename *tmp; + struct open_flags op = { + .open_flag = O_PATH, + }; + + if (flags & ~VALID_RESOLVE_FLAGS) + return -EINVAL; + + if (flags & RESOLVE_CLOEXEC) + op.open_flag |= O_CLOEXEC; + if (!(flags & RESOLVE_NOFOLLOW)) + op.lookup_flags |= LOOKUP_FOLLOW; + if (flags & RESOLVE_BENEATH) + op.lookup_flags |= LOOKUP_BENEATH; + if (flags & RESOLVE_XDEV) + op.lookup_flags |= LOOKUP_XDEV; + if (flags & RESOLVE_NO_MAGICLINKS) + op.lookup_flags |= LOOKUP_NO_MAGICLINKS; + if (flags & RESOLVE_NO_SYMLINKS) + op.lookup_flags |= LOOKUP_NO_SYMLINKS; + if (flags & RESOLVE_THIS_ROOT) + op.lookup_flags |= LOOKUP_IN_ROOT; + + tmp = getname(path); + if (IS_ERR(tmp)) + return PTR_ERR(tmp); + + fd = get_unused_fd_flags(op.open_flag); + if (fd >= 0) { + struct file *f = do_filp_open(dfd, tmp, &op); + if (IS_ERR(f)) { + put_unused_fd(fd); + fd = PTR_ERR(f); + } else { + fsnotify_open(f); + fd_install(fd, f); + } + } + putname(tmp); + return fd; +} + struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt, const char *name, const struct open_flags *op) { diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index bfca7f87cd2a..d5a2a97929c6 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -100,4 +100,16 @@ #define AT_NO_SYMLINKS 0x080000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */ #define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ +/* First two bits of RESOLVE_* are reserved for future ACC_MODE extensions. */ +#define RESOLVE_CLOEXEC 0x004 /* Set O_CLOEXEC on the returned fd. */ +#define RESOLVE_NOFOLLOW 0x008 /* Don't follow trailing symlinks. */ +#define RESOLVE_RESOLUTION_TYPE 0x1F0 /* Type of path-resolution scoping we are applying. */ +#define RESOLVE_BENEATH 0x010 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */ +#define RESOLVE_XDEV 0x020 /* - Block mount-point crossings (includes bind-mounts). */ +#define RESOLVE_NO_MAGICLINKS 0x040 /* - Block procfs-style "magic" symlinks. */ +#define RESOLVE_NO_SYMLINKS 0x080 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */ +#define RESOLVE_THIS_ROOT 0x100 /* - Scope ".." resolution to dirfd (like chroot(2)). */ + +#define VALID_RESOLVE_FLAGS (RESOLVE_CLOEXEC | RESOLVE_NOFOLLOW | RESOLVE_RESOLUTION_TYPE) + #endif /* _UAPI_LINUX_FCNTL_H */ -- 2.21.0
On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > The need to be able to scope path resolution of interpreters became > clear with one of the possible vectors used in CVE-2019-5736 (which > most major container runtimes were vulnerable to). > > Naively, it might seem that openat(2) -- which supports path scoping -- > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the > binary being executed. Unfortunately, a "bad binary" (usually a symlink) > could be written as a #!-style script with the symlink target as the > interpreter -- which would be completely missed by just scoping the > openat(2). An example of this being exploitable is CVE-2019-5736. > > In order to get around this, we need to pass down to each binfmt_* > implementation the scoping flags requested in execveat(2). In order to > maintain backwards-compatibility we only pass the scoping AT_* flags. > > To avoid breaking userspace (in the exceptionally rare cases where you > have #!-scripts with a relative path being execveat(2)-ed with dfd != > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are > set in execveat(2). This seems extremely dangerous. I like the overall series, but not this patch. > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, > > sched_exec(); > > + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | > + AT_THIS_ROOT); [...] > +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ So now what happens if there is a setuid root ELF binary with program interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that going to let the unprivileged user decide which interpreter the setuid-root process should use? From a high-level perspective, opening the interpreter should be controlled by the program that is being loaded, not by the program that invoked it. In my opinion, CVE-2019-5736 points out two different problems: The big problem: The __ptrace_may_access() logic has a special-case short-circuit for "introspection" that you can't opt out of; this makes it possible to open things in procfs that are related to the current process even if the credentials of the process wouldn't permit accessing another process like it. I think the proper fix to deal with this would be to add a prctl() flag for "set whether introspection is allowed for this process", and if userspace has manually un-set that flag, any introspection special-case logic would be skipped. An additional problem: /proc/*/exe can be used to open a file for writing; I think it may have been Andy Lutomirski who pointed out some time ago that it would be nice if you couldn't use /proc/*/fd/* to re-open files with more privileges, which is sort of the same thing.
[-- Attachment #1: Type: text/plain, Size: 4411 bytes --] On 2019-05-06, Jann Horn <jannh@google.com> wrote: > On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > The need to be able to scope path resolution of interpreters became > > clear with one of the possible vectors used in CVE-2019-5736 (which > > most major container runtimes were vulnerable to). > > > > Naively, it might seem that openat(2) -- which supports path scoping -- > > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the > > binary being executed. Unfortunately, a "bad binary" (usually a symlink) > > could be written as a #!-style script with the symlink target as the > > interpreter -- which would be completely missed by just scoping the > > openat(2). An example of this being exploitable is CVE-2019-5736. > > > > In order to get around this, we need to pass down to each binfmt_* > > implementation the scoping flags requested in execveat(2). In order to > > maintain backwards-compatibility we only pass the scoping AT_* flags. > > > > To avoid breaking userspace (in the exceptionally rare cases where you > > have #!-scripts with a relative path being execveat(2)-ed with dfd != > > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are > > set in execveat(2). > > This seems extremely dangerous. I like the overall series, but not this patch. > > > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, > > > > sched_exec(); > > > > + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | > > + AT_THIS_ROOT); > [...] > > +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ > > So now what happens if there is a setuid root ELF binary with program > interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an > unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that > going to let the unprivileged user decide which interpreter the > setuid-root process should use? From a high-level perspective, opening > the interpreter should be controlled by the program that is being > loaded, not by the program that invoked it. I went a bit nuts with openat_exec(), and I did end up adding it to the ELF interpreter lookup (and you're completely right that this is a bad idea -- I will drop it from this patch if it's included in the next series). The proposed solutions you give below are much nicer than this patch so I can drop it and work on fixing those issues separately. > In my opinion, CVE-2019-5736 points out two different problems: > > The big problem: The __ptrace_may_access() logic has a special-case > short-circuit for "introspection" that you can't opt out of; this > makes it possible to open things in procfs that are related to the > current process even if the credentials of the process wouldn't permit > accessing another process like it. I think the proper fix to deal with > this would be to add a prctl() flag for "set whether introspection is > allowed for this process", and if userspace has manually un-set that > flag, any introspection special-case logic would be skipped. We could do PR_SET_DUMPABLE=3 for this, I guess? > An additional problem: /proc/*/exe can be used to open a file for > writing; I think it may have been Andy Lutomirski who pointed out some > time ago that it would be nice if you couldn't use /proc/*/fd/* to > re-open files with more privileges, which is sort of the same thing. This is something I'm currently working on a series for, which would boil down to some restrictions on how re-opening of file descriptors works through procfs. However, execveat() of a procfs magiclink is a bit hard to block -- there is no way for userspace to to represent a file being "open for execute" so they are all "open for execute" by default and blocking it outright seems a bit extreme (though I actually hope to eventually add the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence why I've reserved some bits). (Thinking more about it, there is an argument that I should include the above patch into this series so that we can block re-opening of fds opened through resolveat(2) without explicit flags from the outset.) -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --]
> On May 6, 2019, at 12:17 PM, Aleksa Sarai <cyphar@cyphar.com> wrote: > >> On 2019-05-06, Jann Horn <jannh@google.com> wrote: >>> On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote: >>> The need to be able to scope path resolution of interpreters became >>> clear with one of the possible vectors used in CVE-2019-5736 (which >>> most major container runtimes were vulnerable to). >>> >>> Naively, it might seem that openat(2) -- which supports path scoping -- >>> can be combined with execveat(AT_EMPTY_PATH) to trivially scope the >>> binary being executed. Unfortunately, a "bad binary" (usually a symlink) >>> could be written as a #!-style script with the symlink target as the >>> interpreter -- which would be completely missed by just scoping the >>> openat(2). An example of this being exploitable is CVE-2019-5736. >>> >>> In order to get around this, we need to pass down to each binfmt_* >>> implementation the scoping flags requested in execveat(2). In order to >>> maintain backwards-compatibility we only pass the scoping AT_* flags. >>> >>> To avoid breaking userspace (in the exceptionally rare cases where you >>> have #!-scripts with a relative path being execveat(2)-ed with dfd != >>> AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are >>> set in execveat(2). >> >> This seems extremely dangerous. I like the overall series, but not this patch. >> >>> @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, >>> >>> sched_exec(); >>> >>> + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | >>> + AT_THIS_ROOT); >> [...] >>> +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ >> >> So now what happens if there is a setuid root ELF binary with program >> interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an >> unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that >> going to let the unprivileged user decide which interpreter the >> setuid-root process should use? From a high-level perspective, opening >> the interpreter should be controlled by the program that is being >> loaded, not by the program that invoked it. > > I went a bit nuts with openat_exec(), and I did end up adding it to the > ELF interpreter lookup (and you're completely right that this is a bad > idea -- I will drop it from this patch if it's included in the next > series). > > The proposed solutions you give below are much nicer than this patch so > I can drop it and work on fixing those issues separately. > >> In my opinion, CVE-2019-5736 points out two different problems: >> >> The big problem: The __ptrace_may_access() logic has a special-case >> short-circuit for "introspection" that you can't opt out of; this >> makes it possible to open things in procfs that are related to the >> current process even if the credentials of the process wouldn't permit >> accessing another process like it. I think the proper fix to deal with >> this would be to add a prctl() flag for "set whether introspection is >> allowed for this process", and if userspace has manually un-set that >> flag, any introspection special-case logic would be skipped. > > We could do PR_SET_DUMPABLE=3 for this, I guess? > >> An additional problem: /proc/*/exe can be used to open a file for >> writing; I think it may have been Andy Lutomirski who pointed out some >> time ago that it would be nice if you couldn't use /proc/*/fd/* to >> re-open files with more privileges, which is sort of the same thing. > > This is something I'm currently working on a series for, which would > boil down to some restrictions on how re-opening of file descriptors > works through procfs. > > However, execveat() of a procfs magiclink is a bit hard to block -- > there is no way for userspace to to represent a file being "open for > execute" so they are all "open for execute" by default and blocking it > outright seems a bit extreme (though I actually hope to eventually add > the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence > why I've reserved some bits). There’s an O_MAYEXEC series floating around. > > (Thinking more about it, there is an argument that I should include the > above patch into this series so that we can block re-opening of fds > opened through resolveat(2) without explicit flags from the outset.) > > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/>
Jann Horn <jannh@google.com> writes:
>
> In my opinion, CVE-2019-5736 points out two different problems:
>
> The big problem: The __ptrace_may_access() logic has a special-case
> short-circuit for "introspection" that you can't opt out of;
Once upon a time in a galaxy far far away I fixed a bug where we missing
ptrace_may_access checks on various proc files and systems using selinux
stopped working. At the time selinux did not allow ptrace like access
to yourself. The "introspection" special case was the quick and simple
work-around.
There is nothing fundamental in having the "introspection" special case
except that various lsms have probably grown to depend upon it being
there. I expect without difficulty we could move the check down
into the various lsms. Which would get that check out of the core
kernel code.
Then the special case would the lsms challenge to keep or remove.
Eric
[-- Attachment #1: Type: text/plain, Size: 3207 bytes --] On 2019-05-07, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2019-05-06, Jann Horn <jannh@google.com> wrote: > > On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > The need to be able to scope path resolution of interpreters became > > > clear with one of the possible vectors used in CVE-2019-5736 (which > > > most major container runtimes were vulnerable to). > > > > > > Naively, it might seem that openat(2) -- which supports path scoping -- > > > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the > > > binary being executed. Unfortunately, a "bad binary" (usually a symlink) > > > could be written as a #!-style script with the symlink target as the > > > interpreter -- which would be completely missed by just scoping the > > > openat(2). An example of this being exploitable is CVE-2019-5736. > > > > > > In order to get around this, we need to pass down to each binfmt_* > > > implementation the scoping flags requested in execveat(2). In order to > > > maintain backwards-compatibility we only pass the scoping AT_* flags. > > > > > > To avoid breaking userspace (in the exceptionally rare cases where you > > > have #!-scripts with a relative path being execveat(2)-ed with dfd != > > > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are > > > set in execveat(2). > > > > This seems extremely dangerous. I like the overall series, but not this patch. > > > > > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename, > > > > > > sched_exec(); > > > > > > + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS | > > > + AT_THIS_ROOT); > > [...] > > > +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */ > > > > So now what happens if there is a setuid root ELF binary with program > > interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an > > unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that > > going to let the unprivileged user decide which interpreter the > > setuid-root process should use? From a high-level perspective, opening > > the interpreter should be controlled by the program that is being > > loaded, not by the program that invoked it. > > I went a bit nuts with openat_exec(), and I did end up adding it to the > ELF interpreter lookup (and you're completely right that this is a bad > idea -- I will drop it from this patch if it's included in the next > series). > > The proposed solutions you give below are much nicer than this patch so > I can drop it and work on fixing those issues separately. Another possible solution would be to only allow (for instance) AT_NO_MAGICLINKS for execveat(2). That way you cannot scope the resolution but you can block the most concerning cases -- those involving /proc-related access. I've posted a v7 with this patch dropped (because we can always add AT_* flags later in time), but I think having at least NO_MAGICLINKS would be useful. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --]
On Tue, May 07, 2019 at 07:38:58PM -0500, Eric W. Biederman wrote:
> Jann Horn <jannh@google.com> writes:
> > In my opinion, CVE-2019-5736 points out two different problems:
> >
> > The big problem: The __ptrace_may_access() logic has a special-case
> > short-circuit for "introspection" that you can't opt out of;
>
> Once upon a time in a galaxy far far away I fixed a bug where we missing
> ptrace_may_access checks on various proc files and systems using selinux
> stopped working. At the time selinux did not allow ptrace like access
> to yourself. The "introspection" special case was the quick and simple
> work-around.
>
> There is nothing fundamental in having the "introspection" special case
> except that various lsms have probably grown to depend upon it being
> there. I expect without difficulty we could move the check down
> into the various lsms. Which would get that check out of the core
> kernel code.
Oh, if that's an option, that would be great, I think.
But this means, for example, that a non-root, non-dumpable process can't
open /proc/self/maps anymore, or open /proc/self/fd/*, and things like
that, without making itself dumpable. I would be surprised if there is
no code out there that relies on that.
From what I can tell, without the introspection special case,
introspection would fail in the following cases (assuming that the
process is not capable and isn't using sys_setfs[ug]id()):
- ruid/euid/suid are not all the same
- rgid/egid/sgid are not all the same
- process is not dumpable
I think that there probably should be some way for a non-dumpable
process to look at its own procfs entries? If we could start from a
clean slate, I'd propose an opt-in flag to openat() for that, but
since we don't have a clean slate, I'd be afraid of breaking things
with that. But maybe I'm just being overly careful here?
On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote: > On 2019-05-06, Jann Horn <jannh@google.com> wrote: > > In my opinion, CVE-2019-5736 points out two different problems: > > > > The big problem: The __ptrace_may_access() logic has a special-case > > short-circuit for "introspection" that you can't opt out of; this > > makes it possible to open things in procfs that are related to the > > current process even if the credentials of the process wouldn't permit > > accessing another process like it. I think the proper fix to deal with > > this would be to add a prctl() flag for "set whether introspection is > > allowed for this process", and if userspace has manually un-set that > > flag, any introspection special-case logic would be skipped. > > We could do PR_SET_DUMPABLE=3 for this, I guess? Hmm... I'd make it a new prctl() command, since introspection is somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I think the introspection flag should be per-thread. > > An additional problem: /proc/*/exe can be used to open a file for > > writing; I think it may have been Andy Lutomirski who pointed out some > > time ago that it would be nice if you couldn't use /proc/*/fd/* to > > re-open files with more privileges, which is sort of the same thing. > > This is something I'm currently working on a series for, which would > boil down to some restrictions on how re-opening of file descriptors > works through procfs. Ah, nice! > However, execveat() of a procfs magiclink is a bit hard to block -- > there is no way for userspace to to represent a file being "open for > execute" so they are all "open for execute" by default and blocking it > outright seems a bit extreme (though I actually hope to eventually add > the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence > why I've reserved some bits). (For what it's worth, I'm mostly concerned about read vs write, not really about execute, since execute really is just another form of reading in my opinion.) > (Thinking more about it, there is an argument that I should include the > above patch into this series so that we can block re-opening of fds > opened through resolveat(2) without explicit flags from the outset.)
On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > > In my opinion, CVE-2019-5736 points out two different problems:
> > >
> > > The big problem: The __ptrace_may_access() logic has a special-case
> > > short-circuit for "introspection" that you can't opt out of; this
> > > makes it possible to open things in procfs that are related to the
> > > current process even if the credentials of the process wouldn't permit
> > > accessing another process like it. I think the proper fix to deal with
> > > this would be to add a prctl() flag for "set whether introspection is
> > > allowed for this process", and if userspace has manually un-set that
> > > flag, any introspection special-case logic would be skipped.
> >
> > We could do PR_SET_DUMPABLE=3 for this, I guess?
>
> Hmm... I'd make it a new prctl() command, since introspection is
> somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> think the introspection flag should be per-thread.
I've lost track of the context here, but it seems to me that
mitigating attacks involving accidental following of /proc links
shouldn't depend on dumpability. What's the actual problem this is
trying to solve again?
On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote: > On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote: > > > > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote: > > > On 2019-05-06, Jann Horn <jannh@google.com> wrote: > > > > In my opinion, CVE-2019-5736 points out two different problems: > > > > > > > > The big problem: The __ptrace_may_access() logic has a special-case > > > > short-circuit for "introspection" that you can't opt out of; this > > > > makes it possible to open things in procfs that are related to the > > > > current process even if the credentials of the process wouldn't permit > > > > accessing another process like it. I think the proper fix to deal with > > > > this would be to add a prctl() flag for "set whether introspection is > > > > allowed for this process", and if userspace has manually un-set that > > > > flag, any introspection special-case logic would be skipped. > > > > > > We could do PR_SET_DUMPABLE=3 for this, I guess? > > > > Hmm... I'd make it a new prctl() command, since introspection is > > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I > > think the introspection flag should be per-thread. > > I've lost track of the context here, but it seems to me that > mitigating attacks involving accidental following of /proc links > shouldn't depend on dumpability. What's the actual problem this is > trying to solve again? The one actual security problem that I've seen related to this is CVE-2019-5736. There is a write-up of it at <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html> under "Successful approach", but it goes more or less as follows: A container is running that doesn't use user namespaces (because for some reason I don't understand, apparently some people still do that). An evil process is running inside the container with UID 0 (as in, GLOBAL_ROOT_UID); so if the evil process inside the container was able to reach root-owned files on the host filesystem, it could write into them. The container engine wants to spawn a new process inside the container. It forks off a child that joins the container's namespaces (including PID and mount namespaces), and then the child calls execve() on some path in the container. The attacker replaces the executable in the container with a symlink to /proc/self/exe and replaces a library inside the container with a malicious one. When the container engine calls execve(), intending to run an executable inside the container, it instead goes through ptrace_may_access() using the introspection short-circuit and re-executes its own executable through the jumped symlink /proc/self/exe (which is normally unreachable for the container). After the execve(), the process loads an evil library from inside the container and is under the control of the container. Now the container controls a process whose /proc/self/exe is a jumped symlink to a host executable, and the container can write into it. Some container engines are now using an extremely ugly hack to work around this - whenever they want to enter a container, they copy the host binary into a new memfd and execute that to avoid exposing the original host binary to containers: <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b> In my opinion, the problems here are: - Apparently some people run untrusted containers without user namespaces. It would be really nice if people could not do that. (Probably the biggest problem here.) - ptrace_may_access() has a short-circuit that permits a process to unintentionally look at itself even if it has dropped privileges - here, it permits the execve("/proc/self/exe", ...) that would normally be blocked by the check for CAP_SYS_PTRACE if the process is nondumpable. - You can use /proc/*/exe to get a writable fd.
On Sat, May 11, 2019 at 12:55 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
> > On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > > > On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > > > > In my opinion, CVE-2019-5736 points out two different problems:
> > > > >
> > > > > The big problem: The __ptrace_may_access() logic has a special-case
> > > > > short-circuit for "introspection" that you can't opt out of; this
> > > > > makes it possible to open things in procfs that are related to the
> > > > > current process even if the credentials of the process wouldn't permit
> > > > > accessing another process like it. I think the proper fix to deal with
> > > > > this would be to add a prctl() flag for "set whether introspection is
> > > > > allowed for this process", and if userspace has manually un-set that
> > > > > flag, any introspection special-case logic would be skipped.
> > > >
> > > > We could do PR_SET_DUMPABLE=3 for this, I guess?
> > >
> > > Hmm... I'd make it a new prctl() command, since introspection is
> > > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> > > think the introspection flag should be per-thread.
> >
> > I've lost track of the context here, but it seems to me that
> > mitigating attacks involving accidental following of /proc links
> > shouldn't depend on dumpability. What's the actual problem this is
> > trying to solve again?
>
> The one actual security problem that I've seen related to this is
> CVE-2019-5736. There is a write-up of it at
> <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> under "Successful approach", but it goes more or less as follows:
>
> A container is running that doesn't use user namespaces (because for
> some reason I don't understand, apparently some people still do that).
> An evil process is running inside the container with UID 0 (as in,
> GLOBAL_ROOT_UID); so if the evil process inside the container was able
> to reach root-owned files on the host filesystem, it could write into
> them.
>
> The container engine wants to spawn a new process inside the container.
> It forks off a child that joins the container's namespaces (including
> PID and mount namespaces), and then the child calls execve() on some
> path in the container.
> The attacker replaces the executable in the container with a symlink
> to /proc/self/exe and replaces a library inside the container with a
> malicious one.
> When the container engine calls execve(), intending to run an executable
> inside the container, it instead goes through ptrace_may_access() using
> the introspection short-circuit and re-executes its own executable
> through the jumped symlink /proc/self/exe (which is normally unreachable
> for the container). After the execve(), the process loads an evil
> library from inside the container and is under the control of the
> container.
> Now the container controls a process whose /proc/self/exe is a jumped
> symlink to a host executable, and the container can write into it.
>
> Some container engines are now using an extremely ugly hack to work
> around this - whenever they want to enter a container, they copy the
> host binary into a new memfd and execute that to avoid exposing the
> original host binary to containers:
> <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>
>
>
> In my opinion, the problems here are:
>
> - Apparently some people run untrusted containers without user
> namespaces. It would be really nice if people could not do that.
> (Probably the biggest problem here.)
I know I sound like a broken record since I've been going on about this
forever together with a lot of other people but honestly,
the fact that people are running untrusted workloads in privileged containers
is the real issue here.
Aleksa is a good friend of mine and we have discussed this a lot so I hope
he doesn't hate me for saying this again: it is crazy that there are container
runtimes out there that promise (or at least do not state the opposite)
containers without user namespaces or containers with user namespaces
that allow to map the host root id to anything can be safe. They cannot.
Even if this /proc/*/exe thing is somehow blocked there
are other ways of escaping from a privileged container.
We (i.e. LXC) literally do not accept CVEs for privileged containers
because we do not consider them safe by design.
It seems to me to be heading in the wrong direction to keep up the
illusion that with enough effort we can make this all nice and safe.
Yes, the userspace memfd hack we came up with is as ugly as a security
patch can be but if you make promises you can't keep you better be
prepared to pay the price when things start to fall apart.
So if this part of the patch is just needed to handle this do we really
want to do all that tricky work or is there more to gain from this that
makes it worth it.
Christian
[-- Attachment #1: Type: text/plain, Size: 3202 bytes --] On 2019-05-11, Christian Brauner <christian@brauner.io> wrote: > > In my opinion, the problems here are: > > > > - Apparently some people run untrusted containers without user > > namespaces. It would be really nice if people could not do that. > > (Probably the biggest problem here.) > > I know I sound like a broken record since I've been going on about this > forever together with a lot of other people but honestly, > the fact that people are running untrusted workloads in privileged containers > is the real issue here. I completely agree. It's a shit-show, and it's caused by bad defaults in Docker and (now) podman. To be fair, they both now support rootless containers but the default is still privileged containers. They do support user namespaces (though it should be noted that LXD's support is much nicer from a security standpoint) but unless it's the default the support is almost pointless. In the case of Docker it can lead to some usability issues when you enable it (which I believe is the main justification for it not being the default). > Aleksa is a good friend of mine and we have discussed this a lot so I hope > he doesn't hate me for saying this again: it is crazy that there are container > runtimes out there that promise (or at least do not state the opposite) > containers without user namespaces or containers with user namespaces > that allow to map the host root id to anything can be safe. They cannot. Yeah, the fact that we (runc) don't scream from the rooftops that this setup is insecure is definitely a problem. I have mentioned this whenever I've had a chance, but the fact that the most popular runtimes (which use runc) don't use user namespaces compounds the issue. I'm willing to bet that >90% of users of runc-based runtimes don't use user namespaces at all, and this is all down to bad defaults. There are also some other misfeatures we have in runc that we're basically forced to support because some users use them, and we can't really break entire projects (even though it's the projects' fault they have an insecure setup). > It seems to me to be heading in the wrong direction to keep up the > illusion that with enough effort we can make this all nice and safe. > Yes, the userspace memfd hack we came up with is as ugly as a security > patch can be but if you make promises you can't keep you better be > prepared to pay the price when things start to fall apart. > So if this part of the patch is just needed to handle this do we really > want to do all that tricky work or is there more to gain from this that > makes it worth it. I dropped this patch in v7, I don't think it's required for the overarching feature. Looking back on it, it doesn't make much sense given the context that privileged containers are unsafe in the first place. I do think that being able to block introspection might be a useful hardening feature though. During attachment it would be nice to be sure that nothing will be able to touch the attaching process's /proc/$pid -- even itself. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --]
> On May 10, 2019, at 3:55 PM, Jann Horn <jannh@google.com> wrote: > >> On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote: >>> On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote: >>> >>>> On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote: >>>>> On 2019-05-06, Jann Horn <jannh@google.com> wrote: >>>>> In my opinion, CVE-2019-5736 points out two different problems: >>>>> >>>>> The big problem: The __ptrace_may_access() logic has a special-case >>>>> short-circuit for "introspection" that you can't opt out of; this >>>>> makes it possible to open things in procfs that are related to the >>>>> current process even if the credentials of the process wouldn't permit >>>>> accessing another process like it. I think the proper fix to deal with >>>>> this would be to add a prctl() flag for "set whether introspection is >>>>> allowed for this process", and if userspace has manually un-set that >>>>> flag, any introspection special-case logic would be skipped. >>>> >>>> We could do PR_SET_DUMPABLE=3 for this, I guess? >>> >>> Hmm... I'd make it a new prctl() command, since introspection is >>> somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I >>> think the introspection flag should be per-thread. >> >> I've lost track of the context here, but it seems to me that >> mitigating attacks involving accidental following of /proc links >> shouldn't depend on dumpability. What's the actual problem this is >> trying to solve again? > > The one actual security problem that I've seen related to this is > CVE-2019-5736. There is a write-up of it at > <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html> > under "Successful approach", but it goes more or less as follows: > > A container is running that doesn't use user namespaces (because for > some reason I don't understand, apparently some people still do that). > An evil process is running inside the container with UID 0 (as in, > GLOBAL_ROOT_UID); so if the evil process inside the container was able > to reach root-owned files on the host filesystem, it could write into > them. > > The container engine wants to spawn a new process inside the container. > It forks off a child that joins the container's namespaces (including > PID and mount namespaces), and then the child calls execve() on some > path in the container. I think that, at this point, the task should be considered owned by the container. Maybe we should have a better API than execve() to execute a program in a safer way, but fiddling with dumpability seems like a band-aid. In fact, the process is arguably pwned even *before* execve. A better “spawn” API should fix this. In the mean time, I think it should be assumed that, if you join a container’s namespaces, you are at its mercy. > The attacker replaces the executable in the container with a symlink > to /proc/self/exe and replaces a library inside the container with a > malicious one. Cute. > When the container engine calls execve(), intending to run an executable > inside the container, it instead goes through ptrace_may_access() using > the introspection short-circuit and re-executes its own executable > through the jumped symlink /proc/self/exe (which is normally unreachable > for the container). After the execve(), the process loads an evil > library from inside the container and is under the control of the > container. > Now the container controls a process whose /proc/self/exe is a jumped > symlink to a host executable, and the container can write into it. > > Some container engines are now using an extremely ugly hack to work > around this - whenever they want to enter a container, they copy the > host binary into a new memfd and execute that to avoid exposing the > original host binary to containers: > <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b> > > > In my opinion, the problems here are: > > - Apparently some people run untrusted containers without user > namespaces. It would be really nice if people could not do that. > (Probably the biggest problem here.) > - ptrace_may_access() has a short-circuit that permits a process to > unintentionally look at itself even if it has dropped privileges - > here, it permits the execve("/proc/self/exe", ...) that would > normally be blocked by the check for CAP_SYS_PTRACE if the process > is nondumpable. I don’t see this as a problem. Dumpable is about protecting a task from others, not about protecting a task against itself. > - You can use /proc/*/exe to get a writable fd. This is IMO the real bug.
On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> A better “spawn” API should fix this.
Andy, stop with the "spawn would be better".
Spawn is garbage. It's garbage because it's fundamentally too
inflexible, and it's garbage because it is quite complex to try to
work around the inflexibility by having those complex "action pointer
arrays" to make up for its failings.
And spawn() would fundamentally have all the same permission issues
that you now point to execve() as having, so it doesn't even *solve*
anything.
You've said this whole "spawn would fix things" thing before, and it's
wrong. Spawn isn't better. Really. If fixes absolutely zero things,
and the only reason for spawn existing is because VMS and NT had that
broken and inflexible model.
There's at least one paper from some MS people about how "spawn()" is
wonderful, and maybe you bought into the garbage from that. But that
paper is about how they hate fork(), not because of execve(). And if
you hate fork, use "vfork()" instead (preferably with an immediate
call to a non-returning function in the child to avoid the stack
re-use issue that makes it so simple to screw up vfork() in hard to
debug ways).
execve() is a _fine_ model. That's not the problem in this whole issue
at all - never was, and never will be.
The problem in this discussion is (a) having privileges you shouldn't
have and (b) having other interfaces that make it easyish to change
the filesystem layout to confuse those entities with privileges.
So the reason the open flags can be problematic is exactly because
they effectively change filesystem layout. And no, it's not just
AT_THIS_ROOT, although that's the obvious one. Things like "you can't
follow symlinks" can also effectively change the layout: imagine if
you have a PATH-like lookup model, and you end up having symlinks as
part of the standard filesystem layout.. Now a "don't follow symlinks"
can turn the *standard* executable into something that isn't found,
and then you might end up executing something else instead (think root
having '.' as the last entry in path, which some people used to
suggest as the fix for the completely bad "first entry" case)..
Notice? None of the real problems are about execve or would be solved
by any spawn API. You just think that because you've apparently been
talking to too many MS people that think fork (and thus indirectly
execve()) is bad process management.
Linus
[-- Attachment #1: Type: text/plain, Size: 3017 bytes --] On 2019-05-11, Andy Lutomirski <luto@amacapital.net> wrote: > >> I've lost track of the context here, but it seems to me that > >> mitigating attacks involving accidental following of /proc links > >> shouldn't depend on dumpability. What's the actual problem this is > >> trying to solve again? > > > > The one actual security problem that I've seen related to this is > > CVE-2019-5736. There is a write-up of it at > > <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html> > > under "Successful approach", but it goes more or less as follows: > > > > A container is running that doesn't use user namespaces (because for > > some reason I don't understand, apparently some people still do that). > > An evil process is running inside the container with UID 0 (as in, > > GLOBAL_ROOT_UID); so if the evil process inside the container was able > > to reach root-owned files on the host filesystem, it could write into > > them. > > > > The container engine wants to spawn a new process inside the container. > > It forks off a child that joins the container's namespaces (including > > PID and mount namespaces), and then the child calls execve() on some > > path in the container. > > I think that, at this point, the task should be considered owned by > the container. Maybe we should have a better API than execve() to > execute a program in a safer way, but fiddling with dumpability seems > like a band-aid. In fact, the process is arguably pwned even *before* > execve. Yeah, execve is just the vector (though in this case it's done in order to clear mm->dumpable). An earlier CVE (CVE-2016-9962) was very similar but was attacking a dirfd that runc had open into the container (LXC had a very similar bug too) -- setting !mm->dumpable was one of the workarounds we had for this. > A better “spawn” API should fix this. In the mean time, I think it > should be assumed that, if you join a container’s namespaces, you are > at its mercy. This is generally how we treat containers as runtime authors, but it's not a trivial thing to get right. In many cases the kernel APIs are working against you -- Christian and myself have written a fair few patches to fix holes in the kernel APIs so we can avoid these kinds of assumptions. But yes, one of the most risky parts of a container runtime is when you're attaching to a running container because all of the helpful introspection APIs in /proc/ suddenly become a security nightmare. A better "spawn a process in these namespaces" API might help improve the situation (or at least, I hope it would). > > - You can use /proc/*/exe to get a writable fd. > > This is IMO the real bug. I will try to send an RFC of the patchset I have for this next week or so. Funnily enough, currently /proc/*/exe has the write bit set in its "mode" (my series fixes this). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --]
On Sat, May 11, 2019 at 1:21 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Notice? None of the real problems are about execve or would be solved
> by any spawn API. You just think that because you've apparently been
> talking to too many MS people that think fork (and thus indirectly
> execve()) is bad process management.
Side note: a good policy has been (and remains) to make suid binaries
not be dynamically linked. And in the absence of that, the dynamic
linker at least resets the library path when it notices itself being
dynamic, and it certainly doesn't inherit any open flags from the
non-trusted environment.
And by the same logic, a suid interpreter must *definitely* should not
inherit any execve() flags from the non-trusted environment. So I
think Aleksa's patch to use the passed-in open flags is *exactly* the
wrong thing to do for security reasons. It doesn't close holes, it
opens them.
Linus
[-- Attachment #1: Type: text/plain, Size: 1371 bytes --] On 2019-05-11, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, May 11, 2019 at 1:21 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Notice? None of the real problems are about execve or would be solved > > by any spawn API. You just think that because you've apparently been > > talking to too many MS people that think fork (and thus indirectly > > execve()) is bad process management. > > Side note: a good policy has been (and remains) to make suid binaries > not be dynamically linked. And in the absence of that, the dynamic > linker at least resets the library path when it notices itself being > dynamic, and it certainly doesn't inherit any open flags from the > non-trusted environment. > > And by the same logic, a suid interpreter must *definitely* should not > inherit any execve() flags from the non-trusted environment. So I > think Aleksa's patch to use the passed-in open flags is *exactly* the > wrong thing to do for security reasons. It doesn't close holes, it > opens them. Yup, I've dropped the patch for the next version. (To be honest, I'm not sure why I included any of the other flags -- the only one that would've been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.) -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --]
On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> Yup, I've dropped the patch for the next version. (To be honest, I'm not
> sure why I included any of the other flags -- the only one that would've
> been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
I do wonder if we could try to just set AT_NO_MAGICLINKS
unconditionally for execve() (and certainly for the suid case).
I'd rather try to do these things across the board, than have "suid
binaries are treated specially" if at all possible.
The main use case for having /proc/<pid>/exe thing is for finding open
file descriptors, and for 'ps' kind of use, or to find the startup
directory when people don't populate the execve() environment fully
(ie "readlink(/proc/self/exe)" is afaik pretty common.
Sadly, googling for
execve /proc/self/exe
does actually find hits, including one that implies that chrome does
exactly that. So it might not be possible.
Somewhat odd, but it does just confirm the whole "users will at some
point do everything in their power to use every odd special case,
intended or not".
Linus
On May 11, 2019 7:43:44 PM GMT+02:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>>
>> Yup, I've dropped the patch for the next version. (To be honest, I'm
>not
>> sure why I included any of the other flags -- the only one that
>would've
>> been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
>
>I do wonder if we could try to just set AT_NO_MAGICLINKS
>unconditionally for execve() (and certainly for the suid case).
>
>I'd rather try to do these things across the board, than have "suid
>binaries are treated specially" if at all possible.
>
>The main use case for having /proc/<pid>/exe thing is for finding open
>file descriptors, and for 'ps' kind of use, or to find the startup
>directory when people don't populate the execve() environment fully
>(ie "readlink(/proc/self/exe)" is afaik pretty common.
>
>Sadly, googling for
>
> execve /proc/self/exe
>
>does actually find hits, including one that implies that chrome does
>exactly that. So it might not be possible.
>
>Somewhat odd, but it does just confirm the whole "users will at some
>point do everything in their power to use every odd special case,
>intended or not".
>
> Linus
Sadly I have to admit that we are using this.
Also, execveat on glibc is implemented via
/proc/self/fd/<nr> on kernels that do not
have a proper execveat.
See fexecve...
Christian
[-- Attachment #1: Type: text/plain, Size: 2050 bytes --] On 2019-05-11, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > Yup, I've dropped the patch for the next version. (To be honest, I'm not > > sure why I included any of the other flags -- the only one that would've > > been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.) > > I do wonder if we could try to just set AT_NO_MAGICLINKS > unconditionally for execve() (and certainly for the suid case). > > I'd rather try to do these things across the board, than have "suid > binaries are treated specially" if at all possible. > > The main use case for having /proc/<pid>/exe thing is for finding open > file descriptors, and for 'ps' kind of use, or to find the startup > directory when people don't populate the execve() environment fully > (ie "readlink(/proc/self/exe)" is afaik pretty common. > > Sadly, googling for > > execve /proc/self/exe > > does actually find hits, including one that implies that chrome does > exactly that. So it might not be possible. > > Somewhat odd, but it does just confirm the whole "users will at some > point do everything in their power to use every odd special case, > intended or not". *sheepishly* Actually we use this in runc very liberally. It's done because we need to run namespace-related code but runc is written in Go so (long story short) we re-exec ourselves in order to run some __attribute__((constructor)) code which sets up the namespaces and then lets the Go runtime boot. I suspect just writing everything in C would've been orders of magnitude simpler, but I wasn't around when that decision was made. :P Also as Christian mentioned, fexecve(3) in glibc is implemented using /proc/self/fd on old kernels (then again, if we change the behaviour on new kernels it won't matter because glibc uses execveat(AT_EMPTY_PATH) if it's available). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --]
> On May 11, 2019, at 10:21 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@amacapital.net> wrote: >> >> A better “spawn” API should fix this. > > Andy, stop with the "spawn would be better". It doesn’t have to be spawn per se. But the current situation sucks. > > Notice? None of the real problems are about execve or would be solved > by any spawn API. You just think that because you've apparently been > talking to too many MS people that think fork (and thus indirectly > execve()) is bad process management. > > I’ve literally never spoken to an MS person about it. What container managers and init systems *want* is a way to drop privileges, change namespaces, etc and then run something in a controlled way so that the intermediate states aren’t dangerous. An API for this could be spawn-like or exec-like — that particular distinction is beside the point. Having personally written code that mucks with namepsaces, I've wanted two particular abilities that are both quite awkward: a) Change all my UIDs and GIDs to match a container, enter that container's namespaces, and run some binary in the container's filesystem, all atomically enough that I don't need to worry about accidentally leaking privileges into the container. A super-duper-non-dumpable mode would kind of allow this, but I'd worry that there's some other hole besides ptrace() and /proc/self. b) Change all my UIDs and GIDs to match a container, enter that container's namespaces, and run some binary that is *not* in the container's filesystem. This happens, for example, if the container's mount namespace has no exec mounts at all. We don't have a fantastic way to do this at all right now due to /proc/self/exe. Regardless, the actual CVE at hand would have been nicely avoided if writing to /proc/self/exe didn’t work, and I see no reason we can’t make that happen. I suppose we could also consider a change to disable /proc/self/exe if it's not reachable from /proc/self/root. By "disable", I mean that readlink() should maybe still work, but actually trying to open it could probably fail safely.
On Sat, May 11, 2019 at 07:08:49PM -0400, Linus Torvalds wrote: > [ on mobile again, power is off and the wifi doesn't work, so I'm reading > email on my phone and apologizing once more for horrible html email.. ] > > On Sat, May 11, 2019, 18:40 Andy Lutomirski <luto@kernel.org> wrote: > > > > > a) Change all my UIDs and GIDs to match a container, enter that > > container's namespaces, and run some binary in the container's For the namespace part, an idea I had and presented at LPC a while ago was to make setns() interpret the nstype argument as a flag argument and to introduce an fd that can refer to multiple namespaces at the same time. This way you could do: setns(namespaces_fd, CLONE_NEWNS | CLONE_NEWUSER | CLONE_NEWPID) that could still be done. But I since have come to think that there's a better way now that we have CLONE_PIDFD. We should just make setns() take a pidfd as argument and then be able to call: setns(pidfd, 0); which would cause the calling thread to join all of the namespaces of the process referred to by pidfd at the same time. That really shouldn't be hard to do. I could probably get this going rather quickly and it would really help out container managers a lot. > > filesystem, all atomically enough that I don't need to worry about > > accidentally leaking privileges into the container. A > > super-duper-non-dumpable mode would kind of allow this, but I'd worry > > that there's some other hole besides ptrace() and /proc/self. > > > > So I would expect that you'd want to do this even *without* doing an execve > at all, which is why I still don't think this is actually about > spawn/execve at all. > > But I agree that what you that what you want sounds reasonable. But I think I have pitched an api like that a while ago (see [1]) - when I first started this fd for processes thing - that would allow you to do things like that. The gist was: 1. int pidfd_create aka CLONE_PIDFD now will return an fd that creates a process context. The fd returned by does not refer to any existing process and has not actually been started yet. So non-configuration operations on it or trying to interact with it would fail with e.g. ESRCH/EINVAL. We essentially have this now with CLONE_PIDFD. The bit that is missing is an "unscheduled" process. 2. int pidfd_config takes a CLONE_PIDFD and can be used to configure a process context before it is alive. Some things that I would like to be able to do with this syscall are: - configure signals - set clone flags - write idmappings if the process runs in a new user namespace - configure what happens when all procfds referring to the process are gone - ... 3. int pidfd_info 4. int pidfd_manage Parts of that are captured in pidfd_send_signal(). Just to get a very rough feel for this without detailing parameters right now: /* process should have own mountns */ pidfd_config(fd, PROC_SET_NAMESPACE, CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWUSR, <potentially additional arguments>) /* process should get SIGKILL when all procfds are closed */ pidfd_config(fd, PROC_SET_CLOSE, SIGKILL, <potentially additional arguments>) After the caller is done configuring the process there would be a final step: pidfd_config(fd, PROC_CREATE, 0, <potentially additional arguments>) which would create the process and (either as return value or through a parameter) return the pid of the newly created process. I had more thoughts on this and had started prototyping some of it but then there wasn't much interest it seemed. Maybe because it's crazy. [1]: https://lore.kernel.org/lkml/20181118174148.nvkc4ox2uorfatbm@brauner.io/ > the "dumpable" flag has always been a complete hack, and very unreliable > with random meanings (and random ways to then get around it). > > We have actually had lots of people wanting to run "lists of system calls" > kinds of things. Sometimes for performance reasons, sometimes for other > random reasons Maybe that "atomicity" angle would be another one, although > we would have to make the setuid etc things be something that happens at > the end. > > So rather than "spawn", is much rather see people think about ways to just > batch operations in general, rather than think it is about batching things > just before a process change. > > b) Change all my UIDs and GIDs to match a container, enter that > > container's namespaces, and run some binary that is *not* in the > > container's filesystem. > > > > Hey, you could probably do something very close to that by opening the > executable you want to run first, making it O_CLOEXEC, then doing all the > other things (which now makes the executable inaccessible), and finally > doing execveat() on the file descriptor.. I think that's somewhat similar to what I've suggested above. > > I say "something very close" because even though it's O_CLOEXEC, only the > file would be closed, and /proc/self/exe would still exist. > > But we really could make that execveat of a O_CLOEXEC file thing also > disable access to /proc/*/exe, and it sounds like a sane and simple > extension in general to do.. > > Linus > > >
On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> I bet this will break something that already exists. An execveat() flag to turn off /proc/self/exe would do the trick, though.
Thinking more about it, I suspect it is (once again) wrong to let the
thing that does the execve() control that bit.
Generally, the less we allow people to affect the lifetime and
environment of a suid executable, the better off we are.
But maybe we could limit /proc/*/exe to at least not honor suid'ness
of the target? Or does chrome/runc depend on that too?
Linus
[-- Attachment #1: Type: text/plain, Size: 2131 bytes --] On 2019-05-12, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote: > > I bet this will break something that already exists. An execveat() > > flag to turn off /proc/self/exe would do the trick, though. > > Thinking more about it, I suspect it is (once again) wrong to let the > thing that does the execve() control that bit. > > Generally, the less we allow people to affect the lifetime and > environment of a suid executable, the better off we are. > > But maybe we could limit /proc/*/exe to at least not honor suid'ness > of the target? Or does chrome/runc depend on that too? Speaking on the runc side, we don't depend on this. It's possible someone depends on this for fexecve(3) -- but as mentioned before in newer kernels glibc uses execve(AT_EMPTY_PATH). I would like to point out though that I'm a little bit cautious about /proc/self/exe-specific restrictions -- because a trivial way to get around them would be to just open it with O_PATH (and you end up with a /proc/self/fd/ which is equivalent). Unfortunately blocking setuid exec on all O_PATH descriptors would break even execve(AT_EMPTY_PATH) of setuid descriptors. The patches I mentioned (which Andy and I discussed off-list) would effectively make the magiclink modes in /proc/ affect how you can operate on the path (no write bit in the mode, cannot re-open it write). One aspect of this is how to handle O_PATH and in particular how do we handle an O_PATH re-open of an already-restricted magiclink. Maybe we could make it so that setuid is disallowed if you are dealing with an O_PATH fd which was a magiclink. Effectively, on O_PATH open you get an fmode_t saying FMODE_SETUID_EXEC_ALLOWED *but* if the path is a magiclink this fmode gets dropped and when the fd is given to execveat(AT_EMPTY_PATH) the fmode is checked and setuid-exec is not allowed. [I assume in this discussion "setuid" means "setuid + setcap", right?] -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #1: Type: text/plain, Size: 2459 bytes --] On 2019-05-12, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2019-05-12, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > I bet this will break something that already exists. An execveat() > > > flag to turn off /proc/self/exe would do the trick, though. > > > > Thinking more about it, I suspect it is (once again) wrong to let the > > thing that does the execve() control that bit. > > > > Generally, the less we allow people to affect the lifetime and > > environment of a suid executable, the better off we are. > > > > But maybe we could limit /proc/*/exe to at least not honor suid'ness > > of the target? Or does chrome/runc depend on that too? > > Speaking on the runc side, we don't depend on this. It's possible > someone depends on this for fexecve(3) -- but as mentioned before in > newer kernels glibc uses execve(AT_EMPTY_PATH). > > I would like to point out though that I'm a little bit cautious about > /proc/self/exe-specific restrictions -- because a trivial way to get > around them would be to just open it with O_PATH (and you end up with a > /proc/self/fd/ which is equivalent). Unfortunately blocking setuid exec > on all O_PATH descriptors would break even execve(AT_EMPTY_PATH) of > setuid descriptors. > > The patches I mentioned (which Andy and I discussed off-list) would > effectively make the magiclink modes in /proc/ affect how you can > operate on the path (no write bit in the mode, cannot re-open it write). > One aspect of this is how to handle O_PATH and in particular how do we > handle an O_PATH re-open of an already-restricted magiclink. > > Maybe we could make it so that setuid is disallowed if you are dealing > with an O_PATH fd which was a magiclink. Effectively, on O_PATH open you > get an fmode_t saying FMODE_SETUID_EXEC_ALLOWED *but* if the path is a > magiclink this fmode gets dropped and when the fd is given to > execveat(AT_EMPTY_PATH) the fmode is checked and setuid-exec is not > allowed. ... and obviously /proc/self/exe would have an fmode ~FMODE_SETUID_EXEC_ALLOWED from the outset. The reason for this slightly odd semantic would be to continue to allow O_PATH setuid-exec as long as the O_PATH was opened from an actual path rather than a magiclink. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --]
> On May 12, 2019, at 6:35 AM, Aleksa Sarai <cyphar@cyphar.com> wrote:
>
>> On 2019-05-12, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>> On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>> I bet this will break something that already exists. An execveat()
>>> flag to turn off /proc/self/exe would do the trick, though.
>>
>> Thinking more about it, I suspect it is (once again) wrong to let the
>> thing that does the execve() control that bit.
>>
>> Generally, the less we allow people to affect the lifetime and
>> environment of a suid executable, the better off we are.
>>
>> But maybe we could limit /proc/*/exe to at least not honor suid'ness
>> of the target? Or does chrome/runc depend on that too?
>
> Speaking on the runc side, we don't depend on this. It's possible
> someone depends on this for fexecve(3) -- but as mentioned before in
> newer kernels glibc uses execve(AT_EMPTY_PATH).
Why are we concerned about suid? Don’t we already block suid if the path being execed doesn’t come from the current mountns? That should mostly cover the things we care about, no?
I suppose we could also block suid for deleted files, so that deleting a known-buggy suid binary becomes more reliable. But every sensible package tool should already be chmoding the suid away before unlinking.