linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution
@ 2019-03-20 14:37 Aleksa Sarai
  2019-03-20 14:37 ` [PATCH RESEND v5 1/5] namei: split out nd->dfd handling to dirfd_path_init Aleksa Sarai
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Aleksa Sarai @ 2019-03-20 14:37 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Jann Horn,
	Christian Brauner, David Drysdale, Tycho Andersen, Kees Cook,
	containers, linux-fsdevel, linux-api, Andrew Morton,
	Alexei Starovoitov, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, linux-kernel, linux-arch

Now that the holiday break is over, it's time to re-send this patch
series (with a few additions, due to new information we got from
CVE-2019-5736 -- which this patchset mostly protected against but had
some holes with regards to #!-style scripts).

Patch changelog:
  v5:
    * In response to CVE-2019-5736 (one of the vectors showed that
      open(2)+fexec(3) cannot be used to scope binfmt_script's implicit
      open_exec()), AT_* flags have been re-added and are now piped
      through to binfmt_script (and other binfmt_* that use open_exec)
      but are only supported for execveat(2) for now.
  v4:
    * Remove AT_* flag reservations, as they require more discussion.
    * Switch to path_is_under() over __d_path() for breakout checking.
    * Make O_XDEV no longer block openat("/tmp", "/", O_XDEV) -- dirfd
      is now ignored for absolute paths to match other flags.
    * Improve the dirfd_path_init() refactor and move it to a separate
      commit.
    * Remove reference to Linux-capsicum.
    * Switch "proclink" name to "magic link".
  v3: [resend]
  v2:
    * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
      some semi-aggressive __d_path checking (see patch 3).
    * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
      hopes they can be re-enabled once safe.
    * Removed the selftests as they will be reimplemented as xfstests.
    * Removed stat(2) support, since you can already get it through
      O_PATH and fstatat(2).

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags:

  * O_XDEV blocks all mountpoint crossings (upwards, downwards, or
    through absolute links). Absolute pathnames alone in openat(2) do
    not trigger this.

  * O_NOMAGICLINKS blocks resolution through /proc/$pid/fd-style links.
    This is done by blocking the usage of nd_jump_link() during
    resolution in a filesystem. The term "magic links" is used to match
    with the only reference to these links in Documentation/, but I'm
    happy to change the name.

    It should be noted that this is different to the scope of O_NOFOLLOW
    in that it applies to all path components. However, you can do
    open(O_NOFOLLOW|O_NOMAGICLINKS|O_PATH) on a "magic link" and it will
    *not* fail (assuming that no parent component was a "magic link"),
    and you will have an fd for the "magic link".

  * O_BENEATH disallows escapes to outside the starting dirfd's tree,
    using techniques such as ".." or absolute links. Absolute paths in
    openat(2) are also disallowed. Conceptually this flag is to ensure
    you "stay below" a certain point in the filesystem tree -- but this
    requires some additional to protect against various races that would
    allow escape using ".." (see patch 4 for more detail).

    Currently O_BENEATH implies O_NOMAGICLINKS, because it can trivially
    beam you around the filesystem (breaking the protection). In future,
    there might be similar safety checks as in patch 4, but that
    requires more discussion.

In addition, two new flags were added that expand on the above ideas:

  * O_NOSYMLINKS does what it says on the tin. No symlink resolution is
    allowed at all, including "magic links". Just as with O_NOMAGICLINKS
    this can still be used with (O_PATH|O_NOFOLLOW) to open an fd for
    the symlink as long as no parent path had a symlink component.

  * O_THISROOT is an extension of O_BENEATH that, rather than blocking
    attempts to move past the root, forces all such movements to be
    scoped to the starting point. This provides chroot(2)-like
    protection but without the cost of a chroot(2) for each filesystem
    operation, as well as being safe against race attacks that chroot(2)
    is not.

    If a race is detected (as with O_BENEATH) then an error is
    generated, and similar to O_BENEATH it is not permitted to cross
    "magic links" with O_THISROOT.

    The primary need for this is from container runtimes, which
    currently need to do symlink scoping in userspace[6] when opening
    paths in a potentially malicious container. There is a long list of
    CVEs that could have bene mitigated by having O_THISROOT (such as
    CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
    CVE-2019-5736, just to name a few).

In addition, a mirror set of AT_* flags have been added (though
currently these are only supported for execveat(2) -- and not for any
other syscall). The need for these is explained in the final patch in
the series (it's motivated by CVE-2019-5736).

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: David Drysdale <drysdale@google.com>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: <containers@lists.linux-foundation.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: <linux-api@vger.kernel.org>

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin

Aleksa Sarai (5):
  namei: split out nd->dfd handling to dirfd_path_init
  namei: O_BENEATH-style path resolution flags
  namei: O_THISROOT: chroot-like path resolution
  namei: aggressively check for nd->root escape on ".." resolution
  binfmt_*: scope path resolution of interpreters

 fs/binfmt_elf.c                  |   2 +-
 fs/binfmt_elf_fdpic.c            |   2 +-
 fs/binfmt_em86.c                 |   4 +-
 fs/binfmt_misc.c                 |   2 +-
 fs/binfmt_script.c               |   2 +-
 fs/exec.c                        |  26 +++-
 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 205 ++++++++++++++++++++++---------
 fs/open.c                        |  13 +-
 include/linux/binfmts.h          |   1 +
 include/linux/fcntl.h            |   3 +-
 include/linux/fs.h               |   9 +-
 include/linux/namei.h            |   8 ++
 include/uapi/asm-generic/fcntl.h |  17 +++
 include/uapi/linux/fcntl.h       |   6 +
 15 files changed, 228 insertions(+), 74 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RESEND v5 1/5] namei: split out nd->dfd handling to dirfd_path_init
  2019-03-20 14:37 [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Aleksa Sarai
@ 2019-03-20 14:37 ` Aleksa Sarai
  2019-03-20 14:37 ` [PATCH RESEND v5 2/5] namei: O_BENEATH-style path resolution flags Aleksa Sarai
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2019-03-20 14:37 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
	David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, containers, linux-fsdevel, linux-api,
	linux-kernel, linux-arch

Previously, path_init's handling of *at(dfd, ...) was only done once,
but with O_BENEATH (and O_THISROOT) we have to parse the initial
nd->path at different times (before or after absolute path handling)
depending on whether we have been asked to scope resolution within a
root.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 103 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 59 insertions(+), 44 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a85deb55d0c9..4fdcb36f7c01 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2168,9 +2168,59 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	}
 }
 
+/*
+ * Configure nd->path based on the nd->dfd. This is only used as part of
+ * path_init().
+ */
+static inline int dirfd_path_init(struct nameidata *nd)
+{
+	if (nd->dfd == AT_FDCWD) {
+		if (nd->flags & LOOKUP_RCU) {
+			struct fs_struct *fs = current->fs;
+			unsigned seq;
+
+			do {
+				seq = read_seqcount_begin(&fs->seq);
+				nd->path = fs->pwd;
+				nd->inode = nd->path.dentry->d_inode;
+				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+			} while (read_seqcount_retry(&fs->seq, seq));
+		} else {
+			get_fs_pwd(current->fs, &nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+	} else {
+		/* Caller must check execute permissions on the starting path component */
+		struct fd f = fdget_raw(nd->dfd);
+		struct dentry *dentry;
+
+		if (!f.file)
+			return -EBADF;
+
+		dentry = f.file->f_path.dentry;
+
+		if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
+			fdput(f);
+			return -ENOTDIR;
+		}
+
+		nd->path = f.file->f_path;
+		if (nd->flags & LOOKUP_RCU) {
+			nd->inode = nd->path.dentry->d_inode;
+			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		} else {
+			path_get(&nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+		fdput(f);
+	}
+	return 0;
+}
+
 /* must be paired with terminate_walk() */
 static const char *path_init(struct nameidata *nd, unsigned flags)
 {
+	int error;
 	const char *s = nd->name->name;
 
 	if (!*s)
@@ -2204,52 +2254,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 
 	nd->m_seq = read_seqbegin(&mount_lock);
 	if (*s == '/') {
-		set_root(nd);
-		if (likely(!nd_jump_root(nd)))
-			return s;
-		return ERR_PTR(-ECHILD);
-	} else if (nd->dfd == AT_FDCWD) {
-		if (flags & LOOKUP_RCU) {
-			struct fs_struct *fs = current->fs;
-			unsigned seq;
-
-			do {
-				seq = read_seqcount_begin(&fs->seq);
-				nd->path = fs->pwd;
-				nd->inode = nd->path.dentry->d_inode;
-				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
-			} while (read_seqcount_retry(&fs->seq, seq));
-		} else {
-			get_fs_pwd(current->fs, &nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		return s;
-	} else {
-		/* Caller must check execute permissions on the starting path component */
-		struct fd f = fdget_raw(nd->dfd);
-		struct dentry *dentry;
-
-		if (!f.file)
-			return ERR_PTR(-EBADF);
-
-		dentry = f.file->f_path.dentry;
-
-		if (*s && unlikely(!d_can_lookup(dentry))) {
-			fdput(f);
-			return ERR_PTR(-ENOTDIR);
-		}
-
-		nd->path = f.file->f_path;
-		if (flags & LOOKUP_RCU) {
-			nd->inode = nd->path.dentry->d_inode;
-			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-		} else {
-			path_get(&nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		fdput(f);
+		if (likely(!nd->root.mnt))
+			set_root(nd);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			s = ERR_PTR(error);
 		return s;
 	}
+	error = dirfd_path_init(nd);
+	if (unlikely(error))
+		return ERR_PTR(error);
+	return s;
 }
 
 static const char *trailing_symlink(struct nameidata *nd)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RESEND v5 2/5] namei: O_BENEATH-style path resolution flags
  2019-03-20 14:37 [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Aleksa Sarai
  2019-03-20 14:37 ` [PATCH RESEND v5 1/5] namei: split out nd->dfd handling to dirfd_path_init Aleksa Sarai
@ 2019-03-20 14:37 ` Aleksa Sarai
  2019-03-20 14:37 ` [PATCH RESEND v5 3/5] namei: O_THISROOT: chroot-like path resolution Aleksa Sarai
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2019-03-20 14:37 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells
  Cc: Aleksa Sarai, Eric Biederman, Christian Brauner, Kees Cook,
	David Drysdale, Andy Lutomirski, Linus Torvalds, Andrew Morton,
	Alexei Starovoitov, Jann Horn, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, containers, linux-fsdevel, linux-api, linux-kernel,
	linux-arch

Add the following flags to allow various restrictions on path
resolution (these affect the *entire* resolution, rather than just the
final path component -- as is the case with most other AT_* flags).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web
servers, archive extraction tools, network file servers, and so on.

* O_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
  from one). The primary "scoping" use is to blocking resolution that
  crosses a bind-mount, which has a similar property to a symlink (in
  the way that it allows for escape from the starting-point). Since it
  is not possible to differentiate bind-mounts However since
  bind-mounting requires privileges (in ways symlinks don't) this has
  been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
  (though find(1) doesn't walk upwards, the semantics seem obvious).

* O_NOMAGICLINKS: Disallows ->get_link "symlink" jumping. This is a very
  specific restriction, and it exists because /proc/$pid/fd/...
  "symlinks" allow for access outside nd->root and pose risk to
  container runtimes that don't want to be tricked into accessing a host
  path (but do want to allow no-funny-business symlink resolution).

* O_NOSYMLINKS: Disallows symlink jumping *of any kind*. Implies
  O_NOMAGICLINKS (obviously).

* O_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all "magic link" jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of "magic link" jumping is done because it is not clear
  whether semantically they should be allowed -- while some "magic
  links" are safe there are many that can cause escapes (and once a
  resolution is outside of the root, O_BENEATH will no longer detect
  it). Future patches may re-enable "magic link" jumping when such jumps
  would remain inside the root.

The O_NO*LINK flags return -ELOOP if path resolution would violates
their requirement, while the others all return -EXDEV. Currently these
are only enabled for openat(2). In future it may be necessary to enable
these for *at(2) flags, as well as expanding support for AT_EMPTY_PATH.

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kees Cook <keescook@chromium.org>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/fcntl.c                       |  2 +-
 fs/namei.c                       | 76 +++++++++++++++++++++++++++-----
 fs/open.c                        | 11 ++++-
 include/linux/fcntl.h            |  3 +-
 include/linux/namei.h            |  7 +++
 include/uapi/asm-generic/fcntl.h | 14 ++++++
 6 files changed, 99 insertions(+), 14 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 083185174c6d..f782e99700f0 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index 4fdcb36f7c01..c9a07a8c005a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -845,6 +845,12 @@ static inline void path_to_nameidata(const struct path *path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+	if (unlikely(nd->flags & LOOKUP_BENEATH))
+		return -EXDEV;
+	if (unlikely(nd->flags & LOOKUP_XDEV)) {
+		if (nd->path.mnt != nd->root.mnt)
+			return -EXDEV;
+	}
 	if (nd->flags & LOOKUP_RCU) {
 		struct dentry *d;
 		nd->path = nd->root;
@@ -1053,6 +1059,9 @@ const char *get_link(struct nameidata *nd)
 	int error;
 	const char *res;
 
+	if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+		return ERR_PTR(-ELOOP);
+
 	if (!(nd->flags & LOOKUP_RCU)) {
 		touch_atime(&last->link);
 		cond_resched();
@@ -1083,14 +1092,23 @@ const char *get_link(struct nameidata *nd)
 		} else {
 			res = get(dentry, inode, &last->done);
 		}
+		/* If we just jumped it was because of a procfs-style link. */
+		if (unlikely(nd->flags & LOOKUP_JUMPED)) {
+			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+				return ERR_PTR(-ELOOP);
+			/* Not currently safe. */
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return ERR_PTR(-EXDEV);
+		}
 		if (IS_ERR_OR_NULL(res))
 			return res;
 	}
 	if (*res == '/') {
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (unlikely(nd_jump_root(nd)))
-			return ERR_PTR(-ECHILD);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1271,12 +1289,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
 		break;
 	}
 
-	if (need_mntput && path->mnt == mnt)
-		mntput(path->mnt);
+	if (need_mntput) {
+		if (path->mnt == mnt)
+			mntput(path->mnt);
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			ret = -EXDEV;
+		else
+			nd->flags |= LOOKUP_JUMPED;
+	}
 	if (ret == -EISDIR || !ret)
 		ret = 1;
-	if (need_mntput)
-		nd->flags |= LOOKUP_JUMPED;
 	if (unlikely(ret < 0))
 		path_put_conditional(path, nd);
 	return ret;
@@ -1333,6 +1355,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		mounted = __lookup_mnt(path->mnt, path->dentry);
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return false;
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
@@ -1353,8 +1377,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 	struct inode *inode = nd->inode;
 
 	while (1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			struct dentry *old = nd->path.dentry;
 			struct dentry *parent = old->d_parent;
@@ -1379,6 +1406,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			if (&mparent->mnt == nd->path.mnt)
 				break;
+			if (unlikely(nd->flags & LOOKUP_XDEV))
+				return -EXDEV;
 			/* we know that mountpoint was pinned */
 			nd->path.dentry = mountpoint;
 			nd->path.mnt = &mparent->mnt;
@@ -1393,6 +1422,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			return -ECHILD;
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 		nd->path.mnt = &mounted->mnt;
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
@@ -1481,8 +1512,11 @@ static int path_parent_directory(struct path *path)
 static int follow_dotdot(struct nameidata *nd)
 {
 	while(1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			int ret = path_parent_directory(&nd->path);
 			if (ret)
@@ -1491,6 +1525,8 @@ static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
@@ -1705,6 +1741,13 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
+		/*
+		 * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
+		 * cause our parent to have moved outside of the root and us to skip
+		 * over it.
+		 */
+		if (unlikely(nd->flags & LOOKUP_BENEATH))
+			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
 		if (nd->flags & LOOKUP_RCU) {
@@ -2253,6 +2296,15 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
+
+	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+		nd->root = nd->path;
+		if (!(nd->flags & LOOKUP_RCU))
+			path_get(&nd->root);
+	}
 	if (*s == '/') {
 		if (likely(!nd->root.mnt))
 			set_root(nd);
@@ -2261,9 +2313,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 			s = ERR_PTR(error);
 		return s;
 	}
-	error = dirfd_path_init(nd);
-	if (unlikely(error))
-		return ERR_PTR(error);
+	if (likely(!nd->path.mnt)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
 	return s;
 }
 
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..3e73f940f56e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -959,7 +959,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		 * If we have O_PATH in the open flag. Then we
 		 * cannot have anything other than the below set of flags
 		 */
-		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH | O_BENEATH |
+			 O_XDEV | O_NOSYMLINKS | O_NOMAGICLINKS;
 		acc_mode = 0;
 	}
 
@@ -988,6 +989,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_DIRECTORY;
 	if (!(flags & O_NOFOLLOW))
 		lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & O_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
+	if (flags & O_XDEV)
+		lookup_flags |= LOOKUP_XDEV;
+	if (flags & O_NOMAGICLINKS)
+		lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (flags & O_NOSYMLINKS)
+		lookup_flags |= LOOKUP_NO_SYMLINKS;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 27dc7a60693e..864399c2fdd2 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,8 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
+	 O_NOMAGICLINKS | O_NOSYMLINKS)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a78606e8e3df..82b5039d27a6 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_EMPTY		0x4000
 #define LOOKUP_DOWN		0x8000
 
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH		0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV		0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_MAGICLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
+					    Implies LOOKUP_NO_MAGICLINKS. */
+
 extern int path_pts(struct path *path);
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..ba53e0836cd4 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,20 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+/* Type of path-resolution scoping we are applying. */
+#ifndef O_BENEATH
+#define O_BENEATH	00040000000 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */
+#endif
+#ifndef O_XDEV
+#define O_XDEV		00100000000 /* - Block mount-point crossings (includes bind-mounts). */
+#endif
+#ifndef O_NOMAGICLINKS
+#define O_NOMAGICLINKS	00200000000 /* - Block procfs-style "magic" symlinks. */
+#endif
+#ifndef O_NOSYMLINKS
+#define O_NOSYMLINKS	01000000000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
+#endif
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RESEND v5 3/5] namei: O_THISROOT: chroot-like path resolution
  2019-03-20 14:37 [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Aleksa Sarai
  2019-03-20 14:37 ` [PATCH RESEND v5 1/5] namei: split out nd->dfd handling to dirfd_path_init Aleksa Sarai
  2019-03-20 14:37 ` [PATCH RESEND v5 2/5] namei: O_BENEATH-style path resolution flags Aleksa Sarai
@ 2019-03-20 14:37 ` Aleksa Sarai
  2019-03-20 14:37 ` [PATCH RESEND v5 4/5] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2019-03-20 14:37 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells
  Cc: Aleksa Sarai, Eric Biederman, Christian Brauner, Kees Cook,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Jann Horn,
	David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, containers, linux-fsdevel, linux-api,
	linux-kernel, linux-arch

The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing O_XDEV
and O_NOMAGICLINKS[**] help defend against other potential attacks in a
malicious rootfs scenario.

Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.

[*] At the moment, ".." and "magic link" jumping are disallowed for the
    same reason it is disabled for O_BENEATH -- currently it is not safe
    to allow it. Future patches may enable it unconditionally once we
    have resolved the possible races (for "..") and semantics (for
    "magic link" jumping).

The most significant openat(2) semantic change with O_THISROOT is that
absolute pathnames no longer cause dirfd to be ignored completely. The
rationale is that O_THISROOT must necessarily chroot-scope symlinks with
absolute paths to dirfd, and so doing it for the base path seems to be
the most consistent behaviour (and also avoids foot-gunning users who
want to scope paths that are absolute).

Currently this is only enabled for openat(2), and similar to O_BENEATH
and family requires more discussion about extending it to more *at(2)
syscalls as well as extending AT_EMPTY_PATH support.

[1]: https://github.com/cyphar/filepath-securejoin

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/fcntl.c                       | 2 +-
 fs/namei.c                       | 6 +++---
 fs/open.c                        | 4 +++-
 include/linux/fcntl.h            | 2 +-
 include/linux/namei.h            | 1 +
 include/uapi/asm-generic/fcntl.h | 3 +++
 6 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f782e99700f0..a6b4565a903d 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index c9a07a8c005a..798eb1702a0c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1097,7 +1097,7 @@ const char *get_link(struct nameidata *nd)
 			if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
 				return ERR_PTR(-ELOOP);
 			/* Not currently safe. */
-			if (unlikely(nd->flags & LOOKUP_BENEATH))
+			if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
 				return ERR_PTR(-EXDEV);
 		}
 		if (IS_ERR_OR_NULL(res))
@@ -1746,7 +1746,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		 * cause our parent to have moved outside of the root and us to skip
 		 * over it.
 		 */
-		if (unlikely(nd->flags & LOOKUP_BENEATH))
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
 			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
@@ -2297,7 +2297,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 
 	nd->m_seq = read_seqbegin(&mount_lock);
 
-	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+	if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
 			return ERR_PTR(error);
diff --git a/fs/open.c b/fs/open.c
index 3e73f940f56e..5914aed2fac8 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -960,7 +960,7 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		 * cannot have anything other than the below set of flags
 		 */
 		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH | O_BENEATH |
-			 O_XDEV | O_NOSYMLINKS | O_NOMAGICLINKS;
+			 O_XDEV | O_NOSYMLINKS | O_NOMAGICLINKS | O_THISROOT;
 		acc_mode = 0;
 	}
 
@@ -997,6 +997,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_NO_MAGICLINKS;
 	if (flags & O_NOSYMLINKS)
 		lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (flags & O_THISROOT)
+		lookup_flags |= LOOKUP_IN_ROOT;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 864399c2fdd2..46c92bbfce4a 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
 	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
-	 O_NOMAGICLINKS | O_NOSYMLINKS)
+	 O_NOMAGICLINKS | O_NOSYMLINKS | O_THISROOT)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 82b5039d27a6..0969313b518f 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_MAGICLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
 #define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
 					    Implies LOOKUP_NO_MAGICLINKS. */
+#define LOOKUP_IN_ROOT		0x100000 /* Treat dirfd as %current->fs->root. */
 
 extern int path_pts(struct path *path);
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index ba53e0836cd4..3d0fe0de2ba2 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -110,6 +110,9 @@
 #ifndef O_NOSYMLINKS
 #define O_NOSYMLINKS	01000000000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
 #endif
+#ifndef O_THISROOT
+#define O_THISROOT	02000000000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
+#endif
 
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RESEND v5 4/5] namei: aggressively check for nd->root escape on ".." resolution
  2019-03-20 14:37 [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Aleksa Sarai
                   ` (2 preceding siblings ...)
  2019-03-20 14:37 ` [PATCH RESEND v5 3/5] namei: O_THISROOT: chroot-like path resolution Aleksa Sarai
@ 2019-03-20 14:37 ` Aleksa Sarai
  2019-03-20 14:37 ` [PATCH RESEND v5 5/5] binfmt_*: scope path resolution of interpreters Aleksa Sarai
  2019-03-21 17:06 ` [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Andy Lutomirski
  5 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2019-03-20 14:37 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells
  Cc: Aleksa Sarai, Jann Horn, Kees Cook, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov,
	Christian Brauner, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linus Torvalds, containers, linux-fsdevel,
	linux-api, linux-kernel, linux-arch

This patch allows for O_BENEATH and O_THISROOT to safely permit ".."
resolution (in the case of O_BENEATH the resolution will still fail if
".." resolution would resolve a path outside of the root -- while
O_THISROOT will chroot(2)-style scope it). "magic link" jumps are still
disallowed entirely because now they could result in inconsistent
behaviour if resolution encounters a subsequent "..".

The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
O_THISROOT and O_BENEATH) where a rename(2) of a path can be used to
"skip over" nd->root and thus escape to the filesystem above nd->root.

  thread1 [attacker]:
    for (;;)
      renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
    for (;;)
      openat(dirb, "b/c/../../etc/shadow", O_THISROOT);

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or "magic link"). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of O_THISROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).

In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
have occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.

The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential. If
path_is_under() occurs before the rename, then the path will be resolved
but since the path was originally inside the root there is no escape.
Subsequent ".." jumps are guaranteed to check path_is_under() (by
construction, &rename_lock or &mount_lock must have been taken by the
attacker after path_is_under() returned in the victim), and thus will
not be able to escape from the previously-inside-root path. Walking down
is still safe since the entire subtree was moved (either by rename(2) or
MS_MOVE) and because (as discussed above) walking down is safe.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 798eb1702a0c..af06b10b13a0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -493,7 +493,7 @@ struct nameidata {
 	struct path	root;
 	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
-	unsigned	seq, m_seq;
+	unsigned	seq, m_seq, r_seq;
 	int		last_type;
 	unsigned	depth;
 	int		total_link_count;
@@ -1741,19 +1741,35 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
-		/*
-		 * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
-		 * cause our parent to have moved outside of the root and us to skip
-		 * over it.
-		 */
-		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
-			return -EXDEV;
+		int error = 0;
+
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (nd->flags & LOOKUP_RCU) {
-			return follow_dotdot_rcu(nd);
-		} else
-			return follow_dotdot(nd);
+		if (nd->flags & LOOKUP_RCU)
+			error = follow_dotdot_rcu(nd);
+		else
+			error = follow_dotdot(nd);
+		if (error)
+			return error;
+
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
+			bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+			bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+			/*
+			 * Don't bother checking unless there's a racing
+			 * rename(2) or MS_MOVE.
+			 */
+			if (likely(!m_retry && !r_retry))
+				return 0;
+
+			if (m_retry && !(nd->flags & LOOKUP_RCU))
+				nd->m_seq = read_seqbegin(&mount_lock);
+			if (r_retry)
+				nd->r_seq = read_seqbegin(&rename_lock);
+			if (!path_is_under(&nd->path, &nd->root))
+				return -EXDEV;
+		}
 	}
 	return 0;
 }
@@ -2274,6 +2290,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
+
+	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
+		nd->r_seq = read_seqbegin(&rename_lock);
+
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -2284,7 +2305,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 		if (flags & LOOKUP_RCU) {
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			nd->root_seq = nd->seq;
-			nd->m_seq = read_seqbegin(&mount_lock);
 		} else {
 			path_get(&nd->path);
 		}
@@ -2295,8 +2315,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 
-	nd->m_seq = read_seqbegin(&mount_lock);
-
 	if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RESEND v5 5/5] binfmt_*: scope path resolution of interpreters
  2019-03-20 14:37 [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Aleksa Sarai
                   ` (3 preceding siblings ...)
  2019-03-20 14:37 ` [PATCH RESEND v5 4/5] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai
@ 2019-03-20 14:37 ` Aleksa Sarai
  2019-03-21 17:06 ` [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Andy Lutomirski
  5 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2019-03-20 14:37 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells
  Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
	David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, containers, linux-fsdevel, linux-api,
	linux-kernel, linux-arch

The need to be able to scope path resolution of interpreters became
clear with one of the possible vectors used in CVE-2019-5736 (which
most major container runtimes were vulnerable to).

Naively, it might seem that openat(2) -- which supports path scoping --
can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
binary being executed. Unfortunately, a "bad binary" (usually a symlink)
could be written as a #!-style script with the symlink target as the
interpreter -- which would be completely missed by just scoping the
openat(2). An example of this being exploitable is CVE-2019-5736.

In order to get around this, we need to pass down to each binfmt_*
implementation the scoping flags requested in execveat(2). In order to
maintain backwards-compatibility we only pass the scoping AT_* flags.

To avoid breaking userspace (in the exceptionally rare cases where you
have #!-scripts with a relative path being execveat(2)-ed with dfd !=
AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
set in execveat(2).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/binfmt_elf.c            |  2 +-
 fs/binfmt_elf_fdpic.c      |  2 +-
 fs/binfmt_em86.c           |  4 ++--
 fs/binfmt_misc.c           |  2 +-
 fs/binfmt_script.c         |  2 +-
 fs/exec.c                  | 26 ++++++++++++++++++++++----
 include/linux/binfmts.h    |  1 +
 include/linux/fs.h         |  9 +++++++--
 include/uapi/linux/fcntl.h |  6 ++++++
 9 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 54207327f98f..eef86ffa38c8 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -777,7 +777,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0')
 				goto out_free_interp;
 
-			interpreter = open_exec(elf_interpreter);
+			interpreter = openat_exec(bprm->dfd, elf_interpreter, bprm->flags);
 			retval = PTR_ERR(interpreter);
 			if (IS_ERR(interpreter))
 				goto out_free_interp;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index b53bb3729ac1..c463c6428f77 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -263,7 +263,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 			kdebug("Using ELF interpreter %s", interpreter_name);
 
 			/* replace the program with the interpreter */
-			interpreter = open_exec(interpreter_name);
+			interpreter = openat_exec(bprm->dfd, interpreter_name, bprm->flags);
 			retval = PTR_ERR(interpreter);
 			if (IS_ERR(interpreter)) {
 				interpreter = NULL;
diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c
index dd2d3f0cd55d..3ee46b0dc0d4 100644
--- a/fs/binfmt_em86.c
+++ b/fs/binfmt_em86.c
@@ -81,10 +81,10 @@ static int load_em86(struct linux_binprm *bprm)
 
 	/*
 	 * OK, now restart the process with the interpreter's inode.
-	 * Note that we use open_exec() as the name is now in kernel
+	 * Note that we use openat_exec() as the name is now in kernel
 	 * space, and we don't need to copy it.
 	 */
-	file = open_exec(interp);
+	file = openat_exec(binprm->dfd, interp, binprm->flags);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index aa4a7a23ff99..573ef06ff5a1 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -209,7 +209,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
 		if (!IS_ERR(interp_file))
 			deny_write_access(interp_file);
 	} else {
-		interp_file = open_exec(fmt->interpreter);
+		interp_file = openat_exec(bprm->dfd, fmt->interpreter, bprm->flags);
 	}
 	retval = PTR_ERR(interp_file);
 	if (IS_ERR(interp_file))
diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
index d0078cbb718b..340f63635aac 100644
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -102,7 +102,7 @@ static int load_script(struct linux_binprm *bprm)
 	/*
 	 * OK, now restart the process with the interpreter's dentry.
 	 */
-	file = open_exec(i_name);
+	file = openat_exec(bprm->dfd, i_name, bprm->flags);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
diff --git a/fs/exec.c b/fs/exec.c
index bcf383730bea..e63063b2de23 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -846,12 +846,24 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 		.lookup_flags = LOOKUP_FOLLOW,
 	};
 
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH | AT_BENEATH |
+		       AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
+		       AT_THIS_ROOT)) != 0)
 		return ERR_PTR(-EINVAL);
 	if (flags & AT_SYMLINK_NOFOLLOW)
 		open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
 	if (flags & AT_EMPTY_PATH)
 		open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
+	if (flags & AT_BENEATH)
+		open_exec_flags.lookup_flags |= LOOKUP_BENEATH;
+	if (flags & AT_XDEV)
+		open_exec_flags.lookup_flags |= LOOKUP_XDEV;
+	if (flags & AT_NO_MAGICLINKS)
+		open_exec_flags.lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (flags & AT_NO_SYMLINKS)
+		open_exec_flags.lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (flags & AT_THIS_ROOT)
+		open_exec_flags.lookup_flags |= LOOKUP_IN_ROOT;
 
 	file = do_filp_open(fd, name, &open_exec_flags);
 	if (IS_ERR(file))
@@ -879,18 +891,18 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 	return ERR_PTR(err);
 }
 
-struct file *open_exec(const char *name)
+struct file *openat_exec(int dfd, const char *name, int flags)
 {
 	struct filename *filename = getname_kernel(name);
 	struct file *f = ERR_CAST(filename);
 
 	if (!IS_ERR(filename)) {
-		f = do_open_execat(AT_FDCWD, filename, 0);
+		f = do_open_execat(dfd, filename, flags);
 		putname(filename);
 	}
 	return f;
 }
-EXPORT_SYMBOL(open_exec);
+EXPORT_SYMBOL(openat_exec);
 
 int kernel_read_file(struct file *file, void **buf, loff_t *size,
 		     loff_t max_size, enum kernel_read_file_id id)
@@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
 
 	sched_exec();
 
+	bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
+			       AT_THIS_ROOT);
+	bprm->dfd = AT_FDCWD;
+	if (bprm->flags)
+		bprm->dfd = fd;
+
 	bprm->file = file;
 	if (!filename) {
 		bprm->filename = "none";
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 688ab0de7810..e4da2d36e97f 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -50,6 +50,7 @@ struct linux_binprm {
 	unsigned int taso:1;
 #endif
 	unsigned int recursion_depth; /* only for search_binary_handler() */
+	int dfd, flags;		/* passed down to execat_open() */
 	struct file * file;
 	struct cred *cred;	/* new credentials */
 	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3e85cb8e8c20..a82c8dd44ad9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2937,8 +2937,13 @@ extern int kernel_read_file_from_fd(int, void **, loff_t *, loff_t,
 extern ssize_t kernel_read(struct file *, void *, size_t, loff_t *);
 extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *);
 extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *);
-extern struct file * open_exec(const char *);
- 
+
+extern struct file *openat_exec(int, const char *, int);
+static inline struct file *open_exec(const char *name)
+{
+	return openat_exec(AT_FDCWD, name, 0);
+}
+
 /* fs/dcache.c -- generic fs support functions */
 extern bool is_subdir(struct dentry *, struct dentry *);
 extern bool path_is_under(const struct path *, const struct path *);
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..607bc98813e3 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -90,5 +90,11 @@
 #define AT_STATX_FORCE_SYNC	0x2000	/* - Force the attributes to be sync'd with the server */
 #define AT_STATX_DONT_SYNC	0x4000	/* - Don't sync attributes with the server */
 
+#define AT_RESOLUTION_TYPE	0xF8000 /* Type of path-resolution scoping we are applying. */
+#define AT_BENEATH		0x08000 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */
+#define AT_XDEV			0x10000 /* - Block mount-point crossings (includes bind-mounts). */
+#define AT_NO_MAGICLINKS	0x20000 /* - Block procfs-style "magic" symlinks. */
+#define AT_NO_SYMLINKS		0x40000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
+#define AT_THIS_ROOT		0x80000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution
  2019-03-20 14:37 [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Aleksa Sarai
                   ` (4 preceding siblings ...)
  2019-03-20 14:37 ` [PATCH RESEND v5 5/5] binfmt_*: scope path resolution of interpreters Aleksa Sarai
@ 2019-03-21 17:06 ` Andy Lutomirski
  2019-03-25 13:04   ` Aleksa Sarai
  5 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2019-03-21 17:06 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Eric Biederman, Andy Lutomirski, Jann Horn,
	Christian Brauner, David Drysdale, Tycho Andersen, Kees Cook,
	Linux Containers, Linux FS Devel, Linux API, Andrew Morton,
	Alexei Starovoitov, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, LKML, linux-arch

On Wed, Mar 20, 2019 at 7:38 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> Now that the holiday break is over, it's time to re-send this patch
> series (with a few additions, due to new information we got from
> CVE-2019-5736 -- which this patchset mostly protected against but had
> some holes with regards to #!-style scripts).

I generally like this, but, as Linus pointed out, it will be
unfortunate if application authors see this as just another
non-portable weird Linux API and don't use it.  Would it be worthwhile
to put some thought into making it an API that other OSes might be
willing to implement?  As it stands, the openat(2) flags are getting
rather crazy in this patch set.

Aleksa had a resolveat(2) proposal that really didn't seem too bad.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution
  2019-03-21 17:06 ` [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Andy Lutomirski
@ 2019-03-25 13:04   ` Aleksa Sarai
  2019-04-23 20:13     ` Kees Cook
  0 siblings, 1 reply; 14+ messages in thread
From: Aleksa Sarai @ 2019-03-25 13:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Eric Biederman, Jann Horn, Christian Brauner,
	David Drysdale, Tycho Andersen, Kees Cook, Linux Containers,
	Linux FS Devel, Linux API, Andrew Morton, Alexei Starovoitov,
	Chanho Min, Oleg Nesterov, Aleksa Sarai, Linus Torvalds, LKML,
	linux-arch

[-- Attachment #1: Type: text/plain, Size: 2976 bytes --]

On 2019-03-21, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Mar 20, 2019 at 7:38 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > Now that the holiday break is over, it's time to re-send this patch
> > series (with a few additions, due to new information we got from
> > CVE-2019-5736 -- which this patchset mostly protected against but had
> > some holes with regards to #!-style scripts).
> 
> I generally like this, but, as Linus pointed out, it will be
> unfortunate if application authors see this as just another
> non-portable weird Linux API and don't use it.  Would it be worthwhile
> to put some thought into making it an API that other OSes might be
> willing to implement?  As it stands, the openat(2) flags are getting
> rather crazy in this patch set.
> 
> Aleksa had a resolveat(2) proposal that really didn't seem too bad.

I agree having a bunch of O_* flags for resolution feels quite ugly (and
crazy) -- but the last time I pitched resolveat(2) the reaction was
lukewarm to say the least. It's basically an O_PATHv2 and I don't know
how popular that suggestion might be. I wouldn't mind pitching it again
though (now that I have a better idea of how to handle some of the UX
worries I had).

But, if we have a resolveat(2) we'll need to add O_EMPTYPATH support for
openat(2) so that you can open scoped paths without using the
/proc/self/fd/$n trick. However, you run into reopening permission
issues (as we saw in CVE-2019-5736 and countless other CVEs). There
might be a solution for this which Andy and I have talked about
off-list.

There is also the problem of the execveat(2) attack -- which is dealt
with in patch 5 of this series. In order to scope #!-script resolution
we need to have AT_* flags for the same resolution restriction
(defeating the point of a resolveat(2) slightly).

There is an argument that we shouldn't need AT_THIS_ROOT or AT_BENEATH
support (because ideally you should be doing execve(2) in a pivot_root
anyway) but AT_NO_MAGICLINKS is pretty important -- since it allows you
to block the circumvention mm->dumpable in certain situations (as we saw
in CVE-2019-5736).

The solution Andy and I have discussed is a way to make fd-reopening (a
seemingly little-known trick outside of container runtimes) much safer
such that we don't need mitigations like AT_NO_MAGICLINKS on
execveat(2). If we assume that idea works out (I'm still trying to get a
working patch for that idea) then resolveat(2) would be sufficient and
we don't need AT_* flags on execveat(2).

tl;dr: I think resolveat(2) is much nicer, and assuming it's not an
	   unpopular idea I think we should go for it. But there are several
	   other threads of discussion that we might want to have first
	   (such as how to improve the fd-reopening design so it's safer
	   before we expose it to everyone).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution
  2019-03-25 13:04   ` Aleksa Sarai
@ 2019-04-23 20:13     ` Kees Cook
  2019-04-23 20:24       ` Christian Brauner
  2019-04-24 15:38       ` Aleksa Sarai
  0 siblings, 2 replies; 14+ messages in thread
From: Kees Cook @ 2019-04-23 20:13 UTC (permalink / raw)
  To: Aleksa Sarai, Andy Lutomirski
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Eric Biederman, Jann Horn, Christian Brauner,
	David Drysdale, Tycho Andersen, Kees Cook, Linux Containers,
	Linux FS Devel, Linux API, Andrew Morton, Alexei Starovoitov,
	Chanho Min, Oleg Nesterov, Aleksa Sarai, Linus Torvalds, LKML,
	linux-arch

On Mon, Mar 25, 2019 at 6:05 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2019-03-21, Andy Lutomirski <luto@kernel.org> wrote:
> > On Wed, Mar 20, 2019 at 7:38 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > Now that the holiday break is over, it's time to re-send this patch
> > > series (with a few additions, due to new information we got from
> > > CVE-2019-5736 -- which this patchset mostly protected against but had
> > > some holes with regards to #!-style scripts).
> >
> > I generally like this, but, as Linus pointed out, it will be
> > unfortunate if application authors see this as just another
> > non-portable weird Linux API and don't use it.  Would it be worthwhile
> > to put some thought into making it an API that other OSes might be
> > willing to implement?  As it stands, the openat(2) flags are getting
> > rather crazy in this patch set.

I think many of the issues are specific to Linux (and Linux containers
especially), so I'm not sure this should get blocked because we want
something more portable.

This series provides solutions to so many different race and confusion
issues, I'd really like to see it land. What's the next step here? Is
this planned to go directly to Linus for v5.2, or is it going to live
in -mm for a while? I'd really like to see this moving forward.

Thanks for continuing to work on it!

-Kees

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution
  2019-04-23 20:13     ` Kees Cook
@ 2019-04-23 20:24       ` Christian Brauner
  2019-04-24 15:38       ` Aleksa Sarai
  1 sibling, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2019-04-23 20:24 UTC (permalink / raw)
  To: Kees Cook, Al Viro
  Cc: Aleksa Sarai, Andy Lutomirski, Al Viro, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
	Jann Horn, David Drysdale, Tycho Andersen, Linux Containers,
	Linux FS Devel, Linux API, Andrew Morton, Alexei Starovoitov,
	Chanho Min, Oleg Nesterov, Aleksa Sarai, Linus Torvalds, LKML,
	linux-arch

On Tue, Apr 23, 2019 at 01:13:52PM -0700, Kees Cook wrote:
> On Mon, Mar 25, 2019 at 6:05 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2019-03-21, Andy Lutomirski <luto@kernel.org> wrote:
> > > On Wed, Mar 20, 2019 at 7:38 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > > Now that the holiday break is over, it's time to re-send this patch
> > > > series (with a few additions, due to new information we got from
> > > > CVE-2019-5736 -- which this patchset mostly protected against but had
> > > > some holes with regards to #!-style scripts).
> > >
> > > I generally like this, but, as Linus pointed out, it will be
> > > unfortunate if application authors see this as just another
> > > non-portable weird Linux API and don't use it.  Would it be worthwhile
> > > to put some thought into making it an API that other OSes might be
> > > willing to implement?  As it stands, the openat(2) flags are getting
> > > rather crazy in this patch set.
> 
> I think many of the issues are specific to Linux (and Linux containers
> especially), so I'm not sure this should get blocked because we want
> something more portable.
> 
> This series provides solutions to so many different race and confusion
> issues, I'd really like to see it land. What's the next step here? Is
> this planned to go directly to Linus for v5.2, or is it going to live
> in -mm for a while? I'd really like to see this moving forward.

Yeah, it would be good to move this forward. But since this is pretty
much core-vfs we really need Al to take a look at this.

Christian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution
  2019-04-23 20:13     ` Kees Cook
  2019-04-23 20:24       ` Christian Brauner
@ 2019-04-24 15:38       ` Aleksa Sarai
  2019-04-25 13:22         ` Adam Borowski
  2019-04-25 19:45         ` Aleksa Sarai
  1 sibling, 2 replies; 14+ messages in thread
From: Aleksa Sarai @ 2019-04-24 15:38 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Al Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, David Howells, Eric Biederman, Jann Horn,
	Christian Brauner, David Drysdale, Tycho Andersen,
	Linux Containers, Linux FS Devel, Linux API, Andrew Morton,
	Alexei Starovoitov, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, LKML, linux-arch

[-- Attachment #1: Type: text/plain, Size: 3814 bytes --]

On 2019-04-23, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Mar 25, 2019 at 6:05 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > On 2019-03-21, Andy Lutomirski <luto@kernel.org> wrote:
> > > On Wed, Mar 20, 2019 at 7:38 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > > Now that the holiday break is over, it's time to re-send this patch
> > > > series (with a few additions, due to new information we got from
> > > > CVE-2019-5736 -- which this patchset mostly protected against but had
> > > > some holes with regards to #!-style scripts).
> > >
> > > I generally like this, but, as Linus pointed out, it will be
> > > unfortunate if application authors see this as just another
> > > non-portable weird Linux API and don't use it.  Would it be worthwhile
> > > to put some thought into making it an API that other OSes might be
> > > willing to implement?  As it stands, the openat(2) flags are getting
> > > rather crazy in this patch set.
> 
> I think many of the issues are specific to Linux (and Linux containers
> especially), so I'm not sure this should get blocked because we want
> something more portable.

I agree these issues are quite Linux-specific (*especially* the ability
to re-open fds through /proc and the existence of "magic links").

However, I feel there are a few more good reasons for resolveat(2):

 * openat(2) ignores unknown flags, meaning that old kernels will ignore
   new programs trying to use O_THISROOT and might end up causing
   security issues. Yes, it'd be trivial to check whether the new O_*
   flags are supported at start-up, but I think a security feature
   shouldn't have a foot-gun associated with it. In fact, I didn't know
   openat(2) ignored unknown flags until I wrote this patchset -- I
   doubt many other userspace developers do either.

 * resolveat(2) allows for improvement to the O_PATH interface, which I
   think might be necessary (completely separately to this patchset).
   I've been working on a patchset which would make nd_jump_link()
   transitions in trailing_symlink() depend on the mode of the magic
   link being traversed through (this would allow us to block a
   read-only fd being re-opened as a read-write fd or similar such
   nonsense). One aspect of this could be to allow userspace to enable
   certain re-opening operations by passing a "link mode" to
   resolveat(2).

 * I would argue that O_PATH should've been a separate syscall from the
   beginning, given how different its semantics are to other openat(2)
   flags (not to mention how O_PATH is incompatible with and thus
   ignores so many other openat(2) flags).

 * If we end up needing a resolveat(2) for any of the above reasons,
   then we will have wasted quite a few openat(2) flag slots for naught.
   (Then again, there are plenty of flag slots still left.)

All of that aside, what I'd really like to know is what I should do to
get this patchset reviewed and merged. It's been largely radio-silence
for the last few revisions.

A simple resolveat(2) is fairly trivial (I have a version of it lying
around somewhere), but it doesn't make sense to polish it if there's no
chance Al is interested in it.

> This series provides solutions to so many different race and confusion
> issues, I'd really like to see it land. What's the next step here? Is
> this planned to go directly to Linus for v5.2, or is it going to live
> in -mm for a while? I'd really like to see this moving forward.

Given some of the security requirements of this interface, I think
getting it to live in -mm wouldn't be a bad idea so folks can shake the
bugs out before it's depended on by container runtimes.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution
  2019-04-24 15:38       ` Aleksa Sarai
@ 2019-04-25 13:22         ` Adam Borowski
  2019-04-25 19:45         ` Aleksa Sarai
  1 sibling, 0 replies; 14+ messages in thread
From: Adam Borowski @ 2019-04-25 13:22 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Kees Cook, Andy Lutomirski, Al Viro, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, David Howells, Eric Biederman,
	Jann Horn, Christian Brauner, David Drysdale, Tycho Andersen,
	Linux Containers, Linux FS Devel, Linux API, Andrew Morton,
	Alexei Starovoitov, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, LKML, linux-arch

On Thu, Apr 25, 2019 at 01:38:06AM +1000, Aleksa Sarai wrote:
>  * openat(2) ignores unknown flags, meaning that old kernels will ignore
>    new programs trying to use O_THISROOT and might end up causing
>    security issues. Yes, it'd be trivial to check whether the new O_*
>    flags are supported at start-up, but I think a security feature
>    shouldn't have a foot-gun associated with it. In fact, I didn't know
>    openat(2) ignored unknown flags until I wrote this patchset -- I
>    doubt many other userspace developers do either.

For this reason, I propose every new syscall that has flags to follow a
bitmask scheme, where any flag assigned a bit in the upper half returns
EOPNOTSUPP when called on an old kernel.  That would allow defining which
flags can be safely ignored and which can't.

It otherwise takes major hacks to implement a fail-if-not-supported flag
while keeping compat with old kernels.  For example, for mmap(), MAP_SHARED
has been duplicated as MAP_SHARED_VALIDATE just to allow an unrelated flag
(MAP_SYNC) to fail on old kernels.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven giant trumpets are playing in the
⠈⠳⣄⠀⠀⠀⠀ sky.  Your cat demands food.  The priority should be obvious...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution
  2019-04-24 15:38       ` Aleksa Sarai
  2019-04-25 13:22         ` Adam Borowski
@ 2019-04-25 19:45         ` Aleksa Sarai
  1 sibling, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2019-04-25 19:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Al Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, David Howells, Eric Biederman, Jann Horn,
	Christian Brauner, David Drysdale, Tycho Andersen,
	Linux Containers, Linux FS Devel, Linux API, Andrew Morton,
	Alexei Starovoitov, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, LKML, linux-arch

[-- Attachment #1: Type: text/plain, Size: 1043 bytes --]

On 2019-04-25, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-04-23, Kees Cook <keescook@chromium.org> wrote:
> > This series provides solutions to so many different race and confusion
> > issues, I'd really like to see it land. What's the next step here? Is
> > this planned to go directly to Linus for v5.2, or is it going to live
> > in -mm for a while? I'd really like to see this moving forward.
> 
> Given some of the security requirements of this interface, I think
> getting it to live in -mm wouldn't be a bad idea so folks can shake the
> bugs out before it's depended on by container runtimes.

Scratch my mention of -mm, it should be in Al's tree since it touches
quite a few of the namei seqlocks. My point was that it should live in
someone's tree for a little bit before it goes into a release.

I will put together a PoC of a resolveat(2) variation of this series and
re-send it out with both versions.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RESEND v5 4/5] namei: aggressively check for nd->root escape on ".." resolution
  2019-03-06 19:12 Aleksa Sarai
@ 2019-03-06 19:12 ` Aleksa Sarai
  0 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2019-03-06 19:12 UTC (permalink / raw)
  To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells
  Cc: Aleksa Sarai, Jann Horn, Kees Cook, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov,
	Christian Brauner, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, containers, linux-fsdevel, linux-api, linux-kernel,
	linux-arch

This patch allows for O_BENEATH and O_THISROOT to safely permit ".."
resolution (in the case of O_BENEATH the resolution will still fail if
".." resolution would resolve a path outside of the root -- while
O_THISROOT will chroot(2)-style scope it). "magic link" jumps are still
disallowed entirely because now they could result in inconsistent
behaviour if resolution encounters a subsequent "..".

The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
O_THISROOT and O_BENEATH) where a rename(2) of a path can be used to
"skip over" nd->root and thus escape to the filesystem above nd->root.

  thread1 [attacker]:
    for (;;)
      renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
    for (;;)
      openat(dirb, "b/c/../../etc/shadow", O_THISROOT);

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or "magic link"). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of O_THISROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).

In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
have occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.

The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential. If
path_is_under() occurs before the rename, then the path will be resolved
but since the path was originally inside the root there is no escape.
Subsequent ".." jumps are guaranteed to check path_is_under() (by
construction, &rename_lock or &mount_lock must have been taken by the
attacker after path_is_under() returned in the victim), and thus will
not be able to escape from the previously-inside-root path. Walking down
is still safe since the entire subtree was moved (either by rename(2) or
MS_MOVE) and because (as discussed above) walking down is safe.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 798eb1702a0c..af06b10b13a0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -493,7 +493,7 @@ struct nameidata {
 	struct path	root;
 	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
-	unsigned	seq, m_seq;
+	unsigned	seq, m_seq, r_seq;
 	int		last_type;
 	unsigned	depth;
 	int		total_link_count;
@@ -1741,19 +1741,35 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
-		/*
-		 * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
-		 * cause our parent to have moved outside of the root and us to skip
-		 * over it.
-		 */
-		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
-			return -EXDEV;
+		int error = 0;
+
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (nd->flags & LOOKUP_RCU) {
-			return follow_dotdot_rcu(nd);
-		} else
-			return follow_dotdot(nd);
+		if (nd->flags & LOOKUP_RCU)
+			error = follow_dotdot_rcu(nd);
+		else
+			error = follow_dotdot(nd);
+		if (error)
+			return error;
+
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
+			bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+			bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+			/*
+			 * Don't bother checking unless there's a racing
+			 * rename(2) or MS_MOVE.
+			 */
+			if (likely(!m_retry && !r_retry))
+				return 0;
+
+			if (m_retry && !(nd->flags & LOOKUP_RCU))
+				nd->m_seq = read_seqbegin(&mount_lock);
+			if (r_retry)
+				nd->r_seq = read_seqbegin(&rename_lock);
+			if (!path_is_under(&nd->path, &nd->root))
+				return -EXDEV;
+		}
 	}
 	return 0;
 }
@@ -2274,6 +2290,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
+
+	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
+		nd->r_seq = read_seqbegin(&rename_lock);
+
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -2284,7 +2305,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 		if (flags & LOOKUP_RCU) {
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			nd->root_seq = nd->seq;
-			nd->m_seq = read_seqbegin(&mount_lock);
 		} else {
 			path_get(&nd->path);
 		}
@@ -2295,8 +2315,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 
-	nd->m_seq = read_seqbegin(&mount_lock);
-
 	if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-04-25 19:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-20 14:37 [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Aleksa Sarai
2019-03-20 14:37 ` [PATCH RESEND v5 1/5] namei: split out nd->dfd handling to dirfd_path_init Aleksa Sarai
2019-03-20 14:37 ` [PATCH RESEND v5 2/5] namei: O_BENEATH-style path resolution flags Aleksa Sarai
2019-03-20 14:37 ` [PATCH RESEND v5 3/5] namei: O_THISROOT: chroot-like path resolution Aleksa Sarai
2019-03-20 14:37 ` [PATCH RESEND v5 4/5] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai
2019-03-20 14:37 ` [PATCH RESEND v5 5/5] binfmt_*: scope path resolution of interpreters Aleksa Sarai
2019-03-21 17:06 ` [PATCH RESEND v5 0/5] namei: vfs flags to restrict path resolution Andy Lutomirski
2019-03-25 13:04   ` Aleksa Sarai
2019-04-23 20:13     ` Kees Cook
2019-04-23 20:24       ` Christian Brauner
2019-04-24 15:38       ` Aleksa Sarai
2019-04-25 13:22         ` Adam Borowski
2019-04-25 19:45         ` Aleksa Sarai
  -- strict thread matches above, loose matches on Subject: below --
2019-03-06 19:12 Aleksa Sarai
2019-03-06 19:12 ` [PATCH RESEND v5 4/5] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).