linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags
@ 2018-10-09  7:02 Aleksa Sarai
  2018-10-09  7:02 ` Aleksa Sarai
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09  7:02 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Andy Lutomirski, David Howells, Jann Horn,
	Christian Brauner, David Drysdale, containers, linux-fsdevel,
	linux-api, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Tycho Andersen, dev, linux-kernel, linux-arch

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

As per the discussion in the AT_NO_JUMPS thread, AT_NO_JUMPS has been
split into separate flags.

  * AT_XDEV blocks mountpoint crossings (both upwards and downwards).
      openat("/", "tmp", AT_XDEV); // blocked
      openat("/tmp", "..", AT_XDEV); // blocked
      openat("/tmp", "/", AT_XDEV); // blocked

  * AT_NO_PROCLINKS blocks all resolution through /proc/$pid/fd/$fd
	"symlinks". Specifically, this blocks all jumps caused by a
	filesystem using nd_jump_link() to shove you around in the
	filesystem tree (these are referred to as "proclinks" in lieu of a
	better name).
	  openat(AT_FDCWD, "/proc/self/root", AT_NO_PROCLINKS); // blocked
	  openat(AT_FDCWD, "/proc/self/fd/0", AT_NO_PROCLINKS); // blocked
	  openat(AT_FDCWD, "/proc/self/ns/mnt", AT_NO_PROCLINKS); // blocked

  * AT_BENEATH disallows escapes from the starting dirfd using ".." or
	absolute paths (either in the path or during symlink resolution).
	Conceptually this flag ensures that you "stay below" the starting
	point in the filesystem tree. ".." resolution is allowed if it
	doesn't land you outside of the starting point (this is made safe
	against races by patch 3 in this series).
	  openat("/root", "foo", AT_BENEATH); // *not* blocked
	  openat("/root", "a/../b", AT_BENEATH); // *not* blocked
	  openat("/root", "a/../../root/b", AT_BENEATH); // blocked
	  openat("/root", "/root", AT_BENEATH); // blocked

	AT_BENEATH also currently disallows all "proclink" resolution
	because they can trivially throw you outside of the starting point.
	In a future patch we might allow such resolution (as long as it
	stays within the root).
	  openat("/", "proc/self/exe", AT_BENEATH); // blocked

In addition, two more flags have been added to the series:

  * AT_NO_SYMLINKS disallows *all* symlink resolution, and thus implies
	AT_NO_PROCLINKS. Linus mentioned this is something that git would
	like to have in the original discussion[5].
	  // assuming 'ln -s / /usr'
	  openat("/", "/usr/bin", AT_NO_SYMLINKS); // blocked
	  openat("/", "/proc/self/root", AT_NO_PROCLINKS); // blocked

  * AT_THIS_ROOT is a very similar idea to AT_BENEATH, but it serves a
    very different purpose. Rather than blocking resolutions if they
	would go outside of the starting point, it treats the starting point
	as a form of chroot(2). Container runtimes are one of the primary
	justifications for this flag, as they currently have to implement
	this sort of path handling racily in userspace[6].

	The restrictions on "proclink" resolution are the same as with
	AT_BENEATH (though in AT_THIS_ROOT's case it's not really clear how
	"proclink" jumps outside of the root should be handled), and patch 3
	in this series was also required to make ".." resolution safe.

Currently all of these flags are only enabled for openat(2) (and thus
have their own O_* flag names), but the corresponding AT_* flags have
been reserved so they can be added to syscalls where openat(O_PATH) is
not sufficient.

Patch changelog:
  v2:
    * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
	  some semi-aggressive __d_path checking (see patch 3).
    * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
	  hopes they can be re-enabled once safe.
    * Removed the selftests as they will be reimplemented as xfstests.
	* Removed stat(2) support, since you can already get it through
	  O_PATH and fstatat(2).

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: David Drysdale <drysdale@google.com>
Cc: <containers@lists.linux-foundation.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: <linux-api@vger.kernel.org>

Aleksa Sarai (3):
  namei: implement O_BENEATH-style AT_* flags
  namei: implement AT_THIS_ROOT chroot-like path resolution
  namei: aggressively check for nd->root escape on ".." resolution

 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 241 +++++++++++++++++++++++--------
 fs/open.c                        |  10 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   8 +
 include/uapi/asm-generic/fcntl.h |  20 +++
 include/uapi/linux/fcntl.h       |  10 ++
 8 files changed, 230 insertions(+), 68 deletions(-)

-- 
2.19.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags
  2018-10-09  7:02 [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
@ 2018-10-09  7:02 ` Aleksa Sarai
  2018-10-09  7:02 ` [PATCH v3 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09  7:02 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Andy Lutomirski, David Howells, Jann Horn,
	Christian Brauner, David Drysdale, containers, linux-fsdevel,
	linux-api, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Tycho Andersen, dev, linux-kernel, linux-arch

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

As per the discussion in the AT_NO_JUMPS thread, AT_NO_JUMPS has been
split into separate flags.

  * AT_XDEV blocks mountpoint crossings (both upwards and downwards).
      openat("/", "tmp", AT_XDEV); // blocked
      openat("/tmp", "..", AT_XDEV); // blocked
      openat("/tmp", "/", AT_XDEV); // blocked

  * AT_NO_PROCLINKS blocks all resolution through /proc/$pid/fd/$fd
	"symlinks". Specifically, this blocks all jumps caused by a
	filesystem using nd_jump_link() to shove you around in the
	filesystem tree (these are referred to as "proclinks" in lieu of a
	better name).
	  openat(AT_FDCWD, "/proc/self/root", AT_NO_PROCLINKS); // blocked
	  openat(AT_FDCWD, "/proc/self/fd/0", AT_NO_PROCLINKS); // blocked
	  openat(AT_FDCWD, "/proc/self/ns/mnt", AT_NO_PROCLINKS); // blocked

  * AT_BENEATH disallows escapes from the starting dirfd using ".." or
	absolute paths (either in the path or during symlink resolution).
	Conceptually this flag ensures that you "stay below" the starting
	point in the filesystem tree. ".." resolution is allowed if it
	doesn't land you outside of the starting point (this is made safe
	against races by patch 3 in this series).
	  openat("/root", "foo", AT_BENEATH); // *not* blocked
	  openat("/root", "a/../b", AT_BENEATH); // *not* blocked
	  openat("/root", "a/../../root/b", AT_BENEATH); // blocked
	  openat("/root", "/root", AT_BENEATH); // blocked

	AT_BENEATH also currently disallows all "proclink" resolution
	because they can trivially throw you outside of the starting point.
	In a future patch we might allow such resolution (as long as it
	stays within the root).
	  openat("/", "proc/self/exe", AT_BENEATH); // blocked

In addition, two more flags have been added to the series:

  * AT_NO_SYMLINKS disallows *all* symlink resolution, and thus implies
	AT_NO_PROCLINKS. Linus mentioned this is something that git would
	like to have in the original discussion[5].
	  // assuming 'ln -s / /usr'
	  openat("/", "/usr/bin", AT_NO_SYMLINKS); // blocked
	  openat("/", "/proc/self/root", AT_NO_PROCLINKS); // blocked

  * AT_THIS_ROOT is a very similar idea to AT_BENEATH, but it serves a
    very different purpose. Rather than blocking resolutions if they
	would go outside of the starting point, it treats the starting point
	as a form of chroot(2). Container runtimes are one of the primary
	justifications for this flag, as they currently have to implement
	this sort of path handling racily in userspace[6].

	The restrictions on "proclink" resolution are the same as with
	AT_BENEATH (though in AT_THIS_ROOT's case it's not really clear how
	"proclink" jumps outside of the root should be handled), and patch 3
	in this series was also required to make ".." resolution safe.

Currently all of these flags are only enabled for openat(2) (and thus
have their own O_* flag names), but the corresponding AT_* flags have
been reserved so they can be added to syscalls where openat(O_PATH) is
not sufficient.

Patch changelog:
  v2:
    * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
	  some semi-aggressive __d_path checking (see patch 3).
    * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
	  hopes they can be re-enabled once safe.
    * Removed the selftests as they will be reimplemented as xfstests.
	* Removed stat(2) support, since you can already get it through
	  O_PATH and fstatat(2).

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: David Drysdale <drysdale@google.com>
Cc: <containers@lists.linux-foundation.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: <linux-api@vger.kernel.org>

Aleksa Sarai (3):
  namei: implement O_BENEATH-style AT_* flags
  namei: implement AT_THIS_ROOT chroot-like path resolution
  namei: aggressively check for nd->root escape on ".." resolution

 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 241 +++++++++++++++++++++++--------
 fs/open.c                        |  10 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   8 +
 include/uapi/asm-generic/fcntl.h |  20 +++
 include/uapi/linux/fcntl.h       |  10 ++
 8 files changed, 230 insertions(+), 68 deletions(-)

-- 
2.19.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-09  7:02 [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
  2018-10-09  7:02 ` Aleksa Sarai
@ 2018-10-09  7:02 ` Aleksa Sarai
  2018-10-09  7:02   ` Aleksa Sarai
  2018-10-13  7:33   ` Al Viro
  2018-10-09  7:02 ` [PATCH v3 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution Aleksa Sarai
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09  7:02 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

Add the following flags to allow various restrictions on path
resolution (these affect the *entire* resolution, rather than just the
final path component -- as is the case with most other AT_* flags).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web
servers, archive extraction tools, network file servers, and so on.

* AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
  from one). The primary "scoping" use is to blocking resolution that
  crosses a bind-mount, which has a similar property to a symlink (in
  the way that it allows for escape from the starting-point). Since it
  is not possible to differentiate bind-mounts However since
  bind-mounting requires privileges (in ways symlinks don't) this has
  been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
  (though find(1) doesn't walk upwards, the semantics seem obvious).

* AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
  specific restriction, and it exists because /proc/$pid/fd/...
  "symlinks" allow for access outside nd->root and pose risk to
  container runtimes that don't want to be tricked into accessing a host
  path (but do want to allow no-funny-business symlink resolution).

* AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies
  AT_NO_PROCLINK (obviously).

* AT_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all "proclink" jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of "proclink" jumping is done because it is not clear
  whether semantically they should be allowed -- while some "proclinks"
  are safe there are many that can cause escapes (and once a resolution
  is outside of the root, AT_BENEATH will no longer detect it). Future
  patches may re-enable "proclink" jumping when such jumps would remain
  inside the root.

The AT_NO_*LINK flags return -ELOOP if path resolution would violates
their requirement, while the others all return -EXDEV. Currently these
are only enabled for openat(2) (which has its own brand of O_* flags
with the same semantics). However the AT_* flags have been reserved for
future support in other *at(2) syscalls (though because of AT_EMPTY_PATH
many *at(2) operations will not need to support these flags directly).

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 174 ++++++++++++++++++++++---------
 fs/open.c                        |   8 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   7 ++
 include/uapi/asm-generic/fcntl.h |  17 +++
 include/uapi/linux/fcntl.h       |   8 ++
 8 files changed, 167 insertions(+), 56 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 4137d96534a6..e343618736f7 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index fb913148d4d1..76eacd3af89b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -845,6 +845,12 @@ static inline void path_to_nameidata(const struct path *path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+	if (unlikely(nd->flags & LOOKUP_BENEATH))
+		return -EXDEV;
+	if (unlikely(nd->flags & LOOKUP_XDEV)) {
+		if (nd->path.mnt != nd->root.mnt)
+			return -EXDEV;
+	}
 	if (nd->flags & LOOKUP_RCU) {
 		struct dentry *d;
 		nd->path = nd->root;
@@ -1083,14 +1089,23 @@ const char *get_link(struct nameidata *nd)
 		} else {
 			res = get(dentry, inode, &last->done);
 		}
+		/* If we just jumped it was because of a procfs-style link. */
+		if (unlikely(nd->flags & LOOKUP_JUMPED)) {
+			if (unlikely(nd->flags & LOOKUP_NO_PROCLINKS))
+				return ERR_PTR(-ELOOP);
+			/* Not currently safe. */
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return ERR_PTR(-EXDEV);
+		}
 		if (IS_ERR_OR_NULL(res))
 			return res;
 	}
 	if (*res == '/') {
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (unlikely(nd_jump_root(nd)))
-			return ERR_PTR(-ECHILD);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1271,12 +1286,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
 		break;
 	}
 
-	if (need_mntput && path->mnt == mnt)
-		mntput(path->mnt);
+	if (need_mntput) {
+		if (path->mnt == mnt)
+			mntput(path->mnt);
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			ret = -EXDEV;
+		else
+			nd->flags |= LOOKUP_JUMPED;
+	}
 	if (ret == -EISDIR || !ret)
 		ret = 1;
-	if (need_mntput)
-		nd->flags |= LOOKUP_JUMPED;
 	if (unlikely(ret < 0))
 		path_put_conditional(path, nd);
 	return ret;
@@ -1333,6 +1352,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		mounted = __lookup_mnt(path->mnt, path->dentry);
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return false;
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
@@ -1353,8 +1374,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 	struct inode *inode = nd->inode;
 
 	while (1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			struct dentry *old = nd->path.dentry;
 			struct dentry *parent = old->d_parent;
@@ -1379,6 +1403,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			if (&mparent->mnt == nd->path.mnt)
 				break;
+			if (unlikely(nd->flags & LOOKUP_XDEV))
+				return -EXDEV;
 			/* we know that mountpoint was pinned */
 			nd->path.dentry = mountpoint;
 			nd->path.mnt = &mparent->mnt;
@@ -1393,6 +1419,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			return -ECHILD;
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 		nd->path.mnt = &mounted->mnt;
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
@@ -1481,8 +1509,11 @@ static int path_parent_directory(struct path *path)
 static int follow_dotdot(struct nameidata *nd)
 {
 	while(1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			int ret = path_parent_directory(&nd->path);
 			if (ret)
@@ -1491,6 +1522,8 @@ static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
@@ -1705,6 +1738,12 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
+		/*
+		 * AT_BENEATH resolving ".." is not currently safe -- races can cause
+		 * our parent to have moved outside of the root and us to skip over it.
+		 */
+		if (unlikely(nd->flags & LOOKUP_BENEATH))
+			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
 		if (nd->flags & LOOKUP_RCU) {
@@ -1720,6 +1759,8 @@ static int pick_link(struct nameidata *nd, struct path *link,
 {
 	int error;
 	struct saved *last;
+	if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+		return -ELOOP;
 	if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
 		path_to_nameidata(link, nd);
 		return -ELOOP;
@@ -2168,13 +2209,70 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	}
 }
 
+/*
+ * Configure nd->path based on the nd->dfd. This is only used as part of
+ * path_init().
+ */
+static inline int dirfd_path_init(struct nameidata *nd)
+{
+	if (nd->dfd == AT_FDCWD) {
+		if (nd->flags & LOOKUP_RCU) {
+			struct fs_struct *fs = current->fs;
+			unsigned seq;
+
+			do {
+				seq = read_seqcount_begin(&fs->seq);
+				nd->path = fs->pwd;
+				nd->inode = nd->path.dentry->d_inode;
+				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+			} while (read_seqcount_retry(&fs->seq, seq));
+		} else {
+			get_fs_pwd(current->fs, &nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+	} else {
+		/* Caller must check execute permissions on the starting path component */
+		struct fd f = fdget_raw(nd->dfd);
+		struct dentry *dentry;
+
+		if (!f.file)
+			return -EBADF;
+
+		dentry = f.file->f_path.dentry;
+
+		if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
+			fdput(f);
+			return -ENOTDIR;
+		}
+
+		nd->path = f.file->f_path;
+		if (nd->flags & LOOKUP_RCU) {
+			nd->inode = nd->path.dentry->d_inode;
+			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		} else {
+			path_get(&nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+		fdput(f);
+	}
+	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+		nd->root = nd->path;
+		if (!(nd->flags & LOOKUP_RCU))
+			path_get(&nd->root);
+	}
+	return 0;
+}
+
 /* must be paired with terminate_walk() */
 static const char *path_init(struct nameidata *nd, unsigned flags)
 {
+	int error;
 	const char *s = nd->name->name;
 
 	if (!*s)
 		flags &= ~LOOKUP_RCU;
+	if (flags & LOOKUP_NO_SYMLINKS)
+		flags |= LOOKUP_NO_PROCLINKS;
 	if (flags & LOOKUP_RCU)
 		rcu_read_lock();
 
@@ -2203,53 +2301,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & LOOKUP_XDEV)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
 	if (*s == '/') {
-		set_root(nd);
-		if (likely(!nd_jump_root(nd)))
-			return s;
-		return ERR_PTR(-ECHILD);
-	} else if (nd->dfd == AT_FDCWD) {
-		if (flags & LOOKUP_RCU) {
-			struct fs_struct *fs = current->fs;
-			unsigned seq;
-
-			do {
-				seq = read_seqcount_begin(&fs->seq);
-				nd->path = fs->pwd;
-				nd->inode = nd->path.dentry->d_inode;
-				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
-			} while (read_seqcount_retry(&fs->seq, seq));
-		} else {
-			get_fs_pwd(current->fs, &nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		return s;
-	} else {
-		/* Caller must check execute permissions on the starting path component */
-		struct fd f = fdget_raw(nd->dfd);
-		struct dentry *dentry;
-
-		if (!f.file)
-			return ERR_PTR(-EBADF);
-
-		dentry = f.file->f_path.dentry;
-
-		if (*s && unlikely(!d_can_lookup(dentry))) {
-			fdput(f);
-			return ERR_PTR(-ENOTDIR);
-		}
-
-		nd->path = f.file->f_path;
-		if (flags & LOOKUP_RCU) {
-			nd->inode = nd->path.dentry->d_inode;
-			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-		} else {
-			path_get(&nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		fdput(f);
+		if (likely(!nd->root.mnt))
+			set_root(nd);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			s = ERR_PTR(error);
 		return s;
 	}
+	if (likely(!nd->path.mnt)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
+	return s;
 }
 
 static const char *trailing_symlink(struct nameidata *nd)
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..80f5f566a5ff 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -988,6 +988,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_DIRECTORY;
 	if (!(flags & O_NOFOLLOW))
 		lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & O_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
+	if (flags & O_XDEV)
+		lookup_flags |= LOOKUP_XDEV;
+	if (flags & O_NOPROCLINKS)
+		lookup_flags |= LOOKUP_NO_PROCLINKS;
+	if (flags & O_NOSYMLINKS)
+		lookup_flags |= LOOKUP_NO_SYMLINKS;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/fs/stat.c b/fs/stat.c
index f8e6fb2c3657..d319a468c704 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -170,8 +170,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
 	int error = -EINVAL;
 	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
 
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
-		       AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0)
+	if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
+		      KSTAT_QUERY_FLAGS))
 		return -EINVAL;
 
 	if (flags & AT_SYMLINK_NOFOLLOW)
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 27dc7a60693e..ad5bba4b5b12 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,8 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
+	 O_NOPROCLINKS | O_NOSYMLINKS)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a78606e8e3df..5ff7f3362d1b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_EMPTY		0x4000
 #define LOOKUP_DOWN		0x8000
 
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH		0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV		0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_PROCLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
+					    Implies LOOKUP_NO_PROCLINKS. */
+
 extern int path_pts(struct path *path);
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..c2bf5983e46a 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,23 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+/*
+ * These are identical to their AT_* counterparts (which affect the entireity
+ * of path resolution).
+ */
+#ifndef O_BENEATH
+#define O_BENEATH	00040000000 /* *Not* the same as capsicum's O_BENEATH! */
+#endif
+#ifndef O_XDEV
+#define O_XDEV		00100000000
+#endif
+#ifndef O_NOPROCLINKS
+#define O_NOPROCLINKS	00200000000
+#endif
+#ifndef O_NOSYMLINKS
+#define O_NOSYMLINKS	01000000000
+#endif
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 594b85f7cb86..551a9e2166a8 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -92,5 +92,13 @@
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
+/* Flags which affect path *resolution*, not just last-component handling. */
+#define AT_BENEATH		0x10000	/* No absolute paths or ".." escaping
+					   (in-path or through symlinks) */
+#define AT_XDEV			0x20000 /* No mountpoint crossing. */
+#define AT_NO_PROCLINKS		0x40000 /* No /proc/$pid/fd/... "symlinks". */
+#define AT_NO_SYMLINKS		0x80000 /* No symlinks *at all*.
+					   Implies AT_NO_PROCLINKS. */
+
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.19.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-09  7:02 ` [PATCH v3 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
@ 2018-10-09  7:02   ` Aleksa Sarai
  2018-10-13  7:33   ` Al Viro
  1 sibling, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09  7:02 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

Add the following flags to allow various restrictions on path
resolution (these affect the *entire* resolution, rather than just the
final path component -- as is the case with most other AT_* flags).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web
servers, archive extraction tools, network file servers, and so on.

* AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
  from one). The primary "scoping" use is to blocking resolution that
  crosses a bind-mount, which has a similar property to a symlink (in
  the way that it allows for escape from the starting-point). Since it
  is not possible to differentiate bind-mounts However since
  bind-mounting requires privileges (in ways symlinks don't) this has
  been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
  (though find(1) doesn't walk upwards, the semantics seem obvious).

* AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
  specific restriction, and it exists because /proc/$pid/fd/...
  "symlinks" allow for access outside nd->root and pose risk to
  container runtimes that don't want to be tricked into accessing a host
  path (but do want to allow no-funny-business symlink resolution).

* AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies
  AT_NO_PROCLINK (obviously).

* AT_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all "proclink" jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of "proclink" jumping is done because it is not clear
  whether semantically they should be allowed -- while some "proclinks"
  are safe there are many that can cause escapes (and once a resolution
  is outside of the root, AT_BENEATH will no longer detect it). Future
  patches may re-enable "proclink" jumping when such jumps would remain
  inside the root.

The AT_NO_*LINK flags return -ELOOP if path resolution would violates
their requirement, while the others all return -EXDEV. Currently these
are only enabled for openat(2) (which has its own brand of O_* flags
with the same semantics). However the AT_* flags have been reserved for
future support in other *at(2) syscalls (though because of AT_EMPTY_PATH
many *at(2) operations will not need to support these flags directly).

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 174 ++++++++++++++++++++++---------
 fs/open.c                        |   8 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   7 ++
 include/uapi/asm-generic/fcntl.h |  17 +++
 include/uapi/linux/fcntl.h       |   8 ++
 8 files changed, 167 insertions(+), 56 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 4137d96534a6..e343618736f7 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index fb913148d4d1..76eacd3af89b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -845,6 +845,12 @@ static inline void path_to_nameidata(const struct path *path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+	if (unlikely(nd->flags & LOOKUP_BENEATH))
+		return -EXDEV;
+	if (unlikely(nd->flags & LOOKUP_XDEV)) {
+		if (nd->path.mnt != nd->root.mnt)
+			return -EXDEV;
+	}
 	if (nd->flags & LOOKUP_RCU) {
 		struct dentry *d;
 		nd->path = nd->root;
@@ -1083,14 +1089,23 @@ const char *get_link(struct nameidata *nd)
 		} else {
 			res = get(dentry, inode, &last->done);
 		}
+		/* If we just jumped it was because of a procfs-style link. */
+		if (unlikely(nd->flags & LOOKUP_JUMPED)) {
+			if (unlikely(nd->flags & LOOKUP_NO_PROCLINKS))
+				return ERR_PTR(-ELOOP);
+			/* Not currently safe. */
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return ERR_PTR(-EXDEV);
+		}
 		if (IS_ERR_OR_NULL(res))
 			return res;
 	}
 	if (*res == '/') {
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (unlikely(nd_jump_root(nd)))
-			return ERR_PTR(-ECHILD);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1271,12 +1286,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
 		break;
 	}
 
-	if (need_mntput && path->mnt == mnt)
-		mntput(path->mnt);
+	if (need_mntput) {
+		if (path->mnt == mnt)
+			mntput(path->mnt);
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			ret = -EXDEV;
+		else
+			nd->flags |= LOOKUP_JUMPED;
+	}
 	if (ret == -EISDIR || !ret)
 		ret = 1;
-	if (need_mntput)
-		nd->flags |= LOOKUP_JUMPED;
 	if (unlikely(ret < 0))
 		path_put_conditional(path, nd);
 	return ret;
@@ -1333,6 +1352,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		mounted = __lookup_mnt(path->mnt, path->dentry);
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return false;
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
@@ -1353,8 +1374,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 	struct inode *inode = nd->inode;
 
 	while (1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			struct dentry *old = nd->path.dentry;
 			struct dentry *parent = old->d_parent;
@@ -1379,6 +1403,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			if (&mparent->mnt == nd->path.mnt)
 				break;
+			if (unlikely(nd->flags & LOOKUP_XDEV))
+				return -EXDEV;
 			/* we know that mountpoint was pinned */
 			nd->path.dentry = mountpoint;
 			nd->path.mnt = &mparent->mnt;
@@ -1393,6 +1419,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			return -ECHILD;
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 		nd->path.mnt = &mounted->mnt;
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
@@ -1481,8 +1509,11 @@ static int path_parent_directory(struct path *path)
 static int follow_dotdot(struct nameidata *nd)
 {
 	while(1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			int ret = path_parent_directory(&nd->path);
 			if (ret)
@@ -1491,6 +1522,8 @@ static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
@@ -1705,6 +1738,12 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
+		/*
+		 * AT_BENEATH resolving ".." is not currently safe -- races can cause
+		 * our parent to have moved outside of the root and us to skip over it.
+		 */
+		if (unlikely(nd->flags & LOOKUP_BENEATH))
+			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
 		if (nd->flags & LOOKUP_RCU) {
@@ -1720,6 +1759,8 @@ static int pick_link(struct nameidata *nd, struct path *link,
 {
 	int error;
 	struct saved *last;
+	if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+		return -ELOOP;
 	if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
 		path_to_nameidata(link, nd);
 		return -ELOOP;
@@ -2168,13 +2209,70 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	}
 }
 
+/*
+ * Configure nd->path based on the nd->dfd. This is only used as part of
+ * path_init().
+ */
+static inline int dirfd_path_init(struct nameidata *nd)
+{
+	if (nd->dfd == AT_FDCWD) {
+		if (nd->flags & LOOKUP_RCU) {
+			struct fs_struct *fs = current->fs;
+			unsigned seq;
+
+			do {
+				seq = read_seqcount_begin(&fs->seq);
+				nd->path = fs->pwd;
+				nd->inode = nd->path.dentry->d_inode;
+				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+			} while (read_seqcount_retry(&fs->seq, seq));
+		} else {
+			get_fs_pwd(current->fs, &nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+	} else {
+		/* Caller must check execute permissions on the starting path component */
+		struct fd f = fdget_raw(nd->dfd);
+		struct dentry *dentry;
+
+		if (!f.file)
+			return -EBADF;
+
+		dentry = f.file->f_path.dentry;
+
+		if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
+			fdput(f);
+			return -ENOTDIR;
+		}
+
+		nd->path = f.file->f_path;
+		if (nd->flags & LOOKUP_RCU) {
+			nd->inode = nd->path.dentry->d_inode;
+			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		} else {
+			path_get(&nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+		fdput(f);
+	}
+	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+		nd->root = nd->path;
+		if (!(nd->flags & LOOKUP_RCU))
+			path_get(&nd->root);
+	}
+	return 0;
+}
+
 /* must be paired with terminate_walk() */
 static const char *path_init(struct nameidata *nd, unsigned flags)
 {
+	int error;
 	const char *s = nd->name->name;
 
 	if (!*s)
 		flags &= ~LOOKUP_RCU;
+	if (flags & LOOKUP_NO_SYMLINKS)
+		flags |= LOOKUP_NO_PROCLINKS;
 	if (flags & LOOKUP_RCU)
 		rcu_read_lock();
 
@@ -2203,53 +2301,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & LOOKUP_XDEV)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
 	if (*s == '/') {
-		set_root(nd);
-		if (likely(!nd_jump_root(nd)))
-			return s;
-		return ERR_PTR(-ECHILD);
-	} else if (nd->dfd == AT_FDCWD) {
-		if (flags & LOOKUP_RCU) {
-			struct fs_struct *fs = current->fs;
-			unsigned seq;
-
-			do {
-				seq = read_seqcount_begin(&fs->seq);
-				nd->path = fs->pwd;
-				nd->inode = nd->path.dentry->d_inode;
-				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
-			} while (read_seqcount_retry(&fs->seq, seq));
-		} else {
-			get_fs_pwd(current->fs, &nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		return s;
-	} else {
-		/* Caller must check execute permissions on the starting path component */
-		struct fd f = fdget_raw(nd->dfd);
-		struct dentry *dentry;
-
-		if (!f.file)
-			return ERR_PTR(-EBADF);
-
-		dentry = f.file->f_path.dentry;
-
-		if (*s && unlikely(!d_can_lookup(dentry))) {
-			fdput(f);
-			return ERR_PTR(-ENOTDIR);
-		}
-
-		nd->path = f.file->f_path;
-		if (flags & LOOKUP_RCU) {
-			nd->inode = nd->path.dentry->d_inode;
-			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-		} else {
-			path_get(&nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		fdput(f);
+		if (likely(!nd->root.mnt))
+			set_root(nd);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			s = ERR_PTR(error);
 		return s;
 	}
+	if (likely(!nd->path.mnt)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
+	return s;
 }
 
 static const char *trailing_symlink(struct nameidata *nd)
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..80f5f566a5ff 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -988,6 +988,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_DIRECTORY;
 	if (!(flags & O_NOFOLLOW))
 		lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & O_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
+	if (flags & O_XDEV)
+		lookup_flags |= LOOKUP_XDEV;
+	if (flags & O_NOPROCLINKS)
+		lookup_flags |= LOOKUP_NO_PROCLINKS;
+	if (flags & O_NOSYMLINKS)
+		lookup_flags |= LOOKUP_NO_SYMLINKS;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/fs/stat.c b/fs/stat.c
index f8e6fb2c3657..d319a468c704 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -170,8 +170,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
 	int error = -EINVAL;
 	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
 
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
-		       AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0)
+	if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
+		      KSTAT_QUERY_FLAGS))
 		return -EINVAL;
 
 	if (flags & AT_SYMLINK_NOFOLLOW)
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 27dc7a60693e..ad5bba4b5b12 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,8 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
+	 O_NOPROCLINKS | O_NOSYMLINKS)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a78606e8e3df..5ff7f3362d1b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_EMPTY		0x4000
 #define LOOKUP_DOWN		0x8000
 
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH		0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV		0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_PROCLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
+					    Implies LOOKUP_NO_PROCLINKS. */
+
 extern int path_pts(struct path *path);
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..c2bf5983e46a 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,23 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+/*
+ * These are identical to their AT_* counterparts (which affect the entireity
+ * of path resolution).
+ */
+#ifndef O_BENEATH
+#define O_BENEATH	00040000000 /* *Not* the same as capsicum's O_BENEATH! */
+#endif
+#ifndef O_XDEV
+#define O_XDEV		00100000000
+#endif
+#ifndef O_NOPROCLINKS
+#define O_NOPROCLINKS	00200000000
+#endif
+#ifndef O_NOSYMLINKS
+#define O_NOSYMLINKS	01000000000
+#endif
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 594b85f7cb86..551a9e2166a8 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -92,5 +92,13 @@
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
+/* Flags which affect path *resolution*, not just last-component handling. */
+#define AT_BENEATH		0x10000	/* No absolute paths or ".." escaping
+					   (in-path or through symlinks) */
+#define AT_XDEV			0x20000 /* No mountpoint crossing. */
+#define AT_NO_PROCLINKS		0x40000 /* No /proc/$pid/fd/... "symlinks". */
+#define AT_NO_SYMLINKS		0x80000 /* No symlinks *at all*.
+					   Implies AT_NO_PROCLINKS. */
+
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.19.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution
  2018-10-09  7:02 [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
  2018-10-09  7:02 ` Aleksa Sarai
  2018-10-09  7:02 ` [PATCH v3 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
@ 2018-10-09  7:02 ` Aleksa Sarai
  2018-10-09  7:02   ` Aleksa Sarai
  2018-10-09  7:02 ` [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai
  2018-10-17 15:23 ` [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
  4 siblings, 1 reply; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09  7:02 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing
AT_XDEV and AT_NO_PROCLINKS[**] help defend against other potential
attacks in a malicious rootfs scenario.

Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.

[*] At the moment, ".." and "proclink" jumping are disallowed for the
    same reason it is disabled for AT_BENEATH -- currently it is not
    safe to allow it. Future patches may enable it unconditionally once
    we have resolved the possible races (for "..") and semantics (for
    "proclink" jumping).

The most significant openat(2) semantic change with AT_THIS_ROOT is that
absolute pathnames no longer cause dirfd to be ignored completely. The
rationale is that AT_THIS_ROOT must necessarily chroot-scope symlinks
with absolute paths to dirfd, and so doing it for the base path seems to
be the most consistent behaviour (and also avoids foot-gunning users who
want to scope paths that are absolute).

Currently this is only enabled for openat(2) (which has its own flag
O_THISROOT with the same semantics). However the AT_* flags have been
reserved for future support in other *at(2) syscalls (because of
AT_EMPTY_PATH many *at(2) operations do not need to support these flags
directly).

[1]: https://github.com/cyphar/filepath-securejoin

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/fcntl.c                       | 2 +-
 fs/namei.c                       | 8 ++++----
 fs/open.c                        | 2 ++
 include/linux/fcntl.h            | 2 +-
 include/linux/namei.h            | 1 +
 include/uapi/asm-generic/fcntl.h | 3 +++
 include/uapi/linux/fcntl.h       | 2 ++
 7 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index e343618736f7..4c36c5b9fdb9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index 76eacd3af89b..b31aef27df22 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1094,7 +1094,7 @@ const char *get_link(struct nameidata *nd)
 			if (unlikely(nd->flags & LOOKUP_NO_PROCLINKS))
 				return ERR_PTR(-ELOOP);
 			/* Not currently safe. */
-			if (unlikely(nd->flags & LOOKUP_BENEATH))
+			if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
 				return ERR_PTR(-EXDEV);
 		}
 		if (IS_ERR_OR_NULL(res))
@@ -1742,7 +1742,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		 * AT_BENEATH resolving ".." is not currently safe -- races can cause
 		 * our parent to have moved outside of the root and us to skip over it.
 		 */
-		if (unlikely(nd->flags & LOOKUP_BENEATH))
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
 			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
@@ -2255,7 +2255,7 @@ static inline int dirfd_path_init(struct nameidata *nd)
 		}
 		fdput(f);
 	}
-	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+	if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
 		nd->root = nd->path;
 		if (!(nd->flags & LOOKUP_RCU))
 			path_get(&nd->root);
@@ -2301,7 +2301,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
-	if (unlikely(flags & LOOKUP_XDEV)) {
+	if (unlikely(flags & (LOOKUP_CHROOT | LOOKUP_XDEV))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
 			return ERR_PTR(error);
diff --git a/fs/open.c b/fs/open.c
index 80f5f566a5ff..81d148f626cd 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_NO_PROCLINKS;
 	if (flags & O_NOSYMLINKS)
 		lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (flags & O_THISROOT)
+		lookup_flags |= LOOKUP_CHROOT;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index ad5bba4b5b12..95480cd4c09d 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
 	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
-	 O_NOPROCLINKS | O_NOSYMLINKS)
+	 O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 5ff7f3362d1b..7ec9e2d84649 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_PROCLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
 #define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
 					    Implies LOOKUP_NO_PROCLINKS. */
+#define LOOKUP_CHROOT		0x100000 /* Treat dirfd as %current->fs->root. */
 
 extern int path_pts(struct path *path);
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index c2bf5983e46a..11206b0e927c 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -113,6 +113,9 @@
 #ifndef O_NOSYMLINKS
 #define O_NOSYMLINKS	01000000000
 #endif
+#ifndef O_THISROOT
+#define O_THISROOT	02000000000
+#endif
 
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 551a9e2166a8..ea978457b68f 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -99,6 +99,8 @@
 #define AT_NO_PROCLINKS		0x40000 /* No /proc/$pid/fd/... "symlinks". */
 #define AT_NO_SYMLINKS		0x80000 /* No symlinks *at all*.
 					   Implies AT_NO_PROCLINKS. */
+#define AT_THIS_ROOT		0x100000 /* Path resolution acts as though
+					   it is chroot-ed into dirfd. */
 
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.19.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution
  2018-10-09  7:02 ` [PATCH v3 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution Aleksa Sarai
@ 2018-10-09  7:02   ` Aleksa Sarai
  0 siblings, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09  7:02 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing
AT_XDEV and AT_NO_PROCLINKS[**] help defend against other potential
attacks in a malicious rootfs scenario.

Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.

[*] At the moment, ".." and "proclink" jumping are disallowed for the
    same reason it is disabled for AT_BENEATH -- currently it is not
    safe to allow it. Future patches may enable it unconditionally once
    we have resolved the possible races (for "..") and semantics (for
    "proclink" jumping).

The most significant openat(2) semantic change with AT_THIS_ROOT is that
absolute pathnames no longer cause dirfd to be ignored completely. The
rationale is that AT_THIS_ROOT must necessarily chroot-scope symlinks
with absolute paths to dirfd, and so doing it for the base path seems to
be the most consistent behaviour (and also avoids foot-gunning users who
want to scope paths that are absolute).

Currently this is only enabled for openat(2) (which has its own flag
O_THISROOT with the same semantics). However the AT_* flags have been
reserved for future support in other *at(2) syscalls (because of
AT_EMPTY_PATH many *at(2) operations do not need to support these flags
directly).

[1]: https://github.com/cyphar/filepath-securejoin

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/fcntl.c                       | 2 +-
 fs/namei.c                       | 8 ++++----
 fs/open.c                        | 2 ++
 include/linux/fcntl.h            | 2 +-
 include/linux/namei.h            | 1 +
 include/uapi/asm-generic/fcntl.h | 3 +++
 include/uapi/linux/fcntl.h       | 2 ++
 7 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index e343618736f7..4c36c5b9fdb9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index 76eacd3af89b..b31aef27df22 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1094,7 +1094,7 @@ const char *get_link(struct nameidata *nd)
 			if (unlikely(nd->flags & LOOKUP_NO_PROCLINKS))
 				return ERR_PTR(-ELOOP);
 			/* Not currently safe. */
-			if (unlikely(nd->flags & LOOKUP_BENEATH))
+			if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
 				return ERR_PTR(-EXDEV);
 		}
 		if (IS_ERR_OR_NULL(res))
@@ -1742,7 +1742,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
 		 * AT_BENEATH resolving ".." is not currently safe -- races can cause
 		 * our parent to have moved outside of the root and us to skip over it.
 		 */
-		if (unlikely(nd->flags & LOOKUP_BENEATH))
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
 			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
@@ -2255,7 +2255,7 @@ static inline int dirfd_path_init(struct nameidata *nd)
 		}
 		fdput(f);
 	}
-	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+	if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
 		nd->root = nd->path;
 		if (!(nd->flags & LOOKUP_RCU))
 			path_get(&nd->root);
@@ -2301,7 +2301,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
-	if (unlikely(flags & LOOKUP_XDEV)) {
+	if (unlikely(flags & (LOOKUP_CHROOT | LOOKUP_XDEV))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
 			return ERR_PTR(error);
diff --git a/fs/open.c b/fs/open.c
index 80f5f566a5ff..81d148f626cd 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_NO_PROCLINKS;
 	if (flags & O_NOSYMLINKS)
 		lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (flags & O_THISROOT)
+		lookup_flags |= LOOKUP_CHROOT;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index ad5bba4b5b12..95480cd4c09d 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
 	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
-	 O_NOPROCLINKS | O_NOSYMLINKS)
+	 O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 5ff7f3362d1b..7ec9e2d84649 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_PROCLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
 #define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
 					    Implies LOOKUP_NO_PROCLINKS. */
+#define LOOKUP_CHROOT		0x100000 /* Treat dirfd as %current->fs->root. */
 
 extern int path_pts(struct path *path);
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index c2bf5983e46a..11206b0e927c 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -113,6 +113,9 @@
 #ifndef O_NOSYMLINKS
 #define O_NOSYMLINKS	01000000000
 #endif
+#ifndef O_THISROOT
+#define O_THISROOT	02000000000
+#endif
 
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 551a9e2166a8..ea978457b68f 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -99,6 +99,8 @@
 #define AT_NO_PROCLINKS		0x40000 /* No /proc/$pid/fd/... "symlinks". */
 #define AT_NO_SYMLINKS		0x80000 /* No symlinks *at all*.
 					   Implies AT_NO_PROCLINKS. */
+#define AT_THIS_ROOT		0x100000 /* Path resolution acts as though
+					   it is chroot-ed into dirfd. */
 
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.19.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-09  7:02 [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
                   ` (2 preceding siblings ...)
  2018-10-09  7:02 ` [PATCH v3 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution Aleksa Sarai
@ 2018-10-09  7:02 ` Aleksa Sarai
  2018-10-09  7:02   ` Aleksa Sarai
  2018-10-09 15:19   ` Jann Horn
  2018-10-17 15:23 ` [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
  4 siblings, 2 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09  7:02 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Jann Horn, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Christian Brauner,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

This patch allows for AT_BENEATH and AT_THIS_ROOT to safely permit ".."
resolution (in the case of AT_BENEATH the resolution will still fail if
".." resolution would resolve a path outside of the root -- while
AT_THIS_ROOT will chroot(2)-style scope it). "proclink" jumps are still
disallowed entirely because now they could result in inconsistent
behaviour if resolution encounters a subsequent "..".

The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to
"skip over" nd->root and thus escape to the filesystem above nd->root.

  thread1 [attacker]:
    for (;;)
      renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
    for (;;)
      openat(dirb, "b/c/../../etc/shadow", O_THISROOT);

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or "proclink"). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of AT_THIS_ROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).

In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
have occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.

The use of __d_path here might seem suspect, but on further inspection
of the most important race (a path was *inside* the root but is now
*outside*), there appears to be no attack potential. If __d_path occurs
before the rename, then the path will be resolved but since the path was
originally inside the root there is no escape. Subsequent ".." jumps are
guaranteed to check __d_path reachable (by construction, &rename_lock or
&mount_lock must have been taken after __d_path returned), and thus will
not be able to escape from the previously-inside-root path. Walking down
is still safe since the entire subtree was moved (either by rename(2) or
MS_MOVE) and because (as discussed above) walking down is safe.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 79 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 61 insertions(+), 18 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index b31aef27df22..12cd8d8987ea 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -493,7 +493,7 @@ struct nameidata {
 	struct path	root;
 	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
-	unsigned	seq, m_seq;
+	unsigned	seq, m_seq, r_seq;
 	int		last_type;
 	unsigned	depth;
 	int		total_link_count;
@@ -508,6 +508,7 @@ struct nameidata {
 	struct inode	*link_inode;
 	unsigned	root_seq;
 	int		dfd;
+	char		*dpathbuf;
 } __randomize_layout;
 
 static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
@@ -530,6 +531,7 @@ static void restore_nameidata(void)
 		old->total_link_count = now->total_link_count;
 	if (now->stack != now->internal)
 		kfree(now->stack);
+	kfree(now->dpathbuf);
 }
 
 static int __nd_alloc_stack(struct nameidata *nd)
@@ -580,6 +582,22 @@ static inline int nd_alloc_stack(struct nameidata *nd)
 	return __nd_alloc_stack(nd);
 }
 
+static inline int nd_alloc_dpathbuf(struct nameidata *nd)
+{
+	if (unlikely(!nd->dpathbuf)) {
+		if (nd->flags & LOOKUP_RCU) {
+			nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
+			if (unlikely(!nd->dpathbuf))
+				return -ECHILD;
+		} else {
+			nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
+			if (unlikely(!nd->dpathbuf))
+				return -ENOMEM;
+		}
+	}
+	return 0;
+}
+
 static void drop_links(struct nameidata *nd)
 {
 	int i = nd->depth;
@@ -1738,18 +1756,40 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
-		/*
-		 * AT_BENEATH resolving ".." is not currently safe -- races can cause
-		 * our parent to have moved outside of the root and us to skip over it.
-		 */
-		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
-			return -EXDEV;
+		int error = 0;
+
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (nd->flags & LOOKUP_RCU) {
-			return follow_dotdot_rcu(nd);
-		} else
-			return follow_dotdot(nd);
+		if (nd->flags & LOOKUP_RCU)
+			error = follow_dotdot_rcu(nd);
+		else
+			error = follow_dotdot(nd);
+		if (error)
+			return error;
+
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
+			char *pathptr;
+			bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+			bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+			/* Racing rename(2) or MS_MOVE? */
+			if (likely(!m_retry && !r_retry))
+				return 0;
+			if (m_retry && !(nd->flags & LOOKUP_RCU))
+				nd->m_seq = read_seqbegin(&mount_lock);
+			if (r_retry)
+				nd->r_seq = read_seqbegin(&rename_lock);
+
+			error = nd_alloc_dpathbuf(nd);
+			if (error)
+				return error;
+			pathptr = __d_path(&nd->path, &nd->root, nd->dpathbuf, PATH_MAX);
+			if (unlikely(!pathptr))
+				/* Breakout -- go back to root! */
+				return nd_jump_root(nd);
+			if (unlikely(IS_ERR(pathptr)))
+				return PTR_ERR(pathptr);
+		}
 	}
 	return 0;
 }
@@ -2279,6 +2319,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
+
+	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
+		nd->r_seq = read_seqbegin(&rename_lock);
+
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -2289,7 +2334,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 		if (flags & LOOKUP_RCU) {
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			nd->root_seq = nd->seq;
-			nd->m_seq = read_seqbegin(&mount_lock);
 		} else {
 			path_get(&nd->path);
 		}
@@ -2300,7 +2344,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 
-	nd->m_seq = read_seqbegin(&mount_lock);
 	if (unlikely(flags & (LOOKUP_CHROOT | LOOKUP_XDEV))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
@@ -2407,7 +2450,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
 		    struct path *path, struct path *root)
 {
 	int retval;
-	struct nameidata nd;
+	struct nameidata nd = {};
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 	if (unlikely(root)) {
@@ -2450,7 +2493,7 @@ static struct filename *filename_parentat(int dfd, struct filename *name,
 				struct qstr *last, int *type)
 {
 	int retval;
-	struct nameidata nd;
+	struct nameidata nd = {};
 
 	if (IS_ERR(name))
 		return name;
@@ -2779,7 +2822,7 @@ static int
 filename_mountpoint(int dfd, struct filename *name, struct path *path,
 			unsigned int flags)
 {
-	struct nameidata nd;
+	struct nameidata nd = {};
 	int error;
 	if (IS_ERR(name))
 		return PTR_ERR(name);
@@ -3626,7 +3669,7 @@ static struct file *path_openat(struct nameidata *nd,
 struct file *do_filp_open(int dfd, struct filename *pathname,
 		const struct open_flags *op)
 {
-	struct nameidata nd;
+	struct nameidata nd = {};
 	int flags = op->lookup_flags;
 	struct file *filp;
 
@@ -3643,7 +3686,7 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
 struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 		const char *name, const struct open_flags *op)
 {
-	struct nameidata nd;
+	struct nameidata nd = {};
 	struct file *file;
 	struct filename *filename;
 	int flags = op->lookup_flags | LOOKUP_ROOT;
-- 
2.19.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-09  7:02 ` [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai
@ 2018-10-09  7:02   ` Aleksa Sarai
  2018-10-09 15:19   ` Jann Horn
  1 sibling, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09  7:02 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Jann Horn, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Christian Brauner,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

This patch allows for AT_BENEATH and AT_THIS_ROOT to safely permit ".."
resolution (in the case of AT_BENEATH the resolution will still fail if
".." resolution would resolve a path outside of the root -- while
AT_THIS_ROOT will chroot(2)-style scope it). "proclink" jumps are still
disallowed entirely because now they could result in inconsistent
behaviour if resolution encounters a subsequent "..".

The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to
"skip over" nd->root and thus escape to the filesystem above nd->root.

  thread1 [attacker]:
    for (;;)
      renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
    for (;;)
      openat(dirb, "b/c/../../etc/shadow", O_THISROOT);

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or "proclink"). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of AT_THIS_ROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).

In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
have occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.

The use of __d_path here might seem suspect, but on further inspection
of the most important race (a path was *inside* the root but is now
*outside*), there appears to be no attack potential. If __d_path occurs
before the rename, then the path will be resolved but since the path was
originally inside the root there is no escape. Subsequent ".." jumps are
guaranteed to check __d_path reachable (by construction, &rename_lock or
&mount_lock must have been taken after __d_path returned), and thus will
not be able to escape from the previously-inside-root path. Walking down
is still safe since the entire subtree was moved (either by rename(2) or
MS_MOVE) and because (as discussed above) walking down is safe.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 79 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 61 insertions(+), 18 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index b31aef27df22..12cd8d8987ea 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -493,7 +493,7 @@ struct nameidata {
 	struct path	root;
 	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
-	unsigned	seq, m_seq;
+	unsigned	seq, m_seq, r_seq;
 	int		last_type;
 	unsigned	depth;
 	int		total_link_count;
@@ -508,6 +508,7 @@ struct nameidata {
 	struct inode	*link_inode;
 	unsigned	root_seq;
 	int		dfd;
+	char		*dpathbuf;
 } __randomize_layout;
 
 static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
@@ -530,6 +531,7 @@ static void restore_nameidata(void)
 		old->total_link_count = now->total_link_count;
 	if (now->stack != now->internal)
 		kfree(now->stack);
+	kfree(now->dpathbuf);
 }
 
 static int __nd_alloc_stack(struct nameidata *nd)
@@ -580,6 +582,22 @@ static inline int nd_alloc_stack(struct nameidata *nd)
 	return __nd_alloc_stack(nd);
 }
 
+static inline int nd_alloc_dpathbuf(struct nameidata *nd)
+{
+	if (unlikely(!nd->dpathbuf)) {
+		if (nd->flags & LOOKUP_RCU) {
+			nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
+			if (unlikely(!nd->dpathbuf))
+				return -ECHILD;
+		} else {
+			nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
+			if (unlikely(!nd->dpathbuf))
+				return -ENOMEM;
+		}
+	}
+	return 0;
+}
+
 static void drop_links(struct nameidata *nd)
 {
 	int i = nd->depth;
@@ -1738,18 +1756,40 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
-		/*
-		 * AT_BENEATH resolving ".." is not currently safe -- races can cause
-		 * our parent to have moved outside of the root and us to skip over it.
-		 */
-		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
-			return -EXDEV;
+		int error = 0;
+
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (nd->flags & LOOKUP_RCU) {
-			return follow_dotdot_rcu(nd);
-		} else
-			return follow_dotdot(nd);
+		if (nd->flags & LOOKUP_RCU)
+			error = follow_dotdot_rcu(nd);
+		else
+			error = follow_dotdot(nd);
+		if (error)
+			return error;
+
+		if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
+			char *pathptr;
+			bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+			bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+			/* Racing rename(2) or MS_MOVE? */
+			if (likely(!m_retry && !r_retry))
+				return 0;
+			if (m_retry && !(nd->flags & LOOKUP_RCU))
+				nd->m_seq = read_seqbegin(&mount_lock);
+			if (r_retry)
+				nd->r_seq = read_seqbegin(&rename_lock);
+
+			error = nd_alloc_dpathbuf(nd);
+			if (error)
+				return error;
+			pathptr = __d_path(&nd->path, &nd->root, nd->dpathbuf, PATH_MAX);
+			if (unlikely(!pathptr))
+				/* Breakout -- go back to root! */
+				return nd_jump_root(nd);
+			if (unlikely(IS_ERR(pathptr)))
+				return PTR_ERR(pathptr);
+		}
 	}
 	return 0;
 }
@@ -2279,6 +2319,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
+
+	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
+		nd->r_seq = read_seqbegin(&rename_lock);
+
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -2289,7 +2334,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 		if (flags & LOOKUP_RCU) {
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			nd->root_seq = nd->seq;
-			nd->m_seq = read_seqbegin(&mount_lock);
 		} else {
 			path_get(&nd->path);
 		}
@@ -2300,7 +2344,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 
-	nd->m_seq = read_seqbegin(&mount_lock);
 	if (unlikely(flags & (LOOKUP_CHROOT | LOOKUP_XDEV))) {
 		error = dirfd_path_init(nd);
 		if (unlikely(error))
@@ -2407,7 +2450,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
 		    struct path *path, struct path *root)
 {
 	int retval;
-	struct nameidata nd;
+	struct nameidata nd = {};
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 	if (unlikely(root)) {
@@ -2450,7 +2493,7 @@ static struct filename *filename_parentat(int dfd, struct filename *name,
 				struct qstr *last, int *type)
 {
 	int retval;
-	struct nameidata nd;
+	struct nameidata nd = {};
 
 	if (IS_ERR(name))
 		return name;
@@ -2779,7 +2822,7 @@ static int
 filename_mountpoint(int dfd, struct filename *name, struct path *path,
 			unsigned int flags)
 {
-	struct nameidata nd;
+	struct nameidata nd = {};
 	int error;
 	if (IS_ERR(name))
 		return PTR_ERR(name);
@@ -3626,7 +3669,7 @@ static struct file *path_openat(struct nameidata *nd,
 struct file *do_filp_open(int dfd, struct filename *pathname,
 		const struct open_flags *op)
 {
-	struct nameidata nd;
+	struct nameidata nd = {};
 	int flags = op->lookup_flags;
 	struct file *filp;
 
@@ -3643,7 +3686,7 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
 struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 		const char *name, const struct open_flags *op)
 {
-	struct nameidata nd;
+	struct nameidata nd = {};
 	struct file *file;
 	struct filename *filename;
 	int flags = op->lookup_flags | LOOKUP_ROOT;
-- 
2.19.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-09  7:02 ` [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai
  2018-10-09  7:02   ` Aleksa Sarai
@ 2018-10-09 15:19   ` Jann Horn
  2018-10-09 15:19     ` Jann Horn
  2018-10-09 15:37     ` Aleksa Sarai
  1 sibling, 2 replies; 32+ messages in thread
From: Jann Horn @ 2018-10-09 15:19 UTC (permalink / raw)
  To: cyphar
  Cc: Al Viro, Eric W. Biederman, jlayton, Bruce Fields, Arnd Bergmann,
	Andy Lutomirski, David Howells, christian, Tycho Andersen,
	David Drysdale, dev, containers, linux-fsdevel, kernel list,
	linux-arch, Linux API

On Tue, Oct 9, 2018 at 9:03 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> This patch allows for AT_BENEATH and AT_THIS_ROOT to safely permit ".."
> resolution (in the case of AT_BENEATH the resolution will still fail if
> ".." resolution would resolve a path outside of the root -- while
> AT_THIS_ROOT will chroot(2)-style scope it). "proclink" jumps are still
> disallowed entirely because now they could result in inconsistent
> behaviour if resolution encounters a subsequent "..".
>
> The need for this patch is explained by observing there is a fairly
> easy-to-exploit race condition with chroot(2) (and thus by extension
> AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to
> "skip over" nd->root and thus escape to the filesystem above nd->root.
>
>   thread1 [attacker]:
>     for (;;)
>       renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
>   thread2 [victim]:
>     for (;;)
>       openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
>
> With fairly significant regularity, thread2 will resolve to
> "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
> (though somewhat more privileged) attack using MS_MOVE.
>
> With this patch, such cases will be detected *during* ".." resolution
> (which is the weak point of chroot(2) -- since walking *into* a
> subdirectory tautologically cannot result in you walking *outside*
> nd->root -- except through a bind-mount or "proclink"). By detecting
> this at ".." resolution (rather than checking only at the end of the
> entire resolution) we can both correct escapes by jumping back to the
> root (in the case of AT_THIS_ROOT), as well as avoid revealing to
> attackers the structure of the filesystem outside of the root (through
> timing attacks for instance).
>
> In order to avoid a quadratic lookup with each ".." entry, we only
> activate the slow path if a write through &rename_lock or &mount_lock
> have occurred during path resolution (&rename_lock and &mount_lock are
> re-taken to further optimise the lookup). Since the primary attack being
> protected against is MS_MOVE or rename(2), not doing additional checks
> unless a mount or rename have occurred avoids making the common case
> slow.
>
> The use of __d_path here might seem suspect, but on further inspection
> of the most important race (a path was *inside* the root but is now
> *outside*), there appears to be no attack potential. If __d_path occurs
> before the rename, then the path will be resolved but since the path was
> originally inside the root there is no escape. Subsequent ".." jumps are
> guaranteed to check __d_path reachable (by construction, &rename_lock or
> &mount_lock must have been taken after __d_path returned),

"after"? Don't you mean "before"? Otherwise I don't understand what
you're saying here.

> and thus will
> not be able to escape from the previously-inside-root path. Walking down
> is still safe since the entire subtree was moved (either by rename(2) or
> MS_MOVE) and because (as discussed above) walking down is safe.
>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Jann Horn <jannh@google.com>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  fs/namei.c | 79 +++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 61 insertions(+), 18 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index b31aef27df22..12cd8d8987ea 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
[...]
> +static inline int nd_alloc_dpathbuf(struct nameidata *nd)
> +{
> +       if (unlikely(!nd->dpathbuf)) {
> +               if (nd->flags & LOOKUP_RCU) {
> +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> +                       if (unlikely(!nd->dpathbuf))
> +                               return -ECHILD;
> +               } else {
> +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> +                       if (unlikely(!nd->dpathbuf))
> +                               return -ENOMEM;
> +               }
> +       }
> +       return 0;
> +}

Note that a fixed-size path buffer means that if the path is very
long, e.g. because you followed long symlinks on the way down, this
can cause lookup failures.

>  static void drop_links(struct nameidata *nd)
>  {
>         int i = nd->depth;
> @@ -1738,18 +1756,40 @@ static inline int may_lookup(struct nameidata *nd)
>  static inline int handle_dots(struct nameidata *nd, int type)
>  {
>         if (type == LAST_DOTDOT) {
> -               /*
> -                * AT_BENEATH resolving ".." is not currently safe -- races can cause
> -                * our parent to have moved outside of the root and us to skip over it.
> -                */
> -               if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
> -                       return -EXDEV;
> +               int error = 0;
> +
>                 if (!nd->root.mnt)
>                         set_root(nd);
> -               if (nd->flags & LOOKUP_RCU) {
> -                       return follow_dotdot_rcu(nd);
> -               } else
> -                       return follow_dotdot(nd);
> +               if (nd->flags & LOOKUP_RCU)
> +                       error = follow_dotdot_rcu(nd);
> +               else
> +                       error = follow_dotdot(nd);
> +               if (error)
> +                       return error;
> +
> +               if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
> +                       char *pathptr;
> +                       bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
> +                       bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
> +
> +                       /* Racing rename(2) or MS_MOVE? */
> +                       if (likely(!m_retry && !r_retry))
> +                               return 0;
> +                       if (m_retry && !(nd->flags & LOOKUP_RCU))
> +                               nd->m_seq = read_seqbegin(&mount_lock);
> +                       if (r_retry)
> +                               nd->r_seq = read_seqbegin(&rename_lock);
> +
> +                       error = nd_alloc_dpathbuf(nd);
> +                       if (error)
> +                               return error;
> +                       pathptr = __d_path(&nd->path, &nd->root, nd->dpathbuf, PATH_MAX);
> +                       if (unlikely(!pathptr))
> +                               /* Breakout -- go back to root! */
> +                               return nd_jump_root(nd);

I find the semantics of this check odd - especially in the
LOOKUP_BENEATH case, but also in the LOOKUP_CHROOT case. Wouldn't it
make more sense to just throw an error here? Making /.. go back to the
root is one thing, but making ".." from anything that has escaped from
the root go back to the root seems less logical to me.

Imagine, for example:

Thread A (a webserver or whatever) looks up
"example.org/images/foo/../bar.png" under "/var/www" with
LOOKUP_BENEATH.
Thread B concurrently moves "/var/www/example.org/images" to
"/var/backup/example.org/images".
Now thread A can accidentally resolve its path to "/var/www/bar.png"
if the race happens the wrong way?

> +                       if (unlikely(IS_ERR(pathptr)))
> +                               return PTR_ERR(pathptr);
> +               }
>         }
>         return 0;
>  }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-09 15:19   ` Jann Horn
@ 2018-10-09 15:19     ` Jann Horn
  2018-10-09 15:37     ` Aleksa Sarai
  1 sibling, 0 replies; 32+ messages in thread
From: Jann Horn @ 2018-10-09 15:19 UTC (permalink / raw)
  To: cyphar
  Cc: Al Viro, Eric W. Biederman, jlayton, Bruce Fields, Arnd Bergmann,
	Andy Lutomirski, David Howells, christian, Tycho Andersen,
	David Drysdale, dev, containers, linux-fsdevel, kernel list,
	linux-arch, Linux API

On Tue, Oct 9, 2018 at 9:03 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> This patch allows for AT_BENEATH and AT_THIS_ROOT to safely permit ".."
> resolution (in the case of AT_BENEATH the resolution will still fail if
> ".." resolution would resolve a path outside of the root -- while
> AT_THIS_ROOT will chroot(2)-style scope it). "proclink" jumps are still
> disallowed entirely because now they could result in inconsistent
> behaviour if resolution encounters a subsequent "..".
>
> The need for this patch is explained by observing there is a fairly
> easy-to-exploit race condition with chroot(2) (and thus by extension
> AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to
> "skip over" nd->root and thus escape to the filesystem above nd->root.
>
>   thread1 [attacker]:
>     for (;;)
>       renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
>   thread2 [victim]:
>     for (;;)
>       openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
>
> With fairly significant regularity, thread2 will resolve to
> "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
> (though somewhat more privileged) attack using MS_MOVE.
>
> With this patch, such cases will be detected *during* ".." resolution
> (which is the weak point of chroot(2) -- since walking *into* a
> subdirectory tautologically cannot result in you walking *outside*
> nd->root -- except through a bind-mount or "proclink"). By detecting
> this at ".." resolution (rather than checking only at the end of the
> entire resolution) we can both correct escapes by jumping back to the
> root (in the case of AT_THIS_ROOT), as well as avoid revealing to
> attackers the structure of the filesystem outside of the root (through
> timing attacks for instance).
>
> In order to avoid a quadratic lookup with each ".." entry, we only
> activate the slow path if a write through &rename_lock or &mount_lock
> have occurred during path resolution (&rename_lock and &mount_lock are
> re-taken to further optimise the lookup). Since the primary attack being
> protected against is MS_MOVE or rename(2), not doing additional checks
> unless a mount or rename have occurred avoids making the common case
> slow.
>
> The use of __d_path here might seem suspect, but on further inspection
> of the most important race (a path was *inside* the root but is now
> *outside*), there appears to be no attack potential. If __d_path occurs
> before the rename, then the path will be resolved but since the path was
> originally inside the root there is no escape. Subsequent ".." jumps are
> guaranteed to check __d_path reachable (by construction, &rename_lock or
> &mount_lock must have been taken after __d_path returned),

"after"? Don't you mean "before"? Otherwise I don't understand what
you're saying here.

> and thus will
> not be able to escape from the previously-inside-root path. Walking down
> is still safe since the entire subtree was moved (either by rename(2) or
> MS_MOVE) and because (as discussed above) walking down is safe.
>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Jann Horn <jannh@google.com>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  fs/namei.c | 79 +++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 61 insertions(+), 18 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index b31aef27df22..12cd8d8987ea 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
[...]
> +static inline int nd_alloc_dpathbuf(struct nameidata *nd)
> +{
> +       if (unlikely(!nd->dpathbuf)) {
> +               if (nd->flags & LOOKUP_RCU) {
> +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> +                       if (unlikely(!nd->dpathbuf))
> +                               return -ECHILD;
> +               } else {
> +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> +                       if (unlikely(!nd->dpathbuf))
> +                               return -ENOMEM;
> +               }
> +       }
> +       return 0;
> +}

Note that a fixed-size path buffer means that if the path is very
long, e.g. because you followed long symlinks on the way down, this
can cause lookup failures.

>  static void drop_links(struct nameidata *nd)
>  {
>         int i = nd->depth;
> @@ -1738,18 +1756,40 @@ static inline int may_lookup(struct nameidata *nd)
>  static inline int handle_dots(struct nameidata *nd, int type)
>  {
>         if (type == LAST_DOTDOT) {
> -               /*
> -                * AT_BENEATH resolving ".." is not currently safe -- races can cause
> -                * our parent to have moved outside of the root and us to skip over it.
> -                */
> -               if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
> -                       return -EXDEV;
> +               int error = 0;
> +
>                 if (!nd->root.mnt)
>                         set_root(nd);
> -               if (nd->flags & LOOKUP_RCU) {
> -                       return follow_dotdot_rcu(nd);
> -               } else
> -                       return follow_dotdot(nd);
> +               if (nd->flags & LOOKUP_RCU)
> +                       error = follow_dotdot_rcu(nd);
> +               else
> +                       error = follow_dotdot(nd);
> +               if (error)
> +                       return error;
> +
> +               if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
> +                       char *pathptr;
> +                       bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
> +                       bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
> +
> +                       /* Racing rename(2) or MS_MOVE? */
> +                       if (likely(!m_retry && !r_retry))
> +                               return 0;
> +                       if (m_retry && !(nd->flags & LOOKUP_RCU))
> +                               nd->m_seq = read_seqbegin(&mount_lock);
> +                       if (r_retry)
> +                               nd->r_seq = read_seqbegin(&rename_lock);
> +
> +                       error = nd_alloc_dpathbuf(nd);
> +                       if (error)
> +                               return error;
> +                       pathptr = __d_path(&nd->path, &nd->root, nd->dpathbuf, PATH_MAX);
> +                       if (unlikely(!pathptr))
> +                               /* Breakout -- go back to root! */
> +                               return nd_jump_root(nd);

I find the semantics of this check odd - especially in the
LOOKUP_BENEATH case, but also in the LOOKUP_CHROOT case. Wouldn't it
make more sense to just throw an error here? Making /.. go back to the
root is one thing, but making ".." from anything that has escaped from
the root go back to the root seems less logical to me.

Imagine, for example:

Thread A (a webserver or whatever) looks up
"example.org/images/foo/../bar.png" under "/var/www" with
LOOKUP_BENEATH.
Thread B concurrently moves "/var/www/example.org/images" to
"/var/backup/example.org/images".
Now thread A can accidentally resolve its path to "/var/www/bar.png"
if the race happens the wrong way?

> +                       if (unlikely(IS_ERR(pathptr)))
> +                               return PTR_ERR(pathptr);
> +               }
>         }
>         return 0;
>  }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-09 15:19   ` Jann Horn
  2018-10-09 15:19     ` Jann Horn
@ 2018-10-09 15:37     ` Aleksa Sarai
  2018-10-09 15:37       ` Aleksa Sarai
                         ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09 15:37 UTC (permalink / raw)
  To: Jann Horn
  Cc: cyphar, Al Viro, Eric W. Biederman, jlayton, Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, christian,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	kernel list, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 6238 bytes --]

On 2018-10-09, 'Jann Horn' via dev <jannh@google.com> wrote:
> On Tue, Oct 9, 2018 at 9:03 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > This patch allows for AT_BENEATH and AT_THIS_ROOT to safely permit ".."
> > resolution (in the case of AT_BENEATH the resolution will still fail if
> > ".." resolution would resolve a path outside of the root -- while
> > AT_THIS_ROOT will chroot(2)-style scope it). "proclink" jumps are still
> > disallowed entirely because now they could result in inconsistent
> > behaviour if resolution encounters a subsequent "..".
> >
> > The need for this patch is explained by observing there is a fairly
> > easy-to-exploit race condition with chroot(2) (and thus by extension
> > AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to
> > "skip over" nd->root and thus escape to the filesystem above nd->root.
> >
> >   thread1 [attacker]:
> >     for (;;)
> >       renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
> >   thread2 [victim]:
> >     for (;;)
> >       openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
> >
> > With fairly significant regularity, thread2 will resolve to
> > "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
> > (though somewhat more privileged) attack using MS_MOVE.
> >
> > With this patch, such cases will be detected *during* ".." resolution
> > (which is the weak point of chroot(2) -- since walking *into* a
> > subdirectory tautologically cannot result in you walking *outside*
> > nd->root -- except through a bind-mount or "proclink"). By detecting
> > this at ".." resolution (rather than checking only at the end of the
> > entire resolution) we can both correct escapes by jumping back to the
> > root (in the case of AT_THIS_ROOT), as well as avoid revealing to
> > attackers the structure of the filesystem outside of the root (through
> > timing attacks for instance).
> >
> > In order to avoid a quadratic lookup with each ".." entry, we only
> > activate the slow path if a write through &rename_lock or &mount_lock
> > have occurred during path resolution (&rename_lock and &mount_lock are
> > re-taken to further optimise the lookup). Since the primary attack being
> > protected against is MS_MOVE or rename(2), not doing additional checks
> > unless a mount or rename have occurred avoids making the common case
> > slow.
> >
> > The use of __d_path here might seem suspect, but on further inspection
> > of the most important race (a path was *inside* the root but is now
> > *outside*), there appears to be no attack potential. If __d_path occurs
> > before the rename, then the path will be resolved but since the path was
> > originally inside the root there is no escape. Subsequent ".." jumps are
> > guaranteed to check __d_path reachable (by construction, &rename_lock or
> > &mount_lock must have been taken after __d_path returned),
> 
> "after"? Don't you mean "before"? Otherwise I don't understand what
> you're saying here.

I meant that the attacker doing the rename must've taken &rename_lock
or &mount_lock after __d_path returns in the target process (because the
race being examined is that the rename occurs *after* __d_path) and thus
are guaranteed to be detected).

Maybe there's a better way to phrase what I mean...

> > +static inline int nd_alloc_dpathbuf(struct nameidata *nd)
> > +{
> > +       if (unlikely(!nd->dpathbuf)) {
> > +               if (nd->flags & LOOKUP_RCU) {
> > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > +                       if (unlikely(!nd->dpathbuf))
> > +                               return -ECHILD;
> > +               } else {
> > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> > +                       if (unlikely(!nd->dpathbuf))
> > +                               return -ENOMEM;
> > +               }
> > +       }
> > +       return 0;
> > +}
> 
> Note that a fixed-size path buffer means that if the path is very
> long, e.g. because you followed long symlinks on the way down, this
> can cause lookup failures.

This is already an issue with __d_path (even if the buffer was larger)
because it will not output a path longer than PATH_MAX. I imagine this
is a pretty strong argument for why we should refactor __d_path so that
we can *just* use the escape checking to avoid -ENAMETOOLONG.

I can work on this, but I'll wait until after LPC to see what the
discussion there brings up.

> > +                       error = nd_alloc_dpathbuf(nd);
> > +                       if (error)
> > +                               return error;
> > +                       pathptr = __d_path(&nd->path, &nd->root, nd->dpathbuf, PATH_MAX);
> > +                       if (unlikely(!pathptr))
> > +                               /* Breakout -- go back to root! */
> > +                               return nd_jump_root(nd);
> 
> I find the semantics of this check odd - especially in the
> LOOKUP_BENEATH case, but also in the LOOKUP_CHROOT case. Wouldn't it
> make more sense to just throw an error here? Making /.. go back to the
> root is one thing, but making ".." from anything that has escaped from
> the root go back to the root seems less logical to me.

For AT_BENEATH, nd_jump_root() will always return -EXDEV -- but I'll
take your point for AT_THIS_ROOT below.

> Thread A (a webserver or whatever) looks up
> "example.org/images/foo/../bar.png" under "/var/www" with
> LOOKUP_BENEATH.
> Thread B concurrently moves "/var/www/example.org/images" to
> "/var/backup/example.org/images".
> Now thread A can accidentally resolve its path to "/var/www/bar.png"
> if the race happens the wrong way?

This is a good point. When I changed this from always being -EXDEV in my
other postings, I was thinking about "/.." rather than the case you've
outlined where ".." is in the path but it's not actually meant to go to
the root.

I agree -EXDEV makes much more sense here. I will add this to my tree
(but I'll wait until after LPC before I send out a new series).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-09 15:37     ` Aleksa Sarai
@ 2018-10-09 15:37       ` Aleksa Sarai
  2018-10-09 16:46       ` Jann Horn
  2018-10-13  8:22       ` Al Viro
  2 siblings, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-09 15:37 UTC (permalink / raw)
  To: Jann Horn
  Cc: cyphar, Al Viro, Eric W. Biederman, jlayton, Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, christian,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	kernel list, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 6238 bytes --]

On 2018-10-09, 'Jann Horn' via dev <jannh@google.com> wrote:
> On Tue, Oct 9, 2018 at 9:03 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > This patch allows for AT_BENEATH and AT_THIS_ROOT to safely permit ".."
> > resolution (in the case of AT_BENEATH the resolution will still fail if
> > ".." resolution would resolve a path outside of the root -- while
> > AT_THIS_ROOT will chroot(2)-style scope it). "proclink" jumps are still
> > disallowed entirely because now they could result in inconsistent
> > behaviour if resolution encounters a subsequent "..".
> >
> > The need for this patch is explained by observing there is a fairly
> > easy-to-exploit race condition with chroot(2) (and thus by extension
> > AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to
> > "skip over" nd->root and thus escape to the filesystem above nd->root.
> >
> >   thread1 [attacker]:
> >     for (;;)
> >       renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
> >   thread2 [victim]:
> >     for (;;)
> >       openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
> >
> > With fairly significant regularity, thread2 will resolve to
> > "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
> > (though somewhat more privileged) attack using MS_MOVE.
> >
> > With this patch, such cases will be detected *during* ".." resolution
> > (which is the weak point of chroot(2) -- since walking *into* a
> > subdirectory tautologically cannot result in you walking *outside*
> > nd->root -- except through a bind-mount or "proclink"). By detecting
> > this at ".." resolution (rather than checking only at the end of the
> > entire resolution) we can both correct escapes by jumping back to the
> > root (in the case of AT_THIS_ROOT), as well as avoid revealing to
> > attackers the structure of the filesystem outside of the root (through
> > timing attacks for instance).
> >
> > In order to avoid a quadratic lookup with each ".." entry, we only
> > activate the slow path if a write through &rename_lock or &mount_lock
> > have occurred during path resolution (&rename_lock and &mount_lock are
> > re-taken to further optimise the lookup). Since the primary attack being
> > protected against is MS_MOVE or rename(2), not doing additional checks
> > unless a mount or rename have occurred avoids making the common case
> > slow.
> >
> > The use of __d_path here might seem suspect, but on further inspection
> > of the most important race (a path was *inside* the root but is now
> > *outside*), there appears to be no attack potential. If __d_path occurs
> > before the rename, then the path will be resolved but since the path was
> > originally inside the root there is no escape. Subsequent ".." jumps are
> > guaranteed to check __d_path reachable (by construction, &rename_lock or
> > &mount_lock must have been taken after __d_path returned),
> 
> "after"? Don't you mean "before"? Otherwise I don't understand what
> you're saying here.

I meant that the attacker doing the rename must've taken &rename_lock
or &mount_lock after __d_path returns in the target process (because the
race being examined is that the rename occurs *after* __d_path) and thus
are guaranteed to be detected).

Maybe there's a better way to phrase what I mean...

> > +static inline int nd_alloc_dpathbuf(struct nameidata *nd)
> > +{
> > +       if (unlikely(!nd->dpathbuf)) {
> > +               if (nd->flags & LOOKUP_RCU) {
> > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > +                       if (unlikely(!nd->dpathbuf))
> > +                               return -ECHILD;
> > +               } else {
> > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> > +                       if (unlikely(!nd->dpathbuf))
> > +                               return -ENOMEM;
> > +               }
> > +       }
> > +       return 0;
> > +}
> 
> Note that a fixed-size path buffer means that if the path is very
> long, e.g. because you followed long symlinks on the way down, this
> can cause lookup failures.

This is already an issue with __d_path (even if the buffer was larger)
because it will not output a path longer than PATH_MAX. I imagine this
is a pretty strong argument for why we should refactor __d_path so that
we can *just* use the escape checking to avoid -ENAMETOOLONG.

I can work on this, but I'll wait until after LPC to see what the
discussion there brings up.

> > +                       error = nd_alloc_dpathbuf(nd);
> > +                       if (error)
> > +                               return error;
> > +                       pathptr = __d_path(&nd->path, &nd->root, nd->dpathbuf, PATH_MAX);
> > +                       if (unlikely(!pathptr))
> > +                               /* Breakout -- go back to root! */
> > +                               return nd_jump_root(nd);
> 
> I find the semantics of this check odd - especially in the
> LOOKUP_BENEATH case, but also in the LOOKUP_CHROOT case. Wouldn't it
> make more sense to just throw an error here? Making /.. go back to the
> root is one thing, but making ".." from anything that has escaped from
> the root go back to the root seems less logical to me.

For AT_BENEATH, nd_jump_root() will always return -EXDEV -- but I'll
take your point for AT_THIS_ROOT below.

> Thread A (a webserver or whatever) looks up
> "example.org/images/foo/../bar.png" under "/var/www" with
> LOOKUP_BENEATH.
> Thread B concurrently moves "/var/www/example.org/images" to
> "/var/backup/example.org/images".
> Now thread A can accidentally resolve its path to "/var/www/bar.png"
> if the race happens the wrong way?

This is a good point. When I changed this from always being -EXDEV in my
other postings, I was thinking about "/.." rather than the case you've
outlined where ".." is in the path but it's not actually meant to go to
the root.

I agree -EXDEV makes much more sense here. I will add this to my tree
(but I'll wait until after LPC before I send out a new series).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-09 15:37     ` Aleksa Sarai
  2018-10-09 15:37       ` Aleksa Sarai
@ 2018-10-09 16:46       ` Jann Horn
  2018-10-09 16:46         ` Jann Horn
  2018-10-13  8:22       ` Al Viro
  2 siblings, 1 reply; 32+ messages in thread
From: Jann Horn @ 2018-10-09 16:46 UTC (permalink / raw)
  To: asarai
  Cc: cyphar, Al Viro, Eric W. Biederman, jlayton, Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, christian,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	kernel list, linux-arch, Linux API

On Tue, Oct 9, 2018 at 5:36 PM Aleksa Sarai <asarai@suse.de> wrote:
> On 2018-10-09, 'Jann Horn' via dev <jannh@google.com> wrote:
> > On Tue, Oct 9, 2018 at 9:03 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > This patch allows for AT_BENEATH and AT_THIS_ROOT to safely permit ".."
> > > resolution (in the case of AT_BENEATH the resolution will still fail if
> > > ".." resolution would resolve a path outside of the root -- while
> > > AT_THIS_ROOT will chroot(2)-style scope it). "proclink" jumps are still
> > > disallowed entirely because now they could result in inconsistent
> > > behaviour if resolution encounters a subsequent "..".
> > >
> > > The need for this patch is explained by observing there is a fairly
> > > easy-to-exploit race condition with chroot(2) (and thus by extension
> > > AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to
> > > "skip over" nd->root and thus escape to the filesystem above nd->root.
> > >
> > >   thread1 [attacker]:
> > >     for (;;)
> > >       renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
> > >   thread2 [victim]:
> > >     for (;;)
> > >       openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
> > >
> > > With fairly significant regularity, thread2 will resolve to
> > > "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
> > > (though somewhat more privileged) attack using MS_MOVE.
> > >
> > > With this patch, such cases will be detected *during* ".." resolution
> > > (which is the weak point of chroot(2) -- since walking *into* a
> > > subdirectory tautologically cannot result in you walking *outside*
> > > nd->root -- except through a bind-mount or "proclink"). By detecting
> > > this at ".." resolution (rather than checking only at the end of the
> > > entire resolution) we can both correct escapes by jumping back to the
> > > root (in the case of AT_THIS_ROOT), as well as avoid revealing to
> > > attackers the structure of the filesystem outside of the root (through
> > > timing attacks for instance).
> > >
> > > In order to avoid a quadratic lookup with each ".." entry, we only
> > > activate the slow path if a write through &rename_lock or &mount_lock
> > > have occurred during path resolution (&rename_lock and &mount_lock are
> > > re-taken to further optimise the lookup). Since the primary attack being
> > > protected against is MS_MOVE or rename(2), not doing additional checks
> > > unless a mount or rename have occurred avoids making the common case
> > > slow.
> > >
> > > The use of __d_path here might seem suspect, but on further inspection
> > > of the most important race (a path was *inside* the root but is now
> > > *outside*), there appears to be no attack potential. If __d_path occurs
> > > before the rename, then the path will be resolved but since the path was
> > > originally inside the root there is no escape. Subsequent ".." jumps are
> > > guaranteed to check __d_path reachable (by construction, &rename_lock or
> > > &mount_lock must have been taken after __d_path returned),
> >
> > "after"? Don't you mean "before"? Otherwise I don't understand what
> > you're saying here.
>
> I meant that the attacker doing the rename must've taken &rename_lock
> or &mount_lock after __d_path returns in the target process (because the
> race being examined is that the rename occurs *after* __d_path) and thus
> are guaranteed to be detected).
>
> Maybe there's a better way to phrase what I mean...

Aah, I thought you were referring to what the victim process is doing,
not what the racing attacker is doing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-09 16:46       ` Jann Horn
@ 2018-10-09 16:46         ` Jann Horn
  0 siblings, 0 replies; 32+ messages in thread
From: Jann Horn @ 2018-10-09 16:46 UTC (permalink / raw)
  To: asarai
  Cc: cyphar, Al Viro, Eric W. Biederman, jlayton, Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, christian,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	kernel list, linux-arch, Linux API

On Tue, Oct 9, 2018 at 5:36 PM Aleksa Sarai <asarai@suse.de> wrote:
> On 2018-10-09, 'Jann Horn' via dev <jannh@google.com> wrote:
> > On Tue, Oct 9, 2018 at 9:03 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > This patch allows for AT_BENEATH and AT_THIS_ROOT to safely permit ".."
> > > resolution (in the case of AT_BENEATH the resolution will still fail if
> > > ".." resolution would resolve a path outside of the root -- while
> > > AT_THIS_ROOT will chroot(2)-style scope it). "proclink" jumps are still
> > > disallowed entirely because now they could result in inconsistent
> > > behaviour if resolution encounters a subsequent "..".
> > >
> > > The need for this patch is explained by observing there is a fairly
> > > easy-to-exploit race condition with chroot(2) (and thus by extension
> > > AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to
> > > "skip over" nd->root and thus escape to the filesystem above nd->root.
> > >
> > >   thread1 [attacker]:
> > >     for (;;)
> > >       renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
> > >   thread2 [victim]:
> > >     for (;;)
> > >       openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
> > >
> > > With fairly significant regularity, thread2 will resolve to
> > > "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
> > > (though somewhat more privileged) attack using MS_MOVE.
> > >
> > > With this patch, such cases will be detected *during* ".." resolution
> > > (which is the weak point of chroot(2) -- since walking *into* a
> > > subdirectory tautologically cannot result in you walking *outside*
> > > nd->root -- except through a bind-mount or "proclink"). By detecting
> > > this at ".." resolution (rather than checking only at the end of the
> > > entire resolution) we can both correct escapes by jumping back to the
> > > root (in the case of AT_THIS_ROOT), as well as avoid revealing to
> > > attackers the structure of the filesystem outside of the root (through
> > > timing attacks for instance).
> > >
> > > In order to avoid a quadratic lookup with each ".." entry, we only
> > > activate the slow path if a write through &rename_lock or &mount_lock
> > > have occurred during path resolution (&rename_lock and &mount_lock are
> > > re-taken to further optimise the lookup). Since the primary attack being
> > > protected against is MS_MOVE or rename(2), not doing additional checks
> > > unless a mount or rename have occurred avoids making the common case
> > > slow.
> > >
> > > The use of __d_path here might seem suspect, but on further inspection
> > > of the most important race (a path was *inside* the root but is now
> > > *outside*), there appears to be no attack potential. If __d_path occurs
> > > before the rename, then the path will be resolved but since the path was
> > > originally inside the root there is no escape. Subsequent ".." jumps are
> > > guaranteed to check __d_path reachable (by construction, &rename_lock or
> > > &mount_lock must have been taken after __d_path returned),
> >
> > "after"? Don't you mean "before"? Otherwise I don't understand what
> > you're saying here.
>
> I meant that the attacker doing the rename must've taken &rename_lock
> or &mount_lock after __d_path returns in the target process (because the
> race being examined is that the rename occurs *after* __d_path) and thus
> are guaranteed to be detected).
>
> Maybe there's a better way to phrase what I mean...

Aah, I thought you were referring to what the victim process is doing,
not what the racing attacker is doing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-09  7:02 ` [PATCH v3 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
  2018-10-09  7:02   ` Aleksa Sarai
@ 2018-10-13  7:33   ` Al Viro
  2018-10-13  7:33     ` Al Viro
                       ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Al Viro @ 2018-10-13  7:33 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Eric Biederman, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

On Tue, Oct 09, 2018 at 06:02:28PM +1100, Aleksa Sarai wrote:

First of all, dirfd_path_init() part should be in a separate commit.  And I'm
really not happy with the logics in there.  dirfd_path_init() itself is
kinda-sorta reasonable.  It is equivalent to setting the starting point for
relative pathnames + setting ->root for LOOKUP_BENEATH, right?

But the part in path_init() is too bloody convoluted for its own good.  Let me
try to translate:

> +	if (unlikely(flags & LOOKUP_XDEV)) {
> +		error = dirfd_path_init(nd);
> +		if (unlikely(error))
> +			return ERR_PTR(error);
> +	}

* if LOOKUP_XDEV is set, set the starting point as if it was a relative
  pathname.  If LOOKUP_BENEATH was set as well, set ->root to the same
  point.
* if it's an absolute pathname, 
>  	if (*s == '/') {
... and we hadn't come here with LOOKUP_XDEV + LOOKUP_BENEATH, set ->root.
> +		if (likely(!nd->root.mnt))
> +			set_root(nd);
* if it's an absolute pathname, set the starting point to ->root.  Note that
if we came here with LOOKUP_XDEV, we'll discard the starting point we'd
calculated.
> +		error = nd_jump_root(nd);
> +		if (unlikely(error))
> +			s = ERR_PTR(error);
>  		return s;
>  	}
> +	if (likely(!nd->path.mnt)) {
* if we didn't have LOOKUP_XDEV, set the starting point as if it was a relative
pathname (which it is) and, if LOOKUP_BENEATH is also there, set ->root there
as well.
> +		error = dirfd_path_init(nd);
> +		if (unlikely(error))
> +			return ERR_PTR(error);
> +	}
> +	return s;
>  }

Pardon me, but... huh?  The reason for your two calls of dirfd_path_init() is,
AFAICS, the combination of absolute pathname with both LOOKUP_XDEV and
LOOKUP_BENEATH at the same time.  That combination is treated as if the pathname
had been relative.  Note that LOOKUP_BENEATH alone is ignored for absolute ones
(and with a good reason - it's a no-op on path_init() level in that case).

What the hell?  It complicates your code and doesn't seem to provide any benefits
whatsoever -- you could bloody well have passed the relative pathname to start with.

IDGI...  Without that kludge it becomes simply "do as we currently do for absolute
pathnames, call dirfd_path_init() for relative ones".  And I would argue that
taking LOOKUP_BENEATH handling out of dirfd_path_init() into path_init() (relative)
case would be a good idea.

As it is, the logics is very hard to follow.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-13  7:33   ` Al Viro
@ 2018-10-13  7:33     ` Al Viro
  2018-10-13  8:05     ` Al Viro
  2018-10-13  8:09     ` Aleksa Sarai
  2 siblings, 0 replies; 32+ messages in thread
From: Al Viro @ 2018-10-13  7:33 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Eric Biederman, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

On Tue, Oct 09, 2018 at 06:02:28PM +1100, Aleksa Sarai wrote:

First of all, dirfd_path_init() part should be in a separate commit.  And I'm
really not happy with the logics in there.  dirfd_path_init() itself is
kinda-sorta reasonable.  It is equivalent to setting the starting point for
relative pathnames + setting ->root for LOOKUP_BENEATH, right?

But the part in path_init() is too bloody convoluted for its own good.  Let me
try to translate:

> +	if (unlikely(flags & LOOKUP_XDEV)) {
> +		error = dirfd_path_init(nd);
> +		if (unlikely(error))
> +			return ERR_PTR(error);
> +	}

* if LOOKUP_XDEV is set, set the starting point as if it was a relative
  pathname.  If LOOKUP_BENEATH was set as well, set ->root to the same
  point.
* if it's an absolute pathname, 
>  	if (*s == '/') {
... and we hadn't come here with LOOKUP_XDEV + LOOKUP_BENEATH, set ->root.
> +		if (likely(!nd->root.mnt))
> +			set_root(nd);
* if it's an absolute pathname, set the starting point to ->root.  Note that
if we came here with LOOKUP_XDEV, we'll discard the starting point we'd
calculated.
> +		error = nd_jump_root(nd);
> +		if (unlikely(error))
> +			s = ERR_PTR(error);
>  		return s;
>  	}
> +	if (likely(!nd->path.mnt)) {
* if we didn't have LOOKUP_XDEV, set the starting point as if it was a relative
pathname (which it is) and, if LOOKUP_BENEATH is also there, set ->root there
as well.
> +		error = dirfd_path_init(nd);
> +		if (unlikely(error))
> +			return ERR_PTR(error);
> +	}
> +	return s;
>  }

Pardon me, but... huh?  The reason for your two calls of dirfd_path_init() is,
AFAICS, the combination of absolute pathname with both LOOKUP_XDEV and
LOOKUP_BENEATH at the same time.  That combination is treated as if the pathname
had been relative.  Note that LOOKUP_BENEATH alone is ignored for absolute ones
(and with a good reason - it's a no-op on path_init() level in that case).

What the hell?  It complicates your code and doesn't seem to provide any benefits
whatsoever -- you could bloody well have passed the relative pathname to start with.

IDGI...  Without that kludge it becomes simply "do as we currently do for absolute
pathnames, call dirfd_path_init() for relative ones".  And I would argue that
taking LOOKUP_BENEATH handling out of dirfd_path_init() into path_init() (relative)
case would be a good idea.

As it is, the logics is very hard to follow.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-13  7:33   ` Al Viro
  2018-10-13  7:33     ` Al Viro
@ 2018-10-13  8:05     ` Al Viro
  2018-10-13  8:05       ` Al Viro
  2018-10-13  8:20       ` Aleksa Sarai
  2018-10-13  8:09     ` Aleksa Sarai
  2 siblings, 2 replies; 32+ messages in thread
From: Al Viro @ 2018-10-13  8:05 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Eric Biederman, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

On Sat, Oct 13, 2018 at 08:33:19AM +0100, Al Viro wrote:

> Pardon me, but... huh?  The reason for your two calls of dirfd_path_init() is,
> AFAICS, the combination of absolute pathname with both LOOKUP_XDEV and
> LOOKUP_BENEATH at the same time.  That combination is treated as if the pathname
> had been relative.  Note that LOOKUP_BENEATH alone is ignored for absolute ones
> (and with a good reason - it's a no-op on path_init() level in that case).
> 
> What the hell?  It complicates your code and doesn't seem to provide any benefits
> whatsoever -- you could bloody well have passed the relative pathname to start with.
> 
> IDGI...  Without that kludge it becomes simply "do as we currently do for absolute
> pathnames, call dirfd_path_init() for relative ones".  And I would argue that
> taking LOOKUP_BENEATH handling out of dirfd_path_init() into path_init() (relative)
> case would be a good idea.
> 
> As it is, the logics is very hard to follow.

	... and it fails on LOOKUP_BENEATH anyway.  Egads...  So that's for your
LOOKUP_CHROOT ;-/  IMO that's awful, especially with the way you've spread those
LOOKUP_CHROOT cases between these two.

	Why not simply have O_THISROOT pick root by dirfd and call file_open_root()?
And if something wants it for stat(), etc. just have them use it combined with
O_PATH and pass the result to ...at()...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-13  8:05     ` Al Viro
@ 2018-10-13  8:05       ` Al Viro
  2018-10-13  8:20       ` Aleksa Sarai
  1 sibling, 0 replies; 32+ messages in thread
From: Al Viro @ 2018-10-13  8:05 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Eric Biederman, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

On Sat, Oct 13, 2018 at 08:33:19AM +0100, Al Viro wrote:

> Pardon me, but... huh?  The reason for your two calls of dirfd_path_init() is,
> AFAICS, the combination of absolute pathname with both LOOKUP_XDEV and
> LOOKUP_BENEATH at the same time.  That combination is treated as if the pathname
> had been relative.  Note that LOOKUP_BENEATH alone is ignored for absolute ones
> (and with a good reason - it's a no-op on path_init() level in that case).
> 
> What the hell?  It complicates your code and doesn't seem to provide any benefits
> whatsoever -- you could bloody well have passed the relative pathname to start with.
> 
> IDGI...  Without that kludge it becomes simply "do as we currently do for absolute
> pathnames, call dirfd_path_init() for relative ones".  And I would argue that
> taking LOOKUP_BENEATH handling out of dirfd_path_init() into path_init() (relative)
> case would be a good idea.
> 
> As it is, the logics is very hard to follow.

	... and it fails on LOOKUP_BENEATH anyway.  Egads...  So that's for your
LOOKUP_CHROOT ;-/  IMO that's awful, especially with the way you've spread those
LOOKUP_CHROOT cases between these two.

	Why not simply have O_THISROOT pick root by dirfd and call file_open_root()?
And if something wants it for stat(), etc. just have them use it combined with
O_PATH and pass the result to ...at()...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-13  7:33   ` Al Viro
  2018-10-13  7:33     ` Al Viro
  2018-10-13  8:05     ` Al Viro
@ 2018-10-13  8:09     ` Aleksa Sarai
  2018-10-13  8:09       ` Aleksa Sarai
  2 siblings, 1 reply; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-13  8:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Biederman, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

[-- Attachment #1: Type: text/plain, Size: 5408 bytes --]

On 2018-10-13, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> First of all, dirfd_path_init() part should be in a separate commit.  And I'm
> really not happy with the logics in there.  dirfd_path_init() itself is
> kinda-sorta reasonable.

Sure, I can do that.

> It is equivalent to setting the starting point for
> relative pathnames + setting ->root for LOOKUP_BENEATH, right?

Right.

> But the part in path_init() is too bloody convoluted for its own good.  Let me
> try to translate:
> 
> > +	if (unlikely(flags & LOOKUP_XDEV)) {
> > +		error = dirfd_path_init(nd);
> > +		if (unlikely(error))
> > +			return ERR_PTR(error);
> > +	}
> 
> * if LOOKUP_XDEV is set, set the starting point as if it was a relative
>   pathname.  If LOOKUP_BENEATH was set as well, set ->root to the same
>   point.

Right. This is for two reasons (though if you disagree with these
semantics we can change this as well):

1. It's not clear to me whether openat(somefd->"/", "/tmp", O_XDEV)
   should return an -EXDEV or completely ignore the starting point. Same
   argument with AT_FDCWD. I opted to make it so that the starting point
   has to be on the same mountpoint, but I totally understand if you
   feel this is insane -- and I'd be happy to change it. The real
   problem comes from (2).

2. AT_THIS_ROOT chroot-scope absolute paths, and so in the second patch
   LOOKUP_CHROOT also triggers this codepath. The main argument for this
   semantic is somewhat elaborated in the cover letter -- but
   the short version is because AT_THIS_ROOT has to chroot-scope
   absolute symlinks it would be somewhat strange if it didn't scope
   absolute paths you give it -- otherwise it could either be a footgun
   or would require always returning -EXDEV here.

   Though, as above, if you feel that the current semantics (absolute
   paths override whatever dirfd you give), then -EXDEV is the
   alternative I would pitch.

> * if it's an absolute pathname, 
> >  	if (*s == '/') {
> ... and we hadn't come here with LOOKUP_XDEV + LOOKUP_BENEATH, set ->root.
> > +		if (likely(!nd->root.mnt))
> > +			set_root(nd);
> * if it's an absolute pathname, set the starting point to ->root.  Note that
> if we came here with LOOKUP_XDEV, we'll discard the starting point we'd
> calculated.

We wouldn't discard it -- nd_jump_root() will check whether a mount
crossing was implied here (otherwise an absolute symlink could cause you
to cross a mountpoint).

But as above, if you'd prefer that absolute paths disable all dirfd
handling (as is the case now), I can remove this semantic.

> > +		error = nd_jump_root(nd);
> > +		if (unlikely(error))
> > +			s = ERR_PTR(error);
> >  		return s;
> >  	}
> > +	if (likely(!nd->path.mnt)) {
> * if we didn't have LOOKUP_XDEV, set the starting point as if it was a relative
> pathname (which it is) and, if LOOKUP_BENEATH is also there, set ->root there
> as well.
> > +		error = dirfd_path_init(nd);
> > +		if (unlikely(error))
> > +			return ERR_PTR(error);
> > +	}
> > +	return s;
> >  }
> 
> Pardon me, but... huh?  The reason for your two calls of dirfd_path_init() is,
> AFAICS, the combination of absolute pathname with both LOOKUP_XDEV and
> LOOKUP_BENEATH at the same time.  That combination is treated as if the pathname
> had been relative.  Note that LOOKUP_BENEATH alone is ignored for absolute ones
> (and with a good reason - it's a no-op on path_init() level in that case).
> 
> What the hell?  It complicates your code and doesn't seem to provide any benefits
> whatsoever

The reasoning for this is because of how AT_THIS_ROOT uses both of these
codepaths (it causes dirfd_path_init() to be called before the absolute
check, and also causes ->root to be set).

I wrote the features in parallel and then split out the code for
AT_THIS_ROOT so it could be discussed separately (and so removing it if
it was rejected would be simpler). But unfortunately this does result in
the dirfd_path_init() code looking completely superfluous without seeing
the second patch.

> -- you could bloody well have passed the relative pathname to start with.

(I think you mean always doing dirfd_path_init() first here?)

Right, but I didn't want to discard nd->path unnecessarily -- if we do
all of the code to grab AT_FDCWD and then it is completely unused (not
even in the AT_XDEV sense of "unused") it seems like a waste.

Did I misunderstand your suggestion? Were you referring to userspace
just being able to "[pass] the relative pathname to start with"?

> IDGI...  Without that kludge it becomes simply "do as we currently do for absolute
> pathnames, call dirfd_path_init() for relative ones".  And I would argue that
> taking LOOKUP_BENEATH handling out of dirfd_path_init() into path_init() (relative)
> case would be a good idea.

Right, I could definitely do that -- though for AT_THIS_ROOT we'd
duplicate the ->root setting in both places.

> As it is, the logics is very hard to follow.

Sorry about that. Would you prefer if the two patches (AT_BENEATH family
and AT_THIS_ROOT) were sent as a single patch -- with the
dirfd_path_init() code split out? Or that the second patch do all of the
structural changes to refactor dirfd_path_init() usage?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-13  8:09     ` Aleksa Sarai
@ 2018-10-13  8:09       ` Aleksa Sarai
  0 siblings, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-13  8:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Biederman, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

[-- Attachment #1: Type: text/plain, Size: 5408 bytes --]

On 2018-10-13, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> First of all, dirfd_path_init() part should be in a separate commit.  And I'm
> really not happy with the logics in there.  dirfd_path_init() itself is
> kinda-sorta reasonable.

Sure, I can do that.

> It is equivalent to setting the starting point for
> relative pathnames + setting ->root for LOOKUP_BENEATH, right?

Right.

> But the part in path_init() is too bloody convoluted for its own good.  Let me
> try to translate:
> 
> > +	if (unlikely(flags & LOOKUP_XDEV)) {
> > +		error = dirfd_path_init(nd);
> > +		if (unlikely(error))
> > +			return ERR_PTR(error);
> > +	}
> 
> * if LOOKUP_XDEV is set, set the starting point as if it was a relative
>   pathname.  If LOOKUP_BENEATH was set as well, set ->root to the same
>   point.

Right. This is for two reasons (though if you disagree with these
semantics we can change this as well):

1. It's not clear to me whether openat(somefd->"/", "/tmp", O_XDEV)
   should return an -EXDEV or completely ignore the starting point. Same
   argument with AT_FDCWD. I opted to make it so that the starting point
   has to be on the same mountpoint, but I totally understand if you
   feel this is insane -- and I'd be happy to change it. The real
   problem comes from (2).

2. AT_THIS_ROOT chroot-scope absolute paths, and so in the second patch
   LOOKUP_CHROOT also triggers this codepath. The main argument for this
   semantic is somewhat elaborated in the cover letter -- but
   the short version is because AT_THIS_ROOT has to chroot-scope
   absolute symlinks it would be somewhat strange if it didn't scope
   absolute paths you give it -- otherwise it could either be a footgun
   or would require always returning -EXDEV here.

   Though, as above, if you feel that the current semantics (absolute
   paths override whatever dirfd you give), then -EXDEV is the
   alternative I would pitch.

> * if it's an absolute pathname, 
> >  	if (*s == '/') {
> ... and we hadn't come here with LOOKUP_XDEV + LOOKUP_BENEATH, set ->root.
> > +		if (likely(!nd->root.mnt))
> > +			set_root(nd);
> * if it's an absolute pathname, set the starting point to ->root.  Note that
> if we came here with LOOKUP_XDEV, we'll discard the starting point we'd
> calculated.

We wouldn't discard it -- nd_jump_root() will check whether a mount
crossing was implied here (otherwise an absolute symlink could cause you
to cross a mountpoint).

But as above, if you'd prefer that absolute paths disable all dirfd
handling (as is the case now), I can remove this semantic.

> > +		error = nd_jump_root(nd);
> > +		if (unlikely(error))
> > +			s = ERR_PTR(error);
> >  		return s;
> >  	}
> > +	if (likely(!nd->path.mnt)) {
> * if we didn't have LOOKUP_XDEV, set the starting point as if it was a relative
> pathname (which it is) and, if LOOKUP_BENEATH is also there, set ->root there
> as well.
> > +		error = dirfd_path_init(nd);
> > +		if (unlikely(error))
> > +			return ERR_PTR(error);
> > +	}
> > +	return s;
> >  }
> 
> Pardon me, but... huh?  The reason for your two calls of dirfd_path_init() is,
> AFAICS, the combination of absolute pathname with both LOOKUP_XDEV and
> LOOKUP_BENEATH at the same time.  That combination is treated as if the pathname
> had been relative.  Note that LOOKUP_BENEATH alone is ignored for absolute ones
> (and with a good reason - it's a no-op on path_init() level in that case).
> 
> What the hell?  It complicates your code and doesn't seem to provide any benefits
> whatsoever

The reasoning for this is because of how AT_THIS_ROOT uses both of these
codepaths (it causes dirfd_path_init() to be called before the absolute
check, and also causes ->root to be set).

I wrote the features in parallel and then split out the code for
AT_THIS_ROOT so it could be discussed separately (and so removing it if
it was rejected would be simpler). But unfortunately this does result in
the dirfd_path_init() code looking completely superfluous without seeing
the second patch.

> -- you could bloody well have passed the relative pathname to start with.

(I think you mean always doing dirfd_path_init() first here?)

Right, but I didn't want to discard nd->path unnecessarily -- if we do
all of the code to grab AT_FDCWD and then it is completely unused (not
even in the AT_XDEV sense of "unused") it seems like a waste.

Did I misunderstand your suggestion? Were you referring to userspace
just being able to "[pass] the relative pathname to start with"?

> IDGI...  Without that kludge it becomes simply "do as we currently do for absolute
> pathnames, call dirfd_path_init() for relative ones".  And I would argue that
> taking LOOKUP_BENEATH handling out of dirfd_path_init() into path_init() (relative)
> case would be a good idea.

Right, I could definitely do that -- though for AT_THIS_ROOT we'd
duplicate the ->root setting in both places.

> As it is, the logics is very hard to follow.

Sorry about that. Would you prefer if the two patches (AT_BENEATH family
and AT_THIS_ROOT) were sent as a single patch -- with the
dirfd_path_init() code split out? Or that the second patch do all of the
structural changes to refactor dirfd_path_init() usage?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-13  8:05     ` Al Viro
  2018-10-13  8:05       ` Al Viro
@ 2018-10-13  8:20       ` Aleksa Sarai
  2018-10-13  8:20         ` Aleksa Sarai
  1 sibling, 1 reply; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-13  8:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Biederman, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

[-- Attachment #1: Type: text/plain, Size: 2126 bytes --]

On 2018-10-13, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > Pardon me, but... huh?  The reason for your two calls of dirfd_path_init() is,
> > AFAICS, the combination of absolute pathname with both LOOKUP_XDEV and
> > LOOKUP_BENEATH at the same time.  That combination is treated as if the pathname
> > had been relative.  Note that LOOKUP_BENEATH alone is ignored for absolute ones
> > (and with a good reason - it's a no-op on path_init() level in that case).
> > 
> > What the hell?  It complicates your code and doesn't seem to provide any benefits
> > whatsoever -- you could bloody well have passed the relative pathname to start with.
> > 
> > IDGI...  Without that kludge it becomes simply "do as we currently do for absolute
> > pathnames, call dirfd_path_init() for relative ones".  And I would argue that
> > taking LOOKUP_BENEATH handling out of dirfd_path_init() into path_init() (relative)
> > case would be a good idea.
> > 
> > As it is, the logics is very hard to follow.
> 
> 	... and it fails on LOOKUP_BENEATH anyway.  Egads...  So that's for your
> LOOKUP_CHROOT ;-/  IMO that's awful, especially with the way you've spread those
> LOOKUP_CHROOT cases between these two.

Yeah, the ->root setting in dirfd_path_init() is ugly. :/

> 	Why not simply have O_THISROOT pick root by dirfd and call file_open_root()?

Wouldn't this require replicating the dirfd_path_init()-like code inside
all of the path_*at() callers which use path_init()? Or is there another
common place we could put it?

> And if something wants it for stat(), etc. just have them use it combined with
> O_PATH and pass the result to ...at()...

This works for stat and quite a few other things (which is why I only
added openat(2) support for the moment), but I think we'd eventually
need something like this for renameat2(2) as well as a few other choice
*at(2) syscalls. Though I also think that more AT_EMPTY_PATH support
would removed the need for _most_ *at(2) implementations to use this.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-13  8:20       ` Aleksa Sarai
@ 2018-10-13  8:20         ` Aleksa Sarai
  0 siblings, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-13  8:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Biederman, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

[-- Attachment #1: Type: text/plain, Size: 2126 bytes --]

On 2018-10-13, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > Pardon me, but... huh?  The reason for your two calls of dirfd_path_init() is,
> > AFAICS, the combination of absolute pathname with both LOOKUP_XDEV and
> > LOOKUP_BENEATH at the same time.  That combination is treated as if the pathname
> > had been relative.  Note that LOOKUP_BENEATH alone is ignored for absolute ones
> > (and with a good reason - it's a no-op on path_init() level in that case).
> > 
> > What the hell?  It complicates your code and doesn't seem to provide any benefits
> > whatsoever -- you could bloody well have passed the relative pathname to start with.
> > 
> > IDGI...  Without that kludge it becomes simply "do as we currently do for absolute
> > pathnames, call dirfd_path_init() for relative ones".  And I would argue that
> > taking LOOKUP_BENEATH handling out of dirfd_path_init() into path_init() (relative)
> > case would be a good idea.
> > 
> > As it is, the logics is very hard to follow.
> 
> 	... and it fails on LOOKUP_BENEATH anyway.  Egads...  So that's for your
> LOOKUP_CHROOT ;-/  IMO that's awful, especially with the way you've spread those
> LOOKUP_CHROOT cases between these two.

Yeah, the ->root setting in dirfd_path_init() is ugly. :/

> 	Why not simply have O_THISROOT pick root by dirfd and call file_open_root()?

Wouldn't this require replicating the dirfd_path_init()-like code inside
all of the path_*at() callers which use path_init()? Or is there another
common place we could put it?

> And if something wants it for stat(), etc. just have them use it combined with
> O_PATH and pass the result to ...at()...

This works for stat and quite a few other things (which is why I only
added openat(2) support for the moment), but I think we'd eventually
need something like this for renameat2(2) as well as a few other choice
*at(2) syscalls. Though I also think that more AT_EMPTY_PATH support
would removed the need for _most_ *at(2) implementations to use this.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-09 15:37     ` Aleksa Sarai
  2018-10-09 15:37       ` Aleksa Sarai
  2018-10-09 16:46       ` Jann Horn
@ 2018-10-13  8:22       ` Al Viro
  2018-10-13  8:22         ` Al Viro
  2018-10-13  8:53         ` Aleksa Sarai
  2 siblings, 2 replies; 32+ messages in thread
From: Al Viro @ 2018-10-13  8:22 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jann Horn, cyphar, Eric W. Biederman, jlayton, Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, christian,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	kernel list, linux-arch, Linux API

On Wed, Oct 10, 2018 at 02:37:28AM +1100, Aleksa Sarai wrote:
> > > +static inline int nd_alloc_dpathbuf(struct nameidata *nd)
> > > +{
> > > +       if (unlikely(!nd->dpathbuf)) {
> > > +               if (nd->flags & LOOKUP_RCU) {
> > > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > > +                       if (unlikely(!nd->dpathbuf))
> > > +                               return -ECHILD;
> > > +               } else {
> > > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> > > +                       if (unlikely(!nd->dpathbuf))
> > > +                               return -ENOMEM;
> > > +               }
> > > +       }
> > > +       return 0;
> > > +}
> > 
> > Note that a fixed-size path buffer means that if the path is very
> > long, e.g. because you followed long symlinks on the way down, this
> > can cause lookup failures.
> 
> This is already an issue with __d_path (even if the buffer was larger)
> because it will not output a path longer than PATH_MAX. I imagine this
> is a pretty strong argument for why we should refactor __d_path so that
> we can *just* use the escape checking to avoid -ENAMETOOLONG.

Let me get it straight - the whole point of that buffer is to check
if __d_path() returns NULL?  So you allocate it so that you would have
place to copy the path components into... only to have them completely
ignored?

How is that different from path_is_under()?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-13  8:22       ` Al Viro
@ 2018-10-13  8:22         ` Al Viro
  2018-10-13  8:53         ` Aleksa Sarai
  1 sibling, 0 replies; 32+ messages in thread
From: Al Viro @ 2018-10-13  8:22 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jann Horn, cyphar, Eric W. Biederman, jlayton, Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, christian,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	kernel list, linux-arch, Linux API

On Wed, Oct 10, 2018 at 02:37:28AM +1100, Aleksa Sarai wrote:
> > > +static inline int nd_alloc_dpathbuf(struct nameidata *nd)
> > > +{
> > > +       if (unlikely(!nd->dpathbuf)) {
> > > +               if (nd->flags & LOOKUP_RCU) {
> > > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > > +                       if (unlikely(!nd->dpathbuf))
> > > +                               return -ECHILD;
> > > +               } else {
> > > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> > > +                       if (unlikely(!nd->dpathbuf))
> > > +                               return -ENOMEM;
> > > +               }
> > > +       }
> > > +       return 0;
> > > +}
> > 
> > Note that a fixed-size path buffer means that if the path is very
> > long, e.g. because you followed long symlinks on the way down, this
> > can cause lookup failures.
> 
> This is already an issue with __d_path (even if the buffer was larger)
> because it will not output a path longer than PATH_MAX. I imagine this
> is a pretty strong argument for why we should refactor __d_path so that
> we can *just* use the escape checking to avoid -ENAMETOOLONG.

Let me get it straight - the whole point of that buffer is to check
if __d_path() returns NULL?  So you allocate it so that you would have
place to copy the path components into... only to have them completely
ignored?

How is that different from path_is_under()?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-13  8:22       ` Al Viro
  2018-10-13  8:22         ` Al Viro
@ 2018-10-13  8:53         ` Aleksa Sarai
  2018-10-13  8:53           ` Aleksa Sarai
  2018-10-13  9:04           ` Al Viro
  1 sibling, 2 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-13  8:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, Jann Horn, Eric W. Biederman, jlayton,
	Bruce Fields, Arnd Bergmann, Andy Lutomirski, David Howells,
	christian, Tycho Andersen, David Drysdale, dev, containers,
	linux-fsdevel, kernel list, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 2623 bytes --]

On 2018-10-13, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > > > +static inline int nd_alloc_dpathbuf(struct nameidata *nd)
> > > > +{
> > > > +       if (unlikely(!nd->dpathbuf)) {
> > > > +               if (nd->flags & LOOKUP_RCU) {
> > > > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > > > +                       if (unlikely(!nd->dpathbuf))
> > > > +                               return -ECHILD;
> > > > +               } else {
> > > > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> > > > +                       if (unlikely(!nd->dpathbuf))
> > > > +                               return -ENOMEM;
> > > > +               }
> > > > +       }
> > > > +       return 0;
> > > > +}
> > > 
> > > Note that a fixed-size path buffer means that if the path is very
> > > long, e.g. because you followed long symlinks on the way down, this
> > > can cause lookup failures.
> > 
> > This is already an issue with __d_path (even if the buffer was larger)
> > because it will not output a path longer than PATH_MAX. I imagine this
> > is a pretty strong argument for why we should refactor __d_path so that
> > we can *just* use the escape checking to avoid -ENAMETOOLONG.
> 
> Let me get it straight - the whole point of that buffer is to check
> if __d_path() returns NULL?  So you allocate it so that you would have
> place to copy the path components into... only to have them completely
> ignored?

Yes (and it was definitely the wrong thing to do).

Since writing that mail, I changed it to not have to allocate a buffer
-- though this is done in the fairly ugly way of changing prepend_path()
to be able to take @buffer==NULL which then skips all of the
string-related code, and then having a dumb wrapper which calls
prepend_path(root, path, NULL, NULL).

I was planning on sending out the updated patches after LPC.

> How is that different from path_is_under()?

I didn't know about path_is_under() -- I just checked and it appears to
not take &rename_lock? From my understanding, in order to protect
against the rename attack you need to take &rename_lock (or check
against &rename_lock at least and retry if it changed).

I could definitely use path_is_under() if you prefer, though I think
that in this case we'd need to take &rename_lock (right?). Also is there
a speed issue with taking the write-side of a seqlock when we are just
reading -- is this more efficient than doing a retry like in __d_path?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-13  8:53         ` Aleksa Sarai
@ 2018-10-13  8:53           ` Aleksa Sarai
  2018-10-13  9:04           ` Al Viro
  1 sibling, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-13  8:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, Jann Horn, Eric W. Biederman, jlayton,
	Bruce Fields, Arnd Bergmann, Andy Lutomirski, David Howells,
	christian, Tycho Andersen, David Drysdale, dev, containers,
	linux-fsdevel, kernel list, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 2623 bytes --]

On 2018-10-13, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > > > +static inline int nd_alloc_dpathbuf(struct nameidata *nd)
> > > > +{
> > > > +       if (unlikely(!nd->dpathbuf)) {
> > > > +               if (nd->flags & LOOKUP_RCU) {
> > > > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > > > +                       if (unlikely(!nd->dpathbuf))
> > > > +                               return -ECHILD;
> > > > +               } else {
> > > > +                       nd->dpathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> > > > +                       if (unlikely(!nd->dpathbuf))
> > > > +                               return -ENOMEM;
> > > > +               }
> > > > +       }
> > > > +       return 0;
> > > > +}
> > > 
> > > Note that a fixed-size path buffer means that if the path is very
> > > long, e.g. because you followed long symlinks on the way down, this
> > > can cause lookup failures.
> > 
> > This is already an issue with __d_path (even if the buffer was larger)
> > because it will not output a path longer than PATH_MAX. I imagine this
> > is a pretty strong argument for why we should refactor __d_path so that
> > we can *just* use the escape checking to avoid -ENAMETOOLONG.
> 
> Let me get it straight - the whole point of that buffer is to check
> if __d_path() returns NULL?  So you allocate it so that you would have
> place to copy the path components into... only to have them completely
> ignored?

Yes (and it was definitely the wrong thing to do).

Since writing that mail, I changed it to not have to allocate a buffer
-- though this is done in the fairly ugly way of changing prepend_path()
to be able to take @buffer==NULL which then skips all of the
string-related code, and then having a dumb wrapper which calls
prepend_path(root, path, NULL, NULL).

I was planning on sending out the updated patches after LPC.

> How is that different from path_is_under()?

I didn't know about path_is_under() -- I just checked and it appears to
not take &rename_lock? From my understanding, in order to protect
against the rename attack you need to take &rename_lock (or check
against &rename_lock at least and retry if it changed).

I could definitely use path_is_under() if you prefer, though I think
that in this case we'd need to take &rename_lock (right?). Also is there
a speed issue with taking the write-side of a seqlock when we are just
reading -- is this more efficient than doing a retry like in __d_path?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-13  8:53         ` Aleksa Sarai
  2018-10-13  8:53           ` Aleksa Sarai
@ 2018-10-13  9:04           ` Al Viro
  2018-10-13  9:04             ` Al Viro
  2018-10-13  9:27             ` Aleksa Sarai
  1 sibling, 2 replies; 32+ messages in thread
From: Al Viro @ 2018-10-13  9:04 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Aleksa Sarai, Jann Horn, Eric W. Biederman, jlayton,
	Bruce Fields, Arnd Bergmann, Andy Lutomirski, David Howells,
	christian, Tycho Andersen, David Drysdale, dev, containers,
	linux-fsdevel, kernel list, linux-arch, Linux API

On Sat, Oct 13, 2018 at 07:53:26PM +1100, Aleksa Sarai wrote:

> I didn't know about path_is_under() -- I just checked and it appears to
> not take &rename_lock? From my understanding, in order to protect
> against the rename attack you need to take &rename_lock (or check
> against &rename_lock at least and retry if it changed).
> 
> I could definitely use path_is_under() if you prefer, though I think
> that in this case we'd need to take &rename_lock (right?). Also is there
> a speed issue with taking the write-side of a seqlock when we are just
> reading -- is this more efficient than doing a retry like in __d_path?

???

1) it uses is_subdir(), which does deal with rename_lock
2) what it does is taking mount_lock.lock.  I.e. the same
thing as the second retry in __d_path().  _If_ it shows
up in profiles, we can switch it to read_seqbegin_or_lock(),
but I'd like to see the profiling data first.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-13  9:04           ` Al Viro
@ 2018-10-13  9:04             ` Al Viro
  2018-10-13  9:27             ` Aleksa Sarai
  1 sibling, 0 replies; 32+ messages in thread
From: Al Viro @ 2018-10-13  9:04 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Aleksa Sarai, Jann Horn, Eric W. Biederman, jlayton,
	Bruce Fields, Arnd Bergmann, Andy Lutomirski, David Howells,
	christian, Tycho Andersen, David Drysdale, dev, containers,
	linux-fsdevel, kernel list, linux-arch, Linux API

On Sat, Oct 13, 2018 at 07:53:26PM +1100, Aleksa Sarai wrote:

> I didn't know about path_is_under() -- I just checked and it appears to
> not take &rename_lock? From my understanding, in order to protect
> against the rename attack you need to take &rename_lock (or check
> against &rename_lock at least and retry if it changed).
> 
> I could definitely use path_is_under() if you prefer, though I think
> that in this case we'd need to take &rename_lock (right?). Also is there
> a speed issue with taking the write-side of a seqlock when we are just
> reading -- is this more efficient than doing a retry like in __d_path?

???

1) it uses is_subdir(), which does deal with rename_lock
2) what it does is taking mount_lock.lock.  I.e. the same
thing as the second retry in __d_path().  _If_ it shows
up in profiles, we can switch it to read_seqbegin_or_lock(),
but I'd like to see the profiling data first.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-13  9:04           ` Al Viro
  2018-10-13  9:04             ` Al Viro
@ 2018-10-13  9:27             ` Aleksa Sarai
  2018-10-13  9:27               ` Aleksa Sarai
  1 sibling, 1 reply; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-13  9:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, Jann Horn, Eric W. Biederman, jlayton,
	Bruce Fields, Arnd Bergmann, Andy Lutomirski, David Howells,
	christian, Tycho Andersen, David Drysdale, dev, containers,
	linux-fsdevel, kernel list, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]

On 2018-10-13, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> On Sat, Oct 13, 2018 at 07:53:26PM +1100, Aleksa Sarai wrote:
> 
> > I didn't know about path_is_under() -- I just checked and it appears to
> > not take &rename_lock? From my understanding, in order to protect
> > against the rename attack you need to take &rename_lock (or check
> > against &rename_lock at least and retry if it changed).
> > 
> > I could definitely use path_is_under() if you prefer, though I think
> > that in this case we'd need to take &rename_lock (right?). Also is there
> > a speed issue with taking the write-side of a seqlock when we are just
> > reading -- is this more efficient than doing a retry like in __d_path?
> 
> ???
> 
> 1) it uses is_subdir(), which does deal with rename_lock

Oh -- complete brain-fart on my part. Sorry about that.

> 2) what it does is taking mount_lock.lock.  I.e. the same
> thing as the second retry in __d_path().  _If_ it shows
> up in profiles, we can switch it to read_seqbegin_or_lock(),
> but I'd like to see the profiling data first.

Sure, I'll switch it to use path_is_under().

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution
  2018-10-13  9:27             ` Aleksa Sarai
@ 2018-10-13  9:27               ` Aleksa Sarai
  0 siblings, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-13  9:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, Jann Horn, Eric W. Biederman, jlayton,
	Bruce Fields, Arnd Bergmann, Andy Lutomirski, David Howells,
	christian, Tycho Andersen, David Drysdale, dev, containers,
	linux-fsdevel, kernel list, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]

On 2018-10-13, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> On Sat, Oct 13, 2018 at 07:53:26PM +1100, Aleksa Sarai wrote:
> 
> > I didn't know about path_is_under() -- I just checked and it appears to
> > not take &rename_lock? From my understanding, in order to protect
> > against the rename attack you need to take &rename_lock (or check
> > against &rename_lock at least and retry if it changed).
> > 
> > I could definitely use path_is_under() if you prefer, though I think
> > that in this case we'd need to take &rename_lock (right?). Also is there
> > a speed issue with taking the write-side of a seqlock when we are just
> > reading -- is this more efficient than doing a retry like in __d_path?
> 
> ???
> 
> 1) it uses is_subdir(), which does deal with rename_lock

Oh -- complete brain-fart on my part. Sorry about that.

> 2) what it does is taking mount_lock.lock.  I.e. the same
> thing as the second retry in __d_path().  _If_ it shows
> up in profiles, we can switch it to read_seqbegin_or_lock(),
> but I'd like to see the profiling data first.

Sure, I'll switch it to use path_is_under().

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags
  2018-10-09  7:02 [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
                   ` (3 preceding siblings ...)
  2018-10-09  7:02 ` [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai
@ 2018-10-17 15:23 ` Aleksa Sarai
  2018-10-17 15:23   ` Aleksa Sarai
  4 siblings, 1 reply; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-17 15:23 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Andy Lutomirski, David Howells, Jann Horn, Christian Brauner,
	David Drysdale, containers, linux-fsdevel, linux-api,
	Jeff Layton, J. Bruce Fields, Arnd Bergmann, Tycho Andersen, dev,
	linux-kernel, linux-arch

[-- Attachment #1: Type: text/plain, Size: 2445 bytes --]

On 2018-10-09, Aleksa Sarai <cyphar@cyphar.com> wrote:
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
> of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
> Capsicum project[4]) with a few additions and changes made based on the
> previous discussion within [5] as well as others I felt were useful.

I've been thinking about this problem a little more (from the UX side of
things) and I have a feeling that adding 5 different O_* flags related
to resolution -- rather than properties related to opening the file --
might be less than ideal (even though, as discussed in previous threads,
there is a need for these flags and for them to be separated).

There is *some* precedence for this with O_PATH[**] changing fairly
large semantics of openat(2) but there are some things about O_PATH
which I think could be improved.

What if we had a resolveat(2) which acted like openat(..., O_PATH) *but*
it allowed us to have new flags and to separate the scoping flags from
the (fairly limited) space of O_* flags. Then O_PATH could effectively
just be a legacy way of doing resolveat(2) -- with only O_CLOEXEC,
O_DIRECTORY, and O_NOFOLLOW support.

And the main things we could add would be:

  * These resolution flags, with only support available from
	resolveat(2) for the moment. The idea would be that AT_EMPTY_PATH
	would be the recommended way to make use of this.

  * Support for RESOLVE_{NOPERM,RDONLY,WRONLY,RDWR} (which after some
	discussions with Eric last year might be necessary in order to make
	/proc/$pid/fd/$fd re-opening of O_PATH descriptors safer -- which is
	something that we use in both runc and LXC).

Is this idea palatable, or was this something considered during the
development of O_PATH and someone had an argument why augmenting O_PATH
is better than a new syscall?

[**] And while writing this paragraph I noticed that I didn't update the
     O_PATH "flag whitelist" to allow the scoping flags to affect it. I
	 will include a fix for this in v4 (I must've lost it in an early
	 rebase before I sent v1).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags
  2018-10-17 15:23 ` [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
@ 2018-10-17 15:23   ` Aleksa Sarai
  0 siblings, 0 replies; 32+ messages in thread
From: Aleksa Sarai @ 2018-10-17 15:23 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Andy Lutomirski, David Howells, Jann Horn, Christian Brauner,
	David Drysdale, containers, linux-fsdevel, linux-api,
	Jeff Layton, J. Bruce Fields, Arnd Bergmann, Tycho Andersen, dev,
	linux-kernel, linux-arch

[-- Attachment #1: Type: text/plain, Size: 2445 bytes --]

On 2018-10-09, Aleksa Sarai <cyphar@cyphar.com> wrote:
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
> of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
> Capsicum project[4]) with a few additions and changes made based on the
> previous discussion within [5] as well as others I felt were useful.

I've been thinking about this problem a little more (from the UX side of
things) and I have a feeling that adding 5 different O_* flags related
to resolution -- rather than properties related to opening the file --
might be less than ideal (even though, as discussed in previous threads,
there is a need for these flags and for them to be separated).

There is *some* precedence for this with O_PATH[**] changing fairly
large semantics of openat(2) but there are some things about O_PATH
which I think could be improved.

What if we had a resolveat(2) which acted like openat(..., O_PATH) *but*
it allowed us to have new flags and to separate the scoping flags from
the (fairly limited) space of O_* flags. Then O_PATH could effectively
just be a legacy way of doing resolveat(2) -- with only O_CLOEXEC,
O_DIRECTORY, and O_NOFOLLOW support.

And the main things we could add would be:

  * These resolution flags, with only support available from
	resolveat(2) for the moment. The idea would be that AT_EMPTY_PATH
	would be the recommended way to make use of this.

  * Support for RESOLVE_{NOPERM,RDONLY,WRONLY,RDWR} (which after some
	discussions with Eric last year might be necessary in order to make
	/proc/$pid/fd/$fd re-opening of O_PATH descriptors safer -- which is
	something that we use in both runc and LXC).

Is this idea palatable, or was this something considered during the
development of O_PATH and someone had an argument why augmenting O_PATH
is better than a new syscall?

[**] And while writing this paragraph I noticed that I didn't update the
     O_PATH "flag whitelist" to allow the scoping flags to affect it. I
	 will include a fix for this in v4 (I must've lost it in an early
	 rebase before I sent v1).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-10-17 23:19 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-09  7:02 [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
2018-10-09  7:02 ` Aleksa Sarai
2018-10-09  7:02 ` [PATCH v3 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
2018-10-09  7:02   ` Aleksa Sarai
2018-10-13  7:33   ` Al Viro
2018-10-13  7:33     ` Al Viro
2018-10-13  8:05     ` Al Viro
2018-10-13  8:05       ` Al Viro
2018-10-13  8:20       ` Aleksa Sarai
2018-10-13  8:20         ` Aleksa Sarai
2018-10-13  8:09     ` Aleksa Sarai
2018-10-13  8:09       ` Aleksa Sarai
2018-10-09  7:02 ` [PATCH v3 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution Aleksa Sarai
2018-10-09  7:02   ` Aleksa Sarai
2018-10-09  7:02 ` [PATCH v3 3/3] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai
2018-10-09  7:02   ` Aleksa Sarai
2018-10-09 15:19   ` Jann Horn
2018-10-09 15:19     ` Jann Horn
2018-10-09 15:37     ` Aleksa Sarai
2018-10-09 15:37       ` Aleksa Sarai
2018-10-09 16:46       ` Jann Horn
2018-10-09 16:46         ` Jann Horn
2018-10-13  8:22       ` Al Viro
2018-10-13  8:22         ` Al Viro
2018-10-13  8:53         ` Aleksa Sarai
2018-10-13  8:53           ` Aleksa Sarai
2018-10-13  9:04           ` Al Viro
2018-10-13  9:04             ` Al Viro
2018-10-13  9:27             ` Aleksa Sarai
2018-10-13  9:27               ` Aleksa Sarai
2018-10-17 15:23 ` [PATCH v3 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
2018-10-17 15:23   ` Aleksa Sarai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).