linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] namei: implement various lookup restriction AT_* flags
@ 2018-10-09  6:52 Aleksa Sarai
  2018-10-09  6:52 ` Aleksa Sarai
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-09  6:52 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Andy Lutomirski, David Howells, Jann Horn, Christian Brauner,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
capsicum patchset[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

As per the discussion in the AT_NO_JUMPS thread, AT_NO_JUMPS has been
split into separate flags.

  * AT_XDEV blocks mountpoint crossings (both upwards and downwards).
      openat("/", "tmp", AT_XDEV); // blocked
      openat("/tmp", "..", AT_XDEV); // blocked
      openat("/tmp", "/", AT_XDEV); // blocked

  * AT_NO_PROCLINKS blocks all resolution through /proc/$pid/fd/$fd
	"symlinks". Specifically, this blocks all jumps caused by a
	filesystem using nd_jump_link() to shove you around in the
	filesystem tree (these are referred to as "proclinks" in lieu of a
	better name).

  * AT_BENEATH disallows escapes from the starting dirfd using ".." or
	absolute paths (either in the path or during symlink resolution).
	Conceptually this flag ensures that you "stay below" the starting
	point in the filesystem tree. ".." resolution is allowed if it
	doesn't land you outside of the starting point (this is made safe
	against races by patch 3 in this series).

	AT_BENEATH also currently disallows all "proclink" resolution
	because they can trivially throw you outside of the starting point.
	In a future patch we might allow such resolution (as long as it
	stays within the root).

In addition, two more flags have been added to the series:

  * AT_NO_SYMLINKS disallows *all* symlink resolution, and thus implies
	AT_NO_PROCLINKS. Linus mentioned this is something that git would
	like to have in the original discussion[5].

  * AT_THIS_ROOT is a very similar idea to AT_BENEATH, but it serves a
    very different purpose. Rather than blocking resolutions if they
	would go outside of the starting point, it treats the starting point
	as a form of chroot(2). Container runtimes are one of the primary
	justifications for this flag, as they currently have to implement
	this sort of path handling racily in userspace[6].

	The restrictions on "proclink" resolution are the same as with
	AT_BENEATH (though in AT_THIS_ROOT's case it's not really clear how
	"proclink" jumps outside of the root should be handled), and patch 3
	in this series was also required to make ".." resolution safe.

Patch changelog:
  v2:
    * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe by
	  through __d_path checking (see patch 3).
    * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
	  hopes they can be re-enabled once safe.
    * Removed the selftests as they will be reimplemented as xfstests.

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin

Aleksa Sarai (3):
  namei: implement O_BENEATH-style AT_* flags
  namei: implement AT_THIS_ROOT chroot-like path resolution
  namei: aggressively check nd->root on ".." resolution

 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 192 ++++++++++++++++++++++---------
 fs/open.c                        |  10 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   8 ++
 include/uapi/asm-generic/fcntl.h |  20 ++++
 include/uapi/linux/fcntl.h       |  10 ++
 8 files changed, 193 insertions(+), 56 deletions(-)

-- 
2.19.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/3] namei: implement various lookup restriction AT_* flags
  2018-10-09  6:52 [PATCH v2 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
@ 2018-10-09  6:52 ` Aleksa Sarai
  2018-10-09  6:52 ` Aleksa Sarai
  2018-10-09  6:52 ` [PATCH v2 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
  2 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-09  6:52 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Andy Lutomirski, David Howells, Jann Horn, Christian Brauner,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
capsicum patchset[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

As per the discussion in the AT_NO_JUMPS thread, AT_NO_JUMPS has been
split into separate flags.

  * AT_XDEV blocks mountpoint crossings (both upwards and downwards).
      openat("/", "tmp", AT_XDEV); // blocked
      openat("/tmp", "..", AT_XDEV); // blocked
      openat("/tmp", "/", AT_XDEV); // blocked

  * AT_NO_PROCLINKS blocks all resolution through /proc/$pid/fd/$fd
	"symlinks". Specifically, this blocks all jumps caused by a
	filesystem using nd_jump_link() to shove you around in the
	filesystem tree (these are referred to as "proclinks" in lieu of a
	better name).

  * AT_BENEATH disallows escapes from the starting dirfd using ".." or
	absolute paths (either in the path or during symlink resolution).
	Conceptually this flag ensures that you "stay below" the starting
	point in the filesystem tree. ".." resolution is allowed if it
	doesn't land you outside of the starting point (this is made safe
	against races by patch 3 in this series).

	AT_BENEATH also currently disallows all "proclink" resolution
	because they can trivially throw you outside of the starting point.
	In a future patch we might allow such resolution (as long as it
	stays within the root).

In addition, two more flags have been added to the series:

  * AT_NO_SYMLINKS disallows *all* symlink resolution, and thus implies
	AT_NO_PROCLINKS. Linus mentioned this is something that git would
	like to have in the original discussion[5].

  * AT_THIS_ROOT is a very similar idea to AT_BENEATH, but it serves a
    very different purpose. Rather than blocking resolutions if they
	would go outside of the starting point, it treats the starting point
	as a form of chroot(2). Container runtimes are one of the primary
	justifications for this flag, as they currently have to implement
	this sort of path handling racily in userspace[6].

	The restrictions on "proclink" resolution are the same as with
	AT_BENEATH (though in AT_THIS_ROOT's case it's not really clear how
	"proclink" jumps outside of the root should be handled), and patch 3
	in this series was also required to make ".." resolution safe.

Patch changelog:
  v2:
    * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe by
	  through __d_path checking (see patch 3).
    * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
	  hopes they can be re-enabled once safe.
    * Removed the selftests as they will be reimplemented as xfstests.

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin

Aleksa Sarai (3):
  namei: implement O_BENEATH-style AT_* flags
  namei: implement AT_THIS_ROOT chroot-like path resolution
  namei: aggressively check nd->root on ".." resolution

 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 192 ++++++++++++++++++++++---------
 fs/open.c                        |  10 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   8 ++
 include/uapi/asm-generic/fcntl.h |  20 ++++
 include/uapi/linux/fcntl.h       |  10 ++
 8 files changed, 193 insertions(+), 56 deletions(-)

-- 
2.19.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/3] namei: implement various lookup restriction AT_* flags
  2018-10-09  6:52 [PATCH v2 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
  2018-10-09  6:52 ` Aleksa Sarai
@ 2018-10-09  6:52 ` Aleksa Sarai
  2018-10-09  6:52   ` Aleksa Sarai
  2018-10-09  6:52 ` [PATCH v2 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
  2 siblings, 1 reply; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-09  6:52 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Andy Lutomirski, David Howells, Jann Horn,
	Christian Brauner, David Drysdale, containers, linux-fsdevel,
	linux-api, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Tycho Andersen, dev, linux-kernel, linux-arch

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

As per the discussion in the AT_NO_JUMPS thread, AT_NO_JUMPS has been
split into separate flags.

  * AT_XDEV blocks mountpoint crossings (both upwards and downwards).
      openat("/", "tmp", AT_XDEV); // blocked
      openat("/tmp", "..", AT_XDEV); // blocked
      openat("/tmp", "/", AT_XDEV); // blocked

  * AT_NO_PROCLINKS blocks all resolution through /proc/$pid/fd/$fd
	"symlinks". Specifically, this blocks all jumps caused by a
	filesystem using nd_jump_link() to shove you around in the
	filesystem tree (these are referred to as "proclinks" in lieu of a
	better name).
	  openat(AT_FDCWD, "/proc/self/root", AT_NO_PROCLINKS); // blocked
	  openat(AT_FDCWD, "/proc/self/fd/0", AT_NO_PROCLINKS); // blocked
	  openat(AT_FDCWD, "/proc/self/ns/mnt", AT_NO_PROCLINKS); // blocked

  * AT_BENEATH disallows escapes from the starting dirfd using ".." or
	absolute paths (either in the path or during symlink resolution).
	Conceptually this flag ensures that you "stay below" the starting
	point in the filesystem tree. ".." resolution is allowed if it
	doesn't land you outside of the starting point (this is made safe
	against races by patch 3 in this series).
	  openat("/root", "foo", AT_BENEATH); // *not* blocked
	  openat("/root", "a/../b", AT_BENEATH); // *not* blocked
	  openat("/root", "a/../../root/b", AT_BENEATH); // blocked
	  openat("/root", "/root", AT_BENEATH); // blocked

	AT_BENEATH also currently disallows all "proclink" resolution
	because they can trivially throw you outside of the starting point.
	In a future patch we might allow such resolution (as long as it
	stays within the root).
	  openat("/", "proc/self/exe", AT_BENEATH); // blocked

In addition, two more flags have been added to the series:

  * AT_NO_SYMLINKS disallows *all* symlink resolution, and thus implies
	AT_NO_PROCLINKS. Linus mentioned this is something that git would
	like to have in the original discussion[5].
	  // assuming 'ln -s / /usr'
	  openat("/", "/usr/bin", AT_NO_SYMLINKS); // blocked
	  openat("/", "/proc/self/root", AT_NO_PROCLINKS); // blocked

  * AT_THIS_ROOT is a very similar idea to AT_BENEATH, but it serves a
    very different purpose. Rather than blocking resolutions if they
	would go outside of the starting point, it treats the starting point
	as a form of chroot(2). Container runtimes are one of the primary
	justifications for this flag, as they currently have to implement
	this sort of path handling racily in userspace[6].

	The restrictions on "proclink" resolution are the same as with
	AT_BENEATH (though in AT_THIS_ROOT's case it's not really clear how
	"proclink" jumps outside of the root should be handled), and patch 3
	in this series was also required to make ".." resolution safe.

Currently all of these flags are only enabled for openat(2) (and thus
have their own O_* flag names), but the corresponding AT_* flags have
been reserved so they can be added to syscalls where openat(O_PATH) is
not sufficient.

Patch changelog:
  v2:
    * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
	  some semi-aggressive __d_path checking (see patch 3).
    * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
	  hopes they can be re-enabled once safe.
    * Removed the selftests as they will be reimplemented as xfstests.
	* Removed stat(2) support, since you can already get it through
	  O_PATH and fstatat(2).

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: David Drysdale <drysdale@google.com>
Cc: <containers@lists.linux-foundation.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: <linux-api@vger.kernel.org>

Aleksa Sarai (3):
  namei: implement O_BENEATH-style AT_* flags
  namei: implement AT_THIS_ROOT chroot-like path resolution
  namei: aggressively check for nd->root escape on ".." resolution

 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 241 +++++++++++++++++++++++--------
 fs/open.c                        |  10 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   8 +
 include/uapi/asm-generic/fcntl.h |  20 +++
 include/uapi/linux/fcntl.h       |  10 ++
 8 files changed, 230 insertions(+), 68 deletions(-)

-- 
2.19.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/3] namei: implement various lookup restriction AT_* flags
  2018-10-09  6:52 ` Aleksa Sarai
@ 2018-10-09  6:52   ` Aleksa Sarai
  0 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-09  6:52 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Andy Lutomirski, David Howells, Jann Horn,
	Christian Brauner, David Drysdale, containers, linux-fsdevel,
	linux-api, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Tycho Andersen, dev, linux-kernel, linux-arch

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

As per the discussion in the AT_NO_JUMPS thread, AT_NO_JUMPS has been
split into separate flags.

  * AT_XDEV blocks mountpoint crossings (both upwards and downwards).
      openat("/", "tmp", AT_XDEV); // blocked
      openat("/tmp", "..", AT_XDEV); // blocked
      openat("/tmp", "/", AT_XDEV); // blocked

  * AT_NO_PROCLINKS blocks all resolution through /proc/$pid/fd/$fd
	"symlinks". Specifically, this blocks all jumps caused by a
	filesystem using nd_jump_link() to shove you around in the
	filesystem tree (these are referred to as "proclinks" in lieu of a
	better name).
	  openat(AT_FDCWD, "/proc/self/root", AT_NO_PROCLINKS); // blocked
	  openat(AT_FDCWD, "/proc/self/fd/0", AT_NO_PROCLINKS); // blocked
	  openat(AT_FDCWD, "/proc/self/ns/mnt", AT_NO_PROCLINKS); // blocked

  * AT_BENEATH disallows escapes from the starting dirfd using ".." or
	absolute paths (either in the path or during symlink resolution).
	Conceptually this flag ensures that you "stay below" the starting
	point in the filesystem tree. ".." resolution is allowed if it
	doesn't land you outside of the starting point (this is made safe
	against races by patch 3 in this series).
	  openat("/root", "foo", AT_BENEATH); // *not* blocked
	  openat("/root", "a/../b", AT_BENEATH); // *not* blocked
	  openat("/root", "a/../../root/b", AT_BENEATH); // blocked
	  openat("/root", "/root", AT_BENEATH); // blocked

	AT_BENEATH also currently disallows all "proclink" resolution
	because they can trivially throw you outside of the starting point.
	In a future patch we might allow such resolution (as long as it
	stays within the root).
	  openat("/", "proc/self/exe", AT_BENEATH); // blocked

In addition, two more flags have been added to the series:

  * AT_NO_SYMLINKS disallows *all* symlink resolution, and thus implies
	AT_NO_PROCLINKS. Linus mentioned this is something that git would
	like to have in the original discussion[5].
	  // assuming 'ln -s / /usr'
	  openat("/", "/usr/bin", AT_NO_SYMLINKS); // blocked
	  openat("/", "/proc/self/root", AT_NO_PROCLINKS); // blocked

  * AT_THIS_ROOT is a very similar idea to AT_BENEATH, but it serves a
    very different purpose. Rather than blocking resolutions if they
	would go outside of the starting point, it treats the starting point
	as a form of chroot(2). Container runtimes are one of the primary
	justifications for this flag, as they currently have to implement
	this sort of path handling racily in userspace[6].

	The restrictions on "proclink" resolution are the same as with
	AT_BENEATH (though in AT_THIS_ROOT's case it's not really clear how
	"proclink" jumps outside of the root should be handled), and patch 3
	in this series was also required to make ".." resolution safe.

Currently all of these flags are only enabled for openat(2) (and thus
have their own O_* flag names), but the corresponding AT_* flags have
been reserved so they can be added to syscalls where openat(O_PATH) is
not sufficient.

Patch changelog:
  v2:
    * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
	  some semi-aggressive __d_path checking (see patch 3).
    * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
	  hopes they can be re-enabled once safe.
    * Removed the selftests as they will be reimplemented as xfstests.
	* Removed stat(2) support, since you can already get it through
	  O_PATH and fstatat(2).

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: David Drysdale <drysdale@google.com>
Cc: <containers@lists.linux-foundation.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: <linux-api@vger.kernel.org>

Aleksa Sarai (3):
  namei: implement O_BENEATH-style AT_* flags
  namei: implement AT_THIS_ROOT chroot-like path resolution
  namei: aggressively check for nd->root escape on ".." resolution

 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 241 +++++++++++++++++++++++--------
 fs/open.c                        |  10 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   8 +
 include/uapi/asm-generic/fcntl.h |  20 +++
 include/uapi/linux/fcntl.h       |  10 ++
 8 files changed, 230 insertions(+), 68 deletions(-)

-- 
2.19.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-09  6:52 [PATCH v2 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
  2018-10-09  6:52 ` Aleksa Sarai
  2018-10-09  6:52 ` Aleksa Sarai
@ 2018-10-09  6:52 ` Aleksa Sarai
  2018-10-09  6:52   ` Aleksa Sarai
  2018-10-09 19:25   ` Andy Lutomirski
  2 siblings, 2 replies; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-09  6:52 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

Add the following flags to allow various restrictions on path
resolution (these affect the *entire* resolution, rather than just the
final path component -- as is the case with most other AT_* flags).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web
servers, archive extraction tools, network file servers, and so on.

* AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
  from one). The primary "scoping" use is to blocking resolution that
  crosses a bind-mount, which has a similar property to a symlink (in
  the way that it allows for escape from the starting-point). Since it
  is not possible to differentiate bind-mounts However since
  bind-mounting requires privileges (in ways symlinks don't) this has
  been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
  (though find(1) doesn't walk upwards, the semantics seem obvious).

* AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
  specific restriction, and it exists because /proc/$pid/fd/...
  "symlinks" allow for access outside nd->root and pose risk to
  container runtimes that don't want to be tricked into accessing a host
  path (but do want to allow no-funny-business symlink resolution).

* AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies
  AT_NO_PROCLINK (obviously).

* AT_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all "proclink" jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of "proclink" jumping is done because it is not clear
  whether semantically they should be allowed -- while some "proclinks"
  are safe there are many that can cause escapes (and once a resolution
  is outside of the root, AT_BENEATH will no longer detect it). Future
  patches may re-enable "proclink" jumping when such jumps would remain
  inside the root.

The AT_NO_*LINK flags return -ELOOP if path resolution would violates
their requirement, while the others all return -EXDEV. Currently these
are only enabled for openat(2) (which has its own brand of O_* flags
with the same semantics). However the AT_* flags have been reserved for
future support in other *at(2) syscalls (though because of AT_EMPTY_PATH
many *at(2) operations will not need to support these flags directly).

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 174 ++++++++++++++++++++++---------
 fs/open.c                        |   8 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   7 ++
 include/uapi/asm-generic/fcntl.h |  17 +++
 include/uapi/linux/fcntl.h       |   8 ++
 8 files changed, 167 insertions(+), 56 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 4137d96534a6..e343618736f7 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index fb913148d4d1..76eacd3af89b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -845,6 +845,12 @@ static inline void path_to_nameidata(const struct path *path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+	if (unlikely(nd->flags & LOOKUP_BENEATH))
+		return -EXDEV;
+	if (unlikely(nd->flags & LOOKUP_XDEV)) {
+		if (nd->path.mnt != nd->root.mnt)
+			return -EXDEV;
+	}
 	if (nd->flags & LOOKUP_RCU) {
 		struct dentry *d;
 		nd->path = nd->root;
@@ -1083,14 +1089,23 @@ const char *get_link(struct nameidata *nd)
 		} else {
 			res = get(dentry, inode, &last->done);
 		}
+		/* If we just jumped it was because of a procfs-style link. */
+		if (unlikely(nd->flags & LOOKUP_JUMPED)) {
+			if (unlikely(nd->flags & LOOKUP_NO_PROCLINKS))
+				return ERR_PTR(-ELOOP);
+			/* Not currently safe. */
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return ERR_PTR(-EXDEV);
+		}
 		if (IS_ERR_OR_NULL(res))
 			return res;
 	}
 	if (*res == '/') {
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (unlikely(nd_jump_root(nd)))
-			return ERR_PTR(-ECHILD);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1271,12 +1286,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
 		break;
 	}
 
-	if (need_mntput && path->mnt == mnt)
-		mntput(path->mnt);
+	if (need_mntput) {
+		if (path->mnt == mnt)
+			mntput(path->mnt);
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			ret = -EXDEV;
+		else
+			nd->flags |= LOOKUP_JUMPED;
+	}
 	if (ret == -EISDIR || !ret)
 		ret = 1;
-	if (need_mntput)
-		nd->flags |= LOOKUP_JUMPED;
 	if (unlikely(ret < 0))
 		path_put_conditional(path, nd);
 	return ret;
@@ -1333,6 +1352,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		mounted = __lookup_mnt(path->mnt, path->dentry);
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return false;
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
@@ -1353,8 +1374,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 	struct inode *inode = nd->inode;
 
 	while (1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			struct dentry *old = nd->path.dentry;
 			struct dentry *parent = old->d_parent;
@@ -1379,6 +1403,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			if (&mparent->mnt == nd->path.mnt)
 				break;
+			if (unlikely(nd->flags & LOOKUP_XDEV))
+				return -EXDEV;
 			/* we know that mountpoint was pinned */
 			nd->path.dentry = mountpoint;
 			nd->path.mnt = &mparent->mnt;
@@ -1393,6 +1419,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			return -ECHILD;
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 		nd->path.mnt = &mounted->mnt;
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
@@ -1481,8 +1509,11 @@ static int path_parent_directory(struct path *path)
 static int follow_dotdot(struct nameidata *nd)
 {
 	while(1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			int ret = path_parent_directory(&nd->path);
 			if (ret)
@@ -1491,6 +1522,8 @@ static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
@@ -1705,6 +1738,12 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
+		/*
+		 * AT_BENEATH resolving ".." is not currently safe -- races can cause
+		 * our parent to have moved outside of the root and us to skip over it.
+		 */
+		if (unlikely(nd->flags & LOOKUP_BENEATH))
+			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
 		if (nd->flags & LOOKUP_RCU) {
@@ -1720,6 +1759,8 @@ static int pick_link(struct nameidata *nd, struct path *link,
 {
 	int error;
 	struct saved *last;
+	if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+		return -ELOOP;
 	if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
 		path_to_nameidata(link, nd);
 		return -ELOOP;
@@ -2168,13 +2209,70 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	}
 }
 
+/*
+ * Configure nd->path based on the nd->dfd. This is only used as part of
+ * path_init().
+ */
+static inline int dirfd_path_init(struct nameidata *nd)
+{
+	if (nd->dfd == AT_FDCWD) {
+		if (nd->flags & LOOKUP_RCU) {
+			struct fs_struct *fs = current->fs;
+			unsigned seq;
+
+			do {
+				seq = read_seqcount_begin(&fs->seq);
+				nd->path = fs->pwd;
+				nd->inode = nd->path.dentry->d_inode;
+				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+			} while (read_seqcount_retry(&fs->seq, seq));
+		} else {
+			get_fs_pwd(current->fs, &nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+	} else {
+		/* Caller must check execute permissions on the starting path component */
+		struct fd f = fdget_raw(nd->dfd);
+		struct dentry *dentry;
+
+		if (!f.file)
+			return -EBADF;
+
+		dentry = f.file->f_path.dentry;
+
+		if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
+			fdput(f);
+			return -ENOTDIR;
+		}
+
+		nd->path = f.file->f_path;
+		if (nd->flags & LOOKUP_RCU) {
+			nd->inode = nd->path.dentry->d_inode;
+			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		} else {
+			path_get(&nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+		fdput(f);
+	}
+	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+		nd->root = nd->path;
+		if (!(nd->flags & LOOKUP_RCU))
+			path_get(&nd->root);
+	}
+	return 0;
+}
+
 /* must be paired with terminate_walk() */
 static const char *path_init(struct nameidata *nd, unsigned flags)
 {
+	int error;
 	const char *s = nd->name->name;
 
 	if (!*s)
 		flags &= ~LOOKUP_RCU;
+	if (flags & LOOKUP_NO_SYMLINKS)
+		flags |= LOOKUP_NO_PROCLINKS;
 	if (flags & LOOKUP_RCU)
 		rcu_read_lock();
 
@@ -2203,53 +2301,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & LOOKUP_XDEV)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
 	if (*s == '/') {
-		set_root(nd);
-		if (likely(!nd_jump_root(nd)))
-			return s;
-		return ERR_PTR(-ECHILD);
-	} else if (nd->dfd == AT_FDCWD) {
-		if (flags & LOOKUP_RCU) {
-			struct fs_struct *fs = current->fs;
-			unsigned seq;
-
-			do {
-				seq = read_seqcount_begin(&fs->seq);
-				nd->path = fs->pwd;
-				nd->inode = nd->path.dentry->d_inode;
-				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
-			} while (read_seqcount_retry(&fs->seq, seq));
-		} else {
-			get_fs_pwd(current->fs, &nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		return s;
-	} else {
-		/* Caller must check execute permissions on the starting path component */
-		struct fd f = fdget_raw(nd->dfd);
-		struct dentry *dentry;
-
-		if (!f.file)
-			return ERR_PTR(-EBADF);
-
-		dentry = f.file->f_path.dentry;
-
-		if (*s && unlikely(!d_can_lookup(dentry))) {
-			fdput(f);
-			return ERR_PTR(-ENOTDIR);
-		}
-
-		nd->path = f.file->f_path;
-		if (flags & LOOKUP_RCU) {
-			nd->inode = nd->path.dentry->d_inode;
-			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-		} else {
-			path_get(&nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		fdput(f);
+		if (likely(!nd->root.mnt))
+			set_root(nd);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			s = ERR_PTR(error);
 		return s;
 	}
+	if (likely(!nd->path.mnt)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
+	return s;
 }
 
 static const char *trailing_symlink(struct nameidata *nd)
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..80f5f566a5ff 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -988,6 +988,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_DIRECTORY;
 	if (!(flags & O_NOFOLLOW))
 		lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & O_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
+	if (flags & O_XDEV)
+		lookup_flags |= LOOKUP_XDEV;
+	if (flags & O_NOPROCLINKS)
+		lookup_flags |= LOOKUP_NO_PROCLINKS;
+	if (flags & O_NOSYMLINKS)
+		lookup_flags |= LOOKUP_NO_SYMLINKS;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/fs/stat.c b/fs/stat.c
index f8e6fb2c3657..d319a468c704 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -170,8 +170,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
 	int error = -EINVAL;
 	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
 
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
-		       AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0)
+	if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
+		      KSTAT_QUERY_FLAGS))
 		return -EINVAL;
 
 	if (flags & AT_SYMLINK_NOFOLLOW)
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 27dc7a60693e..ad5bba4b5b12 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,8 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
+	 O_NOPROCLINKS | O_NOSYMLINKS)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a78606e8e3df..5ff7f3362d1b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_EMPTY		0x4000
 #define LOOKUP_DOWN		0x8000
 
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH		0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV		0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_PROCLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
+					    Implies LOOKUP_NO_PROCLINKS. */
+
 extern int path_pts(struct path *path);
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..c2bf5983e46a 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,23 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+/*
+ * These are identical to their AT_* counterparts (which affect the entireity
+ * of path resolution).
+ */
+#ifndef O_BENEATH
+#define O_BENEATH	00040000000 /* *Not* the same as capsicum's O_BENEATH! */
+#endif
+#ifndef O_XDEV
+#define O_XDEV		00100000000
+#endif
+#ifndef O_NOPROCLINKS
+#define O_NOPROCLINKS	00200000000
+#endif
+#ifndef O_NOSYMLINKS
+#define O_NOSYMLINKS	01000000000
+#endif
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 594b85f7cb86..551a9e2166a8 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -92,5 +92,13 @@
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
+/* Flags which affect path *resolution*, not just last-component handling. */
+#define AT_BENEATH		0x10000	/* No absolute paths or ".." escaping
+					   (in-path or through symlinks) */
+#define AT_XDEV			0x20000 /* No mountpoint crossing. */
+#define AT_NO_PROCLINKS		0x40000 /* No /proc/$pid/fd/... "symlinks". */
+#define AT_NO_SYMLINKS		0x80000 /* No symlinks *at all*.
+					   Implies AT_NO_PROCLINKS. */
+
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.19.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-09  6:52 ` [PATCH v2 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
@ 2018-10-09  6:52   ` Aleksa Sarai
  2018-10-09 19:25   ` Andy Lutomirski
  1 sibling, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-09  6:52 UTC (permalink / raw)
  To: Al Viro, Eric Biederman
  Cc: Aleksa Sarai, Christian Brauner, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Andy Lutomirski, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, containers, linux-fsdevel,
	linux-kernel, linux-arch, linux-api

Add the following flags to allow various restrictions on path
resolution (these affect the *entire* resolution, rather than just the
final path component -- as is the case with most other AT_* flags).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web
servers, archive extraction tools, network file servers, and so on.

* AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
  from one). The primary "scoping" use is to blocking resolution that
  crosses a bind-mount, which has a similar property to a symlink (in
  the way that it allows for escape from the starting-point). Since it
  is not possible to differentiate bind-mounts However since
  bind-mounting requires privileges (in ways symlinks don't) this has
  been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
  (though find(1) doesn't walk upwards, the semantics seem obvious).

* AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
  specific restriction, and it exists because /proc/$pid/fd/...
  "symlinks" allow for access outside nd->root and pose risk to
  container runtimes that don't want to be tricked into accessing a host
  path (but do want to allow no-funny-business symlink resolution).

* AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies
  AT_NO_PROCLINK (obviously).

* AT_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all "proclink" jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of "proclink" jumping is done because it is not clear
  whether semantically they should be allowed -- while some "proclinks"
  are safe there are many that can cause escapes (and once a resolution
  is outside of the root, AT_BENEATH will no longer detect it). Future
  patches may re-enable "proclink" jumping when such jumps would remain
  inside the root.

The AT_NO_*LINK flags return -ELOOP if path resolution would violates
their requirement, while the others all return -EXDEV. Currently these
are only enabled for openat(2) (which has its own brand of O_* flags
with the same semantics). However the AT_* flags have been reserved for
future support in other *at(2) syscalls (though because of AT_EMPTY_PATH
many *at(2) operations will not need to support these flags directly).

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian@brauner.io>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/fcntl.c                       |   2 +-
 fs/namei.c                       | 174 ++++++++++++++++++++++---------
 fs/open.c                        |   8 ++
 fs/stat.c                        |   4 +-
 include/linux/fcntl.h            |   3 +-
 include/linux/namei.h            |   7 ++
 include/uapi/asm-generic/fcntl.h |  17 +++
 include/uapi/linux/fcntl.h       |   8 ++
 8 files changed, 167 insertions(+), 56 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 4137d96534a6..e343618736f7 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index fb913148d4d1..76eacd3af89b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -845,6 +845,12 @@ static inline void path_to_nameidata(const struct path *path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+	if (unlikely(nd->flags & LOOKUP_BENEATH))
+		return -EXDEV;
+	if (unlikely(nd->flags & LOOKUP_XDEV)) {
+		if (nd->path.mnt != nd->root.mnt)
+			return -EXDEV;
+	}
 	if (nd->flags & LOOKUP_RCU) {
 		struct dentry *d;
 		nd->path = nd->root;
@@ -1083,14 +1089,23 @@ const char *get_link(struct nameidata *nd)
 		} else {
 			res = get(dentry, inode, &last->done);
 		}
+		/* If we just jumped it was because of a procfs-style link. */
+		if (unlikely(nd->flags & LOOKUP_JUMPED)) {
+			if (unlikely(nd->flags & LOOKUP_NO_PROCLINKS))
+				return ERR_PTR(-ELOOP);
+			/* Not currently safe. */
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return ERR_PTR(-EXDEV);
+		}
 		if (IS_ERR_OR_NULL(res))
 			return res;
 	}
 	if (*res == '/') {
 		if (!nd->root.mnt)
 			set_root(nd);
-		if (unlikely(nd_jump_root(nd)))
-			return ERR_PTR(-ECHILD);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 		while (unlikely(*++res == '/'))
 			;
 	}
@@ -1271,12 +1286,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
 		break;
 	}
 
-	if (need_mntput && path->mnt == mnt)
-		mntput(path->mnt);
+	if (need_mntput) {
+		if (path->mnt == mnt)
+			mntput(path->mnt);
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			ret = -EXDEV;
+		else
+			nd->flags |= LOOKUP_JUMPED;
+	}
 	if (ret == -EISDIR || !ret)
 		ret = 1;
-	if (need_mntput)
-		nd->flags |= LOOKUP_JUMPED;
 	if (unlikely(ret < 0))
 		path_put_conditional(path, nd);
 	return ret;
@@ -1333,6 +1352,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		mounted = __lookup_mnt(path->mnt, path->dentry);
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return false;
 		path->mnt = &mounted->mnt;
 		path->dentry = mounted->mnt.mnt_root;
 		nd->flags |= LOOKUP_JUMPED;
@@ -1353,8 +1374,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 	struct inode *inode = nd->inode;
 
 	while (1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			struct dentry *old = nd->path.dentry;
 			struct dentry *parent = old->d_parent;
@@ -1379,6 +1403,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 				return -ECHILD;
 			if (&mparent->mnt == nd->path.mnt)
 				break;
+			if (unlikely(nd->flags & LOOKUP_XDEV))
+				return -EXDEV;
 			/* we know that mountpoint was pinned */
 			nd->path.dentry = mountpoint;
 			nd->path.mnt = &mparent->mnt;
@@ -1393,6 +1419,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			return -ECHILD;
 		if (!mounted)
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 		nd->path.mnt = &mounted->mnt;
 		nd->path.dentry = mounted->mnt.mnt_root;
 		inode = nd->path.dentry->d_inode;
@@ -1481,8 +1509,11 @@ static int path_parent_directory(struct path *path)
 static int follow_dotdot(struct nameidata *nd)
 {
 	while(1) {
-		if (path_equal(&nd->path, &nd->root))
+		if (path_equal(&nd->path, &nd->root)) {
+			if (unlikely(nd->flags & LOOKUP_BENEATH))
+				return -EXDEV;
 			break;
+		}
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			int ret = path_parent_directory(&nd->path);
 			if (ret)
@@ -1491,6 +1522,8 @@ static int follow_dotdot(struct nameidata *nd)
 		}
 		if (!follow_up(&nd->path))
 			break;
+		if (unlikely(nd->flags & LOOKUP_XDEV))
+			return -EXDEV;
 	}
 	follow_mount(&nd->path);
 	nd->inode = nd->path.dentry->d_inode;
@@ -1705,6 +1738,12 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
 	if (type == LAST_DOTDOT) {
+		/*
+		 * AT_BENEATH resolving ".." is not currently safe -- races can cause
+		 * our parent to have moved outside of the root and us to skip over it.
+		 */
+		if (unlikely(nd->flags & LOOKUP_BENEATH))
+			return -EXDEV;
 		if (!nd->root.mnt)
 			set_root(nd);
 		if (nd->flags & LOOKUP_RCU) {
@@ -1720,6 +1759,8 @@ static int pick_link(struct nameidata *nd, struct path *link,
 {
 	int error;
 	struct saved *last;
+	if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+		return -ELOOP;
 	if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
 		path_to_nameidata(link, nd);
 		return -ELOOP;
@@ -2168,13 +2209,70 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	}
 }
 
+/*
+ * Configure nd->path based on the nd->dfd. This is only used as part of
+ * path_init().
+ */
+static inline int dirfd_path_init(struct nameidata *nd)
+{
+	if (nd->dfd == AT_FDCWD) {
+		if (nd->flags & LOOKUP_RCU) {
+			struct fs_struct *fs = current->fs;
+			unsigned seq;
+
+			do {
+				seq = read_seqcount_begin(&fs->seq);
+				nd->path = fs->pwd;
+				nd->inode = nd->path.dentry->d_inode;
+				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+			} while (read_seqcount_retry(&fs->seq, seq));
+		} else {
+			get_fs_pwd(current->fs, &nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+	} else {
+		/* Caller must check execute permissions on the starting path component */
+		struct fd f = fdget_raw(nd->dfd);
+		struct dentry *dentry;
+
+		if (!f.file)
+			return -EBADF;
+
+		dentry = f.file->f_path.dentry;
+
+		if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
+			fdput(f);
+			return -ENOTDIR;
+		}
+
+		nd->path = f.file->f_path;
+		if (nd->flags & LOOKUP_RCU) {
+			nd->inode = nd->path.dentry->d_inode;
+			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		} else {
+			path_get(&nd->path);
+			nd->inode = nd->path.dentry->d_inode;
+		}
+		fdput(f);
+	}
+	if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+		nd->root = nd->path;
+		if (!(nd->flags & LOOKUP_RCU))
+			path_get(&nd->root);
+	}
+	return 0;
+}
+
 /* must be paired with terminate_walk() */
 static const char *path_init(struct nameidata *nd, unsigned flags)
 {
+	int error;
 	const char *s = nd->name->name;
 
 	if (!*s)
 		flags &= ~LOOKUP_RCU;
+	if (flags & LOOKUP_NO_SYMLINKS)
+		flags |= LOOKUP_NO_PROCLINKS;
 	if (flags & LOOKUP_RCU)
 		rcu_read_lock();
 
@@ -2203,53 +2301,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.dentry = NULL;
 
 	nd->m_seq = read_seqbegin(&mount_lock);
+	if (unlikely(flags & LOOKUP_XDEV)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
 	if (*s == '/') {
-		set_root(nd);
-		if (likely(!nd_jump_root(nd)))
-			return s;
-		return ERR_PTR(-ECHILD);
-	} else if (nd->dfd == AT_FDCWD) {
-		if (flags & LOOKUP_RCU) {
-			struct fs_struct *fs = current->fs;
-			unsigned seq;
-
-			do {
-				seq = read_seqcount_begin(&fs->seq);
-				nd->path = fs->pwd;
-				nd->inode = nd->path.dentry->d_inode;
-				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
-			} while (read_seqcount_retry(&fs->seq, seq));
-		} else {
-			get_fs_pwd(current->fs, &nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		return s;
-	} else {
-		/* Caller must check execute permissions on the starting path component */
-		struct fd f = fdget_raw(nd->dfd);
-		struct dentry *dentry;
-
-		if (!f.file)
-			return ERR_PTR(-EBADF);
-
-		dentry = f.file->f_path.dentry;
-
-		if (*s && unlikely(!d_can_lookup(dentry))) {
-			fdput(f);
-			return ERR_PTR(-ENOTDIR);
-		}
-
-		nd->path = f.file->f_path;
-		if (flags & LOOKUP_RCU) {
-			nd->inode = nd->path.dentry->d_inode;
-			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-		} else {
-			path_get(&nd->path);
-			nd->inode = nd->path.dentry->d_inode;
-		}
-		fdput(f);
+		if (likely(!nd->root.mnt))
+			set_root(nd);
+		error = nd_jump_root(nd);
+		if (unlikely(error))
+			s = ERR_PTR(error);
 		return s;
 	}
+	if (likely(!nd->path.mnt)) {
+		error = dirfd_path_init(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
+	}
+	return s;
 }
 
 static const char *trailing_symlink(struct nameidata *nd)
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..80f5f566a5ff 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -988,6 +988,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_DIRECTORY;
 	if (!(flags & O_NOFOLLOW))
 		lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & O_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
+	if (flags & O_XDEV)
+		lookup_flags |= LOOKUP_XDEV;
+	if (flags & O_NOPROCLINKS)
+		lookup_flags |= LOOKUP_NO_PROCLINKS;
+	if (flags & O_NOSYMLINKS)
+		lookup_flags |= LOOKUP_NO_SYMLINKS;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/fs/stat.c b/fs/stat.c
index f8e6fb2c3657..d319a468c704 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -170,8 +170,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
 	int error = -EINVAL;
 	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
 
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
-		       AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0)
+	if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
+		      KSTAT_QUERY_FLAGS))
 		return -EINVAL;
 
 	if (flags & AT_SYMLINK_NOFOLLOW)
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 27dc7a60693e..ad5bba4b5b12 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,8 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
+	 O_NOPROCLINKS | O_NOSYMLINKS)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a78606e8e3df..5ff7f3362d1b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_EMPTY		0x4000
 #define LOOKUP_DOWN		0x8000
 
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH		0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV		0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_PROCLINKS	0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS	0x080000 /* No symlink crossing *at all*.
+					    Implies LOOKUP_NO_PROCLINKS. */
+
 extern int path_pts(struct path *path);
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..c2bf5983e46a 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,23 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+/*
+ * These are identical to their AT_* counterparts (which affect the entireity
+ * of path resolution).
+ */
+#ifndef O_BENEATH
+#define O_BENEATH	00040000000 /* *Not* the same as capsicum's O_BENEATH! */
+#endif
+#ifndef O_XDEV
+#define O_XDEV		00100000000
+#endif
+#ifndef O_NOPROCLINKS
+#define O_NOPROCLINKS	00200000000
+#endif
+#ifndef O_NOSYMLINKS
+#define O_NOSYMLINKS	01000000000
+#endif
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 594b85f7cb86..551a9e2166a8 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -92,5 +92,13 @@
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
+/* Flags which affect path *resolution*, not just last-component handling. */
+#define AT_BENEATH		0x10000	/* No absolute paths or ".." escaping
+					   (in-path or through symlinks) */
+#define AT_XDEV			0x20000 /* No mountpoint crossing. */
+#define AT_NO_PROCLINKS		0x40000 /* No /proc/$pid/fd/... "symlinks". */
+#define AT_NO_SYMLINKS		0x80000 /* No symlinks *at all*.
+					   Implies AT_NO_PROCLINKS. */
+
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.19.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-09  6:52 ` [PATCH v2 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
  2018-10-09  6:52   ` Aleksa Sarai
@ 2018-10-09 19:25   ` Andy Lutomirski
  2018-10-09 19:25     ` Andy Lutomirski
  2018-10-10  7:07     ` Aleksa Sarai
  1 sibling, 2 replies; 14+ messages in thread
From: Andy Lutomirski @ 2018-10-09 19:25 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Eric W. Biederman, Christian Brauner, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, Andrew Lutomirski, David Howells,
	Jann Horn, Tycho Andersen, David Drysdale, dev, Linux Containers,
	Linux FS Devel, LKML, linux-arch, Linux API

On Mon, Oct 8, 2018 at 11:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
>   specific restriction, and it exists because /proc/$pid/fd/...
>   "symlinks" allow for access outside nd->root and pose risk to
>   container runtimes that don't want to be tricked into accessing a host
>   path (but do want to allow no-funny-business symlink resolution).

Can you elaborate on the use case?

If I'm set up a container namespace and walk it for real (through the
outside /proc/PID/root or otherwise starting from an fd that points
into that namespace), and I walk through that namespace's /proc, I'm
going to see the same thing that the processes in the namespace would
see.  So what's the issue?

Similarly, if I somehow manage to walk into the outside /proc, then
I've pretty much lost regardless of the links.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-09 19:25   ` Andy Lutomirski
@ 2018-10-09 19:25     ` Andy Lutomirski
  2018-10-10  7:07     ` Aleksa Sarai
  1 sibling, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2018-10-09 19:25 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Eric W. Biederman, Christian Brauner, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, Andrew Lutomirski, David Howells,
	Jann Horn, Tycho Andersen, David Drysdale, dev, Linux Containers,
	Linux FS Devel, LKML, linux-arch, Linux API

On Mon, Oct 8, 2018 at 11:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
>   specific restriction, and it exists because /proc/$pid/fd/...
>   "symlinks" allow for access outside nd->root and pose risk to
>   container runtimes that don't want to be tricked into accessing a host
>   path (but do want to allow no-funny-business symlink resolution).

Can you elaborate on the use case?

If I'm set up a container namespace and walk it for real (through the
outside /proc/PID/root or otherwise starting from an fd that points
into that namespace), and I walk through that namespace's /proc, I'm
going to see the same thing that the processes in the namespace would
see.  So what's the issue?

Similarly, if I somehow manage to walk into the outside /proc, then
I've pretty much lost regardless of the links.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-09 19:25   ` Andy Lutomirski
  2018-10-09 19:25     ` Andy Lutomirski
@ 2018-10-10  7:07     ` Aleksa Sarai
  2018-10-10  7:07       ` Aleksa Sarai
                         ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-10  7:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Eric W. Biederman, Christian Brauner, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, Linux Containers,
	Linux FS Devel, LKML, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 3338 bytes --]

On 2018-10-09, Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, Oct 8, 2018 at 11:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
> >   specific restriction, and it exists because /proc/$pid/fd/...
> >   "symlinks" allow for access outside nd->root and pose risk to
> >   container runtimes that don't want to be tricked into accessing a host
> >   path (but do want to allow no-funny-business symlink resolution).
> 
> Can you elaborate on the use case?
> 
> If I'm set up a container namespace and walk it for real (through the
> outside /proc/PID/root or otherwise starting from an fd that points
> into that namespace), and I walk through that namespace's /proc, I'm
> going to see the same thing that the processes in the namespace would
> see.  So what's the issue?
> 
> Similarly, if I somehow manage to walk into the outside /proc, then
> I've pretty much lost regardless of the links.

Well, there's a couple of reasons:

* The original AT_NO_JUMPS patchset similarly disabled "proclinks" but
  it was sort of all contained within AT_NO_JUMPS. In order to have a
  precise 1:1 feature mapping we need this in *some* form (and in v1 the
  only way to get it was to add a separate flag). According to the
  original O_BENEATH changelog, both you and Al pushed for this to be
  part of O_BENEATH. :P

  *However* in v2 of the patchset, proclinks are also disabled by
  AT_BENEATH (because it's not really safe or consistent to allow them
  at the moment -- we'd need to add __d_path checks when jumping through
  them as well if we wanted them to be consistent) -- so the need for
  this flag (purely for AT_NO_JUMPS compatibility) is reduced.

* There were cases in the past where races caused (temporarily)
  something like /proc/self/exe (or a file descriptor referencing the
  host filesystem) to be exposed into a container -- but because of
  set_dumpable they were blocked. CVE-2016-9962 was an example of this
  (it wasn't blocked by set_dumpable -- but the fix used set_dumpable).

  In those cases, if you can trick a host-side process to open that
  procfs file through a symlink/bind-mount (which is technically
  "accessible" but not actually usable by the container process), you
  can trick the resolution to resolve the host filesystem (and this
  might be a file which is unlinked and thus there's no way for __d_path
  checking to verify whether it is safe or not).

  I think that AT_BENEATH allowing only proclinks that result in you
  being under the root is something we might want in the future, but I
  think there are some cases where you want to be _very_ sure you don't
  follow a proclink (now or in the future).

* And finally, some containers run with the host's pidns. This is not a
  usecase that I'm particularly fond of, but some folks do use this (as
  far as I'm aware this is one of the reasons why the subreaper concept
  exists). In those cases, the procfs mount would be able to see the
  host processes -- and thus /proc/self would resolve (as would the
  host's init and so on).

I will admit that this flag is more paranoid than the others though.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-10  7:07     ` Aleksa Sarai
@ 2018-10-10  7:07       ` Aleksa Sarai
  2018-10-10  7:28       ` Aleksa Sarai
  2018-10-12  1:12       ` Andy Lutomirski
  2 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-10  7:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Eric W. Biederman, Christian Brauner, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, Linux Containers,
	Linux FS Devel, LKML, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 3338 bytes --]

On 2018-10-09, Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, Oct 8, 2018 at 11:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
> >   specific restriction, and it exists because /proc/$pid/fd/...
> >   "symlinks" allow for access outside nd->root and pose risk to
> >   container runtimes that don't want to be tricked into accessing a host
> >   path (but do want to allow no-funny-business symlink resolution).
> 
> Can you elaborate on the use case?
> 
> If I'm set up a container namespace and walk it for real (through the
> outside /proc/PID/root or otherwise starting from an fd that points
> into that namespace), and I walk through that namespace's /proc, I'm
> going to see the same thing that the processes in the namespace would
> see.  So what's the issue?
> 
> Similarly, if I somehow manage to walk into the outside /proc, then
> I've pretty much lost regardless of the links.

Well, there's a couple of reasons:

* The original AT_NO_JUMPS patchset similarly disabled "proclinks" but
  it was sort of all contained within AT_NO_JUMPS. In order to have a
  precise 1:1 feature mapping we need this in *some* form (and in v1 the
  only way to get it was to add a separate flag). According to the
  original O_BENEATH changelog, both you and Al pushed for this to be
  part of O_BENEATH. :P

  *However* in v2 of the patchset, proclinks are also disabled by
  AT_BENEATH (because it's not really safe or consistent to allow them
  at the moment -- we'd need to add __d_path checks when jumping through
  them as well if we wanted them to be consistent) -- so the need for
  this flag (purely for AT_NO_JUMPS compatibility) is reduced.

* There were cases in the past where races caused (temporarily)
  something like /proc/self/exe (or a file descriptor referencing the
  host filesystem) to be exposed into a container -- but because of
  set_dumpable they were blocked. CVE-2016-9962 was an example of this
  (it wasn't blocked by set_dumpable -- but the fix used set_dumpable).

  In those cases, if you can trick a host-side process to open that
  procfs file through a symlink/bind-mount (which is technically
  "accessible" but not actually usable by the container process), you
  can trick the resolution to resolve the host filesystem (and this
  might be a file which is unlinked and thus there's no way for __d_path
  checking to verify whether it is safe or not).

  I think that AT_BENEATH allowing only proclinks that result in you
  being under the root is something we might want in the future, but I
  think there are some cases where you want to be _very_ sure you don't
  follow a proclink (now or in the future).

* And finally, some containers run with the host's pidns. This is not a
  usecase that I'm particularly fond of, but some folks do use this (as
  far as I'm aware this is one of the reasons why the subreaper concept
  exists). In those cases, the procfs mount would be able to see the
  host processes -- and thus /proc/self would resolve (as would the
  host's init and so on).

I will admit that this flag is more paranoid than the others though.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-10  7:07     ` Aleksa Sarai
  2018-10-10  7:07       ` Aleksa Sarai
@ 2018-10-10  7:28       ` Aleksa Sarai
  2018-10-10  7:28         ` Aleksa Sarai
  2018-10-12  1:12       ` Andy Lutomirski
  2 siblings, 1 reply; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-10  7:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Eric W. Biederman, Christian Brauner, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, Linux Containers,
	Linux FS Devel, LKML, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 2031 bytes --]

On 2018-10-10, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2018-10-09, Andy Lutomirski <luto@kernel.org> wrote:
> > On Mon, Oct 8, 2018 at 11:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
> > >   specific restriction, and it exists because /proc/$pid/fd/...
> > >   "symlinks" allow for access outside nd->root and pose risk to
> > >   container runtimes that don't want to be tricked into accessing a host
> > >   path (but do want to allow no-funny-business symlink resolution).
> > 
> > Can you elaborate on the use case?
> > 
> [...]
>   I think that AT_BENEATH allowing only proclinks that result in you
>   being under the root is something we might want in the future, but I
>   think there are some cases where you want to be _very_ sure you don't
>   follow a proclink (now or in the future).
> [...]

Sorry, just to clarify this point a bit more.

At the moment, "proclinks" are entirely disabled with AT_BENEATH. This
is a (hopefully) temporary measure until it's decided _how_ they should
be allowed. Personally I think we should allow them if they follow the
same requirement as ".." escapes (that __d_path can resolve them).

But then the question arises -- what if we're looking at a never-mounted
pseudo-filesystem dentry (see the ->d_dname code in d_path)? If we don't
allow it then we'd probably disallow quite a few cases where you'd want
to allow access (nsfs proclinks come immediately to mind).

*But* if we allow it then there's no real way to tell if the container
process has tricked us into opening something we shouldn't (like an open
file descriptor to a memfd or pipe related to some host service). Maybe
we should still allow them in that case because the likelihood of such a
case is very small (and allowing them would let you open nsfs links with
AT_BENEATH), but I'm not sure.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-10  7:28       ` Aleksa Sarai
@ 2018-10-10  7:28         ` Aleksa Sarai
  0 siblings, 0 replies; 14+ messages in thread
From: Aleksa Sarai @ 2018-10-10  7:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Eric W. Biederman, Christian Brauner, Jeff Layton,
	J. Bruce Fields, Arnd Bergmann, David Howells, Jann Horn,
	Tycho Andersen, David Drysdale, dev, Linux Containers,
	Linux FS Devel, LKML, linux-arch, Linux API

[-- Attachment #1: Type: text/plain, Size: 2031 bytes --]

On 2018-10-10, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2018-10-09, Andy Lutomirski <luto@kernel.org> wrote:
> > On Mon, Oct 8, 2018 at 11:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
> > >   specific restriction, and it exists because /proc/$pid/fd/...
> > >   "symlinks" allow for access outside nd->root and pose risk to
> > >   container runtimes that don't want to be tricked into accessing a host
> > >   path (but do want to allow no-funny-business symlink resolution).
> > 
> > Can you elaborate on the use case?
> > 
> [...]
>   I think that AT_BENEATH allowing only proclinks that result in you
>   being under the root is something we might want in the future, but I
>   think there are some cases where you want to be _very_ sure you don't
>   follow a proclink (now or in the future).
> [...]

Sorry, just to clarify this point a bit more.

At the moment, "proclinks" are entirely disabled with AT_BENEATH. This
is a (hopefully) temporary measure until it's decided _how_ they should
be allowed. Personally I think we should allow them if they follow the
same requirement as ".." escapes (that __d_path can resolve them).

But then the question arises -- what if we're looking at a never-mounted
pseudo-filesystem dentry (see the ->d_dname code in d_path)? If we don't
allow it then we'd probably disallow quite a few cases where you'd want
to allow access (nsfs proclinks come immediately to mind).

*But* if we allow it then there's no real way to tell if the container
process has tricked us into opening something we shouldn't (like an open
file descriptor to a memfd or pipe related to some host service). Maybe
we should still allow them in that case because the likelihood of such a
case is very small (and allowing them would let you open nsfs links with
AT_BENEATH), but I'm not sure.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-10  7:07     ` Aleksa Sarai
  2018-10-10  7:07       ` Aleksa Sarai
  2018-10-10  7:28       ` Aleksa Sarai
@ 2018-10-12  1:12       ` Andy Lutomirski
  2018-10-12  1:12         ` Andy Lutomirski
  2 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2018-10-12  1:12 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Andrew Lutomirski, Al Viro, Eric W. Biederman, Christian Brauner,
	Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Jann Horn, Tycho Andersen, David Drysdale, dev, Linux Containers,
	Linux FS Devel, LKML, linux-arch, Linux API

On Wed, Oct 10, 2018 at 12:08 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2018-10-09, Andy Lutomirski <luto@kernel.org> wrote:
> > On Mon, Oct 8, 2018 at 11:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
> > >   specific restriction, and it exists because /proc/$pid/fd/...
> > >   "symlinks" allow for access outside nd->root and pose risk to
> > >   container runtimes that don't want to be tricked into accessing a host
> > >   path (but do want to allow no-funny-business symlink resolution).
> >
> > Can you elaborate on the use case?
> >
> > If I'm set up a container namespace and walk it for real (through the
> > outside /proc/PID/root or otherwise starting from an fd that points
> > into that namespace), and I walk through that namespace's /proc, I'm
> > going to see the same thing that the processes in the namespace would
> > see.  So what's the issue?
> >
> > Similarly, if I somehow manage to walk into the outside /proc, then
> > I've pretty much lost regardless of the links.
>
> Well, there's a couple of reasons:
>
> * The original AT_NO_JUMPS patchset similarly disabled "proclinks" but
>   it was sort of all contained within AT_NO_JUMPS. In order to have a
>   precise 1:1 feature mapping we need this in *some* form (and in v1 the
>   only way to get it was to add a separate flag). According to the
>   original O_BENEATH changelog, both you and Al pushed for this to be
>   part of O_BENEATH. :P

:)

Now that you mention it, I *think* my reasoning involved a rather
different use case: sandboxing.  If a task is Capsicum-ified or
seccomp()ed such that it can *only* use O_BENEATH or AT_BENEATH, this
restriction considerably strengthens the resulting security.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] namei: implement O_BENEATH-style AT_* flags
  2018-10-12  1:12       ` Andy Lutomirski
@ 2018-10-12  1:12         ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2018-10-12  1:12 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Andrew Lutomirski, Al Viro, Eric W. Biederman, Christian Brauner,
	Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Jann Horn, Tycho Andersen, David Drysdale, dev, Linux Containers,
	Linux FS Devel, LKML, linux-arch, Linux API

On Wed, Oct 10, 2018 at 12:08 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2018-10-09, Andy Lutomirski <luto@kernel.org> wrote:
> > On Mon, Oct 8, 2018 at 11:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
> > >   specific restriction, and it exists because /proc/$pid/fd/...
> > >   "symlinks" allow for access outside nd->root and pose risk to
> > >   container runtimes that don't want to be tricked into accessing a host
> > >   path (but do want to allow no-funny-business symlink resolution).
> >
> > Can you elaborate on the use case?
> >
> > If I'm set up a container namespace and walk it for real (through the
> > outside /proc/PID/root or otherwise starting from an fd that points
> > into that namespace), and I walk through that namespace's /proc, I'm
> > going to see the same thing that the processes in the namespace would
> > see.  So what's the issue?
> >
> > Similarly, if I somehow manage to walk into the outside /proc, then
> > I've pretty much lost regardless of the links.
>
> Well, there's a couple of reasons:
>
> * The original AT_NO_JUMPS patchset similarly disabled "proclinks" but
>   it was sort of all contained within AT_NO_JUMPS. In order to have a
>   precise 1:1 feature mapping we need this in *some* form (and in v1 the
>   only way to get it was to add a separate flag). According to the
>   original O_BENEATH changelog, both you and Al pushed for this to be
>   part of O_BENEATH. :P

:)

Now that you mention it, I *think* my reasoning involved a rather
different use case: sandboxing.  If a task is Capsicum-ified or
seccomp()ed such that it can *only* use O_BENEATH or AT_BENEATH, this
restriction considerably strengthens the resulting security.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-10-12  8:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-09  6:52 [PATCH v2 0/3] namei: implement various lookup restriction AT_* flags Aleksa Sarai
2018-10-09  6:52 ` Aleksa Sarai
2018-10-09  6:52 ` Aleksa Sarai
2018-10-09  6:52   ` Aleksa Sarai
2018-10-09  6:52 ` [PATCH v2 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
2018-10-09  6:52   ` Aleksa Sarai
2018-10-09 19:25   ` Andy Lutomirski
2018-10-09 19:25     ` Andy Lutomirski
2018-10-10  7:07     ` Aleksa Sarai
2018-10-10  7:07       ` Aleksa Sarai
2018-10-10  7:28       ` Aleksa Sarai
2018-10-10  7:28         ` Aleksa Sarai
2018-10-12  1:12       ` Andy Lutomirski
2018-10-12  1:12         ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).