archive mirror
 help / color / mirror / Atom feed
From: Aleksa Sarai <>
To: Jann Horn <>
Cc: "Eric W. Biederman" <>,, Bruce Fields <>,
	Al Viro <>, Arnd Bergmann <>,, David Howells <>,
	Andy Lutomirski <>,, Tycho Andersen <>,
	kernel list <>,,
	linux-arch <>,,,,
	Linux API <>
Subject: Re: [PATCH 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution
Date: Mon, 1 Oct 2018 15:44:28 +1000	[thread overview]
Message-ID: <20181001054246.gfinmx3api7kjhmc@ryuk> (raw)
In-Reply-To: <>

[-- Attachment #1: Type: text/plain, Size: 6954 bytes --]

On 2018-09-29, Jann Horn <> wrote:
> The problem is what happens if a folder you are walking through is
> concurrently moved out of the chroot. Consider the following scenario:
> You attempt to open "C/../../etc/passwd" under the root "/A/B".
> Something else concurrently moves /A/B/C to /A/C. This can result in
> the following:
> 1. You start the path walk and reach /A/B/C.
> 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> 3. Your path walk follows the first ".." up into /A. This is outside
> the process root, but you never actually encountered the process root,
> so you don't notice.
> 4. Your path walk follows the second ".." up to /. Again, this is
> outside the process root, but you don't notice.
> 5. Your path walk walks down to /etc/passwd, and the open completes
> successfully. You now have an fd pointing outside your chroot.
> If the root of your walk is below an attacker-controlled directory,
> this of course means that you lose instantly. If you point the root of
> the walk at a directory out of which a process in the container
> wouldn't be able to move the file, you're probably kinda mostly fine -
> as long as you know, for certain, that nothing else on the system
> would ever do that. But I still wouldn't feel good about that.

Please correct me if I'm wrong here (this is the first patch I've
written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this
-- or does that only handle if a particular path component changes
*while* it's being walked through? Is it possible for a path walk to
succeed after a path component was unmounted (obviously you can't delete
a directory path component since you'd get -ENOTEMPTY)?

If this is an issue for AT_THIS_ROOT, I believe this might also be an
issue for AT_BENEATH since they are effectively both using the same
nd->root trick (so you could similarly trick AT_BENEATH to not error
out). So we'd need to figure out how to solve this problem in order for
AT_BENEATH to be safe.

Speaking naively, doesn't it make sense to invalidate the walk if a path
component was modified? Or is this something that would be far too
costly with little benefit? What if we do more aggressive nd->root
checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root !=

Regarding chroot attacks, I was aware of the trivial
chroot-open-chroot-fchdir attack but I was not aware that there was a
rename attack for chroot. Thanks for bringing this up!

> I believe that the only way to robustly use this would be to point the
> dirfd at a mount point, such that you know that being moved out of the
> chroot is impossible because the mount point limits movement of
> directories under it. (Well, technically, it doesn't, but it ensures
> that if a directory does dangerously move away, the syscall fails.) It
> might make sense to hardcode this constraint in the implementation of
> AT_THIS_ROOT, to keep people from shooting themselves in the foot.

Unless I'm missing something, would this not also affect using a
mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through
path component) -- or does MS_MOVE cause a rewalk in a way that rename
does not?

I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I
thought that bind-mounts would be an issue but you also get -EXDEV when
trying to rename across bind-mounts even if they are on the same
underlying filesystem). But AT_BENEATH might be a more bitter pill to
swallow. I'm not sure.

In the usecase of container runtimes, we wouldn't generally be doing
resolution of attacker-controlled paths but it still definitely doesn't
hurt to consider this part of the threat model -- to avoid foot-gunning
as you've said. (There also might be some nested-container cases where
you might want to do that.)

> > Currently most container runtimes try to do this resolution in
> > userspace[1], causing many potential race conditions. In addition, the
> > "obvious" alternative (actually performing a {ch,pivot_}root(2))
> > requires a fork+exec which is *very* costly if necessary for every
> > filesystem operation involving a container.
> Wait. fork() I understand, but why exec? And actually, you don't need
> a full fork() either, clone() lets you do this with some process parts
> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> the file descriptor table shared. And why chroot()/pivot_root(),
> wouldn't you want to use setns()?

You're right about this -- for C runtimes. In Go we cannot do a raw
clone() or fork() (if you do it manually with RawSyscall you'll end with
broken runtime state). So you're forced to do fork+exec (which then
means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes

(It should be noted that multi-threaded C runtimes have somewhat similar
issues -- AFAIK you can technically only use AS-Safe glibc functions
after a fork() but that's more of a theoretical concern here. If you
just use raw syscalls there isn't an issue.)

As for why use setns() rather than pivot_root(), there are cases where
you're operating on a container's image without a running container
(think image extraction or snapshotting tools). In those cases, you
would need to set up a dummy container process in order to setns() into
its namespaces. You are right that setns() would be a better option if
you want the truthful state of what mounts the container sees.

[I also don't like the idea of joining the user namespace of a malicious
container unless it's necessary but that's probably just needless
paranoia more than anything -- since you're not joining the pidns you
aren't trivially addressable by a malicious container.]

> // Ensure that we are non-dumpable. Together with
> // commit bfedb589252c, this ensures that container root
> // can't trace our child once it enters the container.
> // My patch
> //
> // would make this unnecessary, but that patch didn't
> // land because Eric nacked it (for political reasons,
> // because people incorrectly claimed that this was a
> // security fix):

Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a
user_ns owner to mm_struct and fix ptrace permission checks"). If you
join a user namespace then processes within that user namespace won't
have ptrace_may_access() permissions because your mm is owned by an
ancestor user namespace -- only after exec() will you be traceable.

We still use PR_SET_DUMPABLE in runc but that's because we support older
kernels (and people don't use user namespaces under Docker) but with
user namespaces this should not be required anymore.

Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2018-10-01  5:44 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-29 10:34 [PATCH 0/3] namei: implement various scoping AT_* flags Aleksa Sarai
2018-09-29 10:34 ` [PATCH 1/3] namei: implement O_BENEATH-style " Aleksa Sarai
2018-09-29 14:48   ` Christian Brauner
2018-09-29 15:34     ` Aleksa Sarai
2018-09-30  4:38   ` Aleksa Sarai
2018-10-01 12:28   ` Jann Horn
2018-10-01 13:00     ` Christian Brauner
2018-10-01 16:04       ` Aleksa Sarai
2018-10-04 17:20         ` Christian Brauner
2018-09-29 13:15 ` [PATCH 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution Aleksa Sarai
2018-09-29 13:15   ` [PATCH 3/3] selftests: vfs: add AT_* path resolution tests Aleksa Sarai
2018-09-29 16:35   ` [PATCH 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution Jann Horn
2018-09-29 17:25     ` Andy Lutomirski
2018-10-01  9:46       ` Aleksa Sarai
2018-10-01  5:44     ` Aleksa Sarai [this message]
2018-10-01 10:13       ` Jann Horn
2018-10-01 16:18         ` Aleksa Sarai
2018-10-04 17:27           ` Christian Brauner
2018-10-01 10:42       ` Christian Brauner
2018-10-01 11:29         ` Jann Horn
2018-10-01 12:35           ` Christian Brauner
2018-10-01 13:55       ` Bruce Fields
2018-10-01 14:28       ` Andy Lutomirski
2018-10-02  7:32         ` Aleksa Sarai
2018-10-03 22:09           ` Andy Lutomirski
2018-10-06 20:56           ` Florian Weimer
2018-10-06 21:49             ` Christian Brauner
2018-10-01 14:00     ` Christian Brauner
2018-10-04 16:26     ` Aleksa Sarai
2018-10-04 17:31       ` Christian Brauner
2018-10-04 18:26       ` Jann Horn
2018-10-05 15:07         ` Aleksa Sarai
2018-10-05 15:55           ` Jann Horn
2018-10-06  2:10             ` Aleksa Sarai
2018-10-08 10:50               ` Jann Horn
2018-09-29 14:25 ` [PATCH 0/3] namei: implement various scoping AT_* flags Andy Lutomirski
2018-09-29 15:45   ` Aleksa Sarai
2018-09-29 16:34     ` Andy Lutomirski
2018-09-29 19:44       ` Matthew Wilcox
2018-09-29 14:38 ` Christian Brauner
2018-09-30  4:44   ` Aleksa Sarai
2018-09-30 13:54 ` Alban Crequy
2018-09-30 14:02   ` Christian Brauner
2018-09-30 19:45 ` Mickaël Salaün
2018-09-30 21:46   ` Jann Horn
2018-09-30 22:37     ` Mickaël Salaün
2018-10-01 20:14       ` James Morris
2018-10-01  4:08 ` Dave Chinner
2018-10-01  5:47   ` Aleksa Sarai
2018-10-01  6:14     ` Dave Chinner
2018-10-01 13:28 ` David Laight
2018-10-01 16:15   ` Aleksa Sarai
2018-10-03 13:21     ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181001054246.gfinmx3api7kjhmc@ryuk \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).