On Tue, Jan 19, 2016 at 04:47:32PM -0600, Eric W. Biederman wrote: > Dan Carpenter writes: > > > I like to look back over old CVEs to see how we could do better. Here > > is the list from 2015. I got most of this information from the Ubuntu > > CVE tracker. Thanks Ubuntu!. If it doesn't have a hash that means it > > might not be fixed yet. > > > > CVE-2015-8709 : ptrace: race in user namespaces let's users trace root processes > > As this isn't a kernel bug, I agree that it's not a kernel bug and not a kernel race - userspace developers assumed security guarantees that the kernel didn't actually provide. However, I think that the kernel is missing documentation here and that namespaces are designed somewhat unfortunately. A container that can be created and securely, robustly entered by an unprivileged user would have to work like this under the current rules as far as I can tell: To create the container: setsid [prevent tty pushback via /dev/tty] set up tty IO forwarding if necessary [prevents tty pushback, possibly additional filtering] unshare(CLONE_NEWUSER) to create a "purgatory" user ns. Map the container owner to uid 0, map all uids that should be mapped into the container (including the container root) to 1 and higher (where 1 is the container root). stash FD to the purgatory user namespace somewhere in the outer ns drop all privileges (open fds, ...) setresuid(1,1,1) [still protected against ptrace by nondumpability] unshare(CLONE_NEWUSER) to create the container's user ns [From here on, we can be ptraced by the ns root user from outside. The ns root user could ptrace us from outside at this point and see the outer namespaces through us, but that's okay, he'd have to already be in the outer user ns for that.] set up other namespaces for the container stash FDs to the container namespaces in the purgatory ns let a process in the purgatory map the container uids and gids do security-revelant setup work (setup bind mounts, ...) [be careful here, don't trust any files in container-controlled filesystem parts] do security-irrelevant setup work execlp("init") Then, to enter the container: setsid [prevent tty pushback via /dev/tty] set up tty IO forwarding if necessary [prevents tty pushback, possibly additional filtering] Enter the purgatory user ns, referenced through an FD setresuid(1, 1, 1) [still protected against ptrace by nondumpability] enter container namespaces, but not the user namespace yet [We don't really trust the namespace FDs supplied by the setup process because they were sent after the ns root user gained ptrace access, but that's okay because we can only move downward using setns(), so we end up in namespaces below the purgatory that are owned by the namespace root. That's good enough.] drop privileges (open fds, ...) enter container user namespace [ns root gains ptrace access] The purgatory user ns is necessary because without privileges in the container's parent user namespace, it's not possible to switch to the container root uid prior to entering it (except with an ugly hack involving a temporary namespace, newuidmap and a (possibly temporary) setuid binary), and more importantly, even given access to the container's root uid, it's not possible to actually enter the container without having the container owner's euid unless you have CAP_SYS_ADMIN in the outer namespace. (Of course, this could be simplified with a setuid root helper, but I don't think anyone wants more of those to be necessary.) > and is not a race, and no one has even > bothered to see if any userspace processes are this stupid I don't even > think that qualifies as a CVE. I know of at least two projects that enter user namespaces without the necessary care, one of them is LXC. > There is room for improvement in this area but I don't see how this > qualifies as a CVE. I think I agree with that.