On 2019-05-11, Andy Lutomirski wrote: > >> I've lost track of the context here, but it seems to me that > >> mitigating attacks involving accidental following of /proc links > >> shouldn't depend on dumpability. What's the actual problem this is > >> trying to solve again? > > > > The one actual security problem that I've seen related to this is > > CVE-2019-5736. There is a write-up of it at > > > > under "Successful approach", but it goes more or less as follows: > > > > A container is running that doesn't use user namespaces (because for > > some reason I don't understand, apparently some people still do that). > > An evil process is running inside the container with UID 0 (as in, > > GLOBAL_ROOT_UID); so if the evil process inside the container was able > > to reach root-owned files on the host filesystem, it could write into > > them. > > > > The container engine wants to spawn a new process inside the container. > > It forks off a child that joins the container's namespaces (including > > PID and mount namespaces), and then the child calls execve() on some > > path in the container. > > I think that, at this point, the task should be considered owned by > the container. Maybe we should have a better API than execve() to > execute a program in a safer way, but fiddling with dumpability seems > like a band-aid. In fact, the process is arguably pwned even *before* > execve. Yeah, execve is just the vector (though in this case it's done in order to clear mm->dumpable). An earlier CVE (CVE-2016-9962) was very similar but was attacking a dirfd that runc had open into the container (LXC had a very similar bug too) -- setting !mm->dumpable was one of the workarounds we had for this. > A better “spawn” API should fix this. In the mean time, I think it > should be assumed that, if you join a container’s namespaces, you are > at its mercy. This is generally how we treat containers as runtime authors, but it's not a trivial thing to get right. In many cases the kernel APIs are working against you -- Christian and myself have written a fair few patches to fix holes in the kernel APIs so we can avoid these kinds of assumptions. But yes, one of the most risky parts of a container runtime is when you're attaching to a running container because all of the helpful introspection APIs in /proc/ suddenly become a security nightmare. A better "spawn a process in these namespaces" API might help improve the situation (or at least, I hope it would). > > - You can use /proc/*/exe to get a writable fd. > > This is IMO the real bug. I will try to send an RFC of the patchset I have for this next week or so. Funnily enough, currently /proc/*/exe has the write bit set in its "mode" (my series fixes this). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH