Re: [RFC PATCH] Implement /proc/pid/kill

From: Joel Fernandes <joel@joelfernandes.org>
To: Aleksa Sarai <cyphar@cyphar.com>
Cc: Daniel Colascione <dancol@google.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Tim Murray <timmurray@google.com>,
	Suren Baghdasaryan <surenb@google.com>
Subject: Re: [RFC PATCH] Implement /proc/pid/kill
Date: Tue, 30 Oct 2018 14:42:43 -0700	[thread overview]
Message-ID: <20181030214243.GB32621@google.com> (raw)
In-Reply-To: <20181030204501.jnbe7dyqui47hd2x@yavin>

On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
[...] 
> > > (Unfortunately
> > > there are lots of things that make it a bit difficult to use /proc/$pid
> > > exclusively for introspection of a process -- especially in the context
> > > of containers.)
> > 
> > Tons of things already break without a working /proc. What do you have in mind?
> 
> Heh, if only that was the only blocker. :P
> 
> The basic problem is that currently container runtimes either depend on
> some non-transient on-disk state (which becomes invalid on machine
> reboots or dead processes and so on), or on long-running processes that
> keep file descriptors required for administration of a container alive
> (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> attacks). Usually both.
> 
> What would be really useful would be having some way of "hiding away" a
> mount namespace (of the pid1 of the container) that has all of the
> information and bind-mounts-to-file-descriptors that are necessary for
> administration. If the container's pid1 dies all of the transient state
> has disappeared automatically -- because the stashed mount namespace has
> died. In addition, if this was done the way I'm thinking with (and this
> is the contentious bit) hierarchical mount namespaces you could make it
> so that the pid1 could not manipulate its current mount namespace to
> confuse the administrative process. You would also then create an
> intermediate user namespace to help with several race conditions (that
> have caused security bugs like CVE-2016-9962) we've seen when joining
> containers.
> 
> Unfortunately this all depends on hierarchical mount namespaces (and
> note that this would just be that NS_GET_PARENT gives you the mount
> namespace that it was created in -- I'm not suggesting we redesign peers
> or anything like that). This makes it basically a non-starter.
> 
> But if, on top of this ground-work, we then referenced containers
> entirely via an fd to /proc/$pid then you could also avoid PID reuse
> races (as well as being able to find out implicitly whether a container
> has died thanks to the error semantics of /proc/$pid). And that's the
> way I would suggest doing it (if we had these other things in place).

I didn't fully follow exactly what you mean. If you can explain for the
layman who doesn't know much experience with containers..

Are you saying that keeping open a /proc/$pid directory handle is not
sufficient to prevent PID reuse while the proc entries under /proc/$pid are
being looked into? If its not sufficient, then isn't that a bug? If it is
sufficient, then can we not just keep the handle open while we do whatever we
want under /proc/$pid ?

- Joel