linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Aleksa Sarai <cyphar@cyphar.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Daniel Colascione <dancol@google.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Tim Murray <timmurray@google.com>,
	Suren Baghdasaryan <surenb@google.com>
Subject: Re: [RFC PATCH] Implement /proc/pid/kill
Date: Wed, 31 Oct 2018 09:49:08 +1100	[thread overview]
Message-ID: <20181030224908.5rsldg4jsos7o5sa@yavin> (raw)
In-Reply-To: <20181030223343.GB105735@joelaf.mtv.corp.google.com>

[-- Attachment #1: Type: text/plain, Size: 4748 bytes --]

On 2018-10-30, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > [...] 
> > > > > > (Unfortunately
> > > > > > there are lots of things that make it a bit difficult to use /proc/$pid
> > > > > > exclusively for introspection of a process -- especially in the context
> > > > > > of containers.)
> > > > > 
> > > > > Tons of things already break without a working /proc. What do you have in mind?
> > > > 
> > > > Heh, if only that was the only blocker. :P
> > > > 
> > > > The basic problem is that currently container runtimes either depend on
> > > > some non-transient on-disk state (which becomes invalid on machine
> > > > reboots or dead processes and so on), or on long-running processes that
> > > > keep file descriptors required for administration of a container alive
> > > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> > > > attacks). Usually both.
> > > > 
> > > > What would be really useful would be having some way of "hiding away" a
> > > > mount namespace (of the pid1 of the container) that has all of the
> > > > information and bind-mounts-to-file-descriptors that are necessary for
> > > > administration. If the container's pid1 dies all of the transient state
> > > > has disappeared automatically -- because the stashed mount namespace has
> > > > died. In addition, if this was done the way I'm thinking with (and this
> > > > is the contentious bit) hierarchical mount namespaces you could make it
> > > > so that the pid1 could not manipulate its current mount namespace to
> > > > confuse the administrative process. You would also then create an
> > > > intermediate user namespace to help with several race conditions (that
> > > > have caused security bugs like CVE-2016-9962) we've seen when joining
> > > > containers.
> > > > 
> > > > Unfortunately this all depends on hierarchical mount namespaces (and
> > > > note that this would just be that NS_GET_PARENT gives you the mount
> > > > namespace that it was created in -- I'm not suggesting we redesign peers
> > > > or anything like that). This makes it basically a non-starter.
> > > > 
> > > > But if, on top of this ground-work, we then referenced containers
> > > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> > > > races (as well as being able to find out implicitly whether a container
> > > > has died thanks to the error semantics of /proc/$pid). And that's the
> > > > way I would suggest doing it (if we had these other things in place).
> > > 
> > > I didn't fully follow exactly what you mean. If you can explain for the
> > > layman who doesn't know much experience with containers..
> > > 
> > > Are you saying that keeping open a /proc/$pid directory handle is not
> > > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
> > > being looked into? If its not sufficient, then isn't that a bug? If it is
> > > sufficient, then can we not just keep the handle open while we do whatever we
> > > want under /proc/$pid ?
> > 
> > Sorry, I went on a bit of a tangent about various internals of container
> > runtimes. My main point is that I would love to use /proc/$pid because
> > it makes reuse handling very trivial and is always correct, but that
> > there are things which stop us from being able to use it for everything
> > (which is what my incoherent rambling was on about).
> 
> Ok thanks. So I am guessing if the following sequence works, then Dan's patch is not
> needed.
> 
> 1. open /proc/<pid> directory
> 2. inspect /proc/<pid> or do whatever with <pid>
> 3. Issue the kill on <pid>
> 4. Close the /proc/<pid> directory opened in step 1.
> 
> So unless I missed something, the above sequence will not cause any PID reuse
> races.

(Sorry, I misunderstood your original question.)

The problem is that holding /proc/$pid doesn't stop the PID from dying
and being reused. The benefit of holding open /proc/$pid is that you
will get an error if you try to use it *after* the PID has died -- which
means that you don't need to worry about explicitly checking for PID
reuse if you are only operating with the file descriptor and not the
PID.

So that sequence won't always work. There is a race where the pid might
die and be recycled by the time you call kill(2) -- after you've done
step 2. By tying step 2 and 3 together -- in this patch -- you remove
the race (since in order to resolve the "kill" procfs file VFS must
resolve the PID first -- atomically).

Though this race window is likely very tiny, and I wonder how much PID
churn you really need to hit it.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2018-10-30 22:49 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-29 22:10 [RFC PATCH] Implement /proc/pid/kill Daniel Colascione
2018-10-30  3:21 ` Joel Fernandes
2018-10-30  8:50   ` Daniel Colascione
2018-10-30 10:39     ` Christian Brauner
2018-10-30 10:40       ` Christian Brauner
2018-10-30 10:48         ` Daniel Colascione
2018-10-30 11:04           ` Christian Brauner
2018-10-30 11:12             ` Daniel Colascione
2018-10-30 11:19               ` Christian Brauner
2018-10-31  5:00                 ` Eric W. Biederman
2018-10-30 17:01     ` Joel Fernandes
2018-10-30  5:00 ` Aleksa Sarai
2018-10-30  9:05   ` Daniel Colascione
2018-10-30 20:45     ` Aleksa Sarai
2018-10-30 21:42       ` Joel Fernandes
2018-10-30 22:23         ` Aleksa Sarai
2018-10-30 22:33           ` Joel Fernandes
2018-10-30 22:49             ` Aleksa Sarai [this message]
2018-10-31  0:42               ` Joel Fernandes
2018-10-31  1:59                 ` Daniel Colascione
2018-10-30 23:10             ` Daniel Colascione
2018-10-30 23:23               ` Christian Brauner
2018-10-30 23:55                 ` Daniel Colascione
2018-10-31  2:56                 ` Aleksa Sarai
2018-10-31  4:24                   ` Joel Fernandes
2018-11-01 20:40                     ` Joel Fernandes
2018-11-02  9:46                       ` Christian Brauner
2018-11-02 14:34                         ` Serge E. Hallyn
2018-10-31  0:57               ` Joel Fernandes
2018-10-31  1:56                 ` Daniel Colascione
2018-10-31  4:47   ` Eric W. Biederman
2018-10-31  4:44 ` Eric W. Biederman
2018-10-31 12:44   ` Oleg Nesterov
2018-10-31 13:27     ` Daniel Colascione
2018-10-31 15:10       ` Oleg Nesterov
2018-10-31 15:16         ` Daniel Colascione
2018-10-31 15:49           ` Oleg Nesterov
2018-11-01 11:53       ` David Laight
2018-11-01 15:50         ` Daniel Colascione
2018-10-31 14:37 ` [PATCH v2] " Daniel Colascione
2018-10-31 15:05   ` Joel Fernandes
2018-10-31 17:33     ` Aleksa Sarai
2018-10-31 21:47       ` Joel Fernandes
2018-10-31 15:59 ` [PATCH v3] " Daniel Colascione
2018-10-31 17:54   ` Tycho Andersen
2018-10-31 18:00     ` Daniel Colascione
2018-10-31 18:17       ` Tycho Andersen
2018-10-31 19:33         ` Daniel Colascione
2018-10-31 20:06           ` Tycho Andersen
2018-11-01 11:33           ` David Laight
2018-11-12  1:19             ` Eric W. Biederman
2018-10-31 16:22 ` [RFC PATCH] " Jann Horn
2018-11-01  4:53   ` Andy Lutomirski
2018-11-12 23:13 ` Pavel Machek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181030224908.5rsldg4jsos7o5sa@yavin \
    --to=cyphar@cyphar.com \
    --cc=dancol@google.com \
    --cc=joel@joelfernandes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).