Re: [RFC PATCH v2] Minimal non-child process exit notification support

From: Aleksa Sarai <cyphar@cyphar.com>
To: Daniel Colascione <dancol@google.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Tim Murray <timmurray@google.com>,
	Joel Fernandes <joelaf@google.com>
Subject: Re: [RFC PATCH v2] Minimal non-child process exit notification support
Date: Thu, 1 Nov 2018 21:47:51 +1100	[thread overview]
Message-ID: <20181101104750.q23rb3hczx2tzakq@yavin> (raw)
In-Reply-To: <CAKOZueszfoSM0pxhmuFLOuPmJqSfYXxgutstyCgqxAyoUi4h3w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7310 bytes --]

On 2018-11-01, Daniel Colascione <dancol@google.com> wrote:
> On Thu, Nov 1, 2018 at 7:00 AM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > On 2018-10-29, Daniel Colascione <dancol@google.com> wrote:
> >> This patch adds a new file under /proc/pid, /proc/pid/exithand.
> >> Attempting to read from an exithand file will block until the
> >> corresponding process exits, at which point the read will successfully
> >> complete with EOF.  The file descriptor supports both blocking
> >> operations and poll(2). It's intended to be a minimal interface for
> >> allowing a program to wait for the exit of a process that is not one
> >> of its children.
> >>
> >> Why might we want this interface? Android's lmkd kills processes in
> >> order to free memory in response to various memory pressure
> >> signals. It's desirable to wait until a killed process actually exits
> >> before moving on (if needed) to killing the next process. Since the
> >> processes that lmkd kills are not lmkd's children, lmkd currently
> >> lacks a way to wait for a process to actually die after being sent
> >> SIGKILL; today, lmkd resorts to polling the proc filesystem pid
> >> entry. This interface allow lmkd to give up polling and instead block
> >> and wait for process death.
> >
> > I agree with the need for this interface (with a few caveats), but there
> > are a few points I'd like to make:
> >
> >  * I don't think that making a new procfile is necessary. When you open
> >    /proc/$pid you already have a handle for the underlying process, and
> >    you can already poll to check whether the process has died (fstatat
> >    fails for instance). What if we just used an inotify event to tell
> >    userspace that the process has died -- to avoid userspace doing a
> >    poll loop?
> 
> I'm trying to make a simple interface. The basic unix data access
> model is that a userspace application wants information (e.g., next
> bunch of bytes in a file, next packet from a socket, next signal from
> a signal FD, etc.), and tells the kernel so by making a system call on
> a file descriptor. Ordinarily, the kernel returns to userspace with
> the requested information when it's available, potentially after
> blocking until the information is available. Sometimes userspace
> doesn't want to block, so it adds O_NONBLOCK to the open file mode,
> and in this mode, the kernel can tell the userspace requestor "try
> again later", but the source of truth is still that
> ordinarily-blocking system call. How does userspace know when to try
> again in the "try again later" case? By using
> select/poll/epoll/whatever, which suggests a good time for that "try
> again later" retry, but is not dispositive about it, since that
> ordinarily-blocking system call is still the sole source of truth, and
> that poll is allowed to report spurious readabilty.

inotify gives you an event if a file or directory is deleted. A pid
dying semantically is similar to the idea of a /proc/$pid being deleted.
I don't see how a blocking read on a new procfile is simpler than using
the existing notification-on-file-events infrastructure -- not to
mention that the idea of "this file blocks until the thing we are
indirectly referencing by this file is gone" seems to me to be a really
strange interface.

Sure, it uses read(2) -- but is that the only constraint on designing
simple interfaces?

> The event file I'm proposing is so ordinary, in fact, that it works
> from the shell. Without some specific technical reason to do something
> different, we shouldn't do something unusual.

inotify-tools are available on effectively every distribution.

> Given that we *can*, cheaply, provide a clean and consistent API to
> userspace, why would we instead want to inflict some exotic and
> hard-to-use interface on userspace instead? Asking that userspace poll
> on a directory file descriptor and, when poll returns, check by
> looking for certain errors (we'd have to spec which ones) from fstatat
> is awkward. /proc/pid is a directory. In what other context does the
> kernel ask userspace to use a directory this way?

I'm not sure you understood my proposal. I said that we need an
interface to do this, and I was trying to explain (by noting what the
current way of doing it would be) what I think the interface should be.

To reiterate, I believe that having an inotify event (IN_DELETE_SELF on
/proc/$pid) would be in keeping with the current way of doing things but
allowing userspace to avoid all of the annoyances you just mentioned and
I was alluding to.

I *don't* think that the current scheme of looping on fstatat is the way
it should be left. And there is an argument the inotify is not
sufficient to 

> > I'm really not a huge fan of the "blocking read" semantic (though if we
> > have to have it, can we at least provide as much information as you get
> > from proc_connector -- such as the exit status?).
> [...]
> The exit status in /proc/pid/stat is zeroed out for readers that fail
> do_task_stat's ptrace_may_access call. (Falsifying the exit status in
> stat seems a privilege check fails seems like a bad idea from a
> correctness POV.)

It's not clear to me what the purpose of that field is within procfs for
*dead* proceses -- which is what we're discussing here. As far as I can
tell, you will get an ESRCH when you try to read it. When testing this
it also looked like you didn't even get the exit_status as a zombie but
I might be mistaken.

So while it is masked for !ptrace_may_access, it's also zero (or
unreadable) for almost every case outside of stopped processes (AFAICS).
Am I missing something?

> Should open() on exithand perform the same ptrace_may_access privilege
> check? What if the process *becomes* untraceable during its lifetime
> (e.g., with setuid). Should that read() on the exithand FD still yield
> a siginfo_t? Just having exithand yield EOF all the time punts the
> privilege problem to a later discussion because this approach doesn't
> leak information. We can always add an "exithand_full" or something
> that actually yields a siginfo_t.

I agree that read(2) makes this hard. I don't think we should use it.
But if we have to use it, I would like us to have feature parity with
features that FreeBSD had 18 years ago.

> Another option would be to make exithand's read() always yield a
> siginfo_t, but have the open() just fail if the caller couldn't
> ptrace_may_access it. But why shouldn't you be able to wait on other
> processes? If you can see it in /proc, you should be able to wait on
> it exiting.

I would suggest looking at FreeBSD's kevent semantics for inspiration
(or at least to see an alternative way of doing things). In particular,
EVFILT_PROC+NOTE_EXIT -- which is attached to a particular process. I
wonder what their view is on these sorts of questions.

> > Also maybe we should
> > integrate this into the exit machinery instead of this loop...
> 
> I don't know what you mean. It's already integrated into the exit
> machinery: it's what runs the waitqueue.

My mistake, I missed the last hunk of the patch.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]