Re: [RFC PATCH] Minimal non-child process exit notification support

From: Daniel Colascione <dancol@google.com>
To: David Laight <David.Laight@aculab.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"timmurray@google.com" <timmurray@google.com>,
	"joelaf@google.com" <joelaf@google.com>
Subject: Re: [RFC PATCH] Minimal non-child process exit notification support
Date: Wed, 31 Oct 2018 14:48:34 +0000	[thread overview]
Message-ID: <CAKOZueuuAGSDwXc2jdZM_=EJdsG0HeGZFLaFWDFNHLLKXJpwbg@mail.gmail.com> (raw)
In-Reply-To: <4ed9af67cf4a46708905d3f392344bcb@AcuMS.aculab.com>

On Wed, Oct 31, 2018 at 2:25 PM, David Laight <David.Laight@aculab.com> wrote:
> From: Daniel Colascione
>> Sent: 31 October 2018 12:56
>> On Wed, Oct 31, 2018 at 12:27 PM, David Laight <David.Laight@aculab.com> wrote:
>> > From: Daniel Colascione
>> >> Sent: 29 October 2018 17:53
>> >>
>> >> This patch adds a new file under /proc/pid, /proc/pid/exithand.
>> >> Attempting to read from an exithand file will block until the
>> >> corresponding process exits, at which point the read will successfully
>> >> complete with EOF.  The file descriptor supports both blocking
>> >> operations and poll(2). It's intended to be a minimal interface for
>> >> allowing a program to wait for the exit of a process that is not one
>> >> of its children.
>> >
>> > Why do you need an extra file?
>>
>> Because no current file suffices.
>
> That doesn't stop you making something work on any/all of the existing files.

Why overload?

>> > It ought to be possible to use poll() to wait for POLLERR having set
>> > 'events' to zero on any of the nodes in /proc/pid - or even on
>> > the directory itself.
>>
>> That doesn't actually work today. And waiting on a directory with
>> POLLERR would be very weird, since directories in general don't do
>> things like blocking reads or poll support. A separate file with
>> self-contained, well-defined semantics is cleaner.
>
> Device drivers will (well ought to) return POLLERR when a device
> is removed.
> Making procfs behave the same way wouldn't be too stupid.

That's overkill. You'd need to add poll support to files throughout
/proc/pid, and I don't think doing so would add any new capabilities
over keeping the changes localized to one new place.

>> > Indeed, to avoid killing the wrong process you need to have opened
>> > some node of /proc/pid/* (maybe cmdline) before sending the kill
>> > signal.
>>
>> The kernel really needs better documentation of the semantics of
>> procfs file descriptors. You're not the only person to think,
>> mistakenly, that keeping a reference to a /proc/$PID/something FD
>> reserves $PID and prevents it being used for another process. Procfs
>> FDs do no such thing. kill(2) is unsafe whether or not
>> /proc/pid/cmdline or any other /proc file is open.
>
> Interesting.
> Linux 'fixed' the problem of pid reuse in the kernel by adding (IIRC)
> 'struct pid' that reference counts the pid stopping reuse.

Struct pid doesn't stop PID reuse. It just allows the kernel to
distinguish between a stale and current reference to a given PID. In a
sense, the "struct pid*" is the "real" name of a process and the
numeric PID is just a convenient way to find a struct pid.

> But since the pids are still allocated sequentially userspace can
> still reference a pid that is freed and immediately reused.
> I'd have thought that procfs nodes held a reference count on the 'struct pid'.
> There's probably no reason why it shouldn't.
>
> Annoyingly non-GPL drivers can't release references to 'struct pid' so
> are very constrained about which processes they can signal.

If I had my way, I'd s/EXPORT_SYMBOL_GPL/EXPORT_SYMBOL/ in pid.c.
These routines seem generally useful. As an alternative, I suppose
drivers could just use the new /proc/pid/kill interface, with
sufficient contortions, just as userspace can.

> I also managed to use a stale 'struct pid' and kill the wrong process
> - much more likely that the pid number being reused.

That shouldn't be possible, unless by "stale 'struct pid'" you mean a
struct pid* referring to a struct pid that's been released and reused
for some other object (possibly a different struct pid instances), and
that's UB.

> If you look at the NetBSD pid allocator you'll see that it uses the
> low pid bits to index an array and the high bits as a sequence number.
> The array slots are also reused LIFO, so you always need a significant
> number of pid allocate/free before a number is reused.
> The non-sequential allocation also makes it significantly more difficult
> to predict when a pid will be reused.
> The table size is doubled when it gets nearly full.

NetBSD is still just papering over the problem. The real issue is that
the whole PID-based process API model is unsafe, and a clever PID
allocator doesn't address the fundamental race condition. As long as
PID reuse is possible at all, there's a potential race condition, and
correctness depends on hope. The only way you could address the PID
race problem while not changing the Unix process API is by making
pid_t ridiculously wide so that it never wraps around.