On 2018-11-18, Daniel Colascione <dancol@google.com> wrote:
> > The gist is to have file descriptors for processes which is obviously not a new
> > idea. This has been done before in other OSes and it has been tried before in
> > Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to
> > make it very clear that I'm not laying claim to this being my or even a novel
> > idea in any way. However, I want to diverge from previous approaches with my
> > suggestion. (Though I can't be sure that there's not something similar in other
> > OSes already.)
> 
> Windows works basically as you describe. You can create a process is
> suspended state, configure it however you want, then let it run.
> CreateProcess (and even moreso, NtCreateProcess) also provide a rich
> (and *extensible*) interface for pre-creation process configuration.
> 
> >> One of the main motivations for having procfds is to have a race-free way of
> > configuring, starting, polling, and killing a process. Basically, a process
> > lifecycle api if you want to think about it that way. The api should also be
> > easily extendable in the future to avoid running into the limitations we
> > currently see with the clone*() syscall(s) again.
> >
> > One of the crucial points of the api is to *separate the configuration
> > of a process through a procfd from actually creating the process*.
> > This is a crucial property expressed in the open*() system calls. First, get a
> > stable handle on an object then allow for ways to configure it. As such the
> > procfd api shares the same insight with Al's and David's new mount api.
> > (Fwiw, Andy also pointed out similarities with posix_spawn().)
> > What I envisioned was to have the following syscalls (multiple name suggestions):
> >
> > 1. int process_open / proc_open / procopen
> > 2. int process_config / proc_config / procconfig or ioctl()-based
> > 3. int process_info / proc_info / procinfo or ioctl()-based
> > 4. int process_manage / proc_manage / procmanage or ioctl()-based
> 
> The API you've proposed seems fine to me, although I'd either 1)
> consolidate operations further into one system call, or 2) separate
> the different management operations into more and different system
> calls that can be audited independently. The grouping you've proposed
> seems to have the worst aspects of API splitting and API multiplexing.
> But I have no objection to it in spirit.

I think combining it all into one API is going to be a soft re-invention
of ioctls, but specifically for procfds. This would be an improvement
over just ioctls (since the current ioctl namespacing is based on
well-behaved drivers and hoping we never have more than 256 ioctl
drivers), but I'm not sure it would help make the API nicer than having
separate syscalls (we'd have to do something similar to bpf(2) which I'm
not a huge fan of).

> That said, while I do want to fix process configuration and startup
> generally, I want to fix specific holes in the existing API surface
> first. The two patches I've already sent do that, and this work
> shouldn't wait on an ideal larger process-API overhaul that may or may
> not arrive. Based on previous history, I suspect that an API of the
> scope you're proposing would take years to overcome all LKML
> objections and land. I don't want to wait that long when we can make
> smaller fixes that would not conflict with the general architecture.

I believe this is precisely what Christian is trying to do with this
patch (and you say as much later in your mail).

I think that adding all of {sighand,sighand_exitcode,kill,...} would not
help the path of landing a much larger API change. We should instead
think about the API we want at the end of the day, and then land smaller
changes which are long-term compatible (and won't just become deprecated
APIs -- there's far too many of them, let's not add more needlessly).

If the plan is to have an ioctl API we should merge minor ioctls first.
If the idea is to have an explosion of syscalls, then we should merge
minor syscalls first. We shouldn't merge procfiles if the end goal is to
not use them.

> Next, I want to merge my exithand proposal, or something like it. It's
> likewise a simple change that, in a minimal way, addresses a
> longstanding API deficiency. I'm very strongly against the
> POLLERR-on-directory variant of the idea.

I agree with you on this need. I will admit I do somewhat like the EOF
solution (mainly because it seamlessly deals with the multi-reader case)
but I'm still not sure I like /proc/$pid/exit_sighand. As mentioned in
the other discussion, ideally we would be only ever operating with the 

An ugly strawman of an alternative would be an interface that gave you
an fd that you could similarly wait-until-EOF on, but that's probably
not a major API improvement unless we also make said API provide exit
status information in a way that works with the
multiple-readers-with-different-creds usecase.

One other thing I think we should eventually consider is to provide an
API which pings a listener whenever a process does an execve() (and
possibly fork()). This is something you can get from FreeBSD's kqueue --
and is something that we have in a really neutered form in the "proc
connector". But of course we can discuss this separately, especially if
we have an extensible API idea in mind when we start.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>