linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* forkat(int pidfd), execveat(int pidfd), other awful things?
@ 2021-02-01 17:47 Jason A. Donenfeld
  2021-02-01 17:51 ` Jason A. Donenfeld
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Jason A. Donenfeld @ 2021-02-01 17:47 UTC (permalink / raw)
  To: Kernel Hardening, Andy Lutomirski; +Cc: LKML, Jann Horn, Christian Brauner

Hi Andy & others,

I was reversing some NT stuff recently and marveling over how wild and
crazy things are over in Windows-land. A few things related to process
creation caught my interest:

- It's possible to create a new process with an *arbitrary parent
process*, which means it'll then inherit various things like handles
and security attributes and tokens from that new parent process.

- It's possible to create a new process with the memory space handle
of a different process. Consider this on Linux, and you have some
abomination like `forkat(int pidfd)`.

The big question is "why!?" At first I was just amused by its presence
in NT. Everything is an object and you can usually freely mix and
match things, and it's very flexible, which is cool. But this is NT,
not Linux.

Jann and I were discussing, though, that maybe some variant of these
features might be useful to get rid of setuid executables. Imagine
something like `systemd-sudod`, forked off of PID 1 very early.
Subsequently all new processes on the system run with
PR_SET_NO_NEW_PRIVS or similar policies to prevent non-root->root
transition. Then, if you want to transition, you ask systemd-sudod (or
polkitd, or whatever else you have in mind) to make you a new process,
and it then does the various policy checks, and executes a new process
for you as the parent of the requesting process.

So how would that work? Well, executing processes with arbitrary
parents would be part of it, as above. But we'd probably want to more
carefully control that new process. Which chroot is it in? How do
cgroups work? And so on. And ultimately this design leads to something
like ZwCreateProcess, where you have several arguments, each to a
handle to some part of the new process state, or null to be inherited
from its parent.

int execve_parent(int parent_pidfd, int root_dirfd, int cgroup_fd, int
namespace_fd, const char *pathname, char *const argv[], char *const
envp[]);

One could imagine this growing pretty unwieldy. There's also this
other design aspect of Linux that's worth considering. Namespaces and
other process-inherited resources are generally hierarchical, with
children getting the resource from their parent. This makes sense and
is simple to conceptualize. Everytime we add a new thing_fd as a
pointer to one of these resources, and allow it to be used outside of
that hierarchy, it introduces a kind of "escape hatch". That might be
considered "bad design" by some; it might not be by others. Seen this
way, NT is one massive escape hatch, with pretty much everything being
an object with a handle.

But! Maybe this is nonetheless an interesting design avenue to
explore. The introduction of pidfd is sort of just the "beginning" of
that kind of design.

Is any of this interesting to you as a future of privilege escalation
and management on Linux?

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: forkat(int pidfd), execveat(int pidfd), other awful things?
  2021-02-01 17:47 forkat(int pidfd), execveat(int pidfd), other awful things? Jason A. Donenfeld
@ 2021-02-01 17:51 ` Jason A. Donenfeld
  2021-02-01 18:20 ` Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Jason A. Donenfeld @ 2021-02-01 17:51 UTC (permalink / raw)
  To: Kernel Hardening, Andy Lutomirski; +Cc: LKML, Jann Horn, Christian Brauner

> int execve_parent(int parent_pidfd, int root_dirfd, int cgroup_fd, int
> namespace_fd, const char *pathname, char *const argv[], char *const
> envp[]);

A variant on the same scheme would be:

int execve_remote(int pidfd, int root_dirfd, int cgroup_fd, int
namespace_fd, const char *pathname, char *const argv[], char *const
envp[]);

Unpriv'd process calls fork(), and from that fork sends its pidfd
through a unix socket to systemd-sudod, which then calls execve_remote
on that pidfd.

There are a lot of (potentially very bad) ways to skin this cat.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: forkat(int pidfd), execveat(int pidfd), other awful things?
  2021-02-01 17:47 forkat(int pidfd), execveat(int pidfd), other awful things? Jason A. Donenfeld
  2021-02-01 17:51 ` Jason A. Donenfeld
@ 2021-02-01 18:20 ` Christian Brauner
  2021-02-01 18:29 ` Andy Lutomirski
  2021-02-01 18:32 ` forkat(int pidfd), execveat(int pidfd), other awful things? Casey Schaufler
  3 siblings, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2021-02-01 18:20 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Kernel Hardening, Andy Lutomirski, LKML, Jann Horn

On Mon, Feb 01, 2021 at 06:47:17PM +0100, Jason A. Donenfeld wrote:
> Hi Andy & others,
> 
> I was reversing some NT stuff recently and marveling over how wild and
> crazy things are over in Windows-land. A few things related to process
> creation caught my interest:
> 
> - It's possible to create a new process with an *arbitrary parent
> process*, which means it'll then inherit various things like handles
> and security attributes and tokens from that new parent process.
> 
> - It's possible to create a new process with the memory space handle
> of a different process. Consider this on Linux, and you have some
> abomination like `forkat(int pidfd)`.
> 
> The big question is "why!?" At first I was just amused by its presence
> in NT. Everything is an object and you can usually freely mix and
> match things, and it's very flexible, which is cool. But this is NT,
> not Linux.
> 
> Jann and I were discussing, though, that maybe some variant of these
> features might be useful to get rid of setuid executables. Imagine
> something like `systemd-sudod`, forked off of PID 1 very early.
> Subsequently all new processes on the system run with
> PR_SET_NO_NEW_PRIVS or similar policies to prevent non-root->root
> transition. Then, if you want to transition, you ask systemd-sudod (or
> polkitd, or whatever else you have in mind) to make you a new process,
> and it then does the various policy checks, and executes a new process
> for you as the parent of the requesting process.
> 
> So how would that work? Well, executing processes with arbitrary
> parents would be part of it, as above. But we'd probably want to more
> carefully control that new process. Which chroot is it in? How do
> cgroups work? And so on. And ultimately this design leads to something
> like ZwCreateProcess, where you have several arguments, each to a
> handle to some part of the new process state, or null to be inherited
> from its parent.
> 
> int execve_parent(int parent_pidfd, int root_dirfd, int cgroup_fd, int
> namespace_fd, const char *pathname, char *const argv[], char *const
> envp[]);
> 
> One could imagine this growing pretty unwieldy. There's also this
> other design aspect of Linux that's worth considering. Namespaces and
> other process-inherited resources are generally hierarchical, with
> children getting the resource from their parent. This makes sense and
> is simple to conceptualize. Everytime we add a new thing_fd as a
> pointer to one of these resources, and allow it to be used outside of
> that hierarchy, it introduces a kind of "escape hatch". That might be
> considered "bad design" by some; it might not be by others. Seen this
> way, NT is one massive escape hatch, with pretty much everything being
> an object with a handle.
> 
> But! Maybe this is nonetheless an interesting design avenue to
> explore. The introduction of pidfd is sort of just the "beginning" of
> that kind of design.
> 
> Is any of this interesting to you as a future of privilege escalation
> and management on Linux?

A bunch of this was discussed in a breakout room during Linux Plumbers
last year and I also had discussions with Lennart about this a little
while ago.

One API I had proposed was to extend pidfd_open() to give you a
pidfd that does not yet refer to any process, i.e. instead of

int pidfd = pidfd_open(1234, 0);

you could do

int pidfd = pidfd_open(-1/-ESRCH, 0);

which would give you an empty process handle without any mentionable
properties.

A simple/dumb design would then be to let clone3() not just return
pidfds but also take pidfds as an argument. You could then hand-off the
pidfd to another process SCM_RIGHTS/pidfd_getfd() and have it create a
process for you with the privileges of the caller, you'd still be the
parent.

Or in addition to pidfd_open() we add new syscalls to configure a
process context pidfd_configure() or sm. This design I initially
proposed before we ended up with what we have now.

So yes, I would love to have at least the concept to create a process
for another process, delegated fork, essentially.

Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: forkat(int pidfd), execveat(int pidfd), other awful things?
  2021-02-01 17:47 forkat(int pidfd), execveat(int pidfd), other awful things? Jason A. Donenfeld
  2021-02-01 17:51 ` Jason A. Donenfeld
  2021-02-01 18:20 ` Christian Brauner
@ 2021-02-01 18:29 ` Andy Lutomirski
  2021-02-02  9:23   ` David Laight
  2021-02-01 18:32 ` forkat(int pidfd), execveat(int pidfd), other awful things? Casey Schaufler
  3 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2021-02-01 18:29 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Kernel Hardening, LKML, Jann Horn, Christian Brauner

On Mon, Feb 1, 2021 at 9:47 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Andy & others,
>
> I was reversing some NT stuff recently and marveling over how wild and
> crazy things are over in Windows-land. A few things related to process
> creation caught my interest:
>
> - It's possible to create a new process with an *arbitrary parent
> process*, which means it'll then inherit various things like handles
> and security attributes and tokens from that new parent process.
>
> - It's possible to create a new process with the memory space handle
> of a different process. Consider this on Linux, and you have some
> abomination like `forkat(int pidfd)`.

My general thought is that this is an excellent idea, but maybe not
quite in this form.  I do rather like a lot about the NT design,
although I have to say that their actual taste in the structures
passed into APIs is baroque at best.

If we're going to do this, though, can we stay away from fork and and
exec entirely?  Fork is cute but inefficient, and exec is the source
of neverending complexity and bugs in the kernel.  But I also think
that whole project can be decoupled into two almost-orthogonal pieces:

1. Inserting new processes into unusual places in the process tree.
The only part of setuid that really needs kernel help to replace is
for the daemon to be able to make its newly-spawned child be a child
of the process that called out to the daemon. Christian's pidfd
proposal could help here, and there could be a new API that is only a
minor tweak to existing fork/exec to fork-and-reparent.

2. A sane process creation API.  It would be delightful to be able to
create a fully-specified process without forking.  This might end up
being a fairly complicated project, though -- there are a lot of
inherited process properties to be enumerated.

(Bonus #3): binfmts are a pretty big attack surface.  Having a way to
handle all the binfmt magic in userspace might be a nice extension to
#2.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: forkat(int pidfd), execveat(int pidfd), other awful things?
  2021-02-01 17:47 forkat(int pidfd), execveat(int pidfd), other awful things? Jason A. Donenfeld
                   ` (2 preceding siblings ...)
  2021-02-01 18:29 ` Andy Lutomirski
@ 2021-02-01 18:32 ` Casey Schaufler
  3 siblings, 0 replies; 13+ messages in thread
From: Casey Schaufler @ 2021-02-01 18:32 UTC (permalink / raw)
  To: Jason A. Donenfeld, Kernel Hardening, Andy Lutomirski
  Cc: LKML, Jann Horn, Christian Brauner

On 2/1/2021 9:47 AM, Jason A. Donenfeld wrote:
> Hi Andy & others,
>
> I was reversing some NT stuff recently and marveling over how wild and
> crazy things are over in Windows-land. A few things related to process
> creation caught my interest:
>
> - It's possible to create a new process with an *arbitrary parent
> process*, which means it'll then inherit various things like handles
> and security attributes and tokens from that new parent process.
>
> - It's possible to create a new process with the memory space handle
> of a different process. Consider this on Linux, and you have some
> abomination like `forkat(int pidfd)`.
>
> The big question is "why!?" At first I was just amused by its presence
> in NT. Everything is an object and you can usually freely mix and
> match things, and it's very flexible, which is cool. But this is NT,
> not Linux.
>
> Jann and I were discussing, though, that maybe some variant of these
> features might be useful to get rid of setuid executables. Imagine
> something like `systemd-sudod`, forked off of PID 1 very early.
> Subsequently all new processes on the system run with
> PR_SET_NO_NEW_PRIVS or similar policies to prevent non-root->root
> transition. Then, if you want to transition, you ask systemd-sudod (or
> polkitd, or whatever else you have in mind) to make you a new process,
> and it then does the various policy checks, and executes a new process
> for you as the parent of the requesting process.
>
> So how would that work? Well, executing processes with arbitrary
> parents would be part of it, as above. But we'd probably want to more
> carefully control that new process. Which chroot is it in? How do
> cgroups work? And so on. And ultimately this design leads to something
> like ZwCreateProcess, where you have several arguments, each to a
> handle to some part of the new process state, or null to be inherited
> from its parent.
>
> int execve_parent(int parent_pidfd, int root_dirfd, int cgroup_fd, int
> namespace_fd, const char *pathname, char *const argv[], char *const
> envp[]);
>
> One could imagine this growing pretty unwieldy. There's also this
> other design aspect of Linux that's worth considering. Namespaces and
> other process-inherited resources are generally hierarchical, with
> children getting the resource from their parent. This makes sense and
> is simple to conceptualize. Everytime we add a new thing_fd as a
> pointer to one of these resources, and allow it to be used outside of
> that hierarchy, it introduces a kind of "escape hatch". That might be
> considered "bad design" by some; it might not be by others. Seen this
> way, NT is one massive escape hatch, with pretty much everything being
> an object with a handle.
>
> But! Maybe this is nonetheless an interesting design avenue to
> explore. The introduction of pidfd is sort of just the "beginning" of
> that kind of design.
>
> Is any of this interesting to you as a future of privilege escalation
> and management on Linux?

TL;DR - We have plenty of flayed cats.

My brief analysis of your proposal doesn't lead me to think
that there's anything you couldn't already do with systemd and
an application launcher. We already have a bunch of security
mechanisms and behaviors that the masses have decided are too
complicated or dangerous to use. And some that *are* too
complicated or dangerous to use. I wouldn't see these mechanisms
as "hardening" the kernel. I would see them as complicating
what passes for the Linux security policy.

>
> Jason


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: forkat(int pidfd), execveat(int pidfd), other awful things?
  2021-02-01 18:29 ` Andy Lutomirski
@ 2021-02-02  9:23   ` David Laight
  2021-07-28 16:37     ` Leveraging pidfs for process creation without fork John Cotton Ericson
  0 siblings, 1 reply; 13+ messages in thread
From: David Laight @ 2021-02-02  9:23 UTC (permalink / raw)
  To: 'Andy Lutomirski', Jason A. Donenfeld
  Cc: Kernel Hardening, LKML, Jann Horn, Christian Brauner

From: Andy Lutomirski
> Sent: 01 February 2021 18:30
...
> 2. A sane process creation API.  It would be delightful to be able to
> create a fully-specified process without forking.  This might end up
> being a fairly complicated project, though -- there are a lot of
> inherited process properties to be enumerated.

Since you are going to (eventually) load in a program image
have to do several system calls to create the process isn't
likely to be a problem.
So using separate calls for each property isn't really an issue
and solves the horrid problem of the API structure.

So you could create an embryonic process that inherits a lot
of stuff from the current process, the do actions that
sort out the fds, argv, namespace etc.
Finally running the new program.

It would probably make implement posix_spawn() easier.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Leveraging pidfs for process creation without fork
  2021-02-02  9:23   ` David Laight
@ 2021-07-28 16:37     ` John Cotton Ericson
  2021-07-29 14:24       ` Christian Brauner
  0 siblings, 1 reply; 13+ messages in thread
From: John Cotton Ericson @ 2021-07-28 16:37 UTC (permalink / raw)
  To: LKML
  Cc: David Laight, Andy Lutomirski, Jason A. Donenfeld,
	Kernel Hardening, Jann Horn, Christian Brauner

Hi,

I was excited to learn about about pidfds the other day, precisely in 
hopes that it would open the door to such a "sane process creation API". 
I searched the LKML, found this thread, and now hope to rekindle the 
discussion; my apologies if there has been more discussion since that I 
missed and I am making redundant noise.

----

On Tue, Feb 2, 2021, at 4:23 AM, David Laight wrote:
> From: Andy Lutomirski
> > Sent: 01 February 2021 18:30
> ...
> > 2. A sane process creation API.  It would be delightful to be able to
> > create a fully-specified process without forking.  This might end up
> > being a fairly complicated project, though -- there are a lot of
> > inherited process properties to be enumerated.
> 
> Since you are going to (eventually) load in a program image
> have to do several system calls to create the process isn't
> likely to be a problem.
> So using separate calls for each property isn't really an issue
> and solves the horrid problem of the API structure.

I definitely concur creating an embryonic process and then setting the 
properties sounds separately like the right approach. I'm no expert, but 
I gather from afar that between BPF and io_uring, plenty of people are 
investigating general methods of batched/pipelined communication with 
the kernel, and so there's little reason to go around making more ad-hoc 
mammoth syscalls for specific sets of tasks.

----

> So you could create an embryonic process that inherits a lot
> of stuff from the current process, the do actions that
> sort out the fds, argv, namespace etc.
> Finally running the new program.

All that sounds good, but I wonder if it would be possible to have a 
flag such that inheritance (where practical) would *not* be the default 
for new processes. I'm convinced that better security will always be an 
uphill battle until privileges/capabilities/resources are *not* shared 
by default. Only when more sharing requires monotonically more 
programmer effort will productivity/laziness align with the principle of 
least privilege.

With fork/exec, there's no good way to achieve this, I think it's safe 
to say. But with the embryonic processes method, where one has the 
ability to e.g. set/unset file descriptors on the embryo under 
construction, it seems quite natural.

This is one wrinkle of interface evolution --- as new sandboxing 
mechanisms / namespaces are created, we would either need to create 
yet-new "no really, default no-share" flags, or arguably be causing API 
breakage as previously "leaking" privileges are patched up. I am hopeful 
that either having versioned flags, or thoroughly documenting up-front 
that the exact behavior is subject to change as "leaks are plugged" is 
OK, but I recognize that the former might be too much complexity and the 
latter to weasel-wordy, and therefore the whole idea of "opt-in sharing 
only" will have to wait.

----

The security <-> ergonomics aspect is the main point of interest for me, 
but there a few random ideas:

1. I originally thought an fd to an embryonic process should in fact 
point to the task_struct rather than pid, since there is no risk of the 
data becoming useless asynchronously --- an embryonic process is never 
scheduled and cannot do anything like exiting on it's own. But there is 
no reason an embryonic process need start with just one thread, so 
allowing entire embryonic thread groups might actually be virtuous. I 
don't know for sure, but I figure in that case it is simpler to just 
stick with the pid indirection.

2. Embryonic processes can be "forked at rest" (i.e. just duplicated), 
which would allow a regime where they are used as templates for process 
creation, duplicated ("forked at rest"), and sent around for other tasks 
to spawn processes themselves. If my idea for "opt-in sharing only" 
fails per the above, sending around an "as isolated as possible" embryo 
template could be a decent fallback.

That's all I got. I hope continuing this design process is of interest 
to others.

Cheers,

John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Leveraging pidfs for process creation without fork
  2021-07-28 16:37     ` Leveraging pidfs for process creation without fork John Cotton Ericson
@ 2021-07-29 14:24       ` Christian Brauner
  2021-07-29 14:54         ` John Ericson
  2021-07-30  1:41         ` Al Viro
  0 siblings, 2 replies; 13+ messages in thread
From: Christian Brauner @ 2021-07-29 14:24 UTC (permalink / raw)
  To: John Cotton Ericson
  Cc: LKML, David Laight, Andy Lutomirski, Jason A. Donenfeld,
	Kernel Hardening, Jann Horn, Christian Brauner

On Wed, Jul 28, 2021 at 12:37:57PM -0400, John Cotton Ericson wrote:
> Hi,
> 
> I was excited to learn about about pidfds the other day, precisely in hopes
> that it would open the door to such a "sane process creation API". I
> searched the LKML, found this thread, and now hope to rekindle the
> discussion; my apologies if there has been more discussion since that I

Yeah, I haven't forgotten this discussion. A proposal is on my todo list
for this year. So far I've scheduled some time to work on this in the
fall.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Leveraging pidfs for process creation without fork
  2021-07-29 14:24       ` Christian Brauner
@ 2021-07-29 14:54         ` John Ericson
  2021-07-30  1:41         ` Al Viro
  1 sibling, 0 replies; 13+ messages in thread
From: John Ericson @ 2021-07-29 14:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: LKML, David Laight, Andy Lutomirski, Jason A. Donenfeld,
	Kernel Hardening, Jann Horn, Christian Brauner

Wonderful, looking forward to it reading it then!

John

On Thu, Jul 29, 2021, at 10:24 AM, Christian Brauner wrote:
> On Wed, Jul 28, 2021 at 12:37:57PM -0400, John Cotton Ericson wrote:
> > Hi,
> > 
> > I was excited to learn about about pidfds the other day, precisely in hopes
> > that it would open the door to such a "sane process creation API". I
> > searched the LKML, found this thread, and now hope to rekindle the
> > discussion; my apologies if there has been more discussion since that I
> 
> Yeah, I haven't forgotten this discussion. A proposal is on my todo list
> for this year. So far I've scheduled some time to work on this in the
> fall.
> 
> Thanks!
> Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Leveraging pidfs for process creation without fork
  2021-07-29 14:24       ` Christian Brauner
  2021-07-29 14:54         ` John Ericson
@ 2021-07-30  1:41         ` Al Viro
       [not found]           ` <1468d75c-57ae-42aa-85ce-2bee8d403763@www.fastmail.com>
  1 sibling, 1 reply; 13+ messages in thread
From: Al Viro @ 2021-07-30  1:41 UTC (permalink / raw)
  To: Christian Brauner
  Cc: John Cotton Ericson, LKML, David Laight, Andy Lutomirski,
	Jason A. Donenfeld, Kernel Hardening, Jann Horn,
	Christian Brauner

On Thu, Jul 29, 2021 at 04:24:15PM +0200, Christian Brauner wrote:
> On Wed, Jul 28, 2021 at 12:37:57PM -0400, John Cotton Ericson wrote:
> > Hi,
> > 
> > I was excited to learn about about pidfds the other day, precisely in hopes
> > that it would open the door to such a "sane process creation API". I
> > searched the LKML, found this thread, and now hope to rekindle the
> > discussion; my apologies if there has been more discussion since that I
> 
> Yeah, I haven't forgotten this discussion. A proposal is on my todo list
> for this year. So far I've scheduled some time to work on this in the
> fall.

Keep in mind that quite a few places in kernel/exit.c very much rely upon the
lack of anything outside of thread group adding threads into it.  Same for
fs/exec.c.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Leveraging pidfs for process creation without fork
       [not found]           ` <1468d75c-57ae-42aa-85ce-2bee8d403763@www.fastmail.com>
@ 2021-07-31 22:42             ` Al Viro
  2021-08-02 12:19               ` Christian Brauner
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2021-07-31 22:42 UTC (permalink / raw)
  To: John Ericson
  Cc: Christian Brauner, LKML, David Laight, Andy Lutomirski,
	Jason A. Donenfeld, Kernel Hardening, Jann Horn,
	Christian Brauner

On Sat, Jul 31, 2021 at 03:11:03PM -0700, John Ericson wrote:
> Do you mind pointing out one of those examples? I'm new to this, but if they follow a pattern I should be able to find the other examples based off it. I'm certainly curious to take a look :).
> 
> I hope these issues aren't to deep. Ideally there's a nice decoupling so the creating process is just manipulating "inert" data structures for the embryo that scheduler doesn't even need see, and then after the embryonic process is submitted, when the context switches to it for the first time that's a completely normal process without special cases.
> 
> The place complexity is hardest to avoid I think would be cleaning up the yet-unborn embryonic processes orphaned by exitted parent(s), because that will have to handle all the semi-initialized states those could be in (as opposed to real processes).

	It's more on the exit/exec/coredump side, actually.  For
exit we want to be sure that no new live threads will appear in a
group once the last live thread has entered do_exit().  For
exec (de_thread(), for starters) you want to have all threads
except for the one that does execve() to be killed and your
thread to take over as group leader.  Look for the machinery there
and in do_exit()/release_task() involved into that.  For coredump
you want all threads except for dumper to be brought into do_exit()
and stopped there, for dumping one to be able to access their state.

	Then there's fun with ->sighand treatment - the whole thing
critically relies upon ->sighand being shared for the entire thread
group; look at the ->sighand->siglock uses.

	The whole area is full of rather subtle places.  Again, the
real headache comes from the exit and execve.  Embryonic threads are
passive; it's the ones already running that can (and do) cause PITA.

	What do you want that for, BTW?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Leveraging pidfs for process creation without fork
  2021-07-31 22:42             ` Al Viro
@ 2021-08-02 12:19               ` Christian Brauner
  2021-08-03  6:00                 ` John Cotton Ericson
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2021-08-02 12:19 UTC (permalink / raw)
  To: Al Viro
  Cc: John Ericson, LKML, David Laight, Andy Lutomirski,
	Jason A. Donenfeld, Kernel Hardening, Jann Horn,
	Christian Brauner

On Sat, Jul 31, 2021 at 10:42:16PM +0000, Al Viro wrote:
> On Sat, Jul 31, 2021 at 03:11:03PM -0700, John Ericson wrote:
> > Do you mind pointing out one of those examples? I'm new to this, but if they follow a pattern I should be able to find the other examples based off it. I'm certainly curious to take a look :).
> > 
> > I hope these issues aren't to deep. Ideally there's a nice decoupling so the creating process is just manipulating "inert" data structures for the embryo that scheduler doesn't even need see, and then after the embryonic process is submitted, when the context switches to it for the first time that's a completely normal process without special cases.
> > 
> > The place complexity is hardest to avoid I think would be cleaning up the yet-unborn embryonic processes orphaned by exitted parent(s), because that will have to handle all the semi-initialized states those could be in (as opposed to real processes).
> 
> 	It's more on the exit/exec/coredump side, actually.  For
> exit we want to be sure that no new live threads will appear in a
> group once the last live thread has entered do_exit().  For
> exec (de_thread(), for starters) you want to have all threads
> except for the one that does execve() to be killed and your
> thread to take over as group leader.  Look for the machinery there
> and in do_exit()/release_task() involved into that.  For coredump
> you want all threads except for dumper to be brought into do_exit()
> and stopped there, for dumping one to be able to access their state.
> 
> 	Then there's fun with ->sighand treatment - the whole thing
> critically relies upon ->sighand being shared for the entire thread
> group; look at the ->sighand->siglock uses.
> 
> 	The whole area is full of rather subtle places.  Again, the
> real headache comes from the exit and execve.  Embryonic threads are
> passive; it's the ones already running that can (and do) cause PITA.

Iiuc, you're talking about adding a thread into a thread-group tg1 from
a thread in another thread-group tg2. I don't think that's a very
pressing use-case and I agree that that sounds rather nasty right now.
Unless I'm missing something, a simple api to create something like a
processes configuration context doesn't require this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Leveraging pidfs for process creation without fork
  2021-08-02 12:19               ` Christian Brauner
@ 2021-08-03  6:00                 ` John Cotton Ericson
  0 siblings, 0 replies; 13+ messages in thread
From: John Cotton Ericson @ 2021-08-03  6:00 UTC (permalink / raw)
  To: Christian Brauner, Al Viro
  Cc: LKML, David Laight, Andy Lutomirski, Jason A. Donenfeld,
	Kernel Hardening, Jann Horn, Christian Brauner

On Mon, Aug 2, 2021, at 8:19 AM, Christian Brauner wrote:
> On Sat, Jul 31, 2021 at 10:42:16PM +0000, Al Viro wrote:
> > 
> > It's more on the exit/exec/coredump side, actually.  For
> > exit we want to be sure that no new live threads will appear in a
> > group once the last live thread has entered do_exit().  For
> > exec (de_thread(), for starters) you want to have all threads
> > except for the one that does execve() to be killed and your
> > thread to take over as group leader.  Look for the machinery there
> > and in do_exit()/release_task() involved into that.  For coredump
> > you want all threads except for dumper to be brought into do_exit()
> > and stopped there, for dumping one to be able to access their state.
> > 
> > Then there's fun with ->sighand treatment - the whole thing
> > critically relies upon ->sighand being shared for the entire thread
> > group; look at the ->sighand->siglock uses.
> > 
> > The whole area is full of rather subtle places.  Again, the
> > real headache comes from the exit and execve.  Embryonic threads are
> > passive; it's the ones already running that can (and do) cause PITA.

I took a look at de_thread and begin_new_exec. It does seems whatever 
trouble there is stems from a bit of mixing concerns as I thought.

Most of begin_new_exec seems about wiping clean the current process's 
state, including the de_thread, unsharing various things. But then 
operations like that first bprm_creds_from_file call (of perhaps more 
recent vintage [1]) is about initializing new state from binprm argument.

It is interesting to me to note that some of the "unsharing" happens at 
clone time (the namespaces), and some happens (also) at exec time (file 
table, signal handlers). This to me is more good concrete evidence fork 
+ exec is awkward and strews concerns.

There perhaps will be some subtleties about in which order state can be 
set up on the embryonic process, but I don't think any de_thread will be 
needed because there will never be threads from a "previous" state lying 
around. Indeed there is no "previous" anything, just the current 
everything-inert embryonic process.

I would propose trying to rip up begin_new_exec so the unsharing, 
de_thread-ing etc. is just done in the traditional exec path, and just 
the bprm bits with a non-current fresh embryonic task_sched are done in 
the new one.

[1]: 56305aa9b6fab91a5555a45796b79c1b0a6353d1

 > Iiuc, you're talking about adding a thread into a thread-group tg1 from
 > a thread in another thread-group tg2. I don't think that's a very
 > pressing use-case and I agree that that sounds rather nasty right now.
 > Unless I'm missing something, a simple api to create something like a
 > processes configuration context doesn't require this.

Agreed.

I did mention embryonic processes with multiple threads, but was just a 
shower thought and not something I really care about. Also, since that 
would entail adding a thread to an inert thread group the creator has 
full powers over (it's "on the operating table") I don't think it would 
be so bad.

(To keep this new servery metaphor going, exec would be self-surgery, 
and adding a thread to *live* thread group would be surgery without 
anesthesia.)

 > a processes configuration context

This phrase stuck to me, Christian. Not to rush you on your concrete 
proposal, but sounds like you are envisioning building up a separate 
struct with instructions on how to produce a process, rather than 
mutating unscheduled but otherwise genuine `task_struct`s?

 > > What do you want that for, BTW?

Those security + ergonomic things I mentioned in my original email are 
the main goal.

I have a personal *long*-term goal to see something like CloudABI 
resurrected. I think it got most of the interfaces right, but not 
process management, and now that there are pidfds, we have a chance to 
better.

I'm in no rush, so happy to just see very linux-specific interfaces 
evolve in a good direction for now. Writing a personality or some other 
shim is not the interesting part, to say the least, so I'm happy to wait 
ages before doing that while the internals marinate.

John

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-08-03  6:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-01 17:47 forkat(int pidfd), execveat(int pidfd), other awful things? Jason A. Donenfeld
2021-02-01 17:51 ` Jason A. Donenfeld
2021-02-01 18:20 ` Christian Brauner
2021-02-01 18:29 ` Andy Lutomirski
2021-02-02  9:23   ` David Laight
2021-07-28 16:37     ` Leveraging pidfs for process creation without fork John Cotton Ericson
2021-07-29 14:24       ` Christian Brauner
2021-07-29 14:54         ` John Ericson
2021-07-30  1:41         ` Al Viro
     [not found]           ` <1468d75c-57ae-42aa-85ce-2bee8d403763@www.fastmail.com>
2021-07-31 22:42             ` Al Viro
2021-08-02 12:19               ` Christian Brauner
2021-08-03  6:00                 ` John Cotton Ericson
2021-02-01 18:32 ` forkat(int pidfd), execveat(int pidfd), other awful things? Casey Schaufler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).