linux-embedded.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* execve(NULL, argv, envp) for nommu?
@ 2017-09-05  7:34 Rob Landley
  2017-09-05  9:00 ` Geert Uytterhoeven
  0 siblings, 1 reply; 13+ messages in thread
From: Rob Landley @ 2017-09-05  7:34 UTC (permalink / raw)
  To: linux-embedded

For years I've wanted an execve() system call modification that let me
pass a NULL as the first argument to say "re-exec this program please".
Because on nommu you've got to exec something to unblock vfork(), and
daemons (or things like busybox and toybox) want to re-exec themselves.
I just hit this again trying to implement a nommu-friendly strace(): the
one on github doesn't SIGSTOP the child before the execve() of the
process to trace because vfork(), and just races and misses the first
few system calls on nommu instead...)

The problem with exec /proc/self/exe is A) I haven't necessarily got
/proc mounted, B) in a chroot the original binary might not be in scope
anymore. But I'm already _running_ this program. If I could fork() I
could already get a second copy of the sucker and call main() again
myself if necessary, but I can't, so...

I'm aware there's a possible "but what if it was suid and it's already
dropped privileges" argument, and I'm fine with execve(NULL) not
honoring the suid bit if people feel that way. I just wanna unblock
vfork() while still running this code. (A way to detect I did this would
be great too, but the normal tweaking of argv[] or envp[] to let main
know we're a child still works.)

Is there a _reason_ the kernel doesn't do this, or has nobody bothered
to code it up yet?

Rob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-05  7:34 execve(NULL, argv, envp) for nommu? Rob Landley
@ 2017-09-05  9:00 ` Geert Uytterhoeven
  2017-09-05 13:24   ` Alan Cox
  0 siblings, 1 reply; 13+ messages in thread
From: Geert Uytterhoeven @ 2017-09-05  9:00 UTC (permalink / raw)
  To: Rob Landley; +Cc: Linux Embedded, Oleg Nesterov, linux-kernel

CC Oleg, lkml

On Tue, Sep 5, 2017 at 9:34 AM, Rob Landley <rob@landley.net> wrote:
> For years I've wanted an execve() system call modification that let me
> pass a NULL as the first argument to say "re-exec this program please".
> Because on nommu you've got to exec something to unblock vfork(), and
> daemons (or things like busybox and toybox) want to re-exec themselves.
> I just hit this again trying to implement a nommu-friendly strace(): the
> one on github doesn't SIGSTOP the child before the execve() of the
> process to trace because vfork(), and just races and misses the first
> few system calls on nommu instead...)
>
> The problem with exec /proc/self/exe is A) I haven't necessarily got
> /proc mounted, B) in a chroot the original binary might not be in scope
> anymore. But I'm already _running_ this program. If I could fork() I
> could already get a second copy of the sucker and call main() again
> myself if necessary, but I can't, so...
>
> I'm aware there's a possible "but what if it was suid and it's already
> dropped privileges" argument, and I'm fine with execve(NULL) not
> honoring the suid bit if people feel that way. I just wanna unblock
> vfork() while still running this code. (A way to detect I did this would
> be great too, but the normal tweaking of argv[] or envp[] to let main
> know we're a child still works.)
>
> Is there a _reason_ the kernel doesn't do this, or has nobody bothered
> to code it up yet?
>
> Rob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-05  9:00 ` Geert Uytterhoeven
@ 2017-09-05 13:24   ` Alan Cox
  2017-09-06  1:12     ` Rob Landley
  0 siblings, 1 reply; 13+ messages in thread
From: Alan Cox @ 2017-09-05 13:24 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Rob Landley, Linux Embedded, Oleg Nesterov, linux-kernel

> > anymore. But I'm already _running_ this program. If I could fork() I
> > could already get a second copy of the sucker and call main() again
> > myself if necessary, but I can't, so...

You can - ptrace 8)

> > honoring the suid bit if people feel that way. I just wanna unblock
> > vfork() while still running this code. 

Would it make more sense to have a way to promote your vfork into a
fork when you hit these cases (I appreciate that fork on NOMMU has a much
higher performance cost as you start having to softmmu copy or swap
pages).

Alan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-05 13:24   ` Alan Cox
@ 2017-09-06  1:12     ` Rob Landley
  2017-09-08 21:18       ` Rob Landley
  2017-09-11 18:14       ` Alan Cox
  0 siblings, 2 replies; 13+ messages in thread
From: Rob Landley @ 2017-09-06  1:12 UTC (permalink / raw)
  To: Alan Cox, Geert Uytterhoeven
  Cc: Linux Embedded, Oleg Nesterov, dalias, linux-kernel

On 09/05/2017 08:24 AM, Alan Cox wrote:
>>> anymore. But I'm already _running_ this program. If I could fork() I
>>> could already get a second copy of the sucker and call main() again
>>> myself if necessary, but I can't, so...
> 
> You can - ptrace 8)

Oh I can call clone() with various flags and try to fake it myself, it
just won't do what I want. :)

>>> honoring the suid bit if people feel that way. I just wanna unblock
>>> vfork() while still running this code. 
> 
> Would it make more sense to have a way to promote your vfork into a
> fork when you hit these cases (I appreciate that fork on NOMMU has a much
> higher performance cost as you start having to softmmu copy or swap
> pages).

It's not the performance cost, it's rewriting all the pointers.

Without address translation, copying the existing mappings to a new
range requires finding and adjusting every pointer to the old data,
which you can do for the executable mappings in PIE* binaries, but
tracking down all the pointers on the stack, heap, and in your global
variables? Flaming pain.

Making fork() work on nommu is basically the same problem as making
garbage collection work in C on mmu. Thus those of us who defend vfork()
from the people who don't understand why it exists periodically
suggesting we remove it.

> Alan

Rob

* or FDPIC, which is basically just PIE with 4 individually relocatable
text/data/rodata/bss segments instead of one big mapping you relocate as
a contiguous block; both work on nommu but fdpic can fit into more
fragmented memory, and becauase the segments are independent it lets
nommu share some segments between processes (code+rodata**) without
sharing others (data and bss). That's why nommu can't run normal elf but
can run PIE or FDPIC binaries. Or binflt which is the old a.out version.

** Don't ask me what happens when rodata contains a constant pointer to
a bss or data object. I'm guessing the compiler Does A Thing. Ask Rich
Felker?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-06  1:12     ` Rob Landley
@ 2017-09-08 21:18       ` Rob Landley
  2017-09-11 15:15         ` Oleg Nesterov
  2017-09-11 18:14       ` Alan Cox
  1 sibling, 1 reply; 13+ messages in thread
From: Rob Landley @ 2017-09-08 21:18 UTC (permalink / raw)
  To: Alan Cox, Geert Uytterhoeven
  Cc: Linux Embedded, Oleg Nesterov, dalias, linux-kernel

On 09/05/2017 08:12 PM, Rob Landley wrote:
> On 09/05/2017 08:24 AM, Alan Cox wrote:
>>>> honoring the suid bit if people feel that way. I just wanna unblock
>>>> vfork() while still running this code. 
>>
>> Would it make more sense to have a way to promote your vfork into a
>> fork when you hit these cases (I appreciate that fork on NOMMU has a much
>> higher performance cost as you start having to softmmu copy or swap
>> pages).
> 
> It's not the performance cost, it's rewriting all the pointers.
> 
> Without address translation, copying the existing mappings to a new
> range requires finding and adjusting every pointer to the old data,
> which you can do for the executable mappings in PIE* binaries, but
> tracking down all the pointers on the stack, heap, and in your global
> variables? Flaming pain.
> 
> Making fork() work on nommu is basically the same problem as making
> garbage collection work in C on mmu. Thus those of us who defend vfork()
> from the people who don't understand why it exists periodically
> suggesting we remove it.

So is exec(NULL, argv, envp) a reasonable thing to want?

Rob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-08 21:18       ` Rob Landley
@ 2017-09-11 15:15         ` Oleg Nesterov
  2017-09-12 10:48           ` Rob Landley
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2017-09-11 15:15 UTC (permalink / raw)
  To: Rob Landley
  Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias, linux-kernel

On 09/08, Rob Landley wrote:
>
> So is exec(NULL, argv, envp) a reasonable thing to want?

I think that something like prctl(PR_OPEN_EXE_FILE) which does

	dentry_open(current->mm->exe_file->path, O_PATH)

and returns fd make more sense.

Then you can do execveat(fd, "", ..., AT_EMPTY_PATH).

But to be honest, I can't understand the problem, because I know nothing
about nommu.

You need to unblock parent sleeping in vfork(), and you can't do another
fork (I don't undestand why).

Perhaps the child can create another thread? The main thread can exit
after that and unblock the parent. Or perhaps even something like
clone(CLONE_VM | CLONE_PARENT), I dunno...

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-06  1:12     ` Rob Landley
  2017-09-08 21:18       ` Rob Landley
@ 2017-09-11 18:14       ` Alan Cox
  1 sibling, 0 replies; 13+ messages in thread
From: Alan Cox @ 2017-09-11 18:14 UTC (permalink / raw)
  To: Rob Landley
  Cc: Geert Uytterhoeven, Linux Embedded, Oleg Nesterov, dalias, linux-kernel

> It's not the performance cost, it's rewriting all the pointers.

Which you don't need to do

> Without address translation, copying the existing mappings to a new
> range requires finding and adjusting every pointer to the old data,

No it doesn't. See Minix.

When you fork() rather than vfork you stick a copy of any non-relocatable
elements (typically DATA copy + BSS + stack with a sane CPU and compiler)
into a buffer and you swap them over with the real copy when you task
switch to the one in the wrong place. If you start the child first you
usually only take one copy.

I've always been amused that Linux NOMMU hasn't managed to grow a feature
that people successfully implemented on 68000 long long ago, and I
believe some other processors back to v6/v7 days.

Alan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-11 15:15         ` Oleg Nesterov
@ 2017-09-12 10:48           ` Rob Landley
  2017-09-12 11:30             ` Geert Uytterhoeven
  2017-09-12 15:45             ` Oleg Nesterov
  0 siblings, 2 replies; 13+ messages in thread
From: Rob Landley @ 2017-09-12 10:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias, linux-kernel

On 09/11/2017 10:15 AM, Oleg Nesterov wrote:
> On 09/08, Rob Landley wrote:
>>
>> So is exec(NULL, argv, envp) a reasonable thing to want?
> 
> I think that something like prctl(PR_OPEN_EXE_FILE) which does
> 
> 	dentry_open(current->mm->exe_file->path, O_PATH)
> 
> and returns fd make more sense.
> 
> Then you can do execveat(fd, "", ..., AT_EMPTY_PATH).
I'm all for it? That sounds like a cosmetic difference, a more verbose
way of achieving the same outcome.

(Of course now you've got a filehandle you can read xattrs and such
through from otherwise jailed contexts letting you do things you
couldn't necessarily do before, but I assume you know the security
implications of that more than I do. I tried to suggest something that
_didn't_ create new capabilities, just let nommu do a thing that mmu
could already do.)

> But to be honest, I can't understand the problem, because I know nothing
> about nommu.
> 
> You need to unblock parent sleeping in vfork(), and you can't do another
> fork (I don't undestand why).

A nommu system doesn't have a memory management unit, so all addresses
are physical addresses. This means two processes can't see different
things at the same address: either they see the same thing or one of
them can't see that address (due to a range register making it).

Conventional fork() creates copy on write mappings of all the existing
writable memory of the parent process. So when the new PID dirties a
page, the old page gets copied by the fault handler. The problem isn't
the copies (that's just slow), the problem is two processes seeing
different things at the same address. That requires an MMU with a TLB
loaded from page tables.

If you create _new_ mappings and copy the data over, they'll have
different addresses. But any pointers you copied will point to the _old_
addresses. Finding and adjusting all those pointers to point to the new
addresses instead is basically the same problem as doing garbage
collection in C.

Your stack has pointers. Your heap has pointers. Your data and bss (once
initialized) can have pointers. These pointers can be in the middle of
malloc()'ed structures so no ELF table anywhere knows anything about
them. A long variable containing a value that _could_ point into one of
these ranges isn't guaranteed to _be_ a pointer, in which case adjusting
it is breakage. Tracking them all down and fixing up just the right ones
without missing any or changing data you shouldn't is REALLY HARD.

The vfork() system call is what you use on nommu instead: it creates a
child process that uses its parent's memory mappings. The parent process
is stopped until the child calls _exit() or exec(), either of which
means it stops using those mappings and the parent can go back to using
them without the two stomping on each other. (Usually they even share
the same stack, so the child shouldn't return from the function that
called vfork() or it'll corrupt the stack for the parent process. And be
careful about changing local variables, the parent might see the changes
when it resumes. Some vfork() implementations provide a small new stack,
ala signal handlers or kernel interrupts, so you can't guarantee your
parent will see your local variable changes, but you still can't return
from the function that called vfork() in either case.)

So after calling vfork(), the child _must_ call exec() in order for
there to be two independent processes running at the same time. Until
then, the parent is stopped.

The real problem with implementing full fork() isn't the expense of
copying the data (although if you fork and exec from a mozilla style pig
process, you could copy hundreds of megabytes of data and then
immediately discard it again; that's why fork() doesn't usually do that;
oh and on nommu systems you need _contiguous_ memory blocks for the data
because it can't collect disparate pages together into a longer mapping,
so this is actually a largeish real-world issue on those systems, not
merely slow and expensive.) The hard problem is translating the pointers
so the new mapping doesn't read/write objects in the old mapping.

> Perhaps the child can create another thread? The main thread can exit
> after that and unblock the parent. Or perhaps even something like
> clone(CLONE_VM | CLONE_PARENT), I dunno...

Launching a new thread doesn't unblock the parent. A second vfork() from
the child wouldn't unblock the parent. Your mappings are still
overcommited, only _exit() or execve() releases the child process's use
of those mappings.

You can create threads on nommu because they're designed to share the
same mappings. In that case you're guaranteed a new stack, and not
stomping the parent's data is your problem.

But if you exec() from a thread, posix says it kills all the other threads:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html

And even without that, we're still in the "vfork but add concurrency"
territory. Your threads don't have their own independent mappings,
they're sharing and stomping each other's data unless you add locking
and write your program to know about the other threads. To get two
independent process contexts running the same executable but with
different mappings (I.E. the goal we started with), you still need the
child to exec. And the start of this thread was "exec what"?

> Oleg.

Rob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-12 10:48           ` Rob Landley
@ 2017-09-12 11:30             ` Geert Uytterhoeven
  2017-09-12 13:45               ` Rob Landley
  2017-09-12 15:45             ` Oleg Nesterov
  1 sibling, 1 reply; 13+ messages in thread
From: Geert Uytterhoeven @ 2017-09-12 11:30 UTC (permalink / raw)
  To: Rob Landley
  Cc: Oleg Nesterov, Alan Cox, Linux Embedded, Rich Felker, linux-kernel

Hi Rob,

On Tue, Sep 12, 2017 at 12:48 PM, Rob Landley <rob@landley.net> wrote:
> A nommu system doesn't have a memory management unit, so all addresses
> are physical addresses. This means two processes can't see different
> things at the same address: either they see the same thing or one of
> them can't see that address (due to a range register making it).
>
> Conventional fork() creates copy on write mappings of all the existing
> writable memory of the parent process. So when the new PID dirties a
> page, the old page gets copied by the fault handler. The problem isn't
> the copies (that's just slow), the problem is two processes seeing
> different things at the same address. That requires an MMU with a TLB
> loaded from page tables.
>
> If you create _new_ mappings and copy the data over, they'll have
> different addresses. But any pointers you copied will point to the _old_
> addresses. Finding and adjusting all those pointers to point to the new
> addresses instead is basically the same problem as doing garbage
> collection in C.
>
> Your stack has pointers. Your heap has pointers. Your data and bss (once
> initialized) can have pointers. These pointers can be in the middle of
> malloc()'ed structures so no ELF table anywhere knows anything about
> them. A long variable containing a value that _could_ point into one of
> these ranges isn't guaranteed to _be_ a pointer, in which case adjusting
> it is breakage. Tracking them all down and fixing up just the right ones
> without missing any or changing data you shouldn't is REALLY HARD.

Hence (make the compiler) never store pointers, only offsets relative to a
base register. So after making copies of stack, data/bss, and heap, all you
need to do is adjust these base registers for the child process.
Nothing in main memory needs to be modified.

Text accesses can be PC-relative => nothing to adjust.
Local variable accesses are stack-relative => nothing to adjust.
Data/bss accesses can be relative to a reserved register that stores the
data base address => only adjust the base register, nothing in RAM to adjust.
Heap accesses can be relative to a reserved register that stores the heap
base address => only adjust the base register, nothing in RAM to adjust.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-12 11:30             ` Geert Uytterhoeven
@ 2017-09-12 13:45               ` Rob Landley
  2017-09-13 19:33                 ` Alan Cox
  0 siblings, 1 reply; 13+ messages in thread
From: Rob Landley @ 2017-09-12 13:45 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Oleg Nesterov, Alan Cox, Linux Embedded, Rich Felker, linux-kernel

On 09/12/2017 06:30 AM, Geert Uytterhoeven wrote:
> Hi Rob,
> 
> On Tue, Sep 12, 2017 at 12:48 PM, Rob Landley <rob@landley.net> wrote:
>> Your stack has pointers. Your heap has pointers. Your data and bss (once
>> initialized) can have pointers. These pointers can be in the middle of
>> malloc()'ed structures so no ELF table anywhere knows anything about
>> them. A long variable containing a value that _could_ point into one of
>> these ranges isn't guaranteed to _be_ a pointer, in which case adjusting
>> it is breakage. Tracking them all down and fixing up just the right ones
>> without missing any or changing data you shouldn't is REALLY HARD.
> 
> Hence (make the compiler) never store pointers, only offsets relative to a
> base register. So after making copies of stack, data/bss, and heap, all you
> need to do is adjust these base registers for the child process.
> Nothing in main memory needs to be modified.

Ok, I'll bite. How do you set a signal handler under this regime, since
that needs to pass a function pointer to the syscall? Have a different
function pointer type for when you want a real pointer instead of an
offset pointer? Perhaps label them "near" and "far" pointers, since
there's precedent for that back under DOS?

When you call printf(), how does it accept both a "string constant"
living in rodata and a char array on the stack? Two printf functions
with different argument types? If it _does_ take an actual memory
address rather than an offset that isn't always vs the same segment then
you've written pointers to the stack...

You're also requiring static linking: shared libraries work just fine
with fdpic, but under your segment:offset addressing system all text has
to be relative to the same code segment.

Plus there's still the "fork() off of mozilla" problem that you may copy
lots of data just to immediately discard it as the common case (unless
you'd still use vfork() for most things), and you still need contiguous
blocks of memory for each segment (nommu is vulnerable to fragmentation,
increasingly so as the system stays up longer) so your fork() will fail
where vfork() succeeds. But that just makes it really slow and
unreliable, rather than requiring a large rewrite of the C language.

> Text accesses can be PC-relative => nothing to adjust.
> Local variable accesses are stack-relative => nothing to adjust.
> Data/bss accesses can be relative to a reserved register that stores the
> data base address => only adjust the base register, nothing in RAM to adjust.

Does this compiler setup you're describing actually exist?

Instead of making a minor adjustment to one system call, it's better to
extensively rewrite compilers and calling conventions, ignoring the way
C traditionally treats strings and arrays as pointers where pointers
into data, bss, heap, and stack are all used interchangeably...

> Heap accesses can be relative to a reserved register that stores the heap
> base address => only adjust the base register, nothing in RAM to adjust.

Query: if you implement a linked list ala:

struct blah {
  struct blah *next;
  char *key, *value;
};

If next points to a malloc(), key is a constant string in rodata, and
value was strchr(getenv(key), '=')+1 (with appropriate error checking of
course), how does your compiler know which segment each pointer in that
structure is offset from? (What segment IS your environment space
relative to, anyway? It's not the _current_ value of your stack pointer,
that moves.)

How does your proposed compiler rewrite handle mmap()? You can do
MAP_SHARED just fine on nommu today, it's only MAP_PRIVATE that requires
copy on write. (Yes MAP_SHARED can be read only.)

You're aware that most heap implementations can have more than one
underlying mmap(), right?

  http://git.musl-libc.org/cgit/musl/tree/src/malloc/malloc.c#n320

https://github.com/kraj/uClibc/blob/master/libc/stdlib/malloc/malloc.c#L121

So when you say _the_ heap base address above, which chunk are you
referring to?

Rob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-12 10:48           ` Rob Landley
  2017-09-12 11:30             ` Geert Uytterhoeven
@ 2017-09-12 15:45             ` Oleg Nesterov
  2017-09-13 14:20               ` Oleg Nesterov
  1 sibling, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2017-09-12 15:45 UTC (permalink / raw)
  To: Rob Landley
  Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias, linux-kernel

On 09/12, Rob Landley wrote:
>
> On 09/11/2017 10:15 AM, Oleg Nesterov wrote:
> > On 09/08, Rob Landley wrote:
> >>
> >> So is exec(NULL, argv, envp) a reasonable thing to want?
> >
> > I think that something like prctl(PR_OPEN_EXE_FILE) which does
> >
> > 	dentry_open(current->mm->exe_file->path, O_PATH)
> >
> > and returns fd make more sense.
> >
> > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH).
> I'm all for it? That sounds like a cosmetic difference, a more verbose
> way of achieving the same outcome.

Simpler to implement. Something like the (untested) patch below. Not sure
it is correct, not sure it is good idea, etc.

> (Of course now you've got a filehandle you can read xattrs and such
> through from otherwise jailed contexts letting you do things you
> couldn't necessarily do before,

I can be easily wrong, this is not my area, but afaics no. Note that
you get the FMODE_PATH file (see O_PATH), you can do almost nothing
with it.

So. IIUC with this patch you can do

	fd = prctl(PR_OPEN_EXE_FILE);

	execveat(fd, "", NULL, NULL, AT_EMPTY_PATH);

and execveat should succeed even if the binary was unlinked/renamed in
between.

otoh it should fail if, say, you do "chmod a-x exename" in between.

However. This won't work after chroot() so I am not sure this solves your
problems.

> but I assume you know the security
> implications of that more than I do.

Unlikely ;)


> > But to be honest, I can't understand the problem, because I know nothing
> > about nommu.
> >
> > You need to unblock parent sleeping in vfork(), and you can't do another
> > fork (I don't undestand why).
>
> A nommu system doesn't have a memory management unit, so all addresses
> are physical addresses. This means two processes can't see different
> things at the same address: either they see the same thing or one of
> them can't see that address (due to a range register making it).

Yes, yes, I understand, and thanks for your detailed explanation...

> > Perhaps the child can create another thread? The main thread can exit
> > after that and unblock the parent. Or perhaps even something like
> > clone(CLONE_VM | CLONE_PARENT), I dunno...
>
> Launching a new thread doesn't unblock the parent.

Well, this doesn't really matter, but see above, the main thread can exit
after that. This should unblock the parent.

> And even without that, we're still in the "vfork but add concurrency"
> territory. Your threads don't have their own independent mappings,

Of course!

Just I misinterpreted your initial email as if this is fine for your
use-case, and all you need is unblock the parent and nothing else.

Oleg.
---


--- x/kernel/sys.c
+++ x/kernel/sys.c
@@ -2183,6 +2183,40 @@ static int propagate_has_child_subreaper(struct task_struct *p, void *data)
 	return 1;
 }
 
+static int open_mm_exe_file(void)
+{
+	struct file *exe_file, *file;
+	struct path *path;
+	int fd = -ENOENT;
+
+	exe_file = get_mm_exe_file(current->mm);
+	if (!exe_file)
+		goto out;
+
+	path = &exe_file->f_path;
+	if (!path->dentry)
+		goto put_exe_file;
+
+	fd = get_unused_fd_flags(O_CLOEXEC); // flags?
+	if (fd < 0)
+		goto put_exe_file;
+
+	file = dentry_open(path, O_PATH, current_cred());
+	if (IS_ERR(file)) {
+		put_unused_fd(fd);
+		fd = PTR_ERR(file);
+		goto put_exe_file;
+	}
+
+	path_get(path);
+	fd_install(fd, file);
+
+put_exe_file:
+	fput(exe_file);
+out:
+	return fd;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2196,6 +2230,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 	error = 0;
 	switch (option) {
+	case PR_OPEN_EXE_FILE:
+		error = open_mm_exe_file();
+		break;
 	case PR_SET_PDEATHSIG:
 		if (!valid_signal(arg2)) {
 			error = -EINVAL;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-12 15:45             ` Oleg Nesterov
@ 2017-09-13 14:20               ` Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2017-09-13 14:20 UTC (permalink / raw)
  To: Rob Landley
  Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias, linux-kernel

On 09/12, Oleg Nesterov wrote:
>
> On 09/12, Rob Landley wrote:
> >
> > On 09/11/2017 10:15 AM, Oleg Nesterov wrote:
> > > On 09/08, Rob Landley wrote:
> > >>
> > >> So is exec(NULL, argv, envp) a reasonable thing to want?
> > >
> > > I think that something like prctl(PR_OPEN_EXE_FILE) which does
> > >
> > > 	dentry_open(current->mm->exe_file->path, O_PATH)
> > >
> > > and returns fd make more sense.
> > >
> > > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH).
> > I'm all for it? That sounds like a cosmetic difference, a more verbose
> > way of achieving the same outcome.
>
> Simpler to implement. Something like the (untested) patch below. Not sure
> it is correct, not sure it is good idea, etc.

OTOH... with the trivial patch below

	execveat(AT_FDCWD, "", NULL, NULL, AT_EMPTY_PATH);

should always work, even if the binary is not in scope after chroot, or if
it is no longer executable, or unlinked. But I am not sure what else should
we do to avoid the security problems.

Oleg.


--- x/fs/exec.c
+++ x/fs/exec.c
@@ -832,23 +832,32 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 {
 	struct file *file;
 	int err;
-	struct open_flags open_exec_flags = {
-		.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
-		.acc_mode = MAY_EXEC,
-		.intent = LOOKUP_OPEN,
-		.lookup_flags = LOOKUP_FOLLOW,
-	};
-
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
-		return ERR_PTR(-EINVAL);
-	if (flags & AT_SYMLINK_NOFOLLOW)
-		open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
-	if (flags & AT_EMPTY_PATH)
-		open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
 
-	file = do_filp_open(fd, name, &open_exec_flags);
-	if (IS_ERR(file))
-		goto out;
+	if (fd == AT_FDCWD && name->name[0] == '\0' && flags == AT_EMPTY_PATH) {
+		file = get_mm_exe_file(current->mm);
+		if (!file) {
+			file = ERR_PTR(-ENOENT);
+			goto out;
+		}
+	} else {
+		struct open_flags open_exec_flags = {
+			.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+			.acc_mode = MAY_EXEC,
+			.intent = LOOKUP_OPEN,
+			.lookup_flags = LOOKUP_FOLLOW,
+		};
+
+		if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+			return ERR_PTR(-EINVAL);
+		if (flags & AT_SYMLINK_NOFOLLOW)
+			open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
+		if (flags & AT_EMPTY_PATH)
+			open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
+
+		file = do_filp_open(fd, name, &open_exec_flags);
+		if (IS_ERR(file))
+			goto out;
+	}
 
 	err = -EACCES;
 	if (!S_ISREG(file_inode(file)->i_mode))


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: execve(NULL, argv, envp) for nommu?
  2017-09-12 13:45               ` Rob Landley
@ 2017-09-13 19:33                 ` Alan Cox
  0 siblings, 0 replies; 13+ messages in thread
From: Alan Cox @ 2017-09-13 19:33 UTC (permalink / raw)
  To: Rob Landley
  Cc: Geert Uytterhoeven, Oleg Nesterov, Linux Embedded, Rich Felker,
	linux-kernel

> Ok, I'll bite. How do you set a signal handler under this regime, since
> that needs to pass a function pointer to the syscall? Have a different
> function pointer type for when you want a real pointer instead of an
> offset pointer? Perhaps label them "near" and "far" pointers, since
> there's precedent for that back under DOS?

A function pointer is an offset relative to the base of the code (but the
other comments are mostly valid)

For most hardware it's cheaper to just do it the way Minix did,
especially as all the hard work in being able to share code and
copy/migrate data happens to have been done in order to make XIP work. A
modern CPU can copy memory at lot faster than an 8MHZ 68K which couldn't
even manage to move 16bits/clock.

> You're also requiring static linking: shared libraries work just fine
> with fdpic, but under your segment:offset addressing system all text has
> to be relative to the same code segment.

No - see the Windows 16bit approach to this. Bring a bucket though 8)

> Plus there's still the "fork() off of mozilla" problem that you may copy
> lots of data just to immediately discard it as the common case (unless
> you'd still use vfork() for most things), and you still need contiguous
> blocks of memory for each segment (nommu is vulnerable to fragmentation,
> increasingly so as the system stays up longer) so your fork() will fail
> where vfork() succeeds. But that just makes it really slow and

If you just do copies and scheduling time swaps of memory blocks then
fragmentation isn't a problem because you can fragment the copy not
currently running. In fact you can (as MAPUX did) extend this to
completely kill the fragmentation problem at the cost of turning
sustained high memory usage with few process deaths into very poor
performance. MAPUX algorithm works very hard to keep stuff unfragmented
but is prepared to move chunks of other processes temporarily around in
order to keep the running process where it should be. In effect it
implements a software paged MMU with an allocator that tries to achieve a
1:1 mapping of the virt/phys of the process.

POSIX tries to side step all of this by providing a combined fork/mess
with file handles of child etc/execve function (posix_spawn) that an
MMUless system can implement to provide the usual functionalities of
fork() / execve() like handle redirection. There are also other ways to
implement that with threads not sharing file handles if you have enough
thread capability (something posix spawn can't assume).

Alan

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-09-13 19:33 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-05  7:34 execve(NULL, argv, envp) for nommu? Rob Landley
2017-09-05  9:00 ` Geert Uytterhoeven
2017-09-05 13:24   ` Alan Cox
2017-09-06  1:12     ` Rob Landley
2017-09-08 21:18       ` Rob Landley
2017-09-11 15:15         ` Oleg Nesterov
2017-09-12 10:48           ` Rob Landley
2017-09-12 11:30             ` Geert Uytterhoeven
2017-09-12 13:45               ` Rob Landley
2017-09-13 19:33                 ` Alan Cox
2017-09-12 15:45             ` Oleg Nesterov
2017-09-13 14:20               ` Oleg Nesterov
2017-09-11 18:14       ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).