Re: [GIT PULL] execve updates for v6.8-rc1

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Kees Cook <kees@kernel.org>
Cc: Kees Cook <keescook@chromium.org>,
	linux-kernel@vger.kernel.org,
	 Alexey Dobriyan <adobriyan@gmail.com>,
	Josh Triplett <josh@joshtriplett.org>
Subject: Re: [GIT PULL] execve updates for v6.8-rc1
Date: Mon, 8 Jan 2024 19:28:25 -0800	[thread overview]
Message-ID: <CAHk-=wj1vvUhNd_+s8Cuvh-wO2iT7bN-vFfxhVXNOsGGGigDZg@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wgnLA7Jhjiuz8W76PRyQheLCkNS__=D1onenqbhpiXsVQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3428 bytes --]

On Mon, 8 Jan 2024 at 17:53, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Because I *guarantee* that we can trivially write another benchmark
> that shows that looking up the pathname twice is worse.

Ok, so I just took a look at the alleged benchmark that was used for
the "look  up twice" argument.

It looks quite broken.

What it seems to do is to "fork+execve" on a small file, and do
clock_gettime() in the parent and in the child, and add up the
differences between the times.

But that's just testing random scheduler interactions, not the speed
of fork/exec.

IOW, that one improves performance if you always run the child first
after the fork(), so that the child runs immediately, finishes the
work, and when the parent then resumes, it reads the completed result
from the pipe.

It will give big behavior changes for any scheduling behavior - like
trying to run children concurrently on another CPU vs running it
immediately on the same CPU etc etc.

Using "vfork()" instead of "fork()" will remove *one* variable, in
that it will force that "child runs first" behavior that you want, and
would likely help performance a lot. But even then you'll end up with
a scheduling benchmark: when the child does "execve()" that will now
wake up the parent again, and the *optimal* behavior is probably to
run the child fully until it does "exit" (well, at least until it runs
"clock_gettime()") before scheduling the parent.

You might get that by just forcing it all to run on one single CPU,
unless the wakeup by the execve() synchronously wakes up the parent.

IOW, you can probably get closer to the numbers you want with vfork(),
but even then it's a crap-shoot and depends on scheduling.

If you want to actually test execve() itself, you shouldn't use fork()
at all - you should literally execve() in a loop, using the execve()
argument as the "loop variable". That will actually test execve(), not
the scheduling of the child, which will be pretty random.

IOW, something (truly stuipid) like the attached, and then you do

    $ gcc -O2 --static t.c
    $ time ./a.out 100000 1

to time a hundred thousand execve() calls.

Look ma, no fork, vfork, or scheduler interactions.

Of course, if you then want to check the pathname lookup failure cost,
you'd need to change the "execve()" into a "execvpe()" and play around
with the PATH variable, putting "." in different places etc. And you
might want to write your own PATH lookup one, to make sure it actually
uses the "execve()" system call and not "stat()" to find the
executable.

.. and do you want to then check using "execveat()" (new model) vs
"path string created by appending in user space" (classic model)?

Tons of variables. For example, modern "execveat()" behavior is
*probably* using a small pathname that is looked up by opening the
different directories in $PATH, but the old-school thing that creates
pathnames all in user space and then does "execve()" on them will
probably have fairly heavy path lookup costs.

So now the whole "look up path twice" might be very differently
expensive depending on just how you ended up dealing with the $PATH
components. It *could* be cheap. Or it might be looking up a long
path.

End result: there's a million interactions here. You need to decide
what you want to test. But you *definitely* shouldn't decide to test
some random scheduler behavior and call it "execve cost".

                Linus

[-- Attachment #2: t.c --]
[-- Type: text/x-c-code, Size: 360 bytes --]

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv, char **envp)
{
	char buffer[10];
	int n, m;

	if (argc < 3)
		exit(1);
	n = atoi(argv[1]);
	if (n <= 0)
		exit(2);
	m = atoi(argv[2]);
	if (m >= n)
		exit(0);
	snprintf(buffer, sizeof(buffer), "%d", m+1);
	argv[2] = buffer;
	execve("./a.out", argv, envp);
	exit(3);
}