Re: [GIT PULL] execve updates for v6.8-rc1

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Josh Triplett <josh@joshtriplett.org>
Cc: Kees Cook <kees@kernel.org>, Kees Cook <keescook@chromium.org>,
	linux-kernel@vger.kernel.org,
	 Alexey Dobriyan <adobriyan@gmail.com>
Subject: Re: [GIT PULL] execve updates for v6.8-rc1
Date: Tue, 9 Jan 2024 15:40:28 -0800	[thread overview]
Message-ID: <CAHk-=wi75tFVtZdzFRr4hsDeUKmeACbgD46rLe+2bcd=4mHBBw@mail.gmail.com> (raw)
In-Reply-To: <ZZ2W_xzCSyOgltad@localhost>

On Tue, 9 Jan 2024 at 10:57, Josh Triplett <josh@joshtriplett.org> wrote:
>
> But I do think the spawnbench
> benchmark I provided (which has fork-execvpe and vfork-execvpe and
> posix_spawnp variants) is representative of real-world patterns for how
> programs execute other programs on $PATH.

No, it really isn't.

I already pointed out that the benchmark is entirely broken, because
what it times is broken.

It basically shows the difference in times between the parent and
child both doing a clock_gettime(), but what happens in between those
is
 - the parent doing a fork
 - the child doing an execve
 - the child reading the time
and that makes it a "at which point did we happen to schedule"
benchmark, not an execve() benchmark.

Just as an example, imagine if we schedule the child immediately after
the fork, and the parent doesn't run at all.

That obviously gives the minimum time difference - what your benchmark
then treats as "best". Agreed?

Except does it?

You have random details like "who happens to do a page fault and has
to copy the the stack page that has been marked COW, and that both the
parent and child have mapped and both will write to immediately  after
the fork()"

If the child runs first, the child will take that page fault, and the
child will have to do the page copy and insert it into its own VM.

So maybe it's better if the parent runs first and takes the page fault
and does the copy, and the child runs on another CPU just a little bit
later, and sees that it now has an exclusive page and doesn't need to
copy it at all? Maybe it gets to the execve() faster that way, and
gets a lower time difference just by pure luck? Or at least has a CPU
core of its own while the parent does something else?

Now, with "fork()" *something* has to do the page copy before the
execve() happens, unless it's all very lucky and the child happens to
run with the stack just at a page boundary and just gets its own page
that way.

I suspect you'll get the best performance if you run everything on
just one CPU, and don't try to spread things out, at least if your L2
caches are big enough to fit there - just for the best cache
utilization.

Because if you try to run the two loads on different CPU cores (and
maybe avoid HT siblings too, to get the best throughput), you'll have
to push all the cached contents from the parent to the child.

And maybe thats' ok on this one. It's almost certainly a good thing on
*some* loads, particularly if the child then ends up having more work
it does longer-term.

And yes, our scheduler tries to actually take cache affinity etc into
account, although the interaction with fork() may or may not be
optimal.

But my point is that what you are testing isn't actually the execve()
cycle, you're basically testing all these scheduler interactions on a
benchmark that doesn't actually match any real load.

Now, using vfork() instead of fork() will improve things, both from a
performance standpoint and from a "not as much a scheduler benchmark"
standpoint.

At least we don't have the issue with COW pages and trying to aim for
cache re-use, because there will be no overlap in execution of the
child and parent while they share the same VM. The parent is going to
stop in vfork(), the child is obviously best run on the same CPU until
it does an execve() and releases the parent, and at that point it's
*probably* best to try to run the new child on a different CPU, and
bring the parent back on the original CPU,.

Except that behavior (which sounds to me like the best option in
general) is not AT ALL what your benchmark would consider the best
option - because all your spawn bench thing looks at is how quickly
the child gets to run, so things like "cache placement for parent" no
longer matter at all for spawnbench.

So for that benchmark, instead of maybe trying to keep the parent
local to its own caches, and run the (new) child with no cache
footprint on another CPU, the best numbers for your benchmark probably
come from running the new execve() on the same CPU and not running the
parent at all until later.

And those are good numbers for the spawnbench just because the process
was already on that CPU in the kernel, so not running the parent where
it makes sense is good, because alll that matterns by then is that you
want to run the child asap.

See? your benchmark doesn't actually even *attempt* to time how good
our fork-execve sequence is. It times something entirely different. It
basically gives the best score to a scheduler decision that probably
doesn't even make sense.

Or maybe it does. Who knows? Maybe we *should* change the scheduler to
do what is best for spawnbench.

But do you see why I think it's at least as much a scheduler benchmark
as it is a execve() one, and why I think it's likely not a very good
benchmark at all, because I _suspect_ that the best numbers come from
doing things that may or may not make sense..

Now, I sent out that other benchmark, which at least avoids the whole
scheduler thing, because it does everything as one single process. I'm
not saying that's a sensible benchmark _either_, but at least it's
targeted to just execve().

Another option would be to not time the whole "parent clock_gettime ->
child clock_gettime" sequience that makes no sense, but to just time
the whole "fork-execve-exit-wait" sequence (which you can do in the
parent).

Because at that point, you're not timing the difference between two
random points (where scheduling decisions will change what happens
between them), you're actually timing the cost of the *whole*
sequence. Sure, scheduling will still matter for the details, but at
least you've timed the whole work, rather than timed a random *part*
of the work where other things are then ignored entirely.

For example, once you time the whole thing, it's no longer a "did the
parent of the child do the COW copy"? We don't care. One or the other
has to take the cost, and it's part of the *whole* cost of the
operation. Sure, scheduling decisions will still end up mattering, so
it's not a pure execve() benchmark, but at least now it's a benchmark
for the whole load, not just a random part of it.

              Linus