Eric Biggers writes:
> If (more likely) you're talking about things like "use this NEON implementation
> on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the
> developers and community to test different CPUs and make appropriate decisions,
> and yes it can be very useful to have external benchmarks like SUPERCOP to refer
> to, and I appreciate your work in that area.

You seem to be talking about a process that selects (e.g.) ChaCha20
implementations as follows: manually inspect benchmarks of various
implementations on various CPUs, manually write code to map CPUs to
implementations, manually update the code as necessary for new CPUs, and
of course manually do the same for every other primitive that can see
differences between microarchitectures (which isn't something weird---
it's the normal situation after enough optimization effort).

This is quite a bit of manual work, so the kernel often doesn't do it,
so we end up with unhappy people talking about performance regressions.

For comparison, imagine one simple central piece of code in the kernel
to automatically do the following:

   When a CPU core is booted:
     For each primitive:
       Benchmark all implementations of the primitive on the core.
       Select the fastest for subsequent use on the core.

If this is a general-purpose mechanism (as in SUPERCOP, NaCl, and
libpqcrypto) rather than something ad-hoc (as in raid6), then there's no
manual work per primitive, and no work per implementation. Each CPU, old
or new, automatically obtains the fastest available code for that CPU.

The only cost is a moment of benchmarking at boot time. _If_ this is a
noticeable cost then there are many ways to speed it up: for example,
automatically copy the results across identical cores, automatically
copy the results across boots if the cores are unchanged, automatically
copy results from a central database indexed by CPU identifiers, etc.
The SUPERCOP database is evolving towards enabling this type of sharing.

> A lot of code can be shared, but in practice different environments have
> different constraints, and kernel programming in particular has some distinct
> differences from userspace programming.  For example, you cannot just use the
> FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on
> most architectures they can't be used in some contexts such as hardirq context,
> and even when they *can* be used you have to run special code before and after
> which does things like saving all the FPU registers to the task_struct,
> disabling preemption, and/or enabling the FPU.

Is there some reason that each implementor is being pestered to handle
all this? Detecting FPU usage is a simple static-analysis exercise, and
the rest sounds like straightforward boilerplate that should be handled
centrally.

> But disabling preemption for
> long periods of time hurts responsiveness, so it's also desirable to yield the
> processor occasionally, which means that assembly implementations should be
> incremental rather than having a single entry point that does everything.

Doing this rewrite automatically is a bit more of a code-analysis
challenge, but the alternative approach of doing it by hand is insanely
error-prone. See, e.g., https://eprint.iacr.org/2017/891.

> Many people may have contributed to SUPERCOP already, but that doesn't mean
> there aren't things you could do to make it more appealing to contributors and
> more of a community project,

The logic in this sentence is impeccable, and is already illustrated by
many SUPERCOP improvements through the years from an increasing number
of contributors, as summarized in the 87 release announcements so far on
the relevant public mailing list, which you're welcome to study in
detail along with the 400 megabytes of current code and as many previous
versions as you're interested in. That's also the mailing list where
people are told to send patches, as you'll see if you RTFM.

> So Linux distributions may not want to take on the legal risk of
> distributing it

This is a puzzling comment. A moment ago we were talking about the
possibility of useful sharing of (e.g.) ChaCha20 implementations between
SUPERCOP and the Linux kernel, avoiding pointless fracturing of the
community's development process for these implementations. This doesn't
mean that the kernel should be grabbing implementations willy-nilly from
SUPERCOP---surely the kernel should be doing security audits, and the
kernel already has various coding requirements, and the kernel requires
GPL compatibility, while putting any of these requirements into SUPERCOP
would be counterproductive.

If you mean having the entire SUPERCOP benchmarking package distributed
through Linux distributions, I have no idea what your motivation is or
how this is supposed to be connected to anything else we're discussing.
Obviously SUPERCOP's broad code-inclusion policies make this idea a
non-starter.

> nor may companies want to take on the risk of contributing.

RTFM. People who submit code are authorizing public redistribution for
benchmarking. It's up to them to decide if they want to allow more.

---Dan