LKML Archive on lore.kernel.org
 help / color / Atom feed
* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
@ 2018-08-01  7:22 Eric Biggers
  2018-08-01 17:02 ` Andy Lutomirski
  2018-08-03  2:33 ` Jason A. Donenfeld
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Biggers @ 2018-08-01  7:22 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-crypto, linux-kernel, netdev, davem, Andy Lutomirski,
	Greg KH, Samuel Neves, D . J . Bernstein, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

[+Cc linux-crypto]

Hi Jason,

Apologies for starting a new thread, but this patch apparently wasn't
Cc'ed to linux-crypto, despite adding over 24000 lines of crypto code.
So much for WireGuard being only 4000 lines :-)

(For anyone else like me who didn't receive the patch, it can be found at
https://patchwork.ozlabs.org/patch/951763/)

I have some preliminary comments below.

On Tue, 31 Jul 2018 21:11:01 +0200, Jason A. Donenfeld wrote:
> [PATCH v1 2/3] zinc: Introduce minimal cryptography library 
> 
> Zinc stands for "Zinc Is Not crypto/". It's also short, easy to type,
> and plays nicely with the recent trend of naming crypto libraries after
> elements. The guiding principle is "don't overdo it". It's less of a
> library and more of a directory tree for organizing well-curated direct
> implementations of cryptography primitives.
> 
> Zinc is a new cryptography API that is much more minimal and lower-level
> than the current one. It intends to complement it and provide a basis
> upon which the current crypto API might build, and perhaps someday Zinc
> may altogether supplant the current crypto API. It is motivated by
> three primary observations in crypto API design:
> 
>   * Highly composable "cipher modes" and related abstractions from
>     90s cryptographers did not turn out to be as terrific an idea as
>     hoped, leading to a host of API misuse problems.
> 
>   * Most programmers are afraid of crypto code, and so prefer to
>     integrate it into libraries in a highly abstracted manner, so as to
>     shield themselves from implementation details. Cryptographers, on
>     the other hand, prefer simple direct implementations, which they're
>     able to verify for high assurance and optimize in accordance with
>     their expertise.
> 
>   * Overly abstracted and flexible cryptography APIs lead to a host of
>     dangerous problems and performance issues. The kernel is in the
>     business usually not of coming up with new uses of crypto, but
>     rather implementing various constructions, which means it essentially
>     needs a library of primitives, not a highly abstracted enterprise-ready
>     pluggable system, with a few particular exceptions.
> 
> This last observation has seen itself play out several times over and
> over again within the kernel:
> 
>   * The perennial move of actual primitives away from crypto/ and into
>     lib/, so that users can actually call these functions directly with
>     no overhead and without lots of allocations, function pointers,
>     string specifier parsing, and general clunkiness. For example:
>     sha256, chacha20, siphash, sha1, and so forth live in lib/ rather
>     than in crypto/. Zinc intends to stop the cluttering of lib/ and
>     introduce these direct primitives into their proper place, lib/zinc/.
> 
>   * An abundance of misuse bugs with the present crypto API that have
>     been very unpleasant to clean up.
> 
>   * A hesitance to even use cryptography, because of the overhead and
>     headaches involved in accessing the routines.
> 
> Zinc goes in a rather different direction. Rather than providing a
> thoroughly designed and abstracted API, Zinc gives you simple functions,
> which implement some primitive, or some particular and specific
> construction of primitives. It is not dynamic in the least, though one
> could imagine implementing a complex dynamic dispatch mechanism (such as
> the current crypto API) on top of these basic functions. After all,
> dynamic dispatch is usually needed for applications with cipher agility,
> such as IPsec, dm-crypt, AF_ALG, and so forth, and the existing crypto
> API will continue to play that role. However, Zinc will provide a non-
> haphazard way of directly utilizing crypto routines in applications
> that do have neither need nor desire for abstraction and dynamic
> dispatch.
> 
> It also organizes the implementations in a simple, straight-forward,
> and direct manner, making it enjoyable and intuitive to work on.
> Rather than moving optimized assembly implementations into arch/, it
> keeps them all together in lib/zinc/, making it simple and obvious to
> compare and contrast what's happening. This is, notably, exactly what
> the lib/raid6/ tree does, and that seems to work out rather well. It's
> also the pattern of most successful crypto libraries. Likewise, the
> cascade of architecture-specific functions is done as ifdefs within one
> file, so that it's easy and obvious and clear what's happening, how each
> architecture differs, and how to optimize for shared patterns. This is
> very much preferable to architecture-specific file splitting.
> 
> All implementations have been extensively tested and fuzzed, and are
> selected for their quality, trustworthiness, and performance. Wherever
> possible and performant, formally verified implementations are used,
> such as those from HACL* [1] and Fiat-Crypto [2]. The routines also take
> special care to zero out secrets using memzero_explicit (and future work
> is planned to have gcc do this more reliably and performantly with
> compiler plugins). The performance of the selected implementations is
> state-of-the-art and unrivaled. Each implementation also comes with
> extensive self-tests and crafted test vectors, pulled from various
> places such as Wycheproof [9].
> 
> Regularity of function signatures is important, so that users can easily
> "guess" the name of the function they want. Though, individual
> primitives are oftentimes not trivially interchangeable, having been
> designed for different things and requiring different parameters and
> semantics, and so the function signatures they provide will directly
> reflect the realities of the primitives' usages, rather than hiding it
> behind (inevitably leaky) abstractions. Also, in contrast to the current
> crypto API, Zinc functions can work on stack buffers, and can be called
> with different keys, without requiring allocations or locking.
> 
> SIMD is used automatically when available, though some routines may
> benefit from either having their SIMD disabled for particular
> invocations, or to have the SIMD initialization calls amortized over
> several invocations of the function, and so Zinc provides helpers and
> function signatures enabling that.
> 
> More generally, Zinc provides function signatures that allow just what
> is required by the various callers. This isn't to say that users of the
> functions will be permitted to pollute the function semantics with weird
> particular needs, but we are trying very hard not to overdo it, and that
> means looking carefully at what's actually necessary, and doing just that,
> and not much more than that. Remember: practicality and cleanliness rather
> than over-zealous infrastructure.
> 
> Zinc provides also an opening for the best implementers in academia to
> contribute their time and effort to the kernel, by being sufficiently
> simple and inviting. In discussing this commit with some of the best and
> brightest over the last few years, there are many who are eager to
> devote rare talent and energy to this effort.
> 
> This initial commit adds implementations of the primitives used by WireGuard:
> 
>   * Curve25519 [3]: formally verified 64-bit C, formally verified 32-bit C,
>     x86_64 BMI2, x86_64 ADX, ARM NEON.
> 
>   * ChaCha20 [4]: generic C, x86_64 SSSE3, x86_64 AVX-2, x86_64 AVX-512F,
>     x86_64 AVX-512VL, ARM NEON, ARM64 NEON, MIPS.
> 
>   * HChaCha20 [5]: generic C, x86_64 SSSE3.
> 
>   * Poly1305 [6]: generic C, x86_64, x86_64 AVX, x86_64 AVX-2, x86_64
>     AVX-512F, ARM NEON, ARM64 NEON, MIPS, MIPS64.
> 
>   * BLAKE2s [7]: generic C, x86_64 AVX, x86_64 AVX-512VL.
> 
>   * ChaCha20Poly1305 [8]: generic C construction for both full buffers and
>     scatter gather.
> 
>   * XChaCha20Poly1305 [5]: generic C construction.
> 
> Following the merging of this, I expect for first the primitives that
> currently exist in lib/ to work their way into lib/zinc/, after intense
> scrutiny of each implementation, potentially replacing them with either
> formally-verified implementations, or better studied and faster
> state-of-the-art implementations. In a phase after that, I envision that
> certain instances from crypto/ will want to rebase themselves to simply
> be abstracted crypto API wrappers using the lower level Zinc functions.
> This is already what various crypto/ implementations do with the
> existing code in lib/.
> 
> Currently Zinc exists as a single un-menued option, CONFIG_ZINC, but as
> this grows we will inevitably want to make that more granular. This will
> happen at the appropriate time, rather than doing so prematurely. There
> also is a CONFIG_ZINC_DEBUG menued option, performing several intense
> tests at startup and enabling various BUG_ONs.
> 
> [1] https://github.com/project-everest/hacl-star
> [2] https://github.com/mit-plv/fiat-crypto
> [3] https://cr.yp.to/ecdh.html
> [4] https://cr.yp.to/chacha.html
> [5] https://cr.yp.to/snuffle/xsalsa-20081128.pdf
> [6] https://cr.yp.to/mac.html
> [7] https://blake2.net/
> [8] https://tools.ietf.org/html/rfc8439
> [9] https://github.com/google/wycheproof
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Samuel Neves <sneves@dei.uc.pt>
> Cc: D. J. Bernstein <djb@cr.yp.to>
> Cc: Tanja Lange <tanja@hyperelliptic.org>
> Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
> Cc: Karthikeyan Bhargavan <karthik.bhargavan@gmail.com>
> ---
>  MAINTAINERS                             |    7 +
>  include/zinc/blake2s.h                  |   94 +
>  include/zinc/chacha20.h                 |   46 +
>  include/zinc/chacha20poly1305.h         |   51 +
>  include/zinc/curve25519.h               |   25 +
>  include/zinc/poly1305.h                 |   34 +
>  include/zinc/simd.h                     |   60 +
>  lib/Kconfig                             |   21 +
>  lib/Makefile                            |    2 +
>  lib/zinc/Makefile                       |   28 +
>  lib/zinc/blake2s/blake2s-x86_64.S       |  685 ++++++
>  lib/zinc/blake2s/blake2s.c              |  292 +++
>  lib/zinc/chacha20/chacha20-arm.S        | 1471 ++++++++++++
>  lib/zinc/chacha20/chacha20-arm64.S      | 1940 ++++++++++++++++
>  lib/zinc/chacha20/chacha20-mips.S       |  474 ++++
>  lib/zinc/chacha20/chacha20-x86_64.S     | 2630 +++++++++++++++++++++
>  lib/zinc/chacha20/chacha20.c            |  242 ++
>  lib/zinc/chacha20poly1305.c             |  286 +++
>  lib/zinc/curve25519/curve25519-arm.S    | 2110 +++++++++++++++++
>  lib/zinc/curve25519/curve25519-arm.h    |   14 +
>  lib/zinc/curve25519/curve25519-fiat32.h |  838 +++++++
>  lib/zinc/curve25519/curve25519-hacl64.h |  751 ++++++
>  lib/zinc/curve25519/curve25519-x86_64.h | 2060 +++++++++++++++++
>  lib/zinc/curve25519/curve25519.c        |   86 +
>  lib/zinc/main.c                         |   36 +
>  lib/zinc/poly1305/poly1305-arm.S        | 1115 +++++++++
>  lib/zinc/poly1305/poly1305-arm64.S      |  820 +++++++
>  lib/zinc/poly1305/poly1305-mips.S       |  417 ++++
>  lib/zinc/poly1305/poly1305-mips64.S     |  357 +++
>  lib/zinc/poly1305/poly1305-x86_64.S     | 2790 +++++++++++++++++++++++
>  lib/zinc/poly1305/poly1305.c            |  377 +++
>  lib/zinc/selftest/blake2s.h             |  559 +++++
>  lib/zinc/selftest/chacha20poly1305.h    | 1559 +++++++++++++
>  lib/zinc/selftest/curve25519.h          |  607 +++++
>  lib/zinc/selftest/poly1305.h            | 1568 +++++++++++++
>  35 files changed, 24452 insertions(+)
>  create mode 100644 include/zinc/blake2s.h
>  create mode 100644 include/zinc/chacha20.h
>  create mode 100644 include/zinc/chacha20poly1305.h
>  create mode 100644 include/zinc/curve25519.h
>  create mode 100644 include/zinc/poly1305.h
>  create mode 100644 include/zinc/simd.h
>  create mode 100644 lib/zinc/Makefile
>  create mode 100644 lib/zinc/blake2s/blake2s-x86_64.S
>  create mode 100644 lib/zinc/blake2s/blake2s.c
>  create mode 100644 lib/zinc/chacha20/chacha20-arm.S
>  create mode 100644 lib/zinc/chacha20/chacha20-arm64.S
>  create mode 100644 lib/zinc/chacha20/chacha20-mips.S
>  create mode 100644 lib/zinc/chacha20/chacha20-x86_64.S
>  create mode 100644 lib/zinc/chacha20/chacha20.c
>  create mode 100644 lib/zinc/chacha20poly1305.c
>  create mode 100644 lib/zinc/curve25519/curve25519-arm.S
>  create mode 100644 lib/zinc/curve25519/curve25519-arm.h
>  create mode 100644 lib/zinc/curve25519/curve25519-fiat32.h
>  create mode 100644 lib/zinc/curve25519/curve25519-hacl64.h
>  create mode 100644 lib/zinc/curve25519/curve25519-x86_64.h
>  create mode 100644 lib/zinc/curve25519/curve25519.c
>  create mode 100644 lib/zinc/main.c
>  create mode 100644 lib/zinc/poly1305/poly1305-arm.S
>  create mode 100644 lib/zinc/poly1305/poly1305-arm64.S
>  create mode 100644 lib/zinc/poly1305/poly1305-mips.S
>  create mode 100644 lib/zinc/poly1305/poly1305-mips64.S
>  create mode 100644 lib/zinc/poly1305/poly1305-x86_64.S
>  create mode 100644 lib/zinc/poly1305/poly1305.c
>  create mode 100644 lib/zinc/selftest/blake2s.h
>  create mode 100644 lib/zinc/selftest/chacha20poly1305.h
>  create mode 100644 lib/zinc/selftest/curve25519.h
>  create mode 100644 lib/zinc/selftest/poly1305.h
[...]

In general this is great work, and I'm very excited for WireGuard to be
upstreamed!  But for the new crypto code, I think a few things are on
the wrong track, for example treating it is a special library.  Even the
name is contradicting itself: Zinc is "not crypto/", yet as you stated
it's intended that the "Zinc" algorithms will be exposed through the
crypto API -- just like how most of the existing crypto code in lib/ is
also exposed through the crypto API.  So, I think that what you're doing
isn't actually *that* different from what already exists in some cases;
and pretending that it is very different is just going to cause
problems.  Rather, the actual truly new thing seems to be that the
dispatch to architecture specific implementations is done at the lib/
level instead of handled by the crypto API priority numbers.

So, I don't see why you don't just add lib/blake2s/, lib/chacha20/,
lib/poly1305/, etc., without pretending that they all have some special
new "Zinc" thing in common and are part of some holy crusade against the
crypto API.

They could even still go in subdirectory lib/crypto/ -- but just for
logical code organization purposes, as opposed to a special library with
a name that isn't self-explanatory and sounds like some third-party
library rather than first-class kernel code.

CONFIG_ZINC also needs to go.  Algorithms will need to be independently
configurable as soon as anything other than WireGuard needs to use any
of them, so you might as well do it right from the start with
CONFIG_BLAKE2, CONFIG_CHACHA20, CONFIG_POLY1305, etc.

I think the above changes would also naturally lead to a much saner
patch series where each algorithm is added by its own patch, rather than
one monster patch that adds many algorithms and 24000 lines of code.

Note that adding all the algorithms in one patch also makes the
description of them conflated, e.g. you wrote that "formally verified"
implementations were used whenever possible, but AFAICS that actually
only applies to the C implementations of Curve25519, and even those have
your copyright statement so presumably you had to change something from
the "formally verified" code :-).  Note also that Poly1305
implementations are somewhat error-prone, since there can be overflow
bugs that are extremely rarely hit; see e.g. how OpenSSL's Poly1305 NEON
implementation was initially buggy and had to be fixed:
https://mta.openssl.org/pipermail/openssl-commits/2016-April/006639.html.
Not to mention that C glue code is error-prone, especially with the tons
of #ifdefs.  So, I'd strongly prefer that you don't oversell the crypto
code you're adding, e.g. by implying that most of it is formally
verified, as it likely still has bugs, like any other code...

I also want to compare the performance of some of the assembly
implementations you're adding to the existing implementations in the
kernel they duplicate.  I'm especially interested in the NEON
implementation of ChaCha20.  But adding 11 implementations in one single
patch means there isn't really a chance to comment on them individually.

Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM
Cortex-A7 it was actually quite a bit slower than the one in the Linux
kernel written by Ard Biesheuvel... I trust that when claiming the
performance of all implementations you're adding is "state-of-the-art
and unrivaled", you actually compared them to the ones already in the
Linux kernel which you're advocating replacing, right? :-)

Your patch description is also missing any mention of crypto accelerator
hardware.  Quite a bit of the complexity in the crypto API, such as
scatterlist support and asynchronous execution, exists because it
supports crypto accelerators.  AFAICS your new APIs cannot support
crypto accelerators, as your APIs are synchronous and operate on virtual
addresses.  I assume your justification is that "djb algorithms" like
ChaCha and Poly1305 don't need crypto accelerators as they are fast in
software.  But you never explicitly stated this and discussed the
tradeoffs.  Since this is basically the foundation for the design you've
chosen, it really needs to be addressed.

As for doing the architecture-specific dispatch in lib/ rather than
through the crypto API, there definitely are some arguments in favor of
it.  The main problem, though, is that your code is a mess due to all
the #ifdefs, and it will only get worse as people add more
architectures.  You may think you already added all the architectures
that matter, but tomorrow people will come and want to add PowerPC,
RISC-V, etc.  I really think you should consider splitting up
implementations by architecture; this would *not*, however, preclude the
implementations from still being accessed through a single top-level
"glue" function.  For example chacha20() could look like:

void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len,
	      bool have_simd)
{
	if (chacha20_arch(dst, src, len, state->key, state->counter, have_simd))
		goto out;

	chacha20_generic(dst, src, len, state->key, state->counter);

out:
	state->counter[0] += (len + 63) / 64;
}

So, each architecture would optionally define its own chacha20_arch()
that returns true if the data was processed, or false if not.  (The data
wouldn't be processed if, for example, 'have_simd' was false but only
SIMD implementations are available; or if the input was too short for an
SIMD implementation to be faster than the generic one.)  Note that this
would make the code much more consistent with the usual Linux kernel
coding style, which strongly prefers calling functions unconditionally
rather than having core logic littered with unmaintainable #ifdefs.

I'd also strongly prefer the patchset to include converting the crypto
API versions of ChaCha20 and Poly1305 over to use your new lib/ code, to
show that it's really possible.  You mentioned that it's planned, but if
it's not done right away there will be things that were missed and will
require changes when someone finally does it.  IMO it's not acceptable
to add your own completely separate ChaCha20 and Poly1305 just because
you don't like that the existing ones are part of the crypto API.  You
need to refactor things properly.  I think you'd need to expose the new
code under a cra_driver_name like "chacha20-software" and
"poly1305-software" to reflect that they use the fastest available
implementation of the algorithm on the CPU, e.g. "chacha20-generic",
"chacha20-neon", and "chacha20-simd" would all be replaced by a single
"chacha20-software".  Is that what you had in mind?

I'm also wondering about the origin and licensing of some of the
assembly language files.  Many have an OpenSSL copyright statement.
But, the OpenSSL license is often thought to be incompatible with GPL,
so using OpenSSL assembly code in the kernel has in the past required
getting special permission from Andy Polyakov (the person who's written
most of OpenSSL's assembly code so holds the copyright on it).  As one
example, see arch/arm/crypto/sha256-armv4.pl: the file explicitly states
that Andy has relicensed it under GPLv2.  For your new OpenSSL-derived
files, have you gone through and explicitly gotten GPLv2 permission from
Andy / the copyright holders?

Each assembly language file should also explicitly state where it came
from.  For example lib/zinc/curve25519/curve25519-arm.S has your
copyright statement and says it's "Based on algorithms from Daniel J.
Bernstein and Peter Schwabe.", but it's not clarified whether the *code*
was written by those other people, as opposed to those people designing
the *algorithms* and then you writing the code; and if you didn't write
it, where you retrieved the file from and when, what license it had
(even if it was "public domain" like some of djb's code, this should be
mentioned for informational purposes), and what changes you made if any.

Oh, and please use 80-character lines like the rest of the kernel, so
that people's eyes don't bleed when reading your code :-)

Thanks for all your hard work on WireGuard!

- Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-01  7:22 [PATCH v1 2/3] zinc: Introduce minimal cryptography library Eric Biggers
@ 2018-08-01 17:02 ` Andy Lutomirski
  2018-08-03  2:48   ` Jason A. Donenfeld
  2018-08-03  2:33 ` Jason A. Donenfeld
  1 sibling, 1 reply; 20+ messages in thread
From: Andy Lutomirski @ 2018-08-01 17:02 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jason A. Donenfeld, Linux Crypto Mailing List, LKML,
	Network Development, David S. Miller, Andy Lutomirski, Greg KH,
	Samuel Neves, D . J . Bernstein, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

[I reordered the snippets below for readability.]

> On Aug 1, 2018, at 12:22 AM, Eric Biggers <ebiggers3@gmail.com> wrote:
> In general this is great work, and I'm very excited for WireGuard to be
> upstreamed!  But for the new crypto code, I think a few things are on
> the wrong track, for example treating it is a special library.  Even the
> name is contradicting itself: Zinc is "not crypto/", yet as you stated
> it's intended that the "Zinc" algorithms will be exposed through the
> crypto API -- just like how most of the existing crypto code in lib/ is
> also exposed through the crypto API.  So, I think that what you're doing
> isn't actually *that* different from what already exists in some cases;
> and pretending that it is very different is just going to cause
> problems.  Rather, the actual truly new thing seems to be that the
> dispatch to architecture specific implementations is done at the lib/
> level instead of handled by the crypto API priority numbers.

...

>
>
> I think the above changes would also naturally lead to a much saner
> patch series where each algorithm is added by its own patch, rather than
> one monster patch that adds many algorithms and 24000 lines of code.
>

Yes, please.


>
> As for doing the architecture-specific dispatch in lib/ rather than
> through the crypto API, there definitely are some arguments in favor of
> it.  The main problem, though, is that your code is a mess due to all
> the #ifdefs, and it will only get worse as people add more
> architectures.  You may think you already added all the architectures
> that matter, but tomorrow people will come and want to add PowerPC,
> RISC-V, etc.  I really think you should consider splitting up
> implementations by architecture; this would *not*, however, preclude the
> implementations from still being accessed through a single top-level
> "glue" function.  For example chacha20() could look like:
>
> void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len,
>          bool have_simd)
> {
>    if (chacha20_arch(dst, src, len, state->key, state->counter, have_simd))
>        goto out;
>
>    chacha20_generic(dst, src, len, state->key, state->counter);
>
> out:
>    state->counter[0] += (len + 63) / 64;
> }

I like this a *lot*.  (But why are you passing have_simd?  Shouldn't
that check live in chacha20_arch?  If there's some init code needed,
then chacha20_arch() should just return false before the init code
runs.  Once the arch does whatever feature detection it needs, it can
make chacha20_arch() start returning true.)

As I see it, there there are two truly new thing in the zinc patchset:
the direct (in the direct call sense) arch dispatch, and the fact that
the functions can be called directly, without allocating contexts,
using function pointers, etc.

In fact, I had a previous patch set that added such an interface for SHA256.

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=8c59a4dd8b7ba4f2e5a6461132bbd16c83ff7c1f

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=7e5fbc02972b03727b71bc71f84175c36cbf01f5

>
> Your patch description is also missing any mention of crypto accelerator
> hardware.  Quite a bit of the complexity in the crypto API, such as
> scatterlist support and asynchronous execution, exists because it
> supports crypto accelerators.  AFAICS your new APIs cannot support
> crypto accelerators, as your APIs are synchronous and operate on virtual
> addresses.  I assume your justification is that "djb algorithms" like
> ChaCha and Poly1305 don't need crypto accelerators as they are fast in
> software.  But you never explicitly stated this and discussed the
> tradeoffs.  Since this is basically the foundation for the design you've
> chosen, it really needs to be addressed.

I see this as an advantage, not a disadvantage.  A very large majority
of in-kernel crypto users (by number of call sites under a *very*
brief survey, not by number of CPU cycles) just want to do some
synchronous crypto on a buffer that is addressed by a regular pointer.
Most of these users would be slowed down if they used any form of
async crypto, since the CPU can complete the whole operation faster
than it could plausibly initiate and complete anything asynchronous.
And, right now, they suffer the full overhead of allocating a context
(often with alloca!), looking up (or caching) some crypto API data
structures, dispatching the operation, and cleaning up.

So I think the right way to do it is to have directly callable
functions like zinc uses and to have the fancy crypto API layer on top
of them.  So if you actually want async accelerated crypto with
scatterlists or whatever, you can call into the fancy API, and the
fancy API can dispatch to hardware or it can dispatch to the normal
static API.

In fact, this is exactly what my patch above did.  But Jason's work is
more complete than mine, and mine wasn't really done because it had
some configury issues.

All that being said, for algorithms where the crypto API already has a
reasonable implementation, I think the patch series should first
restructure the code (so the actual software implementation is moved
away from the crypto API wrappers) and then, if needed, replace the
implementation with something better.  The latter should be its own
patch with its own justification.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-01  7:22 [PATCH v1 2/3] zinc: Introduce minimal cryptography library Eric Biggers
  2018-08-01 17:02 ` Andy Lutomirski
@ 2018-08-03  2:33 ` Jason A. Donenfeld
  2018-08-14 21:12   ` Eric Biggers
  1 sibling, 1 reply; 20+ messages in thread
From: Jason A. Donenfeld @ 2018-08-03  2:33 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves,
	Daniel J . Bernstein, Tanja Lange, Jean-Philippe Aumasson,
	Karthikeyan Bhargavan

Hi Eric,

Thanks for your feedback. Responses below.

On Wed, Aug 1, 2018 at 9:22 AM Eric Biggers <ebiggers3@gmail.com> wrote:
> it's intended that the "Zinc" algorithms will be exposed through the
> crypto API -- just like how most of the existing crypto code in lib/ is
> also exposed through the crypto API.  So, I think that what you're doing
> isn't actually *that* different from what already exists in some cases;
> and pretending that it is very different is just going to cause
> problems.

I think as an eventual thing, it'd be great to see the whole of the
software part of the Crypto API rebased ontop of Zinc's software
implementations. But the Cryto API is huge, and it's not really
realistic to do this all at once. I'd also like to have a bit more
scrutiny of each individual Zinc submission, rather than just dumping
it all in there wholesale. Additionally, it seems reasonable to do
this in accordance with actual need for using the algorithms from
Zinc, i.e., not rushing toward DES.

> Rather, the actual truly new thing seems to be that the [...]
> [...] are part of some holy crusade against the crypto API.
> So much for WireGuard being only

I sense a bit of snark in your comments, and I am sorry if I ruffled
some feathers. I can understand a number of things might have rubbed
you the wrong way. But I am pretty committed to getting this done
right, taking into account your and Andy's opinions, as well as
working hard to keep Zinc close to its design goals. I think you'll
find we're closer to being on the same page than not, and that I only
want to make things better rather than be on some kind of "crusade"
against anything.

> They could even still go in subdirectory lib/crypto/ -- but just for
> logical code organization purposes,

Indeed a subdirectory is important for organization. We can bikeshed
about names, I suppose. I'm partial zinc, and I think that it might
actually help foster contributors and interested academics. Anyway,
let's dive into the code issues first, though, as I think that'll be a
productive way to get going.

> CONFIG_ZINC also needs to go.  Algorithms will need to be independently
> configurable as soon as anything other than WireGuard needs to use any
> of them, so you might as well do it right from the start with
> CONFIG_BLAKE2, CONFIG_CHACHA20, CONFIG_POLY1305, etc.

Okay, I can add this all as individual nobs. I'll do that for v2.

I'm tempted to still have this built as one module, though. For
example, should WireGuard depend on 5 mini-modules (blake2s, chacha20,
poly1305, chacha20poly1305, curve25519)? Rather, it seems a bit
sleeker to just have a single zinc.ko, and allow what's built into it
be configurable with the various algorithm configuration options. What
would you think of this? Or do you think you'd still prefer
one-algo-per-ko approach like with the current crypto API? I can see
good arguments both ways.

> I think the above changes would also naturally lead to a much saner
> patch series where each algorithm is added by its own patch, rather than
> one monster patch that adds many algorithms and 24000 lines of code.

Yes, absolutely! And sorry for that monster patch; it even got
rejected from LKML as you pointed out. I had started to split things
up, but I wanted to send the patchset foremost to netdev, to start
getting some feedback on WireGuard, and wanted to keep the
organization simple. I'll have this split up for v2.

>
> Note that adding all the algorithms in one patch also makes the
> description of them conflated, e.g. you wrote that "formally verified"
> implementations were used whenever possible, but AFAICS that actually
> only applies to the C implementations of Curve25519,

Yes, the commit message mentioned which implementations have been
verified. It's quite likely that in the near future we'll be adding
more formally verified implementations, and even formally verified
assembly implementations. While BLAKE2s and ChaCha20 are somewhat
straight forward, bigint stuff, such as Curve25519 and Poly1305,
always make me a bit nervous, and so we'll probably be additionally
incorporating a formally verified Poly1305 first of the bunch.

Either way, _all_ of those implementations have been reviewed in some
depth, fuzzed absurd amounts, and fed some pretty nasty test vectors.
I've spent a lot of time with all of them, and most have received a
good amount of external scrutiny, due to their origin.

> So, I'd strongly prefer that you don't oversell the crypto
> code you're adding, e.g. by implying that most of it is formally
> verified, as it likely still has bugs, like any other code...

This sounds like snark that's going to devolve into endless
bikeshedding, but I do note that I mentioned formally verified
implementations in these 3 contexts, which I think were appropriate
each time:

1) "Wherever possible and performant, formally verified
implementations are used, such as those from HACL* and Fiat-Crypto."

So, trying to state that we use it when we can, but we can't always. I
think it's important to mention it here, as already, I've been
contacted by formal verification people who are already eager to help.
This is exactly the kind of positive outcome I was hoping for.

2) "Curve25519: formally verified 64-bit C, formally verified 32-bit C, [...]"

Pointing out that these 2 implementations, in contradistinction to the
subsequent ones listed, are from formally verified C.

3) "potentially replacing them with either formally-verified
implementations [...]"

Again here, I indicate that I'd like to replace non formally-verified
implementations with formally-verified implementations wherever
there's such a potentiality. And like the previous comment, I believe
this has been interpreted as a welcome call to action by the people
who are actually working on this stuff.

So I don't think I'm "over selling" it. Formal verification is a good
thing. We have a little bit of it in Zinc. I'd like to have a lot more
of it. And all the while we're doing the best with what we've got.

Generally speaking, I believe the choices I've made on the individual
implementations in this submission have been made with a lot of care,
and I'm happy to work with you on any concerns you have about them and
do what's needed to get them in the proper shape.

> I also want to compare the performance of some of the assembly
> implementations you're adding to the existing implementations in the
> kernel they duplicate.  I'm especially interested in the NEON
> implementation of ChaCha20.  But adding 11 implementations in one single
> patch means there isn't really a chance to comment on them individually.

As noted, I'll certainly split this up for v2. With ChaCha20, my
benchmarks have indicated Zinc is faster than the Crypto API's
implementations. But there's also another benefit: the ChaCha20 and
Poly1305 implementations are based on Andy Polyakov's ones, which
means they benefit from the *huge* amount of scrutiny those have
already received and continue to receive. In fact, there are people
working on doing formally verified reproductions of some of his
implementations, and that alone is hugely valuable. The
implementations generally found in Linux haven't received nearly as
much scrutiny and attention, putting them at a bit of a disadvantage,
I think. In other words, there's a strong benefit from working
together and sharing in the enterprise with other projects.

There's another plus, too, which is that they all have a very common
interface, making the glue code simpler and more uniform. This is
probably mostly just a benefit for me, as a lazy coder who would
prefer an easier job, but usually less complexity and fewer
differences results in fewer bugs, too.

> Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM
> Cortex-A7 it was actually quite a bit slower than the one in the Linux
> kernel written by Ard Biesheuvel... I trust that when claiming the
> performance of all implementations you're adding is "state-of-the-art
> and unrivaled", you actually compared them to the ones already in the
> Linux kernel which you're advocating replacing, right? :-)

Yes, I have, and my results don't corroborate your findings. It will
be interesting to get out a wider variety of hardware for comparisons.
I suspect, also, that if the snarky emoticons subside, AndyP would be
very interested in whatever we find and could have interest in
improving implementations, should we actually find performance
differences.

> Your patch description is also missing any mention of crypto accelerator
> hardware.  Quite a bit of the complexity in the crypto API, such as
> scatterlist support and asynchronous execution, exists because it
> supports crypto accelerators.

I can augment that description, if you like, for v2. But with regards
to crypto accelerators -- I think the asynchronous API causes all
sorts of problems for some of the most common use cases, and so Zinc
is a software-based ordinary synchronous API. I think this is an
advantage. Of course you can continue to support wild async IPSec
accelerator hardware and such in the crypto API, but that's
intentionally not the focus of Zinc.


> I assume your justification is that "djb algorithms" like
> But you never explicitly stated this and discussed the
> tradeoffs.

It was already a pretty huge wall of text in the monster patch, but
even so there's probably tons of discussion about different details,
both large and small, that I must have left out. Sorry about that.
Hopefully these discussions here can help fill them in.

> The main problem, though, is that your code is a mess due to all
> the #ifdefs, and it will only get worse as people add more
> architectures.

I really thought the ifdefs were cleaner than the three other ways I
have implemented for doing it. When I showed this all to Greg KH
several months ago, he also thought the ifdefs were disgusting. It
seems you agree with him, and I'll certainly change this for v2. I
already have most of that code written, and based on your comments and
suggestions below, I'm sure you'll find it much better.

> Note that this
> would make the code much more consistent with the usual Linux kernel
> coding style, which strongly prefers calling functions unconditionally
> rather than having core logic littered with unmaintainable #ifdefs.

Noted. Will be changed.

> I'd also strongly prefer the patchset to include converting the crypto
> API versions of ChaCha20 and Poly1305 over to use your new lib/ code, to
> show that it's really possible.  You mentioned that it's planned, but if
> it's not done right away there will be things that were missed and will
> require changes when someone finally does it.  IMO it's not acceptable
> to add your own completely separate ChaCha20 and Poly1305 just because
> you don't like that the existing ones are part of the crypto API.

Alright. I didn't really want to do this right off the bat, but you
make a good argument that doing so is important in working out the
required details of just knowing how it's done. So I'll take a stab at
it for v2. I'll let you know if I run into any snags; we might be able
to collaborate nicely here.

> You
> need to refactor things properly.  I think you'd need to expose the new
> code under a cra_driver_name like "chacha20-software" and
> "poly1305-software" to reflect that they use the fastest available
> implementation of the algorithm on the CPU, e.g. "chacha20-generic",
> "chacha20-neon", and "chacha20-simd" would all be replaced by a single
> "chacha20-software".  Is that what you had in mind?

Calling it "chacha20-software" sounds like a reasonable way to name it.

>
> I'm also wondering about the origin and licensing of some of the
> assembly language files.  Many have an OpenSSL copyright statement.
> But, the OpenSSL license is often thought to be incompatible with GPL,
> so using OpenSSL assembly code in the kernel has in the past required
> getting special permission from Andy Polyakov (the person who's written
> most of OpenSSL's assembly code so holds the copyright on it).  As one
> example, see arch/arm/crypto/sha256-armv4.pl: the file explicitly states
> that Andy has relicensed it under GPLv2.  For your new OpenSSL-derived
> files, have you gone through and explicitly gotten GPLv2 permission from
> Andy / the copyright holders?

Andy and I spoke in January about this and he was okay with the usual
CRYPTOGAMS GPLv2 origin dance for that code.

> For example lib/zinc/curve25519/curve25519-arm.S has your
> copyright statement and says it's "Based on algorithms from Daniel J.
> Bernstein and Peter Schwabe.", but it's not clarified whether the *code*
> was written by those other people, as opposed to those people designing
> the *algorithms* and then you writing the code; and if you didn't write
> it, where you retrieved the file from and when, what license it had
> (even if it was "public domain" like some of djb's code, this should be
> mentioned for informational purposes), and what changes you made if any.

It's indeed based on their public domain code. I'll change that
wording to make it very clear that it's from public domain code.

> Oh, and please use 80-character lines like the rest of the kernel, so
> that people's eyes don't bleed when reading your code :-)

Ack. Will be so in v2.

>
> Thanks for all your hard work on WireGuard!

Thanks for your excellent review. I really appreciate it you taking
the time. Hopefully v2 will start looking a lot more interesting to
you!

Regards,
Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-01 17:02 ` Andy Lutomirski
@ 2018-08-03  2:48   ` Jason A. Donenfeld
  2018-08-03 21:29     ` Andy Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Jason A. Donenfeld @ 2018-08-03  2:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric Biggers, Linux Crypto Mailing List, LKML, Netdev,
	David Miller, Andrew Lutomirski, Greg Kroah-Hartman,
	Samuel Neves, Daniel J . Bernstein, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

Hey Andy,

Thanks too for the feedback. Responses below:

On Wed, Aug 1, 2018 at 7:09 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > I think the above changes would also naturally lead to a much saner
> > patch series where each algorithm is added by its own patch, rather than
> > one monster patch that adds many algorithms and 24000 lines of code.
> >
>
> Yes, please.

Ack, will be in v2.


> I like this a *lot*.  (But why are you passing have_simd?  Shouldn't
> that check live in chacha20_arch?  If there's some init code needed,
> then chacha20_arch() should just return false before the init code
> runs.  Once the arch does whatever feature detection it needs, it can
> make chacha20_arch() start returning true.)

The have_simd stuff is so that the FPU state can be amortized across
several calls to the crypto functions. Here's a snippet from
WireGuard's send.c:

void packet_encrypt_worker(struct work_struct *work)
{
    struct crypt_queue *queue = container_of(work, struct
multicore_worker, work)->ptr;
    struct sk_buff *first, *skb, *next;
    bool have_simd = simd_get();

    while ((first = ptr_ring_consume_bh(&queue->ring)) != NULL) {
        enum packet_state state = PACKET_STATE_CRYPTED;

        skb_walk_null_queue_safe(first, skb, next) {
            if (likely(skb_encrypt(skb, PACKET_CB(first)->keypair, have_simd)))
                skb_reset(skb);
            else {
                state = PACKET_STATE_DEAD;
                break;
            }
        }
        queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first, state);

        have_simd = simd_relax(have_simd);
    }
    simd_put(have_simd);
}

simd_get() and simd_put() do the usual irq_fpu_usable/kernel_fpu_begin
dance and return/take a boolean accordingly. simd_relax(on) is:

static inline bool simd_relax(bool was_on)
{
#ifdef CONFIG_PREEMPT
    if (was_on && need_resched()) {
        simd_put(true);
        return simd_get();
    }
#endif
    return was_on;
}

With this, we most of the time get the FPU amortization, while still
doing the right thing for the preemption case (since kernel_fpu_begin
disables preemption). This is a quite important performance
optimization. However, I'd prefer the lazy FPU restoration proposal
discussed a few months ago, but it looks like that hasn't progressed,
hence the need for FPU call amortization:

https://lore.kernel.org/lkml/CALCETrU+2mBPDfkBz1i_GT1EOJau+mzj4yOK8N0UnT2pGjiUWQ@mail.gmail.com/

>
> As I see it, there there are two truly new thing in the zinc patchset:
> the direct (in the direct call sense) arch dispatch, and the fact that
> the functions can be called directly, without allocating contexts,
> using function pointers, etc.
>
> In fact, I had a previous patch set that added such an interface for SHA256.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=8c59a4dd8b7ba4f2e5a6461132bbd16c83ff7c1f
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=7e5fbc02972b03727b71bc71f84175c36cbf01f5

Seems like SHA256 will be a natural next candidate for Zinc, given the demand.


> > Your patch description is also missing any mention of crypto accelerator
> > hardware.  Quite a bit of the complexity in the crypto API, such as
> > scatterlist support and asynchronous execution, exists because it
> > supports crypto accelerators.  AFAICS your new APIs cannot support
> > crypto accelerators, as your APIs are synchronous and operate on virtual
> > addresses.  I assume your justification is that "djb algorithms" like
> > ChaCha and Poly1305 don't need crypto accelerators as they are fast in
> > software.  But you never explicitly stated this and discussed the
> > tradeoffs.  Since this is basically the foundation for the design you've
> > chosen, it really needs to be addressed.
>
> I see this as an advantage, not a disadvantage.  A very large majority
> of in-kernel crypto users (by number of call sites under a *very*
> brief survey, not by number of CPU cycles) just want to do some
> synchronous crypto on a buffer that is addressed by a regular pointer.
> Most of these users would be slowed down if they used any form of
> async crypto, since the CPU can complete the whole operation faster
> than it could plausibly initiate and complete anything asynchronous.
> And, right now, they suffer the full overhead of allocating a context
> (often with alloca!), looking up (or caching) some crypto API data
> structures, dispatching the operation, and cleaning up.
>
> So I think the right way to do it is to have directly callable
> functions like zinc uses and to have the fancy crypto API layer on top
> of them.  So if you actually want async accelerated crypto with
> scatterlists or whatever, you can call into the fancy API, and the
> fancy API can dispatch to hardware or it can dispatch to the normal
> static API.

Yes, exactly this.

Regards,
Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-03  2:48   ` Jason A. Donenfeld
@ 2018-08-03 21:29     ` Andy Lutomirski
  2018-08-03 22:10       ` Jason A. Donenfeld
  0 siblings, 1 reply; 20+ messages in thread
From: Andy Lutomirski @ 2018-08-03 21:29 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Eric Biggers, Linux Crypto Mailing List, LKML, Netdev,
	David Miller, Andrew Lutomirski, Greg Kroah-Hartman,
	Samuel Neves, Daniel J . Bernstein, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

On Thu, Aug 2, 2018 at 7:48 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hey Andy,
>
> Thanks too for the feedback. Responses below:
>
> On Wed, Aug 1, 2018 at 7:09 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> > I think the above changes would also naturally lead to a much saner
>> > patch series where each algorithm is added by its own patch, rather than
>> > one monster patch that adds many algorithms and 24000 lines of code.
>> >
>>
>> Yes, please.
>
> Ack, will be in v2.
>
>
>> I like this a *lot*.  (But why are you passing have_simd?  Shouldn't
>> that check live in chacha20_arch?  If there's some init code needed,
>> then chacha20_arch() should just return false before the init code
>> runs.  Once the arch does whatever feature detection it needs, it can
>> make chacha20_arch() start returning true.)
>
> The have_simd stuff is so that the FPU state can be amortized across
> several calls to the crypto functions. Here's a snippet from
> WireGuard's send.c:
>
> void packet_encrypt_worker(struct work_struct *work)
> {
>     struct crypt_queue *queue = container_of(work, struct
> multicore_worker, work)->ptr;
>     struct sk_buff *first, *skb, *next;
>     bool have_simd = simd_get();

Gotcha.  That was very hidden in the 24k lines.  Please make this (and
any similar goodies) be their own patches.

Also, please consider spelling it differently:

simd_context_t simd_context = simd_get();

Because we'll feel very silly the first time some architecture has
more than one possible state.  (It wouldn't be entirely insane for x86
to distinguish between "no SIMD", "XMM only", and "go to town!", for
example.)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-03 21:29     ` Andy Lutomirski
@ 2018-08-03 22:10       ` Jason A. Donenfeld
  2018-08-07 18:54         ` Jason A. Donenfeld
  0 siblings, 1 reply; 20+ messages in thread
From: Jason A. Donenfeld @ 2018-08-03 22:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric Biggers, Linux Crypto Mailing List, LKML, Netdev,
	David Miller, Andrew Lutomirski, Greg Kroah-Hartman,
	Samuel Neves, Daniel J . Bernstein, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

On Fri, Aug 3, 2018 at 11:29 PM Andy Lutomirski <luto@amacapital.net> wrote:
> Gotcha.  That was very hidden in the 24k lines.  Please make this (and
> any similar goodies) be their own patches.

I know, sorry, and I certainly will have this split out. The above
code snippet was from the much much shorter WireGuard patch (3/3), but
indeed the simd_get/put/relax code is in the monster patch here, so
that'll certainly be split for v2.

>
> Also, please consider spelling it differently:
>
> simd_context_t simd_context = simd_get();
>
> Because we'll feel very silly the first time some architecture has
> more than one possible state.  (It wouldn't be entirely insane for x86
> to distinguish between "no SIMD", "XMM only", and "go to town!", for
> example.)

That's a great idea. It'll also make it clearer that users shouldn't
just dump a raw "true" in there; raw bools are more tempting to abuse
like that.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-03 22:10       ` Jason A. Donenfeld
@ 2018-08-07 18:54         ` Jason A. Donenfeld
  2018-08-07 19:43           ` Andy Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Jason A. Donenfeld @ 2018-08-07 18:54 UTC (permalink / raw)
  To: Andy Lutomirski, Eric Biggers
  Cc: Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves,
	Daniel J . Bernstein, Tanja Lange, Jean-Philippe Aumasson,
	Karthikeyan Bhargavan

Hey Andy, Eric, & all,

I've started the work of separating this out into 16 individual
commits, have addressed numerous other things brought up like the
ifdef maze, and have begun rewriting (parts of) the original commit
message. It's still a work in progress, and I still have some work to
do, but if you want to follow along, things are happening here:

https://git.zx2c4.com/linux-dev/log/?h=jd/wireguard

I'll be rebasing and reworking this continuously, but that's how it's
shaping up.

As I'm still working on it, I won't be submitting v2 today, but if you
have suggestions or concerns or want to poke me while I'm working on
v2, don't hesitate to send me private email or ping me in IRC (I'm
"zx2c4" there) to chat.

Regards,
Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-07 18:54         ` Jason A. Donenfeld
@ 2018-08-07 19:43           ` Andy Lutomirski
  2018-08-07 23:48             ` Jason A. Donenfeld
  0 siblings, 1 reply; 20+ messages in thread
From: Andy Lutomirski @ 2018-08-07 19:43 UTC (permalink / raw)
  To: Jason A. Donenfeld, Ingo Molnar, Thomas Gleixner, linux-arch
  Cc: Eric Biggers, Linux Crypto Mailing List, LKML, Netdev,
	David Miller, Andrew Lutomirski, Greg Kroah-Hartman,
	Samuel Neves, Daniel J . Bernstein, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

On Tue, Aug 7, 2018 at 11:54 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hey Andy, Eric, & all,
>
> I've started the work of separating this out into 16 individual
> commits, have addressed numerous other things brought up like the
> ifdef maze, and have begun rewriting (parts of) the original commit
> message. It's still a work in progress, and I still have some work to
> do, but if you want to follow along, things are happening here:
>
> https://git.zx2c4.com/linux-dev/log/?h=jd/wireguard
>
> I'll be rebasing and reworking this continuously, but that's how it's
> shaping up.
>
> As I'm still working on it, I won't be submitting v2 today, but if you
> have suggestions or concerns or want to poke me while I'm working on
> v2, don't hesitate to send me private email or ping me in IRC (I'm
> "zx2c4" there) to chat.
>
> Regards,
> Jason

For "zinc: add simd helper", I think it should be in include/linux,
and include/linux/simd.h should (immediately or maybe in the future)
include <asm/simd.h> to pick up arch-specific stuff.  And the patch
should get sent to linux-arch@vger.kernel.org.

In your blake2s_arch() implementation, you're not passing in a
simd_context_t.  Is that still a work in progress?  I thought the plan
was to pass it in rather than doing the check in the _arch()
functions.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-07 19:43           ` Andy Lutomirski
@ 2018-08-07 23:48             ` Jason A. Donenfeld
  2018-08-08  1:48               ` Andy Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Jason A. Donenfeld @ 2018-08-07 23:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Thomas Gleixner, linux-arch, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves,
	Daniel J . Bernstein, Tanja Lange, Jean-Philippe Aumasson,
	Karthikeyan Bhargavan

Hey Andy,

On Tue, Aug 7, 2018 at 12:43 PM Andy Lutomirski <luto@amacapital.net> wrote:
> For "zinc: add simd helper", I think it should be in include/linux,
> and include/linux/simd.h should (immediately or maybe in the future)
> include <asm/simd.h> to pick up arch-specific stuff.  And the patch
> should get sent to linux-arch@vger.kernel.org.

I guess you saw my prompt about that in the previous commit message?
Based on your encouragement, I implemented it:
https://git.zx2c4.com/linux-dev/commit/?h=simd This is _far_ more
invasive than I wanted to be, as I don't want this patch submission to
grow unwieldy and never be merged, but I guess we can roll with this
for now...

> In your blake2s_arch() implementation, you're not passing in a
> simd_context_t.  Is that still a work in progress?  I thought the plan
> was to pass it in rather than doing the check in the _arch()
> functions.

I'm inclined to do the explicit context passing only when a function
is likely to be used in that kind of environment, and adjust as
needed. Long term, anyway, that API will be removed once the x86 guys
figure out lazy FPU restoration and the amortization doesn't add
anything.

Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-07 23:48             ` Jason A. Donenfeld
@ 2018-08-08  1:48               ` Andy Lutomirski
  2018-08-08  1:51                 ` Jason A. Donenfeld
  0 siblings, 1 reply; 20+ messages in thread
From: Andy Lutomirski @ 2018-08-08  1:48 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Ingo Molnar, Thomas Gleixner, linux-arch, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves,
	Daniel J . Bernstein, Tanja Lange, Jean-Philippe Aumasson,
	Karthikeyan Bhargavan



> On Aug 7, 2018, at 4:48 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> 
> Hey Andy,
> 
>> On Tue, Aug 7, 2018 at 12:43 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> For "zinc: add simd helper", I think it should be in include/linux,
>> and include/linux/simd.h should (immediately or maybe in the future)
>> include <asm/simd.h> to pick up arch-specific stuff.  And the patch
>> should get sent to linux-arch@vger.kernel.org.
> 
> I guess you saw my prompt about that in the previous commit message?
> Based on your encouragement, I implemented it:
> https://git.zx2c4.com/linux-dev/commit/?h=simd This is _far_ more
> invasive than I wanted to be, as I don't want this patch submission to
> grow unwieldy and never be merged, but I guess we can roll with this
> for now...
> 

I really wish we had a way to see that we use asm-generic’s copy of a header in all cases except where an arch opts out.

>> In your blake2s_arch() implementation, you're not passing in a
>> simd_context_t.  Is that still a work in progress?  I thought the plan
>> was to pass it in rather than doing the check in the _arch()
>> functions.
> 
> I'm inclined to do the explicit context passing only when a function
> is likely to be used in that kind of environment, and adjust as
> needed. Long term, anyway, that API will be removed once the x86 guys
> figure out lazy FPU restoration and the amortization doesn't add
> anything.

Fair enough.

> 
> Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-08  1:48               ` Andy Lutomirski
@ 2018-08-08  1:51                 ` Jason A. Donenfeld
  2018-08-09 18:08                   ` Andy Lutomirski
  0 siblings, 1 reply; 20+ messages in thread
From: Jason A. Donenfeld @ 2018-08-08  1:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Thomas Gleixner, linux-arch, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves,
	Daniel J . Bernstein, Tanja Lange, Jean-Philippe Aumasson,
	Karthikeyan Bhargavan

On Tue, Aug 7, 2018 at 6:49 PM Andy Lutomirski <luto@amacapital.net> wrote:
> I really wish we had a way to see that we use asm-generic’s copy of a header in all cases except where an arch opts out.

It's really not that hard to do -- symlink asm-generic to a target
called "asm" inside an otherwise empty directory, and add that
otherwise empty directory to the -I paths just after arch/include.
Since it's searched second, it's only used if the first fails. Maybe
I'm missing something though, as this seems a bit too obvious. Perhaps
a project for another day.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-08  1:51                 ` Jason A. Donenfeld
@ 2018-08-09 18:08                   ` Andy Lutomirski
  0 siblings, 0 replies; 20+ messages in thread
From: Andy Lutomirski @ 2018-08-09 18:08 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Ingo Molnar, Thomas Gleixner, linux-arch, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves,
	Daniel J . Bernstein, Tanja Lange, Jean-Philippe Aumasson,
	Karthikeyan Bhargavan

On Tue, Aug 7, 2018 at 6:51 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> On Tue, Aug 7, 2018 at 6:49 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> I really wish we had a way to see that we use asm-generic’s copy of a header in all cases except where an arch opts out.
>
> It's really not that hard to do -- symlink asm-generic to a target
> called "asm" inside an otherwise empty directory, and add that
> otherwise empty directory to the -I paths just after arch/include.
> Since it's searched second, it's only used if the first fails. Maybe
> I'm missing something though, as this seems a bit too obvious. Perhaps
> a project for another day.

The problem here (I think) is that it's preferable for stray files in
the kernel tree to have no effect and, with this scheme, stray files
in arch/*/include/asm could affect the build.  So something explicit
has an advantage.  I just think there should be way to say "this
asm-generic file should affect all arches unless they generic-n or
override-generic-y it or whatever".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-03  2:33 ` Jason A. Donenfeld
@ 2018-08-14 21:12   ` Eric Biggers
  2018-08-15 16:28     ` D. J. Bernstein
  2018-08-16  6:31     ` Jason A. Donenfeld
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Biggers @ 2018-08-14 21:12 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Eric Biggers, Linux Crypto Mailing List, LKML, Netdev,
	David Miller, Andrew Lutomirski, Greg Kroah-Hartman,
	Samuel Neves, Daniel J . Bernstein, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

On Fri, Aug 03, 2018 at 04:33:50AM +0200, Jason A. Donenfeld wrote:
> 
> > Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM
> > Cortex-A7 it was actually quite a bit slower than the one in the Linux
> > kernel written by Ard Biesheuvel... I trust that when claiming the
> > performance of all implementations you're adding is "state-of-the-art
> > and unrivaled", you actually compared them to the ones already in the
> > Linux kernel which you're advocating replacing, right? :-)
> 
> Yes, I have, and my results don't corroborate your findings. It will
> be interesting to get out a wider variety of hardware for comparisons.
> I suspect, also, that if the snarky emoticons subside, AndyP would be
> very interested in whatever we find and could have interest in
> improving implementations, should we actually find performance
> differences.
> 

On ARM Cortex-A7, OpenSSL's ChaCha20 implementation is 13.9 cpb (cycles per
byte), whereas Linux's is faster: 11.9 cpb.  I've also recently improved the
Linux implementation to 11.3 cpb and would like to send out a patch soon...
I've also written a scalar ChaCha20 implementation (no NEON instructions!) that
is 12.2 cpb on one block at a time on Cortex-A7, taking advantage of the free
rotates; that would be useful for the single permutation used to compute
XChaCha's subkey, and also for the ends of messages.

The reason Linux's ChaCha20 NEON implementation is faster than OpenSSL's is that
Linux's does 4 blocks at once using NEON instructions, and the words are
de-interleaved so the rows don't need to be shifted between each round.
OpenSSL's implementation, on the other hand, only does 3 blocks at once with
NEON instructions and has to shift the rows between each round.  OpenSSL's
implementation also does a 4th block at the same time using regular ARM
instructions, but that doesn't help on Cortex-A7; it just makes it slower.

I understand there are tradeoffs, and different implementations can be faster on
different CPUs.  Just know that from my point of view, switching to the OpenSSL
implementation actually introduces a performance regression, and we care a *lot*
about this since we need ChaCha to be absolutely as fast as possible for HPolyC
disk encryption.  So if your proposal goes in, I'd likely need to write a patch
to get the old performance back, at least on Cortex-A7...

Also, I don't know whether Andy P. considered the 4xNEON implementation
technique.  It could even be fastest on other ARM CPUs too, I don't know.

- Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-14 21:12   ` Eric Biggers
@ 2018-08-15 16:28     ` D. J. Bernstein
  2018-08-15 19:57       ` Eric Biggers
  2018-08-16  6:31     ` Jason A. Donenfeld
  1 sibling, 1 reply; 20+ messages in thread
From: D. J. Bernstein @ 2018-08-15 16:28 UTC (permalink / raw)
  To: Eric Biggers, Jason A. Donenfeld, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan


[-- Attachment #1: Type: text/plain, Size: 2707 bytes --]

Eric Biggers writes:
> I've also written a scalar ChaCha20 implementation (no NEON instructions!) that
> is 12.2 cpb on one block at a time on Cortex-A7, taking advantage of the free
> rotates; that would be useful for the single permutation used to compute
> XChaCha's subkey, and also for the ends of messages.

This is also how ends of messages are handled in the 2012 implementation
crypto_stream/salsa20/armneon6 (see "mainloop1") inside the SUPERCOP
benchmarking framework:

   https://bench.cr.yp.to/supercop.html

This code is marginally different from Eric's new code because the
occasional loads and stores are scheduled for the Cortex-A8 rather than
the Cortex-A7, and because it's Salsa20 rather than ChaCha20.

The bigger picture is that there are 63 implementations of Salsa20 and
ChaCha20 in SUPERCOP from 10 authors showing various implementation
techniques, including all the techniques that have been mentioned in
this thread; and centralized benchmarks on (e.g.)

   https://bench.cr.yp.to/results-stream.html#amd64-kizomba
   https://bench.cr.yp.to/web-impl/amd64-kizomba-crypto_stream-salsa20.html

showing what's fastest on various platforms, using well-developed
benchmarking tools that produce repeatable, meaningful measurements.
There are also various papers explaining the main techniques.

Of course it's possible that new code will do better, especially on
platforms with different performance characteristics from the platforms
previously targeted. Contributing new implementations to SUPERCOP is
easy---which is why SUPERCOP already has thousands of implementations of
hundreds of cryptographic functions---and is a more effective way to
advertise speedups than adding code merely to (e.g.) the Linux kernel.
Infrastructure is centralized in SUPERCOP to minimize per-implementation
work. There's no risk of being rejected on the basis of cryptographic
concerns (MD5, Speck, and RSA-512 are included in the benchmarks) or
code-style concerns. Users can then decide which implementations best
meet their requirements.

"Do better" seems to be what's happened for the Cortex-A7. The best
SUPERCOP speeds (from code targeting the Cortex-A8 etc.) are 13.42
cycles/byte for 4096 bytes for ChaCha20; 12.2, 11.9, and 11.3 sound
noticeably better. The Cortex-A7 is an interesting case because it's
simultaneously (1) widely deployed---more than a billion units sold---
but (2) poorly documented. If you want to know, e.g., which instructions
can dual-issue with loads/FPU moves/..., then you won't be able to find
anything from ARM giving the answer. I've started building an automated
tool to compute the full CPU pipeline structure from benchmarks, but
this isn't ready yet.

---Dan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-15 16:28     ` D. J. Bernstein
@ 2018-08-15 19:57       ` Eric Biggers
  2018-08-16  4:24         ` D. J. Bernstein
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Biggers @ 2018-08-15 19:57 UTC (permalink / raw)
  To: D. J. Bernstein
  Cc: Jason A. Donenfeld, Eric Biggers, Linux Crypto Mailing List,
	LKML, Netdev, David Miller, Andrew Lutomirski,
	Greg Kroah-Hartman, Samuel Neves, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

On Wed, Aug 15, 2018 at 04:28:19PM -0000, D. J. Bernstein wrote:
> Eric Biggers writes:
> > I've also written a scalar ChaCha20 implementation (no NEON instructions!) that
> > is 12.2 cpb on one block at a time on Cortex-A7, taking advantage of the free
> > rotates; that would be useful for the single permutation used to compute
> > XChaCha's subkey, and also for the ends of messages.
> 
> This is also how ends of messages are handled in the 2012 implementation
> crypto_stream/salsa20/armneon6 (see "mainloop1") inside the SUPERCOP
> benchmarking framework:
> 
>    https://bench.cr.yp.to/supercop.html
> 
> This code is marginally different from Eric's new code because the
> occasional loads and stores are scheduled for the Cortex-A8 rather than
> the Cortex-A7, and because it's Salsa20 rather than ChaCha20.
> 
> The bigger picture is that there are 63 implementations of Salsa20 and
> ChaCha20 in SUPERCOP from 10 authors showing various implementation
> techniques, including all the techniques that have been mentioned in
> this thread; and centralized benchmarks on (e.g.)
> 
>    https://bench.cr.yp.to/results-stream.html#amd64-kizomba
>    https://bench.cr.yp.to/web-impl/amd64-kizomba-crypto_stream-salsa20.html
> 
> showing what's fastest on various platforms, using well-developed
> benchmarking tools that produce repeatable, meaningful measurements.
> There are also various papers explaining the main techniques.
> 
> Of course it's possible that new code will do better, especially on
> platforms with different performance characteristics from the platforms
> previously targeted. Contributing new implementations to SUPERCOP is
> easy---which is why SUPERCOP already has thousands of implementations of
> hundreds of cryptographic functions---and is a more effective way to
> advertise speedups than adding code merely to (e.g.) the Linux kernel.
> Infrastructure is centralized in SUPERCOP to minimize per-implementation
> work. There's no risk of being rejected on the basis of cryptographic
> concerns (MD5, Speck, and RSA-512 are included in the benchmarks) or
> code-style concerns. Users can then decide which implementations best
> meet their requirements.
> 
> "Do better" seems to be what's happened for the Cortex-A7. The best
> SUPERCOP speeds (from code targeting the Cortex-A8 etc.) are 13.42
> cycles/byte for 4096 bytes for ChaCha20; 12.2, 11.9, and 11.3 sound
> noticeably better. The Cortex-A7 is an interesting case because it's
> simultaneously (1) widely deployed---more than a billion units sold---
> but (2) poorly documented. If you want to know, e.g., which instructions
> can dual-issue with loads/FPU moves/..., then you won't be able to find
> anything from ARM giving the answer. I've started building an automated
> tool to compute the full CPU pipeline structure from benchmarks, but
> this isn't ready yet.
> 

Hi Dan, are you saying I should contribute my scalar ChaCha implementation,
and/or the Linux kernel's ChaCha NEON implementation, to SUPERCOP?  Just a few
comments: there doesn't appear to be an official git repository for SUPERCOP,
nor is there any mention of how to send patches, nor is there any COPYING or
LICENSE file, nor even a README file.  So, while I'm interested, from my
perspective SUPERCOP doesn't seem friendly to contributions.  You'd probably
attract more contributors if you followed established open source conventions.

Another issue is that the ChaCha code in SUPERCOP is duplicated for each number
of rounds: 8, 12, and 20.  So if anyone adds a new ChaCha implementation, they'd
apparently have to add 3 copies of it, which isn't very maintainable.

There are actually only two ARM NEON implementations of ChaCha20 in SUPERCOP:
(1) the one in crypto_stream/chacha20/moon/neon/32/chacha.S which looks like an
'objdump' output as it has no comments or anything that would make it readable
like macros and register aliases; and (2) the one in
crypto_stream/chacha20/dolbeau/arm-neon/, which uses a method similar to the
Linux implementation but it uses GCC intrinsics, so its performance will heavily
depend on how the compiler assigns and spills registers, which can vary greatly
depending on the compiler version and options.

I understand that Salsa20 is similar to ChaCha, and that ideas from Salsa20
implementations often apply to ChaCha too.  But it's not always obvious what
carries over and what doesn't; the rotation amounts can matter a lot, for
example, as different rotations can be implemented in different ways.  Nor is it
always obvious which ideas from SSE2 or AVX2 implementations (for example) carry
over to NEON implementations, as these instruction sets are different enough
that each has its own unique quirks and optimizations.

Previously I also found that OpenSSL's ARM NEON implementation of Poly1305 is
much faster than the implementations in SUPERCOP, as well as more
understandable.  (I don't know the 'qhasm' language, for example.)  So from my
perspective, I've had more luck with OpenSSL than SUPERCOP when looking for fast
implementations of crypto algorithms.  Have you considered adding the OpenSSL
implementations to SUPERCOP?

In the end though, both Linux and OpenSSL don't need every implementation under
the sun, but rather a small number of implementations that provide "good"
performance on nearly all CPUs, while not necessarily being optimal on every CPU
out there.  I.e., some tradeoffs are necessary and acceptable.  So for
ChaCha{12,20} on ARM we should choose which 2-3 implementations are most
valuable to have and make them as generally well optimized as possible.  Based
on the research I've seen and done, I think they're likely to be:

	- a 1x_scalar implementation
	- a 3x_NEON+1x_scalar implementation like OpenSSL's
	- a 4x_NEON implementation like Linux's current one

Currently my main argument is just that having the 3x_NEON+1x_scalar method be
the only NEON implementation is probably insufficient, as there are important
CPUs where the 4x_NEON method is significantly faster.

Thanks,

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-15 19:57       ` Eric Biggers
@ 2018-08-16  4:24         ` D. J. Bernstein
  2018-08-16 19:46           ` Eric Biggers
  0 siblings, 1 reply; 20+ messages in thread
From: D. J. Bernstein @ 2018-08-16  4:24 UTC (permalink / raw)
  To: Eric Biggers, Jason A. Donenfeld, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan


[-- Attachment #1: Type: text/plain, Size: 7172 bytes --]

Eric Biggers writes:
> You'd probably attract more contributors if you followed established
> open source conventions.

SUPERCOP already has thousands of implementations from hundreds of
contributors. New speed records are more likely to appear in SUPERCOP
than in any other cryptographic software collection. The API is shared
by state-of-the-art benchmarks, state-of-the-art tests, three ongoing
competitions, and increasingly popular production libraries.

Am I correctly gathering from this thread that someone adding a new
implementation of a crypto primitive to the kernel has to worry about
checking the architecture and CPU features to figure out whether the
implementation will run? Wouldn't it make more sense to take this
error-prone work away from the implementor and have a robust automated
central testing mechanism, as in SUPERCOP?

Am I also correctly gathering that adding an extra implementation to the
kernel can hurt performance, unless the implementor goes to extra effort
to check for the CPUs where the previous implementation is faster---or
to build some ad-hoc timing mechanism ("raid6: using algorithm avx2x4
gen() 31737 MB/s")? Wouldn't it make more sense to take this error-prone
work away from the implementor and have a robust automated central
timing mechanism, as in SUPERCOP?

I also didn't notice anyone disputing Jason's comment about the "general
clunkiness" of the kernel's internal crypto API---but is there really no
consensus as to what the replacement API is supposed to be? Someone who
simply wants to implement some primitives has to decide on function-call
details, argue about the software location, add configuration options,
etc.? Wouldn't it make more sense to do this centrally, as in SUPERCOP?

And then there's the bigger question of how the community is organizing
ongoing work on accelerating---and auditing, and fixing, and hopefully
verifying---implementations of cryptographic primitives. Does it really
make sense that people looking for what's already been done have to go
poking around a bunch of separate libraries? Wouldn't it make more sense
to have one central collection of code, as in SUPERCOP? Is there any
fundamental obstacle to having libraries share code for primitives?

> there doesn't appear to be an official git repository for SUPERCOP,
> nor is there any mention of how to send patches, nor is there any
> COPYING or LICENSE file, nor even a README file.

https://bench.cr.yp.to/call-stream.html explains the API and submission
procedure for stream ciphers. There are similar pages for other types of
cryptographic primitives. https://bench.cr.yp.to/tips.html explains the
develop-test cycle and various useful options.

Licenses vary across implementations. There's a minimum requirement of
public distribution for verifiability of benchmark results, but it's up
to individual implementors to decide what they'll allow beyond that.
Patent status also varies; constant-time status varies; verification
status varies; code quality varies; cryptographic security varies; etc.
As I mentioned, SUPERCOP includes MD5 and Speck and RSA-512.

For comparison, where can I find an explanation of how to test kernel
crypto patches, and how fast is the develop-test cycle? Okay, I don't
have a kernel crypto patch, but I did write a kernel patch recently that
(I think) fixes some recent Lenovo ACPI stupidity:

   https://marc.info/?l=qubes-users&m=153308905514481

I'd propose this for review and upstream adoption _if_ it survives
enough tests---but what's the right test procedure? I see superficial
documentation of where to submit a patch for review, but am I really
supposed to do this before serious testing? The patch works on my
laptop, and several other people say it works, but obviously this is
missing the big question of whether the patch breaks _other_ laptops.
I see an online framework for testing, but using it looks awfully
complicated, and the level of coverage is unclear to me. Has anyone
tried to virtualize kernel testing---to capture hardware data from many
machines and then centrally simulate kernels running on those machines,
for example to check that those machines don't take certain code paths?
I suppose that people who work with the kernel all the time would know
what to do, but for me the lack of information was enough of a deterrent
that I switched to doing something else.

> Another issue is that the ChaCha code in SUPERCOP is duplicated for
> each number of rounds: 8, 12, and 20.

These are auto-generated, of course.

To understand this API detail, consider some of the possibilities for
the round counts supported by compiled code:

   * 20
   * 12
   * 8
   * caller selection from among 20 and 12 and 8
   * caller selection of any multiple of 4
   * caller selection of any multiple of 2
   * caller selection of anything

I hope that in the long term everyone is simply using 20, and then the
pure 20 is the simplest and smallest and most easily verified code, but
obviously there are other implementations today. An API with a separate
function for each round count allows any of these implementations to be
trivially benchmarked and used, whereas an API that insists on passing
the round count as an argument prohibits at least the first three and
maybe more.

> crypto_stream/chacha20/dolbeau/arm-neon/, which uses a method similar to the
> Linux implementation but it uses GCC intrinsics, so its performance will heavily
> depend on how the compiler assigns and spills registers, which can vary greatly
> depending on the compiler version and options.

Sure. The damage done by incompetent compilers is particularly clear for
in-order CPUs such as the Cortex-A7.

> I understand that Salsa20 is similar to ChaCha, and that ideas from Salsa20
> implementations often apply to ChaCha too.  But it's not always obvious what
> carries over and what doesn't; the rotation amounts can matter a lot, for
> example, as different rotations can be implemented in different ways.

This sounds backwards to me. ChaCha20 supports essentially all the
Salsa20 implementation techniques plus some extra streamlining: often a
bit less register pressure, often less data reorganization, and often
some rotation speedups.

> Nor is it always obvious which ideas from SSE2 or AVX2 implementations
> (for example) carry over to NEON implementations, as these instruction
> sets are different enough that each has its own unique quirks and
> optimizations.

Of course.

> Previously I also found that OpenSSL's ARM NEON implementation of Poly1305 is
> much faster than the implementations in SUPERCOP, as well as more
> understandable.  (I don't know the 'qhasm' language, for example.)  So from my
> perspective, I've had more luck with OpenSSL than SUPERCOP when looking for fast
> implementations of crypto algorithms.  Have you considered adding the OpenSSL
> implementations to SUPERCOP?

Almost all of the implementations in SUPERCOP were submitted by the
implementors, with a few exceptions for wrappers. Realistically, the
implementors are in the best position to check that they're getting the
expected results and to be in control of any necessary updates.

---Dan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-14 21:12   ` Eric Biggers
  2018-08-15 16:28     ` D. J. Bernstein
@ 2018-08-16  6:31     ` Jason A. Donenfeld
  1 sibling, 0 replies; 20+ messages in thread
From: Jason A. Donenfeld @ 2018-08-16  6:31 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Eric Biggers, Linux Crypto Mailing List, LKML, Netdev,
	David Miller, Andrew Lutomirski, Greg Kroah-Hartman,
	Samuel Neves, Daniel J . Bernstein, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

Hi Eric,

On Tue, Aug 14, 2018 at 2:12 PM Eric Biggers <ebiggers@kernel.org> wrote:
> On ARM Cortex-A7, OpenSSL's ChaCha20 implementation is 13.9 cpb (cycles per
> byte), whereas Linux's is faster: 11.9 cpb.
>
> The reason Linux's ChaCha20 NEON implementation is faster than OpenSSL's
>
> I understand there are tradeoffs, and different implementations can be faster on
> different CPUs.
>
> So if your proposal goes in, I'd likely need to write a patch
> to get the old performance back, at least on Cortex-A7...

Yes, absolutely. Different CPUs behave differently indeed, but if you
have improvements for hardware that matters to you, we should
certainly incorporate these, and also loop Andy Polyakov in (I've
added him to the CC for the WIP v2). ChaCha is generally pretty
obvious, but for big integer algorithms -- like Poly1305 and
Curve25519 -- I think it's all the more important to involve Andy and
the rest of the world in general, so that Linux benefits from bug
research and fuzzing in places that are typically and classically
prone to nasty issues. In other words, let's definitely incorporate
your improvements after the patchset goes in, and at the same time
we'll try to bring Andy and others into the fold, where our
improvements can generally track each others.

> Also, I don't know whether Andy P. considered the 4xNEON implementation
> technique.  It could even be fastest on other ARM CPUs too, I don't know.

After v2, when he's CC'd in, let's plan to start discussing this with him.

Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-16  4:24         ` D. J. Bernstein
@ 2018-08-16 19:46           ` Eric Biggers
  2018-08-17  7:31             ` D. J. Bernstein
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Biggers @ 2018-08-16 19:46 UTC (permalink / raw)
  To: D. J. Bernstein
  Cc: Jason A. Donenfeld, Eric Biggers, Linux Crypto Mailing List,
	LKML, Netdev, David Miller, Andrew Lutomirski,
	Greg Kroah-Hartman, Samuel Neves, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan

Hi Dan,

(I reordered your responses slightly to group together similar topics)

On Thu, Aug 16, 2018 at 04:24:54AM -0000, D. J. Bernstein wrote:
> Eric Biggers writes:
> > You'd probably attract more contributors if you followed established
> > open source conventions.
> 
> SUPERCOP already has thousands of implementations from hundreds of
> contributors. New speed records are more likely to appear in SUPERCOP
> than in any other cryptographic software collection. The API is shared
> by state-of-the-art benchmarks, state-of-the-art tests, three ongoing
> competitions, and increasingly popular production libraries.
[...]
> > there doesn't appear to be an official git repository for SUPERCOP,
> > nor is there any mention of how to send patches, nor is there any
> > COPYING or LICENSE file, nor even a README file.
> 
> https://bench.cr.yp.to/call-stream.html explains the API and submission
> procedure for stream ciphers. There are similar pages for other types of
> cryptographic primitives. https://bench.cr.yp.to/tips.html explains the
> develop-test cycle and various useful options.
> 
> Licenses vary across implementations. There's a minimum requirement of
> public distribution for verifiability of benchmark results, but it's up
> to individual implementors to decide what they'll allow beyond that.
> Patent status also varies; constant-time status varies; verification
> status varies; code quality varies; cryptographic security varies; etc.
> As I mentioned, SUPERCOP includes MD5 and Speck and RSA-512.

Many people may have contributed to SUPERCOP already, but that doesn't mean
there aren't things you could do to make it more appealing to contributors and
more of a community project, especially since you're suggesting it to the Linux
kernel community.  There should be a git repository with the development
history, including all documentation files so they are available even if your
website goes offline (i.e. the "bus factor" should be > 1), and there should be
a way to contribute by a conventional means that invites public review, e.g.
sending a patch to a mailing list or opening a Github pull request.

The apparent lack of a license for the SUPERCOP benchmarking framework itself is
also problematic.  Though you've marked some individual files as "Public
domain", not all files are such marked (e.g. crypto_stream/measure.c), so AFAIK
there is no explicit permission given to redistribute them.  In some people's
view, license-less projects like this aren't really free software.  So Linux
distributions may not want to take on the legal risk of distributing it, nor may
companies want to take on the risk of contributing.  It may seem silly, but some
people do take these things *very* seriously!

Fortunately, it's easy for you to fix the licensing: just add a standard license
like the MIT license as a file named COPYING in the top-level directory.

> Am I correctly gathering from this thread that someone adding a new
> implementation of a crypto primitive to the kernel has to worry about
> checking the architecture and CPU features to figure out whether the
> implementation will run? Wouldn't it make more sense to take this
> error-prone work away from the implementor and have a robust automated
> central testing mechanism, as in SUPERCOP?
> 
> Am I also correctly gathering that adding an extra implementation to the
> kernel can hurt performance, unless the implementor goes to extra effort
> to check for the CPUs where the previous implementation is faster---or
> to build some ad-hoc timing mechanism ("raid6: using algorithm avx2x4
> gen() 31737 MB/s")? Wouldn't it make more sense to take this error-prone
> work away from the implementor and have a robust automated central
> timing mechanism, as in SUPERCOP?

If you're talking about things like "don't use the AVX2 implementation if the
CPU doesn't support AVX2", then of course has to be checked, but that's
straightforward.

If (more likely) you're talking about things like "use this NEON implementation
on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the
developers and community to test different CPUs and make appropriate decisions,
and yes it can be very useful to have external benchmarks like SUPERCOP to refer
to, and I appreciate your work in that area.  Note that in practice, the
priority order in Linux's crypto API has actually tended to be straightforward,
like generic -> SSE2 -> SSE3 -> AVX2, since historically people haven't cared
quite enough about crypto performance in the kernel to microoptimize it for
individual CPU microarchitectures, nor have there been many weird cases like ARM
NEON where scalar instructions are free when interleaved with vector
instructions on some CPUs but not others.  Maybe that's changing as more people
need optimal crypto performance in the kernel.

> I also didn't notice anyone disputing Jason's comment about the "general
> clunkiness" of the kernel's internal crypto API---but is there really no
> consensus as to what the replacement API is supposed to be? Someone who
> simply wants to implement some primitives has to decide on function-call
> details, argue about the software location, add configuration options,
> etc.? Wouldn't it make more sense to do this centrally, as in SUPERCOP?

Not yet; that's the purpose of code review, so that a consensus can be reached.
But if/when the new APIs land, the next contributor who wants to do things
similarly will have a much easier time.

As a side note, are you certain you've received and read all responses to this
thread?  I haven't always bothered replying to your "qsecretary notice", and I
suspect that many others do the same.  (Imagine if everyone used that, so
everyone had to reply to 10 "qsecretary notices" on threads like this!)

> And then there's the bigger question of how the community is organizing
> ongoing work on accelerating---and auditing, and fixing, and hopefully
> verifying---implementations of cryptographic primitives. Does it really
> make sense that people looking for what's already been done have to go
> poking around a bunch of separate libraries? Wouldn't it make more sense
> to have one central collection of code, as in SUPERCOP? Is there any
> fundamental obstacle to having libraries share code for primitives?

A lot of code can be shared, but in practice different environments have
different constraints, and kernel programming in particular has some distinct
differences from userspace programming.  For example, you cannot just use the
FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on
most architectures they can't be used in some contexts such as hardirq context,
and even when they *can* be used you have to run special code before and after
which does things like saving all the FPU registers to the task_struct,
disabling preemption, and/or enabling the FPU.  But disabling preemption for
long periods of time hurts responsiveness, so it's also desirable to yield the
processor occasionally, which means that assembly implementations should be
incremental rather than having a single entry point that does everything.

There are also many other rules that must be followed in kernel code, like not
being free to use the %rbp register on x86 for anything other than the frame
base pointer, or having to make indirect calls in asm code through a special
macro that mitigates the Spectre vulnerability.

So yes, crypto code can sometimes be shared, but changes often do need to be
made for the kernel.

> For comparison, where can I find an explanation of how to test kernel
> crypto patches, and how fast is the develop-test cycle? Okay, I don't
> have a kernel crypto patch, but I did write a kernel patch recently that
> (I think) fixes some recent Lenovo ACPI stupidity:
> 
>    https://marc.info/?l=qubes-users&m=153308905514481
> 
> I'd propose this for review and upstream adoption _if_ it survives
> enough tests---but what's the right test procedure? I see superficial
> documentation of where to submit a patch for review, but am I really
> supposed to do this before serious testing? The patch works on my
> laptop, and several other people say it works, but obviously this is
> missing the big question of whether the patch breaks _other_ laptops.
> I see an online framework for testing, but using it looks awfully
> complicated, and the level of coverage is unclear to me. Has anyone
> tried to virtualize kernel testing---to capture hardware data from many
> machines and then centrally simulate kernels running on those machines,
> for example to check that those machines don't take certain code paths?
> I suppose that people who work with the kernel all the time would know
> what to do, but for me the lack of information was enough of a deterrent
> that I switched to doing something else.

The process for submitting Linux kernel patches is very well documented.  If
you're interested in contributing, you're welcome to read
Documentation/process/submitting-patches.rst and the other documentation.

As for testing, things are different for different parts of the kernel.  Some
parts are covered well by automated tests, others rely more on manual tests, and
others rely more on a large community running the kernel, especially -rc
(release candidate) kernels, and reporting any problems.  For the crypto API
specifically, there are correctness self-tests that are automatically run when
an option is set in the kernel .config.  Developers should always run them, but
even if they don't, other people are running them too on various hardware.

ACPI workarounds for firmware bugs are a somewhat different story...  AFAIK,
those types of patches rely on testing and review done by the developer, the
subsystem maintainters, and the broader community; as well as the community
reporting any problems they experience in -rc kernels.  Many people, such as
myself, run -rc kernels on their computer(s) and will report any problems found.
Yes, the standards of correctness for those types of things are not as high for
crypto algorithms, which is a shame; but when you're talking about hacking
around bugs in specific firmware versions in specific laptops, there's probably
not much more that can be done in practice...  So I encourage you to still send
out your ACPI patch!

Thanks,

- Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-16 19:46           ` Eric Biggers
@ 2018-08-17  7:31             ` D. J. Bernstein
  2018-08-18  8:13               ` Ard Biesheuvel
  0 siblings, 1 reply; 20+ messages in thread
From: D. J. Bernstein @ 2018-08-17  7:31 UTC (permalink / raw)
  To: Eric Biggers, Jason A. Donenfeld, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan


[-- Attachment #1: Type: text/plain, Size: 5175 bytes --]

Eric Biggers writes:
> If (more likely) you're talking about things like "use this NEON implementation
> on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the
> developers and community to test different CPUs and make appropriate decisions,
> and yes it can be very useful to have external benchmarks like SUPERCOP to refer
> to, and I appreciate your work in that area.

You seem to be talking about a process that selects (e.g.) ChaCha20
implementations as follows: manually inspect benchmarks of various
implementations on various CPUs, manually write code to map CPUs to
implementations, manually update the code as necessary for new CPUs, and
of course manually do the same for every other primitive that can see
differences between microarchitectures (which isn't something weird---
it's the normal situation after enough optimization effort).

This is quite a bit of manual work, so the kernel often doesn't do it,
so we end up with unhappy people talking about performance regressions.

For comparison, imagine one simple central piece of code in the kernel
to automatically do the following:

   When a CPU core is booted:
     For each primitive:
       Benchmark all implementations of the primitive on the core.
       Select the fastest for subsequent use on the core.

If this is a general-purpose mechanism (as in SUPERCOP, NaCl, and
libpqcrypto) rather than something ad-hoc (as in raid6), then there's no
manual work per primitive, and no work per implementation. Each CPU, old
or new, automatically obtains the fastest available code for that CPU.

The only cost is a moment of benchmarking at boot time. _If_ this is a
noticeable cost then there are many ways to speed it up: for example,
automatically copy the results across identical cores, automatically
copy the results across boots if the cores are unchanged, automatically
copy results from a central database indexed by CPU identifiers, etc.
The SUPERCOP database is evolving towards enabling this type of sharing.

> A lot of code can be shared, but in practice different environments have
> different constraints, and kernel programming in particular has some distinct
> differences from userspace programming.  For example, you cannot just use the
> FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on
> most architectures they can't be used in some contexts such as hardirq context,
> and even when they *can* be used you have to run special code before and after
> which does things like saving all the FPU registers to the task_struct,
> disabling preemption, and/or enabling the FPU.

Is there some reason that each implementor is being pestered to handle
all this? Detecting FPU usage is a simple static-analysis exercise, and
the rest sounds like straightforward boilerplate that should be handled
centrally.

> But disabling preemption for
> long periods of time hurts responsiveness, so it's also desirable to yield the
> processor occasionally, which means that assembly implementations should be
> incremental rather than having a single entry point that does everything.

Doing this rewrite automatically is a bit more of a code-analysis
challenge, but the alternative approach of doing it by hand is insanely
error-prone. See, e.g., https://eprint.iacr.org/2017/891.

> Many people may have contributed to SUPERCOP already, but that doesn't mean
> there aren't things you could do to make it more appealing to contributors and
> more of a community project,

The logic in this sentence is impeccable, and is already illustrated by
many SUPERCOP improvements through the years from an increasing number
of contributors, as summarized in the 87 release announcements so far on
the relevant public mailing list, which you're welcome to study in
detail along with the 400 megabytes of current code and as many previous
versions as you're interested in. That's also the mailing list where
people are told to send patches, as you'll see if you RTFM.

> So Linux distributions may not want to take on the legal risk of
> distributing it

This is a puzzling comment. A moment ago we were talking about the
possibility of useful sharing of (e.g.) ChaCha20 implementations between
SUPERCOP and the Linux kernel, avoiding pointless fracturing of the
community's development process for these implementations. This doesn't
mean that the kernel should be grabbing implementations willy-nilly from
SUPERCOP---surely the kernel should be doing security audits, and the
kernel already has various coding requirements, and the kernel requires
GPL compatibility, while putting any of these requirements into SUPERCOP
would be counterproductive.

If you mean having the entire SUPERCOP benchmarking package distributed
through Linux distributions, I have no idea what your motivation is or
how this is supposed to be connected to anything else we're discussing.
Obviously SUPERCOP's broad code-inclusion policies make this idea a
non-starter.

> nor may companies want to take on the risk of contributing.

RTFM. People who submit code are authorizing public redistribution for
benchmarking. It's up to them to decide if they want to allow more.

---Dan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
  2018-08-17  7:31             ` D. J. Bernstein
@ 2018-08-18  8:13               ` Ard Biesheuvel
  0 siblings, 0 replies; 20+ messages in thread
From: Ard Biesheuvel @ 2018-08-18  8:13 UTC (permalink / raw)
  To: D. J. Bernstein
  Cc: Eric Biggers, Jason A. Donenfeld, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan



> On 17 Aug 2018, at 10:31, D. J. Bernstein <djb@cr.yp.to> wrote:
> 
> Eric Biggers writes:
>> If (more likely) you're talking about things like "use this NEON implementation
>> on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the
>> developers and community to test different CPUs and make appropriate decisions,
>> and yes it can be very useful to have external benchmarks like SUPERCOP to refer
>> to, and I appreciate your work in that area.
> 
> You seem to be talking about a process that selects (e.g.) ChaCha20
> implementations as follows: manually inspect benchmarks of various
> implementations on various CPUs, manually write code to map CPUs to
> implementations, manually update the code as necessary for new CPUs, and
> of course manually do the same for every other primitive that can see
> differences between microarchitectures (which isn't something weird---
> it's the normal situation after enough optimization effort).
> 
> This is quite a bit of manual work, so the kernel often doesn't do it,
> so we end up with unhappy people talking about performance regressions.
> 
> For comparison, imagine one simple central piece of code in the kernel
> to automatically do the following:
> 
>   When a CPU core is booted:
>     For each primitive:
>       Benchmark all implementations of the primitive on the core.
>       Select the fastest for subsequent use on the core.
> 
> If this is a general-purpose mechanism (as in SUPERCOP, NaCl, and
> libpqcrypto) rather than something ad-hoc (as in raid6), then there's no
> manual work per primitive, and no work per implementation. Each CPU, old
> or new, automatically obtains the fastest available code for that CPU.
> 
> The only cost is a moment of benchmarking at boot time. _If_ this is a
> noticeable cost then there are many ways to speed it up: for example,
> automatically copy the results across identical cores, automatically
> copy the results across boots if the cores are unchanged, automatically
> copy results from a central database indexed by CPU identifiers, etc.
> The SUPERCOP database is evolving towards enabling this type of sharing.
> 

‘Fastest’ does not imply ‘preferred’. For instance, running the table based cache thrashing generic AES implementation may be fast, but may put a disproportionate load on, e.g., a hyperthreading system, and as you have pointed out yourself, it is time variant as well. Then, there is the power consumption aspect: NEON bit sliced AES may be faster, but does a lot more work, and does it on the SIMD unit which could potentially be turned off entirely otherwise. Only the implementations based on h/w instructions can generally be assumed optimal in all senses, and there is no real point in benchmarking those against pure software implementations.

Then, there is the aspect of accelerators: the kernel’s crypto API seamlessly supports crypto peripherals, which may be slower or faster, have more or fewer queues than the number of CPUs, may offer additional benefits such as protected AES keys etc etc.

In the linux kernel, we generally try to stay away from policy decisions, and offer the controls to allow userland to take charge of this. The modularized crypto code can be blacklisted per algo implementation if desired, and beyond that, we simply try to offer functionality that covers the common case.

>> A lot of code can be shared, but in practice different environments have
>> different constraints, and kernel programming in particular has some distinct
>> differences from userspace programming.  For example, you cannot just use the
>> FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on
>> most architectures they can't be used in some contexts such as hardirq context,
>> and even when they *can* be used you have to run special code before and after
>> which does things like saving all the FPU registers to the task_struct,
>> disabling preemption, and/or enabling the FPU.
> 
> Is there some reason that each implementor is being pestered to handle
> all this? Detecting FPU usage is a simple static-analysis exercise, and
> the rest sounds like straightforward boilerplate that should be handled
> centrally.
> 

Detecting it is easy but that does not mean that you can use SIMD in any context, and whether a certain function may ever be called from such a context cannot be decided by static analysis. Also, there are performance and latency concerns which need to be taken into account.

In the kernel, we simply cannot write our algorithm as if our code is the only thing running on the system.

>> But disabling preemption for
>> long periods of time hurts responsiveness, so it's also desirable to yield the
>> processor occasionally, which means that assembly implementations should be
>> incremental rather than having a single entry point that does everything.
> 
> Doing this rewrite automatically is a bit more of a code-analysis
> challenge, but the alternative approach of doing it by hand is insanely
> error-prone. See, e.g., https://eprint.iacr.org/2017/891.
> 
>> Many people may have contributed to SUPERCOP already, but that doesn't mean
>> there aren't things you could do to make it more appealing to contributors and
>> more of a community project,
> 
> The logic in this sentence is impeccable, and is already illustrated by
> many SUPERCOP improvements through the years from an increasing number
> of contributors, as summarized in the 87 release announcements so far on
> the relevant public mailing list, which you're welcome to study in
> detail along with the 400 megabytes of current code and as many previous
> versions as you're interested in. That's also the mailing list where
> people are told to send patches, as you'll see if you RTFM.
> 
>> So Linux distributions may not want to take on the legal risk of
>> distributing it
> 
> This is a puzzling comment. A moment ago we were talking about the
> possibility of useful sharing of (e.g.) ChaCha20 implementations between
> SUPERCOP and the Linux kernel, avoiding pointless fracturing of the
> community's development process for these implementations. This doesn't
> mean that the kernel should be grabbing implementations willy-nilly from
> SUPERCOP---surely the kernel should be doing security audits, and the
> kernel already has various coding requirements, and the kernel requires
> GPL compatibility, while putting any of these requirements into SUPERCOP
> would be counterproductive.
> 
> If you mean having the entire SUPERCOP benchmarking package distributed
> through Linux distributions, I have no idea what your motivation is or
> how this is supposed to be connected to anything else we're discussing.
> Obviously SUPERCOP's broad code-inclusion policies make this idea a
> non-starter.
> 
>> nor may companies want to take on the risk of contributing.
> 
> RTFM. People who submit code are authorizing public redistribution for
> benchmarking. It's up to them to decide if they want to allow more.
> 
> ---Dan

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-01  7:22 [PATCH v1 2/3] zinc: Introduce minimal cryptography library Eric Biggers
2018-08-01 17:02 ` Andy Lutomirski
2018-08-03  2:48   ` Jason A. Donenfeld
2018-08-03 21:29     ` Andy Lutomirski
2018-08-03 22:10       ` Jason A. Donenfeld
2018-08-07 18:54         ` Jason A. Donenfeld
2018-08-07 19:43           ` Andy Lutomirski
2018-08-07 23:48             ` Jason A. Donenfeld
2018-08-08  1:48               ` Andy Lutomirski
2018-08-08  1:51                 ` Jason A. Donenfeld
2018-08-09 18:08                   ` Andy Lutomirski
2018-08-03  2:33 ` Jason A. Donenfeld
2018-08-14 21:12   ` Eric Biggers
2018-08-15 16:28     ` D. J. Bernstein
2018-08-15 19:57       ` Eric Biggers
2018-08-16  4:24         ` D. J. Bernstein
2018-08-16 19:46           ` Eric Biggers
2018-08-17  7:31             ` D. J. Bernstein
2018-08-18  8:13               ` Ard Biesheuvel
2018-08-16  6:31     ` Jason A. Donenfeld

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git
	git clone --mirror https://lore.kernel.org/lkml/10 lkml/git/10.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git