[+Cc linux-crypto] Hi Jason, Apologies for starting a new thread, but this patch apparently wasn't Cc'ed to linux-crypto, despite adding over 24000 lines of crypto code. So much for WireGuard being only 4000 lines :-) (For anyone else like me who didn't receive the patch, it can be found at https://patchwork.ozlabs.org/patch/951763/) I have some preliminary comments below. On Tue, 31 Jul 2018 21:11:01 +0200, Jason A. Donenfeld wrote: > [PATCH v1 2/3] zinc: Introduce minimal cryptography library > > Zinc stands for "Zinc Is Not crypto/". It's also short, easy to type, > and plays nicely with the recent trend of naming crypto libraries after > elements. The guiding principle is "don't overdo it". It's less of a > library and more of a directory tree for organizing well-curated direct > implementations of cryptography primitives. > > Zinc is a new cryptography API that is much more minimal and lower-level > than the current one. It intends to complement it and provide a basis > upon which the current crypto API might build, and perhaps someday Zinc > may altogether supplant the current crypto API. It is motivated by > three primary observations in crypto API design: > > * Highly composable "cipher modes" and related abstractions from > 90s cryptographers did not turn out to be as terrific an idea as > hoped, leading to a host of API misuse problems. > > * Most programmers are afraid of crypto code, and so prefer to > integrate it into libraries in a highly abstracted manner, so as to > shield themselves from implementation details. Cryptographers, on > the other hand, prefer simple direct implementations, which they're > able to verify for high assurance and optimize in accordance with > their expertise. > > * Overly abstracted and flexible cryptography APIs lead to a host of > dangerous problems and performance issues. The kernel is in the > business usually not of coming up with new uses of crypto, but > rather implementing various constructions, which means it essentially > needs a library of primitives, not a highly abstracted enterprise-ready > pluggable system, with a few particular exceptions. > > This last observation has seen itself play out several times over and > over again within the kernel: > > * The perennial move of actual primitives away from crypto/ and into > lib/, so that users can actually call these functions directly with > no overhead and without lots of allocations, function pointers, > string specifier parsing, and general clunkiness. For example: > sha256, chacha20, siphash, sha1, and so forth live in lib/ rather > than in crypto/. Zinc intends to stop the cluttering of lib/ and > introduce these direct primitives into their proper place, lib/zinc/. > > * An abundance of misuse bugs with the present crypto API that have > been very unpleasant to clean up. > > * A hesitance to even use cryptography, because of the overhead and > headaches involved in accessing the routines. > > Zinc goes in a rather different direction. Rather than providing a > thoroughly designed and abstracted API, Zinc gives you simple functions, > which implement some primitive, or some particular and specific > construction of primitives. It is not dynamic in the least, though one > could imagine implementing a complex dynamic dispatch mechanism (such as > the current crypto API) on top of these basic functions. After all, > dynamic dispatch is usually needed for applications with cipher agility, > such as IPsec, dm-crypt, AF_ALG, and so forth, and the existing crypto > API will continue to play that role. However, Zinc will provide a non- > haphazard way of directly utilizing crypto routines in applications > that do have neither need nor desire for abstraction and dynamic > dispatch. > > It also organizes the implementations in a simple, straight-forward, > and direct manner, making it enjoyable and intuitive to work on. > Rather than moving optimized assembly implementations into arch/, it > keeps them all together in lib/zinc/, making it simple and obvious to > compare and contrast what's happening. This is, notably, exactly what > the lib/raid6/ tree does, and that seems to work out rather well. It's > also the pattern of most successful crypto libraries. Likewise, the > cascade of architecture-specific functions is done as ifdefs within one > file, so that it's easy and obvious and clear what's happening, how each > architecture differs, and how to optimize for shared patterns. This is > very much preferable to architecture-specific file splitting. > > All implementations have been extensively tested and fuzzed, and are > selected for their quality, trustworthiness, and performance. Wherever > possible and performant, formally verified implementations are used, > such as those from HACL* [1] and Fiat-Crypto [2]. The routines also take > special care to zero out secrets using memzero_explicit (and future work > is planned to have gcc do this more reliably and performantly with > compiler plugins). The performance of the selected implementations is > state-of-the-art and unrivaled. Each implementation also comes with > extensive self-tests and crafted test vectors, pulled from various > places such as Wycheproof [9]. > > Regularity of function signatures is important, so that users can easily > "guess" the name of the function they want. Though, individual > primitives are oftentimes not trivially interchangeable, having been > designed for different things and requiring different parameters and > semantics, and so the function signatures they provide will directly > reflect the realities of the primitives' usages, rather than hiding it > behind (inevitably leaky) abstractions. Also, in contrast to the current > crypto API, Zinc functions can work on stack buffers, and can be called > with different keys, without requiring allocations or locking. > > SIMD is used automatically when available, though some routines may > benefit from either having their SIMD disabled for particular > invocations, or to have the SIMD initialization calls amortized over > several invocations of the function, and so Zinc provides helpers and > function signatures enabling that. > > More generally, Zinc provides function signatures that allow just what > is required by the various callers. This isn't to say that users of the > functions will be permitted to pollute the function semantics with weird > particular needs, but we are trying very hard not to overdo it, and that > means looking carefully at what's actually necessary, and doing just that, > and not much more than that. Remember: practicality and cleanliness rather > than over-zealous infrastructure. > > Zinc provides also an opening for the best implementers in academia to > contribute their time and effort to the kernel, by being sufficiently > simple and inviting. In discussing this commit with some of the best and > brightest over the last few years, there are many who are eager to > devote rare talent and energy to this effort. > > This initial commit adds implementations of the primitives used by WireGuard: > > * Curve25519 [3]: formally verified 64-bit C, formally verified 32-bit C, > x86_64 BMI2, x86_64 ADX, ARM NEON. > > * ChaCha20 [4]: generic C, x86_64 SSSE3, x86_64 AVX-2, x86_64 AVX-512F, > x86_64 AVX-512VL, ARM NEON, ARM64 NEON, MIPS. > > * HChaCha20 [5]: generic C, x86_64 SSSE3. > > * Poly1305 [6]: generic C, x86_64, x86_64 AVX, x86_64 AVX-2, x86_64 > AVX-512F, ARM NEON, ARM64 NEON, MIPS, MIPS64. > > * BLAKE2s [7]: generic C, x86_64 AVX, x86_64 AVX-512VL. > > * ChaCha20Poly1305 [8]: generic C construction for both full buffers and > scatter gather. > > * XChaCha20Poly1305 [5]: generic C construction. > > Following the merging of this, I expect for first the primitives that > currently exist in lib/ to work their way into lib/zinc/, after intense > scrutiny of each implementation, potentially replacing them with either > formally-verified implementations, or better studied and faster > state-of-the-art implementations. In a phase after that, I envision that > certain instances from crypto/ will want to rebase themselves to simply > be abstracted crypto API wrappers using the lower level Zinc functions. > This is already what various crypto/ implementations do with the > existing code in lib/. > > Currently Zinc exists as a single un-menued option, CONFIG_ZINC, but as > this grows we will inevitably want to make that more granular. This will > happen at the appropriate time, rather than doing so prematurely. There > also is a CONFIG_ZINC_DEBUG menued option, performing several intense > tests at startup and enabling various BUG_ONs. > > [1] https://github.com/project-everest/hacl-star > [2] https://github.com/mit-plv/fiat-crypto > [3] https://cr.yp.to/ecdh.html > [4] https://cr.yp.to/chacha.html > [5] https://cr.yp.to/snuffle/xsalsa-20081128.pdf > [6] https://cr.yp.to/mac.html > [7] https://blake2.net/ > [8] https://tools.ietf.org/html/rfc8439 > [9] https://github.com/google/wycheproof > > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > Cc: Andy Lutomirski <luto@kernel.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Samuel Neves <sneves@dei.uc.pt> > Cc: D. J. Bernstein <djb@cr.yp.to> > Cc: Tanja Lange <tanja@hyperelliptic.org> > Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com> > Cc: Karthikeyan Bhargavan <karthik.bhargavan@gmail.com> > --- > MAINTAINERS | 7 + > include/zinc/blake2s.h | 94 + > include/zinc/chacha20.h | 46 + > include/zinc/chacha20poly1305.h | 51 + > include/zinc/curve25519.h | 25 + > include/zinc/poly1305.h | 34 + > include/zinc/simd.h | 60 + > lib/Kconfig | 21 + > lib/Makefile | 2 + > lib/zinc/Makefile | 28 + > lib/zinc/blake2s/blake2s-x86_64.S | 685 ++++++ > lib/zinc/blake2s/blake2s.c | 292 +++ > lib/zinc/chacha20/chacha20-arm.S | 1471 ++++++++++++ > lib/zinc/chacha20/chacha20-arm64.S | 1940 ++++++++++++++++ > lib/zinc/chacha20/chacha20-mips.S | 474 ++++ > lib/zinc/chacha20/chacha20-x86_64.S | 2630 +++++++++++++++++++++ > lib/zinc/chacha20/chacha20.c | 242 ++ > lib/zinc/chacha20poly1305.c | 286 +++ > lib/zinc/curve25519/curve25519-arm.S | 2110 +++++++++++++++++ > lib/zinc/curve25519/curve25519-arm.h | 14 + > lib/zinc/curve25519/curve25519-fiat32.h | 838 +++++++ > lib/zinc/curve25519/curve25519-hacl64.h | 751 ++++++ > lib/zinc/curve25519/curve25519-x86_64.h | 2060 +++++++++++++++++ > lib/zinc/curve25519/curve25519.c | 86 + > lib/zinc/main.c | 36 + > lib/zinc/poly1305/poly1305-arm.S | 1115 +++++++++ > lib/zinc/poly1305/poly1305-arm64.S | 820 +++++++ > lib/zinc/poly1305/poly1305-mips.S | 417 ++++ > lib/zinc/poly1305/poly1305-mips64.S | 357 +++ > lib/zinc/poly1305/poly1305-x86_64.S | 2790 +++++++++++++++++++++++ > lib/zinc/poly1305/poly1305.c | 377 +++ > lib/zinc/selftest/blake2s.h | 559 +++++ > lib/zinc/selftest/chacha20poly1305.h | 1559 +++++++++++++ > lib/zinc/selftest/curve25519.h | 607 +++++ > lib/zinc/selftest/poly1305.h | 1568 +++++++++++++ > 35 files changed, 24452 insertions(+) > create mode 100644 include/zinc/blake2s.h > create mode 100644 include/zinc/chacha20.h > create mode 100644 include/zinc/chacha20poly1305.h > create mode 100644 include/zinc/curve25519.h > create mode 100644 include/zinc/poly1305.h > create mode 100644 include/zinc/simd.h > create mode 100644 lib/zinc/Makefile > create mode 100644 lib/zinc/blake2s/blake2s-x86_64.S > create mode 100644 lib/zinc/blake2s/blake2s.c > create mode 100644 lib/zinc/chacha20/chacha20-arm.S > create mode 100644 lib/zinc/chacha20/chacha20-arm64.S > create mode 100644 lib/zinc/chacha20/chacha20-mips.S > create mode 100644 lib/zinc/chacha20/chacha20-x86_64.S > create mode 100644 lib/zinc/chacha20/chacha20.c > create mode 100644 lib/zinc/chacha20poly1305.c > create mode 100644 lib/zinc/curve25519/curve25519-arm.S > create mode 100644 lib/zinc/curve25519/curve25519-arm.h > create mode 100644 lib/zinc/curve25519/curve25519-fiat32.h > create mode 100644 lib/zinc/curve25519/curve25519-hacl64.h > create mode 100644 lib/zinc/curve25519/curve25519-x86_64.h > create mode 100644 lib/zinc/curve25519/curve25519.c > create mode 100644 lib/zinc/main.c > create mode 100644 lib/zinc/poly1305/poly1305-arm.S > create mode 100644 lib/zinc/poly1305/poly1305-arm64.S > create mode 100644 lib/zinc/poly1305/poly1305-mips.S > create mode 100644 lib/zinc/poly1305/poly1305-mips64.S > create mode 100644 lib/zinc/poly1305/poly1305-x86_64.S > create mode 100644 lib/zinc/poly1305/poly1305.c > create mode 100644 lib/zinc/selftest/blake2s.h > create mode 100644 lib/zinc/selftest/chacha20poly1305.h > create mode 100644 lib/zinc/selftest/curve25519.h > create mode 100644 lib/zinc/selftest/poly1305.h [...] In general this is great work, and I'm very excited for WireGuard to be upstreamed! But for the new crypto code, I think a few things are on the wrong track, for example treating it is a special library. Even the name is contradicting itself: Zinc is "not crypto/", yet as you stated it's intended that the "Zinc" algorithms will be exposed through the crypto API -- just like how most of the existing crypto code in lib/ is also exposed through the crypto API. So, I think that what you're doing isn't actually *that* different from what already exists in some cases; and pretending that it is very different is just going to cause problems. Rather, the actual truly new thing seems to be that the dispatch to architecture specific implementations is done at the lib/ level instead of handled by the crypto API priority numbers. So, I don't see why you don't just add lib/blake2s/, lib/chacha20/, lib/poly1305/, etc., without pretending that they all have some special new "Zinc" thing in common and are part of some holy crusade against the crypto API. They could even still go in subdirectory lib/crypto/ -- but just for logical code organization purposes, as opposed to a special library with a name that isn't self-explanatory and sounds like some third-party library rather than first-class kernel code. CONFIG_ZINC also needs to go. Algorithms will need to be independently configurable as soon as anything other than WireGuard needs to use any of them, so you might as well do it right from the start with CONFIG_BLAKE2, CONFIG_CHACHA20, CONFIG_POLY1305, etc. I think the above changes would also naturally lead to a much saner patch series where each algorithm is added by its own patch, rather than one monster patch that adds many algorithms and 24000 lines of code. Note that adding all the algorithms in one patch also makes the description of them conflated, e.g. you wrote that "formally verified" implementations were used whenever possible, but AFAICS that actually only applies to the C implementations of Curve25519, and even those have your copyright statement so presumably you had to change something from the "formally verified" code :-). Note also that Poly1305 implementations are somewhat error-prone, since there can be overflow bugs that are extremely rarely hit; see e.g. how OpenSSL's Poly1305 NEON implementation was initially buggy and had to be fixed: https://mta.openssl.org/pipermail/openssl-commits/2016-April/006639.html. Not to mention that C glue code is error-prone, especially with the tons of #ifdefs. So, I'd strongly prefer that you don't oversell the crypto code you're adding, e.g. by implying that most of it is formally verified, as it likely still has bugs, like any other code... I also want to compare the performance of some of the assembly implementations you're adding to the existing implementations in the kernel they duplicate. I'm especially interested in the NEON implementation of ChaCha20. But adding 11 implementations in one single patch means there isn't really a chance to comment on them individually. Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM Cortex-A7 it was actually quite a bit slower than the one in the Linux kernel written by Ard Biesheuvel... I trust that when claiming the performance of all implementations you're adding is "state-of-the-art and unrivaled", you actually compared them to the ones already in the Linux kernel which you're advocating replacing, right? :-) Your patch description is also missing any mention of crypto accelerator hardware. Quite a bit of the complexity in the crypto API, such as scatterlist support and asynchronous execution, exists because it supports crypto accelerators. AFAICS your new APIs cannot support crypto accelerators, as your APIs are synchronous and operate on virtual addresses. I assume your justification is that "djb algorithms" like ChaCha and Poly1305 don't need crypto accelerators as they are fast in software. But you never explicitly stated this and discussed the tradeoffs. Since this is basically the foundation for the design you've chosen, it really needs to be addressed. As for doing the architecture-specific dispatch in lib/ rather than through the crypto API, there definitely are some arguments in favor of it. The main problem, though, is that your code is a mess due to all the #ifdefs, and it will only get worse as people add more architectures. You may think you already added all the architectures that matter, but tomorrow people will come and want to add PowerPC, RISC-V, etc. I really think you should consider splitting up implementations by architecture; this would *not*, however, preclude the implementations from still being accessed through a single top-level "glue" function. For example chacha20() could look like: void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len, bool have_simd) { if (chacha20_arch(dst, src, len, state->key, state->counter, have_simd)) goto out; chacha20_generic(dst, src, len, state->key, state->counter); out: state->counter[0] += (len + 63) / 64; } So, each architecture would optionally define its own chacha20_arch() that returns true if the data was processed, or false if not. (The data wouldn't be processed if, for example, 'have_simd' was false but only SIMD implementations are available; or if the input was too short for an SIMD implementation to be faster than the generic one.) Note that this would make the code much more consistent with the usual Linux kernel coding style, which strongly prefers calling functions unconditionally rather than having core logic littered with unmaintainable #ifdefs. I'd also strongly prefer the patchset to include converting the crypto API versions of ChaCha20 and Poly1305 over to use your new lib/ code, to show that it's really possible. You mentioned that it's planned, but if it's not done right away there will be things that were missed and will require changes when someone finally does it. IMO it's not acceptable to add your own completely separate ChaCha20 and Poly1305 just because you don't like that the existing ones are part of the crypto API. You need to refactor things properly. I think you'd need to expose the new code under a cra_driver_name like "chacha20-software" and "poly1305-software" to reflect that they use the fastest available implementation of the algorithm on the CPU, e.g. "chacha20-generic", "chacha20-neon", and "chacha20-simd" would all be replaced by a single "chacha20-software". Is that what you had in mind? I'm also wondering about the origin and licensing of some of the assembly language files. Many have an OpenSSL copyright statement. But, the OpenSSL license is often thought to be incompatible with GPL, so using OpenSSL assembly code in the kernel has in the past required getting special permission from Andy Polyakov (the person who's written most of OpenSSL's assembly code so holds the copyright on it). As one example, see arch/arm/crypto/sha256-armv4.pl: the file explicitly states that Andy has relicensed it under GPLv2. For your new OpenSSL-derived files, have you gone through and explicitly gotten GPLv2 permission from Andy / the copyright holders? Each assembly language file should also explicitly state where it came from. For example lib/zinc/curve25519/curve25519-arm.S has your copyright statement and says it's "Based on algorithms from Daniel J. Bernstein and Peter Schwabe.", but it's not clarified whether the *code* was written by those other people, as opposed to those people designing the *algorithms* and then you writing the code; and if you didn't write it, where you retrieved the file from and when, what license it had (even if it was "public domain" like some of djb's code, this should be mentioned for informational purposes), and what changes you made if any. Oh, and please use 80-character lines like the rest of the kernel, so that people's eyes don't bleed when reading your code :-) Thanks for all your hard work on WireGuard! - Eric
[I reordered the snippets below for readability.] > On Aug 1, 2018, at 12:22 AM, Eric Biggers <ebiggers3@gmail.com> wrote: > In general this is great work, and I'm very excited for WireGuard to be > upstreamed! But for the new crypto code, I think a few things are on > the wrong track, for example treating it is a special library. Even the > name is contradicting itself: Zinc is "not crypto/", yet as you stated > it's intended that the "Zinc" algorithms will be exposed through the > crypto API -- just like how most of the existing crypto code in lib/ is > also exposed through the crypto API. So, I think that what you're doing > isn't actually *that* different from what already exists in some cases; > and pretending that it is very different is just going to cause > problems. Rather, the actual truly new thing seems to be that the > dispatch to architecture specific implementations is done at the lib/ > level instead of handled by the crypto API priority numbers. ... > > > I think the above changes would also naturally lead to a much saner > patch series where each algorithm is added by its own patch, rather than > one monster patch that adds many algorithms and 24000 lines of code. > Yes, please. > > As for doing the architecture-specific dispatch in lib/ rather than > through the crypto API, there definitely are some arguments in favor of > it. The main problem, though, is that your code is a mess due to all > the #ifdefs, and it will only get worse as people add more > architectures. You may think you already added all the architectures > that matter, but tomorrow people will come and want to add PowerPC, > RISC-V, etc. I really think you should consider splitting up > implementations by architecture; this would *not*, however, preclude the > implementations from still being accessed through a single top-level > "glue" function. For example chacha20() could look like: > > void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len, > bool have_simd) > { > if (chacha20_arch(dst, src, len, state->key, state->counter, have_simd)) > goto out; > > chacha20_generic(dst, src, len, state->key, state->counter); > > out: > state->counter[0] += (len + 63) / 64; > } I like this a *lot*. (But why are you passing have_simd? Shouldn't that check live in chacha20_arch? If there's some init code needed, then chacha20_arch() should just return false before the init code runs. Once the arch does whatever feature detection it needs, it can make chacha20_arch() start returning true.) As I see it, there there are two truly new thing in the zinc patchset: the direct (in the direct call sense) arch dispatch, and the fact that the functions can be called directly, without allocating contexts, using function pointers, etc. In fact, I had a previous patch set that added such an interface for SHA256. https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=8c59a4dd8b7ba4f2e5a6461132bbd16c83ff7c1f https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=7e5fbc02972b03727b71bc71f84175c36cbf01f5 > > Your patch description is also missing any mention of crypto accelerator > hardware. Quite a bit of the complexity in the crypto API, such as > scatterlist support and asynchronous execution, exists because it > supports crypto accelerators. AFAICS your new APIs cannot support > crypto accelerators, as your APIs are synchronous and operate on virtual > addresses. I assume your justification is that "djb algorithms" like > ChaCha and Poly1305 don't need crypto accelerators as they are fast in > software. But you never explicitly stated this and discussed the > tradeoffs. Since this is basically the foundation for the design you've > chosen, it really needs to be addressed. I see this as an advantage, not a disadvantage. A very large majority of in-kernel crypto users (by number of call sites under a *very* brief survey, not by number of CPU cycles) just want to do some synchronous crypto on a buffer that is addressed by a regular pointer. Most of these users would be slowed down if they used any form of async crypto, since the CPU can complete the whole operation faster than it could plausibly initiate and complete anything asynchronous. And, right now, they suffer the full overhead of allocating a context (often with alloca!), looking up (or caching) some crypto API data structures, dispatching the operation, and cleaning up. So I think the right way to do it is to have directly callable functions like zinc uses and to have the fancy crypto API layer on top of them. So if you actually want async accelerated crypto with scatterlists or whatever, you can call into the fancy API, and the fancy API can dispatch to hardware or it can dispatch to the normal static API. In fact, this is exactly what my patch above did. But Jason's work is more complete than mine, and mine wasn't really done because it had some configury issues. All that being said, for algorithms where the crypto API already has a reasonable implementation, I think the patch series should first restructure the code (so the actual software implementation is moved away from the crypto API wrappers) and then, if needed, replace the implementation with something better. The latter should be its own patch with its own justification.
Hi Eric, Thanks for your feedback. Responses below. On Wed, Aug 1, 2018 at 9:22 AM Eric Biggers <ebiggers3@gmail.com> wrote: > it's intended that the "Zinc" algorithms will be exposed through the > crypto API -- just like how most of the existing crypto code in lib/ is > also exposed through the crypto API. So, I think that what you're doing > isn't actually *that* different from what already exists in some cases; > and pretending that it is very different is just going to cause > problems. I think as an eventual thing, it'd be great to see the whole of the software part of the Crypto API rebased ontop of Zinc's software implementations. But the Cryto API is huge, and it's not really realistic to do this all at once. I'd also like to have a bit more scrutiny of each individual Zinc submission, rather than just dumping it all in there wholesale. Additionally, it seems reasonable to do this in accordance with actual need for using the algorithms from Zinc, i.e., not rushing toward DES. > Rather, the actual truly new thing seems to be that the [...] > [...] are part of some holy crusade against the crypto API. > So much for WireGuard being only I sense a bit of snark in your comments, and I am sorry if I ruffled some feathers. I can understand a number of things might have rubbed you the wrong way. But I am pretty committed to getting this done right, taking into account your and Andy's opinions, as well as working hard to keep Zinc close to its design goals. I think you'll find we're closer to being on the same page than not, and that I only want to make things better rather than be on some kind of "crusade" against anything. > They could even still go in subdirectory lib/crypto/ -- but just for > logical code organization purposes, Indeed a subdirectory is important for organization. We can bikeshed about names, I suppose. I'm partial zinc, and I think that it might actually help foster contributors and interested academics. Anyway, let's dive into the code issues first, though, as I think that'll be a productive way to get going. > CONFIG_ZINC also needs to go. Algorithms will need to be independently > configurable as soon as anything other than WireGuard needs to use any > of them, so you might as well do it right from the start with > CONFIG_BLAKE2, CONFIG_CHACHA20, CONFIG_POLY1305, etc. Okay, I can add this all as individual nobs. I'll do that for v2. I'm tempted to still have this built as one module, though. For example, should WireGuard depend on 5 mini-modules (blake2s, chacha20, poly1305, chacha20poly1305, curve25519)? Rather, it seems a bit sleeker to just have a single zinc.ko, and allow what's built into it be configurable with the various algorithm configuration options. What would you think of this? Or do you think you'd still prefer one-algo-per-ko approach like with the current crypto API? I can see good arguments both ways. > I think the above changes would also naturally lead to a much saner > patch series where each algorithm is added by its own patch, rather than > one monster patch that adds many algorithms and 24000 lines of code. Yes, absolutely! And sorry for that monster patch; it even got rejected from LKML as you pointed out. I had started to split things up, but I wanted to send the patchset foremost to netdev, to start getting some feedback on WireGuard, and wanted to keep the organization simple. I'll have this split up for v2. > > Note that adding all the algorithms in one patch also makes the > description of them conflated, e.g. you wrote that "formally verified" > implementations were used whenever possible, but AFAICS that actually > only applies to the C implementations of Curve25519, Yes, the commit message mentioned which implementations have been verified. It's quite likely that in the near future we'll be adding more formally verified implementations, and even formally verified assembly implementations. While BLAKE2s and ChaCha20 are somewhat straight forward, bigint stuff, such as Curve25519 and Poly1305, always make me a bit nervous, and so we'll probably be additionally incorporating a formally verified Poly1305 first of the bunch. Either way, _all_ of those implementations have been reviewed in some depth, fuzzed absurd amounts, and fed some pretty nasty test vectors. I've spent a lot of time with all of them, and most have received a good amount of external scrutiny, due to their origin. > So, I'd strongly prefer that you don't oversell the crypto > code you're adding, e.g. by implying that most of it is formally > verified, as it likely still has bugs, like any other code... This sounds like snark that's going to devolve into endless bikeshedding, but I do note that I mentioned formally verified implementations in these 3 contexts, which I think were appropriate each time: 1) "Wherever possible and performant, formally verified implementations are used, such as those from HACL* and Fiat-Crypto." So, trying to state that we use it when we can, but we can't always. I think it's important to mention it here, as already, I've been contacted by formal verification people who are already eager to help. This is exactly the kind of positive outcome I was hoping for. 2) "Curve25519: formally verified 64-bit C, formally verified 32-bit C, [...]" Pointing out that these 2 implementations, in contradistinction to the subsequent ones listed, are from formally verified C. 3) "potentially replacing them with either formally-verified implementations [...]" Again here, I indicate that I'd like to replace non formally-verified implementations with formally-verified implementations wherever there's such a potentiality. And like the previous comment, I believe this has been interpreted as a welcome call to action by the people who are actually working on this stuff. So I don't think I'm "over selling" it. Formal verification is a good thing. We have a little bit of it in Zinc. I'd like to have a lot more of it. And all the while we're doing the best with what we've got. Generally speaking, I believe the choices I've made on the individual implementations in this submission have been made with a lot of care, and I'm happy to work with you on any concerns you have about them and do what's needed to get them in the proper shape. > I also want to compare the performance of some of the assembly > implementations you're adding to the existing implementations in the > kernel they duplicate. I'm especially interested in the NEON > implementation of ChaCha20. But adding 11 implementations in one single > patch means there isn't really a chance to comment on them individually. As noted, I'll certainly split this up for v2. With ChaCha20, my benchmarks have indicated Zinc is faster than the Crypto API's implementations. But there's also another benefit: the ChaCha20 and Poly1305 implementations are based on Andy Polyakov's ones, which means they benefit from the *huge* amount of scrutiny those have already received and continue to receive. In fact, there are people working on doing formally verified reproductions of some of his implementations, and that alone is hugely valuable. The implementations generally found in Linux haven't received nearly as much scrutiny and attention, putting them at a bit of a disadvantage, I think. In other words, there's a strong benefit from working together and sharing in the enterprise with other projects. There's another plus, too, which is that they all have a very common interface, making the glue code simpler and more uniform. This is probably mostly just a benefit for me, as a lazy coder who would prefer an easier job, but usually less complexity and fewer differences results in fewer bugs, too. > Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM > Cortex-A7 it was actually quite a bit slower than the one in the Linux > kernel written by Ard Biesheuvel... I trust that when claiming the > performance of all implementations you're adding is "state-of-the-art > and unrivaled", you actually compared them to the ones already in the > Linux kernel which you're advocating replacing, right? :-) Yes, I have, and my results don't corroborate your findings. It will be interesting to get out a wider variety of hardware for comparisons. I suspect, also, that if the snarky emoticons subside, AndyP would be very interested in whatever we find and could have interest in improving implementations, should we actually find performance differences. > Your patch description is also missing any mention of crypto accelerator > hardware. Quite a bit of the complexity in the crypto API, such as > scatterlist support and asynchronous execution, exists because it > supports crypto accelerators. I can augment that description, if you like, for v2. But with regards to crypto accelerators -- I think the asynchronous API causes all sorts of problems for some of the most common use cases, and so Zinc is a software-based ordinary synchronous API. I think this is an advantage. Of course you can continue to support wild async IPSec accelerator hardware and such in the crypto API, but that's intentionally not the focus of Zinc. > I assume your justification is that "djb algorithms" like > But you never explicitly stated this and discussed the > tradeoffs. It was already a pretty huge wall of text in the monster patch, but even so there's probably tons of discussion about different details, both large and small, that I must have left out. Sorry about that. Hopefully these discussions here can help fill them in. > The main problem, though, is that your code is a mess due to all > the #ifdefs, and it will only get worse as people add more > architectures. I really thought the ifdefs were cleaner than the three other ways I have implemented for doing it. When I showed this all to Greg KH several months ago, he also thought the ifdefs were disgusting. It seems you agree with him, and I'll certainly change this for v2. I already have most of that code written, and based on your comments and suggestions below, I'm sure you'll find it much better. > Note that this > would make the code much more consistent with the usual Linux kernel > coding style, which strongly prefers calling functions unconditionally > rather than having core logic littered with unmaintainable #ifdefs. Noted. Will be changed. > I'd also strongly prefer the patchset to include converting the crypto > API versions of ChaCha20 and Poly1305 over to use your new lib/ code, to > show that it's really possible. You mentioned that it's planned, but if > it's not done right away there will be things that were missed and will > require changes when someone finally does it. IMO it's not acceptable > to add your own completely separate ChaCha20 and Poly1305 just because > you don't like that the existing ones are part of the crypto API. Alright. I didn't really want to do this right off the bat, but you make a good argument that doing so is important in working out the required details of just knowing how it's done. So I'll take a stab at it for v2. I'll let you know if I run into any snags; we might be able to collaborate nicely here. > You > need to refactor things properly. I think you'd need to expose the new > code under a cra_driver_name like "chacha20-software" and > "poly1305-software" to reflect that they use the fastest available > implementation of the algorithm on the CPU, e.g. "chacha20-generic", > "chacha20-neon", and "chacha20-simd" would all be replaced by a single > "chacha20-software". Is that what you had in mind? Calling it "chacha20-software" sounds like a reasonable way to name it. > > I'm also wondering about the origin and licensing of some of the > assembly language files. Many have an OpenSSL copyright statement. > But, the OpenSSL license is often thought to be incompatible with GPL, > so using OpenSSL assembly code in the kernel has in the past required > getting special permission from Andy Polyakov (the person who's written > most of OpenSSL's assembly code so holds the copyright on it). As one > example, see arch/arm/crypto/sha256-armv4.pl: the file explicitly states > that Andy has relicensed it under GPLv2. For your new OpenSSL-derived > files, have you gone through and explicitly gotten GPLv2 permission from > Andy / the copyright holders? Andy and I spoke in January about this and he was okay with the usual CRYPTOGAMS GPLv2 origin dance for that code. > For example lib/zinc/curve25519/curve25519-arm.S has your > copyright statement and says it's "Based on algorithms from Daniel J. > Bernstein and Peter Schwabe.", but it's not clarified whether the *code* > was written by those other people, as opposed to those people designing > the *algorithms* and then you writing the code; and if you didn't write > it, where you retrieved the file from and when, what license it had > (even if it was "public domain" like some of djb's code, this should be > mentioned for informational purposes), and what changes you made if any. It's indeed based on their public domain code. I'll change that wording to make it very clear that it's from public domain code. > Oh, and please use 80-character lines like the rest of the kernel, so > that people's eyes don't bleed when reading your code :-) Ack. Will be so in v2. > > Thanks for all your hard work on WireGuard! Thanks for your excellent review. I really appreciate it you taking the time. Hopefully v2 will start looking a lot more interesting to you! Regards, Jason
Hey Andy, Thanks too for the feedback. Responses below: On Wed, Aug 1, 2018 at 7:09 PM Andy Lutomirski <luto@amacapital.net> wrote: > > I think the above changes would also naturally lead to a much saner > > patch series where each algorithm is added by its own patch, rather than > > one monster patch that adds many algorithms and 24000 lines of code. > > > > Yes, please. Ack, will be in v2. > I like this a *lot*. (But why are you passing have_simd? Shouldn't > that check live in chacha20_arch? If there's some init code needed, > then chacha20_arch() should just return false before the init code > runs. Once the arch does whatever feature detection it needs, it can > make chacha20_arch() start returning true.) The have_simd stuff is so that the FPU state can be amortized across several calls to the crypto functions. Here's a snippet from WireGuard's send.c: void packet_encrypt_worker(struct work_struct *work) { struct crypt_queue *queue = container_of(work, struct multicore_worker, work)->ptr; struct sk_buff *first, *skb, *next; bool have_simd = simd_get(); while ((first = ptr_ring_consume_bh(&queue->ring)) != NULL) { enum packet_state state = PACKET_STATE_CRYPTED; skb_walk_null_queue_safe(first, skb, next) { if (likely(skb_encrypt(skb, PACKET_CB(first)->keypair, have_simd))) skb_reset(skb); else { state = PACKET_STATE_DEAD; break; } } queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first, state); have_simd = simd_relax(have_simd); } simd_put(have_simd); } simd_get() and simd_put() do the usual irq_fpu_usable/kernel_fpu_begin dance and return/take a boolean accordingly. simd_relax(on) is: static inline bool simd_relax(bool was_on) { #ifdef CONFIG_PREEMPT if (was_on && need_resched()) { simd_put(true); return simd_get(); } #endif return was_on; } With this, we most of the time get the FPU amortization, while still doing the right thing for the preemption case (since kernel_fpu_begin disables preemption). This is a quite important performance optimization. However, I'd prefer the lazy FPU restoration proposal discussed a few months ago, but it looks like that hasn't progressed, hence the need for FPU call amortization: https://lore.kernel.org/lkml/CALCETrU+2mBPDfkBz1i_GT1EOJau+mzj4yOK8N0UnT2pGjiUWQ@mail.gmail.com/ > > As I see it, there there are two truly new thing in the zinc patchset: > the direct (in the direct call sense) arch dispatch, and the fact that > the functions can be called directly, without allocating contexts, > using function pointers, etc. > > In fact, I had a previous patch set that added such an interface for SHA256. > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=8c59a4dd8b7ba4f2e5a6461132bbd16c83ff7c1f > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=7e5fbc02972b03727b71bc71f84175c36cbf01f5 Seems like SHA256 will be a natural next candidate for Zinc, given the demand. > > Your patch description is also missing any mention of crypto accelerator > > hardware. Quite a bit of the complexity in the crypto API, such as > > scatterlist support and asynchronous execution, exists because it > > supports crypto accelerators. AFAICS your new APIs cannot support > > crypto accelerators, as your APIs are synchronous and operate on virtual > > addresses. I assume your justification is that "djb algorithms" like > > ChaCha and Poly1305 don't need crypto accelerators as they are fast in > > software. But you never explicitly stated this and discussed the > > tradeoffs. Since this is basically the foundation for the design you've > > chosen, it really needs to be addressed. > > I see this as an advantage, not a disadvantage. A very large majority > of in-kernel crypto users (by number of call sites under a *very* > brief survey, not by number of CPU cycles) just want to do some > synchronous crypto on a buffer that is addressed by a regular pointer. > Most of these users would be slowed down if they used any form of > async crypto, since the CPU can complete the whole operation faster > than it could plausibly initiate and complete anything asynchronous. > And, right now, they suffer the full overhead of allocating a context > (often with alloca!), looking up (or caching) some crypto API data > structures, dispatching the operation, and cleaning up. > > So I think the right way to do it is to have directly callable > functions like zinc uses and to have the fancy crypto API layer on top > of them. So if you actually want async accelerated crypto with > scatterlists or whatever, you can call into the fancy API, and the > fancy API can dispatch to hardware or it can dispatch to the normal > static API. Yes, exactly this. Regards, Jason
On Thu, Aug 2, 2018 at 7:48 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hey Andy,
>
> Thanks too for the feedback. Responses below:
>
> On Wed, Aug 1, 2018 at 7:09 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> > I think the above changes would also naturally lead to a much saner
>> > patch series where each algorithm is added by its own patch, rather than
>> > one monster patch that adds many algorithms and 24000 lines of code.
>> >
>>
>> Yes, please.
>
> Ack, will be in v2.
>
>
>> I like this a *lot*. (But why are you passing have_simd? Shouldn't
>> that check live in chacha20_arch? If there's some init code needed,
>> then chacha20_arch() should just return false before the init code
>> runs. Once the arch does whatever feature detection it needs, it can
>> make chacha20_arch() start returning true.)
>
> The have_simd stuff is so that the FPU state can be amortized across
> several calls to the crypto functions. Here's a snippet from
> WireGuard's send.c:
>
> void packet_encrypt_worker(struct work_struct *work)
> {
> struct crypt_queue *queue = container_of(work, struct
> multicore_worker, work)->ptr;
> struct sk_buff *first, *skb, *next;
> bool have_simd = simd_get();
Gotcha. That was very hidden in the 24k lines. Please make this (and
any similar goodies) be their own patches.
Also, please consider spelling it differently:
simd_context_t simd_context = simd_get();
Because we'll feel very silly the first time some architecture has
more than one possible state. (It wouldn't be entirely insane for x86
to distinguish between "no SIMD", "XMM only", and "go to town!", for
example.)
On Fri, Aug 3, 2018 at 11:29 PM Andy Lutomirski <luto@amacapital.net> wrote: > Gotcha. That was very hidden in the 24k lines. Please make this (and > any similar goodies) be their own patches. I know, sorry, and I certainly will have this split out. The above code snippet was from the much much shorter WireGuard patch (3/3), but indeed the simd_get/put/relax code is in the monster patch here, so that'll certainly be split for v2. > > Also, please consider spelling it differently: > > simd_context_t simd_context = simd_get(); > > Because we'll feel very silly the first time some architecture has > more than one possible state. (It wouldn't be entirely insane for x86 > to distinguish between "no SIMD", "XMM only", and "go to town!", for > example.) That's a great idea. It'll also make it clearer that users shouldn't just dump a raw "true" in there; raw bools are more tempting to abuse like that.
Hey Andy, Eric, & all, I've started the work of separating this out into 16 individual commits, have addressed numerous other things brought up like the ifdef maze, and have begun rewriting (parts of) the original commit message. It's still a work in progress, and I still have some work to do, but if you want to follow along, things are happening here: https://git.zx2c4.com/linux-dev/log/?h=jd/wireguard I'll be rebasing and reworking this continuously, but that's how it's shaping up. As I'm still working on it, I won't be submitting v2 today, but if you have suggestions or concerns or want to poke me while I'm working on v2, don't hesitate to send me private email or ping me in IRC (I'm "zx2c4" there) to chat. Regards, Jason
On Tue, Aug 7, 2018 at 11:54 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hey Andy, Eric, & all,
>
> I've started the work of separating this out into 16 individual
> commits, have addressed numerous other things brought up like the
> ifdef maze, and have begun rewriting (parts of) the original commit
> message. It's still a work in progress, and I still have some work to
> do, but if you want to follow along, things are happening here:
>
> https://git.zx2c4.com/linux-dev/log/?h=jd/wireguard
>
> I'll be rebasing and reworking this continuously, but that's how it's
> shaping up.
>
> As I'm still working on it, I won't be submitting v2 today, but if you
> have suggestions or concerns or want to poke me while I'm working on
> v2, don't hesitate to send me private email or ping me in IRC (I'm
> "zx2c4" there) to chat.
>
> Regards,
> Jason
For "zinc: add simd helper", I think it should be in include/linux,
and include/linux/simd.h should (immediately or maybe in the future)
include <asm/simd.h> to pick up arch-specific stuff. And the patch
should get sent to linux-arch@vger.kernel.org.
In your blake2s_arch() implementation, you're not passing in a
simd_context_t. Is that still a work in progress? I thought the plan
was to pass it in rather than doing the check in the _arch()
functions.
Hey Andy, On Tue, Aug 7, 2018 at 12:43 PM Andy Lutomirski <luto@amacapital.net> wrote: > For "zinc: add simd helper", I think it should be in include/linux, > and include/linux/simd.h should (immediately or maybe in the future) > include <asm/simd.h> to pick up arch-specific stuff. And the patch > should get sent to linux-arch@vger.kernel.org. I guess you saw my prompt about that in the previous commit message? Based on your encouragement, I implemented it: https://git.zx2c4.com/linux-dev/commit/?h=simd This is _far_ more invasive than I wanted to be, as I don't want this patch submission to grow unwieldy and never be merged, but I guess we can roll with this for now... > In your blake2s_arch() implementation, you're not passing in a > simd_context_t. Is that still a work in progress? I thought the plan > was to pass it in rather than doing the check in the _arch() > functions. I'm inclined to do the explicit context passing only when a function is likely to be used in that kind of environment, and adjust as needed. Long term, anyway, that API will be removed once the x86 guys figure out lazy FPU restoration and the amortization doesn't add anything. Jason
> On Aug 7, 2018, at 4:48 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hey Andy, > >> On Tue, Aug 7, 2018 at 12:43 PM Andy Lutomirski <luto@amacapital.net> wrote: >> For "zinc: add simd helper", I think it should be in include/linux, >> and include/linux/simd.h should (immediately or maybe in the future) >> include <asm/simd.h> to pick up arch-specific stuff. And the patch >> should get sent to linux-arch@vger.kernel.org. > > I guess you saw my prompt about that in the previous commit message? > Based on your encouragement, I implemented it: > https://git.zx2c4.com/linux-dev/commit/?h=simd This is _far_ more > invasive than I wanted to be, as I don't want this patch submission to > grow unwieldy and never be merged, but I guess we can roll with this > for now... > I really wish we had a way to see that we use asm-generic’s copy of a header in all cases except where an arch opts out. >> In your blake2s_arch() implementation, you're not passing in a >> simd_context_t. Is that still a work in progress? I thought the plan >> was to pass it in rather than doing the check in the _arch() >> functions. > > I'm inclined to do the explicit context passing only when a function > is likely to be used in that kind of environment, and adjust as > needed. Long term, anyway, that API will be removed once the x86 guys > figure out lazy FPU restoration and the amortization doesn't add > anything. Fair enough. > > Jason
On Tue, Aug 7, 2018 at 6:49 PM Andy Lutomirski <luto@amacapital.net> wrote:
> I really wish we had a way to see that we use asm-generic’s copy of a header in all cases except where an arch opts out.
It's really not that hard to do -- symlink asm-generic to a target
called "asm" inside an otherwise empty directory, and add that
otherwise empty directory to the -I paths just after arch/include.
Since it's searched second, it's only used if the first fails. Maybe
I'm missing something though, as this seems a bit too obvious. Perhaps
a project for another day.
On Tue, Aug 7, 2018 at 6:51 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> On Tue, Aug 7, 2018 at 6:49 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> I really wish we had a way to see that we use asm-generic’s copy of a header in all cases except where an arch opts out.
>
> It's really not that hard to do -- symlink asm-generic to a target
> called "asm" inside an otherwise empty directory, and add that
> otherwise empty directory to the -I paths just after arch/include.
> Since it's searched second, it's only used if the first fails. Maybe
> I'm missing something though, as this seems a bit too obvious. Perhaps
> a project for another day.
The problem here (I think) is that it's preferable for stray files in
the kernel tree to have no effect and, with this scheme, stray files
in arch/*/include/asm could affect the build. So something explicit
has an advantage. I just think there should be way to say "this
asm-generic file should affect all arches unless they generic-n or
override-generic-y it or whatever".
On Fri, Aug 03, 2018 at 04:33:50AM +0200, Jason A. Donenfeld wrote:
>
> > Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM
> > Cortex-A7 it was actually quite a bit slower than the one in the Linux
> > kernel written by Ard Biesheuvel... I trust that when claiming the
> > performance of all implementations you're adding is "state-of-the-art
> > and unrivaled", you actually compared them to the ones already in the
> > Linux kernel which you're advocating replacing, right? :-)
>
> Yes, I have, and my results don't corroborate your findings. It will
> be interesting to get out a wider variety of hardware for comparisons.
> I suspect, also, that if the snarky emoticons subside, AndyP would be
> very interested in whatever we find and could have interest in
> improving implementations, should we actually find performance
> differences.
>
On ARM Cortex-A7, OpenSSL's ChaCha20 implementation is 13.9 cpb (cycles per
byte), whereas Linux's is faster: 11.9 cpb. I've also recently improved the
Linux implementation to 11.3 cpb and would like to send out a patch soon...
I've also written a scalar ChaCha20 implementation (no NEON instructions!) that
is 12.2 cpb on one block at a time on Cortex-A7, taking advantage of the free
rotates; that would be useful for the single permutation used to compute
XChaCha's subkey, and also for the ends of messages.
The reason Linux's ChaCha20 NEON implementation is faster than OpenSSL's is that
Linux's does 4 blocks at once using NEON instructions, and the words are
de-interleaved so the rows don't need to be shifted between each round.
OpenSSL's implementation, on the other hand, only does 3 blocks at once with
NEON instructions and has to shift the rows between each round. OpenSSL's
implementation also does a 4th block at the same time using regular ARM
instructions, but that doesn't help on Cortex-A7; it just makes it slower.
I understand there are tradeoffs, and different implementations can be faster on
different CPUs. Just know that from my point of view, switching to the OpenSSL
implementation actually introduces a performance regression, and we care a *lot*
about this since we need ChaCha to be absolutely as fast as possible for HPolyC
disk encryption. So if your proposal goes in, I'd likely need to write a patch
to get the old performance back, at least on Cortex-A7...
Also, I don't know whether Andy P. considered the 4xNEON implementation
technique. It could even be fastest on other ARM CPUs too, I don't know.
- Eric
[-- Attachment #1: Type: text/plain, Size: 2707 bytes --] Eric Biggers writes: > I've also written a scalar ChaCha20 implementation (no NEON instructions!) that > is 12.2 cpb on one block at a time on Cortex-A7, taking advantage of the free > rotates; that would be useful for the single permutation used to compute > XChaCha's subkey, and also for the ends of messages. This is also how ends of messages are handled in the 2012 implementation crypto_stream/salsa20/armneon6 (see "mainloop1") inside the SUPERCOP benchmarking framework: https://bench.cr.yp.to/supercop.html This code is marginally different from Eric's new code because the occasional loads and stores are scheduled for the Cortex-A8 rather than the Cortex-A7, and because it's Salsa20 rather than ChaCha20. The bigger picture is that there are 63 implementations of Salsa20 and ChaCha20 in SUPERCOP from 10 authors showing various implementation techniques, including all the techniques that have been mentioned in this thread; and centralized benchmarks on (e.g.) https://bench.cr.yp.to/results-stream.html#amd64-kizomba https://bench.cr.yp.to/web-impl/amd64-kizomba-crypto_stream-salsa20.html showing what's fastest on various platforms, using well-developed benchmarking tools that produce repeatable, meaningful measurements. There are also various papers explaining the main techniques. Of course it's possible that new code will do better, especially on platforms with different performance characteristics from the platforms previously targeted. Contributing new implementations to SUPERCOP is easy---which is why SUPERCOP already has thousands of implementations of hundreds of cryptographic functions---and is a more effective way to advertise speedups than adding code merely to (e.g.) the Linux kernel. Infrastructure is centralized in SUPERCOP to minimize per-implementation work. There's no risk of being rejected on the basis of cryptographic concerns (MD5, Speck, and RSA-512 are included in the benchmarks) or code-style concerns. Users can then decide which implementations best meet their requirements. "Do better" seems to be what's happened for the Cortex-A7. The best SUPERCOP speeds (from code targeting the Cortex-A8 etc.) are 13.42 cycles/byte for 4096 bytes for ChaCha20; 12.2, 11.9, and 11.3 sound noticeably better. The Cortex-A7 is an interesting case because it's simultaneously (1) widely deployed---more than a billion units sold--- but (2) poorly documented. If you want to know, e.g., which instructions can dual-issue with loads/FPU moves/..., then you won't be able to find anything from ARM giving the answer. I've started building an automated tool to compute the full CPU pipeline structure from benchmarks, but this isn't ready yet. ---Dan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --]
On Wed, Aug 15, 2018 at 04:28:19PM -0000, D. J. Bernstein wrote:
> Eric Biggers writes:
> > I've also written a scalar ChaCha20 implementation (no NEON instructions!) that
> > is 12.2 cpb on one block at a time on Cortex-A7, taking advantage of the free
> > rotates; that would be useful for the single permutation used to compute
> > XChaCha's subkey, and also for the ends of messages.
>
> This is also how ends of messages are handled in the 2012 implementation
> crypto_stream/salsa20/armneon6 (see "mainloop1") inside the SUPERCOP
> benchmarking framework:
>
> https://bench.cr.yp.to/supercop.html
>
> This code is marginally different from Eric's new code because the
> occasional loads and stores are scheduled for the Cortex-A8 rather than
> the Cortex-A7, and because it's Salsa20 rather than ChaCha20.
>
> The bigger picture is that there are 63 implementations of Salsa20 and
> ChaCha20 in SUPERCOP from 10 authors showing various implementation
> techniques, including all the techniques that have been mentioned in
> this thread; and centralized benchmarks on (e.g.)
>
> https://bench.cr.yp.to/results-stream.html#amd64-kizomba
> https://bench.cr.yp.to/web-impl/amd64-kizomba-crypto_stream-salsa20.html
>
> showing what's fastest on various platforms, using well-developed
> benchmarking tools that produce repeatable, meaningful measurements.
> There are also various papers explaining the main techniques.
>
> Of course it's possible that new code will do better, especially on
> platforms with different performance characteristics from the platforms
> previously targeted. Contributing new implementations to SUPERCOP is
> easy---which is why SUPERCOP already has thousands of implementations of
> hundreds of cryptographic functions---and is a more effective way to
> advertise speedups than adding code merely to (e.g.) the Linux kernel.
> Infrastructure is centralized in SUPERCOP to minimize per-implementation
> work. There's no risk of being rejected on the basis of cryptographic
> concerns (MD5, Speck, and RSA-512 are included in the benchmarks) or
> code-style concerns. Users can then decide which implementations best
> meet their requirements.
>
> "Do better" seems to be what's happened for the Cortex-A7. The best
> SUPERCOP speeds (from code targeting the Cortex-A8 etc.) are 13.42
> cycles/byte for 4096 bytes for ChaCha20; 12.2, 11.9, and 11.3 sound
> noticeably better. The Cortex-A7 is an interesting case because it's
> simultaneously (1) widely deployed---more than a billion units sold---
> but (2) poorly documented. If you want to know, e.g., which instructions
> can dual-issue with loads/FPU moves/..., then you won't be able to find
> anything from ARM giving the answer. I've started building an automated
> tool to compute the full CPU pipeline structure from benchmarks, but
> this isn't ready yet.
>
Hi Dan, are you saying I should contribute my scalar ChaCha implementation,
and/or the Linux kernel's ChaCha NEON implementation, to SUPERCOP? Just a few
comments: there doesn't appear to be an official git repository for SUPERCOP,
nor is there any mention of how to send patches, nor is there any COPYING or
LICENSE file, nor even a README file. So, while I'm interested, from my
perspective SUPERCOP doesn't seem friendly to contributions. You'd probably
attract more contributors if you followed established open source conventions.
Another issue is that the ChaCha code in SUPERCOP is duplicated for each number
of rounds: 8, 12, and 20. So if anyone adds a new ChaCha implementation, they'd
apparently have to add 3 copies of it, which isn't very maintainable.
There are actually only two ARM NEON implementations of ChaCha20 in SUPERCOP:
(1) the one in crypto_stream/chacha20/moon/neon/32/chacha.S which looks like an
'objdump' output as it has no comments or anything that would make it readable
like macros and register aliases; and (2) the one in
crypto_stream/chacha20/dolbeau/arm-neon/, which uses a method similar to the
Linux implementation but it uses GCC intrinsics, so its performance will heavily
depend on how the compiler assigns and spills registers, which can vary greatly
depending on the compiler version and options.
I understand that Salsa20 is similar to ChaCha, and that ideas from Salsa20
implementations often apply to ChaCha too. But it's not always obvious what
carries over and what doesn't; the rotation amounts can matter a lot, for
example, as different rotations can be implemented in different ways. Nor is it
always obvious which ideas from SSE2 or AVX2 implementations (for example) carry
over to NEON implementations, as these instruction sets are different enough
that each has its own unique quirks and optimizations.
Previously I also found that OpenSSL's ARM NEON implementation of Poly1305 is
much faster than the implementations in SUPERCOP, as well as more
understandable. (I don't know the 'qhasm' language, for example.) So from my
perspective, I've had more luck with OpenSSL than SUPERCOP when looking for fast
implementations of crypto algorithms. Have you considered adding the OpenSSL
implementations to SUPERCOP?
In the end though, both Linux and OpenSSL don't need every implementation under
the sun, but rather a small number of implementations that provide "good"
performance on nearly all CPUs, while not necessarily being optimal on every CPU
out there. I.e., some tradeoffs are necessary and acceptable. So for
ChaCha{12,20} on ARM we should choose which 2-3 implementations are most
valuable to have and make them as generally well optimized as possible. Based
on the research I've seen and done, I think they're likely to be:
- a 1x_scalar implementation
- a 3x_NEON+1x_scalar implementation like OpenSSL's
- a 4x_NEON implementation like Linux's current one
Currently my main argument is just that having the 3x_NEON+1x_scalar method be
the only NEON implementation is probably insufficient, as there are important
CPUs where the 4x_NEON method is significantly faster.
Thanks,
Eric
[-- Attachment #1: Type: text/plain, Size: 7172 bytes --] Eric Biggers writes: > You'd probably attract more contributors if you followed established > open source conventions. SUPERCOP already has thousands of implementations from hundreds of contributors. New speed records are more likely to appear in SUPERCOP than in any other cryptographic software collection. The API is shared by state-of-the-art benchmarks, state-of-the-art tests, three ongoing competitions, and increasingly popular production libraries. Am I correctly gathering from this thread that someone adding a new implementation of a crypto primitive to the kernel has to worry about checking the architecture and CPU features to figure out whether the implementation will run? Wouldn't it make more sense to take this error-prone work away from the implementor and have a robust automated central testing mechanism, as in SUPERCOP? Am I also correctly gathering that adding an extra implementation to the kernel can hurt performance, unless the implementor goes to extra effort to check for the CPUs where the previous implementation is faster---or to build some ad-hoc timing mechanism ("raid6: using algorithm avx2x4 gen() 31737 MB/s")? Wouldn't it make more sense to take this error-prone work away from the implementor and have a robust automated central timing mechanism, as in SUPERCOP? I also didn't notice anyone disputing Jason's comment about the "general clunkiness" of the kernel's internal crypto API---but is there really no consensus as to what the replacement API is supposed to be? Someone who simply wants to implement some primitives has to decide on function-call details, argue about the software location, add configuration options, etc.? Wouldn't it make more sense to do this centrally, as in SUPERCOP? And then there's the bigger question of how the community is organizing ongoing work on accelerating---and auditing, and fixing, and hopefully verifying---implementations of cryptographic primitives. Does it really make sense that people looking for what's already been done have to go poking around a bunch of separate libraries? Wouldn't it make more sense to have one central collection of code, as in SUPERCOP? Is there any fundamental obstacle to having libraries share code for primitives? > there doesn't appear to be an official git repository for SUPERCOP, > nor is there any mention of how to send patches, nor is there any > COPYING or LICENSE file, nor even a README file. https://bench.cr.yp.to/call-stream.html explains the API and submission procedure for stream ciphers. There are similar pages for other types of cryptographic primitives. https://bench.cr.yp.to/tips.html explains the develop-test cycle and various useful options. Licenses vary across implementations. There's a minimum requirement of public distribution for verifiability of benchmark results, but it's up to individual implementors to decide what they'll allow beyond that. Patent status also varies; constant-time status varies; verification status varies; code quality varies; cryptographic security varies; etc. As I mentioned, SUPERCOP includes MD5 and Speck and RSA-512. For comparison, where can I find an explanation of how to test kernel crypto patches, and how fast is the develop-test cycle? Okay, I don't have a kernel crypto patch, but I did write a kernel patch recently that (I think) fixes some recent Lenovo ACPI stupidity: https://marc.info/?l=qubes-users&m=153308905514481 I'd propose this for review and upstream adoption _if_ it survives enough tests---but what's the right test procedure? I see superficial documentation of where to submit a patch for review, but am I really supposed to do this before serious testing? The patch works on my laptop, and several other people say it works, but obviously this is missing the big question of whether the patch breaks _other_ laptops. I see an online framework for testing, but using it looks awfully complicated, and the level of coverage is unclear to me. Has anyone tried to virtualize kernel testing---to capture hardware data from many machines and then centrally simulate kernels running on those machines, for example to check that those machines don't take certain code paths? I suppose that people who work with the kernel all the time would know what to do, but for me the lack of information was enough of a deterrent that I switched to doing something else. > Another issue is that the ChaCha code in SUPERCOP is duplicated for > each number of rounds: 8, 12, and 20. These are auto-generated, of course. To understand this API detail, consider some of the possibilities for the round counts supported by compiled code: * 20 * 12 * 8 * caller selection from among 20 and 12 and 8 * caller selection of any multiple of 4 * caller selection of any multiple of 2 * caller selection of anything I hope that in the long term everyone is simply using 20, and then the pure 20 is the simplest and smallest and most easily verified code, but obviously there are other implementations today. An API with a separate function for each round count allows any of these implementations to be trivially benchmarked and used, whereas an API that insists on passing the round count as an argument prohibits at least the first three and maybe more. > crypto_stream/chacha20/dolbeau/arm-neon/, which uses a method similar to the > Linux implementation but it uses GCC intrinsics, so its performance will heavily > depend on how the compiler assigns and spills registers, which can vary greatly > depending on the compiler version and options. Sure. The damage done by incompetent compilers is particularly clear for in-order CPUs such as the Cortex-A7. > I understand that Salsa20 is similar to ChaCha, and that ideas from Salsa20 > implementations often apply to ChaCha too. But it's not always obvious what > carries over and what doesn't; the rotation amounts can matter a lot, for > example, as different rotations can be implemented in different ways. This sounds backwards to me. ChaCha20 supports essentially all the Salsa20 implementation techniques plus some extra streamlining: often a bit less register pressure, often less data reorganization, and often some rotation speedups. > Nor is it always obvious which ideas from SSE2 or AVX2 implementations > (for example) carry over to NEON implementations, as these instruction > sets are different enough that each has its own unique quirks and > optimizations. Of course. > Previously I also found that OpenSSL's ARM NEON implementation of Poly1305 is > much faster than the implementations in SUPERCOP, as well as more > understandable. (I don't know the 'qhasm' language, for example.) So from my > perspective, I've had more luck with OpenSSL than SUPERCOP when looking for fast > implementations of crypto algorithms. Have you considered adding the OpenSSL > implementations to SUPERCOP? Almost all of the implementations in SUPERCOP were submitted by the implementors, with a few exceptions for wrappers. Realistically, the implementors are in the best position to check that they're getting the expected results and to be in control of any necessary updates. ---Dan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --]
Hi Eric, On Tue, Aug 14, 2018 at 2:12 PM Eric Biggers <ebiggers@kernel.org> wrote: > On ARM Cortex-A7, OpenSSL's ChaCha20 implementation is 13.9 cpb (cycles per > byte), whereas Linux's is faster: 11.9 cpb. > > The reason Linux's ChaCha20 NEON implementation is faster than OpenSSL's > > I understand there are tradeoffs, and different implementations can be faster on > different CPUs. > > So if your proposal goes in, I'd likely need to write a patch > to get the old performance back, at least on Cortex-A7... Yes, absolutely. Different CPUs behave differently indeed, but if you have improvements for hardware that matters to you, we should certainly incorporate these, and also loop Andy Polyakov in (I've added him to the CC for the WIP v2). ChaCha is generally pretty obvious, but for big integer algorithms -- like Poly1305 and Curve25519 -- I think it's all the more important to involve Andy and the rest of the world in general, so that Linux benefits from bug research and fuzzing in places that are typically and classically prone to nasty issues. In other words, let's definitely incorporate your improvements after the patchset goes in, and at the same time we'll try to bring Andy and others into the fold, where our improvements can generally track each others. > Also, I don't know whether Andy P. considered the 4xNEON implementation > technique. It could even be fastest on other ARM CPUs too, I don't know. After v2, when he's CC'd in, let's plan to start discussing this with him. Jason
Hi Dan, (I reordered your responses slightly to group together similar topics) On Thu, Aug 16, 2018 at 04:24:54AM -0000, D. J. Bernstein wrote: > Eric Biggers writes: > > You'd probably attract more contributors if you followed established > > open source conventions. > > SUPERCOP already has thousands of implementations from hundreds of > contributors. New speed records are more likely to appear in SUPERCOP > than in any other cryptographic software collection. The API is shared > by state-of-the-art benchmarks, state-of-the-art tests, three ongoing > competitions, and increasingly popular production libraries. [...] > > there doesn't appear to be an official git repository for SUPERCOP, > > nor is there any mention of how to send patches, nor is there any > > COPYING or LICENSE file, nor even a README file. > > https://bench.cr.yp.to/call-stream.html explains the API and submission > procedure for stream ciphers. There are similar pages for other types of > cryptographic primitives. https://bench.cr.yp.to/tips.html explains the > develop-test cycle and various useful options. > > Licenses vary across implementations. There's a minimum requirement of > public distribution for verifiability of benchmark results, but it's up > to individual implementors to decide what they'll allow beyond that. > Patent status also varies; constant-time status varies; verification > status varies; code quality varies; cryptographic security varies; etc. > As I mentioned, SUPERCOP includes MD5 and Speck and RSA-512. Many people may have contributed to SUPERCOP already, but that doesn't mean there aren't things you could do to make it more appealing to contributors and more of a community project, especially since you're suggesting it to the Linux kernel community. There should be a git repository with the development history, including all documentation files so they are available even if your website goes offline (i.e. the "bus factor" should be > 1), and there should be a way to contribute by a conventional means that invites public review, e.g. sending a patch to a mailing list or opening a Github pull request. The apparent lack of a license for the SUPERCOP benchmarking framework itself is also problematic. Though you've marked some individual files as "Public domain", not all files are such marked (e.g. crypto_stream/measure.c), so AFAIK there is no explicit permission given to redistribute them. In some people's view, license-less projects like this aren't really free software. So Linux distributions may not want to take on the legal risk of distributing it, nor may companies want to take on the risk of contributing. It may seem silly, but some people do take these things *very* seriously! Fortunately, it's easy for you to fix the licensing: just add a standard license like the MIT license as a file named COPYING in the top-level directory. > Am I correctly gathering from this thread that someone adding a new > implementation of a crypto primitive to the kernel has to worry about > checking the architecture and CPU features to figure out whether the > implementation will run? Wouldn't it make more sense to take this > error-prone work away from the implementor and have a robust automated > central testing mechanism, as in SUPERCOP? > > Am I also correctly gathering that adding an extra implementation to the > kernel can hurt performance, unless the implementor goes to extra effort > to check for the CPUs where the previous implementation is faster---or > to build some ad-hoc timing mechanism ("raid6: using algorithm avx2x4 > gen() 31737 MB/s")? Wouldn't it make more sense to take this error-prone > work away from the implementor and have a robust automated central > timing mechanism, as in SUPERCOP? If you're talking about things like "don't use the AVX2 implementation if the CPU doesn't support AVX2", then of course has to be checked, but that's straightforward. If (more likely) you're talking about things like "use this NEON implementation on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the developers and community to test different CPUs and make appropriate decisions, and yes it can be very useful to have external benchmarks like SUPERCOP to refer to, and I appreciate your work in that area. Note that in practice, the priority order in Linux's crypto API has actually tended to be straightforward, like generic -> SSE2 -> SSE3 -> AVX2, since historically people haven't cared quite enough about crypto performance in the kernel to microoptimize it for individual CPU microarchitectures, nor have there been many weird cases like ARM NEON where scalar instructions are free when interleaved with vector instructions on some CPUs but not others. Maybe that's changing as more people need optimal crypto performance in the kernel. > I also didn't notice anyone disputing Jason's comment about the "general > clunkiness" of the kernel's internal crypto API---but is there really no > consensus as to what the replacement API is supposed to be? Someone who > simply wants to implement some primitives has to decide on function-call > details, argue about the software location, add configuration options, > etc.? Wouldn't it make more sense to do this centrally, as in SUPERCOP? Not yet; that's the purpose of code review, so that a consensus can be reached. But if/when the new APIs land, the next contributor who wants to do things similarly will have a much easier time. As a side note, are you certain you've received and read all responses to this thread? I haven't always bothered replying to your "qsecretary notice", and I suspect that many others do the same. (Imagine if everyone used that, so everyone had to reply to 10 "qsecretary notices" on threads like this!) > And then there's the bigger question of how the community is organizing > ongoing work on accelerating---and auditing, and fixing, and hopefully > verifying---implementations of cryptographic primitives. Does it really > make sense that people looking for what's already been done have to go > poking around a bunch of separate libraries? Wouldn't it make more sense > to have one central collection of code, as in SUPERCOP? Is there any > fundamental obstacle to having libraries share code for primitives? A lot of code can be shared, but in practice different environments have different constraints, and kernel programming in particular has some distinct differences from userspace programming. For example, you cannot just use the FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on most architectures they can't be used in some contexts such as hardirq context, and even when they *can* be used you have to run special code before and after which does things like saving all the FPU registers to the task_struct, disabling preemption, and/or enabling the FPU. But disabling preemption for long periods of time hurts responsiveness, so it's also desirable to yield the processor occasionally, which means that assembly implementations should be incremental rather than having a single entry point that does everything. There are also many other rules that must be followed in kernel code, like not being free to use the %rbp register on x86 for anything other than the frame base pointer, or having to make indirect calls in asm code through a special macro that mitigates the Spectre vulnerability. So yes, crypto code can sometimes be shared, but changes often do need to be made for the kernel. > For comparison, where can I find an explanation of how to test kernel > crypto patches, and how fast is the develop-test cycle? Okay, I don't > have a kernel crypto patch, but I did write a kernel patch recently that > (I think) fixes some recent Lenovo ACPI stupidity: > > https://marc.info/?l=qubes-users&m=153308905514481 > > I'd propose this for review and upstream adoption _if_ it survives > enough tests---but what's the right test procedure? I see superficial > documentation of where to submit a patch for review, but am I really > supposed to do this before serious testing? The patch works on my > laptop, and several other people say it works, but obviously this is > missing the big question of whether the patch breaks _other_ laptops. > I see an online framework for testing, but using it looks awfully > complicated, and the level of coverage is unclear to me. Has anyone > tried to virtualize kernel testing---to capture hardware data from many > machines and then centrally simulate kernels running on those machines, > for example to check that those machines don't take certain code paths? > I suppose that people who work with the kernel all the time would know > what to do, but for me the lack of information was enough of a deterrent > that I switched to doing something else. The process for submitting Linux kernel patches is very well documented. If you're interested in contributing, you're welcome to read Documentation/process/submitting-patches.rst and the other documentation. As for testing, things are different for different parts of the kernel. Some parts are covered well by automated tests, others rely more on manual tests, and others rely more on a large community running the kernel, especially -rc (release candidate) kernels, and reporting any problems. For the crypto API specifically, there are correctness self-tests that are automatically run when an option is set in the kernel .config. Developers should always run them, but even if they don't, other people are running them too on various hardware. ACPI workarounds for firmware bugs are a somewhat different story... AFAIK, those types of patches rely on testing and review done by the developer, the subsystem maintainters, and the broader community; as well as the community reporting any problems they experience in -rc kernels. Many people, such as myself, run -rc kernels on their computer(s) and will report any problems found. Yes, the standards of correctness for those types of things are not as high for crypto algorithms, which is a shame; but when you're talking about hacking around bugs in specific firmware versions in specific laptops, there's probably not much more that can be done in practice... So I encourage you to still send out your ACPI patch! Thanks, - Eric
[-- Attachment #1: Type: text/plain, Size: 5175 bytes --] Eric Biggers writes: > If (more likely) you're talking about things like "use this NEON implementation > on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the > developers and community to test different CPUs and make appropriate decisions, > and yes it can be very useful to have external benchmarks like SUPERCOP to refer > to, and I appreciate your work in that area. You seem to be talking about a process that selects (e.g.) ChaCha20 implementations as follows: manually inspect benchmarks of various implementations on various CPUs, manually write code to map CPUs to implementations, manually update the code as necessary for new CPUs, and of course manually do the same for every other primitive that can see differences between microarchitectures (which isn't something weird--- it's the normal situation after enough optimization effort). This is quite a bit of manual work, so the kernel often doesn't do it, so we end up with unhappy people talking about performance regressions. For comparison, imagine one simple central piece of code in the kernel to automatically do the following: When a CPU core is booted: For each primitive: Benchmark all implementations of the primitive on the core. Select the fastest for subsequent use on the core. If this is a general-purpose mechanism (as in SUPERCOP, NaCl, and libpqcrypto) rather than something ad-hoc (as in raid6), then there's no manual work per primitive, and no work per implementation. Each CPU, old or new, automatically obtains the fastest available code for that CPU. The only cost is a moment of benchmarking at boot time. _If_ this is a noticeable cost then there are many ways to speed it up: for example, automatically copy the results across identical cores, automatically copy the results across boots if the cores are unchanged, automatically copy results from a central database indexed by CPU identifiers, etc. The SUPERCOP database is evolving towards enabling this type of sharing. > A lot of code can be shared, but in practice different environments have > different constraints, and kernel programming in particular has some distinct > differences from userspace programming. For example, you cannot just use the > FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on > most architectures they can't be used in some contexts such as hardirq context, > and even when they *can* be used you have to run special code before and after > which does things like saving all the FPU registers to the task_struct, > disabling preemption, and/or enabling the FPU. Is there some reason that each implementor is being pestered to handle all this? Detecting FPU usage is a simple static-analysis exercise, and the rest sounds like straightforward boilerplate that should be handled centrally. > But disabling preemption for > long periods of time hurts responsiveness, so it's also desirable to yield the > processor occasionally, which means that assembly implementations should be > incremental rather than having a single entry point that does everything. Doing this rewrite automatically is a bit more of a code-analysis challenge, but the alternative approach of doing it by hand is insanely error-prone. See, e.g., https://eprint.iacr.org/2017/891. > Many people may have contributed to SUPERCOP already, but that doesn't mean > there aren't things you could do to make it more appealing to contributors and > more of a community project, The logic in this sentence is impeccable, and is already illustrated by many SUPERCOP improvements through the years from an increasing number of contributors, as summarized in the 87 release announcements so far on the relevant public mailing list, which you're welcome to study in detail along with the 400 megabytes of current code and as many previous versions as you're interested in. That's also the mailing list where people are told to send patches, as you'll see if you RTFM. > So Linux distributions may not want to take on the legal risk of > distributing it This is a puzzling comment. A moment ago we were talking about the possibility of useful sharing of (e.g.) ChaCha20 implementations between SUPERCOP and the Linux kernel, avoiding pointless fracturing of the community's development process for these implementations. This doesn't mean that the kernel should be grabbing implementations willy-nilly from SUPERCOP---surely the kernel should be doing security audits, and the kernel already has various coding requirements, and the kernel requires GPL compatibility, while putting any of these requirements into SUPERCOP would be counterproductive. If you mean having the entire SUPERCOP benchmarking package distributed through Linux distributions, I have no idea what your motivation is or how this is supposed to be connected to anything else we're discussing. Obviously SUPERCOP's broad code-inclusion policies make this idea a non-starter. > nor may companies want to take on the risk of contributing. RTFM. People who submit code are authorizing public redistribution for benchmarking. It's up to them to decide if they want to allow more. ---Dan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --]
> On 17 Aug 2018, at 10:31, D. J. Bernstein <djb@cr.yp.to> wrote: > > Eric Biggers writes: >> If (more likely) you're talking about things like "use this NEON implementation >> on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the >> developers and community to test different CPUs and make appropriate decisions, >> and yes it can be very useful to have external benchmarks like SUPERCOP to refer >> to, and I appreciate your work in that area. > > You seem to be talking about a process that selects (e.g.) ChaCha20 > implementations as follows: manually inspect benchmarks of various > implementations on various CPUs, manually write code to map CPUs to > implementations, manually update the code as necessary for new CPUs, and > of course manually do the same for every other primitive that can see > differences between microarchitectures (which isn't something weird--- > it's the normal situation after enough optimization effort). > > This is quite a bit of manual work, so the kernel often doesn't do it, > so we end up with unhappy people talking about performance regressions. > > For comparison, imagine one simple central piece of code in the kernel > to automatically do the following: > > When a CPU core is booted: > For each primitive: > Benchmark all implementations of the primitive on the core. > Select the fastest for subsequent use on the core. > > If this is a general-purpose mechanism (as in SUPERCOP, NaCl, and > libpqcrypto) rather than something ad-hoc (as in raid6), then there's no > manual work per primitive, and no work per implementation. Each CPU, old > or new, automatically obtains the fastest available code for that CPU. > > The only cost is a moment of benchmarking at boot time. _If_ this is a > noticeable cost then there are many ways to speed it up: for example, > automatically copy the results across identical cores, automatically > copy the results across boots if the cores are unchanged, automatically > copy results from a central database indexed by CPU identifiers, etc. > The SUPERCOP database is evolving towards enabling this type of sharing. > ‘Fastest’ does not imply ‘preferred’. For instance, running the table based cache thrashing generic AES implementation may be fast, but may put a disproportionate load on, e.g., a hyperthreading system, and as you have pointed out yourself, it is time variant as well. Then, there is the power consumption aspect: NEON bit sliced AES may be faster, but does a lot more work, and does it on the SIMD unit which could potentially be turned off entirely otherwise. Only the implementations based on h/w instructions can generally be assumed optimal in all senses, and there is no real point in benchmarking those against pure software implementations. Then, there is the aspect of accelerators: the kernel’s crypto API seamlessly supports crypto peripherals, which may be slower or faster, have more or fewer queues than the number of CPUs, may offer additional benefits such as protected AES keys etc etc. In the linux kernel, we generally try to stay away from policy decisions, and offer the controls to allow userland to take charge of this. The modularized crypto code can be blacklisted per algo implementation if desired, and beyond that, we simply try to offer functionality that covers the common case. >> A lot of code can be shared, but in practice different environments have >> different constraints, and kernel programming in particular has some distinct >> differences from userspace programming. For example, you cannot just use the >> FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on >> most architectures they can't be used in some contexts such as hardirq context, >> and even when they *can* be used you have to run special code before and after >> which does things like saving all the FPU registers to the task_struct, >> disabling preemption, and/or enabling the FPU. > > Is there some reason that each implementor is being pestered to handle > all this? Detecting FPU usage is a simple static-analysis exercise, and > the rest sounds like straightforward boilerplate that should be handled > centrally. > Detecting it is easy but that does not mean that you can use SIMD in any context, and whether a certain function may ever be called from such a context cannot be decided by static analysis. Also, there are performance and latency concerns which need to be taken into account. In the kernel, we simply cannot write our algorithm as if our code is the only thing running on the system. >> But disabling preemption for >> long periods of time hurts responsiveness, so it's also desirable to yield the >> processor occasionally, which means that assembly implementations should be >> incremental rather than having a single entry point that does everything. > > Doing this rewrite automatically is a bit more of a code-analysis > challenge, but the alternative approach of doing it by hand is insanely > error-prone. See, e.g., https://eprint.iacr.org/2017/891. > >> Many people may have contributed to SUPERCOP already, but that doesn't mean >> there aren't things you could do to make it more appealing to contributors and >> more of a community project, > > The logic in this sentence is impeccable, and is already illustrated by > many SUPERCOP improvements through the years from an increasing number > of contributors, as summarized in the 87 release announcements so far on > the relevant public mailing list, which you're welcome to study in > detail along with the 400 megabytes of current code and as many previous > versions as you're interested in. That's also the mailing list where > people are told to send patches, as you'll see if you RTFM. > >> So Linux distributions may not want to take on the legal risk of >> distributing it > > This is a puzzling comment. A moment ago we were talking about the > possibility of useful sharing of (e.g.) ChaCha20 implementations between > SUPERCOP and the Linux kernel, avoiding pointless fracturing of the > community's development process for these implementations. This doesn't > mean that the kernel should be grabbing implementations willy-nilly from > SUPERCOP---surely the kernel should be doing security audits, and the > kernel already has various coding requirements, and the kernel requires > GPL compatibility, while putting any of these requirements into SUPERCOP > would be counterproductive. > > If you mean having the entire SUPERCOP benchmarking package distributed > through Linux distributions, I have no idea what your motivation is or > how this is supposed to be connected to anything else we're discussing. > Obviously SUPERCOP's broad code-inclusion policies make this idea a > non-starter. > >> nor may companies want to take on the risk of contributing. > > RTFM. People who submit code are authorizing public redistribution for > benchmarking. It's up to them to decide if they want to allow more. > > ---Dan