Linux-Crypto Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard
@ 2019-10-17 19:08 Ard Biesheuvel
  2019-10-17 19:08 ` [PATCH v4 01/35] crypto: tidy up lib/crypto Kconfig and Makefile Ard Biesheuvel
                   ` (34 more replies)
  0 siblings, 35 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:08 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

This series implements the crypto library abstractions that are needed to
incorporate WireGuard into the mainline kernel.

Changes since v3:
- Unify the way the generic vs arch libraries are organized between ChaCha20
  and Poly1305 on the one hand and Curve25519 and Blake2s on the other.
  All are now made up of a generic library, a generic crypto API driver
  (skcipher for [X]ChaCha, shash for Poly1305 and Blake2s and kpp for
  Curve25519) and optional per-arch versions providing both the library and
  the crypto API interfaces while potentially relying on the generic *library*
  only as a fallback (and not on the generic crypto API driver). Implementations
  of the libary interface that don't require the fallback don't pull in the
  generic code at all, but the generic crypto API drivers are tied to the
  generic implementations directly (this is necessary since we fuzz test the
  accelerated implementations against the generic implementations)
- Provide testmgr test vectors for the Curve25519 and Blake2s crypto API
  drivers that were added in this revision. This also required some changes
  to the KPP test routines so we can test for failures as well.
- Update to the latest version of Andy Polyakov's Poly1305 implementation for
  MIPS that incorporates Rene's improvements for 32r2
- Remove logic in the x86 and ARM implementations of ChaCha and Poly1305 to
  prefer the non-SIMD path for short inputs. This is no longer necessary, and
  even undesirable since it forced ChaCha20Poly1305's ChaCha pass generating
  the Poly1305 nonce to always take the slower scalar path.

Changes since v2:
- Reduce the cc: audience a bit, since I assumed that not everyone is
  interested in discussing the details of this.
- Incorporate scalar ARM code for ChaCha, and the 64-bit MIPS code for
  Poly1305. NOTE: the Cryptogams MIPS code now supports 32-bit MIPS as well,
  and not just 32r2, so I omitted Rene's Poly1305 implementation for now, and
  used Andy's code for everything.
- Incorporate NEON opt-out for Cortex-A5/A7. Note that the code is still
  exposed via the crypto API, but with a low prioririty, so it is still
  available and still gets test coverage, but is not used by default.
- Use static keys (*not* static calls) in the SIMD and bmi2/adx drivers to
  keep track of which implementation is being used, to avoid the memory
  load on each call.
- Defer using weak references or static calls until the dust around this has
  settled. Instead, rely on Kconfig constraints and symbol dependencies to
  ensure that the arch code is always used when it is loaded. This means
  you can only opt out of using the arch code if you disable it in Kconfig
  but this is something I can live with for now.
- Refactor the Curve25519 glue code slightly so that the call sites branch to
  the arch or generic code directly.
- Split up the Poly1305 refactoring patches so they can be reviewed more
  easily.

Changes since RFC/v1:
- dropped the WireGuard patch itself, and the followup patches - since the
  purpose was to illustrate the extent of the required changes, there is no
  reason to keep including them.
- import the MIPS 32r2 versions of ChaCha and Poly1305, but expose both the
  crypto API and library interfaces so that not only WireGuard but also IPsec
  and Adiantum can benefit immediately. (The latter required adding support for
  the reduced round version of ChaCha to the MIPS asm code)
- fix up various minor kconfig/build issues found in randconfig testing
  (thanks Arnd!)

Patches can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=wireguard-crypto-library-api-v4

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Miller <davem@davemloft.net>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Martin Willi <martin@strongswan.org>
Cc: Rene van Dorst <opensource@vdorst.com>
Cc: David Sterba <dsterba@suse.com>

Ard Biesheuvel (28):
  crypto: tidy up lib/crypto Kconfig and Makefile
  crypto: chacha - move existing library code into lib/crypto
  crypto: x86/chacha - depend on generic chacha library instead of
    crypto driver
  crypto: x86/chacha - expose SIMD ChaCha routine as library function
  crypto: arm64/chacha - depend on generic chacha library instead of
    crypto driver
  crypto: arm64/chacha - expose arm64 ChaCha routine as library function
  crypto: arm/chacha - import Eric Biggers's scalar accelerated ChaCha
    code
  crypto: arm/chacha - remove dependency on generic ChaCha driver
  crypto: arm/chacha - expose ARM ChaCha routine as library function
  crypto: mips/chacha - wire up accelerated 32r2 code from Zinc
  crypto: chacha - unexport chacha_generic routines
  crypto: poly1305 - move core routines into a separate library
  crypto: x86/poly1305 - unify Poly1305 state struct with generic code
  crypto: poly1305 - expose init/update/final library interface
  crypto: x86/poly1305 - depend on generic library not generic shash
  crypto: x86/poly1305 - expose existing driver as poly1305 library
  crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON
    implementation
  crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON
    implementation
  crypto: mips/poly1305 - incorporate OpenSSL/CRYPTOGAMS optimized
    implementation
  int128: move __uint128_t compiler test to Kconfig
  crypto: testmgr - add test cases for Blake2s
  crypto: blake2s - implement generic shash driver
  crypto: testmgr - implement testing for KPP failures
  crypto: curve25519 - add kpp selftest
  crypto: curve25519 - implement generic KPP driver
  crypto: lib/curve25519 - work around Clang stack spilling issue
  crypto: chacha20poly1305 - import construction and selftest from Zinc
  crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routine

Jason A. Donenfeld (7):
  crypto: mips/chacha - import 32r2 ChaCha code from Zinc
  crypto: BLAKE2s - generic C library implementation and selftest
  crypto: BLAKE2s - x86_64 SIMD implementation
  crypto: Curve25519 - generic C library implementations
  crypto: Curve25519 - x86_64 library and KPP implementations
  crypto: arm - import Bernstein and Schwabe's Curve25519 ARM
    implementation
  crypto: arm/Curve25519 - wire up NEON implementation

 arch/arm/crypto/Kconfig                   |   16 +-
 arch/arm/crypto/Makefile                  |   17 +-
 arch/arm/crypto/chacha-glue.c             |  355 +
 arch/arm/crypto/chacha-neon-glue.c        |  202 -
 arch/arm/crypto/chacha-scalar-core.S      |  460 ++
 arch/arm/crypto/curve25519-core.S         | 2062 ++++++
 arch/arm/crypto/curve25519-glue.c         |  136 +
 arch/arm/crypto/poly1305-armv4.pl         | 1236 ++++
 arch/arm/crypto/poly1305-core.S_shipped   | 1158 +++
 arch/arm/crypto/poly1305-glue.c           |  276 +
 arch/arm64/Kconfig                        |    2 +-
 arch/arm64/crypto/Kconfig                 |    9 +-
 arch/arm64/crypto/Makefile                |   10 +-
 arch/arm64/crypto/chacha-neon-glue.c      |   96 +-
 arch/arm64/crypto/poly1305-armv8.pl       |  913 +++
 arch/arm64/crypto/poly1305-core.S_shipped |  835 +++
 arch/arm64/crypto/poly1305-glue.c         |  237 +
 arch/mips/Makefile                        |    2 +-
 arch/mips/crypto/Makefile                 |   18 +
 arch/mips/crypto/chacha-core.S            |  497 ++
 arch/mips/crypto/chacha-glue.c            |  162 +
 arch/mips/crypto/poly1305-glue.c          |  203 +
 arch/mips/crypto/poly1305-mips.pl         | 1273 ++++
 arch/riscv/Kconfig                        |    2 +-
 arch/x86/Kconfig                          |    2 +-
 arch/x86/crypto/Makefile                  |    3 +
 arch/x86/crypto/blake2s-core.S            |  685 ++
 arch/x86/crypto/blake2s-glue.c            |  235 +
 arch/x86/crypto/chacha_glue.c             |  189 +-
 arch/x86/crypto/curve25519-x86_64.c       | 2480 +++++++
 arch/x86/crypto/poly1305_glue.c           |  199 +-
 crypto/Kconfig                            |   57 +-
 crypto/Makefile                           |    2 +
 crypto/adiantum.c                         |    5 +-
 crypto/blake2s-generic.c                  |  171 +
 crypto/chacha_generic.c                   |   84 +-
 crypto/curve25519-generic.c               |   92 +
 crypto/ecc.c                              |    2 +-
 crypto/nhpoly1305.c                       |    3 +-
 crypto/poly1305_generic.c                 |  228 +-
 crypto/testmgr.c                          |   44 +-
 crypto/testmgr.h                          | 1856 +++++
 include/crypto/blake2s.h                  |  108 +
 include/crypto/chacha.h                   |   77 +-
 include/crypto/chacha20poly1305.h         |   48 +
 include/crypto/curve25519.h               |   71 +
 include/crypto/internal/blake2s.h         |   24 +
 include/crypto/internal/chacha.h          |   31 +
 include/crypto/internal/poly1305.h        |   58 +
 include/crypto/poly1305.h                 |   69 +-
 init/Kconfig                              |    4 +
 lib/Makefile                              |    3 +-
 lib/crypto/Kconfig                        |  130 +
 lib/crypto/Makefile                       |   42 +-
 lib/crypto/blake2s-generic.c              |  111 +
 lib/crypto/blake2s-selftest.c             | 2093 ++++++
 lib/crypto/blake2s.c                      |  126 +
 lib/{ => crypto}/chacha.c                 |   20 +-
 lib/crypto/chacha20poly1305-selftest.c    | 7393 ++++++++++++++++++++
 lib/crypto/chacha20poly1305.c             |  369 +
 lib/crypto/curve25519-fiat32.c            |  864 +++
 lib/crypto/curve25519-hacl64.c            |  788 +++
 lib/crypto/curve25519.c                   |   25 +
 lib/crypto/libchacha.c                    |   35 +
 lib/crypto/poly1305.c                     |  232 +
 lib/ubsan.c                               |    2 +-
 lib/ubsan.h                               |    2 +-
 67 files changed, 28470 insertions(+), 769 deletions(-)
 create mode 100644 arch/arm/crypto/chacha-glue.c
 delete mode 100644 arch/arm/crypto/chacha-neon-glue.c
 create mode 100644 arch/arm/crypto/chacha-scalar-core.S
 create mode 100644 arch/arm/crypto/curve25519-core.S
 create mode 100644 arch/arm/crypto/curve25519-glue.c
 create mode 100644 arch/arm/crypto/poly1305-armv4.pl
 create mode 100644 arch/arm/crypto/poly1305-core.S_shipped
 create mode 100644 arch/arm/crypto/poly1305-glue.c
 create mode 100644 arch/arm64/crypto/poly1305-armv8.pl
 create mode 100644 arch/arm64/crypto/poly1305-core.S_shipped
 create mode 100644 arch/arm64/crypto/poly1305-glue.c
 create mode 100644 arch/mips/crypto/chacha-core.S
 create mode 100644 arch/mips/crypto/chacha-glue.c
 create mode 100644 arch/mips/crypto/poly1305-glue.c
 create mode 100644 arch/mips/crypto/poly1305-mips.pl
 create mode 100644 arch/x86/crypto/blake2s-core.S
 create mode 100644 arch/x86/crypto/blake2s-glue.c
 create mode 100644 arch/x86/crypto/curve25519-x86_64.c
 create mode 100644 crypto/blake2s-generic.c
 create mode 100644 crypto/curve25519-generic.c
 create mode 100644 include/crypto/blake2s.h
 create mode 100644 include/crypto/chacha20poly1305.h
 create mode 100644 include/crypto/curve25519.h
 create mode 100644 include/crypto/internal/blake2s.h
 create mode 100644 include/crypto/internal/chacha.h
 create mode 100644 include/crypto/internal/poly1305.h
 create mode 100644 lib/crypto/Kconfig
 create mode 100644 lib/crypto/blake2s-generic.c
 create mode 100644 lib/crypto/blake2s-selftest.c
 create mode 100644 lib/crypto/blake2s.c
 rename lib/{ => crypto}/chacha.c (88%)
 create mode 100644 lib/crypto/chacha20poly1305-selftest.c
 create mode 100644 lib/crypto/chacha20poly1305.c
 create mode 100644 lib/crypto/curve25519-fiat32.c
 create mode 100644 lib/crypto/curve25519-hacl64.c
 create mode 100644 lib/crypto/curve25519.c
 create mode 100644 lib/crypto/libchacha.c
 create mode 100644 lib/crypto/poly1305.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 01/35] crypto: tidy up lib/crypto Kconfig and Makefile
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
@ 2019-10-17 19:08 ` Ard Biesheuvel
  2019-10-17 19:08 ` [PATCH v4 02/35] crypto: chacha - move existing library code into lib/crypto Ard Biesheuvel
                   ` (33 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:08 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

In preparation of introducing a set of crypto library interfaces, tidy
up the Makefile and split off the Kconfig symbols into a separate file.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/Kconfig      | 13 +------------
 lib/crypto/Kconfig  | 15 +++++++++++++++
 lib/crypto/Makefile | 16 ++++++++--------
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 29472fb795f3..70b4f7b238fc 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -878,9 +878,6 @@ config CRYPTO_SHA1_PPC_SPE
 	  SHA-1 secure hash standard (DFIPS 180-4) implemented
 	  using powerpc SPE SIMD instruction set.
 
-config CRYPTO_LIB_SHA256
-	tristate
-
 config CRYPTO_SHA256
 	tristate "SHA224 and SHA256 digest algorithm"
 	select CRYPTO_HASH
@@ -1019,9 +1016,6 @@ config CRYPTO_GHASH_CLMUL_NI_INTEL
 
 comment "Ciphers"
 
-config CRYPTO_LIB_AES
-	tristate
-
 config CRYPTO_AES
 	tristate "AES cipher algorithms"
 	select CRYPTO_ALGAPI
@@ -1150,9 +1144,6 @@ config CRYPTO_ANUBIS
 	  <https://www.cosic.esat.kuleuven.be/nessie/reports/>
 	  <http://www.larc.usp.br/~pbarreto/AnubisPage.html>
 
-config CRYPTO_LIB_ARC4
-	tristate
-
 config CRYPTO_ARC4
 	tristate "ARC4 cipher algorithm"
 	select CRYPTO_BLKCIPHER
@@ -1339,9 +1330,6 @@ config CRYPTO_CAST6_AVX_X86_64
 	  This module provides the Cast6 cipher algorithm that processes
 	  eight blocks parallel using the AVX instruction set.
 
-config CRYPTO_LIB_DES
-	tristate
-
 config CRYPTO_DES
 	tristate "DES and Triple DES EDE cipher algorithms"
 	select CRYPTO_ALGAPI
@@ -1845,6 +1833,7 @@ config CRYPTO_STATS
 config CRYPTO_HASH_INFO
 	bool
 
+source "lib/crypto/Kconfig"
 source "drivers/crypto/Kconfig"
 source "crypto/asymmetric_keys/Kconfig"
 source "certs/Kconfig"
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
new file mode 100644
index 000000000000..261430051595
--- /dev/null
+++ b/lib/crypto/Kconfig
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0
+
+comment "Crypto library routines"
+
+config CRYPTO_LIB_AES
+	tristate
+
+config CRYPTO_LIB_ARC4
+	tristate
+
+config CRYPTO_LIB_DES
+	tristate
+
+config CRYPTO_LIB_SHA256
+	tristate
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index cbe0b6a6450d..63de4cb3fcf8 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -1,13 +1,13 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-$(CONFIG_CRYPTO_LIB_AES) += libaes.o
-libaes-y := aes.o
+obj-$(CONFIG_CRYPTO_LIB_AES)			+= libaes.o
+libaes-y					:= aes.o
 
-obj-$(CONFIG_CRYPTO_LIB_ARC4) += libarc4.o
-libarc4-y := arc4.o
+obj-$(CONFIG_CRYPTO_LIB_ARC4)			+= libarc4.o
+libarc4-y					:= arc4.o
 
-obj-$(CONFIG_CRYPTO_LIB_DES) += libdes.o
-libdes-y := des.o
+obj-$(CONFIG_CRYPTO_LIB_DES)			+= libdes.o
+libdes-y					:= des.o
 
-obj-$(CONFIG_CRYPTO_LIB_SHA256) += libsha256.o
-libsha256-y := sha256.o
+obj-$(CONFIG_CRYPTO_LIB_SHA256)			+= libsha256.o
+libsha256-y					:= sha256.o
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 02/35] crypto: chacha - move existing library code into lib/crypto
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
  2019-10-17 19:08 ` [PATCH v4 01/35] crypto: tidy up lib/crypto Kconfig and Makefile Ard Biesheuvel
@ 2019-10-17 19:08 ` Ard Biesheuvel
  2019-10-23  3:05   ` Eric Biggers
  2019-10-23  3:12   ` Eric Biggers
  2019-10-17 19:09 ` [PATCH v4 03/35] crypto: x86/chacha - depend on generic chacha library instead of crypto driver Ard Biesheuvel
                   ` (32 subsequent siblings)
  34 siblings, 2 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:08 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Currently, our generic ChaCha implementation consists of a permute
function in lib/chacha.c that operates on the 64-byte ChaCha state
directly [and which is always included into the core kernel since it
is used by the /dev/random driver], and the crypto API plumbing to
expose it as a skcipher.

In order to support in-kernel users that need the ChaCha streamcipher
but have no need [or tolerance] for going through the abstractions of
the crypto API, let's expose the streamcipher bits via a library API
as well, in a way that permits the implementation to be superseded by
an architecture specific one if provided.

So move the streamcipher code into a separate module in lib/crypto,
and expose the init() and crypt() routines to users of the library.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/chacha-neon-glue.c   |  2 +-
 arch/arm64/crypto/chacha-neon-glue.c |  2 +-
 arch/x86/crypto/chacha_glue.c        |  2 +-
 crypto/Kconfig                       |  1 +
 crypto/chacha_generic.c              | 60 ++---------------
 include/crypto/chacha.h              | 71 +++++++++++++++-----
 include/crypto/internal/chacha.h     | 41 +++++++++++
 lib/Makefile                         |  3 +-
 lib/crypto/Kconfig                   | 26 +++++++
 lib/crypto/Makefile                  |  4 ++
 lib/{ => crypto}/chacha.c            | 20 +++---
 lib/crypto/libchacha.c               | 35 ++++++++++
 12 files changed, 181 insertions(+), 86 deletions(-)

diff --git a/arch/arm/crypto/chacha-neon-glue.c b/arch/arm/crypto/chacha-neon-glue.c
index a8e9b534c8da..26576772f18b 100644
--- a/arch/arm/crypto/chacha-neon-glue.c
+++ b/arch/arm/crypto/chacha-neon-glue.c
@@ -20,7 +20,7 @@
  */
 
 #include <crypto/algapi.h>
-#include <crypto/chacha.h>
+#include <crypto/internal/chacha.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <linux/kernel.h>
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index 1495d2b18518..d4cc61bfe79d 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -20,7 +20,7 @@
  */
 
 #include <crypto/algapi.h>
-#include <crypto/chacha.h>
+#include <crypto/internal/chacha.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <linux/kernel.h>
diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
index 388f95a4ec24..bc62daa8dafd 100644
--- a/arch/x86/crypto/chacha_glue.c
+++ b/arch/x86/crypto/chacha_glue.c
@@ -7,7 +7,7 @@
  */
 
 #include <crypto/algapi.h>
-#include <crypto/chacha.h>
+#include <crypto/internal/chacha.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <linux/kernel.h>
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 70b4f7b238fc..153cdefcbd78 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1393,6 +1393,7 @@ config CRYPTO_SALSA20
 
 config CRYPTO_CHACHA20
 	tristate "ChaCha stream cipher algorithms"
+	select CRYPTO_LIB_CHACHA_GENERIC
 	select CRYPTO_BLKCIPHER
 	help
 	  The ChaCha20, XChaCha20, and XChaCha12 stream cipher algorithms.
diff --git a/crypto/chacha_generic.c b/crypto/chacha_generic.c
index 085d8d219987..ebae6d9d9b32 100644
--- a/crypto/chacha_generic.c
+++ b/crypto/chacha_generic.c
@@ -8,29 +8,10 @@
 
 #include <asm/unaligned.h>
 #include <crypto/algapi.h>
-#include <crypto/chacha.h>
+#include <crypto/internal/chacha.h>
 #include <crypto/internal/skcipher.h>
 #include <linux/module.h>
 
-static void chacha_docrypt(u32 *state, u8 *dst, const u8 *src,
-			   unsigned int bytes, int nrounds)
-{
-	/* aligned to potentially speed up crypto_xor() */
-	u8 stream[CHACHA_BLOCK_SIZE] __aligned(sizeof(long));
-
-	while (bytes >= CHACHA_BLOCK_SIZE) {
-		chacha_block(state, stream, nrounds);
-		crypto_xor_cpy(dst, src, stream, CHACHA_BLOCK_SIZE);
-		bytes -= CHACHA_BLOCK_SIZE;
-		dst += CHACHA_BLOCK_SIZE;
-		src += CHACHA_BLOCK_SIZE;
-	}
-	if (bytes) {
-		chacha_block(state, stream, nrounds);
-		crypto_xor_cpy(dst, src, stream, bytes);
-	}
-}
-
 static int chacha_stream_xor(struct skcipher_request *req,
 			     const struct chacha_ctx *ctx, const u8 *iv)
 {
@@ -48,8 +29,8 @@ static int chacha_stream_xor(struct skcipher_request *req,
 		if (nbytes < walk.total)
 			nbytes = round_down(nbytes, CHACHA_BLOCK_SIZE);
 
-		chacha_docrypt(state, walk.dst.virt.addr, walk.src.virt.addr,
-			       nbytes, ctx->nrounds);
+		chacha_crypt_generic(state, walk.dst.virt.addr,
+				     walk.src.virt.addr, nbytes, ctx->nrounds);
 		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
 
@@ -58,41 +39,10 @@ static int chacha_stream_xor(struct skcipher_request *req,
 
 void crypto_chacha_init(u32 *state, const struct chacha_ctx *ctx, const u8 *iv)
 {
-	state[0]  = 0x61707865; /* "expa" */
-	state[1]  = 0x3320646e; /* "nd 3" */
-	state[2]  = 0x79622d32; /* "2-by" */
-	state[3]  = 0x6b206574; /* "te k" */
-	state[4]  = ctx->key[0];
-	state[5]  = ctx->key[1];
-	state[6]  = ctx->key[2];
-	state[7]  = ctx->key[3];
-	state[8]  = ctx->key[4];
-	state[9]  = ctx->key[5];
-	state[10] = ctx->key[6];
-	state[11] = ctx->key[7];
-	state[12] = get_unaligned_le32(iv +  0);
-	state[13] = get_unaligned_le32(iv +  4);
-	state[14] = get_unaligned_le32(iv +  8);
-	state[15] = get_unaligned_le32(iv + 12);
+	chacha_init_generic(state, ctx->key, iv);
 }
 EXPORT_SYMBOL_GPL(crypto_chacha_init);
 
-static int chacha_setkey(struct crypto_skcipher *tfm, const u8 *key,
-			 unsigned int keysize, int nrounds)
-{
-	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
-	int i;
-
-	if (keysize != CHACHA_KEY_SIZE)
-		return -EINVAL;
-
-	for (i = 0; i < ARRAY_SIZE(ctx->key); i++)
-		ctx->key[i] = get_unaligned_le32(key + i * sizeof(u32));
-
-	ctx->nrounds = nrounds;
-	return 0;
-}
-
 int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
 			   unsigned int keysize)
 {
@@ -126,7 +76,7 @@ int crypto_xchacha_crypt(struct skcipher_request *req)
 
 	/* Compute the subkey given the original key and first 128 nonce bits */
 	crypto_chacha_init(state, ctx, req->iv);
-	hchacha_block(state, subctx.key, ctx->nrounds);
+	hchacha_block_generic(state, subctx.key, ctx->nrounds);
 	subctx.nrounds = ctx->nrounds;
 
 	/* Build the real IV */
diff --git a/include/crypto/chacha.h b/include/crypto/chacha.h
index d1e723c6a37d..78cb9d549b1b 100644
--- a/include/crypto/chacha.h
+++ b/include/crypto/chacha.h
@@ -15,9 +15,8 @@
 #ifndef _CRYPTO_CHACHA_H
 #define _CRYPTO_CHACHA_H
 
-#include <crypto/skcipher.h>
+#include <asm/unaligned.h>
 #include <linux/types.h>
-#include <linux/crypto.h>
 
 /* 32-bit stream position, then 96-bit nonce (RFC7539 convention) */
 #define CHACHA_IV_SIZE		16
@@ -29,26 +28,64 @@
 /* 192-bit nonce, then 64-bit stream position */
 #define XCHACHA_IV_SIZE		32
 
-struct chacha_ctx {
-	u32 key[8];
-	int nrounds;
-};
-
-void chacha_block(u32 *state, u8 *stream, int nrounds);
+void chacha_block_generic(u32 *state, u8 *stream, int nrounds);
 static inline void chacha20_block(u32 *state, u8 *stream)
 {
-	chacha_block(state, stream, 20);
+	chacha_block_generic(state, stream, 20);
+}
+
+void hchacha_block_arch(const u32 *state, u32 *out, int nrounds);
+void hchacha_block_generic(const u32 *state, u32 *out, int nrounds);
+
+static inline void hchacha_block(const u32 *state, u32 *out, int nrounds)
+{
+	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_CHACHA))
+		hchacha_block_arch(state, out, nrounds);
+	else
+		hchacha_block_generic(state, out, nrounds);
+}
+
+void chacha_init_arch(u32 *state, const u32 *key, const u8 *iv);
+static inline void chacha_init_generic(u32 *state, const u32 *key, const u8 *iv)
+{
+	state[0]  = 0x61707865; /* "expa" */
+	state[1]  = 0x3320646e; /* "nd 3" */
+	state[2]  = 0x79622d32; /* "2-by" */
+	state[3]  = 0x6b206574; /* "te k" */
+	state[4]  = key[0];
+	state[5]  = key[1];
+	state[6]  = key[2];
+	state[7]  = key[3];
+	state[8]  = key[4];
+	state[9]  = key[5];
+	state[10] = key[6];
+	state[11] = key[7];
+	state[12] = get_unaligned_le32(iv +  0);
+	state[13] = get_unaligned_le32(iv +  4);
+	state[14] = get_unaligned_le32(iv +  8);
+	state[15] = get_unaligned_le32(iv + 12);
 }
-void hchacha_block(const u32 *in, u32 *out, int nrounds);
 
-void crypto_chacha_init(u32 *state, const struct chacha_ctx *ctx, const u8 *iv);
+static inline void chacha_init(u32 *state, const u32 *key, const u8 *iv)
+{
+	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_CHACHA))
+		chacha_init_arch(state, key, iv);
+	else
+		chacha_init_generic(state, key, iv);
+}
 
-int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
-			   unsigned int keysize);
-int crypto_chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
-			   unsigned int keysize);
+void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src,
+		       unsigned int bytes, int nrounds);
+void chacha_crypt_generic(u32 *state, u8 *dst, const u8 *src,
+			  unsigned int bytes, int nrounds);
 
-int crypto_chacha_crypt(struct skcipher_request *req);
-int crypto_xchacha_crypt(struct skcipher_request *req);
+static inline void chacha_crypt(u32 *state, u8 *dst, const u8 *src,
+				unsigned int bytes, int nrounds)
+{
+	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_CHACHA))
+		chacha_crypt_arch(state, dst, src, bytes, nrounds);
+	else
+		chacha_crypt_generic(state, dst, src, bytes, nrounds);
+}
 
 #endif /* _CRYPTO_CHACHA_H */
diff --git a/include/crypto/internal/chacha.h b/include/crypto/internal/chacha.h
new file mode 100644
index 000000000000..8724545bc959
--- /dev/null
+++ b/include/crypto/internal/chacha.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _CRYPTO_INTERNAL_CHACHA_H
+#define _CRYPTO_INTERNAL_CHACHA_H
+
+#include <crypto/chacha.h>
+#include <crypto/internal/skcipher.h>
+#include <linux/crypto.h>
+
+struct chacha_ctx {
+	u32 key[8];
+	int nrounds;
+};
+
+void crypto_chacha_init(u32 *state, const struct chacha_ctx *ctx, const u8 *iv);
+
+static inline int chacha_setkey(struct crypto_skcipher *tfm, const u8 *key,
+				unsigned int keysize, int nrounds)
+{
+	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int i;
+
+	if (keysize != CHACHA_KEY_SIZE)
+		return -EINVAL;
+
+	for (i = 0; i < ARRAY_SIZE(ctx->key); i++)
+		ctx->key[i] = get_unaligned_le32(key + i * sizeof(u32));
+
+	ctx->nrounds = nrounds;
+	return 0;
+}
+
+int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			   unsigned int keysize);
+int crypto_chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			   unsigned int keysize);
+
+int crypto_chacha_crypt(struct skcipher_request *req);
+int crypto_xchacha_crypt(struct skcipher_request *req);
+
+#endif /* _CRYPTO_CHACHA_H */
diff --git a/lib/Makefile b/lib/Makefile
index c5892807e06f..5af38fd5cc60 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -26,8 +26,7 @@ endif
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o timerqueue.o xarray.o \
-	 idr.o extable.o \
-	 sha1.o chacha.o irq_regs.o argv_split.o \
+	 idr.o extable.o sha1.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
 	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index 261430051595..6a11931ae105 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -8,6 +8,32 @@ config CRYPTO_LIB_AES
 config CRYPTO_LIB_ARC4
 	tristate
 
+config CRYPTO_ARCH_HAVE_LIB_CHACHA
+	tristate
+	help
+	  Declares whether the architecture provides an arch-specific
+	  accelerated implementation of the ChaCha library interface,
+	  either builtin or as a module.
+
+config CRYPTO_LIB_CHACHA_GENERIC
+	tristate
+	select CRYPTO_ALGAPI
+	help
+	  This symbol can be depended upon by arch implementations of the
+	  ChaCha library interface that require the generic code as a
+	  fallback, e.g., for SIMD implementations. If no arch specific
+	  implementation is enabled, this implementation serves the users
+	  of CRYPTO_LIB_CHACHA.
+
+config CRYPTO_LIB_CHACHA
+	tristate "ChaCha library interface"
+	depends on CRYPTO_ARCH_HAVE_LIB_CHACHA || !CRYPTO_ARCH_HAVE_LIB_CHACHA
+	select CRYPTO_LIB_CHACHA_GENERIC if CRYPTO_ARCH_HAVE_LIB_CHACHA=n
+	help
+	  Enable the ChaCha library interface. This interface may be fulfilled
+	  by either the generic implementation or an arch-specific one, if one
+	  is available and enabled.
+
 config CRYPTO_LIB_DES
 	tristate
 
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index 63de4cb3fcf8..0ce40604e104 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -1,5 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 
+# chacha is used by the /dev/random driver which is always builtin
+obj-y						+= chacha.o
+obj-$(CONFIG_CRYPTO_LIB_CHACHA_GENERIC)		+= libchacha.o
+
 obj-$(CONFIG_CRYPTO_LIB_AES)			+= libaes.o
 libaes-y					:= aes.o
 
diff --git a/lib/chacha.c b/lib/crypto/chacha.c
similarity index 88%
rename from lib/chacha.c
rename to lib/crypto/chacha.c
index c7c9826564d3..65ead6b0c7e0 100644
--- a/lib/chacha.c
+++ b/lib/crypto/chacha.c
@@ -5,9 +5,11 @@
  * Copyright (C) 2015 Martin Willi
  */
 
+#include <linux/bug.h>
 #include <linux/kernel.h>
 #include <linux/export.h>
 #include <linux/bitops.h>
+#include <linux/string.h>
 #include <linux/cryptohash.h>
 #include <asm/unaligned.h>
 #include <crypto/chacha.h>
@@ -72,7 +74,7 @@ static void chacha_permute(u32 *x, int nrounds)
  * The caller has already converted the endianness of the input.  This function
  * also handles incrementing the block counter in the input matrix.
  */
-void chacha_block(u32 *state, u8 *stream, int nrounds)
+void chacha_block_generic(u32 *state, u8 *stream, int nrounds)
 {
 	u32 x[16];
 	int i;
@@ -86,11 +88,11 @@ void chacha_block(u32 *state, u8 *stream, int nrounds)
 
 	state[12]++;
 }
-EXPORT_SYMBOL(chacha_block);
+EXPORT_SYMBOL(chacha_block_generic);
 
 /**
- * hchacha_block - abbreviated ChaCha core, for XChaCha
- * @in: input state matrix (16 32-bit words)
+ * hchacha_block_generic - abbreviated ChaCha core, for XChaCha
+ * @state: input state matrix (16 32-bit words)
  * @out: output (8 32-bit words)
  * @nrounds: number of rounds (20 or 12; 20 is recommended)
  *
@@ -99,15 +101,15 @@ EXPORT_SYMBOL(chacha_block);
  * skips the final addition of the initial state, and outputs only certain words
  * of the state.  It should not be used for streaming directly.
  */
-void hchacha_block(const u32 *in, u32 *out, int nrounds)
+void hchacha_block_generic(const u32 *state, u32 *stream, int nrounds)
 {
 	u32 x[16];
 
-	memcpy(x, in, 64);
+	memcpy(x, state, 64);
 
 	chacha_permute(x, nrounds);
 
-	memcpy(&out[0], &x[0], 16);
-	memcpy(&out[4], &x[12], 16);
+	memcpy(&stream[0], &x[0], 16);
+	memcpy(&stream[4], &x[12], 16);
 }
-EXPORT_SYMBOL(hchacha_block);
+EXPORT_SYMBOL(hchacha_block_generic);
diff --git a/lib/crypto/libchacha.c b/lib/crypto/libchacha.c
new file mode 100644
index 000000000000..7b4b97811f71
--- /dev/null
+++ b/lib/crypto/libchacha.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * The "hash function" used as the core of the ChaCha stream cipher (RFC7539)
+ *
+ * Copyright (C) 2015 Martin Willi
+ */
+
+#include <linux/kernel.h>
+#include <linux/export.h>
+#include <linux/module.h>
+
+#include <crypto/algapi.h> // for crypto_xor_cpy
+#include <crypto/chacha.h>
+
+void chacha_crypt_generic(u32 *state, u8 *dst, const u8 *src,
+			  unsigned int bytes, int nrounds)
+{
+	/* aligned to potentially speed up crypto_xor() */
+	u8 stream[CHACHA_BLOCK_SIZE] __aligned(sizeof(long));
+
+	while (bytes >= CHACHA_BLOCK_SIZE) {
+		chacha_block_generic(state, stream, nrounds);
+		crypto_xor_cpy(dst, src, stream, CHACHA_BLOCK_SIZE);
+		bytes -= CHACHA_BLOCK_SIZE;
+		dst += CHACHA_BLOCK_SIZE;
+		src += CHACHA_BLOCK_SIZE;
+	}
+	if (bytes) {
+		chacha_block_generic(state, stream, nrounds);
+		crypto_xor_cpy(dst, src, stream, bytes);
+	}
+}
+EXPORT_SYMBOL(chacha_crypt_generic);
+
+MODULE_LICENSE("GPL");
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 03/35] crypto: x86/chacha - depend on generic chacha library instead of crypto driver
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
  2019-10-17 19:08 ` [PATCH v4 01/35] crypto: tidy up lib/crypto Kconfig and Makefile Ard Biesheuvel
  2019-10-17 19:08 ` [PATCH v4 02/35] crypto: chacha - move existing library code into lib/crypto Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 04/35] crypto: x86/chacha - expose SIMD ChaCha routine as library function Ard Biesheuvel
                   ` (31 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

In preparation of extending the x86 ChaCha driver to also expose the ChaCha
library interface, drop the dependency on the chacha_generic crypto driver
as a non-SIMD fallback, and depend on the generic ChaCha library directly.
This way, we only pull in the code we actually need, without registering
a set of ChaCha skciphers that we will never use.

Since turning the FPU on and off is cheap these days, simplify the SIMD
routine by dropping the per-page yield, which makes for a cleaner switch
to the library API as well. This also allows use to invoke the skcipher
walk routines in non-atomic mode.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/chacha_glue.c | 98 +++++++++-----------
 crypto/Kconfig                |  2 +-
 2 files changed, 46 insertions(+), 54 deletions(-)

diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
index bc62daa8dafd..a264dcc64679 100644
--- a/arch/x86/crypto/chacha_glue.c
+++ b/arch/x86/crypto/chacha_glue.c
@@ -123,37 +123,38 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src,
 	}
 }
 
-static int chacha_simd_stream_xor(struct skcipher_walk *walk,
+static int chacha_simd_stream_xor(struct skcipher_request *req,
 				  const struct chacha_ctx *ctx, const u8 *iv)
 {
 	u32 *state, state_buf[16 + 2] __aligned(8);
-	int next_yield = 4096; /* bytes until next FPU yield */
-	int err = 0;
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
 
 	BUILD_BUG_ON(CHACHA_STATE_ALIGN != 16);
 	state = PTR_ALIGN(state_buf + 0, CHACHA_STATE_ALIGN);
 
-	crypto_chacha_init(state, ctx, iv);
+	chacha_init_generic(state, ctx->key, iv);
 
-	while (walk->nbytes > 0) {
-		unsigned int nbytes = walk->nbytes;
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
 
-		if (nbytes < walk->total) {
-			nbytes = round_down(nbytes, walk->stride);
-			next_yield -= nbytes;
-		}
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, walk.stride);
 
-		chacha_dosimd(state, walk->dst.virt.addr, walk->src.virt.addr,
-			      nbytes, ctx->nrounds);
-
-		if (next_yield <= 0) {
-			/* temporarily allow preemption */
-			kernel_fpu_end();
+		if (!crypto_simd_usable()) {
+			chacha_crypt_generic(state, walk.dst.virt.addr,
+					     walk.src.virt.addr, nbytes,
+					     ctx->nrounds);
+		} else {
 			kernel_fpu_begin();
-			next_yield = 4096;
+			chacha_dosimd(state, walk.dst.virt.addr,
+				      walk.src.virt.addr, nbytes,
+				      ctx->nrounds);
+			kernel_fpu_end();
 		}
-
-		err = skcipher_walk_done(walk, walk->nbytes - nbytes);
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
 
 	return err;
@@ -163,55 +164,46 @@ static int chacha_simd(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
-	struct skcipher_walk walk;
-	int err;
-
-	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !crypto_simd_usable())
-		return crypto_chacha_crypt(req);
 
-	err = skcipher_walk_virt(&walk, req, true);
-	if (err)
-		return err;
-
-	kernel_fpu_begin();
-	err = chacha_simd_stream_xor(&walk, ctx, req->iv);
-	kernel_fpu_end();
-	return err;
+	return chacha_simd_stream_xor(req, ctx, req->iv);
 }
 
 static int xchacha_simd(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
-	struct skcipher_walk walk;
-	struct chacha_ctx subctx;
 	u32 *state, state_buf[16 + 2] __aligned(8);
+	struct chacha_ctx subctx;
 	u8 real_iv[16];
-	int err;
-
-	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !crypto_simd_usable())
-		return crypto_xchacha_crypt(req);
-
-	err = skcipher_walk_virt(&walk, req, true);
-	if (err)
-		return err;
 
 	BUILD_BUG_ON(CHACHA_STATE_ALIGN != 16);
 	state = PTR_ALIGN(state_buf + 0, CHACHA_STATE_ALIGN);
-	crypto_chacha_init(state, ctx, req->iv);
-
-	kernel_fpu_begin();
-
-	hchacha_block_ssse3(state, subctx.key, ctx->nrounds);
+	chacha_init_generic(state, ctx->key, req->iv);
+
+	if (req->cryptlen > CHACHA_BLOCK_SIZE && crypto_simd_usable()) {
+		kernel_fpu_begin();
+		hchacha_block_ssse3(state, subctx.key, ctx->nrounds);
+		kernel_fpu_end();
+	} else {
+		hchacha_block_generic(state, subctx.key, ctx->nrounds);
+	}
 	subctx.nrounds = ctx->nrounds;
 
 	memcpy(&real_iv[0], req->iv + 24, 8);
 	memcpy(&real_iv[8], req->iv + 16, 8);
-	err = chacha_simd_stream_xor(&walk, &subctx, real_iv);
+	return chacha_simd_stream_xor(req, &subctx, real_iv);
+}
 
-	kernel_fpu_end();
+static int chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		    unsigned int keysize)
+{
+	return chacha_setkey(tfm, key, keysize, 20);
+}
 
-	return err;
+static int chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		    unsigned int keysize)
+{
+	return chacha_setkey(tfm, key, keysize, 12);
 }
 
 static struct skcipher_alg algs[] = {
@@ -227,7 +219,7 @@ static struct skcipher_alg algs[] = {
 		.max_keysize		= CHACHA_KEY_SIZE,
 		.ivsize			= CHACHA_IV_SIZE,
 		.chunksize		= CHACHA_BLOCK_SIZE,
-		.setkey			= crypto_chacha20_setkey,
+		.setkey			= chacha20_setkey,
 		.encrypt		= chacha_simd,
 		.decrypt		= chacha_simd,
 	}, {
@@ -242,7 +234,7 @@ static struct skcipher_alg algs[] = {
 		.max_keysize		= CHACHA_KEY_SIZE,
 		.ivsize			= XCHACHA_IV_SIZE,
 		.chunksize		= CHACHA_BLOCK_SIZE,
-		.setkey			= crypto_chacha20_setkey,
+		.setkey			= chacha20_setkey,
 		.encrypt		= xchacha_simd,
 		.decrypt		= xchacha_simd,
 	}, {
@@ -257,7 +249,7 @@ static struct skcipher_alg algs[] = {
 		.max_keysize		= CHACHA_KEY_SIZE,
 		.ivsize			= XCHACHA_IV_SIZE,
 		.chunksize		= CHACHA_BLOCK_SIZE,
-		.setkey			= crypto_chacha12_setkey,
+		.setkey			= chacha12_setkey,
 		.encrypt		= xchacha_simd,
 		.decrypt		= xchacha_simd,
 	},
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 153cdefcbd78..9da4b67ac8e2 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1417,7 +1417,7 @@ config CRYPTO_CHACHA20_X86_64
 	tristate "ChaCha stream cipher algorithms (x86_64/SSSE3/AVX2/AVX-512VL)"
 	depends on X86 && 64BIT
 	select CRYPTO_BLKCIPHER
-	select CRYPTO_CHACHA20
+	select CRYPTO_LIB_CHACHA_GENERIC
 	help
 	  SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20,
 	  XChaCha20, and XChaCha12 stream ciphers.
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 04/35] crypto: x86/chacha - expose SIMD ChaCha routine as library function
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 03/35] crypto: x86/chacha - depend on generic chacha library instead of crypto driver Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-23  3:10   ` Eric Biggers
  2019-10-23  4:40   ` Eric Biggers
  2019-10-17 19:09 ` [PATCH v4 05/35] crypto: arm64/chacha - depend on generic chacha library instead of crypto driver Ard Biesheuvel
                   ` (30 subsequent siblings)
  34 siblings, 2 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Wire the existing x86 SIMD ChaCha code into the new ChaCha library
interface, so that users of the library interface will get the
accelerated version when available.

Given that calls into the library API will always go through the
routines in this module if it is enabled, switch to static keys
to select the optimal implementation available (which may be none
at all, in which case we defer to the generic implementation for
all invocations).

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/chacha_glue.c | 91 ++++++++++++++------
 crypto/Kconfig                |  1 +
 include/crypto/chacha.h       |  6 ++
 3 files changed, 73 insertions(+), 25 deletions(-)

diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
index a264dcc64679..b788260a1243 100644
--- a/arch/x86/crypto/chacha_glue.c
+++ b/arch/x86/crypto/chacha_glue.c
@@ -21,24 +21,24 @@ asmlinkage void chacha_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src,
 asmlinkage void chacha_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src,
 					unsigned int len, int nrounds);
 asmlinkage void hchacha_block_ssse3(const u32 *state, u32 *out, int nrounds);
-#ifdef CONFIG_AS_AVX2
+
 asmlinkage void chacha_2block_xor_avx2(u32 *state, u8 *dst, const u8 *src,
 				       unsigned int len, int nrounds);
 asmlinkage void chacha_4block_xor_avx2(u32 *state, u8 *dst, const u8 *src,
 				       unsigned int len, int nrounds);
 asmlinkage void chacha_8block_xor_avx2(u32 *state, u8 *dst, const u8 *src,
 				       unsigned int len, int nrounds);
-static bool chacha_use_avx2;
-#ifdef CONFIG_AS_AVX512
+
 asmlinkage void chacha_2block_xor_avx512vl(u32 *state, u8 *dst, const u8 *src,
 					   unsigned int len, int nrounds);
 asmlinkage void chacha_4block_xor_avx512vl(u32 *state, u8 *dst, const u8 *src,
 					   unsigned int len, int nrounds);
 asmlinkage void chacha_8block_xor_avx512vl(u32 *state, u8 *dst, const u8 *src,
 					   unsigned int len, int nrounds);
-static bool chacha_use_avx512vl;
-#endif
-#endif
+
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(chacha_use_simd);
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(chacha_use_avx2);
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(chacha_use_avx512vl);
 
 static unsigned int chacha_advance(unsigned int len, unsigned int maxblocks)
 {
@@ -49,9 +49,8 @@ static unsigned int chacha_advance(unsigned int len, unsigned int maxblocks)
 static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src,
 			  unsigned int bytes, int nrounds)
 {
-#ifdef CONFIG_AS_AVX2
-#ifdef CONFIG_AS_AVX512
-	if (chacha_use_avx512vl) {
+	if (IS_ENABLED(CONFIG_AS_AVX512) &&
+	    static_branch_likely(&chacha_use_avx512vl)) {
 		while (bytes >= CHACHA_BLOCK_SIZE * 8) {
 			chacha_8block_xor_avx512vl(state, dst, src, bytes,
 						   nrounds);
@@ -79,8 +78,9 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src,
 			return;
 		}
 	}
-#endif
-	if (chacha_use_avx2) {
+
+	if (IS_ENABLED(CONFIG_AS_AVX2) &&
+	    static_branch_likely(&chacha_use_avx2)) {
 		while (bytes >= CHACHA_BLOCK_SIZE * 8) {
 			chacha_8block_xor_avx2(state, dst, src, bytes, nrounds);
 			bytes -= CHACHA_BLOCK_SIZE * 8;
@@ -104,7 +104,7 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src,
 			return;
 		}
 	}
-#endif
+
 	while (bytes >= CHACHA_BLOCK_SIZE * 4) {
 		chacha_4block_xor_ssse3(state, dst, src, bytes, nrounds);
 		bytes -= CHACHA_BLOCK_SIZE * 4;
@@ -123,6 +123,43 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src,
 	}
 }
 
+void hchacha_block_arch(const u32 *state, u32 *stream, int nrounds)
+{
+	state = PTR_ALIGN(state, CHACHA_STATE_ALIGN);
+
+	if (!static_branch_likely(&chacha_use_simd) || !crypto_simd_usable()) {
+		hchacha_block_generic(state, stream, nrounds);
+	} else {
+		kernel_fpu_begin();
+		hchacha_block_ssse3(state, stream, nrounds);
+		kernel_fpu_end();
+	}
+}
+EXPORT_SYMBOL(hchacha_block_arch);
+
+void chacha_init_arch(u32 *state, const u32 *key, const u8 *iv)
+{
+	state = PTR_ALIGN(state, CHACHA_STATE_ALIGN);
+
+	chacha_init_generic(state, key, iv);
+}
+EXPORT_SYMBOL(chacha_init_arch);
+
+void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
+		       int nrounds)
+{
+	state = PTR_ALIGN(state, CHACHA_STATE_ALIGN);
+
+	if (!static_branch_likely(&chacha_use_simd) || !crypto_simd_usable() ||
+	    bytes <= CHACHA_BLOCK_SIZE)
+		return chacha_crypt_generic(state, dst, src, bytes, nrounds);
+
+	kernel_fpu_begin();
+	chacha_dosimd(state, dst, src, bytes, nrounds);
+	kernel_fpu_end();
+}
+EXPORT_SYMBOL(chacha_crypt_arch);
+
 static int chacha_simd_stream_xor(struct skcipher_request *req,
 				  const struct chacha_ctx *ctx, const u8 *iv)
 {
@@ -143,7 +180,8 @@ static int chacha_simd_stream_xor(struct skcipher_request *req,
 		if (nbytes < walk.total)
 			nbytes = round_down(nbytes, walk.stride);
 
-		if (!crypto_simd_usable()) {
+		if (!static_branch_likely(&chacha_use_simd) ||
+		    !crypto_simd_usable()) {
 			chacha_crypt_generic(state, walk.dst.virt.addr,
 					     walk.src.virt.addr, nbytes,
 					     ctx->nrounds);
@@ -258,18 +296,21 @@ static struct skcipher_alg algs[] = {
 static int __init chacha_simd_mod_init(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_SSSE3))
-		return -ENODEV;
-
-#ifdef CONFIG_AS_AVX2
-	chacha_use_avx2 = boot_cpu_has(X86_FEATURE_AVX) &&
-			  boot_cpu_has(X86_FEATURE_AVX2) &&
-			  cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
-#ifdef CONFIG_AS_AVX512
-	chacha_use_avx512vl = chacha_use_avx2 &&
-			      boot_cpu_has(X86_FEATURE_AVX512VL) &&
-			      boot_cpu_has(X86_FEATURE_AVX512BW); /* kmovq */
-#endif
-#endif
+		return 0;
+
+	static_branch_enable(&chacha_use_simd);
+
+	if (IS_ENABLED(CONFIG_AS_AVX2) &&
+	    boot_cpu_has(X86_FEATURE_AVX) &&
+	    boot_cpu_has(X86_FEATURE_AVX2) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL)) {
+		static_branch_enable(&chacha_use_avx2);
+
+		if (IS_ENABLED(CONFIG_AS_AVX512) &&
+		    boot_cpu_has(X86_FEATURE_AVX512VL) &&
+		    boot_cpu_has(X86_FEATURE_AVX512BW)) /* kmovq */
+			static_branch_enable(&chacha_use_avx512vl);
+	}
 	return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
 }
 
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 9da4b67ac8e2..4f7212fa0170 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1418,6 +1418,7 @@ config CRYPTO_CHACHA20_X86_64
 	depends on X86 && 64BIT
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_LIB_CHACHA_GENERIC
+	select CRYPTO_ARCH_HAVE_LIB_CHACHA
 	help
 	  SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20,
 	  XChaCha20, and XChaCha12 stream ciphers.
diff --git a/include/crypto/chacha.h b/include/crypto/chacha.h
index 78cb9d549b1b..33034b69b087 100644
--- a/include/crypto/chacha.h
+++ b/include/crypto/chacha.h
@@ -25,6 +25,12 @@
 #define CHACHA_BLOCK_SIZE	64
 #define CHACHAPOLY_IV_SIZE	12
 
+#ifdef CONFIG_X86_64
+#define CHACHA_STATE_WORDS	((CHACHA_BLOCK_SIZE + 12) / sizeof(u32))
+#else
+#define CHACHA_STATE_WORDS	(CHACHA_BLOCK_SIZE / sizeof(u32))
+#endif
+
 /* 192-bit nonce, then 64-bit stream position */
 #define XCHACHA_IV_SIZE		32
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 05/35] crypto: arm64/chacha - depend on generic chacha library instead of crypto driver
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 04/35] crypto: x86/chacha - expose SIMD ChaCha routine as library function Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 06/35] crypto: arm64/chacha - expose arm64 ChaCha routine as library function Ard Biesheuvel
                   ` (29 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Depend on the generic ChaCha library routines instead of pulling in the
generic ChaCha skcipher driver, which is more than we need, and makes
managing the dependencies between the generic library, generic driver,
accelerated library and driver more complicated.

While at it, drop the logic to prefer the scalar code on short inputs.
Turning the NEON on and off is cheap these days, and one major use case
for ChaCha20 is ChaCha20-Poly1305, which is guaranteed to hit the scalar
path upon every invocation  (when doing the Poly1305 nonce generation)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Kconfig            |  2 +-
 arch/arm64/crypto/chacha-neon-glue.c | 52 +++++++++++++-------
 2 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 4922c4451e7c..fdf52d5f18f9 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -103,7 +103,7 @@ config CRYPTO_CHACHA20_NEON
 	tristate "ChaCha20, XChaCha20, and XChaCha12 stream ciphers using NEON instructions"
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_BLKCIPHER
-	select CRYPTO_CHACHA20
+	select CRYPTO_LIB_CHACHA_GENERIC
 
 config CRYPTO_NHPOLY1305_NEON
 	tristate "NHPoly1305 hash function using NEON instructions (for Adiantum)"
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index d4cc61bfe79d..36189514a616 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -68,7 +68,7 @@ static int chacha_neon_stream_xor(struct skcipher_request *req,
 
 	err = skcipher_walk_virt(&walk, req, false);
 
-	crypto_chacha_init(state, ctx, iv);
+	chacha_init_generic(state, ctx->key, iv);
 
 	while (walk.nbytes > 0) {
 		unsigned int nbytes = walk.nbytes;
@@ -76,10 +76,16 @@ static int chacha_neon_stream_xor(struct skcipher_request *req,
 		if (nbytes < walk.total)
 			nbytes = rounddown(nbytes, walk.stride);
 
-		kernel_neon_begin();
-		chacha_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
-			      nbytes, ctx->nrounds);
-		kernel_neon_end();
+		if (!crypto_simd_usable()) {
+			chacha_crypt_generic(state, walk.dst.virt.addr,
+					     walk.src.virt.addr, nbytes,
+					     ctx->nrounds);
+		} else {
+			kernel_neon_begin();
+			chacha_doneon(state, walk.dst.virt.addr,
+				      walk.src.virt.addr, nbytes, ctx->nrounds);
+			kernel_neon_end();
+		}
 		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
 
@@ -91,9 +97,6 @@ static int chacha_neon(struct skcipher_request *req)
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
 
-	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !crypto_simd_usable())
-		return crypto_chacha_crypt(req);
-
 	return chacha_neon_stream_xor(req, ctx, req->iv);
 }
 
@@ -105,14 +108,15 @@ static int xchacha_neon(struct skcipher_request *req)
 	u32 state[16];
 	u8 real_iv[16];
 
-	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !crypto_simd_usable())
-		return crypto_xchacha_crypt(req);
+	chacha_init_generic(state, ctx->key, req->iv);
 
-	crypto_chacha_init(state, ctx, req->iv);
-
-	kernel_neon_begin();
-	hchacha_block_neon(state, subctx.key, ctx->nrounds);
-	kernel_neon_end();
+	if (crypto_simd_usable()) {
+		kernel_neon_begin();
+		hchacha_block_neon(state, subctx.key, ctx->nrounds);
+		kernel_neon_end();
+	} else {
+		hchacha_block_generic(state, subctx.key, ctx->nrounds);
+	}
 	subctx.nrounds = ctx->nrounds;
 
 	memcpy(&real_iv[0], req->iv + 24, 8);
@@ -120,6 +124,18 @@ static int xchacha_neon(struct skcipher_request *req)
 	return chacha_neon_stream_xor(req, &subctx, real_iv);
 }
 
+static int chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		    unsigned int keysize)
+{
+	return chacha_setkey(tfm, key, keysize, 20);
+}
+
+static int chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		    unsigned int keysize)
+{
+	return chacha_setkey(tfm, key, keysize, 12);
+}
+
 static struct skcipher_alg algs[] = {
 	{
 		.base.cra_name		= "chacha20",
@@ -134,7 +150,7 @@ static struct skcipher_alg algs[] = {
 		.ivsize			= CHACHA_IV_SIZE,
 		.chunksize		= CHACHA_BLOCK_SIZE,
 		.walksize		= 5 * CHACHA_BLOCK_SIZE,
-		.setkey			= crypto_chacha20_setkey,
+		.setkey			= chacha20_setkey,
 		.encrypt		= chacha_neon,
 		.decrypt		= chacha_neon,
 	}, {
@@ -150,7 +166,7 @@ static struct skcipher_alg algs[] = {
 		.ivsize			= XCHACHA_IV_SIZE,
 		.chunksize		= CHACHA_BLOCK_SIZE,
 		.walksize		= 5 * CHACHA_BLOCK_SIZE,
-		.setkey			= crypto_chacha20_setkey,
+		.setkey			= chacha20_setkey,
 		.encrypt		= xchacha_neon,
 		.decrypt		= xchacha_neon,
 	}, {
@@ -166,7 +182,7 @@ static struct skcipher_alg algs[] = {
 		.ivsize			= XCHACHA_IV_SIZE,
 		.chunksize		= CHACHA_BLOCK_SIZE,
 		.walksize		= 5 * CHACHA_BLOCK_SIZE,
-		.setkey			= crypto_chacha12_setkey,
+		.setkey			= chacha12_setkey,
 		.encrypt		= xchacha_neon,
 		.decrypt		= xchacha_neon,
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 06/35] crypto: arm64/chacha - expose arm64 ChaCha routine as library function
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 05/35] crypto: arm64/chacha - depend on generic chacha library instead of crypto driver Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-23  3:16   ` Eric Biggers
  2019-10-17 19:09 ` [PATCH v4 07/35] crypto: arm/chacha - import Eric Biggers's scalar accelerated ChaCha code Ard Biesheuvel
                   ` (28 subsequent siblings)
  34 siblings, 1 reply; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Expose the accelerated NEON ChaCha routine directly as a symbol
export so that users of the ChaCha library API can use it directly.

Given that calls into the library API will always go through the
routines in this module if it is enabled, switch to static keys
to select the optimal implementation available (which may be none
at all, in which case we defer to the generic implementation for
all invocations).

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Kconfig            |  1 +
 arch/arm64/crypto/chacha-neon-glue.c | 46 ++++++++++++++++++--
 2 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index fdf52d5f18f9..17bada4b9dd2 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -104,6 +104,7 @@ config CRYPTO_CHACHA20_NEON
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_LIB_CHACHA_GENERIC
+	select CRYPTO_ARCH_HAVE_LIB_CHACHA
 
 config CRYPTO_NHPOLY1305_NEON
 	tristate "NHPoly1305 hash function using NEON instructions (for Adiantum)"
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index 36189514a616..d1310389cf87 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -23,6 +23,7 @@
 #include <crypto/internal/chacha.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
+#include <linux/jump_label.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 
@@ -36,6 +37,8 @@ asmlinkage void chacha_4block_xor_neon(u32 *state, u8 *dst, const u8 *src,
 				       int nrounds, int bytes);
 asmlinkage void hchacha_block_neon(const u32 *state, u32 *out, int nrounds);
 
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(have_neon);
+
 static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
 			  int bytes, int nrounds)
 {
@@ -59,6 +62,37 @@ static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
 	}
 }
 
+void hchacha_block_arch(const u32 *state, u32 *stream, int nrounds)
+{
+	if (!static_branch_likely(&have_neon) || !crypto_simd_usable()) {
+		hchacha_block_generic(state, stream, nrounds);
+	} else {
+		kernel_neon_begin();
+		hchacha_block_neon(state, stream, nrounds);
+		kernel_neon_end();
+	}
+}
+EXPORT_SYMBOL(hchacha_block_arch);
+
+void chacha_init_arch(u32 *state, const u32 *key, const u8 *iv)
+{
+	chacha_init_generic(state, key, iv);
+}
+EXPORT_SYMBOL(chacha_init_arch);
+
+void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
+		       int nrounds)
+{
+	if (!static_branch_likely(&have_neon) || bytes <= CHACHA_BLOCK_SIZE ||
+	    !crypto_simd_usable())
+		return chacha_crypt_generic(state, dst, src, bytes, nrounds);
+
+	kernel_neon_begin();
+	chacha_doneon(state, dst, src, bytes, nrounds);
+	kernel_neon_end();
+}
+EXPORT_SYMBOL(chacha_crypt_arch);
+
 static int chacha_neon_stream_xor(struct skcipher_request *req,
 				  const struct chacha_ctx *ctx, const u8 *iv)
 {
@@ -76,7 +110,8 @@ static int chacha_neon_stream_xor(struct skcipher_request *req,
 		if (nbytes < walk.total)
 			nbytes = rounddown(nbytes, walk.stride);
 
-		if (!crypto_simd_usable()) {
+		if (!static_branch_likely(&have_neon) ||
+		    !crypto_simd_usable()) {
 			chacha_crypt_generic(state, walk.dst.virt.addr,
 					     walk.src.virt.addr, nbytes,
 					     ctx->nrounds);
@@ -110,7 +145,7 @@ static int xchacha_neon(struct skcipher_request *req)
 
 	chacha_init_generic(state, ctx->key, req->iv);
 
-	if (crypto_simd_usable()) {
+	if (static_branch_likely(&have_neon) && crypto_simd_usable()) {
 		kernel_neon_begin();
 		hchacha_block_neon(state, subctx.key, ctx->nrounds);
 		kernel_neon_end();
@@ -191,14 +226,17 @@ static struct skcipher_alg algs[] = {
 static int __init chacha_simd_mod_init(void)
 {
 	if (!cpu_have_named_feature(ASIMD))
-		return -ENODEV;
+		return 0;
+
+	static_branch_enable(&have_neon);
 
 	return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
 }
 
 static void __exit chacha_simd_mod_fini(void)
 {
-	crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
+	if (cpu_have_named_feature(ASIMD))
+		crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
 }
 
 module_init(chacha_simd_mod_init);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 07/35] crypto: arm/chacha - import Eric Biggers's scalar accelerated ChaCha code
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 06/35] crypto: arm64/chacha - expose arm64 ChaCha routine as library function Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 08/35] crypto: arm/chacha - remove dependency on generic ChaCha driver Ard Biesheuvel
                   ` (27 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/chacha-scalar-core.S | 461 ++++++++++++++++++++
 1 file changed, 461 insertions(+)

diff --git a/arch/arm/crypto/chacha-scalar-core.S b/arch/arm/crypto/chacha-scalar-core.S
new file mode 100644
index 000000000000..2140319b64a0
--- /dev/null
+++ b/arch/arm/crypto/chacha-scalar-core.S
@@ -0,0 +1,461 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2018 Google, Inc.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+ * Design notes:
+ *
+ * 16 registers would be needed to hold the state matrix, but only 14 are
+ * available because 'sp' and 'pc' cannot be used.  So we spill the elements
+ * (x8, x9) to the stack and swap them out with (x10, x11).  This adds one
+ * 'ldrd' and one 'strd' instruction per round.
+ *
+ * All rotates are performed using the implicit rotate operand accepted by the
+ * 'add' and 'eor' instructions.  This is faster than using explicit rotate
+ * instructions.  To make this work, we allow the values in the second and last
+ * rows of the ChaCha state matrix (rows 'b' and 'd') to temporarily have the
+ * wrong rotation amount.  The rotation amount is then fixed up just in time
+ * when the values are used.  'brot' is the number of bits the values in row 'b'
+ * need to be rotated right to arrive at the correct values, and 'drot'
+ * similarly for row 'd'.  (brot, drot) start out as (0, 0) but we make it such
+ * that they end up as (25, 24) after every round.
+ */
+
+	// ChaCha state registers
+	X0	.req	r0
+	X1	.req	r1
+	X2	.req	r2
+	X3	.req	r3
+	X4	.req	r4
+	X5	.req	r5
+	X6	.req	r6
+	X7	.req	r7
+	X8_X10	.req	r8	// shared by x8 and x10
+	X9_X11	.req	r9	// shared by x9 and x11
+	X12	.req	r10
+	X13	.req	r11
+	X14	.req	r12
+	X15	.req	r14
+
+.Lexpand_32byte_k:
+	// "expand 32-byte k"
+	.word	0x61707865, 0x3320646e, 0x79622d32, 0x6b206574
+
+#ifdef __thumb2__
+#  define adrl adr
+#endif
+
+.macro __rev		out, in,  t0, t1, t2
+.if __LINUX_ARM_ARCH__ >= 6
+	rev		\out, \in
+.else
+	lsl		\t0, \in, #24
+	and		\t1, \in, #0xff00
+	and		\t2, \in, #0xff0000
+	orr		\out, \t0, \in, lsr #24
+	orr		\out, \out, \t1, lsl #8
+	orr		\out, \out, \t2, lsr #8
+.endif
+.endm
+
+.macro _le32_bswap	x,  t0, t1, t2
+#ifdef __ARMEB__
+	__rev		\x, \x,  \t0, \t1, \t2
+#endif
+.endm
+
+.macro _le32_bswap_4x	a, b, c, d,  t0, t1, t2
+	_le32_bswap	\a,  \t0, \t1, \t2
+	_le32_bswap	\b,  \t0, \t1, \t2
+	_le32_bswap	\c,  \t0, \t1, \t2
+	_le32_bswap	\d,  \t0, \t1, \t2
+.endm
+
+.macro __ldrd		a, b, src, offset
+#if __LINUX_ARM_ARCH__ >= 6
+	ldrd		\a, \b, [\src, #\offset]
+#else
+	ldr		\a, [\src, #\offset]
+	ldr		\b, [\src, #\offset + 4]
+#endif
+.endm
+
+.macro __strd		a, b, dst, offset
+#if __LINUX_ARM_ARCH__ >= 6
+	strd		\a, \b, [\dst, #\offset]
+#else
+	str		\a, [\dst, #\offset]
+	str		\b, [\dst, #\offset + 4]
+#endif
+.endm
+
+.macro _halfround	a1, b1, c1, d1,  a2, b2, c2, d2
+
+	// a += b; d ^= a; d = rol(d, 16);
+	add		\a1, \a1, \b1, ror #brot
+	add		\a2, \a2, \b2, ror #brot
+	eor		\d1, \a1, \d1, ror #drot
+	eor		\d2, \a2, \d2, ror #drot
+	// drot == 32 - 16 == 16
+
+	// c += d; b ^= c; b = rol(b, 12);
+	add		\c1, \c1, \d1, ror #16
+	add		\c2, \c2, \d2, ror #16
+	eor		\b1, \c1, \b1, ror #brot
+	eor		\b2, \c2, \b2, ror #brot
+	// brot == 32 - 12 == 20
+
+	// a += b; d ^= a; d = rol(d, 8);
+	add		\a1, \a1, \b1, ror #20
+	add		\a2, \a2, \b2, ror #20
+	eor		\d1, \a1, \d1, ror #16
+	eor		\d2, \a2, \d2, ror #16
+	// drot == 32 - 8 == 24
+
+	// c += d; b ^= c; b = rol(b, 7);
+	add		\c1, \c1, \d1, ror #24
+	add		\c2, \c2, \d2, ror #24
+	eor		\b1, \c1, \b1, ror #20
+	eor		\b2, \c2, \b2, ror #20
+	// brot == 32 - 7 == 25
+.endm
+
+.macro _doubleround
+
+	// column round
+
+	// quarterrounds: (x0, x4, x8, x12) and (x1, x5, x9, x13)
+	_halfround	X0, X4, X8_X10, X12,  X1, X5, X9_X11, X13
+
+	// save (x8, x9); restore (x10, x11)
+	__strd		X8_X10, X9_X11, sp, 0
+	__ldrd		X8_X10, X9_X11, sp, 8
+
+	// quarterrounds: (x2, x6, x10, x14) and (x3, x7, x11, x15)
+	_halfround	X2, X6, X8_X10, X14,  X3, X7, X9_X11, X15
+
+	.set brot, 25
+	.set drot, 24
+
+	// diagonal round
+
+	// quarterrounds: (x0, x5, x10, x15) and (x1, x6, x11, x12)
+	_halfround	X0, X5, X8_X10, X15,  X1, X6, X9_X11, X12
+
+	// save (x10, x11); restore (x8, x9)
+	__strd		X8_X10, X9_X11, sp, 8
+	__ldrd		X8_X10, X9_X11, sp, 0
+
+	// quarterrounds: (x2, x7, x8, x13) and (x3, x4, x9, x14)
+	_halfround	X2, X7, X8_X10, X13,  X3, X4, X9_X11, X14
+.endm
+
+.macro _chacha_permute	nrounds
+	.set brot, 0
+	.set drot, 0
+	.rept \nrounds / 2
+	 _doubleround
+	.endr
+.endm
+
+.macro _chacha		nrounds
+
+.Lnext_block\@:
+	// Stack: unused0-unused1 x10-x11 x0-x15 OUT IN LEN
+	// Registers contain x0-x9,x12-x15.
+
+	// Do the core ChaCha permutation to update x0-x15.
+	_chacha_permute	\nrounds
+
+	add		sp, #8
+	// Stack: x10-x11 orig_x0-orig_x15 OUT IN LEN
+	// Registers contain x0-x9,x12-x15.
+	// x4-x7 are rotated by 'brot'; x12-x15 are rotated by 'drot'.
+
+	// Free up some registers (r8-r12,r14) by pushing (x8-x9,x12-x15).
+	push		{X8_X10, X9_X11, X12, X13, X14, X15}
+
+	// Load (OUT, IN, LEN).
+	ldr		r14, [sp, #96]
+	ldr		r12, [sp, #100]
+	ldr		r11, [sp, #104]
+
+	orr		r10, r14, r12
+
+	// Use slow path if fewer than 64 bytes remain.
+	cmp		r11, #64
+	blt		.Lxor_slowpath\@
+
+	// Use slow path if IN and/or OUT isn't 4-byte aligned.  Needed even on
+	// ARMv6+, since ldmia and stmia (used below) still require alignment.
+	tst		r10, #3
+	bne		.Lxor_slowpath\@
+
+	// Fast path: XOR 64 bytes of aligned data.
+
+	// Stack: x8-x9 x12-x15 x10-x11 orig_x0-orig_x15 OUT IN LEN
+	// Registers: r0-r7 are x0-x7; r8-r11 are free; r12 is IN; r14 is OUT.
+	// x4-x7 are rotated by 'brot'; x12-x15 are rotated by 'drot'.
+
+	// x0-x3
+	__ldrd		r8, r9, sp, 32
+	__ldrd		r10, r11, sp, 40
+	add		X0, X0, r8
+	add		X1, X1, r9
+	add		X2, X2, r10
+	add		X3, X3, r11
+	_le32_bswap_4x	X0, X1, X2, X3,  r8, r9, r10
+	ldmia		r12!, {r8-r11}
+	eor		X0, X0, r8
+	eor		X1, X1, r9
+	eor		X2, X2, r10
+	eor		X3, X3, r11
+	stmia		r14!, {X0-X3}
+
+	// x4-x7
+	__ldrd		r8, r9, sp, 48
+	__ldrd		r10, r11, sp, 56
+	add		X4, r8, X4, ror #brot
+	add		X5, r9, X5, ror #brot
+	ldmia		r12!, {X0-X3}
+	add		X6, r10, X6, ror #brot
+	add		X7, r11, X7, ror #brot
+	_le32_bswap_4x	X4, X5, X6, X7,  r8, r9, r10
+	eor		X4, X4, X0
+	eor		X5, X5, X1
+	eor		X6, X6, X2
+	eor		X7, X7, X3
+	stmia		r14!, {X4-X7}
+
+	// x8-x15
+	pop		{r0-r7}			// (x8-x9,x12-x15,x10-x11)
+	__ldrd		r8, r9, sp, 32
+	__ldrd		r10, r11, sp, 40
+	add		r0, r0, r8		// x8
+	add		r1, r1, r9		// x9
+	add		r6, r6, r10		// x10
+	add		r7, r7, r11		// x11
+	_le32_bswap_4x	r0, r1, r6, r7,  r8, r9, r10
+	ldmia		r12!, {r8-r11}
+	eor		r0, r0, r8		// x8
+	eor		r1, r1, r9		// x9
+	eor		r6, r6, r10		// x10
+	eor		r7, r7, r11		// x11
+	stmia		r14!, {r0,r1,r6,r7}
+	ldmia		r12!, {r0,r1,r6,r7}
+	__ldrd		r8, r9, sp, 48
+	__ldrd		r10, r11, sp, 56
+	add		r2, r8, r2, ror #drot	// x12
+	add		r3, r9, r3, ror #drot	// x13
+	add		r4, r10, r4, ror #drot	// x14
+	add		r5, r11, r5, ror #drot	// x15
+	_le32_bswap_4x	r2, r3, r4, r5,  r9, r10, r11
+	  ldr		r9, [sp, #72]		// load LEN
+	eor		r2, r2, r0		// x12
+	eor		r3, r3, r1		// x13
+	eor		r4, r4, r6		// x14
+	eor		r5, r5, r7		// x15
+	  subs		r9, #64			// decrement and check LEN
+	stmia		r14!, {r2-r5}
+
+	beq		.Ldone\@
+
+.Lprepare_for_next_block\@:
+
+	// Stack: x0-x15 OUT IN LEN
+
+	// Increment block counter (x12)
+	add		r8, #1
+
+	// Store updated (OUT, IN, LEN)
+	str		r14, [sp, #64]
+	str		r12, [sp, #68]
+	str		r9, [sp, #72]
+
+	  mov		r14, sp
+
+	// Store updated block counter (x12)
+	str		r8, [sp, #48]
+
+	  sub		sp, #16
+
+	// Reload state and do next block
+	ldmia		r14!, {r0-r11}		// load x0-x11
+	__strd		r10, r11, sp, 8		// store x10-x11 before state
+	ldmia		r14, {r10-r12,r14}	// load x12-x15
+	b		.Lnext_block\@
+
+.Lxor_slowpath\@:
+	// Slow path: < 64 bytes remaining, or unaligned input or output buffer.
+	// We handle it by storing the 64 bytes of keystream to the stack, then
+	// XOR-ing the needed portion with the data.
+
+	// Allocate keystream buffer
+	sub		sp, #64
+	mov		r14, sp
+
+	// Stack: ks0-ks15 x8-x9 x12-x15 x10-x11 orig_x0-orig_x15 OUT IN LEN
+	// Registers: r0-r7 are x0-x7; r8-r11 are free; r12 is IN; r14 is &ks0.
+	// x4-x7 are rotated by 'brot'; x12-x15 are rotated by 'drot'.
+
+	// Save keystream for x0-x3
+	__ldrd		r8, r9, sp, 96
+	__ldrd		r10, r11, sp, 104
+	add		X0, X0, r8
+	add		X1, X1, r9
+	add		X2, X2, r10
+	add		X3, X3, r11
+	_le32_bswap_4x	X0, X1, X2, X3,  r8, r9, r10
+	stmia		r14!, {X0-X3}
+
+	// Save keystream for x4-x7
+	__ldrd		r8, r9, sp, 112
+	__ldrd		r10, r11, sp, 120
+	add		X4, r8, X4, ror #brot
+	add		X5, r9, X5, ror #brot
+	add		X6, r10, X6, ror #brot
+	add		X7, r11, X7, ror #brot
+	_le32_bswap_4x	X4, X5, X6, X7,  r8, r9, r10
+	  add		r8, sp, #64
+	stmia		r14!, {X4-X7}
+
+	// Save keystream for x8-x15
+	ldm		r8, {r0-r7}		// (x8-x9,x12-x15,x10-x11)
+	__ldrd		r8, r9, sp, 128
+	__ldrd		r10, r11, sp, 136
+	add		r0, r0, r8		// x8
+	add		r1, r1, r9		// x9
+	add		r6, r6, r10		// x10
+	add		r7, r7, r11		// x11
+	_le32_bswap_4x	r0, r1, r6, r7,  r8, r9, r10
+	stmia		r14!, {r0,r1,r6,r7}
+	__ldrd		r8, r9, sp, 144
+	__ldrd		r10, r11, sp, 152
+	add		r2, r8, r2, ror #drot	// x12
+	add		r3, r9, r3, ror #drot	// x13
+	add		r4, r10, r4, ror #drot	// x14
+	add		r5, r11, r5, ror #drot	// x15
+	_le32_bswap_4x	r2, r3, r4, r5,  r9, r10, r11
+	stmia		r14, {r2-r5}
+
+	// Stack: ks0-ks15 unused0-unused7 x0-x15 OUT IN LEN
+	// Registers: r8 is block counter, r12 is IN.
+
+	ldr		r9, [sp, #168]		// LEN
+	ldr		r14, [sp, #160]		// OUT
+	cmp		r9, #64
+	  mov		r0, sp
+	movle		r1, r9
+	movgt		r1, #64
+	// r1 is number of bytes to XOR, in range [1, 64]
+
+.if __LINUX_ARM_ARCH__ < 6
+	orr		r2, r12, r14
+	tst		r2, #3			// IN or OUT misaligned?
+	bne		.Lxor_next_byte\@
+.endif
+
+	// XOR a word at a time
+.rept 16
+	subs		r1, #4
+	blt		.Lxor_words_done\@
+	ldr		r2, [r12], #4
+	ldr		r3, [r0], #4
+	eor		r2, r2, r3
+	str		r2, [r14], #4
+.endr
+	b		.Lxor_slowpath_done\@
+.Lxor_words_done\@:
+	ands		r1, r1, #3
+	beq		.Lxor_slowpath_done\@
+
+	// XOR a byte at a time
+.Lxor_next_byte\@:
+	ldrb		r2, [r12], #1
+	ldrb		r3, [r0], #1
+	eor		r2, r2, r3
+	strb		r2, [r14], #1
+	subs		r1, #1
+	bne		.Lxor_next_byte\@
+
+.Lxor_slowpath_done\@:
+	subs		r9, #64
+	add		sp, #96
+	bgt		.Lprepare_for_next_block\@
+
+.Ldone\@:
+.endm	// _chacha
+
+/*
+ * void chacha20_arm(u8 *out, const u8 *in, size_t len, const u32 key[8],
+ *		     const u32 iv[4]);
+ */
+ENTRY(chacha20_arm)
+	cmp		r2, #0			// len == 0?
+	reteq		lr
+
+	push		{r0-r2,r4-r11,lr}
+
+	// Push state x0-x15 onto stack.
+	// Also store an extra copy of x10-x11 just before the state.
+
+	ldr		r4, [sp, #48]		// iv
+	mov		r0, sp
+	sub		sp, #80
+
+	// iv: x12-x15
+	ldm		r4, {X12,X13,X14,X15}
+	stmdb		r0!, {X12,X13,X14,X15}
+
+	// key: x4-x11
+	__ldrd		X8_X10, X9_X11, r3, 24
+	__strd		X8_X10, X9_X11, sp, 8
+	stmdb		r0!, {X8_X10, X9_X11}
+	ldm		r3, {X4-X9_X11}
+	stmdb		r0!, {X4-X9_X11}
+
+	// constants: x0-x3
+	adrl		X3, .Lexpand_32byte_k
+	ldm		X3, {X0-X3}
+	__strd		X0, X1, sp, 16
+	__strd		X2, X3, sp, 24
+
+	_chacha		20
+
+	add		sp, #76
+	pop		{r4-r11, pc}
+ENDPROC(chacha20_arm)
+
+/*
+ * void hchacha20_arm(const u32 state[16], u32 out[8]);
+ */
+ENTRY(hchacha20_arm)
+	push		{r1,r4-r11,lr}
+
+	mov		r14, r0
+	ldmia		r14!, {r0-r11}		// load x0-x11
+	push		{r10-r11}		// store x10-x11 to stack
+	ldm		r14, {r10-r12,r14}	// load x12-x15
+	sub		sp, #8
+
+	_chacha_permute	20
+
+	// Skip over (unused0-unused1, x10-x11)
+	add		sp, #16
+
+	// Fix up rotations of x12-x15
+	ror		X12, X12, #drot
+	ror		X13, X13, #drot
+	  pop		{r4}			// load 'out'
+	ror		X14, X14, #drot
+	ror		X15, X15, #drot
+
+	// Store (x0-x3,x12-x15) to 'out'
+	stm		r4, {X0,X1,X2,X3,X12,X13,X14,X15}
+
+	pop		{r4-r11,pc}
+ENDPROC(hchacha20_arm)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 08/35] crypto: arm/chacha - remove dependency on generic ChaCha driver
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (6 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 07/35] crypto: arm/chacha - import Eric Biggers's scalar accelerated ChaCha code Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-23  3:21   ` Eric Biggers
  2019-10-17 19:09 ` [PATCH v4 09/35] crypto: arm/chacha - expose ARM ChaCha routine as library function Ard Biesheuvel
                   ` (26 subsequent siblings)
  34 siblings, 1 reply; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Instead of falling back to the generic ChaCha skcipher driver for
non-SIMD cases, use a fast scalar implementation for ARM authored
by Eric Biggers. This removes the module dependency on chacha-generic
altogether, which also simplifies things when we expose the ChaCha
library interface from this module.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/Kconfig              |   4 +-
 arch/arm/crypto/Makefile             |   3 +-
 arch/arm/crypto/chacha-glue.c        | 316 ++++++++++++++++++++
 arch/arm/crypto/chacha-neon-glue.c   | 202 -------------
 arch/arm/crypto/chacha-scalar-core.S |  65 ++--
 5 files changed, 351 insertions(+), 239 deletions(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index b24df84a1d7a..a31b0b95548d 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -126,10 +126,8 @@ config CRYPTO_CRC32_ARM_CE
 	select CRYPTO_HASH
 
 config CRYPTO_CHACHA20_NEON
-	tristate "NEON accelerated ChaCha stream cipher algorithms"
-	depends on KERNEL_MODE_NEON
+	tristate "NEON and scalar accelerated ChaCha stream cipher algorithms"
 	select CRYPTO_BLKCIPHER
-	select CRYPTO_CHACHA20
 
 config CRYPTO_NHPOLY1305_NEON
 	tristate "NEON accelerated NHPoly1305 hash function (for Adiantum)"
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 4180f3a13512..6b97dffcf90f 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -53,7 +53,8 @@ aes-arm-ce-y	:= aes-ce-core.o aes-ce-glue.o
 ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
 crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
 crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
-chacha-neon-y := chacha-neon-core.o chacha-neon-glue.o
+chacha-neon-y := chacha-scalar-core.o chacha-glue.o
+chacha-neon-$(CONFIG_KERNEL_MODE_NEON) += chacha-neon-core.o
 nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
 
 ifdef REGENERATE_ARM_CRYPTO
diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
new file mode 100644
index 000000000000..2c0f76bc08b2
--- /dev/null
+++ b/arch/arm/crypto/chacha-glue.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ARM NEON accelerated ChaCha and XChaCha stream ciphers,
+ * including ChaCha20 (RFC7539)
+ *
+ * Copyright (C) 2016-2019 Linaro, Ltd. <ard.biesheuvel@linaro.org>
+ * Copyright (C) 2015 Martin Willi
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/internal/chacha.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/skcipher.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <asm/cputype.h>
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+
+asmlinkage void chacha_block_xor_neon(const u32 *state, u8 *dst, const u8 *src,
+				      int nrounds);
+asmlinkage void chacha_4block_xor_neon(const u32 *state, u8 *dst, const u8 *src,
+				       int nrounds);
+asmlinkage void hchacha_block_arm(const u32 *state, u32 *out, int nrounds);
+asmlinkage void hchacha_block_neon(const u32 *state, u32 *out, int nrounds);
+
+asmlinkage void chacha_doarm(u8 *dst, const u8 *src, unsigned int bytes,
+			     const u32 *state, int nrounds);
+
+static inline bool neon_usable(void)
+{
+	return crypto_simd_usable();
+}
+
+static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
+			  unsigned int bytes, int nrounds)
+{
+	u8 buf[CHACHA_BLOCK_SIZE];
+
+	while (bytes >= CHACHA_BLOCK_SIZE * 4) {
+		chacha_4block_xor_neon(state, dst, src, nrounds);
+		bytes -= CHACHA_BLOCK_SIZE * 4;
+		src += CHACHA_BLOCK_SIZE * 4;
+		dst += CHACHA_BLOCK_SIZE * 4;
+		state[12] += 4;
+	}
+	while (bytes >= CHACHA_BLOCK_SIZE) {
+		chacha_block_xor_neon(state, dst, src, nrounds);
+		bytes -= CHACHA_BLOCK_SIZE;
+		src += CHACHA_BLOCK_SIZE;
+		dst += CHACHA_BLOCK_SIZE;
+		state[12]++;
+	}
+	if (bytes) {
+		memcpy(buf, src, bytes);
+		chacha_block_xor_neon(state, buf, buf, nrounds);
+		memcpy(dst, buf, bytes);
+	}
+}
+
+static int chacha_stream_xor(struct skcipher_request *req,
+			     const struct chacha_ctx *ctx, const u8 *iv,
+			     bool neon)
+{
+	struct skcipher_walk walk;
+	u32 state[16];
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	chacha_init_generic(state, ctx->key, iv);
+
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
+
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, walk.stride);
+
+		if (!neon) {
+			chacha_doarm(walk.dst.virt.addr, walk.src.virt.addr,
+				     nbytes, state, ctx->nrounds);
+			state[12] += DIV_ROUND_UP(nbytes, CHACHA_BLOCK_SIZE);
+		} else {
+			kernel_neon_begin();
+			chacha_doneon(state, walk.dst.virt.addr,
+				      walk.src.virt.addr, nbytes, ctx->nrounds);
+			kernel_neon_end();
+		}
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	}
+
+	return err;
+}
+
+static int do_chacha(struct skcipher_request *req, bool neon)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return chacha_stream_xor(req, ctx, req->iv, neon);
+}
+
+static int chacha_arm(struct skcipher_request *req)
+{
+	return do_chacha(req, false);
+}
+
+static int chacha_neon(struct skcipher_request *req)
+{
+	return do_chacha(req, neon_usable());
+}
+
+static int do_xchacha(struct skcipher_request *req, bool neon)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct chacha_ctx subctx;
+	u32 state[16];
+	u8 real_iv[16];
+
+	chacha_init_generic(state, ctx->key, req->iv);
+
+	if (!neon) {
+		hchacha_block_arm(state, subctx.key, ctx->nrounds);
+	} else {
+		kernel_neon_begin();
+		hchacha_block_neon(state, subctx.key, ctx->nrounds);
+		kernel_neon_end();
+	}
+	subctx.nrounds = ctx->nrounds;
+
+	memcpy(&real_iv[0], req->iv + 24, 8);
+	memcpy(&real_iv[8], req->iv + 16, 8);
+	return chacha_stream_xor(req, &subctx, real_iv, neon);
+}
+
+static int xchacha_arm(struct skcipher_request *req)
+{
+	return do_xchacha(req, false);
+}
+
+static int xchacha_neon(struct skcipher_request *req)
+{
+	return do_xchacha(req, neon_usable());
+}
+
+static int chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		    unsigned int keysize)
+{
+	return chacha_setkey(tfm, key, keysize, 20);
+}
+
+static int chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		    unsigned int keysize)
+{
+	return chacha_setkey(tfm, key, keysize, 12);
+}
+
+static struct skcipher_alg arm_algs[] = {
+	{
+		.base.cra_name		= "chacha20",
+		.base.cra_driver_name	= "chacha20-arm",
+		.base.cra_priority	= 200,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= CHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.setkey			= chacha20_setkey,
+		.encrypt		= chacha_arm,
+		.decrypt		= chacha_arm,
+	}, {
+		.base.cra_name		= "xchacha20",
+		.base.cra_driver_name	= "xchacha20-arm",
+		.base.cra_priority	= 200,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.setkey			= chacha20_setkey,
+		.encrypt		= xchacha_arm,
+		.decrypt		= xchacha_arm,
+	}, {
+		.base.cra_name		= "xchacha12",
+		.base.cra_driver_name	= "xchacha12-arm",
+		.base.cra_priority	= 200,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.setkey			= chacha12_setkey,
+		.encrypt		= xchacha_arm,
+		.decrypt		= xchacha_arm,
+	},
+};
+
+static struct skcipher_alg neon_algs[] = {
+	{
+		.base.cra_name		= "chacha20",
+		.base.cra_driver_name	= "chacha20-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= CHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.walksize		= 4 * CHACHA_BLOCK_SIZE,
+		.setkey			= chacha20_setkey,
+		.encrypt		= chacha_neon,
+		.decrypt		= chacha_neon,
+	}, {
+		.base.cra_name		= "xchacha20",
+		.base.cra_driver_name	= "xchacha20-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.walksize		= 4 * CHACHA_BLOCK_SIZE,
+		.setkey			= chacha20_setkey,
+		.encrypt		= xchacha_neon,
+		.decrypt		= xchacha_neon,
+	}, {
+		.base.cra_name		= "xchacha12",
+		.base.cra_driver_name	= "xchacha12-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.walksize		= 4 * CHACHA_BLOCK_SIZE,
+		.setkey			= chacha12_setkey,
+		.encrypt		= xchacha_neon,
+		.decrypt		= xchacha_neon,
+	}
+};
+
+static int __init chacha_simd_mod_init(void)
+{
+	int err;
+
+	err = crypto_register_skciphers(arm_algs, ARRAY_SIZE(arm_algs));
+	if (err)
+		return err;
+
+	if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && (elf_hwcap & HWCAP_NEON)) {
+		int i;
+
+		switch (read_cpuid_part()) {
+		case ARM_CPU_PART_CORTEX_A7:
+		case ARM_CPU_PART_CORTEX_A5:
+			/*
+			 * The Cortex-A7 and Cortex-A5 do not perform well with
+			 * the NEON implementation but do incredibly with the
+			 * scalar one and use less power.
+			 */
+			for (i = 0; i < ARRAY_SIZE(neon_algs); i++)
+				neon_algs[i].base.cra_priority = 0;
+			break;
+		}
+
+		err = crypto_register_skciphers(neon_algs, ARRAY_SIZE(neon_algs));
+		if (err)
+			crypto_unregister_skciphers(arm_algs, ARRAY_SIZE(arm_algs));
+	}
+	return err;
+}
+
+static void __exit chacha_simd_mod_fini(void)
+{
+	crypto_unregister_skciphers(arm_algs, ARRAY_SIZE(arm_algs));
+	if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && (elf_hwcap & HWCAP_NEON))
+		crypto_unregister_skciphers(neon_algs, ARRAY_SIZE(neon_algs));
+}
+
+module_init(chacha_simd_mod_init);
+module_exit(chacha_simd_mod_fini);
+
+MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (scalar and NEON accelerated)");
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("chacha20");
+MODULE_ALIAS_CRYPTO("chacha20-arm");
+MODULE_ALIAS_CRYPTO("xchacha20");
+MODULE_ALIAS_CRYPTO("xchacha20-arm");
+MODULE_ALIAS_CRYPTO("xchacha12");
+MODULE_ALIAS_CRYPTO("xchacha12-arm");
+#ifdef CONFIG_KERNEL_MODE_NEON
+MODULE_ALIAS_CRYPTO("chacha20-neon");
+MODULE_ALIAS_CRYPTO("xchacha20-neon");
+MODULE_ALIAS_CRYPTO("xchacha12-neon");
+#endif
diff --git a/arch/arm/crypto/chacha-neon-glue.c b/arch/arm/crypto/chacha-neon-glue.c
deleted file mode 100644
index 26576772f18b..000000000000
--- a/arch/arm/crypto/chacha-neon-glue.c
+++ /dev/null
@@ -1,202 +0,0 @@
-/*
- * ARM NEON accelerated ChaCha and XChaCha stream ciphers,
- * including ChaCha20 (RFC7539)
- *
- * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * Based on:
- * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include <crypto/algapi.h>
-#include <crypto/internal/chacha.h>
-#include <crypto/internal/simd.h>
-#include <crypto/internal/skcipher.h>
-#include <linux/kernel.h>
-#include <linux/module.h>
-
-#include <asm/hwcap.h>
-#include <asm/neon.h>
-#include <asm/simd.h>
-
-asmlinkage void chacha_block_xor_neon(const u32 *state, u8 *dst, const u8 *src,
-				      int nrounds);
-asmlinkage void chacha_4block_xor_neon(const u32 *state, u8 *dst, const u8 *src,
-				       int nrounds);
-asmlinkage void hchacha_block_neon(const u32 *state, u32 *out, int nrounds);
-
-static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
-			  unsigned int bytes, int nrounds)
-{
-	u8 buf[CHACHA_BLOCK_SIZE];
-
-	while (bytes >= CHACHA_BLOCK_SIZE * 4) {
-		chacha_4block_xor_neon(state, dst, src, nrounds);
-		bytes -= CHACHA_BLOCK_SIZE * 4;
-		src += CHACHA_BLOCK_SIZE * 4;
-		dst += CHACHA_BLOCK_SIZE * 4;
-		state[12] += 4;
-	}
-	while (bytes >= CHACHA_BLOCK_SIZE) {
-		chacha_block_xor_neon(state, dst, src, nrounds);
-		bytes -= CHACHA_BLOCK_SIZE;
-		src += CHACHA_BLOCK_SIZE;
-		dst += CHACHA_BLOCK_SIZE;
-		state[12]++;
-	}
-	if (bytes) {
-		memcpy(buf, src, bytes);
-		chacha_block_xor_neon(state, buf, buf, nrounds);
-		memcpy(dst, buf, bytes);
-	}
-}
-
-static int chacha_neon_stream_xor(struct skcipher_request *req,
-				  const struct chacha_ctx *ctx, const u8 *iv)
-{
-	struct skcipher_walk walk;
-	u32 state[16];
-	int err;
-
-	err = skcipher_walk_virt(&walk, req, false);
-
-	crypto_chacha_init(state, ctx, iv);
-
-	while (walk.nbytes > 0) {
-		unsigned int nbytes = walk.nbytes;
-
-		if (nbytes < walk.total)
-			nbytes = round_down(nbytes, walk.stride);
-
-		kernel_neon_begin();
-		chacha_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
-			      nbytes, ctx->nrounds);
-		kernel_neon_end();
-		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
-	}
-
-	return err;
-}
-
-static int chacha_neon(struct skcipher_request *req)
-{
-	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
-	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
-
-	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !crypto_simd_usable())
-		return crypto_chacha_crypt(req);
-
-	return chacha_neon_stream_xor(req, ctx, req->iv);
-}
-
-static int xchacha_neon(struct skcipher_request *req)
-{
-	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
-	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
-	struct chacha_ctx subctx;
-	u32 state[16];
-	u8 real_iv[16];
-
-	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !crypto_simd_usable())
-		return crypto_xchacha_crypt(req);
-
-	crypto_chacha_init(state, ctx, req->iv);
-
-	kernel_neon_begin();
-	hchacha_block_neon(state, subctx.key, ctx->nrounds);
-	kernel_neon_end();
-	subctx.nrounds = ctx->nrounds;
-
-	memcpy(&real_iv[0], req->iv + 24, 8);
-	memcpy(&real_iv[8], req->iv + 16, 8);
-	return chacha_neon_stream_xor(req, &subctx, real_iv);
-}
-
-static struct skcipher_alg algs[] = {
-	{
-		.base.cra_name		= "chacha20",
-		.base.cra_driver_name	= "chacha20-neon",
-		.base.cra_priority	= 300,
-		.base.cra_blocksize	= 1,
-		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
-		.base.cra_module	= THIS_MODULE,
-
-		.min_keysize		= CHACHA_KEY_SIZE,
-		.max_keysize		= CHACHA_KEY_SIZE,
-		.ivsize			= CHACHA_IV_SIZE,
-		.chunksize		= CHACHA_BLOCK_SIZE,
-		.walksize		= 4 * CHACHA_BLOCK_SIZE,
-		.setkey			= crypto_chacha20_setkey,
-		.encrypt		= chacha_neon,
-		.decrypt		= chacha_neon,
-	}, {
-		.base.cra_name		= "xchacha20",
-		.base.cra_driver_name	= "xchacha20-neon",
-		.base.cra_priority	= 300,
-		.base.cra_blocksize	= 1,
-		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
-		.base.cra_module	= THIS_MODULE,
-
-		.min_keysize		= CHACHA_KEY_SIZE,
-		.max_keysize		= CHACHA_KEY_SIZE,
-		.ivsize			= XCHACHA_IV_SIZE,
-		.chunksize		= CHACHA_BLOCK_SIZE,
-		.walksize		= 4 * CHACHA_BLOCK_SIZE,
-		.setkey			= crypto_chacha20_setkey,
-		.encrypt		= xchacha_neon,
-		.decrypt		= xchacha_neon,
-	}, {
-		.base.cra_name		= "xchacha12",
-		.base.cra_driver_name	= "xchacha12-neon",
-		.base.cra_priority	= 300,
-		.base.cra_blocksize	= 1,
-		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
-		.base.cra_module	= THIS_MODULE,
-
-		.min_keysize		= CHACHA_KEY_SIZE,
-		.max_keysize		= CHACHA_KEY_SIZE,
-		.ivsize			= XCHACHA_IV_SIZE,
-		.chunksize		= CHACHA_BLOCK_SIZE,
-		.walksize		= 4 * CHACHA_BLOCK_SIZE,
-		.setkey			= crypto_chacha12_setkey,
-		.encrypt		= xchacha_neon,
-		.decrypt		= xchacha_neon,
-	}
-};
-
-static int __init chacha_simd_mod_init(void)
-{
-	if (!(elf_hwcap & HWCAP_NEON))
-		return -ENODEV;
-
-	return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
-}
-
-static void __exit chacha_simd_mod_fini(void)
-{
-	crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
-}
-
-module_init(chacha_simd_mod_init);
-module_exit(chacha_simd_mod_fini);
-
-MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (NEON accelerated)");
-MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
-MODULE_LICENSE("GPL v2");
-MODULE_ALIAS_CRYPTO("chacha20");
-MODULE_ALIAS_CRYPTO("chacha20-neon");
-MODULE_ALIAS_CRYPTO("xchacha20");
-MODULE_ALIAS_CRYPTO("xchacha20-neon");
-MODULE_ALIAS_CRYPTO("xchacha12");
-MODULE_ALIAS_CRYPTO("xchacha12-neon");
diff --git a/arch/arm/crypto/chacha-scalar-core.S b/arch/arm/crypto/chacha-scalar-core.S
index 2140319b64a0..0970ae107590 100644
--- a/arch/arm/crypto/chacha-scalar-core.S
+++ b/arch/arm/crypto/chacha-scalar-core.S
@@ -41,14 +41,6 @@
 	X14	.req	r12
 	X15	.req	r14
 
-.Lexpand_32byte_k:
-	// "expand 32-byte k"
-	.word	0x61707865, 0x3320646e, 0x79622d32, 0x6b206574
-
-#ifdef __thumb2__
-#  define adrl adr
-#endif
-
 .macro __rev		out, in,  t0, t1, t2
 .if __LINUX_ARM_ARCH__ >= 6
 	rev		\out, \in
@@ -391,61 +383,65 @@
 .endm	// _chacha
 
 /*
- * void chacha20_arm(u8 *out, const u8 *in, size_t len, const u32 key[8],
- *		     const u32 iv[4]);
+ * void chacha_doarm(u8 *dst, const u8 *src, unsigned int bytes,
+ *		     u32 *state, int nrounds);
  */
-ENTRY(chacha20_arm)
+ENTRY(chacha_doarm)
 	cmp		r2, #0			// len == 0?
 	reteq		lr
 
+	ldr		ip, [sp]
+	cmp		ip, #12
+
 	push		{r0-r2,r4-r11,lr}
 
 	// Push state x0-x15 onto stack.
 	// Also store an extra copy of x10-x11 just before the state.
 
-	ldr		r4, [sp, #48]		// iv
-	mov		r0, sp
-	sub		sp, #80
-
-	// iv: x12-x15
-	ldm		r4, {X12,X13,X14,X15}
-	stmdb		r0!, {X12,X13,X14,X15}
+	add		X12, r3, #48
+	ldm		X12, {X12,X13,X14,X15}
+	push		{X12,X13,X14,X15}
+	sub		sp, sp, #64
 
-	// key: x4-x11
-	__ldrd		X8_X10, X9_X11, r3, 24
+	__ldrd		X8_X10, X9_X11, r3, 40
 	__strd		X8_X10, X9_X11, sp, 8
-	stmdb		r0!, {X8_X10, X9_X11}
-	ldm		r3, {X4-X9_X11}
-	stmdb		r0!, {X4-X9_X11}
-
-	// constants: x0-x3
-	adrl		X3, .Lexpand_32byte_k
-	ldm		X3, {X0-X3}
+	__strd		X8_X10, X9_X11, sp, 56
+	ldm		r3, {X0-X9_X11}
 	__strd		X0, X1, sp, 16
 	__strd		X2, X3, sp, 24
+	__strd		X4, X5, sp, 32
+	__strd		X6, X7, sp, 40
+	__strd		X8_X10, X9_X11, sp, 48
 
+	beq		1f
 	_chacha		20
 
-	add		sp, #76
+0:	add		sp, #76
 	pop		{r4-r11, pc}
-ENDPROC(chacha20_arm)
+
+1:	_chacha		12
+	b		0b
+ENDPROC(chacha_doarm)
 
 /*
- * void hchacha20_arm(const u32 state[16], u32 out[8]);
+ * void hchacha_block_arm(const u32 state[16], u32 out[8], int nrounds);
  */
-ENTRY(hchacha20_arm)
+ENTRY(hchacha_block_arm)
 	push		{r1,r4-r11,lr}
 
+	cmp		r2, #12			// ChaCha12 ?
+
 	mov		r14, r0
 	ldmia		r14!, {r0-r11}		// load x0-x11
 	push		{r10-r11}		// store x10-x11 to stack
 	ldm		r14, {r10-r12,r14}	// load x12-x15
 	sub		sp, #8
 
+	beq		1f
 	_chacha_permute	20
 
 	// Skip over (unused0-unused1, x10-x11)
-	add		sp, #16
+0:	add		sp, #16
 
 	// Fix up rotations of x12-x15
 	ror		X12, X12, #drot
@@ -458,4 +454,7 @@ ENTRY(hchacha20_arm)
 	stm		r4, {X0,X1,X2,X3,X12,X13,X14,X15}
 
 	pop		{r4-r11,pc}
-ENDPROC(hchacha20_arm)
+
+1:	_chacha_permute	12
+	b		0b
+ENDPROC(hchacha_block_arm)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 09/35] crypto: arm/chacha - expose ARM ChaCha routine as library function
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (7 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 08/35] crypto: arm/chacha - remove dependency on generic ChaCha driver Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 10/35] crypto: mips/chacha - import 32r2 ChaCha code from Zinc Ard Biesheuvel
                   ` (25 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Expose the accelerated NEON ChaCha routine directly as a symbol
export so that users of the ChaCha library API can use it directly.

Given that calls into the library API will always go through the
routines in this module if it is enabled, switch to static keys
to select the optimal implementation available (which may be none
at all, in which case we defer to the generic implementation for
all invocations).

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/Kconfig       |  1 +
 arch/arm/crypto/chacha-glue.c | 41 +++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index a31b0b95548d..265da3801e4f 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -128,6 +128,7 @@ config CRYPTO_CRC32_ARM_CE
 config CRYPTO_CHACHA20_NEON
 	tristate "NEON and scalar accelerated ChaCha stream cipher algorithms"
 	select CRYPTO_BLKCIPHER
+	select CRYPTO_ARCH_HAVE_LIB_CHACHA
 
 config CRYPTO_NHPOLY1305_NEON
 	tristate "NEON accelerated NHPoly1305 hash function (for Adiantum)"
diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
index 2c0f76bc08b2..fc6e69184d59 100644
--- a/arch/arm/crypto/chacha-glue.c
+++ b/arch/arm/crypto/chacha-glue.c
@@ -11,6 +11,7 @@
 #include <crypto/internal/chacha.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
+#include <linux/jump_label.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 
@@ -29,9 +30,11 @@ asmlinkage void hchacha_block_neon(const u32 *state, u32 *out, int nrounds);
 asmlinkage void chacha_doarm(u8 *dst, const u8 *src, unsigned int bytes,
 			     const u32 *state, int nrounds);
 
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(use_neon);
+
 static inline bool neon_usable(void)
 {
-	return crypto_simd_usable();
+	return static_branch_likely(&use_neon) && crypto_simd_usable();
 }
 
 static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
@@ -60,6 +63,40 @@ static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
 	}
 }
 
+void hchacha_block_arch(const u32 *state, u32 *stream, int nrounds)
+{
+	if (!IS_ENABLED(CONFIG_KERNEL_MODE_NEON) || !neon_usable()) {
+		hchacha_block_arm(state, stream, nrounds);
+	} else {
+		kernel_neon_begin();
+		hchacha_block_neon(state, stream, nrounds);
+		kernel_neon_end();
+	}
+}
+EXPORT_SYMBOL(hchacha_block_arch);
+
+void chacha_init_arch(u32 *state, const u32 *key, const u8 *iv)
+{
+	chacha_init_generic(state, key, iv);
+}
+EXPORT_SYMBOL(chacha_init_arch);
+
+void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
+		       int nrounds)
+{
+	if (!IS_ENABLED(CONFIG_KERNEL_MODE_NEON) || !neon_usable() ||
+	    bytes <= CHACHA_BLOCK_SIZE) {
+		chacha_doarm(dst, src, bytes, state, nrounds);
+		state[12] += DIV_ROUND_UP(bytes, CHACHA_BLOCK_SIZE);
+		return;
+	}
+
+	kernel_neon_begin();
+	chacha_doneon(state, dst, src, bytes, nrounds);
+	kernel_neon_end();
+}
+EXPORT_SYMBOL(chacha_crypt_arch);
+
 static int chacha_stream_xor(struct skcipher_request *req,
 			     const struct chacha_ctx *ctx, const u8 *iv,
 			     bool neon)
@@ -281,6 +318,8 @@ static int __init chacha_simd_mod_init(void)
 			for (i = 0; i < ARRAY_SIZE(neon_algs); i++)
 				neon_algs[i].base.cra_priority = 0;
 			break;
+		default:
+			static_branch_enable(&use_neon);
 		}
 
 		err = crypto_register_skciphers(neon_algs, ARRAY_SIZE(neon_algs));
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 10/35] crypto: mips/chacha - import 32r2 ChaCha code from Zinc
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (8 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 09/35] crypto: arm/chacha - expose ARM ChaCha routine as library function Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 11/35] crypto: mips/chacha - wire up accelerated 32r2 " Ard Biesheuvel
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

From: "Jason A. Donenfeld" <Jason@zx2c4.com>

This imports the accelerated MIPS 32r2 ChaCha20 implementation from the
Zinc patch set.

Co-developed-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/mips/crypto/chacha-core.S | 424 ++++++++++++++++++++
 1 file changed, 424 insertions(+)

diff --git a/arch/mips/crypto/chacha-core.S b/arch/mips/crypto/chacha-core.S
new file mode 100644
index 000000000000..a81e02db95e7
--- /dev/null
+++ b/arch/mips/crypto/chacha-core.S
@@ -0,0 +1,424 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2016-2018 René van Dorst <opensource@vdorst.com>. All Rights Reserved.
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#define MASK_U32		0x3c
+#define CHACHA20_BLOCK_SIZE	64
+#define STACK_SIZE		32
+
+#define X0	$t0
+#define X1	$t1
+#define X2	$t2
+#define X3	$t3
+#define X4	$t4
+#define X5	$t5
+#define X6	$t6
+#define X7	$t7
+#define X8	$t8
+#define X9	$t9
+#define X10	$v1
+#define X11	$s6
+#define X12	$s5
+#define X13	$s4
+#define X14	$s3
+#define X15	$s2
+/* Use regs which are overwritten on exit for Tx so we don't leak clear data. */
+#define T0	$s1
+#define T1	$s0
+#define T(n)	T ## n
+#define X(n)	X ## n
+
+/* Input arguments */
+#define STATE		$a0
+#define OUT		$a1
+#define IN		$a2
+#define BYTES		$a3
+
+/* Output argument */
+/* NONCE[0] is kept in a register and not in memory.
+ * We don't want to touch original value in memory.
+ * Must be incremented every loop iteration.
+ */
+#define NONCE_0		$v0
+
+/* SAVED_X and SAVED_CA are set in the jump table.
+ * Use regs which are overwritten on exit else we don't leak clear data.
+ * They are used to handling the last bytes which are not multiple of 4.
+ */
+#define SAVED_X		X15
+#define SAVED_CA	$s7
+
+#define IS_UNALIGNED	$s7
+
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+#define MSB 0
+#define LSB 3
+#define ROTx rotl
+#define ROTR(n) rotr n, 24
+#define	CPU_TO_LE32(n) \
+	wsbh	n; \
+	rotr	n, 16;
+#else
+#define MSB 3
+#define LSB 0
+#define ROTx rotr
+#define CPU_TO_LE32(n)
+#define ROTR(n)
+#endif
+
+#define FOR_EACH_WORD(x) \
+	x( 0); \
+	x( 1); \
+	x( 2); \
+	x( 3); \
+	x( 4); \
+	x( 5); \
+	x( 6); \
+	x( 7); \
+	x( 8); \
+	x( 9); \
+	x(10); \
+	x(11); \
+	x(12); \
+	x(13); \
+	x(14); \
+	x(15);
+
+#define FOR_EACH_WORD_REV(x) \
+	x(15); \
+	x(14); \
+	x(13); \
+	x(12); \
+	x(11); \
+	x(10); \
+	x( 9); \
+	x( 8); \
+	x( 7); \
+	x( 6); \
+	x( 5); \
+	x( 4); \
+	x( 3); \
+	x( 2); \
+	x( 1); \
+	x( 0);
+
+#define PLUS_ONE_0	 1
+#define PLUS_ONE_1	 2
+#define PLUS_ONE_2	 3
+#define PLUS_ONE_3	 4
+#define PLUS_ONE_4	 5
+#define PLUS_ONE_5	 6
+#define PLUS_ONE_6	 7
+#define PLUS_ONE_7	 8
+#define PLUS_ONE_8	 9
+#define PLUS_ONE_9	10
+#define PLUS_ONE_10	11
+#define PLUS_ONE_11	12
+#define PLUS_ONE_12	13
+#define PLUS_ONE_13	14
+#define PLUS_ONE_14	15
+#define PLUS_ONE_15	16
+#define PLUS_ONE(x)	PLUS_ONE_ ## x
+#define _CONCAT3(a,b,c)	a ## b ## c
+#define CONCAT3(a,b,c)	_CONCAT3(a,b,c)
+
+#define STORE_UNALIGNED(x) \
+CONCAT3(.Lchacha20_mips_xor_unaligned_, PLUS_ONE(x), _b: ;) \
+	.if (x != 12); \
+		lw	T0, (x*4)(STATE); \
+	.endif; \
+	lwl	T1, (x*4)+MSB ## (IN); \
+	lwr	T1, (x*4)+LSB ## (IN); \
+	.if (x == 12); \
+		addu	X ## x, NONCE_0; \
+	.else; \
+		addu	X ## x, T0; \
+	.endif; \
+	CPU_TO_LE32(X ## x); \
+	xor	X ## x, T1; \
+	swl	X ## x, (x*4)+MSB ## (OUT); \
+	swr	X ## x, (x*4)+LSB ## (OUT);
+
+#define STORE_ALIGNED(x) \
+CONCAT3(.Lchacha20_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \
+	.if (x != 12); \
+		lw	T0, (x*4)(STATE); \
+	.endif; \
+	lw	T1, (x*4) ## (IN); \
+	.if (x == 12); \
+		addu	X ## x, NONCE_0; \
+	.else; \
+		addu	X ## x, T0; \
+	.endif; \
+	CPU_TO_LE32(X ## x); \
+	xor	X ## x, T1; \
+	sw	X ## x, (x*4) ## (OUT);
+
+/* Jump table macro.
+ * Used for setup and handling the last bytes, which are not multiple of 4.
+ * X15 is free to store Xn
+ * Every jumptable entry must be equal in size.
+ */
+#define JMPTBL_ALIGNED(x) \
+.Lchacha20_mips_jmptbl_aligned_ ## x: ; \
+	.set	noreorder; \
+	b	.Lchacha20_mips_xor_aligned_ ## x ## _b; \
+	.if (x == 12); \
+		addu	SAVED_X, X ## x, NONCE_0; \
+	.else; \
+		addu	SAVED_X, X ## x, SAVED_CA; \
+	.endif; \
+	.set	reorder
+
+#define JMPTBL_UNALIGNED(x) \
+.Lchacha20_mips_jmptbl_unaligned_ ## x: ; \
+	.set	noreorder; \
+	b	.Lchacha20_mips_xor_unaligned_ ## x ## _b; \
+	.if (x == 12); \
+		addu	SAVED_X, X ## x, NONCE_0; \
+	.else; \
+		addu	SAVED_X, X ## x, SAVED_CA; \
+	.endif; \
+	.set	reorder
+
+#define AXR(A, B, C, D,  K, L, M, N,  V, W, Y, Z,  S) \
+	addu	X(A), X(K); \
+	addu	X(B), X(L); \
+	addu	X(C), X(M); \
+	addu	X(D), X(N); \
+	xor	X(V), X(A); \
+	xor	X(W), X(B); \
+	xor	X(Y), X(C); \
+	xor	X(Z), X(D); \
+	rotl	X(V), S;    \
+	rotl	X(W), S;    \
+	rotl	X(Y), S;    \
+	rotl	X(Z), S;
+
+.text
+.set	reorder
+.set	noat
+.globl	chacha20_mips
+.ent	chacha20_mips
+chacha20_mips:
+	.frame	$sp, STACK_SIZE, $ra
+
+	addiu	$sp, -STACK_SIZE
+
+	/* Return bytes = 0. */
+	beqz	BYTES, .Lchacha20_mips_end
+
+	lw	NONCE_0, 48(STATE)
+
+	/* Save s0-s7 */
+	sw	$s0,  0($sp)
+	sw	$s1,  4($sp)
+	sw	$s2,  8($sp)
+	sw	$s3, 12($sp)
+	sw	$s4, 16($sp)
+	sw	$s5, 20($sp)
+	sw	$s6, 24($sp)
+	sw	$s7, 28($sp)
+
+	/* Test IN or OUT is unaligned.
+	 * IS_UNALIGNED = ( IN | OUT ) & 0x00000003
+	 */
+	or	IS_UNALIGNED, IN, OUT
+	andi	IS_UNALIGNED, 0x3
+
+	/* Set number of rounds */
+	li	$at, 20
+
+	b	.Lchacha20_rounds_start
+
+.align 4
+.Loop_chacha20_rounds:
+	addiu	IN,  CHACHA20_BLOCK_SIZE
+	addiu	OUT, CHACHA20_BLOCK_SIZE
+	addiu	NONCE_0, 1
+
+.Lchacha20_rounds_start:
+	lw	X0,  0(STATE)
+	lw	X1,  4(STATE)
+	lw	X2,  8(STATE)
+	lw	X3,  12(STATE)
+
+	lw	X4,  16(STATE)
+	lw	X5,  20(STATE)
+	lw	X6,  24(STATE)
+	lw	X7,  28(STATE)
+	lw	X8,  32(STATE)
+	lw	X9,  36(STATE)
+	lw	X10, 40(STATE)
+	lw	X11, 44(STATE)
+
+	move	X12, NONCE_0
+	lw	X13, 52(STATE)
+	lw	X14, 56(STATE)
+	lw	X15, 60(STATE)
+
+.Loop_chacha20_xor_rounds:
+	addiu	$at, -2
+	AXR( 0, 1, 2, 3,  4, 5, 6, 7, 12,13,14,15, 16);
+	AXR( 8, 9,10,11, 12,13,14,15,  4, 5, 6, 7, 12);
+	AXR( 0, 1, 2, 3,  4, 5, 6, 7, 12,13,14,15,  8);
+	AXR( 8, 9,10,11, 12,13,14,15,  4, 5, 6, 7,  7);
+	AXR( 0, 1, 2, 3,  5, 6, 7, 4, 15,12,13,14, 16);
+	AXR(10,11, 8, 9, 15,12,13,14,  5, 6, 7, 4, 12);
+	AXR( 0, 1, 2, 3,  5, 6, 7, 4, 15,12,13,14,  8);
+	AXR(10,11, 8, 9, 15,12,13,14,  5, 6, 7, 4,  7);
+	bnez	$at, .Loop_chacha20_xor_rounds
+
+	addiu	BYTES, -(CHACHA20_BLOCK_SIZE)
+
+	/* Is data src/dst unaligned? Jump */
+	bnez	IS_UNALIGNED, .Loop_chacha20_unaligned
+
+	/* Set number rounds here to fill delayslot. */
+	li	$at, 20
+
+	/* BYTES < 0, it has no full block. */
+	bltz	BYTES, .Lchacha20_mips_no_full_block_aligned
+
+	FOR_EACH_WORD_REV(STORE_ALIGNED)
+
+	/* BYTES > 0? Loop again. */
+	bgtz	BYTES, .Loop_chacha20_rounds
+
+	/* Place this here to fill delay slot */
+	addiu	NONCE_0, 1
+
+	/* BYTES < 0? Handle last bytes */
+	bltz	BYTES, .Lchacha20_mips_xor_bytes
+
+.Lchacha20_mips_xor_done:
+	/* Restore used registers */
+	lw	$s0,  0($sp)
+	lw	$s1,  4($sp)
+	lw	$s2,  8($sp)
+	lw	$s3, 12($sp)
+	lw	$s4, 16($sp)
+	lw	$s5, 20($sp)
+	lw	$s6, 24($sp)
+	lw	$s7, 28($sp)
+
+	/* Write NONCE_0 back to right location in state */
+	sw	NONCE_0, 48(STATE)
+
+.Lchacha20_mips_end:
+	addiu	$sp, STACK_SIZE
+	jr	$ra
+
+.Lchacha20_mips_no_full_block_aligned:
+	/* Restore the offset on BYTES */
+	addiu	BYTES, CHACHA20_BLOCK_SIZE
+
+	/* Get number of full WORDS */
+	andi	$at, BYTES, MASK_U32
+
+	/* Load upper half of jump table addr */
+	lui	T0, %hi(.Lchacha20_mips_jmptbl_aligned_0)
+
+	/* Calculate lower half jump table offset */
+	ins	T0, $at, 1, 6
+
+	/* Add offset to STATE */
+	addu	T1, STATE, $at
+
+	/* Add lower half jump table addr */
+	addiu	T0, %lo(.Lchacha20_mips_jmptbl_aligned_0)
+
+	/* Read value from STATE */
+	lw	SAVED_CA, 0(T1)
+
+	/* Store remaining bytecounter as negative value */
+	subu	BYTES, $at, BYTES
+
+	jr	T0
+
+	/* Jump table */
+	FOR_EACH_WORD(JMPTBL_ALIGNED)
+
+
+.Loop_chacha20_unaligned:
+	/* Set number rounds here to fill delayslot. */
+	li	$at, 20
+
+	/* BYTES > 0, it has no full block. */
+	bltz	BYTES, .Lchacha20_mips_no_full_block_unaligned
+
+	FOR_EACH_WORD_REV(STORE_UNALIGNED)
+
+	/* BYTES > 0? Loop again. */
+	bgtz	BYTES, .Loop_chacha20_rounds
+
+	/* Write NONCE_0 back to right location in state */
+	sw	NONCE_0, 48(STATE)
+
+	.set noreorder
+	/* Fall through to byte handling */
+	bgez	BYTES, .Lchacha20_mips_xor_done
+.Lchacha20_mips_xor_unaligned_0_b:
+.Lchacha20_mips_xor_aligned_0_b:
+	/* Place this here to fill delay slot */
+	addiu	NONCE_0, 1
+	.set reorder
+
+.Lchacha20_mips_xor_bytes:
+	addu	IN, $at
+	addu	OUT, $at
+	/* First byte */
+	lbu	T1, 0(IN)
+	addiu	$at, BYTES, 1
+	CPU_TO_LE32(SAVED_X)
+	ROTR(SAVED_X)
+	xor	T1, SAVED_X
+	sb	T1, 0(OUT)
+	beqz	$at, .Lchacha20_mips_xor_done
+	/* Second byte */
+	lbu	T1, 1(IN)
+	addiu	$at, BYTES, 2
+	ROTx	SAVED_X, 8
+	xor	T1, SAVED_X
+	sb	T1, 1(OUT)
+	beqz	$at, .Lchacha20_mips_xor_done
+	/* Third byte */
+	lbu	T1, 2(IN)
+	ROTx	SAVED_X, 8
+	xor	T1, SAVED_X
+	sb	T1, 2(OUT)
+	b	.Lchacha20_mips_xor_done
+
+.Lchacha20_mips_no_full_block_unaligned:
+	/* Restore the offset on BYTES */
+	addiu	BYTES, CHACHA20_BLOCK_SIZE
+
+	/* Get number of full WORDS */
+	andi	$at, BYTES, MASK_U32
+
+	/* Load upper half of jump table addr */
+	lui	T0, %hi(.Lchacha20_mips_jmptbl_unaligned_0)
+
+	/* Calculate lower half jump table offset */
+	ins	T0, $at, 1, 6
+
+	/* Add offset to STATE */
+	addu	T1, STATE, $at
+
+	/* Add lower half jump table addr */
+	addiu	T0, %lo(.Lchacha20_mips_jmptbl_unaligned_0)
+
+	/* Read value from STATE */
+	lw	SAVED_CA, 0(T1)
+
+	/* Store remaining bytecounter as negative value */
+	subu	BYTES, $at, BYTES
+
+	jr	T0
+
+	/* Jump table */
+	FOR_EACH_WORD(JMPTBL_UNALIGNED)
+.end chacha20_mips
+.set at
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 11/35] crypto: mips/chacha - wire up accelerated 32r2 code from Zinc
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (9 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 10/35] crypto: mips/chacha - import 32r2 ChaCha code from Zinc Ard Biesheuvel
@ 2019-10-17 19:09 ` " Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 12/35] crypto: chacha - unexport chacha_generic routines Ard Biesheuvel
                   ` (23 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

This integrates the accelerated MIPS 32r2 implementation of ChaCha
into both the API and library interfaces of the kernel crypto stack.

The significance of this is that, in addition to becoming available
as an accelerated library implementation, it can also be used by
existing crypto API code such as Adiantum (for block encryption on
ultra low performance cores) or IPsec using chacha20poly1305. These
are use cases that have already opted into using the abstract crypto
API. In order to support Adiantum, the core assembler routine has
been adapted to take the round count as a function argument rather
than hardcoding it to 20.

Co-developed-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/mips/Makefile             |   2 +-
 arch/mips/crypto/Makefile      |   4 +
 arch/mips/crypto/chacha-core.S | 159 +++++++++++++------
 arch/mips/crypto/chacha-glue.c | 162 ++++++++++++++++++++
 crypto/Kconfig                 |   6 +
 5 files changed, 289 insertions(+), 44 deletions(-)

diff --git a/arch/mips/Makefile b/arch/mips/Makefile
index cdc09b71febe..8584c047ea59 100644
--- a/arch/mips/Makefile
+++ b/arch/mips/Makefile
@@ -323,7 +323,7 @@ libs-$(CONFIG_MIPS_FP_SUPPORT) += arch/mips/math-emu/
 # See arch/mips/Kbuild for content of core part of the kernel
 core-y += arch/mips/
 
-drivers-$(CONFIG_MIPS_CRC_SUPPORT) += arch/mips/crypto/
+drivers-y			+= arch/mips/crypto/
 drivers-$(CONFIG_OPROFILE)	+= arch/mips/oprofile/
 
 # suspend and hibernation support
diff --git a/arch/mips/crypto/Makefile b/arch/mips/crypto/Makefile
index e07aca572c2e..b528b9d300f1 100644
--- a/arch/mips/crypto/Makefile
+++ b/arch/mips/crypto/Makefile
@@ -4,3 +4,7 @@
 #
 
 obj-$(CONFIG_CRYPTO_CRC32_MIPS) += crc32-mips.o
+
+obj-$(CONFIG_CRYPTO_CHACHA_MIPS) += chacha-mips.o
+chacha-mips-y := chacha-core.o chacha-glue.o
+AFLAGS_chacha-core.o += -O2 # needed to fill branch delay slots
diff --git a/arch/mips/crypto/chacha-core.S b/arch/mips/crypto/chacha-core.S
index a81e02db95e7..5755f69cfe00 100644
--- a/arch/mips/crypto/chacha-core.S
+++ b/arch/mips/crypto/chacha-core.S
@@ -125,7 +125,7 @@
 #define CONCAT3(a,b,c)	_CONCAT3(a,b,c)
 
 #define STORE_UNALIGNED(x) \
-CONCAT3(.Lchacha20_mips_xor_unaligned_, PLUS_ONE(x), _b: ;) \
+CONCAT3(.Lchacha_mips_xor_unaligned_, PLUS_ONE(x), _b: ;) \
 	.if (x != 12); \
 		lw	T0, (x*4)(STATE); \
 	.endif; \
@@ -142,7 +142,7 @@ CONCAT3(.Lchacha20_mips_xor_unaligned_, PLUS_ONE(x), _b: ;) \
 	swr	X ## x, (x*4)+LSB ## (OUT);
 
 #define STORE_ALIGNED(x) \
-CONCAT3(.Lchacha20_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \
+CONCAT3(.Lchacha_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \
 	.if (x != 12); \
 		lw	T0, (x*4)(STATE); \
 	.endif; \
@@ -162,9 +162,9 @@ CONCAT3(.Lchacha20_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \
  * Every jumptable entry must be equal in size.
  */
 #define JMPTBL_ALIGNED(x) \
-.Lchacha20_mips_jmptbl_aligned_ ## x: ; \
+.Lchacha_mips_jmptbl_aligned_ ## x: ; \
 	.set	noreorder; \
-	b	.Lchacha20_mips_xor_aligned_ ## x ## _b; \
+	b	.Lchacha_mips_xor_aligned_ ## x ## _b; \
 	.if (x == 12); \
 		addu	SAVED_X, X ## x, NONCE_0; \
 	.else; \
@@ -173,9 +173,9 @@ CONCAT3(.Lchacha20_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \
 	.set	reorder
 
 #define JMPTBL_UNALIGNED(x) \
-.Lchacha20_mips_jmptbl_unaligned_ ## x: ; \
+.Lchacha_mips_jmptbl_unaligned_ ## x: ; \
 	.set	noreorder; \
-	b	.Lchacha20_mips_xor_unaligned_ ## x ## _b; \
+	b	.Lchacha_mips_xor_unaligned_ ## x ## _b; \
 	.if (x == 12); \
 		addu	SAVED_X, X ## x, NONCE_0; \
 	.else; \
@@ -200,15 +200,18 @@ CONCAT3(.Lchacha20_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \
 .text
 .set	reorder
 .set	noat
-.globl	chacha20_mips
-.ent	chacha20_mips
-chacha20_mips:
+.globl	chacha_crypt_arch
+.ent	chacha_crypt_arch
+chacha_crypt_arch:
 	.frame	$sp, STACK_SIZE, $ra
 
+	/* Load number of rounds */
+	lw	$at, 16($sp)
+
 	addiu	$sp, -STACK_SIZE
 
 	/* Return bytes = 0. */
-	beqz	BYTES, .Lchacha20_mips_end
+	beqz	BYTES, .Lchacha_mips_end
 
 	lw	NONCE_0, 48(STATE)
 
@@ -228,18 +231,15 @@ chacha20_mips:
 	or	IS_UNALIGNED, IN, OUT
 	andi	IS_UNALIGNED, 0x3
 
-	/* Set number of rounds */
-	li	$at, 20
-
-	b	.Lchacha20_rounds_start
+	b	.Lchacha_rounds_start
 
 .align 4
-.Loop_chacha20_rounds:
+.Loop_chacha_rounds:
 	addiu	IN,  CHACHA20_BLOCK_SIZE
 	addiu	OUT, CHACHA20_BLOCK_SIZE
 	addiu	NONCE_0, 1
 
-.Lchacha20_rounds_start:
+.Lchacha_rounds_start:
 	lw	X0,  0(STATE)
 	lw	X1,  4(STATE)
 	lw	X2,  8(STATE)
@@ -259,7 +259,7 @@ chacha20_mips:
 	lw	X14, 56(STATE)
 	lw	X15, 60(STATE)
 
-.Loop_chacha20_xor_rounds:
+.Loop_chacha_xor_rounds:
 	addiu	$at, -2
 	AXR( 0, 1, 2, 3,  4, 5, 6, 7, 12,13,14,15, 16);
 	AXR( 8, 9,10,11, 12,13,14,15,  4, 5, 6, 7, 12);
@@ -269,31 +269,31 @@ chacha20_mips:
 	AXR(10,11, 8, 9, 15,12,13,14,  5, 6, 7, 4, 12);
 	AXR( 0, 1, 2, 3,  5, 6, 7, 4, 15,12,13,14,  8);
 	AXR(10,11, 8, 9, 15,12,13,14,  5, 6, 7, 4,  7);
-	bnez	$at, .Loop_chacha20_xor_rounds
+	bnez	$at, .Loop_chacha_xor_rounds
 
 	addiu	BYTES, -(CHACHA20_BLOCK_SIZE)
 
 	/* Is data src/dst unaligned? Jump */
-	bnez	IS_UNALIGNED, .Loop_chacha20_unaligned
+	bnez	IS_UNALIGNED, .Loop_chacha_unaligned
 
 	/* Set number rounds here to fill delayslot. */
-	li	$at, 20
+	lw	$at, (STACK_SIZE+16)($sp)
 
 	/* BYTES < 0, it has no full block. */
-	bltz	BYTES, .Lchacha20_mips_no_full_block_aligned
+	bltz	BYTES, .Lchacha_mips_no_full_block_aligned
 
 	FOR_EACH_WORD_REV(STORE_ALIGNED)
 
 	/* BYTES > 0? Loop again. */
-	bgtz	BYTES, .Loop_chacha20_rounds
+	bgtz	BYTES, .Loop_chacha_rounds
 
 	/* Place this here to fill delay slot */
 	addiu	NONCE_0, 1
 
 	/* BYTES < 0? Handle last bytes */
-	bltz	BYTES, .Lchacha20_mips_xor_bytes
+	bltz	BYTES, .Lchacha_mips_xor_bytes
 
-.Lchacha20_mips_xor_done:
+.Lchacha_mips_xor_done:
 	/* Restore used registers */
 	lw	$s0,  0($sp)
 	lw	$s1,  4($sp)
@@ -307,11 +307,11 @@ chacha20_mips:
 	/* Write NONCE_0 back to right location in state */
 	sw	NONCE_0, 48(STATE)
 
-.Lchacha20_mips_end:
+.Lchacha_mips_end:
 	addiu	$sp, STACK_SIZE
 	jr	$ra
 
-.Lchacha20_mips_no_full_block_aligned:
+.Lchacha_mips_no_full_block_aligned:
 	/* Restore the offset on BYTES */
 	addiu	BYTES, CHACHA20_BLOCK_SIZE
 
@@ -319,7 +319,7 @@ chacha20_mips:
 	andi	$at, BYTES, MASK_U32
 
 	/* Load upper half of jump table addr */
-	lui	T0, %hi(.Lchacha20_mips_jmptbl_aligned_0)
+	lui	T0, %hi(.Lchacha_mips_jmptbl_aligned_0)
 
 	/* Calculate lower half jump table offset */
 	ins	T0, $at, 1, 6
@@ -328,7 +328,7 @@ chacha20_mips:
 	addu	T1, STATE, $at
 
 	/* Add lower half jump table addr */
-	addiu	T0, %lo(.Lchacha20_mips_jmptbl_aligned_0)
+	addiu	T0, %lo(.Lchacha_mips_jmptbl_aligned_0)
 
 	/* Read value from STATE */
 	lw	SAVED_CA, 0(T1)
@@ -342,31 +342,31 @@ chacha20_mips:
 	FOR_EACH_WORD(JMPTBL_ALIGNED)
 
 
-.Loop_chacha20_unaligned:
+.Loop_chacha_unaligned:
 	/* Set number rounds here to fill delayslot. */
-	li	$at, 20
+	lw	$at, (STACK_SIZE+16)($sp)
 
 	/* BYTES > 0, it has no full block. */
-	bltz	BYTES, .Lchacha20_mips_no_full_block_unaligned
+	bltz	BYTES, .Lchacha_mips_no_full_block_unaligned
 
 	FOR_EACH_WORD_REV(STORE_UNALIGNED)
 
 	/* BYTES > 0? Loop again. */
-	bgtz	BYTES, .Loop_chacha20_rounds
+	bgtz	BYTES, .Loop_chacha_rounds
 
 	/* Write NONCE_0 back to right location in state */
 	sw	NONCE_0, 48(STATE)
 
 	.set noreorder
 	/* Fall through to byte handling */
-	bgez	BYTES, .Lchacha20_mips_xor_done
-.Lchacha20_mips_xor_unaligned_0_b:
-.Lchacha20_mips_xor_aligned_0_b:
+	bgez	BYTES, .Lchacha_mips_xor_done
+.Lchacha_mips_xor_unaligned_0_b:
+.Lchacha_mips_xor_aligned_0_b:
 	/* Place this here to fill delay slot */
 	addiu	NONCE_0, 1
 	.set reorder
 
-.Lchacha20_mips_xor_bytes:
+.Lchacha_mips_xor_bytes:
 	addu	IN, $at
 	addu	OUT, $at
 	/* First byte */
@@ -376,22 +376,22 @@ chacha20_mips:
 	ROTR(SAVED_X)
 	xor	T1, SAVED_X
 	sb	T1, 0(OUT)
-	beqz	$at, .Lchacha20_mips_xor_done
+	beqz	$at, .Lchacha_mips_xor_done
 	/* Second byte */
 	lbu	T1, 1(IN)
 	addiu	$at, BYTES, 2
 	ROTx	SAVED_X, 8
 	xor	T1, SAVED_X
 	sb	T1, 1(OUT)
-	beqz	$at, .Lchacha20_mips_xor_done
+	beqz	$at, .Lchacha_mips_xor_done
 	/* Third byte */
 	lbu	T1, 2(IN)
 	ROTx	SAVED_X, 8
 	xor	T1, SAVED_X
 	sb	T1, 2(OUT)
-	b	.Lchacha20_mips_xor_done
+	b	.Lchacha_mips_xor_done
 
-.Lchacha20_mips_no_full_block_unaligned:
+.Lchacha_mips_no_full_block_unaligned:
 	/* Restore the offset on BYTES */
 	addiu	BYTES, CHACHA20_BLOCK_SIZE
 
@@ -399,7 +399,7 @@ chacha20_mips:
 	andi	$at, BYTES, MASK_U32
 
 	/* Load upper half of jump table addr */
-	lui	T0, %hi(.Lchacha20_mips_jmptbl_unaligned_0)
+	lui	T0, %hi(.Lchacha_mips_jmptbl_unaligned_0)
 
 	/* Calculate lower half jump table offset */
 	ins	T0, $at, 1, 6
@@ -408,7 +408,7 @@ chacha20_mips:
 	addu	T1, STATE, $at
 
 	/* Add lower half jump table addr */
-	addiu	T0, %lo(.Lchacha20_mips_jmptbl_unaligned_0)
+	addiu	T0, %lo(.Lchacha_mips_jmptbl_unaligned_0)
 
 	/* Read value from STATE */
 	lw	SAVED_CA, 0(T1)
@@ -420,5 +420,78 @@ chacha20_mips:
 
 	/* Jump table */
 	FOR_EACH_WORD(JMPTBL_UNALIGNED)
-.end chacha20_mips
+.end chacha_crypt_arch
+.set at
+
+/* Input arguments
+ * STATE	$a0
+ * OUT		$a1
+ * NROUND	$a2
+ */
+
+#undef X12
+#undef X13
+#undef X14
+#undef X15
+
+#define X12	$a3
+#define X13	$at
+#define X14	$v0
+#define X15	STATE
+
+.set noat
+.globl	hchacha_block_arch
+.ent	hchacha_block_arch
+hchacha_block_arch:
+	.frame	$sp, STACK_SIZE, $ra
+
+	addiu	$sp, -STACK_SIZE
+
+	/* Save X11(s6) */
+	sw	X11, 0($sp)
+
+	lw	X0,  0(STATE)
+	lw	X1,  4(STATE)
+	lw	X2,  8(STATE)
+	lw	X3,  12(STATE)
+	lw	X4,  16(STATE)
+	lw	X5,  20(STATE)
+	lw	X6,  24(STATE)
+	lw	X7,  28(STATE)
+	lw	X8,  32(STATE)
+	lw	X9,  36(STATE)
+	lw	X10, 40(STATE)
+	lw	X11, 44(STATE)
+	lw	X12, 48(STATE)
+	lw	X13, 52(STATE)
+	lw	X14, 56(STATE)
+	lw	X15, 60(STATE)
+
+.Loop_hchacha_xor_rounds:
+	addiu	$a2, -2
+	AXR( 0, 1, 2, 3,  4, 5, 6, 7, 12,13,14,15, 16);
+	AXR( 8, 9,10,11, 12,13,14,15,  4, 5, 6, 7, 12);
+	AXR( 0, 1, 2, 3,  4, 5, 6, 7, 12,13,14,15,  8);
+	AXR( 8, 9,10,11, 12,13,14,15,  4, 5, 6, 7,  7);
+	AXR( 0, 1, 2, 3,  5, 6, 7, 4, 15,12,13,14, 16);
+	AXR(10,11, 8, 9, 15,12,13,14,  5, 6, 7, 4, 12);
+	AXR( 0, 1, 2, 3,  5, 6, 7, 4, 15,12,13,14,  8);
+	AXR(10,11, 8, 9, 15,12,13,14,  5, 6, 7, 4,  7);
+	bnez	$a2, .Loop_hchacha_xor_rounds
+
+	/* Restore used register */
+	lw	X11, 0($sp)
+
+	sw	X0,  0(OUT)
+	sw	X1,  4(OUT)
+	sw	X2,  8(OUT)
+	sw	X3,  12(OUT)
+	sw	X12, 16(OUT)
+	sw	X13, 20(OUT)
+	sw	X14, 24(OUT)
+	sw	X15, 28(OUT)
+
+	addiu	$sp, STACK_SIZE
+	jr	$ra
+.end hchacha_block_arch
 .set at
diff --git a/arch/mips/crypto/chacha-glue.c b/arch/mips/crypto/chacha-glue.c
new file mode 100644
index 000000000000..2cf64af9dd89
--- /dev/null
+++ b/arch/mips/crypto/chacha-glue.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * MIPS accelerated ChaCha and XChaCha stream ciphers,
+ * including ChaCha20 (RFC7539)
+ *
+ * Copyright (C) 2019 Linaro, Ltd. <ard.biesheuvel@linaro.org>
+ */
+
+#include <asm/byteorder.h>
+#include <crypto/algapi.h>
+#include <crypto/internal/chacha.h>
+#include <crypto/internal/skcipher.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+asmlinkage void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src,
+				  unsigned int bytes, int nrounds);
+EXPORT_SYMBOL(chacha_crypt_arch);
+
+asmlinkage void hchacha_block_arch(const u32 *state, u32 *stream, int nrounds);
+EXPORT_SYMBOL(hchacha_block_arch);
+
+void chacha_init_arch(u32 *state, const u32 *key, const u8 *iv)
+{
+	chacha_init_generic(state, key, iv);
+}
+EXPORT_SYMBOL(chacha_init_arch);
+
+static int chacha_mips_stream_xor(struct skcipher_request *req,
+				  const struct chacha_ctx *ctx, const u8 *iv)
+{
+	struct skcipher_walk walk;
+	u32 state[16];
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	chacha_init_generic(state, ctx->key, iv);
+
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
+
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, walk.stride);
+
+		chacha_crypt(state, walk.dst.virt.addr, walk.src.virt.addr,
+			     nbytes, ctx->nrounds);
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	}
+
+	return err;
+}
+
+static int chacha_mips(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return chacha_mips_stream_xor(req, ctx, req->iv);
+}
+
+static int xchacha_mips(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct chacha_ctx subctx;
+	u32 state[16];
+	u8 real_iv[16];
+
+	chacha_init_generic(state, ctx->key, req->iv);
+
+	hchacha_block(state, subctx.key, ctx->nrounds);
+	subctx.nrounds = ctx->nrounds;
+
+	memcpy(&real_iv[0], req->iv + 24, 8);
+	memcpy(&real_iv[8], req->iv + 16, 8);
+	return chacha_mips_stream_xor(req, &subctx, real_iv);
+}
+
+static int chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		    unsigned int keysize)
+{
+	return chacha_setkey(tfm, key, keysize, 20);
+}
+
+static int chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		    unsigned int keysize)
+{
+	return chacha_setkey(tfm, key, keysize, 12);
+}
+
+static struct skcipher_alg algs[] = {
+	{
+		.base.cra_name		= "chacha20",
+		.base.cra_driver_name	= "chacha20-mips",
+		.base.cra_priority	= 200,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= CHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.setkey			= chacha20_setkey,
+		.encrypt		= chacha_mips,
+		.decrypt		= chacha_mips,
+	}, {
+		.base.cra_name		= "xchacha20",
+		.base.cra_driver_name	= "xchacha20-mips",
+		.base.cra_priority	= 200,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.setkey			= chacha20_setkey,
+		.encrypt		= xchacha_mips,
+		.decrypt		= xchacha_mips,
+	}, {
+		.base.cra_name		= "xchacha12",
+		.base.cra_driver_name	= "xchacha12-mips",
+		.base.cra_priority	= 200,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.setkey			= chacha12_setkey,
+		.encrypt		= xchacha_mips,
+		.decrypt		= xchacha_mips,
+	}
+};
+
+static int __init chacha_simd_mod_init(void)
+{
+	return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
+}
+
+static void __exit chacha_simd_mod_fini(void)
+{
+	crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
+}
+
+module_init(chacha_simd_mod_init);
+module_exit(chacha_simd_mod_fini);
+
+MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (MIPS accelerated)");
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("chacha20");
+MODULE_ALIAS_CRYPTO("chacha20-mips");
+MODULE_ALIAS_CRYPTO("xchacha20");
+MODULE_ALIAS_CRYPTO("xchacha20-mips");
+MODULE_ALIAS_CRYPTO("xchacha12");
+MODULE_ALIAS_CRYPTO("xchacha12-mips");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 4f7212fa0170..46944c6c8f9b 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1423,6 +1423,12 @@ config CRYPTO_CHACHA20_X86_64
 	  SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20,
 	  XChaCha20, and XChaCha12 stream ciphers.
 
+config CRYPTO_CHACHA_MIPS
+	tristate "ChaCha stream cipher algorithms (MIPS 32r2 optimized)"
+	depends on CPU_MIPS32_R2
+	select CRYPTO_BLKCIPHER
+	select CRYPTO_ARCH_HAVE_LIB_CHACHA
+
 config CRYPTO_SEED
 	tristate "SEED cipher algorithm"
 	select CRYPTO_ALGAPI
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 12/35] crypto: chacha - unexport chacha_generic routines
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (10 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 11/35] crypto: mips/chacha - wire up accelerated 32r2 " Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 13/35] crypto: poly1305 - move core routines into a separate library Ard Biesheuvel
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Now that all users of generic ChaCha code have moved to the core library,
there is no longer a need for the generic ChaCha skcpiher driver to
export parts of it implementation for reuse by other drivers. So drop
the exports, and make the symbols static.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/chacha_generic.c          | 26 ++++++--------------
 include/crypto/internal/chacha.h | 10 --------
 2 files changed, 8 insertions(+), 28 deletions(-)

diff --git a/crypto/chacha_generic.c b/crypto/chacha_generic.c
index ebae6d9d9b32..c1b147318393 100644
--- a/crypto/chacha_generic.c
+++ b/crypto/chacha_generic.c
@@ -21,7 +21,7 @@ static int chacha_stream_xor(struct skcipher_request *req,
 
 	err = skcipher_walk_virt(&walk, req, false);
 
-	crypto_chacha_init(state, ctx, iv);
+	chacha_init_generic(state, ctx->key, iv);
 
 	while (walk.nbytes > 0) {
 		unsigned int nbytes = walk.nbytes;
@@ -37,36 +37,27 @@ static int chacha_stream_xor(struct skcipher_request *req,
 	return err;
 }
 
-void crypto_chacha_init(u32 *state, const struct chacha_ctx *ctx, const u8 *iv)
-{
-	chacha_init_generic(state, ctx->key, iv);
-}
-EXPORT_SYMBOL_GPL(crypto_chacha_init);
-
-int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
-			   unsigned int keysize)
+static int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
+				  unsigned int keysize)
 {
 	return chacha_setkey(tfm, key, keysize, 20);
 }
-EXPORT_SYMBOL_GPL(crypto_chacha20_setkey);
 
-int crypto_chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
-			   unsigned int keysize)
+static int crypto_chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
+				 unsigned int keysize)
 {
 	return chacha_setkey(tfm, key, keysize, 12);
 }
-EXPORT_SYMBOL_GPL(crypto_chacha12_setkey);
 
-int crypto_chacha_crypt(struct skcipher_request *req)
+static int crypto_chacha_crypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
 
 	return chacha_stream_xor(req, ctx, req->iv);
 }
-EXPORT_SYMBOL_GPL(crypto_chacha_crypt);
 
-int crypto_xchacha_crypt(struct skcipher_request *req)
+static int crypto_xchacha_crypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
@@ -75,7 +66,7 @@ int crypto_xchacha_crypt(struct skcipher_request *req)
 	u8 real_iv[16];
 
 	/* Compute the subkey given the original key and first 128 nonce bits */
-	crypto_chacha_init(state, ctx, req->iv);
+	chacha_init_generic(state, ctx->key, req->iv);
 	hchacha_block_generic(state, subctx.key, ctx->nrounds);
 	subctx.nrounds = ctx->nrounds;
 
@@ -86,7 +77,6 @@ int crypto_xchacha_crypt(struct skcipher_request *req)
 	/* Generate the stream and XOR it with the data */
 	return chacha_stream_xor(req, &subctx, real_iv);
 }
-EXPORT_SYMBOL_GPL(crypto_xchacha_crypt);
 
 static struct skcipher_alg algs[] = {
 	{
diff --git a/include/crypto/internal/chacha.h b/include/crypto/internal/chacha.h
index 8724545bc959..194980d58174 100644
--- a/include/crypto/internal/chacha.h
+++ b/include/crypto/internal/chacha.h
@@ -12,8 +12,6 @@ struct chacha_ctx {
 	int nrounds;
 };
 
-void crypto_chacha_init(u32 *state, const struct chacha_ctx *ctx, const u8 *iv);
-
 static inline int chacha_setkey(struct crypto_skcipher *tfm, const u8 *key,
 				unsigned int keysize, int nrounds)
 {
@@ -30,12 +28,4 @@ static inline int chacha_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	return 0;
 }
 
-int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
-			   unsigned int keysize);
-int crypto_chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
-			   unsigned int keysize);
-
-int crypto_chacha_crypt(struct skcipher_request *req);
-int crypto_xchacha_crypt(struct skcipher_request *req);
-
 #endif /* _CRYPTO_CHACHA_H */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 13/35] crypto: poly1305 - move core routines into a separate library
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (11 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 12/35] crypto: chacha - unexport chacha_generic routines Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 14/35] crypto: x86/poly1305 - unify Poly1305 state struct with generic code Ard Biesheuvel
                   ` (21 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Move the core Poly1305 routines shared between the generic Poly1305
shash driver and the Adiantum and NHPoly1305 drivers into a separate
library so that using just this pieces does not pull in the crypto
API pieces of the generic Poly1305 routine.

In a subsequent patch, we will augment this generic library with
init/update/final routines so that Poyl1305 algorithm can be used
directly without the need for using the crypto API's shash abstraction.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/poly1305_glue.c    |   2 +-
 crypto/Kconfig                     |   5 +-
 crypto/adiantum.c                  |   5 +-
 crypto/nhpoly1305.c                |   3 +-
 crypto/poly1305_generic.c          | 195 +-------------------
 include/crypto/internal/poly1305.h |  67 +++++++
 include/crypto/poly1305.h          |  23 ---
 lib/crypto/Kconfig                 |   3 +
 lib/crypto/Makefile                |   3 +
 lib/crypto/poly1305.c              | 158 ++++++++++++++++
 10 files changed, 248 insertions(+), 216 deletions(-)

diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index 4a1c05dce950..6ccf8eb26324 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -7,8 +7,8 @@
 
 #include <crypto/algapi.h>
 #include <crypto/internal/hash.h>
+#include <crypto/internal/poly1305.h>
 #include <crypto/internal/simd.h>
-#include <crypto/poly1305.h>
 #include <linux/crypto.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 46944c6c8f9b..d71b8afa01cd 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -446,7 +446,7 @@ config CRYPTO_KEYWRAP
 config CRYPTO_NHPOLY1305
 	tristate
 	select CRYPTO_HASH
-	select CRYPTO_POLY1305
+	select CRYPTO_LIB_POLY1305_GENERIC
 
 config CRYPTO_NHPOLY1305_SSE2
 	tristate "NHPoly1305 hash function (x86_64 SSE2 implementation)"
@@ -467,7 +467,7 @@ config CRYPTO_NHPOLY1305_AVX2
 config CRYPTO_ADIANTUM
 	tristate "Adiantum support"
 	select CRYPTO_CHACHA20
-	select CRYPTO_POLY1305
+	select CRYPTO_LIB_POLY1305_GENERIC
 	select CRYPTO_NHPOLY1305
 	select CRYPTO_MANAGER
 	help
@@ -686,6 +686,7 @@ config CRYPTO_GHASH
 config CRYPTO_POLY1305
 	tristate "Poly1305 authenticator algorithm"
 	select CRYPTO_HASH
+	select CRYPTO_LIB_POLY1305_GENERIC
 	help
 	  Poly1305 authenticator algorithm, RFC7539.
 
diff --git a/crypto/adiantum.c b/crypto/adiantum.c
index 395a3ddd3707..aded26092268 100644
--- a/crypto/adiantum.c
+++ b/crypto/adiantum.c
@@ -33,6 +33,7 @@
 #include <crypto/b128ops.h>
 #include <crypto/chacha.h>
 #include <crypto/internal/hash.h>
+#include <crypto/internal/poly1305.h>
 #include <crypto/internal/skcipher.h>
 #include <crypto/nhpoly1305.h>
 #include <crypto/scatterwalk.h>
@@ -242,11 +243,11 @@ static void adiantum_hash_header(struct skcipher_request *req)
 
 	BUILD_BUG_ON(sizeof(header) % POLY1305_BLOCK_SIZE != 0);
 	poly1305_core_blocks(&state, &tctx->header_hash_key,
-			     &header, sizeof(header) / POLY1305_BLOCK_SIZE);
+			     &header, sizeof(header) / POLY1305_BLOCK_SIZE, 1);
 
 	BUILD_BUG_ON(TWEAK_SIZE % POLY1305_BLOCK_SIZE != 0);
 	poly1305_core_blocks(&state, &tctx->header_hash_key, req->iv,
-			     TWEAK_SIZE / POLY1305_BLOCK_SIZE);
+			     TWEAK_SIZE / POLY1305_BLOCK_SIZE, 1);
 
 	poly1305_core_emit(&state, &rctx->header_hash);
 }
diff --git a/crypto/nhpoly1305.c b/crypto/nhpoly1305.c
index 9ab4e07cde4d..f6b6a52092b4 100644
--- a/crypto/nhpoly1305.c
+++ b/crypto/nhpoly1305.c
@@ -33,6 +33,7 @@
 #include <asm/unaligned.h>
 #include <crypto/algapi.h>
 #include <crypto/internal/hash.h>
+#include <crypto/internal/poly1305.h>
 #include <crypto/nhpoly1305.h>
 #include <linux/crypto.h>
 #include <linux/kernel.h>
@@ -78,7 +79,7 @@ static void process_nh_hash_value(struct nhpoly1305_state *state,
 	BUILD_BUG_ON(NH_HASH_BYTES % POLY1305_BLOCK_SIZE != 0);
 
 	poly1305_core_blocks(&state->poly_state, &key->poly_key, state->nh_hash,
-			     NH_HASH_BYTES / POLY1305_BLOCK_SIZE);
+			     NH_HASH_BYTES / POLY1305_BLOCK_SIZE, 1);
 }
 
 /*
diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index adc40298c749..067f493c2504 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -13,27 +13,12 @@
 
 #include <crypto/algapi.h>
 #include <crypto/internal/hash.h>
-#include <crypto/poly1305.h>
+#include <crypto/internal/poly1305.h>
 #include <linux/crypto.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <asm/unaligned.h>
 
-static inline u64 mlt(u64 a, u64 b)
-{
-	return a * b;
-}
-
-static inline u32 sr(u64 v, u_char n)
-{
-	return v >> n;
-}
-
-static inline u32 and(u32 v, u32 mask)
-{
-	return v & mask;
-}
-
 int crypto_poly1305_init(struct shash_desc *desc)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -47,124 +32,8 @@ int crypto_poly1305_init(struct shash_desc *desc)
 }
 EXPORT_SYMBOL_GPL(crypto_poly1305_init);
 
-void poly1305_core_setkey(struct poly1305_key *key, const u8 *raw_key)
-{
-	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
-	key->r[0] = (get_unaligned_le32(raw_key +  0) >> 0) & 0x3ffffff;
-	key->r[1] = (get_unaligned_le32(raw_key +  3) >> 2) & 0x3ffff03;
-	key->r[2] = (get_unaligned_le32(raw_key +  6) >> 4) & 0x3ffc0ff;
-	key->r[3] = (get_unaligned_le32(raw_key +  9) >> 6) & 0x3f03fff;
-	key->r[4] = (get_unaligned_le32(raw_key + 12) >> 8) & 0x00fffff;
-}
-EXPORT_SYMBOL_GPL(poly1305_core_setkey);
-
-/*
- * Poly1305 requires a unique key for each tag, which implies that we can't set
- * it on the tfm that gets accessed by multiple users simultaneously. Instead we
- * expect the key as the first 32 bytes in the update() call.
- */
-unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
-					const u8 *src, unsigned int srclen)
-{
-	if (!dctx->sset) {
-		if (!dctx->rset && srclen >= POLY1305_BLOCK_SIZE) {
-			poly1305_core_setkey(&dctx->r, src);
-			src += POLY1305_BLOCK_SIZE;
-			srclen -= POLY1305_BLOCK_SIZE;
-			dctx->rset = true;
-		}
-		if (srclen >= POLY1305_BLOCK_SIZE) {
-			dctx->s[0] = get_unaligned_le32(src +  0);
-			dctx->s[1] = get_unaligned_le32(src +  4);
-			dctx->s[2] = get_unaligned_le32(src +  8);
-			dctx->s[3] = get_unaligned_le32(src + 12);
-			src += POLY1305_BLOCK_SIZE;
-			srclen -= POLY1305_BLOCK_SIZE;
-			dctx->sset = true;
-		}
-	}
-	return srclen;
-}
-EXPORT_SYMBOL_GPL(crypto_poly1305_setdesckey);
-
-static void poly1305_blocks_internal(struct poly1305_state *state,
-				     const struct poly1305_key *key,
-				     const void *src, unsigned int nblocks,
-				     u32 hibit)
-{
-	u32 r0, r1, r2, r3, r4;
-	u32 s1, s2, s3, s4;
-	u32 h0, h1, h2, h3, h4;
-	u64 d0, d1, d2, d3, d4;
-
-	if (!nblocks)
-		return;
-
-	r0 = key->r[0];
-	r1 = key->r[1];
-	r2 = key->r[2];
-	r3 = key->r[3];
-	r4 = key->r[4];
-
-	s1 = r1 * 5;
-	s2 = r2 * 5;
-	s3 = r3 * 5;
-	s4 = r4 * 5;
-
-	h0 = state->h[0];
-	h1 = state->h[1];
-	h2 = state->h[2];
-	h3 = state->h[3];
-	h4 = state->h[4];
-
-	do {
-		/* h += m[i] */
-		h0 += (get_unaligned_le32(src +  0) >> 0) & 0x3ffffff;
-		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
-		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
-		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
-		h4 += (get_unaligned_le32(src + 12) >> 8) | hibit;
-
-		/* h *= r */
-		d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
-		     mlt(h3, s2) + mlt(h4, s1);
-		d1 = mlt(h0, r1) + mlt(h1, r0) + mlt(h2, s4) +
-		     mlt(h3, s3) + mlt(h4, s2);
-		d2 = mlt(h0, r2) + mlt(h1, r1) + mlt(h2, r0) +
-		     mlt(h3, s4) + mlt(h4, s3);
-		d3 = mlt(h0, r3) + mlt(h1, r2) + mlt(h2, r1) +
-		     mlt(h3, r0) + mlt(h4, s4);
-		d4 = mlt(h0, r4) + mlt(h1, r3) + mlt(h2, r2) +
-		     mlt(h3, r1) + mlt(h4, r0);
-
-		/* (partial) h %= p */
-		d1 += sr(d0, 26);     h0 = and(d0, 0x3ffffff);
-		d2 += sr(d1, 26);     h1 = and(d1, 0x3ffffff);
-		d3 += sr(d2, 26);     h2 = and(d2, 0x3ffffff);
-		d4 += sr(d3, 26);     h3 = and(d3, 0x3ffffff);
-		h0 += sr(d4, 26) * 5; h4 = and(d4, 0x3ffffff);
-		h1 += h0 >> 26;       h0 = h0 & 0x3ffffff;
-
-		src += POLY1305_BLOCK_SIZE;
-	} while (--nblocks);
-
-	state->h[0] = h0;
-	state->h[1] = h1;
-	state->h[2] = h2;
-	state->h[3] = h3;
-	state->h[4] = h4;
-}
-
-void poly1305_core_blocks(struct poly1305_state *state,
-			  const struct poly1305_key *key,
-			  const void *src, unsigned int nblocks)
-{
-	poly1305_blocks_internal(state, key, src, nblocks, 1 << 24);
-}
-EXPORT_SYMBOL_GPL(poly1305_core_blocks);
-
-static void poly1305_blocks(struct poly1305_desc_ctx *dctx,
-			    const u8 *src, unsigned int srclen, u32 hibit)
+static void poly1305_blocks(struct poly1305_desc_ctx *dctx, const u8 *src,
+			    unsigned int srclen)
 {
 	unsigned int datalen;
 
@@ -174,8 +43,8 @@ static void poly1305_blocks(struct poly1305_desc_ctx *dctx,
 		srclen = datalen;
 	}
 
-	poly1305_blocks_internal(&dctx->h, &dctx->r,
-				 src, srclen / POLY1305_BLOCK_SIZE, hibit);
+	poly1305_core_blocks(&dctx->h, &dctx->r, src,
+			     srclen / POLY1305_BLOCK_SIZE, 1);
 }
 
 int crypto_poly1305_update(struct shash_desc *desc,
@@ -193,13 +62,13 @@ int crypto_poly1305_update(struct shash_desc *desc,
 
 		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
 			poly1305_blocks(dctx, dctx->buf,
-					POLY1305_BLOCK_SIZE, 1 << 24);
+					POLY1305_BLOCK_SIZE);
 			dctx->buflen = 0;
 		}
 	}
 
 	if (likely(srclen >= POLY1305_BLOCK_SIZE)) {
-		poly1305_blocks(dctx, src, srclen, 1 << 24);
+		poly1305_blocks(dctx, src, srclen);
 		src += srclen - (srclen % POLY1305_BLOCK_SIZE);
 		srclen %= POLY1305_BLOCK_SIZE;
 	}
@@ -213,54 +82,6 @@ int crypto_poly1305_update(struct shash_desc *desc,
 }
 EXPORT_SYMBOL_GPL(crypto_poly1305_update);
 
-void poly1305_core_emit(const struct poly1305_state *state, void *dst)
-{
-	u32 h0, h1, h2, h3, h4;
-	u32 g0, g1, g2, g3, g4;
-	u32 mask;
-
-	/* fully carry h */
-	h0 = state->h[0];
-	h1 = state->h[1];
-	h2 = state->h[2];
-	h3 = state->h[3];
-	h4 = state->h[4];
-
-	h2 += (h1 >> 26);     h1 = h1 & 0x3ffffff;
-	h3 += (h2 >> 26);     h2 = h2 & 0x3ffffff;
-	h4 += (h3 >> 26);     h3 = h3 & 0x3ffffff;
-	h0 += (h4 >> 26) * 5; h4 = h4 & 0x3ffffff;
-	h1 += (h0 >> 26);     h0 = h0 & 0x3ffffff;
-
-	/* compute h + -p */
-	g0 = h0 + 5;
-	g1 = h1 + (g0 >> 26);             g0 &= 0x3ffffff;
-	g2 = h2 + (g1 >> 26);             g1 &= 0x3ffffff;
-	g3 = h3 + (g2 >> 26);             g2 &= 0x3ffffff;
-	g4 = h4 + (g3 >> 26) - (1 << 26); g3 &= 0x3ffffff;
-
-	/* select h if h < p, or h + -p if h >= p */
-	mask = (g4 >> ((sizeof(u32) * 8) - 1)) - 1;
-	g0 &= mask;
-	g1 &= mask;
-	g2 &= mask;
-	g3 &= mask;
-	g4 &= mask;
-	mask = ~mask;
-	h0 = (h0 & mask) | g0;
-	h1 = (h1 & mask) | g1;
-	h2 = (h2 & mask) | g2;
-	h3 = (h3 & mask) | g3;
-	h4 = (h4 & mask) | g4;
-
-	/* h = h % (2^128) */
-	put_unaligned_le32((h0 >>  0) | (h1 << 26), dst +  0);
-	put_unaligned_le32((h1 >>  6) | (h2 << 20), dst +  4);
-	put_unaligned_le32((h2 >> 12) | (h3 << 14), dst +  8);
-	put_unaligned_le32((h3 >> 18) | (h4 <<  8), dst + 12);
-}
-EXPORT_SYMBOL_GPL(poly1305_core_emit);
-
 int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -274,7 +95,7 @@ int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 		dctx->buf[dctx->buflen++] = 1;
 		memset(dctx->buf + dctx->buflen, 0,
 		       POLY1305_BLOCK_SIZE - dctx->buflen);
-		poly1305_blocks(dctx, dctx->buf, POLY1305_BLOCK_SIZE, 0);
+		poly1305_core_blocks(&dctx->h, &dctx->r, dctx->buf, 1, 0);
 	}
 
 	poly1305_core_emit(&dctx->h, digest);
diff --git a/include/crypto/internal/poly1305.h b/include/crypto/internal/poly1305.h
new file mode 100644
index 000000000000..cb58e61f73a7
--- /dev/null
+++ b/include/crypto/internal/poly1305.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Common values for the Poly1305 algorithm
+ */
+
+#ifndef _CRYPTO_INTERNAL_POLY1305_H
+#define _CRYPTO_INTERNAL_POLY1305_H
+
+#include <asm/unaligned.h>
+#include <linux/types.h>
+#include <crypto/poly1305.h>
+
+struct shash_desc;
+
+/*
+ * Poly1305 core functions.  These implement the ε-almost-∆-universal hash
+ * function underlying the Poly1305 MAC, i.e. they don't add an encrypted nonce
+ * ("s key") at the end.  They also only support block-aligned inputs.
+ */
+void poly1305_core_setkey(struct poly1305_key *key, const u8 *raw_key);
+static inline void poly1305_core_init(struct poly1305_state *state)
+{
+	*state = (struct poly1305_state){};
+}
+
+void poly1305_core_blocks(struct poly1305_state *state,
+			  const struct poly1305_key *key, const void *src,
+			  unsigned int nblocks, u32 hibit);
+void poly1305_core_emit(const struct poly1305_state *state, void *dst);
+
+/* Crypto API helper functions for the Poly1305 MAC */
+int crypto_poly1305_init(struct shash_desc *desc);
+
+int crypto_poly1305_update(struct shash_desc *desc,
+			   const u8 *src, unsigned int srclen);
+int crypto_poly1305_final(struct shash_desc *desc, u8 *dst);
+
+/*
+ * Poly1305 requires a unique key for each tag, which implies that we can't set
+ * it on the tfm that gets accessed by multiple users simultaneously. Instead we
+ * expect the key as the first 32 bytes in the update() call.
+ */
+static inline
+unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
+					const u8 *src, unsigned int srclen)
+{
+	if (!dctx->sset) {
+		if (!dctx->rset && srclen >= POLY1305_BLOCK_SIZE) {
+			poly1305_core_setkey(&dctx->r, src);
+			src += POLY1305_BLOCK_SIZE;
+			srclen -= POLY1305_BLOCK_SIZE;
+			dctx->rset = true;
+		}
+		if (srclen >= POLY1305_BLOCK_SIZE) {
+			dctx->s[0] = get_unaligned_le32(src +  0);
+			dctx->s[1] = get_unaligned_le32(src +  4);
+			dctx->s[2] = get_unaligned_le32(src +  8);
+			dctx->s[3] = get_unaligned_le32(src + 12);
+			src += POLY1305_BLOCK_SIZE;
+			srclen -= POLY1305_BLOCK_SIZE;
+			dctx->sset = true;
+		}
+	}
+	return srclen;
+}
+
+#endif
diff --git a/include/crypto/poly1305.h b/include/crypto/poly1305.h
index 34317ed2071e..f5a4319c2a1f 100644
--- a/include/crypto/poly1305.h
+++ b/include/crypto/poly1305.h
@@ -38,27 +38,4 @@ struct poly1305_desc_ctx {
 	bool sset;
 };
 
-/*
- * Poly1305 core functions.  These implement the ε-almost-∆-universal hash
- * function underlying the Poly1305 MAC, i.e. they don't add an encrypted nonce
- * ("s key") at the end.  They also only support block-aligned inputs.
- */
-void poly1305_core_setkey(struct poly1305_key *key, const u8 *raw_key);
-static inline void poly1305_core_init(struct poly1305_state *state)
-{
-	memset(state->h, 0, sizeof(state->h));
-}
-void poly1305_core_blocks(struct poly1305_state *state,
-			  const struct poly1305_key *key,
-			  const void *src, unsigned int nblocks);
-void poly1305_core_emit(const struct poly1305_state *state, void *dst);
-
-/* Crypto API helper functions for the Poly1305 MAC */
-int crypto_poly1305_init(struct shash_desc *desc);
-unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
-					const u8 *src, unsigned int srclen);
-int crypto_poly1305_update(struct shash_desc *desc,
-			   const u8 *src, unsigned int srclen);
-int crypto_poly1305_final(struct shash_desc *desc, u8 *dst);
-
 #endif
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index 6a11931ae105..c4882d29879e 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -37,5 +37,8 @@ config CRYPTO_LIB_CHACHA
 config CRYPTO_LIB_DES
 	tristate
 
+config CRYPTO_LIB_POLY1305_GENERIC
+	tristate
+
 config CRYPTO_LIB_SHA256
 	tristate
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index 0ce40604e104..b58ab6843a9d 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -13,5 +13,8 @@ libarc4-y					:= arc4.o
 obj-$(CONFIG_CRYPTO_LIB_DES)			+= libdes.o
 libdes-y					:= des.o
 
+obj-$(CONFIG_CRYPTO_LIB_POLY1305_GENERIC)	+= libpoly1305.o
+libpoly1305-y					:= poly1305.o
+
 obj-$(CONFIG_CRYPTO_LIB_SHA256)			+= libsha256.o
 libsha256-y					:= sha256.o
diff --git a/lib/crypto/poly1305.c b/lib/crypto/poly1305.c
new file mode 100644
index 000000000000..f019a57dbc1b
--- /dev/null
+++ b/lib/crypto/poly1305.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Poly1305 authenticator algorithm, RFC7539
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * Based on public domain code by Andrew Moon and Daniel J. Bernstein.
+ */
+
+#include <crypto/internal/poly1305.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/unaligned.h>
+
+static inline u64 mlt(u64 a, u64 b)
+{
+	return a * b;
+}
+
+static inline u32 sr(u64 v, u_char n)
+{
+	return v >> n;
+}
+
+static inline u32 and(u32 v, u32 mask)
+{
+	return v & mask;
+}
+
+void poly1305_core_setkey(struct poly1305_key *key, const u8 *raw_key)
+{
+	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
+	key->r[0] = (get_unaligned_le32(raw_key +  0) >> 0) & 0x3ffffff;
+	key->r[1] = (get_unaligned_le32(raw_key +  3) >> 2) & 0x3ffff03;
+	key->r[2] = (get_unaligned_le32(raw_key +  6) >> 4) & 0x3ffc0ff;
+	key->r[3] = (get_unaligned_le32(raw_key +  9) >> 6) & 0x3f03fff;
+	key->r[4] = (get_unaligned_le32(raw_key + 12) >> 8) & 0x00fffff;
+}
+EXPORT_SYMBOL_GPL(poly1305_core_setkey);
+
+void poly1305_core_blocks(struct poly1305_state *state,
+			  const struct poly1305_key *key, const void *src,
+			  unsigned int nblocks, u32 hibit)
+{
+	u32 r0, r1, r2, r3, r4;
+	u32 s1, s2, s3, s4;
+	u32 h0, h1, h2, h3, h4;
+	u64 d0, d1, d2, d3, d4;
+
+	if (!nblocks)
+		return;
+
+	r0 = key->r[0];
+	r1 = key->r[1];
+	r2 = key->r[2];
+	r3 = key->r[3];
+	r4 = key->r[4];
+
+	s1 = r1 * 5;
+	s2 = r2 * 5;
+	s3 = r3 * 5;
+	s4 = r4 * 5;
+
+	h0 = state->h[0];
+	h1 = state->h[1];
+	h2 = state->h[2];
+	h3 = state->h[3];
+	h4 = state->h[4];
+
+	do {
+		/* h += m[i] */
+		h0 += (get_unaligned_le32(src +  0) >> 0) & 0x3ffffff;
+		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
+		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
+		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
+		h4 += (get_unaligned_le32(src + 12) >> 8) | (hibit << 24);
+
+		/* h *= r */
+		d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
+		     mlt(h3, s2) + mlt(h4, s1);
+		d1 = mlt(h0, r1) + mlt(h1, r0) + mlt(h2, s4) +
+		     mlt(h3, s3) + mlt(h4, s2);
+		d2 = mlt(h0, r2) + mlt(h1, r1) + mlt(h2, r0) +
+		     mlt(h3, s4) + mlt(h4, s3);
+		d3 = mlt(h0, r3) + mlt(h1, r2) + mlt(h2, r1) +
+		     mlt(h3, r0) + mlt(h4, s4);
+		d4 = mlt(h0, r4) + mlt(h1, r3) + mlt(h2, r2) +
+		     mlt(h3, r1) + mlt(h4, r0);
+
+		/* (partial) h %= p */
+		d1 += sr(d0, 26);     h0 = and(d0, 0x3ffffff);
+		d2 += sr(d1, 26);     h1 = and(d1, 0x3ffffff);
+		d3 += sr(d2, 26);     h2 = and(d2, 0x3ffffff);
+		d4 += sr(d3, 26);     h3 = and(d3, 0x3ffffff);
+		h0 += sr(d4, 26) * 5; h4 = and(d4, 0x3ffffff);
+		h1 += h0 >> 26;       h0 = h0 & 0x3ffffff;
+
+		src += POLY1305_BLOCK_SIZE;
+	} while (--nblocks);
+
+	state->h[0] = h0;
+	state->h[1] = h1;
+	state->h[2] = h2;
+	state->h[3] = h3;
+	state->h[4] = h4;
+}
+EXPORT_SYMBOL_GPL(poly1305_core_blocks);
+
+void poly1305_core_emit(const struct poly1305_state *state, void *dst)
+{
+	u32 h0, h1, h2, h3, h4;
+	u32 g0, g1, g2, g3, g4;
+	u32 mask;
+
+	/* fully carry h */
+	h0 = state->h[0];
+	h1 = state->h[1];
+	h2 = state->h[2];
+	h3 = state->h[3];
+	h4 = state->h[4];
+
+	h2 += (h1 >> 26);     h1 = h1 & 0x3ffffff;
+	h3 += (h2 >> 26);     h2 = h2 & 0x3ffffff;
+	h4 += (h3 >> 26);     h3 = h3 & 0x3ffffff;
+	h0 += (h4 >> 26) * 5; h4 = h4 & 0x3ffffff;
+	h1 += (h0 >> 26);     h0 = h0 & 0x3ffffff;
+
+	/* compute h + -p */
+	g0 = h0 + 5;
+	g1 = h1 + (g0 >> 26);             g0 &= 0x3ffffff;
+	g2 = h2 + (g1 >> 26);             g1 &= 0x3ffffff;
+	g3 = h3 + (g2 >> 26);             g2 &= 0x3ffffff;
+	g4 = h4 + (g3 >> 26) - (1 << 26); g3 &= 0x3ffffff;
+
+	/* select h if h < p, or h + -p if h >= p */
+	mask = (g4 >> ((sizeof(u32) * 8) - 1)) - 1;
+	g0 &= mask;
+	g1 &= mask;
+	g2 &= mask;
+	g3 &= mask;
+	g4 &= mask;
+	mask = ~mask;
+	h0 = (h0 & mask) | g0;
+	h1 = (h1 & mask) | g1;
+	h2 = (h2 & mask) | g2;
+	h3 = (h3 & mask) | g3;
+	h4 = (h4 & mask) | g4;
+
+	/* h = h % (2^128) */
+	put_unaligned_le32((h0 >>  0) | (h1 << 26), dst +  0);
+	put_unaligned_le32((h1 >>  6) | (h2 << 20), dst +  4);
+	put_unaligned_le32((h2 >> 12) | (h3 << 14), dst +  8);
+	put_unaligned_le32((h3 >> 18) | (h4 <<  8), dst + 12);
+}
+EXPORT_SYMBOL_GPL(poly1305_core_emit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Martin Willi <martin@strongswan.org>");
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 14/35] crypto: x86/poly1305 - unify Poly1305 state struct with generic code
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (12 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 13/35] crypto: poly1305 - move core routines into a separate library Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 15/35] crypto: poly1305 - expose init/update/final library interface Ard Biesheuvel
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

In preparation of exposing a Poly1305 library interface directly from
the accelerated x86 driver, align the state descriptor of the x86 code
with the one used by the generic driver. This is needed to make the
library interface unified between all implementations.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/poly1305_glue.c    | 88 +++++++-------------
 crypto/poly1305_generic.c          |  6 +-
 include/crypto/internal/poly1305.h |  4 +-
 include/crypto/poly1305.h          | 18 ++--
 4 files changed, 43 insertions(+), 73 deletions(-)

diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index 6ccf8eb26324..b43b93c95e79 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -14,40 +14,14 @@
 #include <linux/module.h>
 #include <asm/simd.h>
 
-struct poly1305_simd_desc_ctx {
-	struct poly1305_desc_ctx base;
-	/* derived key u set? */
-	bool uset;
-#ifdef CONFIG_AS_AVX2
-	/* derived keys r^3, r^4 set? */
-	bool wset;
-#endif
-	/* derived Poly1305 key r^2 */
-	u32 u[5];
-	/* ... silently appended r^3 and r^4 when using AVX2 */
-};
-
 asmlinkage void poly1305_block_sse2(u32 *h, const u8 *src,
 				    const u32 *r, unsigned int blocks);
 asmlinkage void poly1305_2block_sse2(u32 *h, const u8 *src, const u32 *r,
 				     unsigned int blocks, const u32 *u);
-#ifdef CONFIG_AS_AVX2
 asmlinkage void poly1305_4block_avx2(u32 *h, const u8 *src, const u32 *r,
 				     unsigned int blocks, const u32 *u);
-static bool poly1305_use_avx2;
-#endif
 
-static int poly1305_simd_init(struct shash_desc *desc)
-{
-	struct poly1305_simd_desc_ctx *sctx = shash_desc_ctx(desc);
-
-	sctx->uset = false;
-#ifdef CONFIG_AS_AVX2
-	sctx->wset = false;
-#endif
-
-	return crypto_poly1305_init(desc);
-}
+static bool poly1305_use_avx2 __ro_after_init;
 
 static void poly1305_simd_mult(u32 *a, const u32 *b)
 {
@@ -63,53 +37,49 @@ static void poly1305_simd_mult(u32 *a, const u32 *b)
 static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
 					 const u8 *src, unsigned int srclen)
 {
-	struct poly1305_simd_desc_ctx *sctx;
 	unsigned int blocks, datalen;
 
-	BUILD_BUG_ON(offsetof(struct poly1305_simd_desc_ctx, base));
-	sctx = container_of(dctx, struct poly1305_simd_desc_ctx, base);
-
 	if (unlikely(!dctx->sset)) {
 		datalen = crypto_poly1305_setdesckey(dctx, src, srclen);
 		src += srclen - datalen;
 		srclen = datalen;
 	}
 
-#ifdef CONFIG_AS_AVX2
-	if (poly1305_use_avx2 && srclen >= POLY1305_BLOCK_SIZE * 4) {
-		if (unlikely(!sctx->wset)) {
-			if (!sctx->uset) {
-				memcpy(sctx->u, dctx->r.r, sizeof(sctx->u));
-				poly1305_simd_mult(sctx->u, dctx->r.r);
-				sctx->uset = true;
+	if (IS_ENABLED(CONFIG_AS_AVX2) &&
+	    poly1305_use_avx2 &&
+	    srclen >= POLY1305_BLOCK_SIZE * 4) {
+		if (unlikely(dctx->rset < 4)) {
+			if (dctx->rset < 2) {
+				dctx->r[1] = dctx->r[0];
+				poly1305_simd_mult(dctx->r[1].r, dctx->r[0].r);
 			}
-			memcpy(sctx->u + 5, sctx->u, sizeof(sctx->u));
-			poly1305_simd_mult(sctx->u + 5, dctx->r.r);
-			memcpy(sctx->u + 10, sctx->u + 5, sizeof(sctx->u));
-			poly1305_simd_mult(sctx->u + 10, dctx->r.r);
-			sctx->wset = true;
+			dctx->r[2] = dctx->r[1];
+			poly1305_simd_mult(dctx->r[2].r, dctx->r[0].r);
+			dctx->r[3] = dctx->r[2];
+			poly1305_simd_mult(dctx->r[3].r, dctx->r[0].r);
+			dctx->rset = 4;
 		}
 		blocks = srclen / (POLY1305_BLOCK_SIZE * 4);
-		poly1305_4block_avx2(dctx->h.h, src, dctx->r.r, blocks,
-				     sctx->u);
+		poly1305_4block_avx2(dctx->h.h, src, dctx->r[0].r, blocks,
+				     dctx->r[1].r);
 		src += POLY1305_BLOCK_SIZE * 4 * blocks;
 		srclen -= POLY1305_BLOCK_SIZE * 4 * blocks;
 	}
-#endif
+
 	if (likely(srclen >= POLY1305_BLOCK_SIZE * 2)) {
-		if (unlikely(!sctx->uset)) {
-			memcpy(sctx->u, dctx->r.r, sizeof(sctx->u));
-			poly1305_simd_mult(sctx->u, dctx->r.r);
-			sctx->uset = true;
+		if (unlikely(dctx->rset < 2)) {
+			dctx->r[1] = dctx->r[0];
+			poly1305_simd_mult(dctx->r[1].r, dctx->r[0].r);
+			dctx->rset = 2;
 		}
 		blocks = srclen / (POLY1305_BLOCK_SIZE * 2);
-		poly1305_2block_sse2(dctx->h.h, src, dctx->r.r, blocks,
-				     sctx->u);
+		poly1305_2block_sse2(dctx->h.h, src, dctx->r[0].r,
+				     blocks, dctx->r[1].r);
 		src += POLY1305_BLOCK_SIZE * 2 * blocks;
 		srclen -= POLY1305_BLOCK_SIZE * 2 * blocks;
 	}
 	if (srclen >= POLY1305_BLOCK_SIZE) {
-		poly1305_block_sse2(dctx->h.h, src, dctx->r.r, 1);
+		poly1305_block_sse2(dctx->h.h, src, dctx->r[0].r, 1);
 		srclen -= POLY1305_BLOCK_SIZE;
 	}
 	return srclen;
@@ -159,10 +129,10 @@ static int poly1305_simd_update(struct shash_desc *desc,
 
 static struct shash_alg alg = {
 	.digestsize	= POLY1305_DIGEST_SIZE,
-	.init		= poly1305_simd_init,
+	.init		= crypto_poly1305_init,
 	.update		= poly1305_simd_update,
 	.final		= crypto_poly1305_final,
-	.descsize	= sizeof(struct poly1305_simd_desc_ctx),
+	.descsize	= sizeof(struct poly1305_desc_ctx),
 	.base		= {
 		.cra_name		= "poly1305",
 		.cra_driver_name	= "poly1305-simd",
@@ -177,14 +147,14 @@ static int __init poly1305_simd_mod_init(void)
 	if (!boot_cpu_has(X86_FEATURE_XMM2))
 		return -ENODEV;
 
-#ifdef CONFIG_AS_AVX2
-	poly1305_use_avx2 = boot_cpu_has(X86_FEATURE_AVX) &&
+	poly1305_use_avx2 = IS_ENABLED(CONFIG_AS_AVX2) &&
+			    boot_cpu_has(X86_FEATURE_AVX) &&
 			    boot_cpu_has(X86_FEATURE_AVX2) &&
 			    cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
-	alg.descsize = sizeof(struct poly1305_simd_desc_ctx);
+	alg.descsize = sizeof(struct poly1305_desc_ctx) + 5 * sizeof(u32);
 	if (poly1305_use_avx2)
 		alg.descsize += 10 * sizeof(u32);
-#endif
+
 	return crypto_register_shash(&alg);
 }
 
diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 067f493c2504..f3fcd9578a47 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -25,7 +25,7 @@ int crypto_poly1305_init(struct shash_desc *desc)
 
 	poly1305_core_init(&dctx->h);
 	dctx->buflen = 0;
-	dctx->rset = false;
+	dctx->rset = 0;
 	dctx->sset = false;
 
 	return 0;
@@ -43,7 +43,7 @@ static void poly1305_blocks(struct poly1305_desc_ctx *dctx, const u8 *src,
 		srclen = datalen;
 	}
 
-	poly1305_core_blocks(&dctx->h, &dctx->r, src,
+	poly1305_core_blocks(&dctx->h, dctx->r, src,
 			     srclen / POLY1305_BLOCK_SIZE, 1);
 }
 
@@ -95,7 +95,7 @@ int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 		dctx->buf[dctx->buflen++] = 1;
 		memset(dctx->buf + dctx->buflen, 0,
 		       POLY1305_BLOCK_SIZE - dctx->buflen);
-		poly1305_core_blocks(&dctx->h, &dctx->r, dctx->buf, 1, 0);
+		poly1305_core_blocks(&dctx->h, dctx->r, dctx->buf, 1, 0);
 	}
 
 	poly1305_core_emit(&dctx->h, digest);
diff --git a/include/crypto/internal/poly1305.h b/include/crypto/internal/poly1305.h
index cb58e61f73a7..04fa269e5534 100644
--- a/include/crypto/internal/poly1305.h
+++ b/include/crypto/internal/poly1305.h
@@ -46,10 +46,10 @@ unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
 {
 	if (!dctx->sset) {
 		if (!dctx->rset && srclen >= POLY1305_BLOCK_SIZE) {
-			poly1305_core_setkey(&dctx->r, src);
+			poly1305_core_setkey(dctx->r, src);
 			src += POLY1305_BLOCK_SIZE;
 			srclen -= POLY1305_BLOCK_SIZE;
-			dctx->rset = true;
+			dctx->rset = 1;
 		}
 		if (srclen >= POLY1305_BLOCK_SIZE) {
 			dctx->s[0] = get_unaligned_le32(src +  0);
diff --git a/include/crypto/poly1305.h b/include/crypto/poly1305.h
index f5a4319c2a1f..36b5886cb50c 100644
--- a/include/crypto/poly1305.h
+++ b/include/crypto/poly1305.h
@@ -22,20 +22,20 @@ struct poly1305_state {
 };
 
 struct poly1305_desc_ctx {
-	/* key */
-	struct poly1305_key r;
-	/* finalize key */
-	u32 s[4];
-	/* accumulator */
-	struct poly1305_state h;
 	/* partial buffer */
 	u8 buf[POLY1305_BLOCK_SIZE];
 	/* bytes used in partial buffer */
 	unsigned int buflen;
-	/* r key has been set */
-	bool rset;
-	/* s key has been set */
+	/* how many keys have been set in r[] */
+	unsigned short rset;
+	/* whether s[] has been set */
 	bool sset;
+	/* finalize key */
+	u32 s[4];
+	/* accumulator */
+	struct poly1305_state h;
+	/* key */
+	struct poly1305_key r[1];
 };
 
 #endif
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 15/35] crypto: poly1305 - expose init/update/final library interface
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (13 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 14/35] crypto: x86/poly1305 - unify Poly1305 state struct with generic code Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 16/35] crypto: x86/poly1305 - depend on generic library not generic shash Ard Biesheuvel
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Expose the existing generic Poly1305 code via a init/update/final
library interface so that callers are not required to go through
the crypto API's shash abstraction to access it. At the same time,
make some preparations so that the library implementation can be
superseded by an accelerated arch-specific version in the future.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/poly1305_generic.c | 22 +-----
 include/crypto/poly1305.h | 38 +++++++++-
 lib/crypto/Kconfig        | 26 +++++++
 lib/crypto/poly1305.c     | 74 ++++++++++++++++++++
 4 files changed, 138 insertions(+), 22 deletions(-)

diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index f3fcd9578a47..afe9a9e576dd 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -85,31 +85,11 @@ EXPORT_SYMBOL_GPL(crypto_poly1305_update);
 int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
-	__le32 digest[4];
-	u64 f = 0;
 
 	if (unlikely(!dctx->sset))
 		return -ENOKEY;
 
-	if (unlikely(dctx->buflen)) {
-		dctx->buf[dctx->buflen++] = 1;
-		memset(dctx->buf + dctx->buflen, 0,
-		       POLY1305_BLOCK_SIZE - dctx->buflen);
-		poly1305_core_blocks(&dctx->h, dctx->r, dctx->buf, 1, 0);
-	}
-
-	poly1305_core_emit(&dctx->h, digest);
-
-	/* mac = (h + s) % (2^128) */
-	f = (f >> 32) + le32_to_cpu(digest[0]) + dctx->s[0];
-	put_unaligned_le32(f, dst + 0);
-	f = (f >> 32) + le32_to_cpu(digest[1]) + dctx->s[1];
-	put_unaligned_le32(f, dst + 4);
-	f = (f >> 32) + le32_to_cpu(digest[2]) + dctx->s[2];
-	put_unaligned_le32(f, dst + 8);
-	f = (f >> 32) + le32_to_cpu(digest[3]) + dctx->s[3];
-	put_unaligned_le32(f, dst + 12);
-
+	poly1305_final_generic(dctx, dst);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(crypto_poly1305_final);
diff --git a/include/crypto/poly1305.h b/include/crypto/poly1305.h
index 36b5886cb50c..74c6e1cd73ee 100644
--- a/include/crypto/poly1305.h
+++ b/include/crypto/poly1305.h
@@ -35,7 +35,43 @@ struct poly1305_desc_ctx {
 	/* accumulator */
 	struct poly1305_state h;
 	/* key */
-	struct poly1305_key r[1];
+	struct poly1305_key r[CONFIG_CRYPTO_LIB_POLY1305_RSIZE];
 };
 
+void poly1305_init_arch(struct poly1305_desc_ctx *desc, const u8 *key);
+void poly1305_init_generic(struct poly1305_desc_ctx *desc, const u8 *key);
+
+static inline void poly1305_init(struct poly1305_desc_ctx *desc, const u8 *key)
+{
+	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_POLY1305))
+		poly1305_init_arch(desc, key);
+	else
+		poly1305_init_generic(desc, key);
+}
+
+void poly1305_update_arch(struct poly1305_desc_ctx *desc, const u8 *src,
+			  unsigned int nbytes);
+void poly1305_update_generic(struct poly1305_desc_ctx *desc, const u8 *src,
+			     unsigned int nbytes);
+
+static inline void poly1305_update(struct poly1305_desc_ctx *desc,
+				   const u8 *src, unsigned int nbytes)
+{
+	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_POLY1305))
+		poly1305_update_arch(desc, src, nbytes);
+	else
+		poly1305_update_generic(desc, src, nbytes);
+}
+
+void poly1305_final_arch(struct poly1305_desc_ctx *desc, u8 *digest);
+void poly1305_final_generic(struct poly1305_desc_ctx *desc, u8 *digest);
+
+static inline void poly1305_final(struct poly1305_desc_ctx *desc, u8 *digest)
+{
+	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_POLY1305))
+		poly1305_final_arch(desc, digest);
+	else
+		poly1305_final_generic(desc, digest);
+}
+
 #endif
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index c4882d29879e..a731ea36bd5c 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -37,8 +37,34 @@ config CRYPTO_LIB_CHACHA
 config CRYPTO_LIB_DES
 	tristate
 
+config CRYPTO_LIB_POLY1305_RSIZE
+	int
+	default 1
+
+config CRYPTO_ARCH_HAVE_LIB_POLY1305
+	tristate
+	help
+	  Declares whether the architecture provides an arch-specific
+	  accelerated implementation of the Poly1305 library interface,
+	  either builtin or as a module.
+
 config CRYPTO_LIB_POLY1305_GENERIC
 	tristate
+	help
+	  This symbol can be depended upon by arch implementations of the
+	  Poly1305 library interface that require the generic code as a
+	  fallback, e.g., for SIMD implementations. If no arch specific
+	  implementation is enabled, this implementation serves the users
+	  of CRYPTO_LIB_POLY1305.
+
+config CRYPTO_LIB_POLY1305
+	tristate "Poly1305 library interface"
+	depends on CRYPTO_ARCH_HAVE_LIB_POLY1305 || !CRYPTO_ARCH_HAVE_LIB_POLY1305
+	select CRYPTO_LIB_POLY1305_GENERIC if CRYPTO_ARCH_HAVE_LIB_POLY1305=n
+	help
+	  Enable the Poly1305 library interface. This interface may be fulfilled
+	  by either the generic implementation or an arch-specific one, if one
+	  is available and enabled.
 
 config CRYPTO_LIB_SHA256
 	tristate
diff --git a/lib/crypto/poly1305.c b/lib/crypto/poly1305.c
index f019a57dbc1b..32ec293c65ae 100644
--- a/lib/crypto/poly1305.c
+++ b/lib/crypto/poly1305.c
@@ -154,5 +154,79 @@ void poly1305_core_emit(const struct poly1305_state *state, void *dst)
 }
 EXPORT_SYMBOL_GPL(poly1305_core_emit);
 
+void poly1305_init_generic(struct poly1305_desc_ctx *desc, const u8 *key)
+{
+	poly1305_core_setkey(desc->r, key);
+	desc->s[0] = get_unaligned_le32(key + 16);
+	desc->s[1] = get_unaligned_le32(key + 20);
+	desc->s[2] = get_unaligned_le32(key + 24);
+	desc->s[3] = get_unaligned_le32(key + 28);
+	poly1305_core_init(&desc->h);
+	desc->buflen = 0;
+	desc->sset = true;
+	desc->rset = 1;
+}
+EXPORT_SYMBOL_GPL(poly1305_init_generic);
+
+void poly1305_update_generic(struct poly1305_desc_ctx *desc, const u8 *src,
+			     unsigned int nbytes)
+{
+	unsigned int bytes;
+
+	if (unlikely(desc->buflen)) {
+		bytes = min(nbytes, POLY1305_BLOCK_SIZE - desc->buflen);
+		memcpy(desc->buf + desc->buflen, src, bytes);
+		src += bytes;
+		nbytes -= bytes;
+		desc->buflen += bytes;
+
+		if (desc->buflen == POLY1305_BLOCK_SIZE) {
+			poly1305_core_blocks(&desc->h, desc->r, desc->buf, 1, 1);
+			desc->buflen = 0;
+		}
+	}
+
+	if (likely(nbytes >= POLY1305_BLOCK_SIZE)) {
+		poly1305_core_blocks(&desc->h, desc->r, src,
+				     nbytes / POLY1305_BLOCK_SIZE, 1);
+		src += nbytes - (nbytes % POLY1305_BLOCK_SIZE);
+		nbytes %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(nbytes)) {
+		desc->buflen = nbytes;
+		memcpy(desc->buf, src, nbytes);
+	}
+}
+EXPORT_SYMBOL_GPL(poly1305_update_generic);
+
+void poly1305_final_generic(struct poly1305_desc_ctx *desc, u8 *dst)
+{
+	__le32 digest[4];
+	u64 f = 0;
+
+	if (unlikely(desc->buflen)) {
+		desc->buf[desc->buflen++] = 1;
+		memset(desc->buf + desc->buflen, 0,
+		       POLY1305_BLOCK_SIZE - desc->buflen);
+		poly1305_core_blocks(&desc->h, desc->r, desc->buf, 1, 0);
+	}
+
+	poly1305_core_emit(&desc->h, digest);
+
+	/* mac = (h + s) % (2^128) */
+	f = (f >> 32) + le32_to_cpu(digest[0]) + desc->s[0];
+	put_unaligned_le32(f, dst + 0);
+	f = (f >> 32) + le32_to_cpu(digest[1]) + desc->s[1];
+	put_unaligned_le32(f, dst + 4);
+	f = (f >> 32) + le32_to_cpu(digest[2]) + desc->s[2];
+	put_unaligned_le32(f, dst + 8);
+	f = (f >> 32) + le32_to_cpu(digest[3]) + desc->s[3];
+	put_unaligned_le32(f, dst + 12);
+
+	*desc = (struct poly1305_desc_ctx){};
+}
+EXPORT_SYMBOL_GPL(poly1305_final_generic);
+
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Martin Willi <martin@strongswan.org>");
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 16/35] crypto: x86/poly1305 - depend on generic library not generic shash
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (14 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 15/35] crypto: poly1305 - expose init/update/final library interface Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 17/35] crypto: x86/poly1305 - expose existing driver as poly1305 library Ard Biesheuvel
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Remove the dependency on the generic Poly1305 driver. Instead, depend
on the generic library so that we only reuse code without pulling in
the generic skcipher implementation as well.

While at it, remove the logic that prefers the non-SIMD path for short
inputs - this is no longer necessary after recent FPU handling changes
on x86.

Since this removes the last remaining user of the routines exported
by the generic shash driver, unexport them and make them static.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/poly1305_glue.c    | 66 ++++++++++++++++----
 crypto/Kconfig                     |  2 +-
 crypto/poly1305_generic.c          | 11 ++--
 include/crypto/internal/poly1305.h |  9 ---
 4 files changed, 60 insertions(+), 28 deletions(-)

diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index b43b93c95e79..a5b3a054604c 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -34,6 +34,24 @@ static void poly1305_simd_mult(u32 *a, const u32 *b)
 	poly1305_block_sse2(a, m, b, 1);
 }
 
+static unsigned int poly1305_scalar_blocks(struct poly1305_desc_ctx *dctx,
+					   const u8 *src, unsigned int srclen)
+{
+	unsigned int datalen;
+
+	if (unlikely(!dctx->sset)) {
+		datalen = crypto_poly1305_setdesckey(dctx, src, srclen);
+		src += srclen - datalen;
+		srclen = datalen;
+	}
+	if (srclen >= POLY1305_BLOCK_SIZE) {
+		poly1305_core_blocks(&dctx->h, dctx->r, src,
+				     srclen / POLY1305_BLOCK_SIZE, 1);
+		srclen %= POLY1305_BLOCK_SIZE;
+	}
+	return srclen;
+}
+
 static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
 					 const u8 *src, unsigned int srclen)
 {
@@ -91,12 +109,6 @@ static int poly1305_simd_update(struct shash_desc *desc,
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
 	unsigned int bytes;
 
-	/* kernel_fpu_begin/end is costly, use fallback for small updates */
-	if (srclen <= 288 || !crypto_simd_usable())
-		return crypto_poly1305_update(desc, src, srclen);
-
-	kernel_fpu_begin();
-
 	if (unlikely(dctx->buflen)) {
 		bytes = min(srclen, POLY1305_BLOCK_SIZE - dctx->buflen);
 		memcpy(dctx->buf + dctx->buflen, src, bytes);
@@ -105,25 +117,57 @@ static int poly1305_simd_update(struct shash_desc *desc,
 		dctx->buflen += bytes;
 
 		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
-			poly1305_simd_blocks(dctx, dctx->buf,
-					     POLY1305_BLOCK_SIZE);
+			if (likely(crypto_simd_usable())) {
+				kernel_fpu_begin();
+				poly1305_simd_blocks(dctx, dctx->buf,
+						     POLY1305_BLOCK_SIZE);
+				kernel_fpu_end();
+			} else {
+				poly1305_scalar_blocks(dctx, dctx->buf,
+						       POLY1305_BLOCK_SIZE);
+			}
 			dctx->buflen = 0;
 		}
 	}
 
 	if (likely(srclen >= POLY1305_BLOCK_SIZE)) {
-		bytes = poly1305_simd_blocks(dctx, src, srclen);
+		if (likely(crypto_simd_usable())) {
+			kernel_fpu_begin();
+			bytes = poly1305_simd_blocks(dctx, src, srclen);
+			kernel_fpu_end();
+		} else {
+			bytes = poly1305_scalar_blocks(dctx, src, srclen);
+		}
 		src += srclen - bytes;
 		srclen = bytes;
 	}
 
-	kernel_fpu_end();
-
 	if (unlikely(srclen)) {
 		dctx->buflen = srclen;
 		memcpy(dctx->buf, src, srclen);
 	}
+}
+
+static int crypto_poly1305_init(struct shash_desc *desc)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	poly1305_core_init(&dctx->h);
+	dctx->buflen = 0;
+	dctx->rset = 0;
+	dctx->sset = false;
+
+	return 0;
+}
+
+static int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	if (unlikely(!dctx->sset))
+		return -ENOKEY;
 
+	poly1305_final_generic(dctx, dst);
 	return 0;
 }
 
diff --git a/crypto/Kconfig b/crypto/Kconfig
index d71b8afa01cd..950cf1a8dfc0 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -697,7 +697,7 @@ config CRYPTO_POLY1305
 config CRYPTO_POLY1305_X86_64
 	tristate "Poly1305 authenticator algorithm (x86_64/SSE2/AVX2)"
 	depends on X86 && 64BIT
-	select CRYPTO_POLY1305
+	select CRYPTO_LIB_POLY1305_GENERIC
 	help
 	  Poly1305 authenticator algorithm, RFC7539.
 
diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index afe9a9e576dd..21edbd8c99fb 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -19,7 +19,7 @@
 #include <linux/module.h>
 #include <asm/unaligned.h>
 
-int crypto_poly1305_init(struct shash_desc *desc)
+static int crypto_poly1305_init(struct shash_desc *desc)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
 
@@ -30,7 +30,6 @@ int crypto_poly1305_init(struct shash_desc *desc)
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(crypto_poly1305_init);
 
 static void poly1305_blocks(struct poly1305_desc_ctx *dctx, const u8 *src,
 			    unsigned int srclen)
@@ -47,8 +46,8 @@ static void poly1305_blocks(struct poly1305_desc_ctx *dctx, const u8 *src,
 			     srclen / POLY1305_BLOCK_SIZE, 1);
 }
 
-int crypto_poly1305_update(struct shash_desc *desc,
-			   const u8 *src, unsigned int srclen)
+static int crypto_poly1305_update(struct shash_desc *desc,
+				  const u8 *src, unsigned int srclen)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
 	unsigned int bytes;
@@ -80,9 +79,8 @@ int crypto_poly1305_update(struct shash_desc *desc,
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(crypto_poly1305_update);
 
-int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
+static int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
 
@@ -92,7 +90,6 @@ int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 	poly1305_final_generic(dctx, dst);
 	return 0;
 }
-EXPORT_SYMBOL_GPL(crypto_poly1305_final);
 
 static struct shash_alg poly1305_alg = {
 	.digestsize	= POLY1305_DIGEST_SIZE,
diff --git a/include/crypto/internal/poly1305.h b/include/crypto/internal/poly1305.h
index 04fa269e5534..479b0cab2a1a 100644
--- a/include/crypto/internal/poly1305.h
+++ b/include/crypto/internal/poly1305.h
@@ -10,8 +10,6 @@
 #include <linux/types.h>
 #include <crypto/poly1305.h>
 
-struct shash_desc;
-
 /*
  * Poly1305 core functions.  These implement the ε-almost-∆-universal hash
  * function underlying the Poly1305 MAC, i.e. they don't add an encrypted nonce
@@ -28,13 +26,6 @@ void poly1305_core_blocks(struct poly1305_state *state,
 			  unsigned int nblocks, u32 hibit);
 void poly1305_core_emit(const struct poly1305_state *state, void *dst);
 
-/* Crypto API helper functions for the Poly1305 MAC */
-int crypto_poly1305_init(struct shash_desc *desc);
-
-int crypto_poly1305_update(struct shash_desc *desc,
-			   const u8 *src, unsigned int srclen);
-int crypto_poly1305_final(struct shash_desc *desc, u8 *dst);
-
 /*
  * Poly1305 requires a unique key for each tag, which implies that we can't set
  * it on the tfm that gets accessed by multiple users simultaneously. Instead we
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 17/35] crypto: x86/poly1305 - expose existing driver as poly1305 library
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (15 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 16/35] crypto: x86/poly1305 - depend on generic library not generic shash Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 18/35] crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation Ard Biesheuvel
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Implement the arch init/update/final Poly1305 library routines in the
accelerated SIMD driver for x86 so they are accessible to users of
the Poly1305 library interface as well.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/poly1305_glue.c | 57 ++++++++++++++------
 crypto/Kconfig                  |  1 +
 lib/crypto/Kconfig              |  1 +
 3 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index a5b3a054604c..370cd88068ec 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -10,6 +10,7 @@
 #include <crypto/internal/poly1305.h>
 #include <crypto/internal/simd.h>
 #include <linux/crypto.h>
+#include <linux/jump_label.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <asm/simd.h>
@@ -21,7 +22,8 @@ asmlinkage void poly1305_2block_sse2(u32 *h, const u8 *src, const u32 *r,
 asmlinkage void poly1305_4block_avx2(u32 *h, const u8 *src, const u32 *r,
 				     unsigned int blocks, const u32 *u);
 
-static bool poly1305_use_avx2 __ro_after_init;
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(poly1305_use_simd);
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(poly1305_use_avx2);
 
 static void poly1305_simd_mult(u32 *a, const u32 *b)
 {
@@ -64,7 +66,7 @@ static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
 	}
 
 	if (IS_ENABLED(CONFIG_AS_AVX2) &&
-	    poly1305_use_avx2 &&
+	    static_branch_likely(&poly1305_use_avx2) &&
 	    srclen >= POLY1305_BLOCK_SIZE * 4) {
 		if (unlikely(dctx->rset < 4)) {
 			if (dctx->rset < 2) {
@@ -103,10 +105,15 @@ static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
 	return srclen;
 }
 
-static int poly1305_simd_update(struct shash_desc *desc,
-				const u8 *src, unsigned int srclen)
+void poly1305_init_arch(struct poly1305_desc_ctx *desc, const u8 *key)
+{
+	poly1305_init_generic(desc, key);
+}
+EXPORT_SYMBOL(poly1305_init_arch);
+
+void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
+			  unsigned int srclen)
 {
-	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
 	unsigned int bytes;
 
 	if (unlikely(dctx->buflen)) {
@@ -117,7 +124,8 @@ static int poly1305_simd_update(struct shash_desc *desc,
 		dctx->buflen += bytes;
 
 		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
-			if (likely(crypto_simd_usable())) {
+			if (static_branch_likely(&poly1305_use_simd) &&
+			    likely(crypto_simd_usable())) {
 				kernel_fpu_begin();
 				poly1305_simd_blocks(dctx, dctx->buf,
 						     POLY1305_BLOCK_SIZE);
@@ -131,7 +139,8 @@ static int poly1305_simd_update(struct shash_desc *desc,
 	}
 
 	if (likely(srclen >= POLY1305_BLOCK_SIZE)) {
-		if (likely(crypto_simd_usable())) {
+		if (static_branch_likely(&poly1305_use_simd) &&
+		    likely(crypto_simd_usable())) {
 			kernel_fpu_begin();
 			bytes = poly1305_simd_blocks(dctx, src, srclen);
 			kernel_fpu_end();
@@ -147,6 +156,13 @@ static int poly1305_simd_update(struct shash_desc *desc,
 		memcpy(dctx->buf, src, srclen);
 	}
 }
+EXPORT_SYMBOL(poly1305_update_arch);
+
+void poly1305_final_arch(struct poly1305_desc_ctx *desc, u8 *digest)
+{
+	poly1305_final_generic(desc, digest);
+}
+EXPORT_SYMBOL(poly1305_final_arch);
 
 static int crypto_poly1305_init(struct shash_desc *desc)
 {
@@ -171,6 +187,15 @@ static int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 	return 0;
 }
 
+static int poly1305_simd_update(struct shash_desc *desc,
+				const u8 *src, unsigned int srclen)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	poly1305_update_arch(dctx, src, srclen);
+	return 0;
+}
+
 static struct shash_alg alg = {
 	.digestsize	= POLY1305_DIGEST_SIZE,
 	.init		= crypto_poly1305_init,
@@ -189,15 +214,15 @@ static struct shash_alg alg = {
 static int __init poly1305_simd_mod_init(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_XMM2))
-		return -ENODEV;
-
-	poly1305_use_avx2 = IS_ENABLED(CONFIG_AS_AVX2) &&
-			    boot_cpu_has(X86_FEATURE_AVX) &&
-			    boot_cpu_has(X86_FEATURE_AVX2) &&
-			    cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
-	alg.descsize = sizeof(struct poly1305_desc_ctx) + 5 * sizeof(u32);
-	if (poly1305_use_avx2)
-		alg.descsize += 10 * sizeof(u32);
+		return 0;
+
+	static_branch_enable(&poly1305_use_simd);
+
+	if (IS_ENABLED(CONFIG_AS_AVX2) &&
+	    boot_cpu_has(X86_FEATURE_AVX) &&
+	    boot_cpu_has(X86_FEATURE_AVX2) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL))
+		static_branch_enable(&poly1305_use_avx2);
 
 	return crypto_register_shash(&alg);
 }
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 950cf1a8dfc0..325cd31f487f 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -698,6 +698,7 @@ config CRYPTO_POLY1305_X86_64
 	tristate "Poly1305 authenticator algorithm (x86_64/SSE2/AVX2)"
 	depends on X86 && 64BIT
 	select CRYPTO_LIB_POLY1305_GENERIC
+	select CRYPTO_ARCH_HAVE_LIB_POLY1305
 	help
 	  Poly1305 authenticator algorithm, RFC7539.
 
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index a731ea36bd5c..181754615f73 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -39,6 +39,7 @@ config CRYPTO_LIB_DES
 
 config CRYPTO_LIB_POLY1305_RSIZE
 	int
+	default 4 if X86_64
 	default 1
 
 config CRYPTO_ARCH_HAVE_LIB_POLY1305
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 18/35] crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (16 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 17/35] crypto: x86/poly1305 - expose existing driver as poly1305 library Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 19/35] crypto: arm/poly1305 " Ard Biesheuvel
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba, Andy Polyakov

This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation
for NEON authored by Andy Polyakov, and contributed by him to the OpenSSL
project. The file 'poly1305-armv8.pl' is taken straight from this upstream
GitHub repository [0] at commit ec55a08dc0244ce570c4fc7cade330c60798952f,
and already contains all the changes required to build it as part of a
Linux kernel module.

[0] https://github.com/dot-asm/cryptogams

Co-developed-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Kconfig                 |   6 +
 arch/arm64/crypto/Makefile                |  10 +-
 arch/arm64/crypto/poly1305-armv8.pl       | 913 ++++++++++++++++++++
 arch/arm64/crypto/poly1305-core.S_shipped | 835 ++++++++++++++++++
 arch/arm64/crypto/poly1305-glue.c         | 237 +++++
 lib/crypto/Kconfig                        |   1 +
 6 files changed, 2001 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 17bada4b9dd2..30d9b24ee86e 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -106,6 +106,12 @@ config CRYPTO_CHACHA20_NEON
 	select CRYPTO_LIB_CHACHA_GENERIC
 	select CRYPTO_ARCH_HAVE_LIB_CHACHA
 
+config CRYPTO_POLY1305_NEON
+	tristate "Poly1305 hash function using scalar or NEON instructions"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_HASH
+	select CRYPTO_ARCH_HAVE_LIB_POLY1305
+
 config CRYPTO_NHPOLY1305_NEON
 	tristate "NHPoly1305 hash function using NEON instructions (for Adiantum)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 0435f2a0610e..d0901e610df3 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -50,6 +50,10 @@ sha512-arm64-y := sha512-glue.o sha512-core.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
 chacha-neon-y := chacha-neon-core.o chacha-neon-glue.o
 
+obj-$(CONFIG_CRYPTO_POLY1305_NEON) += poly1305-neon.o
+poly1305-neon-y := poly1305-core.o poly1305-glue.o
+AFLAGS_poly1305-core.o += -Dpoly1305_init=poly1305_init_arm64
+
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
 nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
 
@@ -68,11 +72,15 @@ ifdef REGENERATE_ARM64_CRYPTO
 quiet_cmd_perlasm = PERLASM $@
       cmd_perlasm = $(PERL) $(<) void $(@)
 
+$(src)/poly1305-core.S_shipped: $(src)/poly1305-armv8.pl
+	$(call cmd,perlasm)
+
 $(src)/sha256-core.S_shipped: $(src)/sha512-armv8.pl
 	$(call cmd,perlasm)
 
 $(src)/sha512-core.S_shipped: $(src)/sha512-armv8.pl
 	$(call cmd,perlasm)
+
 endif
 
-clean-files += sha256-core.S sha512-core.S
+clean-files += poly1305-core.S sha256-core.S sha512-core.S
diff --git a/arch/arm64/crypto/poly1305-armv8.pl b/arch/arm64/crypto/poly1305-armv8.pl
new file mode 100644
index 000000000000..6e5576d19af8
--- /dev/null
+++ b/arch/arm64/crypto/poly1305-armv8.pl
@@ -0,0 +1,913 @@
+#!/usr/bin/env perl
+# SPDX-License-Identifier: GPL-1.0+ OR BSD-3-Clause
+#
+# ====================================================================
+# Written by Andy Polyakov, @dot-asm, initially for the OpenSSL
+# project.
+# ====================================================================
+#
+# This module implements Poly1305 hash for ARMv8.
+#
+# June 2015
+#
+# Numbers are cycles per processed byte with poly1305_blocks alone.
+#
+#		IALU/gcc-4.9	NEON
+#
+# Apple A7	1.86/+5%	0.72
+# Cortex-A53	2.69/+58%	1.47
+# Cortex-A57	2.70/+7%	1.14
+# Denver	1.64/+50%	1.18(*)
+# X-Gene	2.13/+68%	2.27
+# Mongoose	1.77/+75%	1.12
+# Kryo		2.70/+55%	1.13
+# ThunderX2	1.17/+95%	1.36
+#
+# (*)	estimate based on resources availability is less than 1.0,
+#	i.e. measured result is worse than expected, presumably binary
+#	translator is not almighty;
+
+$flavour=shift;
+$output=shift;
+
+if ($flavour && $flavour ne "void") {
+    $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+    ( $xlate="${dir}arm-xlate.pl" and -f $xlate ) or
+    ( $xlate="${dir}../../perlasm/arm-xlate.pl" and -f $xlate) or
+    die "can't locate arm-xlate.pl";
+
+    open STDOUT,"| \"$^X\" $xlate $flavour $output";
+} else {
+    open STDOUT,">$output";
+}
+
+my ($ctx,$inp,$len,$padbit) = map("x$_",(0..3));
+my ($mac,$nonce)=($inp,$len);
+
+my ($h0,$h1,$h2,$r0,$r1,$s1,$t0,$t1,$d0,$d1,$d2) = map("x$_",(4..14));
+
+$code.=<<___;
+#ifndef __KERNEL__
+# include "arm_arch.h"
+.extern	OPENSSL_armcap_P
+#endif
+
+.text
+
+// forward "declarations" are required for Apple
+.globl	poly1305_blocks
+.globl	poly1305_emit
+
+.globl	poly1305_init
+.type	poly1305_init,%function
+.align	5
+poly1305_init:
+	cmp	$inp,xzr
+	stp	xzr,xzr,[$ctx]		// zero hash value
+	stp	xzr,xzr,[$ctx,#16]	// [along with is_base2_26]
+
+	csel	x0,xzr,x0,eq
+	b.eq	.Lno_key
+
+#ifndef	__KERNEL__
+	adrp	x17,OPENSSL_armcap_P
+	ldr	w17,[x17,#:lo12:OPENSSL_armcap_P]
+#endif
+
+	ldp	$r0,$r1,[$inp]		// load key
+	mov	$s1,#0xfffffffc0fffffff
+	movk	$s1,#0x0fff,lsl#48
+#ifdef	__AARCH64EB__
+	rev	$r0,$r0			// flip bytes
+	rev	$r1,$r1
+#endif
+	and	$r0,$r0,$s1		// &=0ffffffc0fffffff
+	and	$s1,$s1,#-4
+	and	$r1,$r1,$s1		// &=0ffffffc0ffffffc
+	mov	w#$s1,#-1
+	stp	$r0,$r1,[$ctx,#32]	// save key value
+	str	w#$s1,[$ctx,#48]	// impossible key power value
+
+#ifndef	__KERNEL__
+	tst	w17,#ARMV7_NEON
+
+	adr	$d0,.Lpoly1305_blocks
+	adr	$r0,.Lpoly1305_blocks_neon
+	adr	$d1,.Lpoly1305_emit
+
+	csel	$d0,$d0,$r0,eq
+
+# ifdef	__ILP32__
+	stp	w#$d0,w#$d1,[$len]
+# else
+	stp	$d0,$d1,[$len]
+# endif
+#endif
+	mov	x0,#1
+.Lno_key:
+	ret
+.size	poly1305_init,.-poly1305_init
+
+.type	poly1305_blocks,%function
+.align	5
+poly1305_blocks:
+.Lpoly1305_blocks:
+	ands	$len,$len,#-16
+	b.eq	.Lno_data
+
+	ldp	$h0,$h1,[$ctx]		// load hash value
+	ldp	$h2,x17,[$ctx,#16]	// [along with is_base2_26]
+	ldp	$r0,$r1,[$ctx,#32]	// load key value
+
+#ifdef	__AARCH64EB__
+	lsr	$d0,$h0,#32
+	mov	w#$d1,w#$h0
+	lsr	$d2,$h1,#32
+	mov	w15,w#$h1
+	lsr	x16,$h2,#32
+#else
+	mov	w#$d0,w#$h0
+	lsr	$d1,$h0,#32
+	mov	w#$d2,w#$h1
+	lsr	x15,$h1,#32
+	mov	w16,w#$h2
+#endif
+
+	add	$d0,$d0,$d1,lsl#26	// base 2^26 -> base 2^64
+	lsr	$d1,$d2,#12
+	adds	$d0,$d0,$d2,lsl#52
+	add	$d1,$d1,x15,lsl#14
+	adc	$d1,$d1,xzr
+	lsr	$d2,x16,#24
+	adds	$d1,$d1,x16,lsl#40
+	adc	$d2,$d2,xzr
+
+	cmp	x17,#0			// is_base2_26?
+	add	$s1,$r1,$r1,lsr#2	// s1 = r1 + (r1 >> 2)
+	csel	$h0,$h0,$d0,eq		// choose between radixes
+	csel	$h1,$h1,$d1,eq
+	csel	$h2,$h2,$d2,eq
+
+.Loop:
+	ldp	$t0,$t1,[$inp],#16	// load input
+	sub	$len,$len,#16
+#ifdef	__AARCH64EB__
+	rev	$t0,$t0
+	rev	$t1,$t1
+#endif
+	adds	$h0,$h0,$t0		// accumulate input
+	adcs	$h1,$h1,$t1
+
+	mul	$d0,$h0,$r0		// h0*r0
+	adc	$h2,$h2,$padbit
+	umulh	$d1,$h0,$r0
+
+	mul	$t0,$h1,$s1		// h1*5*r1
+	umulh	$t1,$h1,$s1
+
+	adds	$d0,$d0,$t0
+	mul	$t0,$h0,$r1		// h0*r1
+	adc	$d1,$d1,$t1
+	umulh	$d2,$h0,$r1
+
+	adds	$d1,$d1,$t0
+	mul	$t0,$h1,$r0		// h1*r0
+	adc	$d2,$d2,xzr
+	umulh	$t1,$h1,$r0
+
+	adds	$d1,$d1,$t0
+	mul	$t0,$h2,$s1		// h2*5*r1
+	adc	$d2,$d2,$t1
+	mul	$t1,$h2,$r0		// h2*r0
+
+	adds	$d1,$d1,$t0
+	adc	$d2,$d2,$t1
+
+	and	$t0,$d2,#-4		// final reduction
+	and	$h2,$d2,#3
+	add	$t0,$t0,$d2,lsr#2
+	adds	$h0,$d0,$t0
+	adcs	$h1,$d1,xzr
+	adc	$h2,$h2,xzr
+
+	cbnz	$len,.Loop
+
+	stp	$h0,$h1,[$ctx]		// store hash value
+	stp	$h2,xzr,[$ctx,#16]	// [and clear is_base2_26]
+
+.Lno_data:
+	ret
+.size	poly1305_blocks,.-poly1305_blocks
+
+.type	poly1305_emit,%function
+.align	5
+poly1305_emit:
+.Lpoly1305_emit:
+	ldp	$h0,$h1,[$ctx]		// load hash base 2^64
+	ldp	$h2,$r0,[$ctx,#16]	// [along with is_base2_26]
+	ldp	$t0,$t1,[$nonce]	// load nonce
+
+#ifdef	__AARCH64EB__
+	lsr	$d0,$h0,#32
+	mov	w#$d1,w#$h0
+	lsr	$d2,$h1,#32
+	mov	w15,w#$h1
+	lsr	x16,$h2,#32
+#else
+	mov	w#$d0,w#$h0
+	lsr	$d1,$h0,#32
+	mov	w#$d2,w#$h1
+	lsr	x15,$h1,#32
+	mov	w16,w#$h2
+#endif
+
+	add	$d0,$d0,$d1,lsl#26	// base 2^26 -> base 2^64
+	lsr	$d1,$d2,#12
+	adds	$d0,$d0,$d2,lsl#52
+	add	$d1,$d1,x15,lsl#14
+	adc	$d1,$d1,xzr
+	lsr	$d2,x16,#24
+	adds	$d1,$d1,x16,lsl#40
+	adc	$d2,$d2,xzr
+
+	cmp	$r0,#0			// is_base2_26?
+	csel	$h0,$h0,$d0,eq		// choose between radixes
+	csel	$h1,$h1,$d1,eq
+	csel	$h2,$h2,$d2,eq
+
+	adds	$d0,$h0,#5		// compare to modulus
+	adcs	$d1,$h1,xzr
+	adc	$d2,$h2,xzr
+
+	tst	$d2,#-4			// see if it's carried/borrowed
+
+	csel	$h0,$h0,$d0,eq
+	csel	$h1,$h1,$d1,eq
+
+#ifdef	__AARCH64EB__
+	ror	$t0,$t0,#32		// flip nonce words
+	ror	$t1,$t1,#32
+#endif
+	adds	$h0,$h0,$t0		// accumulate nonce
+	adc	$h1,$h1,$t1
+#ifdef	__AARCH64EB__
+	rev	$h0,$h0			// flip output bytes
+	rev	$h1,$h1
+#endif
+	stp	$h0,$h1,[$mac]		// write result
+
+	ret
+.size	poly1305_emit,.-poly1305_emit
+___
+my ($R0,$R1,$S1,$R2,$S2,$R3,$S3,$R4,$S4) = map("v$_.4s",(0..8));
+my ($IN01_0,$IN01_1,$IN01_2,$IN01_3,$IN01_4) = map("v$_.2s",(9..13));
+my ($IN23_0,$IN23_1,$IN23_2,$IN23_3,$IN23_4) = map("v$_.2s",(14..18));
+my ($ACC0,$ACC1,$ACC2,$ACC3,$ACC4) = map("v$_.2d",(19..23));
+my ($H0,$H1,$H2,$H3,$H4) = map("v$_.2s",(24..28));
+my ($T0,$T1,$MASK) = map("v$_",(29..31));
+
+my ($in2,$zeros)=("x16","x17");
+my $is_base2_26 = $zeros;		# borrow
+
+$code.=<<___;
+.type	poly1305_mult,%function
+.align	5
+poly1305_mult:
+	mul	$d0,$h0,$r0		// h0*r0
+	umulh	$d1,$h0,$r0
+
+	mul	$t0,$h1,$s1		// h1*5*r1
+	umulh	$t1,$h1,$s1
+
+	adds	$d0,$d0,$t0
+	mul	$t0,$h0,$r1		// h0*r1
+	adc	$d1,$d1,$t1
+	umulh	$d2,$h0,$r1
+
+	adds	$d1,$d1,$t0
+	mul	$t0,$h1,$r0		// h1*r0
+	adc	$d2,$d2,xzr
+	umulh	$t1,$h1,$r0
+
+	adds	$d1,$d1,$t0
+	mul	$t0,$h2,$s1		// h2*5*r1
+	adc	$d2,$d2,$t1
+	mul	$t1,$h2,$r0		// h2*r0
+
+	adds	$d1,$d1,$t0
+	adc	$d2,$d2,$t1
+
+	and	$t0,$d2,#-4		// final reduction
+	and	$h2,$d2,#3
+	add	$t0,$t0,$d2,lsr#2
+	adds	$h0,$d0,$t0
+	adcs	$h1,$d1,xzr
+	adc	$h2,$h2,xzr
+
+	ret
+.size	poly1305_mult,.-poly1305_mult
+
+.type	poly1305_splat,%function
+.align	4
+poly1305_splat:
+	and	x12,$h0,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x13,$h0,#26,#26
+	extr	x14,$h1,$h0,#52
+	and	x14,x14,#0x03ffffff
+	ubfx	x15,$h1,#14,#26
+	extr	x16,$h2,$h1,#40
+
+	str	w12,[$ctx,#16*0]	// r0
+	add	w12,w13,w13,lsl#2	// r1*5
+	str	w13,[$ctx,#16*1]	// r1
+	add	w13,w14,w14,lsl#2	// r2*5
+	str	w12,[$ctx,#16*2]	// s1
+	str	w14,[$ctx,#16*3]	// r2
+	add	w14,w15,w15,lsl#2	// r3*5
+	str	w13,[$ctx,#16*4]	// s2
+	str	w15,[$ctx,#16*5]	// r3
+	add	w15,w16,w16,lsl#2	// r4*5
+	str	w14,[$ctx,#16*6]	// s3
+	str	w16,[$ctx,#16*7]	// r4
+	str	w15,[$ctx,#16*8]	// s4
+
+	ret
+.size	poly1305_splat,.-poly1305_splat
+
+#ifdef	__KERNEL__
+.globl	poly1305_blocks_neon
+#endif
+.type	poly1305_blocks_neon,%function
+.align	5
+poly1305_blocks_neon:
+.Lpoly1305_blocks_neon:
+	ldr	$is_base2_26,[$ctx,#24]
+	cmp	$len,#128
+	b.lo	.Lpoly1305_blocks
+
+	.inst	0xd503233f		// paciasp
+	stp	x29,x30,[sp,#-80]!
+	add	x29,sp,#0
+
+	stp	d8,d9,[sp,#16]		// meet ABI requirements
+	stp	d10,d11,[sp,#32]
+	stp	d12,d13,[sp,#48]
+	stp	d14,d15,[sp,#64]
+
+	cbz	$is_base2_26,.Lbase2_64_neon
+
+	ldp	w10,w11,[$ctx]		// load hash value base 2^26
+	ldp	w12,w13,[$ctx,#8]
+	ldr	w14,[$ctx,#16]
+
+	tst	$len,#31
+	b.eq	.Leven_neon
+
+	ldp	$r0,$r1,[$ctx,#32]	// load key value
+
+	add	$h0,x10,x11,lsl#26	// base 2^26 -> base 2^64
+	lsr	$h1,x12,#12
+	adds	$h0,$h0,x12,lsl#52
+	add	$h1,$h1,x13,lsl#14
+	adc	$h1,$h1,xzr
+	lsr	$h2,x14,#24
+	adds	$h1,$h1,x14,lsl#40
+	adc	$d2,$h2,xzr		// can be partially reduced...
+
+	ldp	$d0,$d1,[$inp],#16	// load input
+	sub	$len,$len,#16
+	add	$s1,$r1,$r1,lsr#2	// s1 = r1 + (r1 >> 2)
+
+#ifdef	__AARCH64EB__
+	rev	$d0,$d0
+	rev	$d1,$d1
+#endif
+	adds	$h0,$h0,$d0		// accumulate input
+	adcs	$h1,$h1,$d1
+	adc	$h2,$h2,$padbit
+
+	bl	poly1305_mult
+
+	and	x10,$h0,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,$h0,#26,#26
+	extr	x12,$h1,$h0,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,$h1,#14,#26
+	extr	x14,$h2,$h1,#40
+
+	b	.Leven_neon
+
+.align	4
+.Lbase2_64_neon:
+	ldp	$r0,$r1,[$ctx,#32]	// load key value
+
+	ldp	$h0,$h1,[$ctx]		// load hash value base 2^64
+	ldr	$h2,[$ctx,#16]
+
+	tst	$len,#31
+	b.eq	.Linit_neon
+
+	ldp	$d0,$d1,[$inp],#16	// load input
+	sub	$len,$len,#16
+	add	$s1,$r1,$r1,lsr#2	// s1 = r1 + (r1 >> 2)
+#ifdef	__AARCH64EB__
+	rev	$d0,$d0
+	rev	$d1,$d1
+#endif
+	adds	$h0,$h0,$d0		// accumulate input
+	adcs	$h1,$h1,$d1
+	adc	$h2,$h2,$padbit
+
+	bl	poly1305_mult
+
+.Linit_neon:
+	ldr	w17,[$ctx,#48]		// first table element
+	and	x10,$h0,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,$h0,#26,#26
+	extr	x12,$h1,$h0,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,$h1,#14,#26
+	extr	x14,$h2,$h1,#40
+
+	cmp	w17,#-1			// is value impossible?
+	b.ne	.Leven_neon
+
+	fmov	${H0},x10
+	fmov	${H1},x11
+	fmov	${H2},x12
+	fmov	${H3},x13
+	fmov	${H4},x14
+
+	////////////////////////////////// initialize r^n table
+	mov	$h0,$r0			// r^1
+	add	$s1,$r1,$r1,lsr#2	// s1 = r1 + (r1 >> 2)
+	mov	$h1,$r1
+	mov	$h2,xzr
+	add	$ctx,$ctx,#48+12
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^2
+	sub	$ctx,$ctx,#4
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^3
+	sub	$ctx,$ctx,#4
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^4
+	sub	$ctx,$ctx,#4
+	bl	poly1305_splat
+	sub	$ctx,$ctx,#48		// restore original $ctx
+	b	.Ldo_neon
+
+.align	4
+.Leven_neon:
+	fmov	${H0},x10
+	fmov	${H1},x11
+	fmov	${H2},x12
+	fmov	${H3},x13
+	fmov	${H4},x14
+
+.Ldo_neon:
+	ldp	x8,x12,[$inp,#32]	// inp[2:3]
+	subs	$len,$len,#64
+	ldp	x9,x13,[$inp,#48]
+	add	$in2,$inp,#96
+	adr	$zeros,.Lzeros
+
+	lsl	$padbit,$padbit,#24
+	add	x15,$ctx,#48
+
+#ifdef	__AARCH64EB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	$IN23_0,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,$padbit,x12,lsr#40
+	add	x13,$padbit,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	$IN23_1,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	fmov	$IN23_2,x8
+	fmov	$IN23_3,x10
+	fmov	$IN23_4,x12
+
+	ldp	x8,x12,[$inp],#16	// inp[0:1]
+	ldp	x9,x13,[$inp],#48
+
+	ld1	{$R0,$R1,$S1,$R2},[x15],#64
+	ld1	{$S2,$R3,$S3,$R4},[x15],#64
+	ld1	{$S4},[x15]
+
+#ifdef	__AARCH64EB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	$IN01_0,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,$padbit,x12,lsr#40
+	add	x13,$padbit,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	$IN01_1,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	movi	$MASK.2d,#-1
+	fmov	$IN01_2,x8
+	fmov	$IN01_3,x10
+	fmov	$IN01_4,x12
+	ushr	$MASK.2d,$MASK.2d,#38
+
+	b.ls	.Lskip_loop
+
+.align	4
+.Loop_neon:
+	////////////////////////////////////////////////////////////////
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	//   \___________________/
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	//   \___________________/ \____________________/
+	//
+	// Note that we start with inp[2:3]*r^2. This is because it
+	// doesn't depend on reduction in previous iteration.
+	////////////////////////////////////////////////////////////////
+	// d4 = h0*r4 + h1*r3   + h2*r2   + h3*r1   + h4*r0
+	// d3 = h0*r3 + h1*r2   + h2*r1   + h3*r0   + h4*5*r4
+	// d2 = h0*r2 + h1*r1   + h2*r0   + h3*5*r4 + h4*5*r3
+	// d1 = h0*r1 + h1*r0   + h2*5*r4 + h3*5*r3 + h4*5*r2
+	// d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1
+
+	subs	$len,$len,#64
+	umull	$ACC4,$IN23_0,${R4}[2]
+	csel	$in2,$zeros,$in2,lo
+	umull	$ACC3,$IN23_0,${R3}[2]
+	umull	$ACC2,$IN23_0,${R2}[2]
+	 ldp	x8,x12,[$in2],#16	// inp[2:3] (or zero)
+	umull	$ACC1,$IN23_0,${R1}[2]
+	 ldp	x9,x13,[$in2],#48
+	umull	$ACC0,$IN23_0,${R0}[2]
+#ifdef	__AARCH64EB__
+	 rev	x8,x8
+	 rev	x12,x12
+	 rev	x9,x9
+	 rev	x13,x13
+#endif
+
+	umlal	$ACC4,$IN23_1,${R3}[2]
+	 and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	$ACC3,$IN23_1,${R2}[2]
+	 and	x5,x9,#0x03ffffff
+	umlal	$ACC2,$IN23_1,${R1}[2]
+	 ubfx	x6,x8,#26,#26
+	umlal	$ACC1,$IN23_1,${R0}[2]
+	 ubfx	x7,x9,#26,#26
+	umlal	$ACC0,$IN23_1,${S4}[2]
+	 add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+
+	umlal	$ACC4,$IN23_2,${R2}[2]
+	 extr	x8,x12,x8,#52
+	umlal	$ACC3,$IN23_2,${R1}[2]
+	 extr	x9,x13,x9,#52
+	umlal	$ACC2,$IN23_2,${R0}[2]
+	 add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	$ACC1,$IN23_2,${S4}[2]
+	 fmov	$IN23_0,x4
+	umlal	$ACC0,$IN23_2,${S3}[2]
+	 and	x8,x8,#0x03ffffff
+
+	umlal	$ACC4,$IN23_3,${R1}[2]
+	 and	x9,x9,#0x03ffffff
+	umlal	$ACC3,$IN23_3,${R0}[2]
+	 ubfx	x10,x12,#14,#26
+	umlal	$ACC2,$IN23_3,${S4}[2]
+	 ubfx	x11,x13,#14,#26
+	umlal	$ACC1,$IN23_3,${S3}[2]
+	 add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	$ACC0,$IN23_3,${S2}[2]
+	 fmov	$IN23_1,x6
+
+	add	$IN01_2,$IN01_2,$H2
+	 add	x12,$padbit,x12,lsr#40
+	umlal	$ACC4,$IN23_4,${R0}[2]
+	 add	x13,$padbit,x13,lsr#40
+	umlal	$ACC3,$IN23_4,${S4}[2]
+	 add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	$ACC2,$IN23_4,${S3}[2]
+	 add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	$ACC1,$IN23_4,${S2}[2]
+	 fmov	$IN23_2,x8
+	umlal	$ACC0,$IN23_4,${S1}[2]
+	 fmov	$IN23_3,x10
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4 and accumulate
+
+	add	$IN01_0,$IN01_0,$H0
+	 fmov	$IN23_4,x12
+	umlal	$ACC3,$IN01_2,${R1}[0]
+	 ldp	x8,x12,[$inp],#16	// inp[0:1]
+	umlal	$ACC0,$IN01_2,${S3}[0]
+	 ldp	x9,x13,[$inp],#48
+	umlal	$ACC4,$IN01_2,${R2}[0]
+	umlal	$ACC1,$IN01_2,${S4}[0]
+	umlal	$ACC2,$IN01_2,${R0}[0]
+#ifdef	__AARCH64EB__
+	 rev	x8,x8
+	 rev	x12,x12
+	 rev	x9,x9
+	 rev	x13,x13
+#endif
+
+	add	$IN01_1,$IN01_1,$H1
+	umlal	$ACC3,$IN01_0,${R3}[0]
+	umlal	$ACC4,$IN01_0,${R4}[0]
+	 and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	$ACC2,$IN01_0,${R2}[0]
+	 and	x5,x9,#0x03ffffff
+	umlal	$ACC0,$IN01_0,${R0}[0]
+	 ubfx	x6,x8,#26,#26
+	umlal	$ACC1,$IN01_0,${R1}[0]
+	 ubfx	x7,x9,#26,#26
+
+	add	$IN01_3,$IN01_3,$H3
+	 add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	umlal	$ACC3,$IN01_1,${R2}[0]
+	 extr	x8,x12,x8,#52
+	umlal	$ACC4,$IN01_1,${R3}[0]
+	 extr	x9,x13,x9,#52
+	umlal	$ACC0,$IN01_1,${S4}[0]
+	 add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	$ACC2,$IN01_1,${R1}[0]
+	 fmov	$IN01_0,x4
+	umlal	$ACC1,$IN01_1,${R0}[0]
+	 and	x8,x8,#0x03ffffff
+
+	add	$IN01_4,$IN01_4,$H4
+	 and	x9,x9,#0x03ffffff
+	umlal	$ACC3,$IN01_3,${R0}[0]
+	 ubfx	x10,x12,#14,#26
+	umlal	$ACC0,$IN01_3,${S2}[0]
+	 ubfx	x11,x13,#14,#26
+	umlal	$ACC4,$IN01_3,${R1}[0]
+	 add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	$ACC1,$IN01_3,${S3}[0]
+	 fmov	$IN01_1,x6
+	umlal	$ACC2,$IN01_3,${S4}[0]
+	 add	x12,$padbit,x12,lsr#40
+
+	umlal	$ACC3,$IN01_4,${S4}[0]
+	 add	x13,$padbit,x13,lsr#40
+	umlal	$ACC0,$IN01_4,${S1}[0]
+	 add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	$ACC4,$IN01_4,${R0}[0]
+	 add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	$ACC1,$IN01_4,${S2}[0]
+	 fmov	$IN01_2,x8
+	umlal	$ACC2,$IN01_4,${S3}[0]
+	 fmov	$IN01_3,x10
+	 fmov	$IN01_4,x12
+
+	/////////////////////////////////////////////////////////////////
+	// lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	// and P. Schwabe
+	//
+	// [see discussion in poly1305-armv4 module]
+
+	ushr	$T0.2d,$ACC3,#26
+	xtn	$H3,$ACC3
+	 ushr	$T1.2d,$ACC0,#26
+	 and	$ACC0,$ACC0,$MASK.2d
+	add	$ACC4,$ACC4,$T0.2d	// h3 -> h4
+	bic	$H3,#0xfc,lsl#24	// &=0x03ffffff
+	 add	$ACC1,$ACC1,$T1.2d	// h0 -> h1
+
+	ushr	$T0.2d,$ACC4,#26
+	xtn	$H4,$ACC4
+	 ushr	$T1.2d,$ACC1,#26
+	 xtn	$H1,$ACC1
+	bic	$H4,#0xfc,lsl#24
+	 add	$ACC2,$ACC2,$T1.2d	// h1 -> h2
+
+	add	$ACC0,$ACC0,$T0.2d
+	shl	$T0.2d,$T0.2d,#2
+	 shrn	$T1.2s,$ACC2,#26
+	 xtn	$H2,$ACC2
+	add	$ACC0,$ACC0,$T0.2d	// h4 -> h0
+	 bic	$H1,#0xfc,lsl#24
+	 add	$H3,$H3,$T1.2s		// h2 -> h3
+	 bic	$H2,#0xfc,lsl#24
+
+	shrn	$T0.2s,$ACC0,#26
+	xtn	$H0,$ACC0
+	 ushr	$T1.2s,$H3,#26
+	 bic	$H3,#0xfc,lsl#24
+	 bic	$H0,#0xfc,lsl#24
+	add	$H1,$H1,$T0.2s		// h0 -> h1
+	 add	$H4,$H4,$T1.2s		// h3 -> h4
+
+	b.hi	.Loop_neon
+
+.Lskip_loop:
+	dup	$IN23_2,${IN23_2}[0]
+	add	$IN01_2,$IN01_2,$H2
+
+	////////////////////////////////////////////////////////////////
+	// multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	adds	$len,$len,#32
+	b.ne	.Long_tail
+
+	dup	$IN23_2,${IN01_2}[0]
+	add	$IN23_0,$IN01_0,$H0
+	add	$IN23_3,$IN01_3,$H3
+	add	$IN23_1,$IN01_1,$H1
+	add	$IN23_4,$IN01_4,$H4
+
+.Long_tail:
+	dup	$IN23_0,${IN23_0}[0]
+	umull2	$ACC0,$IN23_2,${S3}
+	umull2	$ACC3,$IN23_2,${R1}
+	umull2	$ACC4,$IN23_2,${R2}
+	umull2	$ACC2,$IN23_2,${R0}
+	umull2	$ACC1,$IN23_2,${S4}
+
+	dup	$IN23_1,${IN23_1}[0]
+	umlal2	$ACC0,$IN23_0,${R0}
+	umlal2	$ACC2,$IN23_0,${R2}
+	umlal2	$ACC3,$IN23_0,${R3}
+	umlal2	$ACC4,$IN23_0,${R4}
+	umlal2	$ACC1,$IN23_0,${R1}
+
+	dup	$IN23_3,${IN23_3}[0]
+	umlal2	$ACC0,$IN23_1,${S4}
+	umlal2	$ACC3,$IN23_1,${R2}
+	umlal2	$ACC2,$IN23_1,${R1}
+	umlal2	$ACC4,$IN23_1,${R3}
+	umlal2	$ACC1,$IN23_1,${R0}
+
+	dup	$IN23_4,${IN23_4}[0]
+	umlal2	$ACC3,$IN23_3,${R0}
+	umlal2	$ACC4,$IN23_3,${R1}
+	umlal2	$ACC0,$IN23_3,${S2}
+	umlal2	$ACC1,$IN23_3,${S3}
+	umlal2	$ACC2,$IN23_3,${S4}
+
+	umlal2	$ACC3,$IN23_4,${S4}
+	umlal2	$ACC0,$IN23_4,${S1}
+	umlal2	$ACC4,$IN23_4,${R0}
+	umlal2	$ACC1,$IN23_4,${S2}
+	umlal2	$ACC2,$IN23_4,${S3}
+
+	b.eq	.Lshort_tail
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	add	$IN01_0,$IN01_0,$H0
+	umlal	$ACC3,$IN01_2,${R1}
+	umlal	$ACC0,$IN01_2,${S3}
+	umlal	$ACC4,$IN01_2,${R2}
+	umlal	$ACC1,$IN01_2,${S4}
+	umlal	$ACC2,$IN01_2,${R0}
+
+	add	$IN01_1,$IN01_1,$H1
+	umlal	$ACC3,$IN01_0,${R3}
+	umlal	$ACC0,$IN01_0,${R0}
+	umlal	$ACC4,$IN01_0,${R4}
+	umlal	$ACC1,$IN01_0,${R1}
+	umlal	$ACC2,$IN01_0,${R2}
+
+	add	$IN01_3,$IN01_3,$H3
+	umlal	$ACC3,$IN01_1,${R2}
+	umlal	$ACC0,$IN01_1,${S4}
+	umlal	$ACC4,$IN01_1,${R3}
+	umlal	$ACC1,$IN01_1,${R0}
+	umlal	$ACC2,$IN01_1,${R1}
+
+	add	$IN01_4,$IN01_4,$H4
+	umlal	$ACC3,$IN01_3,${R0}
+	umlal	$ACC0,$IN01_3,${S2}
+	umlal	$ACC4,$IN01_3,${R1}
+	umlal	$ACC1,$IN01_3,${S3}
+	umlal	$ACC2,$IN01_3,${S4}
+
+	umlal	$ACC3,$IN01_4,${S4}
+	umlal	$ACC0,$IN01_4,${S1}
+	umlal	$ACC4,$IN01_4,${R0}
+	umlal	$ACC1,$IN01_4,${S2}
+	umlal	$ACC2,$IN01_4,${S3}
+
+.Lshort_tail:
+	////////////////////////////////////////////////////////////////
+	// horizontal add
+
+	addp	$ACC3,$ACC3,$ACC3
+	 ldp	d8,d9,[sp,#16]		// meet ABI requirements
+	addp	$ACC0,$ACC0,$ACC0
+	 ldp	d10,d11,[sp,#32]
+	addp	$ACC4,$ACC4,$ACC4
+	 ldp	d12,d13,[sp,#48]
+	addp	$ACC1,$ACC1,$ACC1
+	 ldp	d14,d15,[sp,#64]
+	addp	$ACC2,$ACC2,$ACC2
+	 ldr	x30,[sp,#8]
+	 .inst	0xd50323bf		// autiasp
+
+	////////////////////////////////////////////////////////////////
+	// lazy reduction, but without narrowing
+
+	ushr	$T0.2d,$ACC3,#26
+	and	$ACC3,$ACC3,$MASK.2d
+	 ushr	$T1.2d,$ACC0,#26
+	 and	$ACC0,$ACC0,$MASK.2d
+
+	add	$ACC4,$ACC4,$T0.2d	// h3 -> h4
+	 add	$ACC1,$ACC1,$T1.2d	// h0 -> h1
+
+	ushr	$T0.2d,$ACC4,#26
+	and	$ACC4,$ACC4,$MASK.2d
+	 ushr	$T1.2d,$ACC1,#26
+	 and	$ACC1,$ACC1,$MASK.2d
+	 add	$ACC2,$ACC2,$T1.2d	// h1 -> h2
+
+	add	$ACC0,$ACC0,$T0.2d
+	shl	$T0.2d,$T0.2d,#2
+	 ushr	$T1.2d,$ACC2,#26
+	 and	$ACC2,$ACC2,$MASK.2d
+	add	$ACC0,$ACC0,$T0.2d	// h4 -> h0
+	 add	$ACC3,$ACC3,$T1.2d	// h2 -> h3
+
+	ushr	$T0.2d,$ACC0,#26
+	and	$ACC0,$ACC0,$MASK.2d
+	 ushr	$T1.2d,$ACC3,#26
+	 and	$ACC3,$ACC3,$MASK.2d
+	add	$ACC1,$ACC1,$T0.2d	// h0 -> h1
+	 add	$ACC4,$ACC4,$T1.2d	// h3 -> h4
+
+	////////////////////////////////////////////////////////////////
+	// write the result, can be partially reduced
+
+	st4	{$ACC0,$ACC1,$ACC2,$ACC3}[0],[$ctx],#16
+	mov	x4,#1
+	st1	{$ACC4}[0],[$ctx]
+	str	x4,[$ctx,#8]		// set is_base2_26
+
+	ldr	x29,[sp],#80
+	ret
+.size	poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0
+.asciz	"Poly1305 for ARMv8, CRYPTOGAMS by \@dot-asm"
+.align	2
+#if !defined(__KERNEL__) && !defined(_WIN64)
+.comm	OPENSSL_armcap_P,4,4
+.hidden	OPENSSL_armcap_P
+#endif
+___
+
+foreach (split("\n",$code)) {
+	s/\b(shrn\s+v[0-9]+)\.[24]d/$1.2s/			or
+	s/\b(fmov\s+)v([0-9]+)[^,]*,\s*x([0-9]+)/$1d$2,x$3/	or
+	(m/\bdup\b/ and (s/\.[24]s/.2d/g or 1))			or
+	(m/\b(eor|and)/ and (s/\.[248][sdh]/.16b/g or 1))	or
+	(m/\bum(ul|la)l\b/ and (s/\.4s/.2s/g or 1))		or
+	(m/\bum(ul|la)l2\b/ and (s/\.2s/.4s/g or 1))		or
+	(m/\bst[1-4]\s+{[^}]+}\[/ and (s/\.[24]d/.s/g or 1));
+
+	s/\.[124]([sd])\[/.$1\[/;
+	s/w#x([0-9]+)/w$1/g;
+
+	print $_,"\n";
+}
+close STDOUT;
diff --git a/arch/arm64/crypto/poly1305-core.S_shipped b/arch/arm64/crypto/poly1305-core.S_shipped
new file mode 100644
index 000000000000..8d1c4e420ccd
--- /dev/null
+++ b/arch/arm64/crypto/poly1305-core.S_shipped
@@ -0,0 +1,835 @@
+#ifndef __KERNEL__
+# include "arm_arch.h"
+.extern	OPENSSL_armcap_P
+#endif
+
+.text
+
+// forward "declarations" are required for Apple
+.globl	poly1305_blocks
+.globl	poly1305_emit
+
+.globl	poly1305_init
+.type	poly1305_init,%function
+.align	5
+poly1305_init:
+	cmp	x1,xzr
+	stp	xzr,xzr,[x0]		// zero hash value
+	stp	xzr,xzr,[x0,#16]	// [along with is_base2_26]
+
+	csel	x0,xzr,x0,eq
+	b.eq	.Lno_key
+
+#ifndef	__KERNEL__
+	adrp	x17,OPENSSL_armcap_P
+	ldr	w17,[x17,#:lo12:OPENSSL_armcap_P]
+#endif
+
+	ldp	x7,x8,[x1]		// load key
+	mov	x9,#0xfffffffc0fffffff
+	movk	x9,#0x0fff,lsl#48
+#ifdef	__AARCH64EB__
+	rev	x7,x7			// flip bytes
+	rev	x8,x8
+#endif
+	and	x7,x7,x9		// &=0ffffffc0fffffff
+	and	x9,x9,#-4
+	and	x8,x8,x9		// &=0ffffffc0ffffffc
+	mov	w9,#-1
+	stp	x7,x8,[x0,#32]	// save key value
+	str	w9,[x0,#48]	// impossible key power value
+
+#ifndef	__KERNEL__
+	tst	w17,#ARMV7_NEON
+
+	adr	x12,.Lpoly1305_blocks
+	adr	x7,.Lpoly1305_blocks_neon
+	adr	x13,.Lpoly1305_emit
+
+	csel	x12,x12,x7,eq
+
+# ifdef	__ILP32__
+	stp	w12,w13,[x2]
+# else
+	stp	x12,x13,[x2]
+# endif
+#endif
+	mov	x0,#1
+.Lno_key:
+	ret
+.size	poly1305_init,.-poly1305_init
+
+.type	poly1305_blocks,%function
+.align	5
+poly1305_blocks:
+.Lpoly1305_blocks:
+	ands	x2,x2,#-16
+	b.eq	.Lno_data
+
+	ldp	x4,x5,[x0]		// load hash value
+	ldp	x6,x17,[x0,#16]	// [along with is_base2_26]
+	ldp	x7,x8,[x0,#32]	// load key value
+
+#ifdef	__AARCH64EB__
+	lsr	x12,x4,#32
+	mov	w13,w4
+	lsr	x14,x5,#32
+	mov	w15,w5
+	lsr	x16,x6,#32
+#else
+	mov	w12,w4
+	lsr	x13,x4,#32
+	mov	w14,w5
+	lsr	x15,x5,#32
+	mov	w16,w6
+#endif
+
+	add	x12,x12,x13,lsl#26	// base 2^26 -> base 2^64
+	lsr	x13,x14,#12
+	adds	x12,x12,x14,lsl#52
+	add	x13,x13,x15,lsl#14
+	adc	x13,x13,xzr
+	lsr	x14,x16,#24
+	adds	x13,x13,x16,lsl#40
+	adc	x14,x14,xzr
+
+	cmp	x17,#0			// is_base2_26?
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+	csel	x4,x4,x12,eq		// choose between radixes
+	csel	x5,x5,x13,eq
+	csel	x6,x6,x14,eq
+
+.Loop:
+	ldp	x10,x11,[x1],#16	// load input
+	sub	x2,x2,#16
+#ifdef	__AARCH64EB__
+	rev	x10,x10
+	rev	x11,x11
+#endif
+	adds	x4,x4,x10		// accumulate input
+	adcs	x5,x5,x11
+
+	mul	x12,x4,x7		// h0*r0
+	adc	x6,x6,x3
+	umulh	x13,x4,x7
+
+	mul	x10,x5,x9		// h1*5*r1
+	umulh	x11,x5,x9
+
+	adds	x12,x12,x10
+	mul	x10,x4,x8		// h0*r1
+	adc	x13,x13,x11
+	umulh	x14,x4,x8
+
+	adds	x13,x13,x10
+	mul	x10,x5,x7		// h1*r0
+	adc	x14,x14,xzr
+	umulh	x11,x5,x7
+
+	adds	x13,x13,x10
+	mul	x10,x6,x9		// h2*5*r1
+	adc	x14,x14,x11
+	mul	x11,x6,x7		// h2*r0
+
+	adds	x13,x13,x10
+	adc	x14,x14,x11
+
+	and	x10,x14,#-4		// final reduction
+	and	x6,x14,#3
+	add	x10,x10,x14,lsr#2
+	adds	x4,x12,x10
+	adcs	x5,x13,xzr
+	adc	x6,x6,xzr
+
+	cbnz	x2,.Loop
+
+	stp	x4,x5,[x0]		// store hash value
+	stp	x6,xzr,[x0,#16]	// [and clear is_base2_26]
+
+.Lno_data:
+	ret
+.size	poly1305_blocks,.-poly1305_blocks
+
+.type	poly1305_emit,%function
+.align	5
+poly1305_emit:
+.Lpoly1305_emit:
+	ldp	x4,x5,[x0]		// load hash base 2^64
+	ldp	x6,x7,[x0,#16]	// [along with is_base2_26]
+	ldp	x10,x11,[x2]	// load nonce
+
+#ifdef	__AARCH64EB__
+	lsr	x12,x4,#32
+	mov	w13,w4
+	lsr	x14,x5,#32
+	mov	w15,w5
+	lsr	x16,x6,#32
+#else
+	mov	w12,w4
+	lsr	x13,x4,#32
+	mov	w14,w5
+	lsr	x15,x5,#32
+	mov	w16,w6
+#endif
+
+	add	x12,x12,x13,lsl#26	// base 2^26 -> base 2^64
+	lsr	x13,x14,#12
+	adds	x12,x12,x14,lsl#52
+	add	x13,x13,x15,lsl#14
+	adc	x13,x13,xzr
+	lsr	x14,x16,#24
+	adds	x13,x13,x16,lsl#40
+	adc	x14,x14,xzr
+
+	cmp	x7,#0			// is_base2_26?
+	csel	x4,x4,x12,eq		// choose between radixes
+	csel	x5,x5,x13,eq
+	csel	x6,x6,x14,eq
+
+	adds	x12,x4,#5		// compare to modulus
+	adcs	x13,x5,xzr
+	adc	x14,x6,xzr
+
+	tst	x14,#-4			// see if it's carried/borrowed
+
+	csel	x4,x4,x12,eq
+	csel	x5,x5,x13,eq
+
+#ifdef	__AARCH64EB__
+	ror	x10,x10,#32		// flip nonce words
+	ror	x11,x11,#32
+#endif
+	adds	x4,x4,x10		// accumulate nonce
+	adc	x5,x5,x11
+#ifdef	__AARCH64EB__
+	rev	x4,x4			// flip output bytes
+	rev	x5,x5
+#endif
+	stp	x4,x5,[x1]		// write result
+
+	ret
+.size	poly1305_emit,.-poly1305_emit
+.type	poly1305_mult,%function
+.align	5
+poly1305_mult:
+	mul	x12,x4,x7		// h0*r0
+	umulh	x13,x4,x7
+
+	mul	x10,x5,x9		// h1*5*r1
+	umulh	x11,x5,x9
+
+	adds	x12,x12,x10
+	mul	x10,x4,x8		// h0*r1
+	adc	x13,x13,x11
+	umulh	x14,x4,x8
+
+	adds	x13,x13,x10
+	mul	x10,x5,x7		// h1*r0
+	adc	x14,x14,xzr
+	umulh	x11,x5,x7
+
+	adds	x13,x13,x10
+	mul	x10,x6,x9		// h2*5*r1
+	adc	x14,x14,x11
+	mul	x11,x6,x7		// h2*r0
+
+	adds	x13,x13,x10
+	adc	x14,x14,x11
+
+	and	x10,x14,#-4		// final reduction
+	and	x6,x14,#3
+	add	x10,x10,x14,lsr#2
+	adds	x4,x12,x10
+	adcs	x5,x13,xzr
+	adc	x6,x6,xzr
+
+	ret
+.size	poly1305_mult,.-poly1305_mult
+
+.type	poly1305_splat,%function
+.align	4
+poly1305_splat:
+	and	x12,x4,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x13,x4,#26,#26
+	extr	x14,x5,x4,#52
+	and	x14,x14,#0x03ffffff
+	ubfx	x15,x5,#14,#26
+	extr	x16,x6,x5,#40
+
+	str	w12,[x0,#16*0]	// r0
+	add	w12,w13,w13,lsl#2	// r1*5
+	str	w13,[x0,#16*1]	// r1
+	add	w13,w14,w14,lsl#2	// r2*5
+	str	w12,[x0,#16*2]	// s1
+	str	w14,[x0,#16*3]	// r2
+	add	w14,w15,w15,lsl#2	// r3*5
+	str	w13,[x0,#16*4]	// s2
+	str	w15,[x0,#16*5]	// r3
+	add	w15,w16,w16,lsl#2	// r4*5
+	str	w14,[x0,#16*6]	// s3
+	str	w16,[x0,#16*7]	// r4
+	str	w15,[x0,#16*8]	// s4
+
+	ret
+.size	poly1305_splat,.-poly1305_splat
+
+#ifdef	__KERNEL__
+.globl	poly1305_blocks_neon
+#endif
+.type	poly1305_blocks_neon,%function
+.align	5
+poly1305_blocks_neon:
+.Lpoly1305_blocks_neon:
+	ldr	x17,[x0,#24]
+	cmp	x2,#128
+	b.lo	.Lpoly1305_blocks
+
+	.inst	0xd503233f		// paciasp
+	stp	x29,x30,[sp,#-80]!
+	add	x29,sp,#0
+
+	stp	d8,d9,[sp,#16]		// meet ABI requirements
+	stp	d10,d11,[sp,#32]
+	stp	d12,d13,[sp,#48]
+	stp	d14,d15,[sp,#64]
+
+	cbz	x17,.Lbase2_64_neon
+
+	ldp	w10,w11,[x0]		// load hash value base 2^26
+	ldp	w12,w13,[x0,#8]
+	ldr	w14,[x0,#16]
+
+	tst	x2,#31
+	b.eq	.Leven_neon
+
+	ldp	x7,x8,[x0,#32]	// load key value
+
+	add	x4,x10,x11,lsl#26	// base 2^26 -> base 2^64
+	lsr	x5,x12,#12
+	adds	x4,x4,x12,lsl#52
+	add	x5,x5,x13,lsl#14
+	adc	x5,x5,xzr
+	lsr	x6,x14,#24
+	adds	x5,x5,x14,lsl#40
+	adc	x14,x6,xzr		// can be partially reduced...
+
+	ldp	x12,x13,[x1],#16	// load input
+	sub	x2,x2,#16
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+
+#ifdef	__AARCH64EB__
+	rev	x12,x12
+	rev	x13,x13
+#endif
+	adds	x4,x4,x12		// accumulate input
+	adcs	x5,x5,x13
+	adc	x6,x6,x3
+
+	bl	poly1305_mult
+
+	and	x10,x4,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,x4,#26,#26
+	extr	x12,x5,x4,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,x5,#14,#26
+	extr	x14,x6,x5,#40
+
+	b	.Leven_neon
+
+.align	4
+.Lbase2_64_neon:
+	ldp	x7,x8,[x0,#32]	// load key value
+
+	ldp	x4,x5,[x0]		// load hash value base 2^64
+	ldr	x6,[x0,#16]
+
+	tst	x2,#31
+	b.eq	.Linit_neon
+
+	ldp	x12,x13,[x1],#16	// load input
+	sub	x2,x2,#16
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+#ifdef	__AARCH64EB__
+	rev	x12,x12
+	rev	x13,x13
+#endif
+	adds	x4,x4,x12		// accumulate input
+	adcs	x5,x5,x13
+	adc	x6,x6,x3
+
+	bl	poly1305_mult
+
+.Linit_neon:
+	ldr	w17,[x0,#48]		// first table element
+	and	x10,x4,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,x4,#26,#26
+	extr	x12,x5,x4,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,x5,#14,#26
+	extr	x14,x6,x5,#40
+
+	cmp	w17,#-1			// is value impossible?
+	b.ne	.Leven_neon
+
+	fmov	d24,x10
+	fmov	d25,x11
+	fmov	d26,x12
+	fmov	d27,x13
+	fmov	d28,x14
+
+	////////////////////////////////// initialize r^n table
+	mov	x4,x7			// r^1
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+	mov	x5,x8
+	mov	x6,xzr
+	add	x0,x0,#48+12
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^2
+	sub	x0,x0,#4
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^3
+	sub	x0,x0,#4
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^4
+	sub	x0,x0,#4
+	bl	poly1305_splat
+	sub	x0,x0,#48		// restore original x0
+	b	.Ldo_neon
+
+.align	4
+.Leven_neon:
+	fmov	d24,x10
+	fmov	d25,x11
+	fmov	d26,x12
+	fmov	d27,x13
+	fmov	d28,x14
+
+.Ldo_neon:
+	ldp	x8,x12,[x1,#32]	// inp[2:3]
+	subs	x2,x2,#64
+	ldp	x9,x13,[x1,#48]
+	add	x16,x1,#96
+	adr	x17,.Lzeros
+
+	lsl	x3,x3,#24
+	add	x15,x0,#48
+
+#ifdef	__AARCH64EB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	d14,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,x3,x12,lsr#40
+	add	x13,x3,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	d15,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	fmov	d16,x8
+	fmov	d17,x10
+	fmov	d18,x12
+
+	ldp	x8,x12,[x1],#16	// inp[0:1]
+	ldp	x9,x13,[x1],#48
+
+	ld1	{v0.4s,v1.4s,v2.4s,v3.4s},[x15],#64
+	ld1	{v4.4s,v5.4s,v6.4s,v7.4s},[x15],#64
+	ld1	{v8.4s},[x15]
+
+#ifdef	__AARCH64EB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	d9,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,x3,x12,lsr#40
+	add	x13,x3,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	d10,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	movi	v31.2d,#-1
+	fmov	d11,x8
+	fmov	d12,x10
+	fmov	d13,x12
+	ushr	v31.2d,v31.2d,#38
+
+	b.ls	.Lskip_loop
+
+.align	4
+.Loop_neon:
+	////////////////////////////////////////////////////////////////
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	//   ___________________/
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	//   ___________________/ ____________________/
+	//
+	// Note that we start with inp[2:3]*r^2. This is because it
+	// doesn't depend on reduction in previous iteration.
+	////////////////////////////////////////////////////////////////
+	// d4 = h0*r4 + h1*r3   + h2*r2   + h3*r1   + h4*r0
+	// d3 = h0*r3 + h1*r2   + h2*r1   + h3*r0   + h4*5*r4
+	// d2 = h0*r2 + h1*r1   + h2*r0   + h3*5*r4 + h4*5*r3
+	// d1 = h0*r1 + h1*r0   + h2*5*r4 + h3*5*r3 + h4*5*r2
+	// d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1
+
+	subs	x2,x2,#64
+	umull	v23.2d,v14.2s,v7.s[2]
+	csel	x16,x17,x16,lo
+	umull	v22.2d,v14.2s,v5.s[2]
+	umull	v21.2d,v14.2s,v3.s[2]
+	 ldp	x8,x12,[x16],#16	// inp[2:3] (or zero)
+	umull	v20.2d,v14.2s,v1.s[2]
+	 ldp	x9,x13,[x16],#48
+	umull	v19.2d,v14.2s,v0.s[2]
+#ifdef	__AARCH64EB__
+	 rev	x8,x8
+	 rev	x12,x12
+	 rev	x9,x9
+	 rev	x13,x13
+#endif
+
+	umlal	v23.2d,v15.2s,v5.s[2]
+	 and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	v22.2d,v15.2s,v3.s[2]
+	 and	x5,x9,#0x03ffffff
+	umlal	v21.2d,v15.2s,v1.s[2]
+	 ubfx	x6,x8,#26,#26
+	umlal	v20.2d,v15.2s,v0.s[2]
+	 ubfx	x7,x9,#26,#26
+	umlal	v19.2d,v15.2s,v8.s[2]
+	 add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+
+	umlal	v23.2d,v16.2s,v3.s[2]
+	 extr	x8,x12,x8,#52
+	umlal	v22.2d,v16.2s,v1.s[2]
+	 extr	x9,x13,x9,#52
+	umlal	v21.2d,v16.2s,v0.s[2]
+	 add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	v20.2d,v16.2s,v8.s[2]
+	 fmov	d14,x4
+	umlal	v19.2d,v16.2s,v6.s[2]
+	 and	x8,x8,#0x03ffffff
+
+	umlal	v23.2d,v17.2s,v1.s[2]
+	 and	x9,x9,#0x03ffffff
+	umlal	v22.2d,v17.2s,v0.s[2]
+	 ubfx	x10,x12,#14,#26
+	umlal	v21.2d,v17.2s,v8.s[2]
+	 ubfx	x11,x13,#14,#26
+	umlal	v20.2d,v17.2s,v6.s[2]
+	 add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	v19.2d,v17.2s,v4.s[2]
+	 fmov	d15,x6
+
+	add	v11.2s,v11.2s,v26.2s
+	 add	x12,x3,x12,lsr#40
+	umlal	v23.2d,v18.2s,v0.s[2]
+	 add	x13,x3,x13,lsr#40
+	umlal	v22.2d,v18.2s,v8.s[2]
+	 add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	v21.2d,v18.2s,v6.s[2]
+	 add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	v20.2d,v18.2s,v4.s[2]
+	 fmov	d16,x8
+	umlal	v19.2d,v18.2s,v2.s[2]
+	 fmov	d17,x10
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4 and accumulate
+
+	add	v9.2s,v9.2s,v24.2s
+	 fmov	d18,x12
+	umlal	v22.2d,v11.2s,v1.s[0]
+	 ldp	x8,x12,[x1],#16	// inp[0:1]
+	umlal	v19.2d,v11.2s,v6.s[0]
+	 ldp	x9,x13,[x1],#48
+	umlal	v23.2d,v11.2s,v3.s[0]
+	umlal	v20.2d,v11.2s,v8.s[0]
+	umlal	v21.2d,v11.2s,v0.s[0]
+#ifdef	__AARCH64EB__
+	 rev	x8,x8
+	 rev	x12,x12
+	 rev	x9,x9
+	 rev	x13,x13
+#endif
+
+	add	v10.2s,v10.2s,v25.2s
+	umlal	v22.2d,v9.2s,v5.s[0]
+	umlal	v23.2d,v9.2s,v7.s[0]
+	 and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	v21.2d,v9.2s,v3.s[0]
+	 and	x5,x9,#0x03ffffff
+	umlal	v19.2d,v9.2s,v0.s[0]
+	 ubfx	x6,x8,#26,#26
+	umlal	v20.2d,v9.2s,v1.s[0]
+	 ubfx	x7,x9,#26,#26
+
+	add	v12.2s,v12.2s,v27.2s
+	 add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	umlal	v22.2d,v10.2s,v3.s[0]
+	 extr	x8,x12,x8,#52
+	umlal	v23.2d,v10.2s,v5.s[0]
+	 extr	x9,x13,x9,#52
+	umlal	v19.2d,v10.2s,v8.s[0]
+	 add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	v21.2d,v10.2s,v1.s[0]
+	 fmov	d9,x4
+	umlal	v20.2d,v10.2s,v0.s[0]
+	 and	x8,x8,#0x03ffffff
+
+	add	v13.2s,v13.2s,v28.2s
+	 and	x9,x9,#0x03ffffff
+	umlal	v22.2d,v12.2s,v0.s[0]
+	 ubfx	x10,x12,#14,#26
+	umlal	v19.2d,v12.2s,v4.s[0]
+	 ubfx	x11,x13,#14,#26
+	umlal	v23.2d,v12.2s,v1.s[0]
+	 add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	v20.2d,v12.2s,v6.s[0]
+	 fmov	d10,x6
+	umlal	v21.2d,v12.2s,v8.s[0]
+	 add	x12,x3,x12,lsr#40
+
+	umlal	v22.2d,v13.2s,v8.s[0]
+	 add	x13,x3,x13,lsr#40
+	umlal	v19.2d,v13.2s,v2.s[0]
+	 add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	v23.2d,v13.2s,v0.s[0]
+	 add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	v20.2d,v13.2s,v4.s[0]
+	 fmov	d11,x8
+	umlal	v21.2d,v13.2s,v6.s[0]
+	 fmov	d12,x10
+	 fmov	d13,x12
+
+	/////////////////////////////////////////////////////////////////
+	// lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	// and P. Schwabe
+	//
+	// [see discussion in poly1305-armv4 module]
+
+	ushr	v29.2d,v22.2d,#26
+	xtn	v27.2s,v22.2d
+	 ushr	v30.2d,v19.2d,#26
+	 and	v19.16b,v19.16b,v31.16b
+	add	v23.2d,v23.2d,v29.2d	// h3 -> h4
+	bic	v27.2s,#0xfc,lsl#24	// &=0x03ffffff
+	 add	v20.2d,v20.2d,v30.2d	// h0 -> h1
+
+	ushr	v29.2d,v23.2d,#26
+	xtn	v28.2s,v23.2d
+	 ushr	v30.2d,v20.2d,#26
+	 xtn	v25.2s,v20.2d
+	bic	v28.2s,#0xfc,lsl#24
+	 add	v21.2d,v21.2d,v30.2d	// h1 -> h2
+
+	add	v19.2d,v19.2d,v29.2d
+	shl	v29.2d,v29.2d,#2
+	 shrn	v30.2s,v21.2d,#26
+	 xtn	v26.2s,v21.2d
+	add	v19.2d,v19.2d,v29.2d	// h4 -> h0
+	 bic	v25.2s,#0xfc,lsl#24
+	 add	v27.2s,v27.2s,v30.2s		// h2 -> h3
+	 bic	v26.2s,#0xfc,lsl#24
+
+	shrn	v29.2s,v19.2d,#26
+	xtn	v24.2s,v19.2d
+	 ushr	v30.2s,v27.2s,#26
+	 bic	v27.2s,#0xfc,lsl#24
+	 bic	v24.2s,#0xfc,lsl#24
+	add	v25.2s,v25.2s,v29.2s		// h0 -> h1
+	 add	v28.2s,v28.2s,v30.2s		// h3 -> h4
+
+	b.hi	.Loop_neon
+
+.Lskip_loop:
+	dup	v16.2d,v16.d[0]
+	add	v11.2s,v11.2s,v26.2s
+
+	////////////////////////////////////////////////////////////////
+	// multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	adds	x2,x2,#32
+	b.ne	.Long_tail
+
+	dup	v16.2d,v11.d[0]
+	add	v14.2s,v9.2s,v24.2s
+	add	v17.2s,v12.2s,v27.2s
+	add	v15.2s,v10.2s,v25.2s
+	add	v18.2s,v13.2s,v28.2s
+
+.Long_tail:
+	dup	v14.2d,v14.d[0]
+	umull2	v19.2d,v16.4s,v6.4s
+	umull2	v22.2d,v16.4s,v1.4s
+	umull2	v23.2d,v16.4s,v3.4s
+	umull2	v21.2d,v16.4s,v0.4s
+	umull2	v20.2d,v16.4s,v8.4s
+
+	dup	v15.2d,v15.d[0]
+	umlal2	v19.2d,v14.4s,v0.4s
+	umlal2	v21.2d,v14.4s,v3.4s
+	umlal2	v22.2d,v14.4s,v5.4s
+	umlal2	v23.2d,v14.4s,v7.4s
+	umlal2	v20.2d,v14.4s,v1.4s
+
+	dup	v17.2d,v17.d[0]
+	umlal2	v19.2d,v15.4s,v8.4s
+	umlal2	v22.2d,v15.4s,v3.4s
+	umlal2	v21.2d,v15.4s,v1.4s
+	umlal2	v23.2d,v15.4s,v5.4s
+	umlal2	v20.2d,v15.4s,v0.4s
+
+	dup	v18.2d,v18.d[0]
+	umlal2	v22.2d,v17.4s,v0.4s
+	umlal2	v23.2d,v17.4s,v1.4s
+	umlal2	v19.2d,v17.4s,v4.4s
+	umlal2	v20.2d,v17.4s,v6.4s
+	umlal2	v21.2d,v17.4s,v8.4s
+
+	umlal2	v22.2d,v18.4s,v8.4s
+	umlal2	v19.2d,v18.4s,v2.4s
+	umlal2	v23.2d,v18.4s,v0.4s
+	umlal2	v20.2d,v18.4s,v4.4s
+	umlal2	v21.2d,v18.4s,v6.4s
+
+	b.eq	.Lshort_tail
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	add	v9.2s,v9.2s,v24.2s
+	umlal	v22.2d,v11.2s,v1.2s
+	umlal	v19.2d,v11.2s,v6.2s
+	umlal	v23.2d,v11.2s,v3.2s
+	umlal	v20.2d,v11.2s,v8.2s
+	umlal	v21.2d,v11.2s,v0.2s
+
+	add	v10.2s,v10.2s,v25.2s
+	umlal	v22.2d,v9.2s,v5.2s
+	umlal	v19.2d,v9.2s,v0.2s
+	umlal	v23.2d,v9.2s,v7.2s
+	umlal	v20.2d,v9.2s,v1.2s
+	umlal	v21.2d,v9.2s,v3.2s
+
+	add	v12.2s,v12.2s,v27.2s
+	umlal	v22.2d,v10.2s,v3.2s
+	umlal	v19.2d,v10.2s,v8.2s
+	umlal	v23.2d,v10.2s,v5.2s
+	umlal	v20.2d,v10.2s,v0.2s
+	umlal	v21.2d,v10.2s,v1.2s
+
+	add	v13.2s,v13.2s,v28.2s
+	umlal	v22.2d,v12.2s,v0.2s
+	umlal	v19.2d,v12.2s,v4.2s
+	umlal	v23.2d,v12.2s,v1.2s
+	umlal	v20.2d,v12.2s,v6.2s
+	umlal	v21.2d,v12.2s,v8.2s
+
+	umlal	v22.2d,v13.2s,v8.2s
+	umlal	v19.2d,v13.2s,v2.2s
+	umlal	v23.2d,v13.2s,v0.2s
+	umlal	v20.2d,v13.2s,v4.2s
+	umlal	v21.2d,v13.2s,v6.2s
+
+.Lshort_tail:
+	////////////////////////////////////////////////////////////////
+	// horizontal add
+
+	addp	v22.2d,v22.2d,v22.2d
+	 ldp	d8,d9,[sp,#16]		// meet ABI requirements
+	addp	v19.2d,v19.2d,v19.2d
+	 ldp	d10,d11,[sp,#32]
+	addp	v23.2d,v23.2d,v23.2d
+	 ldp	d12,d13,[sp,#48]
+	addp	v20.2d,v20.2d,v20.2d
+	 ldp	d14,d15,[sp,#64]
+	addp	v21.2d,v21.2d,v21.2d
+	 ldr	x30,[sp,#8]
+	 .inst	0xd50323bf		// autiasp
+
+	////////////////////////////////////////////////////////////////
+	// lazy reduction, but without narrowing
+
+	ushr	v29.2d,v22.2d,#26
+	and	v22.16b,v22.16b,v31.16b
+	 ushr	v30.2d,v19.2d,#26
+	 and	v19.16b,v19.16b,v31.16b
+
+	add	v23.2d,v23.2d,v29.2d	// h3 -> h4
+	 add	v20.2d,v20.2d,v30.2d	// h0 -> h1
+
+	ushr	v29.2d,v23.2d,#26
+	and	v23.16b,v23.16b,v31.16b
+	 ushr	v30.2d,v20.2d,#26
+	 and	v20.16b,v20.16b,v31.16b
+	 add	v21.2d,v21.2d,v30.2d	// h1 -> h2
+
+	add	v19.2d,v19.2d,v29.2d
+	shl	v29.2d,v29.2d,#2
+	 ushr	v30.2d,v21.2d,#26
+	 and	v21.16b,v21.16b,v31.16b
+	add	v19.2d,v19.2d,v29.2d	// h4 -> h0
+	 add	v22.2d,v22.2d,v30.2d	// h2 -> h3
+
+	ushr	v29.2d,v19.2d,#26
+	and	v19.16b,v19.16b,v31.16b
+	 ushr	v30.2d,v22.2d,#26
+	 and	v22.16b,v22.16b,v31.16b
+	add	v20.2d,v20.2d,v29.2d	// h0 -> h1
+	 add	v23.2d,v23.2d,v30.2d	// h3 -> h4
+
+	////////////////////////////////////////////////////////////////
+	// write the result, can be partially reduced
+
+	st4	{v19.s,v20.s,v21.s,v22.s}[0],[x0],#16
+	mov	x4,#1
+	st1	{v23.s}[0],[x0]
+	str	x4,[x0,#8]		// set is_base2_26
+
+	ldr	x29,[sp],#80
+	ret
+.size	poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0
+.asciz	"Poly1305 for ARMv8, CRYPTOGAMS by @dot-asm"
+.align	2
+#if !defined(__KERNEL__) && !defined(_WIN64)
+.comm	OPENSSL_armcap_P,4,4
+.hidden	OPENSSL_armcap_P
+#endif
diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
new file mode 100644
index 000000000000..dd843d0ee83a
--- /dev/null
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -0,0 +1,237 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * OpenSSL/Cryptogams accelerated Poly1305 transform for arm64
+ *
+ * Copyright (C) 2019 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ */
+
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <asm/unaligned.h>
+#include <crypto/algapi.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/poly1305.h>
+#include <crypto/internal/simd.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/jump_label.h>
+#include <linux/module.h>
+
+asmlinkage void poly1305_init_arm64(void *state, const u8 *key);
+asmlinkage void poly1305_blocks(void *state, const u8 *src, u32 len, u32 hibit);
+asmlinkage void poly1305_blocks_neon(void *state, const u8 *src, u32 len, u32 hibit);
+asmlinkage void poly1305_emit(void *state, __le32 *digest, const u32 *nonce);
+
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(have_neon);
+
+void poly1305_init_arch(struct poly1305_desc_ctx *dctx, const u8 *key)
+{
+	poly1305_init_arm64(&dctx->h, key);
+	dctx->s[0] = get_unaligned_le32(key + 16);
+	dctx->s[1] = get_unaligned_le32(key + 20);
+	dctx->s[2] = get_unaligned_le32(key + 24);
+	dctx->s[3] = get_unaligned_le32(key + 28);
+	dctx->buflen = 0;
+}
+EXPORT_SYMBOL(poly1305_init_arch);
+
+static int neon_poly1305_init(struct shash_desc *desc)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	dctx->buflen = 0;
+	dctx->rset = 0;
+	dctx->sset = false;
+
+	return 0;
+}
+
+static void neon_poly1305_blocks(struct poly1305_desc_ctx *dctx, const u8 *src,
+				 u32 len, u32 hibit, bool do_neon)
+{
+	if (unlikely(!dctx->sset)) {
+		if (!dctx->rset) {
+			poly1305_init_arch(dctx, src);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->rset = 1;
+		}
+		if (len >= POLY1305_BLOCK_SIZE) {
+			dctx->s[0] = get_unaligned_le32(src +  0);
+			dctx->s[1] = get_unaligned_le32(src +  4);
+			dctx->s[2] = get_unaligned_le32(src +  8);
+			dctx->s[3] = get_unaligned_le32(src + 12);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->sset = true;
+		}
+		if (len < POLY1305_BLOCK_SIZE)
+			return;
+	}
+
+	len &= ~(POLY1305_BLOCK_SIZE - 1);
+
+	if (static_branch_likely(&have_neon) && likely(do_neon))
+		poly1305_blocks_neon(&dctx->h, src, len, hibit);
+	else
+		poly1305_blocks(&dctx->h, src, len, hibit);
+}
+
+static void neon_poly1305_do_update(struct poly1305_desc_ctx *dctx,
+				    const u8 *src, u32 len, bool do_neon)
+{
+	if (unlikely(dctx->buflen)) {
+		u32 bytes = min(len, POLY1305_BLOCK_SIZE - dctx->buflen);
+
+		memcpy(dctx->buf + dctx->buflen, src, bytes);
+		src += bytes;
+		len -= bytes;
+		dctx->buflen += bytes;
+
+		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+			neon_poly1305_blocks(dctx, dctx->buf,
+					     POLY1305_BLOCK_SIZE, 1, false);
+			dctx->buflen = 0;
+		}
+	}
+
+	if (likely(len >= POLY1305_BLOCK_SIZE)) {
+		neon_poly1305_blocks(dctx, src, len, 1, do_neon);
+		src += round_down(len, POLY1305_BLOCK_SIZE);
+		len %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(len)) {
+		dctx->buflen = len;
+		memcpy(dctx->buf, src, len);
+	}
+}
+
+static int neon_poly1305_update(struct shash_desc *desc,
+				const u8 *src, unsigned int srclen)
+{
+	bool do_neon = crypto_simd_usable() && srclen > 128;
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	if (static_branch_likely(&have_neon) && do_neon)
+		kernel_neon_begin();
+	neon_poly1305_do_update(dctx, src, srclen, do_neon);
+	if (static_branch_likely(&have_neon) && do_neon)
+		kernel_neon_end();
+	return 0;
+}
+
+void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
+			  unsigned int nbytes)
+{
+	if (unlikely(dctx->buflen)) {
+		u32 bytes = min(nbytes, POLY1305_BLOCK_SIZE - dctx->buflen);
+
+		memcpy(dctx->buf + dctx->buflen, src, bytes);
+		src += bytes;
+		nbytes -= bytes;
+		dctx->buflen += bytes;
+
+		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+			poly1305_blocks(&dctx->h, dctx->buf, POLY1305_BLOCK_SIZE, 1);
+			dctx->buflen = 0;
+		}
+	}
+
+	if (likely(nbytes >= POLY1305_BLOCK_SIZE)) {
+		unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
+
+		if (static_branch_likely(&have_neon) && crypto_simd_usable()) {
+			kernel_neon_begin();
+			poly1305_blocks_neon(&dctx->h, src, len, 1);
+			kernel_neon_end();
+		} else {
+			poly1305_blocks(&dctx->h, src, len, 1);
+		}
+		src += len;
+		nbytes %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(nbytes)) {
+		dctx->buflen = nbytes;
+		memcpy(dctx->buf, src, nbytes);
+	}
+}
+EXPORT_SYMBOL(poly1305_update_arch);
+
+void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 *dst)
+{
+	__le32 digest[4];
+	u64 f = 0;
+
+	if (unlikely(dctx->buflen)) {
+		dctx->buf[dctx->buflen++] = 1;
+		memset(dctx->buf + dctx->buflen, 0,
+		       POLY1305_BLOCK_SIZE - dctx->buflen);
+		poly1305_blocks(&dctx->h, dctx->buf, POLY1305_BLOCK_SIZE, 0);
+	}
+
+	poly1305_emit(&dctx->h, digest, dctx->s);
+
+	/* mac = (h + s) % (2^128) */
+	f = (f >> 32) + le32_to_cpu(digest[0]);
+	put_unaligned_le32(f, dst);
+	f = (f >> 32) + le32_to_cpu(digest[1]);
+	put_unaligned_le32(f, dst + 4);
+	f = (f >> 32) + le32_to_cpu(digest[2]);
+	put_unaligned_le32(f, dst + 8);
+	f = (f >> 32) + le32_to_cpu(digest[3]);
+	put_unaligned_le32(f, dst + 12);
+
+	*dctx = (struct poly1305_desc_ctx){};
+}
+EXPORT_SYMBOL(poly1305_final_arch);
+
+static int neon_poly1305_final(struct shash_desc *desc, u8 *dst)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	if (unlikely(!dctx->sset))
+		return -ENOKEY;
+
+	poly1305_final_arch(dctx, dst);
+	return 0;
+}
+
+static struct shash_alg neon_poly1305_alg = {
+	.init			= neon_poly1305_init,
+	.update			= neon_poly1305_update,
+	.final			= neon_poly1305_final,
+	.digestsize		= POLY1305_DIGEST_SIZE,
+	.descsize		= sizeof(struct poly1305_desc_ctx),
+
+	.base.cra_name		= "poly1305",
+	.base.cra_driver_name	= "poly1305-neon",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= POLY1305_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+};
+
+static int __init neon_poly1305_mod_init(void)
+{
+	if (!cpu_have_named_feature(ASIMD))
+		return 0;
+
+	static_branch_enable(&have_neon);
+
+	return crypto_register_shash(&neon_poly1305_alg);
+}
+
+static void __exit neon_poly1305_mod_exit(void)
+{
+	if (cpu_have_named_feature(ASIMD))
+		crypto_unregister_shash(&neon_poly1305_alg);
+}
+
+module_init(neon_poly1305_mod_init);
+module_exit(neon_poly1305_mod_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("poly1305");
+MODULE_ALIAS_CRYPTO("poly1305-neon");
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index 181754615f73..9923445e8225 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -40,6 +40,7 @@ config CRYPTO_LIB_DES
 config CRYPTO_LIB_POLY1305_RSIZE
 	int
 	default 4 if X86_64
+	default 9 if ARM64
 	default 1
 
 config CRYPTO_ARCH_HAVE_LIB_POLY1305
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 19/35] crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (17 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 18/35] crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation Ard Biesheuvel
@ 2019-10-17 19:09 ` " Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 20/35] crypto: mips/poly1305 - incorporate OpenSSL/CRYPTOGAMS optimized implementation Ard Biesheuvel
                   ` (15 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba, Andy Polyakov

This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation
for NEON authored by Andy Polyakov, and contributed by him to the OpenSSL
project. The file 'poly1305-armv4.pl' is taken straight from this upstream
GitHub repository [0] at commit ec55a08dc0244ce570c4fc7cade330c60798952f,
and already contains all the changes required to build it as part of a
Linux kernel module.

[0] https://github.com/dot-asm/cryptogams

Co-developed-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/Kconfig                 |    5 +
 arch/arm/crypto/Makefile                |   12 +-
 arch/arm/crypto/poly1305-armv4.pl       | 1236 ++++++++++++++++++++
 arch/arm/crypto/poly1305-core.S_shipped | 1158 ++++++++++++++++++
 arch/arm/crypto/poly1305-glue.c         |  276 +++++
 lib/crypto/Kconfig                      |    2 +-
 6 files changed, 2687 insertions(+), 2 deletions(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 265da3801e4f..19411f82468e 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -130,6 +130,11 @@ config CRYPTO_CHACHA20_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_ARCH_HAVE_LIB_CHACHA
 
+config CRYPTO_POLY1305_ARM
+	tristate "Accelerated scalar and SIMD Poly1305 hash implementations"
+	select CRYPTO_HASH
+	select CRYPTO_ARCH_HAVE_LIB_POLY1305
+
 config CRYPTO_NHPOLY1305_NEON
 	tristate "NEON accelerated NHPoly1305 hash function (for Adiantum)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 6b97dffcf90f..4f6a8a81dabc 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
+obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
 
 ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
@@ -55,12 +56,16 @@ crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
 crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
 chacha-neon-y := chacha-scalar-core.o chacha-glue.o
 chacha-neon-$(CONFIG_KERNEL_MODE_NEON) += chacha-neon-core.o
+poly1305-arm-y := poly1305-core.o poly1305-glue.o
 nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
 
 ifdef REGENERATE_ARM_CRYPTO
 quiet_cmd_perl = PERL    $@
       cmd_perl = $(PERL) $(<) > $(@)
 
+$(src)/poly1305-core.S_shipped: $(src)/poly1305-armv4.pl
+	$(call cmd,perl)
+
 $(src)/sha256-core.S_shipped: $(src)/sha256-armv4.pl
 	$(call cmd,perl)
 
@@ -68,4 +73,9 @@ $(src)/sha512-core.S_shipped: $(src)/sha512-armv4.pl
 	$(call cmd,perl)
 endif
 
-clean-files += sha256-core.S sha512-core.S
+clean-files += poly1305-core.S sha256-core.S sha512-core.S
+
+# massage the perlasm code a bit so we only get the NEON routine if we need it
+poly1305-aflags-$(CONFIG_CPU_V7) := -U__LINUX_ARM_ARCH__ -D__LINUX_ARM_ARCH__=5
+poly1305-aflags-$(CONFIG_KERNEL_MODE_NEON) := -U__LINUX_ARM_ARCH__ -D__LINUX_ARM_ARCH__=7
+AFLAGS_poly1305-core.o += $(poly1305-aflags-y)
diff --git a/arch/arm/crypto/poly1305-armv4.pl b/arch/arm/crypto/poly1305-armv4.pl
new file mode 100644
index 000000000000..6d79498d3115
--- /dev/null
+++ b/arch/arm/crypto/poly1305-armv4.pl
@@ -0,0 +1,1236 @@
+#!/usr/bin/env perl
+# SPDX-License-Identifier: GPL-1.0+ OR BSD-3-Clause
+#
+# ====================================================================
+# Written by Andy Polyakov, @dot-asm, initially for the OpenSSL
+# project.
+# ====================================================================
+#
+#			IALU(*)/gcc-4.4		NEON
+#
+# ARM11xx(ARMv6)	7.78/+100%		-
+# Cortex-A5		6.35/+130%		3.00
+# Cortex-A8		6.25/+115%		2.36
+# Cortex-A9		5.10/+95%		2.55
+# Cortex-A15		3.85/+85%		1.25(**)
+# Snapdragon S4		5.70/+100%		1.48(**)
+#
+# (*)	this is for -march=armv6, i.e. with bunch of ldrb loading data;
+# (**)	these are trade-off results, they can be improved by ~8% but at
+#	the cost of 15/12% regression on Cortex-A5/A7, it's even possible
+#	to improve Cortex-A9 result, but then A5/A7 loose more than 20%;
+
+$flavour = shift;
+if ($flavour=~/\w[\w\-]*\.\w+$/) { $output=$flavour; undef $flavour; }
+else { while (($output=shift) && ($output!~/\w[\w\-]*\.\w+$/)) {} }
+
+if ($flavour && $flavour ne "void") {
+    $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+    ( $xlate="${dir}arm-xlate.pl" and -f $xlate ) or
+    ( $xlate="${dir}../../perlasm/arm-xlate.pl" and -f $xlate) or
+    die "can't locate arm-xlate.pl";
+
+    open STDOUT,"| \"$^X\" $xlate $flavour $output";
+} else {
+    open STDOUT,">$output";
+}
+
+($ctx,$inp,$len,$padbit)=map("r$_",(0..3));
+
+$code.=<<___;
+#ifndef	__KERNEL__
+# include "arm_arch.h"
+#else
+# define __ARM_ARCH__ __LINUX_ARM_ARCH__
+# define __ARM_MAX_ARCH__ __LINUX_ARM_ARCH__
+# define poly1305_init   poly1305_init_arm
+# define poly1305_blocks poly1305_blocks_arm
+# define poly1305_emit   poly1305_emit_arm
+.globl	poly1305_blocks_neon
+#endif
+
+#if defined(__thumb2__)
+.syntax	unified
+.thumb
+#else
+.code	32
+#endif
+
+.text
+
+.globl	poly1305_emit
+.globl	poly1305_blocks
+.globl	poly1305_init
+.type	poly1305_init,%function
+.align	5
+poly1305_init:
+.Lpoly1305_init:
+	stmdb	sp!,{r4-r11}
+
+	eor	r3,r3,r3
+	cmp	$inp,#0
+	str	r3,[$ctx,#0]		@ zero hash value
+	str	r3,[$ctx,#4]
+	str	r3,[$ctx,#8]
+	str	r3,[$ctx,#12]
+	str	r3,[$ctx,#16]
+	str	r3,[$ctx,#36]		@ clear is_base2_26
+	add	$ctx,$ctx,#20
+
+#ifdef	__thumb2__
+	it	eq
+#endif
+	moveq	r0,#0
+	beq	.Lno_key
+
+#if	__ARM_MAX_ARCH__>=7
+	mov	r3,#-1
+	str	r3,[$ctx,#28]		@ impossible key power value
+# ifndef __KERNEL__
+	adr	r11,.Lpoly1305_init
+	ldr	r12,.LOPENSSL_armcap
+# endif
+#endif
+	ldrb	r4,[$inp,#0]
+	mov	r10,#0x0fffffff
+	ldrb	r5,[$inp,#1]
+	and	r3,r10,#-4		@ 0x0ffffffc
+	ldrb	r6,[$inp,#2]
+	ldrb	r7,[$inp,#3]
+	orr	r4,r4,r5,lsl#8
+	ldrb	r5,[$inp,#4]
+	orr	r4,r4,r6,lsl#16
+	ldrb	r6,[$inp,#5]
+	orr	r4,r4,r7,lsl#24
+	ldrb	r7,[$inp,#6]
+	and	r4,r4,r10
+
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+# if !defined(_WIN32)
+	ldr	r12,[r11,r12]		@ OPENSSL_armcap_P
+# endif
+# if defined(__APPLE__) || defined(_WIN32)
+	ldr	r12,[r12]
+# endif
+#endif
+	ldrb	r8,[$inp,#7]
+	orr	r5,r5,r6,lsl#8
+	ldrb	r6,[$inp,#8]
+	orr	r5,r5,r7,lsl#16
+	ldrb	r7,[$inp,#9]
+	orr	r5,r5,r8,lsl#24
+	ldrb	r8,[$inp,#10]
+	and	r5,r5,r3
+
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+	tst	r12,#ARMV7_NEON		@ check for NEON
+# ifdef	__thumb2__
+	adr	r9,.Lpoly1305_blocks_neon
+	adr	r11,.Lpoly1305_blocks
+	it	ne
+	movne	r11,r9
+	adr	r12,.Lpoly1305_emit
+	orr	r11,r11,#1		@ thumb-ify addresses
+	orr	r12,r12,#1
+# else
+	add	r12,r11,#(.Lpoly1305_emit-.Lpoly1305_init)
+	ite	eq
+	addeq	r11,r11,#(.Lpoly1305_blocks-.Lpoly1305_init)
+	addne	r11,r11,#(.Lpoly1305_blocks_neon-.Lpoly1305_init)
+# endif
+#endif
+	ldrb	r9,[$inp,#11]
+	orr	r6,r6,r7,lsl#8
+	ldrb	r7,[$inp,#12]
+	orr	r6,r6,r8,lsl#16
+	ldrb	r8,[$inp,#13]
+	orr	r6,r6,r9,lsl#24
+	ldrb	r9,[$inp,#14]
+	and	r6,r6,r3
+
+	ldrb	r10,[$inp,#15]
+	orr	r7,r7,r8,lsl#8
+	str	r4,[$ctx,#0]
+	orr	r7,r7,r9,lsl#16
+	str	r5,[$ctx,#4]
+	orr	r7,r7,r10,lsl#24
+	str	r6,[$ctx,#8]
+	and	r7,r7,r3
+	str	r7,[$ctx,#12]
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+	stmia	r2,{r11,r12}		@ fill functions table
+	mov	r0,#1
+#else
+	mov	r0,#0
+#endif
+.Lno_key:
+	ldmia	sp!,{r4-r11}
+#if	__ARM_ARCH__>=5
+	ret				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	bx	lr			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_init,.-poly1305_init
+___
+{
+my ($h0,$h1,$h2,$h3,$h4,$r0,$r1,$r2,$r3)=map("r$_",(4..12));
+my ($s1,$s2,$s3)=($r1,$r2,$r3);
+
+$code.=<<___;
+.type	poly1305_blocks,%function
+.align	5
+poly1305_blocks:
+.Lpoly1305_blocks:
+	stmdb	sp!,{r3-r11,lr}
+
+	ands	$len,$len,#-16
+	beq	.Lno_data
+
+	add	$len,$len,$inp		@ end pointer
+	sub	sp,sp,#32
+
+#if __ARM_ARCH__<7
+	ldmia	$ctx,{$h0-$r3}		@ load context
+	add	$ctx,$ctx,#20
+	str	$len,[sp,#16]		@ offload stuff
+	str	$ctx,[sp,#12]
+#else
+	ldr	lr,[$ctx,#36]		@ is_base2_26
+	ldmia	$ctx!,{$h0-$h4}		@ load hash value
+	str	$len,[sp,#16]		@ offload stuff
+	str	$ctx,[sp,#12]
+
+	adds	$r0,$h0,$h1,lsl#26	@ base 2^26 -> base 2^32
+	mov	$r1,$h1,lsr#6
+	adcs	$r1,$r1,$h2,lsl#20
+	mov	$r2,$h2,lsr#12
+	adcs	$r2,$r2,$h3,lsl#14
+	mov	$r3,$h3,lsr#18
+	adcs	$r3,$r3,$h4,lsl#8
+	mov	$len,#0
+	teq	lr,#0
+	str	$len,[$ctx,#16]		@ clear is_base2_26
+	adc	$len,$len,$h4,lsr#24
+
+	itttt	ne
+	movne	$h0,$r0			@ choose between radixes
+	movne	$h1,$r1
+	movne	$h2,$r2
+	movne	$h3,$r3
+	ldmia	$ctx,{$r0-$r3}		@ load key
+	it	ne
+	movne	$h4,$len
+#endif
+
+	mov	lr,$inp
+	cmp	$padbit,#0
+	str	$r1,[sp,#20]
+	str	$r2,[sp,#24]
+	str	$r3,[sp,#28]
+	b	.Loop
+
+.align	4
+.Loop:
+#if __ARM_ARCH__<7
+	ldrb	r0,[lr],#16		@ load input
+# ifdef	__thumb2__
+	it	hi
+# endif
+	addhi	$h4,$h4,#1		@ 1<<128
+	ldrb	r1,[lr,#-15]
+	ldrb	r2,[lr,#-14]
+	ldrb	r3,[lr,#-13]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-12]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-11]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-10]
+	adds	$h0,$h0,r3		@ accumulate input
+
+	ldrb	r3,[lr,#-9]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-8]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-7]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-6]
+	adcs	$h1,$h1,r3
+
+	ldrb	r3,[lr,#-5]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-4]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-3]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-2]
+	adcs	$h2,$h2,r3
+
+	ldrb	r3,[lr,#-1]
+	orr	r1,r0,r1,lsl#8
+	str	lr,[sp,#8]		@ offload input pointer
+	orr	r2,r1,r2,lsl#16
+	add	$s1,$r1,$r1,lsr#2
+	orr	r3,r2,r3,lsl#24
+#else
+	ldr	r0,[lr],#16		@ load input
+	it	hi
+	addhi	$h4,$h4,#1		@ padbit
+	ldr	r1,[lr,#-12]
+	ldr	r2,[lr,#-8]
+	ldr	r3,[lr,#-4]
+# ifdef	__ARMEB__
+	rev	r0,r0
+	rev	r1,r1
+	rev	r2,r2
+	rev	r3,r3
+# endif
+	adds	$h0,$h0,r0		@ accumulate input
+	str	lr,[sp,#8]		@ offload input pointer
+	adcs	$h1,$h1,r1
+	add	$s1,$r1,$r1,lsr#2
+	adcs	$h2,$h2,r2
+#endif
+	add	$s2,$r2,$r2,lsr#2
+	adcs	$h3,$h3,r3
+	add	$s3,$r3,$r3,lsr#2
+
+	umull	r2,r3,$h1,$r0
+	 adc	$h4,$h4,#0
+	umull	r0,r1,$h0,$r0
+	umlal	r2,r3,$h4,$s1
+	umlal	r0,r1,$h3,$s1
+	ldr	$r1,[sp,#20]		@ reload $r1
+	umlal	r2,r3,$h2,$s3
+	umlal	r0,r1,$h1,$s3
+	umlal	r2,r3,$h3,$s2
+	umlal	r0,r1,$h2,$s2
+	umlal	r2,r3,$h0,$r1
+	str	r0,[sp,#0]		@ future $h0
+	 mul	r0,$s2,$h4
+	ldr	$r2,[sp,#24]		@ reload $r2
+	adds	r2,r2,r1		@ d1+=d0>>32
+	 eor	r1,r1,r1
+	adc	lr,r3,#0		@ future $h2
+	str	r2,[sp,#4]		@ future $h1
+
+	mul	r2,$s3,$h4
+	eor	r3,r3,r3
+	umlal	r0,r1,$h3,$s3
+	ldr	$r3,[sp,#28]		@ reload $r3
+	umlal	r2,r3,$h3,$r0
+	umlal	r0,r1,$h2,$r0
+	umlal	r2,r3,$h2,$r1
+	umlal	r0,r1,$h1,$r1
+	umlal	r2,r3,$h1,$r2
+	umlal	r0,r1,$h0,$r2
+	umlal	r2,r3,$h0,$r3
+	ldr	$h0,[sp,#0]
+	mul	$h4,$r0,$h4
+	ldr	$h1,[sp,#4]
+
+	adds	$h2,lr,r0		@ d2+=d1>>32
+	ldr	lr,[sp,#8]		@ reload input pointer
+	adc	r1,r1,#0
+	adds	$h3,r2,r1		@ d3+=d2>>32
+	ldr	r0,[sp,#16]		@ reload end pointer
+	adc	r3,r3,#0
+	add	$h4,$h4,r3		@ h4+=d3>>32
+
+	and	r1,$h4,#-4
+	and	$h4,$h4,#3
+	add	r1,r1,r1,lsr#2		@ *=5
+	adds	$h0,$h0,r1
+	adcs	$h1,$h1,#0
+	adcs	$h2,$h2,#0
+	adcs	$h3,$h3,#0
+	adc	$h4,$h4,#0
+
+	cmp	r0,lr			@ done yet?
+	bhi	.Loop
+
+	ldr	$ctx,[sp,#12]
+	add	sp,sp,#32
+	stmdb	$ctx,{$h0-$h4}		@ store the result
+
+.Lno_data:
+#if	__ARM_ARCH__>=5
+	ldmia	sp!,{r3-r11,pc}
+#else
+	ldmia	sp!,{r3-r11,lr}
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	bx	lr			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_blocks,.-poly1305_blocks
+___
+}
+{
+my ($ctx,$mac,$nonce)=map("r$_",(0..2));
+my ($h0,$h1,$h2,$h3,$h4,$g0,$g1,$g2,$g3)=map("r$_",(3..11));
+my $g4=$ctx;
+
+$code.=<<___;
+.type	poly1305_emit,%function
+.align	5
+poly1305_emit:
+.Lpoly1305_emit:
+	stmdb	sp!,{r4-r11}
+
+	ldmia	$ctx,{$h0-$h4}
+
+#if __ARM_ARCH__>=7
+	ldr	ip,[$ctx,#36]		@ is_base2_26
+
+	adds	$g0,$h0,$h1,lsl#26	@ base 2^26 -> base 2^32
+	mov	$g1,$h1,lsr#6
+	adcs	$g1,$g1,$h2,lsl#20
+	mov	$g2,$h2,lsr#12
+	adcs	$g2,$g2,$h3,lsl#14
+	mov	$g3,$h3,lsr#18
+	adcs	$g3,$g3,$h4,lsl#8
+	mov	$g4,#0
+	adc	$g4,$g4,$h4,lsr#24
+
+	tst	ip,ip
+	itttt	ne
+	movne	$h0,$g0
+	movne	$h1,$g1
+	movne	$h2,$g2
+	movne	$h3,$g3
+	it	ne
+	movne	$h4,$g4
+#endif
+
+	adds	$g0,$h0,#5		@ compare to modulus
+	adcs	$g1,$h1,#0
+	adcs	$g2,$h2,#0
+	adcs	$g3,$h3,#0
+	adc	$g4,$h4,#0
+	tst	$g4,#4			@ did it carry/borrow?
+
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	$h0,$g0
+	ldr	$g0,[$nonce,#0]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	$h1,$g1
+	ldr	$g1,[$nonce,#4]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	$h2,$g2
+	ldr	$g2,[$nonce,#8]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	$h3,$g3
+	ldr	$g3,[$nonce,#12]
+
+	adds	$h0,$h0,$g0
+	adcs	$h1,$h1,$g1
+	adcs	$h2,$h2,$g2
+	adc	$h3,$h3,$g3
+
+#if __ARM_ARCH__>=7
+# ifdef __ARMEB__
+	rev	$h0,$h0
+	rev	$h1,$h1
+	rev	$h2,$h2
+	rev	$h3,$h3
+# endif
+	str	$h0,[$mac,#0]
+	str	$h1,[$mac,#4]
+	str	$h2,[$mac,#8]
+	str	$h3,[$mac,#12]
+#else
+	strb	$h0,[$mac,#0]
+	mov	$h0,$h0,lsr#8
+	strb	$h1,[$mac,#4]
+	mov	$h1,$h1,lsr#8
+	strb	$h2,[$mac,#8]
+	mov	$h2,$h2,lsr#8
+	strb	$h3,[$mac,#12]
+	mov	$h3,$h3,lsr#8
+
+	strb	$h0,[$mac,#1]
+	mov	$h0,$h0,lsr#8
+	strb	$h1,[$mac,#5]
+	mov	$h1,$h1,lsr#8
+	strb	$h2,[$mac,#9]
+	mov	$h2,$h2,lsr#8
+	strb	$h3,[$mac,#13]
+	mov	$h3,$h3,lsr#8
+
+	strb	$h0,[$mac,#2]
+	mov	$h0,$h0,lsr#8
+	strb	$h1,[$mac,#6]
+	mov	$h1,$h1,lsr#8
+	strb	$h2,[$mac,#10]
+	mov	$h2,$h2,lsr#8
+	strb	$h3,[$mac,#14]
+	mov	$h3,$h3,lsr#8
+
+	strb	$h0,[$mac,#3]
+	strb	$h1,[$mac,#7]
+	strb	$h2,[$mac,#11]
+	strb	$h3,[$mac,#15]
+#endif
+	ldmia	sp!,{r4-r11}
+#if	__ARM_ARCH__>=5
+	ret				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	bx	lr			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_emit,.-poly1305_emit
+___
+{
+my ($R0,$R1,$S1,$R2,$S2,$R3,$S3,$R4,$S4) = map("d$_",(0..9));
+my ($D0,$D1,$D2,$D3,$D4, $H0,$H1,$H2,$H3,$H4) = map("q$_",(5..14));
+my ($T0,$T1,$MASK) = map("q$_",(15,4,0));
+
+my ($in2,$zeros,$tbl0,$tbl1) = map("r$_",(4..7));
+
+$code.=<<___;
+#if	__ARM_MAX_ARCH__>=7
+.fpu	neon
+
+.type	poly1305_init_neon,%function
+.align	5
+poly1305_init_neon:
+.Lpoly1305_init_neon:
+	ldr	r3,[$ctx,#48]		@ first table element
+	cmp	r3,#-1			@ is value impossible?
+	bne	.Lno_init_neon
+
+	ldr	r4,[$ctx,#20]		@ load key base 2^32
+	ldr	r5,[$ctx,#24]
+	ldr	r6,[$ctx,#28]
+	ldr	r7,[$ctx,#32]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	and	r3,r3,#0x03ffffff
+	and	r4,r4,#0x03ffffff
+	and	r5,r5,#0x03ffffff
+
+	vdup.32	$R0,r2			@ r^1 in both lanes
+	add	r2,r3,r3,lsl#2		@ *5
+	vdup.32	$R1,r3
+	add	r3,r4,r4,lsl#2
+	vdup.32	$S1,r2
+	vdup.32	$R2,r4
+	add	r4,r5,r5,lsl#2
+	vdup.32	$S2,r3
+	vdup.32	$R3,r5
+	add	r5,r6,r6,lsl#2
+	vdup.32	$S3,r4
+	vdup.32	$R4,r6
+	vdup.32	$S4,r5
+
+	mov	$zeros,#2		@ counter
+
+.Lsquare_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+
+	vmull.u32	$D0,$R0,${R0}[1]
+	vmull.u32	$D1,$R1,${R0}[1]
+	vmull.u32	$D2,$R2,${R0}[1]
+	vmull.u32	$D3,$R3,${R0}[1]
+	vmull.u32	$D4,$R4,${R0}[1]
+
+	vmlal.u32	$D0,$R4,${S1}[1]
+	vmlal.u32	$D1,$R0,${R1}[1]
+	vmlal.u32	$D2,$R1,${R1}[1]
+	vmlal.u32	$D3,$R2,${R1}[1]
+	vmlal.u32	$D4,$R3,${R1}[1]
+
+	vmlal.u32	$D0,$R3,${S2}[1]
+	vmlal.u32	$D1,$R4,${S2}[1]
+	vmlal.u32	$D3,$R1,${R2}[1]
+	vmlal.u32	$D2,$R0,${R2}[1]
+	vmlal.u32	$D4,$R2,${R2}[1]
+
+	vmlal.u32	$D0,$R2,${S3}[1]
+	vmlal.u32	$D3,$R0,${R3}[1]
+	vmlal.u32	$D1,$R3,${S3}[1]
+	vmlal.u32	$D2,$R4,${S3}[1]
+	vmlal.u32	$D4,$R1,${R3}[1]
+
+	vmlal.u32	$D3,$R4,${S4}[1]
+	vmlal.u32	$D0,$R1,${S4}[1]
+	vmlal.u32	$D1,$R2,${S4}[1]
+	vmlal.u32	$D2,$R3,${S4}[1]
+	vmlal.u32	$D4,$R0,${R4}[1]
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	@ and P. Schwabe
+	@
+	@ H0>>+H1>>+H2>>+H3>>+H4
+	@ H3>>+H4>>*5+H0>>+H1
+	@
+	@ Trivia.
+	@
+	@ Result of multiplication of n-bit number by m-bit number is
+	@ n+m bits wide. However! Even though 2^n is a n+1-bit number,
+	@ m-bit number multiplied by 2^n is still n+m bits wide.
+	@
+	@ Sum of two n-bit numbers is n+1 bits wide, sum of three - n+2,
+	@ and so is sum of four. Sum of 2^m n-m-bit numbers and n-bit
+	@ one is n+1 bits wide.
+	@
+	@ >>+ denotes Hnext += Hn>>26, Hn &= 0x3ffffff. This means that
+	@ H0, H2, H3 are guaranteed to be 26 bits wide, while H1 and H4
+	@ can be 27. However! In cases when their width exceeds 26 bits
+	@ they are limited by 2^26+2^6. This in turn means that *sum*
+	@ of the products with these values can still be viewed as sum
+	@ of 52-bit numbers as long as the amount of addends is not a
+	@ power of 2. For example,
+	@
+	@ H4 = H4*R0 + H3*R1 + H2*R2 + H1*R3 + H0 * R4,
+	@
+	@ which can't be larger than 5 * (2^26 + 2^6) * (2^26 + 2^6), or
+	@ 5 * (2^52 + 2*2^32 + 2^12), which in turn is smaller than
+	@ 8 * (2^52) or 2^55. However, the value is then multiplied by
+	@ by 5, so we should be looking at 5 * 5 * (2^52 + 2^33 + 2^12),
+	@ which is less than 32 * (2^52) or 2^57. And when processing
+	@ data we are looking at triple as many addends...
+	@
+	@ In key setup procedure pre-reduced H0 is limited by 5*4+1 and
+	@ 5*H4 - by 5*5 52-bit addends, or 57 bits. But when hashing the
+	@ input H0 is limited by (5*4+1)*3 addends, or 58 bits, while
+	@ 5*H4 by 5*5*3, or 59[!] bits. How is this relevant? vmlal.u32
+	@ instruction accepts 2x32-bit input and writes 2x64-bit result.
+	@ This means that result of reduction have to be compressed upon
+	@ loop wrap-around. This can be done in the process of reduction
+	@ to minimize amount of instructions [as well as amount of
+	@ 128-bit instructions, which benefits low-end processors], but
+	@ one has to watch for H2 (which is narrower than H0) and 5*H4
+	@ not being wider than 58 bits, so that result of right shift
+	@ by 26 bits fits in 32 bits. This is also useful on x86,
+	@ because it allows to use paddd in place for paddq, which
+	@ benefits Atom, where paddq is ridiculously slow.
+
+	vshr.u64	$T0,$D3,#26
+	vmovn.i64	$D3#lo,$D3
+	 vshr.u64	$T1,$D0,#26
+	 vmovn.i64	$D0#lo,$D0
+	vadd.i64	$D4,$D4,$T0		@ h3 -> h4
+	vbic.i32	$D3#lo,#0xfc000000	@ &=0x03ffffff
+	 vadd.i64	$D1,$D1,$T1		@ h0 -> h1
+	 vbic.i32	$D0#lo,#0xfc000000
+
+	vshrn.u64	$T0#lo,$D4,#26
+	vmovn.i64	$D4#lo,$D4
+	 vshr.u64	$T1,$D1,#26
+	 vmovn.i64	$D1#lo,$D1
+	 vadd.i64	$D2,$D2,$T1		@ h1 -> h2
+	vbic.i32	$D4#lo,#0xfc000000
+	 vbic.i32	$D1#lo,#0xfc000000
+
+	vadd.i32	$D0#lo,$D0#lo,$T0#lo
+	vshl.u32	$T0#lo,$T0#lo,#2
+	 vshrn.u64	$T1#lo,$D2,#26
+	 vmovn.i64	$D2#lo,$D2
+	vadd.i32	$D0#lo,$D0#lo,$T0#lo	@ h4 -> h0
+	 vadd.i32	$D3#lo,$D3#lo,$T1#lo	@ h2 -> h3
+	 vbic.i32	$D2#lo,#0xfc000000
+
+	vshr.u32	$T0#lo,$D0#lo,#26
+	vbic.i32	$D0#lo,#0xfc000000
+	 vshr.u32	$T1#lo,$D3#lo,#26
+	 vbic.i32	$D3#lo,#0xfc000000
+	vadd.i32	$D1#lo,$D1#lo,$T0#lo	@ h0 -> h1
+	 vadd.i32	$D4#lo,$D4#lo,$T1#lo	@ h3 -> h4
+
+	subs		$zeros,$zeros,#1
+	beq		.Lsquare_break_neon
+
+	add		$tbl0,$ctx,#(48+0*9*4)
+	add		$tbl1,$ctx,#(48+1*9*4)
+
+	vtrn.32		$R0,$D0#lo		@ r^2:r^1
+	vtrn.32		$R2,$D2#lo
+	vtrn.32		$R3,$D3#lo
+	vtrn.32		$R1,$D1#lo
+	vtrn.32		$R4,$D4#lo
+
+	vshl.u32	$S2,$R2,#2		@ *5
+	vshl.u32	$S3,$R3,#2
+	vshl.u32	$S1,$R1,#2
+	vshl.u32	$S4,$R4,#2
+	vadd.i32	$S2,$S2,$R2
+	vadd.i32	$S1,$S1,$R1
+	vadd.i32	$S3,$S3,$R3
+	vadd.i32	$S4,$S4,$R4
+
+	vst4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!
+	vst4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!
+	vst4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	vst4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vst1.32		{${S4}[0]},[$tbl0,:32]
+	vst1.32		{${S4}[1]},[$tbl1,:32]
+
+	b		.Lsquare_neon
+
+.align	4
+.Lsquare_break_neon:
+	add		$tbl0,$ctx,#(48+2*4*9)
+	add		$tbl1,$ctx,#(48+3*4*9)
+
+	vmov		$R0,$D0#lo		@ r^4:r^3
+	vshl.u32	$S1,$D1#lo,#2		@ *5
+	vmov		$R1,$D1#lo
+	vshl.u32	$S2,$D2#lo,#2
+	vmov		$R2,$D2#lo
+	vshl.u32	$S3,$D3#lo,#2
+	vmov		$R3,$D3#lo
+	vshl.u32	$S4,$D4#lo,#2
+	vmov		$R4,$D4#lo
+	vadd.i32	$S1,$S1,$D1#lo
+	vadd.i32	$S2,$S2,$D2#lo
+	vadd.i32	$S3,$S3,$D3#lo
+	vadd.i32	$S4,$S4,$D4#lo
+
+	vst4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!
+	vst4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!
+	vst4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	vst4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vst1.32		{${S4}[0]},[$tbl0]
+	vst1.32		{${S4}[1]},[$tbl1]
+
+.Lno_init_neon:
+	ret				@ bx	lr
+.size	poly1305_init_neon,.-poly1305_init_neon
+
+.type	poly1305_blocks_neon,%function
+.align	5
+poly1305_blocks_neon:
+.Lpoly1305_blocks_neon:
+	ldr	ip,[$ctx,#36]		@ is_base2_26
+
+	cmp	$len,#64
+	blo	.Lpoly1305_blocks
+
+	stmdb	sp!,{r4-r7}
+	vstmdb	sp!,{d8-d15}		@ ABI specification says so
+
+	tst	ip,ip			@ is_base2_26?
+	bne	.Lbase2_26_neon
+
+	stmdb	sp!,{r1-r3,lr}
+	bl	.Lpoly1305_init_neon
+
+	ldr	r4,[$ctx,#0]		@ load hash value base 2^32
+	ldr	r5,[$ctx,#4]
+	ldr	r6,[$ctx,#8]
+	ldr	r7,[$ctx,#12]
+	ldr	ip,[$ctx,#16]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	 veor	$D0#lo,$D0#lo,$D0#lo
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	 veor	$D1#lo,$D1#lo,$D1#lo
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	 veor	$D2#lo,$D2#lo,$D2#lo
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	 veor	$D3#lo,$D3#lo,$D3#lo
+	and	r3,r3,#0x03ffffff
+	orr	r6,r6,ip,lsl#24
+	 veor	$D4#lo,$D4#lo,$D4#lo
+	and	r4,r4,#0x03ffffff
+	mov	r1,#1
+	and	r5,r5,#0x03ffffff
+	str	r1,[$ctx,#36]		@ set is_base2_26
+
+	vmov.32	$D0#lo[0],r2
+	vmov.32	$D1#lo[0],r3
+	vmov.32	$D2#lo[0],r4
+	vmov.32	$D3#lo[0],r5
+	vmov.32	$D4#lo[0],r6
+	adr	$zeros,.Lzeros
+
+	ldmia	sp!,{r1-r3,lr}
+	b	.Lhash_loaded
+
+.align	4
+.Lbase2_26_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ load hash value
+
+	veor		$D0#lo,$D0#lo,$D0#lo
+	veor		$D1#lo,$D1#lo,$D1#lo
+	veor		$D2#lo,$D2#lo,$D2#lo
+	veor		$D3#lo,$D3#lo,$D3#lo
+	veor		$D4#lo,$D4#lo,$D4#lo
+	vld4.32		{$D0#lo[0],$D1#lo[0],$D2#lo[0],$D3#lo[0]},[$ctx]!
+	adr		$zeros,.Lzeros
+	vld1.32		{$D4#lo[0]},[$ctx]
+	sub		$ctx,$ctx,#16		@ rewind
+
+.Lhash_loaded:
+	add		$in2,$inp,#32
+	mov		$padbit,$padbit,lsl#24
+	tst		$len,#31
+	beq		.Leven
+
+	vld4.32		{$H0#lo[0],$H1#lo[0],$H2#lo[0],$H3#lo[0]},[$inp]!
+	vmov.32		$H4#lo[0],$padbit
+	sub		$len,$len,#16
+	add		$in2,$inp,#32
+
+# ifdef	__ARMEB__
+	vrev32.8	$H0,$H0
+	vrev32.8	$H3,$H3
+	vrev32.8	$H1,$H1
+	vrev32.8	$H2,$H2
+# endif
+	vsri.u32	$H4#lo,$H3#lo,#8	@ base 2^32 -> base 2^26
+	vshl.u32	$H3#lo,$H3#lo,#18
+
+	vsri.u32	$H3#lo,$H2#lo,#14
+	vshl.u32	$H2#lo,$H2#lo,#12
+	vadd.i32	$H4#hi,$H4#lo,$D4#lo	@ add hash value and move to #hi
+
+	vbic.i32	$H3#lo,#0xfc000000
+	vsri.u32	$H2#lo,$H1#lo,#20
+	vshl.u32	$H1#lo,$H1#lo,#6
+
+	vbic.i32	$H2#lo,#0xfc000000
+	vsri.u32	$H1#lo,$H0#lo,#26
+	vadd.i32	$H3#hi,$H3#lo,$D3#lo
+
+	vbic.i32	$H0#lo,#0xfc000000
+	vbic.i32	$H1#lo,#0xfc000000
+	vadd.i32	$H2#hi,$H2#lo,$D2#lo
+
+	vadd.i32	$H0#hi,$H0#lo,$D0#lo
+	vadd.i32	$H1#hi,$H1#lo,$D1#lo
+
+	mov		$tbl1,$zeros
+	add		$tbl0,$ctx,#48
+
+	cmp		$len,$len
+	b		.Long_tail
+
+.align	4
+.Leven:
+	subs		$len,$len,#64
+	it		lo
+	movlo		$in2,$zeros
+
+	vmov.i32	$H4,#1<<24		@ padbit, yes, always
+	vld4.32		{$H0#lo,$H1#lo,$H2#lo,$H3#lo},[$inp]	@ inp[0:1]
+	add		$inp,$inp,#64
+	vld4.32		{$H0#hi,$H1#hi,$H2#hi,$H3#hi},[$in2]	@ inp[2:3] (or 0)
+	add		$in2,$in2,#64
+	itt		hi
+	addhi		$tbl1,$ctx,#(48+1*9*4)
+	addhi		$tbl0,$ctx,#(48+3*9*4)
+
+# ifdef	__ARMEB__
+	vrev32.8	$H0,$H0
+	vrev32.8	$H3,$H3
+	vrev32.8	$H1,$H1
+	vrev32.8	$H2,$H2
+# endif
+	vsri.u32	$H4,$H3,#8		@ base 2^32 -> base 2^26
+	vshl.u32	$H3,$H3,#18
+
+	vsri.u32	$H3,$H2,#14
+	vshl.u32	$H2,$H2,#12
+
+	vbic.i32	$H3,#0xfc000000
+	vsri.u32	$H2,$H1,#20
+	vshl.u32	$H1,$H1,#6
+
+	vbic.i32	$H2,#0xfc000000
+	vsri.u32	$H1,$H0,#26
+
+	vbic.i32	$H0,#0xfc000000
+	vbic.i32	$H1,#0xfc000000
+
+	bls		.Lskip_loop
+
+	vld4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!	@ load r^2
+	vld4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!	@ load r^4
+	vld4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vld4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	b		.Loop_neon
+
+.align	5
+.Loop_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	@   \___________________/
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	@   \___________________/ \____________________/
+	@
+	@ Note that we start with inp[2:3]*r^2. This is because it
+	@ doesn't depend on reduction in previous iteration.
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ inp[2:3]*r^2
+
+	vadd.i32	$H2#lo,$H2#lo,$D2#lo	@ accumulate inp[0:1]
+	vmull.u32	$D2,$H2#hi,${R0}[1]
+	vadd.i32	$H0#lo,$H0#lo,$D0#lo
+	vmull.u32	$D0,$H0#hi,${R0}[1]
+	vadd.i32	$H3#lo,$H3#lo,$D3#lo
+	vmull.u32	$D3,$H3#hi,${R0}[1]
+	vmlal.u32	$D2,$H1#hi,${R1}[1]
+	vadd.i32	$H1#lo,$H1#lo,$D1#lo
+	vmull.u32	$D1,$H1#hi,${R0}[1]
+
+	vadd.i32	$H4#lo,$H4#lo,$D4#lo
+	vmull.u32	$D4,$H4#hi,${R0}[1]
+	subs		$len,$len,#64
+	vmlal.u32	$D0,$H4#hi,${S1}[1]
+	it		lo
+	movlo		$in2,$zeros
+	vmlal.u32	$D3,$H2#hi,${R1}[1]
+	vld1.32		${S4}[1],[$tbl1,:32]
+	vmlal.u32	$D1,$H0#hi,${R1}[1]
+	vmlal.u32	$D4,$H3#hi,${R1}[1]
+
+	vmlal.u32	$D0,$H3#hi,${S2}[1]
+	vmlal.u32	$D3,$H1#hi,${R2}[1]
+	vmlal.u32	$D4,$H2#hi,${R2}[1]
+	vmlal.u32	$D1,$H4#hi,${S2}[1]
+	vmlal.u32	$D2,$H0#hi,${R2}[1]
+
+	vmlal.u32	$D3,$H0#hi,${R3}[1]
+	vmlal.u32	$D0,$H2#hi,${S3}[1]
+	vmlal.u32	$D4,$H1#hi,${R3}[1]
+	vmlal.u32	$D1,$H3#hi,${S3}[1]
+	vmlal.u32	$D2,$H4#hi,${S3}[1]
+
+	vmlal.u32	$D3,$H4#hi,${S4}[1]
+	vmlal.u32	$D0,$H1#hi,${S4}[1]
+	vmlal.u32	$D4,$H0#hi,${R4}[1]
+	vmlal.u32	$D1,$H2#hi,${S4}[1]
+	vmlal.u32	$D2,$H3#hi,${S4}[1]
+
+	vld4.32		{$H0#hi,$H1#hi,$H2#hi,$H3#hi},[$in2]	@ inp[2:3] (or 0)
+	add		$in2,$in2,#64
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4 and accumulate
+
+	vmlal.u32	$D3,$H3#lo,${R0}[0]
+	vmlal.u32	$D0,$H0#lo,${R0}[0]
+	vmlal.u32	$D4,$H4#lo,${R0}[0]
+	vmlal.u32	$D1,$H1#lo,${R0}[0]
+	vmlal.u32	$D2,$H2#lo,${R0}[0]
+	vld1.32		${S4}[0],[$tbl0,:32]
+
+	vmlal.u32	$D3,$H2#lo,${R1}[0]
+	vmlal.u32	$D0,$H4#lo,${S1}[0]
+	vmlal.u32	$D4,$H3#lo,${R1}[0]
+	vmlal.u32	$D1,$H0#lo,${R1}[0]
+	vmlal.u32	$D2,$H1#lo,${R1}[0]
+
+	vmlal.u32	$D3,$H1#lo,${R2}[0]
+	vmlal.u32	$D0,$H3#lo,${S2}[0]
+	vmlal.u32	$D4,$H2#lo,${R2}[0]
+	vmlal.u32	$D1,$H4#lo,${S2}[0]
+	vmlal.u32	$D2,$H0#lo,${R2}[0]
+
+	vmlal.u32	$D3,$H0#lo,${R3}[0]
+	vmlal.u32	$D0,$H2#lo,${S3}[0]
+	vmlal.u32	$D4,$H1#lo,${R3}[0]
+	vmlal.u32	$D1,$H3#lo,${S3}[0]
+	vmlal.u32	$D3,$H4#lo,${S4}[0]
+
+	vmlal.u32	$D2,$H4#lo,${S3}[0]
+	vmlal.u32	$D0,$H1#lo,${S4}[0]
+	vmlal.u32	$D4,$H0#lo,${R4}[0]
+	vmov.i32	$H4,#1<<24		@ padbit, yes, always
+	vmlal.u32	$D1,$H2#lo,${S4}[0]
+	vmlal.u32	$D2,$H3#lo,${S4}[0]
+
+	vld4.32		{$H0#lo,$H1#lo,$H2#lo,$H3#lo},[$inp]	@ inp[0:1]
+	add		$inp,$inp,#64
+# ifdef	__ARMEB__
+	vrev32.8	$H0,$H0
+	vrev32.8	$H1,$H1
+	vrev32.8	$H2,$H2
+	vrev32.8	$H3,$H3
+# endif
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction interleaved with base 2^32 -> base 2^26 of
+	@ inp[0:3] previously loaded to $H0-$H3 and smashed to $H0-$H4.
+
+	vshr.u64	$T0,$D3,#26
+	vmovn.i64	$D3#lo,$D3
+	 vshr.u64	$T1,$D0,#26
+	 vmovn.i64	$D0#lo,$D0
+	vadd.i64	$D4,$D4,$T0		@ h3 -> h4
+	vbic.i32	$D3#lo,#0xfc000000
+	  vsri.u32	$H4,$H3,#8		@ base 2^32 -> base 2^26
+	 vadd.i64	$D1,$D1,$T1		@ h0 -> h1
+	  vshl.u32	$H3,$H3,#18
+	 vbic.i32	$D0#lo,#0xfc000000
+
+	vshrn.u64	$T0#lo,$D4,#26
+	vmovn.i64	$D4#lo,$D4
+	 vshr.u64	$T1,$D1,#26
+	 vmovn.i64	$D1#lo,$D1
+	 vadd.i64	$D2,$D2,$T1		@ h1 -> h2
+	  vsri.u32	$H3,$H2,#14
+	vbic.i32	$D4#lo,#0xfc000000
+	  vshl.u32	$H2,$H2,#12
+	 vbic.i32	$D1#lo,#0xfc000000
+
+	vadd.i32	$D0#lo,$D0#lo,$T0#lo
+	vshl.u32	$T0#lo,$T0#lo,#2
+	  vbic.i32	$H3,#0xfc000000
+	 vshrn.u64	$T1#lo,$D2,#26
+	 vmovn.i64	$D2#lo,$D2
+	vaddl.u32	$D0,$D0#lo,$T0#lo	@ h4 -> h0 [widen for a sec]
+	  vsri.u32	$H2,$H1,#20
+	 vadd.i32	$D3#lo,$D3#lo,$T1#lo	@ h2 -> h3
+	  vshl.u32	$H1,$H1,#6
+	 vbic.i32	$D2#lo,#0xfc000000
+	  vbic.i32	$H2,#0xfc000000
+
+	vshrn.u64	$T0#lo,$D0,#26		@ re-narrow
+	vmovn.i64	$D0#lo,$D0
+	  vsri.u32	$H1,$H0,#26
+	  vbic.i32	$H0,#0xfc000000
+	 vshr.u32	$T1#lo,$D3#lo,#26
+	 vbic.i32	$D3#lo,#0xfc000000
+	vbic.i32	$D0#lo,#0xfc000000
+	vadd.i32	$D1#lo,$D1#lo,$T0#lo	@ h0 -> h1
+	 vadd.i32	$D4#lo,$D4#lo,$T1#lo	@ h3 -> h4
+	  vbic.i32	$H1,#0xfc000000
+
+	bhi		.Loop_neon
+
+.Lskip_loop:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	add		$tbl1,$ctx,#(48+0*9*4)
+	add		$tbl0,$ctx,#(48+1*9*4)
+	adds		$len,$len,#32
+	it		ne
+	movne		$len,#0
+	bne		.Long_tail
+
+	vadd.i32	$H2#hi,$H2#lo,$D2#lo	@ add hash value and move to #hi
+	vadd.i32	$H0#hi,$H0#lo,$D0#lo
+	vadd.i32	$H3#hi,$H3#lo,$D3#lo
+	vadd.i32	$H1#hi,$H1#lo,$D1#lo
+	vadd.i32	$H4#hi,$H4#lo,$D4#lo
+
+.Long_tail:
+	vld4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!	@ load r^1
+	vld4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!	@ load r^2
+
+	vadd.i32	$H2#lo,$H2#lo,$D2#lo	@ can be redundant
+	vmull.u32	$D2,$H2#hi,$R0
+	vadd.i32	$H0#lo,$H0#lo,$D0#lo
+	vmull.u32	$D0,$H0#hi,$R0
+	vadd.i32	$H3#lo,$H3#lo,$D3#lo
+	vmull.u32	$D3,$H3#hi,$R0
+	vadd.i32	$H1#lo,$H1#lo,$D1#lo
+	vmull.u32	$D1,$H1#hi,$R0
+	vadd.i32	$H4#lo,$H4#lo,$D4#lo
+	vmull.u32	$D4,$H4#hi,$R0
+
+	vmlal.u32	$D0,$H4#hi,$S1
+	vld4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vmlal.u32	$D3,$H2#hi,$R1
+	vld4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	vmlal.u32	$D1,$H0#hi,$R1
+	vmlal.u32	$D4,$H3#hi,$R1
+	vmlal.u32	$D2,$H1#hi,$R1
+
+	vmlal.u32	$D3,$H1#hi,$R2
+	vld1.32		${S4}[1],[$tbl1,:32]
+	vmlal.u32	$D0,$H3#hi,$S2
+	vld1.32		${S4}[0],[$tbl0,:32]
+	vmlal.u32	$D4,$H2#hi,$R2
+	vmlal.u32	$D1,$H4#hi,$S2
+	vmlal.u32	$D2,$H0#hi,$R2
+
+	vmlal.u32	$D3,$H0#hi,$R3
+	 it		ne
+	 addne		$tbl1,$ctx,#(48+2*9*4)
+	vmlal.u32	$D0,$H2#hi,$S3
+	 it		ne
+	 addne		$tbl0,$ctx,#(48+3*9*4)
+	vmlal.u32	$D4,$H1#hi,$R3
+	vmlal.u32	$D1,$H3#hi,$S3
+	vmlal.u32	$D2,$H4#hi,$S3
+
+	vmlal.u32	$D3,$H4#hi,$S4
+	 vorn		$MASK,$MASK,$MASK	@ all-ones, can be redundant
+	vmlal.u32	$D0,$H1#hi,$S4
+	 vshr.u64	$MASK,$MASK,#38
+	vmlal.u32	$D4,$H0#hi,$R4
+	vmlal.u32	$D1,$H2#hi,$S4
+	vmlal.u32	$D2,$H3#hi,$S4
+
+	beq		.Lshort_tail
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	vld4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!	@ load r^3
+	vld4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!	@ load r^4
+
+	vmlal.u32	$D2,$H2#lo,$R0
+	vmlal.u32	$D0,$H0#lo,$R0
+	vmlal.u32	$D3,$H3#lo,$R0
+	vmlal.u32	$D1,$H1#lo,$R0
+	vmlal.u32	$D4,$H4#lo,$R0
+
+	vmlal.u32	$D0,$H4#lo,$S1
+	vld4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vmlal.u32	$D3,$H2#lo,$R1
+	vld4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	vmlal.u32	$D1,$H0#lo,$R1
+	vmlal.u32	$D4,$H3#lo,$R1
+	vmlal.u32	$D2,$H1#lo,$R1
+
+	vmlal.u32	$D3,$H1#lo,$R2
+	vld1.32		${S4}[1],[$tbl1,:32]
+	vmlal.u32	$D0,$H3#lo,$S2
+	vld1.32		${S4}[0],[$tbl0,:32]
+	vmlal.u32	$D4,$H2#lo,$R2
+	vmlal.u32	$D1,$H4#lo,$S2
+	vmlal.u32	$D2,$H0#lo,$R2
+
+	vmlal.u32	$D3,$H0#lo,$R3
+	vmlal.u32	$D0,$H2#lo,$S3
+	vmlal.u32	$D4,$H1#lo,$R3
+	vmlal.u32	$D1,$H3#lo,$S3
+	vmlal.u32	$D2,$H4#lo,$S3
+
+	vmlal.u32	$D3,$H4#lo,$S4
+	 vorn		$MASK,$MASK,$MASK	@ all-ones
+	vmlal.u32	$D0,$H1#lo,$S4
+	 vshr.u64	$MASK,$MASK,#38
+	vmlal.u32	$D4,$H0#lo,$R4
+	vmlal.u32	$D1,$H2#lo,$S4
+	vmlal.u32	$D2,$H3#lo,$S4
+
+.Lshort_tail:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ horizontal addition
+
+	vadd.i64	$D3#lo,$D3#lo,$D3#hi
+	vadd.i64	$D0#lo,$D0#lo,$D0#hi
+	vadd.i64	$D4#lo,$D4#lo,$D4#hi
+	vadd.i64	$D1#lo,$D1#lo,$D1#hi
+	vadd.i64	$D2#lo,$D2#lo,$D2#hi
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction, but without narrowing
+
+	vshr.u64	$T0,$D3,#26
+	vand.i64	$D3,$D3,$MASK
+	 vshr.u64	$T1,$D0,#26
+	 vand.i64	$D0,$D0,$MASK
+	vadd.i64	$D4,$D4,$T0		@ h3 -> h4
+	 vadd.i64	$D1,$D1,$T1		@ h0 -> h1
+
+	vshr.u64	$T0,$D4,#26
+	vand.i64	$D4,$D4,$MASK
+	 vshr.u64	$T1,$D1,#26
+	 vand.i64	$D1,$D1,$MASK
+	 vadd.i64	$D2,$D2,$T1		@ h1 -> h2
+
+	vadd.i64	$D0,$D0,$T0
+	vshl.u64	$T0,$T0,#2
+	 vshr.u64	$T1,$D2,#26
+	 vand.i64	$D2,$D2,$MASK
+	vadd.i64	$D0,$D0,$T0		@ h4 -> h0
+	 vadd.i64	$D3,$D3,$T1		@ h2 -> h3
+
+	vshr.u64	$T0,$D0,#26
+	vand.i64	$D0,$D0,$MASK
+	 vshr.u64	$T1,$D3,#26
+	 vand.i64	$D3,$D3,$MASK
+	vadd.i64	$D1,$D1,$T0		@ h0 -> h1
+	 vadd.i64	$D4,$D4,$T1		@ h3 -> h4
+
+	cmp		$len,#0
+	bne		.Leven
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ store hash value
+
+	vst4.32		{$D0#lo[0],$D1#lo[0],$D2#lo[0],$D3#lo[0]},[$ctx]!
+	vst1.32		{$D4#lo[0]},[$ctx]
+
+	vldmia	sp!,{d8-d15}			@ epilogue
+	ldmia	sp!,{r4-r7}
+	ret					@ bx	lr
+.size	poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+#ifndef	__KERNEL__
+.LOPENSSL_armcap:
+# ifdef	_WIN32
+.word	OPENSSL_armcap_P
+# else
+.word	OPENSSL_armcap_P-.Lpoly1305_init
+# endif
+.comm	OPENSSL_armcap_P,4,4
+.hidden	OPENSSL_armcap_P
+#endif
+#endif
+___
+}	}
+$code.=<<___;
+.asciz	"Poly1305 for ARMv4/NEON, CRYPTOGAMS by \@dot-asm"
+.align	2
+___
+
+foreach (split("\n",$code)) {
+	s/\`([^\`]*)\`/eval $1/geo;
+
+	s/\bq([0-9]+)#(lo|hi)/sprintf "d%d",2*$1+($2 eq "hi")/geo	or
+	s/\bret\b/bx	lr/go						or
+	s/\bbx\s+lr\b/.word\t0xe12fff1e/go;	# make it possible to compile with -march=armv4
+
+	print $_,"\n";
+}
+close STDOUT; # enforce flush
diff --git a/arch/arm/crypto/poly1305-core.S_shipped b/arch/arm/crypto/poly1305-core.S_shipped
new file mode 100644
index 000000000000..37b71d990293
--- /dev/null
+++ b/arch/arm/crypto/poly1305-core.S_shipped
@@ -0,0 +1,1158 @@
+#ifndef	__KERNEL__
+# include "arm_arch.h"
+#else
+# define __ARM_ARCH__ __LINUX_ARM_ARCH__
+# define __ARM_MAX_ARCH__ __LINUX_ARM_ARCH__
+# define poly1305_init   poly1305_init_arm
+# define poly1305_blocks poly1305_blocks_arm
+# define poly1305_emit   poly1305_emit_arm
+.globl	poly1305_blocks_neon
+#endif
+
+#if defined(__thumb2__)
+.syntax	unified
+.thumb
+#else
+.code	32
+#endif
+
+.text
+
+.globl	poly1305_emit
+.globl	poly1305_blocks
+.globl	poly1305_init
+.type	poly1305_init,%function
+.align	5
+poly1305_init:
+.Lpoly1305_init:
+	stmdb	sp!,{r4-r11}
+
+	eor	r3,r3,r3
+	cmp	r1,#0
+	str	r3,[r0,#0]		@ zero hash value
+	str	r3,[r0,#4]
+	str	r3,[r0,#8]
+	str	r3,[r0,#12]
+	str	r3,[r0,#16]
+	str	r3,[r0,#36]		@ clear is_base2_26
+	add	r0,r0,#20
+
+#ifdef	__thumb2__
+	it	eq
+#endif
+	moveq	r0,#0
+	beq	.Lno_key
+
+#if	__ARM_MAX_ARCH__>=7
+	mov	r3,#-1
+	str	r3,[r0,#28]		@ impossible key power value
+# ifndef __KERNEL__
+	adr	r11,.Lpoly1305_init
+	ldr	r12,.LOPENSSL_armcap
+# endif
+#endif
+	ldrb	r4,[r1,#0]
+	mov	r10,#0x0fffffff
+	ldrb	r5,[r1,#1]
+	and	r3,r10,#-4		@ 0x0ffffffc
+	ldrb	r6,[r1,#2]
+	ldrb	r7,[r1,#3]
+	orr	r4,r4,r5,lsl#8
+	ldrb	r5,[r1,#4]
+	orr	r4,r4,r6,lsl#16
+	ldrb	r6,[r1,#5]
+	orr	r4,r4,r7,lsl#24
+	ldrb	r7,[r1,#6]
+	and	r4,r4,r10
+
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+# if !defined(_WIN32)
+	ldr	r12,[r11,r12]		@ OPENSSL_armcap_P
+# endif
+# if defined(__APPLE__) || defined(_WIN32)
+	ldr	r12,[r12]
+# endif
+#endif
+	ldrb	r8,[r1,#7]
+	orr	r5,r5,r6,lsl#8
+	ldrb	r6,[r1,#8]
+	orr	r5,r5,r7,lsl#16
+	ldrb	r7,[r1,#9]
+	orr	r5,r5,r8,lsl#24
+	ldrb	r8,[r1,#10]
+	and	r5,r5,r3
+
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+	tst	r12,#ARMV7_NEON		@ check for NEON
+# ifdef	__thumb2__
+	adr	r9,.Lpoly1305_blocks_neon
+	adr	r11,.Lpoly1305_blocks
+	it	ne
+	movne	r11,r9
+	adr	r12,.Lpoly1305_emit
+	orr	r11,r11,#1		@ thumb-ify addresses
+	orr	r12,r12,#1
+# else
+	add	r12,r11,#(.Lpoly1305_emit-.Lpoly1305_init)
+	ite	eq
+	addeq	r11,r11,#(.Lpoly1305_blocks-.Lpoly1305_init)
+	addne	r11,r11,#(.Lpoly1305_blocks_neon-.Lpoly1305_init)
+# endif
+#endif
+	ldrb	r9,[r1,#11]
+	orr	r6,r6,r7,lsl#8
+	ldrb	r7,[r1,#12]
+	orr	r6,r6,r8,lsl#16
+	ldrb	r8,[r1,#13]
+	orr	r6,r6,r9,lsl#24
+	ldrb	r9,[r1,#14]
+	and	r6,r6,r3
+
+	ldrb	r10,[r1,#15]
+	orr	r7,r7,r8,lsl#8
+	str	r4,[r0,#0]
+	orr	r7,r7,r9,lsl#16
+	str	r5,[r0,#4]
+	orr	r7,r7,r10,lsl#24
+	str	r6,[r0,#8]
+	and	r7,r7,r3
+	str	r7,[r0,#12]
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+	stmia	r2,{r11,r12}		@ fill functions table
+	mov	r0,#1
+#else
+	mov	r0,#0
+#endif
+.Lno_key:
+	ldmia	sp!,{r4-r11}
+#if	__ARM_ARCH__>=5
+	bx	lr				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	.word	0xe12fff1e			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_init,.-poly1305_init
+.type	poly1305_blocks,%function
+.align	5
+poly1305_blocks:
+.Lpoly1305_blocks:
+	stmdb	sp!,{r3-r11,lr}
+
+	ands	r2,r2,#-16
+	beq	.Lno_data
+
+	add	r2,r2,r1		@ end pointer
+	sub	sp,sp,#32
+
+#if __ARM_ARCH__<7
+	ldmia	r0,{r4-r12}		@ load context
+	add	r0,r0,#20
+	str	r2,[sp,#16]		@ offload stuff
+	str	r0,[sp,#12]
+#else
+	ldr	lr,[r0,#36]		@ is_base2_26
+	ldmia	r0!,{r4-r8}		@ load hash value
+	str	r2,[sp,#16]		@ offload stuff
+	str	r0,[sp,#12]
+
+	adds	r9,r4,r5,lsl#26	@ base 2^26 -> base 2^32
+	mov	r10,r5,lsr#6
+	adcs	r10,r10,r6,lsl#20
+	mov	r11,r6,lsr#12
+	adcs	r11,r11,r7,lsl#14
+	mov	r12,r7,lsr#18
+	adcs	r12,r12,r8,lsl#8
+	mov	r2,#0
+	teq	lr,#0
+	str	r2,[r0,#16]		@ clear is_base2_26
+	adc	r2,r2,r8,lsr#24
+
+	itttt	ne
+	movne	r4,r9			@ choose between radixes
+	movne	r5,r10
+	movne	r6,r11
+	movne	r7,r12
+	ldmia	r0,{r9-r12}		@ load key
+	it	ne
+	movne	r8,r2
+#endif
+
+	mov	lr,r1
+	cmp	r3,#0
+	str	r10,[sp,#20]
+	str	r11,[sp,#24]
+	str	r12,[sp,#28]
+	b	.Loop
+
+.align	4
+.Loop:
+#if __ARM_ARCH__<7
+	ldrb	r0,[lr],#16		@ load input
+# ifdef	__thumb2__
+	it	hi
+# endif
+	addhi	r8,r8,#1		@ 1<<128
+	ldrb	r1,[lr,#-15]
+	ldrb	r2,[lr,#-14]
+	ldrb	r3,[lr,#-13]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-12]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-11]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-10]
+	adds	r4,r4,r3		@ accumulate input
+
+	ldrb	r3,[lr,#-9]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-8]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-7]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-6]
+	adcs	r5,r5,r3
+
+	ldrb	r3,[lr,#-5]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-4]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-3]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-2]
+	adcs	r6,r6,r3
+
+	ldrb	r3,[lr,#-1]
+	orr	r1,r0,r1,lsl#8
+	str	lr,[sp,#8]		@ offload input pointer
+	orr	r2,r1,r2,lsl#16
+	add	r10,r10,r10,lsr#2
+	orr	r3,r2,r3,lsl#24
+#else
+	ldr	r0,[lr],#16		@ load input
+	it	hi
+	addhi	r8,r8,#1		@ padbit
+	ldr	r1,[lr,#-12]
+	ldr	r2,[lr,#-8]
+	ldr	r3,[lr,#-4]
+# ifdef	__ARMEB__
+	rev	r0,r0
+	rev	r1,r1
+	rev	r2,r2
+	rev	r3,r3
+# endif
+	adds	r4,r4,r0		@ accumulate input
+	str	lr,[sp,#8]		@ offload input pointer
+	adcs	r5,r5,r1
+	add	r10,r10,r10,lsr#2
+	adcs	r6,r6,r2
+#endif
+	add	r11,r11,r11,lsr#2
+	adcs	r7,r7,r3
+	add	r12,r12,r12,lsr#2
+
+	umull	r2,r3,r5,r9
+	 adc	r8,r8,#0
+	umull	r0,r1,r4,r9
+	umlal	r2,r3,r8,r10
+	umlal	r0,r1,r7,r10
+	ldr	r10,[sp,#20]		@ reload r10
+	umlal	r2,r3,r6,r12
+	umlal	r0,r1,r5,r12
+	umlal	r2,r3,r7,r11
+	umlal	r0,r1,r6,r11
+	umlal	r2,r3,r4,r10
+	str	r0,[sp,#0]		@ future r4
+	 mul	r0,r11,r8
+	ldr	r11,[sp,#24]		@ reload r11
+	adds	r2,r2,r1		@ d1+=d0>>32
+	 eor	r1,r1,r1
+	adc	lr,r3,#0		@ future r6
+	str	r2,[sp,#4]		@ future r5
+
+	mul	r2,r12,r8
+	eor	r3,r3,r3
+	umlal	r0,r1,r7,r12
+	ldr	r12,[sp,#28]		@ reload r12
+	umlal	r2,r3,r7,r9
+	umlal	r0,r1,r6,r9
+	umlal	r2,r3,r6,r10
+	umlal	r0,r1,r5,r10
+	umlal	r2,r3,r5,r11
+	umlal	r0,r1,r4,r11
+	umlal	r2,r3,r4,r12
+	ldr	r4,[sp,#0]
+	mul	r8,r9,r8
+	ldr	r5,[sp,#4]
+
+	adds	r6,lr,r0		@ d2+=d1>>32
+	ldr	lr,[sp,#8]		@ reload input pointer
+	adc	r1,r1,#0
+	adds	r7,r2,r1		@ d3+=d2>>32
+	ldr	r0,[sp,#16]		@ reload end pointer
+	adc	r3,r3,#0
+	add	r8,r8,r3		@ h4+=d3>>32
+
+	and	r1,r8,#-4
+	and	r8,r8,#3
+	add	r1,r1,r1,lsr#2		@ *=5
+	adds	r4,r4,r1
+	adcs	r5,r5,#0
+	adcs	r6,r6,#0
+	adcs	r7,r7,#0
+	adc	r8,r8,#0
+
+	cmp	r0,lr			@ done yet?
+	bhi	.Loop
+
+	ldr	r0,[sp,#12]
+	add	sp,sp,#32
+	stmdb	r0,{r4-r8}		@ store the result
+
+.Lno_data:
+#if	__ARM_ARCH__>=5
+	ldmia	sp!,{r3-r11,pc}
+#else
+	ldmia	sp!,{r3-r11,lr}
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	.word	0xe12fff1e			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_blocks,.-poly1305_blocks
+.type	poly1305_emit,%function
+.align	5
+poly1305_emit:
+.Lpoly1305_emit:
+	stmdb	sp!,{r4-r11}
+
+	ldmia	r0,{r3-r7}
+
+#if __ARM_ARCH__>=7
+	ldr	ip,[r0,#36]		@ is_base2_26
+
+	adds	r8,r3,r4,lsl#26	@ base 2^26 -> base 2^32
+	mov	r9,r4,lsr#6
+	adcs	r9,r9,r5,lsl#20
+	mov	r10,r5,lsr#12
+	adcs	r10,r10,r6,lsl#14
+	mov	r11,r6,lsr#18
+	adcs	r11,r11,r7,lsl#8
+	mov	r0,#0
+	adc	r0,r0,r7,lsr#24
+
+	tst	ip,ip
+	itttt	ne
+	movne	r3,r8
+	movne	r4,r9
+	movne	r5,r10
+	movne	r6,r11
+	it	ne
+	movne	r7,r0
+#endif
+
+	adds	r8,r3,#5		@ compare to modulus
+	adcs	r9,r4,#0
+	adcs	r10,r5,#0
+	adcs	r11,r6,#0
+	adc	r0,r7,#0
+	tst	r0,#4			@ did it carry/borrow?
+
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r3,r8
+	ldr	r8,[r2,#0]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r4,r9
+	ldr	r9,[r2,#4]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r5,r10
+	ldr	r10,[r2,#8]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r6,r11
+	ldr	r11,[r2,#12]
+
+	adds	r3,r3,r8
+	adcs	r4,r4,r9
+	adcs	r5,r5,r10
+	adc	r6,r6,r11
+
+#if __ARM_ARCH__>=7
+# ifdef __ARMEB__
+	rev	r3,r3
+	rev	r4,r4
+	rev	r5,r5
+	rev	r6,r6
+# endif
+	str	r3,[r1,#0]
+	str	r4,[r1,#4]
+	str	r5,[r1,#8]
+	str	r6,[r1,#12]
+#else
+	strb	r3,[r1,#0]
+	mov	r3,r3,lsr#8
+	strb	r4,[r1,#4]
+	mov	r4,r4,lsr#8
+	strb	r5,[r1,#8]
+	mov	r5,r5,lsr#8
+	strb	r6,[r1,#12]
+	mov	r6,r6,lsr#8
+
+	strb	r3,[r1,#1]
+	mov	r3,r3,lsr#8
+	strb	r4,[r1,#5]
+	mov	r4,r4,lsr#8
+	strb	r5,[r1,#9]
+	mov	r5,r5,lsr#8
+	strb	r6,[r1,#13]
+	mov	r6,r6,lsr#8
+
+	strb	r3,[r1,#2]
+	mov	r3,r3,lsr#8
+	strb	r4,[r1,#6]
+	mov	r4,r4,lsr#8
+	strb	r5,[r1,#10]
+	mov	r5,r5,lsr#8
+	strb	r6,[r1,#14]
+	mov	r6,r6,lsr#8
+
+	strb	r3,[r1,#3]
+	strb	r4,[r1,#7]
+	strb	r5,[r1,#11]
+	strb	r6,[r1,#15]
+#endif
+	ldmia	sp!,{r4-r11}
+#if	__ARM_ARCH__>=5
+	bx	lr				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	.word	0xe12fff1e			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_emit,.-poly1305_emit
+#if	__ARM_MAX_ARCH__>=7
+.fpu	neon
+
+.type	poly1305_init_neon,%function
+.align	5
+poly1305_init_neon:
+.Lpoly1305_init_neon:
+	ldr	r3,[r0,#48]		@ first table element
+	cmp	r3,#-1			@ is value impossible?
+	bne	.Lno_init_neon
+
+	ldr	r4,[r0,#20]		@ load key base 2^32
+	ldr	r5,[r0,#24]
+	ldr	r6,[r0,#28]
+	ldr	r7,[r0,#32]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	and	r3,r3,#0x03ffffff
+	and	r4,r4,#0x03ffffff
+	and	r5,r5,#0x03ffffff
+
+	vdup.32	d0,r2			@ r^1 in both lanes
+	add	r2,r3,r3,lsl#2		@ *5
+	vdup.32	d1,r3
+	add	r3,r4,r4,lsl#2
+	vdup.32	d2,r2
+	vdup.32	d3,r4
+	add	r4,r5,r5,lsl#2
+	vdup.32	d4,r3
+	vdup.32	d5,r5
+	add	r5,r6,r6,lsl#2
+	vdup.32	d6,r4
+	vdup.32	d7,r6
+	vdup.32	d8,r5
+
+	mov	r5,#2		@ counter
+
+.Lsquare_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+
+	vmull.u32	q5,d0,d0[1]
+	vmull.u32	q6,d1,d0[1]
+	vmull.u32	q7,d3,d0[1]
+	vmull.u32	q8,d5,d0[1]
+	vmull.u32	q9,d7,d0[1]
+
+	vmlal.u32	q5,d7,d2[1]
+	vmlal.u32	q6,d0,d1[1]
+	vmlal.u32	q7,d1,d1[1]
+	vmlal.u32	q8,d3,d1[1]
+	vmlal.u32	q9,d5,d1[1]
+
+	vmlal.u32	q5,d5,d4[1]
+	vmlal.u32	q6,d7,d4[1]
+	vmlal.u32	q8,d1,d3[1]
+	vmlal.u32	q7,d0,d3[1]
+	vmlal.u32	q9,d3,d3[1]
+
+	vmlal.u32	q5,d3,d6[1]
+	vmlal.u32	q8,d0,d5[1]
+	vmlal.u32	q6,d5,d6[1]
+	vmlal.u32	q7,d7,d6[1]
+	vmlal.u32	q9,d1,d5[1]
+
+	vmlal.u32	q8,d7,d8[1]
+	vmlal.u32	q5,d1,d8[1]
+	vmlal.u32	q6,d3,d8[1]
+	vmlal.u32	q7,d5,d8[1]
+	vmlal.u32	q9,d0,d7[1]
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	@ and P. Schwabe
+	@
+	@ H0>>+H1>>+H2>>+H3>>+H4
+	@ H3>>+H4>>*5+H0>>+H1
+	@
+	@ Trivia.
+	@
+	@ Result of multiplication of n-bit number by m-bit number is
+	@ n+m bits wide. However! Even though 2^n is a n+1-bit number,
+	@ m-bit number multiplied by 2^n is still n+m bits wide.
+	@
+	@ Sum of two n-bit numbers is n+1 bits wide, sum of three - n+2,
+	@ and so is sum of four. Sum of 2^m n-m-bit numbers and n-bit
+	@ one is n+1 bits wide.
+	@
+	@ >>+ denotes Hnext += Hn>>26, Hn &= 0x3ffffff. This means that
+	@ H0, H2, H3 are guaranteed to be 26 bits wide, while H1 and H4
+	@ can be 27. However! In cases when their width exceeds 26 bits
+	@ they are limited by 2^26+2^6. This in turn means that *sum*
+	@ of the products with these values can still be viewed as sum
+	@ of 52-bit numbers as long as the amount of addends is not a
+	@ power of 2. For example,
+	@
+	@ H4 = H4*R0 + H3*R1 + H2*R2 + H1*R3 + H0 * R4,
+	@
+	@ which can't be larger than 5 * (2^26 + 2^6) * (2^26 + 2^6), or
+	@ 5 * (2^52 + 2*2^32 + 2^12), which in turn is smaller than
+	@ 8 * (2^52) or 2^55. However, the value is then multiplied by
+	@ by 5, so we should be looking at 5 * 5 * (2^52 + 2^33 + 2^12),
+	@ which is less than 32 * (2^52) or 2^57. And when processing
+	@ data we are looking at triple as many addends...
+	@
+	@ In key setup procedure pre-reduced H0 is limited by 5*4+1 and
+	@ 5*H4 - by 5*5 52-bit addends, or 57 bits. But when hashing the
+	@ input H0 is limited by (5*4+1)*3 addends, or 58 bits, while
+	@ 5*H4 by 5*5*3, or 59[!] bits. How is this relevant? vmlal.u32
+	@ instruction accepts 2x32-bit input and writes 2x64-bit result.
+	@ This means that result of reduction have to be compressed upon
+	@ loop wrap-around. This can be done in the process of reduction
+	@ to minimize amount of instructions [as well as amount of
+	@ 128-bit instructions, which benefits low-end processors], but
+	@ one has to watch for H2 (which is narrower than H0) and 5*H4
+	@ not being wider than 58 bits, so that result of right shift
+	@ by 26 bits fits in 32 bits. This is also useful on x86,
+	@ because it allows to use paddd in place for paddq, which
+	@ benefits Atom, where paddq is ridiculously slow.
+
+	vshr.u64	q15,q8,#26
+	vmovn.i64	d16,q8
+	 vshr.u64	q4,q5,#26
+	 vmovn.i64	d10,q5
+	vadd.i64	q9,q9,q15		@ h3 -> h4
+	vbic.i32	d16,#0xfc000000	@ &=0x03ffffff
+	 vadd.i64	q6,q6,q4		@ h0 -> h1
+	 vbic.i32	d10,#0xfc000000
+
+	vshrn.u64	d30,q9,#26
+	vmovn.i64	d18,q9
+	 vshr.u64	q4,q6,#26
+	 vmovn.i64	d12,q6
+	 vadd.i64	q7,q7,q4		@ h1 -> h2
+	vbic.i32	d18,#0xfc000000
+	 vbic.i32	d12,#0xfc000000
+
+	vadd.i32	d10,d10,d30
+	vshl.u32	d30,d30,#2
+	 vshrn.u64	d8,q7,#26
+	 vmovn.i64	d14,q7
+	vadd.i32	d10,d10,d30	@ h4 -> h0
+	 vadd.i32	d16,d16,d8	@ h2 -> h3
+	 vbic.i32	d14,#0xfc000000
+
+	vshr.u32	d30,d10,#26
+	vbic.i32	d10,#0xfc000000
+	 vshr.u32	d8,d16,#26
+	 vbic.i32	d16,#0xfc000000
+	vadd.i32	d12,d12,d30	@ h0 -> h1
+	 vadd.i32	d18,d18,d8	@ h3 -> h4
+
+	subs		r5,r5,#1
+	beq		.Lsquare_break_neon
+
+	add		r6,r0,#(48+0*9*4)
+	add		r7,r0,#(48+1*9*4)
+
+	vtrn.32		d0,d10		@ r^2:r^1
+	vtrn.32		d3,d14
+	vtrn.32		d5,d16
+	vtrn.32		d1,d12
+	vtrn.32		d7,d18
+
+	vshl.u32	d4,d3,#2		@ *5
+	vshl.u32	d6,d5,#2
+	vshl.u32	d2,d1,#2
+	vshl.u32	d8,d7,#2
+	vadd.i32	d4,d4,d3
+	vadd.i32	d2,d2,d1
+	vadd.i32	d6,d6,d5
+	vadd.i32	d8,d8,d7
+
+	vst4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!
+	vst4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!
+	vst4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vst4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vst1.32		{d8[0]},[r6,:32]
+	vst1.32		{d8[1]},[r7,:32]
+
+	b		.Lsquare_neon
+
+.align	4
+.Lsquare_break_neon:
+	add		r6,r0,#(48+2*4*9)
+	add		r7,r0,#(48+3*4*9)
+
+	vmov		d0,d10		@ r^4:r^3
+	vshl.u32	d2,d12,#2		@ *5
+	vmov		d1,d12
+	vshl.u32	d4,d14,#2
+	vmov		d3,d14
+	vshl.u32	d6,d16,#2
+	vmov		d5,d16
+	vshl.u32	d8,d18,#2
+	vmov		d7,d18
+	vadd.i32	d2,d2,d12
+	vadd.i32	d4,d4,d14
+	vadd.i32	d6,d6,d16
+	vadd.i32	d8,d8,d18
+
+	vst4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!
+	vst4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!
+	vst4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vst4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vst1.32		{d8[0]},[r6]
+	vst1.32		{d8[1]},[r7]
+
+.Lno_init_neon:
+	bx	lr				@ bx	lr
+.size	poly1305_init_neon,.-poly1305_init_neon
+
+.type	poly1305_blocks_neon,%function
+.align	5
+poly1305_blocks_neon:
+.Lpoly1305_blocks_neon:
+	ldr	ip,[r0,#36]		@ is_base2_26
+
+	cmp	r2,#64
+	blo	.Lpoly1305_blocks
+
+	stmdb	sp!,{r4-r7}
+	vstmdb	sp!,{d8-d15}		@ ABI specification says so
+
+	tst	ip,ip			@ is_base2_26?
+	bne	.Lbase2_26_neon
+
+	stmdb	sp!,{r1-r3,lr}
+	bl	.Lpoly1305_init_neon
+
+	ldr	r4,[r0,#0]		@ load hash value base 2^32
+	ldr	r5,[r0,#4]
+	ldr	r6,[r0,#8]
+	ldr	r7,[r0,#12]
+	ldr	ip,[r0,#16]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	 veor	d10,d10,d10
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	 veor	d12,d12,d12
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	 veor	d14,d14,d14
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	 veor	d16,d16,d16
+	and	r3,r3,#0x03ffffff
+	orr	r6,r6,ip,lsl#24
+	 veor	d18,d18,d18
+	and	r4,r4,#0x03ffffff
+	mov	r1,#1
+	and	r5,r5,#0x03ffffff
+	str	r1,[r0,#36]		@ set is_base2_26
+
+	vmov.32	d10[0],r2
+	vmov.32	d12[0],r3
+	vmov.32	d14[0],r4
+	vmov.32	d16[0],r5
+	vmov.32	d18[0],r6
+	adr	r5,.Lzeros
+
+	ldmia	sp!,{r1-r3,lr}
+	b	.Lhash_loaded
+
+.align	4
+.Lbase2_26_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ load hash value
+
+	veor		d10,d10,d10
+	veor		d12,d12,d12
+	veor		d14,d14,d14
+	veor		d16,d16,d16
+	veor		d18,d18,d18
+	vld4.32		{d10[0],d12[0],d14[0],d16[0]},[r0]!
+	adr		r5,.Lzeros
+	vld1.32		{d18[0]},[r0]
+	sub		r0,r0,#16		@ rewind
+
+.Lhash_loaded:
+	add		r4,r1,#32
+	mov		r3,r3,lsl#24
+	tst		r2,#31
+	beq		.Leven
+
+	vld4.32		{d20[0],d22[0],d24[0],d26[0]},[r1]!
+	vmov.32		d28[0],r3
+	sub		r2,r2,#16
+	add		r4,r1,#32
+
+# ifdef	__ARMEB__
+	vrev32.8	q10,q10
+	vrev32.8	q13,q13
+	vrev32.8	q11,q11
+	vrev32.8	q12,q12
+# endif
+	vsri.u32	d28,d26,#8	@ base 2^32 -> base 2^26
+	vshl.u32	d26,d26,#18
+
+	vsri.u32	d26,d24,#14
+	vshl.u32	d24,d24,#12
+	vadd.i32	d29,d28,d18	@ add hash value and move to #hi
+
+	vbic.i32	d26,#0xfc000000
+	vsri.u32	d24,d22,#20
+	vshl.u32	d22,d22,#6
+
+	vbic.i32	d24,#0xfc000000
+	vsri.u32	d22,d20,#26
+	vadd.i32	d27,d26,d16
+
+	vbic.i32	d20,#0xfc000000
+	vbic.i32	d22,#0xfc000000
+	vadd.i32	d25,d24,d14
+
+	vadd.i32	d21,d20,d10
+	vadd.i32	d23,d22,d12
+
+	mov		r7,r5
+	add		r6,r0,#48
+
+	cmp		r2,r2
+	b		.Long_tail
+
+.align	4
+.Leven:
+	subs		r2,r2,#64
+	it		lo
+	movlo		r4,r5
+
+	vmov.i32	q14,#1<<24		@ padbit, yes, always
+	vld4.32		{d20,d22,d24,d26},[r1]	@ inp[0:1]
+	add		r1,r1,#64
+	vld4.32		{d21,d23,d25,d27},[r4]	@ inp[2:3] (or 0)
+	add		r4,r4,#64
+	itt		hi
+	addhi		r7,r0,#(48+1*9*4)
+	addhi		r6,r0,#(48+3*9*4)
+
+# ifdef	__ARMEB__
+	vrev32.8	q10,q10
+	vrev32.8	q13,q13
+	vrev32.8	q11,q11
+	vrev32.8	q12,q12
+# endif
+	vsri.u32	q14,q13,#8		@ base 2^32 -> base 2^26
+	vshl.u32	q13,q13,#18
+
+	vsri.u32	q13,q12,#14
+	vshl.u32	q12,q12,#12
+
+	vbic.i32	q13,#0xfc000000
+	vsri.u32	q12,q11,#20
+	vshl.u32	q11,q11,#6
+
+	vbic.i32	q12,#0xfc000000
+	vsri.u32	q11,q10,#26
+
+	vbic.i32	q10,#0xfc000000
+	vbic.i32	q11,#0xfc000000
+
+	bls		.Lskip_loop
+
+	vld4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!	@ load r^2
+	vld4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!	@ load r^4
+	vld4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vld4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	b		.Loop_neon
+
+.align	5
+.Loop_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	@   ___________________/
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	@   ___________________/ ____________________/
+	@
+	@ Note that we start with inp[2:3]*r^2. This is because it
+	@ doesn't depend on reduction in previous iteration.
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ inp[2:3]*r^2
+
+	vadd.i32	d24,d24,d14	@ accumulate inp[0:1]
+	vmull.u32	q7,d25,d0[1]
+	vadd.i32	d20,d20,d10
+	vmull.u32	q5,d21,d0[1]
+	vadd.i32	d26,d26,d16
+	vmull.u32	q8,d27,d0[1]
+	vmlal.u32	q7,d23,d1[1]
+	vadd.i32	d22,d22,d12
+	vmull.u32	q6,d23,d0[1]
+
+	vadd.i32	d28,d28,d18
+	vmull.u32	q9,d29,d0[1]
+	subs		r2,r2,#64
+	vmlal.u32	q5,d29,d2[1]
+	it		lo
+	movlo		r4,r5
+	vmlal.u32	q8,d25,d1[1]
+	vld1.32		d8[1],[r7,:32]
+	vmlal.u32	q6,d21,d1[1]
+	vmlal.u32	q9,d27,d1[1]
+
+	vmlal.u32	q5,d27,d4[1]
+	vmlal.u32	q8,d23,d3[1]
+	vmlal.u32	q9,d25,d3[1]
+	vmlal.u32	q6,d29,d4[1]
+	vmlal.u32	q7,d21,d3[1]
+
+	vmlal.u32	q8,d21,d5[1]
+	vmlal.u32	q5,d25,d6[1]
+	vmlal.u32	q9,d23,d5[1]
+	vmlal.u32	q6,d27,d6[1]
+	vmlal.u32	q7,d29,d6[1]
+
+	vmlal.u32	q8,d29,d8[1]
+	vmlal.u32	q5,d23,d8[1]
+	vmlal.u32	q9,d21,d7[1]
+	vmlal.u32	q6,d25,d8[1]
+	vmlal.u32	q7,d27,d8[1]
+
+	vld4.32		{d21,d23,d25,d27},[r4]	@ inp[2:3] (or 0)
+	add		r4,r4,#64
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4 and accumulate
+
+	vmlal.u32	q8,d26,d0[0]
+	vmlal.u32	q5,d20,d0[0]
+	vmlal.u32	q9,d28,d0[0]
+	vmlal.u32	q6,d22,d0[0]
+	vmlal.u32	q7,d24,d0[0]
+	vld1.32		d8[0],[r6,:32]
+
+	vmlal.u32	q8,d24,d1[0]
+	vmlal.u32	q5,d28,d2[0]
+	vmlal.u32	q9,d26,d1[0]
+	vmlal.u32	q6,d20,d1[0]
+	vmlal.u32	q7,d22,d1[0]
+
+	vmlal.u32	q8,d22,d3[0]
+	vmlal.u32	q5,d26,d4[0]
+	vmlal.u32	q9,d24,d3[0]
+	vmlal.u32	q6,d28,d4[0]
+	vmlal.u32	q7,d20,d3[0]
+
+	vmlal.u32	q8,d20,d5[0]
+	vmlal.u32	q5,d24,d6[0]
+	vmlal.u32	q9,d22,d5[0]
+	vmlal.u32	q6,d26,d6[0]
+	vmlal.u32	q8,d28,d8[0]
+
+	vmlal.u32	q7,d28,d6[0]
+	vmlal.u32	q5,d22,d8[0]
+	vmlal.u32	q9,d20,d7[0]
+	vmov.i32	q14,#1<<24		@ padbit, yes, always
+	vmlal.u32	q6,d24,d8[0]
+	vmlal.u32	q7,d26,d8[0]
+
+	vld4.32		{d20,d22,d24,d26},[r1]	@ inp[0:1]
+	add		r1,r1,#64
+# ifdef	__ARMEB__
+	vrev32.8	q10,q10
+	vrev32.8	q11,q11
+	vrev32.8	q12,q12
+	vrev32.8	q13,q13
+# endif
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction interleaved with base 2^32 -> base 2^26 of
+	@ inp[0:3] previously loaded to q10-q13 and smashed to q10-q14.
+
+	vshr.u64	q15,q8,#26
+	vmovn.i64	d16,q8
+	 vshr.u64	q4,q5,#26
+	 vmovn.i64	d10,q5
+	vadd.i64	q9,q9,q15		@ h3 -> h4
+	vbic.i32	d16,#0xfc000000
+	  vsri.u32	q14,q13,#8		@ base 2^32 -> base 2^26
+	 vadd.i64	q6,q6,q4		@ h0 -> h1
+	  vshl.u32	q13,q13,#18
+	 vbic.i32	d10,#0xfc000000
+
+	vshrn.u64	d30,q9,#26
+	vmovn.i64	d18,q9
+	 vshr.u64	q4,q6,#26
+	 vmovn.i64	d12,q6
+	 vadd.i64	q7,q7,q4		@ h1 -> h2
+	  vsri.u32	q13,q12,#14
+	vbic.i32	d18,#0xfc000000
+	  vshl.u32	q12,q12,#12
+	 vbic.i32	d12,#0xfc000000
+
+	vadd.i32	d10,d10,d30
+	vshl.u32	d30,d30,#2
+	  vbic.i32	q13,#0xfc000000
+	 vshrn.u64	d8,q7,#26
+	 vmovn.i64	d14,q7
+	vaddl.u32	q5,d10,d30	@ h4 -> h0 [widen for a sec]
+	  vsri.u32	q12,q11,#20
+	 vadd.i32	d16,d16,d8	@ h2 -> h3
+	  vshl.u32	q11,q11,#6
+	 vbic.i32	d14,#0xfc000000
+	  vbic.i32	q12,#0xfc000000
+
+	vshrn.u64	d30,q5,#26		@ re-narrow
+	vmovn.i64	d10,q5
+	  vsri.u32	q11,q10,#26
+	  vbic.i32	q10,#0xfc000000
+	 vshr.u32	d8,d16,#26
+	 vbic.i32	d16,#0xfc000000
+	vbic.i32	d10,#0xfc000000
+	vadd.i32	d12,d12,d30	@ h0 -> h1
+	 vadd.i32	d18,d18,d8	@ h3 -> h4
+	  vbic.i32	q11,#0xfc000000
+
+	bhi		.Loop_neon
+
+.Lskip_loop:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	add		r7,r0,#(48+0*9*4)
+	add		r6,r0,#(48+1*9*4)
+	adds		r2,r2,#32
+	it		ne
+	movne		r2,#0
+	bne		.Long_tail
+
+	vadd.i32	d25,d24,d14	@ add hash value and move to #hi
+	vadd.i32	d21,d20,d10
+	vadd.i32	d27,d26,d16
+	vadd.i32	d23,d22,d12
+	vadd.i32	d29,d28,d18
+
+.Long_tail:
+	vld4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!	@ load r^1
+	vld4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!	@ load r^2
+
+	vadd.i32	d24,d24,d14	@ can be redundant
+	vmull.u32	q7,d25,d0
+	vadd.i32	d20,d20,d10
+	vmull.u32	q5,d21,d0
+	vadd.i32	d26,d26,d16
+	vmull.u32	q8,d27,d0
+	vadd.i32	d22,d22,d12
+	vmull.u32	q6,d23,d0
+	vadd.i32	d28,d28,d18
+	vmull.u32	q9,d29,d0
+
+	vmlal.u32	q5,d29,d2
+	vld4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vmlal.u32	q8,d25,d1
+	vld4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vmlal.u32	q6,d21,d1
+	vmlal.u32	q9,d27,d1
+	vmlal.u32	q7,d23,d1
+
+	vmlal.u32	q8,d23,d3
+	vld1.32		d8[1],[r7,:32]
+	vmlal.u32	q5,d27,d4
+	vld1.32		d8[0],[r6,:32]
+	vmlal.u32	q9,d25,d3
+	vmlal.u32	q6,d29,d4
+	vmlal.u32	q7,d21,d3
+
+	vmlal.u32	q8,d21,d5
+	 it		ne
+	 addne		r7,r0,#(48+2*9*4)
+	vmlal.u32	q5,d25,d6
+	 it		ne
+	 addne		r6,r0,#(48+3*9*4)
+	vmlal.u32	q9,d23,d5
+	vmlal.u32	q6,d27,d6
+	vmlal.u32	q7,d29,d6
+
+	vmlal.u32	q8,d29,d8
+	 vorn		q0,q0,q0	@ all-ones, can be redundant
+	vmlal.u32	q5,d23,d8
+	 vshr.u64	q0,q0,#38
+	vmlal.u32	q9,d21,d7
+	vmlal.u32	q6,d25,d8
+	vmlal.u32	q7,d27,d8
+
+	beq		.Lshort_tail
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	vld4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!	@ load r^3
+	vld4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!	@ load r^4
+
+	vmlal.u32	q7,d24,d0
+	vmlal.u32	q5,d20,d0
+	vmlal.u32	q8,d26,d0
+	vmlal.u32	q6,d22,d0
+	vmlal.u32	q9,d28,d0
+
+	vmlal.u32	q5,d28,d2
+	vld4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vmlal.u32	q8,d24,d1
+	vld4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vmlal.u32	q6,d20,d1
+	vmlal.u32	q9,d26,d1
+	vmlal.u32	q7,d22,d1
+
+	vmlal.u32	q8,d22,d3
+	vld1.32		d8[1],[r7,:32]
+	vmlal.u32	q5,d26,d4
+	vld1.32		d8[0],[r6,:32]
+	vmlal.u32	q9,d24,d3
+	vmlal.u32	q6,d28,d4
+	vmlal.u32	q7,d20,d3
+
+	vmlal.u32	q8,d20,d5
+	vmlal.u32	q5,d24,d6
+	vmlal.u32	q9,d22,d5
+	vmlal.u32	q6,d26,d6
+	vmlal.u32	q7,d28,d6
+
+	vmlal.u32	q8,d28,d8
+	 vorn		q0,q0,q0	@ all-ones
+	vmlal.u32	q5,d22,d8
+	 vshr.u64	q0,q0,#38
+	vmlal.u32	q9,d20,d7
+	vmlal.u32	q6,d24,d8
+	vmlal.u32	q7,d26,d8
+
+.Lshort_tail:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ horizontal addition
+
+	vadd.i64	d16,d16,d17
+	vadd.i64	d10,d10,d11
+	vadd.i64	d18,d18,d19
+	vadd.i64	d12,d12,d13
+	vadd.i64	d14,d14,d15
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction, but without narrowing
+
+	vshr.u64	q15,q8,#26
+	vand.i64	q8,q8,q0
+	 vshr.u64	q4,q5,#26
+	 vand.i64	q5,q5,q0
+	vadd.i64	q9,q9,q15		@ h3 -> h4
+	 vadd.i64	q6,q6,q4		@ h0 -> h1
+
+	vshr.u64	q15,q9,#26
+	vand.i64	q9,q9,q0
+	 vshr.u64	q4,q6,#26
+	 vand.i64	q6,q6,q0
+	 vadd.i64	q7,q7,q4		@ h1 -> h2
+
+	vadd.i64	q5,q5,q15
+	vshl.u64	q15,q15,#2
+	 vshr.u64	q4,q7,#26
+	 vand.i64	q7,q7,q0
+	vadd.i64	q5,q5,q15		@ h4 -> h0
+	 vadd.i64	q8,q8,q4		@ h2 -> h3
+
+	vshr.u64	q15,q5,#26
+	vand.i64	q5,q5,q0
+	 vshr.u64	q4,q8,#26
+	 vand.i64	q8,q8,q0
+	vadd.i64	q6,q6,q15		@ h0 -> h1
+	 vadd.i64	q9,q9,q4		@ h3 -> h4
+
+	cmp		r2,#0
+	bne		.Leven
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ store hash value
+
+	vst4.32		{d10[0],d12[0],d14[0],d16[0]},[r0]!
+	vst1.32		{d18[0]},[r0]
+
+	vldmia	sp!,{d8-d15}			@ epilogue
+	ldmia	sp!,{r4-r7}
+	bx	lr					@ bx	lr
+.size	poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+#ifndef	__KERNEL__
+.LOPENSSL_armcap:
+# ifdef	_WIN32
+.word	OPENSSL_armcap_P
+# else
+.word	OPENSSL_armcap_P-.Lpoly1305_init
+# endif
+.comm	OPENSSL_armcap_P,4,4
+.hidden	OPENSSL_armcap_P
+#endif
+#endif
+.asciz	"Poly1305 for ARMv4/NEON, CRYPTOGAMS by @dot-asm"
+.align	2
diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
new file mode 100644
index 000000000000..74a725ac89c9
--- /dev/null
+++ b/arch/arm/crypto/poly1305-glue.c
@@ -0,0 +1,276 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * OpenSSL/Cryptogams accelerated Poly1305 transform for ARM
+ *
+ * Copyright (C) 2019 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ */
+
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <asm/unaligned.h>
+#include <crypto/algapi.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/poly1305.h>
+#include <crypto/internal/simd.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/jump_label.h>
+#include <linux/module.h>
+
+void poly1305_init_arm(void *state, const u8 *key);
+void poly1305_blocks_arm(void *state, const u8 *src, u32 len, u32 hibit);
+void poly1305_emit_arm(void *state, __le32 *digest, const u32 *nonce);
+
+void __weak poly1305_blocks_neon(void *state, const u8 *src, u32 len, u32 hibit)
+{
+}
+
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(have_neon);
+
+void poly1305_init_arch(struct poly1305_desc_ctx *dctx, const u8 *key)
+{
+	poly1305_init_arm(&dctx->h, key);
+	dctx->s[0] = get_unaligned_le32(key + 16);
+	dctx->s[1] = get_unaligned_le32(key + 20);
+	dctx->s[2] = get_unaligned_le32(key + 24);
+	dctx->s[3] = get_unaligned_le32(key + 28);
+	dctx->buflen = 0;
+}
+EXPORT_SYMBOL(poly1305_init_arch);
+
+static int arm_poly1305_init(struct shash_desc *desc)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	dctx->buflen = 0;
+	dctx->rset = 0;
+	dctx->sset = false;
+
+	return 0;
+}
+
+static void arm_poly1305_blocks(struct poly1305_desc_ctx *dctx, const u8 *src,
+				 u32 len, u32 hibit, bool do_neon)
+{
+	if (unlikely(!dctx->sset)) {
+		if (!dctx->rset) {
+			poly1305_init_arm(&dctx->h, src);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->rset = 1;
+		}
+		if (len >= POLY1305_BLOCK_SIZE) {
+			dctx->s[0] = get_unaligned_le32(src +  0);
+			dctx->s[1] = get_unaligned_le32(src +  4);
+			dctx->s[2] = get_unaligned_le32(src +  8);
+			dctx->s[3] = get_unaligned_le32(src + 12);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->sset = true;
+		}
+		if (len < POLY1305_BLOCK_SIZE)
+			return;
+	}
+
+	len &= ~(POLY1305_BLOCK_SIZE - 1);
+
+	if (static_branch_likely(&have_neon) && likely(do_neon))
+		poly1305_blocks_neon(&dctx->h, src, len, hibit);
+	else
+		poly1305_blocks_arm(&dctx->h, src, len, hibit);
+}
+
+static void arm_poly1305_do_update(struct poly1305_desc_ctx *dctx,
+				    const u8 *src, u32 len, bool do_neon)
+{
+	if (unlikely(dctx->buflen)) {
+		u32 bytes = min(len, POLY1305_BLOCK_SIZE - dctx->buflen);
+
+		memcpy(dctx->buf + dctx->buflen, src, bytes);
+		src += bytes;
+		len -= bytes;
+		dctx->buflen += bytes;
+
+		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+			arm_poly1305_blocks(dctx, dctx->buf,
+					    POLY1305_BLOCK_SIZE, 1, false);
+			dctx->buflen = 0;
+		}
+	}
+
+	if (likely(len >= POLY1305_BLOCK_SIZE)) {
+		arm_poly1305_blocks(dctx, src, len, 1, do_neon);
+		src += round_down(len, POLY1305_BLOCK_SIZE);
+		len %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(len)) {
+		dctx->buflen = len;
+		memcpy(dctx->buf, src, len);
+	}
+}
+
+static int arm_poly1305_update(struct shash_desc *desc,
+			       const u8 *src, unsigned int srclen)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	arm_poly1305_do_update(dctx, src, srclen, false);
+	return 0;
+}
+
+static int __maybe_unused arm_poly1305_update_neon(struct shash_desc *desc,
+						   const u8 *src,
+						   unsigned int srclen)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+	bool do_neon = crypto_simd_usable() && srclen > 128;
+
+	if (static_branch_likely(&have_neon) && do_neon)
+		kernel_neon_begin();
+	arm_poly1305_do_update(dctx, src, srclen, do_neon);
+	if (static_branch_likely(&have_neon) && do_neon)
+		kernel_neon_end();
+	return 0;
+}
+
+void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
+			  unsigned int nbytes)
+{
+	bool do_neon = IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
+		       crypto_simd_usable();
+
+	if (unlikely(dctx->buflen)) {
+		u32 bytes = min(nbytes, POLY1305_BLOCK_SIZE - dctx->buflen);
+
+		memcpy(dctx->buf + dctx->buflen, src, bytes);
+		src += bytes;
+		nbytes -= bytes;
+		dctx->buflen += bytes;
+
+		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+			poly1305_blocks_arm(&dctx->h, dctx->buf,
+					    POLY1305_BLOCK_SIZE, 1);
+			dctx->buflen = 0;
+		}
+	}
+
+	if (likely(nbytes >= POLY1305_BLOCK_SIZE)) {
+		unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
+
+		if (static_branch_likely(&have_neon) && do_neon) {
+			kernel_neon_begin();
+			poly1305_blocks_neon(&dctx->h, src, len, 1);
+			kernel_neon_end();
+		} else {
+			poly1305_blocks_arm(&dctx->h, src, len, 1);
+		}
+		src += len;
+		nbytes %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(nbytes)) {
+		dctx->buflen = nbytes;
+		memcpy(dctx->buf, src, nbytes);
+	}
+}
+EXPORT_SYMBOL(poly1305_update_arch);
+
+void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 *dst)
+{
+	__le32 digest[4];
+	u64 f = 0;
+
+	if (unlikely(dctx->buflen)) {
+		dctx->buf[dctx->buflen++] = 1;
+		memset(dctx->buf + dctx->buflen, 0,
+		       POLY1305_BLOCK_SIZE - dctx->buflen);
+		poly1305_blocks_arm(&dctx->h, dctx->buf, POLY1305_BLOCK_SIZE, 0);
+	}
+
+	poly1305_emit_arm(&dctx->h, digest, dctx->s);
+
+	/* mac = (h + s) % (2^128) */
+	f = (f >> 32) + le32_to_cpu(digest[0]);
+	put_unaligned_le32(f, dst);
+	f = (f >> 32) + le32_to_cpu(digest[1]);
+	put_unaligned_le32(f, dst + 4);
+	f = (f >> 32) + le32_to_cpu(digest[2]);
+	put_unaligned_le32(f, dst + 8);
+	f = (f >> 32) + le32_to_cpu(digest[3]);
+	put_unaligned_le32(f, dst + 12);
+
+	*dctx = (struct poly1305_desc_ctx){};
+}
+EXPORT_SYMBOL(poly1305_final_arch);
+
+static int arm_poly1305_final(struct shash_desc *desc, u8 *dst)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	if (unlikely(!dctx->sset))
+		return -ENOKEY;
+
+	poly1305_final_arch(dctx, dst);
+	return 0;
+}
+
+static struct shash_alg arm_poly1305_algs[] = {{
+	.init			= arm_poly1305_init,
+	.update			= arm_poly1305_update,
+	.final			= arm_poly1305_final,
+	.digestsize		= POLY1305_DIGEST_SIZE,
+	.descsize		= sizeof(struct poly1305_desc_ctx),
+
+	.base.cra_name		= "poly1305",
+	.base.cra_driver_name	= "poly1305-arm",
+	.base.cra_priority	= 150,
+	.base.cra_blocksize	= POLY1305_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+#ifdef CONFIG_KERNEL_MODE_NEON
+}, {
+	.init			= arm_poly1305_init,
+	.update			= arm_poly1305_update_neon,
+	.final			= arm_poly1305_final,
+	.digestsize		= POLY1305_DIGEST_SIZE,
+	.descsize		= sizeof(struct poly1305_desc_ctx),
+
+	.base.cra_name		= "poly1305",
+	.base.cra_driver_name	= "poly1305-neon",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= POLY1305_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+#endif
+}};
+
+static int __init arm_poly1305_mod_init(void)
+{
+	if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
+	    (elf_hwcap & HWCAP_NEON))
+		static_branch_enable(&have_neon);
+	else
+		/* register only the first entry */
+		return crypto_register_shash(&arm_poly1305_algs[0]);
+
+	return crypto_register_shashes(arm_poly1305_algs,
+				       ARRAY_SIZE(arm_poly1305_algs));
+}
+
+static void __exit arm_poly1305_mod_exit(void)
+{
+	if (!static_branch_likely(&have_neon)) {
+		crypto_unregister_shash(&arm_poly1305_algs[0]);
+		return;
+	}
+	crypto_unregister_shashes(arm_poly1305_algs,
+				  ARRAY_SIZE(arm_poly1305_algs));
+}
+
+module_init(arm_poly1305_mod_init);
+module_exit(arm_poly1305_mod_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("poly1305");
+MODULE_ALIAS_CRYPTO("poly1305-arm");
+MODULE_ALIAS_CRYPTO("poly1305-neon");
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index 9923445e8225..9bd15b227e78 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -40,7 +40,7 @@ config CRYPTO_LIB_DES
 config CRYPTO_LIB_POLY1305_RSIZE
 	int
 	default 4 if X86_64
-	default 9 if ARM64
+	default 9 if ARM || ARM64
 	default 1
 
 config CRYPTO_ARCH_HAVE_LIB_POLY1305
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 20/35] crypto: mips/poly1305 - incorporate OpenSSL/CRYPTOGAMS optimized implementation
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (18 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 19/35] crypto: arm/poly1305 " Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 21/35] int128: move __uint128_t compiler test to Kconfig Ard Biesheuvel
                   ` (14 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba, Andy Polyakov

This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation for
MIPS authored by Andy Polyakov, a prior 64-bit only version of which has been
contributed by him to the OpenSSL project. The file 'poly1305-mips.pl' is taken
straight from this upstream GitHub repository [0] at commit
d22ade312a7af958ec955620b0d241cf42c37feb, and already contains all the changes
required to build it as part of a Linux kernel module.

[0] https://github.com/dot-asm/cryptogams

Co-developed-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Andy Polyakov <appro@cryptogams.org>
Co-developed-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/mips/crypto/Makefile         |   14 +
 arch/mips/crypto/poly1305-glue.c  |  203 ++++
 arch/mips/crypto/poly1305-mips.pl | 1273 ++++++++++++++++++++
 crypto/Kconfig                    |    5 +
 lib/crypto/Kconfig                |    1 +
 5 files changed, 1496 insertions(+)

diff --git a/arch/mips/crypto/Makefile b/arch/mips/crypto/Makefile
index b528b9d300f1..8e1deaf00e0c 100644
--- a/arch/mips/crypto/Makefile
+++ b/arch/mips/crypto/Makefile
@@ -8,3 +8,17 @@ obj-$(CONFIG_CRYPTO_CRC32_MIPS) += crc32-mips.o
 obj-$(CONFIG_CRYPTO_CHACHA_MIPS) += chacha-mips.o
 chacha-mips-y := chacha-core.o chacha-glue.o
 AFLAGS_chacha-core.o += -O2 # needed to fill branch delay slots
+
+obj-$(CONFIG_CRYPTO_POLY1305_MIPS) += poly1305-mips.o
+poly1305-mips-y := poly1305-core.o poly1305-glue.o
+
+perlasm-flavour-$(CONFIG_CPU_MIPS32) := o32
+perlasm-flavour-$(CONFIG_CPU_MIPS64) := 64
+
+quiet_cmd_perlasm = PERLASM $@
+      cmd_perlasm = $(PERL) $(<) $(perlasm-flavour-y) $(@)
+
+$(obj)/poly1305-core.S: $(src)/poly1305-mips.pl FORCE
+	$(call if_changed,perlasm)
+
+targets += poly1305-core.S
diff --git a/arch/mips/crypto/poly1305-glue.c b/arch/mips/crypto/poly1305-glue.c
new file mode 100644
index 000000000000..b759b6ccc361
--- /dev/null
+++ b/arch/mips/crypto/poly1305-glue.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * OpenSSL/Cryptogams accelerated Poly1305 transform for MIPS
+ *
+ * Copyright (C) 2019 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ */
+
+#include <asm/unaligned.h>
+#include <crypto/algapi.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/poly1305.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+asmlinkage void poly1305_init_mips(void *state, const u8 *key);
+asmlinkage void poly1305_blocks_mips(void *state, const u8 *src, u32 len, u32 hibit);
+asmlinkage void poly1305_emit_mips(void *state, __le32 *digest, const u32 *nonce);
+
+void poly1305_init_arch(struct poly1305_desc_ctx *dctx, const u8 *key)
+{
+	poly1305_init_mips(&dctx->h, key);
+	dctx->s[0] = get_unaligned_le32(key + 16);
+	dctx->s[1] = get_unaligned_le32(key + 20);
+	dctx->s[2] = get_unaligned_le32(key + 24);
+	dctx->s[3] = get_unaligned_le32(key + 28);
+	dctx->buflen = 0;
+}
+EXPORT_SYMBOL(poly1305_init_arch);
+
+static int mips_poly1305_init(struct shash_desc *desc)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	dctx->buflen = 0;
+	dctx->rset = 0;
+	dctx->sset = false;
+
+	return 0;
+}
+
+static void mips_poly1305_blocks(struct poly1305_desc_ctx *dctx, const u8 *src,
+				 u32 len, u32 hibit)
+{
+	if (unlikely(!dctx->sset)) {
+		if (!dctx->rset) {
+			poly1305_init_mips(&dctx->h, src);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->rset = 1;
+		}
+		if (len >= POLY1305_BLOCK_SIZE) {
+			dctx->s[0] = get_unaligned_le32(src +  0);
+			dctx->s[1] = get_unaligned_le32(src +  4);
+			dctx->s[2] = get_unaligned_le32(src +  8);
+			dctx->s[3] = get_unaligned_le32(src + 12);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->sset = true;
+		}
+		if (len < POLY1305_BLOCK_SIZE)
+			return;
+	}
+
+	len &= ~(POLY1305_BLOCK_SIZE - 1);
+
+	poly1305_blocks_mips(&dctx->h, src, len, hibit);
+}
+
+static int mips_poly1305_update(struct shash_desc *desc, const u8 *src,
+				unsigned int len)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	if (unlikely(dctx->buflen)) {
+		u32 bytes = min(len, POLY1305_BLOCK_SIZE - dctx->buflen);
+
+		memcpy(dctx->buf + dctx->buflen, src, bytes);
+		src += bytes;
+		len -= bytes;
+		dctx->buflen += bytes;
+
+		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+			mips_poly1305_blocks(dctx, dctx->buf, POLY1305_BLOCK_SIZE, 1);
+			dctx->buflen = 0;
+		}
+	}
+
+	if (likely(len >= POLY1305_BLOCK_SIZE)) {
+		mips_poly1305_blocks(dctx, src, len, 1);
+		src += round_down(len, POLY1305_BLOCK_SIZE);
+		len %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(len)) {
+		dctx->buflen = len;
+		memcpy(dctx->buf, src, len);
+	}
+	return 0;
+}
+
+void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
+			  unsigned int nbytes)
+{
+	if (unlikely(dctx->buflen)) {
+		u32 bytes = min(nbytes, POLY1305_BLOCK_SIZE - dctx->buflen);
+
+		memcpy(dctx->buf + dctx->buflen, src, bytes);
+		src += bytes;
+		nbytes -= bytes;
+		dctx->buflen += bytes;
+
+		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+			poly1305_blocks_mips(&dctx->h, dctx->buf,
+					     POLY1305_BLOCK_SIZE, 1);
+			dctx->buflen = 0;
+		}
+	}
+
+	if (likely(nbytes >= POLY1305_BLOCK_SIZE)) {
+		unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
+
+		poly1305_blocks_mips(&dctx->h, src, len, 1);
+		src += len;
+		nbytes %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(nbytes)) {
+		dctx->buflen = nbytes;
+		memcpy(dctx->buf, src, nbytes);
+	}
+}
+EXPORT_SYMBOL(poly1305_update_arch);
+
+void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 *dst)
+{
+	__le32 digest[4];
+	u64 f = 0;
+
+	if (unlikely(dctx->buflen)) {
+		dctx->buf[dctx->buflen++] = 1;
+		memset(dctx->buf + dctx->buflen, 0,
+		       POLY1305_BLOCK_SIZE - dctx->buflen);
+		poly1305_blocks_mips(&dctx->h, dctx->buf, POLY1305_BLOCK_SIZE, 0);
+	}
+
+	poly1305_emit_mips(&dctx->h, digest, dctx->s);
+
+	/* mac = (h + s) % (2^128) */
+	f = (f >> 32) + le32_to_cpu(digest[0]);
+	put_unaligned_le32(f, dst);
+	f = (f >> 32) + le32_to_cpu(digest[1]);
+	put_unaligned_le32(f, dst + 4);
+	f = (f >> 32) + le32_to_cpu(digest[2]);
+	put_unaligned_le32(f, dst + 8);
+	f = (f >> 32) + le32_to_cpu(digest[3]);
+	put_unaligned_le32(f, dst + 12);
+
+	*dctx = (struct poly1305_desc_ctx){};
+}
+EXPORT_SYMBOL(poly1305_final_arch);
+
+static int mips_poly1305_final(struct shash_desc *desc, u8 *dst)
+{
+	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+
+	if (unlikely(!dctx->sset))
+		return -ENOKEY;
+
+	poly1305_final_arch(dctx, dst);
+	return 0;
+}
+
+static struct shash_alg mips_poly1305_alg = {
+	.init			= mips_poly1305_init,
+	.update			= mips_poly1305_update,
+	.final			= mips_poly1305_final,
+	.digestsize		= POLY1305_DIGEST_SIZE,
+	.descsize		= sizeof(struct poly1305_desc_ctx),
+
+	.base.cra_name		= "poly1305",
+	.base.cra_driver_name	= "poly1305-mips",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= POLY1305_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+};
+
+static int __init mips_poly1305_mod_init(void)
+{
+	return crypto_register_shash(&mips_poly1305_alg);
+}
+
+static void __exit mips_poly1305_mod_exit(void)
+{
+	crypto_unregister_shash(&mips_poly1305_alg);
+}
+
+module_init(mips_poly1305_mod_init);
+module_exit(mips_poly1305_mod_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("poly1305");
+MODULE_ALIAS_CRYPTO("poly1305-mips");
diff --git a/arch/mips/crypto/poly1305-mips.pl b/arch/mips/crypto/poly1305-mips.pl
new file mode 100644
index 000000000000..b05bab884ed2
--- /dev/null
+++ b/arch/mips/crypto/poly1305-mips.pl
@@ -0,0 +1,1273 @@
+#!/usr/bin/env perl
+# SPDX-License-Identifier: GPL-1.0+ OR BSD-3-Clause
+#
+# ====================================================================
+# Written by Andy Polyakov, @dot-asm, originally for the OpenSSL
+# project.
+# ====================================================================
+
+# Poly1305 hash for MIPS.
+#
+# May 2016
+#
+# Numbers are cycles per processed byte with poly1305_blocks alone.
+#
+#		IALU/gcc
+# R1x000	~5.5/+130%	(big-endian)
+# Octeon II	2.50/+70%	(little-endian)
+#
+# March 2019
+#
+# Add 32-bit code path.
+#
+# October 2019
+#
+# Modulo-scheduling reduction allows to omit dependency chain at the
+# end of inner loop and improve performance. Also optimize MIPS32R2
+# code path for MIPS 1004K core. Per René von Dorst's suggestions.
+#
+#		IALU/gcc
+# R1x000	~9.8/?		(big-endian)
+# Octeon II	3.65/+140%	(little-endian)
+# MT7621/1004K	4.75/?		(little-endian)
+#
+######################################################################
+# There is a number of MIPS ABI in use, O32 and N32/64 are most
+# widely used. Then there is a new contender: NUBI. It appears that if
+# one picks the latter, it's possible to arrange code in ABI neutral
+# manner. Therefore let's stick to NUBI register layout:
+#
+($zero,$at,$t0,$t1,$t2)=map("\$$_",(0..2,24,25));
+($a0,$a1,$a2,$a3,$a4,$a5,$a6,$a7)=map("\$$_",(4..11));
+($s0,$s1,$s2,$s3,$s4,$s5,$s6,$s7,$s8,$s9,$s10,$s11)=map("\$$_",(12..23));
+($gp,$tp,$sp,$fp,$ra)=map("\$$_",(3,28..31));
+#
+# The return value is placed in $a0. Following coding rules facilitate
+# interoperability:
+#
+# - never ever touch $tp, "thread pointer", former $gp [o32 can be
+#   excluded from the rule, because it's specified volatile];
+# - copy return value to $t0, former $v0 [or to $a0 if you're adapting
+#   old code];
+# - on O32 populate $a4-$a7 with 'lw $aN,4*N($sp)' if necessary;
+#
+# For reference here is register layout for N32/64 MIPS ABIs:
+#
+# ($zero,$at,$v0,$v1)=map("\$$_",(0..3));
+# ($a0,$a1,$a2,$a3,$a4,$a5,$a6,$a7)=map("\$$_",(4..11));
+# ($t0,$t1,$t2,$t3,$t8,$t9)=map("\$$_",(12..15,24,25));
+# ($s0,$s1,$s2,$s3,$s4,$s5,$s6,$s7)=map("\$$_",(16..23));
+# ($gp,$sp,$fp,$ra)=map("\$$_",(28..31));
+#
+# <appro@openssl.org>
+#
+######################################################################
+
+$flavour = shift || "64"; # supported flavours are o32,n32,64,nubi32,nubi64
+
+$v0 = ($flavour =~ /nubi/i) ? $a0 : $t0;
+
+if ($flavour =~ /64|n32/i) {{{
+######################################################################
+# 64-bit code path
+#
+
+my ($ctx,$inp,$len,$padbit) = ($a0,$a1,$a2,$a3);
+my ($in0,$in1,$tmp0,$tmp1,$tmp2,$tmp3,$tmp4) = ($a4,$a5,$a6,$a7,$at,$t0,$t1);
+
+$code.=<<___;
+#if (defined(_MIPS_ARCH_MIPS64R3) || defined(_MIPS_ARCH_MIPS64R5) || \\
+     defined(_MIPS_ARCH_MIPS64R6)) \\
+     && !defined(_MIPS_ARCH_MIPS64R2)
+# define _MIPS_ARCH_MIPS64R2
+#endif
+
+#if defined(_MIPS_ARCH_MIPS64R6)
+# define dmultu(rs,rt)
+# define mflo(rd,rs,rt)	dmulu	rd,rs,rt
+# define mfhi(rd,rs,rt)	dmuhu	rd,rs,rt
+#else
+# define dmultu(rs,rt)		dmultu	rs,rt
+# define mflo(rd,rs,rt)	mflo	rd
+# define mfhi(rd,rs,rt)	mfhi	rd
+#endif
+
+#ifdef	__KERNEL__
+# define poly1305_init   poly1305_init_mips
+# define poly1305_blocks poly1305_blocks_mips
+# define poly1305_emit   poly1305_emit_mips
+#endif
+
+#if defined(__MIPSEB__) && !defined(MIPSEB)
+# define MIPSEB
+#endif
+
+#ifdef MIPSEB
+# define MSB 0
+# define LSB 7
+#else
+# define MSB 7
+# define LSB 0
+#endif
+
+.text
+.set	noat
+.set	noreorder
+
+.align	5
+.globl	poly1305_init
+.ent	poly1305_init
+poly1305_init:
+	.frame	$sp,0,$ra
+	.set	reorder
+
+	sd	$zero,0($ctx)
+	sd	$zero,8($ctx)
+	sd	$zero,16($ctx)
+
+	beqz	$inp,.Lno_key
+
+#if defined(_MIPS_ARCH_MIPS64R6)
+	andi	$tmp0,$inp,7		# $inp % 8
+	dsubu	$inp,$inp,$tmp0		# align $inp
+	sll	$tmp0,$tmp0,3		# byte to bit offset
+	ld	$in0,0($inp)
+	ld	$in1,8($inp)
+	beqz	$tmp0,.Laligned_key
+	ld	$tmp2,16($inp)
+
+	subu	$tmp1,$zero,$tmp0
+# ifdef	MIPSEB
+	dsllv	$in0,$in0,$tmp0
+	dsrlv	$tmp3,$in1,$tmp1
+	dsllv	$in1,$in1,$tmp0
+	dsrlv	$tmp2,$tmp2,$tmp1
+# else
+	dsrlv	$in0,$in0,$tmp0
+	dsllv	$tmp3,$in1,$tmp1
+	dsrlv	$in1,$in1,$tmp0
+	dsllv	$tmp2,$tmp2,$tmp1
+# endif
+	or	$in0,$in0,$tmp3
+	or	$in1,$in1,$tmp2
+.Laligned_key:
+#else
+	ldl	$in0,0+MSB($inp)
+	ldl	$in1,8+MSB($inp)
+	ldr	$in0,0+LSB($inp)
+	ldr	$in1,8+LSB($inp)
+#endif
+#ifdef	MIPSEB
+# if defined(_MIPS_ARCH_MIPS64R2)
+	dsbh	$in0,$in0		# byte swap
+	 dsbh	$in1,$in1
+	dshd	$in0,$in0
+	 dshd	$in1,$in1
+# else
+	ori	$tmp0,$zero,0xFF
+	dsll	$tmp2,$tmp0,32
+	or	$tmp0,$tmp2		# 0x000000FF000000FF
+
+	and	$tmp1,$in0,$tmp0	# byte swap
+	 and	$tmp3,$in1,$tmp0
+	dsrl	$tmp2,$in0,24
+	 dsrl	$tmp4,$in1,24
+	dsll	$tmp1,24
+	 dsll	$tmp3,24
+	and	$tmp2,$tmp0
+	 and	$tmp4,$tmp0
+	dsll	$tmp0,8			# 0x0000FF000000FF00
+	or	$tmp1,$tmp2
+	 or	$tmp3,$tmp4
+	and	$tmp2,$in0,$tmp0
+	 and	$tmp4,$in1,$tmp0
+	dsrl	$in0,8
+	 dsrl	$in1,8
+	dsll	$tmp2,8
+	 dsll	$tmp4,8
+	and	$in0,$tmp0
+	 and	$in1,$tmp0
+	or	$tmp1,$tmp2
+	 or	$tmp3,$tmp4
+	or	$in0,$tmp1
+	 or	$in1,$tmp3
+	dsrl	$tmp1,$in0,32
+	 dsrl	$tmp3,$in1,32
+	dsll	$in0,32
+	 dsll	$in1,32
+	or	$in0,$tmp1
+	 or	$in1,$tmp3
+# endif
+#endif
+	li	$tmp0,1
+	dsll	$tmp0,32		# 0x0000000100000000
+	daddiu	$tmp0,-63		# 0x00000000ffffffc1
+	dsll	$tmp0,28		# 0x0ffffffc10000000
+	daddiu	$tmp0,-1		# 0x0ffffffc0fffffff
+
+	and	$in0,$tmp0
+	daddiu	$tmp0,-3		# 0x0ffffffc0ffffffc
+	and	$in1,$tmp0
+
+	sd	$in0,24($ctx)
+	dsrl	$tmp0,$in1,2
+	sd	$in1,32($ctx)
+	daddu	$tmp0,$in1		# s1 = r1 + (r1 >> 2)
+	sd	$tmp0,40($ctx)
+
+.Lno_key:
+	li	$v0,0			# return 0
+	jr	$ra
+.end	poly1305_init
+___
+{
+my $SAVED_REGS_MASK = ($flavour =~ /nubi/i) ? "0x0003f000" : "0x00030000";
+
+my ($h0,$h1,$h2,$r0,$r1,$rs1,$d0,$d1,$d2) =
+   ($s0,$s1,$s2,$s3,$s4,$s5,$in0,$in1,$t2);
+my ($shr,$shl) = ($s6,$s7);		# used on R6
+
+$code.=<<___;
+.align	5
+.globl	poly1305_blocks
+.ent	poly1305_blocks
+poly1305_blocks:
+	.set	noreorder
+	dsrl	$len,4			# number of complete blocks
+	bnez	$len,poly1305_blocks_internal
+	nop
+	jr	$ra
+	nop
+.end	poly1305_blocks
+
+.align	5
+.ent	poly1305_blocks_internal
+poly1305_blocks_internal:
+	.set	noreorder
+#if defined(_MIPS_ARCH_MIPS64R6)
+	.frame	$sp,8*8,$ra
+	.mask	$SAVED_REGS_MASK|0x000c0000,-8
+	dsubu	$sp,8*8
+	sd	$s7,56($sp)
+	sd	$s6,48($sp)
+#else
+	.frame	$sp,6*8,$ra
+	.mask	$SAVED_REGS_MASK,-8
+	dsubu	$sp,6*8
+#endif
+	sd	$s5,40($sp)
+	sd	$s4,32($sp)
+___
+$code.=<<___ if ($flavour =~ /nubi/i);	# optimize non-nubi prologue
+	sd	$s3,24($sp)
+	sd	$s2,16($sp)
+	sd	$s1,8($sp)
+	sd	$s0,0($sp)
+___
+$code.=<<___;
+	.set	reorder
+
+#if defined(_MIPS_ARCH_MIPS64R6)
+	andi	$shr,$inp,7
+	dsubu	$inp,$inp,$shr		# align $inp
+	sll	$shr,$shr,3		# byte to bit offset
+	subu	$shl,$zero,$shr
+#endif
+
+	ld	$h0,0($ctx)		# load hash value
+	ld	$h1,8($ctx)
+	ld	$h2,16($ctx)
+
+	ld	$r0,24($ctx)		# load key
+	ld	$r1,32($ctx)
+	ld	$rs1,40($ctx)
+
+	dsll	$len,4
+	daddu	$len,$inp		# end of buffer
+	b	.Loop
+
+.align	4
+.Loop:
+#if defined(_MIPS_ARCH_MIPS64R6)
+	ld	$in0,0($inp)		# load input
+	ld	$in1,8($inp)
+	beqz	$shr,.Laligned_inp
+
+	ld	$tmp2,16($inp)
+# ifdef	MIPSEB
+	dsllv	$in0,$in0,$shr
+	dsrlv	$tmp3,$in1,$shl
+	dsllv	$in1,$in1,$shr
+	dsrlv	$tmp2,$tmp2,$shl
+# else
+	dsrlv	$in0,$in0,$shr
+	dsllv	$tmp3,$in1,$shl
+	dsrlv	$in1,$in1,$shr
+	dsllv	$tmp2,$tmp2,$shl
+# endif
+	or	$in0,$in0,$tmp3
+	or	$in1,$in1,$tmp2
+.Laligned_inp:
+#else
+	ldl	$in0,0+MSB($inp)	# load input
+	ldl	$in1,8+MSB($inp)
+	ldr	$in0,0+LSB($inp)
+	ldr	$in1,8+LSB($inp)
+#endif
+	daddiu	$inp,16
+#ifdef	MIPSEB
+# if defined(_MIPS_ARCH_MIPS64R2)
+	dsbh	$in0,$in0		# byte swap
+	 dsbh	$in1,$in1
+	dshd	$in0,$in0
+	 dshd	$in1,$in1
+# else
+	ori	$tmp0,$zero,0xFF
+	dsll	$tmp2,$tmp0,32
+	or	$tmp0,$tmp2		# 0x000000FF000000FF
+
+	and	$tmp1,$in0,$tmp0	# byte swap
+	 and	$tmp3,$in1,$tmp0
+	dsrl	$tmp2,$in0,24
+	 dsrl	$tmp4,$in1,24
+	dsll	$tmp1,24
+	 dsll	$tmp3,24
+	and	$tmp2,$tmp0
+	 and	$tmp4,$tmp0
+	dsll	$tmp0,8			# 0x0000FF000000FF00
+	or	$tmp1,$tmp2
+	 or	$tmp3,$tmp4
+	and	$tmp2,$in0,$tmp0
+	 and	$tmp4,$in1,$tmp0
+	dsrl	$in0,8
+	 dsrl	$in1,8
+	dsll	$tmp2,8
+	 dsll	$tmp4,8
+	and	$in0,$tmp0
+	 and	$in1,$tmp0
+	or	$tmp1,$tmp2
+	 or	$tmp3,$tmp4
+	or	$in0,$tmp1
+	 or	$in1,$tmp3
+	dsrl	$tmp1,$in0,32
+	 dsrl	$tmp3,$in1,32
+	dsll	$in0,32
+	 dsll	$in1,32
+	or	$in0,$tmp1
+	 or	$in1,$tmp3
+# endif
+#endif
+	dsrl	$tmp1,$h2,2		# modulo-scheduled reduction
+	andi	$h2,$h2,3
+	dsll	$tmp0,$tmp1,2
+
+	daddu	$d0,$h0,$in0		# accumulate input
+	 daddu	$tmp1,$tmp0
+	sltu	$tmp0,$d0,$h0
+	daddu	$d0,$d0,$tmp1		# ... and residue
+	sltu	$tmp1,$d0,$tmp1
+	daddu	$d1,$h1,$in1
+	daddu	$tmp0,$tmp1
+	sltu	$tmp1,$d1,$h1
+	daddu	$d1,$tmp0
+
+	dmultu	($r0,$d0)		# h0*r0
+	 daddu	$d2,$h2,$padbit
+	 sltu	$tmp0,$d1,$tmp0
+	mflo	($h0,$r0,$d0)
+	mfhi	($h1,$r0,$d0)
+
+	dmultu	($rs1,$d1)		# h1*5*r1
+	 daddu	$d2,$tmp1
+	 daddu	$d2,$tmp0
+	mflo	($tmp0,$rs1,$d1)
+	mfhi	($tmp1,$rs1,$d1)
+
+	dmultu	($r1,$d0)		# h0*r1
+	mflo	($tmp2,$r1,$d0)
+	mfhi	($h2,$r1,$d0)
+	 daddu	$h0,$tmp0
+	 daddu	$h1,$tmp1
+	 sltu	$tmp0,$h0,$tmp0
+
+	dmultu	($r0,$d1)		# h1*r0
+	 daddu	$h1,$tmp0
+	 daddu	$h1,$tmp2
+	mflo	($tmp0,$r0,$d1)
+	mfhi	($tmp1,$r0,$d1)
+
+	dmultu	($rs1,$d2)		# h2*5*r1
+	 sltu	$tmp2,$h1,$tmp2
+	 daddu	$h2,$tmp2
+	mflo	($tmp2,$rs1,$d2)
+
+	dmultu	($r0,$d2)		# h2*r0
+	 daddu	$h1,$tmp0
+	 daddu	$h2,$tmp1
+	mflo	($tmp3,$r0,$d2)
+	 sltu	$tmp0,$h1,$tmp0
+	 daddu	$h2,$tmp0
+
+	daddu	$h1,$tmp2
+	sltu	$tmp2,$h1,$tmp2
+	daddu	$h2,$tmp2
+	daddu	$h2,$tmp3
+
+	bne	$inp,$len,.Loop
+
+	sd	$h0,0($ctx)		# store hash value
+	sd	$h1,8($ctx)
+	sd	$h2,16($ctx)
+
+	.set	noreorder
+#if defined(_MIPS_ARCH_MIPS64R6)
+	ld	$s7,56($sp)
+	ld	$s6,48($sp)
+#endif
+	ld	$s5,40($sp)		# epilogue
+	ld	$s4,32($sp)
+___
+$code.=<<___ if ($flavour =~ /nubi/i);	# optimize non-nubi epilogue
+	ld	$s3,24($sp)
+	ld	$s2,16($sp)
+	ld	$s1,8($sp)
+	ld	$s0,0($sp)
+___
+$code.=<<___;
+	jr	$ra
+#if defined(_MIPS_ARCH_MIPS64R6)
+	daddu	$sp,8*8
+#else
+	daddu	$sp,6*8
+#endif
+.end	poly1305_blocks_internal
+___
+}
+{
+my ($ctx,$mac,$nonce) = ($a0,$a1,$a2);
+
+$code.=<<___;
+.align	5
+.globl	poly1305_emit
+.ent	poly1305_emit
+poly1305_emit:
+	.frame	$sp,0,$ra
+	.set	reorder
+
+	ld	$tmp2,16($ctx)
+	ld	$tmp0,0($ctx)
+	ld	$tmp1,8($ctx)
+
+	li	$in0,-4			# final reduction
+	dsrl	$in1,$tmp2,2
+	and	$in0,$tmp2
+	andi	$tmp2,$tmp2,3
+	daddu	$in0,$in1
+
+	daddu	$tmp0,$tmp0,$in0
+	sltu	$in1,$tmp0,$in0
+	 daddiu	$in0,$tmp0,5		# compare to modulus
+	daddu	$tmp1,$tmp1,$in1
+	 sltiu	$tmp3,$in0,5
+	sltu	$tmp4,$tmp1,$in1
+	 daddu	$in1,$tmp1,$tmp3
+	daddu	$tmp2,$tmp2,$tmp4
+	 sltu	$tmp3,$in1,$tmp3
+	 daddu	$tmp2,$tmp2,$tmp3
+
+	dsrl	$tmp2,2			# see if it carried/borrowed
+	dsubu	$tmp2,$zero,$tmp2
+
+	xor	$in0,$tmp0
+	xor	$in1,$tmp1
+	and	$in0,$tmp2
+	and	$in1,$tmp2
+	xor	$in0,$tmp0
+	xor	$in1,$tmp1
+
+	lwu	$tmp0,0($nonce)		# load nonce
+	lwu	$tmp1,4($nonce)
+	lwu	$tmp2,8($nonce)
+	lwu	$tmp3,12($nonce)
+	dsll	$tmp1,32
+	dsll	$tmp3,32
+	or	$tmp0,$tmp1
+	or	$tmp2,$tmp3
+
+	daddu	$in0,$tmp0		# accumulate nonce
+	daddu	$in1,$tmp2
+	sltu	$tmp0,$in0,$tmp0
+	daddu	$in1,$tmp0
+
+	dsrl	$tmp0,$in0,8		# write mac value
+	dsrl	$tmp1,$in0,16
+	dsrl	$tmp2,$in0,24
+	sb	$in0,0($mac)
+	dsrl	$tmp3,$in0,32
+	sb	$tmp0,1($mac)
+	dsrl	$tmp0,$in0,40
+	sb	$tmp1,2($mac)
+	dsrl	$tmp1,$in0,48
+	sb	$tmp2,3($mac)
+	dsrl	$tmp2,$in0,56
+	sb	$tmp3,4($mac)
+	dsrl	$tmp3,$in1,8
+	sb	$tmp0,5($mac)
+	dsrl	$tmp0,$in1,16
+	sb	$tmp1,6($mac)
+	dsrl	$tmp1,$in1,24
+	sb	$tmp2,7($mac)
+
+	sb	$in1,8($mac)
+	dsrl	$tmp2,$in1,32
+	sb	$tmp3,9($mac)
+	dsrl	$tmp3,$in1,40
+	sb	$tmp0,10($mac)
+	dsrl	$tmp0,$in1,48
+	sb	$tmp1,11($mac)
+	dsrl	$tmp1,$in1,56
+	sb	$tmp2,12($mac)
+	sb	$tmp3,13($mac)
+	sb	$tmp0,14($mac)
+	sb	$tmp1,15($mac)
+
+	jr	$ra
+.end	poly1305_emit
+.rdata
+.asciiz	"Poly1305 for MIPS64, CRYPTOGAMS by \@dot-asm"
+.align	2
+___
+}
+}}} else {{{
+######################################################################
+# 32-bit code path
+#
+
+my ($ctx,$inp,$len,$padbit) = ($a0,$a1,$a2,$a3);
+my ($in0,$in1,$in2,$in3,$tmp0,$tmp1,$tmp2,$tmp3) =
+   ($a4,$a5,$a6,$a7,$at,$t0,$t1,$t2);
+
+$code.=<<___;
+#if (defined(_MIPS_ARCH_MIPS32R3) || defined(_MIPS_ARCH_MIPS32R5) || \\
+     defined(_MIPS_ARCH_MIPS32R6)) \\
+     && !defined(_MIPS_ARCH_MIPS32R2)
+# define _MIPS_ARCH_MIPS32R2
+#endif
+
+#if defined(_MIPS_ARCH_MIPS32R6)
+# define multu(rs,rt)
+# define mflo(rd,rs,rt)	mulu	rd,rs,rt
+# define mfhi(rd,rs,rt)	muhu	rd,rs,rt
+#else
+# define multu(rs,rt)	multu	rs,rt
+# define mflo(rd,rs,rt)	mflo	rd
+# define mfhi(rd,rs,rt)	mfhi	rd
+#endif
+
+#ifdef	__KERNEL__
+# define poly1305_init   poly1305_init_mips
+# define poly1305_blocks poly1305_blocks_mips
+# define poly1305_emit   poly1305_emit_mips
+#endif
+
+#if defined(__MIPSEB__) && !defined(MIPSEB)
+# define MIPSEB
+#endif
+
+#ifdef MIPSEB
+# define MSB 0
+# define LSB 3
+#else
+# define MSB 3
+# define LSB 0
+#endif
+
+.text
+.set	noat
+.set	noreorder
+
+.align	5
+.globl	poly1305_init
+.ent	poly1305_init
+poly1305_init:
+	.frame	$sp,0,$ra
+	.set	reorder
+
+	sw	$zero,0($ctx)
+	sw	$zero,4($ctx)
+	sw	$zero,8($ctx)
+	sw	$zero,12($ctx)
+	sw	$zero,16($ctx)
+
+	beqz	$inp,.Lno_key
+
+#if defined(_MIPS_ARCH_MIPS32R6)
+	andi	$tmp0,$inp,3		# $inp % 4
+	subu	$inp,$inp,$tmp0		# align $inp
+	sll	$tmp0,$tmp0,3		# byte to bit offset
+	lw	$in0,0($inp)
+	lw	$in1,4($inp)
+	lw	$in2,8($inp)
+	lw	$in3,12($inp)
+	beqz	$tmp0,.Laligned_key
+
+	lw	$tmp2,16($inp)
+	subu	$tmp1,$zero,$tmp0
+# ifdef	MIPSEB
+	sllv	$in0,$in0,$tmp0
+	srlv	$tmp3,$in1,$tmp1
+	sllv	$in1,$in1,$tmp0
+	or	$in0,$in0,$tmp3
+	srlv	$tmp3,$in2,$tmp1
+	sllv	$in2,$in2,$tmp0
+	or	$in1,$in1,$tmp3
+	srlv	$tmp3,$in3,$tmp1
+	sllv	$in3,$in3,$tmp0
+	or	$in2,$in2,$tmp3
+	srlv	$tmp2,$tmp2,$tmp1
+	or	$in3,$in3,$tmp2
+# else
+	srlv	$in0,$in0,$tmp0
+	sllv	$tmp3,$in1,$tmp1
+	srlv	$in1,$in1,$tmp0
+	or	$in0,$in0,$tmp3
+	sllv	$tmp3,$in2,$tmp1
+	srlv	$in2,$in2,$tmp0
+	or	$in1,$in1,$tmp3
+	sllv	$tmp3,$in3,$tmp1
+	srlv	$in3,$in3,$tmp0
+	or	$in2,$in2,$tmp3
+	sllv	$tmp2,$tmp2,$tmp1
+	or	$in3,$in3,$tmp2
+# endif
+.Laligned_key:
+#else
+	lwl	$in0,0+MSB($inp)
+	lwl	$in1,4+MSB($inp)
+	lwl	$in2,8+MSB($inp)
+	lwl	$in3,12+MSB($inp)
+	lwr	$in0,0+LSB($inp)
+	lwr	$in1,4+LSB($inp)
+	lwr	$in2,8+LSB($inp)
+	lwr	$in3,12+LSB($inp)
+#endif
+#ifdef	MIPSEB
+# if defined(_MIPS_ARCH_MIPS32R2)
+	wsbh	$in0,$in0		# byte swap
+	wsbh	$in1,$in1
+	wsbh	$in2,$in2
+	wsbh	$in3,$in3
+	rotr	$in0,$in0,16
+	rotr	$in1,$in1,16
+	rotr	$in2,$in2,16
+	rotr	$in3,$in3,16
+# else
+	srl	$tmp0,$in0,24		# byte swap
+	srl	$tmp1,$in0,8
+	andi	$tmp2,$in0,0xFF00
+	sll	$in0,$in0,24
+	andi	$tmp1,0xFF00
+	sll	$tmp2,$tmp2,8
+	or	$in0,$tmp0
+	 srl	$tmp0,$in1,24
+	or	$tmp1,$tmp2
+	 srl	$tmp2,$in1,8
+	or	$in0,$tmp1
+	 andi	$tmp1,$in1,0xFF00
+	 sll	$in1,$in1,24
+	 andi	$tmp2,0xFF00
+	 sll	$tmp1,$tmp1,8
+	 or	$in1,$tmp0
+	srl	$tmp0,$in2,24
+	 or	$tmp2,$tmp1
+	srl	$tmp1,$in2,8
+	 or	$in1,$tmp2
+	andi	$tmp2,$in2,0xFF00
+	sll	$in2,$in2,24
+	andi	$tmp1,0xFF00
+	sll	$tmp2,$tmp2,8
+	or	$in2,$tmp0
+	 srl	$tmp0,$in3,24
+	or	$tmp1,$tmp2
+	 srl	$tmp2,$in3,8
+	or	$in2,$tmp1
+	 andi	$tmp1,$in3,0xFF00
+	 sll	$in3,$in3,24
+	 andi	$tmp2,0xFF00
+	 sll	$tmp1,$tmp1,8
+	 or	$in3,$tmp0
+	 or	$tmp2,$tmp1
+	 or	$in3,$tmp2
+# endif
+#endif
+	lui	$tmp0,0x0fff
+	ori	$tmp0,0xffff		# 0x0fffffff
+	and	$in0,$in0,$tmp0
+	subu	$tmp0,3			# 0x0ffffffc
+	and	$in1,$in1,$tmp0
+	and	$in2,$in2,$tmp0
+	and	$in3,$in3,$tmp0
+
+	sw	$in0,20($ctx)
+	sw	$in1,24($ctx)
+	sw	$in2,28($ctx)
+	sw	$in3,32($ctx)
+
+	srl	$tmp1,$in1,2
+	srl	$tmp2,$in2,2
+	srl	$tmp3,$in3,2
+	addu	$in1,$in1,$tmp1		# s1 = r1 + (r1 >> 2)
+	addu	$in2,$in2,$tmp2
+	addu	$in3,$in3,$tmp3
+	sw	$in1,36($ctx)
+	sw	$in2,40($ctx)
+	sw	$in3,44($ctx)
+.Lno_key:
+	li	$v0,0
+	jr	$ra
+.end	poly1305_init
+___
+{
+my $SAVED_REGS_MASK = ($flavour =~ /nubi/i) ? "0x00fff000" : "0x00ff0000";
+
+my ($h0,$h1,$h2,$h3,$h4, $r0,$r1,$r2,$r3, $rs1,$rs2,$rs3) =
+   ($s0,$s1,$s2,$s3,$s4, $s5,$s6,$s7,$s8, $s9,$s10,$s11);
+my ($d0,$d1,$d2,$d3) =
+   ($a4,$a5,$a6,$a7);
+my $shr = $t2;		# used on R6
+my $one = $t2;		# used on R2
+
+$code.=<<___;
+.globl	poly1305_blocks
+.align	5
+.ent	poly1305_blocks
+poly1305_blocks:
+	.frame	$sp,16*4,$ra
+	.mask	$SAVED_REGS_MASK,-4
+	.set	noreorder
+	subu	$sp, $sp,4*12
+	sw	$s11,4*11($sp)
+	sw	$s10,4*10($sp)
+	sw	$s9, 4*9($sp)
+	sw	$s8, 4*8($sp)
+	sw	$s7, 4*7($sp)
+	sw	$s6, 4*6($sp)
+	sw	$s5, 4*5($sp)
+	sw	$s4, 4*4($sp)
+___
+$code.=<<___ if ($flavour =~ /nubi/i);	# optimize non-nubi prologue
+	sw	$s3, 4*3($sp)
+	sw	$s2, 4*2($sp)
+	sw	$s1, 4*1($sp)
+	sw	$s0, 4*0($sp)
+___
+$code.=<<___;
+	.set	reorder
+
+	srl	$len,4			# number of complete blocks
+	li	$one,1
+	beqz	$len,.Labort
+
+#if defined(_MIPS_ARCH_MIPS32R6)
+	andi	$shr,$inp,3
+	subu	$inp,$inp,$shr		# align $inp
+	sll	$shr,$shr,3		# byte to bit offset
+#endif
+
+	lw	$h0,0($ctx)		# load hash value
+	lw	$h1,4($ctx)
+	lw	$h2,8($ctx)
+	lw	$h3,12($ctx)
+	lw	$h4,16($ctx)
+
+	lw	$r0,20($ctx)		# load key
+	lw	$r1,24($ctx)
+	lw	$r2,28($ctx)
+	lw	$r3,32($ctx)
+	lw	$rs1,36($ctx)
+	lw	$rs2,40($ctx)
+	lw	$rs3,44($ctx)
+
+	sll	$len,4
+	addu	$len,$len,$inp		# end of buffer
+	b	.Loop
+
+.align	4
+.Loop:
+#if defined(_MIPS_ARCH_MIPS32R6)
+	lw	$d0,0($inp)		# load input
+	lw	$d1,4($inp)
+	lw	$d2,8($inp)
+	lw	$d3,12($inp)
+	beqz	$shr,.Laligned_inp
+
+	lw	$t0,16($inp)
+	subu	$t1,$zero,$shr
+# ifdef	MIPSEB
+	sllv	$d0,$d0,$shr
+	srlv	$at,$d1,$t1
+	sllv	$d1,$d1,$shr
+	or	$d0,$d0,$at
+	srlv	$at,$d2,$t1
+	sllv	$d2,$d2,$shr
+	or	$d1,$d1,$at
+	srlv	$at,$d3,$t1
+	sllv	$d3,$d3,$shr
+	or	$d2,$d2,$at
+	srlv	$t0,$t0,$t1
+	or	$d3,$d3,$t0
+# else
+	srlv	$d0,$d0,$shr
+	sllv	$at,$d1,$t1
+	srlv	$d1,$d1,$shr
+	or	$d0,$d0,$at
+	sllv	$at,$d2,$t1
+	srlv	$d2,$d2,$shr
+	or	$d1,$d1,$at
+	sllv	$at,$d3,$t1
+	srlv	$d3,$d3,$shr
+	or	$d2,$d2,$at
+	sllv	$t0,$t0,$t1
+	or	$d3,$d3,$t0
+# endif
+.Laligned_inp:
+#else
+	lwl	$d0,0+MSB($inp)		# load input
+	lwl	$d1,4+MSB($inp)
+	lwl	$d2,8+MSB($inp)
+	lwl	$d3,12+MSB($inp)
+	lwr	$d0,0+LSB($inp)
+	lwr	$d1,4+LSB($inp)
+	lwr	$d2,8+LSB($inp)
+	lwr	$d3,12+LSB($inp)
+#endif
+#ifdef	MIPSEB
+# if defined(_MIPS_ARCH_MIPS32R2)
+	wsbh	$d0,$d0			# byte swap
+	wsbh	$d1,$d1
+	wsbh	$d2,$d2
+	wsbh	$d3,$d3
+	rotr	$d0,$d0,16
+	rotr	$d1,$d1,16
+	rotr	$d2,$d2,16
+	rotr	$d3,$d3,16
+# else
+	srl	$at,$d0,24		# byte swap
+	srl	$t0,$d0,8
+	andi	$t1,$d0,0xFF00
+	sll	$d0,$d0,24
+	andi	$t0,0xFF00
+	sll	$t1,$t1,8
+	or	$d0,$at
+	 srl	$at,$d1,24
+	or	$t0,$t1
+	 srl	$t1,$d1,8
+	or	$d0,$t0
+	 andi	$t0,$d1,0xFF00
+	 sll	$d1,$d1,24
+	 andi	$t1,0xFF00
+	 sll	$t0,$t0,8
+	 or	$d1,$at
+	srl	$at,$d2,24
+	 or	$t1,$t0
+	srl	$t0,$d2,8
+	 or	$d1,$t1
+	andi	$t1,$d2,0xFF00
+	sll	$d2,$d2,24
+	andi	$t0,0xFF00
+	sll	$t1,$t1,8
+	or	$d2,$at
+	 srl	$at,$d3,24
+	or	$t0,$t1
+	 srl	$t1,$d3,8
+	or	$d2,$t0
+	 andi	$t0,$d3,0xFF00
+	 sll	$d3,$d3,24
+	 andi	$t1,0xFF00
+	 sll	$t0,$t0,8
+	 or	$d3,$at
+	 or	$t1,$t0
+	 or	$d3,$t1
+# endif
+#endif
+	srl	$t0,$h4,2		# modulo-scheduled reduction
+	andi	$h4,$h4,3
+	sll	$at,$t0,2
+
+	addu	$d0,$d0,$h0		# accumulate input
+	 addu	$t0,$t0,$at
+	sltu	$h0,$d0,$h0
+	addu	$d0,$d0,$t0		# ... and residue
+	sltu	$at,$d0,$t0
+
+	addu	$d1,$d1,$h1
+	 addu	$h0,$h0,$at		# carry
+	sltu	$h1,$d1,$h1
+	addu	$d1,$d1,$h0
+	sltu	$h0,$d1,$h0
+
+	addu	$d2,$d2,$h2
+	 addu	$h1,$h1,$h0		# carry
+	sltu	$h2,$d2,$h2
+	addu	$d2,$d2,$h1
+	sltu	$h1,$d2,$h1
+
+	addu	$d3,$d3,$h3
+	 addu	$h2,$h2,$h1		# carry
+	sltu	$h3,$d3,$h3
+	addu	$d3,$d3,$h2
+
+#if defined(_MIPS_ARCH_MIPS32R2) && !defined(_MIPS_ARCH_MIPS32R6)
+	multu	$r0,$d0			# d0*r0
+	 sltu	$h2,$d3,$h2
+	maddu	$rs3,$d1		# d1*s3
+	 addu	$h3,$h3,$h2		# carry
+	maddu	$rs2,$d2		# d2*s2
+	 addu	$h4,$h4,$padbit
+	maddu	$rs1,$d3		# d3*s1
+	 addu	$h4,$h4,$h3
+	mfhi	$at
+	mflo	$h0
+
+	multu	$r1,$d0			# d0*r1
+	maddu	$r0,$d1			# d1*r0
+	maddu	$rs3,$d2		# d2*s3
+	maddu	$rs2,$d3		# d3*s2
+	maddu	$rs1,$h4		# h4*s1
+	maddu	$at,$one		# hi*1
+	mfhi	$at
+	mflo	$h1
+
+	multu	$r2,$d0			# d0*r2
+	maddu	$r1,$d1			# d1*r1
+	maddu	$r0,$d2			# d2*r0
+	maddu	$rs3,$d3		# d3*s3
+	maddu	$rs2,$h4		# h4*s2
+	maddu	$at,$one		# hi*1
+	mfhi	$at
+	mflo	$h2
+
+	mul	$t0,$r0,$h4		# h4*r0
+
+	multu	$r3,$d0			# d0*r3
+	maddu	$r2,$d1			# d1*r2
+	maddu	$r1,$d2			# d2*r1
+	maddu	$r0,$d3			# d3*r0
+	maddu	$rs3,$h4		# h4*s3
+	maddu	$at,$one		# hi*1
+	mfhi	$at
+	mflo	$h3
+
+	 addiu	$inp,$inp,16
+
+	addu	$h4,$t0,$at
+#else
+	multu	($r0,$d0)		# d0*r0
+	mflo	($h0,$r0,$d0)
+	mfhi	($h1,$r0,$d0)
+
+	 sltu	$h2,$d3,$h2
+	 addu	$h3,$h3,$h2		# carry
+
+	multu	($rs3,$d1)		# d1*s3
+	mflo	($at,$rs3,$d1)
+	mfhi	($t0,$rs3,$d1)
+
+	 addu	$h4,$h4,$padbit
+	 addiu	$inp,$inp,16
+	 addu	$h4,$h4,$h3
+
+	multu	($rs2,$d2)		# d2*s2
+	mflo	($a3,$rs2,$d2)
+	mfhi	($t1,$rs2,$d2)
+	 addu	$h0,$h0,$at
+	 addu	$h1,$h1,$t0
+	multu	($rs1,$d3)		# d3*s1
+	 sltu	$at,$h0,$at
+	 addu	$h1,$h1,$at
+
+	mflo	($at,$rs1,$d3)
+	mfhi	($t0,$rs1,$d3)
+	 addu	$h0,$h0,$a3
+	 addu	$h1,$h1,$t1
+	multu	($r1,$d0)		# d0*r1
+	 sltu	$a3,$h0,$a3
+	 addu	$h1,$h1,$a3
+
+
+	mflo	($a3,$r1,$d0)
+	mfhi	($h2,$r1,$d0)
+	 addu	$h0,$h0,$at
+	 addu	$h1,$h1,$t0
+	multu	($r0,$d1)		# d1*r0
+	 sltu	$at,$h0,$at
+	 addu	$h1,$h1,$at
+
+	mflo	($at,$r0,$d1)
+	mfhi	($t0,$r0,$d1)
+	 addu	$h1,$h1,$a3
+	 sltu	$a3,$h1,$a3
+	multu	($rs3,$d2)		# d2*s3
+	 addu	$h2,$h2,$a3
+
+	mflo	($a3,$rs3,$d2)
+	mfhi	($t1,$rs3,$d2)
+	 addu	$h1,$h1,$at
+	 addu	$h2,$h2,$t0
+	multu	($rs2,$d3)		# d3*s2
+	 sltu	$at,$h1,$at
+	 addu	$h2,$h2,$at
+
+	mflo	($at,$rs2,$d3)
+	mfhi	($t0,$rs2,$d3)
+	 addu	$h1,$h1,$a3
+	 addu	$h2,$h2,$t1
+	multu	($rs1,$h4)		# h4*s1
+	 sltu	$a3,$h1,$a3
+	 addu	$h2,$h2,$a3
+
+	mflo	($a3,$rs1,$h4)
+	 addu	$h1,$h1,$at
+	 addu	$h2,$h2,$t0
+	multu	($r2,$d0)		# d0*r2
+	 sltu	$at,$h1,$at
+	 addu	$h2,$h2,$at
+
+
+	mflo	($at,$r2,$d0)
+	mfhi	($h3,$r2,$d0)
+	 addu	$h1,$h1,$a3
+	 sltu	$a3,$h1,$a3
+	multu	($r1,$d1)		# d1*r1
+	 addu	$h2,$h2,$a3
+
+	mflo	($a3,$r1,$d1)
+	mfhi	($t1,$r1,$d1)
+	 addu	$h2,$h2,$at
+	 sltu	$at,$h2,$at
+	multu	($r0,$d2)		# d2*r0
+	 addu	$h3,$h3,$at
+
+	mflo	($at,$r0,$d2)
+	mfhi	($t0,$r0,$d2)
+	 addu	$h2,$h2,$a3
+	 addu	$h3,$h3,$t1
+	multu	($rs3,$d3)		# d3*s3
+	 sltu	$a3,$h2,$a3
+	 addu	$h3,$h3,$a3
+
+	mflo	($a3,$rs3,$d3)
+	mfhi	($t1,$rs3,$d3)
+	 addu	$h2,$h2,$at
+	 addu	$h3,$h3,$t0
+	multu	($rs2,$h4)		# h4*s2
+	 sltu	$at,$h2,$at
+	 addu	$h3,$h3,$at
+
+	mflo	($at,$rs2,$h4)
+	 addu	$h2,$h2,$a3
+	 addu	$h3,$h3,$t1
+	multu	($r3,$d0)		# d0*r3
+	 sltu	$a3,$h2,$a3
+	 addu	$h3,$h3,$a3
+
+
+	mflo	($a3,$r3,$d0)
+	mfhi	($t1,$r3,$d0)
+	 addu	$h2,$h2,$at
+	 sltu	$at,$h2,$at
+	multu	($r2,$d1)		# d1*r2
+	 addu	$h3,$h3,$at
+
+	mflo	($at,$r2,$d1)
+	mfhi	($t0,$r2,$d1)
+	 addu	$h3,$h3,$a3
+	 sltu	$a3,$h3,$a3
+	multu	($r0,$d3)		# d3*r0
+	 addu	$t1,$t1,$a3
+
+	mflo	($a3,$r0,$d3)
+	mfhi	($d3,$r0,$d3)
+	 addu	$h3,$h3,$at
+	 addu	$t1,$t1,$t0
+	multu	($r1,$d2)		# d2*r1
+	 sltu	$at,$h3,$at
+	 addu	$t1,$t1,$at
+
+	mflo	($at,$r1,$d2)
+	mfhi	($t0,$r1,$d2)
+	 addu	$h3,$h3,$a3
+	 addu	$t1,$t1,$d3
+	multu	($rs3,$h4)		# h4*s3
+	 sltu	$a3,$h3,$a3
+	 addu	$t1,$t1,$a3
+
+	mflo	($a3,$rs3,$h4)
+	 addu	$h3,$h3,$at
+	 addu	$t1,$t1,$t0
+	multu	($r0,$h4)		# h4*r0
+	 sltu	$at,$h3,$at
+	 addu	$t1,$t1,$at
+
+
+	mflo	($h4,$r0,$h4)
+	 addu	$h3,$h3,$a3
+	 sltu	$a3,$h3,$a3
+	 addu	$t1,$t1,$a3
+	addu	$h4,$h4,$t1
+
+	li	$padbit,1		# if we loop, padbit is 1
+#endif
+	bne	$inp,$len,.Loop
+
+	sw	$h0,0($ctx)		# store hash value
+	sw	$h1,4($ctx)
+	sw	$h2,8($ctx)
+	sw	$h3,12($ctx)
+	sw	$h4,16($ctx)
+
+	.set	noreorder
+.Labort:
+	lw	$s11,4*11($sp)
+	lw	$s10,4*10($sp)
+	lw	$s9, 4*9($sp)
+	lw	$s8, 4*8($sp)
+	lw	$s7, 4*7($sp)
+	lw	$s6, 4*6($sp)
+	lw	$s5, 4*5($sp)
+	lw	$s4, 4*4($sp)
+___
+$code.=<<___ if ($flavour =~ /nubi/i);	# optimize non-nubi prologue
+	lw	$s3, 4*3($sp)
+	lw	$s2, 4*2($sp)
+	lw	$s1, 4*1($sp)
+	lw	$s0, 4*0($sp)
+___
+$code.=<<___;
+	jr	$ra
+	addu	$sp,$sp,4*12
+.end	poly1305_blocks
+___
+}
+{
+my ($ctx,$mac,$nonce,$tmp4) = ($a0,$a1,$a2,$a3);
+
+$code.=<<___;
+.align	5
+.globl	poly1305_emit
+.ent	poly1305_emit
+poly1305_emit:
+	.frame	$sp,0,$ra
+	.set	reorder
+
+	lw	$tmp4,16($ctx)
+	lw	$tmp0,0($ctx)
+	lw	$tmp1,4($ctx)
+	lw	$tmp2,8($ctx)
+	lw	$tmp3,12($ctx)
+
+	li	$in0,-4			# final reduction
+	srl	$ctx,$tmp4,2
+	and	$in0,$in0,$tmp4
+	andi	$tmp4,$tmp4,3
+	addu	$ctx,$ctx,$in0
+
+	addu	$tmp0,$tmp0,$ctx
+	sltu	$ctx,$tmp0,$ctx
+	 addiu	$in0,$tmp0,5		# compare to modulus
+	addu	$tmp1,$tmp1,$ctx
+	 sltiu	$in1,$in0,5
+	sltu	$ctx,$tmp1,$ctx
+	 addu	$in1,$in1,$tmp1
+	addu	$tmp2,$tmp2,$ctx
+	 sltu	$in2,$in1,$tmp1
+	sltu	$ctx,$tmp2,$ctx
+	 addu	$in2,$in2,$tmp2
+	addu	$tmp3,$tmp3,$ctx
+	 sltu	$in3,$in2,$tmp2
+	sltu	$ctx,$tmp3,$ctx
+	 addu	$in3,$in3,$tmp3
+	addu	$tmp4,$tmp4,$ctx
+	 sltu	$ctx,$in3,$tmp3
+	 addu	$ctx,$tmp4
+
+	srl	$ctx,2			# see if it carried/borrowed
+	subu	$ctx,$zero,$ctx
+
+	xor	$in0,$tmp0
+	xor	$in1,$tmp1
+	xor	$in2,$tmp2
+	xor	$in3,$tmp3
+	and	$in0,$ctx
+	and	$in1,$ctx
+	and	$in2,$ctx
+	and	$in3,$ctx
+	xor	$in0,$tmp0
+	xor	$in1,$tmp1
+	xor	$in2,$tmp2
+	xor	$in3,$tmp3
+
+	lw	$tmp0,0($nonce)		# load nonce
+	lw	$tmp1,4($nonce)
+	lw	$tmp2,8($nonce)
+	lw	$tmp3,12($nonce)
+
+	addu	$in0,$tmp0		# accumulate nonce
+	sltu	$ctx,$in0,$tmp0
+
+	addu	$in1,$tmp1
+	sltu	$tmp1,$in1,$tmp1
+	addu	$in1,$ctx
+	sltu	$ctx,$in1,$ctx
+	addu	$ctx,$tmp1
+
+	addu	$in2,$tmp2
+	sltu	$tmp2,$in2,$tmp2
+	addu	$in2,$ctx
+	sltu	$ctx,$in2,$ctx
+	addu	$ctx,$tmp2
+
+	addu	$in3,$tmp3
+	addu	$in3,$ctx
+
+	srl	$tmp0,$in0,8		# write mac value
+	srl	$tmp1,$in0,16
+	srl	$tmp2,$in0,24
+	sb	$in0, 0($mac)
+	sb	$tmp0,1($mac)
+	srl	$tmp0,$in1,8
+	sb	$tmp1,2($mac)
+	srl	$tmp1,$in1,16
+	sb	$tmp2,3($mac)
+	srl	$tmp2,$in1,24
+	sb	$in1, 4($mac)
+	sb	$tmp0,5($mac)
+	srl	$tmp0,$in2,8
+	sb	$tmp1,6($mac)
+	srl	$tmp1,$in2,16
+	sb	$tmp2,7($mac)
+	srl	$tmp2,$in2,24
+	sb	$in2, 8($mac)
+	sb	$tmp0,9($mac)
+	srl	$tmp0,$in3,8
+	sb	$tmp1,10($mac)
+	srl	$tmp1,$in3,16
+	sb	$tmp2,11($mac)
+	srl	$tmp2,$in3,24
+	sb	$in3, 12($mac)
+	sb	$tmp0,13($mac)
+	sb	$tmp1,14($mac)
+	sb	$tmp2,15($mac)
+
+	jr	$ra
+.end	poly1305_emit
+.rdata
+.asciiz	"Poly1305 for MIPS32, CRYPTOGAMS by \@dot-asm"
+.align	2
+___
+}
+}}}
+
+$output=pop and open STDOUT,">$output";
+print $code;
+close STDOUT;
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 325cd31f487f..9fc0f722f1b2 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -707,6 +707,11 @@ config CRYPTO_POLY1305_X86_64
 	  in IETF protocols. This is the x86_64 assembler implementation using SIMD
 	  instructions.
 
+config CRYPTO_POLY1305_MIPS
+	tristate "Poly1305 authenticator algorithm (MIPS optimized)"
+	depends on CPU_MIPS32 || (CPU_MIPS64 && 64BIT)
+	select CRYPTO_ARCH_HAVE_LIB_POLY1305
+
 config CRYPTO_MD4
 	tristate "MD4 digest algorithm"
 	select CRYPTO_HASH
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index 9bd15b227e78..d15ec5382986 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -39,6 +39,7 @@ config CRYPTO_LIB_DES
 
 config CRYPTO_LIB_POLY1305_RSIZE
 	int
+	default 2 if MIPS
 	default 4 if X86_64
 	default 9 if ARM || ARM64
 	default 1
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 21/35] int128: move __uint128_t compiler test to Kconfig
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (19 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 20/35] crypto: mips/poly1305 - incorporate OpenSSL/CRYPTOGAMS optimized implementation Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 22/35] crypto: BLAKE2s - generic C library implementation and selftest Ard Biesheuvel
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba, Masahiro Yamada

In order to use 128-bit integer arithmetic in C code, the architecture
needs to have declared support for it by setting ARCH_SUPPORTS_INT128,
and it requires a version of the toolchain that supports this at build
time. This is why all existing tests for ARCH_SUPPORTS_INT128 also test
whether __SIZEOF_INT128__ is defined, since this is only the case for
compilers that can support 128-bit integers.

Let's fold this additional test into the Kconfig declaration of
ARCH_SUPPORTS_INT128 so that we can also use the symbol in Makefiles,
e.g., to decide whether a certain object needs to be included in the
first place.

Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/Kconfig | 2 +-
 arch/riscv/Kconfig | 2 +-
 arch/x86/Kconfig   | 2 +-
 crypto/ecc.c       | 2 +-
 init/Kconfig       | 4 ++++
 lib/ubsan.c        | 2 +-
 lib/ubsan.h        | 2 +-
 7 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 41a9b4257b72..a591a0673694 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -67,7 +67,7 @@ config ARM64
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_SUPPORTS_MEMORY_FAILURE
 	select ARCH_SUPPORTS_ATOMIC_RMW
-	select ARCH_SUPPORTS_INT128 if GCC_VERSION >= 50000 || CC_IS_CLANG
+	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && (GCC_VERSION >= 50000 || CC_IS_CLANG)
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 8eebbc8860bb..75a6c9117622 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -164,7 +164,7 @@ config ARCH_RV32I
 config ARCH_RV64I
 	bool "RV64I"
 	select 64BIT
-	select ARCH_SUPPORTS_INT128 if GCC_VERSION >= 50000
+	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
 	select HAVE_FUNCTION_TRACER
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_FTRACE_MCOUNT_RECORD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d6e1faa28c58..f4d9d1e55e5c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -24,7 +24,7 @@ config X86_64
 	depends on 64BIT
 	# Options that are inherently 64-bit kernel only:
 	select ARCH_HAS_GIGANTIC_PAGE
-	select ARCH_SUPPORTS_INT128
+	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select HAVE_ARCH_SOFT_DIRTY
 	select MODULES_USE_ELF_RELA
diff --git a/crypto/ecc.c b/crypto/ecc.c
index dfe114bc0c4a..6e6aab6c987c 100644
--- a/crypto/ecc.c
+++ b/crypto/ecc.c
@@ -336,7 +336,7 @@ static u64 vli_usub(u64 *result, const u64 *left, u64 right,
 static uint128_t mul_64_64(u64 left, u64 right)
 {
 	uint128_t result;
-#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+#if defined(CONFIG_ARCH_SUPPORTS_INT128)
 	unsigned __int128 m = (unsigned __int128)left * right;
 
 	result.m_low  = m;
diff --git a/init/Kconfig b/init/Kconfig
index b4daad2bac23..020526f681c0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -785,6 +785,10 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	bool
 
+config CC_HAS_INT128
+	def_bool y
+	depends on !$(cc-option,-D__SIZEOF_INT128__=0)
+
 #
 # For architectures that know their GCC __int128 support is sound
 #
diff --git a/lib/ubsan.c b/lib/ubsan.c
index e7d31735950d..b652cc14dd60 100644
--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -119,7 +119,7 @@ static void val_to_string(char *str, size_t size, struct type_descriptor *type,
 {
 	if (type_is_int(type)) {
 		if (type_bit_width(type) == 128) {
-#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+#if defined(CONFIG_ARCH_SUPPORTS_INT128)
 			u_max val = get_unsigned_val(type, value);
 
 			scnprintf(str, size, "0x%08x%08x%08x%08x",
diff --git a/lib/ubsan.h b/lib/ubsan.h
index b8fa83864467..7b56c09473a9 100644
--- a/lib/ubsan.h
+++ b/lib/ubsan.h
@@ -78,7 +78,7 @@ struct invalid_value_data {
 	struct type_descriptor *type;
 };
 
-#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+#if defined(CONFIG_ARCH_SUPPORTS_INT128)
 typedef __int128 s_max;
 typedef unsigned __int128 u_max;
 #else
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 22/35] crypto: BLAKE2s - generic C library implementation and selftest
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (20 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 21/35] int128: move __uint128_t compiler test to Kconfig Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-23  4:51   ` Eric Biggers
  2019-10-17 19:09 ` [PATCH v4 23/35] crypto: testmgr - add test cases for Blake2s Ard Biesheuvel
                   ` (12 subsequent siblings)
  34 siblings, 1 reply; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

From: "Jason A. Donenfeld" <Jason@zx2c4.com>

The C implementation was originally based on Samuel Neves' public
domain reference implementation but has since been heavily modified
for the kernel. We're able to do compile-time optimizations by moving
some scaffolding around the final function into the header file.

Information: https://blake2.net/

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
[ardb: move from lib/zinc to lib/crypto and remove simd handling]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 include/crypto/blake2s.h          |  108 +
 include/crypto/internal/blake2s.h |   19 +
 lib/crypto/Kconfig                |   25 +
 lib/crypto/Makefile               |   10 +
 lib/crypto/blake2s-generic.c      |  111 ++
 lib/crypto/blake2s-selftest.c     | 2093 ++++++++++++++++++++
 lib/crypto/blake2s.c              |  126 ++
 7 files changed, 2492 insertions(+)

diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
new file mode 100644
index 000000000000..7a2d810deeca
--- /dev/null
+++ b/include/crypto/blake2s.h
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef BLAKE2S_H
+#define BLAKE2S_H
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+
+#include <asm/bug.h>
+
+enum blake2s_lengths {
+	BLAKE2S_BLOCK_SIZE = 64,
+	BLAKE2S_HASH_SIZE = 32,
+	BLAKE2S_KEY_SIZE = 32,
+
+	BLAKE2S_128_HASH_SIZE = 16,
+	BLAKE2S_160_HASH_SIZE = 20,
+	BLAKE2S_224_HASH_SIZE = 28,
+	BLAKE2S_256_HASH_SIZE = 32,
+};
+
+struct blake2s_state {
+	u32 h[8];
+	u32 t[2];
+	u32 f[2];
+	u8 buf[BLAKE2S_BLOCK_SIZE];
+	unsigned int buflen;
+	unsigned int outlen;
+};
+
+enum blake2s_iv {
+	BLAKE2S_IV0 = 0x6A09E667UL,
+	BLAKE2S_IV1 = 0xBB67AE85UL,
+	BLAKE2S_IV2 = 0x3C6EF372UL,
+	BLAKE2S_IV3 = 0xA54FF53AUL,
+	BLAKE2S_IV4 = 0x510E527FUL,
+	BLAKE2S_IV5 = 0x9B05688CUL,
+	BLAKE2S_IV6 = 0x1F83D9ABUL,
+	BLAKE2S_IV7 = 0x5BE0CD19UL,
+};
+
+void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen);
+void blake2s_final(struct blake2s_state *state, u8 *out);
+
+static inline void blake2s_init_param(struct blake2s_state *state,
+				      const u32 param)
+{
+	*state = (struct blake2s_state){{
+		BLAKE2S_IV0 ^ param,
+		BLAKE2S_IV1,
+		BLAKE2S_IV2,
+		BLAKE2S_IV3,
+		BLAKE2S_IV4,
+		BLAKE2S_IV5,
+		BLAKE2S_IV6,
+		BLAKE2S_IV7,
+	}};
+}
+
+static inline void blake2s_init(struct blake2s_state *state,
+				const size_t outlen)
+{
+	WARN_ON(IS_ENABLED(DEBUG) && (!outlen || outlen > BLAKE2S_HASH_SIZE));
+
+	blake2s_init_param(state, 0x01010000 | outlen);
+	state->outlen = outlen;
+}
+
+static inline void blake2s_init_key(struct blake2s_state *state,
+				    const size_t outlen, const void *key,
+				    const size_t keylen)
+{
+	WARN_ON(IS_ENABLED(DEBUG) && (!outlen || outlen > BLAKE2S_HASH_SIZE ||
+		!key || !keylen || keylen > BLAKE2S_KEY_SIZE));
+
+	blake2s_init_param(state, 0x01010000 | keylen << 8 | outlen);
+	memcpy(state->buf, key, keylen);
+	state->buflen = BLAKE2S_BLOCK_SIZE;
+	state->outlen = outlen;
+}
+
+static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
+			   const size_t outlen, const size_t inlen,
+			   const size_t keylen)
+{
+	struct blake2s_state state;
+
+	WARN_ON(IS_ENABLED(DEBUG) && ((!in && inlen > 0) || !out || !outlen ||
+		outlen > BLAKE2S_HASH_SIZE || keylen > BLAKE2S_KEY_SIZE ||
+		(!key && keylen)));
+
+	if (keylen)
+		blake2s_init_key(&state, outlen, key, keylen);
+	else
+		blake2s_init(&state, outlen);
+
+	blake2s_update(&state, in, inlen);
+	blake2s_final(&state, out);
+}
+
+void blake2s_hmac(u8 *out, const u8 *in, const u8 *key, const size_t outlen,
+		  const size_t inlen, const size_t keylen);
+
+#endif /* BLAKE2S_H */
diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
new file mode 100644
index 000000000000..941693effc7d
--- /dev/null
+++ b/include/crypto/internal/blake2s.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+
+#ifndef BLAKE2S_INTERNAL_H
+#define BLAKE2S_INTERNAL_H
+
+#include <crypto/blake2s.h>
+
+void blake2s_compress_generic(struct blake2s_state *state,const u8 *block,
+			      size_t nblocks, const u32 inc);
+
+void blake2s_compress_arch(struct blake2s_state *state,const u8 *block,
+			   size_t nblocks, const u32 inc);
+
+static inline void blake2s_set_lastblock(struct blake2s_state *state)
+{
+	state->f[0] = -1;
+}
+
+#endif /* BLAKE2S_INTERNAL_H */
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index d15ec5382986..7ad98b624e55 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -8,6 +8,31 @@ config CRYPTO_LIB_AES
 config CRYPTO_LIB_ARC4
 	tristate
 
+config CRYPTO_ARCH_HAVE_LIB_BLAKE2S
+	tristate
+	help
+	  Declares whether the architecture provides an arch-specific
+	  accelerated implementation of the Blake2s library interface,
+	  either builtin or as a module.
+
+config CRYPTO_LIB_BLAKE2S_GENERIC
+	tristate
+	help
+	  This symbol can be depended upon by arch implementations of the
+	  Blake2s library interface that require the generic code as a
+	  fallback, e.g., for SIMD implementations. If no arch specific
+	  implementation is enabled, this implementation serves the users
+	  of CRYPTO_LIB_BLAKE2S.
+
+config CRYPTO_LIB_BLAKE2S
+	tristate "BLAKE2s hash function library"
+	depends on CRYPTO_ARCH_HAVE_LIB_BLAKE2S || !CRYPTO_ARCH_HAVE_LIB_BLAKE2S
+	select CRYPTO_LIB_BLAKE2S_GENERIC if CRYPTO_ARCH_HAVE_LIB_BLAKE2S=n
+	help
+	  Enable the Blake2s library interface. This interface may be fulfilled
+	  by either the generic implementation or an arch-specific one, if one
+	  is available and enabled.
+
 config CRYPTO_ARCH_HAVE_LIB_CHACHA
 	tristate
 	help
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index b58ab6843a9d..8ca66b5f9807 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -10,6 +10,12 @@ libaes-y					:= aes.o
 obj-$(CONFIG_CRYPTO_LIB_ARC4)			+= libarc4.o
 libarc4-y					:= arc4.o
 
+obj-$(CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC)	+= libblake2s-generic.o
+libblake2s-generic-y				+= blake2s-generic.o
+
+obj-$(CONFIG_CRYPTO_LIB_BLAKE2S)		+= libblake2s.o
+libblake2s-y					+= blake2s.o
+
 obj-$(CONFIG_CRYPTO_LIB_DES)			+= libdes.o
 libdes-y					:= des.o
 
@@ -18,3 +24,7 @@ libpoly1305-y					:= poly1305.o
 
 obj-$(CONFIG_CRYPTO_LIB_SHA256)			+= libsha256.o
 libsha256-y					:= sha256.o
+
+ifneq ($(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS),y)
+libblake2s-y					+= blake2s-selftest.o
+endif
diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
new file mode 100644
index 000000000000..04ff8df24513
--- /dev/null
+++ b/lib/crypto/blake2s-generic.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is an implementation of the BLAKE2s hash and PRF functions.
+ *
+ * Information: https://blake2.net/
+ *
+ */
+
+#include <crypto/internal/blake2s.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bug.h>
+#include <asm/unaligned.h>
+
+static const u8 blake2s_sigma[10][16] = {
+	{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+	{ 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 },
+	{ 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4 },
+	{ 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8 },
+	{ 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13 },
+	{ 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9 },
+	{ 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11 },
+	{ 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10 },
+	{ 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5 },
+	{ 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0 },
+};
+
+static inline void blake2s_increment_counter(struct blake2s_state *state,
+					     const u32 inc)
+{
+	state->t[0] += inc;
+	state->t[1] += (state->t[0] < inc);
+}
+
+void blake2s_compress_generic(struct blake2s_state *state,const u8 *block,
+			      size_t nblocks, const u32 inc)
+{
+	u32 m[16];
+	u32 v[16];
+	int i;
+
+	WARN_ON(IS_ENABLED(DEBUG) &&
+		(nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE));
+
+	while (nblocks > 0) {
+		blake2s_increment_counter(state, inc);
+		memcpy(m, block, BLAKE2S_BLOCK_SIZE);
+		le32_to_cpu_array(m, ARRAY_SIZE(m));
+		memcpy(v, state->h, 32);
+		v[ 8] = BLAKE2S_IV0;
+		v[ 9] = BLAKE2S_IV1;
+		v[10] = BLAKE2S_IV2;
+		v[11] = BLAKE2S_IV3;
+		v[12] = BLAKE2S_IV4 ^ state->t[0];
+		v[13] = BLAKE2S_IV5 ^ state->t[1];
+		v[14] = BLAKE2S_IV6 ^ state->f[0];
+		v[15] = BLAKE2S_IV7 ^ state->f[1];
+
+#define G(r, i, a, b, c, d) do { \
+	a += b + m[blake2s_sigma[r][2 * i + 0]]; \
+	d = ror32(d ^ a, 16); \
+	c += d; \
+	b = ror32(b ^ c, 12); \
+	a += b + m[blake2s_sigma[r][2 * i + 1]]; \
+	d = ror32(d ^ a, 8); \
+	c += d; \
+	b = ror32(b ^ c, 7); \
+} while (0)
+
+#define ROUND(r) do { \
+	G(r, 0, v[0], v[ 4], v[ 8], v[12]); \
+	G(r, 1, v[1], v[ 5], v[ 9], v[13]); \
+	G(r, 2, v[2], v[ 6], v[10], v[14]); \
+	G(r, 3, v[3], v[ 7], v[11], v[15]); \
+	G(r, 4, v[0], v[ 5], v[10], v[15]); \
+	G(r, 5, v[1], v[ 6], v[11], v[12]); \
+	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
+	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
+} while (0)
+		ROUND(0);
+		ROUND(1);
+		ROUND(2);
+		ROUND(3);
+		ROUND(4);
+		ROUND(5);
+		ROUND(6);
+		ROUND(7);
+		ROUND(8);
+		ROUND(9);
+
+#undef G
+#undef ROUND
+
+		for (i = 0; i < 8; ++i)
+			state->h[i] ^= v[i] ^ v[i + 8];
+
+		block += BLAKE2S_BLOCK_SIZE;
+		--nblocks;
+	}
+}
+
+EXPORT_SYMBOL(blake2s_compress_generic);
+
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("BLAKE2s hash function");
+MODULE_AUTHOR("Jason A. Donenfeld <Jason@zx2c4.com>");
diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c
new file mode 100644
index 000000000000..7ba00fcc6b60
--- /dev/null
+++ b/lib/crypto/blake2s-selftest.c
@@ -0,0 +1,2093 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <crypto/blake2s.h>
+#include <linux/string.h>
+
+static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
+	{ 0x69, 0x21, 0x7a, 0x30, 0x79, 0x90, 0x80, 0x94,
+	  0xe1, 0x11, 0x21, 0xd0, 0x42, 0x35, 0x4a, 0x7c,
+	  0x1f, 0x55, 0xb6, 0x48, 0x2c, 0xa1, 0xa5, 0x1e,
+	  0x1b, 0x25, 0x0d, 0xfd, 0x1e, 0xd0, 0xee, 0xf9 },
+	{ 0xe3, 0x4d, 0x74, 0xdb, 0xaf, 0x4f, 0xf4, 0xc6,
+	  0xab, 0xd8, 0x71, 0xcc, 0x22, 0x04, 0x51, 0xd2,
+	  0xea, 0x26, 0x48, 0x84, 0x6c, 0x77, 0x57, 0xfb,
+	  0xaa, 0xc8, 0x2f, 0xe5, 0x1a, 0xd6, 0x4b, 0xea },
+	{ 0xdd, 0xad, 0x9a, 0xb1, 0x5d, 0xac, 0x45, 0x49,
+	  0xba, 0x42, 0xf4, 0x9d, 0x26, 0x24, 0x96, 0xbe,
+	  0xf6, 0xc0, 0xba, 0xe1, 0xdd, 0x34, 0x2a, 0x88,
+	  0x08, 0xf8, 0xea, 0x26, 0x7c, 0x6e, 0x21, 0x0c },
+	{ 0xe8, 0xf9, 0x1c, 0x6e, 0xf2, 0x32, 0xa0, 0x41,
+	  0x45, 0x2a, 0xb0, 0xe1, 0x49, 0x07, 0x0c, 0xdd,
+	  0x7d, 0xd1, 0x76, 0x9e, 0x75, 0xb3, 0xa5, 0x92,
+	  0x1b, 0xe3, 0x78, 0x76, 0xc4, 0x5c, 0x99, 0x00 },
+	{ 0x0c, 0xc7, 0x0e, 0x00, 0x34, 0x8b, 0x86, 0xba,
+	  0x29, 0x44, 0xd0, 0xc3, 0x20, 0x38, 0xb2, 0x5c,
+	  0x55, 0x58, 0x4f, 0x90, 0xdf, 0x23, 0x04, 0xf5,
+	  0x5f, 0xa3, 0x32, 0xaf, 0x5f, 0xb0, 0x1e, 0x20 },
+	{ 0xec, 0x19, 0x64, 0x19, 0x10, 0x87, 0xa4, 0xfe,
+	  0x9d, 0xf1, 0xc7, 0x95, 0x34, 0x2a, 0x02, 0xff,
+	  0xc1, 0x91, 0xa5, 0xb2, 0x51, 0x76, 0x48, 0x56,
+	  0xae, 0x5b, 0x8b, 0x57, 0x69, 0xf0, 0xc6, 0xcd },
+	{ 0xe1, 0xfa, 0x51, 0x61, 0x8d, 0x7d, 0xf4, 0xeb,
+	  0x70, 0xcf, 0x0d, 0x5a, 0x9e, 0x90, 0x6f, 0x80,
+	  0x6e, 0x9d, 0x19, 0xf7, 0xf4, 0xf0, 0x1e, 0x3b,
+	  0x62, 0x12, 0x88, 0xe4, 0x12, 0x04, 0x05, 0xd6 },
+	{ 0x59, 0x80, 0x01, 0xfa, 0xfb, 0xe8, 0xf9, 0x4e,
+	  0xc6, 0x6d, 0xc8, 0x27, 0xd0, 0x12, 0xcf, 0xcb,
+	  0xba, 0x22, 0x28, 0x56, 0x9f, 0x44, 0x8e, 0x89,
+	  0xea, 0x22, 0x08, 0xc8, 0xbf, 0x76, 0x92, 0x93 },
+	{ 0xc7, 0xe8, 0x87, 0xb5, 0x46, 0x62, 0x36, 0x35,
+	  0xe9, 0x3e, 0x04, 0x95, 0x59, 0x8f, 0x17, 0x26,
+	  0x82, 0x19, 0x96, 0xc2, 0x37, 0x77, 0x05, 0xb9,
+	  0x3a, 0x1f, 0x63, 0x6f, 0x87, 0x2b, 0xfa, 0x2d },
+	{ 0xc3, 0x15, 0xa4, 0x37, 0xdd, 0x28, 0x06, 0x2a,
+	  0x77, 0x0d, 0x48, 0x19, 0x67, 0x13, 0x6b, 0x1b,
+	  0x5e, 0xb8, 0x8b, 0x21, 0xee, 0x53, 0xd0, 0x32,
+	  0x9c, 0x58, 0x97, 0x12, 0x6e, 0x9d, 0xb0, 0x2c },
+	{ 0xbb, 0x47, 0x3d, 0xed, 0xdc, 0x05, 0x5f, 0xea,
+	  0x62, 0x28, 0xf2, 0x07, 0xda, 0x57, 0x53, 0x47,
+	  0xbb, 0x00, 0x40, 0x4c, 0xd3, 0x49, 0xd3, 0x8c,
+	  0x18, 0x02, 0x63, 0x07, 0xa2, 0x24, 0xcb, 0xff },
+	{ 0x68, 0x7e, 0x18, 0x73, 0xa8, 0x27, 0x75, 0x91,
+	  0xbb, 0x33, 0xd9, 0xad, 0xf9, 0xa1, 0x39, 0x12,
+	  0xef, 0xef, 0xe5, 0x57, 0xca, 0xfc, 0x39, 0xa7,
+	  0x95, 0x26, 0x23, 0xe4, 0x72, 0x55, 0xf1, 0x6d },
+	{ 0x1a, 0xc7, 0xba, 0x75, 0x4d, 0x6e, 0x2f, 0x94,
+	  0xe0, 0xe8, 0x6c, 0x46, 0xbf, 0xb2, 0x62, 0xab,
+	  0xbb, 0x74, 0xf4, 0x50, 0xef, 0x45, 0x6d, 0x6b,
+	  0x4d, 0x97, 0xaa, 0x80, 0xce, 0x6d, 0xa7, 0x67 },
+	{ 0x01, 0x2c, 0x97, 0x80, 0x96, 0x14, 0x81, 0x6b,
+	  0x5d, 0x94, 0x94, 0x47, 0x7d, 0x4b, 0x68, 0x7d,
+	  0x15, 0xb9, 0x6e, 0xb6, 0x9c, 0x0e, 0x80, 0x74,
+	  0xa8, 0x51, 0x6f, 0x31, 0x22, 0x4b, 0x5c, 0x98 },
+	{ 0x91, 0xff, 0xd2, 0x6c, 0xfa, 0x4d, 0xa5, 0x13,
+	  0x4c, 0x7e, 0xa2, 0x62, 0xf7, 0x88, 0x9c, 0x32,
+	  0x9f, 0x61, 0xf6, 0xa6, 0x57, 0x22, 0x5c, 0xc2,
+	  0x12, 0xf4, 0x00, 0x56, 0xd9, 0x86, 0xb3, 0xf4 },
+	{ 0xd9, 0x7c, 0x82, 0x8d, 0x81, 0x82, 0xa7, 0x21,
+	  0x80, 0xa0, 0x6a, 0x78, 0x26, 0x83, 0x30, 0x67,
+	  0x3f, 0x7c, 0x4e, 0x06, 0x35, 0x94, 0x7c, 0x04,
+	  0xc0, 0x23, 0x23, 0xfd, 0x45, 0xc0, 0xa5, 0x2d },
+	{ 0xef, 0xc0, 0x4c, 0xdc, 0x39, 0x1c, 0x7e, 0x91,
+	  0x19, 0xbd, 0x38, 0x66, 0x8a, 0x53, 0x4e, 0x65,
+	  0xfe, 0x31, 0x03, 0x6d, 0x6a, 0x62, 0x11, 0x2e,
+	  0x44, 0xeb, 0xeb, 0x11, 0xf9, 0xc5, 0x70, 0x80 },
+	{ 0x99, 0x2c, 0xf5, 0xc0, 0x53, 0x44, 0x2a, 0x5f,
+	  0xbc, 0x4f, 0xaf, 0x58, 0x3e, 0x04, 0xe5, 0x0b,
+	  0xb7, 0x0d, 0x2f, 0x39, 0xfb, 0xb6, 0xa5, 0x03,
+	  0xf8, 0x9e, 0x56, 0xa6, 0x3e, 0x18, 0x57, 0x8a },
+	{ 0x38, 0x64, 0x0e, 0x9f, 0x21, 0x98, 0x3e, 0x67,
+	  0xb5, 0x39, 0xca, 0xcc, 0xae, 0x5e, 0xcf, 0x61,
+	  0x5a, 0xe2, 0x76, 0x4f, 0x75, 0xa0, 0x9c, 0x9c,
+	  0x59, 0xb7, 0x64, 0x83, 0xc1, 0xfb, 0xc7, 0x35 },
+	{ 0x21, 0x3d, 0xd3, 0x4c, 0x7e, 0xfe, 0x4f, 0xb2,
+	  0x7a, 0x6b, 0x35, 0xf6, 0xb4, 0x00, 0x0d, 0x1f,
+	  0xe0, 0x32, 0x81, 0xaf, 0x3c, 0x72, 0x3e, 0x5c,
+	  0x9f, 0x94, 0x74, 0x7a, 0x5f, 0x31, 0xcd, 0x3b },
+	{ 0xec, 0x24, 0x6e, 0xee, 0xb9, 0xce, 0xd3, 0xf7,
+	  0xad, 0x33, 0xed, 0x28, 0x66, 0x0d, 0xd9, 0xbb,
+	  0x07, 0x32, 0x51, 0x3d, 0xb4, 0xe2, 0xfa, 0x27,
+	  0x8b, 0x60, 0xcd, 0xe3, 0x68, 0x2a, 0x4c, 0xcd },
+	{ 0xac, 0x9b, 0x61, 0xd4, 0x46, 0x64, 0x8c, 0x30,
+	  0x05, 0xd7, 0x89, 0x2b, 0xf3, 0xa8, 0x71, 0x9f,
+	  0x4c, 0x81, 0x81, 0xcf, 0xdc, 0xbc, 0x2b, 0x79,
+	  0xfe, 0xf1, 0x0a, 0x27, 0x9b, 0x91, 0x10, 0x95 },
+	{ 0x7b, 0xf8, 0xb2, 0x29, 0x59, 0xe3, 0x4e, 0x3a,
+	  0x43, 0xf7, 0x07, 0x92, 0x23, 0xe8, 0x3a, 0x97,
+	  0x54, 0x61, 0x7d, 0x39, 0x1e, 0x21, 0x3d, 0xfd,
+	  0x80, 0x8e, 0x41, 0xb9, 0xbe, 0xad, 0x4c, 0xe7 },
+	{ 0x68, 0xd4, 0xb5, 0xd4, 0xfa, 0x0e, 0x30, 0x2b,
+	  0x64, 0xcc, 0xc5, 0xaf, 0x79, 0x29, 0x13, 0xac,
+	  0x4c, 0x88, 0xec, 0x95, 0xc0, 0x7d, 0xdf, 0x40,
+	  0x69, 0x42, 0x56, 0xeb, 0x88, 0xce, 0x9f, 0x3d },
+	{ 0xb2, 0xc2, 0x42, 0x0f, 0x05, 0xf9, 0xab, 0xe3,
+	  0x63, 0x15, 0x91, 0x93, 0x36, 0xb3, 0x7e, 0x4e,
+	  0x0f, 0xa3, 0x3f, 0xf7, 0xe7, 0x6a, 0x49, 0x27,
+	  0x67, 0x00, 0x6f, 0xdb, 0x5d, 0x93, 0x54, 0x62 },
+	{ 0x13, 0x4f, 0x61, 0xbb, 0xd0, 0xbb, 0xb6, 0x9a,
+	  0xed, 0x53, 0x43, 0x90, 0x45, 0x51, 0xa3, 0xe6,
+	  0xc1, 0xaa, 0x7d, 0xcd, 0xd7, 0x7e, 0x90, 0x3e,
+	  0x70, 0x23, 0xeb, 0x7c, 0x60, 0x32, 0x0a, 0xa7 },
+	{ 0x46, 0x93, 0xf9, 0xbf, 0xf7, 0xd4, 0xf3, 0x98,
+	  0x6a, 0x7d, 0x17, 0x6e, 0x6e, 0x06, 0xf7, 0x2a,
+	  0xd1, 0x49, 0x0d, 0x80, 0x5c, 0x99, 0xe2, 0x53,
+	  0x47, 0xb8, 0xde, 0x77, 0xb4, 0xdb, 0x6d, 0x9b },
+	{ 0x85, 0x3e, 0x26, 0xf7, 0x41, 0x95, 0x3b, 0x0f,
+	  0xd5, 0xbd, 0xb4, 0x24, 0xe8, 0xab, 0x9e, 0x8b,
+	  0x37, 0x50, 0xea, 0xa8, 0xef, 0x61, 0xe4, 0x79,
+	  0x02, 0xc9, 0x1e, 0x55, 0x4e, 0x9c, 0x73, 0xb9 },
+	{ 0xf7, 0xde, 0x53, 0x63, 0x61, 0xab, 0xaa, 0x0e,
+	  0x15, 0x81, 0x56, 0xcf, 0x0e, 0xa4, 0xf6, 0x3a,
+	  0x99, 0xb5, 0xe4, 0x05, 0x4f, 0x8f, 0xa4, 0xc9,
+	  0xd4, 0x5f, 0x62, 0x85, 0xca, 0xd5, 0x56, 0x94 },
+	{ 0x4c, 0x23, 0x06, 0x08, 0x86, 0x0a, 0x99, 0xae,
+	  0x8d, 0x7b, 0xd5, 0xc2, 0xcc, 0x17, 0xfa, 0x52,
+	  0x09, 0x6b, 0x9a, 0x61, 0xbe, 0xdb, 0x17, 0xcb,
+	  0x76, 0x17, 0x86, 0x4a, 0xd2, 0x9c, 0xa7, 0xa6 },
+	{ 0xae, 0xb9, 0x20, 0xea, 0x87, 0x95, 0x2d, 0xad,
+	  0xb1, 0xfb, 0x75, 0x92, 0x91, 0xe3, 0x38, 0x81,
+	  0x39, 0xa8, 0x72, 0x86, 0x50, 0x01, 0x88, 0x6e,
+	  0xd8, 0x47, 0x52, 0xe9, 0x3c, 0x25, 0x0c, 0x2a },
+	{ 0xab, 0xa4, 0xad, 0x9b, 0x48, 0x0b, 0x9d, 0xf3,
+	  0xd0, 0x8c, 0xa5, 0xe8, 0x7b, 0x0c, 0x24, 0x40,
+	  0xd4, 0xe4, 0xea, 0x21, 0x22, 0x4c, 0x2e, 0xb4,
+	  0x2c, 0xba, 0xe4, 0x69, 0xd0, 0x89, 0xb9, 0x31 },
+	{ 0x05, 0x82, 0x56, 0x07, 0xd7, 0xfd, 0xf2, 0xd8,
+	  0x2e, 0xf4, 0xc3, 0xc8, 0xc2, 0xae, 0xa9, 0x61,
+	  0xad, 0x98, 0xd6, 0x0e, 0xdf, 0xf7, 0xd0, 0x18,
+	  0x98, 0x3e, 0x21, 0x20, 0x4c, 0x0d, 0x93, 0xd1 },
+	{ 0xa7, 0x42, 0xf8, 0xb6, 0xaf, 0x82, 0xd8, 0xa6,
+	  0xca, 0x23, 0x57, 0xc5, 0xf1, 0xcf, 0x91, 0xde,
+	  0xfb, 0xd0, 0x66, 0x26, 0x7d, 0x75, 0xc0, 0x48,
+	  0xb3, 0x52, 0x36, 0x65, 0x85, 0x02, 0x59, 0x62 },
+	{ 0x2b, 0xca, 0xc8, 0x95, 0x99, 0x00, 0x0b, 0x42,
+	  0xc9, 0x5a, 0xe2, 0x38, 0x35, 0xa7, 0x13, 0x70,
+	  0x4e, 0xd7, 0x97, 0x89, 0xc8, 0x4f, 0xef, 0x14,
+	  0x9a, 0x87, 0x4f, 0xf7, 0x33, 0xf0, 0x17, 0xa2 },
+	{ 0xac, 0x1e, 0xd0, 0x7d, 0x04, 0x8f, 0x10, 0x5a,
+	  0x9e, 0x5b, 0x7a, 0xb8, 0x5b, 0x09, 0xa4, 0x92,
+	  0xd5, 0xba, 0xff, 0x14, 0xb8, 0xbf, 0xb0, 0xe9,
+	  0xfd, 0x78, 0x94, 0x86, 0xee, 0xa2, 0xb9, 0x74 },
+	{ 0xe4, 0x8d, 0x0e, 0xcf, 0xaf, 0x49, 0x7d, 0x5b,
+	  0x27, 0xc2, 0x5d, 0x99, 0xe1, 0x56, 0xcb, 0x05,
+	  0x79, 0xd4, 0x40, 0xd6, 0xe3, 0x1f, 0xb6, 0x24,
+	  0x73, 0x69, 0x6d, 0xbf, 0x95, 0xe0, 0x10, 0xe4 },
+	{ 0x12, 0xa9, 0x1f, 0xad, 0xf8, 0xb2, 0x16, 0x44,
+	  0xfd, 0x0f, 0x93, 0x4f, 0x3c, 0x4a, 0x8f, 0x62,
+	  0xba, 0x86, 0x2f, 0xfd, 0x20, 0xe8, 0xe9, 0x61,
+	  0x15, 0x4c, 0x15, 0xc1, 0x38, 0x84, 0xed, 0x3d },
+	{ 0x7c, 0xbe, 0xe9, 0x6e, 0x13, 0x98, 0x97, 0xdc,
+	  0x98, 0xfb, 0xef, 0x3b, 0xe8, 0x1a, 0xd4, 0xd9,
+	  0x64, 0xd2, 0x35, 0xcb, 0x12, 0x14, 0x1f, 0xb6,
+	  0x67, 0x27, 0xe6, 0xe5, 0xdf, 0x73, 0xa8, 0x78 },
+	{ 0xeb, 0xf6, 0x6a, 0xbb, 0x59, 0x7a, 0xe5, 0x72,
+	  0xa7, 0x29, 0x7c, 0xb0, 0x87, 0x1e, 0x35, 0x5a,
+	  0xcc, 0xaf, 0xad, 0x83, 0x77, 0xb8, 0xe7, 0x8b,
+	  0xf1, 0x64, 0xce, 0x2a, 0x18, 0xde, 0x4b, 0xaf },
+	{ 0x71, 0xb9, 0x33, 0xb0, 0x7e, 0x4f, 0xf7, 0x81,
+	  0x8c, 0xe0, 0x59, 0xd0, 0x08, 0x82, 0x9e, 0x45,
+	  0x3c, 0x6f, 0xf0, 0x2e, 0xc0, 0xa7, 0xdb, 0x39,
+	  0x3f, 0xc2, 0xd8, 0x70, 0xf3, 0x7a, 0x72, 0x86 },
+	{ 0x7c, 0xf7, 0xc5, 0x13, 0x31, 0x22, 0x0b, 0x8d,
+	  0x3e, 0xba, 0xed, 0x9c, 0x29, 0x39, 0x8a, 0x16,
+	  0xd9, 0x81, 0x56, 0xe2, 0x61, 0x3c, 0xb0, 0x88,
+	  0xf2, 0xb0, 0xe0, 0x8a, 0x1b, 0xe4, 0xcf, 0x4f },
+	{ 0x3e, 0x41, 0xa1, 0x08, 0xe0, 0xf6, 0x4a, 0xd2,
+	  0x76, 0xb9, 0x79, 0xe1, 0xce, 0x06, 0x82, 0x79,
+	  0xe1, 0x6f, 0x7b, 0xc7, 0xe4, 0xaa, 0x1d, 0x21,
+	  0x1e, 0x17, 0xb8, 0x11, 0x61, 0xdf, 0x16, 0x02 },
+	{ 0x88, 0x65, 0x02, 0xa8, 0x2a, 0xb4, 0x7b, 0xa8,
+	  0xd8, 0x67, 0x10, 0xaa, 0x9d, 0xe3, 0xd4, 0x6e,
+	  0xa6, 0x5c, 0x47, 0xaf, 0x6e, 0xe8, 0xde, 0x45,
+	  0x0c, 0xce, 0xb8, 0xb1, 0x1b, 0x04, 0x5f, 0x50 },
+	{ 0xc0, 0x21, 0xbc, 0x5f, 0x09, 0x54, 0xfe, 0xe9,
+	  0x4f, 0x46, 0xea, 0x09, 0x48, 0x7e, 0x10, 0xa8,
+	  0x48, 0x40, 0xd0, 0x2f, 0x64, 0x81, 0x0b, 0xc0,
+	  0x8d, 0x9e, 0x55, 0x1f, 0x7d, 0x41, 0x68, 0x14 },
+	{ 0x20, 0x30, 0x51, 0x6e, 0x8a, 0x5f, 0xe1, 0x9a,
+	  0xe7, 0x9c, 0x33, 0x6f, 0xce, 0x26, 0x38, 0x2a,
+	  0x74, 0x9d, 0x3f, 0xd0, 0xec, 0x91, 0xe5, 0x37,
+	  0xd4, 0xbd, 0x23, 0x58, 0xc1, 0x2d, 0xfb, 0x22 },
+	{ 0x55, 0x66, 0x98, 0xda, 0xc8, 0x31, 0x7f, 0xd3,
+	  0x6d, 0xfb, 0xdf, 0x25, 0xa7, 0x9c, 0xb1, 0x12,
+	  0xd5, 0x42, 0x58, 0x60, 0x60, 0x5c, 0xba, 0xf5,
+	  0x07, 0xf2, 0x3b, 0xf7, 0xe9, 0xf4, 0x2a, 0xfe },
+	{ 0x2f, 0x86, 0x7b, 0xa6, 0x77, 0x73, 0xfd, 0xc3,
+	  0xe9, 0x2f, 0xce, 0xd9, 0x9a, 0x64, 0x09, 0xad,
+	  0x39, 0xd0, 0xb8, 0x80, 0xfd, 0xe8, 0xf1, 0x09,
+	  0xa8, 0x17, 0x30, 0xc4, 0x45, 0x1d, 0x01, 0x78 },
+	{ 0x17, 0x2e, 0xc2, 0x18, 0xf1, 0x19, 0xdf, 0xae,
+	  0x98, 0x89, 0x6d, 0xff, 0x29, 0xdd, 0x98, 0x76,
+	  0xc9, 0x4a, 0xf8, 0x74, 0x17, 0xf9, 0xae, 0x4c,
+	  0x70, 0x14, 0xbb, 0x4e, 0x4b, 0x96, 0xaf, 0xc7 },
+	{ 0x3f, 0x85, 0x81, 0x4a, 0x18, 0x19, 0x5f, 0x87,
+	  0x9a, 0xa9, 0x62, 0xf9, 0x5d, 0x26, 0xbd, 0x82,
+	  0xa2, 0x78, 0xf2, 0xb8, 0x23, 0x20, 0x21, 0x8f,
+	  0x6b, 0x3b, 0xd6, 0xf7, 0xf6, 0x67, 0xa6, 0xd9 },
+	{ 0x1b, 0x61, 0x8f, 0xba, 0xa5, 0x66, 0xb3, 0xd4,
+	  0x98, 0xc1, 0x2e, 0x98, 0x2c, 0x9e, 0xc5, 0x2e,
+	  0x4d, 0xa8, 0x5a, 0x8c, 0x54, 0xf3, 0x8f, 0x34,
+	  0xc0, 0x90, 0x39, 0x4f, 0x23, 0xc1, 0x84, 0xc1 },
+	{ 0x0c, 0x75, 0x8f, 0xb5, 0x69, 0x2f, 0xfd, 0x41,
+	  0xa3, 0x57, 0x5d, 0x0a, 0xf0, 0x0c, 0xc7, 0xfb,
+	  0xf2, 0xcb, 0xe5, 0x90, 0x5a, 0x58, 0x32, 0x3a,
+	  0x88, 0xae, 0x42, 0x44, 0xf6, 0xe4, 0xc9, 0x93 },
+	{ 0xa9, 0x31, 0x36, 0x0c, 0xad, 0x62, 0x8c, 0x7f,
+	  0x12, 0xa6, 0xc1, 0xc4, 0xb7, 0x53, 0xb0, 0xf4,
+	  0x06, 0x2a, 0xef, 0x3c, 0xe6, 0x5a, 0x1a, 0xe3,
+	  0xf1, 0x93, 0x69, 0xda, 0xdf, 0x3a, 0xe2, 0x3d },
+	{ 0xcb, 0xac, 0x7d, 0x77, 0x3b, 0x1e, 0x3b, 0x3c,
+	  0x66, 0x91, 0xd7, 0xab, 0xb7, 0xe9, 0xdf, 0x04,
+	  0x5c, 0x8b, 0xa1, 0x92, 0x68, 0xde, 0xd1, 0x53,
+	  0x20, 0x7f, 0x5e, 0x80, 0x43, 0x52, 0xec, 0x5d },
+	{ 0x23, 0xa1, 0x96, 0xd3, 0x80, 0x2e, 0xd3, 0xc1,
+	  0xb3, 0x84, 0x01, 0x9a, 0x82, 0x32, 0x58, 0x40,
+	  0xd3, 0x2f, 0x71, 0x95, 0x0c, 0x45, 0x80, 0xb0,
+	  0x34, 0x45, 0xe0, 0x89, 0x8e, 0x14, 0x05, 0x3c },
+	{ 0xf4, 0x49, 0x54, 0x70, 0xf2, 0x26, 0xc8, 0xc2,
+	  0x14, 0xbe, 0x08, 0xfd, 0xfa, 0xd4, 0xbc, 0x4a,
+	  0x2a, 0x9d, 0xbe, 0xa9, 0x13, 0x6a, 0x21, 0x0d,
+	  0xf0, 0xd4, 0xb6, 0x49, 0x29, 0xe6, 0xfc, 0x14 },
+	{ 0xe2, 0x90, 0xdd, 0x27, 0x0b, 0x46, 0x7f, 0x34,
+	  0xab, 0x1c, 0x00, 0x2d, 0x34, 0x0f, 0xa0, 0x16,
+	  0x25, 0x7f, 0xf1, 0x9e, 0x58, 0x33, 0xfd, 0xbb,
+	  0xf2, 0xcb, 0x40, 0x1c, 0x3b, 0x28, 0x17, 0xde },
+	{ 0x9f, 0xc7, 0xb5, 0xde, 0xd3, 0xc1, 0x50, 0x42,
+	  0xb2, 0xa6, 0x58, 0x2d, 0xc3, 0x9b, 0xe0, 0x16,
+	  0xd2, 0x4a, 0x68, 0x2d, 0x5e, 0x61, 0xad, 0x1e,
+	  0xff, 0x9c, 0x63, 0x30, 0x98, 0x48, 0xf7, 0x06 },
+	{ 0x8c, 0xca, 0x67, 0xa3, 0x6d, 0x17, 0xd5, 0xe6,
+	  0x34, 0x1c, 0xb5, 0x92, 0xfd, 0x7b, 0xef, 0x99,
+	  0x26, 0xc9, 0xe3, 0xaa, 0x10, 0x27, 0xea, 0x11,
+	  0xa7, 0xd8, 0xbd, 0x26, 0x0b, 0x57, 0x6e, 0x04 },
+	{ 0x40, 0x93, 0x92, 0xf5, 0x60, 0xf8, 0x68, 0x31,
+	  0xda, 0x43, 0x73, 0xee, 0x5e, 0x00, 0x74, 0x26,
+	  0x05, 0x95, 0xd7, 0xbc, 0x24, 0x18, 0x3b, 0x60,
+	  0xed, 0x70, 0x0d, 0x45, 0x83, 0xd3, 0xf6, 0xf0 },
+	{ 0x28, 0x02, 0x16, 0x5d, 0xe0, 0x90, 0x91, 0x55,
+	  0x46, 0xf3, 0x39, 0x8c, 0xd8, 0x49, 0x16, 0x4a,
+	  0x19, 0xf9, 0x2a, 0xdb, 0xc3, 0x61, 0xad, 0xc9,
+	  0x9b, 0x0f, 0x20, 0xc8, 0xea, 0x07, 0x10, 0x54 },
+	{ 0xad, 0x83, 0x91, 0x68, 0xd9, 0xf8, 0xa4, 0xbe,
+	  0x95, 0xba, 0x9e, 0xf9, 0xa6, 0x92, 0xf0, 0x72,
+	  0x56, 0xae, 0x43, 0xfe, 0x6f, 0x98, 0x64, 0xe2,
+	  0x90, 0x69, 0x1b, 0x02, 0x56, 0xce, 0x50, 0xa9 },
+	{ 0x75, 0xfd, 0xaa, 0x50, 0x38, 0xc2, 0x84, 0xb8,
+	  0x6d, 0x6e, 0x8a, 0xff, 0xe8, 0xb2, 0x80, 0x7e,
+	  0x46, 0x7b, 0x86, 0x60, 0x0e, 0x79, 0xaf, 0x36,
+	  0x89, 0xfb, 0xc0, 0x63, 0x28, 0xcb, 0xf8, 0x94 },
+	{ 0xe5, 0x7c, 0xb7, 0x94, 0x87, 0xdd, 0x57, 0x90,
+	  0x24, 0x32, 0xb2, 0x50, 0x73, 0x38, 0x13, 0xbd,
+	  0x96, 0xa8, 0x4e, 0xfc, 0xe5, 0x9f, 0x65, 0x0f,
+	  0xac, 0x26, 0xe6, 0x69, 0x6a, 0xef, 0xaf, 0xc3 },
+	{ 0x56, 0xf3, 0x4e, 0x8b, 0x96, 0x55, 0x7e, 0x90,
+	  0xc1, 0xf2, 0x4b, 0x52, 0xd0, 0xc8, 0x9d, 0x51,
+	  0x08, 0x6a, 0xcf, 0x1b, 0x00, 0xf6, 0x34, 0xcf,
+	  0x1d, 0xde, 0x92, 0x33, 0xb8, 0xea, 0xaa, 0x3e },
+	{ 0x1b, 0x53, 0xee, 0x94, 0xaa, 0xf3, 0x4e, 0x4b,
+	  0x15, 0x9d, 0x48, 0xde, 0x35, 0x2c, 0x7f, 0x06,
+	  0x61, 0xd0, 0xa4, 0x0e, 0xdf, 0xf9, 0x5a, 0x0b,
+	  0x16, 0x39, 0xb4, 0x09, 0x0e, 0x97, 0x44, 0x72 },
+	{ 0x05, 0x70, 0x5e, 0x2a, 0x81, 0x75, 0x7c, 0x14,
+	  0xbd, 0x38, 0x3e, 0xa9, 0x8d, 0xda, 0x54, 0x4e,
+	  0xb1, 0x0e, 0x6b, 0xc0, 0x7b, 0xae, 0x43, 0x5e,
+	  0x25, 0x18, 0xdb, 0xe1, 0x33, 0x52, 0x53, 0x75 },
+	{ 0xd8, 0xb2, 0x86, 0x6e, 0x8a, 0x30, 0x9d, 0xb5,
+	  0x3e, 0x52, 0x9e, 0xc3, 0x29, 0x11, 0xd8, 0x2f,
+	  0x5c, 0xa1, 0x6c, 0xff, 0x76, 0x21, 0x68, 0x91,
+	  0xa9, 0x67, 0x6a, 0xa3, 0x1a, 0xaa, 0x6c, 0x42 },
+	{ 0xf5, 0x04, 0x1c, 0x24, 0x12, 0x70, 0xeb, 0x04,
+	  0xc7, 0x1e, 0xc2, 0xc9, 0x5d, 0x4c, 0x38, 0xd8,
+	  0x03, 0xb1, 0x23, 0x7b, 0x0f, 0x29, 0xfd, 0x4d,
+	  0xb3, 0xeb, 0x39, 0x76, 0x69, 0xe8, 0x86, 0x99 },
+	{ 0x9a, 0x4c, 0xe0, 0x77, 0xc3, 0x49, 0x32, 0x2f,
+	  0x59, 0x5e, 0x0e, 0xe7, 0x9e, 0xd0, 0xda, 0x5f,
+	  0xab, 0x66, 0x75, 0x2c, 0xbf, 0xef, 0x8f, 0x87,
+	  0xd0, 0xe9, 0xd0, 0x72, 0x3c, 0x75, 0x30, 0xdd },
+	{ 0x65, 0x7b, 0x09, 0xf3, 0xd0, 0xf5, 0x2b, 0x5b,
+	  0x8f, 0x2f, 0x97, 0x16, 0x3a, 0x0e, 0xdf, 0x0c,
+	  0x04, 0xf0, 0x75, 0x40, 0x8a, 0x07, 0xbb, 0xeb,
+	  0x3a, 0x41, 0x01, 0xa8, 0x91, 0x99, 0x0d, 0x62 },
+	{ 0x1e, 0x3f, 0x7b, 0xd5, 0xa5, 0x8f, 0xa5, 0x33,
+	  0x34, 0x4a, 0xa8, 0xed, 0x3a, 0xc1, 0x22, 0xbb,
+	  0x9e, 0x70, 0xd4, 0xef, 0x50, 0xd0, 0x04, 0x53,
+	  0x08, 0x21, 0x94, 0x8f, 0x5f, 0xe6, 0x31, 0x5a },
+	{ 0x80, 0xdc, 0xcf, 0x3f, 0xd8, 0x3d, 0xfd, 0x0d,
+	  0x35, 0xaa, 0x28, 0x58, 0x59, 0x22, 0xab, 0x89,
+	  0xd5, 0x31, 0x39, 0x97, 0x67, 0x3e, 0xaf, 0x90,
+	  0x5c, 0xea, 0x9c, 0x0b, 0x22, 0x5c, 0x7b, 0x5f },
+	{ 0x8a, 0x0d, 0x0f, 0xbf, 0x63, 0x77, 0xd8, 0x3b,
+	  0xb0, 0x8b, 0x51, 0x4b, 0x4b, 0x1c, 0x43, 0xac,
+	  0xc9, 0x5d, 0x75, 0x17, 0x14, 0xf8, 0x92, 0x56,
+	  0x45, 0xcb, 0x6b, 0xc8, 0x56, 0xca, 0x15, 0x0a },
+	{ 0x9f, 0xa5, 0xb4, 0x87, 0x73, 0x8a, 0xd2, 0x84,
+	  0x4c, 0xc6, 0x34, 0x8a, 0x90, 0x19, 0x18, 0xf6,
+	  0x59, 0xa3, 0xb8, 0x9e, 0x9c, 0x0d, 0xfe, 0xea,
+	  0xd3, 0x0d, 0xd9, 0x4b, 0xcf, 0x42, 0xef, 0x8e },
+	{ 0x80, 0x83, 0x2c, 0x4a, 0x16, 0x77, 0xf5, 0xea,
+	  0x25, 0x60, 0xf6, 0x68, 0xe9, 0x35, 0x4d, 0xd3,
+	  0x69, 0x97, 0xf0, 0x37, 0x28, 0xcf, 0xa5, 0x5e,
+	  0x1b, 0x38, 0x33, 0x7c, 0x0c, 0x9e, 0xf8, 0x18 },
+	{ 0xab, 0x37, 0xdd, 0xb6, 0x83, 0x13, 0x7e, 0x74,
+	  0x08, 0x0d, 0x02, 0x6b, 0x59, 0x0b, 0x96, 0xae,
+	  0x9b, 0xb4, 0x47, 0x72, 0x2f, 0x30, 0x5a, 0x5a,
+	  0xc5, 0x70, 0xec, 0x1d, 0xf9, 0xb1, 0x74, 0x3c },
+	{ 0x3e, 0xe7, 0x35, 0xa6, 0x94, 0xc2, 0x55, 0x9b,
+	  0x69, 0x3a, 0xa6, 0x86, 0x29, 0x36, 0x1e, 0x15,
+	  0xd1, 0x22, 0x65, 0xad, 0x6a, 0x3d, 0xed, 0xf4,
+	  0x88, 0xb0, 0xb0, 0x0f, 0xac, 0x97, 0x54, 0xba },
+	{ 0xd6, 0xfc, 0xd2, 0x32, 0x19, 0xb6, 0x47, 0xe4,
+	  0xcb, 0xd5, 0xeb, 0x2d, 0x0a, 0xd0, 0x1e, 0xc8,
+	  0x83, 0x8a, 0x4b, 0x29, 0x01, 0xfc, 0x32, 0x5c,
+	  0xc3, 0x70, 0x19, 0x81, 0xca, 0x6c, 0x88, 0x8b },
+	{ 0x05, 0x20, 0xec, 0x2f, 0x5b, 0xf7, 0xa7, 0x55,
+	  0xda, 0xcb, 0x50, 0xc6, 0xbf, 0x23, 0x3e, 0x35,
+	  0x15, 0x43, 0x47, 0x63, 0xdb, 0x01, 0x39, 0xcc,
+	  0xd9, 0xfa, 0xef, 0xbb, 0x82, 0x07, 0x61, 0x2d },
+	{ 0xaf, 0xf3, 0xb7, 0x5f, 0x3f, 0x58, 0x12, 0x64,
+	  0xd7, 0x66, 0x16, 0x62, 0xb9, 0x2f, 0x5a, 0xd3,
+	  0x7c, 0x1d, 0x32, 0xbd, 0x45, 0xff, 0x81, 0xa4,
+	  0xed, 0x8a, 0xdc, 0x9e, 0xf3, 0x0d, 0xd9, 0x89 },
+	{ 0xd0, 0xdd, 0x65, 0x0b, 0xef, 0xd3, 0xba, 0x63,
+	  0xdc, 0x25, 0x10, 0x2c, 0x62, 0x7c, 0x92, 0x1b,
+	  0x9c, 0xbe, 0xb0, 0xb1, 0x30, 0x68, 0x69, 0x35,
+	  0xb5, 0xc9, 0x27, 0xcb, 0x7c, 0xcd, 0x5e, 0x3b },
+	{ 0xe1, 0x14, 0x98, 0x16, 0xb1, 0x0a, 0x85, 0x14,
+	  0xfb, 0x3e, 0x2c, 0xab, 0x2c, 0x08, 0xbe, 0xe9,
+	  0xf7, 0x3c, 0xe7, 0x62, 0x21, 0x70, 0x12, 0x46,
+	  0xa5, 0x89, 0xbb, 0xb6, 0x73, 0x02, 0xd8, 0xa9 },
+	{ 0x7d, 0xa3, 0xf4, 0x41, 0xde, 0x90, 0x54, 0x31,
+	  0x7e, 0x72, 0xb5, 0xdb, 0xf9, 0x79, 0xda, 0x01,
+	  0xe6, 0xbc, 0xee, 0xbb, 0x84, 0x78, 0xea, 0xe6,
+	  0xa2, 0x28, 0x49, 0xd9, 0x02, 0x92, 0x63, 0x5c },
+	{ 0x12, 0x30, 0xb1, 0xfc, 0x8a, 0x7d, 0x92, 0x15,
+	  0xed, 0xc2, 0xd4, 0xa2, 0xde, 0xcb, 0xdd, 0x0a,
+	  0x6e, 0x21, 0x6c, 0x92, 0x42, 0x78, 0xc9, 0x1f,
+	  0xc5, 0xd1, 0x0e, 0x7d, 0x60, 0x19, 0x2d, 0x94 },
+	{ 0x57, 0x50, 0xd7, 0x16, 0xb4, 0x80, 0x8f, 0x75,
+	  0x1f, 0xeb, 0xc3, 0x88, 0x06, 0xba, 0x17, 0x0b,
+	  0xf6, 0xd5, 0x19, 0x9a, 0x78, 0x16, 0xbe, 0x51,
+	  0x4e, 0x3f, 0x93, 0x2f, 0xbe, 0x0c, 0xb8, 0x71 },
+	{ 0x6f, 0xc5, 0x9b, 0x2f, 0x10, 0xfe, 0xba, 0x95,
+	  0x4a, 0xa6, 0x82, 0x0b, 0x3c, 0xa9, 0x87, 0xee,
+	  0x81, 0xd5, 0xcc, 0x1d, 0xa3, 0xc6, 0x3c, 0xe8,
+	  0x27, 0x30, 0x1c, 0x56, 0x9d, 0xfb, 0x39, 0xce },
+	{ 0xc7, 0xc3, 0xfe, 0x1e, 0xeb, 0xdc, 0x7b, 0x5a,
+	  0x93, 0x93, 0x26, 0xe8, 0xdd, 0xb8, 0x3e, 0x8b,
+	  0xf2, 0xb7, 0x80, 0xb6, 0x56, 0x78, 0xcb, 0x62,
+	  0xf2, 0x08, 0xb0, 0x40, 0xab, 0xdd, 0x35, 0xe2 },
+	{ 0x0c, 0x75, 0xc1, 0xa1, 0x5c, 0xf3, 0x4a, 0x31,
+	  0x4e, 0xe4, 0x78, 0xf4, 0xa5, 0xce, 0x0b, 0x8a,
+	  0x6b, 0x36, 0x52, 0x8e, 0xf7, 0xa8, 0x20, 0x69,
+	  0x6c, 0x3e, 0x42, 0x46, 0xc5, 0xa1, 0x58, 0x64 },
+	{ 0x21, 0x6d, 0xc1, 0x2a, 0x10, 0x85, 0x69, 0xa3,
+	  0xc7, 0xcd, 0xde, 0x4a, 0xed, 0x43, 0xa6, 0xc3,
+	  0x30, 0x13, 0x9d, 0xda, 0x3c, 0xcc, 0x4a, 0x10,
+	  0x89, 0x05, 0xdb, 0x38, 0x61, 0x89, 0x90, 0x50 },
+	{ 0xa5, 0x7b, 0xe6, 0xae, 0x67, 0x56, 0xf2, 0x8b,
+	  0x02, 0xf5, 0x9d, 0xad, 0xf7, 0xe0, 0xd7, 0xd8,
+	  0x80, 0x7f, 0x10, 0xfa, 0x15, 0xce, 0xd1, 0xad,
+	  0x35, 0x85, 0x52, 0x1a, 0x1d, 0x99, 0x5a, 0x89 },
+	{ 0x81, 0x6a, 0xef, 0x87, 0x59, 0x53, 0x71, 0x6c,
+	  0xd7, 0xa5, 0x81, 0xf7, 0x32, 0xf5, 0x3d, 0xd4,
+	  0x35, 0xda, 0xb6, 0x6d, 0x09, 0xc3, 0x61, 0xd2,
+	  0xd6, 0x59, 0x2d, 0xe1, 0x77, 0x55, 0xd8, 0xa8 },
+	{ 0x9a, 0x76, 0x89, 0x32, 0x26, 0x69, 0x3b, 0x6e,
+	  0xa9, 0x7e, 0x6a, 0x73, 0x8f, 0x9d, 0x10, 0xfb,
+	  0x3d, 0x0b, 0x43, 0xae, 0x0e, 0x8b, 0x7d, 0x81,
+	  0x23, 0xea, 0x76, 0xce, 0x97, 0x98, 0x9c, 0x7e },
+	{ 0x8d, 0xae, 0xdb, 0x9a, 0x27, 0x15, 0x29, 0xdb,
+	  0xb7, 0xdc, 0x3b, 0x60, 0x7f, 0xe5, 0xeb, 0x2d,
+	  0x32, 0x11, 0x77, 0x07, 0x58, 0xdd, 0x3b, 0x0a,
+	  0x35, 0x93, 0xd2, 0xd7, 0x95, 0x4e, 0x2d, 0x5b },
+	{ 0x16, 0xdb, 0xc0, 0xaa, 0x5d, 0xd2, 0xc7, 0x74,
+	  0xf5, 0x05, 0x10, 0x0f, 0x73, 0x37, 0x86, 0xd8,
+	  0xa1, 0x75, 0xfc, 0xbb, 0xb5, 0x9c, 0x43, 0xe1,
+	  0xfb, 0xff, 0x3e, 0x1e, 0xaf, 0x31, 0xcb, 0x4a },
+	{ 0x86, 0x06, 0xcb, 0x89, 0x9c, 0x6a, 0xea, 0xf5,
+	  0x1b, 0x9d, 0xb0, 0xfe, 0x49, 0x24, 0xa9, 0xfd,
+	  0x5d, 0xab, 0xc1, 0x9f, 0x88, 0x26, 0xf2, 0xbc,
+	  0x1c, 0x1d, 0x7d, 0xa1, 0x4d, 0x2c, 0x2c, 0x99 },
+	{ 0x84, 0x79, 0x73, 0x1a, 0xed, 0xa5, 0x7b, 0xd3,
+	  0x7e, 0xad, 0xb5, 0x1a, 0x50, 0x7e, 0x30, 0x7f,
+	  0x3b, 0xd9, 0x5e, 0x69, 0xdb, 0xca, 0x94, 0xf3,
+	  0xbc, 0x21, 0x72, 0x60, 0x66, 0xad, 0x6d, 0xfd },
+	{ 0x58, 0x47, 0x3a, 0x9e, 0xa8, 0x2e, 0xfa, 0x3f,
+	  0x3b, 0x3d, 0x8f, 0xc8, 0x3e, 0xd8, 0x86, 0x31,
+	  0x27, 0xb3, 0x3a, 0xe8, 0xde, 0xae, 0x63, 0x07,
+	  0x20, 0x1e, 0xdb, 0x6d, 0xde, 0x61, 0xde, 0x29 },
+	{ 0x9a, 0x92, 0x55, 0xd5, 0x3a, 0xf1, 0x16, 0xde,
+	  0x8b, 0xa2, 0x7c, 0xe3, 0x5b, 0x4c, 0x7e, 0x15,
+	  0x64, 0x06, 0x57, 0xa0, 0xfc, 0xb8, 0x88, 0xc7,
+	  0x0d, 0x95, 0x43, 0x1d, 0xac, 0xd8, 0xf8, 0x30 },
+	{ 0x9e, 0xb0, 0x5f, 0xfb, 0xa3, 0x9f, 0xd8, 0x59,
+	  0x6a, 0x45, 0x49, 0x3e, 0x18, 0xd2, 0x51, 0x0b,
+	  0xf3, 0xef, 0x06, 0x5c, 0x51, 0xd6, 0xe1, 0x3a,
+	  0xbe, 0x66, 0xaa, 0x57, 0xe0, 0x5c, 0xfd, 0xb7 },
+	{ 0x81, 0xdc, 0xc3, 0xa5, 0x05, 0xea, 0xce, 0x3f,
+	  0x87, 0x9d, 0x8f, 0x70, 0x27, 0x76, 0x77, 0x0f,
+	  0x9d, 0xf5, 0x0e, 0x52, 0x1d, 0x14, 0x28, 0xa8,
+	  0x5d, 0xaf, 0x04, 0xf9, 0xad, 0x21, 0x50, 0xe0 },
+	{ 0xe3, 0xe3, 0xc4, 0xaa, 0x3a, 0xcb, 0xbc, 0x85,
+	  0x33, 0x2a, 0xf9, 0xd5, 0x64, 0xbc, 0x24, 0x16,
+	  0x5e, 0x16, 0x87, 0xf6, 0xb1, 0xad, 0xcb, 0xfa,
+	  0xe7, 0x7a, 0x8f, 0x03, 0xc7, 0x2a, 0xc2, 0x8c },
+	{ 0x67, 0x46, 0xc8, 0x0b, 0x4e, 0xb5, 0x6a, 0xea,
+	  0x45, 0xe6, 0x4e, 0x72, 0x89, 0xbb, 0xa3, 0xed,
+	  0xbf, 0x45, 0xec, 0xf8, 0x20, 0x64, 0x81, 0xff,
+	  0x63, 0x02, 0x12, 0x29, 0x84, 0xcd, 0x52, 0x6a },
+	{ 0x2b, 0x62, 0x8e, 0x52, 0x76, 0x4d, 0x7d, 0x62,
+	  0xc0, 0x86, 0x8b, 0x21, 0x23, 0x57, 0xcd, 0xd1,
+	  0x2d, 0x91, 0x49, 0x82, 0x2f, 0x4e, 0x98, 0x45,
+	  0xd9, 0x18, 0xa0, 0x8d, 0x1a, 0xe9, 0x90, 0xc0 },
+	{ 0xe4, 0xbf, 0xe8, 0x0d, 0x58, 0xc9, 0x19, 0x94,
+	  0x61, 0x39, 0x09, 0xdc, 0x4b, 0x1a, 0x12, 0x49,
+	  0x68, 0x96, 0xc0, 0x04, 0xaf, 0x7b, 0x57, 0x01,
+	  0x48, 0x3d, 0xe4, 0x5d, 0x28, 0x23, 0xd7, 0x8e },
+	{ 0xeb, 0xb4, 0xba, 0x15, 0x0c, 0xef, 0x27, 0x34,
+	  0x34, 0x5b, 0x5d, 0x64, 0x1b, 0xbe, 0xd0, 0x3a,
+	  0x21, 0xea, 0xfa, 0xe9, 0x33, 0xc9, 0x9e, 0x00,
+	  0x92, 0x12, 0xef, 0x04, 0x57, 0x4a, 0x85, 0x30 },
+	{ 0x39, 0x66, 0xec, 0x73, 0xb1, 0x54, 0xac, 0xc6,
+	  0x97, 0xac, 0x5c, 0xf5, 0xb2, 0x4b, 0x40, 0xbd,
+	  0xb0, 0xdb, 0x9e, 0x39, 0x88, 0x36, 0xd7, 0x6d,
+	  0x4b, 0x88, 0x0e, 0x3b, 0x2a, 0xf1, 0xaa, 0x27 },
+	{ 0xef, 0x7e, 0x48, 0x31, 0xb3, 0xa8, 0x46, 0x36,
+	  0x51, 0x8d, 0x6e, 0x4b, 0xfc, 0xe6, 0x4a, 0x43,
+	  0xdb, 0x2a, 0x5d, 0xda, 0x9c, 0xca, 0x2b, 0x44,
+	  0xf3, 0x90, 0x33, 0xbd, 0xc4, 0x0d, 0x62, 0x43 },
+	{ 0x7a, 0xbf, 0x6a, 0xcf, 0x5c, 0x8e, 0x54, 0x9d,
+	  0xdb, 0xb1, 0x5a, 0xe8, 0xd8, 0xb3, 0x88, 0xc1,
+	  0xc1, 0x97, 0xe6, 0x98, 0x73, 0x7c, 0x97, 0x85,
+	  0x50, 0x1e, 0xd1, 0xf9, 0x49, 0x30, 0xb7, 0xd9 },
+	{ 0x88, 0x01, 0x8d, 0xed, 0x66, 0x81, 0x3f, 0x0c,
+	  0xa9, 0x5d, 0xef, 0x47, 0x4c, 0x63, 0x06, 0x92,
+	  0x01, 0x99, 0x67, 0xb9, 0xe3, 0x68, 0x88, 0xda,
+	  0xdd, 0x94, 0x12, 0x47, 0x19, 0xb6, 0x82, 0xf6 },
+	{ 0x39, 0x30, 0x87, 0x6b, 0x9f, 0xc7, 0x52, 0x90,
+	  0x36, 0xb0, 0x08, 0xb1, 0xb8, 0xbb, 0x99, 0x75,
+	  0x22, 0xa4, 0x41, 0x63, 0x5a, 0x0c, 0x25, 0xec,
+	  0x02, 0xfb, 0x6d, 0x90, 0x26, 0xe5, 0x5a, 0x97 },
+	{ 0x0a, 0x40, 0x49, 0xd5, 0x7e, 0x83, 0x3b, 0x56,
+	  0x95, 0xfa, 0xc9, 0x3d, 0xd1, 0xfb, 0xef, 0x31,
+	  0x66, 0xb4, 0x4b, 0x12, 0xad, 0x11, 0x24, 0x86,
+	  0x62, 0x38, 0x3a, 0xe0, 0x51, 0xe1, 0x58, 0x27 },
+	{ 0x81, 0xdc, 0xc0, 0x67, 0x8b, 0xb6, 0xa7, 0x65,
+	  0xe4, 0x8c, 0x32, 0x09, 0x65, 0x4f, 0xe9, 0x00,
+	  0x89, 0xce, 0x44, 0xff, 0x56, 0x18, 0x47, 0x7e,
+	  0x39, 0xab, 0x28, 0x64, 0x76, 0xdf, 0x05, 0x2b },
+	{ 0xe6, 0x9b, 0x3a, 0x36, 0xa4, 0x46, 0x19, 0x12,
+	  0xdc, 0x08, 0x34, 0x6b, 0x11, 0xdd, 0xcb, 0x9d,
+	  0xb7, 0x96, 0xf8, 0x85, 0xfd, 0x01, 0x93, 0x6e,
+	  0x66, 0x2f, 0xe2, 0x92, 0x97, 0xb0, 0x99, 0xa4 },
+	{ 0x5a, 0xc6, 0x50, 0x3b, 0x0d, 0x8d, 0xa6, 0x91,
+	  0x76, 0x46, 0xe6, 0xdc, 0xc8, 0x7e, 0xdc, 0x58,
+	  0xe9, 0x42, 0x45, 0x32, 0x4c, 0xc2, 0x04, 0xf4,
+	  0xdd, 0x4a, 0xf0, 0x15, 0x63, 0xac, 0xd4, 0x27 },
+	{ 0xdf, 0x6d, 0xda, 0x21, 0x35, 0x9a, 0x30, 0xbc,
+	  0x27, 0x17, 0x80, 0x97, 0x1c, 0x1a, 0xbd, 0x56,
+	  0xa6, 0xef, 0x16, 0x7e, 0x48, 0x08, 0x87, 0x88,
+	  0x8e, 0x73, 0xa8, 0x6d, 0x3b, 0xf6, 0x05, 0xe9 },
+	{ 0xe8, 0xe6, 0xe4, 0x70, 0x71, 0xe7, 0xb7, 0xdf,
+	  0x25, 0x80, 0xf2, 0x25, 0xcf, 0xbb, 0xed, 0xf8,
+	  0x4c, 0xe6, 0x77, 0x46, 0x62, 0x66, 0x28, 0xd3,
+	  0x30, 0x97, 0xe4, 0xb7, 0xdc, 0x57, 0x11, 0x07 },
+	{ 0x53, 0xe4, 0x0e, 0xad, 0x62, 0x05, 0x1e, 0x19,
+	  0xcb, 0x9b, 0xa8, 0x13, 0x3e, 0x3e, 0x5c, 0x1c,
+	  0xe0, 0x0d, 0xdc, 0xad, 0x8a, 0xcf, 0x34, 0x2a,
+	  0x22, 0x43, 0x60, 0xb0, 0xac, 0xc1, 0x47, 0x77 },
+	{ 0x9c, 0xcd, 0x53, 0xfe, 0x80, 0xbe, 0x78, 0x6a,
+	  0xa9, 0x84, 0x63, 0x84, 0x62, 0xfb, 0x28, 0xaf,
+	  0xdf, 0x12, 0x2b, 0x34, 0xd7, 0x8f, 0x46, 0x87,
+	  0xec, 0x63, 0x2b, 0xb1, 0x9d, 0xe2, 0x37, 0x1a },
+	{ 0xcb, 0xd4, 0x80, 0x52, 0xc4, 0x8d, 0x78, 0x84,
+	  0x66, 0xa3, 0xe8, 0x11, 0x8c, 0x56, 0xc9, 0x7f,
+	  0xe1, 0x46, 0xe5, 0x54, 0x6f, 0xaa, 0xf9, 0x3e,
+	  0x2b, 0xc3, 0xc4, 0x7e, 0x45, 0x93, 0x97, 0x53 },
+	{ 0x25, 0x68, 0x83, 0xb1, 0x4e, 0x2a, 0xf4, 0x4d,
+	  0xad, 0xb2, 0x8e, 0x1b, 0x34, 0xb2, 0xac, 0x0f,
+	  0x0f, 0x4c, 0x91, 0xc3, 0x4e, 0xc9, 0x16, 0x9e,
+	  0x29, 0x03, 0x61, 0x58, 0xac, 0xaa, 0x95, 0xb9 },
+	{ 0x44, 0x71, 0xb9, 0x1a, 0xb4, 0x2d, 0xb7, 0xc4,
+	  0xdd, 0x84, 0x90, 0xab, 0x95, 0xa2, 0xee, 0x8d,
+	  0x04, 0xe3, 0xef, 0x5c, 0x3d, 0x6f, 0xc7, 0x1a,
+	  0xc7, 0x4b, 0x2b, 0x26, 0x91, 0x4d, 0x16, 0x41 },
+	{ 0xa5, 0xeb, 0x08, 0x03, 0x8f, 0x8f, 0x11, 0x55,
+	  0xed, 0x86, 0xe6, 0x31, 0x90, 0x6f, 0xc1, 0x30,
+	  0x95, 0xf6, 0xbb, 0xa4, 0x1d, 0xe5, 0xd4, 0xe7,
+	  0x95, 0x75, 0x8e, 0xc8, 0xc8, 0xdf, 0x8a, 0xf1 },
+	{ 0xdc, 0x1d, 0xb6, 0x4e, 0xd8, 0xb4, 0x8a, 0x91,
+	  0x0e, 0x06, 0x0a, 0x6b, 0x86, 0x63, 0x74, 0xc5,
+	  0x78, 0x78, 0x4e, 0x9a, 0xc4, 0x9a, 0xb2, 0x77,
+	  0x40, 0x92, 0xac, 0x71, 0x50, 0x19, 0x34, 0xac },
+	{ 0x28, 0x54, 0x13, 0xb2, 0xf2, 0xee, 0x87, 0x3d,
+	  0x34, 0x31, 0x9e, 0xe0, 0xbb, 0xfb, 0xb9, 0x0f,
+	  0x32, 0xda, 0x43, 0x4c, 0xc8, 0x7e, 0x3d, 0xb5,
+	  0xed, 0x12, 0x1b, 0xb3, 0x98, 0xed, 0x96, 0x4b },
+	{ 0x02, 0x16, 0xe0, 0xf8, 0x1f, 0x75, 0x0f, 0x26,
+	  0xf1, 0x99, 0x8b, 0xc3, 0x93, 0x4e, 0x3e, 0x12,
+	  0x4c, 0x99, 0x45, 0xe6, 0x85, 0xa6, 0x0b, 0x25,
+	  0xe8, 0xfb, 0xd9, 0x62, 0x5a, 0xb6, 0xb5, 0x99 },
+	{ 0x38, 0xc4, 0x10, 0xf5, 0xb9, 0xd4, 0x07, 0x20,
+	  0x50, 0x75, 0x5b, 0x31, 0xdc, 0xa8, 0x9f, 0xd5,
+	  0x39, 0x5c, 0x67, 0x85, 0xee, 0xb3, 0xd7, 0x90,
+	  0xf3, 0x20, 0xff, 0x94, 0x1c, 0x5a, 0x93, 0xbf },
+	{ 0xf1, 0x84, 0x17, 0xb3, 0x9d, 0x61, 0x7a, 0xb1,
+	  0xc1, 0x8f, 0xdf, 0x91, 0xeb, 0xd0, 0xfc, 0x6d,
+	  0x55, 0x16, 0xbb, 0x34, 0xcf, 0x39, 0x36, 0x40,
+	  0x37, 0xbc, 0xe8, 0x1f, 0xa0, 0x4c, 0xec, 0xb1 },
+	{ 0x1f, 0xa8, 0x77, 0xde, 0x67, 0x25, 0x9d, 0x19,
+	  0x86, 0x3a, 0x2a, 0x34, 0xbc, 0xc6, 0x96, 0x2a,
+	  0x2b, 0x25, 0xfc, 0xbf, 0x5c, 0xbe, 0xcd, 0x7e,
+	  0xde, 0x8f, 0x1f, 0xa3, 0x66, 0x88, 0xa7, 0x96 },
+	{ 0x5b, 0xd1, 0x69, 0xe6, 0x7c, 0x82, 0xc2, 0xc2,
+	  0xe9, 0x8e, 0xf7, 0x00, 0x8b, 0xdf, 0x26, 0x1f,
+	  0x2d, 0xdf, 0x30, 0xb1, 0xc0, 0x0f, 0x9e, 0x7f,
+	  0x27, 0x5b, 0xb3, 0xe8, 0xa2, 0x8d, 0xc9, 0xa2 },
+	{ 0xc8, 0x0a, 0xbe, 0xeb, 0xb6, 0x69, 0xad, 0x5d,
+	  0xee, 0xb5, 0xf5, 0xec, 0x8e, 0xa6, 0xb7, 0xa0,
+	  0x5d, 0xdf, 0x7d, 0x31, 0xec, 0x4c, 0x0a, 0x2e,
+	  0xe2, 0x0b, 0x0b, 0x98, 0xca, 0xec, 0x67, 0x46 },
+	{ 0xe7, 0x6d, 0x3f, 0xbd, 0xa5, 0xba, 0x37, 0x4e,
+	  0x6b, 0xf8, 0xe5, 0x0f, 0xad, 0xc3, 0xbb, 0xb9,
+	  0xba, 0x5c, 0x20, 0x6e, 0xbd, 0xec, 0x89, 0xa3,
+	  0xa5, 0x4c, 0xf3, 0xdd, 0x84, 0xa0, 0x70, 0x16 },
+	{ 0x7b, 0xba, 0x9d, 0xc5, 0xb5, 0xdb, 0x20, 0x71,
+	  0xd1, 0x77, 0x52, 0xb1, 0x04, 0x4c, 0x1e, 0xce,
+	  0xd9, 0x6a, 0xaf, 0x2d, 0xd4, 0x6e, 0x9b, 0x43,
+	  0x37, 0x50, 0xe8, 0xea, 0x0d, 0xcc, 0x18, 0x70 },
+	{ 0xf2, 0x9b, 0x1b, 0x1a, 0xb9, 0xba, 0xb1, 0x63,
+	  0x01, 0x8e, 0xe3, 0xda, 0x15, 0x23, 0x2c, 0xca,
+	  0x78, 0xec, 0x52, 0xdb, 0xc3, 0x4e, 0xda, 0x5b,
+	  0x82, 0x2e, 0xc1, 0xd8, 0x0f, 0xc2, 0x1b, 0xd0 },
+	{ 0x9e, 0xe3, 0xe3, 0xe7, 0xe9, 0x00, 0xf1, 0xe1,
+	  0x1d, 0x30, 0x8c, 0x4b, 0x2b, 0x30, 0x76, 0xd2,
+	  0x72, 0xcf, 0x70, 0x12, 0x4f, 0x9f, 0x51, 0xe1,
+	  0xda, 0x60, 0xf3, 0x78, 0x46, 0xcd, 0xd2, 0xf4 },
+	{ 0x70, 0xea, 0x3b, 0x01, 0x76, 0x92, 0x7d, 0x90,
+	  0x96, 0xa1, 0x85, 0x08, 0xcd, 0x12, 0x3a, 0x29,
+	  0x03, 0x25, 0x92, 0x0a, 0x9d, 0x00, 0xa8, 0x9b,
+	  0x5d, 0xe0, 0x42, 0x73, 0xfb, 0xc7, 0x6b, 0x85 },
+	{ 0x67, 0xde, 0x25, 0xc0, 0x2a, 0x4a, 0xab, 0xa2,
+	  0x3b, 0xdc, 0x97, 0x3c, 0x8b, 0xb0, 0xb5, 0x79,
+	  0x6d, 0x47, 0xcc, 0x06, 0x59, 0xd4, 0x3d, 0xff,
+	  0x1f, 0x97, 0xde, 0x17, 0x49, 0x63, 0xb6, 0x8e },
+	{ 0xb2, 0x16, 0x8e, 0x4e, 0x0f, 0x18, 0xb0, 0xe6,
+	  0x41, 0x00, 0xb5, 0x17, 0xed, 0x95, 0x25, 0x7d,
+	  0x73, 0xf0, 0x62, 0x0d, 0xf8, 0x85, 0xc1, 0x3d,
+	  0x2e, 0xcf, 0x79, 0x36, 0x7b, 0x38, 0x4c, 0xee },
+	{ 0x2e, 0x7d, 0xec, 0x24, 0x28, 0x85, 0x3b, 0x2c,
+	  0x71, 0x76, 0x07, 0x45, 0x54, 0x1f, 0x7a, 0xfe,
+	  0x98, 0x25, 0xb5, 0xdd, 0x77, 0xdf, 0x06, 0x51,
+	  0x1d, 0x84, 0x41, 0xa9, 0x4b, 0xac, 0xc9, 0x27 },
+	{ 0xca, 0x9f, 0xfa, 0xc4, 0xc4, 0x3f, 0x0b, 0x48,
+	  0x46, 0x1d, 0xc5, 0xc2, 0x63, 0xbe, 0xa3, 0xf6,
+	  0xf0, 0x06, 0x11, 0xce, 0xac, 0xab, 0xf6, 0xf8,
+	  0x95, 0xba, 0x2b, 0x01, 0x01, 0xdb, 0xb6, 0x8d },
+	{ 0x74, 0x10, 0xd4, 0x2d, 0x8f, 0xd1, 0xd5, 0xe9,
+	  0xd2, 0xf5, 0x81, 0x5c, 0xb9, 0x34, 0x17, 0x99,
+	  0x88, 0x28, 0xef, 0x3c, 0x42, 0x30, 0xbf, 0xbd,
+	  0x41, 0x2d, 0xf0, 0xa4, 0xa7, 0xa2, 0x50, 0x7a },
+	{ 0x50, 0x10, 0xf6, 0x84, 0x51, 0x6d, 0xcc, 0xd0,
+	  0xb6, 0xee, 0x08, 0x52, 0xc2, 0x51, 0x2b, 0x4d,
+	  0xc0, 0x06, 0x6c, 0xf0, 0xd5, 0x6f, 0x35, 0x30,
+	  0x29, 0x78, 0xdb, 0x8a, 0xe3, 0x2c, 0x6a, 0x81 },
+	{ 0xac, 0xaa, 0xb5, 0x85, 0xf7, 0xb7, 0x9b, 0x71,
+	  0x99, 0x35, 0xce, 0xb8, 0x95, 0x23, 0xdd, 0xc5,
+	  0x48, 0x27, 0xf7, 0x5c, 0x56, 0x88, 0x38, 0x56,
+	  0x15, 0x4a, 0x56, 0xcd, 0xcd, 0x5e, 0xe9, 0x88 },
+	{ 0x66, 0x6d, 0xe5, 0xd1, 0x44, 0x0f, 0xee, 0x73,
+	  0x31, 0xaa, 0xf0, 0x12, 0x3a, 0x62, 0xef, 0x2d,
+	  0x8b, 0xa5, 0x74, 0x53, 0xa0, 0x76, 0x96, 0x35,
+	  0xac, 0x6c, 0xd0, 0x1e, 0x63, 0x3f, 0x77, 0x12 },
+	{ 0xa6, 0xf9, 0x86, 0x58, 0xf6, 0xea, 0xba, 0xf9,
+	  0x02, 0xd8, 0xb3, 0x87, 0x1a, 0x4b, 0x10, 0x1d,
+	  0x16, 0x19, 0x6e, 0x8a, 0x4b, 0x24, 0x1e, 0x15,
+	  0x58, 0xfe, 0x29, 0x96, 0x6e, 0x10, 0x3e, 0x8d },
+	{ 0x89, 0x15, 0x46, 0xa8, 0xb2, 0x9f, 0x30, 0x47,
+	  0xdd, 0xcf, 0xe5, 0xb0, 0x0e, 0x45, 0xfd, 0x55,
+	  0x75, 0x63, 0x73, 0x10, 0x5e, 0xa8, 0x63, 0x7d,
+	  0xfc, 0xff, 0x54, 0x7b, 0x6e, 0xa9, 0x53, 0x5f },
+	{ 0x18, 0xdf, 0xbc, 0x1a, 0xc5, 0xd2, 0x5b, 0x07,
+	  0x61, 0x13, 0x7d, 0xbd, 0x22, 0xc1, 0x7c, 0x82,
+	  0x9d, 0x0f, 0x0e, 0xf1, 0xd8, 0x23, 0x44, 0xe9,
+	  0xc8, 0x9c, 0x28, 0x66, 0x94, 0xda, 0x24, 0xe8 },
+	{ 0xb5, 0x4b, 0x9b, 0x67, 0xf8, 0xfe, 0xd5, 0x4b,
+	  0xbf, 0x5a, 0x26, 0x66, 0xdb, 0xdf, 0x4b, 0x23,
+	  0xcf, 0xf1, 0xd1, 0xb6, 0xf4, 0xaf, 0xc9, 0x85,
+	  0xb2, 0xe6, 0xd3, 0x30, 0x5a, 0x9f, 0xf8, 0x0f },
+	{ 0x7d, 0xb4, 0x42, 0xe1, 0x32, 0xba, 0x59, 0xbc,
+	  0x12, 0x89, 0xaa, 0x98, 0xb0, 0xd3, 0xe8, 0x06,
+	  0x00, 0x4f, 0x8e, 0xc1, 0x28, 0x11, 0xaf, 0x1e,
+	  0x2e, 0x33, 0xc6, 0x9b, 0xfd, 0xe7, 0x29, 0xe1 },
+	{ 0x25, 0x0f, 0x37, 0xcd, 0xc1, 0x5e, 0x81, 0x7d,
+	  0x2f, 0x16, 0x0d, 0x99, 0x56, 0xc7, 0x1f, 0xe3,
+	  0xeb, 0x5d, 0xb7, 0x45, 0x56, 0xe4, 0xad, 0xf9,
+	  0xa4, 0xff, 0xaf, 0xba, 0x74, 0x01, 0x03, 0x96 },
+	{ 0x4a, 0xb8, 0xa3, 0xdd, 0x1d, 0xdf, 0x8a, 0xd4,
+	  0x3d, 0xab, 0x13, 0xa2, 0x7f, 0x66, 0xa6, 0x54,
+	  0x4f, 0x29, 0x05, 0x97, 0xfa, 0x96, 0x04, 0x0e,
+	  0x0e, 0x1d, 0xb9, 0x26, 0x3a, 0xa4, 0x79, 0xf8 },
+	{ 0xee, 0x61, 0x72, 0x7a, 0x07, 0x66, 0xdf, 0x93,
+	  0x9c, 0xcd, 0xc8, 0x60, 0x33, 0x40, 0x44, 0xc7,
+	  0x9a, 0x3c, 0x9b, 0x15, 0x62, 0x00, 0xbc, 0x3a,
+	  0xa3, 0x29, 0x73, 0x48, 0x3d, 0x83, 0x41, 0xae },
+	{ 0x3f, 0x68, 0xc7, 0xec, 0x63, 0xac, 0x11, 0xeb,
+	  0xb9, 0x8f, 0x94, 0xb3, 0x39, 0xb0, 0x5c, 0x10,
+	  0x49, 0x84, 0xfd, 0xa5, 0x01, 0x03, 0x06, 0x01,
+	  0x44, 0xe5, 0xa2, 0xbf, 0xcc, 0xc9, 0xda, 0x95 },
+	{ 0x05, 0x6f, 0x29, 0x81, 0x6b, 0x8a, 0xf8, 0xf5,
+	  0x66, 0x82, 0xbc, 0x4d, 0x7c, 0xf0, 0x94, 0x11,
+	  0x1d, 0xa7, 0x73, 0x3e, 0x72, 0x6c, 0xd1, 0x3d,
+	  0x6b, 0x3e, 0x8e, 0xa0, 0x3e, 0x92, 0xa0, 0xd5 },
+	{ 0xf5, 0xec, 0x43, 0xa2, 0x8a, 0xcb, 0xef, 0xf1,
+	  0xf3, 0x31, 0x8a, 0x5b, 0xca, 0xc7, 0xc6, 0x6d,
+	  0xdb, 0x52, 0x30, 0xb7, 0x9d, 0xb2, 0xd1, 0x05,
+	  0xbc, 0xbe, 0x15, 0xf3, 0xc1, 0x14, 0x8d, 0x69 },
+	{ 0x2a, 0x69, 0x60, 0xad, 0x1d, 0x8d, 0xd5, 0x47,
+	  0x55, 0x5c, 0xfb, 0xd5, 0xe4, 0x60, 0x0f, 0x1e,
+	  0xaa, 0x1c, 0x8e, 0xda, 0x34, 0xde, 0x03, 0x74,
+	  0xec, 0x4a, 0x26, 0xea, 0xaa, 0xa3, 0x3b, 0x4e },
+	{ 0xdc, 0xc1, 0xea, 0x7b, 0xaa, 0xb9, 0x33, 0x84,
+	  0xf7, 0x6b, 0x79, 0x68, 0x66, 0x19, 0x97, 0x54,
+	  0x74, 0x2f, 0x7b, 0x96, 0xd6, 0xb4, 0xc1, 0x20,
+	  0x16, 0x5c, 0x04, 0xa6, 0xc4, 0xf5, 0xce, 0x10 },
+	{ 0x13, 0xd5, 0xdf, 0x17, 0x92, 0x21, 0x37, 0x9c,
+	  0x6a, 0x78, 0xc0, 0x7c, 0x79, 0x3f, 0xf5, 0x34,
+	  0x87, 0xca, 0xe6, 0xbf, 0x9f, 0xe8, 0x82, 0x54,
+	  0x1a, 0xb0, 0xe7, 0x35, 0xe3, 0xea, 0xda, 0x3b },
+	{ 0x8c, 0x59, 0xe4, 0x40, 0x76, 0x41, 0xa0, 0x1e,
+	  0x8f, 0xf9, 0x1f, 0x99, 0x80, 0xdc, 0x23, 0x6f,
+	  0x4e, 0xcd, 0x6f, 0xcf, 0x52, 0x58, 0x9a, 0x09,
+	  0x9a, 0x96, 0x16, 0x33, 0x96, 0x77, 0x14, 0xe1 },
+	{ 0x83, 0x3b, 0x1a, 0xc6, 0xa2, 0x51, 0xfd, 0x08,
+	  0xfd, 0x6d, 0x90, 0x8f, 0xea, 0x2a, 0x4e, 0xe1,
+	  0xe0, 0x40, 0xbc, 0xa9, 0x3f, 0xc1, 0xa3, 0x8e,
+	  0xc3, 0x82, 0x0e, 0x0c, 0x10, 0xbd, 0x82, 0xea },
+	{ 0xa2, 0x44, 0xf9, 0x27, 0xf3, 0xb4, 0x0b, 0x8f,
+	  0x6c, 0x39, 0x15, 0x70, 0xc7, 0x65, 0x41, 0x8f,
+	  0x2f, 0x6e, 0x70, 0x8e, 0xac, 0x90, 0x06, 0xc5,
+	  0x1a, 0x7f, 0xef, 0xf4, 0xaf, 0x3b, 0x2b, 0x9e },
+	{ 0x3d, 0x99, 0xed, 0x95, 0x50, 0xcf, 0x11, 0x96,
+	  0xe6, 0xc4, 0xd2, 0x0c, 0x25, 0x96, 0x20, 0xf8,
+	  0x58, 0xc3, 0xd7, 0x03, 0x37, 0x4c, 0x12, 0x8c,
+	  0xe7, 0xb5, 0x90, 0x31, 0x0c, 0x83, 0x04, 0x6d },
+	{ 0x2b, 0x35, 0xc4, 0x7d, 0x7b, 0x87, 0x76, 0x1f,
+	  0x0a, 0xe4, 0x3a, 0xc5, 0x6a, 0xc2, 0x7b, 0x9f,
+	  0x25, 0x83, 0x03, 0x67, 0xb5, 0x95, 0xbe, 0x8c,
+	  0x24, 0x0e, 0x94, 0x60, 0x0c, 0x6e, 0x33, 0x12 },
+	{ 0x5d, 0x11, 0xed, 0x37, 0xd2, 0x4d, 0xc7, 0x67,
+	  0x30, 0x5c, 0xb7, 0xe1, 0x46, 0x7d, 0x87, 0xc0,
+	  0x65, 0xac, 0x4b, 0xc8, 0xa4, 0x26, 0xde, 0x38,
+	  0x99, 0x1f, 0xf5, 0x9a, 0xa8, 0x73, 0x5d, 0x02 },
+	{ 0xb8, 0x36, 0x47, 0x8e, 0x1c, 0xa0, 0x64, 0x0d,
+	  0xce, 0x6f, 0xd9, 0x10, 0xa5, 0x09, 0x62, 0x72,
+	  0xc8, 0x33, 0x09, 0x90, 0xcd, 0x97, 0x86, 0x4a,
+	  0xc2, 0xbf, 0x14, 0xef, 0x6b, 0x23, 0x91, 0x4a },
+	{ 0x91, 0x00, 0xf9, 0x46, 0xd6, 0xcc, 0xde, 0x3a,
+	  0x59, 0x7f, 0x90, 0xd3, 0x9f, 0xc1, 0x21, 0x5b,
+	  0xad, 0xdc, 0x74, 0x13, 0x64, 0x3d, 0x85, 0xc2,
+	  0x1c, 0x3e, 0xee, 0x5d, 0x2d, 0xd3, 0x28, 0x94 },
+	{ 0xda, 0x70, 0xee, 0xdd, 0x23, 0xe6, 0x63, 0xaa,
+	  0x1a, 0x74, 0xb9, 0x76, 0x69, 0x35, 0xb4, 0x79,
+	  0x22, 0x2a, 0x72, 0xaf, 0xba, 0x5c, 0x79, 0x51,
+	  0x58, 0xda, 0xd4, 0x1a, 0x3b, 0xd7, 0x7e, 0x40 },
+	{ 0xf0, 0x67, 0xed, 0x6a, 0x0d, 0xbd, 0x43, 0xaa,
+	  0x0a, 0x92, 0x54, 0xe6, 0x9f, 0xd6, 0x6b, 0xdd,
+	  0x8a, 0xcb, 0x87, 0xde, 0x93, 0x6c, 0x25, 0x8c,
+	  0xfb, 0x02, 0x28, 0x5f, 0x2c, 0x11, 0xfa, 0x79 },
+	{ 0x71, 0x5c, 0x99, 0xc7, 0xd5, 0x75, 0x80, 0xcf,
+	  0x97, 0x53, 0xb4, 0xc1, 0xd7, 0x95, 0xe4, 0x5a,
+	  0x83, 0xfb, 0xb2, 0x28, 0xc0, 0xd3, 0x6f, 0xbe,
+	  0x20, 0xfa, 0xf3, 0x9b, 0xdd, 0x6d, 0x4e, 0x85 },
+	{ 0xe4, 0x57, 0xd6, 0xad, 0x1e, 0x67, 0xcb, 0x9b,
+	  0xbd, 0x17, 0xcb, 0xd6, 0x98, 0xfa, 0x6d, 0x7d,
+	  0xae, 0x0c, 0x9b, 0x7a, 0xd6, 0xcb, 0xd6, 0x53,
+	  0x96, 0x34, 0xe3, 0x2a, 0x71, 0x9c, 0x84, 0x92 },
+	{ 0xec, 0xe3, 0xea, 0x81, 0x03, 0xe0, 0x24, 0x83,
+	  0xc6, 0x4a, 0x70, 0xa4, 0xbd, 0xce, 0xe8, 0xce,
+	  0xb6, 0x27, 0x8f, 0x25, 0x33, 0xf3, 0xf4, 0x8d,
+	  0xbe, 0xed, 0xfb, 0xa9, 0x45, 0x31, 0xd4, 0xae },
+	{ 0x38, 0x8a, 0xa5, 0xd3, 0x66, 0x7a, 0x97, 0xc6,
+	  0x8d, 0x3d, 0x56, 0xf8, 0xf3, 0xee, 0x8d, 0x3d,
+	  0x36, 0x09, 0x1f, 0x17, 0xfe, 0x5d, 0x1b, 0x0d,
+	  0x5d, 0x84, 0xc9, 0x3b, 0x2f, 0xfe, 0x40, 0xbd },
+	{ 0x8b, 0x6b, 0x31, 0xb9, 0xad, 0x7c, 0x3d, 0x5c,
+	  0xd8, 0x4b, 0xf9, 0x89, 0x47, 0xb9, 0xcd, 0xb5,
+	  0x9d, 0xf8, 0xa2, 0x5f, 0xf7, 0x38, 0x10, 0x10,
+	  0x13, 0xbe, 0x4f, 0xd6, 0x5e, 0x1d, 0xd1, 0xa3 },
+	{ 0x06, 0x62, 0x91, 0xf6, 0xbb, 0xd2, 0x5f, 0x3c,
+	  0x85, 0x3d, 0xb7, 0xd8, 0xb9, 0x5c, 0x9a, 0x1c,
+	  0xfb, 0x9b, 0xf1, 0xc1, 0xc9, 0x9f, 0xb9, 0x5a,
+	  0x9b, 0x78, 0x69, 0xd9, 0x0f, 0x1c, 0x29, 0x03 },
+	{ 0xa7, 0x07, 0xef, 0xbc, 0xcd, 0xce, 0xed, 0x42,
+	  0x96, 0x7a, 0x66, 0xf5, 0x53, 0x9b, 0x93, 0xed,
+	  0x75, 0x60, 0xd4, 0x67, 0x30, 0x40, 0x16, 0xc4,
+	  0x78, 0x0d, 0x77, 0x55, 0xa5, 0x65, 0xd4, 0xc4 },
+	{ 0x38, 0xc5, 0x3d, 0xfb, 0x70, 0xbe, 0x7e, 0x79,
+	  0x2b, 0x07, 0xa6, 0xa3, 0x5b, 0x8a, 0x6a, 0x0a,
+	  0xba, 0x02, 0xc5, 0xc5, 0xf3, 0x8b, 0xaf, 0x5c,
+	  0x82, 0x3f, 0xdf, 0xd9, 0xe4, 0x2d, 0x65, 0x7e },
+	{ 0xf2, 0x91, 0x13, 0x86, 0x50, 0x1d, 0x9a, 0xb9,
+	  0xd7, 0x20, 0xcf, 0x8a, 0xd1, 0x05, 0x03, 0xd5,
+	  0x63, 0x4b, 0xf4, 0xb7, 0xd1, 0x2b, 0x56, 0xdf,
+	  0xb7, 0x4f, 0xec, 0xc6, 0xe4, 0x09, 0x3f, 0x68 },
+	{ 0xc6, 0xf2, 0xbd, 0xd5, 0x2b, 0x81, 0xe6, 0xe4,
+	  0xf6, 0x59, 0x5a, 0xbd, 0x4d, 0x7f, 0xb3, 0x1f,
+	  0x65, 0x11, 0x69, 0xd0, 0x0f, 0xf3, 0x26, 0x92,
+	  0x6b, 0x34, 0x94, 0x7b, 0x28, 0xa8, 0x39, 0x59 },
+	{ 0x29, 0x3d, 0x94, 0xb1, 0x8c, 0x98, 0xbb, 0x32,
+	  0x23, 0x36, 0x6b, 0x8c, 0xe7, 0x4c, 0x28, 0xfb,
+	  0xdf, 0x28, 0xe1, 0xf8, 0x4a, 0x33, 0x50, 0xb0,
+	  0xeb, 0x2d, 0x18, 0x04, 0xa5, 0x77, 0x57, 0x9b },
+	{ 0x2c, 0x2f, 0xa5, 0xc0, 0xb5, 0x15, 0x33, 0x16,
+	  0x5b, 0xc3, 0x75, 0xc2, 0x2e, 0x27, 0x81, 0x76,
+	  0x82, 0x70, 0xa3, 0x83, 0x98, 0x5d, 0x13, 0xbd,
+	  0x6b, 0x67, 0xb6, 0xfd, 0x67, 0xf8, 0x89, 0xeb },
+	{ 0xca, 0xa0, 0x9b, 0x82, 0xb7, 0x25, 0x62, 0xe4,
+	  0x3f, 0x4b, 0x22, 0x75, 0xc0, 0x91, 0x91, 0x8e,
+	  0x62, 0x4d, 0x91, 0x16, 0x61, 0xcc, 0x81, 0x1b,
+	  0xb5, 0xfa, 0xec, 0x51, 0xf6, 0x08, 0x8e, 0xf7 },
+	{ 0x24, 0x76, 0x1e, 0x45, 0xe6, 0x74, 0x39, 0x53,
+	  0x79, 0xfb, 0x17, 0x72, 0x9c, 0x78, 0xcb, 0x93,
+	  0x9e, 0x6f, 0x74, 0xc5, 0xdf, 0xfb, 0x9c, 0x96,
+	  0x1f, 0x49, 0x59, 0x82, 0xc3, 0xed, 0x1f, 0xe3 },
+	{ 0x55, 0xb7, 0x0a, 0x82, 0x13, 0x1e, 0xc9, 0x48,
+	  0x88, 0xd7, 0xab, 0x54, 0xa7, 0xc5, 0x15, 0x25,
+	  0x5c, 0x39, 0x38, 0xbb, 0x10, 0xbc, 0x78, 0x4d,
+	  0xc9, 0xb6, 0x7f, 0x07, 0x6e, 0x34, 0x1a, 0x73 },
+	{ 0x6a, 0xb9, 0x05, 0x7b, 0x97, 0x7e, 0xbc, 0x3c,
+	  0xa4, 0xd4, 0xce, 0x74, 0x50, 0x6c, 0x25, 0xcc,
+	  0xcd, 0xc5, 0x66, 0x49, 0x7c, 0x45, 0x0b, 0x54,
+	  0x15, 0xa3, 0x94, 0x86, 0xf8, 0x65, 0x7a, 0x03 },
+	{ 0x24, 0x06, 0x6d, 0xee, 0xe0, 0xec, 0xee, 0x15,
+	  0xa4, 0x5f, 0x0a, 0x32, 0x6d, 0x0f, 0x8d, 0xbc,
+	  0x79, 0x76, 0x1e, 0xbb, 0x93, 0xcf, 0x8c, 0x03,
+	  0x77, 0xaf, 0x44, 0x09, 0x78, 0xfc, 0xf9, 0x94 },
+	{ 0x20, 0x00, 0x0d, 0x3f, 0x66, 0xba, 0x76, 0x86,
+	  0x0d, 0x5a, 0x95, 0x06, 0x88, 0xb9, 0xaa, 0x0d,
+	  0x76, 0xcf, 0xea, 0x59, 0xb0, 0x05, 0xd8, 0x59,
+	  0x91, 0x4b, 0x1a, 0x46, 0x65, 0x3a, 0x93, 0x9b },
+	{ 0xb9, 0x2d, 0xaa, 0x79, 0x60, 0x3e, 0x3b, 0xdb,
+	  0xc3, 0xbf, 0xe0, 0xf4, 0x19, 0xe4, 0x09, 0xb2,
+	  0xea, 0x10, 0xdc, 0x43, 0x5b, 0xee, 0xfe, 0x29,
+	  0x59, 0xda, 0x16, 0x89, 0x5d, 0x5d, 0xca, 0x1c },
+	{ 0xe9, 0x47, 0x94, 0x87, 0x05, 0xb2, 0x06, 0xd5,
+	  0x72, 0xb0, 0xe8, 0xf6, 0x2f, 0x66, 0xa6, 0x55,
+	  0x1c, 0xbd, 0x6b, 0xc3, 0x05, 0xd2, 0x6c, 0xe7,
+	  0x53, 0x9a, 0x12, 0xf9, 0xaa, 0xdf, 0x75, 0x71 },
+	{ 0x3d, 0x67, 0xc1, 0xb3, 0xf9, 0xb2, 0x39, 0x10,
+	  0xe3, 0xd3, 0x5e, 0x6b, 0x0f, 0x2c, 0xcf, 0x44,
+	  0xa0, 0xb5, 0x40, 0xa4, 0x5c, 0x18, 0xba, 0x3c,
+	  0x36, 0x26, 0x4d, 0xd4, 0x8e, 0x96, 0xaf, 0x6a },
+	{ 0xc7, 0x55, 0x8b, 0xab, 0xda, 0x04, 0xbc, 0xcb,
+	  0x76, 0x4d, 0x0b, 0xbf, 0x33, 0x58, 0x42, 0x51,
+	  0x41, 0x90, 0x2d, 0x22, 0x39, 0x1d, 0x9f, 0x8c,
+	  0x59, 0x15, 0x9f, 0xec, 0x9e, 0x49, 0xb1, 0x51 },
+	{ 0x0b, 0x73, 0x2b, 0xb0, 0x35, 0x67, 0x5a, 0x50,
+	  0xff, 0x58, 0xf2, 0xc2, 0x42, 0xe4, 0x71, 0x0a,
+	  0xec, 0xe6, 0x46, 0x70, 0x07, 0x9c, 0x13, 0x04,
+	  0x4c, 0x79, 0xc9, 0xb7, 0x49, 0x1f, 0x70, 0x00 },
+	{ 0xd1, 0x20, 0xb5, 0xef, 0x6d, 0x57, 0xeb, 0xf0,
+	  0x6e, 0xaf, 0x96, 0xbc, 0x93, 0x3c, 0x96, 0x7b,
+	  0x16, 0xcb, 0xe6, 0xe2, 0xbf, 0x00, 0x74, 0x1c,
+	  0x30, 0xaa, 0x1c, 0x54, 0xba, 0x64, 0x80, 0x1f },
+	{ 0x58, 0xd2, 0x12, 0xad, 0x6f, 0x58, 0xae, 0xf0,
+	  0xf8, 0x01, 0x16, 0xb4, 0x41, 0xe5, 0x7f, 0x61,
+	  0x95, 0xbf, 0xef, 0x26, 0xb6, 0x14, 0x63, 0xed,
+	  0xec, 0x11, 0x83, 0xcd, 0xb0, 0x4f, 0xe7, 0x6d },
+	{ 0xb8, 0x83, 0x6f, 0x51, 0xd1, 0xe2, 0x9b, 0xdf,
+	  0xdb, 0xa3, 0x25, 0x56, 0x53, 0x60, 0x26, 0x8b,
+	  0x8f, 0xad, 0x62, 0x74, 0x73, 0xed, 0xec, 0xef,
+	  0x7e, 0xae, 0xfe, 0xe8, 0x37, 0xc7, 0x40, 0x03 },
+	{ 0xc5, 0x47, 0xa3, 0xc1, 0x24, 0xae, 0x56, 0x85,
+	  0xff, 0xa7, 0xb8, 0xed, 0xaf, 0x96, 0xec, 0x86,
+	  0xf8, 0xb2, 0xd0, 0xd5, 0x0c, 0xee, 0x8b, 0xe3,
+	  0xb1, 0xf0, 0xc7, 0x67, 0x63, 0x06, 0x9d, 0x9c },
+	{ 0x5d, 0x16, 0x8b, 0x76, 0x9a, 0x2f, 0x67, 0x85,
+	  0x3d, 0x62, 0x95, 0xf7, 0x56, 0x8b, 0xe4, 0x0b,
+	  0xb7, 0xa1, 0x6b, 0x8d, 0x65, 0xba, 0x87, 0x63,
+	  0x5d, 0x19, 0x78, 0xd2, 0xab, 0x11, 0xba, 0x2a },
+	{ 0xa2, 0xf6, 0x75, 0xdc, 0x73, 0x02, 0x63, 0x8c,
+	  0xb6, 0x02, 0x01, 0x06, 0x4c, 0xa5, 0x50, 0x77,
+	  0x71, 0x4d, 0x71, 0xfe, 0x09, 0x6a, 0x31, 0x5f,
+	  0x2f, 0xe7, 0x40, 0x12, 0x77, 0xca, 0xa5, 0xaf },
+	{ 0xc8, 0xaa, 0xb5, 0xcd, 0x01, 0x60, 0xae, 0x78,
+	  0xcd, 0x2e, 0x8a, 0xc5, 0xfb, 0x0e, 0x09, 0x3c,
+	  0xdb, 0x5c, 0x4b, 0x60, 0x52, 0xa0, 0xa9, 0x7b,
+	  0xb0, 0x42, 0x16, 0x82, 0x6f, 0xa7, 0xa4, 0x37 },
+	{ 0xff, 0x68, 0xca, 0x40, 0x35, 0xbf, 0xeb, 0x43,
+	  0xfb, 0xf1, 0x45, 0xfd, 0xdd, 0x5e, 0x43, 0xf1,
+	  0xce, 0xa5, 0x4f, 0x11, 0xf7, 0xbe, 0xe1, 0x30,
+	  0x58, 0xf0, 0x27, 0x32, 0x9a, 0x4a, 0x5f, 0xa4 },
+	{ 0x1d, 0x4e, 0x54, 0x87, 0xae, 0x3c, 0x74, 0x0f,
+	  0x2b, 0xa6, 0xe5, 0x41, 0xac, 0x91, 0xbc, 0x2b,
+	  0xfc, 0xd2, 0x99, 0x9c, 0x51, 0x8d, 0x80, 0x7b,
+	  0x42, 0x67, 0x48, 0x80, 0x3a, 0x35, 0x0f, 0xd4 },
+	{ 0x6d, 0x24, 0x4e, 0x1a, 0x06, 0xce, 0x4e, 0xf5,
+	  0x78, 0xdd, 0x0f, 0x63, 0xaf, 0xf0, 0x93, 0x67,
+	  0x06, 0x73, 0x51, 0x19, 0xca, 0x9c, 0x8d, 0x22,
+	  0xd8, 0x6c, 0x80, 0x14, 0x14, 0xab, 0x97, 0x41 },
+	{ 0xde, 0xcf, 0x73, 0x29, 0xdb, 0xcc, 0x82, 0x7b,
+	  0x8f, 0xc5, 0x24, 0xc9, 0x43, 0x1e, 0x89, 0x98,
+	  0x02, 0x9e, 0xce, 0x12, 0xce, 0x93, 0xb7, 0xb2,
+	  0xf3, 0xe7, 0x69, 0xa9, 0x41, 0xfb, 0x8c, 0xea },
+	{ 0x2f, 0xaf, 0xcc, 0x0f, 0x2e, 0x63, 0xcb, 0xd0,
+	  0x77, 0x55, 0xbe, 0x7b, 0x75, 0xec, 0xea, 0x0a,
+	  0xdf, 0xf9, 0xaa, 0x5e, 0xde, 0x2a, 0x52, 0xfd,
+	  0xab, 0x4d, 0xfd, 0x03, 0x74, 0xcd, 0x48, 0x3f },
+	{ 0xaa, 0x85, 0x01, 0x0d, 0xd4, 0x6a, 0x54, 0x6b,
+	  0x53, 0x5e, 0xf4, 0xcf, 0x5f, 0x07, 0xd6, 0x51,
+	  0x61, 0xe8, 0x98, 0x28, 0xf3, 0xa7, 0x7d, 0xb7,
+	  0xb9, 0xb5, 0x6f, 0x0d, 0xf5, 0x9a, 0xae, 0x45 },
+	{ 0x07, 0xe8, 0xe1, 0xee, 0x73, 0x2c, 0xb0, 0xd3,
+	  0x56, 0xc9, 0xc0, 0xd1, 0x06, 0x9c, 0x89, 0xd1,
+	  0x7a, 0xdf, 0x6a, 0x9a, 0x33, 0x4f, 0x74, 0x5e,
+	  0xc7, 0x86, 0x73, 0x32, 0x54, 0x8c, 0xa8, 0xe9 },
+	{ 0x0e, 0x01, 0xe8, 0x1c, 0xad, 0xa8, 0x16, 0x2b,
+	  0xfd, 0x5f, 0x8a, 0x8c, 0x81, 0x8a, 0x6c, 0x69,
+	  0xfe, 0xdf, 0x02, 0xce, 0xb5, 0x20, 0x85, 0x23,
+	  0xcb, 0xe5, 0x31, 0x3b, 0x89, 0xca, 0x10, 0x53 },
+	{ 0x6b, 0xb6, 0xc6, 0x47, 0x26, 0x55, 0x08, 0x43,
+	  0x99, 0x85, 0x2e, 0x00, 0x24, 0x9f, 0x8c, 0xb2,
+	  0x47, 0x89, 0x6d, 0x39, 0x2b, 0x02, 0xd7, 0x3b,
+	  0x7f, 0x0d, 0xd8, 0x18, 0xe1, 0xe2, 0x9b, 0x07 },
+	{ 0x42, 0xd4, 0x63, 0x6e, 0x20, 0x60, 0xf0, 0x8f,
+	  0x41, 0xc8, 0x82, 0xe7, 0x6b, 0x39, 0x6b, 0x11,
+	  0x2e, 0xf6, 0x27, 0xcc, 0x24, 0xc4, 0x3d, 0xd5,
+	  0xf8, 0x3a, 0x1d, 0x1a, 0x7e, 0xad, 0x71, 0x1a },
+	{ 0x48, 0x58, 0xc9, 0xa1, 0x88, 0xb0, 0x23, 0x4f,
+	  0xb9, 0xa8, 0xd4, 0x7d, 0x0b, 0x41, 0x33, 0x65,
+	  0x0a, 0x03, 0x0b, 0xd0, 0x61, 0x1b, 0x87, 0xc3,
+	  0x89, 0x2e, 0x94, 0x95, 0x1f, 0x8d, 0xf8, 0x52 },
+	{ 0x3f, 0xab, 0x3e, 0x36, 0x98, 0x8d, 0x44, 0x5a,
+	  0x51, 0xc8, 0x78, 0x3e, 0x53, 0x1b, 0xe3, 0xa0,
+	  0x2b, 0xe4, 0x0c, 0xd0, 0x47, 0x96, 0xcf, 0xb6,
+	  0x1d, 0x40, 0x34, 0x74, 0x42, 0xd3, 0xf7, 0x94 },
+	{ 0xeb, 0xab, 0xc4, 0x96, 0x36, 0xbd, 0x43, 0x3d,
+	  0x2e, 0xc8, 0xf0, 0xe5, 0x18, 0x73, 0x2e, 0xf8,
+	  0xfa, 0x21, 0xd4, 0xd0, 0x71, 0xcc, 0x3b, 0xc4,
+	  0x6c, 0xd7, 0x9f, 0xa3, 0x8a, 0x28, 0xb8, 0x10 },
+	{ 0xa1, 0xd0, 0x34, 0x35, 0x23, 0xb8, 0x93, 0xfc,
+	  0xa8, 0x4f, 0x47, 0xfe, 0xb4, 0xa6, 0x4d, 0x35,
+	  0x0a, 0x17, 0xd8, 0xee, 0xf5, 0x49, 0x7e, 0xce,
+	  0x69, 0x7d, 0x02, 0xd7, 0x91, 0x78, 0xb5, 0x91 },
+	{ 0x26, 0x2e, 0xbf, 0xd9, 0x13, 0x0b, 0x7d, 0x28,
+	  0x76, 0x0d, 0x08, 0xef, 0x8b, 0xfd, 0x3b, 0x86,
+	  0xcd, 0xd3, 0xb2, 0x11, 0x3d, 0x2c, 0xae, 0xf7,
+	  0xea, 0x95, 0x1a, 0x30, 0x3d, 0xfa, 0x38, 0x46 },
+	{ 0xf7, 0x61, 0x58, 0xed, 0xd5, 0x0a, 0x15, 0x4f,
+	  0xa7, 0x82, 0x03, 0xed, 0x23, 0x62, 0x93, 0x2f,
+	  0xcb, 0x82, 0x53, 0xaa, 0xe3, 0x78, 0x90, 0x3e,
+	  0xde, 0xd1, 0xe0, 0x3f, 0x70, 0x21, 0xa2, 0x57 },
+	{ 0x26, 0x17, 0x8e, 0x95, 0x0a, 0xc7, 0x22, 0xf6,
+	  0x7a, 0xe5, 0x6e, 0x57, 0x1b, 0x28, 0x4c, 0x02,
+	  0x07, 0x68, 0x4a, 0x63, 0x34, 0xa1, 0x77, 0x48,
+	  0xa9, 0x4d, 0x26, 0x0b, 0xc5, 0xf5, 0x52, 0x74 },
+	{ 0xc3, 0x78, 0xd1, 0xe4, 0x93, 0xb4, 0x0e, 0xf1,
+	  0x1f, 0xe6, 0xa1, 0x5d, 0x9c, 0x27, 0x37, 0xa3,
+	  0x78, 0x09, 0x63, 0x4c, 0x5a, 0xba, 0xd5, 0xb3,
+	  0x3d, 0x7e, 0x39, 0x3b, 0x4a, 0xe0, 0x5d, 0x03 },
+	{ 0x98, 0x4b, 0xd8, 0x37, 0x91, 0x01, 0xbe, 0x8f,
+	  0xd8, 0x06, 0x12, 0xd8, 0xea, 0x29, 0x59, 0xa7,
+	  0x86, 0x5e, 0xc9, 0x71, 0x85, 0x23, 0x55, 0x01,
+	  0x07, 0xae, 0x39, 0x38, 0xdf, 0x32, 0x01, 0x1b },
+	{ 0xc6, 0xf2, 0x5a, 0x81, 0x2a, 0x14, 0x48, 0x58,
+	  0xac, 0x5c, 0xed, 0x37, 0xa9, 0x3a, 0x9f, 0x47,
+	  0x59, 0xba, 0x0b, 0x1c, 0x0f, 0xdc, 0x43, 0x1d,
+	  0xce, 0x35, 0xf9, 0xec, 0x1f, 0x1f, 0x4a, 0x99 },
+	{ 0x92, 0x4c, 0x75, 0xc9, 0x44, 0x24, 0xff, 0x75,
+	  0xe7, 0x4b, 0x8b, 0x4e, 0x94, 0x35, 0x89, 0x58,
+	  0xb0, 0x27, 0xb1, 0x71, 0xdf, 0x5e, 0x57, 0x89,
+	  0x9a, 0xd0, 0xd4, 0xda, 0xc3, 0x73, 0x53, 0xb6 },
+	{ 0x0a, 0xf3, 0x58, 0x92, 0xa6, 0x3f, 0x45, 0x93,
+	  0x1f, 0x68, 0x46, 0xed, 0x19, 0x03, 0x61, 0xcd,
+	  0x07, 0x30, 0x89, 0xe0, 0x77, 0x16, 0x57, 0x14,
+	  0xb5, 0x0b, 0x81, 0xa2, 0xe3, 0xdd, 0x9b, 0xa1 },
+	{ 0xcc, 0x80, 0xce, 0xfb, 0x26, 0xc3, 0xb2, 0xb0,
+	  0xda, 0xef, 0x23, 0x3e, 0x60, 0x6d, 0x5f, 0xfc,
+	  0x80, 0xfa, 0x17, 0x42, 0x7d, 0x18, 0xe3, 0x04,
+	  0x89, 0x67, 0x3e, 0x06, 0xef, 0x4b, 0x87, 0xf7 },
+	{ 0xc2, 0xf8, 0xc8, 0x11, 0x74, 0x47, 0xf3, 0x97,
+	  0x8b, 0x08, 0x18, 0xdc, 0xf6, 0xf7, 0x01, 0x16,
+	  0xac, 0x56, 0xfd, 0x18, 0x4d, 0xd1, 0x27, 0x84,
+	  0x94, 0xe1, 0x03, 0xfc, 0x6d, 0x74, 0xa8, 0x87 },
+	{ 0xbd, 0xec, 0xf6, 0xbf, 0xc1, 0xba, 0x0d, 0xf6,
+	  0xe8, 0x62, 0xc8, 0x31, 0x99, 0x22, 0x07, 0x79,
+	  0x6a, 0xcc, 0x79, 0x79, 0x68, 0x35, 0x88, 0x28,
+	  0xc0, 0x6e, 0x7a, 0x51, 0xe0, 0x90, 0x09, 0x8f },
+	{ 0x24, 0xd1, 0xa2, 0x6e, 0x3d, 0xab, 0x02, 0xfe,
+	  0x45, 0x72, 0xd2, 0xaa, 0x7d, 0xbd, 0x3e, 0xc3,
+	  0x0f, 0x06, 0x93, 0xdb, 0x26, 0xf2, 0x73, 0xd0,
+	  0xab, 0x2c, 0xb0, 0xc1, 0x3b, 0x5e, 0x64, 0x51 },
+	{ 0xec, 0x56, 0xf5, 0x8b, 0x09, 0x29, 0x9a, 0x30,
+	  0x0b, 0x14, 0x05, 0x65, 0xd7, 0xd3, 0xe6, 0x87,
+	  0x82, 0xb6, 0xe2, 0xfb, 0xeb, 0x4b, 0x7e, 0xa9,
+	  0x7a, 0xc0, 0x57, 0x98, 0x90, 0x61, 0xdd, 0x3f },
+	{ 0x11, 0xa4, 0x37, 0xc1, 0xab, 0xa3, 0xc1, 0x19,
+	  0xdd, 0xfa, 0xb3, 0x1b, 0x3e, 0x8c, 0x84, 0x1d,
+	  0xee, 0xeb, 0x91, 0x3e, 0xf5, 0x7f, 0x7e, 0x48,
+	  0xf2, 0xc9, 0xcf, 0x5a, 0x28, 0xfa, 0x42, 0xbc },
+	{ 0x53, 0xc7, 0xe6, 0x11, 0x4b, 0x85, 0x0a, 0x2c,
+	  0xb4, 0x96, 0xc9, 0xb3, 0xc6, 0x9a, 0x62, 0x3e,
+	  0xae, 0xa2, 0xcb, 0x1d, 0x33, 0xdd, 0x81, 0x7e,
+	  0x47, 0x65, 0xed, 0xaa, 0x68, 0x23, 0xc2, 0x28 },
+	{ 0x15, 0x4c, 0x3e, 0x96, 0xfe, 0xe5, 0xdb, 0x14,
+	  0xf8, 0x77, 0x3e, 0x18, 0xaf, 0x14, 0x85, 0x79,
+	  0x13, 0x50, 0x9d, 0xa9, 0x99, 0xb4, 0x6c, 0xdd,
+	  0x3d, 0x4c, 0x16, 0x97, 0x60, 0xc8, 0x3a, 0xd2 },
+	{ 0x40, 0xb9, 0x91, 0x6f, 0x09, 0x3e, 0x02, 0x7a,
+	  0x87, 0x86, 0x64, 0x18, 0x18, 0x92, 0x06, 0x20,
+	  0x47, 0x2f, 0xbc, 0xf6, 0x8f, 0x70, 0x1d, 0x1b,
+	  0x68, 0x06, 0x32, 0xe6, 0x99, 0x6b, 0xde, 0xd3 },
+	{ 0x24, 0xc4, 0xcb, 0xba, 0x07, 0x11, 0x98, 0x31,
+	  0xa7, 0x26, 0xb0, 0x53, 0x05, 0xd9, 0x6d, 0xa0,
+	  0x2f, 0xf8, 0xb1, 0x48, 0xf0, 0xda, 0x44, 0x0f,
+	  0xe2, 0x33, 0xbc, 0xaa, 0x32, 0xc7, 0x2f, 0x6f },
+	{ 0x5d, 0x20, 0x15, 0x10, 0x25, 0x00, 0x20, 0xb7,
+	  0x83, 0x68, 0x96, 0x88, 0xab, 0xbf, 0x8e, 0xcf,
+	  0x25, 0x94, 0xa9, 0x6a, 0x08, 0xf2, 0xbf, 0xec,
+	  0x6c, 0xe0, 0x57, 0x44, 0x65, 0xdd, 0xed, 0x71 },
+	{ 0x04, 0x3b, 0x97, 0xe3, 0x36, 0xee, 0x6f, 0xdb,
+	  0xbe, 0x2b, 0x50, 0xf2, 0x2a, 0xf8, 0x32, 0x75,
+	  0xa4, 0x08, 0x48, 0x05, 0xd2, 0xd5, 0x64, 0x59,
+	  0x62, 0x45, 0x4b, 0x6c, 0x9b, 0x80, 0x53, 0xa0 },
+	{ 0x56, 0x48, 0x35, 0xcb, 0xae, 0xa7, 0x74, 0x94,
+	  0x85, 0x68, 0xbe, 0x36, 0xcf, 0x52, 0xfc, 0xdd,
+	  0x83, 0x93, 0x4e, 0xb0, 0xa2, 0x75, 0x12, 0xdb,
+	  0xe3, 0xe2, 0xdb, 0x47, 0xb9, 0xe6, 0x63, 0x5a },
+	{ 0xf2, 0x1c, 0x33, 0xf4, 0x7b, 0xde, 0x40, 0xa2,
+	  0xa1, 0x01, 0xc9, 0xcd, 0xe8, 0x02, 0x7a, 0xaf,
+	  0x61, 0xa3, 0x13, 0x7d, 0xe2, 0x42, 0x2b, 0x30,
+	  0x03, 0x5a, 0x04, 0xc2, 0x70, 0x89, 0x41, 0x83 },
+	{ 0x9d, 0xb0, 0xef, 0x74, 0xe6, 0x6c, 0xbb, 0x84,
+	  0x2e, 0xb0, 0xe0, 0x73, 0x43, 0xa0, 0x3c, 0x5c,
+	  0x56, 0x7e, 0x37, 0x2b, 0x3f, 0x23, 0xb9, 0x43,
+	  0xc7, 0x88, 0xa4, 0xf2, 0x50, 0xf6, 0x78, 0x91 },
+	{ 0xab, 0x8d, 0x08, 0x65, 0x5f, 0xf1, 0xd3, 0xfe,
+	  0x87, 0x58, 0xd5, 0x62, 0x23, 0x5f, 0xd2, 0x3e,
+	  0x7c, 0xf9, 0xdc, 0xaa, 0xd6, 0x58, 0x87, 0x2a,
+	  0x49, 0xe5, 0xd3, 0x18, 0x3b, 0x6c, 0xce, 0xbd },
+	{ 0x6f, 0x27, 0xf7, 0x7e, 0x7b, 0xcf, 0x46, 0xa1,
+	  0xe9, 0x63, 0xad, 0xe0, 0x30, 0x97, 0x33, 0x54,
+	  0x30, 0x31, 0xdc, 0xcd, 0xd4, 0x7c, 0xaa, 0xc1,
+	  0x74, 0xd7, 0xd2, 0x7c, 0xe8, 0x07, 0x7e, 0x8b },
+	{ 0xe3, 0xcd, 0x54, 0xda, 0x7e, 0x44, 0x4c, 0xaa,
+	  0x62, 0x07, 0x56, 0x95, 0x25, 0xa6, 0x70, 0xeb,
+	  0xae, 0x12, 0x78, 0xde, 0x4e, 0x3f, 0xe2, 0x68,
+	  0x4b, 0x3e, 0x33, 0xf5, 0xef, 0x90, 0xcc, 0x1b },
+	{ 0xb2, 0xc3, 0xe3, 0x3a, 0x51, 0xd2, 0x2c, 0x4c,
+	  0x08, 0xfc, 0x09, 0x89, 0xc8, 0x73, 0xc9, 0xcc,
+	  0x41, 0x50, 0x57, 0x9b, 0x1e, 0x61, 0x63, 0xfa,
+	  0x69, 0x4a, 0xd5, 0x1d, 0x53, 0xd7, 0x12, 0xdc },
+	{ 0xbe, 0x7f, 0xda, 0x98, 0x3e, 0x13, 0x18, 0x9b,
+	  0x4c, 0x77, 0xe0, 0xa8, 0x09, 0x20, 0xb6, 0xe0,
+	  0xe0, 0xea, 0x80, 0xc3, 0xb8, 0x4d, 0xbe, 0x7e,
+	  0x71, 0x17, 0xd2, 0x53, 0xf4, 0x81, 0x12, 0xf4 },
+	{ 0xb6, 0x00, 0x8c, 0x28, 0xfa, 0xe0, 0x8a, 0xa4,
+	  0x27, 0xe5, 0xbd, 0x3a, 0xad, 0x36, 0xf1, 0x00,
+	  0x21, 0xf1, 0x6c, 0x77, 0xcf, 0xea, 0xbe, 0xd0,
+	  0x7f, 0x97, 0xcc, 0x7d, 0xc1, 0xf1, 0x28, 0x4a },
+	{ 0x6e, 0x4e, 0x67, 0x60, 0xc5, 0x38, 0xf2, 0xe9,
+	  0x7b, 0x3a, 0xdb, 0xfb, 0xbc, 0xde, 0x57, 0xf8,
+	  0x96, 0x6b, 0x7e, 0xa8, 0xfc, 0xb5, 0xbf, 0x7e,
+	  0xfe, 0xc9, 0x13, 0xfd, 0x2a, 0x2b, 0x0c, 0x55 },
+	{ 0x4a, 0xe5, 0x1f, 0xd1, 0x83, 0x4a, 0xa5, 0xbd,
+	  0x9a, 0x6f, 0x7e, 0xc3, 0x9f, 0xc6, 0x63, 0x33,
+	  0x8d, 0xc5, 0xd2, 0xe2, 0x07, 0x61, 0x56, 0x6d,
+	  0x90, 0xcc, 0x68, 0xb1, 0xcb, 0x87, 0x5e, 0xd8 },
+	{ 0xb6, 0x73, 0xaa, 0xd7, 0x5a, 0xb1, 0xfd, 0xb5,
+	  0x40, 0x1a, 0xbf, 0xa1, 0xbf, 0x89, 0xf3, 0xad,
+	  0xd2, 0xeb, 0xc4, 0x68, 0xdf, 0x36, 0x24, 0xa4,
+	  0x78, 0xf4, 0xfe, 0x85, 0x9d, 0x8d, 0x55, 0xe2 },
+	{ 0x13, 0xc9, 0x47, 0x1a, 0x98, 0x55, 0x91, 0x35,
+	  0x39, 0x83, 0x66, 0x60, 0x39, 0x8d, 0xa0, 0xf3,
+	  0xf9, 0x9a, 0xda, 0x08, 0x47, 0x9c, 0x69, 0xd1,
+	  0xb7, 0xfc, 0xaa, 0x34, 0x61, 0xdd, 0x7e, 0x59 },
+	{ 0x2c, 0x11, 0xf4, 0xa7, 0xf9, 0x9a, 0x1d, 0x23,
+	  0xa5, 0x8b, 0xb6, 0x36, 0x35, 0x0f, 0xe8, 0x49,
+	  0xf2, 0x9c, 0xba, 0xc1, 0xb2, 0xa1, 0x11, 0x2d,
+	  0x9f, 0x1e, 0xd5, 0xbc, 0x5b, 0x31, 0x3c, 0xcd },
+	{ 0xc7, 0xd3, 0xc0, 0x70, 0x6b, 0x11, 0xae, 0x74,
+	  0x1c, 0x05, 0xa1, 0xef, 0x15, 0x0d, 0xd6, 0x5b,
+	  0x54, 0x94, 0xd6, 0xd5, 0x4c, 0x9a, 0x86, 0xe2,
+	  0x61, 0x78, 0x54, 0xe6, 0xae, 0xee, 0xbb, 0xd9 },
+	{ 0x19, 0x4e, 0x10, 0xc9, 0x38, 0x93, 0xaf, 0xa0,
+	  0x64, 0xc3, 0xac, 0x04, 0xc0, 0xdd, 0x80, 0x8d,
+	  0x79, 0x1c, 0x3d, 0x4b, 0x75, 0x56, 0xe8, 0x9d,
+	  0x8d, 0x9c, 0xb2, 0x25, 0xc4, 0xb3, 0x33, 0x39 },
+	{ 0x6f, 0xc4, 0x98, 0x8b, 0x8f, 0x78, 0x54, 0x6b,
+	  0x16, 0x88, 0x99, 0x18, 0x45, 0x90, 0x8f, 0x13,
+	  0x4b, 0x6a, 0x48, 0x2e, 0x69, 0x94, 0xb3, 0xd4,
+	  0x83, 0x17, 0xbf, 0x08, 0xdb, 0x29, 0x21, 0x85 },
+	{ 0x56, 0x65, 0xbe, 0xb8, 0xb0, 0x95, 0x55, 0x25,
+	  0x81, 0x3b, 0x59, 0x81, 0xcd, 0x14, 0x2e, 0xd4,
+	  0xd0, 0x3f, 0xba, 0x38, 0xa6, 0xf3, 0xe5, 0xad,
+	  0x26, 0x8e, 0x0c, 0xc2, 0x70, 0xd1, 0xcd, 0x11 },
+	{ 0xb8, 0x83, 0xd6, 0x8f, 0x5f, 0xe5, 0x19, 0x36,
+	  0x43, 0x1b, 0xa4, 0x25, 0x67, 0x38, 0x05, 0x3b,
+	  0x1d, 0x04, 0x26, 0xd4, 0xcb, 0x64, 0xb1, 0x6e,
+	  0x83, 0xba, 0xdc, 0x5e, 0x9f, 0xbe, 0x3b, 0x81 },
+	{ 0x53, 0xe7, 0xb2, 0x7e, 0xa5, 0x9c, 0x2f, 0x6d,
+	  0xbb, 0x50, 0x76, 0x9e, 0x43, 0x55, 0x4d, 0xf3,
+	  0x5a, 0xf8, 0x9f, 0x48, 0x22, 0xd0, 0x46, 0x6b,
+	  0x00, 0x7d, 0xd6, 0xf6, 0xde, 0xaf, 0xff, 0x02 },
+	{ 0x1f, 0x1a, 0x02, 0x29, 0xd4, 0x64, 0x0f, 0x01,
+	  0x90, 0x15, 0x88, 0xd9, 0xde, 0xc2, 0x2d, 0x13,
+	  0xfc, 0x3e, 0xb3, 0x4a, 0x61, 0xb3, 0x29, 0x38,
+	  0xef, 0xbf, 0x53, 0x34, 0xb2, 0x80, 0x0a, 0xfa },
+	{ 0xc2, 0xb4, 0x05, 0xaf, 0xa0, 0xfa, 0x66, 0x68,
+	  0x85, 0x2a, 0xee, 0x4d, 0x88, 0x04, 0x08, 0x53,
+	  0xfa, 0xb8, 0x00, 0xe7, 0x2b, 0x57, 0x58, 0x14,
+	  0x18, 0xe5, 0x50, 0x6f, 0x21, 0x4c, 0x7d, 0x1f },
+	{ 0xc0, 0x8a, 0xa1, 0xc2, 0x86, 0xd7, 0x09, 0xfd,
+	  0xc7, 0x47, 0x37, 0x44, 0x97, 0x71, 0x88, 0xc8,
+	  0x95, 0xba, 0x01, 0x10, 0x14, 0x24, 0x7e, 0x4e,
+	  0xfa, 0x8d, 0x07, 0xe7, 0x8f, 0xec, 0x69, 0x5c },
+	{ 0xf0, 0x3f, 0x57, 0x89, 0xd3, 0x33, 0x6b, 0x80,
+	  0xd0, 0x02, 0xd5, 0x9f, 0xdf, 0x91, 0x8b, 0xdb,
+	  0x77, 0x5b, 0x00, 0x95, 0x6e, 0xd5, 0x52, 0x8e,
+	  0x86, 0xaa, 0x99, 0x4a, 0xcb, 0x38, 0xfe, 0x2d }
+};
+
+static const u8 blake2s_keyed_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
+	{ 0x48, 0xa8, 0x99, 0x7d, 0xa4, 0x07, 0x87, 0x6b,
+	  0x3d, 0x79, 0xc0, 0xd9, 0x23, 0x25, 0xad, 0x3b,
+	  0x89, 0xcb, 0xb7, 0x54, 0xd8, 0x6a, 0xb7, 0x1a,
+	  0xee, 0x04, 0x7a, 0xd3, 0x45, 0xfd, 0x2c, 0x49 },
+	{ 0x40, 0xd1, 0x5f, 0xee, 0x7c, 0x32, 0x88, 0x30,
+	  0x16, 0x6a, 0xc3, 0xf9, 0x18, 0x65, 0x0f, 0x80,
+	  0x7e, 0x7e, 0x01, 0xe1, 0x77, 0x25, 0x8c, 0xdc,
+	  0x0a, 0x39, 0xb1, 0x1f, 0x59, 0x80, 0x66, 0xf1 },
+	{ 0x6b, 0xb7, 0x13, 0x00, 0x64, 0x4c, 0xd3, 0x99,
+	  0x1b, 0x26, 0xcc, 0xd4, 0xd2, 0x74, 0xac, 0xd1,
+	  0xad, 0xea, 0xb8, 0xb1, 0xd7, 0x91, 0x45, 0x46,
+	  0xc1, 0x19, 0x8b, 0xbe, 0x9f, 0xc9, 0xd8, 0x03 },
+	{ 0x1d, 0x22, 0x0d, 0xbe, 0x2e, 0xe1, 0x34, 0x66,
+	  0x1f, 0xdf, 0x6d, 0x9e, 0x74, 0xb4, 0x17, 0x04,
+	  0x71, 0x05, 0x56, 0xf2, 0xf6, 0xe5, 0xa0, 0x91,
+	  0xb2, 0x27, 0x69, 0x74, 0x45, 0xdb, 0xea, 0x6b },
+	{ 0xf6, 0xc3, 0xfb, 0xad, 0xb4, 0xcc, 0x68, 0x7a,
+	  0x00, 0x64, 0xa5, 0xbe, 0x6e, 0x79, 0x1b, 0xec,
+	  0x63, 0xb8, 0x68, 0xad, 0x62, 0xfb, 0xa6, 0x1b,
+	  0x37, 0x57, 0xef, 0x9c, 0xa5, 0x2e, 0x05, 0xb2 },
+	{ 0x49, 0xc1, 0xf2, 0x11, 0x88, 0xdf, 0xd7, 0x69,
+	  0xae, 0xa0, 0xe9, 0x11, 0xdd, 0x6b, 0x41, 0xf1,
+	  0x4d, 0xab, 0x10, 0x9d, 0x2b, 0x85, 0x97, 0x7a,
+	  0xa3, 0x08, 0x8b, 0x5c, 0x70, 0x7e, 0x85, 0x98 },
+	{ 0xfd, 0xd8, 0x99, 0x3d, 0xcd, 0x43, 0xf6, 0x96,
+	  0xd4, 0x4f, 0x3c, 0xea, 0x0f, 0xf3, 0x53, 0x45,
+	  0x23, 0x4e, 0xc8, 0xee, 0x08, 0x3e, 0xb3, 0xca,
+	  0xda, 0x01, 0x7c, 0x7f, 0x78, 0xc1, 0x71, 0x43 },
+	{ 0xe6, 0xc8, 0x12, 0x56, 0x37, 0x43, 0x8d, 0x09,
+	  0x05, 0xb7, 0x49, 0xf4, 0x65, 0x60, 0xac, 0x89,
+	  0xfd, 0x47, 0x1c, 0xf8, 0x69, 0x2e, 0x28, 0xfa,
+	  0xb9, 0x82, 0xf7, 0x3f, 0x01, 0x9b, 0x83, 0xa9 },
+	{ 0x19, 0xfc, 0x8c, 0xa6, 0x97, 0x9d, 0x60, 0xe6,
+	  0xed, 0xd3, 0xb4, 0x54, 0x1e, 0x2f, 0x96, 0x7c,
+	  0xed, 0x74, 0x0d, 0xf6, 0xec, 0x1e, 0xae, 0xbb,
+	  0xfe, 0x81, 0x38, 0x32, 0xe9, 0x6b, 0x29, 0x74 },
+	{ 0xa6, 0xad, 0x77, 0x7c, 0xe8, 0x81, 0xb5, 0x2b,
+	  0xb5, 0xa4, 0x42, 0x1a, 0xb6, 0xcd, 0xd2, 0xdf,
+	  0xba, 0x13, 0xe9, 0x63, 0x65, 0x2d, 0x4d, 0x6d,
+	  0x12, 0x2a, 0xee, 0x46, 0x54, 0x8c, 0x14, 0xa7 },
+	{ 0xf5, 0xc4, 0xb2, 0xba, 0x1a, 0x00, 0x78, 0x1b,
+	  0x13, 0xab, 0xa0, 0x42, 0x52, 0x42, 0xc6, 0x9c,
+	  0xb1, 0x55, 0x2f, 0x3f, 0x71, 0xa9, 0xa3, 0xbb,
+	  0x22, 0xb4, 0xa6, 0xb4, 0x27, 0x7b, 0x46, 0xdd },
+	{ 0xe3, 0x3c, 0x4c, 0x9b, 0xd0, 0xcc, 0x7e, 0x45,
+	  0xc8, 0x0e, 0x65, 0xc7, 0x7f, 0xa5, 0x99, 0x7f,
+	  0xec, 0x70, 0x02, 0x73, 0x85, 0x41, 0x50, 0x9e,
+	  0x68, 0xa9, 0x42, 0x38, 0x91, 0xe8, 0x22, 0xa3 },
+	{ 0xfb, 0xa1, 0x61, 0x69, 0xb2, 0xc3, 0xee, 0x10,
+	  0x5b, 0xe6, 0xe1, 0xe6, 0x50, 0xe5, 0xcb, 0xf4,
+	  0x07, 0x46, 0xb6, 0x75, 0x3d, 0x03, 0x6a, 0xb5,
+	  0x51, 0x79, 0x01, 0x4a, 0xd7, 0xef, 0x66, 0x51 },
+	{ 0xf5, 0xc4, 0xbe, 0xc6, 0xd6, 0x2f, 0xc6, 0x08,
+	  0xbf, 0x41, 0xcc, 0x11, 0x5f, 0x16, 0xd6, 0x1c,
+	  0x7e, 0xfd, 0x3f, 0xf6, 0xc6, 0x56, 0x92, 0xbb,
+	  0xe0, 0xaf, 0xff, 0xb1, 0xfe, 0xde, 0x74, 0x75 },
+	{ 0xa4, 0x86, 0x2e, 0x76, 0xdb, 0x84, 0x7f, 0x05,
+	  0xba, 0x17, 0xed, 0xe5, 0xda, 0x4e, 0x7f, 0x91,
+	  0xb5, 0x92, 0x5c, 0xf1, 0xad, 0x4b, 0xa1, 0x27,
+	  0x32, 0xc3, 0x99, 0x57, 0x42, 0xa5, 0xcd, 0x6e },
+	{ 0x65, 0xf4, 0xb8, 0x60, 0xcd, 0x15, 0xb3, 0x8e,
+	  0xf8, 0x14, 0xa1, 0xa8, 0x04, 0x31, 0x4a, 0x55,
+	  0xbe, 0x95, 0x3c, 0xaa, 0x65, 0xfd, 0x75, 0x8a,
+	  0xd9, 0x89, 0xff, 0x34, 0xa4, 0x1c, 0x1e, 0xea },
+	{ 0x19, 0xba, 0x23, 0x4f, 0x0a, 0x4f, 0x38, 0x63,
+	  0x7d, 0x18, 0x39, 0xf9, 0xd9, 0xf7, 0x6a, 0xd9,
+	  0x1c, 0x85, 0x22, 0x30, 0x71, 0x43, 0xc9, 0x7d,
+	  0x5f, 0x93, 0xf6, 0x92, 0x74, 0xce, 0xc9, 0xa7 },
+	{ 0x1a, 0x67, 0x18, 0x6c, 0xa4, 0xa5, 0xcb, 0x8e,
+	  0x65, 0xfc, 0xa0, 0xe2, 0xec, 0xbc, 0x5d, 0xdc,
+	  0x14, 0xae, 0x38, 0x1b, 0xb8, 0xbf, 0xfe, 0xb9,
+	  0xe0, 0xa1, 0x03, 0x44, 0x9e, 0x3e, 0xf0, 0x3c },
+	{ 0xaf, 0xbe, 0xa3, 0x17, 0xb5, 0xa2, 0xe8, 0x9c,
+	  0x0b, 0xd9, 0x0c, 0xcf, 0x5d, 0x7f, 0xd0, 0xed,
+	  0x57, 0xfe, 0x58, 0x5e, 0x4b, 0xe3, 0x27, 0x1b,
+	  0x0a, 0x6b, 0xf0, 0xf5, 0x78, 0x6b, 0x0f, 0x26 },
+	{ 0xf1, 0xb0, 0x15, 0x58, 0xce, 0x54, 0x12, 0x62,
+	  0xf5, 0xec, 0x34, 0x29, 0x9d, 0x6f, 0xb4, 0x09,
+	  0x00, 0x09, 0xe3, 0x43, 0x4b, 0xe2, 0xf4, 0x91,
+	  0x05, 0xcf, 0x46, 0xaf, 0x4d, 0x2d, 0x41, 0x24 },
+	{ 0x13, 0xa0, 0xa0, 0xc8, 0x63, 0x35, 0x63, 0x5e,
+	  0xaa, 0x74, 0xca, 0x2d, 0x5d, 0x48, 0x8c, 0x79,
+	  0x7b, 0xbb, 0x4f, 0x47, 0xdc, 0x07, 0x10, 0x50,
+	  0x15, 0xed, 0x6a, 0x1f, 0x33, 0x09, 0xef, 0xce },
+	{ 0x15, 0x80, 0xaf, 0xee, 0xbe, 0xbb, 0x34, 0x6f,
+	  0x94, 0xd5, 0x9f, 0xe6, 0x2d, 0xa0, 0xb7, 0x92,
+	  0x37, 0xea, 0xd7, 0xb1, 0x49, 0x1f, 0x56, 0x67,
+	  0xa9, 0x0e, 0x45, 0xed, 0xf6, 0xca, 0x8b, 0x03 },
+	{ 0x20, 0xbe, 0x1a, 0x87, 0x5b, 0x38, 0xc5, 0x73,
+	  0xdd, 0x7f, 0xaa, 0xa0, 0xde, 0x48, 0x9d, 0x65,
+	  0x5c, 0x11, 0xef, 0xb6, 0xa5, 0x52, 0x69, 0x8e,
+	  0x07, 0xa2, 0xd3, 0x31, 0xb5, 0xf6, 0x55, 0xc3 },
+	{ 0xbe, 0x1f, 0xe3, 0xc4, 0xc0, 0x40, 0x18, 0xc5,
+	  0x4c, 0x4a, 0x0f, 0x6b, 0x9a, 0x2e, 0xd3, 0xc5,
+	  0x3a, 0xbe, 0x3a, 0x9f, 0x76, 0xb4, 0xd2, 0x6d,
+	  0xe5, 0x6f, 0xc9, 0xae, 0x95, 0x05, 0x9a, 0x99 },
+	{ 0xe3, 0xe3, 0xac, 0xe5, 0x37, 0xeb, 0x3e, 0xdd,
+	  0x84, 0x63, 0xd9, 0xad, 0x35, 0x82, 0xe1, 0x3c,
+	  0xf8, 0x65, 0x33, 0xff, 0xde, 0x43, 0xd6, 0x68,
+	  0xdd, 0x2e, 0x93, 0xbb, 0xdb, 0xd7, 0x19, 0x5a },
+	{ 0x11, 0x0c, 0x50, 0xc0, 0xbf, 0x2c, 0x6e, 0x7a,
+	  0xeb, 0x7e, 0x43, 0x5d, 0x92, 0xd1, 0x32, 0xab,
+	  0x66, 0x55, 0x16, 0x8e, 0x78, 0xa2, 0xde, 0xcd,
+	  0xec, 0x33, 0x30, 0x77, 0x76, 0x84, 0xd9, 0xc1 },
+	{ 0xe9, 0xba, 0x8f, 0x50, 0x5c, 0x9c, 0x80, 0xc0,
+	  0x86, 0x66, 0xa7, 0x01, 0xf3, 0x36, 0x7e, 0x6c,
+	  0xc6, 0x65, 0xf3, 0x4b, 0x22, 0xe7, 0x3c, 0x3c,
+	  0x04, 0x17, 0xeb, 0x1c, 0x22, 0x06, 0x08, 0x2f },
+	{ 0x26, 0xcd, 0x66, 0xfc, 0xa0, 0x23, 0x79, 0xc7,
+	  0x6d, 0xf1, 0x23, 0x17, 0x05, 0x2b, 0xca, 0xfd,
+	  0x6c, 0xd8, 0xc3, 0xa7, 0xb8, 0x90, 0xd8, 0x05,
+	  0xf3, 0x6c, 0x49, 0x98, 0x97, 0x82, 0x43, 0x3a },
+	{ 0x21, 0x3f, 0x35, 0x96, 0xd6, 0xe3, 0xa5, 0xd0,
+	  0xe9, 0x93, 0x2c, 0xd2, 0x15, 0x91, 0x46, 0x01,
+	  0x5e, 0x2a, 0xbc, 0x94, 0x9f, 0x47, 0x29, 0xee,
+	  0x26, 0x32, 0xfe, 0x1e, 0xdb, 0x78, 0xd3, 0x37 },
+	{ 0x10, 0x15, 0xd7, 0x01, 0x08, 0xe0, 0x3b, 0xe1,
+	  0xc7, 0x02, 0xfe, 0x97, 0x25, 0x36, 0x07, 0xd1,
+	  0x4a, 0xee, 0x59, 0x1f, 0x24, 0x13, 0xea, 0x67,
+	  0x87, 0x42, 0x7b, 0x64, 0x59, 0xff, 0x21, 0x9a },
+	{ 0x3c, 0xa9, 0x89, 0xde, 0x10, 0xcf, 0xe6, 0x09,
+	  0x90, 0x94, 0x72, 0xc8, 0xd3, 0x56, 0x10, 0x80,
+	  0x5b, 0x2f, 0x97, 0x77, 0x34, 0xcf, 0x65, 0x2c,
+	  0xc6, 0x4b, 0x3b, 0xfc, 0x88, 0x2d, 0x5d, 0x89 },
+	{ 0xb6, 0x15, 0x6f, 0x72, 0xd3, 0x80, 0xee, 0x9e,
+	  0xa6, 0xac, 0xd1, 0x90, 0x46, 0x4f, 0x23, 0x07,
+	  0xa5, 0xc1, 0x79, 0xef, 0x01, 0xfd, 0x71, 0xf9,
+	  0x9f, 0x2d, 0x0f, 0x7a, 0x57, 0x36, 0x0a, 0xea },
+	{ 0xc0, 0x3b, 0xc6, 0x42, 0xb2, 0x09, 0x59, 0xcb,
+	  0xe1, 0x33, 0xa0, 0x30, 0x3e, 0x0c, 0x1a, 0xbf,
+	  0xf3, 0xe3, 0x1e, 0xc8, 0xe1, 0xa3, 0x28, 0xec,
+	  0x85, 0x65, 0xc3, 0x6d, 0xec, 0xff, 0x52, 0x65 },
+	{ 0x2c, 0x3e, 0x08, 0x17, 0x6f, 0x76, 0x0c, 0x62,
+	  0x64, 0xc3, 0xa2, 0xcd, 0x66, 0xfe, 0xc6, 0xc3,
+	  0xd7, 0x8d, 0xe4, 0x3f, 0xc1, 0x92, 0x45, 0x7b,
+	  0x2a, 0x4a, 0x66, 0x0a, 0x1e, 0x0e, 0xb2, 0x2b },
+	{ 0xf7, 0x38, 0xc0, 0x2f, 0x3c, 0x1b, 0x19, 0x0c,
+	  0x51, 0x2b, 0x1a, 0x32, 0xde, 0xab, 0xf3, 0x53,
+	  0x72, 0x8e, 0x0e, 0x9a, 0xb0, 0x34, 0x49, 0x0e,
+	  0x3c, 0x34, 0x09, 0x94, 0x6a, 0x97, 0xae, 0xec },
+	{ 0x8b, 0x18, 0x80, 0xdf, 0x30, 0x1c, 0xc9, 0x63,
+	  0x41, 0x88, 0x11, 0x08, 0x89, 0x64, 0x83, 0x92,
+	  0x87, 0xff, 0x7f, 0xe3, 0x1c, 0x49, 0xea, 0x6e,
+	  0xbd, 0x9e, 0x48, 0xbd, 0xee, 0xe4, 0x97, 0xc5 },
+	{ 0x1e, 0x75, 0xcb, 0x21, 0xc6, 0x09, 0x89, 0x02,
+	  0x03, 0x75, 0xf1, 0xa7, 0xa2, 0x42, 0x83, 0x9f,
+	  0x0b, 0x0b, 0x68, 0x97, 0x3a, 0x4c, 0x2a, 0x05,
+	  0xcf, 0x75, 0x55, 0xed, 0x5a, 0xae, 0xc4, 0xc1 },
+	{ 0x62, 0xbf, 0x8a, 0x9c, 0x32, 0xa5, 0xbc, 0xcf,
+	  0x29, 0x0b, 0x6c, 0x47, 0x4d, 0x75, 0xb2, 0xa2,
+	  0xa4, 0x09, 0x3f, 0x1a, 0x9e, 0x27, 0x13, 0x94,
+	  0x33, 0xa8, 0xf2, 0xb3, 0xbc, 0xe7, 0xb8, 0xd7 },
+	{ 0x16, 0x6c, 0x83, 0x50, 0xd3, 0x17, 0x3b, 0x5e,
+	  0x70, 0x2b, 0x78, 0x3d, 0xfd, 0x33, 0xc6, 0x6e,
+	  0xe0, 0x43, 0x27, 0x42, 0xe9, 0xb9, 0x2b, 0x99,
+	  0x7f, 0xd2, 0x3c, 0x60, 0xdc, 0x67, 0x56, 0xca },
+	{ 0x04, 0x4a, 0x14, 0xd8, 0x22, 0xa9, 0x0c, 0xac,
+	  0xf2, 0xf5, 0xa1, 0x01, 0x42, 0x8a, 0xdc, 0x8f,
+	  0x41, 0x09, 0x38, 0x6c, 0xcb, 0x15, 0x8b, 0xf9,
+	  0x05, 0xc8, 0x61, 0x8b, 0x8e, 0xe2, 0x4e, 0xc3 },
+	{ 0x38, 0x7d, 0x39, 0x7e, 0xa4, 0x3a, 0x99, 0x4b,
+	  0xe8, 0x4d, 0x2d, 0x54, 0x4a, 0xfb, 0xe4, 0x81,
+	  0xa2, 0x00, 0x0f, 0x55, 0x25, 0x26, 0x96, 0xbb,
+	  0xa2, 0xc5, 0x0c, 0x8e, 0xbd, 0x10, 0x13, 0x47 },
+	{ 0x56, 0xf8, 0xcc, 0xf1, 0xf8, 0x64, 0x09, 0xb4,
+	  0x6c, 0xe3, 0x61, 0x66, 0xae, 0x91, 0x65, 0x13,
+	  0x84, 0x41, 0x57, 0x75, 0x89, 0xdb, 0x08, 0xcb,
+	  0xc5, 0xf6, 0x6c, 0xa2, 0x97, 0x43, 0xb9, 0xfd },
+	{ 0x97, 0x06, 0xc0, 0x92, 0xb0, 0x4d, 0x91, 0xf5,
+	  0x3d, 0xff, 0x91, 0xfa, 0x37, 0xb7, 0x49, 0x3d,
+	  0x28, 0xb5, 0x76, 0xb5, 0xd7, 0x10, 0x46, 0x9d,
+	  0xf7, 0x94, 0x01, 0x66, 0x22, 0x36, 0xfc, 0x03 },
+	{ 0x87, 0x79, 0x68, 0x68, 0x6c, 0x06, 0x8c, 0xe2,
+	  0xf7, 0xe2, 0xad, 0xcf, 0xf6, 0x8b, 0xf8, 0x74,
+	  0x8e, 0xdf, 0x3c, 0xf8, 0x62, 0xcf, 0xb4, 0xd3,
+	  0x94, 0x7a, 0x31, 0x06, 0x95, 0x80, 0x54, 0xe3 },
+	{ 0x88, 0x17, 0xe5, 0x71, 0x98, 0x79, 0xac, 0xf7,
+	  0x02, 0x47, 0x87, 0xec, 0xcd, 0xb2, 0x71, 0x03,
+	  0x55, 0x66, 0xcf, 0xa3, 0x33, 0xe0, 0x49, 0x40,
+	  0x7c, 0x01, 0x78, 0xcc, 0xc5, 0x7a, 0x5b, 0x9f },
+	{ 0x89, 0x38, 0x24, 0x9e, 0x4b, 0x50, 0xca, 0xda,
+	  0xcc, 0xdf, 0x5b, 0x18, 0x62, 0x13, 0x26, 0xcb,
+	  0xb1, 0x52, 0x53, 0xe3, 0x3a, 0x20, 0xf5, 0x63,
+	  0x6e, 0x99, 0x5d, 0x72, 0x47, 0x8d, 0xe4, 0x72 },
+	{ 0xf1, 0x64, 0xab, 0xba, 0x49, 0x63, 0xa4, 0x4d,
+	  0x10, 0x72, 0x57, 0xe3, 0x23, 0x2d, 0x90, 0xac,
+	  0xa5, 0xe6, 0x6a, 0x14, 0x08, 0x24, 0x8c, 0x51,
+	  0x74, 0x1e, 0x99, 0x1d, 0xb5, 0x22, 0x77, 0x56 },
+	{ 0xd0, 0x55, 0x63, 0xe2, 0xb1, 0xcb, 0xa0, 0xc4,
+	  0xa2, 0xa1, 0xe8, 0xbd, 0xe3, 0xa1, 0xa0, 0xd9,
+	  0xf5, 0xb4, 0x0c, 0x85, 0xa0, 0x70, 0xd6, 0xf5,
+	  0xfb, 0x21, 0x06, 0x6e, 0xad, 0x5d, 0x06, 0x01 },
+	{ 0x03, 0xfb, 0xb1, 0x63, 0x84, 0xf0, 0xa3, 0x86,
+	  0x6f, 0x4c, 0x31, 0x17, 0x87, 0x76, 0x66, 0xef,
+	  0xbf, 0x12, 0x45, 0x97, 0x56, 0x4b, 0x29, 0x3d,
+	  0x4a, 0xab, 0x0d, 0x26, 0x9f, 0xab, 0xdd, 0xfa },
+	{ 0x5f, 0xa8, 0x48, 0x6a, 0xc0, 0xe5, 0x29, 0x64,
+	  0xd1, 0x88, 0x1b, 0xbe, 0x33, 0x8e, 0xb5, 0x4b,
+	  0xe2, 0xf7, 0x19, 0x54, 0x92, 0x24, 0x89, 0x20,
+	  0x57, 0xb4, 0xda, 0x04, 0xba, 0x8b, 0x34, 0x75 },
+	{ 0xcd, 0xfa, 0xbc, 0xee, 0x46, 0x91, 0x11, 0x11,
+	  0x23, 0x6a, 0x31, 0x70, 0x8b, 0x25, 0x39, 0xd7,
+	  0x1f, 0xc2, 0x11, 0xd9, 0xb0, 0x9c, 0x0d, 0x85,
+	  0x30, 0xa1, 0x1e, 0x1d, 0xbf, 0x6e, 0xed, 0x01 },
+	{ 0x4f, 0x82, 0xde, 0x03, 0xb9, 0x50, 0x47, 0x93,
+	  0xb8, 0x2a, 0x07, 0xa0, 0xbd, 0xcd, 0xff, 0x31,
+	  0x4d, 0x75, 0x9e, 0x7b, 0x62, 0xd2, 0x6b, 0x78,
+	  0x49, 0x46, 0xb0, 0xd3, 0x6f, 0x91, 0x6f, 0x52 },
+	{ 0x25, 0x9e, 0xc7, 0xf1, 0x73, 0xbc, 0xc7, 0x6a,
+	  0x09, 0x94, 0xc9, 0x67, 0xb4, 0xf5, 0xf0, 0x24,
+	  0xc5, 0x60, 0x57, 0xfb, 0x79, 0xc9, 0x65, 0xc4,
+	  0xfa, 0xe4, 0x18, 0x75, 0xf0, 0x6a, 0x0e, 0x4c },
+	{ 0x19, 0x3c, 0xc8, 0xe7, 0xc3, 0xe0, 0x8b, 0xb3,
+	  0x0f, 0x54, 0x37, 0xaa, 0x27, 0xad, 0xe1, 0xf1,
+	  0x42, 0x36, 0x9b, 0x24, 0x6a, 0x67, 0x5b, 0x23,
+	  0x83, 0xe6, 0xda, 0x9b, 0x49, 0xa9, 0x80, 0x9e },
+	{ 0x5c, 0x10, 0x89, 0x6f, 0x0e, 0x28, 0x56, 0xb2,
+	  0xa2, 0xee, 0xe0, 0xfe, 0x4a, 0x2c, 0x16, 0x33,
+	  0x56, 0x5d, 0x18, 0xf0, 0xe9, 0x3e, 0x1f, 0xab,
+	  0x26, 0xc3, 0x73, 0xe8, 0xf8, 0x29, 0x65, 0x4d },
+	{ 0xf1, 0x60, 0x12, 0xd9, 0x3f, 0x28, 0x85, 0x1a,
+	  0x1e, 0xb9, 0x89, 0xf5, 0xd0, 0xb4, 0x3f, 0x3f,
+	  0x39, 0xca, 0x73, 0xc9, 0xa6, 0x2d, 0x51, 0x81,
+	  0xbf, 0xf2, 0x37, 0x53, 0x6b, 0xd3, 0x48, 0xc3 },
+	{ 0x29, 0x66, 0xb3, 0xcf, 0xae, 0x1e, 0x44, 0xea,
+	  0x99, 0x6d, 0xc5, 0xd6, 0x86, 0xcf, 0x25, 0xfa,
+	  0x05, 0x3f, 0xb6, 0xf6, 0x72, 0x01, 0xb9, 0xe4,
+	  0x6e, 0xad, 0xe8, 0x5d, 0x0a, 0xd6, 0xb8, 0x06 },
+	{ 0xdd, 0xb8, 0x78, 0x24, 0x85, 0xe9, 0x00, 0xbc,
+	  0x60, 0xbc, 0xf4, 0xc3, 0x3a, 0x6f, 0xd5, 0x85,
+	  0x68, 0x0c, 0xc6, 0x83, 0xd5, 0x16, 0xef, 0xa0,
+	  0x3e, 0xb9, 0x98, 0x5f, 0xad, 0x87, 0x15, 0xfb },
+	{ 0x4c, 0x4d, 0x6e, 0x71, 0xae, 0xa0, 0x57, 0x86,
+	  0x41, 0x31, 0x48, 0xfc, 0x7a, 0x78, 0x6b, 0x0e,
+	  0xca, 0xf5, 0x82, 0xcf, 0xf1, 0x20, 0x9f, 0x5a,
+	  0x80, 0x9f, 0xba, 0x85, 0x04, 0xce, 0x66, 0x2c },
+	{ 0xfb, 0x4c, 0x5e, 0x86, 0xd7, 0xb2, 0x22, 0x9b,
+	  0x99, 0xb8, 0xba, 0x6d, 0x94, 0xc2, 0x47, 0xef,
+	  0x96, 0x4a, 0xa3, 0xa2, 0xba, 0xe8, 0xed, 0xc7,
+	  0x75, 0x69, 0xf2, 0x8d, 0xbb, 0xff, 0x2d, 0x4e },
+	{ 0xe9, 0x4f, 0x52, 0x6d, 0xe9, 0x01, 0x96, 0x33,
+	  0xec, 0xd5, 0x4a, 0xc6, 0x12, 0x0f, 0x23, 0x95,
+	  0x8d, 0x77, 0x18, 0xf1, 0xe7, 0x71, 0x7b, 0xf3,
+	  0x29, 0x21, 0x1a, 0x4f, 0xae, 0xed, 0x4e, 0x6d },
+	{ 0xcb, 0xd6, 0x66, 0x0a, 0x10, 0xdb, 0x3f, 0x23,
+	  0xf7, 0xa0, 0x3d, 0x4b, 0x9d, 0x40, 0x44, 0xc7,
+	  0x93, 0x2b, 0x28, 0x01, 0xac, 0x89, 0xd6, 0x0b,
+	  0xc9, 0xeb, 0x92, 0xd6, 0x5a, 0x46, 0xc2, 0xa0 },
+	{ 0x88, 0x18, 0xbb, 0xd3, 0xdb, 0x4d, 0xc1, 0x23,
+	  0xb2, 0x5c, 0xbb, 0xa5, 0xf5, 0x4c, 0x2b, 0xc4,
+	  0xb3, 0xfc, 0xf9, 0xbf, 0x7d, 0x7a, 0x77, 0x09,
+	  0xf4, 0xae, 0x58, 0x8b, 0x26, 0x7c, 0x4e, 0xce },
+	{ 0xc6, 0x53, 0x82, 0x51, 0x3f, 0x07, 0x46, 0x0d,
+	  0xa3, 0x98, 0x33, 0xcb, 0x66, 0x6c, 0x5e, 0xd8,
+	  0x2e, 0x61, 0xb9, 0xe9, 0x98, 0xf4, 0xb0, 0xc4,
+	  0x28, 0x7c, 0xee, 0x56, 0xc3, 0xcc, 0x9b, 0xcd },
+	{ 0x89, 0x75, 0xb0, 0x57, 0x7f, 0xd3, 0x55, 0x66,
+	  0xd7, 0x50, 0xb3, 0x62, 0xb0, 0x89, 0x7a, 0x26,
+	  0xc3, 0x99, 0x13, 0x6d, 0xf0, 0x7b, 0xab, 0xab,
+	  0xbd, 0xe6, 0x20, 0x3f, 0xf2, 0x95, 0x4e, 0xd4 },
+	{ 0x21, 0xfe, 0x0c, 0xeb, 0x00, 0x52, 0xbe, 0x7f,
+	  0xb0, 0xf0, 0x04, 0x18, 0x7c, 0xac, 0xd7, 0xde,
+	  0x67, 0xfa, 0x6e, 0xb0, 0x93, 0x8d, 0x92, 0x76,
+	  0x77, 0xf2, 0x39, 0x8c, 0x13, 0x23, 0x17, 0xa8 },
+	{ 0x2e, 0xf7, 0x3f, 0x3c, 0x26, 0xf1, 0x2d, 0x93,
+	  0x88, 0x9f, 0x3c, 0x78, 0xb6, 0xa6, 0x6c, 0x1d,
+	  0x52, 0xb6, 0x49, 0xdc, 0x9e, 0x85, 0x6e, 0x2c,
+	  0x17, 0x2e, 0xa7, 0xc5, 0x8a, 0xc2, 0xb5, 0xe3 },
+	{ 0x38, 0x8a, 0x3c, 0xd5, 0x6d, 0x73, 0x86, 0x7a,
+	  0xbb, 0x5f, 0x84, 0x01, 0x49, 0x2b, 0x6e, 0x26,
+	  0x81, 0xeb, 0x69, 0x85, 0x1e, 0x76, 0x7f, 0xd8,
+	  0x42, 0x10, 0xa5, 0x60, 0x76, 0xfb, 0x3d, 0xd3 },
+	{ 0xaf, 0x53, 0x3e, 0x02, 0x2f, 0xc9, 0x43, 0x9e,
+	  0x4e, 0x3c, 0xb8, 0x38, 0xec, 0xd1, 0x86, 0x92,
+	  0x23, 0x2a, 0xdf, 0x6f, 0xe9, 0x83, 0x95, 0x26,
+	  0xd3, 0xc3, 0xdd, 0x1b, 0x71, 0x91, 0x0b, 0x1a },
+	{ 0x75, 0x1c, 0x09, 0xd4, 0x1a, 0x93, 0x43, 0x88,
+	  0x2a, 0x81, 0xcd, 0x13, 0xee, 0x40, 0x81, 0x8d,
+	  0x12, 0xeb, 0x44, 0xc6, 0xc7, 0xf4, 0x0d, 0xf1,
+	  0x6e, 0x4a, 0xea, 0x8f, 0xab, 0x91, 0x97, 0x2a },
+	{ 0x5b, 0x73, 0xdd, 0xb6, 0x8d, 0x9d, 0x2b, 0x0a,
+	  0xa2, 0x65, 0xa0, 0x79, 0x88, 0xd6, 0xb8, 0x8a,
+	  0xe9, 0xaa, 0xc5, 0x82, 0xaf, 0x83, 0x03, 0x2f,
+	  0x8a, 0x9b, 0x21, 0xa2, 0xe1, 0xb7, 0xbf, 0x18 },
+	{ 0x3d, 0xa2, 0x91, 0x26, 0xc7, 0xc5, 0xd7, 0xf4,
+	  0x3e, 0x64, 0x24, 0x2a, 0x79, 0xfe, 0xaa, 0x4e,
+	  0xf3, 0x45, 0x9c, 0xde, 0xcc, 0xc8, 0x98, 0xed,
+	  0x59, 0xa9, 0x7f, 0x6e, 0xc9, 0x3b, 0x9d, 0xab },
+	{ 0x56, 0x6d, 0xc9, 0x20, 0x29, 0x3d, 0xa5, 0xcb,
+	  0x4f, 0xe0, 0xaa, 0x8a, 0xbd, 0xa8, 0xbb, 0xf5,
+	  0x6f, 0x55, 0x23, 0x13, 0xbf, 0xf1, 0x90, 0x46,
+	  0x64, 0x1e, 0x36, 0x15, 0xc1, 0xe3, 0xed, 0x3f },
+	{ 0x41, 0x15, 0xbe, 0xa0, 0x2f, 0x73, 0xf9, 0x7f,
+	  0x62, 0x9e, 0x5c, 0x55, 0x90, 0x72, 0x0c, 0x01,
+	  0xe7, 0xe4, 0x49, 0xae, 0x2a, 0x66, 0x97, 0xd4,
+	  0xd2, 0x78, 0x33, 0x21, 0x30, 0x36, 0x92, 0xf9 },
+	{ 0x4c, 0xe0, 0x8f, 0x47, 0x62, 0x46, 0x8a, 0x76,
+	  0x70, 0x01, 0x21, 0x64, 0x87, 0x8d, 0x68, 0x34,
+	  0x0c, 0x52, 0xa3, 0x5e, 0x66, 0xc1, 0x88, 0x4d,
+	  0x5c, 0x86, 0x48, 0x89, 0xab, 0xc9, 0x66, 0x77 },
+	{ 0x81, 0xea, 0x0b, 0x78, 0x04, 0x12, 0x4e, 0x0c,
+	  0x22, 0xea, 0x5f, 0xc7, 0x11, 0x04, 0xa2, 0xaf,
+	  0xcb, 0x52, 0xa1, 0xfa, 0x81, 0x6f, 0x3e, 0xcb,
+	  0x7d, 0xcb, 0x5d, 0x9d, 0xea, 0x17, 0x86, 0xd0 },
+	{ 0xfe, 0x36, 0x27, 0x33, 0xb0, 0x5f, 0x6b, 0xed,
+	  0xaf, 0x93, 0x79, 0xd7, 0xf7, 0x93, 0x6e, 0xde,
+	  0x20, 0x9b, 0x1f, 0x83, 0x23, 0xc3, 0x92, 0x25,
+	  0x49, 0xd9, 0xe7, 0x36, 0x81, 0xb5, 0xdb, 0x7b },
+	{ 0xef, 0xf3, 0x7d, 0x30, 0xdf, 0xd2, 0x03, 0x59,
+	  0xbe, 0x4e, 0x73, 0xfd, 0xf4, 0x0d, 0x27, 0x73,
+	  0x4b, 0x3d, 0xf9, 0x0a, 0x97, 0xa5, 0x5e, 0xd7,
+	  0x45, 0x29, 0x72, 0x94, 0xca, 0x85, 0xd0, 0x9f },
+	{ 0x17, 0x2f, 0xfc, 0x67, 0x15, 0x3d, 0x12, 0xe0,
+	  0xca, 0x76, 0xa8, 0xb6, 0xcd, 0x5d, 0x47, 0x31,
+	  0x88, 0x5b, 0x39, 0xce, 0x0c, 0xac, 0x93, 0xa8,
+	  0x97, 0x2a, 0x18, 0x00, 0x6c, 0x8b, 0x8b, 0xaf },
+	{ 0xc4, 0x79, 0x57, 0xf1, 0xcc, 0x88, 0xe8, 0x3e,
+	  0xf9, 0x44, 0x58, 0x39, 0x70, 0x9a, 0x48, 0x0a,
+	  0x03, 0x6b, 0xed, 0x5f, 0x88, 0xac, 0x0f, 0xcc,
+	  0x8e, 0x1e, 0x70, 0x3f, 0xfa, 0xac, 0x13, 0x2c },
+	{ 0x30, 0xf3, 0x54, 0x83, 0x70, 0xcf, 0xdc, 0xed,
+	  0xa5, 0xc3, 0x7b, 0x56, 0x9b, 0x61, 0x75, 0xe7,
+	  0x99, 0xee, 0xf1, 0xa6, 0x2a, 0xaa, 0x94, 0x32,
+	  0x45, 0xae, 0x76, 0x69, 0xc2, 0x27, 0xa7, 0xb5 },
+	{ 0xc9, 0x5d, 0xcb, 0x3c, 0xf1, 0xf2, 0x7d, 0x0e,
+	  0xef, 0x2f, 0x25, 0xd2, 0x41, 0x38, 0x70, 0x90,
+	  0x4a, 0x87, 0x7c, 0x4a, 0x56, 0xc2, 0xde, 0x1e,
+	  0x83, 0xe2, 0xbc, 0x2a, 0xe2, 0xe4, 0x68, 0x21 },
+	{ 0xd5, 0xd0, 0xb5, 0xd7, 0x05, 0x43, 0x4c, 0xd4,
+	  0x6b, 0x18, 0x57, 0x49, 0xf6, 0x6b, 0xfb, 0x58,
+	  0x36, 0xdc, 0xdf, 0x6e, 0xe5, 0x49, 0xa2, 0xb7,
+	  0xa4, 0xae, 0xe7, 0xf5, 0x80, 0x07, 0xca, 0xaf },
+	{ 0xbb, 0xc1, 0x24, 0xa7, 0x12, 0xf1, 0x5d, 0x07,
+	  0xc3, 0x00, 0xe0, 0x5b, 0x66, 0x83, 0x89, 0xa4,
+	  0x39, 0xc9, 0x17, 0x77, 0xf7, 0x21, 0xf8, 0x32,
+	  0x0c, 0x1c, 0x90, 0x78, 0x06, 0x6d, 0x2c, 0x7e },
+	{ 0xa4, 0x51, 0xb4, 0x8c, 0x35, 0xa6, 0xc7, 0x85,
+	  0x4c, 0xfa, 0xae, 0x60, 0x26, 0x2e, 0x76, 0x99,
+	  0x08, 0x16, 0x38, 0x2a, 0xc0, 0x66, 0x7e, 0x5a,
+	  0x5c, 0x9e, 0x1b, 0x46, 0xc4, 0x34, 0x2d, 0xdf },
+	{ 0xb0, 0xd1, 0x50, 0xfb, 0x55, 0xe7, 0x78, 0xd0,
+	  0x11, 0x47, 0xf0, 0xb5, 0xd8, 0x9d, 0x99, 0xec,
+	  0xb2, 0x0f, 0xf0, 0x7e, 0x5e, 0x67, 0x60, 0xd6,
+	  0xb6, 0x45, 0xeb, 0x5b, 0x65, 0x4c, 0x62, 0x2b },
+	{ 0x34, 0xf7, 0x37, 0xc0, 0xab, 0x21, 0x99, 0x51,
+	  0xee, 0xe8, 0x9a, 0x9f, 0x8d, 0xac, 0x29, 0x9c,
+	  0x9d, 0x4c, 0x38, 0xf3, 0x3f, 0xa4, 0x94, 0xc5,
+	  0xc6, 0xee, 0xfc, 0x92, 0xb6, 0xdb, 0x08, 0xbc },
+	{ 0x1a, 0x62, 0xcc, 0x3a, 0x00, 0x80, 0x0d, 0xcb,
+	  0xd9, 0x98, 0x91, 0x08, 0x0c, 0x1e, 0x09, 0x84,
+	  0x58, 0x19, 0x3a, 0x8c, 0xc9, 0xf9, 0x70, 0xea,
+	  0x99, 0xfb, 0xef, 0xf0, 0x03, 0x18, 0xc2, 0x89 },
+	{ 0xcf, 0xce, 0x55, 0xeb, 0xaf, 0xc8, 0x40, 0xd7,
+	  0xae, 0x48, 0x28, 0x1c, 0x7f, 0xd5, 0x7e, 0xc8,
+	  0xb4, 0x82, 0xd4, 0xb7, 0x04, 0x43, 0x74, 0x95,
+	  0x49, 0x5a, 0xc4, 0x14, 0xcf, 0x4a, 0x37, 0x4b },
+	{ 0x67, 0x46, 0xfa, 0xcf, 0x71, 0x14, 0x6d, 0x99,
+	  0x9d, 0xab, 0xd0, 0x5d, 0x09, 0x3a, 0xe5, 0x86,
+	  0x64, 0x8d, 0x1e, 0xe2, 0x8e, 0x72, 0x61, 0x7b,
+	  0x99, 0xd0, 0xf0, 0x08, 0x6e, 0x1e, 0x45, 0xbf },
+	{ 0x57, 0x1c, 0xed, 0x28, 0x3b, 0x3f, 0x23, 0xb4,
+	  0xe7, 0x50, 0xbf, 0x12, 0xa2, 0xca, 0xf1, 0x78,
+	  0x18, 0x47, 0xbd, 0x89, 0x0e, 0x43, 0x60, 0x3c,
+	  0xdc, 0x59, 0x76, 0x10, 0x2b, 0x7b, 0xb1, 0x1b },
+	{ 0xcf, 0xcb, 0x76, 0x5b, 0x04, 0x8e, 0x35, 0x02,
+	  0x2c, 0x5d, 0x08, 0x9d, 0x26, 0xe8, 0x5a, 0x36,
+	  0xb0, 0x05, 0xa2, 0xb8, 0x04, 0x93, 0xd0, 0x3a,
+	  0x14, 0x4e, 0x09, 0xf4, 0x09, 0xb6, 0xaf, 0xd1 },
+	{ 0x40, 0x50, 0xc7, 0xa2, 0x77, 0x05, 0xbb, 0x27,
+	  0xf4, 0x20, 0x89, 0xb2, 0x99, 0xf3, 0xcb, 0xe5,
+	  0x05, 0x4e, 0xad, 0x68, 0x72, 0x7e, 0x8e, 0xf9,
+	  0x31, 0x8c, 0xe6, 0xf2, 0x5c, 0xd6, 0xf3, 0x1d },
+	{ 0x18, 0x40, 0x70, 0xbd, 0x5d, 0x26, 0x5f, 0xbd,
+	  0xc1, 0x42, 0xcd, 0x1c, 0x5c, 0xd0, 0xd7, 0xe4,
+	  0x14, 0xe7, 0x03, 0x69, 0xa2, 0x66, 0xd6, 0x27,
+	  0xc8, 0xfb, 0xa8, 0x4f, 0xa5, 0xe8, 0x4c, 0x34 },
+	{ 0x9e, 0xdd, 0xa9, 0xa4, 0x44, 0x39, 0x02, 0xa9,
+	  0x58, 0x8c, 0x0d, 0x0c, 0xcc, 0x62, 0xb9, 0x30,
+	  0x21, 0x84, 0x79, 0xa6, 0x84, 0x1e, 0x6f, 0xe7,
+	  0xd4, 0x30, 0x03, 0xf0, 0x4b, 0x1f, 0xd6, 0x43 },
+	{ 0xe4, 0x12, 0xfe, 0xef, 0x79, 0x08, 0x32, 0x4a,
+	  0x6d, 0xa1, 0x84, 0x16, 0x29, 0xf3, 0x5d, 0x3d,
+	  0x35, 0x86, 0x42, 0x01, 0x93, 0x10, 0xec, 0x57,
+	  0xc6, 0x14, 0x83, 0x6b, 0x63, 0xd3, 0x07, 0x63 },
+	{ 0x1a, 0x2b, 0x8e, 0xdf, 0xf3, 0xf9, 0xac, 0xc1,
+	  0x55, 0x4f, 0xcb, 0xae, 0x3c, 0xf1, 0xd6, 0x29,
+	  0x8c, 0x64, 0x62, 0xe2, 0x2e, 0x5e, 0xb0, 0x25,
+	  0x96, 0x84, 0xf8, 0x35, 0x01, 0x2b, 0xd1, 0x3f },
+	{ 0x28, 0x8c, 0x4a, 0xd9, 0xb9, 0x40, 0x97, 0x62,
+	  0xea, 0x07, 0xc2, 0x4a, 0x41, 0xf0, 0x4f, 0x69,
+	  0xa7, 0xd7, 0x4b, 0xee, 0x2d, 0x95, 0x43, 0x53,
+	  0x74, 0xbd, 0xe9, 0x46, 0xd7, 0x24, 0x1c, 0x7b },
+	{ 0x80, 0x56, 0x91, 0xbb, 0x28, 0x67, 0x48, 0xcf,
+	  0xb5, 0x91, 0xd3, 0xae, 0xbe, 0x7e, 0x6f, 0x4e,
+	  0x4d, 0xc6, 0xe2, 0x80, 0x8c, 0x65, 0x14, 0x3c,
+	  0xc0, 0x04, 0xe4, 0xeb, 0x6f, 0xd0, 0x9d, 0x43 },
+	{ 0xd4, 0xac, 0x8d, 0x3a, 0x0a, 0xfc, 0x6c, 0xfa,
+	  0x7b, 0x46, 0x0a, 0xe3, 0x00, 0x1b, 0xae, 0xb3,
+	  0x6d, 0xad, 0xb3, 0x7d, 0xa0, 0x7d, 0x2e, 0x8a,
+	  0xc9, 0x18, 0x22, 0xdf, 0x34, 0x8a, 0xed, 0x3d },
+	{ 0xc3, 0x76, 0x61, 0x70, 0x14, 0xd2, 0x01, 0x58,
+	  0xbc, 0xed, 0x3d, 0x3b, 0xa5, 0x52, 0xb6, 0xec,
+	  0xcf, 0x84, 0xe6, 0x2a, 0xa3, 0xeb, 0x65, 0x0e,
+	  0x90, 0x02, 0x9c, 0x84, 0xd1, 0x3e, 0xea, 0x69 },
+	{ 0xc4, 0x1f, 0x09, 0xf4, 0x3c, 0xec, 0xae, 0x72,
+	  0x93, 0xd6, 0x00, 0x7c, 0xa0, 0xa3, 0x57, 0x08,
+	  0x7d, 0x5a, 0xe5, 0x9b, 0xe5, 0x00, 0xc1, 0xcd,
+	  0x5b, 0x28, 0x9e, 0xe8, 0x10, 0xc7, 0xb0, 0x82 },
+	{ 0x03, 0xd1, 0xce, 0xd1, 0xfb, 0xa5, 0xc3, 0x91,
+	  0x55, 0xc4, 0x4b, 0x77, 0x65, 0xcb, 0x76, 0x0c,
+	  0x78, 0x70, 0x8d, 0xcf, 0xc8, 0x0b, 0x0b, 0xd8,
+	  0xad, 0xe3, 0xa5, 0x6d, 0xa8, 0x83, 0x0b, 0x29 },
+	{ 0x09, 0xbd, 0xe6, 0xf1, 0x52, 0x21, 0x8d, 0xc9,
+	  0x2c, 0x41, 0xd7, 0xf4, 0x53, 0x87, 0xe6, 0x3e,
+	  0x58, 0x69, 0xd8, 0x07, 0xec, 0x70, 0xb8, 0x21,
+	  0x40, 0x5d, 0xbd, 0x88, 0x4b, 0x7f, 0xcf, 0x4b },
+	{ 0x71, 0xc9, 0x03, 0x6e, 0x18, 0x17, 0x9b, 0x90,
+	  0xb3, 0x7d, 0x39, 0xe9, 0xf0, 0x5e, 0xb8, 0x9c,
+	  0xc5, 0xfc, 0x34, 0x1f, 0xd7, 0xc4, 0x77, 0xd0,
+	  0xd7, 0x49, 0x32, 0x85, 0xfa, 0xca, 0x08, 0xa4 },
+	{ 0x59, 0x16, 0x83, 0x3e, 0xbb, 0x05, 0xcd, 0x91,
+	  0x9c, 0xa7, 0xfe, 0x83, 0xb6, 0x92, 0xd3, 0x20,
+	  0x5b, 0xef, 0x72, 0x39, 0x2b, 0x2c, 0xf6, 0xbb,
+	  0x0a, 0x6d, 0x43, 0xf9, 0x94, 0xf9, 0x5f, 0x11 },
+	{ 0xf6, 0x3a, 0xab, 0x3e, 0xc6, 0x41, 0xb3, 0xb0,
+	  0x24, 0x96, 0x4c, 0x2b, 0x43, 0x7c, 0x04, 0xf6,
+	  0x04, 0x3c, 0x4c, 0x7e, 0x02, 0x79, 0x23, 0x99,
+	  0x95, 0x40, 0x19, 0x58, 0xf8, 0x6b, 0xbe, 0x54 },
+	{ 0xf1, 0x72, 0xb1, 0x80, 0xbf, 0xb0, 0x97, 0x40,
+	  0x49, 0x31, 0x20, 0xb6, 0x32, 0x6c, 0xbd, 0xc5,
+	  0x61, 0xe4, 0x77, 0xde, 0xf9, 0xbb, 0xcf, 0xd2,
+	  0x8c, 0xc8, 0xc1, 0xc5, 0xe3, 0x37, 0x9a, 0x31 },
+	{ 0xcb, 0x9b, 0x89, 0xcc, 0x18, 0x38, 0x1d, 0xd9,
+	  0x14, 0x1a, 0xde, 0x58, 0x86, 0x54, 0xd4, 0xe6,
+	  0xa2, 0x31, 0xd5, 0xbf, 0x49, 0xd4, 0xd5, 0x9a,
+	  0xc2, 0x7d, 0x86, 0x9c, 0xbe, 0x10, 0x0c, 0xf3 },
+	{ 0x7b, 0xd8, 0x81, 0x50, 0x46, 0xfd, 0xd8, 0x10,
+	  0xa9, 0x23, 0xe1, 0x98, 0x4a, 0xae, 0xbd, 0xcd,
+	  0xf8, 0x4d, 0x87, 0xc8, 0x99, 0x2d, 0x68, 0xb5,
+	  0xee, 0xb4, 0x60, 0xf9, 0x3e, 0xb3, 0xc8, 0xd7 },
+	{ 0x60, 0x7b, 0xe6, 0x68, 0x62, 0xfd, 0x08, 0xee,
+	  0x5b, 0x19, 0xfa, 0xca, 0xc0, 0x9d, 0xfd, 0xbc,
+	  0xd4, 0x0c, 0x31, 0x21, 0x01, 0xd6, 0x6e, 0x6e,
+	  0xbd, 0x2b, 0x84, 0x1f, 0x1b, 0x9a, 0x93, 0x25 },
+	{ 0x9f, 0xe0, 0x3b, 0xbe, 0x69, 0xab, 0x18, 0x34,
+	  0xf5, 0x21, 0x9b, 0x0d, 0xa8, 0x8a, 0x08, 0xb3,
+	  0x0a, 0x66, 0xc5, 0x91, 0x3f, 0x01, 0x51, 0x96,
+	  0x3c, 0x36, 0x05, 0x60, 0xdb, 0x03, 0x87, 0xb3 },
+	{ 0x90, 0xa8, 0x35, 0x85, 0x71, 0x7b, 0x75, 0xf0,
+	  0xe9, 0xb7, 0x25, 0xe0, 0x55, 0xee, 0xee, 0xb9,
+	  0xe7, 0xa0, 0x28, 0xea, 0x7e, 0x6c, 0xbc, 0x07,
+	  0xb2, 0x09, 0x17, 0xec, 0x03, 0x63, 0xe3, 0x8c },
+	{ 0x33, 0x6e, 0xa0, 0x53, 0x0f, 0x4a, 0x74, 0x69,
+	  0x12, 0x6e, 0x02, 0x18, 0x58, 0x7e, 0xbb, 0xde,
+	  0x33, 0x58, 0xa0, 0xb3, 0x1c, 0x29, 0xd2, 0x00,
+	  0xf7, 0xdc, 0x7e, 0xb1, 0x5c, 0x6a, 0xad, 0xd8 },
+	{ 0xa7, 0x9e, 0x76, 0xdc, 0x0a, 0xbc, 0xa4, 0x39,
+	  0x6f, 0x07, 0x47, 0xcd, 0x7b, 0x74, 0x8d, 0xf9,
+	  0x13, 0x00, 0x76, 0x26, 0xb1, 0xd6, 0x59, 0xda,
+	  0x0c, 0x1f, 0x78, 0xb9, 0x30, 0x3d, 0x01, 0xa3 },
+	{ 0x44, 0xe7, 0x8a, 0x77, 0x37, 0x56, 0xe0, 0x95,
+	  0x15, 0x19, 0x50, 0x4d, 0x70, 0x38, 0xd2, 0x8d,
+	  0x02, 0x13, 0xa3, 0x7e, 0x0c, 0xe3, 0x75, 0x37,
+	  0x17, 0x57, 0xbc, 0x99, 0x63, 0x11, 0xe3, 0xb8 },
+	{ 0x77, 0xac, 0x01, 0x2a, 0x3f, 0x75, 0x4d, 0xcf,
+	  0xea, 0xb5, 0xeb, 0x99, 0x6b, 0xe9, 0xcd, 0x2d,
+	  0x1f, 0x96, 0x11, 0x1b, 0x6e, 0x49, 0xf3, 0x99,
+	  0x4d, 0xf1, 0x81, 0xf2, 0x85, 0x69, 0xd8, 0x25 },
+	{ 0xce, 0x5a, 0x10, 0xdb, 0x6f, 0xcc, 0xda, 0xf1,
+	  0x40, 0xaa, 0xa4, 0xde, 0xd6, 0x25, 0x0a, 0x9c,
+	  0x06, 0xe9, 0x22, 0x2b, 0xc9, 0xf9, 0xf3, 0x65,
+	  0x8a, 0x4a, 0xff, 0x93, 0x5f, 0x2b, 0x9f, 0x3a },
+	{ 0xec, 0xc2, 0x03, 0xa7, 0xfe, 0x2b, 0xe4, 0xab,
+	  0xd5, 0x5b, 0xb5, 0x3e, 0x6e, 0x67, 0x35, 0x72,
+	  0xe0, 0x07, 0x8d, 0xa8, 0xcd, 0x37, 0x5e, 0xf4,
+	  0x30, 0xcc, 0x97, 0xf9, 0xf8, 0x00, 0x83, 0xaf },
+	{ 0x14, 0xa5, 0x18, 0x6d, 0xe9, 0xd7, 0xa1, 0x8b,
+	  0x04, 0x12, 0xb8, 0x56, 0x3e, 0x51, 0xcc, 0x54,
+	  0x33, 0x84, 0x0b, 0x4a, 0x12, 0x9a, 0x8f, 0xf9,
+	  0x63, 0xb3, 0x3a, 0x3c, 0x4a, 0xfe, 0x8e, 0xbb },
+	{ 0x13, 0xf8, 0xef, 0x95, 0xcb, 0x86, 0xe6, 0xa6,
+	  0x38, 0x93, 0x1c, 0x8e, 0x10, 0x76, 0x73, 0xeb,
+	  0x76, 0xba, 0x10, 0xd7, 0xc2, 0xcd, 0x70, 0xb9,
+	  0xd9, 0x92, 0x0b, 0xbe, 0xed, 0x92, 0x94, 0x09 },
+	{ 0x0b, 0x33, 0x8f, 0x4e, 0xe1, 0x2f, 0x2d, 0xfc,
+	  0xb7, 0x87, 0x13, 0x37, 0x79, 0x41, 0xe0, 0xb0,
+	  0x63, 0x21, 0x52, 0x58, 0x1d, 0x13, 0x32, 0x51,
+	  0x6e, 0x4a, 0x2c, 0xab, 0x19, 0x42, 0xcc, 0xa4 },
+	{ 0xea, 0xab, 0x0e, 0xc3, 0x7b, 0x3b, 0x8a, 0xb7,
+	  0x96, 0xe9, 0xf5, 0x72, 0x38, 0xde, 0x14, 0xa2,
+	  0x64, 0xa0, 0x76, 0xf3, 0x88, 0x7d, 0x86, 0xe2,
+	  0x9b, 0xb5, 0x90, 0x6d, 0xb5, 0xa0, 0x0e, 0x02 },
+	{ 0x23, 0xcb, 0x68, 0xb8, 0xc0, 0xe6, 0xdc, 0x26,
+	  0xdc, 0x27, 0x76, 0x6d, 0xdc, 0x0a, 0x13, 0xa9,
+	  0x94, 0x38, 0xfd, 0x55, 0x61, 0x7a, 0xa4, 0x09,
+	  0x5d, 0x8f, 0x96, 0x97, 0x20, 0xc8, 0x72, 0xdf },
+	{ 0x09, 0x1d, 0x8e, 0xe3, 0x0d, 0x6f, 0x29, 0x68,
+	  0xd4, 0x6b, 0x68, 0x7d, 0xd6, 0x52, 0x92, 0x66,
+	  0x57, 0x42, 0xde, 0x0b, 0xb8, 0x3d, 0xcc, 0x00,
+	  0x04, 0xc7, 0x2c, 0xe1, 0x00, 0x07, 0xa5, 0x49 },
+	{ 0x7f, 0x50, 0x7a, 0xbc, 0x6d, 0x19, 0xba, 0x00,
+	  0xc0, 0x65, 0xa8, 0x76, 0xec, 0x56, 0x57, 0x86,
+	  0x88, 0x82, 0xd1, 0x8a, 0x22, 0x1b, 0xc4, 0x6c,
+	  0x7a, 0x69, 0x12, 0x54, 0x1f, 0x5b, 0xc7, 0xba },
+	{ 0xa0, 0x60, 0x7c, 0x24, 0xe1, 0x4e, 0x8c, 0x22,
+	  0x3d, 0xb0, 0xd7, 0x0b, 0x4d, 0x30, 0xee, 0x88,
+	  0x01, 0x4d, 0x60, 0x3f, 0x43, 0x7e, 0x9e, 0x02,
+	  0xaa, 0x7d, 0xaf, 0xa3, 0xcd, 0xfb, 0xad, 0x94 },
+	{ 0xdd, 0xbf, 0xea, 0x75, 0xcc, 0x46, 0x78, 0x82,
+	  0xeb, 0x34, 0x83, 0xce, 0x5e, 0x2e, 0x75, 0x6a,
+	  0x4f, 0x47, 0x01, 0xb7, 0x6b, 0x44, 0x55, 0x19,
+	  0xe8, 0x9f, 0x22, 0xd6, 0x0f, 0xa8, 0x6e, 0x06 },
+	{ 0x0c, 0x31, 0x1f, 0x38, 0xc3, 0x5a, 0x4f, 0xb9,
+	  0x0d, 0x65, 0x1c, 0x28, 0x9d, 0x48, 0x68, 0x56,
+	  0xcd, 0x14, 0x13, 0xdf, 0x9b, 0x06, 0x77, 0xf5,
+	  0x3e, 0xce, 0x2c, 0xd9, 0xe4, 0x77, 0xc6, 0x0a },
+	{ 0x46, 0xa7, 0x3a, 0x8d, 0xd3, 0xe7, 0x0f, 0x59,
+	  0xd3, 0x94, 0x2c, 0x01, 0xdf, 0x59, 0x9d, 0xef,
+	  0x78, 0x3c, 0x9d, 0xa8, 0x2f, 0xd8, 0x32, 0x22,
+	  0xcd, 0x66, 0x2b, 0x53, 0xdc, 0xe7, 0xdb, 0xdf },
+	{ 0xad, 0x03, 0x8f, 0xf9, 0xb1, 0x4d, 0xe8, 0x4a,
+	  0x80, 0x1e, 0x4e, 0x62, 0x1c, 0xe5, 0xdf, 0x02,
+	  0x9d, 0xd9, 0x35, 0x20, 0xd0, 0xc2, 0xfa, 0x38,
+	  0xbf, 0xf1, 0x76, 0xa8, 0xb1, 0xd1, 0x69, 0x8c },
+	{ 0xab, 0x70, 0xc5, 0xdf, 0xbd, 0x1e, 0xa8, 0x17,
+	  0xfe, 0xd0, 0xcd, 0x06, 0x72, 0x93, 0xab, 0xf3,
+	  0x19, 0xe5, 0xd7, 0x90, 0x1c, 0x21, 0x41, 0xd5,
+	  0xd9, 0x9b, 0x23, 0xf0, 0x3a, 0x38, 0xe7, 0x48 },
+	{ 0x1f, 0xff, 0xda, 0x67, 0x93, 0x2b, 0x73, 0xc8,
+	  0xec, 0xaf, 0x00, 0x9a, 0x34, 0x91, 0xa0, 0x26,
+	  0x95, 0x3b, 0xab, 0xfe, 0x1f, 0x66, 0x3b, 0x06,
+	  0x97, 0xc3, 0xc4, 0xae, 0x8b, 0x2e, 0x7d, 0xcb },
+	{ 0xb0, 0xd2, 0xcc, 0x19, 0x47, 0x2d, 0xd5, 0x7f,
+	  0x2b, 0x17, 0xef, 0xc0, 0x3c, 0x8d, 0x58, 0xc2,
+	  0x28, 0x3d, 0xbb, 0x19, 0xda, 0x57, 0x2f, 0x77,
+	  0x55, 0x85, 0x5a, 0xa9, 0x79, 0x43, 0x17, 0xa0 },
+	{ 0xa0, 0xd1, 0x9a, 0x6e, 0xe3, 0x39, 0x79, 0xc3,
+	  0x25, 0x51, 0x0e, 0x27, 0x66, 0x22, 0xdf, 0x41,
+	  0xf7, 0x15, 0x83, 0xd0, 0x75, 0x01, 0xb8, 0x70,
+	  0x71, 0x12, 0x9a, 0x0a, 0xd9, 0x47, 0x32, 0xa5 },
+	{ 0x72, 0x46, 0x42, 0xa7, 0x03, 0x2d, 0x10, 0x62,
+	  0xb8, 0x9e, 0x52, 0xbe, 0xa3, 0x4b, 0x75, 0xdf,
+	  0x7d, 0x8f, 0xe7, 0x72, 0xd9, 0xfe, 0x3c, 0x93,
+	  0xdd, 0xf3, 0xc4, 0x54, 0x5a, 0xb5, 0xa9, 0x9b },
+	{ 0xad, 0xe5, 0xea, 0xa7, 0xe6, 0x1f, 0x67, 0x2d,
+	  0x58, 0x7e, 0xa0, 0x3d, 0xae, 0x7d, 0x7b, 0x55,
+	  0x22, 0x9c, 0x01, 0xd0, 0x6b, 0xc0, 0xa5, 0x70,
+	  0x14, 0x36, 0xcb, 0xd1, 0x83, 0x66, 0xa6, 0x26 },
+	{ 0x01, 0x3b, 0x31, 0xeb, 0xd2, 0x28, 0xfc, 0xdd,
+	  0xa5, 0x1f, 0xab, 0xb0, 0x3b, 0xb0, 0x2d, 0x60,
+	  0xac, 0x20, 0xca, 0x21, 0x5a, 0xaf, 0xa8, 0x3b,
+	  0xdd, 0x85, 0x5e, 0x37, 0x55, 0xa3, 0x5f, 0x0b },
+	{ 0x33, 0x2e, 0xd4, 0x0b, 0xb1, 0x0d, 0xde, 0x3c,
+	  0x95, 0x4a, 0x75, 0xd7, 0xb8, 0x99, 0x9d, 0x4b,
+	  0x26, 0xa1, 0xc0, 0x63, 0xc1, 0xdc, 0x6e, 0x32,
+	  0xc1, 0xd9, 0x1b, 0xab, 0x7b, 0xbb, 0x7d, 0x16 },
+	{ 0xc7, 0xa1, 0x97, 0xb3, 0xa0, 0x5b, 0x56, 0x6b,
+	  0xcc, 0x9f, 0xac, 0xd2, 0x0e, 0x44, 0x1d, 0x6f,
+	  0x6c, 0x28, 0x60, 0xac, 0x96, 0x51, 0xcd, 0x51,
+	  0xd6, 0xb9, 0xd2, 0xcd, 0xee, 0xea, 0x03, 0x90 },
+	{ 0xbd, 0x9c, 0xf6, 0x4e, 0xa8, 0x95, 0x3c, 0x03,
+	  0x71, 0x08, 0xe6, 0xf6, 0x54, 0x91, 0x4f, 0x39,
+	  0x58, 0xb6, 0x8e, 0x29, 0xc1, 0x67, 0x00, 0xdc,
+	  0x18, 0x4d, 0x94, 0xa2, 0x17, 0x08, 0xff, 0x60 },
+	{ 0x88, 0x35, 0xb0, 0xac, 0x02, 0x11, 0x51, 0xdf,
+	  0x71, 0x64, 0x74, 0xce, 0x27, 0xce, 0x4d, 0x3c,
+	  0x15, 0xf0, 0xb2, 0xda, 0xb4, 0x80, 0x03, 0xcf,
+	  0x3f, 0x3e, 0xfd, 0x09, 0x45, 0x10, 0x6b, 0x9a },
+	{ 0x3b, 0xfe, 0xfa, 0x33, 0x01, 0xaa, 0x55, 0xc0,
+	  0x80, 0x19, 0x0c, 0xff, 0xda, 0x8e, 0xae, 0x51,
+	  0xd9, 0xaf, 0x48, 0x8b, 0x4c, 0x1f, 0x24, 0xc3,
+	  0xd9, 0xa7, 0x52, 0x42, 0xfd, 0x8e, 0xa0, 0x1d },
+	{ 0x08, 0x28, 0x4d, 0x14, 0x99, 0x3c, 0xd4, 0x7d,
+	  0x53, 0xeb, 0xae, 0xcf, 0x0d, 0xf0, 0x47, 0x8c,
+	  0xc1, 0x82, 0xc8, 0x9c, 0x00, 0xe1, 0x85, 0x9c,
+	  0x84, 0x85, 0x16, 0x86, 0xdd, 0xf2, 0xc1, 0xb7 },
+	{ 0x1e, 0xd7, 0xef, 0x9f, 0x04, 0xc2, 0xac, 0x8d,
+	  0xb6, 0xa8, 0x64, 0xdb, 0x13, 0x10, 0x87, 0xf2,
+	  0x70, 0x65, 0x09, 0x8e, 0x69, 0xc3, 0xfe, 0x78,
+	  0x71, 0x8d, 0x9b, 0x94, 0x7f, 0x4a, 0x39, 0xd0 },
+	{ 0xc1, 0x61, 0xf2, 0xdc, 0xd5, 0x7e, 0x9c, 0x14,
+	  0x39, 0xb3, 0x1a, 0x9d, 0xd4, 0x3d, 0x8f, 0x3d,
+	  0x7d, 0xd8, 0xf0, 0xeb, 0x7c, 0xfa, 0xc6, 0xfb,
+	  0x25, 0xa0, 0xf2, 0x8e, 0x30, 0x6f, 0x06, 0x61 },
+	{ 0xc0, 0x19, 0x69, 0xad, 0x34, 0xc5, 0x2c, 0xaf,
+	  0x3d, 0xc4, 0xd8, 0x0d, 0x19, 0x73, 0x5c, 0x29,
+	  0x73, 0x1a, 0xc6, 0xe7, 0xa9, 0x20, 0x85, 0xab,
+	  0x92, 0x50, 0xc4, 0x8d, 0xea, 0x48, 0xa3, 0xfc },
+	{ 0x17, 0x20, 0xb3, 0x65, 0x56, 0x19, 0xd2, 0xa5,
+	  0x2b, 0x35, 0x21, 0xae, 0x0e, 0x49, 0xe3, 0x45,
+	  0xcb, 0x33, 0x89, 0xeb, 0xd6, 0x20, 0x8a, 0xca,
+	  0xf9, 0xf1, 0x3f, 0xda, 0xcc, 0xa8, 0xbe, 0x49 },
+	{ 0x75, 0x62, 0x88, 0x36, 0x1c, 0x83, 0xe2, 0x4c,
+	  0x61, 0x7c, 0xf9, 0x5c, 0x90, 0x5b, 0x22, 0xd0,
+	  0x17, 0xcd, 0xc8, 0x6f, 0x0b, 0xf1, 0xd6, 0x58,
+	  0xf4, 0x75, 0x6c, 0x73, 0x79, 0x87, 0x3b, 0x7f },
+	{ 0xe7, 0xd0, 0xed, 0xa3, 0x45, 0x26, 0x93, 0xb7,
+	  0x52, 0xab, 0xcd, 0xa1, 0xb5, 0x5e, 0x27, 0x6f,
+	  0x82, 0x69, 0x8f, 0x5f, 0x16, 0x05, 0x40, 0x3e,
+	  0xff, 0x83, 0x0b, 0xea, 0x00, 0x71, 0xa3, 0x94 },
+	{ 0x2c, 0x82, 0xec, 0xaa, 0x6b, 0x84, 0x80, 0x3e,
+	  0x04, 0x4a, 0xf6, 0x31, 0x18, 0xaf, 0xe5, 0x44,
+	  0x68, 0x7c, 0xb6, 0xe6, 0xc7, 0xdf, 0x49, 0xed,
+	  0x76, 0x2d, 0xfd, 0x7c, 0x86, 0x93, 0xa1, 0xbc },
+	{ 0x61, 0x36, 0xcb, 0xf4, 0xb4, 0x41, 0x05, 0x6f,
+	  0xa1, 0xe2, 0x72, 0x24, 0x98, 0x12, 0x5d, 0x6d,
+	  0xed, 0x45, 0xe1, 0x7b, 0x52, 0x14, 0x39, 0x59,
+	  0xc7, 0xf4, 0xd4, 0xe3, 0x95, 0x21, 0x8a, 0xc2 },
+	{ 0x72, 0x1d, 0x32, 0x45, 0xaa, 0xfe, 0xf2, 0x7f,
+	  0x6a, 0x62, 0x4f, 0x47, 0x95, 0x4b, 0x6c, 0x25,
+	  0x50, 0x79, 0x52, 0x6f, 0xfa, 0x25, 0xe9, 0xff,
+	  0x77, 0xe5, 0xdc, 0xff, 0x47, 0x3b, 0x15, 0x97 },
+	{ 0x9d, 0xd2, 0xfb, 0xd8, 0xce, 0xf1, 0x6c, 0x35,
+	  0x3c, 0x0a, 0xc2, 0x11, 0x91, 0xd5, 0x09, 0xeb,
+	  0x28, 0xdd, 0x9e, 0x3e, 0x0d, 0x8c, 0xea, 0x5d,
+	  0x26, 0xca, 0x83, 0x93, 0x93, 0x85, 0x1c, 0x3a },
+	{ 0xb2, 0x39, 0x4c, 0xea, 0xcd, 0xeb, 0xf2, 0x1b,
+	  0xf9, 0xdf, 0x2c, 0xed, 0x98, 0xe5, 0x8f, 0x1c,
+	  0x3a, 0x4b, 0xbb, 0xff, 0x66, 0x0d, 0xd9, 0x00,
+	  0xf6, 0x22, 0x02, 0xd6, 0x78, 0x5c, 0xc4, 0x6e },
+	{ 0x57, 0x08, 0x9f, 0x22, 0x27, 0x49, 0xad, 0x78,
+	  0x71, 0x76, 0x5f, 0x06, 0x2b, 0x11, 0x4f, 0x43,
+	  0xba, 0x20, 0xec, 0x56, 0x42, 0x2a, 0x8b, 0x1e,
+	  0x3f, 0x87, 0x19, 0x2c, 0x0e, 0xa7, 0x18, 0xc6 },
+	{ 0xe4, 0x9a, 0x94, 0x59, 0x96, 0x1c, 0xd3, 0x3c,
+	  0xdf, 0x4a, 0xae, 0x1b, 0x10, 0x78, 0xa5, 0xde,
+	  0xa7, 0xc0, 0x40, 0xe0, 0xfe, 0xa3, 0x40, 0xc9,
+	  0x3a, 0x72, 0x48, 0x72, 0xfc, 0x4a, 0xf8, 0x06 },
+	{ 0xed, 0xe6, 0x7f, 0x72, 0x0e, 0xff, 0xd2, 0xca,
+	  0x9c, 0x88, 0x99, 0x41, 0x52, 0xd0, 0x20, 0x1d,
+	  0xee, 0x6b, 0x0a, 0x2d, 0x2c, 0x07, 0x7a, 0xca,
+	  0x6d, 0xae, 0x29, 0xf7, 0x3f, 0x8b, 0x63, 0x09 },
+	{ 0xe0, 0xf4, 0x34, 0xbf, 0x22, 0xe3, 0x08, 0x80,
+	  0x39, 0xc2, 0x1f, 0x71, 0x9f, 0xfc, 0x67, 0xf0,
+	  0xf2, 0xcb, 0x5e, 0x98, 0xa7, 0xa0, 0x19, 0x4c,
+	  0x76, 0xe9, 0x6b, 0xf4, 0xe8, 0xe1, 0x7e, 0x61 },
+	{ 0x27, 0x7c, 0x04, 0xe2, 0x85, 0x34, 0x84, 0xa4,
+	  0xeb, 0xa9, 0x10, 0xad, 0x33, 0x6d, 0x01, 0xb4,
+	  0x77, 0xb6, 0x7c, 0xc2, 0x00, 0xc5, 0x9f, 0x3c,
+	  0x8d, 0x77, 0xee, 0xf8, 0x49, 0x4f, 0x29, 0xcd },
+	{ 0x15, 0x6d, 0x57, 0x47, 0xd0, 0xc9, 0x9c, 0x7f,
+	  0x27, 0x09, 0x7d, 0x7b, 0x7e, 0x00, 0x2b, 0x2e,
+	  0x18, 0x5c, 0xb7, 0x2d, 0x8d, 0xd7, 0xeb, 0x42,
+	  0x4a, 0x03, 0x21, 0x52, 0x81, 0x61, 0x21, 0x9f },
+	{ 0x20, 0xdd, 0xd1, 0xed, 0x9b, 0x1c, 0xa8, 0x03,
+	  0x94, 0x6d, 0x64, 0xa8, 0x3a, 0xe4, 0x65, 0x9d,
+	  0xa6, 0x7f, 0xba, 0x7a, 0x1a, 0x3e, 0xdd, 0xb1,
+	  0xe1, 0x03, 0xc0, 0xf5, 0xe0, 0x3e, 0x3a, 0x2c },
+	{ 0xf0, 0xaf, 0x60, 0x4d, 0x3d, 0xab, 0xbf, 0x9a,
+	  0x0f, 0x2a, 0x7d, 0x3d, 0xda, 0x6b, 0xd3, 0x8b,
+	  0xba, 0x72, 0xc6, 0xd0, 0x9b, 0xe4, 0x94, 0xfc,
+	  0xef, 0x71, 0x3f, 0xf1, 0x01, 0x89, 0xb6, 0xe6 },
+	{ 0x98, 0x02, 0xbb, 0x87, 0xde, 0xf4, 0xcc, 0x10,
+	  0xc4, 0xa5, 0xfd, 0x49, 0xaa, 0x58, 0xdf, 0xe2,
+	  0xf3, 0xfd, 0xdb, 0x46, 0xb4, 0x70, 0x88, 0x14,
+	  0xea, 0xd8, 0x1d, 0x23, 0xba, 0x95, 0x13, 0x9b },
+	{ 0x4f, 0x8c, 0xe1, 0xe5, 0x1d, 0x2f, 0xe7, 0xf2,
+	  0x40, 0x43, 0xa9, 0x04, 0xd8, 0x98, 0xeb, 0xfc,
+	  0x91, 0x97, 0x54, 0x18, 0x75, 0x34, 0x13, 0xaa,
+	  0x09, 0x9b, 0x79, 0x5e, 0xcb, 0x35, 0xce, 0xdb },
+	{ 0xbd, 0xdc, 0x65, 0x14, 0xd7, 0xee, 0x6a, 0xce,
+	  0x0a, 0x4a, 0xc1, 0xd0, 0xe0, 0x68, 0x11, 0x22,
+	  0x88, 0xcb, 0xcf, 0x56, 0x04, 0x54, 0x64, 0x27,
+	  0x05, 0x63, 0x01, 0x77, 0xcb, 0xa6, 0x08, 0xbd },
+	{ 0xd6, 0x35, 0x99, 0x4f, 0x62, 0x91, 0x51, 0x7b,
+	  0x02, 0x81, 0xff, 0xdd, 0x49, 0x6a, 0xfa, 0x86,
+	  0x27, 0x12, 0xe5, 0xb3, 0xc4, 0xe5, 0x2e, 0x4c,
+	  0xd5, 0xfd, 0xae, 0x8c, 0x0e, 0x72, 0xfb, 0x08 },
+	{ 0x87, 0x8d, 0x9c, 0xa6, 0x00, 0xcf, 0x87, 0xe7,
+	  0x69, 0xcc, 0x30, 0x5c, 0x1b, 0x35, 0x25, 0x51,
+	  0x86, 0x61, 0x5a, 0x73, 0xa0, 0xda, 0x61, 0x3b,
+	  0x5f, 0x1c, 0x98, 0xdb, 0xf8, 0x12, 0x83, 0xea },
+	{ 0xa6, 0x4e, 0xbe, 0x5d, 0xc1, 0x85, 0xde, 0x9f,
+	  0xdd, 0xe7, 0x60, 0x7b, 0x69, 0x98, 0x70, 0x2e,
+	  0xb2, 0x34, 0x56, 0x18, 0x49, 0x57, 0x30, 0x7d,
+	  0x2f, 0xa7, 0x2e, 0x87, 0xa4, 0x77, 0x02, 0xd6 },
+	{ 0xce, 0x50, 0xea, 0xb7, 0xb5, 0xeb, 0x52, 0xbd,
+	  0xc9, 0xad, 0x8e, 0x5a, 0x48, 0x0a, 0xb7, 0x80,
+	  0xca, 0x93, 0x20, 0xe4, 0x43, 0x60, 0xb1, 0xfe,
+	  0x37, 0xe0, 0x3f, 0x2f, 0x7a, 0xd7, 0xde, 0x01 },
+	{ 0xee, 0xdd, 0xb7, 0xc0, 0xdb, 0x6e, 0x30, 0xab,
+	  0xe6, 0x6d, 0x79, 0xe3, 0x27, 0x51, 0x1e, 0x61,
+	  0xfc, 0xeb, 0xbc, 0x29, 0xf1, 0x59, 0xb4, 0x0a,
+	  0x86, 0xb0, 0x46, 0xec, 0xf0, 0x51, 0x38, 0x23 },
+	{ 0x78, 0x7f, 0xc9, 0x34, 0x40, 0xc1, 0xec, 0x96,
+	  0xb5, 0xad, 0x01, 0xc1, 0x6c, 0xf7, 0x79, 0x16,
+	  0xa1, 0x40, 0x5f, 0x94, 0x26, 0x35, 0x6e, 0xc9,
+	  0x21, 0xd8, 0xdf, 0xf3, 0xea, 0x63, 0xb7, 0xe0 },
+	{ 0x7f, 0x0d, 0x5e, 0xab, 0x47, 0xee, 0xfd, 0xa6,
+	  0x96, 0xc0, 0xbf, 0x0f, 0xbf, 0x86, 0xab, 0x21,
+	  0x6f, 0xce, 0x46, 0x1e, 0x93, 0x03, 0xab, 0xa6,
+	  0xac, 0x37, 0x41, 0x20, 0xe8, 0x90, 0xe8, 0xdf },
+	{ 0xb6, 0x80, 0x04, 0xb4, 0x2f, 0x14, 0xad, 0x02,
+	  0x9f, 0x4c, 0x2e, 0x03, 0xb1, 0xd5, 0xeb, 0x76,
+	  0xd5, 0x71, 0x60, 0xe2, 0x64, 0x76, 0xd2, 0x11,
+	  0x31, 0xbe, 0xf2, 0x0a, 0xda, 0x7d, 0x27, 0xf4 },
+	{ 0xb0, 0xc4, 0xeb, 0x18, 0xae, 0x25, 0x0b, 0x51,
+	  0xa4, 0x13, 0x82, 0xea, 0xd9, 0x2d, 0x0d, 0xc7,
+	  0x45, 0x5f, 0x93, 0x79, 0xfc, 0x98, 0x84, 0x42,
+	  0x8e, 0x47, 0x70, 0x60, 0x8d, 0xb0, 0xfa, 0xec },
+	{ 0xf9, 0x2b, 0x7a, 0x87, 0x0c, 0x05, 0x9f, 0x4d,
+	  0x46, 0x46, 0x4c, 0x82, 0x4e, 0xc9, 0x63, 0x55,
+	  0x14, 0x0b, 0xdc, 0xe6, 0x81, 0x32, 0x2c, 0xc3,
+	  0xa9, 0x92, 0xff, 0x10, 0x3e, 0x3f, 0xea, 0x52 },
+	{ 0x53, 0x64, 0x31, 0x26, 0x14, 0x81, 0x33, 0x98,
+	  0xcc, 0x52, 0x5d, 0x4c, 0x4e, 0x14, 0x6e, 0xde,
+	  0xb3, 0x71, 0x26, 0x5f, 0xba, 0x19, 0x13, 0x3a,
+	  0x2c, 0x3d, 0x21, 0x59, 0x29, 0x8a, 0x17, 0x42 },
+	{ 0xf6, 0x62, 0x0e, 0x68, 0xd3, 0x7f, 0xb2, 0xaf,
+	  0x50, 0x00, 0xfc, 0x28, 0xe2, 0x3b, 0x83, 0x22,
+	  0x97, 0xec, 0xd8, 0xbc, 0xe9, 0x9e, 0x8b, 0xe4,
+	  0xd0, 0x4e, 0x85, 0x30, 0x9e, 0x3d, 0x33, 0x74 },
+	{ 0x53, 0x16, 0xa2, 0x79, 0x69, 0xd7, 0xfe, 0x04,
+	  0xff, 0x27, 0xb2, 0x83, 0x96, 0x1b, 0xff, 0xc3,
+	  0xbf, 0x5d, 0xfb, 0x32, 0xfb, 0x6a, 0x89, 0xd1,
+	  0x01, 0xc6, 0xc3, 0xb1, 0x93, 0x7c, 0x28, 0x71 },
+	{ 0x81, 0xd1, 0x66, 0x4f, 0xdf, 0x3c, 0xb3, 0x3c,
+	  0x24, 0xee, 0xba, 0xc0, 0xbd, 0x64, 0x24, 0x4b,
+	  0x77, 0xc4, 0xab, 0xea, 0x90, 0xbb, 0xe8, 0xb5,
+	  0xee, 0x0b, 0x2a, 0xaf, 0xcf, 0x2d, 0x6a, 0x53 },
+	{ 0x34, 0x57, 0x82, 0xf2, 0x95, 0xb0, 0x88, 0x03,
+	  0x52, 0xe9, 0x24, 0xa0, 0x46, 0x7b, 0x5f, 0xbc,
+	  0x3e, 0x8f, 0x3b, 0xfb, 0xc3, 0xc7, 0xe4, 0x8b,
+	  0x67, 0x09, 0x1f, 0xb5, 0xe8, 0x0a, 0x94, 0x42 },
+	{ 0x79, 0x41, 0x11, 0xea, 0x6c, 0xd6, 0x5e, 0x31,
+	  0x1f, 0x74, 0xee, 0x41, 0xd4, 0x76, 0xcb, 0x63,
+	  0x2c, 0xe1, 0xe4, 0xb0, 0x51, 0xdc, 0x1d, 0x9e,
+	  0x9d, 0x06, 0x1a, 0x19, 0xe1, 0xd0, 0xbb, 0x49 },
+	{ 0x2a, 0x85, 0xda, 0xf6, 0x13, 0x88, 0x16, 0xb9,
+	  0x9b, 0xf8, 0xd0, 0x8b, 0xa2, 0x11, 0x4b, 0x7a,
+	  0xb0, 0x79, 0x75, 0xa7, 0x84, 0x20, 0xc1, 0xa3,
+	  0xb0, 0x6a, 0x77, 0x7c, 0x22, 0xdd, 0x8b, 0xcb },
+	{ 0x89, 0xb0, 0xd5, 0xf2, 0x89, 0xec, 0x16, 0x40,
+	  0x1a, 0x06, 0x9a, 0x96, 0x0d, 0x0b, 0x09, 0x3e,
+	  0x62, 0x5d, 0xa3, 0xcf, 0x41, 0xee, 0x29, 0xb5,
+	  0x9b, 0x93, 0x0c, 0x58, 0x20, 0x14, 0x54, 0x55 },
+	{ 0xd0, 0xfd, 0xcb, 0x54, 0x39, 0x43, 0xfc, 0x27,
+	  0xd2, 0x08, 0x64, 0xf5, 0x21, 0x81, 0x47, 0x1b,
+	  0x94, 0x2c, 0xc7, 0x7c, 0xa6, 0x75, 0xbc, 0xb3,
+	  0x0d, 0xf3, 0x1d, 0x35, 0x8e, 0xf7, 0xb1, 0xeb },
+	{ 0xb1, 0x7e, 0xa8, 0xd7, 0x70, 0x63, 0xc7, 0x09,
+	  0xd4, 0xdc, 0x6b, 0x87, 0x94, 0x13, 0xc3, 0x43,
+	  0xe3, 0x79, 0x0e, 0x9e, 0x62, 0xca, 0x85, 0xb7,
+	  0x90, 0x0b, 0x08, 0x6f, 0x6b, 0x75, 0xc6, 0x72 },
+	{ 0xe7, 0x1a, 0x3e, 0x2c, 0x27, 0x4d, 0xb8, 0x42,
+	  0xd9, 0x21, 0x14, 0xf2, 0x17, 0xe2, 0xc0, 0xea,
+	  0xc8, 0xb4, 0x50, 0x93, 0xfd, 0xfd, 0x9d, 0xf4,
+	  0xca, 0x71, 0x62, 0x39, 0x48, 0x62, 0xd5, 0x01 },
+	{ 0xc0, 0x47, 0x67, 0x59, 0xab, 0x7a, 0xa3, 0x33,
+	  0x23, 0x4f, 0x6b, 0x44, 0xf5, 0xfd, 0x85, 0x83,
+	  0x90, 0xec, 0x23, 0x69, 0x4c, 0x62, 0x2c, 0xb9,
+	  0x86, 0xe7, 0x69, 0xc7, 0x8e, 0xdd, 0x73, 0x3e },
+	{ 0x9a, 0xb8, 0xea, 0xbb, 0x14, 0x16, 0x43, 0x4d,
+	  0x85, 0x39, 0x13, 0x41, 0xd5, 0x69, 0x93, 0xc5,
+	  0x54, 0x58, 0x16, 0x7d, 0x44, 0x18, 0xb1, 0x9a,
+	  0x0f, 0x2a, 0xd8, 0xb7, 0x9a, 0x83, 0xa7, 0x5b },
+	{ 0x79, 0x92, 0xd0, 0xbb, 0xb1, 0x5e, 0x23, 0x82,
+	  0x6f, 0x44, 0x3e, 0x00, 0x50, 0x5d, 0x68, 0xd3,
+	  0xed, 0x73, 0x72, 0x99, 0x5a, 0x5c, 0x3e, 0x49,
+	  0x86, 0x54, 0x10, 0x2f, 0xbc, 0xd0, 0x96, 0x4e },
+	{ 0xc0, 0x21, 0xb3, 0x00, 0x85, 0x15, 0x14, 0x35,
+	  0xdf, 0x33, 0xb0, 0x07, 0xcc, 0xec, 0xc6, 0x9d,
+	  0xf1, 0x26, 0x9f, 0x39, 0xba, 0x25, 0x09, 0x2b,
+	  0xed, 0x59, 0xd9, 0x32, 0xac, 0x0f, 0xdc, 0x28 },
+	{ 0x91, 0xa2, 0x5e, 0xc0, 0xec, 0x0d, 0x9a, 0x56,
+	  0x7f, 0x89, 0xc4, 0xbf, 0xe1, 0xa6, 0x5a, 0x0e,
+	  0x43, 0x2d, 0x07, 0x06, 0x4b, 0x41, 0x90, 0xe2,
+	  0x7d, 0xfb, 0x81, 0x90, 0x1f, 0xd3, 0x13, 0x9b },
+	{ 0x59, 0x50, 0xd3, 0x9a, 0x23, 0xe1, 0x54, 0x5f,
+	  0x30, 0x12, 0x70, 0xaa, 0x1a, 0x12, 0xf2, 0xe6,
+	  0xc4, 0x53, 0x77, 0x6e, 0x4d, 0x63, 0x55, 0xde,
+	  0x42, 0x5c, 0xc1, 0x53, 0xf9, 0x81, 0x88, 0x67 },
+	{ 0xd7, 0x9f, 0x14, 0x72, 0x0c, 0x61, 0x0a, 0xf1,
+	  0x79, 0xa3, 0x76, 0x5d, 0x4b, 0x7c, 0x09, 0x68,
+	  0xf9, 0x77, 0x96, 0x2d, 0xbf, 0x65, 0x5b, 0x52,
+	  0x12, 0x72, 0xb6, 0xf1, 0xe1, 0x94, 0x48, 0x8e },
+	{ 0xe9, 0x53, 0x1b, 0xfc, 0x8b, 0x02, 0x99, 0x5a,
+	  0xea, 0xa7, 0x5b, 0xa2, 0x70, 0x31, 0xfa, 0xdb,
+	  0xcb, 0xf4, 0xa0, 0xda, 0xb8, 0x96, 0x1d, 0x92,
+	  0x96, 0xcd, 0x7e, 0x84, 0xd2, 0x5d, 0x60, 0x06 },
+	{ 0x34, 0xe9, 0xc2, 0x6a, 0x01, 0xd7, 0xf1, 0x61,
+	  0x81, 0xb4, 0x54, 0xa9, 0xd1, 0x62, 0x3c, 0x23,
+	  0x3c, 0xb9, 0x9d, 0x31, 0xc6, 0x94, 0x65, 0x6e,
+	  0x94, 0x13, 0xac, 0xa3, 0xe9, 0x18, 0x69, 0x2f },
+	{ 0xd9, 0xd7, 0x42, 0x2f, 0x43, 0x7b, 0xd4, 0x39,
+	  0xdd, 0xd4, 0xd8, 0x83, 0xda, 0xe2, 0xa0, 0x83,
+	  0x50, 0x17, 0x34, 0x14, 0xbe, 0x78, 0x15, 0x51,
+	  0x33, 0xff, 0xf1, 0x96, 0x4c, 0x3d, 0x79, 0x72 },
+	{ 0x4a, 0xee, 0x0c, 0x7a, 0xaf, 0x07, 0x54, 0x14,
+	  0xff, 0x17, 0x93, 0xea, 0xd7, 0xea, 0xca, 0x60,
+	  0x17, 0x75, 0xc6, 0x15, 0xdb, 0xd6, 0x0b, 0x64,
+	  0x0b, 0x0a, 0x9f, 0x0c, 0xe5, 0x05, 0xd4, 0x35 },
+	{ 0x6b, 0xfd, 0xd1, 0x54, 0x59, 0xc8, 0x3b, 0x99,
+	  0xf0, 0x96, 0xbf, 0xb4, 0x9e, 0xe8, 0x7b, 0x06,
+	  0x3d, 0x69, 0xc1, 0x97, 0x4c, 0x69, 0x28, 0xac,
+	  0xfc, 0xfb, 0x40, 0x99, 0xf8, 0xc4, 0xef, 0x67 },
+	{ 0x9f, 0xd1, 0xc4, 0x08, 0xfd, 0x75, 0xc3, 0x36,
+	  0x19, 0x3a, 0x2a, 0x14, 0xd9, 0x4f, 0x6a, 0xf5,
+	  0xad, 0xf0, 0x50, 0xb8, 0x03, 0x87, 0xb4, 0xb0,
+	  0x10, 0xfb, 0x29, 0xf4, 0xcc, 0x72, 0x70, 0x7c },
+	{ 0x13, 0xc8, 0x84, 0x80, 0xa5, 0xd0, 0x0d, 0x6c,
+	  0x8c, 0x7a, 0xd2, 0x11, 0x0d, 0x76, 0xa8, 0x2d,
+	  0x9b, 0x70, 0xf4, 0xfa, 0x66, 0x96, 0xd4, 0xe5,
+	  0xdd, 0x42, 0xa0, 0x66, 0xdc, 0xaf, 0x99, 0x20 },
+	{ 0x82, 0x0e, 0x72, 0x5e, 0xe2, 0x5f, 0xe8, 0xfd,
+	  0x3a, 0x8d, 0x5a, 0xbe, 0x4c, 0x46, 0xc3, 0xba,
+	  0x88, 0x9d, 0xe6, 0xfa, 0x91, 0x91, 0xaa, 0x22,
+	  0xba, 0x67, 0xd5, 0x70, 0x54, 0x21, 0x54, 0x2b },
+	{ 0x32, 0xd9, 0x3a, 0x0e, 0xb0, 0x2f, 0x42, 0xfb,
+	  0xbc, 0xaf, 0x2b, 0xad, 0x00, 0x85, 0xb2, 0x82,
+	  0xe4, 0x60, 0x46, 0xa4, 0xdf, 0x7a, 0xd1, 0x06,
+	  0x57, 0xc9, 0xd6, 0x47, 0x63, 0x75, 0xb9, 0x3e },
+	{ 0xad, 0xc5, 0x18, 0x79, 0x05, 0xb1, 0x66, 0x9c,
+	  0xd8, 0xec, 0x9c, 0x72, 0x1e, 0x19, 0x53, 0x78,
+	  0x6b, 0x9d, 0x89, 0xa9, 0xba, 0xe3, 0x07, 0x80,
+	  0xf1, 0xe1, 0xea, 0xb2, 0x4a, 0x00, 0x52, 0x3c },
+	{ 0xe9, 0x07, 0x56, 0xff, 0x7f, 0x9a, 0xd8, 0x10,
+	  0xb2, 0x39, 0xa1, 0x0c, 0xed, 0x2c, 0xf9, 0xb2,
+	  0x28, 0x43, 0x54, 0xc1, 0xf8, 0xc7, 0xe0, 0xac,
+	  0xcc, 0x24, 0x61, 0xdc, 0x79, 0x6d, 0x6e, 0x89 },
+	{ 0x12, 0x51, 0xf7, 0x6e, 0x56, 0x97, 0x84, 0x81,
+	  0x87, 0x53, 0x59, 0x80, 0x1d, 0xb5, 0x89, 0xa0,
+	  0xb2, 0x2f, 0x86, 0xd8, 0xd6, 0x34, 0xdc, 0x04,
+	  0x50, 0x6f, 0x32, 0x2e, 0xd7, 0x8f, 0x17, 0xe8 },
+	{ 0x3a, 0xfa, 0x89, 0x9f, 0xd9, 0x80, 0xe7, 0x3e,
+	  0xcb, 0x7f, 0x4d, 0x8b, 0x8f, 0x29, 0x1d, 0xc9,
+	  0xaf, 0x79, 0x6b, 0xc6, 0x5d, 0x27, 0xf9, 0x74,
+	  0xc6, 0xf1, 0x93, 0xc9, 0x19, 0x1a, 0x09, 0xfd },
+	{ 0xaa, 0x30, 0x5b, 0xe2, 0x6e, 0x5d, 0xed, 0xdc,
+	  0x3c, 0x10, 0x10, 0xcb, 0xc2, 0x13, 0xf9, 0x5f,
+	  0x05, 0x1c, 0x78, 0x5c, 0x5b, 0x43, 0x1e, 0x6a,
+	  0x7c, 0xd0, 0x48, 0xf1, 0x61, 0x78, 0x75, 0x28 },
+	{ 0x8e, 0xa1, 0x88, 0x4f, 0xf3, 0x2e, 0x9d, 0x10,
+	  0xf0, 0x39, 0xb4, 0x07, 0xd0, 0xd4, 0x4e, 0x7e,
+	  0x67, 0x0a, 0xbd, 0x88, 0x4a, 0xee, 0xe0, 0xfb,
+	  0x75, 0x7a, 0xe9, 0x4e, 0xaa, 0x97, 0x37, 0x3d },
+	{ 0xd4, 0x82, 0xb2, 0x15, 0x5d, 0x4d, 0xec, 0x6b,
+	  0x47, 0x36, 0xa1, 0xf1, 0x61, 0x7b, 0x53, 0xaa,
+	  0xa3, 0x73, 0x10, 0x27, 0x7d, 0x3f, 0xef, 0x0c,
+	  0x37, 0xad, 0x41, 0x76, 0x8f, 0xc2, 0x35, 0xb4 },
+	{ 0x4d, 0x41, 0x39, 0x71, 0x38, 0x7e, 0x7a, 0x88,
+	  0x98, 0xa8, 0xdc, 0x2a, 0x27, 0x50, 0x07, 0x78,
+	  0x53, 0x9e, 0xa2, 0x14, 0xa2, 0xdf, 0xe9, 0xb3,
+	  0xd7, 0xe8, 0xeb, 0xdc, 0xe5, 0xcf, 0x3d, 0xb3 },
+	{ 0x69, 0x6e, 0x5d, 0x46, 0xe6, 0xc5, 0x7e, 0x87,
+	  0x96, 0xe4, 0x73, 0x5d, 0x08, 0x91, 0x6e, 0x0b,
+	  0x79, 0x29, 0xb3, 0xcf, 0x29, 0x8c, 0x29, 0x6d,
+	  0x22, 0xe9, 0xd3, 0x01, 0x96, 0x53, 0x37, 0x1c },
+	{ 0x1f, 0x56, 0x47, 0xc1, 0xd3, 0xb0, 0x88, 0x22,
+	  0x88, 0x85, 0x86, 0x5c, 0x89, 0x40, 0x90, 0x8b,
+	  0xf4, 0x0d, 0x1a, 0x82, 0x72, 0x82, 0x19, 0x73,
+	  0xb1, 0x60, 0x00, 0x8e, 0x7a, 0x3c, 0xe2, 0xeb },
+	{ 0xb6, 0xe7, 0x6c, 0x33, 0x0f, 0x02, 0x1a, 0x5b,
+	  0xda, 0x65, 0x87, 0x50, 0x10, 0xb0, 0xed, 0xf0,
+	  0x91, 0x26, 0xc0, 0xf5, 0x10, 0xea, 0x84, 0x90,
+	  0x48, 0x19, 0x20, 0x03, 0xae, 0xf4, 0xc6, 0x1c },
+	{ 0x3c, 0xd9, 0x52, 0xa0, 0xbe, 0xad, 0xa4, 0x1a,
+	  0xbb, 0x42, 0x4c, 0xe4, 0x7f, 0x94, 0xb4, 0x2b,
+	  0xe6, 0x4e, 0x1f, 0xfb, 0x0f, 0xd0, 0x78, 0x22,
+	  0x76, 0x80, 0x79, 0x46, 0xd0, 0xd0, 0xbc, 0x55 },
+	{ 0x98, 0xd9, 0x26, 0x77, 0x43, 0x9b, 0x41, 0xb7,
+	  0xbb, 0x51, 0x33, 0x12, 0xaf, 0xb9, 0x2b, 0xcc,
+	  0x8e, 0xe9, 0x68, 0xb2, 0xe3, 0xb2, 0x38, 0xce,
+	  0xcb, 0x9b, 0x0f, 0x34, 0xc9, 0xbb, 0x63, 0xd0 },
+	{ 0xec, 0xbc, 0xa2, 0xcf, 0x08, 0xae, 0x57, 0xd5,
+	  0x17, 0xad, 0x16, 0x15, 0x8a, 0x32, 0xbf, 0xa7,
+	  0xdc, 0x03, 0x82, 0xea, 0xed, 0xa1, 0x28, 0xe9,
+	  0x18, 0x86, 0x73, 0x4c, 0x24, 0xa0, 0xb2, 0x9d },
+	{ 0x94, 0x2c, 0xc7, 0xc0, 0xb5, 0x2e, 0x2b, 0x16,
+	  0xa4, 0xb8, 0x9f, 0xa4, 0xfc, 0x7e, 0x0b, 0xf6,
+	  0x09, 0xe2, 0x9a, 0x08, 0xc1, 0xa8, 0x54, 0x34,
+	  0x52, 0xb7, 0x7c, 0x7b, 0xfd, 0x11, 0xbb, 0x28 },
+	{ 0x8a, 0x06, 0x5d, 0x8b, 0x61, 0xa0, 0xdf, 0xfb,
+	  0x17, 0x0d, 0x56, 0x27, 0x73, 0x5a, 0x76, 0xb0,
+	  0xe9, 0x50, 0x60, 0x37, 0x80, 0x8c, 0xba, 0x16,
+	  0xc3, 0x45, 0x00, 0x7c, 0x9f, 0x79, 0xcf, 0x8f },
+	{ 0x1b, 0x9f, 0xa1, 0x97, 0x14, 0x65, 0x9c, 0x78,
+	  0xff, 0x41, 0x38, 0x71, 0x84, 0x92, 0x15, 0x36,
+	  0x10, 0x29, 0xac, 0x80, 0x2b, 0x1c, 0xbc, 0xd5,
+	  0x4e, 0x40, 0x8b, 0xd8, 0x72, 0x87, 0xf8, 0x1f },
+	{ 0x8d, 0xab, 0x07, 0x1b, 0xcd, 0x6c, 0x72, 0x92,
+	  0xa9, 0xef, 0x72, 0x7b, 0x4a, 0xe0, 0xd8, 0x67,
+	  0x13, 0x30, 0x1d, 0xa8, 0x61, 0x8d, 0x9a, 0x48,
+	  0xad, 0xce, 0x55, 0xf3, 0x03, 0xa8, 0x69, 0xa1 },
+	{ 0x82, 0x53, 0xe3, 0xe7, 0xc7, 0xb6, 0x84, 0xb9,
+	  0xcb, 0x2b, 0xeb, 0x01, 0x4c, 0xe3, 0x30, 0xff,
+	  0x3d, 0x99, 0xd1, 0x7a, 0xbb, 0xdb, 0xab, 0xe4,
+	  0xf4, 0xd6, 0x74, 0xde, 0xd5, 0x3f, 0xfc, 0x6b },
+	{ 0xf1, 0x95, 0xf3, 0x21, 0xe9, 0xe3, 0xd6, 0xbd,
+	  0x7d, 0x07, 0x45, 0x04, 0xdd, 0x2a, 0xb0, 0xe6,
+	  0x24, 0x1f, 0x92, 0xe7, 0x84, 0xb1, 0xaa, 0x27,
+	  0x1f, 0xf6, 0x48, 0xb1, 0xca, 0xb6, 0xd7, 0xf6 },
+	{ 0x27, 0xe4, 0xcc, 0x72, 0x09, 0x0f, 0x24, 0x12,
+	  0x66, 0x47, 0x6a, 0x7c, 0x09, 0x49, 0x5f, 0x2d,
+	  0xb1, 0x53, 0xd5, 0xbc, 0xbd, 0x76, 0x19, 0x03,
+	  0xef, 0x79, 0x27, 0x5e, 0xc5, 0x6b, 0x2e, 0xd8 },
+	{ 0x89, 0x9c, 0x24, 0x05, 0x78, 0x8e, 0x25, 0xb9,
+	  0x9a, 0x18, 0x46, 0x35, 0x5e, 0x64, 0x6d, 0x77,
+	  0xcf, 0x40, 0x00, 0x83, 0x41, 0x5f, 0x7d, 0xc5,
+	  0xaf, 0xe6, 0x9d, 0x6e, 0x17, 0xc0, 0x00, 0x23 },
+	{ 0xa5, 0x9b, 0x78, 0xc4, 0x90, 0x57, 0x44, 0x07,
+	  0x6b, 0xfe, 0xe8, 0x94, 0xde, 0x70, 0x7d, 0x4f,
+	  0x12, 0x0b, 0x5c, 0x68, 0x93, 0xea, 0x04, 0x00,
+	  0x29, 0x7d, 0x0b, 0xb8, 0x34, 0x72, 0x76, 0x32 },
+	{ 0x59, 0xdc, 0x78, 0xb1, 0x05, 0x64, 0x97, 0x07,
+	  0xa2, 0xbb, 0x44, 0x19, 0xc4, 0x8f, 0x00, 0x54,
+	  0x00, 0xd3, 0x97, 0x3d, 0xe3, 0x73, 0x66, 0x10,
+	  0x23, 0x04, 0x35, 0xb1, 0x04, 0x24, 0xb2, 0x4f },
+	{ 0xc0, 0x14, 0x9d, 0x1d, 0x7e, 0x7a, 0x63, 0x53,
+	  0xa6, 0xd9, 0x06, 0xef, 0xe7, 0x28, 0xf2, 0xf3,
+	  0x29, 0xfe, 0x14, 0xa4, 0x14, 0x9a, 0x3e, 0xa7,
+	  0x76, 0x09, 0xbc, 0x42, 0xb9, 0x75, 0xdd, 0xfa },
+	{ 0xa3, 0x2f, 0x24, 0x14, 0x74, 0xa6, 0xc1, 0x69,
+	  0x32, 0xe9, 0x24, 0x3b, 0xe0, 0xcf, 0x09, 0xbc,
+	  0xdc, 0x7e, 0x0c, 0xa0, 0xe7, 0xa6, 0xa1, 0xb9,
+	  0xb1, 0xa0, 0xf0, 0x1e, 0x41, 0x50, 0x23, 0x77 },
+	{ 0xb2, 0x39, 0xb2, 0xe4, 0xf8, 0x18, 0x41, 0x36,
+	  0x1c, 0x13, 0x39, 0xf6, 0x8e, 0x2c, 0x35, 0x9f,
+	  0x92, 0x9a, 0xf9, 0xad, 0x9f, 0x34, 0xe0, 0x1a,
+	  0xab, 0x46, 0x31, 0xad, 0x6d, 0x55, 0x00, 0xb0 },
+	{ 0x85, 0xfb, 0x41, 0x9c, 0x70, 0x02, 0xa3, 0xe0,
+	  0xb4, 0xb6, 0xea, 0x09, 0x3b, 0x4c, 0x1a, 0xc6,
+	  0x93, 0x66, 0x45, 0xb6, 0x5d, 0xac, 0x5a, 0xc1,
+	  0x5a, 0x85, 0x28, 0xb7, 0xb9, 0x4c, 0x17, 0x54 },
+	{ 0x96, 0x19, 0x72, 0x06, 0x25, 0xf1, 0x90, 0xb9,
+	  0x3a, 0x3f, 0xad, 0x18, 0x6a, 0xb3, 0x14, 0x18,
+	  0x96, 0x33, 0xc0, 0xd3, 0xa0, 0x1e, 0x6f, 0x9b,
+	  0xc8, 0xc4, 0xa8, 0xf8, 0x2f, 0x38, 0x3d, 0xbf },
+	{ 0x7d, 0x62, 0x0d, 0x90, 0xfe, 0x69, 0xfa, 0x46,
+	  0x9a, 0x65, 0x38, 0x38, 0x89, 0x70, 0xa1, 0xaa,
+	  0x09, 0xbb, 0x48, 0xa2, 0xd5, 0x9b, 0x34, 0x7b,
+	  0x97, 0xe8, 0xce, 0x71, 0xf4, 0x8c, 0x7f, 0x46 },
+	{ 0x29, 0x43, 0x83, 0x56, 0x85, 0x96, 0xfb, 0x37,
+	  0xc7, 0x5b, 0xba, 0xcd, 0x97, 0x9c, 0x5f, 0xf6,
+	  0xf2, 0x0a, 0x55, 0x6b, 0xf8, 0x87, 0x9c, 0xc7,
+	  0x29, 0x24, 0x85, 0x5d, 0xf9, 0xb8, 0x24, 0x0e },
+	{ 0x16, 0xb1, 0x8a, 0xb3, 0x14, 0x35, 0x9c, 0x2b,
+	  0x83, 0x3c, 0x1c, 0x69, 0x86, 0xd4, 0x8c, 0x55,
+	  0xa9, 0xfc, 0x97, 0xcd, 0xe9, 0xa3, 0xc1, 0xf1,
+	  0x0a, 0x31, 0x77, 0x14, 0x0f, 0x73, 0xf7, 0x38 },
+	{ 0x8c, 0xbb, 0xdd, 0x14, 0xbc, 0x33, 0xf0, 0x4c,
+	  0xf4, 0x58, 0x13, 0xe4, 0xa1, 0x53, 0xa2, 0x73,
+	  0xd3, 0x6a, 0xda, 0xd5, 0xce, 0x71, 0xf4, 0x99,
+	  0xee, 0xb8, 0x7f, 0xb8, 0xac, 0x63, 0xb7, 0x29 },
+	{ 0x69, 0xc9, 0xa4, 0x98, 0xdb, 0x17, 0x4e, 0xca,
+	  0xef, 0xcc, 0x5a, 0x3a, 0xc9, 0xfd, 0xed, 0xf0,
+	  0xf8, 0x13, 0xa5, 0xbe, 0xc7, 0x27, 0xf1, 0xe7,
+	  0x75, 0xba, 0xbd, 0xec, 0x77, 0x18, 0x81, 0x6e },
+	{ 0xb4, 0x62, 0xc3, 0xbe, 0x40, 0x44, 0x8f, 0x1d,
+	  0x4f, 0x80, 0x62, 0x62, 0x54, 0xe5, 0x35, 0xb0,
+	  0x8b, 0xc9, 0xcd, 0xcf, 0xf5, 0x99, 0xa7, 0x68,
+	  0x57, 0x8d, 0x4b, 0x28, 0x81, 0xa8, 0xe3, 0xf0 },
+	{ 0x55, 0x3e, 0x9d, 0x9c, 0x5f, 0x36, 0x0a, 0xc0,
+	  0xb7, 0x4a, 0x7d, 0x44, 0xe5, 0xa3, 0x91, 0xda,
+	  0xd4, 0xce, 0xd0, 0x3e, 0x0c, 0x24, 0x18, 0x3b,
+	  0x7e, 0x8e, 0xca, 0xbd, 0xf1, 0x71, 0x5a, 0x64 },
+	{ 0x7a, 0x7c, 0x55, 0xa5, 0x6f, 0xa9, 0xae, 0x51,
+	  0xe6, 0x55, 0xe0, 0x19, 0x75, 0xd8, 0xa6, 0xff,
+	  0x4a, 0xe9, 0xe4, 0xb4, 0x86, 0xfc, 0xbe, 0x4e,
+	  0xac, 0x04, 0x45, 0x88, 0xf2, 0x45, 0xeb, 0xea },
+	{ 0x2a, 0xfd, 0xf3, 0xc8, 0x2a, 0xbc, 0x48, 0x67,
+	  0xf5, 0xde, 0x11, 0x12, 0x86, 0xc2, 0xb3, 0xbe,
+	  0x7d, 0x6e, 0x48, 0x65, 0x7b, 0xa9, 0x23, 0xcf,
+	  0xbf, 0x10, 0x1a, 0x6d, 0xfc, 0xf9, 0xdb, 0x9a },
+	{ 0x41, 0x03, 0x7d, 0x2e, 0xdc, 0xdc, 0xe0, 0xc4,
+	  0x9b, 0x7f, 0xb4, 0xa6, 0xaa, 0x09, 0x99, 0xca,
+	  0x66, 0x97, 0x6c, 0x74, 0x83, 0xaf, 0xe6, 0x31,
+	  0xd4, 0xed, 0xa2, 0x83, 0x14, 0x4f, 0x6d, 0xfc },
+	{ 0xc4, 0x46, 0x6f, 0x84, 0x97, 0xca, 0x2e, 0xeb,
+	  0x45, 0x83, 0xa0, 0xb0, 0x8e, 0x9d, 0x9a, 0xc7,
+	  0x43, 0x95, 0x70, 0x9f, 0xda, 0x10, 0x9d, 0x24,
+	  0xf2, 0xe4, 0x46, 0x21, 0x96, 0x77, 0x9c, 0x5d },
+	{ 0x75, 0xf6, 0x09, 0x33, 0x8a, 0xa6, 0x7d, 0x96,
+	  0x9a, 0x2a, 0xe2, 0xa2, 0x36, 0x2b, 0x2d, 0xa9,
+	  0xd7, 0x7c, 0x69, 0x5d, 0xfd, 0x1d, 0xf7, 0x22,
+	  0x4a, 0x69, 0x01, 0xdb, 0x93, 0x2c, 0x33, 0x64 },
+	{ 0x68, 0x60, 0x6c, 0xeb, 0x98, 0x9d, 0x54, 0x88,
+	  0xfc, 0x7c, 0xf6, 0x49, 0xf3, 0xd7, 0xc2, 0x72,
+	  0xef, 0x05, 0x5d, 0xa1, 0xa9, 0x3f, 0xae, 0xcd,
+	  0x55, 0xfe, 0x06, 0xf6, 0x96, 0x70, 0x98, 0xca },
+	{ 0x44, 0x34, 0x6b, 0xde, 0xb7, 0xe0, 0x52, 0xf6,
+	  0x25, 0x50, 0x48, 0xf0, 0xd9, 0xb4, 0x2c, 0x42,
+	  0x5b, 0xab, 0x9c, 0x3d, 0xd2, 0x41, 0x68, 0x21,
+	  0x2c, 0x3e, 0xcf, 0x1e, 0xbf, 0x34, 0xe6, 0xae },
+	{ 0x8e, 0x9c, 0xf6, 0xe1, 0xf3, 0x66, 0x47, 0x1f,
+	  0x2a, 0xc7, 0xd2, 0xee, 0x9b, 0x5e, 0x62, 0x66,
+	  0xfd, 0xa7, 0x1f, 0x8f, 0x2e, 0x41, 0x09, 0xf2,
+	  0x23, 0x7e, 0xd5, 0xf8, 0x81, 0x3f, 0xc7, 0x18 },
+	{ 0x84, 0xbb, 0xeb, 0x84, 0x06, 0xd2, 0x50, 0x95,
+	  0x1f, 0x8c, 0x1b, 0x3e, 0x86, 0xa7, 0xc0, 0x10,
+	  0x08, 0x29, 0x21, 0x83, 0x3d, 0xfd, 0x95, 0x55,
+	  0xa2, 0xf9, 0x09, 0xb1, 0x08, 0x6e, 0xb4, 0xb8 },
+	{ 0xee, 0x66, 0x6f, 0x3e, 0xef, 0x0f, 0x7e, 0x2a,
+	  0x9c, 0x22, 0x29, 0x58, 0xc9, 0x7e, 0xaf, 0x35,
+	  0xf5, 0x1c, 0xed, 0x39, 0x3d, 0x71, 0x44, 0x85,
+	  0xab, 0x09, 0xa0, 0x69, 0x34, 0x0f, 0xdf, 0x88 },
+	{ 0xc1, 0x53, 0xd3, 0x4a, 0x65, 0xc4, 0x7b, 0x4a,
+	  0x62, 0xc5, 0xca, 0xcf, 0x24, 0x01, 0x09, 0x75,
+	  0xd0, 0x35, 0x6b, 0x2f, 0x32, 0xc8, 0xf5, 0xda,
+	  0x53, 0x0d, 0x33, 0x88, 0x16, 0xad, 0x5d, 0xe6 },
+	{ 0x9f, 0xc5, 0x45, 0x01, 0x09, 0xe1, 0xb7, 0x79,
+	  0xf6, 0xc7, 0xae, 0x79, 0xd5, 0x6c, 0x27, 0x63,
+	  0x5c, 0x8d, 0xd4, 0x26, 0xc5, 0xa9, 0xd5, 0x4e,
+	  0x25, 0x78, 0xdb, 0x98, 0x9b, 0x8c, 0x3b, 0x4e },
+	{ 0xd1, 0x2b, 0xf3, 0x73, 0x2e, 0xf4, 0xaf, 0x5c,
+	  0x22, 0xfa, 0x90, 0x35, 0x6a, 0xf8, 0xfc, 0x50,
+	  0xfc, 0xb4, 0x0f, 0x8f, 0x2e, 0xa5, 0xc8, 0x59,
+	  0x47, 0x37, 0xa3, 0xb3, 0xd5, 0xab, 0xdb, 0xd7 },
+	{ 0x11, 0x03, 0x0b, 0x92, 0x89, 0xbb, 0xa5, 0xaf,
+	  0x65, 0x26, 0x06, 0x72, 0xab, 0x6f, 0xee, 0x88,
+	  0xb8, 0x74, 0x20, 0xac, 0xef, 0x4a, 0x17, 0x89,
+	  0xa2, 0x07, 0x3b, 0x7e, 0xc2, 0xf2, 0xa0, 0x9e },
+	{ 0x69, 0xcb, 0x19, 0x2b, 0x84, 0x44, 0x00, 0x5c,
+	  0x8c, 0x0c, 0xeb, 0x12, 0xc8, 0x46, 0x86, 0x07,
+	  0x68, 0x18, 0x8c, 0xda, 0x0a, 0xec, 0x27, 0xa9,
+	  0xc8, 0xa5, 0x5c, 0xde, 0xe2, 0x12, 0x36, 0x32 },
+	{ 0xdb, 0x44, 0x4c, 0x15, 0x59, 0x7b, 0x5f, 0x1a,
+	  0x03, 0xd1, 0xf9, 0xed, 0xd1, 0x6e, 0x4a, 0x9f,
+	  0x43, 0xa6, 0x67, 0xcc, 0x27, 0x51, 0x75, 0xdf,
+	  0xa2, 0xb7, 0x04, 0xe3, 0xbb, 0x1a, 0x9b, 0x83 },
+	{ 0x3f, 0xb7, 0x35, 0x06, 0x1a, 0xbc, 0x51, 0x9d,
+	  0xfe, 0x97, 0x9e, 0x54, 0xc1, 0xee, 0x5b, 0xfa,
+	  0xd0, 0xa9, 0xd8, 0x58, 0xb3, 0x31, 0x5b, 0xad,
+	  0x34, 0xbd, 0xe9, 0x99, 0xef, 0xd7, 0x24, 0xdd }
+};
+
+bool __init blake2s_selftest(void)
+{
+	u8 key[BLAKE2S_KEY_SIZE];
+	u8 buf[ARRAY_SIZE(blake2s_testvecs)];
+	u8 hash[BLAKE2S_HASH_SIZE];
+	size_t i;
+	bool success = true;
+
+	for (i = 0; i < BLAKE2S_KEY_SIZE; ++i)
+		key[i] = (u8)i;
+
+	for (i = 0; i < ARRAY_SIZE(blake2s_testvecs); ++i)
+		buf[i] = (u8)i;
+
+	for (i = 0; i < ARRAY_SIZE(blake2s_keyed_testvecs); ++i) {
+		blake2s(hash, buf, key, BLAKE2S_HASH_SIZE, i, BLAKE2S_KEY_SIZE);
+		if (memcmp(hash, blake2s_keyed_testvecs[i], BLAKE2S_HASH_SIZE)) {
+			pr_err("blake2s keyed self-test %zu: FAIL\n", i + 1);
+			success = false;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(blake2s_testvecs); ++i) {
+		blake2s(hash, buf, NULL, BLAKE2S_HASH_SIZE, i, 0);
+		if (memcmp(hash, blake2s_testvecs[i], BLAKE2S_HASH_SIZE)) {
+			pr_err("blake2s unkeyed self-test %zu: FAIL\n", i + i);
+			success = false;
+		}
+	}
+	return success;
+}
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
new file mode 100644
index 000000000000..3048ccdf0aa8
--- /dev/null
+++ b/lib/crypto/blake2s.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is an implementation of the BLAKE2s hash and PRF functions.
+ *
+ * Information: https://blake2.net/
+ *
+ */
+
+#include <crypto/internal/blake2s.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bug.h>
+#include <asm/unaligned.h>
+
+bool blake2s_selftest(void);
+
+void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
+{
+	const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
+
+	if (unlikely(!inlen))
+		return;
+	if (inlen > fill) {
+		memcpy(state->buf + state->buflen, in, fill);
+		if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
+			blake2s_compress_arch(state, state->buf, 1,
+					      BLAKE2S_BLOCK_SIZE);
+		else
+			blake2s_compress_generic(state, state->buf, 1,
+						 BLAKE2S_BLOCK_SIZE);
+		state->buflen = 0;
+		in += fill;
+		inlen -= fill;
+	}
+	if (inlen > BLAKE2S_BLOCK_SIZE) {
+		const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
+		/* Hash one less (full) block than strictly possible */
+		if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
+			blake2s_compress_arch(state, in, nblocks - 1,
+					      BLAKE2S_BLOCK_SIZE);
+		else
+			blake2s_compress_generic(state, in, nblocks - 1,
+						 BLAKE2S_BLOCK_SIZE);
+		in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+		inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+	}
+	memcpy(state->buf + state->buflen, in, inlen);
+	state->buflen += inlen;
+}
+EXPORT_SYMBOL(blake2s_update);
+
+void blake2s_final(struct blake2s_state *state, u8 *out)
+{
+	WARN_ON(IS_ENABLED(DEBUG) && !out);
+	blake2s_set_lastblock(state);
+	memset(state->buf + state->buflen, 0,
+	       BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
+	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
+		blake2s_compress_arch(state, state->buf, 1, state->buflen);
+	else
+		blake2s_compress_generic(state, state->buf, 1, state->buflen);
+	cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
+	memcpy(out, state->h, state->outlen);
+	memzero_explicit(state, sizeof(*state));
+}
+EXPORT_SYMBOL(blake2s_final);
+
+void blake2s_hmac(u8 *out, const u8 *in, const u8 *key, const size_t outlen,
+		  const size_t inlen, const size_t keylen)
+{
+	struct blake2s_state state;
+	u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
+	u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
+	int i;
+
+	if (keylen > BLAKE2S_BLOCK_SIZE) {
+		blake2s_init(&state, BLAKE2S_HASH_SIZE);
+		blake2s_update(&state, key, keylen);
+		blake2s_final(&state, x_key);
+	} else
+		memcpy(x_key, key, keylen);
+
+	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+		x_key[i] ^= 0x36;
+
+	blake2s_init(&state, BLAKE2S_HASH_SIZE);
+	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+	blake2s_update(&state, in, inlen);
+	blake2s_final(&state, i_hash);
+
+	for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+		x_key[i] ^= 0x5c ^ 0x36;
+
+	blake2s_init(&state, BLAKE2S_HASH_SIZE);
+	blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+	blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
+	blake2s_final(&state, i_hash);
+
+	memcpy(out, i_hash, outlen);
+	memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
+	memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
+}
+EXPORT_SYMBOL(blake2s_hmac);
+
+static int __init mod_init(void)
+{
+	if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
+	    WARN_ON(!blake2s_selftest()))
+		return -ENODEV;
+	return 0;
+}
+
+static void __exit mod_exit(void)
+{
+}
+
+module_init(mod_init);
+module_exit(mod_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("BLAKE2s hash function");
+MODULE_AUTHOR("Jason A. Donenfeld <Jason@zx2c4.com>");
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 23/35] crypto: testmgr - add test cases for Blake2s
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (21 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 22/35] crypto: BLAKE2s - generic C library implementation and selftest Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 24/35] crypto: blake2s - implement generic shash driver Ard Biesheuvel
                   ` (11 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

As suggested by Eric for the Blake2b implementation contributed by
David, introduce a set of test vectors for Blake2s covering different
digest and key sizes.

          blake2s-128  blake2s-160  blake2s-224  blake2s-256
         ---------------------------------------------------
len=0   | klen=0       klen=1       klen=16      klen=32
len=1   | klen=16      klen=32      klen=0       klen=1
len=7   | klen=32      klen=0       klen=1       klen=16
len=15  | klen=1       klen=16      klen=32      klen=0
len=64  | klen=0       klen=1       klen=16      klen=32
len=247 | klen=16      klen=32      klen=0       klen=1
len=256 | klen=32      klen=0       klen=1       klen=16

Cc: David Sterba <dsterba@suse.com>
Cc: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.c |  24 ++
 crypto/testmgr.h | 251 ++++++++++++++++++++
 2 files changed, 275 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 82084f6d84b6..767fc5444771 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4022,6 +4022,30 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.alg = "authenc(hmac(sha512),rfc3686(ctr(aes)))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
+	}, {
+		.alg = "blake2s-128",
+		.test = alg_test_hash,
+		.suite = {
+			.hash = __VECS(blakes2s_128_tv_template)
+		}
+	}, {
+		.alg = "blake2s-160",
+		.test = alg_test_hash,
+		.suite = {
+			.hash = __VECS(blakes2s_160_tv_template)
+		}
+	}, {
+		.alg = "blake2s-224",
+		.test = alg_test_hash,
+		.suite = {
+			.hash = __VECS(blakes2s_224_tv_template)
+		}
+	}, {
+		.alg = "blake2s-256",
+		.test = alg_test_hash,
+		.suite = {
+			.hash = __VECS(blakes2s_256_tv_template)
+		}
 	}, {
 		.alg = "cbc(aes)",
 		.test = alg_test_skcipher,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d1d89101f1b5..b717afc0926e 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -31942,4 +31942,255 @@ static const struct aead_testvec essiv_hmac_sha256_aes_cbc_tv_temp[] = {
 	},
 };
 
+static const char blake2s_ordered_sequence[] =
+	"\x00\x01\x02\x03\x04\x05\x06\x07"
+	"\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+	"\x10\x11\x12\x13\x14\x15\x16\x17"
+	"\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+	"\x20\x21\x22\x23\x24\x25\x26\x27"
+	"\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+	"\x30\x31\x32\x33\x34\x35\x36\x37"
+	"\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+	"\x40\x41\x42\x43\x44\x45\x46\x47"
+	"\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+	"\x50\x51\x52\x53\x54\x55\x56\x57"
+	"\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+	"\x60\x61\x62\x63\x64\x65\x66\x67"
+	"\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+	"\x70\x71\x72\x73\x74\x75\x76\x77"
+	"\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+	"\x80\x81\x82\x83\x84\x85\x86\x87"
+	"\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+	"\x90\x91\x92\x93\x94\x95\x96\x97"
+	"\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+	"\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+	"\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+	"\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+	"\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+	"\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+	"\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+	"\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+	"\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+	"\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+	"\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+	"\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+	"\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff";
+
+static const struct hash_testvec blakes2s_128_tv_template[] = {{
+	.digest = (u8[]){ 0x64, 0x55, 0x0d, 0x6f, 0xfe, 0x2c, 0x0a, 0x01,
+			  0xa1, 0x4a, 0xba, 0x1e, 0xad, 0xe0, 0x20, 0x0c, },
+}, {
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 64,
+	.digest = (u8[]){ 0xdc, 0x66, 0xca, 0x8f, 0x03, 0x86, 0x58, 0x01,
+			  0xb0, 0xff, 0xe0, 0x6e, 0xd8, 0xa1, 0xa9, 0x0e, },
+}, {
+	.ksize = 16,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 1,
+	.digest = (u8[]){ 0x88, 0x1e, 0x42, 0xe7, 0xbb, 0x35, 0x80, 0x82,
+			  0x63, 0x7c, 0x0a, 0x0f, 0xd7, 0xec, 0x6c, 0x2f, },
+}, {
+	.ksize = 32,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 7,
+	.digest = (u8[]){ 0xcf, 0x9e, 0x07, 0x2a, 0xd5, 0x22, 0xf2, 0xcd,
+			  0xa2, 0xd8, 0x25, 0x21, 0x80, 0x86, 0x73, 0x1c, },
+}, {
+	.ksize = 1,
+	.key = "B",
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 15,
+	.digest = (u8[]){ 0xf6, 0x33, 0x5a, 0x2c, 0x22, 0xa0, 0x64, 0xb2,
+			  0xb6, 0x3f, 0xeb, 0xbc, 0xd1, 0xc3, 0xe5, 0xb2, },
+}, {
+	.ksize = 16,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 247,
+	.digest = (u8[]){ 0x72, 0x66, 0x49, 0x60, 0xf9, 0x4a, 0xea, 0xbe,
+			  0x1f, 0xf4, 0x60, 0xce, 0xb7, 0x81, 0xcb, 0x09, },
+}, {
+	.ksize = 32,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 256,
+	.digest = (u8[]){ 0xd5, 0xa4, 0x0e, 0xc3, 0x16, 0xc7, 0x51, 0xa6,
+			  0x3c, 0xd0, 0xd9, 0x11, 0x57, 0xfa, 0x1e, 0xbb, },
+}};
+
+static const struct hash_testvec blakes2s_160_tv_template[] = {{
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 7,
+	.digest = (u8[]){ 0xb4, 0xf2, 0x03, 0x49, 0x37, 0xed, 0xb1, 0x3e,
+			  0x5b, 0x2a, 0xca, 0x64, 0x82, 0x74, 0xf6, 0x62,
+			  0xe3, 0xf2, 0x84, 0xff, },
+}, {
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 256,
+	.digest = (u8[]){ 0xaa, 0x56, 0x9b, 0xdc, 0x98, 0x17, 0x75, 0xf2,
+			  0xb3, 0x68, 0x83, 0xb7, 0x9b, 0x8d, 0x48, 0xb1,
+			  0x9b, 0x2d, 0x35, 0x05, },
+}, {
+	.ksize = 1,
+	.key = "B",
+	.digest = (u8[]){ 0x50, 0x16, 0xe7, 0x0c, 0x01, 0xd0, 0xd3, 0xc3,
+			  0xf4, 0x3e, 0xb1, 0x6e, 0x97, 0xa9, 0x4e, 0xd1,
+			  0x79, 0x65, 0x32, 0x93, },
+}, {
+	.ksize = 32,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 1,
+	.digest = (u8[]){ 0x1c, 0x2b, 0xcd, 0x9a, 0x68, 0xca, 0x8c, 0x71,
+			  0x90, 0x29, 0x6c, 0x54, 0xfa, 0x56, 0x4a, 0xef,
+			  0xa2, 0x3a, 0x56, 0x9c, },
+}, {
+	.ksize = 16,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 15,
+	.digest = (u8[]){ 0x36, 0xc3, 0x5f, 0x9a, 0xdc, 0x7e, 0xbf, 0x19,
+			  0x68, 0xaa, 0xca, 0xd8, 0x81, 0xbf, 0x09, 0x34,
+			  0x83, 0x39, 0x0f, 0x30, },
+}, {
+	.ksize = 1,
+	.key = "B",
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 64,
+	.digest = (u8[]){ 0x86, 0x80, 0x78, 0xa4, 0x14, 0xec, 0x03, 0xe5,
+			  0xb6, 0x9a, 0x52, 0x0e, 0x42, 0xee, 0x39, 0x9d,
+			  0xac, 0xa6, 0x81, 0x63, },
+}, {
+	.ksize = 32,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 247,
+	.digest = (u8[]){ 0x2d, 0xd8, 0xd2, 0x53, 0x66, 0xfa, 0xa9, 0x01,
+			  0x1c, 0x9c, 0xaf, 0xa3, 0xe2, 0x9d, 0x9b, 0x10,
+			  0x0a, 0xf6, 0x73, 0xe8, },
+}};
+
+static const struct hash_testvec blakes2s_224_tv_template[] = {{
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 1,
+	.digest = (u8[]){ 0x61, 0xb9, 0x4e, 0xc9, 0x46, 0x22, 0xa3, 0x91,
+			  0xd2, 0xae, 0x42, 0xe6, 0x45, 0x6c, 0x90, 0x12,
+			  0xd5, 0x80, 0x07, 0x97, 0xb8, 0x86, 0x5a, 0xfc,
+			  0x48, 0x21, 0x97, 0xbb, },
+}, {
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 247,
+	.digest = (u8[]){ 0x9e, 0xda, 0xc7, 0x20, 0x2c, 0xd8, 0x48, 0x2e,
+			  0x31, 0x94, 0xab, 0x46, 0x6d, 0x94, 0xd8, 0xb4,
+			  0x69, 0xcd, 0xae, 0x19, 0x6d, 0x9e, 0x41, 0xcc,
+			  0x2b, 0xa4, 0xd5, 0xf6, },
+}, {
+	.ksize = 16,
+	.key = blake2s_ordered_sequence,
+	.digest = (u8[]){ 0x32, 0xc0, 0xac, 0xf4, 0x3b, 0xd3, 0x07, 0x9f,
+			  0xbe, 0xfb, 0xfa, 0x4d, 0x6b, 0x4e, 0x56, 0xb3,
+			  0xaa, 0xd3, 0x27, 0xf6, 0x14, 0xbf, 0xb9, 0x32,
+			  0xa7, 0x19, 0xfc, 0xb8, },
+}, {
+	.ksize = 1,
+	.key = "B",
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 7,
+	.digest = (u8[]){ 0x73, 0xad, 0x5e, 0x6d, 0xb9, 0x02, 0x8e, 0x76,
+			  0xf2, 0x66, 0x42, 0x4b, 0x4c, 0xfa, 0x1f, 0xe6,
+			  0x2e, 0x56, 0x40, 0xe5, 0xa2, 0xb0, 0x3c, 0xe8,
+			  0x7b, 0x45, 0xfe, 0x05, },
+}, {
+	.ksize = 32,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 15,
+	.digest = (u8[]){ 0x16, 0x60, 0xfb, 0x92, 0x54, 0xb3, 0x6e, 0x36,
+			  0x81, 0xf4, 0x16, 0x41, 0xc3, 0x3d, 0xd3, 0x43,
+			  0x84, 0xed, 0x10, 0x6f, 0x65, 0x80, 0x7a, 0x3e,
+			  0x25, 0xab, 0xc5, 0x02, },
+}, {
+	.ksize = 16,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 64,
+	.digest = (u8[]){ 0xca, 0xaa, 0x39, 0x67, 0x9c, 0xf7, 0x6b, 0xc7,
+			  0xb6, 0x82, 0xca, 0x0e, 0x65, 0x36, 0x5b, 0x7c,
+			  0x24, 0x00, 0xfa, 0x5f, 0xda, 0x06, 0x91, 0x93,
+			  0x6a, 0x31, 0x83, 0xb5, },
+}, {
+	.ksize = 1,
+	.key = "B",
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 256,
+	.digest = (u8[]){ 0x90, 0x02, 0x26, 0xb5, 0x06, 0x9c, 0x36, 0x86,
+			  0x94, 0x91, 0x90, 0x1e, 0x7d, 0x2a, 0x71, 0xb2,
+			  0x48, 0xb5, 0xe8, 0x16, 0xfd, 0x64, 0x33, 0x45,
+			  0xb3, 0xd7, 0xec, 0xcc, },
+}};
+
+static const struct hash_testvec blakes2s_256_tv_template[] = {{
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 15,
+	.digest = (u8[]){ 0xd9, 0x7c, 0x82, 0x8d, 0x81, 0x82, 0xa7, 0x21,
+			  0x80, 0xa0, 0x6a, 0x78, 0x26, 0x83, 0x30, 0x67,
+			  0x3f, 0x7c, 0x4e, 0x06, 0x35, 0x94, 0x7c, 0x04,
+			  0xc0, 0x23, 0x23, 0xfd, 0x45, 0xc0, 0xa5, 0x2d, },
+}, {
+	.ksize = 32,
+	.key = blake2s_ordered_sequence,
+	.digest = (u8[]){ 0x48, 0xa8, 0x99, 0x7d, 0xa4, 0x07, 0x87, 0x6b,
+			  0x3d, 0x79, 0xc0, 0xd9, 0x23, 0x25, 0xad, 0x3b,
+			  0x89, 0xcb, 0xb7, 0x54, 0xd8, 0x6a, 0xb7, 0x1a,
+			  0xee, 0x04, 0x7a, 0xd3, 0x45, 0xfd, 0x2c, 0x49, },
+}, {
+	.ksize = 1,
+	.key = "B",
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 1,
+	.digest = (u8[]){ 0x22, 0x27, 0xae, 0xaa, 0x6e, 0x81, 0x56, 0x03,
+			  0xa7, 0xe3, 0xa1, 0x18, 0xa5, 0x9a, 0x2c, 0x18,
+			  0xf4, 0x63, 0xbc, 0x16, 0x70, 0xf1, 0xe7, 0x4b,
+			  0x00, 0x6d, 0x66, 0x16, 0xae, 0x9e, 0x74, 0x4e, },
+}, {
+	.ksize = 16,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 7,
+	.digest = (u8[]){ 0x58, 0x5d, 0xa8, 0x60, 0x1c, 0xa4, 0xd8, 0x03,
+			  0x86, 0x86, 0x84, 0x64, 0xd7, 0xa0, 0x8e, 0x15,
+			  0x2f, 0x05, 0xa2, 0x1b, 0xbc, 0xef, 0x7a, 0x34,
+			  0xb3, 0xc5, 0xbc, 0x4b, 0xf0, 0x32, 0xeb, 0x12, },
+}, {
+	.ksize = 32,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 64,
+	.digest = (u8[]){ 0x89, 0x75, 0xb0, 0x57, 0x7f, 0xd3, 0x55, 0x66,
+			  0xd7, 0x50, 0xb3, 0x62, 0xb0, 0x89, 0x7a, 0x26,
+			  0xc3, 0x99, 0x13, 0x6d, 0xf0, 0x7b, 0xab, 0xab,
+			  0xbd, 0xe6, 0x20, 0x3f, 0xf2, 0x95, 0x4e, 0xd4, },
+}, {
+	.ksize = 1,
+	.key = "B",
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 247,
+	.digest = (u8[]){ 0x2e, 0x74, 0x1c, 0x1d, 0x03, 0xf4, 0x9d, 0x84,
+			  0x6f, 0xfc, 0x86, 0x32, 0x92, 0x49, 0x7e, 0x66,
+			  0xd7, 0xc3, 0x10, 0x88, 0xfe, 0x28, 0xb3, 0xe0,
+			  0xbf, 0x50, 0x75, 0xad, 0x8e, 0xa4, 0xe6, 0xb2, },
+}, {
+	.ksize = 16,
+	.key = blake2s_ordered_sequence,
+	.plaintext = blake2s_ordered_sequence,
+	.psize = 256,
+	.digest = (u8[]){ 0xb9, 0xd2, 0x81, 0x0e, 0x3a, 0xb1, 0x62, 0x9b,
+			  0xad, 0x44, 0x05, 0xf4, 0x92, 0x2e, 0x99, 0xc1,
+			  0x4a, 0x47, 0xbb, 0x5b, 0x6f, 0xb2, 0x96, 0xed,
+			  0xd5, 0x06, 0xb5, 0x3a, 0x7c, 0x7a, 0x65, 0x1d, },
+}};
+
 #endif	/* _CRYPTO_TESTMGR_H */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 24/35] crypto: blake2s - implement generic shash driver
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (22 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 23/35] crypto: testmgr - add test cases for Blake2s Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-23  3:25   ` Eric Biggers
  2019-10-17 19:09 ` [PATCH v4 25/35] crypto: BLAKE2s - x86_64 SIMD implementation Ard Biesheuvel
                   ` (10 subsequent siblings)
  34 siblings, 1 reply; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Wire up our newly added Blake2s implementation via the shash API.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/Kconfig                    |   4 +
 crypto/Makefile                   |   1 +
 crypto/blake2s-generic.c          | 171 ++++++++++++++++++++
 include/crypto/internal/blake2s.h |   5 +
 4 files changed, 181 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 9fc0f722f1b2..c0f299279df8 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1021,6 +1021,10 @@ config CRYPTO_GHASH_CLMUL_NI_INTEL
 	  This is the x86_64 CLMUL-NI accelerated implementation of
 	  GHASH, the hash function used in GCM (Galois/Counter mode).
 
+config CRYPTO_BLAKE2S
+	tristate "BLAKE2s hash function"
+	select CRYPTO_LIB_BLAKE2S_GENERIC
+
 comment "Ciphers"
 
 config CRYPTO_AES
diff --git a/crypto/Makefile b/crypto/Makefile
index aa740c8492b9..ecc69a726460 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_CRYPTO_MANAGER2) += cryptomgr.o
 obj-$(CONFIG_CRYPTO_USER) += crypto_user.o
 crypto_user-y := crypto_user_base.o
 crypto_user-$(CONFIG_CRYPTO_STATS) += crypto_user_stat.o
+obj-$(CONFIG_CRYPTO_BLAKE2S) += blake2s-generic.o
 obj-$(CONFIG_CRYPTO_CMAC) += cmac.o
 obj-$(CONFIG_CRYPTO_HMAC) += hmac.o
 obj-$(CONFIG_CRYPTO_VMAC) += vmac.o
diff --git a/crypto/blake2s-generic.c b/crypto/blake2s-generic.c
new file mode 100644
index 000000000000..ed0c74640470
--- /dev/null
+++ b/crypto/blake2s-generic.c
@@ -0,0 +1,171 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <crypto/internal/blake2s.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/hash.h>
+
+#include <linux/types.h>
+#include <linux/jump_label.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+static int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
+				 unsigned int keylen)
+{
+	struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
+
+	if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE) {
+		crypto_shash_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
+		return -EINVAL;
+	}
+
+	memcpy(tctx->key, key, keylen);
+	tctx->keylen = keylen;
+
+	return 0;
+}
+
+static int crypto_blake2s_init(struct shash_desc *desc)
+{
+	struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct blake2s_state *state = shash_desc_ctx(desc);
+	const int outlen = crypto_shash_digestsize(desc->tfm);
+
+	if (tctx->keylen)
+		blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
+	else
+		blake2s_init(state, outlen);
+
+	return 0;
+}
+
+static int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
+				 unsigned int inlen)
+{
+	struct blake2s_state *state = shash_desc_ctx(desc);
+	const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
+
+	if (unlikely(!inlen))
+		return 0;
+	if (inlen > fill) {
+		memcpy(state->buf + state->buflen, in, fill);
+		blake2s_compress_generic(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
+		state->buflen = 0;
+		in += fill;
+		inlen -= fill;
+	}
+	if (inlen > BLAKE2S_BLOCK_SIZE) {
+		const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
+		/* Hash one less (full) block than strictly possible */
+		blake2s_compress_generic(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
+		in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+		inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+	}
+	memcpy(state->buf + state->buflen, in, inlen);
+	state->buflen += inlen;
+
+	return 0;
+}
+
+static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
+{
+	struct blake2s_state *state = shash_desc_ctx(desc);
+
+	blake2s_set_lastblock(state);
+	memset(state->buf + state->buflen, 0,
+	       BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
+	blake2s_compress_generic(state, state->buf, 1, state->buflen);
+	cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
+	memcpy(out, state->h, state->outlen);
+	memzero_explicit(state, sizeof(*state));
+
+	return 0;
+}
+
+static struct shash_alg blake2s_algs[] = {{
+	.base.cra_name		= "blake2s-128",
+	.base.cra_driver_name	= "blake2s-128-generic",
+	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
+	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
+	.base.cra_priority	= 200,
+	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+
+	.digestsize		= BLAKE2S_128_HASH_SIZE,
+	.setkey			= crypto_blake2s_setkey,
+	.init			= crypto_blake2s_init,
+	.update			= crypto_blake2s_update,
+	.final			= crypto_blake2s_final,
+	.descsize		= sizeof(struct blake2s_state),
+}, {
+	.base.cra_name		= "blake2s-160",
+	.base.cra_driver_name	= "blake2s-160-generic",
+	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
+	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
+	.base.cra_priority	= 200,
+	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+
+	.digestsize		= BLAKE2S_160_HASH_SIZE,
+	.setkey			= crypto_blake2s_setkey,
+	.init			= crypto_blake2s_init,
+	.update			= crypto_blake2s_update,
+	.final			= crypto_blake2s_final,
+	.descsize		= sizeof(struct blake2s_state),
+}, {
+	.base.cra_name		= "blake2s-224",
+	.base.cra_driver_name	= "blake2s-224-generic",
+	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
+	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
+	.base.cra_priority	= 200,
+	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+
+	.digestsize		= BLAKE2S_224_HASH_SIZE,
+	.setkey			= crypto_blake2s_setkey,
+	.init			= crypto_blake2s_init,
+	.update			= crypto_blake2s_update,
+	.final			= crypto_blake2s_final,
+	.descsize		= sizeof(struct blake2s_state),
+}, {
+	.base.cra_name		= "blake2s-256",
+	.base.cra_driver_name	= "blake2s-256-generic",
+	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
+	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
+	.base.cra_priority	= 200,
+	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+
+	.digestsize		= BLAKE2S_256_HASH_SIZE,
+	.setkey			= crypto_blake2s_setkey,
+	.init			= crypto_blake2s_init,
+	.update			= crypto_blake2s_update,
+	.final			= crypto_blake2s_final,
+	.descsize		= sizeof(struct blake2s_state),
+}};
+
+static int __init blake2s_mod_init(void)
+{
+	return crypto_register_shashes(blake2s_algs, ARRAY_SIZE(blake2s_algs));
+}
+
+static void __exit blake2s_mod_exit(void)
+{
+	crypto_unregister_shashes(blake2s_algs, ARRAY_SIZE(blake2s_algs));
+}
+
+subsys_initcall(blake2s_mod_init);
+module_exit(blake2s_mod_exit);
+
+MODULE_ALIAS_CRYPTO("blake2s-128");
+MODULE_ALIAS_CRYPTO("blake2s-128-generic");
+MODULE_ALIAS_CRYPTO("blake2s-160");
+MODULE_ALIAS_CRYPTO("blake2s-160-generic");
+MODULE_ALIAS_CRYPTO("blake2s-224");
+MODULE_ALIAS_CRYPTO("blake2s-224-generic");
+MODULE_ALIAS_CRYPTO("blake2s-256");
+MODULE_ALIAS_CRYPTO("blake2s-256-generic");
+MODULE_LICENSE("GPL v2");
diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
index 941693effc7d..74ff77032e52 100644
--- a/include/crypto/internal/blake2s.h
+++ b/include/crypto/internal/blake2s.h
@@ -5,6 +5,11 @@
 
 #include <crypto/blake2s.h>
 
+struct blake2s_tfm_ctx {
+	u8 key[BLAKE2S_KEY_SIZE];
+	unsigned int keylen;
+};
+
 void blake2s_compress_generic(struct blake2s_state *state,const u8 *block,
 			      size_t nblocks, const u32 inc);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 25/35] crypto: BLAKE2s - x86_64 SIMD implementation
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (23 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 24/35] crypto: blake2s - implement generic shash driver Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-23  4:55   ` Eric Biggers
  2019-10-17 19:09 ` [PATCH v4 26/35] crypto: Curve25519 - generic C library implementations Ard Biesheuvel
                   ` (9 subsequent siblings)
  34 siblings, 1 reply; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

From: "Jason A. Donenfeld" <Jason@zx2c4.com>

These implementations from Samuel Neves support AVX and AVX-512VL.
Originally this used AVX-512F, but Skylake thermal throttling made
AVX-512VL more attractive and possible to do with negligable difference.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
[ardb: move to arch/x86/crypto, wire into lib/crypto framework]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/Makefile       |   2 +
 arch/x86/crypto/blake2s-core.S | 685 ++++++++++++++++++++
 arch/x86/crypto/blake2s-glue.c | 235 +++++++
 crypto/Kconfig                 |   6 +
 4 files changed, 928 insertions(+)

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 759b1a927826..922c8ecfa00f 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -48,6 +48,7 @@ ifeq ($(avx_supported),yes)
 	obj-$(CONFIG_CRYPTO_CAST6_AVX_X86_64) += cast6-avx-x86_64.o
 	obj-$(CONFIG_CRYPTO_TWOFISH_AVX_X86_64) += twofish-avx-x86_64.o
 	obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o
+	obj-$(CONFIG_CRYPTO_BLAKE2S_X86) += blake2s-x86_64.o
 endif
 
 # These modules require assembler to support AVX2.
@@ -70,6 +71,7 @@ serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o
 aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o
 
 nhpoly1305-sse2-y := nh-sse2-x86_64.o nhpoly1305-sse2-glue.o
+blake2s-x86_64-y := blake2s-core.o blake2s-glue.o
 
 ifeq ($(avx_supported),yes)
 	camellia-aesni-avx-x86_64-y := camellia-aesni-avx-asm_64.o \
diff --git a/arch/x86/crypto/blake2s-core.S b/arch/x86/crypto/blake2s-core.S
new file mode 100644
index 000000000000..675288fa4cca
--- /dev/null
+++ b/arch/x86/crypto/blake2s-core.S
@@ -0,0 +1,685 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright (C) 2017 Samuel Neves <sneves@dei.uc.pt>. All Rights Reserved.
+ */
+
+#include <linux/linkage.h>
+
+.section .rodata.cst32.BLAKE2S_IV, "aM", @progbits, 32
+.align 32
+IV:	.octa 0xA54FF53A3C6EF372BB67AE856A09E667
+	.octa 0x5BE0CD191F83D9AB9B05688C510E527F
+.section .rodata.cst16.ROT16, "aM", @progbits, 16
+.align 16
+ROT16:	.octa 0x0D0C0F0E09080B0A0504070601000302
+.section .rodata.cst16.ROR328, "aM", @progbits, 16
+.align 16
+ROR328:	.octa 0x0C0F0E0D080B0A090407060500030201
+#ifdef CONFIG_AS_AVX512
+.section .rodata.cst64.BLAKE2S_SIGMA, "aM", @progbits, 640
+.align 64
+SIGMA:
+.long 0, 2, 4, 6, 1, 3, 5, 7, 8, 10, 12, 14, 9, 11, 13, 15
+.long 11, 2, 12, 14, 9, 8, 15, 3, 4, 0, 13, 6, 10, 1, 7, 5
+.long 10, 12, 11, 6, 5, 9, 13, 3, 4, 15, 14, 2, 0, 7, 8, 1
+.long 10, 9, 7, 0, 11, 14, 1, 12, 6, 2, 15, 3, 13, 8, 5, 4
+.long 4, 9, 8, 13, 14, 0, 10, 11, 7, 3, 12, 1, 5, 6, 15, 2
+.long 2, 10, 4, 14, 13, 3, 9, 11, 6, 5, 7, 12, 15, 1, 8, 0
+.long 4, 11, 14, 8, 13, 10, 12, 5, 2, 1, 15, 3, 9, 7, 0, 6
+.long 6, 12, 0, 13, 15, 2, 1, 10, 4, 5, 11, 14, 8, 3, 9, 7
+.long 14, 5, 4, 12, 9, 7, 3, 10, 2, 0, 6, 15, 11, 1, 13, 8
+.long 11, 7, 13, 10, 12, 14, 0, 15, 4, 5, 6, 9, 2, 1, 8, 3
+#endif /* CONFIG_AS_AVX512 */
+
+.text
+#ifdef CONFIG_AS_AVX
+ENTRY(blake2s_compress_avx)
+	movl		%ecx, %ecx
+	testq		%rdx, %rdx
+	je		.Lendofloop
+	.align 32
+.Lbeginofloop:
+	addq		%rcx, 32(%rdi)
+	vmovdqu		IV+16(%rip), %xmm1
+	vmovdqu		(%rsi), %xmm4
+	vpxor		32(%rdi), %xmm1, %xmm1
+	vmovdqu		16(%rsi), %xmm3
+	vshufps		$136, %xmm3, %xmm4, %xmm6
+	vmovdqa		ROT16(%rip), %xmm7
+	vpaddd		(%rdi), %xmm6, %xmm6
+	vpaddd		16(%rdi), %xmm6, %xmm6
+	vpxor		%xmm6, %xmm1, %xmm1
+	vmovdqu		IV(%rip), %xmm8
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vmovdqu		48(%rsi), %xmm5
+	vpaddd		%xmm1, %xmm8, %xmm8
+	vpxor		16(%rdi), %xmm8, %xmm9
+	vmovdqu		32(%rsi), %xmm2
+	vpblendw	$12, %xmm3, %xmm5, %xmm13
+	vshufps		$221, %xmm5, %xmm2, %xmm12
+	vpunpckhqdq	%xmm2, %xmm4, %xmm14
+	vpslld		$20, %xmm9, %xmm0
+	vpsrld		$12, %xmm9, %xmm9
+	vpxor		%xmm0, %xmm9, %xmm0
+	vshufps		$221, %xmm3, %xmm4, %xmm9
+	vpaddd		%xmm9, %xmm6, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vmovdqa		ROR328(%rip), %xmm6
+	vpshufb		%xmm6, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm8, %xmm8
+	vpxor		%xmm8, %xmm0, %xmm0
+	vpshufd		$147, %xmm1, %xmm1
+	vpshufd		$78, %xmm8, %xmm8
+	vpslld		$25, %xmm0, %xmm10
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm10, %xmm0, %xmm0
+	vshufps		$136, %xmm5, %xmm2, %xmm10
+	vpshufd		$57, %xmm0, %xmm0
+	vpaddd		%xmm10, %xmm9, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpaddd		%xmm12, %xmm9, %xmm9
+	vpblendw	$12, %xmm2, %xmm3, %xmm12
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm8, %xmm8
+	vpxor		%xmm8, %xmm0, %xmm10
+	vpslld		$20, %xmm10, %xmm0
+	vpsrld		$12, %xmm10, %xmm10
+	vpxor		%xmm0, %xmm10, %xmm0
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpshufb		%xmm6, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm8, %xmm8
+	vpxor		%xmm8, %xmm0, %xmm0
+	vpshufd		$57, %xmm1, %xmm1
+	vpshufd		$78, %xmm8, %xmm8
+	vpslld		$25, %xmm0, %xmm10
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm10, %xmm0, %xmm0
+	vpslldq		$4, %xmm5, %xmm10
+	vpblendw	$240, %xmm10, %xmm12, %xmm12
+	vpshufd		$147, %xmm0, %xmm0
+	vpshufd		$147, %xmm12, %xmm12
+	vpaddd		%xmm9, %xmm12, %xmm12
+	vpaddd		%xmm0, %xmm12, %xmm12
+	vpxor		%xmm12, %xmm1, %xmm1
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm8, %xmm8
+	vpxor		%xmm8, %xmm0, %xmm11
+	vpslld		$20, %xmm11, %xmm9
+	vpsrld		$12, %xmm11, %xmm11
+	vpxor		%xmm9, %xmm11, %xmm0
+	vpshufd		$8, %xmm2, %xmm9
+	vpblendw	$192, %xmm5, %xmm3, %xmm11
+	vpblendw	$240, %xmm11, %xmm9, %xmm9
+	vpshufd		$177, %xmm9, %xmm9
+	vpaddd		%xmm12, %xmm9, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm11
+	vpxor		%xmm11, %xmm1, %xmm1
+	vpshufb		%xmm6, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm8, %xmm8
+	vpxor		%xmm8, %xmm0, %xmm9
+	vpshufd		$147, %xmm1, %xmm1
+	vpshufd		$78, %xmm8, %xmm8
+	vpslld		$25, %xmm9, %xmm0
+	vpsrld		$7, %xmm9, %xmm9
+	vpxor		%xmm0, %xmm9, %xmm0
+	vpslldq		$4, %xmm3, %xmm9
+	vpblendw	$48, %xmm9, %xmm2, %xmm9
+	vpblendw	$240, %xmm9, %xmm4, %xmm9
+	vpshufd		$57, %xmm0, %xmm0
+	vpshufd		$177, %xmm9, %xmm9
+	vpaddd		%xmm11, %xmm9, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm8, %xmm11
+	vpxor		%xmm11, %xmm0, %xmm0
+	vpslld		$20, %xmm0, %xmm8
+	vpsrld		$12, %xmm0, %xmm0
+	vpxor		%xmm8, %xmm0, %xmm0
+	vpunpckhdq	%xmm3, %xmm4, %xmm8
+	vpblendw	$12, %xmm10, %xmm8, %xmm12
+	vpshufd		$177, %xmm12, %xmm12
+	vpaddd		%xmm9, %xmm12, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpshufb		%xmm6, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm11, %xmm11
+	vpxor		%xmm11, %xmm0, %xmm0
+	vpshufd		$57, %xmm1, %xmm1
+	vpshufd		$78, %xmm11, %xmm11
+	vpslld		$25, %xmm0, %xmm12
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm12, %xmm0, %xmm0
+	vpunpckhdq	%xmm5, %xmm2, %xmm12
+	vpshufd		$147, %xmm0, %xmm0
+	vpblendw	$15, %xmm13, %xmm12, %xmm12
+	vpslldq		$8, %xmm5, %xmm13
+	vpshufd		$210, %xmm12, %xmm12
+	vpaddd		%xmm9, %xmm12, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm11, %xmm11
+	vpxor		%xmm11, %xmm0, %xmm0
+	vpslld		$20, %xmm0, %xmm12
+	vpsrld		$12, %xmm0, %xmm0
+	vpxor		%xmm12, %xmm0, %xmm0
+	vpunpckldq	%xmm4, %xmm2, %xmm12
+	vpblendw	$240, %xmm4, %xmm12, %xmm12
+	vpblendw	$192, %xmm13, %xmm12, %xmm12
+	vpsrldq		$12, %xmm3, %xmm13
+	vpaddd		%xmm12, %xmm9, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpshufb		%xmm6, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm11, %xmm11
+	vpxor		%xmm11, %xmm0, %xmm0
+	vpshufd		$147, %xmm1, %xmm1
+	vpshufd		$78, %xmm11, %xmm11
+	vpslld		$25, %xmm0, %xmm12
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm12, %xmm0, %xmm0
+	vpblendw	$60, %xmm2, %xmm4, %xmm12
+	vpblendw	$3, %xmm13, %xmm12, %xmm12
+	vpshufd		$57, %xmm0, %xmm0
+	vpshufd		$78, %xmm12, %xmm12
+	vpaddd		%xmm9, %xmm12, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm11, %xmm11
+	vpxor		%xmm11, %xmm0, %xmm12
+	vpslld		$20, %xmm12, %xmm13
+	vpsrld		$12, %xmm12, %xmm0
+	vpblendw	$51, %xmm3, %xmm4, %xmm12
+	vpxor		%xmm13, %xmm0, %xmm0
+	vpblendw	$192, %xmm10, %xmm12, %xmm10
+	vpslldq		$8, %xmm2, %xmm12
+	vpshufd		$27, %xmm10, %xmm10
+	vpaddd		%xmm9, %xmm10, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpshufb		%xmm6, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm11, %xmm11
+	vpxor		%xmm11, %xmm0, %xmm0
+	vpshufd		$57, %xmm1, %xmm1
+	vpshufd		$78, %xmm11, %xmm11
+	vpslld		$25, %xmm0, %xmm10
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm10, %xmm0, %xmm0
+	vpunpckhdq	%xmm2, %xmm8, %xmm10
+	vpshufd		$147, %xmm0, %xmm0
+	vpblendw	$12, %xmm5, %xmm10, %xmm10
+	vpshufd		$210, %xmm10, %xmm10
+	vpaddd		%xmm9, %xmm10, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm11, %xmm11
+	vpxor		%xmm11, %xmm0, %xmm10
+	vpslld		$20, %xmm10, %xmm0
+	vpsrld		$12, %xmm10, %xmm10
+	vpxor		%xmm0, %xmm10, %xmm0
+	vpblendw	$12, %xmm4, %xmm5, %xmm10
+	vpblendw	$192, %xmm12, %xmm10, %xmm10
+	vpunpckldq	%xmm2, %xmm4, %xmm12
+	vpshufd		$135, %xmm10, %xmm10
+	vpaddd		%xmm9, %xmm10, %xmm9
+	vpaddd		%xmm0, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm1, %xmm1
+	vpshufb		%xmm6, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm11, %xmm13
+	vpxor		%xmm13, %xmm0, %xmm0
+	vpshufd		$147, %xmm1, %xmm1
+	vpshufd		$78, %xmm13, %xmm13
+	vpslld		$25, %xmm0, %xmm10
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm10, %xmm0, %xmm0
+	vpblendw	$15, %xmm3, %xmm4, %xmm10
+	vpblendw	$192, %xmm5, %xmm10, %xmm10
+	vpshufd		$57, %xmm0, %xmm0
+	vpshufd		$198, %xmm10, %xmm10
+	vpaddd		%xmm9, %xmm10, %xmm10
+	vpaddd		%xmm0, %xmm10, %xmm10
+	vpxor		%xmm10, %xmm1, %xmm1
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm13, %xmm13
+	vpxor		%xmm13, %xmm0, %xmm9
+	vpslld		$20, %xmm9, %xmm0
+	vpsrld		$12, %xmm9, %xmm9
+	vpxor		%xmm0, %xmm9, %xmm0
+	vpunpckhdq	%xmm2, %xmm3, %xmm9
+	vpunpcklqdq	%xmm12, %xmm9, %xmm15
+	vpunpcklqdq	%xmm12, %xmm8, %xmm12
+	vpblendw	$15, %xmm5, %xmm8, %xmm8
+	vpaddd		%xmm15, %xmm10, %xmm15
+	vpaddd		%xmm0, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm1, %xmm1
+	vpshufd		$141, %xmm8, %xmm8
+	vpshufb		%xmm6, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm13, %xmm13
+	vpxor		%xmm13, %xmm0, %xmm0
+	vpshufd		$57, %xmm1, %xmm1
+	vpshufd		$78, %xmm13, %xmm13
+	vpslld		$25, %xmm0, %xmm10
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm10, %xmm0, %xmm0
+	vpunpcklqdq	%xmm2, %xmm3, %xmm10
+	vpshufd		$147, %xmm0, %xmm0
+	vpblendw	$51, %xmm14, %xmm10, %xmm14
+	vpshufd		$135, %xmm14, %xmm14
+	vpaddd		%xmm15, %xmm14, %xmm14
+	vpaddd		%xmm0, %xmm14, %xmm14
+	vpxor		%xmm14, %xmm1, %xmm1
+	vpunpcklqdq	%xmm3, %xmm4, %xmm15
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm13, %xmm13
+	vpxor		%xmm13, %xmm0, %xmm0
+	vpslld		$20, %xmm0, %xmm11
+	vpsrld		$12, %xmm0, %xmm0
+	vpxor		%xmm11, %xmm0, %xmm0
+	vpunpckhqdq	%xmm5, %xmm3, %xmm11
+	vpblendw	$51, %xmm15, %xmm11, %xmm11
+	vpunpckhqdq	%xmm3, %xmm5, %xmm15
+	vpaddd		%xmm11, %xmm14, %xmm11
+	vpaddd		%xmm0, %xmm11, %xmm11
+	vpxor		%xmm11, %xmm1, %xmm1
+	vpshufb		%xmm6, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm13, %xmm13
+	vpxor		%xmm13, %xmm0, %xmm0
+	vpshufd		$147, %xmm1, %xmm1
+	vpshufd		$78, %xmm13, %xmm13
+	vpslld		$25, %xmm0, %xmm14
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm14, %xmm0, %xmm14
+	vpunpckhqdq	%xmm4, %xmm2, %xmm0
+	vpshufd		$57, %xmm14, %xmm14
+	vpblendw	$51, %xmm15, %xmm0, %xmm15
+	vpaddd		%xmm15, %xmm11, %xmm15
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm1, %xmm1
+	vpshufb		%xmm7, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm13, %xmm13
+	vpxor		%xmm13, %xmm14, %xmm14
+	vpslld		$20, %xmm14, %xmm11
+	vpsrld		$12, %xmm14, %xmm14
+	vpxor		%xmm11, %xmm14, %xmm14
+	vpblendw	$3, %xmm2, %xmm4, %xmm11
+	vpslldq		$8, %xmm11, %xmm0
+	vpblendw	$15, %xmm5, %xmm0, %xmm0
+	vpshufd		$99, %xmm0, %xmm0
+	vpaddd		%xmm15, %xmm0, %xmm15
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm1, %xmm0
+	vpaddd		%xmm12, %xmm15, %xmm15
+	vpshufb		%xmm6, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm13, %xmm13
+	vpxor		%xmm13, %xmm14, %xmm14
+	vpshufd		$57, %xmm0, %xmm0
+	vpshufd		$78, %xmm13, %xmm13
+	vpslld		$25, %xmm14, %xmm1
+	vpsrld		$7, %xmm14, %xmm14
+	vpxor		%xmm1, %xmm14, %xmm14
+	vpblendw	$3, %xmm5, %xmm4, %xmm1
+	vpshufd		$147, %xmm14, %xmm14
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm0, %xmm0
+	vpshufb		%xmm7, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm13, %xmm13
+	vpxor		%xmm13, %xmm14, %xmm14
+	vpslld		$20, %xmm14, %xmm12
+	vpsrld		$12, %xmm14, %xmm14
+	vpxor		%xmm12, %xmm14, %xmm14
+	vpsrldq		$4, %xmm2, %xmm12
+	vpblendw	$60, %xmm12, %xmm1, %xmm1
+	vpaddd		%xmm1, %xmm15, %xmm15
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm0, %xmm0
+	vpblendw	$12, %xmm4, %xmm3, %xmm1
+	vpshufb		%xmm6, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm13, %xmm13
+	vpxor		%xmm13, %xmm14, %xmm14
+	vpshufd		$147, %xmm0, %xmm0
+	vpshufd		$78, %xmm13, %xmm13
+	vpslld		$25, %xmm14, %xmm12
+	vpsrld		$7, %xmm14, %xmm14
+	vpxor		%xmm12, %xmm14, %xmm14
+	vpsrldq		$4, %xmm5, %xmm12
+	vpblendw	$48, %xmm12, %xmm1, %xmm1
+	vpshufd		$33, %xmm5, %xmm12
+	vpshufd		$57, %xmm14, %xmm14
+	vpshufd		$108, %xmm1, %xmm1
+	vpblendw	$51, %xmm12, %xmm10, %xmm12
+	vpaddd		%xmm15, %xmm1, %xmm15
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm0, %xmm0
+	vpaddd		%xmm12, %xmm15, %xmm15
+	vpshufb		%xmm7, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm13, %xmm1
+	vpxor		%xmm1, %xmm14, %xmm14
+	vpslld		$20, %xmm14, %xmm13
+	vpsrld		$12, %xmm14, %xmm14
+	vpxor		%xmm13, %xmm14, %xmm14
+	vpslldq		$12, %xmm3, %xmm13
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm0, %xmm0
+	vpshufb		%xmm6, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm14, %xmm14
+	vpshufd		$57, %xmm0, %xmm0
+	vpshufd		$78, %xmm1, %xmm1
+	vpslld		$25, %xmm14, %xmm12
+	vpsrld		$7, %xmm14, %xmm14
+	vpxor		%xmm12, %xmm14, %xmm14
+	vpblendw	$51, %xmm5, %xmm4, %xmm12
+	vpshufd		$147, %xmm14, %xmm14
+	vpblendw	$192, %xmm13, %xmm12, %xmm12
+	vpaddd		%xmm12, %xmm15, %xmm15
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm0, %xmm0
+	vpsrldq		$4, %xmm3, %xmm12
+	vpshufb		%xmm7, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm14, %xmm14
+	vpslld		$20, %xmm14, %xmm13
+	vpsrld		$12, %xmm14, %xmm14
+	vpxor		%xmm13, %xmm14, %xmm14
+	vpblendw	$48, %xmm2, %xmm5, %xmm13
+	vpblendw	$3, %xmm12, %xmm13, %xmm13
+	vpshufd		$156, %xmm13, %xmm13
+	vpaddd		%xmm15, %xmm13, %xmm15
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm0, %xmm0
+	vpshufb		%xmm6, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm14, %xmm14
+	vpshufd		$147, %xmm0, %xmm0
+	vpshufd		$78, %xmm1, %xmm1
+	vpslld		$25, %xmm14, %xmm13
+	vpsrld		$7, %xmm14, %xmm14
+	vpxor		%xmm13, %xmm14, %xmm14
+	vpunpcklqdq	%xmm2, %xmm4, %xmm13
+	vpshufd		$57, %xmm14, %xmm14
+	vpblendw	$12, %xmm12, %xmm13, %xmm12
+	vpshufd		$180, %xmm12, %xmm12
+	vpaddd		%xmm15, %xmm12, %xmm15
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm0, %xmm0
+	vpshufb		%xmm7, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm14, %xmm14
+	vpslld		$20, %xmm14, %xmm12
+	vpsrld		$12, %xmm14, %xmm14
+	vpxor		%xmm12, %xmm14, %xmm14
+	vpunpckhqdq	%xmm9, %xmm4, %xmm12
+	vpshufd		$198, %xmm12, %xmm12
+	vpaddd		%xmm15, %xmm12, %xmm15
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm0, %xmm0
+	vpaddd		%xmm15, %xmm8, %xmm15
+	vpshufb		%xmm6, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm14, %xmm14
+	vpshufd		$57, %xmm0, %xmm0
+	vpshufd		$78, %xmm1, %xmm1
+	vpslld		$25, %xmm14, %xmm12
+	vpsrld		$7, %xmm14, %xmm14
+	vpxor		%xmm12, %xmm14, %xmm14
+	vpsrldq		$4, %xmm4, %xmm12
+	vpshufd		$147, %xmm14, %xmm14
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm15, %xmm0, %xmm0
+	vpshufb		%xmm7, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm14, %xmm14
+	vpslld		$20, %xmm14, %xmm8
+	vpsrld		$12, %xmm14, %xmm14
+	vpxor		%xmm14, %xmm8, %xmm14
+	vpblendw	$48, %xmm5, %xmm2, %xmm8
+	vpblendw	$3, %xmm12, %xmm8, %xmm8
+	vpunpckhqdq	%xmm5, %xmm4, %xmm12
+	vpshufd		$75, %xmm8, %xmm8
+	vpblendw	$60, %xmm10, %xmm12, %xmm10
+	vpaddd		%xmm15, %xmm8, %xmm15
+	vpaddd		%xmm14, %xmm15, %xmm15
+	vpxor		%xmm0, %xmm15, %xmm0
+	vpshufd		$45, %xmm10, %xmm10
+	vpshufb		%xmm6, %xmm0, %xmm0
+	vpaddd		%xmm15, %xmm10, %xmm15
+	vpaddd		%xmm0, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm14, %xmm14
+	vpshufd		$147, %xmm0, %xmm0
+	vpshufd		$78, %xmm1, %xmm1
+	vpslld		$25, %xmm14, %xmm8
+	vpsrld		$7, %xmm14, %xmm14
+	vpxor		%xmm14, %xmm8, %xmm8
+	vpshufd		$57, %xmm8, %xmm8
+	vpaddd		%xmm8, %xmm15, %xmm15
+	vpxor		%xmm0, %xmm15, %xmm0
+	vpshufb		%xmm7, %xmm0, %xmm0
+	vpaddd		%xmm0, %xmm1, %xmm1
+	vpxor		%xmm8, %xmm1, %xmm8
+	vpslld		$20, %xmm8, %xmm10
+	vpsrld		$12, %xmm8, %xmm8
+	vpxor		%xmm8, %xmm10, %xmm10
+	vpunpckldq	%xmm3, %xmm4, %xmm8
+	vpunpcklqdq	%xmm9, %xmm8, %xmm9
+	vpaddd		%xmm9, %xmm15, %xmm9
+	vpaddd		%xmm10, %xmm9, %xmm9
+	vpxor		%xmm0, %xmm9, %xmm8
+	vpshufb		%xmm6, %xmm8, %xmm8
+	vpaddd		%xmm8, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm10, %xmm10
+	vpshufd		$57, %xmm8, %xmm8
+	vpshufd		$78, %xmm1, %xmm1
+	vpslld		$25, %xmm10, %xmm12
+	vpsrld		$7, %xmm10, %xmm10
+	vpxor		%xmm10, %xmm12, %xmm10
+	vpblendw	$48, %xmm4, %xmm3, %xmm12
+	vpshufd		$147, %xmm10, %xmm0
+	vpunpckhdq	%xmm5, %xmm3, %xmm10
+	vpshufd		$78, %xmm12, %xmm12
+	vpunpcklqdq	%xmm4, %xmm10, %xmm10
+	vpblendw	$192, %xmm2, %xmm10, %xmm10
+	vpshufhw	$78, %xmm10, %xmm10
+	vpaddd		%xmm10, %xmm9, %xmm10
+	vpaddd		%xmm0, %xmm10, %xmm10
+	vpxor		%xmm8, %xmm10, %xmm8
+	vpshufb		%xmm7, %xmm8, %xmm8
+	vpaddd		%xmm8, %xmm1, %xmm1
+	vpxor		%xmm0, %xmm1, %xmm9
+	vpslld		$20, %xmm9, %xmm0
+	vpsrld		$12, %xmm9, %xmm9
+	vpxor		%xmm9, %xmm0, %xmm0
+	vpunpckhdq	%xmm5, %xmm4, %xmm9
+	vpblendw	$240, %xmm9, %xmm2, %xmm13
+	vpshufd		$39, %xmm13, %xmm13
+	vpaddd		%xmm10, %xmm13, %xmm10
+	vpaddd		%xmm0, %xmm10, %xmm10
+	vpxor		%xmm8, %xmm10, %xmm8
+	vpblendw	$12, %xmm4, %xmm2, %xmm13
+	vpshufb		%xmm6, %xmm8, %xmm8
+	vpslldq		$4, %xmm13, %xmm13
+	vpblendw	$15, %xmm5, %xmm13, %xmm13
+	vpaddd		%xmm8, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm0, %xmm0
+	vpaddd		%xmm13, %xmm10, %xmm13
+	vpshufd		$147, %xmm8, %xmm8
+	vpshufd		$78, %xmm1, %xmm1
+	vpslld		$25, %xmm0, %xmm14
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm0, %xmm14, %xmm14
+	vpshufd		$57, %xmm14, %xmm14
+	vpaddd		%xmm14, %xmm13, %xmm13
+	vpxor		%xmm8, %xmm13, %xmm8
+	vpaddd		%xmm13, %xmm12, %xmm12
+	vpshufb		%xmm7, %xmm8, %xmm8
+	vpaddd		%xmm8, %xmm1, %xmm1
+	vpxor		%xmm14, %xmm1, %xmm14
+	vpslld		$20, %xmm14, %xmm10
+	vpsrld		$12, %xmm14, %xmm14
+	vpxor		%xmm14, %xmm10, %xmm10
+	vpaddd		%xmm10, %xmm12, %xmm12
+	vpxor		%xmm8, %xmm12, %xmm8
+	vpshufb		%xmm6, %xmm8, %xmm8
+	vpaddd		%xmm8, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm10, %xmm0
+	vpshufd		$57, %xmm8, %xmm8
+	vpshufd		$78, %xmm1, %xmm1
+	vpslld		$25, %xmm0, %xmm10
+	vpsrld		$7, %xmm0, %xmm0
+	vpxor		%xmm0, %xmm10, %xmm10
+	vpblendw	$48, %xmm2, %xmm3, %xmm0
+	vpblendw	$15, %xmm11, %xmm0, %xmm0
+	vpshufd		$147, %xmm10, %xmm10
+	vpshufd		$114, %xmm0, %xmm0
+	vpaddd		%xmm12, %xmm0, %xmm0
+	vpaddd		%xmm10, %xmm0, %xmm0
+	vpxor		%xmm8, %xmm0, %xmm8
+	vpshufb		%xmm7, %xmm8, %xmm8
+	vpaddd		%xmm8, %xmm1, %xmm1
+	vpxor		%xmm10, %xmm1, %xmm10
+	vpslld		$20, %xmm10, %xmm11
+	vpsrld		$12, %xmm10, %xmm10
+	vpxor		%xmm10, %xmm11, %xmm10
+	vpslldq		$4, %xmm4, %xmm11
+	vpblendw	$192, %xmm11, %xmm3, %xmm3
+	vpunpckldq	%xmm5, %xmm4, %xmm4
+	vpshufd		$99, %xmm3, %xmm3
+	vpaddd		%xmm0, %xmm3, %xmm3
+	vpaddd		%xmm10, %xmm3, %xmm3
+	vpxor		%xmm8, %xmm3, %xmm11
+	vpunpckldq	%xmm5, %xmm2, %xmm0
+	vpblendw	$192, %xmm2, %xmm5, %xmm2
+	vpshufb		%xmm6, %xmm11, %xmm11
+	vpunpckhqdq	%xmm0, %xmm9, %xmm0
+	vpblendw	$15, %xmm4, %xmm2, %xmm4
+	vpaddd		%xmm11, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm10, %xmm10
+	vpshufd		$147, %xmm11, %xmm11
+	vpshufd		$201, %xmm0, %xmm0
+	vpslld		$25, %xmm10, %xmm8
+	vpsrld		$7, %xmm10, %xmm10
+	vpxor		%xmm10, %xmm8, %xmm10
+	vpshufd		$78, %xmm1, %xmm1
+	vpaddd		%xmm3, %xmm0, %xmm0
+	vpshufd		$27, %xmm4, %xmm4
+	vpshufd		$57, %xmm10, %xmm10
+	vpaddd		%xmm10, %xmm0, %xmm0
+	vpxor		%xmm11, %xmm0, %xmm11
+	vpaddd		%xmm0, %xmm4, %xmm0
+	vpshufb		%xmm7, %xmm11, %xmm7
+	vpaddd		%xmm7, %xmm1, %xmm1
+	vpxor		%xmm10, %xmm1, %xmm10
+	vpslld		$20, %xmm10, %xmm8
+	vpsrld		$12, %xmm10, %xmm10
+	vpxor		%xmm10, %xmm8, %xmm8
+	vpaddd		%xmm8, %xmm0, %xmm0
+	vpxor		%xmm7, %xmm0, %xmm7
+	vpshufb		%xmm6, %xmm7, %xmm6
+	vpaddd		%xmm6, %xmm1, %xmm1
+	vpxor		%xmm1, %xmm8, %xmm8
+	vpshufd		$78, %xmm1, %xmm1
+	vpshufd		$57, %xmm6, %xmm6
+	vpslld		$25, %xmm8, %xmm2
+	vpsrld		$7, %xmm8, %xmm8
+	vpxor		%xmm8, %xmm2, %xmm8
+	vpxor		(%rdi), %xmm1, %xmm1
+	vpshufd		$147, %xmm8, %xmm8
+	vpxor		%xmm0, %xmm1, %xmm0
+	vmovups		%xmm0, (%rdi)
+	vpxor		16(%rdi), %xmm8, %xmm0
+	vpxor		%xmm6, %xmm0, %xmm6
+	vmovups		%xmm6, 16(%rdi)
+	addq		$64, %rsi
+	decq		%rdx
+	jnz .Lbeginofloop
+.Lendofloop:
+	ret
+ENDPROC(blake2s_compress_avx)
+#endif /* CONFIG_AS_AVX */
+
+#ifdef CONFIG_AS_AVX512
+ENTRY(blake2s_compress_avx512)
+	vmovdqu		(%rdi),%xmm0
+	vmovdqu		0x10(%rdi),%xmm1
+	vmovdqu		0x20(%rdi),%xmm4
+	vmovq		%rcx,%xmm5
+	vmovdqa		IV(%rip),%xmm14
+	vmovdqa		IV+16(%rip),%xmm15
+	jmp		.Lblake2s_compress_avx512_mainloop
+.align 32
+.Lblake2s_compress_avx512_mainloop:
+	vmovdqa		%xmm0,%xmm10
+	vmovdqa		%xmm1,%xmm11
+	vpaddq		%xmm5,%xmm4,%xmm4
+	vmovdqa		%xmm14,%xmm2
+	vpxor		%xmm15,%xmm4,%xmm3
+	vmovdqu		(%rsi),%ymm6
+	vmovdqu		0x20(%rsi),%ymm7
+	addq		$0x40,%rsi
+	leaq		SIGMA(%rip),%rax
+	movb		$0xa,%cl
+.Lblake2s_compress_avx512_roundloop:
+	addq		$0x40,%rax
+	vmovdqa		-0x40(%rax),%ymm8
+	vmovdqa		-0x20(%rax),%ymm9
+	vpermi2d	%ymm7,%ymm6,%ymm8
+	vpermi2d	%ymm7,%ymm6,%ymm9
+	vmovdqa		%ymm8,%ymm6
+	vmovdqa		%ymm9,%ymm7
+	vpaddd		%xmm8,%xmm0,%xmm0
+	vpaddd		%xmm1,%xmm0,%xmm0
+	vpxor		%xmm0,%xmm3,%xmm3
+	vprord		$0x10,%xmm3,%xmm3
+	vpaddd		%xmm3,%xmm2,%xmm2
+	vpxor		%xmm2,%xmm1,%xmm1
+	vprord		$0xc,%xmm1,%xmm1
+	vextracti128	$0x1,%ymm8,%xmm8
+	vpaddd		%xmm8,%xmm0,%xmm0
+	vpaddd		%xmm1,%xmm0,%xmm0
+	vpxor		%xmm0,%xmm3,%xmm3
+	vprord		$0x8,%xmm3,%xmm3
+	vpaddd		%xmm3,%xmm2,%xmm2
+	vpxor		%xmm2,%xmm1,%xmm1
+	vprord		$0x7,%xmm1,%xmm1
+	vpshufd		$0x39,%xmm1,%xmm1
+	vpshufd		$0x4e,%xmm2,%xmm2
+	vpshufd		$0x93,%xmm3,%xmm3
+	vpaddd		%xmm9,%xmm0,%xmm0
+	vpaddd		%xmm1,%xmm0,%xmm0
+	vpxor		%xmm0,%xmm3,%xmm3
+	vprord		$0x10,%xmm3,%xmm3
+	vpaddd		%xmm3,%xmm2,%xmm2
+	vpxor		%xmm2,%xmm1,%xmm1
+	vprord		$0xc,%xmm1,%xmm1
+	vextracti128	$0x1,%ymm9,%xmm9
+	vpaddd		%xmm9,%xmm0,%xmm0
+	vpaddd		%xmm1,%xmm0,%xmm0
+	vpxor		%xmm0,%xmm3,%xmm3
+	vprord		$0x8,%xmm3,%xmm3
+	vpaddd		%xmm3,%xmm2,%xmm2
+	vpxor		%xmm2,%xmm1,%xmm1
+	vprord		$0x7,%xmm1,%xmm1
+	vpshufd		$0x93,%xmm1,%xmm1
+	vpshufd		$0x4e,%xmm2,%xmm2
+	vpshufd		$0x39,%xmm3,%xmm3
+	decb		%cl
+	jne		.Lblake2s_compress_avx512_roundloop
+	vpxor		%xmm10,%xmm0,%xmm0
+	vpxor		%xmm11,%xmm1,%xmm1
+	vpxor		%xmm2,%xmm0,%xmm0
+	vpxor		%xmm3,%xmm1,%xmm1
+	decq		%rdx
+	jne		.Lblake2s_compress_avx512_mainloop
+	vmovdqu		%xmm0,(%rdi)
+	vmovdqu		%xmm1,0x10(%rdi)
+	vmovdqu		%xmm4,0x20(%rdi)
+	vzeroupper
+	retq
+ENDPROC(blake2s_compress_avx512)
+#endif /* CONFIG_AS_AVX512 */
diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c
new file mode 100644
index 000000000000..09243094a191
--- /dev/null
+++ b/arch/x86/crypto/blake2s-glue.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <crypto/internal/blake2s.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/hash.h>
+
+#include <linux/types.h>
+#include <linux/jump_label.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <asm/cpufeature.h>
+#include <asm/fpu/api.h>
+#include <asm/processor.h>
+#include <asm/simd.h>
+
+asmlinkage void blake2s_compress_avx(struct blake2s_state *state,
+				     const u8 *block, const size_t nblocks,
+				     const u32 inc);
+asmlinkage void blake2s_compress_avx512(struct blake2s_state *state,
+					const u8 *block, const size_t nblocks,
+					const u32 inc);
+
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(blake2s_use_avx);
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(blake2s_use_avx512);
+
+void blake2s_compress_arch(struct blake2s_state *state,
+			   const u8 *block, size_t nblocks,
+			   const u32 inc)
+{
+	/* SIMD disables preemption, so relax after processing each page. */
+	BUILD_BUG_ON(PAGE_SIZE / BLAKE2S_BLOCK_SIZE < 8);
+
+	if (!static_branch_likely(&blake2s_use_avx) || !crypto_simd_usable()) {
+		blake2s_compress_generic(state, block, nblocks, inc);
+		return;
+	}
+
+	for (;;) {
+		const size_t blocks = min_t(size_t, nblocks,
+					    PAGE_SIZE / BLAKE2S_BLOCK_SIZE);
+
+		kernel_fpu_begin();
+		if (IS_ENABLED(CONFIG_AS_AVX512) &&
+		    static_branch_likely(&blake2s_use_avx512))
+			blake2s_compress_avx512(state, block, blocks, inc);
+		else
+			blake2s_compress_avx(state, block, blocks, inc);
+		kernel_fpu_end();
+
+		nblocks -= blocks;
+		if (!nblocks)
+			break;
+		block += blocks * BLAKE2S_BLOCK_SIZE;
+	}
+}
+EXPORT_SYMBOL(blake2s_compress_arch);
+
+static int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
+				 unsigned int keylen)
+{
+	struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
+
+	if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE) {
+		crypto_shash_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
+		return -EINVAL;
+	}
+
+	memcpy(tctx->key, key, keylen);
+	tctx->keylen = keylen;
+
+	return 0;
+}
+
+static int crypto_blake2s_init(struct shash_desc *desc)
+{
+	struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct blake2s_state *state = shash_desc_ctx(desc);
+	const int outlen = crypto_shash_digestsize(desc->tfm);
+
+	if (tctx->keylen)
+		blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
+	else
+		blake2s_init(state, outlen);
+
+	return 0;
+}
+
+static int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
+				 unsigned int inlen)
+{
+	struct blake2s_state *state = shash_desc_ctx(desc);
+	const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
+
+	if (unlikely(!inlen))
+		return 0;
+	if (inlen > fill) {
+		memcpy(state->buf + state->buflen, in, fill);
+		blake2s_compress_arch(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
+		state->buflen = 0;
+		in += fill;
+		inlen -= fill;
+	}
+	if (inlen > BLAKE2S_BLOCK_SIZE) {
+		const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
+		/* Hash one less (full) block than strictly possible */
+		blake2s_compress_arch(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
+		in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+		inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+	}
+	memcpy(state->buf + state->buflen, in, inlen);
+	state->buflen += inlen;
+
+	return 0;
+}
+
+static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
+{
+	struct blake2s_state *state = shash_desc_ctx(desc);
+
+	blake2s_set_lastblock(state);
+	memset(state->buf + state->buflen, 0,
+	       BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
+	blake2s_compress_arch(state, state->buf, 1, state->buflen);
+	cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
+	memcpy(out, state->h, state->outlen);
+	memzero_explicit(state, sizeof(*state));
+
+	return 0;
+}
+
+static struct shash_alg blake2s_algs[] = {{
+	.base.cra_name		= "blake2s-128",
+	.base.cra_driver_name	= "blake2s-128-x86",
+	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
+	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
+	.base.cra_priority	= 200,
+	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+
+	.digestsize		= BLAKE2S_128_HASH_SIZE,
+	.setkey			= crypto_blake2s_setkey,
+	.init			= crypto_blake2s_init,
+	.update			= crypto_blake2s_update,
+	.final			= crypto_blake2s_final,
+	.descsize		= sizeof(struct blake2s_state),
+}, {
+	.base.cra_name		= "blake2s-160",
+	.base.cra_driver_name	= "blake2s-160-x86",
+	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
+	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
+	.base.cra_priority	= 200,
+	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+
+	.digestsize		= BLAKE2S_160_HASH_SIZE,
+	.setkey			= crypto_blake2s_setkey,
+	.init			= crypto_blake2s_init,
+	.update			= crypto_blake2s_update,
+	.final			= crypto_blake2s_final,
+	.descsize		= sizeof(struct blake2s_state),
+}, {
+	.base.cra_name		= "blake2s-224",
+	.base.cra_driver_name	= "blake2s-224-x86",
+	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
+	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
+	.base.cra_priority	= 200,
+	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+
+	.digestsize		= BLAKE2S_224_HASH_SIZE,
+	.setkey			= crypto_blake2s_setkey,
+	.init			= crypto_blake2s_init,
+	.update			= crypto_blake2s_update,
+	.final			= crypto_blake2s_final,
+	.descsize		= sizeof(struct blake2s_state),
+}, {
+	.base.cra_name		= "blake2s-256",
+	.base.cra_driver_name	= "blake2s-256-x86",
+	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
+	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
+	.base.cra_priority	= 200,
+	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+
+	.digestsize		= BLAKE2S_256_HASH_SIZE,
+	.setkey			= crypto_blake2s_setkey,
+	.init			= crypto_blake2s_init,
+	.update			= crypto_blake2s_update,
+	.final			= crypto_blake2s_final,
+	.descsize		= sizeof(struct blake2s_state),
+}};
+
+static int __init blake2s_mod_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_AVX) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL))
+		return 0;
+
+	static_branch_enable(&blake2s_use_avx);
+
+	if (IS_ENABLED(CONFIG_AS_AVX512) &&
+	    boot_cpu_has(X86_FEATURE_AVX) &&
+	    boot_cpu_has(X86_FEATURE_AVX2) &&
+	    boot_cpu_has(X86_FEATURE_AVX512F) &&
+	    boot_cpu_has(X86_FEATURE_AVX512VL) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+			      XFEATURE_MASK_AVX512, NULL))
+		static_branch_enable(&blake2s_use_avx512);
+
+	return crypto_register_shashes(blake2s_algs, ARRAY_SIZE(blake2s_algs));
+}
+
+static void __exit blake2s_mod_exit(void)
+{
+	if (boot_cpu_has(X86_FEATURE_AVX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL))
+		crypto_unregister_shashes(blake2s_algs, ARRAY_SIZE(blake2s_algs));
+}
+
+module_init(blake2s_mod_init);
+module_exit(blake2s_mod_exit);
+
+MODULE_ALIAS_CRYPTO("blake2s-128");
+MODULE_ALIAS_CRYPTO("blake2s-128-x86");
+MODULE_ALIAS_CRYPTO("blake2s-160");
+MODULE_ALIAS_CRYPTO("blake2s-160-x86");
+MODULE_ALIAS_CRYPTO("blake2s-224");
+MODULE_ALIAS_CRYPTO("blake2s-224-x86");
+MODULE_ALIAS_CRYPTO("blake2s-256");
+MODULE_ALIAS_CRYPTO("blake2s-256-x86");
+MODULE_LICENSE("GPL v2");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index c0f299279df8..ff077dd3a7b7 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1025,6 +1025,12 @@ config CRYPTO_BLAKE2S
 	tristate "BLAKE2s hash function"
 	select CRYPTO_LIB_BLAKE2S_GENERIC
 
+config CRYPTO_BLAKE2S_X86
+	tristate "BLAKE2s hash function library (x86 accelerated version)"
+	depends on X86 && 64BIT
+	select CRYPTO_LIB_BLAKE2S_GENERIC
+	select CRYPTO_ARCH_HAVE_LIB_BLAKE2S
+
 comment "Ciphers"
 
 config CRYPTO_AES
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 26/35] crypto: Curve25519 - generic C library implementations
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (24 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 25/35] crypto: BLAKE2s - x86_64 SIMD implementation Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 27/35] crypto: testmgr - implement testing for KPP failures Ard Biesheuvel
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

From: "Jason A. Donenfeld" <Jason@zx2c4.com>

This contains two formally verified C implementations of the Curve25519
scalar multiplication function, one for 32-bit systems, and one for
64-bit systems whose compiler supports efficient 128-bit integer types.
Not only are these implementations formally verified, but they are also
the fastest available C implementations. They have been modified to be
friendly to kernel space and to be generally less horrendous looking,
but still an effort has been made to retain their formally verified
characteristic, and so the C might look slightly unidiomatic.

The 64-bit version comes from HACL*: https://github.com/project-everest/hacl-star
The 32-bit version comes from Fiat: https://github.com/mit-plv/fiat-crypto

Information: https://cr.yp.to/ecdh.html

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
[ardb: - move from lib/zinc to lib/crypto
       - replace .c #includes with Kconfig based object selection
       - drop simd handling and simplify support for per-arch versions ]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 include/crypto/curve25519.h    |  71 ++
 lib/crypto/Kconfig             |  25 +
 lib/crypto/Makefile            |   5 +
 lib/crypto/curve25519-fiat32.c | 864 ++++++++++++++++++++
 lib/crypto/curve25519-hacl64.c | 788 ++++++++++++++++++
 lib/crypto/curve25519.c        |  25 +
 6 files changed, 1778 insertions(+)

diff --git a/include/crypto/curve25519.h b/include/crypto/curve25519.h
new file mode 100644
index 000000000000..4e6dc840b159
--- /dev/null
+++ b/include/crypto/curve25519.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef CURVE25519_H
+#define CURVE25519_H
+
+#include <crypto/algapi.h> // For crypto_memneq.
+#include <linux/types.h>
+#include <linux/random.h>
+
+enum curve25519_lengths {
+	CURVE25519_KEY_SIZE = 32
+};
+
+extern const u8 curve25519_null_point[];
+extern const u8 curve25519_base_point[];
+
+void curve25519_generic(u8 out[CURVE25519_KEY_SIZE],
+			const u8 scalar[CURVE25519_KEY_SIZE],
+			const u8 point[CURVE25519_KEY_SIZE]);
+
+void curve25519_arch(u8 out[CURVE25519_KEY_SIZE],
+		     const u8 scalar[CURVE25519_KEY_SIZE],
+		     const u8 point[CURVE25519_KEY_SIZE]);
+
+void curve25519_base_arch(u8 pub[CURVE25519_KEY_SIZE],
+			  const u8 secret[CURVE25519_KEY_SIZE]);
+
+static inline
+bool __must_check curve25519(u8 mypublic[CURVE25519_KEY_SIZE],
+			     const u8 secret[CURVE25519_KEY_SIZE],
+			     const u8 basepoint[CURVE25519_KEY_SIZE])
+{
+	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_CURVE25519))
+		curve25519_arch(mypublic, secret, basepoint);
+	else
+		curve25519_generic(mypublic, secret, basepoint);
+	return crypto_memneq(mypublic, curve25519_null_point,
+			     CURVE25519_KEY_SIZE);
+}
+
+static inline bool
+__must_check curve25519_generate_public(u8 pub[CURVE25519_KEY_SIZE],
+					const u8 secret[CURVE25519_KEY_SIZE])
+{
+	if (unlikely(!crypto_memneq(secret, curve25519_null_point,
+				    CURVE25519_KEY_SIZE)))
+		return false;
+
+	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_CURVE25519))
+		curve25519_base_arch(pub, secret);
+	else
+		curve25519_generic(pub, secret, curve25519_base_point);
+	return crypto_memneq(pub, curve25519_null_point, CURVE25519_KEY_SIZE);
+}
+
+static inline void curve25519_clamp_secret(u8 secret[CURVE25519_KEY_SIZE])
+{
+	secret[0] &= 248;
+	secret[31] = (secret[31] & 127) | 64;
+}
+
+static inline void curve25519_generate_secret(u8 secret[CURVE25519_KEY_SIZE])
+{
+	get_random_bytes_wait(secret, CURVE25519_KEY_SIZE);
+	curve25519_clamp_secret(secret);
+}
+
+#endif /* CURVE25519_H */
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index 7ad98b624e55..b1d830dc1c9e 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -59,6 +59,31 @@ config CRYPTO_LIB_CHACHA
 	  by either the generic implementation or an arch-specific one, if one
 	  is available and enabled.
 
+config CRYPTO_ARCH_HAVE_LIB_CURVE25519
+	tristate
+	help
+	  Declares whether the architecture provides an arch-specific
+	  accelerated implementation of the Curve25519 library interface,
+	  either builtin or as a module.
+
+config CRYPTO_LIB_CURVE25519_GENERIC
+	tristate
+	help
+	  This symbol can be depended upon by arch implementations of the
+	  Curve25519 library interface that require the generic code as a
+	  fallback, e.g., for SIMD implementations. If no arch specific
+	  implementation is enabled, this implementation serves the users
+	  of CRYPTO_LIB_CURVE25519.
+
+config CRYPTO_LIB_CURVE25519
+	tristate "Curve25519 scalar multiplication library"
+	depends on CRYPTO_ARCH_HAVE_LIB_CURVE25519 || !CRYPTO_ARCH_HAVE_LIB_CURVE25519
+	select CRYPTO_LIB_CURVE25519_GENERIC if CRYPTO_ARCH_HAVE_LIB_CURVE25519=n
+	help
+	  Enable the Curve25519 library interface. This interface may be
+	  fulfilled by either the generic implementation or an arch-specific
+	  one, if one is available and enabled.
+
 config CRYPTO_LIB_DES
 	tristate
 
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index 8ca66b5f9807..273c55d5e147 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -16,6 +16,11 @@ libblake2s-generic-y				+= blake2s-generic.o
 obj-$(CONFIG_CRYPTO_LIB_BLAKE2S)		+= libblake2s.o
 libblake2s-y					+= blake2s.o
 
+obj-$(CONFIG_CRYPTO_LIB_CURVE25519_GENERIC)	+= libcurve25519.o
+libcurve25519-y					:= curve25519-fiat32.o
+libcurve25519-$(CONFIG_ARCH_SUPPORTS_INT128)	:= curve25519-hacl64.o
+libcurve25519-y					+= curve25519.o
+
 obj-$(CONFIG_CRYPTO_LIB_DES)			+= libdes.o
 libdes-y					:= des.o
 
diff --git a/lib/crypto/curve25519-fiat32.c b/lib/crypto/curve25519-fiat32.c
new file mode 100644
index 000000000000..1c455207341d
--- /dev/null
+++ b/lib/crypto/curve25519-fiat32.c
@@ -0,0 +1,864 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2016 The fiat-crypto Authors.
+ * Copyright (C) 2018-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is a machine-generated formally verified implementation of Curve25519
+ * ECDH from: <https://github.com/mit-plv/fiat-crypto>. Though originally
+ * machine generated, it has been tweaked to be suitable for use in the kernel.
+ * It is optimized for 32-bit machines and machines that cannot work efficiently
+ * with 128-bit integer types.
+ */
+
+#include <asm/unaligned.h>
+#include <crypto/curve25519.h>
+#include <linux/string.h>
+
+/* fe means field element. Here the field is \Z/(2^255-19). An element t,
+ * entries t[0]...t[9], represents the integer t[0]+2^26 t[1]+2^51 t[2]+2^77
+ * t[3]+2^102 t[4]+...+2^230 t[9].
+ * fe limbs are bounded by 1.125*2^26,1.125*2^25,1.125*2^26,1.125*2^25,etc.
+ * Multiplication and carrying produce fe from fe_loose.
+ */
+typedef struct fe { u32 v[10]; } fe;
+
+/* fe_loose limbs are bounded by 3.375*2^26,3.375*2^25,3.375*2^26,3.375*2^25,etc
+ * Addition and subtraction produce fe_loose from (fe, fe).
+ */
+typedef struct fe_loose { u32 v[10]; } fe_loose;
+
+static __always_inline void fe_frombytes_impl(u32 h[10], const u8 *s)
+{
+	/* Ignores top bit of s. */
+	u32 a0 = get_unaligned_le32(s);
+	u32 a1 = get_unaligned_le32(s+4);
+	u32 a2 = get_unaligned_le32(s+8);
+	u32 a3 = get_unaligned_le32(s+12);
+	u32 a4 = get_unaligned_le32(s+16);
+	u32 a5 = get_unaligned_le32(s+20);
+	u32 a6 = get_unaligned_le32(s+24);
+	u32 a7 = get_unaligned_le32(s+28);
+	h[0] = a0&((1<<26)-1);                    /* 26 used, 32-26 left.   26 */
+	h[1] = (a0>>26) | ((a1&((1<<19)-1))<< 6); /* (32-26) + 19 =  6+19 = 25 */
+	h[2] = (a1>>19) | ((a2&((1<<13)-1))<<13); /* (32-19) + 13 = 13+13 = 26 */
+	h[3] = (a2>>13) | ((a3&((1<< 6)-1))<<19); /* (32-13) +  6 = 19+ 6 = 25 */
+	h[4] = (a3>> 6);                          /* (32- 6)              = 26 */
+	h[5] = a4&((1<<25)-1);                    /*                        25 */
+	h[6] = (a4>>25) | ((a5&((1<<19)-1))<< 7); /* (32-25) + 19 =  7+19 = 26 */
+	h[7] = (a5>>19) | ((a6&((1<<12)-1))<<13); /* (32-19) + 12 = 13+12 = 25 */
+	h[8] = (a6>>12) | ((a7&((1<< 6)-1))<<20); /* (32-12) +  6 = 20+ 6 = 26 */
+	h[9] = (a7>> 6)&((1<<25)-1); /*                                     25 */
+}
+
+static __always_inline void fe_frombytes(fe *h, const u8 *s)
+{
+	fe_frombytes_impl(h->v, s);
+}
+
+static __always_inline u8 /*bool*/
+addcarryx_u25(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+	/* This function extracts 25 bits of result and 1 bit of carry
+	 * (26 total), so a 32-bit intermediate is sufficient.
+	 */
+	u32 x = a + b + c;
+	*low = x & ((1 << 25) - 1);
+	return (x >> 25) & 1;
+}
+
+static __always_inline u8 /*bool*/
+addcarryx_u26(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+	/* This function extracts 26 bits of result and 1 bit of carry
+	 * (27 total), so a 32-bit intermediate is sufficient.
+	 */
+	u32 x = a + b + c;
+	*low = x & ((1 << 26) - 1);
+	return (x >> 26) & 1;
+}
+
+static __always_inline u8 /*bool*/
+subborrow_u25(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+	/* This function extracts 25 bits of result and 1 bit of borrow
+	 * (26 total), so a 32-bit intermediate is sufficient.
+	 */
+	u32 x = a - b - c;
+	*low = x & ((1 << 25) - 1);
+	return x >> 31;
+}
+
+static __always_inline u8 /*bool*/
+subborrow_u26(u8 /*bool*/ c, u32 a, u32 b, u32 *low)
+{
+	/* This function extracts 26 bits of result and 1 bit of borrow
+	 *(27 total), so a 32-bit intermediate is sufficient.
+	 */
+	u32 x = a - b - c;
+	*low = x & ((1 << 26) - 1);
+	return x >> 31;
+}
+
+static __always_inline u32 cmovznz32(u32 t, u32 z, u32 nz)
+{
+	t = -!!t; /* all set if nonzero, 0 if 0 */
+	return (t&nz) | ((~t)&z);
+}
+
+static __always_inline void fe_freeze(u32 out[10], const u32 in1[10])
+{
+	{ const u32 x17 = in1[9];
+	{ const u32 x18 = in1[8];
+	{ const u32 x16 = in1[7];
+	{ const u32 x14 = in1[6];
+	{ const u32 x12 = in1[5];
+	{ const u32 x10 = in1[4];
+	{ const u32 x8 = in1[3];
+	{ const u32 x6 = in1[2];
+	{ const u32 x4 = in1[1];
+	{ const u32 x2 = in1[0];
+	{ u32 x20; u8/*bool*/ x21 = subborrow_u26(0x0, x2, 0x3ffffed, &x20);
+	{ u32 x23; u8/*bool*/ x24 = subborrow_u25(x21, x4, 0x1ffffff, &x23);
+	{ u32 x26; u8/*bool*/ x27 = subborrow_u26(x24, x6, 0x3ffffff, &x26);
+	{ u32 x29; u8/*bool*/ x30 = subborrow_u25(x27, x8, 0x1ffffff, &x29);
+	{ u32 x32; u8/*bool*/ x33 = subborrow_u26(x30, x10, 0x3ffffff, &x32);
+	{ u32 x35; u8/*bool*/ x36 = subborrow_u25(x33, x12, 0x1ffffff, &x35);
+	{ u32 x38; u8/*bool*/ x39 = subborrow_u26(x36, x14, 0x3ffffff, &x38);
+	{ u32 x41; u8/*bool*/ x42 = subborrow_u25(x39, x16, 0x1ffffff, &x41);
+	{ u32 x44; u8/*bool*/ x45 = subborrow_u26(x42, x18, 0x3ffffff, &x44);
+	{ u32 x47; u8/*bool*/ x48 = subborrow_u25(x45, x17, 0x1ffffff, &x47);
+	{ u32 x49 = cmovznz32(x48, 0x0, 0xffffffff);
+	{ u32 x50 = (x49 & 0x3ffffed);
+	{ u32 x52; u8/*bool*/ x53 = addcarryx_u26(0x0, x20, x50, &x52);
+	{ u32 x54 = (x49 & 0x1ffffff);
+	{ u32 x56; u8/*bool*/ x57 = addcarryx_u25(x53, x23, x54, &x56);
+	{ u32 x58 = (x49 & 0x3ffffff);
+	{ u32 x60; u8/*bool*/ x61 = addcarryx_u26(x57, x26, x58, &x60);
+	{ u32 x62 = (x49 & 0x1ffffff);
+	{ u32 x64; u8/*bool*/ x65 = addcarryx_u25(x61, x29, x62, &x64);
+	{ u32 x66 = (x49 & 0x3ffffff);
+	{ u32 x68; u8/*bool*/ x69 = addcarryx_u26(x65, x32, x66, &x68);
+	{ u32 x70 = (x49 & 0x1ffffff);
+	{ u32 x72; u8/*bool*/ x73 = addcarryx_u25(x69, x35, x70, &x72);
+	{ u32 x74 = (x49 & 0x3ffffff);
+	{ u32 x76; u8/*bool*/ x77 = addcarryx_u26(x73, x38, x74, &x76);
+	{ u32 x78 = (x49 & 0x1ffffff);
+	{ u32 x80; u8/*bool*/ x81 = addcarryx_u25(x77, x41, x78, &x80);
+	{ u32 x82 = (x49 & 0x3ffffff);
+	{ u32 x84; u8/*bool*/ x85 = addcarryx_u26(x81, x44, x82, &x84);
+	{ u32 x86 = (x49 & 0x1ffffff);
+	{ u32 x88; addcarryx_u25(x85, x47, x86, &x88);
+	out[0] = x52;
+	out[1] = x56;
+	out[2] = x60;
+	out[3] = x64;
+	out[4] = x68;
+	out[5] = x72;
+	out[6] = x76;
+	out[7] = x80;
+	out[8] = x84;
+	out[9] = x88;
+	}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_tobytes(u8 s[32], const fe *f)
+{
+	u32 h[10];
+	fe_freeze(h, f->v);
+	s[0] = h[0] >> 0;
+	s[1] = h[0] >> 8;
+	s[2] = h[0] >> 16;
+	s[3] = (h[0] >> 24) | (h[1] << 2);
+	s[4] = h[1] >> 6;
+	s[5] = h[1] >> 14;
+	s[6] = (h[1] >> 22) | (h[2] << 3);
+	s[7] = h[2] >> 5;
+	s[8] = h[2] >> 13;
+	s[9] = (h[2] >> 21) | (h[3] << 5);
+	s[10] = h[3] >> 3;
+	s[11] = h[3] >> 11;
+	s[12] = (h[3] >> 19) | (h[4] << 6);
+	s[13] = h[4] >> 2;
+	s[14] = h[4] >> 10;
+	s[15] = h[4] >> 18;
+	s[16] = h[5] >> 0;
+	s[17] = h[5] >> 8;
+	s[18] = h[5] >> 16;
+	s[19] = (h[5] >> 24) | (h[6] << 1);
+	s[20] = h[6] >> 7;
+	s[21] = h[6] >> 15;
+	s[22] = (h[6] >> 23) | (h[7] << 3);
+	s[23] = h[7] >> 5;
+	s[24] = h[7] >> 13;
+	s[25] = (h[7] >> 21) | (h[8] << 4);
+	s[26] = h[8] >> 4;
+	s[27] = h[8] >> 12;
+	s[28] = (h[8] >> 20) | (h[9] << 6);
+	s[29] = h[9] >> 2;
+	s[30] = h[9] >> 10;
+	s[31] = h[9] >> 18;
+}
+
+/* h = f */
+static __always_inline void fe_copy(fe *h, const fe *f)
+{
+	memmove(h, f, sizeof(u32) * 10);
+}
+
+static __always_inline void fe_copy_lt(fe_loose *h, const fe *f)
+{
+	memmove(h, f, sizeof(u32) * 10);
+}
+
+/* h = 0 */
+static __always_inline void fe_0(fe *h)
+{
+	memset(h, 0, sizeof(u32) * 10);
+}
+
+/* h = 1 */
+static __always_inline void fe_1(fe *h)
+{
+	memset(h, 0, sizeof(u32) * 10);
+	h->v[0] = 1;
+}
+
+static void fe_add_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+{
+	{ const u32 x20 = in1[9];
+	{ const u32 x21 = in1[8];
+	{ const u32 x19 = in1[7];
+	{ const u32 x17 = in1[6];
+	{ const u32 x15 = in1[5];
+	{ const u32 x13 = in1[4];
+	{ const u32 x11 = in1[3];
+	{ const u32 x9 = in1[2];
+	{ const u32 x7 = in1[1];
+	{ const u32 x5 = in1[0];
+	{ const u32 x38 = in2[9];
+	{ const u32 x39 = in2[8];
+	{ const u32 x37 = in2[7];
+	{ const u32 x35 = in2[6];
+	{ const u32 x33 = in2[5];
+	{ const u32 x31 = in2[4];
+	{ const u32 x29 = in2[3];
+	{ const u32 x27 = in2[2];
+	{ const u32 x25 = in2[1];
+	{ const u32 x23 = in2[0];
+	out[0] = (x5 + x23);
+	out[1] = (x7 + x25);
+	out[2] = (x9 + x27);
+	out[3] = (x11 + x29);
+	out[4] = (x13 + x31);
+	out[5] = (x15 + x33);
+	out[6] = (x17 + x35);
+	out[7] = (x19 + x37);
+	out[8] = (x21 + x39);
+	out[9] = (x20 + x38);
+	}}}}}}}}}}}}}}}}}}}}
+}
+
+/* h = f + g
+ * Can overlap h with f or g.
+ */
+static __always_inline void fe_add(fe_loose *h, const fe *f, const fe *g)
+{
+	fe_add_impl(h->v, f->v, g->v);
+}
+
+static void fe_sub_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+{
+	{ const u32 x20 = in1[9];
+	{ const u32 x21 = in1[8];
+	{ const u32 x19 = in1[7];
+	{ const u32 x17 = in1[6];
+	{ const u32 x15 = in1[5];
+	{ const u32 x13 = in1[4];
+	{ const u32 x11 = in1[3];
+	{ const u32 x9 = in1[2];
+	{ const u32 x7 = in1[1];
+	{ const u32 x5 = in1[0];
+	{ const u32 x38 = in2[9];
+	{ const u32 x39 = in2[8];
+	{ const u32 x37 = in2[7];
+	{ const u32 x35 = in2[6];
+	{ const u32 x33 = in2[5];
+	{ const u32 x31 = in2[4];
+	{ const u32 x29 = in2[3];
+	{ const u32 x27 = in2[2];
+	{ const u32 x25 = in2[1];
+	{ const u32 x23 = in2[0];
+	out[0] = ((0x7ffffda + x5) - x23);
+	out[1] = ((0x3fffffe + x7) - x25);
+	out[2] = ((0x7fffffe + x9) - x27);
+	out[3] = ((0x3fffffe + x11) - x29);
+	out[4] = ((0x7fffffe + x13) - x31);
+	out[5] = ((0x3fffffe + x15) - x33);
+	out[6] = ((0x7fffffe + x17) - x35);
+	out[7] = ((0x3fffffe + x19) - x37);
+	out[8] = ((0x7fffffe + x21) - x39);
+	out[9] = ((0x3fffffe + x20) - x38);
+	}}}}}}}}}}}}}}}}}}}}
+}
+
+/* h = f - g
+ * Can overlap h with f or g.
+ */
+static __always_inline void fe_sub(fe_loose *h, const fe *f, const fe *g)
+{
+	fe_sub_impl(h->v, f->v, g->v);
+}
+
+static void fe_mul_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+{
+	{ const u32 x20 = in1[9];
+	{ const u32 x21 = in1[8];
+	{ const u32 x19 = in1[7];
+	{ const u32 x17 = in1[6];
+	{ const u32 x15 = in1[5];
+	{ const u32 x13 = in1[4];
+	{ const u32 x11 = in1[3];
+	{ const u32 x9 = in1[2];
+	{ const u32 x7 = in1[1];
+	{ const u32 x5 = in1[0];
+	{ const u32 x38 = in2[9];
+	{ const u32 x39 = in2[8];
+	{ const u32 x37 = in2[7];
+	{ const u32 x35 = in2[6];
+	{ const u32 x33 = in2[5];
+	{ const u32 x31 = in2[4];
+	{ const u32 x29 = in2[3];
+	{ const u32 x27 = in2[2];
+	{ const u32 x25 = in2[1];
+	{ const u32 x23 = in2[0];
+	{ u64 x40 = ((u64)x23 * x5);
+	{ u64 x41 = (((u64)x23 * x7) + ((u64)x25 * x5));
+	{ u64 x42 = ((((u64)(0x2 * x25) * x7) + ((u64)x23 * x9)) + ((u64)x27 * x5));
+	{ u64 x43 = (((((u64)x25 * x9) + ((u64)x27 * x7)) + ((u64)x23 * x11)) + ((u64)x29 * x5));
+	{ u64 x44 = (((((u64)x27 * x9) + (0x2 * (((u64)x25 * x11) + ((u64)x29 * x7)))) + ((u64)x23 * x13)) + ((u64)x31 * x5));
+	{ u64 x45 = (((((((u64)x27 * x11) + ((u64)x29 * x9)) + ((u64)x25 * x13)) + ((u64)x31 * x7)) + ((u64)x23 * x15)) + ((u64)x33 * x5));
+	{ u64 x46 = (((((0x2 * ((((u64)x29 * x11) + ((u64)x25 * x15)) + ((u64)x33 * x7))) + ((u64)x27 * x13)) + ((u64)x31 * x9)) + ((u64)x23 * x17)) + ((u64)x35 * x5));
+	{ u64 x47 = (((((((((u64)x29 * x13) + ((u64)x31 * x11)) + ((u64)x27 * x15)) + ((u64)x33 * x9)) + ((u64)x25 * x17)) + ((u64)x35 * x7)) + ((u64)x23 * x19)) + ((u64)x37 * x5));
+	{ u64 x48 = (((((((u64)x31 * x13) + (0x2 * (((((u64)x29 * x15) + ((u64)x33 * x11)) + ((u64)x25 * x19)) + ((u64)x37 * x7)))) + ((u64)x27 * x17)) + ((u64)x35 * x9)) + ((u64)x23 * x21)) + ((u64)x39 * x5));
+	{ u64 x49 = (((((((((((u64)x31 * x15) + ((u64)x33 * x13)) + ((u64)x29 * x17)) + ((u64)x35 * x11)) + ((u64)x27 * x19)) + ((u64)x37 * x9)) + ((u64)x25 * x21)) + ((u64)x39 * x7)) + ((u64)x23 * x20)) + ((u64)x38 * x5));
+	{ u64 x50 = (((((0x2 * ((((((u64)x33 * x15) + ((u64)x29 * x19)) + ((u64)x37 * x11)) + ((u64)x25 * x20)) + ((u64)x38 * x7))) + ((u64)x31 * x17)) + ((u64)x35 * x13)) + ((u64)x27 * x21)) + ((u64)x39 * x9));
+	{ u64 x51 = (((((((((u64)x33 * x17) + ((u64)x35 * x15)) + ((u64)x31 * x19)) + ((u64)x37 * x13)) + ((u64)x29 * x21)) + ((u64)x39 * x11)) + ((u64)x27 * x20)) + ((u64)x38 * x9));
+	{ u64 x52 = (((((u64)x35 * x17) + (0x2 * (((((u64)x33 * x19) + ((u64)x37 * x15)) + ((u64)x29 * x20)) + ((u64)x38 * x11)))) + ((u64)x31 * x21)) + ((u64)x39 * x13));
+	{ u64 x53 = (((((((u64)x35 * x19) + ((u64)x37 * x17)) + ((u64)x33 * x21)) + ((u64)x39 * x15)) + ((u64)x31 * x20)) + ((u64)x38 * x13));
+	{ u64 x54 = (((0x2 * ((((u64)x37 * x19) + ((u64)x33 * x20)) + ((u64)x38 * x15))) + ((u64)x35 * x21)) + ((u64)x39 * x17));
+	{ u64 x55 = (((((u64)x37 * x21) + ((u64)x39 * x19)) + ((u64)x35 * x20)) + ((u64)x38 * x17));
+	{ u64 x56 = (((u64)x39 * x21) + (0x2 * (((u64)x37 * x20) + ((u64)x38 * x19))));
+	{ u64 x57 = (((u64)x39 * x20) + ((u64)x38 * x21));
+	{ u64 x58 = ((u64)(0x2 * x38) * x20);
+	{ u64 x59 = (x48 + (x58 << 0x4));
+	{ u64 x60 = (x59 + (x58 << 0x1));
+	{ u64 x61 = (x60 + x58);
+	{ u64 x62 = (x47 + (x57 << 0x4));
+	{ u64 x63 = (x62 + (x57 << 0x1));
+	{ u64 x64 = (x63 + x57);
+	{ u64 x65 = (x46 + (x56 << 0x4));
+	{ u64 x66 = (x65 + (x56 << 0x1));
+	{ u64 x67 = (x66 + x56);
+	{ u64 x68 = (x45 + (x55 << 0x4));
+	{ u64 x69 = (x68 + (x55 << 0x1));
+	{ u64 x70 = (x69 + x55);
+	{ u64 x71 = (x44 + (x54 << 0x4));
+	{ u64 x72 = (x71 + (x54 << 0x1));
+	{ u64 x73 = (x72 + x54);
+	{ u64 x74 = (x43 + (x53 << 0x4));
+	{ u64 x75 = (x74 + (x53 << 0x1));
+	{ u64 x76 = (x75 + x53);
+	{ u64 x77 = (x42 + (x52 << 0x4));
+	{ u64 x78 = (x77 + (x52 << 0x1));
+	{ u64 x79 = (x78 + x52);
+	{ u64 x80 = (x41 + (x51 << 0x4));
+	{ u64 x81 = (x80 + (x51 << 0x1));
+	{ u64 x82 = (x81 + x51);
+	{ u64 x83 = (x40 + (x50 << 0x4));
+	{ u64 x84 = (x83 + (x50 << 0x1));
+	{ u64 x85 = (x84 + x50);
+	{ u64 x86 = (x85 >> 0x1a);
+	{ u32 x87 = ((u32)x85 & 0x3ffffff);
+	{ u64 x88 = (x86 + x82);
+	{ u64 x89 = (x88 >> 0x19);
+	{ u32 x90 = ((u32)x88 & 0x1ffffff);
+	{ u64 x91 = (x89 + x79);
+	{ u64 x92 = (x91 >> 0x1a);
+	{ u32 x93 = ((u32)x91 & 0x3ffffff);
+	{ u64 x94 = (x92 + x76);
+	{ u64 x95 = (x94 >> 0x19);
+	{ u32 x96 = ((u32)x94 & 0x1ffffff);
+	{ u64 x97 = (x95 + x73);
+	{ u64 x98 = (x97 >> 0x1a);
+	{ u32 x99 = ((u32)x97 & 0x3ffffff);
+	{ u64 x100 = (x98 + x70);
+	{ u64 x101 = (x100 >> 0x19);
+	{ u32 x102 = ((u32)x100 & 0x1ffffff);
+	{ u64 x103 = (x101 + x67);
+	{ u64 x104 = (x103 >> 0x1a);
+	{ u32 x105 = ((u32)x103 & 0x3ffffff);
+	{ u64 x106 = (x104 + x64);
+	{ u64 x107 = (x106 >> 0x19);
+	{ u32 x108 = ((u32)x106 & 0x1ffffff);
+	{ u64 x109 = (x107 + x61);
+	{ u64 x110 = (x109 >> 0x1a);
+	{ u32 x111 = ((u32)x109 & 0x3ffffff);
+	{ u64 x112 = (x110 + x49);
+	{ u64 x113 = (x112 >> 0x19);
+	{ u32 x114 = ((u32)x112 & 0x1ffffff);
+	{ u64 x115 = (x87 + (0x13 * x113));
+	{ u32 x116 = (u32) (x115 >> 0x1a);
+	{ u32 x117 = ((u32)x115 & 0x3ffffff);
+	{ u32 x118 = (x116 + x90);
+	{ u32 x119 = (x118 >> 0x19);
+	{ u32 x120 = (x118 & 0x1ffffff);
+	out[0] = x117;
+	out[1] = x120;
+	out[2] = (x119 + x93);
+	out[3] = x96;
+	out[4] = x99;
+	out[5] = x102;
+	out[6] = x105;
+	out[7] = x108;
+	out[8] = x111;
+	out[9] = x114;
+	}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_mul_ttt(fe *h, const fe *f, const fe *g)
+{
+	fe_mul_impl(h->v, f->v, g->v);
+}
+
+static __always_inline void fe_mul_tlt(fe *h, const fe_loose *f, const fe *g)
+{
+	fe_mul_impl(h->v, f->v, g->v);
+}
+
+static __always_inline void
+fe_mul_tll(fe *h, const fe_loose *f, const fe_loose *g)
+{
+	fe_mul_impl(h->v, f->v, g->v);
+}
+
+static void fe_sqr_impl(u32 out[10], const u32 in1[10])
+{
+	{ const u32 x17 = in1[9];
+	{ const u32 x18 = in1[8];
+	{ const u32 x16 = in1[7];
+	{ const u32 x14 = in1[6];
+	{ const u32 x12 = in1[5];
+	{ const u32 x10 = in1[4];
+	{ const u32 x8 = in1[3];
+	{ const u32 x6 = in1[2];
+	{ const u32 x4 = in1[1];
+	{ const u32 x2 = in1[0];
+	{ u64 x19 = ((u64)x2 * x2);
+	{ u64 x20 = ((u64)(0x2 * x2) * x4);
+	{ u64 x21 = (0x2 * (((u64)x4 * x4) + ((u64)x2 * x6)));
+	{ u64 x22 = (0x2 * (((u64)x4 * x6) + ((u64)x2 * x8)));
+	{ u64 x23 = ((((u64)x6 * x6) + ((u64)(0x4 * x4) * x8)) + ((u64)(0x2 * x2) * x10));
+	{ u64 x24 = (0x2 * ((((u64)x6 * x8) + ((u64)x4 * x10)) + ((u64)x2 * x12)));
+	{ u64 x25 = (0x2 * (((((u64)x8 * x8) + ((u64)x6 * x10)) + ((u64)x2 * x14)) + ((u64)(0x2 * x4) * x12)));
+	{ u64 x26 = (0x2 * (((((u64)x8 * x10) + ((u64)x6 * x12)) + ((u64)x4 * x14)) + ((u64)x2 * x16)));
+	{ u64 x27 = (((u64)x10 * x10) + (0x2 * ((((u64)x6 * x14) + ((u64)x2 * x18)) + (0x2 * (((u64)x4 * x16) + ((u64)x8 * x12))))));
+	{ u64 x28 = (0x2 * ((((((u64)x10 * x12) + ((u64)x8 * x14)) + ((u64)x6 * x16)) + ((u64)x4 * x18)) + ((u64)x2 * x17)));
+	{ u64 x29 = (0x2 * (((((u64)x12 * x12) + ((u64)x10 * x14)) + ((u64)x6 * x18)) + (0x2 * (((u64)x8 * x16) + ((u64)x4 * x17)))));
+	{ u64 x30 = (0x2 * (((((u64)x12 * x14) + ((u64)x10 * x16)) + ((u64)x8 * x18)) + ((u64)x6 * x17)));
+	{ u64 x31 = (((u64)x14 * x14) + (0x2 * (((u64)x10 * x18) + (0x2 * (((u64)x12 * x16) + ((u64)x8 * x17))))));
+	{ u64 x32 = (0x2 * ((((u64)x14 * x16) + ((u64)x12 * x18)) + ((u64)x10 * x17)));
+	{ u64 x33 = (0x2 * ((((u64)x16 * x16) + ((u64)x14 * x18)) + ((u64)(0x2 * x12) * x17)));
+	{ u64 x34 = (0x2 * (((u64)x16 * x18) + ((u64)x14 * x17)));
+	{ u64 x35 = (((u64)x18 * x18) + ((u64)(0x4 * x16) * x17));
+	{ u64 x36 = ((u64)(0x2 * x18) * x17);
+	{ u64 x37 = ((u64)(0x2 * x17) * x17);
+	{ u64 x38 = (x27 + (x37 << 0x4));
+	{ u64 x39 = (x38 + (x37 << 0x1));
+	{ u64 x40 = (x39 + x37);
+	{ u64 x41 = (x26 + (x36 << 0x4));
+	{ u64 x42 = (x41 + (x36 << 0x1));
+	{ u64 x43 = (x42 + x36);
+	{ u64 x44 = (x25 + (x35 << 0x4));
+	{ u64 x45 = (x44 + (x35 << 0x1));
+	{ u64 x46 = (x45 + x35);
+	{ u64 x47 = (x24 + (x34 << 0x4));
+	{ u64 x48 = (x47 + (x34 << 0x1));
+	{ u64 x49 = (x48 + x34);
+	{ u64 x50 = (x23 + (x33 << 0x4));
+	{ u64 x51 = (x50 + (x33 << 0x1));
+	{ u64 x52 = (x51 + x33);
+	{ u64 x53 = (x22 + (x32 << 0x4));
+	{ u64 x54 = (x53 + (x32 << 0x1));
+	{ u64 x55 = (x54 + x32);
+	{ u64 x56 = (x21 + (x31 << 0x4));
+	{ u64 x57 = (x56 + (x31 << 0x1));
+	{ u64 x58 = (x57 + x31);
+	{ u64 x59 = (x20 + (x30 << 0x4));
+	{ u64 x60 = (x59 + (x30 << 0x1));
+	{ u64 x61 = (x60 + x30);
+	{ u64 x62 = (x19 + (x29 << 0x4));
+	{ u64 x63 = (x62 + (x29 << 0x1));
+	{ u64 x64 = (x63 + x29);
+	{ u64 x65 = (x64 >> 0x1a);
+	{ u32 x66 = ((u32)x64 & 0x3ffffff);
+	{ u64 x67 = (x65 + x61);
+	{ u64 x68 = (x67 >> 0x19);
+	{ u32 x69 = ((u32)x67 & 0x1ffffff);
+	{ u64 x70 = (x68 + x58);
+	{ u64 x71 = (x70 >> 0x1a);
+	{ u32 x72 = ((u32)x70 & 0x3ffffff);
+	{ u64 x73 = (x71 + x55);
+	{ u64 x74 = (x73 >> 0x19);
+	{ u32 x75 = ((u32)x73 & 0x1ffffff);
+	{ u64 x76 = (x74 + x52);
+	{ u64 x77 = (x76 >> 0x1a);
+	{ u32 x78 = ((u32)x76 & 0x3ffffff);
+	{ u64 x79 = (x77 + x49);
+	{ u64 x80 = (x79 >> 0x19);
+	{ u32 x81 = ((u32)x79 & 0x1ffffff);
+	{ u64 x82 = (x80 + x46);
+	{ u64 x83 = (x82 >> 0x1a);
+	{ u32 x84 = ((u32)x82 & 0x3ffffff);
+	{ u64 x85 = (x83 + x43);
+	{ u64 x86 = (x85 >> 0x19);
+	{ u32 x87 = ((u32)x85 & 0x1ffffff);
+	{ u64 x88 = (x86 + x40);
+	{ u64 x89 = (x88 >> 0x1a);
+	{ u32 x90 = ((u32)x88 & 0x3ffffff);
+	{ u64 x91 = (x89 + x28);
+	{ u64 x92 = (x91 >> 0x19);
+	{ u32 x93 = ((u32)x91 & 0x1ffffff);
+	{ u64 x94 = (x66 + (0x13 * x92));
+	{ u32 x95 = (u32) (x94 >> 0x1a);
+	{ u32 x96 = ((u32)x94 & 0x3ffffff);
+	{ u32 x97 = (x95 + x69);
+	{ u32 x98 = (x97 >> 0x19);
+	{ u32 x99 = (x97 & 0x1ffffff);
+	out[0] = x96;
+	out[1] = x99;
+	out[2] = (x98 + x72);
+	out[3] = x75;
+	out[4] = x78;
+	out[5] = x81;
+	out[6] = x84;
+	out[7] = x87;
+	out[8] = x90;
+	out[9] = x93;
+	}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_sq_tl(fe *h, const fe_loose *f)
+{
+	fe_sqr_impl(h->v, f->v);
+}
+
+static __always_inline void fe_sq_tt(fe *h, const fe *f)
+{
+	fe_sqr_impl(h->v, f->v);
+}
+
+static __always_inline void fe_loose_invert(fe *out, const fe_loose *z)
+{
+	fe t0;
+	fe t1;
+	fe t2;
+	fe t3;
+	int i;
+
+	fe_sq_tl(&t0, z);
+	fe_sq_tt(&t1, &t0);
+	for (i = 1; i < 2; ++i)
+		fe_sq_tt(&t1, &t1);
+	fe_mul_tlt(&t1, z, &t1);
+	fe_mul_ttt(&t0, &t0, &t1);
+	fe_sq_tt(&t2, &t0);
+	fe_mul_ttt(&t1, &t1, &t2);
+	fe_sq_tt(&t2, &t1);
+	for (i = 1; i < 5; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t1, &t2, &t1);
+	fe_sq_tt(&t2, &t1);
+	for (i = 1; i < 10; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t2, &t2, &t1);
+	fe_sq_tt(&t3, &t2);
+	for (i = 1; i < 20; ++i)
+		fe_sq_tt(&t3, &t3);
+	fe_mul_ttt(&t2, &t3, &t2);
+	fe_sq_tt(&t2, &t2);
+	for (i = 1; i < 10; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t1, &t2, &t1);
+	fe_sq_tt(&t2, &t1);
+	for (i = 1; i < 50; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t2, &t2, &t1);
+	fe_sq_tt(&t3, &t2);
+	for (i = 1; i < 100; ++i)
+		fe_sq_tt(&t3, &t3);
+	fe_mul_ttt(&t2, &t3, &t2);
+	fe_sq_tt(&t2, &t2);
+	for (i = 1; i < 50; ++i)
+		fe_sq_tt(&t2, &t2);
+	fe_mul_ttt(&t1, &t2, &t1);
+	fe_sq_tt(&t1, &t1);
+	for (i = 1; i < 5; ++i)
+		fe_sq_tt(&t1, &t1);
+	fe_mul_ttt(out, &t1, &t0);
+}
+
+static __always_inline void fe_invert(fe *out, const fe *z)
+{
+	fe_loose l;
+	fe_copy_lt(&l, z);
+	fe_loose_invert(out, &l);
+}
+
+/* Replace (f,g) with (g,f) if b == 1;
+ * replace (f,g) with (f,g) if b == 0.
+ *
+ * Preconditions: b in {0,1}
+ */
+static __always_inline void fe_cswap(fe *f, fe *g, unsigned int b)
+{
+	unsigned i;
+	b = 0 - b;
+	for (i = 0; i < 10; i++) {
+		u32 x = f->v[i] ^ g->v[i];
+		x &= b;
+		f->v[i] ^= x;
+		g->v[i] ^= x;
+	}
+}
+
+/* NOTE: based on fiat-crypto fe_mul, edited for in2=121666, 0, 0.*/
+static __always_inline void fe_mul_121666_impl(u32 out[10], const u32 in1[10])
+{
+	{ const u32 x20 = in1[9];
+	{ const u32 x21 = in1[8];
+	{ const u32 x19 = in1[7];
+	{ const u32 x17 = in1[6];
+	{ const u32 x15 = in1[5];
+	{ const u32 x13 = in1[4];
+	{ const u32 x11 = in1[3];
+	{ const u32 x9 = in1[2];
+	{ const u32 x7 = in1[1];
+	{ const u32 x5 = in1[0];
+	{ const u32 x38 = 0;
+	{ const u32 x39 = 0;
+	{ const u32 x37 = 0;
+	{ const u32 x35 = 0;
+	{ const u32 x33 = 0;
+	{ const u32 x31 = 0;
+	{ const u32 x29 = 0;
+	{ const u32 x27 = 0;
+	{ const u32 x25 = 0;
+	{ const u32 x23 = 121666;
+	{ u64 x40 = ((u64)x23 * x5);
+	{ u64 x41 = (((u64)x23 * x7) + ((u64)x25 * x5));
+	{ u64 x42 = ((((u64)(0x2 * x25) * x7) + ((u64)x23 * x9)) + ((u64)x27 * x5));
+	{ u64 x43 = (((((u64)x25 * x9) + ((u64)x27 * x7)) + ((u64)x23 * x11)) + ((u64)x29 * x5));
+	{ u64 x44 = (((((u64)x27 * x9) + (0x2 * (((u64)x25 * x11) + ((u64)x29 * x7)))) + ((u64)x23 * x13)) + ((u64)x31 * x5));
+	{ u64 x45 = (((((((u64)x27 * x11) + ((u64)x29 * x9)) + ((u64)x25 * x13)) + ((u64)x31 * x7)) + ((u64)x23 * x15)) + ((u64)x33 * x5));
+	{ u64 x46 = (((((0x2 * ((((u64)x29 * x11) + ((u64)x25 * x15)) + ((u64)x33 * x7))) + ((u64)x27 * x13)) + ((u64)x31 * x9)) + ((u64)x23 * x17)) + ((u64)x35 * x5));
+	{ u64 x47 = (((((((((u64)x29 * x13) + ((u64)x31 * x11)) + ((u64)x27 * x15)) + ((u64)x33 * x9)) + ((u64)x25 * x17)) + ((u64)x35 * x7)) + ((u64)x23 * x19)) + ((u64)x37 * x5));
+	{ u64 x48 = (((((((u64)x31 * x13) + (0x2 * (((((u64)x29 * x15) + ((u64)x33 * x11)) + ((u64)x25 * x19)) + ((u64)x37 * x7)))) + ((u64)x27 * x17)) + ((u64)x35 * x9)) + ((u64)x23 * x21)) + ((u64)x39 * x5));
+	{ u64 x49 = (((((((((((u64)x31 * x15) + ((u64)x33 * x13)) + ((u64)x29 * x17)) + ((u64)x35 * x11)) + ((u64)x27 * x19)) + ((u64)x37 * x9)) + ((u64)x25 * x21)) + ((u64)x39 * x7)) + ((u64)x23 * x20)) + ((u64)x38 * x5));
+	{ u64 x50 = (((((0x2 * ((((((u64)x33 * x15) + ((u64)x29 * x19)) + ((u64)x37 * x11)) + ((u64)x25 * x20)) + ((u64)x38 * x7))) + ((u64)x31 * x17)) + ((u64)x35 * x13)) + ((u64)x27 * x21)) + ((u64)x39 * x9));
+	{ u64 x51 = (((((((((u64)x33 * x17) + ((u64)x35 * x15)) + ((u64)x31 * x19)) + ((u64)x37 * x13)) + ((u64)x29 * x21)) + ((u64)x39 * x11)) + ((u64)x27 * x20)) + ((u64)x38 * x9));
+	{ u64 x52 = (((((u64)x35 * x17) + (0x2 * (((((u64)x33 * x19) + ((u64)x37 * x15)) + ((u64)x29 * x20)) + ((u64)x38 * x11)))) + ((u64)x31 * x21)) + ((u64)x39 * x13));
+	{ u64 x53 = (((((((u64)x35 * x19) + ((u64)x37 * x17)) + ((u64)x33 * x21)) + ((u64)x39 * x15)) + ((u64)x31 * x20)) + ((u64)x38 * x13));
+	{ u64 x54 = (((0x2 * ((((u64)x37 * x19) + ((u64)x33 * x20)) + ((u64)x38 * x15))) + ((u64)x35 * x21)) + ((u64)x39 * x17));
+	{ u64 x55 = (((((u64)x37 * x21) + ((u64)x39 * x19)) + ((u64)x35 * x20)) + ((u64)x38 * x17));
+	{ u64 x56 = (((u64)x39 * x21) + (0x2 * (((u64)x37 * x20) + ((u64)x38 * x19))));
+	{ u64 x57 = (((u64)x39 * x20) + ((u64)x38 * x21));
+	{ u64 x58 = ((u64)(0x2 * x38) * x20);
+	{ u64 x59 = (x48 + (x58 << 0x4));
+	{ u64 x60 = (x59 + (x58 << 0x1));
+	{ u64 x61 = (x60 + x58);
+	{ u64 x62 = (x47 + (x57 << 0x4));
+	{ u64 x63 = (x62 + (x57 << 0x1));
+	{ u64 x64 = (x63 + x57);
+	{ u64 x65 = (x46 + (x56 << 0x4));
+	{ u64 x66 = (x65 + (x56 << 0x1));
+	{ u64 x67 = (x66 + x56);
+	{ u64 x68 = (x45 + (x55 << 0x4));
+	{ u64 x69 = (x68 + (x55 << 0x1));
+	{ u64 x70 = (x69 + x55);
+	{ u64 x71 = (x44 + (x54 << 0x4));
+	{ u64 x72 = (x71 + (x54 << 0x1));
+	{ u64 x73 = (x72 + x54);
+	{ u64 x74 = (x43 + (x53 << 0x4));
+	{ u64 x75 = (x74 + (x53 << 0x1));
+	{ u64 x76 = (x75 + x53);
+	{ u64 x77 = (x42 + (x52 << 0x4));
+	{ u64 x78 = (x77 + (x52 << 0x1));
+	{ u64 x79 = (x78 + x52);
+	{ u64 x80 = (x41 + (x51 << 0x4));
+	{ u64 x81 = (x80 + (x51 << 0x1));
+	{ u64 x82 = (x81 + x51);
+	{ u64 x83 = (x40 + (x50 << 0x4));
+	{ u64 x84 = (x83 + (x50 << 0x1));
+	{ u64 x85 = (x84 + x50);
+	{ u64 x86 = (x85 >> 0x1a);
+	{ u32 x87 = ((u32)x85 & 0x3ffffff);
+	{ u64 x88 = (x86 + x82);
+	{ u64 x89 = (x88 >> 0x19);
+	{ u32 x90 = ((u32)x88 & 0x1ffffff);
+	{ u64 x91 = (x89 + x79);
+	{ u64 x92 = (x91 >> 0x1a);
+	{ u32 x93 = ((u32)x91 & 0x3ffffff);
+	{ u64 x94 = (x92 + x76);
+	{ u64 x95 = (x94 >> 0x19);
+	{ u32 x96 = ((u32)x94 & 0x1ffffff);
+	{ u64 x97 = (x95 + x73);
+	{ u64 x98 = (x97 >> 0x1a);
+	{ u32 x99 = ((u32)x97 & 0x3ffffff);
+	{ u64 x100 = (x98 + x70);
+	{ u64 x101 = (x100 >> 0x19);
+	{ u32 x102 = ((u32)x100 & 0x1ffffff);
+	{ u64 x103 = (x101 + x67);
+	{ u64 x104 = (x103 >> 0x1a);
+	{ u32 x105 = ((u32)x103 & 0x3ffffff);
+	{ u64 x106 = (x104 + x64);
+	{ u64 x107 = (x106 >> 0x19);
+	{ u32 x108 = ((u32)x106 & 0x1ffffff);
+	{ u64 x109 = (x107 + x61);
+	{ u64 x110 = (x109 >> 0x1a);
+	{ u32 x111 = ((u32)x109 & 0x3ffffff);
+	{ u64 x112 = (x110 + x49);
+	{ u64 x113 = (x112 >> 0x19);
+	{ u32 x114 = ((u32)x112 & 0x1ffffff);
+	{ u64 x115 = (x87 + (0x13 * x113));
+	{ u32 x116 = (u32) (x115 >> 0x1a);
+	{ u32 x117 = ((u32)x115 & 0x3ffffff);
+	{ u32 x118 = (x116 + x90);
+	{ u32 x119 = (x118 >> 0x19);
+	{ u32 x120 = (x118 & 0x1ffffff);
+	out[0] = x117;
+	out[1] = x120;
+	out[2] = (x119 + x93);
+	out[3] = x96;
+	out[4] = x99;
+	out[5] = x102;
+	out[6] = x105;
+	out[7] = x108;
+	out[8] = x111;
+	out[9] = x114;
+	}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
+}
+
+static __always_inline void fe_mul121666(fe *h, const fe_loose *f)
+{
+	fe_mul_121666_impl(h->v, f->v);
+}
+
+void curve25519_generic(u8 out[CURVE25519_KEY_SIZE],
+			const u8 scalar[CURVE25519_KEY_SIZE],
+			const u8 point[CURVE25519_KEY_SIZE])
+{
+	fe x1, x2, z2, x3, z3;
+	fe_loose x2l, z2l, x3l;
+	unsigned swap = 0;
+	int pos;
+	u8 e[32];
+
+	memcpy(e, scalar, 32);
+	curve25519_clamp_secret(e);
+
+	/* The following implementation was transcribed to Coq and proven to
+	 * correspond to unary scalar multiplication in affine coordinates given
+	 * that x1 != 0 is the x coordinate of some point on the curve. It was
+	 * also checked in Coq that doing a ladderstep with x1 = x3 = 0 gives
+	 * z2' = z3' = 0, and z2 = z3 = 0 gives z2' = z3' = 0. The statement was
+	 * quantified over the underlying field, so it applies to Curve25519
+	 * itself and the quadratic twist of Curve25519. It was not proven in
+	 * Coq that prime-field arithmetic correctly simulates extension-field
+	 * arithmetic on prime-field values. The decoding of the byte array
+	 * representation of e was not considered.
+	 *
+	 * Specification of Montgomery curves in affine coordinates:
+	 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Spec/MontgomeryCurve.v#L27>
+	 *
+	 * Proof that these form a group that is isomorphic to a Weierstrass
+	 * curve:
+	 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/AffineProofs.v#L35>
+	 *
+	 * Coq transcription and correctness proof of the loop
+	 * (where scalarbits=255):
+	 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZ.v#L118>
+	 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L278>
+	 * preconditions: 0 <= e < 2^255 (not necessarily e < order),
+	 * fe_invert(0) = 0
+	 */
+	fe_frombytes(&x1, point);
+	fe_1(&x2);
+	fe_0(&z2);
+	fe_copy(&x3, &x1);
+	fe_1(&z3);
+
+	for (pos = 254; pos >= 0; --pos) {
+		fe tmp0, tmp1;
+		fe_loose tmp0l, tmp1l;
+		/* loop invariant as of right before the test, for the case
+		 * where x1 != 0:
+		 *   pos >= -1; if z2 = 0 then x2 is nonzero; if z3 = 0 then x3
+		 *   is nonzero
+		 *   let r := e >> (pos+1) in the following equalities of
+		 *   projective points:
+		 *   to_xz (r*P)     === if swap then (x3, z3) else (x2, z2)
+		 *   to_xz ((r+1)*P) === if swap then (x2, z2) else (x3, z3)
+		 *   x1 is the nonzero x coordinate of the nonzero
+		 *   point (r*P-(r+1)*P)
+		 */
+		unsigned b = 1 & (e[pos / 8] >> (pos & 7));
+		swap ^= b;
+		fe_cswap(&x2, &x3, swap);
+		fe_cswap(&z2, &z3, swap);
+		swap = b;
+		/* Coq transcription of ladderstep formula (called from
+		 * transcribed loop):
+		 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZ.v#L89>
+		 * <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L131>
+		 * x1 != 0 <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L217>
+		 * x1  = 0 <https://github.com/mit-plv/fiat-crypto/blob/2456d821825521f7e03e65882cc3521795b0320f/src/Curves/Montgomery/XZProofs.v#L147>
+		 */
+		fe_sub(&tmp0l, &x3, &z3);
+		fe_sub(&tmp1l, &x2, &z2);
+		fe_add(&x2l, &x2, &z2);
+		fe_add(&z2l, &x3, &z3);
+		fe_mul_tll(&z3, &tmp0l, &x2l);
+		fe_mul_tll(&z2, &z2l, &tmp1l);
+		fe_sq_tl(&tmp0, &tmp1l);
+		fe_sq_tl(&tmp1, &x2l);
+		fe_add(&x3l, &z3, &z2);
+		fe_sub(&z2l, &z3, &z2);
+		fe_mul_ttt(&x2, &tmp1, &tmp0);
+		fe_sub(&tmp1l, &tmp1, &tmp0);
+		fe_sq_tl(&z2, &z2l);
+		fe_mul121666(&z3, &tmp1l);
+		fe_sq_tl(&x3, &x3l);
+		fe_add(&tmp0l, &tmp0, &z3);
+		fe_mul_ttt(&z3, &x1, &z2);
+		fe_mul_tll(&z2, &tmp1l, &tmp0l);
+	}
+	/* here pos=-1, so r=e, so to_xz (e*P) === if swap then (x3, z3)
+	 * else (x2, z2)
+	 */
+	fe_cswap(&x2, &x3, swap);
+	fe_cswap(&z2, &z3, swap);
+
+	fe_invert(&z2, &z2);
+	fe_mul_ttt(&x2, &x2, &z2);
+	fe_tobytes(out, &x2);
+
+	memzero_explicit(&x1, sizeof(x1));
+	memzero_explicit(&x2, sizeof(x2));
+	memzero_explicit(&z2, sizeof(z2));
+	memzero_explicit(&x3, sizeof(x3));
+	memzero_explicit(&z3, sizeof(z3));
+	memzero_explicit(&x2l, sizeof(x2l));
+	memzero_explicit(&z2l, sizeof(z2l));
+	memzero_explicit(&x3l, sizeof(x3l));
+	memzero_explicit(&e, sizeof(e));
+}
diff --git a/lib/crypto/curve25519-hacl64.c b/lib/crypto/curve25519-hacl64.c
new file mode 100644
index 000000000000..771d82dc5f14
--- /dev/null
+++ b/lib/crypto/curve25519-hacl64.c
@@ -0,0 +1,788 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2016-2017 INRIA and Microsoft Corporation.
+ * Copyright (C) 2018-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is a machine-generated formally verified implementation of Curve25519
+ * ECDH from: <https://github.com/mitls/hacl-star>. Though originally machine
+ * generated, it has been tweaked to be suitable for use in the kernel. It is
+ * optimized for 64-bit machines that can efficiently work with 128-bit
+ * integer types.
+ */
+
+#include <asm/unaligned.h>
+#include <crypto/curve25519.h>
+#include <linux/string.h>
+
+typedef __uint128_t u128;
+
+static __always_inline u64 u64_eq_mask(u64 a, u64 b)
+{
+	u64 x = a ^ b;
+	u64 minus_x = ~x + (u64)1U;
+	u64 x_or_minus_x = x | minus_x;
+	u64 xnx = x_or_minus_x >> (u32)63U;
+	u64 c = xnx - (u64)1U;
+	return c;
+}
+
+static __always_inline u64 u64_gte_mask(u64 a, u64 b)
+{
+	u64 x = a;
+	u64 y = b;
+	u64 x_xor_y = x ^ y;
+	u64 x_sub_y = x - y;
+	u64 x_sub_y_xor_y = x_sub_y ^ y;
+	u64 q = x_xor_y | x_sub_y_xor_y;
+	u64 x_xor_q = x ^ q;
+	u64 x_xor_q_ = x_xor_q >> (u32)63U;
+	u64 c = x_xor_q_ - (u64)1U;
+	return c;
+}
+
+static __always_inline void modulo_carry_top(u64 *b)
+{
+	u64 b4 = b[4];
+	u64 b0 = b[0];
+	u64 b4_ = b4 & 0x7ffffffffffffLLU;
+	u64 b0_ = b0 + 19 * (b4 >> 51);
+	b[4] = b4_;
+	b[0] = b0_;
+}
+
+static __always_inline void fproduct_copy_from_wide_(u64 *output, u128 *input)
+{
+	{
+		u128 xi = input[0];
+		output[0] = ((u64)(xi));
+	}
+	{
+		u128 xi = input[1];
+		output[1] = ((u64)(xi));
+	}
+	{
+		u128 xi = input[2];
+		output[2] = ((u64)(xi));
+	}
+	{
+		u128 xi = input[3];
+		output[3] = ((u64)(xi));
+	}
+	{
+		u128 xi = input[4];
+		output[4] = ((u64)(xi));
+	}
+}
+
+static __always_inline void
+fproduct_sum_scalar_multiplication_(u128 *output, u64 *input, u64 s)
+{
+	output[0] += (u128)input[0] * s;
+	output[1] += (u128)input[1] * s;
+	output[2] += (u128)input[2] * s;
+	output[3] += (u128)input[3] * s;
+	output[4] += (u128)input[4] * s;
+}
+
+static __always_inline void fproduct_carry_wide_(u128 *tmp)
+{
+	{
+		u32 ctr = 0;
+		u128 tctr = tmp[ctr];
+		u128 tctrp1 = tmp[ctr + 1];
+		u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+		u128 c = ((tctr) >> (51));
+		tmp[ctr] = ((u128)(r0));
+		tmp[ctr + 1] = ((tctrp1) + (c));
+	}
+	{
+		u32 ctr = 1;
+		u128 tctr = tmp[ctr];
+		u128 tctrp1 = tmp[ctr + 1];
+		u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+		u128 c = ((tctr) >> (51));
+		tmp[ctr] = ((u128)(r0));
+		tmp[ctr + 1] = ((tctrp1) + (c));
+	}
+
+	{
+		u32 ctr = 2;
+		u128 tctr = tmp[ctr];
+		u128 tctrp1 = tmp[ctr + 1];
+		u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+		u128 c = ((tctr) >> (51));
+		tmp[ctr] = ((u128)(r0));
+		tmp[ctr + 1] = ((tctrp1) + (c));
+	}
+	{
+		u32 ctr = 3;
+		u128 tctr = tmp[ctr];
+		u128 tctrp1 = tmp[ctr + 1];
+		u64 r0 = ((u64)(tctr)) & 0x7ffffffffffffLLU;
+		u128 c = ((tctr) >> (51));
+		tmp[ctr] = ((u128)(r0));
+		tmp[ctr + 1] = ((tctrp1) + (c));
+	}
+}
+
+static __always_inline void fmul_shift_reduce(u64 *output)
+{
+	u64 tmp = output[4];
+	u64 b0;
+	{
+		u32 ctr = 5 - 0 - 1;
+		u64 z = output[ctr - 1];
+		output[ctr] = z;
+	}
+	{
+		u32 ctr = 5 - 1 - 1;
+		u64 z = output[ctr - 1];
+		output[ctr] = z;
+	}
+	{
+		u32 ctr = 5 - 2 - 1;
+		u64 z = output[ctr - 1];
+		output[ctr] = z;
+	}
+	{
+		u32 ctr = 5 - 3 - 1;
+		u64 z = output[ctr - 1];
+		output[ctr] = z;
+	}
+	output[0] = tmp;
+	b0 = output[0];
+	output[0] = 19 * b0;
+}
+
+static __always_inline void fmul_mul_shift_reduce_(u128 *output, u64 *input,
+						   u64 *input21)
+{
+	u32 i;
+	u64 input2i;
+	{
+		u64 input2i = input21[0];
+		fproduct_sum_scalar_multiplication_(output, input, input2i);
+		fmul_shift_reduce(input);
+	}
+	{
+		u64 input2i = input21[1];
+		fproduct_sum_scalar_multiplication_(output, input, input2i);
+		fmul_shift_reduce(input);
+	}
+	{
+		u64 input2i = input21[2];
+		fproduct_sum_scalar_multiplication_(output, input, input2i);
+		fmul_shift_reduce(input);
+	}
+	{
+		u64 input2i = input21[3];
+		fproduct_sum_scalar_multiplication_(output, input, input2i);
+		fmul_shift_reduce(input);
+	}
+	i = 4;
+	input2i = input21[i];
+	fproduct_sum_scalar_multiplication_(output, input, input2i);
+}
+
+static __always_inline void fmul_fmul(u64 *output, u64 *input, u64 *input21)
+{
+	u64 tmp[5] = { input[0], input[1], input[2], input[3], input[4] };
+	{
+		u128 b4;
+		u128 b0;
+		u128 b4_;
+		u128 b0_;
+		u64 i0;
+		u64 i1;
+		u64 i0_;
+		u64 i1_;
+		u128 t[5] = { 0 };
+		fmul_mul_shift_reduce_(t, tmp, input21);
+		fproduct_carry_wide_(t);
+		b4 = t[4];
+		b0 = t[0];
+		b4_ = ((b4) & (((u128)(0x7ffffffffffffLLU))));
+		b0_ = ((b0) + (((u128)(19) * (((u64)(((b4) >> (51))))))));
+		t[4] = b4_;
+		t[0] = b0_;
+		fproduct_copy_from_wide_(output, t);
+		i0 = output[0];
+		i1 = output[1];
+		i0_ = i0 & 0x7ffffffffffffLLU;
+		i1_ = i1 + (i0 >> 51);
+		output[0] = i0_;
+		output[1] = i1_;
+	}
+}
+
+static __always_inline void fsquare_fsquare__(u128 *tmp, u64 *output)
+{
+	u64 r0 = output[0];
+	u64 r1 = output[1];
+	u64 r2 = output[2];
+	u64 r3 = output[3];
+	u64 r4 = output[4];
+	u64 d0 = r0 * 2;
+	u64 d1 = r1 * 2;
+	u64 d2 = r2 * 2 * 19;
+	u64 d419 = r4 * 19;
+	u64 d4 = d419 * 2;
+	u128 s0 = ((((((u128)(r0) * (r0))) + (((u128)(d4) * (r1))))) +
+		   (((u128)(d2) * (r3))));
+	u128 s1 = ((((((u128)(d0) * (r1))) + (((u128)(d4) * (r2))))) +
+		   (((u128)(r3 * 19) * (r3))));
+	u128 s2 = ((((((u128)(d0) * (r2))) + (((u128)(r1) * (r1))))) +
+		   (((u128)(d4) * (r3))));
+	u128 s3 = ((((((u128)(d0) * (r3))) + (((u128)(d1) * (r2))))) +
+		   (((u128)(r4) * (d419))));
+	u128 s4 = ((((((u128)(d0) * (r4))) + (((u128)(d1) * (r3))))) +
+		   (((u128)(r2) * (r2))));
+	tmp[0] = s0;
+	tmp[1] = s1;
+	tmp[2] = s2;
+	tmp[3] = s3;
+	tmp[4] = s4;
+}
+
+static __always_inline void fsquare_fsquare_(u128 *tmp, u64 *output)
+{
+	u128 b4;
+	u128 b0;
+	u128 b4_;
+	u128 b0_;
+	u64 i0;
+	u64 i1;
+	u64 i0_;
+	u64 i1_;
+	fsquare_fsquare__(tmp, output);
+	fproduct_carry_wide_(tmp);
+	b4 = tmp[4];
+	b0 = tmp[0];
+	b4_ = ((b4) & (((u128)(0x7ffffffffffffLLU))));
+	b0_ = ((b0) + (((u128)(19) * (((u64)(((b4) >> (51))))))));
+	tmp[4] = b4_;
+	tmp[0] = b0_;
+	fproduct_copy_from_wide_(output, tmp);
+	i0 = output[0];
+	i1 = output[1];
+	i0_ = i0 & 0x7ffffffffffffLLU;
+	i1_ = i1 + (i0 >> 51);
+	output[0] = i0_;
+	output[1] = i1_;
+}
+
+static __always_inline void fsquare_fsquare_times_(u64 *output, u128 *tmp,
+						   u32 count1)
+{
+	u32 i;
+	fsquare_fsquare_(tmp, output);
+	for (i = 1; i < count1; ++i)
+		fsquare_fsquare_(tmp, output);
+}
+
+static __always_inline void fsquare_fsquare_times(u64 *output, u64 *input,
+						  u32 count1)
+{
+	u128 t[5];
+	memcpy(output, input, 5 * sizeof(*input));
+	fsquare_fsquare_times_(output, t, count1);
+}
+
+static __always_inline void fsquare_fsquare_times_inplace(u64 *output,
+							  u32 count1)
+{
+	u128 t[5];
+	fsquare_fsquare_times_(output, t, count1);
+}
+
+static __always_inline void crecip_crecip(u64 *out, u64 *z)
+{
+	u64 buf[20] = { 0 };
+	u64 *a0 = buf;
+	u64 *t00 = buf + 5;
+	u64 *b0 = buf + 10;
+	u64 *t01;
+	u64 *b1;
+	u64 *c0;
+	u64 *a;
+	u64 *t0;
+	u64 *b;
+	u64 *c;
+	fsquare_fsquare_times(a0, z, 1);
+	fsquare_fsquare_times(t00, a0, 2);
+	fmul_fmul(b0, t00, z);
+	fmul_fmul(a0, b0, a0);
+	fsquare_fsquare_times(t00, a0, 1);
+	fmul_fmul(b0, t00, b0);
+	fsquare_fsquare_times(t00, b0, 5);
+	t01 = buf + 5;
+	b1 = buf + 10;
+	c0 = buf + 15;
+	fmul_fmul(b1, t01, b1);
+	fsquare_fsquare_times(t01, b1, 10);
+	fmul_fmul(c0, t01, b1);
+	fsquare_fsquare_times(t01, c0, 20);
+	fmul_fmul(t01, t01, c0);
+	fsquare_fsquare_times_inplace(t01, 10);
+	fmul_fmul(b1, t01, b1);
+	fsquare_fsquare_times(t01, b1, 50);
+	a = buf;
+	t0 = buf + 5;
+	b = buf + 10;
+	c = buf + 15;
+	fmul_fmul(c, t0, b);
+	fsquare_fsquare_times(t0, c, 100);
+	fmul_fmul(t0, t0, c);
+	fsquare_fsquare_times_inplace(t0, 50);
+	fmul_fmul(t0, t0, b);
+	fsquare_fsquare_times_inplace(t0, 5);
+	fmul_fmul(out, t0, a);
+}
+
+static __always_inline void fsum(u64 *a, u64 *b)
+{
+	a[0] += b[0];
+	a[1] += b[1];
+	a[2] += b[2];
+	a[3] += b[3];
+	a[4] += b[4];
+}
+
+static __always_inline void fdifference(u64 *a, u64 *b)
+{
+	u64 tmp[5] = { 0 };
+	u64 b0;
+	u64 b1;
+	u64 b2;
+	u64 b3;
+	u64 b4;
+	memcpy(tmp, b, 5 * sizeof(*b));
+	b0 = tmp[0];
+	b1 = tmp[1];
+	b2 = tmp[2];
+	b3 = tmp[3];
+	b4 = tmp[4];
+	tmp[0] = b0 + 0x3fffffffffff68LLU;
+	tmp[1] = b1 + 0x3ffffffffffff8LLU;
+	tmp[2] = b2 + 0x3ffffffffffff8LLU;
+	tmp[3] = b3 + 0x3ffffffffffff8LLU;
+	tmp[4] = b4 + 0x3ffffffffffff8LLU;
+	{
+		u64 xi = a[0];
+		u64 yi = tmp[0];
+		a[0] = yi - xi;
+	}
+	{
+		u64 xi = a[1];
+		u64 yi = tmp[1];
+		a[1] = yi - xi;
+	}
+	{
+		u64 xi = a[2];
+		u64 yi = tmp[2];
+		a[2] = yi - xi;
+	}
+	{
+		u64 xi = a[3];
+		u64 yi = tmp[3];
+		a[3] = yi - xi;
+	}
+	{
+		u64 xi = a[4];
+		u64 yi = tmp[4];
+		a[4] = yi - xi;
+	}
+}
+
+static __always_inline void fscalar(u64 *output, u64 *b, u64 s)
+{
+	u128 tmp[5];
+	u128 b4;
+	u128 b0;
+	u128 b4_;
+	u128 b0_;
+	{
+		u64 xi = b[0];
+		tmp[0] = ((u128)(xi) * (s));
+	}
+	{
+		u64 xi = b[1];
+		tmp[1] = ((u128)(xi) * (s));
+	}
+	{
+		u64 xi = b[2];
+		tmp[2] = ((u128)(xi) * (s));
+	}
+	{
+		u64 xi = b[3];
+		tmp[3] = ((u128)(xi) * (s));
+	}
+	{
+		u64 xi = b[4];
+		tmp[4] = ((u128)(xi) * (s));
+	}
+	fproduct_carry_wide_(tmp);
+	b4 = tmp[4];
+	b0 = tmp[0];
+	b4_ = ((b4) & (((u128)(0x7ffffffffffffLLU))));
+	b0_ = ((b0) + (((u128)(19) * (((u64)(((b4) >> (51))))))));
+	tmp[4] = b4_;
+	tmp[0] = b0_;
+	fproduct_copy_from_wide_(output, tmp);
+}
+
+static __always_inline void fmul(u64 *output, u64 *a, u64 *b)
+{
+	fmul_fmul(output, a, b);
+}
+
+static __always_inline void crecip(u64 *output, u64 *input)
+{
+	crecip_crecip(output, input);
+}
+
+static __always_inline void point_swap_conditional_step(u64 *a, u64 *b,
+							u64 swap1, u32 ctr)
+{
+	u32 i = ctr - 1;
+	u64 ai = a[i];
+	u64 bi = b[i];
+	u64 x = swap1 & (ai ^ bi);
+	u64 ai1 = ai ^ x;
+	u64 bi1 = bi ^ x;
+	a[i] = ai1;
+	b[i] = bi1;
+}
+
+static __always_inline void point_swap_conditional5(u64 *a, u64 *b, u64 swap1)
+{
+	point_swap_conditional_step(a, b, swap1, 5);
+	point_swap_conditional_step(a, b, swap1, 4);
+	point_swap_conditional_step(a, b, swap1, 3);
+	point_swap_conditional_step(a, b, swap1, 2);
+	point_swap_conditional_step(a, b, swap1, 1);
+}
+
+static __always_inline void point_swap_conditional(u64 *a, u64 *b, u64 iswap)
+{
+	u64 swap1 = 0 - iswap;
+	point_swap_conditional5(a, b, swap1);
+	point_swap_conditional5(a + 5, b + 5, swap1);
+}
+
+static __always_inline void point_copy(u64 *output, u64 *input)
+{
+	memcpy(output, input, 5 * sizeof(*input));
+	memcpy(output + 5, input + 5, 5 * sizeof(*input));
+}
+
+static __always_inline void addanddouble_fmonty(u64 *pp, u64 *ppq, u64 *p,
+						u64 *pq, u64 *qmqp)
+{
+	u64 *qx = qmqp;
+	u64 *x2 = pp;
+	u64 *z2 = pp + 5;
+	u64 *x3 = ppq;
+	u64 *z3 = ppq + 5;
+	u64 *x = p;
+	u64 *z = p + 5;
+	u64 *xprime = pq;
+	u64 *zprime = pq + 5;
+	u64 buf[40] = { 0 };
+	u64 *origx = buf;
+	u64 *origxprime0 = buf + 5;
+	u64 *xxprime0;
+	u64 *zzprime0;
+	u64 *origxprime;
+	xxprime0 = buf + 25;
+	zzprime0 = buf + 30;
+	memcpy(origx, x, 5 * sizeof(*x));
+	fsum(x, z);
+	fdifference(z, origx);
+	memcpy(origxprime0, xprime, 5 * sizeof(*xprime));
+	fsum(xprime, zprime);
+	fdifference(zprime, origxprime0);
+	fmul(xxprime0, xprime, z);
+	fmul(zzprime0, x, zprime);
+	origxprime = buf + 5;
+	{
+		u64 *xx0;
+		u64 *zz0;
+		u64 *xxprime;
+		u64 *zzprime;
+		u64 *zzzprime;
+		xx0 = buf + 15;
+		zz0 = buf + 20;
+		xxprime = buf + 25;
+		zzprime = buf + 30;
+		zzzprime = buf + 35;
+		memcpy(origxprime, xxprime, 5 * sizeof(*xxprime));
+		fsum(xxprime, zzprime);
+		fdifference(zzprime, origxprime);
+		fsquare_fsquare_times(x3, xxprime, 1);
+		fsquare_fsquare_times(zzzprime, zzprime, 1);
+		fmul(z3, zzzprime, qx);
+		fsquare_fsquare_times(xx0, x, 1);
+		fsquare_fsquare_times(zz0, z, 1);
+		{
+			u64 *zzz;
+			u64 *xx;
+			u64 *zz;
+			u64 scalar;
+			zzz = buf + 10;
+			xx = buf + 15;
+			zz = buf + 20;
+			fmul(x2, xx, zz);
+			fdifference(zz, xx);
+			scalar = 121665;
+			fscalar(zzz, zz, scalar);
+			fsum(zzz, xx);
+			fmul(z2, zzz, zz);
+		}
+	}
+}
+
+static __always_inline void
+ladder_smallloop_cmult_small_loop_step(u64 *nq, u64 *nqpq, u64 *nq2, u64 *nqpq2,
+				       u64 *q, u8 byt)
+{
+	u64 bit0 = (u64)(byt >> 7);
+	u64 bit;
+	point_swap_conditional(nq, nqpq, bit0);
+	addanddouble_fmonty(nq2, nqpq2, nq, nqpq, q);
+	bit = (u64)(byt >> 7);
+	point_swap_conditional(nq2, nqpq2, bit);
+}
+
+static __always_inline void
+ladder_smallloop_cmult_small_loop_double_step(u64 *nq, u64 *nqpq, u64 *nq2,
+					      u64 *nqpq2, u64 *q, u8 byt)
+{
+	u8 byt1;
+	ladder_smallloop_cmult_small_loop_step(nq, nqpq, nq2, nqpq2, q, byt);
+	byt1 = byt << 1;
+	ladder_smallloop_cmult_small_loop_step(nq2, nqpq2, nq, nqpq, q, byt1);
+}
+
+static __always_inline void
+ladder_smallloop_cmult_small_loop(u64 *nq, u64 *nqpq, u64 *nq2, u64 *nqpq2,
+				  u64 *q, u8 byt, u32 i)
+{
+	while (i--) {
+		ladder_smallloop_cmult_small_loop_double_step(nq, nqpq, nq2,
+							      nqpq2, q, byt);
+		byt <<= 2;
+	}
+}
+
+static __always_inline void ladder_bigloop_cmult_big_loop(u8 *n1, u64 *nq,
+							  u64 *nqpq, u64 *nq2,
+							  u64 *nqpq2, u64 *q,
+							  u32 i)
+{
+	while (i--) {
+		u8 byte = n1[i];
+		ladder_smallloop_cmult_small_loop(nq, nqpq, nq2, nqpq2, q,
+						  byte, 4);
+	}
+}
+
+static void ladder_cmult(u64 *result, u8 *n1, u64 *q)
+{
+	u64 point_buf[40] = { 0 };
+	u64 *nq = point_buf;
+	u64 *nqpq = point_buf + 10;
+	u64 *nq2 = point_buf + 20;
+	u64 *nqpq2 = point_buf + 30;
+	point_copy(nqpq, q);
+	nq[0] = 1;
+	ladder_bigloop_cmult_big_loop(n1, nq, nqpq, nq2, nqpq2, q, 32);
+	point_copy(result, nq);
+}
+
+static __always_inline void format_fexpand(u64 *output, const u8 *input)
+{
+	const u8 *x00 = input + 6;
+	const u8 *x01 = input + 12;
+	const u8 *x02 = input + 19;
+	const u8 *x0 = input + 24;
+	u64 i0, i1, i2, i3, i4, output0, output1, output2, output3, output4;
+	i0 = get_unaligned_le64(input);
+	i1 = get_unaligned_le64(x00);
+	i2 = get_unaligned_le64(x01);
+	i3 = get_unaligned_le64(x02);
+	i4 = get_unaligned_le64(x0);
+	output0 = i0 & 0x7ffffffffffffLLU;
+	output1 = i1 >> 3 & 0x7ffffffffffffLLU;
+	output2 = i2 >> 6 & 0x7ffffffffffffLLU;
+	output3 = i3 >> 1 & 0x7ffffffffffffLLU;
+	output4 = i4 >> 12 & 0x7ffffffffffffLLU;
+	output[0] = output0;
+	output[1] = output1;
+	output[2] = output2;
+	output[3] = output3;
+	output[4] = output4;
+}
+
+static __always_inline void format_fcontract_first_carry_pass(u64 *input)
+{
+	u64 t0 = input[0];
+	u64 t1 = input[1];
+	u64 t2 = input[2];
+	u64 t3 = input[3];
+	u64 t4 = input[4];
+	u64 t1_ = t1 + (t0 >> 51);
+	u64 t0_ = t0 & 0x7ffffffffffffLLU;
+	u64 t2_ = t2 + (t1_ >> 51);
+	u64 t1__ = t1_ & 0x7ffffffffffffLLU;
+	u64 t3_ = t3 + (t2_ >> 51);
+	u64 t2__ = t2_ & 0x7ffffffffffffLLU;
+	u64 t4_ = t4 + (t3_ >> 51);
+	u64 t3__ = t3_ & 0x7ffffffffffffLLU;
+	input[0] = t0_;
+	input[1] = t1__;
+	input[2] = t2__;
+	input[3] = t3__;
+	input[4] = t4_;
+}
+
+static __always_inline void format_fcontract_first_carry_full(u64 *input)
+{
+	format_fcontract_first_carry_pass(input);
+	modulo_carry_top(input);
+}
+
+static __always_inline void format_fcontract_second_carry_pass(u64 *input)
+{
+	u64 t0 = input[0];
+	u64 t1 = input[1];
+	u64 t2 = input[2];
+	u64 t3 = input[3];
+	u64 t4 = input[4];
+	u64 t1_ = t1 + (t0 >> 51);
+	u64 t0_ = t0 & 0x7ffffffffffffLLU;
+	u64 t2_ = t2 + (t1_ >> 51);
+	u64 t1__ = t1_ & 0x7ffffffffffffLLU;
+	u64 t3_ = t3 + (t2_ >> 51);
+	u64 t2__ = t2_ & 0x7ffffffffffffLLU;
+	u64 t4_ = t4 + (t3_ >> 51);
+	u64 t3__ = t3_ & 0x7ffffffffffffLLU;
+	input[0] = t0_;
+	input[1] = t1__;
+	input[2] = t2__;
+	input[3] = t3__;
+	input[4] = t4_;
+}
+
+static __always_inline void format_fcontract_second_carry_full(u64 *input)
+{
+	u64 i0;
+	u64 i1;
+	u64 i0_;
+	u64 i1_;
+	format_fcontract_second_carry_pass(input);
+	modulo_carry_top(input);
+	i0 = input[0];
+	i1 = input[1];
+	i0_ = i0 & 0x7ffffffffffffLLU;
+	i1_ = i1 + (i0 >> 51);
+	input[0] = i0_;
+	input[1] = i1_;
+}
+
+static __always_inline void format_fcontract_trim(u64 *input)
+{
+	u64 a0 = input[0];
+	u64 a1 = input[1];
+	u64 a2 = input[2];
+	u64 a3 = input[3];
+	u64 a4 = input[4];
+	u64 mask0 = u64_gte_mask(a0, 0x7ffffffffffedLLU);
+	u64 mask1 = u64_eq_mask(a1, 0x7ffffffffffffLLU);
+	u64 mask2 = u64_eq_mask(a2, 0x7ffffffffffffLLU);
+	u64 mask3 = u64_eq_mask(a3, 0x7ffffffffffffLLU);
+	u64 mask4 = u64_eq_mask(a4, 0x7ffffffffffffLLU);
+	u64 mask = (((mask0 & mask1) & mask2) & mask3) & mask4;
+	u64 a0_ = a0 - (0x7ffffffffffedLLU & mask);
+	u64 a1_ = a1 - (0x7ffffffffffffLLU & mask);
+	u64 a2_ = a2 - (0x7ffffffffffffLLU & mask);
+	u64 a3_ = a3 - (0x7ffffffffffffLLU & mask);
+	u64 a4_ = a4 - (0x7ffffffffffffLLU & mask);
+	input[0] = a0_;
+	input[1] = a1_;
+	input[2] = a2_;
+	input[3] = a3_;
+	input[4] = a4_;
+}
+
+static __always_inline void format_fcontract_store(u8 *output, u64 *input)
+{
+	u64 t0 = input[0];
+	u64 t1 = input[1];
+	u64 t2 = input[2];
+	u64 t3 = input[3];
+	u64 t4 = input[4];
+	u64 o0 = t1 << 51 | t0;
+	u64 o1 = t2 << 38 | t1 >> 13;
+	u64 o2 = t3 << 25 | t2 >> 26;
+	u64 o3 = t4 << 12 | t3 >> 39;
+	u8 *b0 = output;
+	u8 *b1 = output + 8;
+	u8 *b2 = output + 16;
+	u8 *b3 = output + 24;
+	put_unaligned_le64(o0, b0);
+	put_unaligned_le64(o1, b1);
+	put_unaligned_le64(o2, b2);
+	put_unaligned_le64(o3, b3);
+}
+
+static __always_inline void format_fcontract(u8 *output, u64 *input)
+{
+	format_fcontract_first_carry_full(input);
+	format_fcontract_second_carry_full(input);
+	format_fcontract_trim(input);
+	format_fcontract_store(output, input);
+}
+
+static __always_inline void format_scalar_of_point(u8 *scalar, u64 *point)
+{
+	u64 *x = point;
+	u64 *z = point + 5;
+	u64 buf[10] __aligned(32) = { 0 };
+	u64 *zmone = buf;
+	u64 *sc = buf + 5;
+	crecip(zmone, z);
+	fmul(sc, x, zmone);
+	format_fcontract(scalar, sc);
+}
+
+void curve25519_generic(u8 mypublic[CURVE25519_KEY_SIZE],
+			const u8 secret[CURVE25519_KEY_SIZE],
+			const u8 basepoint[CURVE25519_KEY_SIZE])
+{
+	u64 buf0[10] __aligned(32) = { 0 };
+	u64 *x0 = buf0;
+	u64 *z = buf0 + 5;
+	u64 *q;
+	format_fexpand(x0, basepoint);
+	z[0] = 1;
+	q = buf0;
+	{
+		u8 e[32] __aligned(32) = { 0 };
+		u8 *scalar;
+		memcpy(e, secret, 32);
+		curve25519_clamp_secret(e);
+		scalar = e;
+		{
+			u64 buf[15] = { 0 };
+			u64 *nq = buf;
+			u64 *x = nq;
+			x[0] = 1;
+			ladder_cmult(nq, scalar, q);
+			format_scalar_of_point(mypublic, nq);
+			memzero_explicit(buf, sizeof(buf));
+		}
+		memzero_explicit(e, sizeof(e));
+	}
+	memzero_explicit(buf0, sizeof(buf0));
+}
diff --git a/lib/crypto/curve25519.c b/lib/crypto/curve25519.c
new file mode 100644
index 000000000000..0106bebe6900
--- /dev/null
+++ b/lib/crypto/curve25519.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This is an implementation of the Curve25519 ECDH algorithm, using either
+ * a 32-bit implementation or a 64-bit implementation with 128-bit integers,
+ * depending on what is supported by the target compiler.
+ *
+ * Information: https://cr.yp.to/ecdh.html
+ */
+
+#include <crypto/curve25519.h>
+#include <linux/module.h>
+#include <linux/init.h>
+
+const u8 curve25519_null_point[CURVE25519_KEY_SIZE] __aligned(32) = { 0 };
+const u8 curve25519_base_point[CURVE25519_KEY_SIZE] __aligned(32) = { 9 };
+
+EXPORT_SYMBOL(curve25519_null_point);
+EXPORT_SYMBOL(curve25519_base_point);
+EXPORT_SYMBOL(curve25519_generic);
+
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Curve25519 scalar multiplication");
+MODULE_AUTHOR("Jason A. Donenfeld <Jason@zx2c4.com>");
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 27/35] crypto: testmgr - implement testing for KPP failures
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (25 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 26/35] crypto: Curve25519 - generic C library implementations Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 28/35] crypto: curve25519 - add kpp selftest Ard Biesheuvel
                   ` (7 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

The Curve25519 drivers we will be implementing will perform validation
of the inputs, and in order to ensure that these checks work as expected,
we need to be able to test for expected errors. So implement this in the
KPP test routines in the testmgr framework.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.c | 14 ++++++++++----
 crypto/testmgr.h |  2 ++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 767fc5444771..4548559b2f0b 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3499,8 +3499,11 @@ static int do_test_kpp(struct crypto_kpp *tfm, const struct kpp_testvec *vec,
 	/* Compute party A's public key */
 	err = crypto_wait_req(crypto_kpp_generate_public_key(req), &wait);
 	if (err) {
-		pr_err("alg: %s: Party A: generate public key test failed. err %d\n",
-		       alg, err);
+		if (err != vec->gen_pubkey_error)
+			pr_err("alg: %s: Party A: generate public key test failed. err %d (expected %d)\n",
+			       alg, err, vec->gen_pubkey_error);
+		else
+			err = 0;
 		goto free_output;
 	}
 
@@ -3537,8 +3540,11 @@ static int do_test_kpp(struct crypto_kpp *tfm, const struct kpp_testvec *vec,
 				 crypto_req_done, &wait);
 	err = crypto_wait_req(crypto_kpp_compute_shared_secret(req), &wait);
 	if (err) {
-		pr_err("alg: %s: Party A: compute shared secret test failed. err %d\n",
-		       alg, err);
+		if (err != vec->comp_ss_error)
+			pr_err("alg: %s: Party A: compute shared secret test failed. err %d (expected %d)\n",
+			       alg, err, vec->comp_ss_error);
+		else
+			err = 0;
 		goto free_all;
 	}
 
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index b717afc0926e..c39be00a8125 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -166,6 +166,8 @@ struct kpp_testvec {
 	unsigned short expected_a_public_size;
 	unsigned short expected_ss_size;
 	bool genkey;
+	int gen_pubkey_error;
+	int comp_ss_error;
 };
 
 static const char zeroed_string[48];
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 28/35] crypto: curve25519 - add kpp selftest
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (26 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 27/35] crypto: testmgr - implement testing for KPP failures Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 29/35] crypto: curve25519 - implement generic KPP driver Ard Biesheuvel
                   ` (6 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

In preparation of introducing KPP implementations of Curve25519, import
the set of test cases proposed by the Zinc patch set, but converted to
the KPP format.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.c |    6 +
 crypto/testmgr.h | 1603 ++++++++++++++++++++
 2 files changed, 1609 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 4548559b2f0b..cc544346a8d7 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4295,6 +4295,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.alg = "cts(cbc(paes))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
+	}, {
+		.alg = "curve25519",
+		.test = alg_test_kpp,
+		.suite = {
+			.kpp = __VECS(curve25519_tv_template)
+		}
 	}, {
 		.alg = "deflate",
 		.test = alg_test_comp,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index c39be00a8125..db17a29b6297 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1032,6 +1032,1609 @@ static const struct kpp_testvec dh_tv_template[] = {
 	}
 };
 
+static const struct kpp_testvec curve25519_tv_template[] = {
+{
+	.secret = (u8[32]){
+		    0x77, 0x07, 0x6d, 0x0a, 0x73, 0x18, 0xa5, 0x7d,
+		    0x3c, 0x16, 0xc1, 0x72, 0x51, 0xb2, 0x66, 0x45,
+		    0xdf, 0x4c, 0x2f, 0x87, 0xeb, 0xc0, 0x99, 0x2a,
+		    0xb1, 0x77, 0xfb, 0xa5, 0x1d, 0xb9, 0x2c, 0x2a },
+	.b_public = (u8[32]){
+		    0xde, 0x9e, 0xdb, 0x7d, 0x7b, 0x7d, 0xc1, 0xb4,
+		    0xd3, 0x5b, 0x61, 0xc2, 0xec, 0xe4, 0x35, 0x37,
+		    0x3f, 0x83, 0x43, 0xc8, 0x5b, 0x78, 0x67, 0x4d,
+		    0xad, 0xfc, 0x7e, 0x14, 0x6f, 0x88, 0x2b, 0x4f },
+	.expected_ss = (u8[32]){
+		    0x4a, 0x5d, 0x9d, 0x5b, 0xa4, 0xce, 0x2d, 0xe1,
+		    0x72, 0x8e, 0x3b, 0xf4, 0x80, 0x35, 0x0f, 0x25,
+		    0xe0, 0x7e, 0x21, 0xc9, 0x47, 0xd1, 0x9e, 0x33,
+		    0x76, 0xf0, 0x9b, 0x3c, 0x1e, 0x16, 0x17, 0x42 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+{
+	.secret = (u8[32]){
+		    0x5d, 0xab, 0x08, 0x7e, 0x62, 0x4a, 0x8a, 0x4b,
+		    0x79, 0xe1, 0x7f, 0x8b, 0x83, 0x80, 0x0e, 0xe6,
+		    0x6f, 0x3b, 0xb1, 0x29, 0x26, 0x18, 0xb6, 0xfd,
+		    0x1c, 0x2f, 0x8b, 0x27, 0xff, 0x88, 0xe0, 0xeb },
+	.b_public = (u8[32]){
+		    0x85, 0x20, 0xf0, 0x09, 0x89, 0x30, 0xa7, 0x54,
+		    0x74, 0x8b, 0x7d, 0xdc, 0xb4, 0x3e, 0xf7, 0x5a,
+		    0x0d, 0xbf, 0x3a, 0x0d, 0x26, 0x38, 0x1a, 0xf4,
+		    0xeb, 0xa4, 0xa9, 0x8e, 0xaa, 0x9b, 0x4e, 0x6a },
+	.expected_ss = (u8[32]){
+		    0x4a, 0x5d, 0x9d, 0x5b, 0xa4, 0xce, 0x2d, 0xe1,
+		    0x72, 0x8e, 0x3b, 0xf4, 0x80, 0x35, 0x0f, 0x25,
+		    0xe0, 0x7e, 0x21, 0xc9, 0x47, 0xd1, 0x9e, 0x33,
+		    0x76, 0xf0, 0x9b, 0x3c, 0x1e, 0x16, 0x17, 0x42 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+{
+	.secret = (u8[32]){
+		    1 },
+	.b_public = (u8[32]){
+		    0x25, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.expected_ss = (u8[32]){
+		    0x3c, 0x77, 0x77, 0xca, 0xf9, 0x97, 0xb2, 0x64,
+		    0x41, 0x60, 0x77, 0x66, 0x5b, 0x4e, 0x22, 0x9d,
+		    0x0b, 0x95, 0x48, 0xdc, 0x0c, 0xd8, 0x19, 0x98,
+		    0xdd, 0xcd, 0xc5, 0xc8, 0x53, 0x3c, 0x79, 0x7f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+{
+	.secret = (u8[32]){
+		    1 },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0xb3, 0x2d, 0x13, 0x62, 0xc2, 0x48, 0xd6, 0x2f,
+		    0xe6, 0x26, 0x19, 0xcf, 0xf0, 0x4d, 0xd4, 0x3d,
+		    0xb7, 0x3f, 0xfc, 0x1b, 0x63, 0x08, 0xed, 0xe3,
+		    0x0b, 0x78, 0xd8, 0x73, 0x80, 0xf1, 0xe8, 0x34 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+{
+	.secret = (u8[32]){
+		    0xa5, 0x46, 0xe3, 0x6b, 0xf0, 0x52, 0x7c, 0x9d,
+		    0x3b, 0x16, 0x15, 0x4b, 0x82, 0x46, 0x5e, 0xdd,
+		    0x62, 0x14, 0x4c, 0x0a, 0xc1, 0xfc, 0x5a, 0x18,
+		    0x50, 0x6a, 0x22, 0x44, 0xba, 0x44, 0x9a, 0xc4 },
+	.b_public = (u8[32]){
+		    0xe6, 0xdb, 0x68, 0x67, 0x58, 0x30, 0x30, 0xdb,
+		    0x35, 0x94, 0xc1, 0xa4, 0x24, 0xb1, 0x5f, 0x7c,
+		    0x72, 0x66, 0x24, 0xec, 0x26, 0xb3, 0x35, 0x3b,
+		    0x10, 0xa9, 0x03, 0xa6, 0xd0, 0xab, 0x1c, 0x4c },
+	.expected_ss = (u8[32]){
+		    0xc3, 0xda, 0x55, 0x37, 0x9d, 0xe9, 0xc6, 0x90,
+		    0x8e, 0x94, 0xea, 0x4d, 0xf2, 0x8d, 0x08, 0x4f,
+		    0x32, 0xec, 0xcf, 0x03, 0x49, 0x1c, 0x71, 0xf7,
+		    0x54, 0xb4, 0x07, 0x55, 0x77, 0xa2, 0x85, 0x52 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+{
+	.secret = (u8[32]){ 1, 2, 3, 4 },
+	.b_public = (u8[32]){ 0 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+{
+	.secret = (u8[32]){ 2, 4, 6, 8 },
+	.b_public = (u8[32]){
+		    0xe0, 0xeb, 0x7a, 0x7c, 0x3b, 0x41, 0xb8, 0xae,
+		    0x16, 0x56, 0xe3, 0xfa, 0xf1, 0x9f, 0xc4, 0x6a,
+		    0xda, 0x09, 0x8d, 0xeb, 0x9c, 0x32, 0xb1, 0xfd,
+		    0x86, 0x62, 0x05, 0x16, 0x5f, 0x49, 0xb8 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+{
+	.secret = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0x0a, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0x0a, 0x00, 0xfb, 0x9f },
+	.expected_ss = (u8[32]){
+		    0x77, 0x52, 0xb6, 0x18, 0xc1, 0x2d, 0x48, 0xd2,
+		    0xc6, 0x93, 0x46, 0x83, 0x81, 0x7c, 0xc6, 0x57,
+		    0xf3, 0x31, 0x03, 0x19, 0x49, 0x48, 0x20, 0x05,
+		    0x42, 0x2b, 0x4e, 0xae, 0x8d, 0x1d, 0x43, 0x23 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+{
+	.secret = (u8[32]){
+		    0x8e, 0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.b_public = (u8[32]){
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x8e, 0x06 },
+	.expected_ss = (u8[32]){
+		    0x5a, 0xdf, 0xaa, 0x25, 0x86, 0x8e, 0x32, 0x3d,
+		    0xae, 0x49, 0x62, 0xc1, 0x01, 0x5c, 0xb3, 0x12,
+		    0xe1, 0xc5, 0xc7, 0x9e, 0x95, 0x3f, 0x03, 0x99,
+		    0xb0, 0xba, 0x16, 0x22, 0xf3, 0xb6, 0xf7, 0x0c },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - normal case */
+{
+	.secret = (u8[32]){
+		    0x48, 0x52, 0x83, 0x4d, 0x9d, 0x6b, 0x77, 0xda,
+		    0xde, 0xab, 0xaa, 0xf2, 0xe1, 0x1d, 0xca, 0x66,
+		    0xd1, 0x9f, 0xe7, 0x49, 0x93, 0xa7, 0xbe, 0xc3,
+		    0x6c, 0x6e, 0x16, 0xa0, 0x98, 0x3f, 0xea, 0xba },
+	.b_public = (u8[32]){
+		    0x9c, 0x64, 0x7d, 0x9a, 0xe5, 0x89, 0xb9, 0xf5,
+		    0x8f, 0xdc, 0x3c, 0xa4, 0x94, 0x7e, 0xfb, 0xc9,
+		    0x15, 0xc4, 0xb2, 0xe0, 0x8e, 0x74, 0x4a, 0x0e,
+		    0xdf, 0x46, 0x9d, 0xac, 0x59, 0xc8, 0xf8, 0x5a },
+	.expected_ss = (u8[32]){
+		    0x87, 0xb7, 0xf2, 0x12, 0xb6, 0x27, 0xf7, 0xa5,
+		    0x4c, 0xa5, 0xe0, 0xbc, 0xda, 0xdd, 0xd5, 0x38,
+		    0x9d, 0x9d, 0xe6, 0x15, 0x6c, 0xdb, 0xcf, 0x8e,
+		    0xbe, 0x14, 0xff, 0xbc, 0xfb, 0x43, 0x65, 0x51 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key on twist */
+{
+	.secret = (u8[32]){
+		    0x58, 0x8c, 0x06, 0x1a, 0x50, 0x80, 0x4a, 0xc4,
+		    0x88, 0xad, 0x77, 0x4a, 0xc7, 0x16, 0xc3, 0xf5,
+		    0xba, 0x71, 0x4b, 0x27, 0x12, 0xe0, 0x48, 0x49,
+		    0x13, 0x79, 0xa5, 0x00, 0x21, 0x19, 0x98, 0xa8 },
+	.b_public = (u8[32]){
+		    0x63, 0xaa, 0x40, 0xc6, 0xe3, 0x83, 0x46, 0xc5,
+		    0xca, 0xf2, 0x3a, 0x6d, 0xf0, 0xa5, 0xe6, 0xc8,
+		    0x08, 0x89, 0xa0, 0x86, 0x47, 0xe5, 0x51, 0xb3,
+		    0x56, 0x34, 0x49, 0xbe, 0xfc, 0xfc, 0x97, 0x33 },
+	.expected_ss = (u8[32]){
+		    0xb1, 0xa7, 0x07, 0x51, 0x94, 0x95, 0xff, 0xff,
+		    0xb2, 0x98, 0xff, 0x94, 0x17, 0x16, 0xb0, 0x6d,
+		    0xfa, 0xb8, 0x7c, 0xf8, 0xd9, 0x11, 0x23, 0xfe,
+		    0x2b, 0xe9, 0xa2, 0x33, 0xdd, 0xa2, 0x22, 0x12 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key on twist */
+{
+	.secret = (u8[32]){
+		    0xb0, 0x5b, 0xfd, 0x32, 0xe5, 0x53, 0x25, 0xd9,
+		    0xfd, 0x64, 0x8c, 0xb3, 0x02, 0x84, 0x80, 0x39,
+		    0x00, 0x0b, 0x39, 0x0e, 0x44, 0xd5, 0x21, 0xe5,
+		    0x8a, 0xab, 0x3b, 0x29, 0xa6, 0x96, 0x0b, 0xa8 },
+	.b_public = (u8[32]){
+		    0x0f, 0x83, 0xc3, 0x6f, 0xde, 0xd9, 0xd3, 0x2f,
+		    0xad, 0xf4, 0xef, 0xa3, 0xae, 0x93, 0xa9, 0x0b,
+		    0xb5, 0xcf, 0xa6, 0x68, 0x93, 0xbc, 0x41, 0x2c,
+		    0x43, 0xfa, 0x72, 0x87, 0xdb, 0xb9, 0x97, 0x79 },
+	.expected_ss = (u8[32]){
+		    0x67, 0xdd, 0x4a, 0x6e, 0x16, 0x55, 0x33, 0x53,
+		    0x4c, 0x0e, 0x3f, 0x17, 0x2e, 0x4a, 0xb8, 0x57,
+		    0x6b, 0xca, 0x92, 0x3a, 0x5f, 0x07, 0xb2, 0xc0,
+		    0x69, 0xb4, 0xc3, 0x10, 0xff, 0x2e, 0x93, 0x5b },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key on twist */
+{
+	.secret = (u8[32]){
+		    0x70, 0xe3, 0x4b, 0xcb, 0xe1, 0xf4, 0x7f, 0xbc,
+		    0x0f, 0xdd, 0xfd, 0x7c, 0x1e, 0x1a, 0xa5, 0x3d,
+		    0x57, 0xbf, 0xe0, 0xf6, 0x6d, 0x24, 0x30, 0x67,
+		    0xb4, 0x24, 0xbb, 0x62, 0x10, 0xbe, 0xd1, 0x9c },
+	.b_public = (u8[32]){
+		    0x0b, 0x82, 0x11, 0xa2, 0xb6, 0x04, 0x90, 0x97,
+		    0xf6, 0x87, 0x1c, 0x6c, 0x05, 0x2d, 0x3c, 0x5f,
+		    0xc1, 0xba, 0x17, 0xda, 0x9e, 0x32, 0xae, 0x45,
+		    0x84, 0x03, 0xb0, 0x5b, 0xb2, 0x83, 0x09, 0x2a },
+	.expected_ss = (u8[32]){
+		    0x4a, 0x06, 0x38, 0xcf, 0xaa, 0x9e, 0xf1, 0x93,
+		    0x3b, 0x47, 0xf8, 0x93, 0x92, 0x96, 0xa6, 0xb2,
+		    0x5b, 0xe5, 0x41, 0xef, 0x7f, 0x70, 0xe8, 0x44,
+		    0xc0, 0xbc, 0xc0, 0x0b, 0x13, 0x4d, 0xe6, 0x4a },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key on twist */
+{
+	.secret = (u8[32]){
+		    0x68, 0xc1, 0xf3, 0xa6, 0x53, 0xa4, 0xcd, 0xb1,
+		    0xd3, 0x7b, 0xba, 0x94, 0x73, 0x8f, 0x8b, 0x95,
+		    0x7a, 0x57, 0xbe, 0xb2, 0x4d, 0x64, 0x6e, 0x99,
+		    0x4d, 0xc2, 0x9a, 0x27, 0x6a, 0xad, 0x45, 0x8d },
+	.b_public = (u8[32]){
+		    0x34, 0x3a, 0xc2, 0x0a, 0x3b, 0x9c, 0x6a, 0x27,
+		    0xb1, 0x00, 0x81, 0x76, 0x50, 0x9a, 0xd3, 0x07,
+		    0x35, 0x85, 0x6e, 0xc1, 0xc8, 0xd8, 0xfc, 0xae,
+		    0x13, 0x91, 0x2d, 0x08, 0xd1, 0x52, 0xf4, 0x6c },
+	.expected_ss = (u8[32]){
+		    0x39, 0x94, 0x91, 0xfc, 0xe8, 0xdf, 0xab, 0x73,
+		    0xb4, 0xf9, 0xf6, 0x11, 0xde, 0x8e, 0xa0, 0xb2,
+		    0x7b, 0x28, 0xf8, 0x59, 0x94, 0x25, 0x0b, 0x0f,
+		    0x47, 0x5d, 0x58, 0x5d, 0x04, 0x2a, 0xc2, 0x07 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key on twist */
+{
+	.secret = (u8[32]){
+		    0xd8, 0x77, 0xb2, 0x6d, 0x06, 0xdf, 0xf9, 0xd9,
+		    0xf7, 0xfd, 0x4c, 0x5b, 0x37, 0x69, 0xf8, 0xcd,
+		    0xd5, 0xb3, 0x05, 0x16, 0xa5, 0xab, 0x80, 0x6b,
+		    0xe3, 0x24, 0xff, 0x3e, 0xb6, 0x9e, 0xa0, 0xb2 },
+	.b_public = (u8[32]){
+		    0xfa, 0x69, 0x5f, 0xc7, 0xbe, 0x8d, 0x1b, 0xe5,
+		    0xbf, 0x70, 0x48, 0x98, 0xf3, 0x88, 0xc4, 0x52,
+		    0xba, 0xfd, 0xd3, 0xb8, 0xea, 0xe8, 0x05, 0xf8,
+		    0x68, 0x1a, 0x8d, 0x15, 0xc2, 0xd4, 0xe1, 0x42 },
+	.expected_ss = (u8[32]){
+		    0x2c, 0x4f, 0xe1, 0x1d, 0x49, 0x0a, 0x53, 0x86,
+		    0x17, 0x76, 0xb1, 0x3b, 0x43, 0x54, 0xab, 0xd4,
+		    0xcf, 0x5a, 0x97, 0x69, 0x9d, 0xb6, 0xe6, 0xc6,
+		    0x8c, 0x16, 0x26, 0xd0, 0x76, 0x62, 0xf7, 0x58 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key = 0 */
+{
+	.secret = (u8[32]){
+		    0x20, 0x74, 0x94, 0x03, 0x8f, 0x2b, 0xb8, 0x11,
+		    0xd4, 0x78, 0x05, 0xbc, 0xdf, 0x04, 0xa2, 0xac,
+		    0x58, 0x5a, 0xda, 0x7f, 0x2f, 0x23, 0x38, 0x9b,
+		    0xfd, 0x46, 0x58, 0xf9, 0xdd, 0xd4, 0xde, 0xbc },
+	.b_public = (u8[32]){
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key = 1 */
+{
+	.secret = (u8[32]){
+		    0x20, 0x2e, 0x89, 0x72, 0xb6, 0x1c, 0x7e, 0x61,
+		    0x93, 0x0e, 0xb9, 0x45, 0x0b, 0x50, 0x70, 0xea,
+		    0xe1, 0xc6, 0x70, 0x47, 0x56, 0x85, 0x54, 0x1f,
+		    0x04, 0x76, 0x21, 0x7e, 0x48, 0x18, 0xcf, 0xab },
+	.b_public = (u8[32]){
+		    0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.expected_ss = (u8[32]){
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - edge case on twist */
+{
+	.secret = (u8[32]){
+		    0x38, 0xdd, 0xe9, 0xf3, 0xe7, 0xb7, 0x99, 0x04,
+		    0x5f, 0x9a, 0xc3, 0x79, 0x3d, 0x4a, 0x92, 0x77,
+		    0xda, 0xde, 0xad, 0xc4, 0x1b, 0xec, 0x02, 0x90,
+		    0xf8, 0x1f, 0x74, 0x4f, 0x73, 0x77, 0x5f, 0x84 },
+	.b_public = (u8[32]){
+		    0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.expected_ss = (u8[32]){
+		    0x9a, 0x2c, 0xfe, 0x84, 0xff, 0x9c, 0x4a, 0x97,
+		    0x39, 0x62, 0x5c, 0xae, 0x4a, 0x3b, 0x82, 0xa9,
+		    0x06, 0x87, 0x7a, 0x44, 0x19, 0x46, 0xf8, 0xd7,
+		    0xb3, 0xd7, 0x95, 0xfe, 0x8f, 0x5d, 0x16, 0x39 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case on twist */
+{
+	.secret = (u8[32]){
+		    0x98, 0x57, 0xa9, 0x14, 0xe3, 0xc2, 0x90, 0x36,
+		    0xfd, 0x9a, 0x44, 0x2b, 0xa5, 0x26, 0xb5, 0xcd,
+		    0xcd, 0xf2, 0x82, 0x16, 0x15, 0x3e, 0x63, 0x6c,
+		    0x10, 0x67, 0x7a, 0xca, 0xb6, 0xbd, 0x6a, 0xa5 },
+	.b_public = (u8[32]){
+		    0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.expected_ss = (u8[32]){
+		    0x4d, 0xa4, 0xe0, 0xaa, 0x07, 0x2c, 0x23, 0x2e,
+		    0xe2, 0xf0, 0xfa, 0x4e, 0x51, 0x9a, 0xe5, 0x0b,
+		    0x52, 0xc1, 0xed, 0xd0, 0x8a, 0x53, 0x4d, 0x4e,
+		    0xf3, 0x46, 0xc2, 0xe1, 0x06, 0xd2, 0x1d, 0x60 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case on twist */
+{
+	.secret = (u8[32]){
+		    0x48, 0xe2, 0x13, 0x0d, 0x72, 0x33, 0x05, 0xed,
+		    0x05, 0xe6, 0xe5, 0x89, 0x4d, 0x39, 0x8a, 0x5e,
+		    0x33, 0x36, 0x7a, 0x8c, 0x6a, 0xac, 0x8f, 0xcd,
+		    0xf0, 0xa8, 0x8e, 0x4b, 0x42, 0x82, 0x0d, 0xb7 },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0x03, 0x00, 0x00, 0xf8, 0xff,
+		    0xff, 0x1f, 0x00, 0x00, 0xc0, 0xff, 0xff, 0xff,
+		    0x00, 0x00, 0x00, 0xfe, 0xff, 0xff, 0x07, 0x00,
+		    0x00, 0xf0, 0xff, 0xff, 0x3f, 0x00, 0x00, 0x00 },
+	.expected_ss = (u8[32]){
+		    0x9e, 0xd1, 0x0c, 0x53, 0x74, 0x7f, 0x64, 0x7f,
+		    0x82, 0xf4, 0x51, 0x25, 0xd3, 0xde, 0x15, 0xa1,
+		    0xe6, 0xb8, 0x24, 0x49, 0x6a, 0xb4, 0x04, 0x10,
+		    0xff, 0xcc, 0x3c, 0xfe, 0x95, 0x76, 0x0f, 0x3b },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case on twist */
+{
+	.secret = (u8[32]){
+		    0x28, 0xf4, 0x10, 0x11, 0x69, 0x18, 0x51, 0xb3,
+		    0xa6, 0x2b, 0x64, 0x15, 0x53, 0xb3, 0x0d, 0x0d,
+		    0xfd, 0xdc, 0xb8, 0xff, 0xfc, 0xf5, 0x37, 0x00,
+		    0xa7, 0xbe, 0x2f, 0x6a, 0x87, 0x2e, 0x9f, 0xb0 },
+	.b_public = (u8[32]){
+		    0x00, 0x00, 0x00, 0xfc, 0xff, 0xff, 0x07, 0x00,
+		    0x00, 0xe0, 0xff, 0xff, 0x3f, 0x00, 0x00, 0x00,
+		    0xff, 0xff, 0xff, 0x01, 0x00, 0x00, 0xf8, 0xff,
+		    0xff, 0x0f, 0x00, 0x00, 0xc0, 0xff, 0xff, 0x7f },
+	.expected_ss = (u8[32]){
+		    0xcf, 0x72, 0xb4, 0xaa, 0x6a, 0xa1, 0xc9, 0xf8,
+		    0x94, 0xf4, 0x16, 0x5b, 0x86, 0x10, 0x9a, 0xa4,
+		    0x68, 0x51, 0x76, 0x48, 0xe1, 0xf0, 0xcc, 0x70,
+		    0xe1, 0xab, 0x08, 0x46, 0x01, 0x76, 0x50, 0x6b },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case on twist */
+{
+	.secret = (u8[32]){
+		    0x18, 0xa9, 0x3b, 0x64, 0x99, 0xb9, 0xf6, 0xb3,
+		    0x22, 0x5c, 0xa0, 0x2f, 0xef, 0x41, 0x0e, 0x0a,
+		    0xde, 0xc2, 0x35, 0x32, 0x32, 0x1d, 0x2d, 0x8e,
+		    0xf1, 0xa6, 0xd6, 0x02, 0xa8, 0xc6, 0x5b, 0x83 },
+	.b_public = (u8[32]){
+		    0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
+		    0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
+		    0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
+		    0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0x7f },
+	.expected_ss = (u8[32]){
+		    0x5d, 0x50, 0xb6, 0x28, 0x36, 0xbb, 0x69, 0x57,
+		    0x94, 0x10, 0x38, 0x6c, 0xf7, 0xbb, 0x81, 0x1c,
+		    0x14, 0xbf, 0x85, 0xb1, 0xc7, 0xb1, 0x7e, 0x59,
+		    0x24, 0xc7, 0xff, 0xea, 0x91, 0xef, 0x9e, 0x12 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case on twist */
+{
+	.secret = (u8[32]){
+		    0xc0, 0x1d, 0x13, 0x05, 0xa1, 0x33, 0x8a, 0x1f,
+		    0xca, 0xc2, 0xba, 0x7e, 0x2e, 0x03, 0x2b, 0x42,
+		    0x7e, 0x0b, 0x04, 0x90, 0x31, 0x65, 0xac, 0xa9,
+		    0x57, 0xd8, 0xd0, 0x55, 0x3d, 0x87, 0x17, 0xb0 },
+	.b_public = (u8[32]){
+		    0xea, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.expected_ss = (u8[32]){
+		    0x19, 0x23, 0x0e, 0xb1, 0x48, 0xd5, 0xd6, 0x7c,
+		    0x3c, 0x22, 0xab, 0x1d, 0xae, 0xff, 0x80, 0xa5,
+		    0x7e, 0xae, 0x42, 0x65, 0xce, 0x28, 0x72, 0x65,
+		    0x7b, 0x2c, 0x80, 0x99, 0xfc, 0x69, 0x8e, 0x50 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for public key */
+{
+	.secret = (u8[32]){
+		    0x38, 0x6f, 0x7f, 0x16, 0xc5, 0x07, 0x31, 0xd6,
+		    0x4f, 0x82, 0xe6, 0xa1, 0x70, 0xb1, 0x42, 0xa4,
+		    0xe3, 0x4f, 0x31, 0xfd, 0x77, 0x68, 0xfc, 0xb8,
+		    0x90, 0x29, 0x25, 0xe7, 0xd1, 0xe2, 0x1a, 0xbe },
+	.b_public = (u8[32]){
+		    0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.expected_ss = (u8[32]){
+		    0x0f, 0xca, 0xb5, 0xd8, 0x42, 0xa0, 0x78, 0xd7,
+		    0xa7, 0x1f, 0xc5, 0x9b, 0x57, 0xbf, 0xb4, 0xca,
+		    0x0b, 0xe6, 0x87, 0x3b, 0x49, 0xdc, 0xdb, 0x9f,
+		    0x44, 0xe1, 0x4a, 0xe8, 0xfb, 0xdf, 0xa5, 0x42 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for public key */
+{
+	.secret = (u8[32]){
+		    0xe0, 0x23, 0xa2, 0x89, 0xbd, 0x5e, 0x90, 0xfa,
+		    0x28, 0x04, 0xdd, 0xc0, 0x19, 0xa0, 0x5e, 0xf3,
+		    0xe7, 0x9d, 0x43, 0x4b, 0xb6, 0xea, 0x2f, 0x52,
+		    0x2e, 0xcb, 0x64, 0x3a, 0x75, 0x29, 0x6e, 0x95 },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00,
+		    0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00,
+		    0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00,
+		    0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00 },
+	.expected_ss = (u8[32]){
+		    0x54, 0xce, 0x8f, 0x22, 0x75, 0xc0, 0x77, 0xe3,
+		    0xb1, 0x30, 0x6a, 0x39, 0x39, 0xc5, 0xe0, 0x3e,
+		    0xef, 0x6b, 0xbb, 0x88, 0x06, 0x05, 0x44, 0x75,
+		    0x8d, 0x9f, 0xef, 0x59, 0xb0, 0xbc, 0x3e, 0x4f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for public key */
+{
+	.secret = (u8[32]){
+		    0x68, 0xf0, 0x10, 0xd6, 0x2e, 0xe8, 0xd9, 0x26,
+		    0x05, 0x3a, 0x36, 0x1c, 0x3a, 0x75, 0xc6, 0xea,
+		    0x4e, 0xbd, 0xc8, 0x60, 0x6a, 0xb2, 0x85, 0x00,
+		    0x3a, 0x6f, 0x8f, 0x40, 0x76, 0xb0, 0x1e, 0x83 },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x03 },
+	.expected_ss = (u8[32]){
+		    0xf1, 0x36, 0x77, 0x5c, 0x5b, 0xeb, 0x0a, 0xf8,
+		    0x11, 0x0a, 0xf1, 0x0b, 0x20, 0x37, 0x23, 0x32,
+		    0x04, 0x3c, 0xab, 0x75, 0x24, 0x19, 0x67, 0x87,
+		    0x75, 0xa2, 0x23, 0xdf, 0x57, 0xc9, 0xd3, 0x0d },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for public key */
+{
+	.secret = (u8[32]){
+		    0x58, 0xeb, 0xcb, 0x35, 0xb0, 0xf8, 0x84, 0x5c,
+		    0xaf, 0x1e, 0xc6, 0x30, 0xf9, 0x65, 0x76, 0xb6,
+		    0x2c, 0x4b, 0x7b, 0x6c, 0x36, 0xb2, 0x9d, 0xeb,
+		    0x2c, 0xb0, 0x08, 0x46, 0x51, 0x75, 0x5c, 0x96 },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0xfb, 0xff, 0xff, 0xfb, 0xff,
+		    0xff, 0xdf, 0xff, 0xff, 0xdf, 0xff, 0xff, 0xff,
+		    0xfe, 0xff, 0xff, 0xfe, 0xff, 0xff, 0xf7, 0xff,
+		    0xff, 0xf7, 0xff, 0xff, 0xbf, 0xff, 0xff, 0x3f },
+	.expected_ss = (u8[32]){
+		    0xbf, 0x9a, 0xff, 0xd0, 0x6b, 0x84, 0x40, 0x85,
+		    0x58, 0x64, 0x60, 0x96, 0x2e, 0xf2, 0x14, 0x6f,
+		    0xf3, 0xd4, 0x53, 0x3d, 0x94, 0x44, 0xaa, 0xb0,
+		    0x06, 0xeb, 0x88, 0xcc, 0x30, 0x54, 0x40, 0x7d },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for public key */
+{
+	.secret = (u8[32]){
+		    0x18, 0x8c, 0x4b, 0xc5, 0xb9, 0xc4, 0x4b, 0x38,
+		    0xbb, 0x65, 0x8b, 0x9b, 0x2a, 0xe8, 0x2d, 0x5b,
+		    0x01, 0x01, 0x5e, 0x09, 0x31, 0x84, 0xb1, 0x7c,
+		    0xb7, 0x86, 0x35, 0x03, 0xa7, 0x83, 0xe1, 0xbb },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+	.expected_ss = (u8[32]){
+		    0xd4, 0x80, 0xde, 0x04, 0xf6, 0x99, 0xcb, 0x3b,
+		    0xe0, 0x68, 0x4a, 0x9c, 0xc2, 0xe3, 0x12, 0x81,
+		    0xea, 0x0b, 0xc5, 0xa9, 0xdc, 0xc1, 0x57, 0xd3,
+		    0xd2, 0x01, 0x58, 0xd4, 0x6c, 0xa5, 0x24, 0x6d },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for public key */
+{
+	.secret = (u8[32]){
+		    0xe0, 0x6c, 0x11, 0xbb, 0x2e, 0x13, 0xce, 0x3d,
+		    0xc7, 0x67, 0x3f, 0x67, 0xf5, 0x48, 0x22, 0x42,
+		    0x90, 0x94, 0x23, 0xa9, 0xae, 0x95, 0xee, 0x98,
+		    0x6a, 0x98, 0x8d, 0x98, 0xfa, 0xee, 0x23, 0xa2 },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f,
+		    0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f,
+		    0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f,
+		    0xff, 0xff, 0xff, 0xff, 0xfe, 0xff, 0xff, 0x7f },
+	.expected_ss = (u8[32]){
+		    0x4c, 0x44, 0x01, 0xcc, 0xe6, 0xb5, 0x1e, 0x4c,
+		    0xb1, 0x8f, 0x27, 0x90, 0x24, 0x6c, 0x9b, 0xf9,
+		    0x14, 0xdb, 0x66, 0x77, 0x50, 0xa1, 0xcb, 0x89,
+		    0x06, 0x90, 0x92, 0xaf, 0x07, 0x29, 0x22, 0x76 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for public key */
+{
+	.secret = (u8[32]){
+		    0xc0, 0x65, 0x8c, 0x46, 0xdd, 0xe1, 0x81, 0x29,
+		    0x29, 0x38, 0x77, 0x53, 0x5b, 0x11, 0x62, 0xb6,
+		    0xf9, 0xf5, 0x41, 0x4a, 0x23, 0xcf, 0x4d, 0x2c,
+		    0xbc, 0x14, 0x0a, 0x4d, 0x99, 0xda, 0x2b, 0x8f },
+	.b_public = (u8[32]){
+		    0xeb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.expected_ss = (u8[32]){
+		    0x57, 0x8b, 0xa8, 0xcc, 0x2d, 0xbd, 0xc5, 0x75,
+		    0xaf, 0xcf, 0x9d, 0xf2, 0xb3, 0xee, 0x61, 0x89,
+		    0xf5, 0x33, 0x7d, 0x68, 0x54, 0xc7, 0x9b, 0x4c,
+		    0xe1, 0x65, 0xea, 0x12, 0x29, 0x3b, 0x3a, 0x0f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0x10, 0x25, 0x5c, 0x92, 0x30, 0xa9, 0x7a, 0x30,
+		    0xa4, 0x58, 0xca, 0x28, 0x4a, 0x62, 0x96, 0x69,
+		    0x29, 0x3a, 0x31, 0x89, 0x0c, 0xda, 0x9d, 0x14,
+		    0x7f, 0xeb, 0xc7, 0xd1, 0xe2, 0x2d, 0x6b, 0xb1 },
+	.b_public = (u8[32]){
+		    0xe0, 0xeb, 0x7a, 0x7c, 0x3b, 0x41, 0xb8, 0xae,
+		    0x16, 0x56, 0xe3, 0xfa, 0xf1, 0x9f, 0xc4, 0x6a,
+		    0xda, 0x09, 0x8d, 0xeb, 0x9c, 0x32, 0xb1, 0xfd,
+		    0x86, 0x62, 0x05, 0x16, 0x5f, 0x49, 0xb8, 0x00 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0x78, 0xf1, 0xe8, 0xed, 0xf1, 0x44, 0x81, 0xb3,
+		    0x89, 0x44, 0x8d, 0xac, 0x8f, 0x59, 0xc7, 0x0b,
+		    0x03, 0x8e, 0x7c, 0xf9, 0x2e, 0xf2, 0xc7, 0xef,
+		    0xf5, 0x7a, 0x72, 0x46, 0x6e, 0x11, 0x52, 0x96 },
+	.b_public = (u8[32]){
+		    0x5f, 0x9c, 0x95, 0xbc, 0xa3, 0x50, 0x8c, 0x24,
+		    0xb1, 0xd0, 0xb1, 0x55, 0x9c, 0x83, 0xef, 0x5b,
+		    0x04, 0x44, 0x5c, 0xc4, 0x58, 0x1c, 0x8e, 0x86,
+		    0xd8, 0x22, 0x4e, 0xdd, 0xd0, 0x9f, 0x11, 0x57 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa0, 0x5a, 0x3e, 0x8f, 0x9f, 0x44, 0x20,
+		    0x4d, 0x5f, 0x80, 0x59, 0xa9, 0x4a, 0xc7, 0xdf,
+		    0xc3, 0x9a, 0x49, 0xac, 0x01, 0x6d, 0xd7, 0x43,
+		    0xdb, 0xfa, 0x43, 0xc5, 0xd6, 0x71, 0xfd, 0x88 },
+	.b_public = (u8[32]){
+		    0xec, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0xd0, 0xdb, 0xb3, 0xed, 0x19, 0x06, 0x66, 0x3f,
+		    0x15, 0x42, 0x0a, 0xf3, 0x1f, 0x4e, 0xaf, 0x65,
+		    0x09, 0xd9, 0xa9, 0x94, 0x97, 0x23, 0x50, 0x06,
+		    0x05, 0xad, 0x7c, 0x1c, 0x6e, 0x74, 0x50, 0xa9 },
+	.b_public = (u8[32]){
+		    0xed, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0xc0, 0xb1, 0xd0, 0xeb, 0x22, 0xb2, 0x44, 0xfe,
+		    0x32, 0x91, 0x14, 0x00, 0x72, 0xcd, 0xd9, 0xd9,
+		    0x89, 0xb5, 0xf0, 0xec, 0xd9, 0x6c, 0x10, 0x0f,
+		    0xeb, 0x5b, 0xca, 0x24, 0x1c, 0x1d, 0x9f, 0x8f },
+	.b_public = (u8[32]){
+		    0xee, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0x48, 0x0b, 0xf4, 0x5f, 0x59, 0x49, 0x42, 0xa8,
+		    0xbc, 0x0f, 0x33, 0x53, 0xc6, 0xe8, 0xb8, 0x85,
+		    0x3d, 0x77, 0xf3, 0x51, 0xf1, 0xc2, 0xca, 0x6c,
+		    0x2d, 0x1a, 0xbf, 0x8a, 0x00, 0xb4, 0x22, 0x9c },
+	.b_public = (u8[32]){
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0x30, 0xf9, 0x93, 0xfc, 0xf8, 0x51, 0x4f, 0xc8,
+		    0x9b, 0xd8, 0xdb, 0x14, 0xcd, 0x43, 0xba, 0x0d,
+		    0x4b, 0x25, 0x30, 0xe7, 0x3c, 0x42, 0x76, 0xa0,
+		    0x5e, 0x1b, 0x14, 0x5d, 0x42, 0x0c, 0xed, 0xb4 },
+	.b_public = (u8[32]){
+		    0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0xc0, 0x49, 0x74, 0xb7, 0x58, 0x38, 0x0e, 0x2a,
+		    0x5b, 0x5d, 0xf6, 0xeb, 0x09, 0xbb, 0x2f, 0x6b,
+		    0x34, 0x34, 0xf9, 0x82, 0x72, 0x2a, 0x8e, 0x67,
+		    0x6d, 0x3d, 0xa2, 0x51, 0xd1, 0xb3, 0xde, 0x83 },
+	.b_public = (u8[32]){
+		    0xe0, 0xeb, 0x7a, 0x7c, 0x3b, 0x41, 0xb8, 0xae,
+		    0x16, 0x56, 0xe3, 0xfa, 0xf1, 0x9f, 0xc4, 0x6a,
+		    0xda, 0x09, 0x8d, 0xeb, 0x9c, 0x32, 0xb1, 0xfd,
+		    0x86, 0x62, 0x05, 0x16, 0x5f, 0x49, 0xb8, 0x80 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0x50, 0x2a, 0x31, 0x37, 0x3d, 0xb3, 0x24, 0x46,
+		    0x84, 0x2f, 0xe5, 0xad, 0xd3, 0xe0, 0x24, 0x02,
+		    0x2e, 0xa5, 0x4f, 0x27, 0x41, 0x82, 0xaf, 0xc3,
+		    0xd9, 0xf1, 0xbb, 0x3d, 0x39, 0x53, 0x4e, 0xb5 },
+	.b_public = (u8[32]){
+		    0x5f, 0x9c, 0x95, 0xbc, 0xa3, 0x50, 0x8c, 0x24,
+		    0xb1, 0xd0, 0xb1, 0x55, 0x9c, 0x83, 0xef, 0x5b,
+		    0x04, 0x44, 0x5c, 0xc4, 0x58, 0x1c, 0x8e, 0x86,
+		    0xd8, 0x22, 0x4e, 0xdd, 0xd0, 0x9f, 0x11, 0xd7 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0x90, 0xfa, 0x64, 0x17, 0xb0, 0xe3, 0x70, 0x30,
+		    0xfd, 0x6e, 0x43, 0xef, 0xf2, 0xab, 0xae, 0xf1,
+		    0x4c, 0x67, 0x93, 0x11, 0x7a, 0x03, 0x9c, 0xf6,
+		    0x21, 0x31, 0x8b, 0xa9, 0x0f, 0x4e, 0x98, 0xbe },
+	.b_public = (u8[32]){
+		    0xec, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0x78, 0xad, 0x3f, 0x26, 0x02, 0x7f, 0x1c, 0x9f,
+		    0xdd, 0x97, 0x5a, 0x16, 0x13, 0xb9, 0x47, 0x77,
+		    0x9b, 0xad, 0x2c, 0xf2, 0xb7, 0x41, 0xad, 0xe0,
+		    0x18, 0x40, 0x88, 0x5a, 0x30, 0xbb, 0x97, 0x9c },
+	.b_public = (u8[32]){
+		    0xed, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key with low order */
+{
+	.secret = (u8[32]){
+		    0x98, 0xe2, 0x3d, 0xe7, 0xb1, 0xe0, 0x92, 0x6e,
+		    0xd9, 0xc8, 0x7e, 0x7b, 0x14, 0xba, 0xf5, 0x5f,
+		    0x49, 0x7a, 0x1d, 0x70, 0x96, 0xf9, 0x39, 0x77,
+		    0x68, 0x0e, 0x44, 0xdc, 0x1c, 0x7b, 0x7b, 0x8b },
+	.b_public = (u8[32]){
+		    0xee, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.comp_ss_error = -EKEYREJECTED,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0xf0, 0x1e, 0x48, 0xda, 0xfa, 0xc9, 0xd7, 0xbc,
+		    0xf5, 0x89, 0xcb, 0xc3, 0x82, 0xc8, 0x78, 0xd1,
+		    0x8b, 0xda, 0x35, 0x50, 0x58, 0x9f, 0xfb, 0x5d,
+		    0x50, 0xb5, 0x23, 0xbe, 0xbe, 0x32, 0x9d, 0xae },
+	.b_public = (u8[32]){
+		    0xef, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.expected_ss = (u8[32]){
+		    0xbd, 0x36, 0xa0, 0x79, 0x0e, 0xb8, 0x83, 0x09,
+		    0x8c, 0x98, 0x8b, 0x21, 0x78, 0x67, 0x73, 0xde,
+		    0x0b, 0x3a, 0x4d, 0xf1, 0x62, 0x28, 0x2c, 0xf1,
+		    0x10, 0xde, 0x18, 0xdd, 0x48, 0x4c, 0xe7, 0x4b },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x28, 0x87, 0x96, 0xbc, 0x5a, 0xff, 0x4b, 0x81,
+		    0xa3, 0x75, 0x01, 0x75, 0x7b, 0xc0, 0x75, 0x3a,
+		    0x3c, 0x21, 0x96, 0x47, 0x90, 0xd3, 0x86, 0x99,
+		    0x30, 0x8d, 0xeb, 0xc1, 0x7a, 0x6e, 0xaf, 0x8d },
+	.b_public = (u8[32]){
+		    0xf0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.expected_ss = (u8[32]){
+		    0xb4, 0xe0, 0xdd, 0x76, 0xda, 0x7b, 0x07, 0x17,
+		    0x28, 0xb6, 0x1f, 0x85, 0x67, 0x71, 0xaa, 0x35,
+		    0x6e, 0x57, 0xed, 0xa7, 0x8a, 0x5b, 0x16, 0x55,
+		    0xcc, 0x38, 0x20, 0xfb, 0x5f, 0x85, 0x4c, 0x5c },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x98, 0xdf, 0x84, 0x5f, 0x66, 0x51, 0xbf, 0x11,
+		    0x38, 0x22, 0x1f, 0x11, 0x90, 0x41, 0xf7, 0x2b,
+		    0x6d, 0xbc, 0x3c, 0x4a, 0xce, 0x71, 0x43, 0xd9,
+		    0x9f, 0xd5, 0x5a, 0xd8, 0x67, 0x48, 0x0d, 0xa8 },
+	.b_public = (u8[32]){
+		    0xf1, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.expected_ss = (u8[32]){
+		    0x6f, 0xdf, 0x6c, 0x37, 0x61, 0x1d, 0xbd, 0x53,
+		    0x04, 0xdc, 0x0f, 0x2e, 0xb7, 0xc9, 0x51, 0x7e,
+		    0xb3, 0xc5, 0x0e, 0x12, 0xfd, 0x05, 0x0a, 0xc6,
+		    0xde, 0xc2, 0x70, 0x71, 0xd4, 0xbf, 0xc0, 0x34 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0xf0, 0x94, 0x98, 0xe4, 0x6f, 0x02, 0xf8, 0x78,
+		    0x82, 0x9e, 0x78, 0xb8, 0x03, 0xd3, 0x16, 0xa2,
+		    0xed, 0x69, 0x5d, 0x04, 0x98, 0xa0, 0x8a, 0xbd,
+		    0xf8, 0x27, 0x69, 0x30, 0xe2, 0x4e, 0xdc, 0xb0 },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.expected_ss = (u8[32]){
+		    0x4c, 0x8f, 0xc4, 0xb1, 0xc6, 0xab, 0x88, 0xfb,
+		    0x21, 0xf1, 0x8f, 0x6d, 0x4c, 0x81, 0x02, 0x40,
+		    0xd4, 0xe9, 0x46, 0x51, 0xba, 0x44, 0xf7, 0xa2,
+		    0xc8, 0x63, 0xce, 0xc7, 0xdc, 0x56, 0x60, 0x2d },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x18, 0x13, 0xc1, 0x0a, 0x5c, 0x7f, 0x21, 0xf9,
+		    0x6e, 0x17, 0xf2, 0x88, 0xc0, 0xcc, 0x37, 0x60,
+		    0x7c, 0x04, 0xc5, 0xf5, 0xae, 0xa2, 0xdb, 0x13,
+		    0x4f, 0x9e, 0x2f, 0xfc, 0x66, 0xbd, 0x9d, 0xb8 },
+	.b_public = (u8[32]){
+		    0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+	.expected_ss = (u8[32]){
+		    0x1c, 0xd0, 0xb2, 0x82, 0x67, 0xdc, 0x54, 0x1c,
+		    0x64, 0x2d, 0x6d, 0x7d, 0xca, 0x44, 0xa8, 0xb3,
+		    0x8a, 0x63, 0x73, 0x6e, 0xef, 0x5c, 0x4e, 0x65,
+		    0x01, 0xff, 0xbb, 0xb1, 0x78, 0x0c, 0x03, 0x3c },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x78, 0x57, 0xfb, 0x80, 0x86, 0x53, 0x64, 0x5a,
+		    0x0b, 0xeb, 0x13, 0x8a, 0x64, 0xf5, 0xf4, 0xd7,
+		    0x33, 0xa4, 0x5e, 0xa8, 0x4c, 0x3c, 0xda, 0x11,
+		    0xa9, 0xc0, 0x6f, 0x7e, 0x71, 0x39, 0x14, 0x9e },
+	.b_public = (u8[32]){
+		    0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+	.expected_ss = (u8[32]){
+		    0x87, 0x55, 0xbe, 0x01, 0xc6, 0x0a, 0x7e, 0x82,
+		    0x5c, 0xff, 0x3e, 0x0e, 0x78, 0xcb, 0x3a, 0xa4,
+		    0x33, 0x38, 0x61, 0x51, 0x6a, 0xa5, 0x9b, 0x1c,
+		    0x51, 0xa8, 0xb2, 0xa5, 0x43, 0xdf, 0xa8, 0x22 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0xe0, 0x3a, 0xa8, 0x42, 0xe2, 0xab, 0xc5, 0x6e,
+		    0x81, 0xe8, 0x7b, 0x8b, 0x9f, 0x41, 0x7b, 0x2a,
+		    0x1e, 0x59, 0x13, 0xc7, 0x23, 0xee, 0xd2, 0x8d,
+		    0x75, 0x2f, 0x8d, 0x47, 0xa5, 0x9f, 0x49, 0x8f },
+	.b_public = (u8[32]){
+		    0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80 },
+	.expected_ss = (u8[32]){
+		    0x54, 0xc9, 0xa1, 0xed, 0x95, 0xe5, 0x46, 0xd2,
+		    0x78, 0x22, 0xa3, 0x60, 0x93, 0x1d, 0xda, 0x60,
+		    0xa1, 0xdf, 0x04, 0x9d, 0xa6, 0xf9, 0x04, 0x25,
+		    0x3c, 0x06, 0x12, 0xbb, 0xdc, 0x08, 0x74, 0x76 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0xf8, 0xf7, 0x07, 0xb7, 0x99, 0x9b, 0x18, 0xcb,
+		    0x0d, 0x6b, 0x96, 0x12, 0x4f, 0x20, 0x45, 0x97,
+		    0x2c, 0xa2, 0x74, 0xbf, 0xc1, 0x54, 0xad, 0x0c,
+		    0x87, 0x03, 0x8c, 0x24, 0xc6, 0xd0, 0xd4, 0xb2 },
+	.b_public = (u8[32]){
+		    0xda, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0xcc, 0x1f, 0x40, 0xd7, 0x43, 0xcd, 0xc2, 0x23,
+		    0x0e, 0x10, 0x43, 0xda, 0xba, 0x8b, 0x75, 0xe8,
+		    0x10, 0xf1, 0xfb, 0xab, 0x7f, 0x25, 0x52, 0x69,
+		    0xbd, 0x9e, 0xbb, 0x29, 0xe6, 0xbf, 0x49, 0x4f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0xa0, 0x34, 0xf6, 0x84, 0xfa, 0x63, 0x1e, 0x1a,
+		    0x34, 0x81, 0x18, 0xc1, 0xce, 0x4c, 0x98, 0x23,
+		    0x1f, 0x2d, 0x9e, 0xec, 0x9b, 0xa5, 0x36, 0x5b,
+		    0x4a, 0x05, 0xd6, 0x9a, 0x78, 0x5b, 0x07, 0x96 },
+	.b_public = (u8[32]){
+		    0xdb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0x54, 0x99, 0x8e, 0xe4, 0x3a, 0x5b, 0x00, 0x7b,
+		    0xf4, 0x99, 0xf0, 0x78, 0xe7, 0x36, 0x52, 0x44,
+		    0x00, 0xa8, 0xb5, 0xc7, 0xe9, 0xb9, 0xb4, 0x37,
+		    0x71, 0x74, 0x8c, 0x7c, 0xdf, 0x88, 0x04, 0x12 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x30, 0xb6, 0xc6, 0xa0, 0xf2, 0xff, 0xa6, 0x80,
+		    0x76, 0x8f, 0x99, 0x2b, 0xa8, 0x9e, 0x15, 0x2d,
+		    0x5b, 0xc9, 0x89, 0x3d, 0x38, 0xc9, 0x11, 0x9b,
+		    0xe4, 0xf7, 0x67, 0xbf, 0xab, 0x6e, 0x0c, 0xa5 },
+	.b_public = (u8[32]){
+		    0xdc, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0xea, 0xd9, 0xb3, 0x8e, 0xfd, 0xd7, 0x23, 0x63,
+		    0x79, 0x34, 0xe5, 0x5a, 0xb7, 0x17, 0xa7, 0xae,
+		    0x09, 0xeb, 0x86, 0xa2, 0x1d, 0xc3, 0x6a, 0x3f,
+		    0xee, 0xb8, 0x8b, 0x75, 0x9e, 0x39, 0x1e, 0x09 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x90, 0x1b, 0x9d, 0xcf, 0x88, 0x1e, 0x01, 0xe0,
+		    0x27, 0x57, 0x50, 0x35, 0xd4, 0x0b, 0x43, 0xbd,
+		    0xc1, 0xc5, 0x24, 0x2e, 0x03, 0x08, 0x47, 0x49,
+		    0x5b, 0x0c, 0x72, 0x86, 0x46, 0x9b, 0x65, 0x91 },
+	.b_public = (u8[32]){
+		    0xea, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0x60, 0x2f, 0xf4, 0x07, 0x89, 0xb5, 0x4b, 0x41,
+		    0x80, 0x59, 0x15, 0xfe, 0x2a, 0x62, 0x21, 0xf0,
+		    0x7a, 0x50, 0xff, 0xc2, 0xc3, 0xfc, 0x94, 0xcf,
+		    0x61, 0xf1, 0x3d, 0x79, 0x04, 0xe8, 0x8e, 0x0e },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x80, 0x46, 0x67, 0x7c, 0x28, 0xfd, 0x82, 0xc9,
+		    0xa1, 0xbd, 0xb7, 0x1a, 0x1a, 0x1a, 0x34, 0xfa,
+		    0xba, 0x12, 0x25, 0xe2, 0x50, 0x7f, 0xe3, 0xf5,
+		    0x4d, 0x10, 0xbd, 0x5b, 0x0d, 0x86, 0x5f, 0x8e },
+	.b_public = (u8[32]){
+		    0xeb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0xe0, 0x0a, 0xe8, 0xb1, 0x43, 0x47, 0x12, 0x47,
+		    0xba, 0x24, 0xf1, 0x2c, 0x88, 0x55, 0x36, 0xc3,
+		    0xcb, 0x98, 0x1b, 0x58, 0xe1, 0xe5, 0x6b, 0x2b,
+		    0xaf, 0x35, 0xc1, 0x2a, 0xe1, 0xf7, 0x9c, 0x26 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x60, 0x2f, 0x7e, 0x2f, 0x68, 0xa8, 0x46, 0xb8,
+		    0x2c, 0xc2, 0x69, 0xb1, 0xd4, 0x8e, 0x93, 0x98,
+		    0x86, 0xae, 0x54, 0xfd, 0x63, 0x6c, 0x1f, 0xe0,
+		    0x74, 0xd7, 0x10, 0x12, 0x7d, 0x47, 0x24, 0x91 },
+	.b_public = (u8[32]){
+		    0xef, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0x98, 0xcb, 0x9b, 0x50, 0xdd, 0x3f, 0xc2, 0xb0,
+		    0xd4, 0xf2, 0xd2, 0xbf, 0x7c, 0x5c, 0xfd, 0xd1,
+		    0x0c, 0x8f, 0xcd, 0x31, 0xfc, 0x40, 0xaf, 0x1a,
+		    0xd4, 0x4f, 0x47, 0xc1, 0x31, 0x37, 0x63, 0x62 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x60, 0x88, 0x7b, 0x3d, 0xc7, 0x24, 0x43, 0x02,
+		    0x6e, 0xbe, 0xdb, 0xbb, 0xb7, 0x06, 0x65, 0xf4,
+		    0x2b, 0x87, 0xad, 0xd1, 0x44, 0x0e, 0x77, 0x68,
+		    0xfb, 0xd7, 0xe8, 0xe2, 0xce, 0x5f, 0x63, 0x9d },
+	.b_public = (u8[32]){
+		    0xf0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0x38, 0xd6, 0x30, 0x4c, 0x4a, 0x7e, 0x6d, 0x9f,
+		    0x79, 0x59, 0x33, 0x4f, 0xb5, 0x24, 0x5b, 0xd2,
+		    0xc7, 0x54, 0x52, 0x5d, 0x4c, 0x91, 0xdb, 0x95,
+		    0x02, 0x06, 0x92, 0x62, 0x34, 0xc1, 0xf6, 0x33 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0x78, 0xd3, 0x1d, 0xfa, 0x85, 0x44, 0x97, 0xd7,
+		    0x2d, 0x8d, 0xef, 0x8a, 0x1b, 0x7f, 0xb0, 0x06,
+		    0xce, 0xc2, 0xd8, 0xc4, 0x92, 0x46, 0x47, 0xc9,
+		    0x38, 0x14, 0xae, 0x56, 0xfa, 0xed, 0xa4, 0x95 },
+	.b_public = (u8[32]){
+		    0xf1, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0x78, 0x6c, 0xd5, 0x49, 0x96, 0xf0, 0x14, 0xa5,
+		    0xa0, 0x31, 0xec, 0x14, 0xdb, 0x81, 0x2e, 0xd0,
+		    0x83, 0x55, 0x06, 0x1f, 0xdb, 0x5d, 0xe6, 0x80,
+		    0xa8, 0x00, 0xac, 0x52, 0x1f, 0x31, 0x8e, 0x23 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - public key >= p */
+{
+	.secret = (u8[32]){
+		    0xc0, 0x4c, 0x5b, 0xae, 0xfa, 0x83, 0x02, 0xdd,
+		    0xde, 0xd6, 0xa4, 0xbb, 0x95, 0x77, 0x61, 0xb4,
+		    0xeb, 0x97, 0xae, 0xfa, 0x4f, 0xc3, 0xb8, 0x04,
+		    0x30, 0x85, 0xf9, 0x6a, 0x56, 0x59, 0xb3, 0xa5 },
+	.b_public = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+	.expected_ss = (u8[32]){
+		    0x29, 0xae, 0x8b, 0xc7, 0x3e, 0x9b, 0x10, 0xa0,
+		    0x8b, 0x4f, 0x68, 0x1c, 0x43, 0xc3, 0xe0, 0xac,
+		    0x1a, 0x17, 0x1d, 0x31, 0xb3, 0x8f, 0x1a, 0x48,
+		    0xef, 0xba, 0x29, 0xae, 0x63, 0x9e, 0xa1, 0x34 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - RFC 7748 */
+{
+	.secret = (u8[32]){
+		    0xa0, 0x46, 0xe3, 0x6b, 0xf0, 0x52, 0x7c, 0x9d,
+		    0x3b, 0x16, 0x15, 0x4b, 0x82, 0x46, 0x5e, 0xdd,
+		    0x62, 0x14, 0x4c, 0x0a, 0xc1, 0xfc, 0x5a, 0x18,
+		    0x50, 0x6a, 0x22, 0x44, 0xba, 0x44, 0x9a, 0x44 },
+	.b_public = (u8[32]){
+		    0xe6, 0xdb, 0x68, 0x67, 0x58, 0x30, 0x30, 0xdb,
+		    0x35, 0x94, 0xc1, 0xa4, 0x24, 0xb1, 0x5f, 0x7c,
+		    0x72, 0x66, 0x24, 0xec, 0x26, 0xb3, 0x35, 0x3b,
+		    0x10, 0xa9, 0x03, 0xa6, 0xd0, 0xab, 0x1c, 0x4c },
+	.expected_ss = (u8[32]){
+		    0xc3, 0xda, 0x55, 0x37, 0x9d, 0xe9, 0xc6, 0x90,
+		    0x8e, 0x94, 0xea, 0x4d, 0xf2, 0x8d, 0x08, 0x4f,
+		    0x32, 0xec, 0xcf, 0x03, 0x49, 0x1c, 0x71, 0xf7,
+		    0x54, 0xb4, 0x07, 0x55, 0x77, 0xa2, 0x85, 0x52 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - RFC 7748 */
+{
+	.secret = (u8[32]){
+		    0x48, 0x66, 0xe9, 0xd4, 0xd1, 0xb4, 0x67, 0x3c,
+		    0x5a, 0xd2, 0x26, 0x91, 0x95, 0x7d, 0x6a, 0xf5,
+		    0xc1, 0x1b, 0x64, 0x21, 0xe0, 0xea, 0x01, 0xd4,
+		    0x2c, 0xa4, 0x16, 0x9e, 0x79, 0x18, 0xba, 0x4d },
+	.b_public = (u8[32]){
+		    0xe5, 0x21, 0x0f, 0x12, 0x78, 0x68, 0x11, 0xd3,
+		    0xf4, 0xb7, 0x95, 0x9d, 0x05, 0x38, 0xae, 0x2c,
+		    0x31, 0xdb, 0xe7, 0x10, 0x6f, 0xc0, 0x3c, 0x3e,
+		    0xfc, 0x4c, 0xd5, 0x49, 0xc7, 0x15, 0xa4, 0x13 },
+	.expected_ss = (u8[32]){
+		    0x95, 0xcb, 0xde, 0x94, 0x76, 0xe8, 0x90, 0x7d,
+		    0x7a, 0xad, 0xe4, 0x5c, 0xb4, 0xb8, 0x73, 0xf8,
+		    0x8b, 0x59, 0x5a, 0x68, 0x79, 0x9f, 0xa1, 0x52,
+		    0xe6, 0xf8, 0xf7, 0x64, 0x7a, 0xac, 0x79, 0x57 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x0a, 0xb4, 0xe7, 0x63, 0x80, 0xd8, 0x4d, 0xde,
+		    0x4f, 0x68, 0x33, 0xc5, 0x8f, 0x2a, 0x9f, 0xb8,
+		    0xf8, 0x3b, 0xb0, 0x16, 0x9b, 0x17, 0x2b, 0xe4,
+		    0xb6, 0xe0, 0x59, 0x28, 0x87, 0x74, 0x1a, 0x36 },
+	.expected_ss = (u8[32]){
+		    0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x89, 0xe1, 0x0d, 0x57, 0x01, 0xb4, 0x33, 0x7d,
+		    0x2d, 0x03, 0x21, 0x81, 0x53, 0x8b, 0x10, 0x64,
+		    0xbd, 0x40, 0x84, 0x40, 0x1c, 0xec, 0xa1, 0xfd,
+		    0x12, 0x66, 0x3a, 0x19, 0x59, 0x38, 0x80, 0x00 },
+	.expected_ss = (u8[32]){
+		    0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x2b, 0x55, 0xd3, 0xaa, 0x4a, 0x8f, 0x80, 0xc8,
+		    0xc0, 0xb2, 0xae, 0x5f, 0x93, 0x3e, 0x85, 0xaf,
+		    0x49, 0xbe, 0xac, 0x36, 0xc2, 0xfa, 0x73, 0x94,
+		    0xba, 0xb7, 0x6c, 0x89, 0x33, 0xf8, 0xf8, 0x1d },
+	.expected_ss = (u8[32]){
+		    0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x63, 0xe5, 0xb1, 0xfe, 0x96, 0x01, 0xfe, 0x84,
+		    0x38, 0x5d, 0x88, 0x66, 0xb0, 0x42, 0x12, 0x62,
+		    0xf7, 0x8f, 0xbf, 0xa5, 0xaf, 0xf9, 0x58, 0x5e,
+		    0x62, 0x66, 0x79, 0xb1, 0x85, 0x47, 0xd9, 0x59 },
+	.expected_ss = (u8[32]){
+		    0xfe, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0xe4, 0x28, 0xf3, 0xda, 0xc1, 0x78, 0x09, 0xf8,
+		    0x27, 0xa5, 0x22, 0xce, 0x32, 0x35, 0x50, 0x58,
+		    0xd0, 0x73, 0x69, 0x36, 0x4a, 0xa7, 0x89, 0x02,
+		    0xee, 0x10, 0x13, 0x9b, 0x9f, 0x9d, 0xd6, 0x53 },
+	.expected_ss = (u8[32]){
+		    0xfc, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0xb3, 0xb5, 0x0e, 0x3e, 0xd3, 0xa4, 0x07, 0xb9,
+		    0x5d, 0xe9, 0x42, 0xef, 0x74, 0x57, 0x5b, 0x5a,
+		    0xb8, 0xa1, 0x0c, 0x09, 0xee, 0x10, 0x35, 0x44,
+		    0xd6, 0x0b, 0xdf, 0xed, 0x81, 0x38, 0xab, 0x2b },
+	.expected_ss = (u8[32]){
+		    0xf9, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x21, 0x3f, 0xff, 0xe9, 0x3d, 0x5e, 0xa8, 0xcd,
+		    0x24, 0x2e, 0x46, 0x28, 0x44, 0x02, 0x99, 0x22,
+		    0xc4, 0x3c, 0x77, 0xc9, 0xe3, 0xe4, 0x2f, 0x56,
+		    0x2f, 0x48, 0x5d, 0x24, 0xc5, 0x01, 0xa2, 0x0b },
+	.expected_ss = (u8[32]){
+		    0xf3, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x3f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x91, 0xb2, 0x32, 0xa1, 0x78, 0xb3, 0xcd, 0x53,
+		    0x09, 0x32, 0x44, 0x1e, 0x61, 0x39, 0x41, 0x8f,
+		    0x72, 0x17, 0x22, 0x92, 0xf1, 0xda, 0x4c, 0x18,
+		    0x34, 0xfc, 0x5e, 0xbf, 0xef, 0xb5, 0x1e, 0x3f },
+	.expected_ss = (u8[32]){
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x03 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x04, 0x5c, 0x6e, 0x11, 0xc5, 0xd3, 0x32, 0x55,
+		    0x6c, 0x78, 0x22, 0xfe, 0x94, 0xeb, 0xf8, 0x9b,
+		    0x56, 0xa3, 0x87, 0x8d, 0xc2, 0x7c, 0xa0, 0x79,
+		    0x10, 0x30, 0x58, 0x84, 0x9f, 0xab, 0xcb, 0x4f },
+	.expected_ss = (u8[32]){
+		    0xe5, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x1c, 0xa2, 0x19, 0x0b, 0x71, 0x16, 0x35, 0x39,
+		    0x06, 0x3c, 0x35, 0x77, 0x3b, 0xda, 0x0c, 0x9c,
+		    0x92, 0x8e, 0x91, 0x36, 0xf0, 0x62, 0x0a, 0xeb,
+		    0x09, 0x3f, 0x09, 0x91, 0x97, 0xb7, 0xf7, 0x4e },
+	.expected_ss = (u8[32]){
+		    0xe3, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0xf7, 0x6e, 0x90, 0x10, 0xac, 0x33, 0xc5, 0x04,
+		    0x3b, 0x2d, 0x3b, 0x76, 0xa8, 0x42, 0x17, 0x10,
+		    0x00, 0xc4, 0x91, 0x62, 0x22, 0xe9, 0xe8, 0x58,
+		    0x97, 0xa0, 0xae, 0xc7, 0xf6, 0x35, 0x0b, 0x3c },
+	.expected_ss = (u8[32]){
+		    0xdd, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0xbb, 0x72, 0x68, 0x8d, 0x8f, 0x8a, 0xa7, 0xa3,
+		    0x9c, 0xd6, 0x06, 0x0c, 0xd5, 0xc8, 0x09, 0x3c,
+		    0xde, 0xc6, 0xfe, 0x34, 0x19, 0x37, 0xc3, 0x88,
+		    0x6a, 0x99, 0x34, 0x6c, 0xd0, 0x7f, 0xaa, 0x55 },
+	.expected_ss = (u8[32]){
+		    0xdb, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x7f },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x88, 0xfd, 0xde, 0xa1, 0x93, 0x39, 0x1c, 0x6a,
+		    0x59, 0x33, 0xef, 0x9b, 0x71, 0x90, 0x15, 0x49,
+		    0x44, 0x72, 0x05, 0xaa, 0xe9, 0xda, 0x92, 0x8a,
+		    0x6b, 0x91, 0xa3, 0x52, 0xba, 0x10, 0xf4, 0x1f },
+	.expected_ss = (u8[32]){
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - edge case for shared secret */
+{
+	.secret = (u8[32]){
+		    0xa0, 0xa4, 0xf1, 0x30, 0xb9, 0x8a, 0x5b, 0xe4,
+		    0xb1, 0xce, 0xdb, 0x7c, 0xb8, 0x55, 0x84, 0xa3,
+		    0x52, 0x0e, 0x14, 0x2d, 0x47, 0x4d, 0xc9, 0xcc,
+		    0xb9, 0x09, 0xa0, 0x73, 0xa9, 0x76, 0xbf, 0x63 },
+	.b_public = (u8[32]){
+		    0x30, 0x3b, 0x39, 0x2f, 0x15, 0x31, 0x16, 0xca,
+		    0xd9, 0xcc, 0x68, 0x2a, 0x00, 0xcc, 0xc4, 0x4c,
+		    0x95, 0xff, 0x0d, 0x3b, 0xbe, 0x56, 0x8b, 0xeb,
+		    0x6c, 0x4e, 0x73, 0x9b, 0xaf, 0xdc, 0x2c, 0x68 },
+	.expected_ss = (u8[32]){
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x00 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - checking for overflow */
+{
+	.secret = (u8[32]){
+		    0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+		    0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+		    0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+		    0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+	.b_public = (u8[32]){
+		    0xfd, 0x30, 0x0a, 0xeb, 0x40, 0xe1, 0xfa, 0x58,
+		    0x25, 0x18, 0x41, 0x2b, 0x49, 0xb2, 0x08, 0xa7,
+		    0x84, 0x2b, 0x1e, 0x1f, 0x05, 0x6a, 0x04, 0x01,
+		    0x78, 0xea, 0x41, 0x41, 0x53, 0x4f, 0x65, 0x2d },
+	.expected_ss = (u8[32]){
+		    0xb7, 0x34, 0x10, 0x5d, 0xc2, 0x57, 0x58, 0x5d,
+		    0x73, 0xb5, 0x66, 0xcc, 0xb7, 0x6f, 0x06, 0x27,
+		    0x95, 0xcc, 0xbe, 0xc8, 0x91, 0x28, 0xe5, 0x2b,
+		    0x02, 0xf3, 0xe5, 0x96, 0x39, 0xf1, 0x3c, 0x46 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - checking for overflow */
+{
+	.secret = (u8[32]){
+		    0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+		    0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+		    0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+		    0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+	.b_public = (u8[32]){
+		    0xc8, 0xef, 0x79, 0xb5, 0x14, 0xd7, 0x68, 0x26,
+		    0x77, 0xbc, 0x79, 0x31, 0xe0, 0x6e, 0xe5, 0xc2,
+		    0x7c, 0x9b, 0x39, 0x2b, 0x4a, 0xe9, 0x48, 0x44,
+		    0x73, 0xf5, 0x54, 0xe6, 0x67, 0x8e, 0xcc, 0x2e },
+	.expected_ss = (u8[32]){
+		    0x64, 0x7a, 0x46, 0xb6, 0xfc, 0x3f, 0x40, 0xd6,
+		    0x21, 0x41, 0xee, 0x3c, 0xee, 0x70, 0x6b, 0x4d,
+		    0x7a, 0x92, 0x71, 0x59, 0x3a, 0x7b, 0x14, 0x3e,
+		    0x8e, 0x2e, 0x22, 0x79, 0x88, 0x3e, 0x45, 0x50 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - checking for overflow */
+{
+	.secret = (u8[32]){
+		    0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+		    0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+		    0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+		    0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+	.b_public = (u8[32]){
+		    0x64, 0xae, 0xac, 0x25, 0x04, 0x14, 0x48, 0x61,
+		    0x53, 0x2b, 0x7b, 0xbc, 0xb6, 0xc8, 0x7d, 0x67,
+		    0xdd, 0x4c, 0x1f, 0x07, 0xeb, 0xc2, 0xe0, 0x6e,
+		    0xff, 0xb9, 0x5a, 0xec, 0xc6, 0x17, 0x0b, 0x2c },
+	.expected_ss = (u8[32]){
+		    0x4f, 0xf0, 0x3d, 0x5f, 0xb4, 0x3c, 0xd8, 0x65,
+		    0x7a, 0x3c, 0xf3, 0x7c, 0x13, 0x8c, 0xad, 0xce,
+		    0xcc, 0xe5, 0x09, 0xe4, 0xeb, 0xa0, 0x89, 0xd0,
+		    0xef, 0x40, 0xb4, 0xe4, 0xfb, 0x94, 0x61, 0x55 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - checking for overflow */
+{
+	.secret = (u8[32]){
+		    0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+		    0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+		    0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+		    0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+	.b_public = (u8[32]){
+		    0xbf, 0x68, 0xe3, 0x5e, 0x9b, 0xdb, 0x7e, 0xee,
+		    0x1b, 0x50, 0x57, 0x02, 0x21, 0x86, 0x0f, 0x5d,
+		    0xcd, 0xad, 0x8a, 0xcb, 0xab, 0x03, 0x1b, 0x14,
+		    0x97, 0x4c, 0xc4, 0x90, 0x13, 0xc4, 0x98, 0x31 },
+	.expected_ss = (u8[32]){
+		    0x21, 0xce, 0xe5, 0x2e, 0xfd, 0xbc, 0x81, 0x2e,
+		    0x1d, 0x02, 0x1a, 0x4a, 0xf1, 0xe1, 0xd8, 0xbc,
+		    0x4d, 0xb3, 0xc4, 0x00, 0xe4, 0xd2, 0xa2, 0xc5,
+		    0x6a, 0x39, 0x26, 0xdb, 0x4d, 0x99, 0xc6, 0x5b },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - checking for overflow */
+{
+	.secret = (u8[32]){
+		    0xc8, 0x17, 0x24, 0x70, 0x40, 0x00, 0xb2, 0x6d,
+		    0x31, 0x70, 0x3c, 0xc9, 0x7e, 0x3a, 0x37, 0x8d,
+		    0x56, 0xfa, 0xd8, 0x21, 0x93, 0x61, 0xc8, 0x8c,
+		    0xca, 0x8b, 0xd7, 0xc5, 0x71, 0x9b, 0x12, 0xb2 },
+	.b_public = (u8[32]){
+		    0x53, 0x47, 0xc4, 0x91, 0x33, 0x1a, 0x64, 0xb4,
+		    0x3d, 0xdc, 0x68, 0x30, 0x34, 0xe6, 0x77, 0xf5,
+		    0x3d, 0xc3, 0x2b, 0x52, 0xa5, 0x2a, 0x57, 0x7c,
+		    0x15, 0xa8, 0x3b, 0xf2, 0x98, 0xe9, 0x9f, 0x19 },
+	.expected_ss = (u8[32]){
+		    0x18, 0xcb, 0x89, 0xe4, 0xe2, 0x0c, 0x0c, 0x2b,
+		    0xd3, 0x24, 0x30, 0x52, 0x45, 0x26, 0x6c, 0x93,
+		    0x27, 0x69, 0x0b, 0xbe, 0x79, 0xac, 0xb8, 0x8f,
+		    0x5b, 0x8f, 0xb3, 0xf7, 0x4e, 0xca, 0x3e, 0x52 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - private key == -1 (mod order) */
+{
+	.secret = (u8[32]){
+		    0xa0, 0x23, 0xcd, 0xd0, 0x83, 0xef, 0x5b, 0xb8,
+		    0x2f, 0x10, 0xd6, 0x2e, 0x59, 0xe1, 0x5a, 0x68,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+		    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x50 },
+	.b_public = (u8[32]){
+		    0x25, 0x8e, 0x04, 0x52, 0x3b, 0x8d, 0x25, 0x3e,
+		    0xe6, 0x57, 0x19, 0xfc, 0x69, 0x06, 0xc6, 0x57,
+		    0x19, 0x2d, 0x80, 0x71, 0x7e, 0xdc, 0x82, 0x8f,
+		    0xa0, 0xaf, 0x21, 0x68, 0x6e, 0x2f, 0xaa, 0x75 },
+	.expected_ss = (u8[32]){
+		    0x25, 0x8e, 0x04, 0x52, 0x3b, 0x8d, 0x25, 0x3e,
+		    0xe6, 0x57, 0x19, 0xfc, 0x69, 0x06, 0xc6, 0x57,
+		    0x19, 0x2d, 0x80, 0x71, 0x7e, 0xdc, 0x82, 0x8f,
+		    0xa0, 0xaf, 0x21, 0x68, 0x6e, 0x2f, 0xaa, 0x75 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+},
+/* wycheproof - private key == 1 (mod order) on twist */
+{
+	.secret = (u8[32]){
+		    0x58, 0x08, 0x3d, 0xd2, 0x61, 0xad, 0x91, 0xef,
+		    0xf9, 0x52, 0x32, 0x2e, 0xc8, 0x24, 0xc6, 0x82,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x5f },
+	.b_public = (u8[32]){
+		    0x2e, 0xae, 0x5e, 0xc3, 0xdd, 0x49, 0x4e, 0x9f,
+		    0x2d, 0x37, 0xd2, 0x58, 0xf8, 0x73, 0xa8, 0xe6,
+		    0xe9, 0xd0, 0xdb, 0xd1, 0xe3, 0x83, 0xef, 0x64,
+		    0xd9, 0x8b, 0xb9, 0x1b, 0x3e, 0x0b, 0xe0, 0x35 },
+	.expected_ss = (u8[32]){
+		    0x2e, 0xae, 0x5e, 0xc3, 0xdd, 0x49, 0x4e, 0x9f,
+		    0x2d, 0x37, 0xd2, 0x58, 0xf8, 0x73, 0xa8, 0xe6,
+		    0xe9, 0xd0, 0xdb, 0xd1, 0xe3, 0x83, 0xef, 0x64,
+		    0xd9, 0x8b, 0xb9, 0x1b, 0x3e, 0x0b, 0xe0, 0x35 },
+	.secret_size = 32,
+	.b_public_size = 32,
+	.expected_ss_size = 32,
+}
+};
+
 static const struct kpp_testvec ecdh_tv_template[] = {
 	{
 #ifndef CONFIG_CRYPTO_FIPS
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 29/35] crypto: curve25519 - implement generic KPP driver
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (27 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 28/35] crypto: curve25519 - add kpp selftest Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 30/35] crypto: lib/curve25519 - work around Clang stack spilling issue Ard Biesheuvel
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Expose the generic Curve25519 library via the crypto API KPP interface.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/Kconfig              |  5 ++
 crypto/Makefile             |  1 +
 crypto/curve25519-generic.c | 93 ++++++++++++++++++++
 3 files changed, 99 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index ff077dd3a7b7..b83088e6a8e6 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -264,6 +264,11 @@ config CRYPTO_ECRDSA
 	  standard algorithms (called GOST algorithms). Only signature verification
 	  is implemented.
 
+config CRYPTO_CURVE25519
+	tristate "Curve25519 algorithm"
+	select CRYPTO_KPP
+	select CRYPTO_LIB_CURVE25519_GENERIC
+
 comment "Authenticated Encryption with Associated Data"
 
 config CRYPTO_CCM
diff --git a/crypto/Makefile b/crypto/Makefile
index ecc69a726460..1e6cea469a4a 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -167,6 +167,7 @@ obj-$(CONFIG_CRYPTO_ZSTD) += zstd.o
 obj-$(CONFIG_CRYPTO_OFB) += ofb.o
 obj-$(CONFIG_CRYPTO_ECC) += ecc.o
 obj-$(CONFIG_CRYPTO_ESSIV) += essiv.o
+obj-$(CONFIG_CRYPTO_CURVE25519) += curve25519-generic.o
 
 ecdh_generic-y += ecdh.o
 ecdh_generic-y += ecdh_helper.o
diff --git a/crypto/curve25519-generic.c b/crypto/curve25519-generic.c
new file mode 100644
index 000000000000..0f4cd6347e3b
--- /dev/null
+++ b/crypto/curve25519-generic.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <crypto/curve25519.h>
+#include <crypto/internal/kpp.h>
+#include <crypto/kpp.h>
+#include <linux/module.h>
+#include <linux/scatterlist.h>
+
+static int curve25519_set_secret(struct crypto_kpp *tfm, const void *buf,
+				 unsigned int len)
+{
+	u8 *secret = kpp_tfm_ctx(tfm);
+
+	if (!len)
+		curve25519_generate_secret(secret);
+	else if (len == CURVE25519_KEY_SIZE &&
+		 crypto_memneq(buf, curve25519_null_point, CURVE25519_KEY_SIZE))
+		memcpy(secret, buf, CURVE25519_KEY_SIZE);
+	else
+		return -EINVAL;
+	return 0;
+}
+
+static int curve25519_compute_value(struct kpp_request *req)
+{
+	struct crypto_kpp *tfm = crypto_kpp_reqtfm(req);
+	const u8 *secret = kpp_tfm_ctx(tfm);
+	u8 public_key[CURVE25519_KEY_SIZE];
+	u8 buf[CURVE25519_KEY_SIZE];
+	int copied, nbytes;
+	u8 const *bp;
+
+	if (req->src) {
+		copied = sg_copy_to_buffer(req->src,
+					   sg_nents_for_len(req->src,
+							    CURVE25519_KEY_SIZE),
+					   public_key, CURVE25519_KEY_SIZE);
+		if (copied != CURVE25519_KEY_SIZE)
+			return -EINVAL;
+		bp = public_key;
+	} else {
+		bp = curve25519_base_point;
+	}
+
+	curve25519_generic(buf, secret, bp);
+
+	if (!crypto_memneq(buf, curve25519_null_point, CURVE25519_KEY_SIZE))
+		return -EKEYREJECTED;
+
+	/* might want less than we've got */
+	nbytes = min_t(size_t, CURVE25519_KEY_SIZE, req->dst_len);
+	copied = sg_copy_from_buffer(req->dst, sg_nents_for_len(req->dst,
+								nbytes),
+				     buf, nbytes);
+	if (copied != nbytes)
+		return -EINVAL;
+	return 0;
+}
+
+static unsigned int curve25519_max_size(struct crypto_kpp *tfm)
+{
+	return CURVE25519_KEY_SIZE;
+}
+
+static struct kpp_alg curve25519_alg = {
+	.base.cra_name		= "curve25519",
+	.base.cra_driver_name	= "curve25519-generic",
+	.base.cra_priority	= 100,
+	.base.cra_module	= THIS_MODULE,
+	.base.cra_ctxsize	= CURVE25519_KEY_SIZE,
+
+	.set_secret		= curve25519_set_secret,
+	.generate_public_key	= curve25519_compute_value,
+	.compute_shared_secret	= curve25519_compute_value,
+	.max_size		= curve25519_max_size,
+};
+
+static int curve25519_init(void)
+{
+	return crypto_register_kpp(&curve25519_alg);
+}
+
+static void curve25519_exit(void)
+{
+	crypto_unregister_kpp(&curve25519_alg);
+}
+
+subsys_initcall(curve25519_init);
+module_exit(curve25519_exit);
+
+MODULE_ALIAS_CRYPTO("curve25519");
+MODULE_ALIAS_CRYPTO("curve25519-generic");
+MODULE_LICENSE("GPL");
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 30/35] crypto: lib/curve25519 - work around Clang stack spilling issue
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (28 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 29/35] crypto: curve25519 - implement generic KPP driver Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 31/35] crypto: Curve25519 - x86_64 library and KPP implementations Ard Biesheuvel
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

Arnd reports that the 32-bit generic library code for Curve25119 ends
up using an excessive amount of stack space when built with Clang:

  lib/crypto/curve25519-fiat32.c:756:6: error: stack frame size
      of 1384 bytes in function 'curve25519_generic'
      [-Werror,-Wframe-larger-than=]

Let's give some hints to the compiler regarding which routines should
not be inlined, to prevent it from running out of registers and spilling
to the stack. The resulting code performs identically under both GCC
and Clang, and makes the warning go away.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 lib/crypto/curve25519-fiat32.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/lib/crypto/curve25519-fiat32.c b/lib/crypto/curve25519-fiat32.c
index 1c455207341d..2fde0ec33dbd 100644
--- a/lib/crypto/curve25519-fiat32.c
+++ b/lib/crypto/curve25519-fiat32.c
@@ -223,7 +223,7 @@ static __always_inline void fe_1(fe *h)
 	h->v[0] = 1;
 }
 
-static void fe_add_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+static noinline void fe_add_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
 {
 	{ const u32 x20 = in1[9];
 	{ const u32 x21 = in1[8];
@@ -266,7 +266,7 @@ static __always_inline void fe_add(fe_loose *h, const fe *f, const fe *g)
 	fe_add_impl(h->v, f->v, g->v);
 }
 
-static void fe_sub_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+static noinline void fe_sub_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
 {
 	{ const u32 x20 = in1[9];
 	{ const u32 x21 = in1[8];
@@ -309,7 +309,7 @@ static __always_inline void fe_sub(fe_loose *h, const fe *f, const fe *g)
 	fe_sub_impl(h->v, f->v, g->v);
 }
 
-static void fe_mul_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
+static noinline void fe_mul_impl(u32 out[10], const u32 in1[10], const u32 in2[10])
 {
 	{ const u32 x20 = in1[9];
 	{ const u32 x21 = in1[8];
@@ -441,7 +441,7 @@ fe_mul_tll(fe *h, const fe_loose *f, const fe_loose *g)
 	fe_mul_impl(h->v, f->v, g->v);
 }
 
-static void fe_sqr_impl(u32 out[10], const u32 in1[10])
+static noinline void fe_sqr_impl(u32 out[10], const u32 in1[10])
 {
 	{ const u32 x17 = in1[9];
 	{ const u32 x18 = in1[8];
@@ -619,7 +619,7 @@ static __always_inline void fe_invert(fe *out, const fe *z)
  *
  * Preconditions: b in {0,1}
  */
-static __always_inline void fe_cswap(fe *f, fe *g, unsigned int b)
+static noinline void fe_cswap(fe *f, fe *g, unsigned int b)
 {
 	unsigned i;
 	b = 0 - b;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 31/35] crypto: Curve25519 - x86_64 library and KPP implementations
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (29 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 30/35] crypto: lib/curve25519 - work around Clang stack spilling issue Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 32/35] crypto: arm - import Bernstein and Schwabe's Curve25519 ARM implementation Ard Biesheuvel
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

From: "Jason A. Donenfeld" <Jason@zx2c4.com>

This implementation is the fastest available x86_64 implementation, and
unlike Sandy2x, it doesn't requie use of the floating point registers at
all. Instead it makes use of BMI2 and ADX, available on recent
microarchitectures. The implementation was written by Armando
Faz-Hernández with contributions (upstream) from Samuel Neves and me,
in addition to further changes in the kernel implementation from us.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
[ardb: - move to arch/x86/crypto
       - wire into lib/crypto framework
       - implement crypto API KPP hooks ]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/Makefile            |    1 +
 arch/x86/crypto/curve25519-x86_64.c | 2480 ++++++++++++++++++++
 crypto/Kconfig                      |    6 +
 crypto/curve25519-generic.c         |    3 +-
 4 files changed, 2488 insertions(+), 2 deletions(-)

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 922c8ecfa00f..958440eae27e 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -39,6 +39,7 @@ obj-$(CONFIG_CRYPTO_AEGIS128_AESNI_SSE2) += aegis128-aesni.o
 
 obj-$(CONFIG_CRYPTO_NHPOLY1305_SSE2) += nhpoly1305-sse2.o
 obj-$(CONFIG_CRYPTO_NHPOLY1305_AVX2) += nhpoly1305-avx2.o
+obj-$(CONFIG_CRYPTO_CURVE25519_X86) += curve25519-x86_64.o
 
 # These modules require assembler to support AVX.
 ifeq ($(avx_supported),yes)
diff --git a/arch/x86/crypto/curve25519-x86_64.c b/arch/x86/crypto/curve25519-x86_64.c
new file mode 100644
index 000000000000..5f4901aecd9b
--- /dev/null
+++ b/arch/x86/crypto/curve25519-x86_64.c
@@ -0,0 +1,2480 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+/*
+ * Copyright (c) 2017 Armando Faz <armfazh@ic.unicamp.br>. All Rights Reserved.
+ * Copyright (C) 2018-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright (C) 2018 Samuel Neves <sneves@dei.uc.pt>. All Rights Reserved.
+ */
+
+#include <crypto/curve25519.h>
+#include <crypto/internal/kpp.h>
+
+#include <linux/types.h>
+#include <linux/jump_label.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(curve25519_use_bmi2);
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(curve25519_use_adx);
+
+enum { NUM_WORDS_ELTFP25519 = 4 };
+typedef __aligned(32) u64 eltfp25519_1w[NUM_WORDS_ELTFP25519];
+typedef __aligned(32) u64 eltfp25519_1w_buffer[2 * NUM_WORDS_ELTFP25519];
+
+#define mul_eltfp25519_1w_adx(c, a, b) do { \
+	mul_256x256_integer_adx(m.buffer, a, b); \
+	red_eltfp25519_1w_adx(c, m.buffer); \
+} while (0)
+
+#define mul_eltfp25519_1w_bmi2(c, a, b) do { \
+	mul_256x256_integer_bmi2(m.buffer, a, b); \
+	red_eltfp25519_1w_bmi2(c, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_1w_adx(a) do { \
+	sqr_256x256_integer_adx(m.buffer, a); \
+	red_eltfp25519_1w_adx(a, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_1w_bmi2(a) do { \
+	sqr_256x256_integer_bmi2(m.buffer, a); \
+	red_eltfp25519_1w_bmi2(a, m.buffer); \
+} while (0)
+
+#define mul_eltfp25519_2w_adx(c, a, b) do { \
+	mul2_256x256_integer_adx(m.buffer, a, b); \
+	red_eltfp25519_2w_adx(c, m.buffer); \
+} while (0)
+
+#define mul_eltfp25519_2w_bmi2(c, a, b) do { \
+	mul2_256x256_integer_bmi2(m.buffer, a, b); \
+	red_eltfp25519_2w_bmi2(c, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_2w_adx(a) do { \
+	sqr2_256x256_integer_adx(m.buffer, a); \
+	red_eltfp25519_2w_adx(a, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_2w_bmi2(a) do { \
+	sqr2_256x256_integer_bmi2(m.buffer, a); \
+	red_eltfp25519_2w_bmi2(a, m.buffer); \
+} while (0)
+
+#define sqrn_eltfp25519_1w_adx(a, times) do { \
+	int ____counter = (times); \
+	while (____counter-- > 0) \
+		sqr_eltfp25519_1w_adx(a); \
+} while (0)
+
+#define sqrn_eltfp25519_1w_bmi2(a, times) do { \
+	int ____counter = (times); \
+	while (____counter-- > 0) \
+		sqr_eltfp25519_1w_bmi2(a); \
+} while (0)
+
+#define copy_eltfp25519_1w(C, A) do { \
+	(C)[0] = (A)[0]; \
+	(C)[1] = (A)[1]; \
+	(C)[2] = (A)[2]; \
+	(C)[3] = (A)[3]; \
+} while (0)
+
+#define setzero_eltfp25519_1w(C) do { \
+	(C)[0] = 0; \
+	(C)[1] = 0; \
+	(C)[2] = 0; \
+	(C)[3] = 0; \
+} while (0)
+
+__aligned(32) static const u64 table_ladder_8k[252 * NUM_WORDS_ELTFP25519] = {
+	/*   1 */ 0xfffffffffffffff3UL, 0xffffffffffffffffUL,
+		  0xffffffffffffffffUL, 0x5fffffffffffffffUL,
+	/*   2 */ 0x6b8220f416aafe96UL, 0x82ebeb2b4f566a34UL,
+		  0xd5a9a5b075a5950fUL, 0x5142b2cf4b2488f4UL,
+	/*   3 */ 0x6aaebc750069680cUL, 0x89cf7820a0f99c41UL,
+		  0x2a58d9183b56d0f4UL, 0x4b5aca80e36011a4UL,
+	/*   4 */ 0x329132348c29745dUL, 0xf4a2e616e1642fd7UL,
+		  0x1e45bb03ff67bc34UL, 0x306912d0f42a9b4aUL,
+	/*   5 */ 0xff886507e6af7154UL, 0x04f50e13dfeec82fUL,
+		  0xaa512fe82abab5ceUL, 0x174e251a68d5f222UL,
+	/*   6 */ 0xcf96700d82028898UL, 0x1743e3370a2c02c5UL,
+		  0x379eec98b4e86eaaUL, 0x0c59888a51e0482eUL,
+	/*   7 */ 0xfbcbf1d699b5d189UL, 0xacaef0d58e9fdc84UL,
+		  0xc1c20d06231f7614UL, 0x2938218da274f972UL,
+	/*   8 */ 0xf6af49beff1d7f18UL, 0xcc541c22387ac9c2UL,
+		  0x96fcc9ef4015c56bUL, 0x69c1627c690913a9UL,
+	/*   9 */ 0x7a86fd2f4733db0eUL, 0xfdb8c4f29e087de9UL,
+		  0x095e4b1a8ea2a229UL, 0x1ad7a7c829b37a79UL,
+	/*  10 */ 0x342d89cad17ea0c0UL, 0x67bedda6cced2051UL,
+		  0x19ca31bf2bb42f74UL, 0x3df7b4c84980acbbUL,
+	/*  11 */ 0xa8c6444dc80ad883UL, 0xb91e440366e3ab85UL,
+		  0xc215cda00164f6d8UL, 0x3d867c6ef247e668UL,
+	/*  12 */ 0xc7dd582bcc3e658cUL, 0xfd2c4748ee0e5528UL,
+		  0xa0fd9b95cc9f4f71UL, 0x7529d871b0675ddfUL,
+	/*  13 */ 0xb8f568b42d3cbd78UL, 0x1233011b91f3da82UL,
+		  0x2dce6ccd4a7c3b62UL, 0x75e7fc8e9e498603UL,
+	/*  14 */ 0x2f4f13f1fcd0b6ecUL, 0xf1a8ca1f29ff7a45UL,
+		  0xc249c1a72981e29bUL, 0x6ebe0dbb8c83b56aUL,
+	/*  15 */ 0x7114fa8d170bb222UL, 0x65a2dcd5bf93935fUL,
+		  0xbdc41f68b59c979aUL, 0x2f0eef79a2ce9289UL,
+	/*  16 */ 0x42ecbf0c083c37ceUL, 0x2930bc09ec496322UL,
+		  0xf294b0c19cfeac0dUL, 0x3780aa4bedfabb80UL,
+	/*  17 */ 0x56c17d3e7cead929UL, 0xe7cb4beb2e5722c5UL,
+		  0x0ce931732dbfe15aUL, 0x41b883c7621052f8UL,
+	/*  18 */ 0xdbf75ca0c3d25350UL, 0x2936be086eb1e351UL,
+		  0xc936e03cb4a9b212UL, 0x1d45bf82322225aaUL,
+	/*  19 */ 0xe81ab1036a024cc5UL, 0xe212201c304c9a72UL,
+		  0xc5d73fba6832b1fcUL, 0x20ffdb5a4d839581UL,
+	/*  20 */ 0xa283d367be5d0fadUL, 0x6c2b25ca8b164475UL,
+		  0x9d4935467caaf22eUL, 0x5166408eee85ff49UL,
+	/*  21 */ 0x3c67baa2fab4e361UL, 0xb3e433c67ef35cefUL,
+		  0x5259729241159b1cUL, 0x6a621892d5b0ab33UL,
+	/*  22 */ 0x20b74a387555cdcbUL, 0x532aa10e1208923fUL,
+		  0xeaa17b7762281dd1UL, 0x61ab3443f05c44bfUL,
+	/*  23 */ 0x257a6c422324def8UL, 0x131c6c1017e3cf7fUL,
+		  0x23758739f630a257UL, 0x295a407a01a78580UL,
+	/*  24 */ 0xf8c443246d5da8d9UL, 0x19d775450c52fa5dUL,
+		  0x2afcfc92731bf83dUL, 0x7d10c8e81b2b4700UL,
+	/*  25 */ 0xc8e0271f70baa20bUL, 0x993748867ca63957UL,
+		  0x5412efb3cb7ed4bbUL, 0x3196d36173e62975UL,
+	/*  26 */ 0xde5bcad141c7dffcUL, 0x47cc8cd2b395c848UL,
+		  0xa34cd942e11af3cbUL, 0x0256dbf2d04ecec2UL,
+	/*  27 */ 0x875ab7e94b0e667fUL, 0xcad4dd83c0850d10UL,
+		  0x47f12e8f4e72c79fUL, 0x5f1a87bb8c85b19bUL,
+	/*  28 */ 0x7ae9d0b6437f51b8UL, 0x12c7ce5518879065UL,
+		  0x2ade09fe5cf77aeeUL, 0x23a05a2f7d2c5627UL,
+	/*  29 */ 0x5908e128f17c169aUL, 0xf77498dd8ad0852dUL,
+		  0x74b4c4ceab102f64UL, 0x183abadd10139845UL,
+	/*  30 */ 0xb165ba8daa92aaacUL, 0xd5c5ef9599386705UL,
+		  0xbe2f8f0cf8fc40d1UL, 0x2701e635ee204514UL,
+	/*  31 */ 0x629fa80020156514UL, 0xf223868764a8c1ceUL,
+		  0x5b894fff0b3f060eUL, 0x60d9944cf708a3faUL,
+	/*  32 */ 0xaeea001a1c7a201fUL, 0xebf16a633ee2ce63UL,
+		  0x6f7709594c7a07e1UL, 0x79b958150d0208cbUL,
+	/*  33 */ 0x24b55e5301d410e7UL, 0xe3a34edff3fdc84dUL,
+		  0xd88768e4904032d8UL, 0x131384427b3aaeecUL,
+	/*  34 */ 0x8405e51286234f14UL, 0x14dc4739adb4c529UL,
+		  0xb8a2b5b250634ffdUL, 0x2fe2a94ad8a7ff93UL,
+	/*  35 */ 0xec5c57efe843faddUL, 0x2843ce40f0bb9918UL,
+		  0xa4b561d6cf3d6305UL, 0x743629bde8fb777eUL,
+	/*  36 */ 0x343edd46bbaf738fUL, 0xed981828b101a651UL,
+		  0xa401760b882c797aUL, 0x1fc223e28dc88730UL,
+	/*  37 */ 0x48604e91fc0fba0eUL, 0xb637f78f052c6fa4UL,
+		  0x91ccac3d09e9239cUL, 0x23f7eed4437a687cUL,
+	/*  38 */ 0x5173b1118d9bd800UL, 0x29d641b63189d4a7UL,
+		  0xfdbf177988bbc586UL, 0x2959894fcad81df5UL,
+	/*  39 */ 0xaebc8ef3b4bbc899UL, 0x4148995ab26992b9UL,
+		  0x24e20b0134f92cfbUL, 0x40d158894a05dee8UL,
+	/*  40 */ 0x46b00b1185af76f6UL, 0x26bac77873187a79UL,
+		  0x3dc0bf95ab8fff5fUL, 0x2a608bd8945524d7UL,
+	/*  41 */ 0x26449588bd446302UL, 0x7c4bc21c0388439cUL,
+		  0x8e98a4f383bd11b2UL, 0x26218d7bc9d876b9UL,
+	/*  42 */ 0xe3081542997c178aUL, 0x3c2d29a86fb6606fUL,
+		  0x5c217736fa279374UL, 0x7dde05734afeb1faUL,
+	/*  43 */ 0x3bf10e3906d42babUL, 0xe4f7803e1980649cUL,
+		  0xe6053bf89595bf7aUL, 0x394faf38da245530UL,
+	/*  44 */ 0x7a8efb58896928f4UL, 0xfbc778e9cc6a113cUL,
+		  0x72670ce330af596fUL, 0x48f222a81d3d6cf7UL,
+	/*  45 */ 0xf01fce410d72caa7UL, 0x5a20ecc7213b5595UL,
+		  0x7bc21165c1fa1483UL, 0x07f89ae31da8a741UL,
+	/*  46 */ 0x05d2c2b4c6830ff9UL, 0xd43e330fc6316293UL,
+		  0xa5a5590a96d3a904UL, 0x705edb91a65333b6UL,
+	/*  47 */ 0x048ee15e0bb9a5f7UL, 0x3240cfca9e0aaf5dUL,
+		  0x8f4b71ceedc4a40bUL, 0x621c0da3de544a6dUL,
+	/*  48 */ 0x92872836a08c4091UL, 0xce8375b010c91445UL,
+		  0x8a72eb524f276394UL, 0x2667fcfa7ec83635UL,
+	/*  49 */ 0x7f4c173345e8752aUL, 0x061b47feee7079a5UL,
+		  0x25dd9afa9f86ff34UL, 0x3780cef5425dc89cUL,
+	/*  50 */ 0x1a46035a513bb4e9UL, 0x3e1ef379ac575adaUL,
+		  0xc78c5f1c5fa24b50UL, 0x321a967634fd9f22UL,
+	/*  51 */ 0x946707b8826e27faUL, 0x3dca84d64c506fd0UL,
+		  0xc189218075e91436UL, 0x6d9284169b3b8484UL,
+	/*  52 */ 0x3a67e840383f2ddfUL, 0x33eec9a30c4f9b75UL,
+		  0x3ec7c86fa783ef47UL, 0x26ec449fbac9fbc4UL,
+	/*  53 */ 0x5c0f38cba09b9e7dUL, 0x81168cc762a3478cUL,
+		  0x3e23b0d306fc121cUL, 0x5a238aa0a5efdcddUL,
+	/*  54 */ 0x1ba26121c4ea43ffUL, 0x36f8c77f7c8832b5UL,
+		  0x88fbea0b0adcf99aUL, 0x5ca9938ec25bebf9UL,
+	/*  55 */ 0xd5436a5e51fccda0UL, 0x1dbc4797c2cd893bUL,
+		  0x19346a65d3224a08UL, 0x0f5034e49b9af466UL,
+	/*  56 */ 0xf23c3967a1e0b96eUL, 0xe58b08fa867a4d88UL,
+		  0xfb2fabc6a7341679UL, 0x2a75381eb6026946UL,
+	/*  57 */ 0xc80a3be4c19420acUL, 0x66b1f6c681f2b6dcUL,
+		  0x7cf7036761e93388UL, 0x25abbbd8a660a4c4UL,
+	/*  58 */ 0x91ea12ba14fd5198UL, 0x684950fc4a3cffa9UL,
+		  0xf826842130f5ad28UL, 0x3ea988f75301a441UL,
+	/*  59 */ 0xc978109a695f8c6fUL, 0x1746eb4a0530c3f3UL,
+		  0x444d6d77b4459995UL, 0x75952b8c054e5cc7UL,
+	/*  60 */ 0xa3703f7915f4d6aaUL, 0x66c346202f2647d8UL,
+		  0xd01469df811d644bUL, 0x77fea47d81a5d71fUL,
+	/*  61 */ 0xc5e9529ef57ca381UL, 0x6eeeb4b9ce2f881aUL,
+		  0xb6e91a28e8009bd6UL, 0x4b80be3e9afc3fecUL,
+	/*  62 */ 0x7e3773c526aed2c5UL, 0x1b4afcb453c9a49dUL,
+		  0xa920bdd7baffb24dUL, 0x7c54699f122d400eUL,
+	/*  63 */ 0xef46c8e14fa94bc8UL, 0xe0b074ce2952ed5eUL,
+		  0xbea450e1dbd885d5UL, 0x61b68649320f712cUL,
+	/*  64 */ 0x8a485f7309ccbdd1UL, 0xbd06320d7d4d1a2dUL,
+		  0x25232973322dbef4UL, 0x445dc4758c17f770UL,
+	/*  65 */ 0xdb0434177cc8933cUL, 0xed6fe82175ea059fUL,
+		  0x1efebefdc053db34UL, 0x4adbe867c65daf99UL,
+	/*  66 */ 0x3acd71a2a90609dfUL, 0xe5e991856dd04050UL,
+		  0x1ec69b688157c23cUL, 0x697427f6885cfe4dUL,
+	/*  67 */ 0xd7be7b9b65e1a851UL, 0xa03d28d522c536ddUL,
+		  0x28399d658fd2b645UL, 0x49e5b7e17c2641e1UL,
+	/*  68 */ 0x6f8c3a98700457a4UL, 0x5078f0a25ebb6778UL,
+		  0xd13c3ccbc382960fUL, 0x2e003258a7df84b1UL,
+	/*  69 */ 0x8ad1f39be6296a1cUL, 0xc1eeaa652a5fbfb2UL,
+		  0x33ee0673fd26f3cbUL, 0x59256173a69d2cccUL,
+	/*  70 */ 0x41ea07aa4e18fc41UL, 0xd9fc19527c87a51eUL,
+		  0xbdaacb805831ca6fUL, 0x445b652dc916694fUL,
+	/*  71 */ 0xce92a3a7f2172315UL, 0x1edc282de11b9964UL,
+		  0xa1823aafe04c314aUL, 0x790a2d94437cf586UL,
+	/*  72 */ 0x71c447fb93f6e009UL, 0x8922a56722845276UL,
+		  0xbf70903b204f5169UL, 0x2f7a89891ba319feUL,
+	/*  73 */ 0x02a08eb577e2140cUL, 0xed9a4ed4427bdcf4UL,
+		  0x5253ec44e4323cd1UL, 0x3e88363c14e9355bUL,
+	/*  74 */ 0xaa66c14277110b8cUL, 0x1ae0391610a23390UL,
+		  0x2030bd12c93fc2a2UL, 0x3ee141579555c7abUL,
+	/*  75 */ 0x9214de3a6d6e7d41UL, 0x3ccdd88607f17efeUL,
+		  0x674f1288f8e11217UL, 0x5682250f329f93d0UL,
+	/*  76 */ 0x6cf00b136d2e396eUL, 0x6e4cf86f1014debfUL,
+		  0x5930b1b5bfcc4e83UL, 0x047069b48aba16b6UL,
+	/*  77 */ 0x0d4ce4ab69b20793UL, 0xb24db91a97d0fb9eUL,
+		  0xcdfa50f54e00d01dUL, 0x221b1085368bddb5UL,
+	/*  78 */ 0xe7e59468b1e3d8d2UL, 0x53c56563bd122f93UL,
+		  0xeee8a903e0663f09UL, 0x61efa662cbbe3d42UL,
+	/*  79 */ 0x2cf8ddddde6eab2aUL, 0x9bf80ad51435f231UL,
+		  0x5deadacec9f04973UL, 0x29275b5d41d29b27UL,
+	/*  80 */ 0xcfde0f0895ebf14fUL, 0xb9aab96b054905a7UL,
+		  0xcae80dd9a1c420fdUL, 0x0a63bf2f1673bbc7UL,
+	/*  81 */ 0x092f6e11958fbc8cUL, 0x672a81e804822fadUL,
+		  0xcac8351560d52517UL, 0x6f3f7722c8f192f8UL,
+	/*  82 */ 0xf8ba90ccc2e894b7UL, 0x2c7557a438ff9f0dUL,
+		  0x894d1d855ae52359UL, 0x68e122157b743d69UL,
+	/*  83 */ 0xd87e5570cfb919f3UL, 0x3f2cdecd95798db9UL,
+		  0x2121154710c0a2ceUL, 0x3c66a115246dc5b2UL,
+	/*  84 */ 0xcbedc562294ecb72UL, 0xba7143c36a280b16UL,
+		  0x9610c2efd4078b67UL, 0x6144735d946a4b1eUL,
+	/*  85 */ 0x536f111ed75b3350UL, 0x0211db8c2041d81bUL,
+		  0xf93cb1000e10413cUL, 0x149dfd3c039e8876UL,
+	/*  86 */ 0xd479dde46b63155bUL, 0xb66e15e93c837976UL,
+		  0xdafde43b1f13e038UL, 0x5fafda1a2e4b0b35UL,
+	/*  87 */ 0x3600bbdf17197581UL, 0x3972050bbe3cd2c2UL,
+		  0x5938906dbdd5be86UL, 0x34fce5e43f9b860fUL,
+	/*  88 */ 0x75a8a4cd42d14d02UL, 0x828dabc53441df65UL,
+		  0x33dcabedd2e131d3UL, 0x3ebad76fb814d25fUL,
+	/*  89 */ 0xd4906f566f70e10fUL, 0x5d12f7aa51690f5aUL,
+		  0x45adb16e76cefcf2UL, 0x01f768aead232999UL,
+	/*  90 */ 0x2b6cc77b6248febdUL, 0x3cd30628ec3aaffdUL,
+		  0xce1c0b80d4ef486aUL, 0x4c3bff2ea6f66c23UL,
+	/*  91 */ 0x3f2ec4094aeaeb5fUL, 0x61b19b286e372ca7UL,
+		  0x5eefa966de2a701dUL, 0x23b20565de55e3efUL,
+	/*  92 */ 0xe301ca5279d58557UL, 0x07b2d4ce27c2874fUL,
+		  0xa532cd8a9dcf1d67UL, 0x2a52fee23f2bff56UL,
+	/*  93 */ 0x8624efb37cd8663dUL, 0xbbc7ac20ffbd7594UL,
+		  0x57b85e9c82d37445UL, 0x7b3052cb86a6ec66UL,
+	/*  94 */ 0x3482f0ad2525e91eUL, 0x2cb68043d28edca0UL,
+		  0xaf4f6d052e1b003aUL, 0x185f8c2529781b0aUL,
+	/*  95 */ 0xaa41de5bd80ce0d6UL, 0x9407b2416853e9d6UL,
+		  0x563ec36e357f4c3aUL, 0x4cc4b8dd0e297bceUL,
+	/*  96 */ 0xa2fc1a52ffb8730eUL, 0x1811f16e67058e37UL,
+		  0x10f9a366cddf4ee1UL, 0x72f4a0c4a0b9f099UL,
+	/*  97 */ 0x8c16c06f663f4ea7UL, 0x693b3af74e970fbaUL,
+		  0x2102e7f1d69ec345UL, 0x0ba53cbc968a8089UL,
+	/*  98 */ 0xca3d9dc7fea15537UL, 0x4c6824bb51536493UL,
+		  0xb9886314844006b1UL, 0x40d2a72ab454cc60UL,
+	/*  99 */ 0x5936a1b712570975UL, 0x91b9d648debda657UL,
+		  0x3344094bb64330eaUL, 0x006ba10d12ee51d0UL,
+	/* 100 */ 0x19228468f5de5d58UL, 0x0eb12f4c38cc05b0UL,
+		  0xa1039f9dd5601990UL, 0x4502d4ce4fff0e0bUL,
+	/* 101 */ 0xeb2054106837c189UL, 0xd0f6544c6dd3b93cUL,
+		  0x40727064c416d74fUL, 0x6e15c6114b502ef0UL,
+	/* 102 */ 0x4df2a398cfb1a76bUL, 0x11256c7419f2f6b1UL,
+		  0x4a497962066e6043UL, 0x705b3aab41355b44UL,
+	/* 103 */ 0x365ef536d797b1d8UL, 0x00076bd622ddf0dbUL,
+		  0x3bbf33b0e0575a88UL, 0x3777aa05c8e4ca4dUL,
+	/* 104 */ 0x392745c85578db5fUL, 0x6fda4149dbae5ae2UL,
+		  0xb1f0b00b8adc9867UL, 0x09963437d36f1da3UL,
+	/* 105 */ 0x7e824e90a5dc3853UL, 0xccb5f6641f135cbdUL,
+		  0x6736d86c87ce8fccUL, 0x625f3ce26604249fUL,
+	/* 106 */ 0xaf8ac8059502f63fUL, 0x0c05e70a2e351469UL,
+		  0x35292e9c764b6305UL, 0x1a394360c7e23ac3UL,
+	/* 107 */ 0xd5c6d53251183264UL, 0x62065abd43c2b74fUL,
+		  0xb5fbf5d03b973f9bUL, 0x13a3da3661206e5eUL,
+	/* 108 */ 0xc6bd5837725d94e5UL, 0x18e30912205016c5UL,
+		  0x2088ce1570033c68UL, 0x7fba1f495c837987UL,
+	/* 109 */ 0x5a8c7423f2f9079dUL, 0x1735157b34023fc5UL,
+		  0xe4f9b49ad2fab351UL, 0x6691ff72c878e33cUL,
+	/* 110 */ 0x122c2adedc5eff3eUL, 0xf8dd4bf1d8956cf4UL,
+		  0xeb86205d9e9e5bdaUL, 0x049b92b9d975c743UL,
+	/* 111 */ 0xa5379730b0f6c05aUL, 0x72a0ffacc6f3a553UL,
+		  0xb0032c34b20dcd6dUL, 0x470e9dbc88d5164aUL,
+	/* 112 */ 0xb19cf10ca237c047UL, 0xb65466711f6c81a2UL,
+		  0xb3321bd16dd80b43UL, 0x48c14f600c5fbe8eUL,
+	/* 113 */ 0x66451c264aa6c803UL, 0xb66e3904a4fa7da6UL,
+		  0xd45f19b0b3128395UL, 0x31602627c3c9bc10UL,
+	/* 114 */ 0x3120dc4832e4e10dUL, 0xeb20c46756c717f7UL,
+		  0x00f52e3f67280294UL, 0x566d4fc14730c509UL,
+	/* 115 */ 0x7e3a5d40fd837206UL, 0xc1e926dc7159547aUL,
+		  0x216730fba68d6095UL, 0x22e8c3843f69cea7UL,
+	/* 116 */ 0x33d074e8930e4b2bUL, 0xb6e4350e84d15816UL,
+		  0x5534c26ad6ba2365UL, 0x7773c12f89f1f3f3UL,
+	/* 117 */ 0x8cba404da57962aaUL, 0x5b9897a81999ce56UL,
+		  0x508e862f121692fcUL, 0x3a81907fa093c291UL,
+	/* 118 */ 0x0dded0ff4725a510UL, 0x10d8cc10673fc503UL,
+		  0x5b9d151c9f1f4e89UL, 0x32a5c1d5cb09a44cUL,
+	/* 119 */ 0x1e0aa442b90541fbUL, 0x5f85eb7cc1b485dbUL,
+		  0xbee595ce8a9df2e5UL, 0x25e496c722422236UL,
+	/* 120 */ 0x5edf3c46cd0fe5b9UL, 0x34e75a7ed2a43388UL,
+		  0xe488de11d761e352UL, 0x0e878a01a085545cUL,
+	/* 121 */ 0xba493c77e021bb04UL, 0x2b4d1843c7df899aUL,
+		  0x9ea37a487ae80d67UL, 0x67a9958011e41794UL,
+	/* 122 */ 0x4b58051a6697b065UL, 0x47e33f7d8d6ba6d4UL,
+		  0xbb4da8d483ca46c1UL, 0x68becaa181c2db0dUL,
+	/* 123 */ 0x8d8980e90b989aa5UL, 0xf95eb14a2c93c99bUL,
+		  0x51c6c7c4796e73a2UL, 0x6e228363b5efb569UL,
+	/* 124 */ 0xc6bbc0b02dd624c8UL, 0x777eb47dec8170eeUL,
+		  0x3cde15a004cfafa9UL, 0x1dc6bc087160bf9bUL,
+	/* 125 */ 0x2e07e043eec34002UL, 0x18e9fc677a68dc7fUL,
+		  0xd8da03188bd15b9aUL, 0x48fbc3bb00568253UL,
+	/* 126 */ 0x57547d4cfb654ce1UL, 0xd3565b82a058e2adUL,
+		  0xf63eaf0bbf154478UL, 0x47531ef114dfbb18UL,
+	/* 127 */ 0xe1ec630a4278c587UL, 0x5507d546ca8e83f3UL,
+		  0x85e135c63adc0c2bUL, 0x0aa7efa85682844eUL,
+	/* 128 */ 0x72691ba8b3e1f615UL, 0x32b4e9701fbe3ffaUL,
+		  0x97b6d92e39bb7868UL, 0x2cfe53dea02e39e8UL,
+	/* 129 */ 0x687392cd85cd52b0UL, 0x27ff66c910e29831UL,
+		  0x97134556a9832d06UL, 0x269bb0360a84f8a0UL,
+	/* 130 */ 0x706e55457643f85cUL, 0x3734a48c9b597d1bUL,
+		  0x7aee91e8c6efa472UL, 0x5cd6abc198a9d9e0UL,
+	/* 131 */ 0x0e04de06cb3ce41aUL, 0xd8c6eb893402e138UL,
+		  0x904659bb686e3772UL, 0x7215c371746ba8c8UL,
+	/* 132 */ 0xfd12a97eeae4a2d9UL, 0x9514b7516394f2c5UL,
+		  0x266fd5809208f294UL, 0x5c847085619a26b9UL,
+	/* 133 */ 0x52985410fed694eaUL, 0x3c905b934a2ed254UL,
+		  0x10bb47692d3be467UL, 0x063b3d2d69e5e9e1UL,
+	/* 134 */ 0x472726eedda57debUL, 0xefb6c4ae10f41891UL,
+		  0x2b1641917b307614UL, 0x117c554fc4f45b7cUL,
+	/* 135 */ 0xc07cf3118f9d8812UL, 0x01dbd82050017939UL,
+		  0xd7e803f4171b2827UL, 0x1015e87487d225eaUL,
+	/* 136 */ 0xc58de3fed23acc4dUL, 0x50db91c294a7be2dUL,
+		  0x0b94d43d1c9cf457UL, 0x6b1640fa6e37524aUL,
+	/* 137 */ 0x692f346c5fda0d09UL, 0x200b1c59fa4d3151UL,
+		  0xb8c46f760777a296UL, 0x4b38395f3ffdfbcfUL,
+	/* 138 */ 0x18d25e00be54d671UL, 0x60d50582bec8aba6UL,
+		  0x87ad8f263b78b982UL, 0x50fdf64e9cda0432UL,
+	/* 139 */ 0x90f567aac578dcf0UL, 0xef1e9b0ef2a3133bUL,
+		  0x0eebba9242d9de71UL, 0x15473c9bf03101c7UL,
+	/* 140 */ 0x7c77e8ae56b78095UL, 0xb678e7666e6f078eUL,
+		  0x2da0b9615348ba1fUL, 0x7cf931c1ff733f0bUL,
+	/* 141 */ 0x26b357f50a0a366cUL, 0xe9708cf42b87d732UL,
+		  0xc13aeea5f91cb2c0UL, 0x35d90c991143bb4cUL,
+	/* 142 */ 0x47c1c404a9a0d9dcUL, 0x659e58451972d251UL,
+		  0x3875a8c473b38c31UL, 0x1fbd9ed379561f24UL,
+	/* 143 */ 0x11fabc6fd41ec28dUL, 0x7ef8dfe3cd2a2dcaUL,
+		  0x72e73b5d8c404595UL, 0x6135fa4954b72f27UL,
+	/* 144 */ 0xccfc32a2de24b69cUL, 0x3f55698c1f095d88UL,
+		  0xbe3350ed5ac3f929UL, 0x5e9bf806ca477eebUL,
+	/* 145 */ 0xe9ce8fb63c309f68UL, 0x5376f63565e1f9f4UL,
+		  0xd1afcfb35a6393f1UL, 0x6632a1ede5623506UL,
+	/* 146 */ 0x0b7d6c390c2ded4cUL, 0x56cb3281df04cb1fUL,
+		  0x66305a1249ecc3c7UL, 0x5d588b60a38ca72aUL,
+	/* 147 */ 0xa6ecbf78e8e5f42dUL, 0x86eeb44b3c8a3eecUL,
+		  0xec219c48fbd21604UL, 0x1aaf1af517c36731UL,
+	/* 148 */ 0xc306a2836769bde7UL, 0x208280622b1e2adbUL,
+		  0x8027f51ffbff94a6UL, 0x76cfa1ce1124f26bUL,
+	/* 149 */ 0x18eb00562422abb6UL, 0xf377c4d58f8c29c3UL,
+		  0x4dbbc207f531561aUL, 0x0253b7f082128a27UL,
+	/* 150 */ 0x3d1f091cb62c17e0UL, 0x4860e1abd64628a9UL,
+		  0x52d17436309d4253UL, 0x356f97e13efae576UL,
+	/* 151 */ 0xd351e11aa150535bUL, 0x3e6b45bb1dd878ccUL,
+		  0x0c776128bed92c98UL, 0x1d34ae93032885b8UL,
+	/* 152 */ 0x4ba0488ca85ba4c3UL, 0x985348c33c9ce6ceUL,
+		  0x66124c6f97bda770UL, 0x0f81a0290654124aUL,
+	/* 153 */ 0x9ed09ca6569b86fdUL, 0x811009fd18af9a2dUL,
+		  0xff08d03f93d8c20aUL, 0x52a148199faef26bUL,
+	/* 154 */ 0x3e03f9dc2d8d1b73UL, 0x4205801873961a70UL,
+		  0xc0d987f041a35970UL, 0x07aa1f15a1c0d549UL,
+	/* 155 */ 0xdfd46ce08cd27224UL, 0x6d0a024f934e4239UL,
+		  0x808a7a6399897b59UL, 0x0a4556e9e13d95a2UL,
+	/* 156 */ 0xd21a991fe9c13045UL, 0x9b0e8548fe7751b8UL,
+		  0x5da643cb4bf30035UL, 0x77db28d63940f721UL,
+	/* 157 */ 0xfc5eeb614adc9011UL, 0x5229419ae8c411ebUL,
+		  0x9ec3e7787d1dcf74UL, 0x340d053e216e4cb5UL,
+	/* 158 */ 0xcac7af39b48df2b4UL, 0xc0faec2871a10a94UL,
+		  0x140a69245ca575edUL, 0x0cf1c37134273a4cUL,
+	/* 159 */ 0xc8ee306ac224b8a5UL, 0x57eaee7ccb4930b0UL,
+		  0xa1e806bdaacbe74fUL, 0x7d9a62742eeb657dUL,
+	/* 160 */ 0x9eb6b6ef546c4830UL, 0x885cca1fddb36e2eUL,
+		  0xe6b9f383ef0d7105UL, 0x58654fef9d2e0412UL,
+	/* 161 */ 0xa905c4ffbe0e8e26UL, 0x942de5df9b31816eUL,
+		  0x497d723f802e88e1UL, 0x30684dea602f408dUL,
+	/* 162 */ 0x21e5a278a3e6cb34UL, 0xaefb6e6f5b151dc4UL,
+		  0xb30b8e049d77ca15UL, 0x28c3c9cf53b98981UL,
+	/* 163 */ 0x287fb721556cdd2aUL, 0x0d317ca897022274UL,
+		  0x7468c7423a543258UL, 0x4a7f11464eb5642fUL,
+	/* 164 */ 0xa237a4774d193aa6UL, 0xd865986ea92129a1UL,
+		  0x24c515ecf87c1a88UL, 0x604003575f39f5ebUL,
+	/* 165 */ 0x47b9f189570a9b27UL, 0x2b98cede465e4b78UL,
+		  0x026df551dbb85c20UL, 0x74fcd91047e21901UL,
+	/* 166 */ 0x13e2a90a23c1bfa3UL, 0x0cb0074e478519f6UL,
+		  0x5ff1cbbe3af6cf44UL, 0x67fe5438be812dbeUL,
+	/* 167 */ 0xd13cf64fa40f05b0UL, 0x054dfb2f32283787UL,
+		  0x4173915b7f0d2aeaUL, 0x482f144f1f610d4eUL,
+	/* 168 */ 0xf6210201b47f8234UL, 0x5d0ae1929e70b990UL,
+		  0xdcd7f455b049567cUL, 0x7e93d0f1f0916f01UL,
+	/* 169 */ 0xdd79cbf18a7db4faUL, 0xbe8391bf6f74c62fUL,
+		  0x027145d14b8291bdUL, 0x585a73ea2cbf1705UL,
+	/* 170 */ 0x485ca03e928a0db2UL, 0x10fc01a5742857e7UL,
+		  0x2f482edbd6d551a7UL, 0x0f0433b5048fdb8aUL,
+	/* 171 */ 0x60da2e8dd7dc6247UL, 0x88b4c9d38cd4819aUL,
+		  0x13033ac001f66697UL, 0x273b24fe3b367d75UL,
+	/* 172 */ 0xc6e8f66a31b3b9d4UL, 0x281514a494df49d5UL,
+		  0xd1726fdfc8b23da7UL, 0x4b3ae7d103dee548UL,
+	/* 173 */ 0xc6256e19ce4b9d7eUL, 0xff5c5cf186e3c61cUL,
+		  0xacc63ca34b8ec145UL, 0x74621888fee66574UL,
+	/* 174 */ 0x956f409645290a1eUL, 0xef0bf8e3263a962eUL,
+		  0xed6a50eb5ec2647bUL, 0x0694283a9dca7502UL,
+	/* 175 */ 0x769b963643a2dcd1UL, 0x42b7c8ea09fc5353UL,
+		  0x4f002aee13397eabUL, 0x63005e2c19b7d63aUL,
+	/* 176 */ 0xca6736da63023beaUL, 0x966c7f6db12a99b7UL,
+		  0xace09390c537c5e1UL, 0x0b696063a1aa89eeUL,
+	/* 177 */ 0xebb03e97288c56e5UL, 0x432a9f9f938c8be8UL,
+		  0xa6a5a93d5b717f71UL, 0x1a5fb4c3e18f9d97UL,
+	/* 178 */ 0x1c94e7ad1c60cdceUL, 0xee202a43fc02c4a0UL,
+		  0x8dafe4d867c46a20UL, 0x0a10263c8ac27b58UL,
+	/* 179 */ 0xd0dea9dfe4432a4aUL, 0x856af87bbe9277c5UL,
+		  0xce8472acc212c71aUL, 0x6f151b6d9bbb1e91UL,
+	/* 180 */ 0x26776c527ceed56aUL, 0x7d211cb7fbf8faecUL,
+		  0x37ae66a6fd4609ccUL, 0x1f81b702d2770c42UL,
+	/* 181 */ 0x2fb0b057eac58392UL, 0xe1dd89fe29744e9dUL,
+		  0xc964f8eb17beb4f8UL, 0x29571073c9a2d41eUL,
+	/* 182 */ 0xa948a18981c0e254UL, 0x2df6369b65b22830UL,
+		  0xa33eb2d75fcfd3c6UL, 0x078cd6ec4199a01fUL,
+	/* 183 */ 0x4a584a41ad900d2fUL, 0x32142b78e2c74c52UL,
+		  0x68c4e8338431c978UL, 0x7f69ea9008689fc2UL,
+	/* 184 */ 0x52f2c81e46a38265UL, 0xfd78072d04a832fdUL,
+		  0x8cd7d5fa25359e94UL, 0x4de71b7454cc29d2UL,
+	/* 185 */ 0x42eb60ad1eda6ac9UL, 0x0aad37dfdbc09c3aUL,
+		  0x81004b71e33cc191UL, 0x44e6be345122803cUL,
+	/* 186 */ 0x03fe8388ba1920dbUL, 0xf5d57c32150db008UL,
+		  0x49c8c4281af60c29UL, 0x21edb518de701aeeUL,
+	/* 187 */ 0x7fb63e418f06dc99UL, 0xa4460d99c166d7b8UL,
+		  0x24dd5248ce520a83UL, 0x5ec3ad712b928358UL,
+	/* 188 */ 0x15022a5fbd17930fUL, 0xa4f64a77d82570e3UL,
+		  0x12bc8d6915783712UL, 0x498194c0fc620abbUL,
+	/* 189 */ 0x38a2d9d255686c82UL, 0x785c6bd9193e21f0UL,
+		  0xe4d5c81ab24a5484UL, 0x56307860b2e20989UL,
+	/* 190 */ 0x429d55f78b4d74c4UL, 0x22f1834643350131UL,
+		  0x1e60c24598c71fffUL, 0x59f2f014979983efUL,
+	/* 191 */ 0x46a47d56eb494a44UL, 0x3e22a854d636a18eUL,
+		  0xb346e15274491c3bUL, 0x2ceafd4e5390cde7UL,
+	/* 192 */ 0xba8a8538be0d6675UL, 0x4b9074bb50818e23UL,
+		  0xcbdab89085d304c3UL, 0x61a24fe0e56192c4UL,
+	/* 193 */ 0xcb7615e6db525bcbUL, 0xdd7d8c35a567e4caUL,
+		  0xe6b4153acafcdd69UL, 0x2d668e097f3c9766UL,
+	/* 194 */ 0xa57e7e265ce55ef0UL, 0x5d9f4e527cd4b967UL,
+		  0xfbc83606492fd1e5UL, 0x090d52beb7c3f7aeUL,
+	/* 195 */ 0x09b9515a1e7b4d7cUL, 0x1f266a2599da44c0UL,
+		  0xa1c49548e2c55504UL, 0x7ef04287126f15ccUL,
+	/* 196 */ 0xfed1659dbd30ef15UL, 0x8b4ab9eec4e0277bUL,
+		  0x884d6236a5df3291UL, 0x1fd96ea6bf5cf788UL,
+	/* 197 */ 0x42a161981f190d9aUL, 0x61d849507e6052c1UL,
+		  0x9fe113bf285a2cd5UL, 0x7c22d676dbad85d8UL,
+	/* 198 */ 0x82e770ed2bfbd27dUL, 0x4c05b2ece996f5a5UL,
+		  0xcd40a9c2b0900150UL, 0x5895319213d9bf64UL,
+	/* 199 */ 0xe7cc5d703fea2e08UL, 0xb50c491258e2188cUL,
+		  0xcce30baa48205bf0UL, 0x537c659ccfa32d62UL,
+	/* 200 */ 0x37b6623a98cfc088UL, 0xfe9bed1fa4d6aca4UL,
+		  0x04d29b8e56a8d1b0UL, 0x725f71c40b519575UL,
+	/* 201 */ 0x28c7f89cd0339ce6UL, 0x8367b14469ddc18bUL,
+		  0x883ada83a6a1652cUL, 0x585f1974034d6c17UL,
+	/* 202 */ 0x89cfb266f1b19188UL, 0xe63b4863e7c35217UL,
+		  0xd88c9da6b4c0526aUL, 0x3e035c9df0954635UL,
+	/* 203 */ 0xdd9d5412fb45de9dUL, 0xdd684532e4cff40dUL,
+		  0x4b5c999b151d671cUL, 0x2d8c2cc811e7f690UL,
+	/* 204 */ 0x7f54be1d90055d40UL, 0xa464c5df464aaf40UL,
+		  0x33979624f0e917beUL, 0x2c018dc527356b30UL,
+	/* 205 */ 0xa5415024e330b3d4UL, 0x73ff3d96691652d3UL,
+		  0x94ec42c4ef9b59f1UL, 0x0747201618d08e5aUL,
+	/* 206 */ 0x4d6ca48aca411c53UL, 0x66415f2fcfa66119UL,
+		  0x9c4dd40051e227ffUL, 0x59810bc09a02f7ebUL,
+	/* 207 */ 0x2a7eb171b3dc101dUL, 0x441c5ab99ffef68eUL,
+		  0x32025c9b93b359eaUL, 0x5e8ce0a71e9d112fUL,
+	/* 208 */ 0xbfcccb92429503fdUL, 0xd271ba752f095d55UL,
+		  0x345ead5e972d091eUL, 0x18c8df11a83103baUL,
+	/* 209 */ 0x90cd949a9aed0f4cUL, 0xc5d1f4cb6660e37eUL,
+		  0xb8cac52d56c52e0bUL, 0x6e42e400c5808e0dUL,
+	/* 210 */ 0xa3b46966eeaefd23UL, 0x0c4f1f0be39ecdcaUL,
+		  0x189dc8c9d683a51dUL, 0x51f27f054c09351bUL,
+	/* 211 */ 0x4c487ccd2a320682UL, 0x587ea95bb3df1c96UL,
+		  0xc8ccf79e555cb8e8UL, 0x547dc829a206d73dUL,
+	/* 212 */ 0xb822a6cd80c39b06UL, 0xe96d54732000d4c6UL,
+		  0x28535b6f91463b4dUL, 0x228f4660e2486e1dUL,
+	/* 213 */ 0x98799538de8d3abfUL, 0x8cd8330045ebca6eUL,
+		  0x79952a008221e738UL, 0x4322e1a7535cd2bbUL,
+	/* 214 */ 0xb114c11819d1801cUL, 0x2016e4d84f3f5ec7UL,
+		  0xdd0e2df409260f4cUL, 0x5ec362c0ae5f7266UL,
+	/* 215 */ 0xc0462b18b8b2b4eeUL, 0x7cc8d950274d1afbUL,
+		  0xf25f7105436b02d2UL, 0x43bbf8dcbff9ccd3UL,
+	/* 216 */ 0xb6ad1767a039e9dfUL, 0xb0714da8f69d3583UL,
+		  0x5e55fa18b42931f5UL, 0x4ed5558f33c60961UL,
+	/* 217 */ 0x1fe37901c647a5ddUL, 0x593ddf1f8081d357UL,
+		  0x0249a4fd813fd7a6UL, 0x69acca274e9caf61UL,
+	/* 218 */ 0x047ba3ea330721c9UL, 0x83423fc20e7e1ea0UL,
+		  0x1df4c0af01314a60UL, 0x09a62dab89289527UL,
+	/* 219 */ 0xa5b325a49cc6cb00UL, 0xe94b5dc654b56cb6UL,
+		  0x3be28779adc994a0UL, 0x4296e8f8ba3a4aadUL,
+	/* 220 */ 0x328689761e451eabUL, 0x2e4d598bff59594aUL,
+		  0x49b96853d7a7084aUL, 0x4980a319601420a8UL,
+	/* 221 */ 0x9565b9e12f552c42UL, 0x8a5318db7100fe96UL,
+		  0x05c90b4d43add0d7UL, 0x538b4cd66a5d4edaUL,
+	/* 222 */ 0xf4e94fc3e89f039fUL, 0x592c9af26f618045UL,
+		  0x08a36eb5fd4b9550UL, 0x25fffaf6c2ed1419UL,
+	/* 223 */ 0x34434459cc79d354UL, 0xeeecbfb4b1d5476bUL,
+		  0xddeb34a061615d99UL, 0x5129cecceb64b773UL,
+	/* 224 */ 0xee43215894993520UL, 0x772f9c7cf14c0b3bUL,
+		  0xd2e2fce306bedad5UL, 0x715f42b546f06a97UL,
+	/* 225 */ 0x434ecdceda5b5f1aUL, 0x0da17115a49741a9UL,
+		  0x680bd77c73edad2eUL, 0x487c02354edd9041UL,
+	/* 226 */ 0xb8efeff3a70ed9c4UL, 0x56a32aa3e857e302UL,
+		  0xdf3a68bd48a2a5a0UL, 0x07f650b73176c444UL,
+	/* 227 */ 0xe38b9b1626e0ccb1UL, 0x79e053c18b09fb36UL,
+		  0x56d90319c9f94964UL, 0x1ca941e7ac9ff5c4UL,
+	/* 228 */ 0x49c4df29162fa0bbUL, 0x8488cf3282b33305UL,
+		  0x95dfda14cabb437dUL, 0x3391f78264d5ad86UL,
+	/* 229 */ 0x729ae06ae2b5095dUL, 0xd58a58d73259a946UL,
+		  0xe9834262d13921edUL, 0x27fedafaa54bb592UL,
+	/* 230 */ 0xa99dc5b829ad48bbUL, 0x5f025742499ee260UL,
+		  0x802c8ecd5d7513fdUL, 0x78ceb3ef3f6dd938UL,
+	/* 231 */ 0xc342f44f8a135d94UL, 0x7b9edb44828cdda3UL,
+		  0x9436d11a0537cfe7UL, 0x5064b164ec1ab4c8UL,
+	/* 232 */ 0x7020eccfd37eb2fcUL, 0x1f31ea3ed90d25fcUL,
+		  0x1b930d7bdfa1bb34UL, 0x5344467a48113044UL,
+	/* 233 */ 0x70073170f25e6dfbUL, 0xe385dc1a50114cc8UL,
+		  0x2348698ac8fc4f00UL, 0x2a77a55284dd40d8UL,
+	/* 234 */ 0xfe06afe0c98c6ce4UL, 0xc235df96dddfd6e4UL,
+		  0x1428d01e33bf1ed3UL, 0x785768ec9300bdafUL,
+	/* 235 */ 0x9702e57a91deb63bUL, 0x61bdb8bfe5ce8b80UL,
+		  0x645b426f3d1d58acUL, 0x4804a82227a557bcUL,
+	/* 236 */ 0x8e57048ab44d2601UL, 0x68d6501a4b3a6935UL,
+		  0xc39c9ec3f9e1c293UL, 0x4172f257d4de63e2UL,
+	/* 237 */ 0xd368b450330c6401UL, 0x040d3017418f2391UL,
+		  0x2c34bb6090b7d90dUL, 0x16f649228fdfd51fUL,
+	/* 238 */ 0xbea6818e2b928ef5UL, 0xe28ccf91cdc11e72UL,
+		  0x594aaa68e77a36cdUL, 0x313034806c7ffd0fUL,
+	/* 239 */ 0x8a9d27ac2249bd65UL, 0x19a3b464018e9512UL,
+		  0xc26ccff352b37ec7UL, 0x056f68341d797b21UL,
+	/* 240 */ 0x5e79d6757efd2327UL, 0xfabdbcb6553afe15UL,
+		  0xd3e7222c6eaf5a60UL, 0x7046c76d4dae743bUL,
+	/* 241 */ 0x660be872b18d4a55UL, 0x19992518574e1496UL,
+		  0xc103053a302bdcbbUL, 0x3ed8e9800b218e8eUL,
+	/* 242 */ 0x7b0b9239fa75e03eUL, 0xefe9fb684633c083UL,
+		  0x98a35fbe391a7793UL, 0x6065510fe2d0fe34UL,
+	/* 243 */ 0x55cb668548abad0cUL, 0xb4584548da87e527UL,
+		  0x2c43ecea0107c1ddUL, 0x526028809372de35UL,
+	/* 244 */ 0x3415c56af9213b1fUL, 0x5bee1a4d017e98dbUL,
+		  0x13f6b105b5cf709bUL, 0x5ff20e3482b29ab6UL,
+	/* 245 */ 0x0aa29c75cc2e6c90UL, 0xfc7d73ca3a70e206UL,
+		  0x899fc38fc4b5c515UL, 0x250386b124ffc207UL,
+	/* 246 */ 0x54ea28d5ae3d2b56UL, 0x9913149dd6de60ceUL,
+		  0x16694fc58f06d6c1UL, 0x46b23975eb018fc7UL,
+	/* 247 */ 0x470a6a0fb4b7b4e2UL, 0x5d92475a8f7253deUL,
+		  0xabeee5b52fbd3adbUL, 0x7fa20801a0806968UL,
+	/* 248 */ 0x76f3faf19f7714d2UL, 0xb3e840c12f4660c3UL,
+		  0x0fb4cd8df212744eUL, 0x4b065a251d3a2dd2UL,
+	/* 249 */ 0x5cebde383d77cd4aUL, 0x6adf39df882c9cb1UL,
+		  0xa2dd242eb09af759UL, 0x3147c0e50e5f6422UL,
+	/* 250 */ 0x164ca5101d1350dbUL, 0xf8d13479c33fc962UL,
+		  0xe640ce4d13e5da08UL, 0x4bdee0c45061f8baUL,
+	/* 251 */ 0xd7c46dc1a4edb1c9UL, 0x5514d7b6437fd98aUL,
+		  0x58942f6bb2a1c00bUL, 0x2dffb2ab1d70710eUL,
+	/* 252 */ 0xccdfcf2fc18b6d68UL, 0xa8ebcba8b7806167UL,
+		  0x980697f95e2937e3UL, 0x02fbba1cd0126e8cUL
+};
+
+/* c is two 512-bit products: c0[0:7]=a0[0:3]*b0[0:3] and c1[8:15]=a1[4:7]*b1[4:7]
+ * a is two 256-bit integers: a0[0:3] and a1[4:7]
+ * b is two 256-bit integers: b0[0:3] and b1[4:7]
+ */
+static void mul2_256x256_integer_adx(u64 *const c, const u64 *const a,
+				     const u64 *const b)
+{
+	asm volatile(
+		"xorl %%r14d, %%r14d ;"
+		"movq   (%1), %%rdx; "	/* A[0] */
+		"mulx   (%2),  %%r8, %%r15; " /* A[0]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"movq %%r8, (%0) ;"
+		"mulx  8(%2), %%r10, %%rax; " /* A[0]*B[1] */
+		"adox %%r10, %%r15 ;"
+		"mulx 16(%2),  %%r8, %%rbx; " /* A[0]*B[2] */
+		"adox  %%r8, %%rax ;"
+		"mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */
+		"adox %%r10, %%rbx ;"
+		/******************************************/
+		"adox %%r14, %%rcx ;"
+
+		"movq  8(%1), %%rdx; "	/* A[1] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[1]*B[0] */
+		"adox %%r15,  %%r8 ;"
+		"movq  %%r8, 8(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[1]*B[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rax ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[1]*B[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%rbx ;"
+		"mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%rcx ;"
+		/******************************************/
+		"adox %%r14, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+
+		"movq 16(%1), %%rdx; " /* A[2] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx   (%2),  %%r8,  %%r9; " /* A[2]*B[0] */
+		"adox %%rax,  %%r8 ;"
+		"movq %%r8, 16(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[2]*B[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rbx ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[2]*B[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%rcx ;"
+		"mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%r15 ;"
+		/******************************************/
+		"adox %%r14, %%rax ;"
+		"adcx %%r14, %%rax ;"
+
+		"movq 24(%1), %%rdx; " /* A[3] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx   (%2),  %%r8,  %%r9; " /* A[3]*B[0] */
+		"adox %%rbx,  %%r8 ;"
+		"movq %%r8, 24(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[3]*B[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rcx ;"
+		"movq %%rcx, 32(%0) ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[3]*B[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%rax ;"
+		"movq %%rax, 48(%0) ;"
+		/******************************************/
+		"adox %%r14, %%rbx ;"
+		"adcx %%r14, %%rbx ;"
+		"movq %%rbx, 56(%0) ;"
+
+		"movq 32(%1), %%rdx; "	/* C[0] */
+		"mulx 32(%2),  %%r8, %%r15; " /* C[0]*D[0] */
+		"xorl %%r10d, %%r10d ;"
+		"movq %%r8, 64(%0);"
+		"mulx 40(%2), %%r10, %%rax; " /* C[0]*D[1] */
+		"adox %%r10, %%r15 ;"
+		"mulx 48(%2),  %%r8, %%rbx; " /* C[0]*D[2] */
+		"adox  %%r8, %%rax ;"
+		"mulx 56(%2), %%r10, %%rcx; " /* C[0]*D[3] */
+		"adox %%r10, %%rbx ;"
+		/******************************************/
+		"adox %%r14, %%rcx ;"
+
+		"movq 40(%1), %%rdx; " /* C[1] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[1]*D[0] */
+		"adox %%r15,  %%r8 ;"
+		"movq  %%r8, 72(%0);"
+		"mulx 40(%2), %%r10, %%r11; " /* C[1]*D[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rax ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[1]*D[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%rbx ;"
+		"mulx 56(%2), %%r10, %%r15; " /* C[1]*D[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%rcx ;"
+		/******************************************/
+		"adox %%r14, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+
+		"movq 48(%1), %%rdx; " /* C[2] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[2]*D[0] */
+		"adox %%rax,  %%r8 ;"
+		"movq  %%r8, 80(%0);"
+		"mulx 40(%2), %%r10, %%r11; " /* C[2]*D[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rbx ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[2]*D[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%rcx ;"
+		"mulx 56(%2), %%r10, %%rax; " /* C[2]*D[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%r15 ;"
+		/******************************************/
+		"adox %%r14, %%rax ;"
+		"adcx %%r14, %%rax ;"
+
+		"movq 56(%1), %%rdx; " /* C[3] */
+		"xorl %%r10d, %%r10d ;"
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[3]*D[0] */
+		"adox %%rbx,  %%r8 ;"
+		"movq  %%r8, 88(%0);"
+		"mulx 40(%2), %%r10, %%r11; " /* C[3]*D[1] */
+		"adox %%r10,  %%r9 ;"
+		"adcx  %%r9, %%rcx ;"
+		"movq %%rcx,  96(%0) ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[3]*D[2] */
+		"adox  %%r8, %%r11 ;"
+		"adcx %%r11, %%r15 ;"
+		"movq %%r15, 104(%0) ;"
+		"mulx 56(%2), %%r10, %%rbx; " /* C[3]*D[3] */
+		"adox %%r10, %%r13 ;"
+		"adcx %%r13, %%rax ;"
+		"movq %%rax, 112(%0) ;"
+		/******************************************/
+		"adox %%r14, %%rbx ;"
+		"adcx %%r14, %%rbx ;"
+		"movq %%rbx, 120(%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9",
+		  "%r10", "%r11", "%r13", "%r14", "%r15");
+}
+
+static void mul2_256x256_integer_bmi2(u64 *const c, const u64 *const a,
+				      const u64 *const b)
+{
+	asm volatile(
+		"movq   (%1), %%rdx; "	/* A[0] */
+		"mulx   (%2),  %%r8, %%r15; " /* A[0]*B[0] */
+		"movq %%r8,  (%0) ;"
+		"mulx  8(%2), %%r10, %%rax; " /* A[0]*B[1] */
+		"addq %%r10, %%r15 ;"
+		"mulx 16(%2),  %%r8, %%rbx; " /* A[0]*B[2] */
+		"adcq  %%r8, %%rax ;"
+		"mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */
+		"adcq %%r10, %%rbx ;"
+		/******************************************/
+		"adcq    $0, %%rcx ;"
+
+		"movq  8(%1), %%rdx; "	/* A[1] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[1]*B[0] */
+		"addq %%r15,  %%r8 ;"
+		"movq %%r8, 8(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[1]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[1]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%r15 ;"
+
+		"addq  %%r9, %%rax ;"
+		"adcq %%r11, %%rbx ;"
+		"adcq %%r13, %%rcx ;"
+		"adcq    $0, %%r15 ;"
+
+		"movq 16(%1), %%rdx; "	/* A[2] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[2]*B[0] */
+		"addq %%rax,  %%r8 ;"
+		"movq %%r8, 16(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[2]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[2]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rax ;"
+
+		"addq  %%r9, %%rbx ;"
+		"adcq %%r11, %%rcx ;"
+		"adcq %%r13, %%r15 ;"
+		"adcq    $0, %%rax ;"
+
+		"movq 24(%1), %%rdx; "	/* A[3] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[3]*B[0] */
+		"addq %%rbx,  %%r8 ;"
+		"movq %%r8, 24(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[3]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[3]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rbx ;"
+
+		"addq  %%r9, %%rcx ;"
+		"movq %%rcx, 32(%0) ;"
+		"adcq %%r11, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"adcq %%r13, %%rax ;"
+		"movq %%rax, 48(%0) ;"
+		"adcq    $0, %%rbx ;"
+		"movq %%rbx, 56(%0) ;"
+
+		"movq 32(%1), %%rdx; "	/* C[0] */
+		"mulx 32(%2),  %%r8, %%r15; " /* C[0]*D[0] */
+		"movq %%r8, 64(%0) ;"
+		"mulx 40(%2), %%r10, %%rax; " /* C[0]*D[1] */
+		"addq %%r10, %%r15 ;"
+		"mulx 48(%2),  %%r8, %%rbx; " /* C[0]*D[2] */
+		"adcq  %%r8, %%rax ;"
+		"mulx 56(%2), %%r10, %%rcx; " /* C[0]*D[3] */
+		"adcq %%r10, %%rbx ;"
+		/******************************************/
+		"adcq    $0, %%rcx ;"
+
+		"movq 40(%1), %%rdx; "	/* C[1] */
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[1]*D[0] */
+		"addq %%r15,  %%r8 ;"
+		"movq %%r8, 72(%0) ;"
+		"mulx 40(%2), %%r10, %%r11; " /* C[1]*D[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[1]*D[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 56(%2), %%r10, %%r15; " /* C[1]*D[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%r15 ;"
+
+		"addq  %%r9, %%rax ;"
+		"adcq %%r11, %%rbx ;"
+		"adcq %%r13, %%rcx ;"
+		"adcq    $0, %%r15 ;"
+
+		"movq 48(%1), %%rdx; "	/* C[2] */
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[2]*D[0] */
+		"addq %%rax,  %%r8 ;"
+		"movq %%r8, 80(%0) ;"
+		"mulx 40(%2), %%r10, %%r11; " /* C[2]*D[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[2]*D[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 56(%2), %%r10, %%rax; " /* C[2]*D[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rax ;"
+
+		"addq  %%r9, %%rbx ;"
+		"adcq %%r11, %%rcx ;"
+		"adcq %%r13, %%r15 ;"
+		"adcq    $0, %%rax ;"
+
+		"movq 56(%1), %%rdx; "	/* C[3] */
+		"mulx 32(%2),  %%r8,  %%r9; " /* C[3]*D[0] */
+		"addq %%rbx,  %%r8 ;"
+		"movq %%r8, 88(%0) ;"
+		"mulx 40(%2), %%r10, %%r11; " /* C[3]*D[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 48(%2),  %%r8, %%r13; " /* C[3]*D[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 56(%2), %%r10, %%rbx; " /* C[3]*D[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rbx ;"
+
+		"addq  %%r9, %%rcx ;"
+		"movq %%rcx,  96(%0) ;"
+		"adcq %%r11, %%r15 ;"
+		"movq %%r15, 104(%0) ;"
+		"adcq %%r13, %%rax ;"
+		"movq %%rax, 112(%0) ;"
+		"adcq    $0, %%rbx ;"
+		"movq %%rbx, 120(%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9",
+		  "%r10", "%r11", "%r13", "%r15");
+}
+
+static void sqr2_256x256_integer_adx(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movq   (%1), %%rdx        ;" /* A[0]      */
+		"mulx  8(%1),  %%r8, %%r14 ;" /* A[1]*A[0] */
+		"xorl %%r15d, %%r15d;"
+		"mulx 16(%1),  %%r9, %%r10 ;" /* A[2]*A[0] */
+		"adcx %%r14,  %%r9 ;"
+		"mulx 24(%1), %%rax, %%rcx ;" /* A[3]*A[0] */
+		"adcx %%rax, %%r10 ;"
+		"movq 24(%1), %%rdx        ;" /* A[3]      */
+		"mulx  8(%1), %%r11, %%rbx ;" /* A[1]*A[3] */
+		"adcx %%rcx, %%r11 ;"
+		"mulx 16(%1), %%rax, %%r13 ;" /* A[2]*A[3] */
+		"adcx %%rax, %%rbx ;"
+		"movq  8(%1), %%rdx        ;" /* A[1]      */
+		"adcx %%r15, %%r13 ;"
+		"mulx 16(%1), %%rax, %%rcx ;" /* A[2]*A[1] */
+		"movq    $0, %%r14 ;"
+		/******************************************/
+		"adcx %%r15, %%r14 ;"
+
+		"xorl %%r15d, %%r15d;"
+		"adox %%rax, %%r10 ;"
+		"adcx  %%r8,  %%r8 ;"
+		"adox %%rcx, %%r11 ;"
+		"adcx  %%r9,  %%r9 ;"
+		"adox %%r15, %%rbx ;"
+		"adcx %%r10, %%r10 ;"
+		"adox %%r15, %%r13 ;"
+		"adcx %%r11, %%r11 ;"
+		"adox %%r15, %%r14 ;"
+		"adcx %%rbx, %%rbx ;"
+		"adcx %%r13, %%r13 ;"
+		"adcx %%r14, %%r14 ;"
+
+		"movq   (%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */
+		/*******************/
+		"movq %%rax,  0(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"movq  8(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9, 16(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"movq 16(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11, 32(%0) ;"
+		"adcq %%rcx, %%rbx ;"
+		"movq %%rbx, 40(%0) ;"
+		"movq 24(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 48(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 56(%0) ;"
+
+
+		"movq 32(%1), %%rdx        ;" /* B[0]      */
+		"mulx 40(%1),  %%r8, %%r14 ;" /* B[1]*B[0] */
+		"xorl %%r15d, %%r15d;"
+		"mulx 48(%1),  %%r9, %%r10 ;" /* B[2]*B[0] */
+		"adcx %%r14,  %%r9 ;"
+		"mulx 56(%1), %%rax, %%rcx ;" /* B[3]*B[0] */
+		"adcx %%rax, %%r10 ;"
+		"movq 56(%1), %%rdx        ;" /* B[3]      */
+		"mulx 40(%1), %%r11, %%rbx ;" /* B[1]*B[3] */
+		"adcx %%rcx, %%r11 ;"
+		"mulx 48(%1), %%rax, %%r13 ;" /* B[2]*B[3] */
+		"adcx %%rax, %%rbx ;"
+		"movq 40(%1), %%rdx        ;" /* B[1]      */
+		"adcx %%r15, %%r13 ;"
+		"mulx 48(%1), %%rax, %%rcx ;" /* B[2]*B[1] */
+		"movq    $0, %%r14 ;"
+		/******************************************/
+		"adcx %%r15, %%r14 ;"
+
+		"xorl %%r15d, %%r15d;"
+		"adox %%rax, %%r10 ;"
+		"adcx  %%r8,  %%r8 ;"
+		"adox %%rcx, %%r11 ;"
+		"adcx  %%r9,  %%r9 ;"
+		"adox %%r15, %%rbx ;"
+		"adcx %%r10, %%r10 ;"
+		"adox %%r15, %%r13 ;"
+		"adcx %%r11, %%r11 ;"
+		"adox %%r15, %%r14 ;"
+		"adcx %%rbx, %%rbx ;"
+		"adcx %%r13, %%r13 ;"
+		"adcx %%r14, %%r14 ;"
+
+		"movq 32(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* B[0]^2 */
+		/*******************/
+		"movq %%rax,  64(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  72(%0) ;"
+		"movq 40(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* B[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9,  80(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10,  88(%0) ;"
+		"movq 48(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* B[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11,  96(%0) ;"
+		"adcq %%rcx, %%rbx ;"
+		"movq %%rbx, 104(%0) ;"
+		"movq 56(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* B[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 112(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 120(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9",
+		  "%r10", "%r11", "%r13", "%r14", "%r15");
+}
+
+static void sqr2_256x256_integer_bmi2(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movq  8(%1), %%rdx        ;" /* A[1]      */
+		"mulx   (%1),  %%r8,  %%r9 ;" /* A[0]*A[1] */
+		"mulx 16(%1), %%r10, %%r11 ;" /* A[2]*A[1] */
+		"mulx 24(%1), %%rcx, %%r14 ;" /* A[3]*A[1] */
+
+		"movq 16(%1), %%rdx        ;" /* A[2]      */
+		"mulx 24(%1), %%r15, %%r13 ;" /* A[3]*A[2] */
+		"mulx   (%1), %%rax, %%rdx ;" /* A[0]*A[2] */
+
+		"addq %%rax,  %%r9 ;"
+		"adcq %%rdx, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq %%r14, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"movq    $0, %%r14 ;"
+		"adcq    $0, %%r14 ;"
+
+		"movq   (%1), %%rdx        ;" /* A[0]      */
+		"mulx 24(%1), %%rax, %%rcx ;" /* A[0]*A[3] */
+
+		"addq %%rax, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq    $0, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"adcq    $0, %%r14 ;"
+
+		"shldq $1, %%r13, %%r14 ;"
+		"shldq $1, %%r15, %%r13 ;"
+		"shldq $1, %%r11, %%r15 ;"
+		"shldq $1, %%r10, %%r11 ;"
+		"shldq $1,  %%r9, %%r10 ;"
+		"shldq $1,  %%r8,  %%r9 ;"
+		"shlq  $1,  %%r8        ;"
+
+		/*******************/
+		"mulx %%rdx, %%rax, %%rcx ; " /* A[0]^2 */
+		/*******************/
+		"movq %%rax,  0(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"movq  8(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* A[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9, 16(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"movq 16(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* A[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11, 32(%0) ;"
+		"adcq %%rcx, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"movq 24(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* A[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 48(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 56(%0) ;"
+
+		"movq 40(%1), %%rdx        ;" /* B[1]      */
+		"mulx 32(%1),  %%r8,  %%r9 ;" /* B[0]*B[1] */
+		"mulx 48(%1), %%r10, %%r11 ;" /* B[2]*B[1] */
+		"mulx 56(%1), %%rcx, %%r14 ;" /* B[3]*B[1] */
+
+		"movq 48(%1), %%rdx        ;" /* B[2]      */
+		"mulx 56(%1), %%r15, %%r13 ;" /* B[3]*B[2] */
+		"mulx 32(%1), %%rax, %%rdx ;" /* B[0]*B[2] */
+
+		"addq %%rax,  %%r9 ;"
+		"adcq %%rdx, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq %%r14, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"movq    $0, %%r14 ;"
+		"adcq    $0, %%r14 ;"
+
+		"movq 32(%1), %%rdx        ;" /* B[0]      */
+		"mulx 56(%1), %%rax, %%rcx ;" /* B[0]*B[3] */
+
+		"addq %%rax, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq    $0, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"adcq    $0, %%r14 ;"
+
+		"shldq $1, %%r13, %%r14 ;"
+		"shldq $1, %%r15, %%r13 ;"
+		"shldq $1, %%r11, %%r15 ;"
+		"shldq $1, %%r10, %%r11 ;"
+		"shldq $1,  %%r9, %%r10 ;"
+		"shldq $1,  %%r8,  %%r9 ;"
+		"shlq  $1,  %%r8        ;"
+
+		/*******************/
+		"mulx %%rdx, %%rax, %%rcx ; " /* B[0]^2 */
+		/*******************/
+		"movq %%rax,  64(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  72(%0) ;"
+		"movq 40(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* B[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9,  80(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10,  88(%0) ;"
+		"movq 48(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* B[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11,  96(%0) ;"
+		"adcq %%rcx, %%r15 ;"
+		"movq %%r15, 104(%0) ;"
+		"movq 56(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ; " /* B[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 112(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 120(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10",
+		  "%r11", "%r13", "%r14", "%r15");
+}
+
+static void red_eltfp25519_2w_adx(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movl    $38, %%edx; "	/* 2*c = 38 = 2^256 */
+		"mulx 32(%1),  %%r8, %%r10; " /* c*C[4] */
+		"xorl %%ebx, %%ebx ;"
+		"adox   (%1),  %%r8 ;"
+		"mulx 40(%1),  %%r9, %%r11; " /* c*C[5] */
+		"adcx %%r10,  %%r9 ;"
+		"adox  8(%1),  %%r9 ;"
+		"mulx 48(%1), %%r10, %%rax; " /* c*C[6] */
+		"adcx %%r11, %%r10 ;"
+		"adox 16(%1), %%r10 ;"
+		"mulx 56(%1), %%r11, %%rcx; " /* c*C[7] */
+		"adcx %%rax, %%r11 ;"
+		"adox 24(%1), %%r11 ;"
+		/***************************************/
+		"adcx %%rbx, %%rcx ;"
+		"adox  %%rbx, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */
+		"adcx %%rcx,  %%r8 ;"
+		"adcx %%rbx,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcx %%rbx, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcx %%rbx, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+
+		"mulx  96(%1),  %%r8, %%r10; " /* c*C[4] */
+		"xorl %%ebx, %%ebx ;"
+		"adox 64(%1),  %%r8 ;"
+		"mulx 104(%1),  %%r9, %%r11; " /* c*C[5] */
+		"adcx %%r10,  %%r9 ;"
+		"adox 72(%1),  %%r9 ;"
+		"mulx 112(%1), %%r10, %%rax; " /* c*C[6] */
+		"adcx %%r11, %%r10 ;"
+		"adox 80(%1), %%r10 ;"
+		"mulx 120(%1), %%r11, %%rcx; " /* c*C[7] */
+		"adcx %%rax, %%r11 ;"
+		"adox 88(%1), %%r11 ;"
+		/****************************************/
+		"adcx %%rbx, %%rcx ;"
+		"adox  %%rbx, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */
+		"adcx %%rcx,  %%r8 ;"
+		"adcx %%rbx,  %%r9 ;"
+		"movq  %%r9, 40(%0) ;"
+		"adcx %%rbx, %%r10 ;"
+		"movq %%r10, 48(%0) ;"
+		"adcx %%rbx, %%r11 ;"
+		"movq %%r11, 56(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8, 32(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9",
+		  "%r10", "%r11");
+}
+
+static void red_eltfp25519_2w_bmi2(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movl    $38, %%edx ; "       /* 2*c = 38 = 2^256 */
+		"mulx 32(%1),  %%r8, %%r10 ;" /* c*C[4] */
+		"mulx 40(%1),  %%r9, %%r11 ;" /* c*C[5] */
+		"addq %%r10,  %%r9 ;"
+		"mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */
+		"adcq %%r11, %%r10 ;"
+		"mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */
+		"adcq %%rax, %%r11 ;"
+		/***************************************/
+		"adcq    $0, %%rcx ;"
+		"addq   (%1),  %%r8 ;"
+		"adcq  8(%1),  %%r9 ;"
+		"adcq 16(%1), %%r10 ;"
+		"adcq 24(%1), %%r11 ;"
+		"adcq     $0, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */
+		"addq %%rcx,  %%r8 ;"
+		"adcq    $0,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcq    $0, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcq    $0, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+
+		"mulx  96(%1),  %%r8, %%r10 ;" /* c*C[4] */
+		"mulx 104(%1),  %%r9, %%r11 ;" /* c*C[5] */
+		"addq %%r10,  %%r9 ;"
+		"mulx 112(%1), %%r10, %%rax ;" /* c*C[6] */
+		"adcq %%r11, %%r10 ;"
+		"mulx 120(%1), %%r11, %%rcx ;" /* c*C[7] */
+		"adcq %%rax, %%r11 ;"
+		/****************************************/
+		"adcq    $0, %%rcx ;"
+		"addq 64(%1),  %%r8 ;"
+		"adcq 72(%1),  %%r9 ;"
+		"adcq 80(%1), %%r10 ;"
+		"adcq 88(%1), %%r11 ;"
+		"adcq     $0, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */
+		"addq %%rcx,  %%r8 ;"
+		"adcq    $0,  %%r9 ;"
+		"movq  %%r9, 40(%0) ;"
+		"adcq    $0, %%r10 ;"
+		"movq %%r10, 48(%0) ;"
+		"adcq    $0, %%r11 ;"
+		"movq %%r11, 56(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8, 32(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10",
+		  "%r11");
+}
+
+static void mul_256x256_integer_adx(u64 *const c, const u64 *const a,
+				    const u64 *const b)
+{
+	asm volatile(
+		"movq   (%1), %%rdx; "	/* A[0] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[0]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"movq  %%r8,  (%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[0]*B[1] */
+		"adox  %%r9, %%r10 ;"
+		"movq %%r10, 8(%0) ;"
+		"mulx 16(%2), %%r15, %%r13; " /* A[0]*B[2] */
+		"adox %%r11, %%r15 ;"
+		"mulx 24(%2), %%r14, %%rdx; " /* A[0]*B[3] */
+		"adox %%r13, %%r14 ;"
+		"movq $0, %%rax ;"
+		/******************************************/
+		"adox %%rdx, %%rax ;"
+
+		"movq  8(%1), %%rdx; "	/* A[1] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[1]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"adcx 8(%0),  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[1]*B[1] */
+		"adox  %%r9, %%r10 ;"
+		"adcx %%r15, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"mulx 16(%2), %%r15, %%r13; " /* A[1]*B[2] */
+		"adox %%r11, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+		"movq $0, %%r8  ;"
+		"mulx 24(%2), %%r14, %%rdx; " /* A[1]*B[3] */
+		"adox %%r13, %%r14 ;"
+		"adcx %%rax, %%r14 ;"
+		"movq $0, %%rax ;"
+		/******************************************/
+		"adox %%rdx, %%rax ;"
+		"adcx  %%r8, %%rax ;"
+
+		"movq 16(%1), %%rdx; "	/* A[2] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[2]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"adcx 16(%0), %%r8 ;"
+		"movq  %%r8, 16(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[2]*B[1] */
+		"adox  %%r9, %%r10 ;"
+		"adcx %%r15, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"mulx 16(%2), %%r15, %%r13; " /* A[2]*B[2] */
+		"adox %%r11, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+		"movq $0, %%r8  ;"
+		"mulx 24(%2), %%r14, %%rdx; " /* A[2]*B[3] */
+		"adox %%r13, %%r14 ;"
+		"adcx %%rax, %%r14 ;"
+		"movq $0, %%rax ;"
+		/******************************************/
+		"adox %%rdx, %%rax ;"
+		"adcx  %%r8, %%rax ;"
+
+		"movq 24(%1), %%rdx; "	/* A[3] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[3]*B[0] */
+		"xorl %%r10d, %%r10d ;"
+		"adcx 24(%0), %%r8 ;"
+		"movq  %%r8, 24(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[3]*B[1] */
+		"adox  %%r9, %%r10 ;"
+		"adcx %%r15, %%r10 ;"
+		"movq %%r10, 32(%0) ;"
+		"mulx 16(%2), %%r15, %%r13; " /* A[3]*B[2] */
+		"adox %%r11, %%r15 ;"
+		"adcx %%r14, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"movq $0, %%r8  ;"
+		"mulx 24(%2), %%r14, %%rdx; " /* A[3]*B[3] */
+		"adox %%r13, %%r14 ;"
+		"adcx %%rax, %%r14 ;"
+		"movq %%r14, 48(%0) ;"
+		"movq $0, %%rax ;"
+		/******************************************/
+		"adox %%rdx, %%rax ;"
+		"adcx  %%r8, %%rax ;"
+		"movq %%rax, 56(%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11",
+		  "%r13", "%r14", "%r15");
+}
+
+static void mul_256x256_integer_bmi2(u64 *const c, const u64 *const a,
+				     const u64 *const b)
+{
+	asm volatile(
+		"movq   (%1), %%rdx; "	/* A[0] */
+		"mulx   (%2),  %%r8, %%r15; " /* A[0]*B[0] */
+		"movq %%r8,  (%0) ;"
+		"mulx  8(%2), %%r10, %%rax; " /* A[0]*B[1] */
+		"addq %%r10, %%r15 ;"
+		"mulx 16(%2),  %%r8, %%rbx; " /* A[0]*B[2] */
+		"adcq  %%r8, %%rax ;"
+		"mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */
+		"adcq %%r10, %%rbx ;"
+		/******************************************/
+		"adcq    $0, %%rcx ;"
+
+		"movq  8(%1), %%rdx; "	/* A[1] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[1]*B[0] */
+		"addq %%r15,  %%r8 ;"
+		"movq %%r8, 8(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[1]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[1]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%r15 ;"
+
+		"addq  %%r9, %%rax ;"
+		"adcq %%r11, %%rbx ;"
+		"adcq %%r13, %%rcx ;"
+		"adcq    $0, %%r15 ;"
+
+		"movq 16(%1), %%rdx; "	/* A[2] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[2]*B[0] */
+		"addq %%rax,  %%r8 ;"
+		"movq %%r8, 16(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[2]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[2]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rax ;"
+
+		"addq  %%r9, %%rbx ;"
+		"adcq %%r11, %%rcx ;"
+		"adcq %%r13, %%r15 ;"
+		"adcq    $0, %%rax ;"
+
+		"movq 24(%1), %%rdx; "	/* A[3] */
+		"mulx   (%2),  %%r8,  %%r9; " /* A[3]*B[0] */
+		"addq %%rbx,  %%r8 ;"
+		"movq %%r8, 24(%0) ;"
+		"mulx  8(%2), %%r10, %%r11; " /* A[3]*B[1] */
+		"adcq %%r10,  %%r9 ;"
+		"mulx 16(%2),  %%r8, %%r13; " /* A[3]*B[2] */
+		"adcq  %%r8, %%r11 ;"
+		"mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */
+		"adcq %%r10, %%r13 ;"
+		/******************************************/
+		"adcq    $0, %%rbx ;"
+
+		"addq  %%r9, %%rcx ;"
+		"movq %%rcx, 32(%0) ;"
+		"adcq %%r11, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"adcq %%r13, %%rax ;"
+		"movq %%rax, 48(%0) ;"
+		"adcq    $0, %%rbx ;"
+		"movq %%rbx, 56(%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9",
+		  "%r10", "%r11", "%r13", "%r15");
+}
+
+static void sqr_256x256_integer_adx(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movq   (%1), %%rdx        ;" /* A[0]      */
+		"mulx  8(%1),  %%r8, %%r14 ;" /* A[1]*A[0] */
+		"xorl %%r15d, %%r15d;"
+		"mulx 16(%1),  %%r9, %%r10 ;" /* A[2]*A[0] */
+		"adcx %%r14,  %%r9 ;"
+		"mulx 24(%1), %%rax, %%rcx ;" /* A[3]*A[0] */
+		"adcx %%rax, %%r10 ;"
+		"movq 24(%1), %%rdx        ;" /* A[3]      */
+		"mulx  8(%1), %%r11, %%rbx ;" /* A[1]*A[3] */
+		"adcx %%rcx, %%r11 ;"
+		"mulx 16(%1), %%rax, %%r13 ;" /* A[2]*A[3] */
+		"adcx %%rax, %%rbx ;"
+		"movq  8(%1), %%rdx        ;" /* A[1]      */
+		"adcx %%r15, %%r13 ;"
+		"mulx 16(%1), %%rax, %%rcx ;" /* A[2]*A[1] */
+		"movq    $0, %%r14 ;"
+		/******************************************/
+		"adcx %%r15, %%r14 ;"
+
+		"xorl %%r15d, %%r15d;"
+		"adox %%rax, %%r10 ;"
+		"adcx  %%r8,  %%r8 ;"
+		"adox %%rcx, %%r11 ;"
+		"adcx  %%r9,  %%r9 ;"
+		"adox %%r15, %%rbx ;"
+		"adcx %%r10, %%r10 ;"
+		"adox %%r15, %%r13 ;"
+		"adcx %%r11, %%r11 ;"
+		"adox %%r15, %%r14 ;"
+		"adcx %%rbx, %%rbx ;"
+		"adcx %%r13, %%r13 ;"
+		"adcx %%r14, %%r14 ;"
+
+		"movq   (%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */
+		/*******************/
+		"movq %%rax,  0(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"movq  8(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9, 16(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"movq 16(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11, 32(%0) ;"
+		"adcq %%rcx, %%rbx ;"
+		"movq %%rbx, 40(%0) ;"
+		"movq 24(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 48(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 56(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9",
+		  "%r10", "%r11", "%r13", "%r14", "%r15");
+}
+
+static void sqr_256x256_integer_bmi2(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movq  8(%1), %%rdx        ;" /* A[1]      */
+		"mulx   (%1),  %%r8,  %%r9 ;" /* A[0]*A[1] */
+		"mulx 16(%1), %%r10, %%r11 ;" /* A[2]*A[1] */
+		"mulx 24(%1), %%rcx, %%r14 ;" /* A[3]*A[1] */
+
+		"movq 16(%1), %%rdx        ;" /* A[2]      */
+		"mulx 24(%1), %%r15, %%r13 ;" /* A[3]*A[2] */
+		"mulx   (%1), %%rax, %%rdx ;" /* A[0]*A[2] */
+
+		"addq %%rax,  %%r9 ;"
+		"adcq %%rdx, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq %%r14, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"movq    $0, %%r14 ;"
+		"adcq    $0, %%r14 ;"
+
+		"movq   (%1), %%rdx        ;" /* A[0]      */
+		"mulx 24(%1), %%rax, %%rcx ;" /* A[0]*A[3] */
+
+		"addq %%rax, %%r10 ;"
+		"adcq %%rcx, %%r11 ;"
+		"adcq    $0, %%r15 ;"
+		"adcq    $0, %%r13 ;"
+		"adcq    $0, %%r14 ;"
+
+		"shldq $1, %%r13, %%r14 ;"
+		"shldq $1, %%r15, %%r13 ;"
+		"shldq $1, %%r11, %%r15 ;"
+		"shldq $1, %%r10, %%r11 ;"
+		"shldq $1,  %%r9, %%r10 ;"
+		"shldq $1,  %%r8,  %%r9 ;"
+		"shlq  $1,  %%r8        ;"
+
+		/*******************/
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */
+		/*******************/
+		"movq %%rax,  0(%0) ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,  8(%0) ;"
+		"movq  8(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */
+		"adcq %%rax,  %%r9 ;"
+		"movq  %%r9, 16(%0) ;"
+		"adcq %%rcx, %%r10 ;"
+		"movq %%r10, 24(%0) ;"
+		"movq 16(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */
+		"adcq %%rax, %%r11 ;"
+		"movq %%r11, 32(%0) ;"
+		"adcq %%rcx, %%r15 ;"
+		"movq %%r15, 40(%0) ;"
+		"movq 24(%1), %%rdx ;"
+		"mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */
+		"adcq %%rax, %%r13 ;"
+		"movq %%r13, 48(%0) ;"
+		"adcq %%rcx, %%r14 ;"
+		"movq %%r14, 56(%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10",
+		  "%r11", "%r13", "%r14", "%r15");
+}
+
+static void red_eltfp25519_1w_adx(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movl    $38, %%edx ;"	/* 2*c = 38 = 2^256 */
+		"mulx 32(%1),  %%r8, %%r10 ;" /* c*C[4] */
+		"xorl %%ebx, %%ebx ;"
+		"adox   (%1),  %%r8 ;"
+		"mulx 40(%1),  %%r9, %%r11 ;" /* c*C[5] */
+		"adcx %%r10,  %%r9 ;"
+		"adox  8(%1),  %%r9 ;"
+		"mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */
+		"adcx %%r11, %%r10 ;"
+		"adox 16(%1), %%r10 ;"
+		"mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */
+		"adcx %%rax, %%r11 ;"
+		"adox 24(%1), %%r11 ;"
+		/***************************************/
+		"adcx %%rbx, %%rcx ;"
+		"adox  %%rbx, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */
+		"adcx %%rcx,  %%r8 ;"
+		"adcx %%rbx,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcx %%rbx, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcx %%rbx, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9",
+		  "%r10", "%r11");
+}
+
+static void red_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a)
+{
+	asm volatile(
+		"movl    $38, %%edx ;"	/* 2*c = 38 = 2^256 */
+		"mulx 32(%1),  %%r8, %%r10 ;" /* c*C[4] */
+		"mulx 40(%1),  %%r9, %%r11 ;" /* c*C[5] */
+		"addq %%r10,  %%r9 ;"
+		"mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */
+		"adcq %%r11, %%r10 ;"
+		"mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */
+		"adcq %%rax, %%r11 ;"
+		/***************************************/
+		"adcq    $0, %%rcx ;"
+		"addq   (%1),  %%r8 ;"
+		"adcq  8(%1),  %%r9 ;"
+		"adcq 16(%1), %%r10 ;"
+		"adcq 24(%1), %%r11 ;"
+		"adcq     $0, %%rcx ;"
+		"imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */
+		"addq %%rcx,  %%r8 ;"
+		"adcq    $0,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcq    $0, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcq    $0, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10",
+		  "%r11");
+}
+
+static __always_inline void
+add_eltfp25519_1w_adx(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"mov     $38, %%eax ;"
+		"xorl  %%ecx, %%ecx ;"
+		"movq   (%2),  %%r8 ;"
+		"adcx   (%1),  %%r8 ;"
+		"movq  8(%2),  %%r9 ;"
+		"adcx  8(%1),  %%r9 ;"
+		"movq 16(%2), %%r10 ;"
+		"adcx 16(%1), %%r10 ;"
+		"movq 24(%2), %%r11 ;"
+		"adcx 24(%1), %%r11 ;"
+		"cmovc %%eax, %%ecx ;"
+		"xorl %%eax, %%eax  ;"
+		"adcx %%rcx,  %%r8  ;"
+		"adcx %%rax,  %%r9  ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcx %%rax, %%r10  ;"
+		"movq %%r10, 16(%0) ;"
+		"adcx %%rax, %%r11  ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $38, %%ecx ;"
+		"cmovc %%ecx, %%eax ;"
+		"addq %%rax,  %%r8  ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11");
+}
+
+static __always_inline void
+add_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"mov     $38, %%eax ;"
+		"movq   (%2),  %%r8 ;"
+		"addq   (%1),  %%r8 ;"
+		"movq  8(%2),  %%r9 ;"
+		"adcq  8(%1),  %%r9 ;"
+		"movq 16(%2), %%r10 ;"
+		"adcq 16(%1), %%r10 ;"
+		"movq 24(%2), %%r11 ;"
+		"adcq 24(%1), %%r11 ;"
+		"mov      $0, %%ecx ;"
+		"cmovc %%eax, %%ecx ;"
+		"addq %%rcx,  %%r8  ;"
+		"adcq    $0,  %%r9  ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcq    $0, %%r10  ;"
+		"movq %%r10, 16(%0) ;"
+		"adcq    $0, %%r11  ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx  ;"
+		"cmovc %%eax, %%ecx ;"
+		"addq %%rcx,  %%r8  ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11");
+}
+
+static __always_inline void
+sub_eltfp25519_1w(u64 *const c, const u64 *const a, const u64 *const b)
+{
+	asm volatile(
+		"mov     $38, %%eax ;"
+		"movq   (%1),  %%r8 ;"
+		"subq   (%2),  %%r8 ;"
+		"movq  8(%1),  %%r9 ;"
+		"sbbq  8(%2),  %%r9 ;"
+		"movq 16(%1), %%r10 ;"
+		"sbbq 16(%2), %%r10 ;"
+		"movq 24(%1), %%r11 ;"
+		"sbbq 24(%2), %%r11 ;"
+		"mov      $0, %%ecx ;"
+		"cmovc %%eax, %%ecx ;"
+		"subq %%rcx,  %%r8  ;"
+		"sbbq    $0,  %%r9  ;"
+		"movq  %%r9,  8(%0) ;"
+		"sbbq    $0, %%r10  ;"
+		"movq %%r10, 16(%0) ;"
+		"sbbq    $0, %%r11  ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx  ;"
+		"cmovc %%eax, %%ecx ;"
+		"subq %%rcx,  %%r8  ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(b)
+		: "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11");
+}
+
+/* Multiplication by a24 = (A+2)/4 = (486662+2)/4 = 121666 */
+static __always_inline void
+mul_a24_eltfp25519_1w(u64 *const c, const u64 *const a)
+{
+	const u64 a24 = 121666;
+	asm volatile(
+		"movq     %2, %%rdx ;"
+		"mulx   (%1),  %%r8, %%r10 ;"
+		"mulx  8(%1),  %%r9, %%r11 ;"
+		"addq %%r10,  %%r9 ;"
+		"mulx 16(%1), %%r10, %%rax ;"
+		"adcq %%r11, %%r10 ;"
+		"mulx 24(%1), %%r11, %%rcx ;"
+		"adcq %%rax, %%r11 ;"
+		/**************************/
+		"adcq    $0, %%rcx ;"
+		"movl   $38, %%edx ;" /* 2*c = 38 = 2^256 mod 2^255-19*/
+		"imul %%rdx, %%rcx ;"
+		"addq %%rcx,  %%r8 ;"
+		"adcq    $0,  %%r9 ;"
+		"movq  %%r9,  8(%0) ;"
+		"adcq    $0, %%r10 ;"
+		"movq %%r10, 16(%0) ;"
+		"adcq    $0, %%r11 ;"
+		"movq %%r11, 24(%0) ;"
+		"mov     $0, %%ecx ;"
+		"cmovc %%edx, %%ecx ;"
+		"addq %%rcx,  %%r8 ;"
+		"movq  %%r8,   (%0) ;"
+		:
+		: "r"(c), "r"(a), "r"(a24)
+		: "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10",
+		  "%r11");
+}
+
+static void inv_eltfp25519_1w_adx(u64 *const c, const u64 *const a)
+{
+	struct {
+		eltfp25519_1w_buffer buffer;
+		eltfp25519_1w x0, x1, x2;
+	} __aligned(32) m;
+	u64 *T[4];
+
+	T[0] = m.x0;
+	T[1] = c; /* x^(-1) */
+	T[2] = m.x1;
+	T[3] = m.x2;
+
+	copy_eltfp25519_1w(T[1], a);
+	sqrn_eltfp25519_1w_adx(T[1], 1);
+	copy_eltfp25519_1w(T[2], T[1]);
+	sqrn_eltfp25519_1w_adx(T[2], 2);
+	mul_eltfp25519_1w_adx(T[0], a, T[2]);
+	mul_eltfp25519_1w_adx(T[1], T[1], T[0]);
+	copy_eltfp25519_1w(T[2], T[1]);
+	sqrn_eltfp25519_1w_adx(T[2], 1);
+	mul_eltfp25519_1w_adx(T[0], T[0], T[2]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_adx(T[2], 5);
+	mul_eltfp25519_1w_adx(T[0], T[0], T[2]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_adx(T[2], 10);
+	mul_eltfp25519_1w_adx(T[2], T[2], T[0]);
+	copy_eltfp25519_1w(T[3], T[2]);
+	sqrn_eltfp25519_1w_adx(T[3], 20);
+	mul_eltfp25519_1w_adx(T[3], T[3], T[2]);
+	sqrn_eltfp25519_1w_adx(T[3], 10);
+	mul_eltfp25519_1w_adx(T[3], T[3], T[0]);
+	copy_eltfp25519_1w(T[0], T[3]);
+	sqrn_eltfp25519_1w_adx(T[0], 50);
+	mul_eltfp25519_1w_adx(T[0], T[0], T[3]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_adx(T[2], 100);
+	mul_eltfp25519_1w_adx(T[2], T[2], T[0]);
+	sqrn_eltfp25519_1w_adx(T[2], 50);
+	mul_eltfp25519_1w_adx(T[2], T[2], T[3]);
+	sqrn_eltfp25519_1w_adx(T[2], 5);
+	mul_eltfp25519_1w_adx(T[1], T[1], T[2]);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+static void inv_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a)
+{
+	struct {
+		eltfp25519_1w_buffer buffer;
+		eltfp25519_1w x0, x1, x2;
+	} __aligned(32) m;
+	u64 *T[5];
+
+	T[0] = m.x0;
+	T[1] = c; /* x^(-1) */
+	T[2] = m.x1;
+	T[3] = m.x2;
+
+	copy_eltfp25519_1w(T[1], a);
+	sqrn_eltfp25519_1w_bmi2(T[1], 1);
+	copy_eltfp25519_1w(T[2], T[1]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 2);
+	mul_eltfp25519_1w_bmi2(T[0], a, T[2]);
+	mul_eltfp25519_1w_bmi2(T[1], T[1], T[0]);
+	copy_eltfp25519_1w(T[2], T[1]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 1);
+	mul_eltfp25519_1w_bmi2(T[0], T[0], T[2]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 5);
+	mul_eltfp25519_1w_bmi2(T[0], T[0], T[2]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 10);
+	mul_eltfp25519_1w_bmi2(T[2], T[2], T[0]);
+	copy_eltfp25519_1w(T[3], T[2]);
+	sqrn_eltfp25519_1w_bmi2(T[3], 20);
+	mul_eltfp25519_1w_bmi2(T[3], T[3], T[2]);
+	sqrn_eltfp25519_1w_bmi2(T[3], 10);
+	mul_eltfp25519_1w_bmi2(T[3], T[3], T[0]);
+	copy_eltfp25519_1w(T[0], T[3]);
+	sqrn_eltfp25519_1w_bmi2(T[0], 50);
+	mul_eltfp25519_1w_bmi2(T[0], T[0], T[3]);
+	copy_eltfp25519_1w(T[2], T[0]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 100);
+	mul_eltfp25519_1w_bmi2(T[2], T[2], T[0]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 50);
+	mul_eltfp25519_1w_bmi2(T[2], T[2], T[3]);
+	sqrn_eltfp25519_1w_bmi2(T[2], 5);
+	mul_eltfp25519_1w_bmi2(T[1], T[1], T[2]);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+/* Given c, a 256-bit number, fred_eltfp25519_1w updates c
+ * with a number such that 0 <= C < 2**255-19.
+ */
+static __always_inline void fred_eltfp25519_1w(u64 *const c)
+{
+	u64 tmp0 = 38, tmp1 = 19;
+	asm volatile(
+		"btrq   $63,    %3 ;" /* Put bit 255 in carry flag and clear */
+		"cmovncl %k5,   %k4 ;" /* c[255] ? 38 : 19 */
+
+		/* Add either 19 or 38 to c */
+		"addq    %4,   %0 ;"
+		"adcq    $0,   %1 ;"
+		"adcq    $0,   %2 ;"
+		"adcq    $0,   %3 ;"
+
+		/* Test for bit 255 again; only triggered on overflow modulo 2^255-19 */
+		"movl    $0,  %k4 ;"
+		"cmovnsl %k5,  %k4 ;" /* c[255] ? 0 : 19 */
+		"btrq   $63,   %3 ;" /* Clear bit 255 */
+
+		/* Subtract 19 if necessary */
+		"subq    %4,   %0 ;"
+		"sbbq    $0,   %1 ;"
+		"sbbq    $0,   %2 ;"
+		"sbbq    $0,   %3 ;"
+
+		: "+r"(c[0]), "+r"(c[1]), "+r"(c[2]), "+r"(c[3]), "+r"(tmp0),
+		  "+r"(tmp1)
+		:
+		: "memory", "cc");
+}
+
+static __always_inline void cswap(u8 bit, u64 *const px, u64 *const py)
+{
+	u64 temp;
+	asm volatile(
+		"test %9, %9 ;"
+		"movq %0, %8 ;"
+		"cmovnzq %4, %0 ;"
+		"cmovnzq %8, %4 ;"
+		"movq %1, %8 ;"
+		"cmovnzq %5, %1 ;"
+		"cmovnzq %8, %5 ;"
+		"movq %2, %8 ;"
+		"cmovnzq %6, %2 ;"
+		"cmovnzq %8, %6 ;"
+		"movq %3, %8 ;"
+		"cmovnzq %7, %3 ;"
+		"cmovnzq %8, %7 ;"
+		: "+r"(px[0]), "+r"(px[1]), "+r"(px[2]), "+r"(px[3]),
+		  "+r"(py[0]), "+r"(py[1]), "+r"(py[2]), "+r"(py[3]),
+		  "=r"(temp)
+		: "r"(bit)
+		: "cc"
+	);
+}
+
+static __always_inline void cselect(u8 bit, u64 *const px, const u64 *const py)
+{
+	asm volatile(
+		"test %4, %4 ;"
+		"cmovnzq %5, %0 ;"
+		"cmovnzq %6, %1 ;"
+		"cmovnzq %7, %2 ;"
+		"cmovnzq %8, %3 ;"
+		: "+r"(px[0]), "+r"(px[1]), "+r"(px[2]), "+r"(px[3])
+		: "r"(bit), "rm"(py[0]), "rm"(py[1]), "rm"(py[2]), "rm"(py[3])
+		: "cc"
+	);
+}
+
+static void curve25519_adx(u8 shared[CURVE25519_KEY_SIZE],
+			   const u8 private_key[CURVE25519_KEY_SIZE],
+			   const u8 session_key[CURVE25519_KEY_SIZE])
+{
+	struct {
+		u64 buffer[4 * NUM_WORDS_ELTFP25519];
+		u64 coordinates[4 * NUM_WORDS_ELTFP25519];
+		u64 workspace[6 * NUM_WORDS_ELTFP25519];
+		u8 session[CURVE25519_KEY_SIZE];
+		u8 private[CURVE25519_KEY_SIZE];
+	} __aligned(32) m;
+
+	int i = 0, j = 0;
+	u64 prev = 0;
+	u64 *const X1 = (u64 *)m.session;
+	u64 *const key = (u64 *)m.private;
+	u64 *const Px = m.coordinates + 0;
+	u64 *const Pz = m.coordinates + 4;
+	u64 *const Qx = m.coordinates + 8;
+	u64 *const Qz = m.coordinates + 12;
+	u64 *const X2 = Qx;
+	u64 *const Z2 = Qz;
+	u64 *const X3 = Px;
+	u64 *const Z3 = Pz;
+	u64 *const X2Z2 = Qx;
+	u64 *const X3Z3 = Px;
+
+	u64 *const A = m.workspace + 0;
+	u64 *const B = m.workspace + 4;
+	u64 *const D = m.workspace + 8;
+	u64 *const C = m.workspace + 12;
+	u64 *const DA = m.workspace + 16;
+	u64 *const CB = m.workspace + 20;
+	u64 *const AB = A;
+	u64 *const DC = D;
+	u64 *const DACB = DA;
+
+	memcpy(m.private, private_key, sizeof(m.private));
+	memcpy(m.session, session_key, sizeof(m.session));
+
+	curve25519_clamp_secret(m.private);
+
+	/* As in the draft:
+	 * When receiving such an array, implementations of curve25519
+	 * MUST mask the most-significant bit in the final byte. This
+	 * is done to preserve compatibility with point formats which
+	 * reserve the sign bit for use in other protocols and to
+	 * increase resistance to implementation fingerprinting
+	 */
+	m.session[CURVE25519_KEY_SIZE - 1] &= (1 << (255 % 8)) - 1;
+
+	copy_eltfp25519_1w(Px, X1);
+	setzero_eltfp25519_1w(Pz);
+	setzero_eltfp25519_1w(Qx);
+	setzero_eltfp25519_1w(Qz);
+
+	Pz[0] = 1;
+	Qx[0] = 1;
+
+	/* main-loop */
+	prev = 0;
+	j = 62;
+	for (i = 3; i >= 0; --i) {
+		while (j >= 0) {
+			u64 bit = (key[i] >> j) & 0x1;
+			u64 swap = bit ^ prev;
+			prev = bit;
+
+			add_eltfp25519_1w_adx(A, X2, Z2);	/* A = (X2+Z2) */
+			sub_eltfp25519_1w(B, X2, Z2);		/* B = (X2-Z2) */
+			add_eltfp25519_1w_adx(C, X3, Z3);	/* C = (X3+Z3) */
+			sub_eltfp25519_1w(D, X3, Z3);		/* D = (X3-Z3) */
+			mul_eltfp25519_2w_adx(DACB, AB, DC);	/* [DA|CB] = [A|B]*[D|C] */
+
+			cselect(swap, A, C);
+			cselect(swap, B, D);
+
+			sqr_eltfp25519_2w_adx(AB);		/* [AA|BB] = [A^2|B^2] */
+			add_eltfp25519_1w_adx(X3, DA, CB);	/* X3 = (DA+CB) */
+			sub_eltfp25519_1w(Z3, DA, CB);		/* Z3 = (DA-CB) */
+			sqr_eltfp25519_2w_adx(X3Z3);		/* [X3|Z3] = [(DA+CB)|(DA+CB)]^2 */
+
+			copy_eltfp25519_1w(X2, B);		/* X2 = B^2 */
+			sub_eltfp25519_1w(Z2, A, B);		/* Z2 = E = AA-BB */
+
+			mul_a24_eltfp25519_1w(B, Z2);		/* B = a24*E */
+			add_eltfp25519_1w_adx(B, B, X2);	/* B = a24*E+B */
+			mul_eltfp25519_2w_adx(X2Z2, X2Z2, AB);	/* [X2|Z2] = [B|E]*[A|a24*E+B] */
+			mul_eltfp25519_1w_adx(Z3, Z3, X1);	/* Z3 = Z3*X1 */
+			--j;
+		}
+		j = 63;
+	}
+
+	inv_eltfp25519_1w_adx(A, Qz);
+	mul_eltfp25519_1w_adx((u64 *)shared, Qx, A);
+	fred_eltfp25519_1w((u64 *)shared);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+static void curve25519_adx_base(u8 session_key[CURVE25519_KEY_SIZE],
+				const u8 private_key[CURVE25519_KEY_SIZE])
+{
+	struct {
+		u64 buffer[4 * NUM_WORDS_ELTFP25519];
+		u64 coordinates[4 * NUM_WORDS_ELTFP25519];
+		u64 workspace[4 * NUM_WORDS_ELTFP25519];
+		u8 private[CURVE25519_KEY_SIZE];
+	} __aligned(32) m;
+
+	const int ite[4] = { 64, 64, 64, 63 };
+	const int q = 3;
+	u64 swap = 1;
+
+	int i = 0, j = 0, k = 0;
+	u64 *const key = (u64 *)m.private;
+	u64 *const Ur1 = m.coordinates + 0;
+	u64 *const Zr1 = m.coordinates + 4;
+	u64 *const Ur2 = m.coordinates + 8;
+	u64 *const Zr2 = m.coordinates + 12;
+
+	u64 *const UZr1 = m.coordinates + 0;
+	u64 *const ZUr2 = m.coordinates + 8;
+
+	u64 *const A = m.workspace + 0;
+	u64 *const B = m.workspace + 4;
+	u64 *const C = m.workspace + 8;
+	u64 *const D = m.workspace + 12;
+
+	u64 *const AB = m.workspace + 0;
+	u64 *const CD = m.workspace + 8;
+
+	const u64 *const P = table_ladder_8k;
+
+	memcpy(m.private, private_key, sizeof(m.private));
+
+	curve25519_clamp_secret(m.private);
+
+	setzero_eltfp25519_1w(Ur1);
+	setzero_eltfp25519_1w(Zr1);
+	setzero_eltfp25519_1w(Zr2);
+	Ur1[0] = 1;
+	Zr1[0] = 1;
+	Zr2[0] = 1;
+
+	/* G-S */
+	Ur2[3] = 0x1eaecdeee27cab34UL;
+	Ur2[2] = 0xadc7a0b9235d48e2UL;
+	Ur2[1] = 0xbbf095ae14b2edf8UL;
+	Ur2[0] = 0x7e94e1fec82faabdUL;
+
+	/* main-loop */
+	j = q;
+	for (i = 0; i < NUM_WORDS_ELTFP25519; ++i) {
+		while (j < ite[i]) {
+			u64 bit = (key[i] >> j) & 0x1;
+			k = (64 * i + j - q);
+			swap = swap ^ bit;
+			cswap(swap, Ur1, Ur2);
+			cswap(swap, Zr1, Zr2);
+			swap = bit;
+			/* Addition */
+			sub_eltfp25519_1w(B, Ur1, Zr1);		/* B = Ur1-Zr1 */
+			add_eltfp25519_1w_adx(A, Ur1, Zr1);	/* A = Ur1+Zr1 */
+			mul_eltfp25519_1w_adx(C, &P[4 * k], B);	/* C = M0-B */
+			sub_eltfp25519_1w(B, A, C);		/* B = (Ur1+Zr1) - M*(Ur1-Zr1) */
+			add_eltfp25519_1w_adx(A, A, C);		/* A = (Ur1+Zr1) + M*(Ur1-Zr1) */
+			sqr_eltfp25519_2w_adx(AB);		/* A = A^2      |  B = B^2 */
+			mul_eltfp25519_2w_adx(UZr1, ZUr2, AB);	/* Ur1 = Zr2*A  |  Zr1 = Ur2*B */
+			++j;
+		}
+		j = 0;
+	}
+
+	/* Doubling */
+	for (i = 0; i < q; ++i) {
+		add_eltfp25519_1w_adx(A, Ur1, Zr1);	/*  A = Ur1+Zr1 */
+		sub_eltfp25519_1w(B, Ur1, Zr1);		/*  B = Ur1-Zr1 */
+		sqr_eltfp25519_2w_adx(AB);		/*  A = A**2     B = B**2 */
+		copy_eltfp25519_1w(C, B);		/*  C = B */
+		sub_eltfp25519_1w(B, A, B);		/*  B = A-B */
+		mul_a24_eltfp25519_1w(D, B);		/*  D = my_a24*B */
+		add_eltfp25519_1w_adx(D, D, C);		/*  D = D+C */
+		mul_eltfp25519_2w_adx(UZr1, AB, CD);	/*  Ur1 = A*B   Zr1 = Zr1*A */
+	}
+
+	/* Convert to affine coordinates */
+	inv_eltfp25519_1w_adx(A, Zr1);
+	mul_eltfp25519_1w_adx((u64 *)session_key, Ur1, A);
+	fred_eltfp25519_1w((u64 *)session_key);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+static void curve25519_bmi2(u8 shared[CURVE25519_KEY_SIZE],
+			    const u8 private_key[CURVE25519_KEY_SIZE],
+			    const u8 session_key[CURVE25519_KEY_SIZE])
+{
+	struct {
+		u64 buffer[4 * NUM_WORDS_ELTFP25519];
+		u64 coordinates[4 * NUM_WORDS_ELTFP25519];
+		u64 workspace[6 * NUM_WORDS_ELTFP25519];
+		u8 session[CURVE25519_KEY_SIZE];
+		u8 private[CURVE25519_KEY_SIZE];
+	} __aligned(32) m;
+
+	int i = 0, j = 0;
+	u64 prev = 0;
+	u64 *const X1 = (u64 *)m.session;
+	u64 *const key = (u64 *)m.private;
+	u64 *const Px = m.coordinates + 0;
+	u64 *const Pz = m.coordinates + 4;
+	u64 *const Qx = m.coordinates + 8;
+	u64 *const Qz = m.coordinates + 12;
+	u64 *const X2 = Qx;
+	u64 *const Z2 = Qz;
+	u64 *const X3 = Px;
+	u64 *const Z3 = Pz;
+	u64 *const X2Z2 = Qx;
+	u64 *const X3Z3 = Px;
+
+	u64 *const A = m.workspace + 0;
+	u64 *const B = m.workspace + 4;
+	u64 *const D = m.workspace + 8;
+	u64 *const C = m.workspace + 12;
+	u64 *const DA = m.workspace + 16;
+	u64 *const CB = m.workspace + 20;
+	u64 *const AB = A;
+	u64 *const DC = D;
+	u64 *const DACB = DA;
+
+	memcpy(m.private, private_key, sizeof(m.private));
+	memcpy(m.session, session_key, sizeof(m.session));
+
+	curve25519_clamp_secret(m.private);
+
+	/* As in the draft:
+	 * When receiving such an array, implementations of curve25519
+	 * MUST mask the most-significant bit in the final byte. This
+	 * is done to preserve compatibility with point formats which
+	 * reserve the sign bit for use in other protocols and to
+	 * increase resistance to implementation fingerprinting
+	 */
+	m.session[CURVE25519_KEY_SIZE - 1] &= (1 << (255 % 8)) - 1;
+
+	copy_eltfp25519_1w(Px, X1);
+	setzero_eltfp25519_1w(Pz);
+	setzero_eltfp25519_1w(Qx);
+	setzero_eltfp25519_1w(Qz);
+
+	Pz[0] = 1;
+	Qx[0] = 1;
+
+	/* main-loop */
+	prev = 0;
+	j = 62;
+	for (i = 3; i >= 0; --i) {
+		while (j >= 0) {
+			u64 bit = (key[i] >> j) & 0x1;
+			u64 swap = bit ^ prev;
+			prev = bit;
+
+			add_eltfp25519_1w_bmi2(A, X2, Z2);	/* A = (X2+Z2) */
+			sub_eltfp25519_1w(B, X2, Z2);		/* B = (X2-Z2) */
+			add_eltfp25519_1w_bmi2(C, X3, Z3);	/* C = (X3+Z3) */
+			sub_eltfp25519_1w(D, X3, Z3);		/* D = (X3-Z3) */
+			mul_eltfp25519_2w_bmi2(DACB, AB, DC);	/* [DA|CB] = [A|B]*[D|C] */
+
+			cselect(swap, A, C);
+			cselect(swap, B, D);
+
+			sqr_eltfp25519_2w_bmi2(AB);		/* [AA|BB] = [A^2|B^2] */
+			add_eltfp25519_1w_bmi2(X3, DA, CB);	/* X3 = (DA+CB) */
+			sub_eltfp25519_1w(Z3, DA, CB);		/* Z3 = (DA-CB) */
+			sqr_eltfp25519_2w_bmi2(X3Z3);		/* [X3|Z3] = [(DA+CB)|(DA+CB)]^2 */
+
+			copy_eltfp25519_1w(X2, B);		/* X2 = B^2 */
+			sub_eltfp25519_1w(Z2, A, B);		/* Z2 = E = AA-BB */
+
+			mul_a24_eltfp25519_1w(B, Z2);		/* B = a24*E */
+			add_eltfp25519_1w_bmi2(B, B, X2);	/* B = a24*E+B */
+			mul_eltfp25519_2w_bmi2(X2Z2, X2Z2, AB);	/* [X2|Z2] = [B|E]*[A|a24*E+B] */
+			mul_eltfp25519_1w_bmi2(Z3, Z3, X1);	/* Z3 = Z3*X1 */
+			--j;
+		}
+		j = 63;
+	}
+
+	inv_eltfp25519_1w_bmi2(A, Qz);
+	mul_eltfp25519_1w_bmi2((u64 *)shared, Qx, A);
+	fred_eltfp25519_1w((u64 *)shared);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+static void curve25519_bmi2_base(u8 session_key[CURVE25519_KEY_SIZE],
+				 const u8 private_key[CURVE25519_KEY_SIZE])
+{
+	struct {
+		u64 buffer[4 * NUM_WORDS_ELTFP25519];
+		u64 coordinates[4 * NUM_WORDS_ELTFP25519];
+		u64 workspace[4 * NUM_WORDS_ELTFP25519];
+		u8 private[CURVE25519_KEY_SIZE];
+	} __aligned(32) m;
+
+	const int ite[4] = { 64, 64, 64, 63 };
+	const int q = 3;
+	u64 swap = 1;
+
+	int i = 0, j = 0, k = 0;
+	u64 *const key = (u64 *)m.private;
+	u64 *const Ur1 = m.coordinates + 0;
+	u64 *const Zr1 = m.coordinates + 4;
+	u64 *const Ur2 = m.coordinates + 8;
+	u64 *const Zr2 = m.coordinates + 12;
+
+	u64 *const UZr1 = m.coordinates + 0;
+	u64 *const ZUr2 = m.coordinates + 8;
+
+	u64 *const A = m.workspace + 0;
+	u64 *const B = m.workspace + 4;
+	u64 *const C = m.workspace + 8;
+	u64 *const D = m.workspace + 12;
+
+	u64 *const AB = m.workspace + 0;
+	u64 *const CD = m.workspace + 8;
+
+	const u64 *const P = table_ladder_8k;
+
+	memcpy(m.private, private_key, sizeof(m.private));
+
+	curve25519_clamp_secret(m.private);
+
+	setzero_eltfp25519_1w(Ur1);
+	setzero_eltfp25519_1w(Zr1);
+	setzero_eltfp25519_1w(Zr2);
+	Ur1[0] = 1;
+	Zr1[0] = 1;
+	Zr2[0] = 1;
+
+	/* G-S */
+	Ur2[3] = 0x1eaecdeee27cab34UL;
+	Ur2[2] = 0xadc7a0b9235d48e2UL;
+	Ur2[1] = 0xbbf095ae14b2edf8UL;
+	Ur2[0] = 0x7e94e1fec82faabdUL;
+
+	/* main-loop */
+	j = q;
+	for (i = 0; i < NUM_WORDS_ELTFP25519; ++i) {
+		while (j < ite[i]) {
+			u64 bit = (key[i] >> j) & 0x1;
+			k = (64 * i + j - q);
+			swap = swap ^ bit;
+			cswap(swap, Ur1, Ur2);
+			cswap(swap, Zr1, Zr2);
+			swap = bit;
+			/* Addition */
+			sub_eltfp25519_1w(B, Ur1, Zr1);		/* B = Ur1-Zr1 */
+			add_eltfp25519_1w_bmi2(A, Ur1, Zr1);	/* A = Ur1+Zr1 */
+			mul_eltfp25519_1w_bmi2(C, &P[4 * k], B);/* C = M0-B */
+			sub_eltfp25519_1w(B, A, C);		/* B = (Ur1+Zr1) - M*(Ur1-Zr1) */
+			add_eltfp25519_1w_bmi2(A, A, C);	/* A = (Ur1+Zr1) + M*(Ur1-Zr1) */
+			sqr_eltfp25519_2w_bmi2(AB);		/* A = A^2      |  B = B^2 */
+			mul_eltfp25519_2w_bmi2(UZr1, ZUr2, AB);	/* Ur1 = Zr2*A  |  Zr1 = Ur2*B */
+			++j;
+		}
+		j = 0;
+	}
+
+	/* Doubling */
+	for (i = 0; i < q; ++i) {
+		add_eltfp25519_1w_bmi2(A, Ur1, Zr1);	/*  A = Ur1+Zr1 */
+		sub_eltfp25519_1w(B, Ur1, Zr1);		/*  B = Ur1-Zr1 */
+		sqr_eltfp25519_2w_bmi2(AB);		/*  A = A**2     B = B**2 */
+		copy_eltfp25519_1w(C, B);		/*  C = B */
+		sub_eltfp25519_1w(B, A, B);		/*  B = A-B */
+		mul_a24_eltfp25519_1w(D, B);		/*  D = my_a24*B */
+		add_eltfp25519_1w_bmi2(D, D, C);	/*  D = D+C */
+		mul_eltfp25519_2w_bmi2(UZr1, AB, CD);	/*  Ur1 = A*B   Zr1 = Zr1*A */
+	}
+
+	/* Convert to affine coordinates */
+	inv_eltfp25519_1w_bmi2(A, Zr1);
+	mul_eltfp25519_1w_bmi2((u64 *)session_key, Ur1, A);
+	fred_eltfp25519_1w((u64 *)session_key);
+
+	memzero_explicit(&m, sizeof(m));
+}
+
+void curve25519_arch(u8 mypublic[CURVE25519_KEY_SIZE],
+		     const u8 secret[CURVE25519_KEY_SIZE],
+		     const u8 basepoint[CURVE25519_KEY_SIZE])
+{
+	if (static_branch_likely(&curve25519_use_adx))
+		curve25519_adx(mypublic, secret, basepoint);
+	else if (static_branch_likely(&curve25519_use_bmi2))
+		curve25519_bmi2(mypublic, secret, basepoint);
+	else
+		curve25519_generic(mypublic, secret, basepoint);
+}
+EXPORT_SYMBOL(curve25519_arch);
+
+void curve25519_base_arch(u8 pub[CURVE25519_KEY_SIZE],
+			  const u8 secret[CURVE25519_KEY_SIZE])
+{
+	if (static_branch_likely(&curve25519_use_adx))
+		curve25519_adx_base(pub, secret);
+	else if (static_branch_likely(&curve25519_use_bmi2))
+		curve25519_bmi2_base(pub, secret);
+	else
+		curve25519_generic(pub, secret, curve25519_base_point);
+}
+EXPORT_SYMBOL(curve25519_base_arch);
+
+static int curve25519_set_secret(struct crypto_kpp *tfm, const void *buf,
+				 unsigned int len)
+{
+	u8 *secret = kpp_tfm_ctx(tfm);
+
+	if (!len)
+		curve25519_generate_secret(secret);
+	else if (len == CURVE25519_KEY_SIZE)
+		memcpy(secret, buf, CURVE25519_KEY_SIZE);
+	else
+		return -EINVAL;
+	return 0;
+}
+
+static int curve25519_generate_public_key(struct kpp_request *req)
+{
+	struct crypto_kpp *tfm = crypto_kpp_reqtfm(req);
+	const u8 *secret = kpp_tfm_ctx(tfm);
+	u8 buf[CURVE25519_KEY_SIZE];
+	int copied, nbytes;
+
+	if (req->src)
+		return -EINVAL;
+
+	curve25519_base_arch(buf, secret);
+
+	if (!crypto_memneq(buf, curve25519_null_point, CURVE25519_KEY_SIZE))
+		return -EKEYREJECTED;
+
+	/* might want less than we've got */
+	nbytes = min_t(size_t, CURVE25519_KEY_SIZE, req->dst_len);
+	copied = sg_copy_from_buffer(req->dst, sg_nents_for_len(req->dst,
+								nbytes),
+				     buf, nbytes);
+	if (copied != nbytes)
+		return -EINVAL;
+	return 0;
+}
+
+static int curve25519_compute_shared_secret(struct kpp_request *req)
+{
+	struct crypto_kpp *tfm = crypto_kpp_reqtfm(req);
+	const u8 *secret = kpp_tfm_ctx(tfm);
+	u8 public_key[CURVE25519_KEY_SIZE];
+	u8 buf[CURVE25519_KEY_SIZE];
+	int copied, nbytes;
+
+	if (!req->src)
+		return -EINVAL;
+
+	copied = sg_copy_to_buffer(req->src,
+				   sg_nents_for_len(req->src,
+						    CURVE25519_KEY_SIZE),
+				   public_key, CURVE25519_KEY_SIZE);
+	if (copied != CURVE25519_KEY_SIZE)
+		return -EINVAL;
+
+	curve25519_arch(buf, secret, public_key);
+
+	if (!crypto_memneq(buf, curve25519_null_point, CURVE25519_KEY_SIZE))
+		return -EKEYREJECTED;
+
+	/* might want less than we've got */
+	nbytes = min_t(size_t, CURVE25519_KEY_SIZE, req->dst_len);
+	copied = sg_copy_from_buffer(req->dst, sg_nents_for_len(req->dst,
+								nbytes),
+				     buf, nbytes);
+	if (copied != nbytes)
+		return -EINVAL;
+	return 0;
+}
+
+static unsigned int curve25519_max_size(struct crypto_kpp *tfm)
+{
+	return CURVE25519_KEY_SIZE;
+}
+
+static struct kpp_alg curve25519_alg = {
+	.base.cra_name		= "curve25519",
+	.base.cra_driver_name	= "curve25519-x86",
+	.base.cra_priority	= 200,
+	.base.cra_module	= THIS_MODULE,
+	.base.cra_ctxsize	= CURVE25519_KEY_SIZE,
+
+	.set_secret		= curve25519_set_secret,
+	.generate_public_key	= curve25519_generate_public_key,
+	.compute_shared_secret	= curve25519_compute_shared_secret,
+	.max_size		= curve25519_max_size,
+};
+
+static int __init curve25519_mod_init(void)
+{
+	if (boot_cpu_has(X86_FEATURE_BMI2))
+		static_branch_enable(&curve25519_use_bmi2);
+	else if (boot_cpu_has(X86_FEATURE_ADX))
+		static_branch_enable(&curve25519_use_adx);
+	else
+		return 0;
+	return crypto_register_kpp(&curve25519_alg);
+}
+
+static void __exit curve25519_mod_exit(void)
+{
+	if (boot_cpu_has(X86_FEATURE_BMI2) ||
+	    boot_cpu_has(X86_FEATURE_ADX))
+		crypto_unregister_kpp(&curve25519_alg);
+}
+
+module_init(curve25519_mod_init);
+module_exit(curve25519_mod_exit);
+
+MODULE_ALIAS_CRYPTO("curve25519");
+MODULE_ALIAS_CRYPTO("curve25519-x86");
+MODULE_LICENSE("GPL v2");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index b83088e6a8e6..c77d61df4fd8 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -269,6 +269,12 @@ config CRYPTO_CURVE25519
 	select CRYPTO_KPP
 	select CRYPTO_LIB_CURVE25519_GENERIC
 
+config CRYPTO_CURVE25519_X86
+	tristate "x86_64 accelerated Curve25519 scalar multiplication library"
+	depends on X86 && 64BIT
+	select CRYPTO_LIB_CURVE25519_GENERIC
+	select CRYPTO_ARCH_HAVE_LIB_CURVE25519
+
 comment "Authenticated Encryption with Associated Data"
 
 config CRYPTO_CCM
diff --git a/crypto/curve25519-generic.c b/crypto/curve25519-generic.c
index 0f4cd6347e3b..f1b0eb833884 100644
--- a/crypto/curve25519-generic.c
+++ b/crypto/curve25519-generic.c
@@ -13,8 +13,7 @@ static int curve25519_set_secret(struct crypto_kpp *tfm, const void *buf,
 
 	if (!len)
 		curve25519_generate_secret(secret);
-	else if (len == CURVE25519_KEY_SIZE &&
-		 crypto_memneq(buf, curve25519_null_point, CURVE25519_KEY_SIZE))
+	else if (len == CURVE25519_KEY_SIZE)
 		memcpy(secret, buf, CURVE25519_KEY_SIZE);
 	else
 		return -EINVAL;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 32/35] crypto: arm - import Bernstein and Schwabe's Curve25519 ARM implementation
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (30 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 31/35] crypto: Curve25519 - x86_64 library and KPP implementations Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 33/35] crypto: arm/Curve25519 - wire up NEON implementation Ard Biesheuvel
                   ` (2 subsequent siblings)
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

From: "Jason A. Donenfeld" <Jason@zx2c4.com>

This comes from Dan Bernstein and Peter Schwabe's public domain NEON
code, and is included here in raw form so that subsequent commits that
fix these up for the kernel can see how it has changed. This code does
have some entirely cosmetic formatting differences, adding indentation
and so forth, so that when we actually port it for use in the kernel in
the subsequent commit, it's obvious what's changed in the process.

This code originates from SUPERCOP 20180818, available at
<https://bench.cr.yp.to/supercop.html>.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/curve25519-core.S | 2105 ++++++++++++++++++++
 1 file changed, 2105 insertions(+)

diff --git a/arch/arm/crypto/curve25519-core.S b/arch/arm/crypto/curve25519-core.S
new file mode 100644
index 000000000000..f33b85fef382
--- /dev/null
+++ b/arch/arm/crypto/curve25519-core.S
@@ -0,0 +1,2105 @@
+/*
+ * Public domain code from Daniel J. Bernstein and Peter Schwabe, from
+ * SUPERCOP's curve25519/neon2/scalarmult.s.
+ */
+
+.fpu neon
+.text
+.align 4
+.global _crypto_scalarmult_curve25519_neon2
+.global crypto_scalarmult_curve25519_neon2
+.type _crypto_scalarmult_curve25519_neon2 STT_FUNC
+.type crypto_scalarmult_curve25519_neon2 STT_FUNC
+	_crypto_scalarmult_curve25519_neon2:
+	crypto_scalarmult_curve25519_neon2:
+	vpush		{q4, q5, q6, q7}
+	mov		r12, sp
+	sub		sp, sp, #736
+	and		sp, sp, #0xffffffe0
+	strd		r4, [sp, #0]
+	strd		r6, [sp, #8]
+	strd		r8, [sp, #16]
+	strd		r10, [sp, #24]
+	str		r12, [sp, #480]
+	str		r14, [sp, #484]
+	mov		r0, r0
+	mov		r1, r1
+	mov		r2, r2
+	add		r3, sp, #32
+	ldr		r4, =0
+	ldr		r5, =254
+	vmov.i32	q0, #1
+	vshr.u64	q1, q0, #7
+	vshr.u64	q0, q0, #8
+	vmov.i32	d4, #19
+	vmov.i32	d5, #38
+	add		r6, sp, #512
+	vst1.8		{d2-d3}, [r6, : 128]
+	add		r6, sp, #528
+	vst1.8		{d0-d1}, [r6, : 128]
+	add		r6, sp, #544
+	vst1.8		{d4-d5}, [r6, : 128]
+	add		r6, r3, #0
+	vmov.i32	q2, #0
+	vst1.8		{d4-d5}, [r6, : 128]!
+	vst1.8		{d4-d5}, [r6, : 128]!
+	vst1.8		d4, [r6, : 64]
+	add		r6, r3, #0
+	ldr		r7, =960
+	sub		r7, r7, #2
+	neg		r7, r7
+	sub		r7, r7, r7, LSL #7
+	str		r7, [r6]
+	add		r6, sp, #704
+	vld1.8		{d4-d5}, [r1]!
+	vld1.8		{d6-d7}, [r1]
+	vst1.8		{d4-d5}, [r6, : 128]!
+	vst1.8		{d6-d7}, [r6, : 128]
+	sub		r1, r6, #16
+	ldrb		r6, [r1]
+	and		r6, r6, #248
+	strb		r6, [r1]
+	ldrb		r6, [r1, #31]
+	and		r6, r6, #127
+	orr		r6, r6, #64
+	strb		r6, [r1, #31]
+	vmov.i64	q2, #0xffffffff
+	vshr.u64	q3, q2, #7
+	vshr.u64	q2, q2, #6
+	vld1.8		{d8}, [r2]
+	vld1.8		{d10}, [r2]
+	add		r2, r2, #6
+	vld1.8		{d12}, [r2]
+	vld1.8		{d14}, [r2]
+	add		r2, r2, #6
+	vld1.8		{d16}, [r2]
+	add		r2, r2, #4
+	vld1.8		{d18}, [r2]
+	vld1.8		{d20}, [r2]
+	add		r2, r2, #6
+	vld1.8		{d22}, [r2]
+	add		r2, r2, #2
+	vld1.8		{d24}, [r2]
+	vld1.8		{d26}, [r2]
+	vshr.u64	q5, q5, #26
+	vshr.u64	q6, q6, #3
+	vshr.u64	q7, q7, #29
+	vshr.u64	q8, q8, #6
+	vshr.u64	q10, q10, #25
+	vshr.u64	q11, q11, #3
+	vshr.u64	q12, q12, #12
+	vshr.u64	q13, q13, #38
+	vand		q4, q4, q2
+	vand		q6, q6, q2
+	vand		q8, q8, q2
+	vand		q10, q10, q2
+	vand		q2, q12, q2
+	vand		q5, q5, q3
+	vand		q7, q7, q3
+	vand		q9, q9, q3
+	vand		q11, q11, q3
+	vand		q3, q13, q3
+	add		r2, r3, #48
+	vadd.i64	q12, q4, q1
+	vadd.i64	q13, q10, q1
+	vshr.s64	q12, q12, #26
+	vshr.s64	q13, q13, #26
+	vadd.i64	q5, q5, q12
+	vshl.i64	q12, q12, #26
+	vadd.i64	q14, q5, q0
+	vadd.i64	q11, q11, q13
+	vshl.i64	q13, q13, #26
+	vadd.i64	q15, q11, q0
+	vsub.i64	q4, q4, q12
+	vshr.s64	q12, q14, #25
+	vsub.i64	q10, q10, q13
+	vshr.s64	q13, q15, #25
+	vadd.i64	q6, q6, q12
+	vshl.i64	q12, q12, #25
+	vadd.i64	q14, q6, q1
+	vadd.i64	q2, q2, q13
+	vsub.i64	q5, q5, q12
+	vshr.s64	q12, q14, #26
+	vshl.i64	q13, q13, #25
+	vadd.i64	q14, q2, q1
+	vadd.i64	q7, q7, q12
+	vshl.i64	q12, q12, #26
+	vadd.i64	q15, q7, q0
+	vsub.i64	q11, q11, q13
+	vshr.s64	q13, q14, #26
+	vsub.i64	q6, q6, q12
+	vshr.s64	q12, q15, #25
+	vadd.i64	q3, q3, q13
+	vshl.i64	q13, q13, #26
+	vadd.i64	q14, q3, q0
+	vadd.i64	q8, q8, q12
+	vshl.i64	q12, q12, #25
+	vadd.i64	q15, q8, q1
+	add		r2, r2, #8
+	vsub.i64	q2, q2, q13
+	vshr.s64	q13, q14, #25
+	vsub.i64	q7, q7, q12
+	vshr.s64	q12, q15, #26
+	vadd.i64	q14, q13, q13
+	vadd.i64	q9, q9, q12
+	vtrn.32		d12, d14
+	vshl.i64	q12, q12, #26
+	vtrn.32		d13, d15
+	vadd.i64	q0, q9, q0
+	vadd.i64	q4, q4, q14
+	vst1.8		d12, [r2, : 64]!
+	vshl.i64	q6, q13, #4
+	vsub.i64	q7, q8, q12
+	vshr.s64	q0, q0, #25
+	vadd.i64	q4, q4, q6
+	vadd.i64	q6, q10, q0
+	vshl.i64	q0, q0, #25
+	vadd.i64	q8, q6, q1
+	vadd.i64	q4, q4, q13
+	vshl.i64	q10, q13, #25
+	vadd.i64	q1, q4, q1
+	vsub.i64	q0, q9, q0
+	vshr.s64	q8, q8, #26
+	vsub.i64	q3, q3, q10
+	vtrn.32		d14, d0
+	vshr.s64	q1, q1, #26
+	vtrn.32		d15, d1
+	vadd.i64	q0, q11, q8
+	vst1.8		d14, [r2, : 64]
+	vshl.i64	q7, q8, #26
+	vadd.i64	q5, q5, q1
+	vtrn.32		d4, d6
+	vshl.i64	q1, q1, #26
+	vtrn.32		d5, d7
+	vsub.i64	q3, q6, q7
+	add		r2, r2, #16
+	vsub.i64	q1, q4, q1
+	vst1.8		d4, [r2, : 64]
+	vtrn.32		d6, d0
+	vtrn.32		d7, d1
+	sub		r2, r2, #8
+	vtrn.32		d2, d10
+	vtrn.32		d3, d11
+	vst1.8		d6, [r2, : 64]
+	sub		r2, r2, #24
+	vst1.8		d2, [r2, : 64]
+	add		r2, r3, #96
+	vmov.i32	q0, #0
+	vmov.i64	d2, #0xff
+	vmov.i64	d3, #0
+	vshr.u32	q1, q1, #7
+	vst1.8		{d2-d3}, [r2, : 128]!
+	vst1.8		{d0-d1}, [r2, : 128]!
+	vst1.8		d0, [r2, : 64]
+	add		r2, r3, #144
+	vmov.i32	q0, #0
+	vst1.8		{d0-d1}, [r2, : 128]!
+	vst1.8		{d0-d1}, [r2, : 128]!
+	vst1.8		d0, [r2, : 64]
+	add		r2, r3, #240
+	vmov.i32	q0, #0
+	vmov.i64	d2, #0xff
+	vmov.i64	d3, #0
+	vshr.u32	q1, q1, #7
+	vst1.8		{d2-d3}, [r2, : 128]!
+	vst1.8		{d0-d1}, [r2, : 128]!
+	vst1.8		d0, [r2, : 64]
+	add		r2, r3, #48
+	add		r6, r3, #192
+	vld1.8		{d0-d1}, [r2, : 128]!
+	vld1.8		{d2-d3}, [r2, : 128]!
+	vld1.8		{d4}, [r2, : 64]
+	vst1.8		{d0-d1}, [r6, : 128]!
+	vst1.8		{d2-d3}, [r6, : 128]!
+	vst1.8		d4, [r6, : 64]
+._mainloop:
+	mov		r2, r5, LSR #3
+	and		r6, r5, #7
+	ldrb		r2, [r1, r2]
+	mov		r2, r2, LSR r6
+	and		r2, r2, #1
+	str		r5, [sp, #488]
+	eor		r4, r4, r2
+	str		r2, [sp, #492]
+	neg		r2, r4
+	add		r4, r3, #96
+	add		r5, r3, #192
+	add		r6, r3, #144
+	vld1.8		{d8-d9}, [r4, : 128]!
+	add		r7, r3, #240
+	vld1.8		{d10-d11}, [r5, : 128]!
+	veor		q6, q4, q5
+	vld1.8		{d14-d15}, [r6, : 128]!
+	vdup.i32	q8, r2
+	vld1.8		{d18-d19}, [r7, : 128]!
+	veor		q10, q7, q9
+	vld1.8		{d22-d23}, [r4, : 128]!
+	vand		q6, q6, q8
+	vld1.8		{d24-d25}, [r5, : 128]!
+	vand		q10, q10, q8
+	vld1.8		{d26-d27}, [r6, : 128]!
+	veor		q4, q4, q6
+	vld1.8		{d28-d29}, [r7, : 128]!
+	veor		q5, q5, q6
+	vld1.8		{d0}, [r4, : 64]
+	veor		q6, q7, q10
+	vld1.8		{d2}, [r5, : 64]
+	veor		q7, q9, q10
+	vld1.8		{d4}, [r6, : 64]
+	veor		q9, q11, q12
+	vld1.8		{d6}, [r7, : 64]
+	veor		q10, q0, q1
+	sub		r2, r4, #32
+	vand		q9, q9, q8
+	sub		r4, r5, #32
+	vand		q10, q10, q8
+	sub		r5, r6, #32
+	veor		q11, q11, q9
+	sub		r6, r7, #32
+	veor		q0, q0, q10
+	veor		q9, q12, q9
+	veor		q1, q1, q10
+	veor		q10, q13, q14
+	veor		q12, q2, q3
+	vand		q10, q10, q8
+	vand		q8, q12, q8
+	veor		q12, q13, q10
+	veor		q2, q2, q8
+	veor		q10, q14, q10
+	veor		q3, q3, q8
+	vadd.i32	q8, q4, q6
+	vsub.i32	q4, q4, q6
+	vst1.8		{d16-d17}, [r2, : 128]!
+	vadd.i32	q6, q11, q12
+	vst1.8		{d8-d9}, [r5, : 128]!
+	vsub.i32	q4, q11, q12
+	vst1.8		{d12-d13}, [r2, : 128]!
+	vadd.i32	q6, q0, q2
+	vst1.8		{d8-d9}, [r5, : 128]!
+	vsub.i32	q0, q0, q2
+	vst1.8		d12, [r2, : 64]
+	vadd.i32	q2, q5, q7
+	vst1.8		d0, [r5, : 64]
+	vsub.i32	q0, q5, q7
+	vst1.8		{d4-d5}, [r4, : 128]!
+	vadd.i32	q2, q9, q10
+	vst1.8		{d0-d1}, [r6, : 128]!
+	vsub.i32	q0, q9, q10
+	vst1.8		{d4-d5}, [r4, : 128]!
+	vadd.i32	q2, q1, q3
+	vst1.8		{d0-d1}, [r6, : 128]!
+	vsub.i32	q0, q1, q3
+	vst1.8		d4, [r4, : 64]
+	vst1.8		d0, [r6, : 64]
+	add		r2, sp, #544
+	add		r4, r3, #96
+	add		r5, r3, #144
+	vld1.8		{d0-d1}, [r2, : 128]
+	vld1.8		{d2-d3}, [r4, : 128]!
+	vld1.8		{d4-d5}, [r5, : 128]!
+	vzip.i32	q1, q2
+	vld1.8		{d6-d7}, [r4, : 128]!
+	vld1.8		{d8-d9}, [r5, : 128]!
+	vshl.i32	q5, q1, #1
+	vzip.i32	q3, q4
+	vshl.i32	q6, q2, #1
+	vld1.8		{d14}, [r4, : 64]
+	vshl.i32	q8, q3, #1
+	vld1.8		{d15}, [r5, : 64]
+	vshl.i32	q9, q4, #1
+	vmul.i32	d21, d7, d1
+	vtrn.32		d14, d15
+	vmul.i32	q11, q4, q0
+	vmul.i32	q0, q7, q0
+	vmull.s32	q12, d2, d2
+	vmlal.s32	q12, d11, d1
+	vmlal.s32	q12, d12, d0
+	vmlal.s32	q12, d13, d23
+	vmlal.s32	q12, d16, d22
+	vmlal.s32	q12, d7, d21
+	vmull.s32	q10, d2, d11
+	vmlal.s32	q10, d4, d1
+	vmlal.s32	q10, d13, d0
+	vmlal.s32	q10, d6, d23
+	vmlal.s32	q10, d17, d22
+	vmull.s32	q13, d10, d4
+	vmlal.s32	q13, d11, d3
+	vmlal.s32	q13, d13, d1
+	vmlal.s32	q13, d16, d0
+	vmlal.s32	q13, d17, d23
+	vmlal.s32	q13, d8, d22
+	vmull.s32	q1, d10, d5
+	vmlal.s32	q1, d11, d4
+	vmlal.s32	q1, d6, d1
+	vmlal.s32	q1, d17, d0
+	vmlal.s32	q1, d8, d23
+	vmull.s32	q14, d10, d6
+	vmlal.s32	q14, d11, d13
+	vmlal.s32	q14, d4, d4
+	vmlal.s32	q14, d17, d1
+	vmlal.s32	q14, d18, d0
+	vmlal.s32	q14, d9, d23
+	vmull.s32	q11, d10, d7
+	vmlal.s32	q11, d11, d6
+	vmlal.s32	q11, d12, d5
+	vmlal.s32	q11, d8, d1
+	vmlal.s32	q11, d19, d0
+	vmull.s32	q15, d10, d8
+	vmlal.s32	q15, d11, d17
+	vmlal.s32	q15, d12, d6
+	vmlal.s32	q15, d13, d5
+	vmlal.s32	q15, d19, d1
+	vmlal.s32	q15, d14, d0
+	vmull.s32	q2, d10, d9
+	vmlal.s32	q2, d11, d8
+	vmlal.s32	q2, d12, d7
+	vmlal.s32	q2, d13, d6
+	vmlal.s32	q2, d14, d1
+	vmull.s32	q0, d15, d1
+	vmlal.s32	q0, d10, d14
+	vmlal.s32	q0, d11, d19
+	vmlal.s32	q0, d12, d8
+	vmlal.s32	q0, d13, d17
+	vmlal.s32	q0, d6, d6
+	add		r2, sp, #512
+	vld1.8		{d18-d19}, [r2, : 128]
+	vmull.s32	q3, d16, d7
+	vmlal.s32	q3, d10, d15
+	vmlal.s32	q3, d11, d14
+	vmlal.s32	q3, d12, d9
+	vmlal.s32	q3, d13, d8
+	add		r2, sp, #528
+	vld1.8		{d8-d9}, [r2, : 128]
+	vadd.i64	q5, q12, q9
+	vadd.i64	q6, q15, q9
+	vshr.s64	q5, q5, #26
+	vshr.s64	q6, q6, #26
+	vadd.i64	q7, q10, q5
+	vshl.i64	q5, q5, #26
+	vadd.i64	q8, q7, q4
+	vadd.i64	q2, q2, q6
+	vshl.i64	q6, q6, #26
+	vadd.i64	q10, q2, q4
+	vsub.i64	q5, q12, q5
+	vshr.s64	q8, q8, #25
+	vsub.i64	q6, q15, q6
+	vshr.s64	q10, q10, #25
+	vadd.i64	q12, q13, q8
+	vshl.i64	q8, q8, #25
+	vadd.i64	q13, q12, q9
+	vadd.i64	q0, q0, q10
+	vsub.i64	q7, q7, q8
+	vshr.s64	q8, q13, #26
+	vshl.i64	q10, q10, #25
+	vadd.i64	q13, q0, q9
+	vadd.i64	q1, q1, q8
+	vshl.i64	q8, q8, #26
+	vadd.i64	q15, q1, q4
+	vsub.i64	q2, q2, q10
+	vshr.s64	q10, q13, #26
+	vsub.i64	q8, q12, q8
+	vshr.s64	q12, q15, #25
+	vadd.i64	q3, q3, q10
+	vshl.i64	q10, q10, #26
+	vadd.i64	q13, q3, q4
+	vadd.i64	q14, q14, q12
+	add		r2, r3, #288
+	vshl.i64	q12, q12, #25
+	add		r4, r3, #336
+	vadd.i64	q15, q14, q9
+	add		r2, r2, #8
+	vsub.i64	q0, q0, q10
+	add		r4, r4, #8
+	vshr.s64	q10, q13, #25
+	vsub.i64	q1, q1, q12
+	vshr.s64	q12, q15, #26
+	vadd.i64	q13, q10, q10
+	vadd.i64	q11, q11, q12
+	vtrn.32		d16, d2
+	vshl.i64	q12, q12, #26
+	vtrn.32		d17, d3
+	vadd.i64	q1, q11, q4
+	vadd.i64	q4, q5, q13
+	vst1.8		d16, [r2, : 64]!
+	vshl.i64	q5, q10, #4
+	vst1.8		d17, [r4, : 64]!
+	vsub.i64	q8, q14, q12
+	vshr.s64	q1, q1, #25
+	vadd.i64	q4, q4, q5
+	vadd.i64	q5, q6, q1
+	vshl.i64	q1, q1, #25
+	vadd.i64	q6, q5, q9
+	vadd.i64	q4, q4, q10
+	vshl.i64	q10, q10, #25
+	vadd.i64	q9, q4, q9
+	vsub.i64	q1, q11, q1
+	vshr.s64	q6, q6, #26
+	vsub.i64	q3, q3, q10
+	vtrn.32		d16, d2
+	vshr.s64	q9, q9, #26
+	vtrn.32		d17, d3
+	vadd.i64	q1, q2, q6
+	vst1.8		d16, [r2, : 64]
+	vshl.i64	q2, q6, #26
+	vst1.8		d17, [r4, : 64]
+	vadd.i64	q6, q7, q9
+	vtrn.32		d0, d6
+	vshl.i64	q7, q9, #26
+	vtrn.32		d1, d7
+	vsub.i64	q2, q5, q2
+	add		r2, r2, #16
+	vsub.i64	q3, q4, q7
+	vst1.8		d0, [r2, : 64]
+	add		r4, r4, #16
+	vst1.8		d1, [r4, : 64]
+	vtrn.32		d4, d2
+	vtrn.32		d5, d3
+	sub		r2, r2, #8
+	sub		r4, r4, #8
+	vtrn.32		d6, d12
+	vtrn.32		d7, d13
+	vst1.8		d4, [r2, : 64]
+	vst1.8		d5, [r4, : 64]
+	sub		r2, r2, #24
+	sub		r4, r4, #24
+	vst1.8		d6, [r2, : 64]
+	vst1.8		d7, [r4, : 64]
+	add		r2, r3, #240
+	add		r4, r3, #96
+	vld1.8		{d0-d1}, [r4, : 128]!
+	vld1.8		{d2-d3}, [r4, : 128]!
+	vld1.8		{d4}, [r4, : 64]
+	add		r4, r3, #144
+	vld1.8		{d6-d7}, [r4, : 128]!
+	vtrn.32		q0, q3
+	vld1.8		{d8-d9}, [r4, : 128]!
+	vshl.i32	q5, q0, #4
+	vtrn.32		q1, q4
+	vshl.i32	q6, q3, #4
+	vadd.i32	q5, q5, q0
+	vadd.i32	q6, q6, q3
+	vshl.i32	q7, q1, #4
+	vld1.8		{d5}, [r4, : 64]
+	vshl.i32	q8, q4, #4
+	vtrn.32		d4, d5
+	vadd.i32	q7, q7, q1
+	vadd.i32	q8, q8, q4
+	vld1.8		{d18-d19}, [r2, : 128]!
+	vshl.i32	q10, q2, #4
+	vld1.8		{d22-d23}, [r2, : 128]!
+	vadd.i32	q10, q10, q2
+	vld1.8		{d24}, [r2, : 64]
+	vadd.i32	q5, q5, q0
+	add		r2, r3, #192
+	vld1.8		{d26-d27}, [r2, : 128]!
+	vadd.i32	q6, q6, q3
+	vld1.8		{d28-d29}, [r2, : 128]!
+	vadd.i32	q8, q8, q4
+	vld1.8		{d25}, [r2, : 64]
+	vadd.i32	q10, q10, q2
+	vtrn.32		q9, q13
+	vadd.i32	q7, q7, q1
+	vadd.i32	q5, q5, q0
+	vtrn.32		q11, q14
+	vadd.i32	q6, q6, q3
+	add		r2, sp, #560
+	vadd.i32	q10, q10, q2
+	vtrn.32		d24, d25
+	vst1.8		{d12-d13}, [r2, : 128]
+	vshl.i32	q6, q13, #1
+	add		r2, sp, #576
+	vst1.8		{d20-d21}, [r2, : 128]
+	vshl.i32	q10, q14, #1
+	add		r2, sp, #592
+	vst1.8		{d12-d13}, [r2, : 128]
+	vshl.i32	q15, q12, #1
+	vadd.i32	q8, q8, q4
+	vext.32		d10, d31, d30, #0
+	vadd.i32	q7, q7, q1
+	add		r2, sp, #608
+	vst1.8		{d16-d17}, [r2, : 128]
+	vmull.s32	q8, d18, d5
+	vmlal.s32	q8, d26, d4
+	vmlal.s32	q8, d19, d9
+	vmlal.s32	q8, d27, d3
+	vmlal.s32	q8, d22, d8
+	vmlal.s32	q8, d28, d2
+	vmlal.s32	q8, d23, d7
+	vmlal.s32	q8, d29, d1
+	vmlal.s32	q8, d24, d6
+	vmlal.s32	q8, d25, d0
+	add		r2, sp, #624
+	vst1.8		{d14-d15}, [r2, : 128]
+	vmull.s32	q2, d18, d4
+	vmlal.s32	q2, d12, d9
+	vmlal.s32	q2, d13, d8
+	vmlal.s32	q2, d19, d3
+	vmlal.s32	q2, d22, d2
+	vmlal.s32	q2, d23, d1
+	vmlal.s32	q2, d24, d0
+	add		r2, sp, #640
+	vst1.8		{d20-d21}, [r2, : 128]
+	vmull.s32	q7, d18, d9
+	vmlal.s32	q7, d26, d3
+	vmlal.s32	q7, d19, d8
+	vmlal.s32	q7, d27, d2
+	vmlal.s32	q7, d22, d7
+	vmlal.s32	q7, d28, d1
+	vmlal.s32	q7, d23, d6
+	vmlal.s32	q7, d29, d0
+	add		r2, sp, #656
+	vst1.8		{d10-d11}, [r2, : 128]
+	vmull.s32	q5, d18, d3
+	vmlal.s32	q5, d19, d2
+	vmlal.s32	q5, d22, d1
+	vmlal.s32	q5, d23, d0
+	vmlal.s32	q5, d12, d8
+	add		r2, sp, #672
+	vst1.8		{d16-d17}, [r2, : 128]
+	vmull.s32	q4, d18, d8
+	vmlal.s32	q4, d26, d2
+	vmlal.s32	q4, d19, d7
+	vmlal.s32	q4, d27, d1
+	vmlal.s32	q4, d22, d6
+	vmlal.s32	q4, d28, d0
+	vmull.s32	q8, d18, d7
+	vmlal.s32	q8, d26, d1
+	vmlal.s32	q8, d19, d6
+	vmlal.s32	q8, d27, d0
+	add		r2, sp, #576
+	vld1.8		{d20-d21}, [r2, : 128]
+	vmlal.s32	q7, d24, d21
+	vmlal.s32	q7, d25, d20
+	vmlal.s32	q4, d23, d21
+	vmlal.s32	q4, d29, d20
+	vmlal.s32	q8, d22, d21
+	vmlal.s32	q8, d28, d20
+	vmlal.s32	q5, d24, d20
+	add		r2, sp, #576
+	vst1.8		{d14-d15}, [r2, : 128]
+	vmull.s32	q7, d18, d6
+	vmlal.s32	q7, d26, d0
+	add		r2, sp, #656
+	vld1.8		{d30-d31}, [r2, : 128]
+	vmlal.s32	q2, d30, d21
+	vmlal.s32	q7, d19, d21
+	vmlal.s32	q7, d27, d20
+	add		r2, sp, #624
+	vld1.8		{d26-d27}, [r2, : 128]
+	vmlal.s32	q4, d25, d27
+	vmlal.s32	q8, d29, d27
+	vmlal.s32	q8, d25, d26
+	vmlal.s32	q7, d28, d27
+	vmlal.s32	q7, d29, d26
+	add		r2, sp, #608
+	vld1.8		{d28-d29}, [r2, : 128]
+	vmlal.s32	q4, d24, d29
+	vmlal.s32	q8, d23, d29
+	vmlal.s32	q8, d24, d28
+	vmlal.s32	q7, d22, d29
+	vmlal.s32	q7, d23, d28
+	add		r2, sp, #608
+	vst1.8		{d8-d9}, [r2, : 128]
+	add		r2, sp, #560
+	vld1.8		{d8-d9}, [r2, : 128]
+	vmlal.s32	q7, d24, d9
+	vmlal.s32	q7, d25, d31
+	vmull.s32	q1, d18, d2
+	vmlal.s32	q1, d19, d1
+	vmlal.s32	q1, d22, d0
+	vmlal.s32	q1, d24, d27
+	vmlal.s32	q1, d23, d20
+	vmlal.s32	q1, d12, d7
+	vmlal.s32	q1, d13, d6
+	vmull.s32	q6, d18, d1
+	vmlal.s32	q6, d19, d0
+	vmlal.s32	q6, d23, d27
+	vmlal.s32	q6, d22, d20
+	vmlal.s32	q6, d24, d26
+	vmull.s32	q0, d18, d0
+	vmlal.s32	q0, d22, d27
+	vmlal.s32	q0, d23, d26
+	vmlal.s32	q0, d24, d31
+	vmlal.s32	q0, d19, d20
+	add		r2, sp, #640
+	vld1.8		{d18-d19}, [r2, : 128]
+	vmlal.s32	q2, d18, d7
+	vmlal.s32	q2, d19, d6
+	vmlal.s32	q5, d18, d6
+	vmlal.s32	q5, d19, d21
+	vmlal.s32	q1, d18, d21
+	vmlal.s32	q1, d19, d29
+	vmlal.s32	q0, d18, d28
+	vmlal.s32	q0, d19, d9
+	vmlal.s32	q6, d18, d29
+	vmlal.s32	q6, d19, d28
+	add		r2, sp, #592
+	vld1.8		{d18-d19}, [r2, : 128]
+	add		r2, sp, #512
+	vld1.8		{d22-d23}, [r2, : 128]
+	vmlal.s32	q5, d19, d7
+	vmlal.s32	q0, d18, d21
+	vmlal.s32	q0, d19, d29
+	vmlal.s32	q6, d18, d6
+	add		r2, sp, #528
+	vld1.8		{d6-d7}, [r2, : 128]
+	vmlal.s32	q6, d19, d21
+	add		r2, sp, #576
+	vld1.8		{d18-d19}, [r2, : 128]
+	vmlal.s32	q0, d30, d8
+	add		r2, sp, #672
+	vld1.8		{d20-d21}, [r2, : 128]
+	vmlal.s32	q5, d30, d29
+	add		r2, sp, #608
+	vld1.8		{d24-d25}, [r2, : 128]
+	vmlal.s32	q1, d30, d28
+	vadd.i64	q13, q0, q11
+	vadd.i64	q14, q5, q11
+	vmlal.s32	q6, d30, d9
+	vshr.s64	q4, q13, #26
+	vshr.s64	q13, q14, #26
+	vadd.i64	q7, q7, q4
+	vshl.i64	q4, q4, #26
+	vadd.i64	q14, q7, q3
+	vadd.i64	q9, q9, q13
+	vshl.i64	q13, q13, #26
+	vadd.i64	q15, q9, q3
+	vsub.i64	q0, q0, q4
+	vshr.s64	q4, q14, #25
+	vsub.i64	q5, q5, q13
+	vshr.s64	q13, q15, #25
+	vadd.i64	q6, q6, q4
+	vshl.i64	q4, q4, #25
+	vadd.i64	q14, q6, q11
+	vadd.i64	q2, q2, q13
+	vsub.i64	q4, q7, q4
+	vshr.s64	q7, q14, #26
+	vshl.i64	q13, q13, #25
+	vadd.i64	q14, q2, q11
+	vadd.i64	q8, q8, q7
+	vshl.i64	q7, q7, #26
+	vadd.i64	q15, q8, q3
+	vsub.i64	q9, q9, q13
+	vshr.s64	q13, q14, #26
+	vsub.i64	q6, q6, q7
+	vshr.s64	q7, q15, #25
+	vadd.i64	q10, q10, q13
+	vshl.i64	q13, q13, #26
+	vadd.i64	q14, q10, q3
+	vadd.i64	q1, q1, q7
+	add		r2, r3, #144
+	vshl.i64	q7, q7, #25
+	add		r4, r3, #96
+	vadd.i64	q15, q1, q11
+	add		r2, r2, #8
+	vsub.i64	q2, q2, q13
+	add		r4, r4, #8
+	vshr.s64	q13, q14, #25
+	vsub.i64	q7, q8, q7
+	vshr.s64	q8, q15, #26
+	vadd.i64	q14, q13, q13
+	vadd.i64	q12, q12, q8
+	vtrn.32		d12, d14
+	vshl.i64	q8, q8, #26
+	vtrn.32		d13, d15
+	vadd.i64	q3, q12, q3
+	vadd.i64	q0, q0, q14
+	vst1.8		d12, [r2, : 64]!
+	vshl.i64	q7, q13, #4
+	vst1.8		d13, [r4, : 64]!
+	vsub.i64	q1, q1, q8
+	vshr.s64	q3, q3, #25
+	vadd.i64	q0, q0, q7
+	vadd.i64	q5, q5, q3
+	vshl.i64	q3, q3, #25
+	vadd.i64	q6, q5, q11
+	vadd.i64	q0, q0, q13
+	vshl.i64	q7, q13, #25
+	vadd.i64	q8, q0, q11
+	vsub.i64	q3, q12, q3
+	vshr.s64	q6, q6, #26
+	vsub.i64	q7, q10, q7
+	vtrn.32		d2, d6
+	vshr.s64	q8, q8, #26
+	vtrn.32		d3, d7
+	vadd.i64	q3, q9, q6
+	vst1.8		d2, [r2, : 64]
+	vshl.i64	q6, q6, #26
+	vst1.8		d3, [r4, : 64]
+	vadd.i64	q1, q4, q8
+	vtrn.32		d4, d14
+	vshl.i64	q4, q8, #26
+	vtrn.32		d5, d15
+	vsub.i64	q5, q5, q6
+	add		r2, r2, #16
+	vsub.i64	q0, q0, q4
+	vst1.8		d4, [r2, : 64]
+	add		r4, r4, #16
+	vst1.8		d5, [r4, : 64]
+	vtrn.32		d10, d6
+	vtrn.32		d11, d7
+	sub		r2, r2, #8
+	sub		r4, r4, #8
+	vtrn.32		d0, d2
+	vtrn.32		d1, d3
+	vst1.8		d10, [r2, : 64]
+	vst1.8		d11, [r4, : 64]
+	sub		r2, r2, #24
+	sub		r4, r4, #24
+	vst1.8		d0, [r2, : 64]
+	vst1.8		d1, [r4, : 64]
+	add		r2, r3, #288
+	add		r4, r3, #336
+	vld1.8		{d0-d1}, [r2, : 128]!
+	vld1.8		{d2-d3}, [r4, : 128]!
+	vsub.i32	q0, q0, q1
+	vld1.8		{d2-d3}, [r2, : 128]!
+	vld1.8		{d4-d5}, [r4, : 128]!
+	vsub.i32	q1, q1, q2
+	add		r5, r3, #240
+	vld1.8		{d4}, [r2, : 64]
+	vld1.8		{d6}, [r4, : 64]
+	vsub.i32	q2, q2, q3
+	vst1.8		{d0-d1}, [r5, : 128]!
+	vst1.8		{d2-d3}, [r5, : 128]!
+	vst1.8		d4, [r5, : 64]
+	add		r2, r3, #144
+	add		r4, r3, #96
+	add		r5, r3, #144
+	add		r6, r3, #192
+	vld1.8		{d0-d1}, [r2, : 128]!
+	vld1.8		{d2-d3}, [r4, : 128]!
+	vsub.i32	q2, q0, q1
+	vadd.i32	q0, q0, q1
+	vld1.8		{d2-d3}, [r2, : 128]!
+	vld1.8		{d6-d7}, [r4, : 128]!
+	vsub.i32	q4, q1, q3
+	vadd.i32	q1, q1, q3
+	vld1.8		{d6}, [r2, : 64]
+	vld1.8		{d10}, [r4, : 64]
+	vsub.i32	q6, q3, q5
+	vadd.i32	q3, q3, q5
+	vst1.8		{d4-d5}, [r5, : 128]!
+	vst1.8		{d0-d1}, [r6, : 128]!
+	vst1.8		{d8-d9}, [r5, : 128]!
+	vst1.8		{d2-d3}, [r6, : 128]!
+	vst1.8		d12, [r5, : 64]
+	vst1.8		d6, [r6, : 64]
+	add		r2, r3, #0
+	add		r4, r3, #240
+	vld1.8		{d0-d1}, [r4, : 128]!
+	vld1.8		{d2-d3}, [r4, : 128]!
+	vld1.8		{d4}, [r4, : 64]
+	add		r4, r3, #336
+	vld1.8		{d6-d7}, [r4, : 128]!
+	vtrn.32		q0, q3
+	vld1.8		{d8-d9}, [r4, : 128]!
+	vshl.i32	q5, q0, #4
+	vtrn.32		q1, q4
+	vshl.i32	q6, q3, #4
+	vadd.i32	q5, q5, q0
+	vadd.i32	q6, q6, q3
+	vshl.i32	q7, q1, #4
+	vld1.8		{d5}, [r4, : 64]
+	vshl.i32	q8, q4, #4
+	vtrn.32		d4, d5
+	vadd.i32	q7, q7, q1
+	vadd.i32	q8, q8, q4
+	vld1.8		{d18-d19}, [r2, : 128]!
+	vshl.i32	q10, q2, #4
+	vld1.8		{d22-d23}, [r2, : 128]!
+	vadd.i32	q10, q10, q2
+	vld1.8		{d24}, [r2, : 64]
+	vadd.i32	q5, q5, q0
+	add		r2, r3, #288
+	vld1.8		{d26-d27}, [r2, : 128]!
+	vadd.i32	q6, q6, q3
+	vld1.8		{d28-d29}, [r2, : 128]!
+	vadd.i32	q8, q8, q4
+	vld1.8		{d25}, [r2, : 64]
+	vadd.i32	q10, q10, q2
+	vtrn.32		q9, q13
+	vadd.i32	q7, q7, q1
+	vadd.i32	q5, q5, q0
+	vtrn.32		q11, q14
+	vadd.i32	q6, q6, q3
+	add		r2, sp, #560
+	vadd.i32	q10, q10, q2
+	vtrn.32		d24, d25
+	vst1.8		{d12-d13}, [r2, : 128]
+	vshl.i32	q6, q13, #1
+	add		r2, sp, #576
+	vst1.8		{d20-d21}, [r2, : 128]
+	vshl.i32	q10, q14, #1
+	add		r2, sp, #592
+	vst1.8		{d12-d13}, [r2, : 128]
+	vshl.i32	q15, q12, #1
+	vadd.i32	q8, q8, q4
+	vext.32		d10, d31, d30, #0
+	vadd.i32	q7, q7, q1
+	add		r2, sp, #608
+	vst1.8		{d16-d17}, [r2, : 128]
+	vmull.s32	q8, d18, d5
+	vmlal.s32	q8, d26, d4
+	vmlal.s32	q8, d19, d9
+	vmlal.s32	q8, d27, d3
+	vmlal.s32	q8, d22, d8
+	vmlal.s32	q8, d28, d2
+	vmlal.s32	q8, d23, d7
+	vmlal.s32	q8, d29, d1
+	vmlal.s32	q8, d24, d6
+	vmlal.s32	q8, d25, d0
+	add		r2, sp, #624
+	vst1.8		{d14-d15}, [r2, : 128]
+	vmull.s32	q2, d18, d4
+	vmlal.s32	q2, d12, d9
+	vmlal.s32	q2, d13, d8
+	vmlal.s32	q2, d19, d3
+	vmlal.s32	q2, d22, d2
+	vmlal.s32	q2, d23, d1
+	vmlal.s32	q2, d24, d0
+	add		r2, sp, #640
+	vst1.8		{d20-d21}, [r2, : 128]
+	vmull.s32	q7, d18, d9
+	vmlal.s32	q7, d26, d3
+	vmlal.s32	q7, d19, d8
+	vmlal.s32	q7, d27, d2
+	vmlal.s32	q7, d22, d7
+	vmlal.s32	q7, d28, d1
+	vmlal.s32	q7, d23, d6
+	vmlal.s32	q7, d29, d0
+	add		r2, sp, #656
+	vst1.8		{d10-d11}, [r2, : 128]
+	vmull.s32	q5, d18, d3
+	vmlal.s32	q5, d19, d2
+	vmlal.s32	q5, d22, d1
+	vmlal.s32	q5, d23, d0
+	vmlal.s32	q5, d12, d8
+	add		r2, sp, #672
+	vst1.8		{d16-d17}, [r2, : 128]
+	vmull.s32	q4, d18, d8
+	vmlal.s32	q4, d26, d2
+	vmlal.s32	q4, d19, d7
+	vmlal.s32	q4, d27, d1
+	vmlal.s32	q4, d22, d6
+	vmlal.s32	q4, d28, d0
+	vmull.s32	q8, d18, d7
+	vmlal.s32	q8, d26, d1
+	vmlal.s32	q8, d19, d6
+	vmlal.s32	q8, d27, d0
+	add		r2, sp, #576
+	vld1.8		{d20-d21}, [r2, : 128]
+	vmlal.s32	q7, d24, d21
+	vmlal.s32	q7, d25, d20
+	vmlal.s32	q4, d23, d21
+	vmlal.s32	q4, d29, d20
+	vmlal.s32	q8, d22, d21
+	vmlal.s32	q8, d28, d20
+	vmlal.s32	q5, d24, d20
+	add		r2, sp, #576
+	vst1.8		{d14-d15}, [r2, : 128]
+	vmull.s32	q7, d18, d6
+	vmlal.s32	q7, d26, d0
+	add		r2, sp, #656
+	vld1.8		{d30-d31}, [r2, : 128]
+	vmlal.s32	q2, d30, d21
+	vmlal.s32	q7, d19, d21
+	vmlal.s32	q7, d27, d20
+	add		r2, sp, #624
+	vld1.8		{d26-d27}, [r2, : 128]
+	vmlal.s32	q4, d25, d27
+	vmlal.s32	q8, d29, d27
+	vmlal.s32	q8, d25, d26
+	vmlal.s32	q7, d28, d27
+	vmlal.s32	q7, d29, d26
+	add		r2, sp, #608
+	vld1.8		{d28-d29}, [r2, : 128]
+	vmlal.s32	q4, d24, d29
+	vmlal.s32	q8, d23, d29
+	vmlal.s32	q8, d24, d28
+	vmlal.s32	q7, d22, d29
+	vmlal.s32	q7, d23, d28
+	add		r2, sp, #608
+	vst1.8		{d8-d9}, [r2, : 128]
+	add		r2, sp, #560
+	vld1.8		{d8-d9}, [r2, : 128]
+	vmlal.s32	q7, d24, d9
+	vmlal.s32	q7, d25, d31
+	vmull.s32	q1, d18, d2
+	vmlal.s32	q1, d19, d1
+	vmlal.s32	q1, d22, d0
+	vmlal.s32	q1, d24, d27
+	vmlal.s32	q1, d23, d20
+	vmlal.s32	q1, d12, d7
+	vmlal.s32	q1, d13, d6
+	vmull.s32	q6, d18, d1
+	vmlal.s32	q6, d19, d0
+	vmlal.s32	q6, d23, d27
+	vmlal.s32	q6, d22, d20
+	vmlal.s32	q6, d24, d26
+	vmull.s32	q0, d18, d0
+	vmlal.s32	q0, d22, d27
+	vmlal.s32	q0, d23, d26
+	vmlal.s32	q0, d24, d31
+	vmlal.s32	q0, d19, d20
+	add		r2, sp, #640
+	vld1.8		{d18-d19}, [r2, : 128]
+	vmlal.s32	q2, d18, d7
+	vmlal.s32	q2, d19, d6
+	vmlal.s32	q5, d18, d6
+	vmlal.s32	q5, d19, d21
+	vmlal.s32	q1, d18, d21
+	vmlal.s32	q1, d19, d29
+	vmlal.s32	q0, d18, d28
+	vmlal.s32	q0, d19, d9
+	vmlal.s32	q6, d18, d29
+	vmlal.s32	q6, d19, d28
+	add		r2, sp, #592
+	vld1.8		{d18-d19}, [r2, : 128]
+	add		r2, sp, #512
+	vld1.8		{d22-d23}, [r2, : 128]
+	vmlal.s32	q5, d19, d7
+	vmlal.s32	q0, d18, d21
+	vmlal.s32	q0, d19, d29
+	vmlal.s32	q6, d18, d6
+	add		r2, sp, #528
+	vld1.8		{d6-d7}, [r2, : 128]
+	vmlal.s32	q6, d19, d21
+	add		r2, sp, #576
+	vld1.8		{d18-d19}, [r2, : 128]
+	vmlal.s32	q0, d30, d8
+	add		r2, sp, #672
+	vld1.8		{d20-d21}, [r2, : 128]
+	vmlal.s32	q5, d30, d29
+	add		r2, sp, #608
+	vld1.8		{d24-d25}, [r2, : 128]
+	vmlal.s32	q1, d30, d28
+	vadd.i64	q13, q0, q11
+	vadd.i64	q14, q5, q11
+	vmlal.s32	q6, d30, d9
+	vshr.s64	q4, q13, #26
+	vshr.s64	q13, q14, #26
+	vadd.i64	q7, q7, q4
+	vshl.i64	q4, q4, #26
+	vadd.i64	q14, q7, q3
+	vadd.i64	q9, q9, q13
+	vshl.i64	q13, q13, #26
+	vadd.i64	q15, q9, q3
+	vsub.i64	q0, q0, q4
+	vshr.s64	q4, q14, #25
+	vsub.i64	q5, q5, q13
+	vshr.s64	q13, q15, #25
+	vadd.i64	q6, q6, q4
+	vshl.i64	q4, q4, #25
+	vadd.i64	q14, q6, q11
+	vadd.i64	q2, q2, q13
+	vsub.i64	q4, q7, q4
+	vshr.s64	q7, q14, #26
+	vshl.i64	q13, q13, #25
+	vadd.i64	q14, q2, q11
+	vadd.i64	q8, q8, q7
+	vshl.i64	q7, q7, #26
+	vadd.i64	q15, q8, q3
+	vsub.i64	q9, q9, q13
+	vshr.s64	q13, q14, #26
+	vsub.i64	q6, q6, q7
+	vshr.s64	q7, q15, #25
+	vadd.i64	q10, q10, q13
+	vshl.i64	q13, q13, #26
+	vadd.i64	q14, q10, q3
+	vadd.i64	q1, q1, q7
+	add		r2, r3, #288
+	vshl.i64	q7, q7, #25
+	add		r4, r3, #96
+	vadd.i64	q15, q1, q11
+	add		r2, r2, #8
+	vsub.i64	q2, q2, q13
+	add		r4, r4, #8
+	vshr.s64	q13, q14, #25
+	vsub.i64	q7, q8, q7
+	vshr.s64	q8, q15, #26
+	vadd.i64	q14, q13, q13
+	vadd.i64	q12, q12, q8
+	vtrn.32		d12, d14
+	vshl.i64	q8, q8, #26
+	vtrn.32		d13, d15
+	vadd.i64	q3, q12, q3
+	vadd.i64	q0, q0, q14
+	vst1.8		d12, [r2, : 64]!
+	vshl.i64	q7, q13, #4
+	vst1.8		d13, [r4, : 64]!
+	vsub.i64	q1, q1, q8
+	vshr.s64	q3, q3, #25
+	vadd.i64	q0, q0, q7
+	vadd.i64	q5, q5, q3
+	vshl.i64	q3, q3, #25
+	vadd.i64	q6, q5, q11
+	vadd.i64	q0, q0, q13
+	vshl.i64	q7, q13, #25
+	vadd.i64	q8, q0, q11
+	vsub.i64	q3, q12, q3
+	vshr.s64	q6, q6, #26
+	vsub.i64	q7, q10, q7
+	vtrn.32		d2, d6
+	vshr.s64	q8, q8, #26
+	vtrn.32		d3, d7
+	vadd.i64	q3, q9, q6
+	vst1.8		d2, [r2, : 64]
+	vshl.i64	q6, q6, #26
+	vst1.8		d3, [r4, : 64]
+	vadd.i64	q1, q4, q8
+	vtrn.32		d4, d14
+	vshl.i64	q4, q8, #26
+	vtrn.32		d5, d15
+	vsub.i64	q5, q5, q6
+	add		r2, r2, #16
+	vsub.i64	q0, q0, q4
+	vst1.8		d4, [r2, : 64]
+	add		r4, r4, #16
+	vst1.8		d5, [r4, : 64]
+	vtrn.32		d10, d6
+	vtrn.32		d11, d7
+	sub		r2, r2, #8
+	sub		r4, r4, #8
+	vtrn.32		d0, d2
+	vtrn.32		d1, d3
+	vst1.8		d10, [r2, : 64]
+	vst1.8		d11, [r4, : 64]
+	sub		r2, r2, #24
+	sub		r4, r4, #24
+	vst1.8		d0, [r2, : 64]
+	vst1.8		d1, [r4, : 64]
+	add		r2, sp, #544
+	add		r4, r3, #144
+	add		r5, r3, #192
+	vld1.8		{d0-d1}, [r2, : 128]
+	vld1.8		{d2-d3}, [r4, : 128]!
+	vld1.8		{d4-d5}, [r5, : 128]!
+	vzip.i32	q1, q2
+	vld1.8		{d6-d7}, [r4, : 128]!
+	vld1.8		{d8-d9}, [r5, : 128]!
+	vshl.i32	q5, q1, #1
+	vzip.i32	q3, q4
+	vshl.i32	q6, q2, #1
+	vld1.8		{d14}, [r4, : 64]
+	vshl.i32	q8, q3, #1
+	vld1.8		{d15}, [r5, : 64]
+	vshl.i32	q9, q4, #1
+	vmul.i32	d21, d7, d1
+	vtrn.32		d14, d15
+	vmul.i32	q11, q4, q0
+	vmul.i32	q0, q7, q0
+	vmull.s32	q12, d2, d2
+	vmlal.s32	q12, d11, d1
+	vmlal.s32	q12, d12, d0
+	vmlal.s32	q12, d13, d23
+	vmlal.s32	q12, d16, d22
+	vmlal.s32	q12, d7, d21
+	vmull.s32	q10, d2, d11
+	vmlal.s32	q10, d4, d1
+	vmlal.s32	q10, d13, d0
+	vmlal.s32	q10, d6, d23
+	vmlal.s32	q10, d17, d22
+	vmull.s32	q13, d10, d4
+	vmlal.s32	q13, d11, d3
+	vmlal.s32	q13, d13, d1
+	vmlal.s32	q13, d16, d0
+	vmlal.s32	q13, d17, d23
+	vmlal.s32	q13, d8, d22
+	vmull.s32	q1, d10, d5
+	vmlal.s32	q1, d11, d4
+	vmlal.s32	q1, d6, d1
+	vmlal.s32	q1, d17, d0
+	vmlal.s32	q1, d8, d23
+	vmull.s32	q14, d10, d6
+	vmlal.s32	q14, d11, d13
+	vmlal.s32	q14, d4, d4
+	vmlal.s32	q14, d17, d1
+	vmlal.s32	q14, d18, d0
+	vmlal.s32	q14, d9, d23
+	vmull.s32	q11, d10, d7
+	vmlal.s32	q11, d11, d6
+	vmlal.s32	q11, d12, d5
+	vmlal.s32	q11, d8, d1
+	vmlal.s32	q11, d19, d0
+	vmull.s32	q15, d10, d8
+	vmlal.s32	q15, d11, d17
+	vmlal.s32	q15, d12, d6
+	vmlal.s32	q15, d13, d5
+	vmlal.s32	q15, d19, d1
+	vmlal.s32	q15, d14, d0
+	vmull.s32	q2, d10, d9
+	vmlal.s32	q2, d11, d8
+	vmlal.s32	q2, d12, d7
+	vmlal.s32	q2, d13, d6
+	vmlal.s32	q2, d14, d1
+	vmull.s32	q0, d15, d1
+	vmlal.s32	q0, d10, d14
+	vmlal.s32	q0, d11, d19
+	vmlal.s32	q0, d12, d8
+	vmlal.s32	q0, d13, d17
+	vmlal.s32	q0, d6, d6
+	add		r2, sp, #512
+	vld1.8		{d18-d19}, [r2, : 128]
+	vmull.s32	q3, d16, d7
+	vmlal.s32	q3, d10, d15
+	vmlal.s32	q3, d11, d14
+	vmlal.s32	q3, d12, d9
+	vmlal.s32	q3, d13, d8
+	add		r2, sp, #528
+	vld1.8		{d8-d9}, [r2, : 128]
+	vadd.i64	q5, q12, q9
+	vadd.i64	q6, q15, q9
+	vshr.s64	q5, q5, #26
+	vshr.s64	q6, q6, #26
+	vadd.i64	q7, q10, q5
+	vshl.i64	q5, q5, #26
+	vadd.i64	q8, q7, q4
+	vadd.i64	q2, q2, q6
+	vshl.i64	q6, q6, #26
+	vadd.i64	q10, q2, q4
+	vsub.i64	q5, q12, q5
+	vshr.s64	q8, q8, #25
+	vsub.i64	q6, q15, q6
+	vshr.s64	q10, q10, #25
+	vadd.i64	q12, q13, q8
+	vshl.i64	q8, q8, #25
+	vadd.i64	q13, q12, q9
+	vadd.i64	q0, q0, q10
+	vsub.i64	q7, q7, q8
+	vshr.s64	q8, q13, #26
+	vshl.i64	q10, q10, #25
+	vadd.i64	q13, q0, q9
+	vadd.i64	q1, q1, q8
+	vshl.i64	q8, q8, #26
+	vadd.i64	q15, q1, q4
+	vsub.i64	q2, q2, q10
+	vshr.s64	q10, q13, #26
+	vsub.i64	q8, q12, q8
+	vshr.s64	q12, q15, #25
+	vadd.i64	q3, q3, q10
+	vshl.i64	q10, q10, #26
+	vadd.i64	q13, q3, q4
+	vadd.i64	q14, q14, q12
+	add		r2, r3, #144
+	vshl.i64	q12, q12, #25
+	add		r4, r3, #192
+	vadd.i64	q15, q14, q9
+	add		r2, r2, #8
+	vsub.i64	q0, q0, q10
+	add		r4, r4, #8
+	vshr.s64	q10, q13, #25
+	vsub.i64	q1, q1, q12
+	vshr.s64	q12, q15, #26
+	vadd.i64	q13, q10, q10
+	vadd.i64	q11, q11, q12
+	vtrn.32		d16, d2
+	vshl.i64	q12, q12, #26
+	vtrn.32		d17, d3
+	vadd.i64	q1, q11, q4
+	vadd.i64	q4, q5, q13
+	vst1.8		d16, [r2, : 64]!
+	vshl.i64	q5, q10, #4
+	vst1.8		d17, [r4, : 64]!
+	vsub.i64	q8, q14, q12
+	vshr.s64	q1, q1, #25
+	vadd.i64	q4, q4, q5
+	vadd.i64	q5, q6, q1
+	vshl.i64	q1, q1, #25
+	vadd.i64	q6, q5, q9
+	vadd.i64	q4, q4, q10
+	vshl.i64	q10, q10, #25
+	vadd.i64	q9, q4, q9
+	vsub.i64	q1, q11, q1
+	vshr.s64	q6, q6, #26
+	vsub.i64	q3, q3, q10
+	vtrn.32		d16, d2
+	vshr.s64	q9, q9, #26
+	vtrn.32		d17, d3
+	vadd.i64	q1, q2, q6
+	vst1.8		d16, [r2, : 64]
+	vshl.i64	q2, q6, #26
+	vst1.8		d17, [r4, : 64]
+	vadd.i64	q6, q7, q9
+	vtrn.32		d0, d6
+	vshl.i64	q7, q9, #26
+	vtrn.32		d1, d7
+	vsub.i64	q2, q5, q2
+	add		r2, r2, #16
+	vsub.i64	q3, q4, q7
+	vst1.8		d0, [r2, : 64]
+	add		r4, r4, #16
+	vst1.8		d1, [r4, : 64]
+	vtrn.32		d4, d2
+	vtrn.32		d5, d3
+	sub		r2, r2, #8
+	sub		r4, r4, #8
+	vtrn.32		d6, d12
+	vtrn.32		d7, d13
+	vst1.8		d4, [r2, : 64]
+	vst1.8		d5, [r4, : 64]
+	sub		r2, r2, #24
+	sub		r4, r4, #24
+	vst1.8		d6, [r2, : 64]
+	vst1.8		d7, [r4, : 64]
+	add		r2, r3, #336
+	add		r4, r3, #288
+	vld1.8		{d0-d1}, [r2, : 128]!
+	vld1.8		{d2-d3}, [r4, : 128]!
+	vadd.i32	q0, q0, q1
+	vld1.8		{d2-d3}, [r2, : 128]!
+	vld1.8		{d4-d5}, [r4, : 128]!
+	vadd.i32	q1, q1, q2
+	add		r5, r3, #288
+	vld1.8		{d4}, [r2, : 64]
+	vld1.8		{d6}, [r4, : 64]
+	vadd.i32	q2, q2, q3
+	vst1.8		{d0-d1}, [r5, : 128]!
+	vst1.8		{d2-d3}, [r5, : 128]!
+	vst1.8		d4, [r5, : 64]
+	add		r2, r3, #48
+	add		r4, r3, #144
+	vld1.8		{d0-d1}, [r4, : 128]!
+	vld1.8		{d2-d3}, [r4, : 128]!
+	vld1.8		{d4}, [r4, : 64]
+	add		r4, r3, #288
+	vld1.8		{d6-d7}, [r4, : 128]!
+	vtrn.32		q0, q3
+	vld1.8		{d8-d9}, [r4, : 128]!
+	vshl.i32	q5, q0, #4
+	vtrn.32		q1, q4
+	vshl.i32	q6, q3, #4
+	vadd.i32	q5, q5, q0
+	vadd.i32	q6, q6, q3
+	vshl.i32	q7, q1, #4
+	vld1.8		{d5}, [r4, : 64]
+	vshl.i32	q8, q4, #4
+	vtrn.32		d4, d5
+	vadd.i32	q7, q7, q1
+	vadd.i32	q8, q8, q4
+	vld1.8		{d18-d19}, [r2, : 128]!
+	vshl.i32	q10, q2, #4
+	vld1.8		{d22-d23}, [r2, : 128]!
+	vadd.i32	q10, q10, q2
+	vld1.8		{d24}, [r2, : 64]
+	vadd.i32	q5, q5, q0
+	add		r2, r3, #240
+	vld1.8		{d26-d27}, [r2, : 128]!
+	vadd.i32	q6, q6, q3
+	vld1.8		{d28-d29}, [r2, : 128]!
+	vadd.i32	q8, q8, q4
+	vld1.8		{d25}, [r2, : 64]
+	vadd.i32	q10, q10, q2
+	vtrn.32		q9, q13
+	vadd.i32	q7, q7, q1
+	vadd.i32	q5, q5, q0
+	vtrn.32		q11, q14
+	vadd.i32	q6, q6, q3
+	add		r2, sp, #560
+	vadd.i32	q10, q10, q2
+	vtrn.32		d24, d25
+	vst1.8		{d12-d13}, [r2, : 128]
+	vshl.i32	q6, q13, #1
+	add		r2, sp, #576
+	vst1.8		{d20-d21}, [r2, : 128]
+	vshl.i32	q10, q14, #1
+	add		r2, sp, #592
+	vst1.8		{d12-d13}, [r2, : 128]
+	vshl.i32	q15, q12, #1
+	vadd.i32	q8, q8, q4
+	vext.32		d10, d31, d30, #0
+	vadd.i32	q7, q7, q1
+	add		r2, sp, #608
+	vst1.8		{d16-d17}, [r2, : 128]
+	vmull.s32	q8, d18, d5
+	vmlal.s32	q8, d26, d4
+	vmlal.s32	q8, d19, d9
+	vmlal.s32	q8, d27, d3
+	vmlal.s32	q8, d22, d8
+	vmlal.s32	q8, d28, d2
+	vmlal.s32	q8, d23, d7
+	vmlal.s32	q8, d29, d1
+	vmlal.s32	q8, d24, d6
+	vmlal.s32	q8, d25, d0
+	add		r2, sp, #624
+	vst1.8		{d14-d15}, [r2, : 128]
+	vmull.s32	q2, d18, d4
+	vmlal.s32	q2, d12, d9
+	vmlal.s32	q2, d13, d8
+	vmlal.s32	q2, d19, d3
+	vmlal.s32	q2, d22, d2
+	vmlal.s32	q2, d23, d1
+	vmlal.s32	q2, d24, d0
+	add		r2, sp, #640
+	vst1.8		{d20-d21}, [r2, : 128]
+	vmull.s32	q7, d18, d9
+	vmlal.s32	q7, d26, d3
+	vmlal.s32	q7, d19, d8
+	vmlal.s32	q7, d27, d2
+	vmlal.s32	q7, d22, d7
+	vmlal.s32	q7, d28, d1
+	vmlal.s32	q7, d23, d6
+	vmlal.s32	q7, d29, d0
+	add		r2, sp, #656
+	vst1.8		{d10-d11}, [r2, : 128]
+	vmull.s32	q5, d18, d3
+	vmlal.s32	q5, d19, d2
+	vmlal.s32	q5, d22, d1
+	vmlal.s32	q5, d23, d0
+	vmlal.s32	q5, d12, d8
+	add		r2, sp, #672
+	vst1.8		{d16-d17}, [r2, : 128]
+	vmull.s32	q4, d18, d8
+	vmlal.s32	q4, d26, d2
+	vmlal.s32	q4, d19, d7
+	vmlal.s32	q4, d27, d1
+	vmlal.s32	q4, d22, d6
+	vmlal.s32	q4, d28, d0
+	vmull.s32	q8, d18, d7
+	vmlal.s32	q8, d26, d1
+	vmlal.s32	q8, d19, d6
+	vmlal.s32	q8, d27, d0
+	add		r2, sp, #576
+	vld1.8		{d20-d21}, [r2, : 128]
+	vmlal.s32	q7, d24, d21
+	vmlal.s32	q7, d25, d20
+	vmlal.s32	q4, d23, d21
+	vmlal.s32	q4, d29, d20
+	vmlal.s32	q8, d22, d21
+	vmlal.s32	q8, d28, d20
+	vmlal.s32	q5, d24, d20
+	add		r2, sp, #576
+	vst1.8		{d14-d15}, [r2, : 128]
+	vmull.s32	q7, d18, d6
+	vmlal.s32	q7, d26, d0
+	add		r2, sp, #656
+	vld1.8		{d30-d31}, [r2, : 128]
+	vmlal.s32	q2, d30, d21
+	vmlal.s32	q7, d19, d21
+	vmlal.s32	q7, d27, d20
+	add		r2, sp, #624
+	vld1.8		{d26-d27}, [r2, : 128]
+	vmlal.s32	q4, d25, d27
+	vmlal.s32	q8, d29, d27
+	vmlal.s32	q8, d25, d26
+	vmlal.s32	q7, d28, d27
+	vmlal.s32	q7, d29, d26
+	add		r2, sp, #608
+	vld1.8		{d28-d29}, [r2, : 128]
+	vmlal.s32	q4, d24, d29
+	vmlal.s32	q8, d23, d29
+	vmlal.s32	q8, d24, d28
+	vmlal.s32	q7, d22, d29
+	vmlal.s32	q7, d23, d28
+	add		r2, sp, #608
+	vst1.8		{d8-d9}, [r2, : 128]
+	add		r2, sp, #560
+	vld1.8		{d8-d9}, [r2, : 128]
+	vmlal.s32	q7, d24, d9
+	vmlal.s32	q7, d25, d31
+	vmull.s32	q1, d18, d2
+	vmlal.s32	q1, d19, d1
+	vmlal.s32	q1, d22, d0
+	vmlal.s32	q1, d24, d27
+	vmlal.s32	q1, d23, d20
+	vmlal.s32	q1, d12, d7
+	vmlal.s32	q1, d13, d6
+	vmull.s32	q6, d18, d1
+	vmlal.s32	q6, d19, d0
+	vmlal.s32	q6, d23, d27
+	vmlal.s32	q6, d22, d20
+	vmlal.s32	q6, d24, d26
+	vmull.s32	q0, d18, d0
+	vmlal.s32	q0, d22, d27
+	vmlal.s32	q0, d23, d26
+	vmlal.s32	q0, d24, d31
+	vmlal.s32	q0, d19, d20
+	add		r2, sp, #640
+	vld1.8		{d18-d19}, [r2, : 128]
+	vmlal.s32	q2, d18, d7
+	vmlal.s32	q2, d19, d6
+	vmlal.s32	q5, d18, d6
+	vmlal.s32	q5, d19, d21
+	vmlal.s32	q1, d18, d21
+	vmlal.s32	q1, d19, d29
+	vmlal.s32	q0, d18, d28
+	vmlal.s32	q0, d19, d9
+	vmlal.s32	q6, d18, d29
+	vmlal.s32	q6, d19, d28
+	add		r2, sp, #592
+	vld1.8		{d18-d19}, [r2, : 128]
+	add		r2, sp, #512
+	vld1.8		{d22-d23}, [r2, : 128]
+	vmlal.s32	q5, d19, d7
+	vmlal.s32	q0, d18, d21
+	vmlal.s32	q0, d19, d29
+	vmlal.s32	q6, d18, d6
+	add		r2, sp, #528
+	vld1.8		{d6-d7}, [r2, : 128]
+	vmlal.s32	q6, d19, d21
+	add		r2, sp, #576
+	vld1.8		{d18-d19}, [r2, : 128]
+	vmlal.s32	q0, d30, d8
+	add		r2, sp, #672
+	vld1.8		{d20-d21}, [r2, : 128]
+	vmlal.s32	q5, d30, d29
+	add		r2, sp, #608
+	vld1.8		{d24-d25}, [r2, : 128]
+	vmlal.s32	q1, d30, d28
+	vadd.i64	q13, q0, q11
+	vadd.i64	q14, q5, q11
+	vmlal.s32	q6, d30, d9
+	vshr.s64	q4, q13, #26
+	vshr.s64	q13, q14, #26
+	vadd.i64	q7, q7, q4
+	vshl.i64	q4, q4, #26
+	vadd.i64	q14, q7, q3
+	vadd.i64	q9, q9, q13
+	vshl.i64	q13, q13, #26
+	vadd.i64	q15, q9, q3
+	vsub.i64	q0, q0, q4
+	vshr.s64	q4, q14, #25
+	vsub.i64	q5, q5, q13
+	vshr.s64	q13, q15, #25
+	vadd.i64	q6, q6, q4
+	vshl.i64	q4, q4, #25
+	vadd.i64	q14, q6, q11
+	vadd.i64	q2, q2, q13
+	vsub.i64	q4, q7, q4
+	vshr.s64	q7, q14, #26
+	vshl.i64	q13, q13, #25
+	vadd.i64	q14, q2, q11
+	vadd.i64	q8, q8, q7
+	vshl.i64	q7, q7, #26
+	vadd.i64	q15, q8, q3
+	vsub.i64	q9, q9, q13
+	vshr.s64	q13, q14, #26
+	vsub.i64	q6, q6, q7
+	vshr.s64	q7, q15, #25
+	vadd.i64	q10, q10, q13
+	vshl.i64	q13, q13, #26
+	vadd.i64	q14, q10, q3
+	vadd.i64	q1, q1, q7
+	add		r2, r3, #240
+	vshl.i64	q7, q7, #25
+	add		r4, r3, #144
+	vadd.i64	q15, q1, q11
+	add		r2, r2, #8
+	vsub.i64	q2, q2, q13
+	add		r4, r4, #8
+	vshr.s64	q13, q14, #25
+	vsub.i64	q7, q8, q7
+	vshr.s64	q8, q15, #26
+	vadd.i64	q14, q13, q13
+	vadd.i64	q12, q12, q8
+	vtrn.32		d12, d14
+	vshl.i64	q8, q8, #26
+	vtrn.32		d13, d15
+	vadd.i64	q3, q12, q3
+	vadd.i64	q0, q0, q14
+	vst1.8		d12, [r2, : 64]!
+	vshl.i64	q7, q13, #4
+	vst1.8		d13, [r4, : 64]!
+	vsub.i64	q1, q1, q8
+	vshr.s64	q3, q3, #25
+	vadd.i64	q0, q0, q7
+	vadd.i64	q5, q5, q3
+	vshl.i64	q3, q3, #25
+	vadd.i64	q6, q5, q11
+	vadd.i64	q0, q0, q13
+	vshl.i64	q7, q13, #25
+	vadd.i64	q8, q0, q11
+	vsub.i64	q3, q12, q3
+	vshr.s64	q6, q6, #26
+	vsub.i64	q7, q10, q7
+	vtrn.32		d2, d6
+	vshr.s64	q8, q8, #26
+	vtrn.32		d3, d7
+	vadd.i64	q3, q9, q6
+	vst1.8		d2, [r2, : 64]
+	vshl.i64	q6, q6, #26
+	vst1.8		d3, [r4, : 64]
+	vadd.i64	q1, q4, q8
+	vtrn.32		d4, d14
+	vshl.i64	q4, q8, #26
+	vtrn.32		d5, d15
+	vsub.i64	q5, q5, q6
+	add		r2, r2, #16
+	vsub.i64	q0, q0, q4
+	vst1.8		d4, [r2, : 64]
+	add		r4, r4, #16
+	vst1.8		d5, [r4, : 64]
+	vtrn.32		d10, d6
+	vtrn.32		d11, d7
+	sub		r2, r2, #8
+	sub		r4, r4, #8
+	vtrn.32		d0, d2
+	vtrn.32		d1, d3
+	vst1.8		d10, [r2, : 64]
+	vst1.8		d11, [r4, : 64]
+	sub		r2, r2, #24
+	sub		r4, r4, #24
+	vst1.8		d0, [r2, : 64]
+	vst1.8		d1, [r4, : 64]
+	ldr		r2, [sp, #488]
+	ldr		r4, [sp, #492]
+	subs		r5, r2, #1
+	bge		._mainloop
+	add		r1, r3, #144
+	add		r2, r3, #336
+	vld1.8		{d0-d1}, [r1, : 128]!
+	vld1.8		{d2-d3}, [r1, : 128]!
+	vld1.8		{d4}, [r1, : 64]
+	vst1.8		{d0-d1}, [r2, : 128]!
+	vst1.8		{d2-d3}, [r2, : 128]!
+	vst1.8		d4, [r2, : 64]
+	ldr		r1, =0
+._invertloop:
+	add		r2, r3, #144
+	ldr		r4, =0
+	ldr		r5, =2
+	cmp		r1, #1
+	ldreq		r5, =1
+	addeq		r2, r3, #336
+	addeq		r4, r3, #48
+	cmp		r1, #2
+	ldreq		r5, =1
+	addeq		r2, r3, #48
+	cmp		r1, #3
+	ldreq		r5, =5
+	addeq		r4, r3, #336
+	cmp		r1, #4
+	ldreq		r5, =10
+	cmp		r1, #5
+	ldreq		r5, =20
+	cmp		r1, #6
+	ldreq		r5, =10
+	addeq		r2, r3, #336
+	addeq		r4, r3, #336
+	cmp		r1, #7
+	ldreq		r5, =50
+	cmp		r1, #8
+	ldreq		r5, =100
+	cmp		r1, #9
+	ldreq		r5, =50
+	addeq		r2, r3, #336
+	cmp		r1, #10
+	ldreq		r5, =5
+	addeq		r2, r3, #48
+	cmp		r1, #11
+	ldreq		r5, =0
+	addeq		r2, r3, #96
+	add		r6, r3, #144
+	add		r7, r3, #288
+	vld1.8		{d0-d1}, [r6, : 128]!
+	vld1.8		{d2-d3}, [r6, : 128]!
+	vld1.8		{d4}, [r6, : 64]
+	vst1.8		{d0-d1}, [r7, : 128]!
+	vst1.8		{d2-d3}, [r7, : 128]!
+	vst1.8		d4, [r7, : 64]
+	cmp		r5, #0
+	beq		._skipsquaringloop
+._squaringloop:
+	add		r6, r3, #288
+	add		r7, r3, #288
+	add		r8, r3, #288
+	vmov.i32	q0, #19
+	vmov.i32	q1, #0
+	vmov.i32	q2, #1
+	vzip.i32	q1, q2
+	vld1.8		{d4-d5}, [r7, : 128]!
+	vld1.8		{d6-d7}, [r7, : 128]!
+	vld1.8		{d9}, [r7, : 64]
+	vld1.8		{d10-d11}, [r6, : 128]!
+	add		r7, sp, #416
+	vld1.8		{d12-d13}, [r6, : 128]!
+	vmul.i32	q7, q2, q0
+	vld1.8		{d8}, [r6, : 64]
+	vext.32		d17, d11, d10, #1
+	vmul.i32	q9, q3, q0
+	vext.32		d16, d10, d8, #1
+	vshl.u32	q10, q5, q1
+	vext.32		d22, d14, d4, #1
+	vext.32		d24, d18, d6, #1
+	vshl.u32	q13, q6, q1
+	vshl.u32	d28, d8, d2
+	vrev64.i32	d22, d22
+	vmul.i32	d1, d9, d1
+	vrev64.i32	d24, d24
+	vext.32		d29, d8, d13, #1
+	vext.32		d0, d1, d9, #1
+	vrev64.i32	d0, d0
+	vext.32		d2, d9, d1, #1
+	vext.32		d23, d15, d5, #1
+	vmull.s32	q4, d20, d4
+	vrev64.i32	d23, d23
+	vmlal.s32	q4, d21, d1
+	vrev64.i32	d2, d2
+	vmlal.s32	q4, d26, d19
+	vext.32		d3, d5, d15, #1
+	vmlal.s32	q4, d27, d18
+	vrev64.i32	d3, d3
+	vmlal.s32	q4, d28, d15
+	vext.32		d14, d12, d11, #1
+	vmull.s32	q5, d16, d23
+	vext.32		d15, d13, d12, #1
+	vmlal.s32	q5, d17, d4
+	vst1.8		d8, [r7, : 64]!
+	vmlal.s32	q5, d14, d1
+	vext.32		d12, d9, d8, #0
+	vmlal.s32	q5, d15, d19
+	vmov.i64	d13, #0
+	vmlal.s32	q5, d29, d18
+	vext.32		d25, d19, d7, #1
+	vmlal.s32	q6, d20, d5
+	vrev64.i32	d25, d25
+	vmlal.s32	q6, d21, d4
+	vst1.8		d11, [r7, : 64]!
+	vmlal.s32	q6, d26, d1
+	vext.32		d9, d10, d10, #0
+	vmlal.s32	q6, d27, d19
+	vmov.i64	d8, #0
+	vmlal.s32	q6, d28, d18
+	vmlal.s32	q4, d16, d24
+	vmlal.s32	q4, d17, d5
+	vmlal.s32	q4, d14, d4
+	vst1.8		d12, [r7, : 64]!
+	vmlal.s32	q4, d15, d1
+	vext.32		d10, d13, d12, #0
+	vmlal.s32	q4, d29, d19
+	vmov.i64	d11, #0
+	vmlal.s32	q5, d20, d6
+	vmlal.s32	q5, d21, d5
+	vmlal.s32	q5, d26, d4
+	vext.32		d13, d8, d8, #0
+	vmlal.s32	q5, d27, d1
+	vmov.i64	d12, #0
+	vmlal.s32	q5, d28, d19
+	vst1.8		d9, [r7, : 64]!
+	vmlal.s32	q6, d16, d25
+	vmlal.s32	q6, d17, d6
+	vst1.8		d10, [r7, : 64]
+	vmlal.s32	q6, d14, d5
+	vext.32		d8, d11, d10, #0
+	vmlal.s32	q6, d15, d4
+	vmov.i64	d9, #0
+	vmlal.s32	q6, d29, d1
+	vmlal.s32	q4, d20, d7
+	vmlal.s32	q4, d21, d6
+	vmlal.s32	q4, d26, d5
+	vext.32		d11, d12, d12, #0
+	vmlal.s32	q4, d27, d4
+	vmov.i64	d10, #0
+	vmlal.s32	q4, d28, d1
+	vmlal.s32	q5, d16, d0
+	sub		r6, r7, #32
+	vmlal.s32	q5, d17, d7
+	vmlal.s32	q5, d14, d6
+	vext.32		d30, d9, d8, #0
+	vmlal.s32	q5, d15, d5
+	vld1.8		{d31}, [r6, : 64]!
+	vmlal.s32	q5, d29, d4
+	vmlal.s32	q15, d20, d0
+	vext.32		d0, d6, d18, #1
+	vmlal.s32	q15, d21, d25
+	vrev64.i32	d0, d0
+	vmlal.s32	q15, d26, d24
+	vext.32		d1, d7, d19, #1
+	vext.32		d7, d10, d10, #0
+	vmlal.s32	q15, d27, d23
+	vrev64.i32	d1, d1
+	vld1.8		{d6}, [r6, : 64]
+	vmlal.s32	q15, d28, d22
+	vmlal.s32	q3, d16, d4
+	add		r6, r6, #24
+	vmlal.s32	q3, d17, d2
+	vext.32		d4, d31, d30, #0
+	vmov		d17, d11
+	vmlal.s32	q3, d14, d1
+	vext.32		d11, d13, d13, #0
+	vext.32		d13, d30, d30, #0
+	vmlal.s32	q3, d15, d0
+	vext.32		d1, d8, d8, #0
+	vmlal.s32	q3, d29, d3
+	vld1.8		{d5}, [r6, : 64]
+	sub		r6, r6, #16
+	vext.32		d10, d6, d6, #0
+	vmov.i32	q1, #0xffffffff
+	vshl.i64	q4, q1, #25
+	add		r7, sp, #512
+	vld1.8		{d14-d15}, [r7, : 128]
+	vadd.i64	q9, q2, q7
+	vshl.i64	q1, q1, #26
+	vshr.s64	q10, q9, #26
+	vld1.8		{d0}, [r6, : 64]!
+	vadd.i64	q5, q5, q10
+	vand		q9, q9, q1
+	vld1.8		{d16}, [r6, : 64]!
+	add		r6, sp, #528
+	vld1.8		{d20-d21}, [r6, : 128]
+	vadd.i64	q11, q5, q10
+	vsub.i64	q2, q2, q9
+	vshr.s64	q9, q11, #25
+	vext.32		d12, d5, d4, #0
+	vand		q11, q11, q4
+	vadd.i64	q0, q0, q9
+	vmov		d19, d7
+	vadd.i64	q3, q0, q7
+	vsub.i64	q5, q5, q11
+	vshr.s64	q11, q3, #26
+	vext.32		d18, d11, d10, #0
+	vand		q3, q3, q1
+	vadd.i64	q8, q8, q11
+	vadd.i64	q11, q8, q10
+	vsub.i64	q0, q0, q3
+	vshr.s64	q3, q11, #25
+	vand		q11, q11, q4
+	vadd.i64	q3, q6, q3
+	vadd.i64	q6, q3, q7
+	vsub.i64	q8, q8, q11
+	vshr.s64	q11, q6, #26
+	vand		q6, q6, q1
+	vadd.i64	q9, q9, q11
+	vadd.i64	d25, d19, d21
+	vsub.i64	q3, q3, q6
+	vshr.s64	d23, d25, #25
+	vand		q4, q12, q4
+	vadd.i64	d21, d23, d23
+	vshl.i64	d25, d23, #4
+	vadd.i64	d21, d21, d23
+	vadd.i64	d25, d25, d21
+	vadd.i64	d4, d4, d25
+	vzip.i32	q0, q8
+	vadd.i64	d12, d4, d14
+	add		r6, r8, #8
+	vst1.8		d0, [r6, : 64]
+	vsub.i64	d19, d19, d9
+	add		r6, r6, #16
+	vst1.8		d16, [r6, : 64]
+	vshr.s64	d22, d12, #26
+	vand		q0, q6, q1
+	vadd.i64	d10, d10, d22
+	vzip.i32	q3, q9
+	vsub.i64	d4, d4, d0
+	sub		r6, r6, #8
+	vst1.8		d6, [r6, : 64]
+	add		r6, r6, #16
+	vst1.8		d18, [r6, : 64]
+	vzip.i32	q2, q5
+	sub		r6, r6, #32
+	vst1.8		d4, [r6, : 64]
+	subs		r5, r5, #1
+	bhi		._squaringloop
+._skipsquaringloop:
+	mov		r2, r2
+	add		r5, r3, #288
+	add		r6, r3, #144
+	vmov.i32	q0, #19
+	vmov.i32	q1, #0
+	vmov.i32	q2, #1
+	vzip.i32	q1, q2
+	vld1.8		{d4-d5}, [r5, : 128]!
+	vld1.8		{d6-d7}, [r5, : 128]!
+	vld1.8		{d9}, [r5, : 64]
+	vld1.8		{d10-d11}, [r2, : 128]!
+	add		r5, sp, #416
+	vld1.8		{d12-d13}, [r2, : 128]!
+	vmul.i32	q7, q2, q0
+	vld1.8		{d8}, [r2, : 64]
+	vext.32		d17, d11, d10, #1
+	vmul.i32	q9, q3, q0
+	vext.32		d16, d10, d8, #1
+	vshl.u32	q10, q5, q1
+	vext.32		d22, d14, d4, #1
+	vext.32		d24, d18, d6, #1
+	vshl.u32	q13, q6, q1
+	vshl.u32	d28, d8, d2
+	vrev64.i32	d22, d22
+	vmul.i32	d1, d9, d1
+	vrev64.i32	d24, d24
+	vext.32		d29, d8, d13, #1
+	vext.32		d0, d1, d9, #1
+	vrev64.i32	d0, d0
+	vext.32		d2, d9, d1, #1
+	vext.32		d23, d15, d5, #1
+	vmull.s32	q4, d20, d4
+	vrev64.i32	d23, d23
+	vmlal.s32	q4, d21, d1
+	vrev64.i32	d2, d2
+	vmlal.s32	q4, d26, d19
+	vext.32		d3, d5, d15, #1
+	vmlal.s32	q4, d27, d18
+	vrev64.i32	d3, d3
+	vmlal.s32	q4, d28, d15
+	vext.32		d14, d12, d11, #1
+	vmull.s32	q5, d16, d23
+	vext.32		d15, d13, d12, #1
+	vmlal.s32	q5, d17, d4
+	vst1.8		d8, [r5, : 64]!
+	vmlal.s32	q5, d14, d1
+	vext.32		d12, d9, d8, #0
+	vmlal.s32	q5, d15, d19
+	vmov.i64	d13, #0
+	vmlal.s32	q5, d29, d18
+	vext.32		d25, d19, d7, #1
+	vmlal.s32	q6, d20, d5
+	vrev64.i32	d25, d25
+	vmlal.s32	q6, d21, d4
+	vst1.8		d11, [r5, : 64]!
+	vmlal.s32	q6, d26, d1
+	vext.32		d9, d10, d10, #0
+	vmlal.s32	q6, d27, d19
+	vmov.i64	d8, #0
+	vmlal.s32	q6, d28, d18
+	vmlal.s32	q4, d16, d24
+	vmlal.s32	q4, d17, d5
+	vmlal.s32	q4, d14, d4
+	vst1.8		d12, [r5, : 64]!
+	vmlal.s32	q4, d15, d1
+	vext.32		d10, d13, d12, #0
+	vmlal.s32	q4, d29, d19
+	vmov.i64	d11, #0
+	vmlal.s32	q5, d20, d6
+	vmlal.s32	q5, d21, d5
+	vmlal.s32	q5, d26, d4
+	vext.32		d13, d8, d8, #0
+	vmlal.s32	q5, d27, d1
+	vmov.i64	d12, #0
+	vmlal.s32	q5, d28, d19
+	vst1.8		d9, [r5, : 64]!
+	vmlal.s32	q6, d16, d25
+	vmlal.s32	q6, d17, d6
+	vst1.8		d10, [r5, : 64]
+	vmlal.s32	q6, d14, d5
+	vext.32		d8, d11, d10, #0
+	vmlal.s32	q6, d15, d4
+	vmov.i64	d9, #0
+	vmlal.s32	q6, d29, d1
+	vmlal.s32	q4, d20, d7
+	vmlal.s32	q4, d21, d6
+	vmlal.s32	q4, d26, d5
+	vext.32		d11, d12, d12, #0
+	vmlal.s32	q4, d27, d4
+	vmov.i64	d10, #0
+	vmlal.s32	q4, d28, d1
+	vmlal.s32	q5, d16, d0
+	sub		r2, r5, #32
+	vmlal.s32	q5, d17, d7
+	vmlal.s32	q5, d14, d6
+	vext.32		d30, d9, d8, #0
+	vmlal.s32	q5, d15, d5
+	vld1.8		{d31}, [r2, : 64]!
+	vmlal.s32	q5, d29, d4
+	vmlal.s32	q15, d20, d0
+	vext.32		d0, d6, d18, #1
+	vmlal.s32	q15, d21, d25
+	vrev64.i32	d0, d0
+	vmlal.s32	q15, d26, d24
+	vext.32		d1, d7, d19, #1
+	vext.32		d7, d10, d10, #0
+	vmlal.s32	q15, d27, d23
+	vrev64.i32	d1, d1
+	vld1.8		{d6}, [r2, : 64]
+	vmlal.s32	q15, d28, d22
+	vmlal.s32	q3, d16, d4
+	add		r2, r2, #24
+	vmlal.s32	q3, d17, d2
+	vext.32		d4, d31, d30, #0
+	vmov		d17, d11
+	vmlal.s32	q3, d14, d1
+	vext.32		d11, d13, d13, #0
+	vext.32		d13, d30, d30, #0
+	vmlal.s32	q3, d15, d0
+	vext.32		d1, d8, d8, #0
+	vmlal.s32	q3, d29, d3
+	vld1.8		{d5}, [r2, : 64]
+	sub		r2, r2, #16
+	vext.32		d10, d6, d6, #0
+	vmov.i32	q1, #0xffffffff
+	vshl.i64	q4, q1, #25
+	add		r5, sp, #512
+	vld1.8		{d14-d15}, [r5, : 128]
+	vadd.i64	q9, q2, q7
+	vshl.i64	q1, q1, #26
+	vshr.s64	q10, q9, #26
+	vld1.8		{d0}, [r2, : 64]!
+	vadd.i64	q5, q5, q10
+	vand		q9, q9, q1
+	vld1.8		{d16}, [r2, : 64]!
+	add		r2, sp, #528
+	vld1.8		{d20-d21}, [r2, : 128]
+	vadd.i64	q11, q5, q10
+	vsub.i64	q2, q2, q9
+	vshr.s64	q9, q11, #25
+	vext.32		d12, d5, d4, #0
+	vand		q11, q11, q4
+	vadd.i64	q0, q0, q9
+	vmov		d19, d7
+	vadd.i64	q3, q0, q7
+	vsub.i64	q5, q5, q11
+	vshr.s64	q11, q3, #26
+	vext.32		d18, d11, d10, #0
+	vand		q3, q3, q1
+	vadd.i64	q8, q8, q11
+	vadd.i64	q11, q8, q10
+	vsub.i64	q0, q0, q3
+	vshr.s64	q3, q11, #25
+	vand		q11, q11, q4
+	vadd.i64	q3, q6, q3
+	vadd.i64	q6, q3, q7
+	vsub.i64	q8, q8, q11
+	vshr.s64	q11, q6, #26
+	vand		q6, q6, q1
+	vadd.i64	q9, q9, q11
+	vadd.i64	d25, d19, d21
+	vsub.i64	q3, q3, q6
+	vshr.s64	d23, d25, #25
+	vand		q4, q12, q4
+	vadd.i64	d21, d23, d23
+	vshl.i64	d25, d23, #4
+	vadd.i64	d21, d21, d23
+	vadd.i64	d25, d25, d21
+	vadd.i64	d4, d4, d25
+	vzip.i32	q0, q8
+	vadd.i64	d12, d4, d14
+	add		r2, r6, #8
+	vst1.8		d0, [r2, : 64]
+	vsub.i64	d19, d19, d9
+	add		r2, r2, #16
+	vst1.8		d16, [r2, : 64]
+	vshr.s64	d22, d12, #26
+	vand		q0, q6, q1
+	vadd.i64	d10, d10, d22
+	vzip.i32	q3, q9
+	vsub.i64	d4, d4, d0
+	sub		r2, r2, #8
+	vst1.8		d6, [r2, : 64]
+	add		r2, r2, #16
+	vst1.8		d18, [r2, : 64]
+	vzip.i32	q2, q5
+	sub		r2, r2, #32
+	vst1.8		d4, [r2, : 64]
+	cmp		r4, #0
+	beq		._skippostcopy
+	add		r2, r3, #144
+	mov		r4, r4
+	vld1.8		{d0-d1}, [r2, : 128]!
+	vld1.8		{d2-d3}, [r2, : 128]!
+	vld1.8		{d4}, [r2, : 64]
+	vst1.8		{d0-d1}, [r4, : 128]!
+	vst1.8		{d2-d3}, [r4, : 128]!
+	vst1.8		d4, [r4, : 64]
+._skippostcopy:
+	cmp		r1, #1
+	bne		._skipfinalcopy
+	add		r2, r3, #288
+	add		r4, r3, #144
+	vld1.8		{d0-d1}, [r2, : 128]!
+	vld1.8		{d2-d3}, [r2, : 128]!
+	vld1.8		{d4}, [r2, : 64]
+	vst1.8		{d0-d1}, [r4, : 128]!
+	vst1.8		{d2-d3}, [r4, : 128]!
+	vst1.8		d4, [r4, : 64]
+._skipfinalcopy:
+	add		r1, r1, #1
+	cmp		r1, #12
+	blo		._invertloop
+	add		r1, r3, #144
+	ldr		r2, [r1], #4
+	ldr		r3, [r1], #4
+	ldr		r4, [r1], #4
+	ldr		r5, [r1], #4
+	ldr		r6, [r1], #4
+	ldr		r7, [r1], #4
+	ldr		r8, [r1], #4
+	ldr		r9, [r1], #4
+	ldr		r10, [r1], #4
+	ldr		r1, [r1]
+	add		r11, r1, r1, LSL #4
+	add		r11, r11, r1, LSL #1
+	add		r11, r11, #16777216
+	mov		r11, r11, ASR #25
+	add		r11, r11, r2
+	mov		r11, r11, ASR #26
+	add		r11, r11, r3
+	mov		r11, r11, ASR #25
+	add		r11, r11, r4
+	mov		r11, r11, ASR #26
+	add		r11, r11, r5
+	mov		r11, r11, ASR #25
+	add		r11, r11, r6
+	mov		r11, r11, ASR #26
+	add		r11, r11, r7
+	mov		r11, r11, ASR #25
+	add		r11, r11, r8
+	mov		r11, r11, ASR #26
+	add		r11, r11, r9
+	mov		r11, r11, ASR #25
+	add		r11, r11, r10
+	mov		r11, r11, ASR #26
+	add		r11, r11, r1
+	mov		r11, r11, ASR #25
+	add		r2, r2, r11
+	add		r2, r2, r11, LSL #1
+	add		r2, r2, r11, LSL #4
+	mov		r11, r2, ASR #26
+	add		r3, r3, r11
+	sub		r2, r2, r11, LSL #26
+	mov		r11, r3, ASR #25
+	add		r4, r4, r11
+	sub		r3, r3, r11, LSL #25
+	mov		r11, r4, ASR #26
+	add		r5, r5, r11
+	sub		r4, r4, r11, LSL #26
+	mov		r11, r5, ASR #25
+	add		r6, r6, r11
+	sub		r5, r5, r11, LSL #25
+	mov		r11, r6, ASR #26
+	add		r7, r7, r11
+	sub		r6, r6, r11, LSL #26
+	mov		r11, r7, ASR #25
+	add		r8, r8, r11
+	sub		r7, r7, r11, LSL #25
+	mov		r11, r8, ASR #26
+	add		r9, r9, r11
+	sub		r8, r8, r11, LSL #26
+	mov		r11, r9, ASR #25
+	add		r10, r10, r11
+	sub		r9, r9, r11, LSL #25
+	mov		r11, r10, ASR #26
+	add		r1, r1, r11
+	sub		r10, r10, r11, LSL #26
+	mov		r11, r1, ASR #25
+	sub		r1, r1, r11, LSL #25
+	add		r2, r2, r3, LSL #26
+	mov		r3, r3, LSR #6
+	add		r3, r3, r4, LSL #19
+	mov		r4, r4, LSR #13
+	add		r4, r4, r5, LSL #13
+	mov		r5, r5, LSR #19
+	add		r5, r5, r6, LSL #6
+	add		r6, r7, r8, LSL #25
+	mov		r7, r8, LSR #7
+	add		r7, r7, r9, LSL #19
+	mov		r8, r9, LSR #13
+	add		r8, r8, r10, LSL #12
+	mov		r9, r10, LSR #20
+	add		r1, r9, r1, LSL #6
+	str		r2, [r0], #4
+	str		r3, [r0], #4
+	str		r4, [r0], #4
+	str		r5, [r0], #4
+	str		r6, [r0], #4
+	str		r7, [r0], #4
+	str		r8, [r0], #4
+	str		r1, [r0]
+	ldrd		r4, [sp, #0]
+	ldrd		r6, [sp, #8]
+	ldrd		r8, [sp, #16]
+	ldrd		r10, [sp, #24]
+	ldr		r12, [sp, #480]
+	ldr		r14, [sp, #484]
+	ldr		r0, =0
+	mov		sp, r12
+	vpop		{q4, q5, q6, q7}
+	bx		lr
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 33/35] crypto: arm/Curve25519 - wire up NEON implementation
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (31 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 32/35] crypto: arm - import Bernstein and Schwabe's Curve25519 ARM implementation Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 34/35] crypto: chacha20poly1305 - import construction and selftest from Zinc Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 35/35] crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routine Ard Biesheuvel
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

From: "Jason A. Donenfeld" <Jason@zx2c4.com>

This ports the SUPERCOP implementation for usage in kernel space. In
addition to the usual header, macro, and style changes required for
kernel space, it makes a few small changes to the code:

  - The stack alignment is relaxed to 16 bytes.
  - Superfluous mov statements have been removed.
  - ldr for constants has been replaced with movw.
  - ldreq has been replaced with moveq.
  - The str epilogue has been made more idiomatic.
  - SIMD registers are not pushed and popped at the beginning and end.
  - The prologue and epilogue have been made idiomatic.
  - A hole has been removed from the stack, saving 32 bytes.
  - We write-back the base register whenever possible for vld1.8.
  - Some multiplications have been reordered for better A7 performance.

There are more opportunities for cleanup, since this code is from qhasm,
which doesn't always do the most opportune thing. But even prior to
extensive hand optimizations, this code delivers significant performance
improvements (given in get_cycles() per call):

		      ----------- -------------
	             | generic C | this commit |
	 ------------ ----------- -------------
	| Cortex-A7  |     49136 |       22395 |
	 ------------ ----------- -------------
	| Cortex-A17 |     17326 |        4983 |
	 ------------ ----------- -------------

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
[ardb: - move to arch/arm/crypto
       - wire into lib/crypto framework
       - implement crypto API KPP hooks ]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/Kconfig           |   6 +
 arch/arm/crypto/Makefile          |   2 +
 arch/arm/crypto/curve25519-core.S | 347 +++++++++-----------
 arch/arm/crypto/curve25519-glue.c | 136 ++++++++
 4 files changed, 296 insertions(+), 195 deletions(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 19411f82468e..1d8efae37d24 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -140,4 +140,10 @@ config CRYPTO_NHPOLY1305_NEON
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_NHPOLY1305
 
+config CRYPTO_CURVE25519_NEON
+	tristate "NEON accelerated Curve25519 scalar multiplication library"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_LIB_CURVE25519_GENERIC
+	select CRYPTO_ARCH_HAVE_LIB_CURVE25519
+
 endif
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 4f6a8a81dabc..7700385cec9f 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
 obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
+obj-$(CONFIG_CRYPTO_CURVE25519_NEON) += curve25519-neon.o
 
 ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
@@ -58,6 +59,7 @@ chacha-neon-y := chacha-scalar-core.o chacha-glue.o
 chacha-neon-$(CONFIG_KERNEL_MODE_NEON) += chacha-neon-core.o
 poly1305-arm-y := poly1305-core.o poly1305-glue.o
 nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
+curve25519-neon-y := curve25519-core.o curve25519-glue.o
 
 ifdef REGENERATE_ARM_CRYPTO
 quiet_cmd_perl = PERL    $@
diff --git a/arch/arm/crypto/curve25519-core.S b/arch/arm/crypto/curve25519-core.S
index f33b85fef382..be18af52e7dc 100644
--- a/arch/arm/crypto/curve25519-core.S
+++ b/arch/arm/crypto/curve25519-core.S
@@ -1,43 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
 /*
- * Public domain code from Daniel J. Bernstein and Peter Schwabe, from
- * SUPERCOP's curve25519/neon2/scalarmult.s.
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * Based on public domain code from Daniel J. Bernstein and Peter Schwabe. This
+ * began from SUPERCOP's curve25519/neon2/scalarmult.s, but has subsequently been
+ * manually reworked for use in kernel space.
  */
 
-.fpu neon
+#include <linux/linkage.h>
+
 .text
+.fpu neon
+.arch armv7-a
 .align 4
-.global _crypto_scalarmult_curve25519_neon2
-.global crypto_scalarmult_curve25519_neon2
-.type _crypto_scalarmult_curve25519_neon2 STT_FUNC
-.type crypto_scalarmult_curve25519_neon2 STT_FUNC
-	_crypto_scalarmult_curve25519_neon2:
-	crypto_scalarmult_curve25519_neon2:
-	vpush		{q4, q5, q6, q7}
-	mov		r12, sp
-	sub		sp, sp, #736
-	and		sp, sp, #0xffffffe0
-	strd		r4, [sp, #0]
-	strd		r6, [sp, #8]
-	strd		r8, [sp, #16]
-	strd		r10, [sp, #24]
-	str		r12, [sp, #480]
-	str		r14, [sp, #484]
-	mov		r0, r0
-	mov		r1, r1
-	mov		r2, r2
-	add		r3, sp, #32
-	ldr		r4, =0
-	ldr		r5, =254
+
+ENTRY(curve25519_neon)
+	push		{r4-r11, lr}
+	mov		ip, sp
+	sub		r3, sp, #704
+	and		r3, r3, #0xfffffff0
+	mov		sp, r3
+	movw		r4, #0
+	movw		r5, #254
 	vmov.i32	q0, #1
 	vshr.u64	q1, q0, #7
 	vshr.u64	q0, q0, #8
 	vmov.i32	d4, #19
 	vmov.i32	d5, #38
-	add		r6, sp, #512
-	vst1.8		{d2-d3}, [r6, : 128]
-	add		r6, sp, #528
-	vst1.8		{d0-d1}, [r6, : 128]
-	add		r6, sp, #544
+	add		r6, sp, #480
+	vst1.8		{d2-d3}, [r6, : 128]!
+	vst1.8		{d0-d1}, [r6, : 128]!
 	vst1.8		{d4-d5}, [r6, : 128]
 	add		r6, r3, #0
 	vmov.i32	q2, #0
@@ -45,12 +37,12 @@
 	vst1.8		{d4-d5}, [r6, : 128]!
 	vst1.8		d4, [r6, : 64]
 	add		r6, r3, #0
-	ldr		r7, =960
+	movw		r7, #960
 	sub		r7, r7, #2
 	neg		r7, r7
 	sub		r7, r7, r7, LSL #7
 	str		r7, [r6]
-	add		r6, sp, #704
+	add		r6, sp, #672
 	vld1.8		{d4-d5}, [r1]!
 	vld1.8		{d6-d7}, [r1]
 	vst1.8		{d4-d5}, [r6, : 128]!
@@ -212,15 +204,15 @@
 	vst1.8		{d0-d1}, [r6, : 128]!
 	vst1.8		{d2-d3}, [r6, : 128]!
 	vst1.8		d4, [r6, : 64]
-._mainloop:
+.Lmainloop:
 	mov		r2, r5, LSR #3
 	and		r6, r5, #7
 	ldrb		r2, [r1, r2]
 	mov		r2, r2, LSR r6
 	and		r2, r2, #1
-	str		r5, [sp, #488]
+	str		r5, [sp, #456]
 	eor		r4, r4, r2
-	str		r2, [sp, #492]
+	str		r2, [sp, #460]
 	neg		r2, r4
 	add		r4, r3, #96
 	add		r5, r3, #192
@@ -291,7 +283,7 @@
 	vsub.i32	q0, q1, q3
 	vst1.8		d4, [r4, : 64]
 	vst1.8		d0, [r6, : 64]
-	add		r2, sp, #544
+	add		r2, sp, #512
 	add		r4, r3, #96
 	add		r5, r3, #144
 	vld1.8		{d0-d1}, [r2, : 128]
@@ -361,14 +353,13 @@
 	vmlal.s32	q0, d12, d8
 	vmlal.s32	q0, d13, d17
 	vmlal.s32	q0, d6, d6
-	add		r2, sp, #512
-	vld1.8		{d18-d19}, [r2, : 128]
+	add		r2, sp, #480
+	vld1.8		{d18-d19}, [r2, : 128]!
 	vmull.s32	q3, d16, d7
 	vmlal.s32	q3, d10, d15
 	vmlal.s32	q3, d11, d14
 	vmlal.s32	q3, d12, d9
 	vmlal.s32	q3, d13, d8
-	add		r2, sp, #528
 	vld1.8		{d8-d9}, [r2, : 128]
 	vadd.i64	q5, q12, q9
 	vadd.i64	q6, q15, q9
@@ -502,22 +493,19 @@
 	vadd.i32	q5, q5, q0
 	vtrn.32		q11, q14
 	vadd.i32	q6, q6, q3
-	add		r2, sp, #560
+	add		r2, sp, #528
 	vadd.i32	q10, q10, q2
 	vtrn.32		d24, d25
-	vst1.8		{d12-d13}, [r2, : 128]
+	vst1.8		{d12-d13}, [r2, : 128]!
 	vshl.i32	q6, q13, #1
-	add		r2, sp, #576
-	vst1.8		{d20-d21}, [r2, : 128]
+	vst1.8		{d20-d21}, [r2, : 128]!
 	vshl.i32	q10, q14, #1
-	add		r2, sp, #592
-	vst1.8		{d12-d13}, [r2, : 128]
+	vst1.8		{d12-d13}, [r2, : 128]!
 	vshl.i32	q15, q12, #1
 	vadd.i32	q8, q8, q4
 	vext.32		d10, d31, d30, #0
 	vadd.i32	q7, q7, q1
-	add		r2, sp, #608
-	vst1.8		{d16-d17}, [r2, : 128]
+	vst1.8		{d16-d17}, [r2, : 128]!
 	vmull.s32	q8, d18, d5
 	vmlal.s32	q8, d26, d4
 	vmlal.s32	q8, d19, d9
@@ -528,8 +516,7 @@
 	vmlal.s32	q8, d29, d1
 	vmlal.s32	q8, d24, d6
 	vmlal.s32	q8, d25, d0
-	add		r2, sp, #624
-	vst1.8		{d14-d15}, [r2, : 128]
+	vst1.8		{d14-d15}, [r2, : 128]!
 	vmull.s32	q2, d18, d4
 	vmlal.s32	q2, d12, d9
 	vmlal.s32	q2, d13, d8
@@ -537,8 +524,7 @@
 	vmlal.s32	q2, d22, d2
 	vmlal.s32	q2, d23, d1
 	vmlal.s32	q2, d24, d0
-	add		r2, sp, #640
-	vst1.8		{d20-d21}, [r2, : 128]
+	vst1.8		{d20-d21}, [r2, : 128]!
 	vmull.s32	q7, d18, d9
 	vmlal.s32	q7, d26, d3
 	vmlal.s32	q7, d19, d8
@@ -547,14 +533,12 @@
 	vmlal.s32	q7, d28, d1
 	vmlal.s32	q7, d23, d6
 	vmlal.s32	q7, d29, d0
-	add		r2, sp, #656
-	vst1.8		{d10-d11}, [r2, : 128]
+	vst1.8		{d10-d11}, [r2, : 128]!
 	vmull.s32	q5, d18, d3
 	vmlal.s32	q5, d19, d2
 	vmlal.s32	q5, d22, d1
 	vmlal.s32	q5, d23, d0
 	vmlal.s32	q5, d12, d8
-	add		r2, sp, #672
 	vst1.8		{d16-d17}, [r2, : 128]
 	vmull.s32	q4, d18, d8
 	vmlal.s32	q4, d26, d2
@@ -566,7 +550,7 @@
 	vmlal.s32	q8, d26, d1
 	vmlal.s32	q8, d19, d6
 	vmlal.s32	q8, d27, d0
-	add		r2, sp, #576
+	add		r2, sp, #544
 	vld1.8		{d20-d21}, [r2, : 128]
 	vmlal.s32	q7, d24, d21
 	vmlal.s32	q7, d25, d20
@@ -575,32 +559,30 @@
 	vmlal.s32	q8, d22, d21
 	vmlal.s32	q8, d28, d20
 	vmlal.s32	q5, d24, d20
-	add		r2, sp, #576
 	vst1.8		{d14-d15}, [r2, : 128]
 	vmull.s32	q7, d18, d6
 	vmlal.s32	q7, d26, d0
-	add		r2, sp, #656
+	add		r2, sp, #624
 	vld1.8		{d30-d31}, [r2, : 128]
 	vmlal.s32	q2, d30, d21
 	vmlal.s32	q7, d19, d21
 	vmlal.s32	q7, d27, d20
-	add		r2, sp, #624
+	add		r2, sp, #592
 	vld1.8		{d26-d27}, [r2, : 128]
 	vmlal.s32	q4, d25, d27
 	vmlal.s32	q8, d29, d27
 	vmlal.s32	q8, d25, d26
 	vmlal.s32	q7, d28, d27
 	vmlal.s32	q7, d29, d26
-	add		r2, sp, #608
+	add		r2, sp, #576
 	vld1.8		{d28-d29}, [r2, : 128]
 	vmlal.s32	q4, d24, d29
 	vmlal.s32	q8, d23, d29
 	vmlal.s32	q8, d24, d28
 	vmlal.s32	q7, d22, d29
 	vmlal.s32	q7, d23, d28
-	add		r2, sp, #608
 	vst1.8		{d8-d9}, [r2, : 128]
-	add		r2, sp, #560
+	add		r2, sp, #528
 	vld1.8		{d8-d9}, [r2, : 128]
 	vmlal.s32	q7, d24, d9
 	vmlal.s32	q7, d25, d31
@@ -621,36 +603,36 @@
 	vmlal.s32	q0, d23, d26
 	vmlal.s32	q0, d24, d31
 	vmlal.s32	q0, d19, d20
-	add		r2, sp, #640
+	add		r2, sp, #608
 	vld1.8		{d18-d19}, [r2, : 128]
 	vmlal.s32	q2, d18, d7
-	vmlal.s32	q2, d19, d6
 	vmlal.s32	q5, d18, d6
-	vmlal.s32	q5, d19, d21
 	vmlal.s32	q1, d18, d21
-	vmlal.s32	q1, d19, d29
 	vmlal.s32	q0, d18, d28
-	vmlal.s32	q0, d19, d9
 	vmlal.s32	q6, d18, d29
+	vmlal.s32	q2, d19, d6
+	vmlal.s32	q5, d19, d21
+	vmlal.s32	q1, d19, d29
+	vmlal.s32	q0, d19, d9
 	vmlal.s32	q6, d19, d28
-	add		r2, sp, #592
+	add		r2, sp, #560
 	vld1.8		{d18-d19}, [r2, : 128]
-	add		r2, sp, #512
+	add		r2, sp, #480
 	vld1.8		{d22-d23}, [r2, : 128]
 	vmlal.s32	q5, d19, d7
 	vmlal.s32	q0, d18, d21
 	vmlal.s32	q0, d19, d29
 	vmlal.s32	q6, d18, d6
-	add		r2, sp, #528
+	add		r2, sp, #496
 	vld1.8		{d6-d7}, [r2, : 128]
 	vmlal.s32	q6, d19, d21
-	add		r2, sp, #576
+	add		r2, sp, #544
 	vld1.8		{d18-d19}, [r2, : 128]
 	vmlal.s32	q0, d30, d8
-	add		r2, sp, #672
+	add		r2, sp, #640
 	vld1.8		{d20-d21}, [r2, : 128]
 	vmlal.s32	q5, d30, d29
-	add		r2, sp, #608
+	add		r2, sp, #576
 	vld1.8		{d24-d25}, [r2, : 128]
 	vmlal.s32	q1, d30, d28
 	vadd.i64	q13, q0, q11
@@ -823,22 +805,19 @@
 	vadd.i32	q5, q5, q0
 	vtrn.32		q11, q14
 	vadd.i32	q6, q6, q3
-	add		r2, sp, #560
+	add		r2, sp, #528
 	vadd.i32	q10, q10, q2
 	vtrn.32		d24, d25
-	vst1.8		{d12-d13}, [r2, : 128]
+	vst1.8		{d12-d13}, [r2, : 128]!
 	vshl.i32	q6, q13, #1
-	add		r2, sp, #576
-	vst1.8		{d20-d21}, [r2, : 128]
+	vst1.8		{d20-d21}, [r2, : 128]!
 	vshl.i32	q10, q14, #1
-	add		r2, sp, #592
-	vst1.8		{d12-d13}, [r2, : 128]
+	vst1.8		{d12-d13}, [r2, : 128]!
 	vshl.i32	q15, q12, #1
 	vadd.i32	q8, q8, q4
 	vext.32		d10, d31, d30, #0
 	vadd.i32	q7, q7, q1
-	add		r2, sp, #608
-	vst1.8		{d16-d17}, [r2, : 128]
+	vst1.8		{d16-d17}, [r2, : 128]!
 	vmull.s32	q8, d18, d5
 	vmlal.s32	q8, d26, d4
 	vmlal.s32	q8, d19, d9
@@ -849,8 +828,7 @@
 	vmlal.s32	q8, d29, d1
 	vmlal.s32	q8, d24, d6
 	vmlal.s32	q8, d25, d0
-	add		r2, sp, #624
-	vst1.8		{d14-d15}, [r2, : 128]
+	vst1.8		{d14-d15}, [r2, : 128]!
 	vmull.s32	q2, d18, d4
 	vmlal.s32	q2, d12, d9
 	vmlal.s32	q2, d13, d8
@@ -858,8 +836,7 @@
 	vmlal.s32	q2, d22, d2
 	vmlal.s32	q2, d23, d1
 	vmlal.s32	q2, d24, d0
-	add		r2, sp, #640
-	vst1.8		{d20-d21}, [r2, : 128]
+	vst1.8		{d20-d21}, [r2, : 128]!
 	vmull.s32	q7, d18, d9
 	vmlal.s32	q7, d26, d3
 	vmlal.s32	q7, d19, d8
@@ -868,15 +845,13 @@
 	vmlal.s32	q7, d28, d1
 	vmlal.s32	q7, d23, d6
 	vmlal.s32	q7, d29, d0
-	add		r2, sp, #656
-	vst1.8		{d10-d11}, [r2, : 128]
+	vst1.8		{d10-d11}, [r2, : 128]!
 	vmull.s32	q5, d18, d3
 	vmlal.s32	q5, d19, d2
 	vmlal.s32	q5, d22, d1
 	vmlal.s32	q5, d23, d0
 	vmlal.s32	q5, d12, d8
-	add		r2, sp, #672
-	vst1.8		{d16-d17}, [r2, : 128]
+	vst1.8		{d16-d17}, [r2, : 128]!
 	vmull.s32	q4, d18, d8
 	vmlal.s32	q4, d26, d2
 	vmlal.s32	q4, d19, d7
@@ -887,7 +862,7 @@
 	vmlal.s32	q8, d26, d1
 	vmlal.s32	q8, d19, d6
 	vmlal.s32	q8, d27, d0
-	add		r2, sp, #576
+	add		r2, sp, #544
 	vld1.8		{d20-d21}, [r2, : 128]
 	vmlal.s32	q7, d24, d21
 	vmlal.s32	q7, d25, d20
@@ -896,32 +871,30 @@
 	vmlal.s32	q8, d22, d21
 	vmlal.s32	q8, d28, d20
 	vmlal.s32	q5, d24, d20
-	add		r2, sp, #576
 	vst1.8		{d14-d15}, [r2, : 128]
 	vmull.s32	q7, d18, d6
 	vmlal.s32	q7, d26, d0
-	add		r2, sp, #656
+	add		r2, sp, #624
 	vld1.8		{d30-d31}, [r2, : 128]
 	vmlal.s32	q2, d30, d21
 	vmlal.s32	q7, d19, d21
 	vmlal.s32	q7, d27, d20
-	add		r2, sp, #624
+	add		r2, sp, #592
 	vld1.8		{d26-d27}, [r2, : 128]
 	vmlal.s32	q4, d25, d27
 	vmlal.s32	q8, d29, d27
 	vmlal.s32	q8, d25, d26
 	vmlal.s32	q7, d28, d27
 	vmlal.s32	q7, d29, d26
-	add		r2, sp, #608
+	add		r2, sp, #576
 	vld1.8		{d28-d29}, [r2, : 128]
 	vmlal.s32	q4, d24, d29
 	vmlal.s32	q8, d23, d29
 	vmlal.s32	q8, d24, d28
 	vmlal.s32	q7, d22, d29
 	vmlal.s32	q7, d23, d28
-	add		r2, sp, #608
 	vst1.8		{d8-d9}, [r2, : 128]
-	add		r2, sp, #560
+	add		r2, sp, #528
 	vld1.8		{d8-d9}, [r2, : 128]
 	vmlal.s32	q7, d24, d9
 	vmlal.s32	q7, d25, d31
@@ -942,36 +915,36 @@
 	vmlal.s32	q0, d23, d26
 	vmlal.s32	q0, d24, d31
 	vmlal.s32	q0, d19, d20
-	add		r2, sp, #640
+	add		r2, sp, #608
 	vld1.8		{d18-d19}, [r2, : 128]
 	vmlal.s32	q2, d18, d7
-	vmlal.s32	q2, d19, d6
 	vmlal.s32	q5, d18, d6
-	vmlal.s32	q5, d19, d21
 	vmlal.s32	q1, d18, d21
-	vmlal.s32	q1, d19, d29
 	vmlal.s32	q0, d18, d28
-	vmlal.s32	q0, d19, d9
 	vmlal.s32	q6, d18, d29
+	vmlal.s32	q2, d19, d6
+	vmlal.s32	q5, d19, d21
+	vmlal.s32	q1, d19, d29
+	vmlal.s32	q0, d19, d9
 	vmlal.s32	q6, d19, d28
-	add		r2, sp, #592
+	add		r2, sp, #560
 	vld1.8		{d18-d19}, [r2, : 128]
-	add		r2, sp, #512
+	add		r2, sp, #480
 	vld1.8		{d22-d23}, [r2, : 128]
 	vmlal.s32	q5, d19, d7
 	vmlal.s32	q0, d18, d21
 	vmlal.s32	q0, d19, d29
 	vmlal.s32	q6, d18, d6
-	add		r2, sp, #528
+	add		r2, sp, #496
 	vld1.8		{d6-d7}, [r2, : 128]
 	vmlal.s32	q6, d19, d21
-	add		r2, sp, #576
+	add		r2, sp, #544
 	vld1.8		{d18-d19}, [r2, : 128]
 	vmlal.s32	q0, d30, d8
-	add		r2, sp, #672
+	add		r2, sp, #640
 	vld1.8		{d20-d21}, [r2, : 128]
 	vmlal.s32	q5, d30, d29
-	add		r2, sp, #608
+	add		r2, sp, #576
 	vld1.8		{d24-d25}, [r2, : 128]
 	vmlal.s32	q1, d30, d28
 	vadd.i64	q13, q0, q11
@@ -1069,7 +1042,7 @@
 	sub		r4, r4, #24
 	vst1.8		d0, [r2, : 64]
 	vst1.8		d1, [r4, : 64]
-	add		r2, sp, #544
+	add		r2, sp, #512
 	add		r4, r3, #144
 	add		r5, r3, #192
 	vld1.8		{d0-d1}, [r2, : 128]
@@ -1139,14 +1112,13 @@
 	vmlal.s32	q0, d12, d8
 	vmlal.s32	q0, d13, d17
 	vmlal.s32	q0, d6, d6
-	add		r2, sp, #512
-	vld1.8		{d18-d19}, [r2, : 128]
+	add		r2, sp, #480
+	vld1.8		{d18-d19}, [r2, : 128]!
 	vmull.s32	q3, d16, d7
 	vmlal.s32	q3, d10, d15
 	vmlal.s32	q3, d11, d14
 	vmlal.s32	q3, d12, d9
 	vmlal.s32	q3, d13, d8
-	add		r2, sp, #528
 	vld1.8		{d8-d9}, [r2, : 128]
 	vadd.i64	q5, q12, q9
 	vadd.i64	q6, q15, q9
@@ -1295,22 +1267,19 @@
 	vadd.i32	q5, q5, q0
 	vtrn.32		q11, q14
 	vadd.i32	q6, q6, q3
-	add		r2, sp, #560
+	add		r2, sp, #528
 	vadd.i32	q10, q10, q2
 	vtrn.32		d24, d25
-	vst1.8		{d12-d13}, [r2, : 128]
+	vst1.8		{d12-d13}, [r2, : 128]!
 	vshl.i32	q6, q13, #1
-	add		r2, sp, #576
-	vst1.8		{d20-d21}, [r2, : 128]
+	vst1.8		{d20-d21}, [r2, : 128]!
 	vshl.i32	q10, q14, #1
-	add		r2, sp, #592
-	vst1.8		{d12-d13}, [r2, : 128]
+	vst1.8		{d12-d13}, [r2, : 128]!
 	vshl.i32	q15, q12, #1
 	vadd.i32	q8, q8, q4
 	vext.32		d10, d31, d30, #0
 	vadd.i32	q7, q7, q1
-	add		r2, sp, #608
-	vst1.8		{d16-d17}, [r2, : 128]
+	vst1.8		{d16-d17}, [r2, : 128]!
 	vmull.s32	q8, d18, d5
 	vmlal.s32	q8, d26, d4
 	vmlal.s32	q8, d19, d9
@@ -1321,8 +1290,7 @@
 	vmlal.s32	q8, d29, d1
 	vmlal.s32	q8, d24, d6
 	vmlal.s32	q8, d25, d0
-	add		r2, sp, #624
-	vst1.8		{d14-d15}, [r2, : 128]
+	vst1.8		{d14-d15}, [r2, : 128]!
 	vmull.s32	q2, d18, d4
 	vmlal.s32	q2, d12, d9
 	vmlal.s32	q2, d13, d8
@@ -1330,8 +1298,7 @@
 	vmlal.s32	q2, d22, d2
 	vmlal.s32	q2, d23, d1
 	vmlal.s32	q2, d24, d0
-	add		r2, sp, #640
-	vst1.8		{d20-d21}, [r2, : 128]
+	vst1.8		{d20-d21}, [r2, : 128]!
 	vmull.s32	q7, d18, d9
 	vmlal.s32	q7, d26, d3
 	vmlal.s32	q7, d19, d8
@@ -1340,15 +1307,13 @@
 	vmlal.s32	q7, d28, d1
 	vmlal.s32	q7, d23, d6
 	vmlal.s32	q7, d29, d0
-	add		r2, sp, #656
-	vst1.8		{d10-d11}, [r2, : 128]
+	vst1.8		{d10-d11}, [r2, : 128]!
 	vmull.s32	q5, d18, d3
 	vmlal.s32	q5, d19, d2
 	vmlal.s32	q5, d22, d1
 	vmlal.s32	q5, d23, d0
 	vmlal.s32	q5, d12, d8
-	add		r2, sp, #672
-	vst1.8		{d16-d17}, [r2, : 128]
+	vst1.8		{d16-d17}, [r2, : 128]!
 	vmull.s32	q4, d18, d8
 	vmlal.s32	q4, d26, d2
 	vmlal.s32	q4, d19, d7
@@ -1359,7 +1324,7 @@
 	vmlal.s32	q8, d26, d1
 	vmlal.s32	q8, d19, d6
 	vmlal.s32	q8, d27, d0
-	add		r2, sp, #576
+	add		r2, sp, #544
 	vld1.8		{d20-d21}, [r2, : 128]
 	vmlal.s32	q7, d24, d21
 	vmlal.s32	q7, d25, d20
@@ -1368,32 +1333,30 @@
 	vmlal.s32	q8, d22, d21
 	vmlal.s32	q8, d28, d20
 	vmlal.s32	q5, d24, d20
-	add		r2, sp, #576
 	vst1.8		{d14-d15}, [r2, : 128]
 	vmull.s32	q7, d18, d6
 	vmlal.s32	q7, d26, d0
-	add		r2, sp, #656
+	add		r2, sp, #624
 	vld1.8		{d30-d31}, [r2, : 128]
 	vmlal.s32	q2, d30, d21
 	vmlal.s32	q7, d19, d21
 	vmlal.s32	q7, d27, d20
-	add		r2, sp, #624
+	add		r2, sp, #592
 	vld1.8		{d26-d27}, [r2, : 128]
 	vmlal.s32	q4, d25, d27
 	vmlal.s32	q8, d29, d27
 	vmlal.s32	q8, d25, d26
 	vmlal.s32	q7, d28, d27
 	vmlal.s32	q7, d29, d26
-	add		r2, sp, #608
+	add		r2, sp, #576
 	vld1.8		{d28-d29}, [r2, : 128]
 	vmlal.s32	q4, d24, d29
 	vmlal.s32	q8, d23, d29
 	vmlal.s32	q8, d24, d28
 	vmlal.s32	q7, d22, d29
 	vmlal.s32	q7, d23, d28
-	add		r2, sp, #608
 	vst1.8		{d8-d9}, [r2, : 128]
-	add		r2, sp, #560
+	add		r2, sp, #528
 	vld1.8		{d8-d9}, [r2, : 128]
 	vmlal.s32	q7, d24, d9
 	vmlal.s32	q7, d25, d31
@@ -1414,36 +1377,36 @@
 	vmlal.s32	q0, d23, d26
 	vmlal.s32	q0, d24, d31
 	vmlal.s32	q0, d19, d20
-	add		r2, sp, #640
+	add		r2, sp, #608
 	vld1.8		{d18-d19}, [r2, : 128]
 	vmlal.s32	q2, d18, d7
-	vmlal.s32	q2, d19, d6
 	vmlal.s32	q5, d18, d6
-	vmlal.s32	q5, d19, d21
 	vmlal.s32	q1, d18, d21
-	vmlal.s32	q1, d19, d29
 	vmlal.s32	q0, d18, d28
-	vmlal.s32	q0, d19, d9
 	vmlal.s32	q6, d18, d29
+	vmlal.s32	q2, d19, d6
+	vmlal.s32	q5, d19, d21
+	vmlal.s32	q1, d19, d29
+	vmlal.s32	q0, d19, d9
 	vmlal.s32	q6, d19, d28
-	add		r2, sp, #592
+	add		r2, sp, #560
 	vld1.8		{d18-d19}, [r2, : 128]
-	add		r2, sp, #512
+	add		r2, sp, #480
 	vld1.8		{d22-d23}, [r2, : 128]
 	vmlal.s32	q5, d19, d7
 	vmlal.s32	q0, d18, d21
 	vmlal.s32	q0, d19, d29
 	vmlal.s32	q6, d18, d6
-	add		r2, sp, #528
+	add		r2, sp, #496
 	vld1.8		{d6-d7}, [r2, : 128]
 	vmlal.s32	q6, d19, d21
-	add		r2, sp, #576
+	add		r2, sp, #544
 	vld1.8		{d18-d19}, [r2, : 128]
 	vmlal.s32	q0, d30, d8
-	add		r2, sp, #672
+	add		r2, sp, #640
 	vld1.8		{d20-d21}, [r2, : 128]
 	vmlal.s32	q5, d30, d29
-	add		r2, sp, #608
+	add		r2, sp, #576
 	vld1.8		{d24-d25}, [r2, : 128]
 	vmlal.s32	q1, d30, d28
 	vadd.i64	q13, q0, q11
@@ -1541,10 +1504,10 @@
 	sub		r4, r4, #24
 	vst1.8		d0, [r2, : 64]
 	vst1.8		d1, [r4, : 64]
-	ldr		r2, [sp, #488]
-	ldr		r4, [sp, #492]
+	ldr		r2, [sp, #456]
+	ldr		r4, [sp, #460]
 	subs		r5, r2, #1
-	bge		._mainloop
+	bge		.Lmainloop
 	add		r1, r3, #144
 	add		r2, r3, #336
 	vld1.8		{d0-d1}, [r1, : 128]!
@@ -1553,41 +1516,41 @@
 	vst1.8		{d0-d1}, [r2, : 128]!
 	vst1.8		{d2-d3}, [r2, : 128]!
 	vst1.8		d4, [r2, : 64]
-	ldr		r1, =0
-._invertloop:
+	movw		r1, #0
+.Linvertloop:
 	add		r2, r3, #144
-	ldr		r4, =0
-	ldr		r5, =2
+	movw		r4, #0
+	movw		r5, #2
 	cmp		r1, #1
-	ldreq		r5, =1
+	moveq		r5, #1
 	addeq		r2, r3, #336
 	addeq		r4, r3, #48
 	cmp		r1, #2
-	ldreq		r5, =1
+	moveq		r5, #1
 	addeq		r2, r3, #48
 	cmp		r1, #3
-	ldreq		r5, =5
+	moveq		r5, #5
 	addeq		r4, r3, #336
 	cmp		r1, #4
-	ldreq		r5, =10
+	moveq		r5, #10
 	cmp		r1, #5
-	ldreq		r5, =20
+	moveq		r5, #20
 	cmp		r1, #6
-	ldreq		r5, =10
+	moveq		r5, #10
 	addeq		r2, r3, #336
 	addeq		r4, r3, #336
 	cmp		r1, #7
-	ldreq		r5, =50
+	moveq		r5, #50
 	cmp		r1, #8
-	ldreq		r5, =100
+	moveq		r5, #100
 	cmp		r1, #9
-	ldreq		r5, =50
+	moveq		r5, #50
 	addeq		r2, r3, #336
 	cmp		r1, #10
-	ldreq		r5, =5
+	moveq		r5, #5
 	addeq		r2, r3, #48
 	cmp		r1, #11
-	ldreq		r5, =0
+	moveq		r5, #0
 	addeq		r2, r3, #96
 	add		r6, r3, #144
 	add		r7, r3, #288
@@ -1598,8 +1561,8 @@
 	vst1.8		{d2-d3}, [r7, : 128]!
 	vst1.8		d4, [r7, : 64]
 	cmp		r5, #0
-	beq		._skipsquaringloop
-._squaringloop:
+	beq		.Lskipsquaringloop
+.Lsquaringloop:
 	add		r6, r3, #288
 	add		r7, r3, #288
 	add		r8, r3, #288
@@ -1611,7 +1574,7 @@
 	vld1.8		{d6-d7}, [r7, : 128]!
 	vld1.8		{d9}, [r7, : 64]
 	vld1.8		{d10-d11}, [r6, : 128]!
-	add		r7, sp, #416
+	add		r7, sp, #384
 	vld1.8		{d12-d13}, [r6, : 128]!
 	vmul.i32	q7, q2, q0
 	vld1.8		{d8}, [r6, : 64]
@@ -1726,7 +1689,7 @@
 	vext.32		d10, d6, d6, #0
 	vmov.i32	q1, #0xffffffff
 	vshl.i64	q4, q1, #25
-	add		r7, sp, #512
+	add		r7, sp, #480
 	vld1.8		{d14-d15}, [r7, : 128]
 	vadd.i64	q9, q2, q7
 	vshl.i64	q1, q1, #26
@@ -1735,7 +1698,7 @@
 	vadd.i64	q5, q5, q10
 	vand		q9, q9, q1
 	vld1.8		{d16}, [r6, : 64]!
-	add		r6, sp, #528
+	add		r6, sp, #496
 	vld1.8		{d20-d21}, [r6, : 128]
 	vadd.i64	q11, q5, q10
 	vsub.i64	q2, q2, q9
@@ -1789,8 +1752,8 @@
 	sub		r6, r6, #32
 	vst1.8		d4, [r6, : 64]
 	subs		r5, r5, #1
-	bhi		._squaringloop
-._skipsquaringloop:
+	bhi		.Lsquaringloop
+.Lskipsquaringloop:
 	mov		r2, r2
 	add		r5, r3, #288
 	add		r6, r3, #144
@@ -1802,7 +1765,7 @@
 	vld1.8		{d6-d7}, [r5, : 128]!
 	vld1.8		{d9}, [r5, : 64]
 	vld1.8		{d10-d11}, [r2, : 128]!
-	add		r5, sp, #416
+	add		r5, sp, #384
 	vld1.8		{d12-d13}, [r2, : 128]!
 	vmul.i32	q7, q2, q0
 	vld1.8		{d8}, [r2, : 64]
@@ -1917,7 +1880,7 @@
 	vext.32		d10, d6, d6, #0
 	vmov.i32	q1, #0xffffffff
 	vshl.i64	q4, q1, #25
-	add		r5, sp, #512
+	add		r5, sp, #480
 	vld1.8		{d14-d15}, [r5, : 128]
 	vadd.i64	q9, q2, q7
 	vshl.i64	q1, q1, #26
@@ -1926,7 +1889,7 @@
 	vadd.i64	q5, q5, q10
 	vand		q9, q9, q1
 	vld1.8		{d16}, [r2, : 64]!
-	add		r2, sp, #528
+	add		r2, sp, #496
 	vld1.8		{d20-d21}, [r2, : 128]
 	vadd.i64	q11, q5, q10
 	vsub.i64	q2, q2, q9
@@ -1980,7 +1943,7 @@
 	sub		r2, r2, #32
 	vst1.8		d4, [r2, : 64]
 	cmp		r4, #0
-	beq		._skippostcopy
+	beq		.Lskippostcopy
 	add		r2, r3, #144
 	mov		r4, r4
 	vld1.8		{d0-d1}, [r2, : 128]!
@@ -1989,9 +1952,9 @@
 	vst1.8		{d0-d1}, [r4, : 128]!
 	vst1.8		{d2-d3}, [r4, : 128]!
 	vst1.8		d4, [r4, : 64]
-._skippostcopy:
+.Lskippostcopy:
 	cmp		r1, #1
-	bne		._skipfinalcopy
+	bne		.Lskipfinalcopy
 	add		r2, r3, #288
 	add		r4, r3, #144
 	vld1.8		{d0-d1}, [r2, : 128]!
@@ -2000,10 +1963,10 @@
 	vst1.8		{d0-d1}, [r4, : 128]!
 	vst1.8		{d2-d3}, [r4, : 128]!
 	vst1.8		d4, [r4, : 64]
-._skipfinalcopy:
+.Lskipfinalcopy:
 	add		r1, r1, #1
 	cmp		r1, #12
-	blo		._invertloop
+	blo		.Linvertloop
 	add		r1, r3, #144
 	ldr		r2, [r1], #4
 	ldr		r3, [r1], #4
@@ -2085,21 +2048,15 @@
 	add		r8, r8, r10, LSL #12
 	mov		r9, r10, LSR #20
 	add		r1, r9, r1, LSL #6
-	str		r2, [r0], #4
-	str		r3, [r0], #4
-	str		r4, [r0], #4
-	str		r5, [r0], #4
-	str		r6, [r0], #4
-	str		r7, [r0], #4
-	str		r8, [r0], #4
-	str		r1, [r0]
-	ldrd		r4, [sp, #0]
-	ldrd		r6, [sp, #8]
-	ldrd		r8, [sp, #16]
-	ldrd		r10, [sp, #24]
-	ldr		r12, [sp, #480]
-	ldr		r14, [sp, #484]
-	ldr		r0, =0
-	mov		sp, r12
-	vpop		{q4, q5, q6, q7}
-	bx		lr
+	str		r2, [r0]
+	str		r3, [r0, #4]
+	str		r4, [r0, #8]
+	str		r5, [r0, #12]
+	str		r6, [r0, #16]
+	str		r7, [r0, #20]
+	str		r8, [r0, #24]
+	str		r1, [r0, #28]
+	movw		r0, #0
+	mov		sp, ip
+	pop		{r4-r11, pc}
+ENDPROC(curve25519_neon)
diff --git a/arch/arm/crypto/curve25519-glue.c b/arch/arm/crypto/curve25519-glue.c
new file mode 100644
index 000000000000..6eb0c6322948
--- /dev/null
+++ b/arch/arm/crypto/curve25519-glue.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * Based on public domain code from Daniel J. Bernstein and Peter Schwabe. This
+ * began from SUPERCOP's curve25519/neon2/scalarmult.s, but has subsequently been
+ * manually reworked for use in kernel space.
+ */
+
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/kpp.h>
+#include <crypto/internal/simd.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/jump_label.h>
+#include <crypto/curve25519.h>
+
+asmlinkage void curve25519_neon(u8 mypublic[CURVE25519_KEY_SIZE],
+				const u8 secret[CURVE25519_KEY_SIZE],
+				const u8 basepoint[CURVE25519_KEY_SIZE]);
+
+static __ro_after_init DEFINE_STATIC_KEY_FALSE(have_neon);
+
+void curve25519_arch(u8 out[CURVE25519_KEY_SIZE],
+		     const u8 scalar[CURVE25519_KEY_SIZE],
+		     const u8 point[CURVE25519_KEY_SIZE])
+{
+	if (static_branch_likely(&have_neon) && crypto_simd_usable()) {
+		kernel_neon_begin();
+		curve25519_neon(out, scalar, point);
+		kernel_neon_end();
+	} else {
+		curve25519_generic(out, scalar, point);
+	}
+}
+EXPORT_SYMBOL(curve25519_arch);
+
+void curve25519_base_arch(u8 pub[CURVE25519_KEY_SIZE],
+			  const u8 secret[CURVE25519_KEY_SIZE])
+{
+	curve25519_arch(pub, secret, curve25519_base_point);
+}
+EXPORT_SYMBOL(curve25519_base_arch);
+
+static int curve25519_set_secret(struct crypto_kpp *tfm, const void *buf,
+				 unsigned int len)
+{
+	u8 *secret = kpp_tfm_ctx(tfm);
+
+	if (!len)
+		curve25519_generate_secret(secret);
+	else if (len == CURVE25519_KEY_SIZE)
+		memcpy(secret, buf, CURVE25519_KEY_SIZE);
+	else
+		return -EINVAL;
+	return 0;
+}
+
+static int curve25519_compute_value(struct kpp_request *req)
+{
+	struct crypto_kpp *tfm = crypto_kpp_reqtfm(req);
+	const u8 *secret = kpp_tfm_ctx(tfm);
+	u8 public_key[CURVE25519_KEY_SIZE];
+	u8 buf[CURVE25519_KEY_SIZE];
+	int copied, nbytes;
+	u8 const *bp;
+
+	if (req->src) {
+		copied = sg_copy_to_buffer(req->src,
+					   sg_nents_for_len(req->src,
+							    CURVE25519_KEY_SIZE),
+					   public_key, CURVE25519_KEY_SIZE);
+		if (copied != CURVE25519_KEY_SIZE)
+			return -EINVAL;
+		bp = public_key;
+	} else {
+		bp = curve25519_base_point;
+	}
+
+	curve25519_arch(buf, secret, bp);
+
+	if (!crypto_memneq(buf, curve25519_null_point, CURVE25519_KEY_SIZE))
+		return -EKEYREJECTED;
+
+	/* might want less than we've got */
+	nbytes = min_t(size_t, CURVE25519_KEY_SIZE, req->dst_len);
+	copied = sg_copy_from_buffer(req->dst, sg_nents_for_len(req->dst,
+								nbytes),
+				     buf, nbytes);
+	if (copied != nbytes)
+		return -EINVAL;
+	return 0;
+}
+
+static unsigned int curve25519_max_size(struct crypto_kpp *tfm)
+{
+	return CURVE25519_KEY_SIZE;
+}
+
+static struct kpp_alg curve25519_alg = {
+	.base.cra_name		= "curve25519",
+	.base.cra_driver_name	= "curve25519-neon",
+	.base.cra_priority	= 200,
+	.base.cra_module	= THIS_MODULE,
+	.base.cra_ctxsize	= CURVE25519_KEY_SIZE,
+
+	.set_secret		= curve25519_set_secret,
+	.generate_public_key	= curve25519_compute_value,
+	.compute_shared_secret	= curve25519_compute_value,
+	.max_size		= curve25519_max_size,
+};
+
+static int __init mod_init(void)
+{
+	if (elf_hwcap & HWCAP_NEON) {
+		static_branch_enable(&have_neon);
+		return crypto_register_kpp(&curve25519_alg);
+	}
+	return 0;
+}
+
+static void __exit mod_exit(void)
+{
+	if (elf_hwcap & HWCAP_NEON)
+		crypto_unregister_kpp(&curve25519_alg);
+}
+
+module_init(mod_init);
+module_exit(mod_exit);
+
+MODULE_ALIAS_CRYPTO("curve25519");
+MODULE_ALIAS_CRYPTO("curve25519-neon");
+MODULE_LICENSE("GPL v2");
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4 34/35] crypto: chacha20poly1305 - import construction and selftest from Zinc
  2019-10-17 19:08 [PATCH v4 00/35] crypto: crypto API library interfaces for WireGuard Ard Biesheuvel
                   ` (32 preceding siblings ...)
  2019-10-17 19:09 ` [PATCH v4 33/35] crypto: arm/Curve25519 - wire up NEON implementation Ard Biesheuvel
@ 2019-10-17 19:09 ` Ard Biesheuvel
  2019-10-17 19:09 ` [PATCH v4 35/35] crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routine Ard Biesheuvel
  34 siblings, 0 replies; 50+ messages in thread
From: Ard Biesheuvel @ 2019-10-17 19:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ard Biesheuvel, Herbert Xu, David Miller, Jason A . Donenfeld,
	Samuel Neves, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Martin Willi, Rene van Dorst, David Sterba

This incorporates the chacha20poly1305 from the Zinc library, retaining
the library interface, but replacing the implementation with calls into
the code that already existed in the kernel's crypto API.

Note that this library API does not implement RFC7539 fully, given that
it is limited to 64-bit nonces. (The 96-bit nonce version that was part
of the selftest only has been removed, along with the 96-bit nonce test
vectors that only tested the selftest but not the actual library itself)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 include/crypto/chacha20poly1305.h      |   37 +
 lib/crypto/Kconfig                     |    7 +
 lib/crypto/Makefile                    |    4 +
 lib/crypto/chacha20poly1305-selftest.c | 7348 ++++++++++++++++++++
 lib/crypto/chacha20poly1305.c          |  219 +
 5 files changed, 7615 insertions(+)

diff --git a/include/crypto/chacha20poly1305.h b/include/crypto/chacha20poly1305.h
new file mode 100644
index 000000000000..ad3b1de58df8
--- /dev/null
+++ b/include/crypto/chacha20poly1305.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef __CHACHA20POLY1305_H
+#define __CHACHA20POLY1305_H
+
+#include <linux/types.h>
+
+enum chacha20poly1305_lengths {
+	XCHACHA20POLY1305_NONCE_SIZE = 24,
+	CHACHA20POLY1305_KEY_SIZE = 32,
+	CHACHA20POLY1305_AUTHTAG_SIZE = 16
+};
+
+void chacha20poly1305_encrypt(u8 *dst, const u8 *src, const size_t src_len,
+			      const u8 *ad, const size_t ad_len,
+			      const u64 nonce,
+			      const u8 key[CHACHA20POLY1305_KEY_SIZE]);
+
+bool __must_check
+chacha20poly1305_decrypt(u8 *dst, const u8 *src, const size_t src_len,
+			 const u8 *ad, const size_t ad_len, const u64 nonce,
+			 const u8 key[CHACHA20POLY1305_KEY_SIZE]);
+
+void xchacha20poly1305_encrypt(u8 *dst, const u8 *src, const size_t src_len,
+			       const u8 *ad, const size_t ad_len,
+			       const u8 nonce[XCHACHA20POLY1305_NONCE_SIZE],
+			       const u8 key[CHACHA20POLY1305_KEY_SIZE]);
+
+bool __must_check xchacha20poly1305_decrypt(
+	u8 *dst, const u8 *src, const size_t src_len, const u8 *ad,
+	const size_t ad_len, const u8 nonce[XCHACHA20POLY1305_NONCE_SIZE],
+	const u8 key[CHACHA20POLY1305_KEY_SIZE]);
+
+#endif /* __CHACHA20POLY1305_H */
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index b1d830dc1c9e..0b2c4fce26d9 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -119,5 +119,12 @@ config CRYPTO_LIB_POLY1305
 	  by either the generic implementation or an arch-specific one, if one
 	  is available and enabled.
 
+config CRYPTO_LIB_CHACHA20POLY1305
+	tristate "ChaCha20-Poly1305 AEAD support (8-byte nonce library version)"
+	depends on CRYPTO_ARCH_HAVE_LIB_CHACHA || !CRYPTO_ARCH_HAVE_LIB_CHACHA
+	depends on CRYPTO_ARCH_HAVE_LIB_POLY1305 || !CRYPTO_ARCH_HAVE_LIB_POLY1305
+	select CRYPTO_LIB_CHACHA
+	select CRYPTO_LIB_POLY1305
+
 config CRYPTO_LIB_SHA256
 	tristate
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index 273c55d5e147..34a701ab8b92 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -16,6 +16,9 @@ libblake2s-generic-y				+= blake2s-generic.o
 obj-$(CONFIG_CRYPTO_LIB_BLAKE2S)		+= libblake2s.o
 libblake2s-y					+= blake2s.o
 
+obj-$(CONFIG_CRYPTO_LIB_CHACHA20POLY1305)	+= libchacha20poly1305.o
+libchacha20poly1305-y				+= chacha20poly1305.o
+
 obj-$(CONFIG_CRYPTO_LIB_CURVE25519_GENERIC)	+= libcurve25519.o
 libcurve25519-y					:= curve25519-fiat32.o
 libcurve25519-$(CONFIG_ARCH_SUPPORTS_INT128)	:= curve25519-hacl64.o
@@ -32,4 +35,5 @@ libsha256-y					:= sha256.o
 
 ifneq ($(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS),y)
 libblake2s-y					+= blake2s-selftest.o
+libchacha20poly1305-y				+= chacha20poly1305-selftest.o
 endif
diff --git a/lib/crypto/chacha20poly1305-selftest.c b/lib/crypto/chacha20poly1305-selftest.c
new file mode 100644
index 000000000000..d1ed0f27cfdb
--- /dev/null
+++ b/lib/crypto/chacha20poly1305-selftest.c
@@ -0,0 +1,7348 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <crypto/chacha20poly1305.h>
+#include <crypto/poly1305.h>
+
+#include <asm/unaligned.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+struct chacha20poly1305_testvec {
+	const u8 *input, *output, *assoc, *nonce, *key;
+	size_t ilen, alen, nlen;
+	bool failure;
+};
+
+/* The first of these are the ChaCha20-Poly1305 AEAD test vectors from RFC7539
+ * 2.8.2. After they are generated by reference implementations. And the final
+ * marked ones are taken from wycheproof, but we only do these for the encrypt
+ * side, because mostly we're stressing the primitives rather than the actual
+ * chapoly construction.
+ */
+
+static const u8 enc_input001[] __initconst = {
+	0x49, 0x6e, 0x74, 0x65, 0x72, 0x6e, 0x65, 0x74,
+	0x2d, 0x44, 0x72, 0x61, 0x66, 0x74, 0x73, 0x20,
+	0x61, 0x72, 0x65, 0x20, 0x64, 0x72, 0x61, 0x66,
+	0x74, 0x20, 0x64, 0x6f, 0x63, 0x75, 0x6d, 0x65,
+	0x6e, 0x74, 0x73, 0x20, 0x76, 0x61, 0x6c, 0x69,
+	0x64, 0x20, 0x66, 0x6f, 0x72, 0x20, 0x61, 0x20,
+	0x6d, 0x61, 0x78, 0x69, 0x6d, 0x75, 0x6d, 0x20,
+	0x6f, 0x66, 0x20, 0x73, 0x69, 0x78, 0x20, 0x6d,
+	0x6f, 0x6e, 0x74, 0x68, 0x73, 0x20, 0x61, 0x6e,
+	0x64, 0x20, 0x6d, 0x61, 0x79, 0x20, 0x62, 0x65,
+	0x20, 0x75, 0x70, 0x64, 0x61, 0x74, 0x65, 0x64,
+	0x2c, 0x20, 0x72, 0x65, 0x70, 0x6c, 0x61, 0x63,
+	0x65, 0x64, 0x2c, 0x20, 0x6f, 0x72, 0x20, 0x6f,
+	0x62, 0x73, 0x6f, 0x6c, 0x65, 0x74, 0x65, 0x64,
+	0x20, 0x62, 0x79, 0x20, 0x6f, 0x74, 0x68, 0x65,
+	0x72, 0x20, 0x64, 0x6f, 0x63, 0x75, 0x6d, 0x65,
+	0x6e, 0x74, 0x73, 0x20, 0x61, 0x74, 0x20, 0x61,
+	0x6e, 0x79, 0x20, 0x74, 0x69, 0x6d, 0x65, 0x2e,
+	0x20, 0x49, 0x74, 0x20, 0x69, 0x73, 0x20, 0x69,
+	0x6e, 0x61, 0x70, 0x70, 0x72, 0x6f, 0x70, 0x72,
+	0x69, 0x61, 0x74, 0x65, 0x20, 0x74, 0x6f, 0x20,
+	0x75, 0x73, 0x65, 0x20, 0x49, 0x6e, 0x74, 0x65,
+	0x72, 0x6e, 0x65, 0x74, 0x2d, 0x44, 0x72, 0x61,
+	0x66, 0x74, 0x73, 0x20, 0x61, 0x73, 0x20, 0x72,
+	0x65, 0x66, 0x65, 0x72, 0x65, 0x6e, 0x63, 0x65,
+	0x20, 0x6d, 0x61, 0x74, 0x65, 0x72, 0x69, 0x61,
+	0x6c, 0x20, 0x6f, 0x72, 0x20, 0x74, 0x6f, 0x20,
+	0x63, 0x69, 0x74, 0x65, 0x20, 0x74, 0x68, 0x65,
+	0x6d, 0x20, 0x6f, 0x74, 0x68, 0x65, 0x72, 0x20,
+	0x74, 0x68, 0x61, 0x6e, 0x20, 0x61, 0x73, 0x20,
+	0x2f, 0xe2, 0x80, 0x9c, 0x77, 0x6f, 0x72, 0x6b,
+	0x20, 0x69, 0x6e, 0x20, 0x70, 0x72, 0x6f, 0x67,
+	0x72, 0x65, 0x73, 0x73, 0x2e, 0x2f, 0xe2, 0x80,
+	0x9d
+};
+static const u8 enc_output001[] __initconst = {
+	0x64, 0xa0, 0x86, 0x15, 0x75, 0x86, 0x1a, 0xf4,
+	0x60, 0xf0, 0x62, 0xc7, 0x9b, 0xe6, 0x43, 0xbd,
+	0x5e, 0x80, 0x5c, 0xfd, 0x34, 0x5c, 0xf3, 0x89,
+	0xf1, 0x08, 0x67, 0x0a, 0xc7, 0x6c, 0x8c, 0xb2,
+	0x4c, 0x6c, 0xfc, 0x18, 0x75, 0x5d, 0x43, 0xee,
+	0xa0, 0x9e, 0xe9, 0x4e, 0x38, 0x2d, 0x26, 0xb0,
+	0xbd, 0xb7, 0xb7, 0x3c, 0x32, 0x1b, 0x01, 0x00,
+	0xd4, 0xf0, 0x3b, 0x7f, 0x35, 0x58, 0x94, 0xcf,
+	0x33, 0x2f, 0x83, 0x0e, 0x71, 0x0b, 0x97, 0xce,
+	0x98, 0xc8, 0xa8, 0x4a, 0xbd, 0x0b, 0x94, 0x81,
+	0x14, 0xad, 0x17, 0x6e, 0x00, 0x8d, 0x33, 0xbd,
+	0x60, 0xf9, 0x82, 0xb1, 0xff, 0x37, 0xc8, 0x55,
+	0x97, 0x97, 0xa0, 0x6e, 0xf4, 0xf0, 0xef, 0x61,
+	0xc1, 0x86, 0x32, 0x4e, 0x2b, 0x35, 0x06, 0x38,
+	0x36, 0x06, 0x90, 0x7b, 0x6a, 0x7c, 0x02, 0xb0,
+	0xf9, 0xf6, 0x15, 0x7b, 0x53, 0xc8, 0x67, 0xe4,
+	0xb9, 0x16, 0x6c, 0x76, 0x7b, 0x80, 0x4d, 0x46,
+	0xa5, 0x9b, 0x52, 0x16, 0xcd, 0xe7, 0xa4, 0xe9,
+	0x90, 0x40, 0xc5, 0xa4, 0x04, 0x33, 0x22, 0x5e,
+	0xe2, 0x82, 0xa1, 0xb0, 0xa0, 0x6c, 0x52, 0x3e,
+	0xaf, 0x45, 0x34, 0xd7, 0xf8, 0x3f, 0xa1, 0x15,
+	0x5b, 0x00, 0x47, 0x71, 0x8c, 0xbc, 0x54, 0x6a,
+	0x0d, 0x07, 0x2b, 0x04, 0xb3, 0x56, 0x4e, 0xea,
+	0x1b, 0x42, 0x22, 0x73, 0xf5, 0x48, 0x27, 0x1a,
+	0x0b, 0xb2, 0x31, 0x60, 0x53, 0xfa, 0x76, 0x99,
+	0x19, 0x55, 0xeb, 0xd6, 0x31, 0x59, 0x43, 0x4e,
+	0xce, 0xbb, 0x4e, 0x46, 0x6d, 0xae, 0x5a, 0x10,
+	0x73, 0xa6, 0x72, 0x76, 0x27, 0x09, 0x7a, 0x10,
+	0x49, 0xe6, 0x17, 0xd9, 0x1d, 0x36, 0x10, 0x94,
+	0xfa, 0x68, 0xf0, 0xff, 0x77, 0x98, 0x71, 0x30,
+	0x30, 0x5b, 0xea, 0xba, 0x2e, 0xda, 0x04, 0xdf,
+	0x99, 0x7b, 0x71, 0x4d, 0x6c, 0x6f, 0x2c, 0x29,
+	0xa6, 0xad, 0x5c, 0xb4, 0x02, 0x2b, 0x02, 0x70,
+	0x9b, 0xee, 0xad, 0x9d, 0x67, 0x89, 0x0c, 0xbb,
+	0x22, 0x39, 0x23, 0x36, 0xfe, 0xa1, 0x85, 0x1f,
+	0x38
+};
+static const u8 enc_assoc001[] __initconst = {
+	0xf3, 0x33, 0x88, 0x86, 0x00, 0x00, 0x00, 0x00,
+	0x00, 0x00, 0x4e, 0x91
+};
+static const u8 enc_nonce001[] __initconst = {
+	0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08
+};
+static const u8 enc_key001[] __initconst = {
+	0x1c, 0x92, 0x40, 0xa5, 0xeb, 0x55, 0xd3, 0x8a,
+	0xf3, 0x33, 0x88, 0x86, 0x04, 0xf6, 0xb5, 0xf0,
+	0x47, 0x39, 0x17, 0xc1, 0x40, 0x2b, 0x80, 0x09,
+	0x9d, 0xca, 0x5c, 0xbc, 0x20, 0x70, 0x75, 0xc0
+};
+
+static const u8 enc_input002[] __initconst = { };
+static const u8 enc_output002[] __initconst = {
+	0xea, 0xe0, 0x1e, 0x9e, 0x2c, 0x91, 0xaa, 0xe1,
+	0xdb, 0x5d, 0x99, 0x3f, 0x8a, 0xf7, 0x69, 0x92
+};
+static const u8 enc_assoc002[] __initconst = { };
+static const u8 enc_nonce002[] __initconst = {
+	0xca, 0xbf, 0x33, 0x71, 0x32, 0x45, 0x77, 0x8e
+};
+static const u8 enc_key002[] __initconst = {
+	0x4c, 0xf5, 0x96, 0x83, 0x38, 0xe6, 0xae, 0x7f,
+	0x2d, 0x29, 0x25, 0x76, 0xd5, 0x75, 0x27, 0x86,
+	0x91, 0x9a, 0x27, 0x7a, 0xfb, 0x46, 0xc5, 0xef,
+	0x94, 0x81, 0x79, 0x57, 0x14, 0x59, 0x40, 0x68
+};
+
+static const u8 enc_input003[] __initconst = { };
+static const u8 enc_output003[] __initconst = {
+	0xdd, 0x6b, 0x3b, 0x82, 0xce, 0x5a, 0xbd, 0xd6,
+	0xa9, 0x35, 0x83, 0xd8, 0x8c, 0x3d, 0x85, 0x77
+};
+static const u8 enc_assoc003[] __initconst = {
+	0x33, 0x10, 0x41, 0x12, 0x1f, 0xf3, 0xd2, 0x6b
+};
+static const u8 enc_nonce003[] __initconst = {
+	0x3d, 0x86, 0xb5, 0x6b, 0xc8, 0xa3, 0x1f, 0x1d
+};
+static const u8 enc_key003[] __initconst = {
+	0x2d, 0xb0, 0x5d, 0x40, 0xc8, 0xed, 0x44, 0x88,
+	0x34, 0xd1, 0x13, 0xaf, 0x57, 0xa1, 0xeb, 0x3a,
+	0x2a, 0x80, 0x51, 0x36, 0xec, 0x5b, 0xbc, 0x08,
+	0x93, 0x84, 0x21, 0xb5, 0x13, 0x88, 0x3c, 0x0d
+};
+
+static const u8 enc_input004[] __initconst = {
+	0xa4
+};
+static const u8 enc_output004[] __initconst = {
+	0xb7, 0x1b, 0xb0, 0x73, 0x59, 0xb0, 0x84, 0xb2,
+	0x6d, 0x8e, 0xab, 0x94, 0x31, 0xa1, 0xae, 0xac,
+	0x89
+};
+static const u8 enc_assoc004[] __initconst = {
+	0x6a, 0xe2, 0xad, 0x3f, 0x88, 0x39, 0x5a, 0x40
+};
+static const u8 enc_nonce004[] __initconst = {
+	0xd2, 0x32, 0x1f, 0x29, 0x28, 0xc6, 0xc4, 0xc4
+};
+static const u8 enc_key004[] __initconst = {
+	0x4b, 0x28, 0x4b, 0xa3, 0x7b, 0xbe, 0xe9, 0xf8,
+	0x31, 0x80, 0x82, 0xd7, 0xd8, 0xe8, 0xb5, 0xa1,
+	0xe2, 0x18, 0x18, 0x8a, 0x9c, 0xfa, 0xa3, 0x3d,
+	0x25, 0x71, 0x3e, 0x40, 0xbc, 0x54, 0x7a, 0x3e
+};
+
+static const u8 enc_input005[] __initconst = {
+	0x2d
+};
+static const u8 enc_output005[] __initconst = {
+	0xbf, 0xe1, 0x5b, 0x0b, 0xdb, 0x6b, 0xf5, 0x5e,
+	0x6c, 0x5d, 0x84, 0x44, 0x39, 0x81, 0xc1, 0x9c,
+	0xac
+};
+static const u8 enc_assoc005[] __initconst = { };
+static const u8 enc_nonce005[] __initconst = {
+	0x20, 0x1c, 0xaa, 0x5f, 0x9c, 0xbf, 0x92, 0x30
+};
+static const u8 enc_key005[] __initconst = {
+	0x66, 0xca, 0x9c, 0x23, 0x2a, 0x4b, 0x4b, 0x31,
+	0x0e, 0x92, 0x89, 0x8b, 0xf4, 0x93, 0xc7, 0x87,
+	0x98, 0xa3, 0xd8, 0x39, 0xf8, 0xf4, 0xa7, 0x01,
+	0xc0, 0x2e, 0x0a, 0xa6, 0x7e, 0x5a, 0x78, 0x87
+};
+
+static const u8 enc_input006[] __initconst = {
+	0x33, 0x2f, 0x94, 0xc1, 0xa4, 0xef, 0xcc, 0x2a,
+	0x5b, 0xa6, 0xe5, 0x8f, 0x1d, 0x40, 0xf0, 0x92,
+	0x3c, 0xd9, 0x24, 0x11, 0xa9, 0x71, 0xf9, 0x37,
+	0x14, 0x99, 0xfa, 0xbe, 0xe6, 0x80, 0xde, 0x50,
+	0xc9, 0x96, 0xd4, 0xb0, 0xec, 0x9e, 0x17, 0xec,
+	0xd2, 0x5e, 0x72, 0x99, 0xfc, 0x0a, 0xe1, 0xcb,
+	0x48, 0xd2, 0x85, 0xdd, 0x2f, 0x90, 0xe0, 0x66,
+	0x3b, 0xe6, 0x20, 0x74, 0xbe, 0x23, 0x8f, 0xcb,
+	0xb4, 0xe4, 0xda, 0x48, 0x40, 0xa6, 0xd1, 0x1b,
+	0xc7, 0x42, 0xce, 0x2f, 0x0c, 0xa6, 0x85, 0x6e,
+	0x87, 0x37, 0x03, 0xb1, 0x7c, 0x25, 0x96, 0xa3,
+	0x05, 0xd8, 0xb0, 0xf4, 0xed, 0xea, 0xc2, 0xf0,
+	0x31, 0x98, 0x6c, 0xd1, 0x14, 0x25, 0xc0, 0xcb,
+	0x01, 0x74, 0xd0, 0x82, 0xf4, 0x36, 0xf5, 0x41,
+	0xd5, 0xdc, 0xca, 0xc5, 0xbb, 0x98, 0xfe, 0xfc,
+	0x69, 0x21, 0x70, 0xd8, 0xa4, 0x4b, 0xc8, 0xde,
+	0x8f
+};
+static const u8 enc_output006[] __initconst = {
+	0x8b, 0x06, 0xd3, 0x31, 0xb0, 0x93, 0x45, 0xb1,
+	0x75, 0x6e, 0x26, 0xf9, 0x67, 0xbc, 0x90, 0x15,
+	0x81, 0x2c, 0xb5, 0xf0, 0xc6, 0x2b, 0xc7, 0x8c,
+	0x56, 0xd1, 0xbf, 0x69, 0x6c, 0x07, 0xa0, 0xda,
+	0x65, 0x27, 0xc9, 0x90, 0x3d, 0xef, 0x4b, 0x11,
+	0x0f, 0x19, 0x07, 0xfd, 0x29, 0x92, 0xd9, 0xc8,
+	0xf7, 0x99, 0x2e, 0x4a, 0xd0, 0xb8, 0x2c, 0xdc,
+	0x93, 0xf5, 0x9e, 0x33, 0x78, 0xd1, 0x37, 0xc3,
+	0x66, 0xd7, 0x5e, 0xbc, 0x44, 0xbf, 0x53, 0xa5,
+	0xbc, 0xc4, 0xcb, 0x7b, 0x3a, 0x8e, 0x7f, 0x02,
+	0xbd, 0xbb, 0xe7, 0xca, 0xa6, 0x6c, 0x6b, 0x93,
+	0x21, 0x93, 0x10, 0x61, 0xe7, 0x69, 0xd0, 0x78,
+	0xf3, 0x07, 0x5a, 0x1a, 0x8f, 0x73, 0xaa, 0xb1,
+	0x4e, 0xd3, 0xda, 0x4f, 0xf3, 0x32, 0xe1, 0x66,
+	0x3e, 0x6c, 0xc6, 0x13, 0xba, 0x06, 0x5b, 0xfc,
+	0x6a, 0xe5, 0x6f, 0x60, 0xfb, 0x07, 0x40, 0xb0,
+	0x8c, 0x9d, 0x84, 0x43, 0x6b, 0xc1, 0xf7, 0x8d,
+	0x8d, 0x31, 0xf7, 0x7a, 0x39, 0x4d, 0x8f, 0x9a,
+	0xeb
+};
+static const u8 enc_assoc006[] __initconst = {
+	0x70, 0xd3, 0x33, 0xf3, 0x8b, 0x18, 0x0b
+};
+static const u8 enc_nonce006[] __initconst = {
+	0xdf, 0x51, 0x84, 0x82, 0x42, 0x0c, 0x75, 0x9c
+};
+static const u8 enc_key006[] __initconst = {
+	0x68, 0x7b, 0x8d, 0x8e, 0xe3, 0xc4, 0xdd, 0xae,
+	0xdf, 0x72, 0x7f, 0x53, 0x72, 0x25, 0x1e, 0x78,
+	0x91, 0xcb, 0x69, 0x76, 0x1f, 0x49, 0x93, 0xf9,
+	0x6f, 0x21, 0xcc, 0x39, 0x9c, 0xad, 0xb1, 0x01
+};
+
+static const u8 enc_input007[] __initconst = {
+	0x9b, 0x18, 0xdb, 0xdd, 0x9a, 0x0f, 0x3e, 0xa5,
+	0x15, 0x17, 0xde, 0xdf, 0x08, 0x9d, 0x65, 0x0a,
+	0x67, 0x30, 0x12, 0xe2, 0x34, 0x77, 0x4b, 0xc1,
+	0xd9, 0xc6, 0x1f, 0xab, 0xc6, 0x18, 0x50, 0x17,
+	0xa7, 0x9d, 0x3c, 0xa6, 0xc5, 0x35, 0x8c, 0x1c,
+	0xc0, 0xa1, 0x7c, 0x9f, 0x03, 0x89, 0xca, 0xe1,
+	0xe6, 0xe9, 0xd4, 0xd3, 0x88, 0xdb, 0xb4, 0x51,
+	0x9d, 0xec, 0xb4, 0xfc, 0x52, 0xee, 0x6d, 0xf1,
+	0x75, 0x42, 0xc6, 0xfd, 0xbd, 0x7a, 0x8e, 0x86,
+	0xfc, 0x44, 0xb3, 0x4f, 0xf3, 0xea, 0x67, 0x5a,
+	0x41, 0x13, 0xba, 0xb0, 0xdc, 0xe1, 0xd3, 0x2a,
+	0x7c, 0x22, 0xb3, 0xca, 0xac, 0x6a, 0x37, 0x98,
+	0x3e, 0x1d, 0x40, 0x97, 0xf7, 0x9b, 0x1d, 0x36,
+	0x6b, 0xb3, 0x28, 0xbd, 0x60, 0x82, 0x47, 0x34,
+	0xaa, 0x2f, 0x7d, 0xe9, 0xa8, 0x70, 0x81, 0x57,
+	0xd4, 0xb9, 0x77, 0x0a, 0x9d, 0x29, 0xa7, 0x84,
+	0x52, 0x4f, 0xc2, 0x4a, 0x40, 0x3b, 0x3c, 0xd4,
+	0xc9, 0x2a, 0xdb, 0x4a, 0x53, 0xc4, 0xbe, 0x80,
+	0xe9, 0x51, 0x7f, 0x8f, 0xc7, 0xa2, 0xce, 0x82,
+	0x5c, 0x91, 0x1e, 0x74, 0xd9, 0xd0, 0xbd, 0xd5,
+	0xf3, 0xfd, 0xda, 0x4d, 0x25, 0xb4, 0xbb, 0x2d,
+	0xac, 0x2f, 0x3d, 0x71, 0x85, 0x7b, 0xcf, 0x3c,
+	0x7b, 0x3e, 0x0e, 0x22, 0x78, 0x0c, 0x29, 0xbf,
+	0xe4, 0xf4, 0x57, 0xb3, 0xcb, 0x49, 0xa0, 0xfc,
+	0x1e, 0x05, 0x4e, 0x16, 0xbc, 0xd5, 0xa8, 0xa3,
+	0xee, 0x05, 0x35, 0xc6, 0x7c, 0xab, 0x60, 0x14,
+	0x55, 0x1a, 0x8e, 0xc5, 0x88, 0x5d, 0xd5, 0x81,
+	0xc2, 0x81, 0xa5, 0xc4, 0x60, 0xdb, 0xaf, 0x77,
+	0x91, 0xe1, 0xce, 0xa2, 0x7e, 0x7f, 0x42, 0xe3,
+	0xb0, 0x13, 0x1c, 0x1f, 0x25, 0x60, 0x21, 0xe2,
+	0x40, 0x5f, 0x99, 0xb7, 0x73, 0xec, 0x9b, 0x2b,
+	0xf0, 0x65, 0x11, 0xc8, 0xd0, 0x0a, 0x9f, 0xd3
+};
+static const u8 enc_output007[] __initconst = {
+	0x85, 0x04, 0xc2, 0xed, 0x8d, 0xfd, 0x97, 0x5c,
+	0xd2, 0xb7, 0xe2, 0xc1, 0x6b, 0xa3, 0xba, 0xf8,
+	0xc9, 0x50, 0xc3, 0xc6, 0xa5, 0xe3, 0xa4, 0x7c,
+	0xc3, 0x23, 0x49, 0x5e, 0xa9, 0xb9, 0x32, 0xeb,
+	0x8a, 0x7c, 0xca, 0xe5, 0xec, 0xfb, 0x7c, 0xc0,
+	0xcb, 0x7d, 0xdc, 0x2c, 0x9d, 0x92, 0x55, 0x21,
+	0x0a, 0xc8, 0x43, 0x63, 0x59, 0x0a, 0x31, 0x70,
+	0x82, 0x67, 0x41, 0x03, 0xf8, 0xdf, 0xf2, 0xac,
+	0xa7, 0x02, 0xd4, 0xd5, 0x8a, 0x2d, 0xc8, 0x99,
+	0x19, 0x66, 0xd0, 0xf6, 0x88, 0x2c, 0x77, 0xd9,
+	0xd4, 0x0d, 0x6c, 0xbd, 0x98, 0xde, 0xe7, 0x7f,
+	0xad, 0x7e, 0x8a, 0xfb, 0xe9, 0x4b, 0xe5, 0xf7,
+	0xe5, 0x50, 0xa0, 0x90, 0x3f, 0xd6, 0x22, 0x53,
+	0xe3, 0xfe, 0x1b, 0xcc, 0x79, 0x3b, 0xec, 0x12,
+	0x47, 0x52, 0xa7, 0xd6, 0x04, 0xe3, 0x52, 0xe6,
+	0x93, 0x90, 0x91, 0x32, 0x73, 0x79, 0xb8, 0xd0,
+	0x31, 0xde, 0x1f, 0x9f, 0x2f, 0x05, 0x38, 0x54,
+	0x2f, 0x35, 0x04, 0x39, 0xe0, 0xa7, 0xba, 0xc6,
+	0x52, 0xf6, 0x37, 0x65, 0x4c, 0x07, 0xa9, 0x7e,
+	0xb3, 0x21, 0x6f, 0x74, 0x8c, 0xc9, 0xde, 0xdb,
+	0x65, 0x1b, 0x9b, 0xaa, 0x60, 0xb1, 0x03, 0x30,
+	0x6b, 0xb2, 0x03, 0xc4, 0x1c, 0x04, 0xf8, 0x0f,
+	0x64, 0xaf, 0x46, 0xe4, 0x65, 0x99, 0x49, 0xe2,
+	0xea, 0xce, 0x78, 0x00, 0xd8, 0x8b, 0xd5, 0x2e,
+	0xcf, 0xfc, 0x40, 0x49, 0xe8, 0x58, 0xdc, 0x34,
+	0x9c, 0x8c, 0x61, 0xbf, 0x0a, 0x8e, 0xec, 0x39,
+	0xa9, 0x30, 0x05, 0x5a, 0xd2, 0x56, 0x01, 0xc7,
+	0xda, 0x8f, 0x4e, 0xbb, 0x43, 0xa3, 0x3a, 0xf9,
+	0x15, 0x2a, 0xd0, 0xa0, 0x7a, 0x87, 0x34, 0x82,
+	0xfe, 0x8a, 0xd1, 0x2d, 0x5e, 0xc7, 0xbf, 0x04,
+	0x53, 0x5f, 0x3b, 0x36, 0xd4, 0x25, 0x5c, 0x34,
+	0x7a, 0x8d, 0xd5, 0x05, 0xce, 0x72, 0xca, 0xef,
+	0x7a, 0x4b, 0xbc, 0xb0, 0x10, 0x5c, 0x96, 0x42,
+	0x3a, 0x00, 0x98, 0xcd, 0x15, 0xe8, 0xb7, 0x53
+};
+static const u8 enc_assoc007[] __initconst = { };
+static const u8 enc_nonce007[] __initconst = {
+	0xde, 0x7b, 0xef, 0xc3, 0x65, 0x1b, 0x68, 0xb0
+};
+static const u8 enc_key007[] __initconst = {
+	0x8d, 0xb8, 0x91, 0x48, 0xf0, 0xe7, 0x0a, 0xbd,
+	0xf9, 0x3f, 0xcd, 0xd9, 0xa0, 0x1e, 0x42, 0x4c,
+	0xe7, 0xde, 0x25, 0x3d, 0xa3, 0xd7, 0x05, 0x80,
+	0x8d, 0xf2, 0x82, 0xac, 0x44, 0x16, 0x51, 0x01
+};
+
+static const u8 enc_input008[] __initconst = {
+	0xc3, 0x09, 0x94, 0x62, 0xe6, 0x46, 0x2e, 0x10,
+	0xbe, 0x00, 0xe4, 0xfc, 0xf3, 0x40, 0xa3, 0xe2,
+	0x0f, 0xc2, 0x8b, 0x28, 0xdc, 0xba, 0xb4, 0x3c,
+	0xe4, 0x21, 0x58, 0x61, 0xcd, 0x8b, 0xcd, 0xfb,
+	0xac, 0x94, 0xa1, 0x45, 0xf5, 0x1c, 0xe1, 0x12,
+	0xe0, 0x3b, 0x67, 0x21, 0x54, 0x5e, 0x8c, 0xaa,
+	0xcf, 0xdb, 0xb4, 0x51, 0xd4, 0x13, 0xda, 0xe6,
+	0x83, 0x89, 0xb6, 0x92, 0xe9, 0x21, 0x76, 0xa4,
+	0x93, 0x7d, 0x0e, 0xfd, 0x96, 0x36, 0x03, 0x91,
+	0x43, 0x5c, 0x92, 0x49, 0x62, 0x61, 0x7b, 0xeb,
+	0x43, 0x89, 0xb8, 0x12, 0x20, 0x43, 0xd4, 0x47,
+	0x06, 0x84, 0xee, 0x47, 0xe9, 0x8a, 0x73, 0x15,
+	0x0f, 0x72, 0xcf, 0xed, 0xce, 0x96, 0xb2, 0x7f,
+	0x21, 0x45, 0x76, 0xeb, 0x26, 0x28, 0x83, 0x6a,
+	0xad, 0xaa, 0xa6, 0x81, 0xd8, 0x55, 0xb1, 0xa3,
+	0x85, 0xb3, 0x0c, 0xdf, 0xf1, 0x69, 0x2d, 0x97,
+	0x05, 0x2a, 0xbc, 0x7c, 0x7b, 0x25, 0xf8, 0x80,
+	0x9d, 0x39, 0x25, 0xf3, 0x62, 0xf0, 0x66, 0x5e,
+	0xf4, 0xa0, 0xcf, 0xd8, 0xfd, 0x4f, 0xb1, 0x1f,
+	0x60, 0x3a, 0x08, 0x47, 0xaf, 0xe1, 0xf6, 0x10,
+	0x77, 0x09, 0xa7, 0x27, 0x8f, 0x9a, 0x97, 0x5a,
+	0x26, 0xfa, 0xfe, 0x41, 0x32, 0x83, 0x10, 0xe0,
+	0x1d, 0xbf, 0x64, 0x0d, 0xf4, 0x1c, 0x32, 0x35,
+	0xe5, 0x1b, 0x36, 0xef, 0xd4, 0x4a, 0x93, 0x4d,
+	0x00, 0x7c, 0xec, 0x02, 0x07, 0x8b, 0x5d, 0x7d,
+	0x1b, 0x0e, 0xd1, 0xa6, 0xa5, 0x5d, 0x7d, 0x57,
+	0x88, 0xa8, 0xcc, 0x81, 0xb4, 0x86, 0x4e, 0xb4,
+	0x40, 0xe9, 0x1d, 0xc3, 0xb1, 0x24, 0x3e, 0x7f,
+	0xcc, 0x8a, 0x24, 0x9b, 0xdf, 0x6d, 0xf0, 0x39,
+	0x69, 0x3e, 0x4c, 0xc0, 0x96, 0xe4, 0x13, 0xda,
+	0x90, 0xda, 0xf4, 0x95, 0x66, 0x8b, 0x17, 0x17,
+	0xfe, 0x39, 0x43, 0x25, 0xaa, 0xda, 0xa0, 0x43,
+	0x3c, 0xb1, 0x41, 0x02, 0xa3, 0xf0, 0xa7, 0x19,
+	0x59, 0xbc, 0x1d, 0x7d, 0x6c, 0x6d, 0x91, 0x09,
+	0x5c, 0xb7, 0x5b, 0x01, 0xd1, 0x6f, 0x17, 0x21,
+	0x97, 0xbf, 0x89, 0x71, 0xa5, 0xb0, 0x6e, 0x07,
+	0x45, 0xfd, 0x9d, 0xea, 0x07, 0xf6, 0x7a, 0x9f,
+	0x10, 0x18, 0x22, 0x30, 0x73, 0xac, 0xd4, 0x6b,
+	0x72, 0x44, 0xed, 0xd9, 0x19, 0x9b, 0x2d, 0x4a,
+	0x41, 0xdd, 0xd1, 0x85, 0x5e, 0x37, 0x19, 0xed,
+	0xd2, 0x15, 0x8f, 0x5e, 0x91, 0xdb, 0x33, 0xf2,
+	0xe4, 0xdb, 0xff, 0x98, 0xfb, 0xa3, 0xb5, 0xca,
+	0x21, 0x69, 0x08, 0xe7, 0x8a, 0xdf, 0x90, 0xff,
+	0x3e, 0xe9, 0x20, 0x86, 0x3c, 0xe9, 0xfc, 0x0b,
+	0xfe, 0x5c, 0x61, 0xaa, 0x13, 0x92, 0x7f, 0x7b,
+	0xec, 0xe0, 0x6d, 0xa8, 0x23, 0x22, 0xf6, 0x6b,
+	0x77, 0xc4, 0xfe, 0x40, 0x07, 0x3b, 0xb6, 0xf6,
+	0x8e, 0x5f, 0xd4, 0xb9, 0xb7, 0x0f, 0x21, 0x04,
+	0xef, 0x83, 0x63, 0x91, 0x69, 0x40, 0xa3, 0x48,
+	0x5c, 0xd2, 0x60, 0xf9, 0x4f, 0x6c, 0x47, 0x8b,
+	0x3b, 0xb1, 0x9f, 0x8e, 0xee, 0x16, 0x8a, 0x13,
+	0xfc, 0x46, 0x17, 0xc3, 0xc3, 0x32, 0x56, 0xf8,
+	0x3c, 0x85, 0x3a, 0xb6, 0x3e, 0xaa, 0x89, 0x4f,
+	0xb3, 0xdf, 0x38, 0xfd, 0xf1, 0xe4, 0x3a, 0xc0,
+	0xe6, 0x58, 0xb5, 0x8f, 0xc5, 0x29, 0xa2, 0x92,
+	0x4a, 0xb6, 0xa0, 0x34, 0x7f, 0xab, 0xb5, 0x8a,
+	0x90, 0xa1, 0xdb, 0x4d, 0xca, 0xb6, 0x2c, 0x41,
+	0x3c, 0xf7, 0x2b, 0x21, 0xc3, 0xfd, 0xf4, 0x17,
+	0x5c, 0xb5, 0x33, 0x17, 0x68, 0x2b, 0x08, 0x30,
+	0xf3, 0xf7, 0x30, 0x3c, 0x96, 0xe6, 0x6a, 0x20,
+	0x97, 0xe7, 0x4d, 0x10, 0x5f, 0x47, 0x5f, 0x49,
+	0x96, 0x09, 0xf0, 0x27, 0x91, 0xc8, 0xf8, 0x5a,
+	0x2e, 0x79, 0xb5, 0xe2, 0xb8, 0xe8, 0xb9, 0x7b,
+	0xd5, 0x10, 0xcb, 0xff, 0x5d, 0x14, 0x73, 0xf3
+};
+static const u8 enc_output008[] __initconst = {
+	0x14, 0xf6, 0x41, 0x37, 0xa6, 0xd4, 0x27, 0xcd,
+	0xdb, 0x06, 0x3e, 0x9a, 0x4e, 0xab, 0xd5, 0xb1,
+	0x1e, 0x6b, 0xd2, 0xbc, 0x11, 0xf4, 0x28, 0x93,
+	0x63, 0x54, 0xef, 0xbb, 0x5e, 0x1d, 0x3a, 0x1d,
+	0x37, 0x3c, 0x0a, 0x6c, 0x1e, 0xc2, 0xd1, 0x2c,
+	0xb5, 0xa3, 0xb5, 0x7b, 0xb8, 0x8f, 0x25, 0xa6,
+	0x1b, 0x61, 0x1c, 0xec, 0x28, 0x58, 0x26, 0xa4,
+	0xa8, 0x33, 0x28, 0x25, 0x5c, 0x45, 0x05, 0xe5,
+	0x6c, 0x99, 0xe5, 0x45, 0xc4, 0xa2, 0x03, 0x84,
+	0x03, 0x73, 0x1e, 0x8c, 0x49, 0xac, 0x20, 0xdd,
+	0x8d, 0xb3, 0xc4, 0xf5, 0xe7, 0x4f, 0xf1, 0xed,
+	0xa1, 0x98, 0xde, 0xa4, 0x96, 0xdd, 0x2f, 0xab,
+	0xab, 0x97, 0xcf, 0x3e, 0xd2, 0x9e, 0xb8, 0x13,
+	0x07, 0x28, 0x29, 0x19, 0xaf, 0xfd, 0xf2, 0x49,
+	0x43, 0xea, 0x49, 0x26, 0x91, 0xc1, 0x07, 0xd6,
+	0xbb, 0x81, 0x75, 0x35, 0x0d, 0x24, 0x7f, 0xc8,
+	0xda, 0xd4, 0xb7, 0xeb, 0xe8, 0x5c, 0x09, 0xa2,
+	0x2f, 0xdc, 0x28, 0x7d, 0x3a, 0x03, 0xfa, 0x94,
+	0xb5, 0x1d, 0x17, 0x99, 0x36, 0xc3, 0x1c, 0x18,
+	0x34, 0xe3, 0x9f, 0xf5, 0x55, 0x7c, 0xb0, 0x60,
+	0x9d, 0xff, 0xac, 0xd4, 0x61, 0xf2, 0xad, 0xf8,
+	0xce, 0xc7, 0xbe, 0x5c, 0xd2, 0x95, 0xa8, 0x4b,
+	0x77, 0x13, 0x19, 0x59, 0x26, 0xc9, 0xb7, 0x8f,
+	0x6a, 0xcb, 0x2d, 0x37, 0x91, 0xea, 0x92, 0x9c,
+	0x94, 0x5b, 0xda, 0x0b, 0xce, 0xfe, 0x30, 0x20,
+	0xf8, 0x51, 0xad, 0xf2, 0xbe, 0xe7, 0xc7, 0xff,
+	0xb3, 0x33, 0x91, 0x6a, 0xc9, 0x1a, 0x41, 0xc9,
+	0x0f, 0xf3, 0x10, 0x0e, 0xfd, 0x53, 0xff, 0x6c,
+	0x16, 0x52, 0xd9, 0xf3, 0xf7, 0x98, 0x2e, 0xc9,
+	0x07, 0x31, 0x2c, 0x0c, 0x72, 0xd7, 0xc5, 0xc6,
+	0x08, 0x2a, 0x7b, 0xda, 0xbd, 0x7e, 0x02, 0xea,
+	0x1a, 0xbb, 0xf2, 0x04, 0x27, 0x61, 0x28, 0x8e,
+	0xf5, 0x04, 0x03, 0x1f, 0x4c, 0x07, 0x55, 0x82,
+	0xec, 0x1e, 0xd7, 0x8b, 0x2f, 0x65, 0x56, 0xd1,
+	0xd9, 0x1e, 0x3c, 0xe9, 0x1f, 0x5e, 0x98, 0x70,
+	0x38, 0x4a, 0x8c, 0x49, 0xc5, 0x43, 0xa0, 0xa1,
+	0x8b, 0x74, 0x9d, 0x4c, 0x62, 0x0d, 0x10, 0x0c,
+	0xf4, 0x6c, 0x8f, 0xe0, 0xaa, 0x9a, 0x8d, 0xb7,
+	0xe0, 0xbe, 0x4c, 0x87, 0xf1, 0x98, 0x2f, 0xcc,
+	0xed, 0xc0, 0x52, 0x29, 0xdc, 0x83, 0xf8, 0xfc,
+	0x2c, 0x0e, 0xa8, 0x51, 0x4d, 0x80, 0x0d, 0xa3,
+	0xfe, 0xd8, 0x37, 0xe7, 0x41, 0x24, 0xfc, 0xfb,
+	0x75, 0xe3, 0x71, 0x7b, 0x57, 0x45, 0xf5, 0x97,
+	0x73, 0x65, 0x63, 0x14, 0x74, 0xb8, 0x82, 0x9f,
+	0xf8, 0x60, 0x2f, 0x8a, 0xf2, 0x4e, 0xf1, 0x39,
+	0xda, 0x33, 0x91, 0xf8, 0x36, 0xe0, 0x8d, 0x3f,
+	0x1f, 0x3b, 0x56, 0xdc, 0xa0, 0x8f, 0x3c, 0x9d,
+	0x71, 0x52, 0xa7, 0xb8, 0xc0, 0xa5, 0xc6, 0xa2,
+	0x73, 0xda, 0xf4, 0x4b, 0x74, 0x5b, 0x00, 0x3d,
+	0x99, 0xd7, 0x96, 0xba, 0xe6, 0xe1, 0xa6, 0x96,
+	0x38, 0xad, 0xb3, 0xc0, 0xd2, 0xba, 0x91, 0x6b,
+	0xf9, 0x19, 0xdd, 0x3b, 0xbe, 0xbe, 0x9c, 0x20,
+	0x50, 0xba, 0xa1, 0xd0, 0xce, 0x11, 0xbd, 0x95,
+	0xd8, 0xd1, 0xdd, 0x33, 0x85, 0x74, 0xdc, 0xdb,
+	0x66, 0x76, 0x44, 0xdc, 0x03, 0x74, 0x48, 0x35,
+	0x98, 0xb1, 0x18, 0x47, 0x94, 0x7d, 0xff, 0x62,
+	0xe4, 0x58, 0x78, 0xab, 0xed, 0x95, 0x36, 0xd9,
+	0x84, 0x91, 0x82, 0x64, 0x41, 0xbb, 0x58, 0xe6,
+	0x1c, 0x20, 0x6d, 0x15, 0x6b, 0x13, 0x96, 0xe8,
+	0x35, 0x7f, 0xdc, 0x40, 0x2c, 0xe9, 0xbc, 0x8a,
+	0x4f, 0x92, 0xec, 0x06, 0x2d, 0x50, 0xdf, 0x93,
+	0x5d, 0x65, 0x5a, 0xa8, 0xfc, 0x20, 0x50, 0x14,
+	0xa9, 0x8a, 0x7e, 0x1d, 0x08, 0x1f, 0xe2, 0x99,
+	0xd0, 0xbe, 0xfb, 0x3a, 0x21, 0x9d, 0xad, 0x86,
+	0x54, 0xfd, 0x0d, 0x98, 0x1c, 0x5a, 0x6f, 0x1f,
+	0x9a, 0x40, 0xcd, 0xa2, 0xff, 0x6a, 0xf1, 0x54
+};
+static const u8 enc_assoc008[] __initconst = { };
+static const u8 enc_nonce008[] __initconst = {
+	0x0e, 0x0d, 0x57, 0xbb, 0x7b, 0x40, 0x54, 0x02
+};
+static const u8 enc_key008[] __initconst = {
+	0xf2, 0xaa, 0x4f, 0x99, 0xfd, 0x3e, 0xa8, 0x53,
+	0xc1, 0x44, 0xe9, 0x81, 0x18, 0xdc, 0xf5, 0xf0,
+	0x3e, 0x44, 0x15, 0x59, 0xe0, 0xc5, 0x44, 0x86,
+	0xc3, 0x91, 0xa8, 0x75, 0xc0, 0x12, 0x46, 0xba
+};
+
+static const u8 enc_input009[] __initconst = {
+	0xe6, 0xc3, 0xdb, 0x63, 0x55, 0x15, 0xe3, 0x5b,
+	0xb7, 0x4b, 0x27, 0x8b, 0x5a, 0xdd, 0xc2, 0xe8,
+	0x3a, 0x6b, 0xd7, 0x81, 0x96, 0x35, 0x97, 0xca,
+	0xd7, 0x68, 0xe8, 0xef, 0xce, 0xab, 0xda, 0x09,
+	0x6e, 0xd6, 0x8e, 0xcb, 0x55, 0xb5, 0xe1, 0xe5,
+	0x57, 0xfd, 0xc4, 0xe3, 0xe0, 0x18, 0x4f, 0x85,
+	0xf5, 0x3f, 0x7e, 0x4b, 0x88, 0xc9, 0x52, 0x44,
+	0x0f, 0xea, 0xaf, 0x1f, 0x71, 0x48, 0x9f, 0x97,
+	0x6d, 0xb9, 0x6f, 0x00, 0xa6, 0xde, 0x2b, 0x77,
+	0x8b, 0x15, 0xad, 0x10, 0xa0, 0x2b, 0x7b, 0x41,
+	0x90, 0x03, 0x2d, 0x69, 0xae, 0xcc, 0x77, 0x7c,
+	0xa5, 0x9d, 0x29, 0x22, 0xc2, 0xea, 0xb4, 0x00,
+	0x1a, 0xd2, 0x7a, 0x98, 0x8a, 0xf9, 0xf7, 0x82,
+	0xb0, 0xab, 0xd8, 0xa6, 0x94, 0x8d, 0x58, 0x2f,
+	0x01, 0x9e, 0x00, 0x20, 0xfc, 0x49, 0xdc, 0x0e,
+	0x03, 0xe8, 0x45, 0x10, 0xd6, 0xa8, 0xda, 0x55,
+	0x10, 0x9a, 0xdf, 0x67, 0x22, 0x8b, 0x43, 0xab,
+	0x00, 0xbb, 0x02, 0xc8, 0xdd, 0x7b, 0x97, 0x17,
+	0xd7, 0x1d, 0x9e, 0x02, 0x5e, 0x48, 0xde, 0x8e,
+	0xcf, 0x99, 0x07, 0x95, 0x92, 0x3c, 0x5f, 0x9f,
+	0xc5, 0x8a, 0xc0, 0x23, 0xaa, 0xd5, 0x8c, 0x82,
+	0x6e, 0x16, 0x92, 0xb1, 0x12, 0x17, 0x07, 0xc3,
+	0xfb, 0x36, 0xf5, 0x6c, 0x35, 0xd6, 0x06, 0x1f,
+	0x9f, 0xa7, 0x94, 0xa2, 0x38, 0x63, 0x9c, 0xb0,
+	0x71, 0xb3, 0xa5, 0xd2, 0xd8, 0xba, 0x9f, 0x08,
+	0x01, 0xb3, 0xff, 0x04, 0x97, 0x73, 0x45, 0x1b,
+	0xd5, 0xa9, 0x9c, 0x80, 0xaf, 0x04, 0x9a, 0x85,
+	0xdb, 0x32, 0x5b, 0x5d, 0x1a, 0xc1, 0x36, 0x28,
+	0x10, 0x79, 0xf1, 0x3c, 0xbf, 0x1a, 0x41, 0x5c,
+	0x4e, 0xdf, 0xb2, 0x7c, 0x79, 0x3b, 0x7a, 0x62,
+	0x3d, 0x4b, 0xc9, 0x9b, 0x2a, 0x2e, 0x7c, 0xa2,
+	0xb1, 0x11, 0x98, 0xa7, 0x34, 0x1a, 0x00, 0xf3,
+	0xd1, 0xbc, 0x18, 0x22, 0xba, 0x02, 0x56, 0x62,
+	0x31, 0x10, 0x11, 0x6d, 0xe0, 0x54, 0x9d, 0x40,
+	0x1f, 0x26, 0x80, 0x41, 0xca, 0x3f, 0x68, 0x0f,
+	0x32, 0x1d, 0x0a, 0x8e, 0x79, 0xd8, 0xa4, 0x1b,
+	0x29, 0x1c, 0x90, 0x8e, 0xc5, 0xe3, 0xb4, 0x91,
+	0x37, 0x9a, 0x97, 0x86, 0x99, 0xd5, 0x09, 0xc5,
+	0xbb, 0xa3, 0x3f, 0x21, 0x29, 0x82, 0x14, 0x5c,
+	0xab, 0x25, 0xfb, 0xf2, 0x4f, 0x58, 0x26, 0xd4,
+	0x83, 0xaa, 0x66, 0x89, 0x67, 0x7e, 0xc0, 0x49,
+	0xe1, 0x11, 0x10, 0x7f, 0x7a, 0xda, 0x29, 0x04,
+	0xff, 0xf0, 0xcb, 0x09, 0x7c, 0x9d, 0xfa, 0x03,
+	0x6f, 0x81, 0x09, 0x31, 0x60, 0xfb, 0x08, 0xfa,
+	0x74, 0xd3, 0x64, 0x44, 0x7c, 0x55, 0x85, 0xec,
+	0x9c, 0x6e, 0x25, 0xb7, 0x6c, 0xc5, 0x37, 0xb6,
+	0x83, 0x87, 0x72, 0x95, 0x8b, 0x9d, 0xe1, 0x69,
+	0x5c, 0x31, 0x95, 0x42, 0xa6, 0x2c, 0xd1, 0x36,
+	0x47, 0x1f, 0xec, 0x54, 0xab, 0xa2, 0x1c, 0xd8,
+	0x00, 0xcc, 0xbc, 0x0d, 0x65, 0xe2, 0x67, 0xbf,
+	0xbc, 0xea, 0xee, 0x9e, 0xe4, 0x36, 0x95, 0xbe,
+	0x73, 0xd9, 0xa6, 0xd9, 0x0f, 0xa0, 0xcc, 0x82,
+	0x76, 0x26, 0xad, 0x5b, 0x58, 0x6c, 0x4e, 0xab,
+	0x29, 0x64, 0xd3, 0xd9, 0xa9, 0x08, 0x8c, 0x1d,
+	0xa1, 0x4f, 0x80, 0xd8, 0x3f, 0x94, 0xfb, 0xd3,
+	0x7b, 0xfc, 0xd1, 0x2b, 0xc3, 0x21, 0xeb, 0xe5,
+	0x1c, 0x84, 0x23, 0x7f, 0x4b, 0xfa, 0xdb, 0x34,
+	0x18, 0xa2, 0xc2, 0xe5, 0x13, 0xfe, 0x6c, 0x49,
+	0x81, 0xd2, 0x73, 0xe7, 0xe2, 0xd7, 0xe4, 0x4f,
+	0x4b, 0x08, 0x6e, 0xb1, 0x12, 0x22, 0x10, 0x9d,
+	0xac, 0x51, 0x1e, 0x17, 0xd9, 0x8a, 0x0b, 0x42,
+	0x88, 0x16, 0x81, 0x37, 0x7c, 0x6a, 0xf7, 0xef,
+	0x2d, 0xe3, 0xd9, 0xf8, 0x5f, 0xe0, 0x53, 0x27,
+	0x74, 0xb9, 0xe2, 0xd6, 0x1c, 0x80, 0x2c, 0x52,
+	0x65
+};
+static const u8 enc_output009[] __initconst = {
+	0xfd, 0x81, 0x8d, 0xd0, 0x3d, 0xb4, 0xd5, 0xdf,
+	0xd3, 0x42, 0x47, 0x5a, 0x6d, 0x19, 0x27, 0x66,
+	0x4b, 0x2e, 0x0c, 0x27, 0x9c, 0x96, 0x4c, 0x72,
+	0x02, 0xa3, 0x65, 0xc3, 0xb3, 0x6f, 0x2e, 0xbd,
+	0x63, 0x8a, 0x4a, 0x5d, 0x29, 0xa2, 0xd0, 0x28,
+	0x48, 0xc5, 0x3d, 0x98, 0xa3, 0xbc, 0xe0, 0xbe,
+	0x3b, 0x3f, 0xe6, 0x8a, 0xa4, 0x7f, 0x53, 0x06,
+	0xfa, 0x7f, 0x27, 0x76, 0x72, 0x31, 0xa1, 0xf5,
+	0xd6, 0x0c, 0x52, 0x47, 0xba, 0xcd, 0x4f, 0xd7,
+	0xeb, 0x05, 0x48, 0x0d, 0x7c, 0x35, 0x4a, 0x09,
+	0xc9, 0x76, 0x71, 0x02, 0xa3, 0xfb, 0xb7, 0x1a,
+	0x65, 0xb7, 0xed, 0x98, 0xc6, 0x30, 0x8a, 0x00,
+	0xae, 0xa1, 0x31, 0xe5, 0xb5, 0x9e, 0x6d, 0x62,
+	0xda, 0xda, 0x07, 0x0f, 0x38, 0x38, 0xd3, 0xcb,
+	0xc1, 0xb0, 0xad, 0xec, 0x72, 0xec, 0xb1, 0xa2,
+	0x7b, 0x59, 0xf3, 0x3d, 0x2b, 0xef, 0xcd, 0x28,
+	0x5b, 0x83, 0xcc, 0x18, 0x91, 0x88, 0xb0, 0x2e,
+	0xf9, 0x29, 0x31, 0x18, 0xf9, 0x4e, 0xe9, 0x0a,
+	0x91, 0x92, 0x9f, 0xae, 0x2d, 0xad, 0xf4, 0xe6,
+	0x1a, 0xe2, 0xa4, 0xee, 0x47, 0x15, 0xbf, 0x83,
+	0x6e, 0xd7, 0x72, 0x12, 0x3b, 0x2d, 0x24, 0xe9,
+	0xb2, 0x55, 0xcb, 0x3c, 0x10, 0xf0, 0x24, 0x8a,
+	0x4a, 0x02, 0xea, 0x90, 0x25, 0xf0, 0xb4, 0x79,
+	0x3a, 0xef, 0x6e, 0xf5, 0x52, 0xdf, 0xb0, 0x0a,
+	0xcd, 0x24, 0x1c, 0xd3, 0x2e, 0x22, 0x74, 0xea,
+	0x21, 0x6f, 0xe9, 0xbd, 0xc8, 0x3e, 0x36, 0x5b,
+	0x19, 0xf1, 0xca, 0x99, 0x0a, 0xb4, 0xa7, 0x52,
+	0x1a, 0x4e, 0xf2, 0xad, 0x8d, 0x56, 0x85, 0xbb,
+	0x64, 0x89, 0xba, 0x26, 0xf9, 0xc7, 0xe1, 0x89,
+	0x19, 0x22, 0x77, 0xc3, 0xa8, 0xfc, 0xff, 0xad,
+	0xfe, 0xb9, 0x48, 0xae, 0x12, 0x30, 0x9f, 0x19,
+	0xfb, 0x1b, 0xef, 0x14, 0x87, 0x8a, 0x78, 0x71,
+	0xf3, 0xf4, 0xb7, 0x00, 0x9c, 0x1d, 0xb5, 0x3d,
+	0x49, 0x00, 0x0c, 0x06, 0xd4, 0x50, 0xf9, 0x54,
+	0x45, 0xb2, 0x5b, 0x43, 0xdb, 0x6d, 0xcf, 0x1a,
+	0xe9, 0x7a, 0x7a, 0xcf, 0xfc, 0x8a, 0x4e, 0x4d,
+	0x0b, 0x07, 0x63, 0x28, 0xd8, 0xe7, 0x08, 0x95,
+	0xdf, 0xa6, 0x72, 0x93, 0x2e, 0xbb, 0xa0, 0x42,
+	0x89, 0x16, 0xf1, 0xd9, 0x0c, 0xf9, 0xa1, 0x16,
+	0xfd, 0xd9, 0x03, 0xb4, 0x3b, 0x8a, 0xf5, 0xf6,
+	0xe7, 0x6b, 0x2e, 0x8e, 0x4c, 0x3d, 0xe2, 0xaf,
+	0x08, 0x45, 0x03, 0xff, 0x09, 0xb6, 0xeb, 0x2d,
+	0xc6, 0x1b, 0x88, 0x94, 0xac, 0x3e, 0xf1, 0x9f,
+	0x0e, 0x0e, 0x2b, 0xd5, 0x00, 0x4d, 0x3f, 0x3b,
+	0x53, 0xae, 0xaf, 0x1c, 0x33, 0x5f, 0x55, 0x6e,
+	0x8d, 0xaf, 0x05, 0x7a, 0x10, 0x34, 0xc9, 0xf4,
+	0x66, 0xcb, 0x62, 0x12, 0xa6, 0xee, 0xe8, 0x1c,
+	0x5d, 0x12, 0x86, 0xdb, 0x6f, 0x1c, 0x33, 0xc4,
+	0x1c, 0xda, 0x82, 0x2d, 0x3b, 0x59, 0xfe, 0xb1,
+	0xa4, 0x59, 0x41, 0x86, 0xd0, 0xef, 0xae, 0xfb,
+	0xda, 0x6d, 0x11, 0xb8, 0xca, 0xe9, 0x6e, 0xff,
+	0xf7, 0xa9, 0xd9, 0x70, 0x30, 0xfc, 0x53, 0xe2,
+	0xd7, 0xa2, 0x4e, 0xc7, 0x91, 0xd9, 0x07, 0x06,
+	0xaa, 0xdd, 0xb0, 0x59, 0x28, 0x1d, 0x00, 0x66,
+	0xc5, 0x54, 0xc2, 0xfc, 0x06, 0xda, 0x05, 0x90,
+	0x52, 0x1d, 0x37, 0x66, 0xee, 0xf0, 0xb2, 0x55,
+	0x8a, 0x5d, 0xd2, 0x38, 0x86, 0x94, 0x9b, 0xfc,
+	0x10, 0x4c, 0xa1, 0xb9, 0x64, 0x3e, 0x44, 0xb8,
+	0x5f, 0xb0, 0x0c, 0xec, 0xe0, 0xc9, 0xe5, 0x62,
+	0x75, 0x3f, 0x09, 0xd5, 0xf5, 0xd9, 0x26, 0xba,
+	0x9e, 0xd2, 0xf4, 0xb9, 0x48, 0x0a, 0xbc, 0xa2,
+	0xd6, 0x7c, 0x36, 0x11, 0x7d, 0x26, 0x81, 0x89,
+	0xcf, 0xa4, 0xad, 0x73, 0x0e, 0xee, 0xcc, 0x06,
+	0xa9, 0xdb, 0xb1, 0xfd, 0xfb, 0x09, 0x7f, 0x90,
+	0x42, 0x37, 0x2f, 0xe1, 0x9c, 0x0f, 0x6f, 0xcf,
+	0x43, 0xb5, 0xd9, 0x90, 0xe1, 0x85, 0xf5, 0xa8,
+	0xae
+};
+static const u8 enc_assoc009[] __initconst = {
+	0x5a, 0x27, 0xff, 0xeb, 0xdf, 0x84, 0xb2, 0x9e,
+	0xef
+};
+static const u8 enc_nonce009[] __initconst = {
+	0xef, 0x2d, 0x63, 0xee, 0x6b, 0x80, 0x8b, 0x78
+};
+static const u8 enc_key009[] __initconst = {
+	0xea, 0xbc, 0x56, 0x99, 0xe3, 0x50, 0xff, 0xc5,
+	0xcc, 0x1a, 0xd7, 0xc1, 0x57, 0x72, 0xea, 0x86,
+	0x5b, 0x89, 0x88, 0x61, 0x3d, 0x2f, 0x9b, 0xb2,
+	0xe7, 0x9c, 0xec, 0x74, 0x6e, 0x3e, 0xf4, 0x3b
+};
+
+static const u8 enc_input010[] __initconst = {
+	0x42, 0x93, 0xe4, 0xeb, 0x97, 0xb0, 0x57, 0xbf,
+	0x1a, 0x8b, 0x1f, 0xe4, 0x5f, 0x36, 0x20, 0x3c,
+	0xef, 0x0a, 0xa9, 0x48, 0x5f, 0x5f, 0x37, 0x22,
+	0x3a, 0xde, 0xe3, 0xae, 0xbe, 0xad, 0x07, 0xcc,
+	0xb1, 0xf6, 0xf5, 0xf9, 0x56, 0xdd, 0xe7, 0x16,
+	0x1e, 0x7f, 0xdf, 0x7a, 0x9e, 0x75, 0xb7, 0xc7,
+	0xbe, 0xbe, 0x8a, 0x36, 0x04, 0xc0, 0x10, 0xf4,
+	0x95, 0x20, 0x03, 0xec, 0xdc, 0x05, 0xa1, 0x7d,
+	0xc4, 0xa9, 0x2c, 0x82, 0xd0, 0xbc, 0x8b, 0xc5,
+	0xc7, 0x45, 0x50, 0xf6, 0xa2, 0x1a, 0xb5, 0x46,
+	0x3b, 0x73, 0x02, 0xa6, 0x83, 0x4b, 0x73, 0x82,
+	0x58, 0x5e, 0x3b, 0x65, 0x2f, 0x0e, 0xfd, 0x2b,
+	0x59, 0x16, 0xce, 0xa1, 0x60, 0x9c, 0xe8, 0x3a,
+	0x99, 0xed, 0x8d, 0x5a, 0xcf, 0xf6, 0x83, 0xaf,
+	0xba, 0xd7, 0x73, 0x73, 0x40, 0x97, 0x3d, 0xca,
+	0xef, 0x07, 0x57, 0xe6, 0xd9, 0x70, 0x0e, 0x95,
+	0xae, 0xa6, 0x8d, 0x04, 0xcc, 0xee, 0xf7, 0x09,
+	0x31, 0x77, 0x12, 0xa3, 0x23, 0x97, 0x62, 0xb3,
+	0x7b, 0x32, 0xfb, 0x80, 0x14, 0x48, 0x81, 0xc3,
+	0xe5, 0xea, 0x91, 0x39, 0x52, 0x81, 0xa2, 0x4f,
+	0xe4, 0xb3, 0x09, 0xff, 0xde, 0x5e, 0xe9, 0x58,
+	0x84, 0x6e, 0xf9, 0x3d, 0xdf, 0x25, 0xea, 0xad,
+	0xae, 0xe6, 0x9a, 0xd1, 0x89, 0x55, 0xd3, 0xde,
+	0x6c, 0x52, 0xdb, 0x70, 0xfe, 0x37, 0xce, 0x44,
+	0x0a, 0xa8, 0x25, 0x5f, 0x92, 0xc1, 0x33, 0x4a,
+	0x4f, 0x9b, 0x62, 0x35, 0xff, 0xce, 0xc0, 0xa9,
+	0x60, 0xce, 0x52, 0x00, 0x97, 0x51, 0x35, 0x26,
+	0x2e, 0xb9, 0x36, 0xa9, 0x87, 0x6e, 0x1e, 0xcc,
+	0x91, 0x78, 0x53, 0x98, 0x86, 0x5b, 0x9c, 0x74,
+	0x7d, 0x88, 0x33, 0xe1, 0xdf, 0x37, 0x69, 0x2b,
+	0xbb, 0xf1, 0x4d, 0xf4, 0xd1, 0xf1, 0x39, 0x93,
+	0x17, 0x51, 0x19, 0xe3, 0x19, 0x1e, 0x76, 0x37,
+	0x25, 0xfb, 0x09, 0x27, 0x6a, 0xab, 0x67, 0x6f,
+	0x14, 0x12, 0x64, 0xe7, 0xc4, 0x07, 0xdf, 0x4d,
+	0x17, 0xbb, 0x6d, 0xe0, 0xe9, 0xb9, 0xab, 0xca,
+	0x10, 0x68, 0xaf, 0x7e, 0xb7, 0x33, 0x54, 0x73,
+	0x07, 0x6e, 0xf7, 0x81, 0x97, 0x9c, 0x05, 0x6f,
+	0x84, 0x5f, 0xd2, 0x42, 0xfb, 0x38, 0xcf, 0xd1,
+	0x2f, 0x14, 0x30, 0x88, 0x98, 0x4d, 0x5a, 0xa9,
+	0x76, 0xd5, 0x4f, 0x3e, 0x70, 0x6c, 0x85, 0x76,
+	0xd7, 0x01, 0xa0, 0x1a, 0xc8, 0x4e, 0xaa, 0xac,
+	0x78, 0xfe, 0x46, 0xde, 0x6a, 0x05, 0x46, 0xa7,
+	0x43, 0x0c, 0xb9, 0xde, 0xb9, 0x68, 0xfb, 0xce,
+	0x42, 0x99, 0x07, 0x4d, 0x0b, 0x3b, 0x5a, 0x30,
+	0x35, 0xa8, 0xf9, 0x3a, 0x73, 0xef, 0x0f, 0xdb,
+	0x1e, 0x16, 0x42, 0xc4, 0xba, 0xae, 0x58, 0xaa,
+	0xf8, 0xe5, 0x75, 0x2f, 0x1b, 0x15, 0x5c, 0xfd,
+	0x0a, 0x97, 0xd0, 0xe4, 0x37, 0x83, 0x61, 0x5f,
+	0x43, 0xa6, 0xc7, 0x3f, 0x38, 0x59, 0xe6, 0xeb,
+	0xa3, 0x90, 0xc3, 0xaa, 0xaa, 0x5a, 0xd3, 0x34,
+	0xd4, 0x17, 0xc8, 0x65, 0x3e, 0x57, 0xbc, 0x5e,
+	0xdd, 0x9e, 0xb7, 0xf0, 0x2e, 0x5b, 0xb2, 0x1f,
+	0x8a, 0x08, 0x0d, 0x45, 0x91, 0x0b, 0x29, 0x53,
+	0x4f, 0x4c, 0x5a, 0x73, 0x56, 0xfe, 0xaf, 0x41,
+	0x01, 0x39, 0x0a, 0x24, 0x3c, 0x7e, 0xbe, 0x4e,
+	0x53, 0xf3, 0xeb, 0x06, 0x66, 0x51, 0x28, 0x1d,
+	0xbd, 0x41, 0x0a, 0x01, 0xab, 0x16, 0x47, 0x27,
+	0x47, 0x47, 0xf7, 0xcb, 0x46, 0x0a, 0x70, 0x9e,
+	0x01, 0x9c, 0x09, 0xe1, 0x2a, 0x00, 0x1a, 0xd8,
+	0xd4, 0x79, 0x9d, 0x80, 0x15, 0x8e, 0x53, 0x2a,
+	0x65, 0x83, 0x78, 0x3e, 0x03, 0x00, 0x07, 0x12,
+	0x1f, 0x33, 0x3e, 0x7b, 0x13, 0x37, 0xf1, 0xc3,
+	0xef, 0xb7, 0xc1, 0x20, 0x3c, 0x3e, 0x67, 0x66,
+	0x5d, 0x88, 0xa7, 0x7d, 0x33, 0x50, 0x77, 0xb0,
+	0x28, 0x8e, 0xe7, 0x2c, 0x2e, 0x7a, 0xf4, 0x3c,
+	0x8d, 0x74, 0x83, 0xaf, 0x8e, 0x87, 0x0f, 0xe4,
+	0x50, 0xff, 0x84, 0x5c, 0x47, 0x0c, 0x6a, 0x49,
+	0xbf, 0x42, 0x86, 0x77, 0x15, 0x48, 0xa5, 0x90,
+	0x5d, 0x93, 0xd6, 0x2a, 0x11, 0xd5, 0xd5, 0x11,
+	0xaa, 0xce, 0xe7, 0x6f, 0xa5, 0xb0, 0x09, 0x2c,
+	0x8d, 0xd3, 0x92, 0xf0, 0x5a, 0x2a, 0xda, 0x5b,
+	0x1e, 0xd5, 0x9a, 0xc4, 0xc4, 0xf3, 0x49, 0x74,
+	0x41, 0xca, 0xe8, 0xc1, 0xf8, 0x44, 0xd6, 0x3c,
+	0xae, 0x6c, 0x1d, 0x9a, 0x30, 0x04, 0x4d, 0x27,
+	0x0e, 0xb1, 0x5f, 0x59, 0xa2, 0x24, 0xe8, 0xe1,
+	0x98, 0xc5, 0x6a, 0x4c, 0xfe, 0x41, 0xd2, 0x27,
+	0x42, 0x52, 0xe1, 0xe9, 0x7d, 0x62, 0xe4, 0x88,
+	0x0f, 0xad, 0xb2, 0x70, 0xcb, 0x9d, 0x4c, 0x27,
+	0x2e, 0x76, 0x1e, 0x1a, 0x63, 0x65, 0xf5, 0x3b,
+	0xf8, 0x57, 0x69, 0xeb, 0x5b, 0x38, 0x26, 0x39,
+	0x33, 0x25, 0x45, 0x3e, 0x91, 0xb8, 0xd8, 0xc7,
+	0xd5, 0x42, 0xc0, 0x22, 0x31, 0x74, 0xf4, 0xbc,
+	0x0c, 0x23, 0xf1, 0xca, 0xc1, 0x8d, 0xd7, 0xbe,
+	0xc9, 0x62, 0xe4, 0x08, 0x1a, 0xcf, 0x36, 0xd5,
+	0xfe, 0x55, 0x21, 0x59, 0x91, 0x87, 0x87, 0xdf,
+	0x06, 0xdb, 0xdf, 0x96, 0x45, 0x58, 0xda, 0x05,
+	0xcd, 0x50, 0x4d, 0xd2, 0x7d, 0x05, 0x18, 0x73,
+	0x6a, 0x8d, 0x11, 0x85, 0xa6, 0x88, 0xe8, 0xda,
+	0xe6, 0x30, 0x33, 0xa4, 0x89, 0x31, 0x75, 0xbe,
+	0x69, 0x43, 0x84, 0x43, 0x50, 0x87, 0xdd, 0x71,
+	0x36, 0x83, 0xc3, 0x78, 0x74, 0x24, 0x0a, 0xed,
+	0x7b, 0xdb, 0xa4, 0x24, 0x0b, 0xb9, 0x7e, 0x5d,
+	0xff, 0xde, 0xb1, 0xef, 0x61, 0x5a, 0x45, 0x33,
+	0xf6, 0x17, 0x07, 0x08, 0x98, 0x83, 0x92, 0x0f,
+	0x23, 0x6d, 0xe6, 0xaa, 0x17, 0x54, 0xad, 0x6a,
+	0xc8, 0xdb, 0x26, 0xbe, 0xb8, 0xb6, 0x08, 0xfa,
+	0x68, 0xf1, 0xd7, 0x79, 0x6f, 0x18, 0xb4, 0x9e,
+	0x2d, 0x3f, 0x1b, 0x64, 0xaf, 0x8d, 0x06, 0x0e,
+	0x49, 0x28, 0xe0, 0x5d, 0x45, 0x68, 0x13, 0x87,
+	0xfa, 0xde, 0x40, 0x7b, 0xd2, 0xc3, 0x94, 0xd5,
+	0xe1, 0xd9, 0xc2, 0xaf, 0x55, 0x89, 0xeb, 0xb4,
+	0x12, 0x59, 0xa8, 0xd4, 0xc5, 0x29, 0x66, 0x38,
+	0xe6, 0xac, 0x22, 0x22, 0xd9, 0x64, 0x9b, 0x34,
+	0x0a, 0x32, 0x9f, 0xc2, 0xbf, 0x17, 0x6c, 0x3f,
+	0x71, 0x7a, 0x38, 0x6b, 0x98, 0xfb, 0x49, 0x36,
+	0x89, 0xc9, 0xe2, 0xd6, 0xc7, 0x5d, 0xd0, 0x69,
+	0x5f, 0x23, 0x35, 0xc9, 0x30, 0xe2, 0xfd, 0x44,
+	0x58, 0x39, 0xd7, 0x97, 0xfb, 0x5c, 0x00, 0xd5,
+	0x4f, 0x7a, 0x1a, 0x95, 0x8b, 0x62, 0x4b, 0xce,
+	0xe5, 0x91, 0x21, 0x7b, 0x30, 0x00, 0xd6, 0xdd,
+	0x6d, 0x02, 0x86, 0x49, 0x0f, 0x3c, 0x1a, 0x27,
+	0x3c, 0xd3, 0x0e, 0x71, 0xf2, 0xff, 0xf5, 0x2f,
+	0x87, 0xac, 0x67, 0x59, 0x81, 0xa3, 0xf7, 0xf8,
+	0xd6, 0x11, 0x0c, 0x84, 0xa9, 0x03, 0xee, 0x2a,
+	0xc4, 0xf3, 0x22, 0xab, 0x7c, 0xe2, 0x25, 0xf5,
+	0x67, 0xa3, 0xe4, 0x11, 0xe0, 0x59, 0xb3, 0xca,
+	0x87, 0xa0, 0xae, 0xc9, 0xa6, 0x62, 0x1b, 0x6e,
+	0x4d, 0x02, 0x6b, 0x07, 0x9d, 0xfd, 0xd0, 0x92,
+	0x06, 0xe1, 0xb2, 0x9a, 0x4a, 0x1f, 0x1f, 0x13,
+	0x49, 0x99, 0x97, 0x08, 0xde, 0x7f, 0x98, 0xaf,
+	0x51, 0x98, 0xee, 0x2c, 0xcb, 0xf0, 0x0b, 0xc6,
+	0xb6, 0xb7, 0x2d, 0x9a, 0xb1, 0xac, 0xa6, 0xe3,
+	0x15, 0x77, 0x9d, 0x6b, 0x1a, 0xe4, 0xfc, 0x8b,
+	0xf2, 0x17, 0x59, 0x08, 0x04, 0x58, 0x81, 0x9d,
+	0x1b, 0x1b, 0x69, 0x55, 0xc2, 0xb4, 0x3c, 0x1f,
+	0x50, 0xf1, 0x7f, 0x77, 0x90, 0x4c, 0x66, 0x40,
+	0x5a, 0xc0, 0x33, 0x1f, 0xcb, 0x05, 0x6d, 0x5c,
+	0x06, 0x87, 0x52, 0xa2, 0x8f, 0x26, 0xd5, 0x4f
+};
+static const u8 enc_output010[] __initconst = {
+	0xe5, 0x26, 0xa4, 0x3d, 0xbd, 0x33, 0xd0, 0x4b,
+	0x6f, 0x05, 0xa7, 0x6e, 0x12, 0x7a, 0xd2, 0x74,
+	0xa6, 0xdd, 0xbd, 0x95, 0xeb, 0xf9, 0xa4, 0xf1,
+	0x59, 0x93, 0x91, 0x70, 0xd9, 0xfe, 0x9a, 0xcd,
+	0x53, 0x1f, 0x3a, 0xab, 0xa6, 0x7c, 0x9f, 0xa6,
+	0x9e, 0xbd, 0x99, 0xd9, 0xb5, 0x97, 0x44, 0xd5,
+	0x14, 0x48, 0x4d, 0x9d, 0xc0, 0xd0, 0x05, 0x96,
+	0xeb, 0x4c, 0x78, 0x55, 0x09, 0x08, 0x01, 0x02,
+	0x30, 0x90, 0x7b, 0x96, 0x7a, 0x7b, 0x5f, 0x30,
+	0x41, 0x24, 0xce, 0x68, 0x61, 0x49, 0x86, 0x57,
+	0x82, 0xdd, 0x53, 0x1c, 0x51, 0x28, 0x2b, 0x53,
+	0x6e, 0x2d, 0xc2, 0x20, 0x4c, 0xdd, 0x8f, 0x65,
+	0x10, 0x20, 0x50, 0xdd, 0x9d, 0x50, 0xe5, 0x71,
+	0x40, 0x53, 0x69, 0xfc, 0x77, 0x48, 0x11, 0xb9,
+	0xde, 0xa4, 0x8d, 0x58, 0xe4, 0xa6, 0x1a, 0x18,
+	0x47, 0x81, 0x7e, 0xfc, 0xdd, 0xf6, 0xef, 0xce,
+	0x2f, 0x43, 0x68, 0xd6, 0x06, 0xe2, 0x74, 0x6a,
+	0xad, 0x90, 0xf5, 0x37, 0xf3, 0x3d, 0x82, 0x69,
+	0x40, 0xe9, 0x6b, 0xa7, 0x3d, 0xa8, 0x1e, 0xd2,
+	0x02, 0x7c, 0xb7, 0x9b, 0xe4, 0xda, 0x8f, 0x95,
+	0x06, 0xc5, 0xdf, 0x73, 0xa3, 0x20, 0x9a, 0x49,
+	0xde, 0x9c, 0xbc, 0xee, 0x14, 0x3f, 0x81, 0x5e,
+	0xf8, 0x3b, 0x59, 0x3c, 0xe1, 0x68, 0x12, 0x5a,
+	0x3a, 0x76, 0x3a, 0x3f, 0xf7, 0x87, 0x33, 0x0a,
+	0x01, 0xb8, 0xd4, 0xed, 0xb6, 0xbe, 0x94, 0x5e,
+	0x70, 0x40, 0x56, 0x67, 0x1f, 0x50, 0x44, 0x19,
+	0xce, 0x82, 0x70, 0x10, 0x87, 0x13, 0x20, 0x0b,
+	0x4c, 0x5a, 0xb6, 0xf6, 0xa7, 0xae, 0x81, 0x75,
+	0x01, 0x81, 0xe6, 0x4b, 0x57, 0x7c, 0xdd, 0x6d,
+	0xf8, 0x1c, 0x29, 0x32, 0xf7, 0xda, 0x3c, 0x2d,
+	0xf8, 0x9b, 0x25, 0x6e, 0x00, 0xb4, 0xf7, 0x2f,
+	0xf7, 0x04, 0xf7, 0xa1, 0x56, 0xac, 0x4f, 0x1a,
+	0x64, 0xb8, 0x47, 0x55, 0x18, 0x7b, 0x07, 0x4d,
+	0xbd, 0x47, 0x24, 0x80, 0x5d, 0xa2, 0x70, 0xc5,
+	0xdd, 0x8e, 0x82, 0xd4, 0xeb, 0xec, 0xb2, 0x0c,
+	0x39, 0xd2, 0x97, 0xc1, 0xcb, 0xeb, 0xf4, 0x77,
+	0x59, 0xb4, 0x87, 0xef, 0xcb, 0x43, 0x2d, 0x46,
+	0x54, 0xd1, 0xa7, 0xd7, 0x15, 0x99, 0x0a, 0x43,
+	0xa1, 0xe0, 0x99, 0x33, 0x71, 0xc1, 0xed, 0xfe,
+	0x72, 0x46, 0x33, 0x8e, 0x91, 0x08, 0x9f, 0xc8,
+	0x2e, 0xca, 0xfa, 0xdc, 0x59, 0xd5, 0xc3, 0x76,
+	0x84, 0x9f, 0xa3, 0x37, 0x68, 0xc3, 0xf0, 0x47,
+	0x2c, 0x68, 0xdb, 0x5e, 0xc3, 0x49, 0x4c, 0xe8,
+	0x92, 0x85, 0xe2, 0x23, 0xd3, 0x3f, 0xad, 0x32,
+	0xe5, 0x2b, 0x82, 0xd7, 0x8f, 0x99, 0x0a, 0x59,
+	0x5c, 0x45, 0xd9, 0xb4, 0x51, 0x52, 0xc2, 0xae,
+	0xbf, 0x80, 0xcf, 0xc9, 0xc9, 0x51, 0x24, 0x2a,
+	0x3b, 0x3a, 0x4d, 0xae, 0xeb, 0xbd, 0x22, 0xc3,
+	0x0e, 0x0f, 0x59, 0x25, 0x92, 0x17, 0xe9, 0x74,
+	0xc7, 0x8b, 0x70, 0x70, 0x36, 0x55, 0x95, 0x75,
+	0x4b, 0xad, 0x61, 0x2b, 0x09, 0xbc, 0x82, 0xf2,
+	0x6e, 0x94, 0x43, 0xae, 0xc3, 0xd5, 0xcd, 0x8e,
+	0xfe, 0x5b, 0x9a, 0x88, 0x43, 0x01, 0x75, 0xb2,
+	0x23, 0x09, 0xf7, 0x89, 0x83, 0xe7, 0xfa, 0xf9,
+	0xb4, 0x9b, 0xf8, 0xef, 0xbd, 0x1c, 0x92, 0xc1,
+	0xda, 0x7e, 0xfe, 0x05, 0xba, 0x5a, 0xcd, 0x07,
+	0x6a, 0x78, 0x9e, 0x5d, 0xfb, 0x11, 0x2f, 0x79,
+	0x38, 0xb6, 0xc2, 0x5b, 0x6b, 0x51, 0xb4, 0x71,
+	0xdd, 0xf7, 0x2a, 0xe4, 0xf4, 0x72, 0x76, 0xad,
+	0xc2, 0xdd, 0x64, 0x5d, 0x79, 0xb6, 0xf5, 0x7a,
+	0x77, 0x20, 0x05, 0x3d, 0x30, 0x06, 0xd4, 0x4c,
+	0x0a, 0x2c, 0x98, 0x5a, 0xb9, 0xd4, 0x98, 0xa9,
+	0x3f, 0xc6, 0x12, 0xea, 0x3b, 0x4b, 0xc5, 0x79,
+	0x64, 0x63, 0x6b, 0x09, 0x54, 0x3b, 0x14, 0x27,
+	0xba, 0x99, 0x80, 0xc8, 0x72, 0xa8, 0x12, 0x90,
+	0x29, 0xba, 0x40, 0x54, 0x97, 0x2b, 0x7b, 0xfe,
+	0xeb, 0xcd, 0x01, 0x05, 0x44, 0x72, 0xdb, 0x99,
+	0xe4, 0x61, 0xc9, 0x69, 0xd6, 0xb9, 0x28, 0xd1,
+	0x05, 0x3e, 0xf9, 0x0b, 0x49, 0x0a, 0x49, 0xe9,
+	0x8d, 0x0e, 0xa7, 0x4a, 0x0f, 0xaf, 0x32, 0xd0,
+	0xe0, 0xb2, 0x3a, 0x55, 0x58, 0xfe, 0x5c, 0x28,
+	0x70, 0x51, 0x23, 0xb0, 0x7b, 0x6a, 0x5f, 0x1e,
+	0xb8, 0x17, 0xd7, 0x94, 0x15, 0x8f, 0xee, 0x20,
+	0xc7, 0x42, 0x25, 0x3e, 0x9a, 0x14, 0xd7, 0x60,
+	0x72, 0x39, 0x47, 0x48, 0xa9, 0xfe, 0xdd, 0x47,
+	0x0a, 0xb1, 0xe6, 0x60, 0x28, 0x8c, 0x11, 0x68,
+	0xe1, 0xff, 0xd7, 0xce, 0xc8, 0xbe, 0xb3, 0xfe,
+	0x27, 0x30, 0x09, 0x70, 0xd7, 0xfa, 0x02, 0x33,
+	0x3a, 0x61, 0x2e, 0xc7, 0xff, 0xa4, 0x2a, 0xa8,
+	0x6e, 0xb4, 0x79, 0x35, 0x6d, 0x4c, 0x1e, 0x38,
+	0xf8, 0xee, 0xd4, 0x84, 0x4e, 0x6e, 0x28, 0xa7,
+	0xce, 0xc8, 0xc1, 0xcf, 0x80, 0x05, 0xf3, 0x04,
+	0xef, 0xc8, 0x18, 0x28, 0x2e, 0x8d, 0x5e, 0x0c,
+	0xdf, 0xb8, 0x5f, 0x96, 0xe8, 0xc6, 0x9c, 0x2f,
+	0xe5, 0xa6, 0x44, 0xd7, 0xe7, 0x99, 0x44, 0x0c,
+	0xec, 0xd7, 0x05, 0x60, 0x97, 0xbb, 0x74, 0x77,
+	0x58, 0xd5, 0xbb, 0x48, 0xde, 0x5a, 0xb2, 0x54,
+	0x7f, 0x0e, 0x46, 0x70, 0x6a, 0x6f, 0x78, 0xa5,
+	0x08, 0x89, 0x05, 0x4e, 0x7e, 0xa0, 0x69, 0xb4,
+	0x40, 0x60, 0x55, 0x77, 0x75, 0x9b, 0x19, 0xf2,
+	0xd5, 0x13, 0x80, 0x77, 0xf9, 0x4b, 0x3f, 0x1e,
+	0xee, 0xe6, 0x76, 0x84, 0x7b, 0x8c, 0xe5, 0x27,
+	0xa8, 0x0a, 0x91, 0x01, 0x68, 0x71, 0x8a, 0x3f,
+	0x06, 0xab, 0xf6, 0xa9, 0xa5, 0xe6, 0x72, 0x92,
+	0xe4, 0x67, 0xe2, 0xa2, 0x46, 0x35, 0x84, 0x55,
+	0x7d, 0xca, 0xa8, 0x85, 0xd0, 0xf1, 0x3f, 0xbe,
+	0xd7, 0x34, 0x64, 0xfc, 0xae, 0xe3, 0xe4, 0x04,
+	0x9f, 0x66, 0x02, 0xb9, 0x88, 0x10, 0xd9, 0xc4,
+	0x4c, 0x31, 0x43, 0x7a, 0x93, 0xe2, 0x9b, 0x56,
+	0x43, 0x84, 0xdc, 0xdc, 0xde, 0x1d, 0xa4, 0x02,
+	0x0e, 0xc2, 0xef, 0xc3, 0xf8, 0x78, 0xd1, 0xb2,
+	0x6b, 0x63, 0x18, 0xc9, 0xa9, 0xe5, 0x72, 0xd8,
+	0xf3, 0xb9, 0xd1, 0x8a, 0xc7, 0x1a, 0x02, 0x27,
+	0x20, 0x77, 0x10, 0xe5, 0xc8, 0xd4, 0x4a, 0x47,
+	0xe5, 0xdf, 0x5f, 0x01, 0xaa, 0xb0, 0xd4, 0x10,
+	0xbb, 0x69, 0xe3, 0x36, 0xc8, 0xe1, 0x3d, 0x43,
+	0xfb, 0x86, 0xcd, 0xcc, 0xbf, 0xf4, 0x88, 0xe0,
+	0x20, 0xca, 0xb7, 0x1b, 0xf1, 0x2f, 0x5c, 0xee,
+	0xd4, 0xd3, 0xa3, 0xcc, 0xa4, 0x1e, 0x1c, 0x47,
+	0xfb, 0xbf, 0xfc, 0xa2, 0x41, 0x55, 0x9d, 0xf6,
+	0x5a, 0x5e, 0x65, 0x32, 0x34, 0x7b, 0x52, 0x8d,
+	0xd5, 0xd0, 0x20, 0x60, 0x03, 0xab, 0x3f, 0x8c,
+	0xd4, 0x21, 0xea, 0x2a, 0xd9, 0xc4, 0xd0, 0xd3,
+	0x65, 0xd8, 0x7a, 0x13, 0x28, 0x62, 0x32, 0x4b,
+	0x2c, 0x87, 0x93, 0xa8, 0xb4, 0x52, 0x45, 0x09,
+	0x44, 0xec, 0xec, 0xc3, 0x17, 0xdb, 0x9a, 0x4d,
+	0x5c, 0xa9, 0x11, 0xd4, 0x7d, 0xaf, 0x9e, 0xf1,
+	0x2d, 0xb2, 0x66, 0xc5, 0x1d, 0xed, 0xb7, 0xcd,
+	0x0b, 0x25, 0x5e, 0x30, 0x47, 0x3f, 0x40, 0xf4,
+	0xa1, 0xa0, 0x00, 0x94, 0x10, 0xc5, 0x6a, 0x63,
+	0x1a, 0xd5, 0x88, 0x92, 0x8e, 0x82, 0x39, 0x87,
+	0x3c, 0x78, 0x65, 0x58, 0x42, 0x75, 0x5b, 0xdd,
+	0x77, 0x3e, 0x09, 0x4e, 0x76, 0x5b, 0xe6, 0x0e,
+	0x4d, 0x38, 0xb2, 0xc0, 0xb8, 0x95, 0x01, 0x7a,
+	0x10, 0xe0, 0xfb, 0x07, 0xf2, 0xab, 0x2d, 0x8c,
+	0x32, 0xed, 0x2b, 0xc0, 0x46, 0xc2, 0xf5, 0x38,
+	0x83, 0xf0, 0x17, 0xec, 0xc1, 0x20, 0x6a, 0x9a,
+	0x0b, 0x00, 0xa0, 0x98, 0x22, 0x50, 0x23, 0xd5,
+	0x80, 0x6b, 0xf6, 0x1f, 0xc3, 0xcc, 0x97, 0xc9,
+	0x24, 0x9f, 0xf3, 0xaf, 0x43, 0x14, 0xd5, 0xa0
+};
+static const u8 enc_assoc010[] __initconst = {
+	0xd2, 0xa1, 0x70, 0xdb, 0x7a, 0xf8, 0xfa, 0x27,
+	0xba, 0x73, 0x0f, 0xbf, 0x3d, 0x1e, 0x82, 0xb2
+};
+static const u8 enc_nonce010[] __initconst = {
+	0xdb, 0x92, 0x0f, 0x7f, 0x17, 0x54, 0x0c, 0x30
+};
+static const u8 enc_key010[] __initconst = {
+	0x47, 0x11, 0xeb, 0x86, 0x2b, 0x2c, 0xab, 0x44,
+	0x34, 0xda, 0x7f, 0x57, 0x03, 0x39, 0x0c, 0xaf,
+	0x2c, 0x14, 0xfd, 0x65, 0x23, 0xe9, 0x8e, 0x74,
+	0xd5, 0x08, 0x68, 0x08, 0xe7, 0xb4, 0x72, 0xd7
+};
+
+static const u8 enc_input011[] __initconst = {
+	0x7a, 0x57, 0xf2, 0xc7, 0x06, 0x3f, 0x50, 0x7b,
+	0x36, 0x1a, 0x66, 0x5c, 0xb9, 0x0e, 0x5e, 0x3b,
+	0x45, 0x60, 0xbe, 0x9a, 0x31, 0x9f, 0xff, 0x5d,
+	0x66, 0x34, 0xb4, 0xdc, 0xfb, 0x9d, 0x8e, 0xee,
+	0x6a, 0x33, 0xa4, 0x07, 0x3c, 0xf9, 0x4c, 0x30,
+	0xa1, 0x24, 0x52, 0xf9, 0x50, 0x46, 0x88, 0x20,
+	0x02, 0x32, 0x3a, 0x0e, 0x99, 0x63, 0xaf, 0x1f,
+	0x15, 0x28, 0x2a, 0x05, 0xff, 0x57, 0x59, 0x5e,
+	0x18, 0xa1, 0x1f, 0xd0, 0x92, 0x5c, 0x88, 0x66,
+	0x1b, 0x00, 0x64, 0xa5, 0x93, 0x8d, 0x06, 0x46,
+	0xb0, 0x64, 0x8b, 0x8b, 0xef, 0x99, 0x05, 0x35,
+	0x85, 0xb3, 0xf3, 0x33, 0xbb, 0xec, 0x66, 0xb6,
+	0x3d, 0x57, 0x42, 0xe3, 0xb4, 0xc6, 0xaa, 0xb0,
+	0x41, 0x2a, 0xb9, 0x59, 0xa9, 0xf6, 0x3e, 0x15,
+	0x26, 0x12, 0x03, 0x21, 0x4c, 0x74, 0x43, 0x13,
+	0x2a, 0x03, 0x27, 0x09, 0xb4, 0xfb, 0xe7, 0xb7,
+	0x40, 0xff, 0x5e, 0xce, 0x48, 0x9a, 0x60, 0xe3,
+	0x8b, 0x80, 0x8c, 0x38, 0x2d, 0xcb, 0x93, 0x37,
+	0x74, 0x05, 0x52, 0x6f, 0x73, 0x3e, 0xc3, 0xbc,
+	0xca, 0x72, 0x0a, 0xeb, 0xf1, 0x3b, 0xa0, 0x95,
+	0xdc, 0x8a, 0xc4, 0xa9, 0xdc, 0xca, 0x44, 0xd8,
+	0x08, 0x63, 0x6a, 0x36, 0xd3, 0x3c, 0xb8, 0xac,
+	0x46, 0x7d, 0xfd, 0xaa, 0xeb, 0x3e, 0x0f, 0x45,
+	0x8f, 0x49, 0xda, 0x2b, 0xf2, 0x12, 0xbd, 0xaf,
+	0x67, 0x8a, 0x63, 0x48, 0x4b, 0x55, 0x5f, 0x6d,
+	0x8c, 0xb9, 0x76, 0x34, 0x84, 0xae, 0xc2, 0xfc,
+	0x52, 0x64, 0x82, 0xf7, 0xb0, 0x06, 0xf0, 0x45,
+	0x73, 0x12, 0x50, 0x30, 0x72, 0xea, 0x78, 0x9a,
+	0xa8, 0xaf, 0xb5, 0xe3, 0xbb, 0x77, 0x52, 0xec,
+	0x59, 0x84, 0xbf, 0x6b, 0x8f, 0xce, 0x86, 0x5e,
+	0x1f, 0x23, 0xe9, 0xfb, 0x08, 0x86, 0xf7, 0x10,
+	0xb9, 0xf2, 0x44, 0x96, 0x44, 0x63, 0xa9, 0xa8,
+	0x78, 0x00, 0x23, 0xd6, 0xc7, 0xe7, 0x6e, 0x66,
+	0x4f, 0xcc, 0xee, 0x15, 0xb3, 0xbd, 0x1d, 0xa0,
+	0xe5, 0x9c, 0x1b, 0x24, 0x2c, 0x4d, 0x3c, 0x62,
+	0x35, 0x9c, 0x88, 0x59, 0x09, 0xdd, 0x82, 0x1b,
+	0xcf, 0x0a, 0x83, 0x6b, 0x3f, 0xae, 0x03, 0xc4,
+	0xb4, 0xdd, 0x7e, 0x5b, 0x28, 0x76, 0x25, 0x96,
+	0xd9, 0xc9, 0x9d, 0x5f, 0x86, 0xfa, 0xf6, 0xd7,
+	0xd2, 0xe6, 0x76, 0x1d, 0x0f, 0xa1, 0xdc, 0x74,
+	0x05, 0x1b, 0x1d, 0xe0, 0xcd, 0x16, 0xb0, 0xa8,
+	0x8a, 0x34, 0x7b, 0x15, 0x11, 0x77, 0xe5, 0x7b,
+	0x7e, 0x20, 0xf7, 0xda, 0x38, 0xda, 0xce, 0x70,
+	0xe9, 0xf5, 0x6c, 0xd9, 0xbe, 0x0c, 0x4c, 0x95,
+	0x4c, 0xc2, 0x9b, 0x34, 0x55, 0x55, 0xe1, 0xf3,
+	0x46, 0x8e, 0x48, 0x74, 0x14, 0x4f, 0x9d, 0xc9,
+	0xf5, 0xe8, 0x1a, 0xf0, 0x11, 0x4a, 0xc1, 0x8d,
+	0xe0, 0x93, 0xa0, 0xbe, 0x09, 0x1c, 0x2b, 0x4e,
+	0x0f, 0xb2, 0x87, 0x8b, 0x84, 0xfe, 0x92, 0x32,
+	0x14, 0xd7, 0x93, 0xdf, 0xe7, 0x44, 0xbc, 0xc5,
+	0xae, 0x53, 0x69, 0xd8, 0xb3, 0x79, 0x37, 0x80,
+	0xe3, 0x17, 0x5c, 0xec, 0x53, 0x00, 0x9a, 0xe3,
+	0x8e, 0xdc, 0x38, 0xb8, 0x66, 0xf0, 0xd3, 0xad,
+	0x1d, 0x02, 0x96, 0x86, 0x3e, 0x9d, 0x3b, 0x5d,
+	0xa5, 0x7f, 0x21, 0x10, 0xf1, 0x1f, 0x13, 0x20,
+	0xf9, 0x57, 0x87, 0x20, 0xf5, 0x5f, 0xf1, 0x17,
+	0x48, 0x0a, 0x51, 0x5a, 0xcd, 0x19, 0x03, 0xa6,
+	0x5a, 0xd1, 0x12, 0x97, 0xe9, 0x48, 0xe2, 0x1d,
+	0x83, 0x75, 0x50, 0xd9, 0x75, 0x7d, 0x6a, 0x82,
+	0xa1, 0xf9, 0x4e, 0x54, 0x87, 0x89, 0xc9, 0x0c,
+	0xb7, 0x5b, 0x6a, 0x91, 0xc1, 0x9c, 0xb2, 0xa9,
+	0xdc, 0x9a, 0xa4, 0x49, 0x0a, 0x6d, 0x0d, 0xbb,
+	0xde, 0x86, 0x44, 0xdd, 0x5d, 0x89, 0x2b, 0x96,
+	0x0f, 0x23, 0x95, 0xad, 0xcc, 0xa2, 0xb3, 0xb9,
+	0x7e, 0x74, 0x38, 0xba, 0x9f, 0x73, 0xae, 0x5f,
+	0xf8, 0x68, 0xa2, 0xe0, 0xa9, 0xce, 0xbd, 0x40,
+	0xd4, 0x4c, 0x6b, 0xd2, 0x56, 0x62, 0xb0, 0xcc,
+	0x63, 0x7e, 0x5b, 0xd3, 0xae, 0xd1, 0x75, 0xce,
+	0xbb, 0xb4, 0x5b, 0xa8, 0xf8, 0xb4, 0xac, 0x71,
+	0x75, 0xaa, 0xc9, 0x9f, 0xbb, 0x6c, 0xad, 0x0f,
+	0x55, 0x5d, 0xe8, 0x85, 0x7d, 0xf9, 0x21, 0x35,
+	0xea, 0x92, 0x85, 0x2b, 0x00, 0xec, 0x84, 0x90,
+	0x0a, 0x63, 0x96, 0xe4, 0x6b, 0xa9, 0x77, 0xb8,
+	0x91, 0xf8, 0x46, 0x15, 0x72, 0x63, 0x70, 0x01,
+	0x40, 0xa3, 0xa5, 0x76, 0x62, 0x2b, 0xbf, 0xf1,
+	0xe5, 0x8d, 0x9f, 0xa3, 0xfa, 0x9b, 0x03, 0xbe,
+	0xfe, 0x65, 0x6f, 0xa2, 0x29, 0x0d, 0x54, 0xb4,
+	0x71, 0xce, 0xa9, 0xd6, 0x3d, 0x88, 0xf9, 0xaf,
+	0x6b, 0xa8, 0x9e, 0xf4, 0x16, 0x96, 0x36, 0xb9,
+	0x00, 0xdc, 0x10, 0xab, 0xb5, 0x08, 0x31, 0x1f,
+	0x00, 0xb1, 0x3c, 0xd9, 0x38, 0x3e, 0xc6, 0x04,
+	0xa7, 0x4e, 0xe8, 0xae, 0xed, 0x98, 0xc2, 0xf7,
+	0xb9, 0x00, 0x5f, 0x8c, 0x60, 0xd1, 0xe5, 0x15,
+	0xf7, 0xae, 0x1e, 0x84, 0x88, 0xd1, 0xf6, 0xbc,
+	0x3a, 0x89, 0x35, 0x22, 0x83, 0x7c, 0xca, 0xf0,
+	0x33, 0x82, 0x4c, 0x79, 0x3c, 0xfd, 0xb1, 0xae,
+	0x52, 0x62, 0x55, 0xd2, 0x41, 0x60, 0xc6, 0xbb,
+	0xfa, 0x0e, 0x59, 0xd6, 0xa8, 0xfe, 0x5d, 0xed,
+	0x47, 0x3d, 0xe0, 0xea, 0x1f, 0x6e, 0x43, 0x51,
+	0xec, 0x10, 0x52, 0x56, 0x77, 0x42, 0x6b, 0x52,
+	0x87, 0xd8, 0xec, 0xe0, 0xaa, 0x76, 0xa5, 0x84,
+	0x2a, 0x22, 0x24, 0xfd, 0x92, 0x40, 0x88, 0xd5,
+	0x85, 0x1c, 0x1f, 0x6b, 0x47, 0xa0, 0xc4, 0xe4,
+	0xef, 0xf4, 0xea, 0xd7, 0x59, 0xac, 0x2a, 0x9e,
+	0x8c, 0xfa, 0x1f, 0x42, 0x08, 0xfe, 0x4f, 0x74,
+	0xa0, 0x26, 0xf5, 0xb3, 0x84, 0xf6, 0x58, 0x5f,
+	0x26, 0x66, 0x3e, 0xd7, 0xe4, 0x22, 0x91, 0x13,
+	0xc8, 0xac, 0x25, 0x96, 0x23, 0xd8, 0x09, 0xea,
+	0x45, 0x75, 0x23, 0xb8, 0x5f, 0xc2, 0x90, 0x8b,
+	0x09, 0xc4, 0xfc, 0x47, 0x6c, 0x6d, 0x0a, 0xef,
+	0x69, 0xa4, 0x38, 0x19, 0xcf, 0x7d, 0xf9, 0x09,
+	0x73, 0x9b, 0x60, 0x5a, 0xf7, 0x37, 0xb5, 0xfe,
+	0x9f, 0xe3, 0x2b, 0x4c, 0x0d, 0x6e, 0x19, 0xf1,
+	0xd6, 0xc0, 0x70, 0xf3, 0x9d, 0x22, 0x3c, 0xf9,
+	0x49, 0xce, 0x30, 0x8e, 0x44, 0xb5, 0x76, 0x15,
+	0x8f, 0x52, 0xfd, 0xa5, 0x04, 0xb8, 0x55, 0x6a,
+	0x36, 0x59, 0x7c, 0xc4, 0x48, 0xb8, 0xd7, 0xab,
+	0x05, 0x66, 0xe9, 0x5e, 0x21, 0x6f, 0x6b, 0x36,
+	0x29, 0xbb, 0xe9, 0xe3, 0xa2, 0x9a, 0xa8, 0xcd,
+	0x55, 0x25, 0x11, 0xba, 0x5a, 0x58, 0xa0, 0xde,
+	0xae, 0x19, 0x2a, 0x48, 0x5a, 0xff, 0x36, 0xcd,
+	0x6d, 0x16, 0x7a, 0x73, 0x38, 0x46, 0xe5, 0x47,
+	0x59, 0xc8, 0xa2, 0xf6, 0xe2, 0x6c, 0x83, 0xc5,
+	0x36, 0x2c, 0x83, 0x7d, 0xb4, 0x01, 0x05, 0x69,
+	0xe7, 0xaf, 0x5c, 0xc4, 0x64, 0x82, 0x12, 0x21,
+	0xef, 0xf7, 0xd1, 0x7d, 0xb8, 0x8d, 0x8c, 0x98,
+	0x7c, 0x5f, 0x7d, 0x92, 0x88, 0xb9, 0x94, 0x07,
+	0x9c, 0xd8, 0xe9, 0x9c, 0x17, 0x38, 0xe3, 0x57,
+	0x6c, 0xe0, 0xdc, 0xa5, 0x92, 0x42, 0xb3, 0xbd,
+	0x50, 0xa2, 0x7e, 0xb5, 0xb1, 0x52, 0x72, 0x03,
+	0x97, 0xd8, 0xaa, 0x9a, 0x1e, 0x75, 0x41, 0x11,
+	0xa3, 0x4f, 0xcc, 0xd4, 0xe3, 0x73, 0xad, 0x96,
+	0xdc, 0x47, 0x41, 0x9f, 0xb0, 0xbe, 0x79, 0x91,
+	0xf5, 0xb6, 0x18, 0xfe, 0xc2, 0x83, 0x18, 0x7d,
+	0x73, 0xd9, 0x4f, 0x83, 0x84, 0x03, 0xb3, 0xf0,
+	0x77, 0x66, 0x3d, 0x83, 0x63, 0x2e, 0x2c, 0xf9,
+	0xdd, 0xa6, 0x1f, 0x89, 0x82, 0xb8, 0x23, 0x42,
+	0xeb, 0xe2, 0xca, 0x70, 0x82, 0x61, 0x41, 0x0a,
+	0x6d, 0x5f, 0x75, 0xc5, 0xe2, 0xc4, 0x91, 0x18,
+	0x44, 0x22, 0xfa, 0x34, 0x10, 0xf5, 0x20, 0xdc,
+	0xb7, 0xdd, 0x2a, 0x20, 0x77, 0xf5, 0xf9, 0xce,
+	0xdb, 0xa0, 0x0a, 0x52, 0x2a, 0x4e, 0xdd, 0xcc,
+	0x97, 0xdf, 0x05, 0xe4, 0x5e, 0xb7, 0xaa, 0xf0,
+	0xe2, 0x80, 0xff, 0xba, 0x1a, 0x0f, 0xac, 0xdf,
+	0x02, 0x32, 0xe6, 0xf7, 0xc7, 0x17, 0x13, 0xb7,
+	0xfc, 0x98, 0x48, 0x8c, 0x0d, 0x82, 0xc9, 0x80,
+	0x7a, 0xe2, 0x0a, 0xc5, 0xb4, 0xde, 0x7c, 0x3c,
+	0x79, 0x81, 0x0e, 0x28, 0x65, 0x79, 0x67, 0x82,
+	0x69, 0x44, 0x66, 0x09, 0xf7, 0x16, 0x1a, 0xf9,
+	0x7d, 0x80, 0xa1, 0x79, 0x14, 0xa9, 0xc8, 0x20,
+	0xfb, 0xa2, 0x46, 0xbe, 0x08, 0x35, 0x17, 0x58,
+	0xc1, 0x1a, 0xda, 0x2a, 0x6b, 0x2e, 0x1e, 0xe6,
+	0x27, 0x55, 0x7b, 0x19, 0xe2, 0xfb, 0x64, 0xfc,
+	0x5e, 0x15, 0x54, 0x3c, 0xe7, 0xc2, 0x11, 0x50,
+	0x30, 0xb8, 0x72, 0x03, 0x0b, 0x1a, 0x9f, 0x86,
+	0x27, 0x11, 0x5c, 0x06, 0x2b, 0xbd, 0x75, 0x1a,
+	0x0a, 0xda, 0x01, 0xfa, 0x5c, 0x4a, 0xc1, 0x80,
+	0x3a, 0x6e, 0x30, 0xc8, 0x2c, 0xeb, 0x56, 0xec,
+	0x89, 0xfa, 0x35, 0x7b, 0xb2, 0xf0, 0x97, 0x08,
+	0x86, 0x53, 0xbe, 0xbd, 0x40, 0x41, 0x38, 0x1c,
+	0xb4, 0x8b, 0x79, 0x2e, 0x18, 0x96, 0x94, 0xde,
+	0xe8, 0xca, 0xe5, 0x9f, 0x92, 0x9f, 0x15, 0x5d,
+	0x56, 0x60, 0x5c, 0x09, 0xf9, 0x16, 0xf4, 0x17,
+	0x0f, 0xf6, 0x4c, 0xda, 0xe6, 0x67, 0x89, 0x9f,
+	0xca, 0x6c, 0xe7, 0x9b, 0x04, 0x62, 0x0e, 0x26,
+	0xa6, 0x52, 0xbd, 0x29, 0xff, 0xc7, 0xa4, 0x96,
+	0xe6, 0x6a, 0x02, 0xa5, 0x2e, 0x7b, 0xfe, 0x97,
+	0x68, 0x3e, 0x2e, 0x5f, 0x3b, 0x0f, 0x36, 0xd6,
+	0x98, 0x19, 0x59, 0x48, 0xd2, 0xc6, 0xe1, 0x55,
+	0x1a, 0x6e, 0xd6, 0xed, 0x2c, 0xba, 0xc3, 0x9e,
+	0x64, 0xc9, 0x95, 0x86, 0x35, 0x5e, 0x3e, 0x88,
+	0x69, 0x99, 0x4b, 0xee, 0xbe, 0x9a, 0x99, 0xb5,
+	0x6e, 0x58, 0xae, 0xdd, 0x22, 0xdb, 0xdd, 0x6b,
+	0xfc, 0xaf, 0x90, 0xa3, 0x3d, 0xa4, 0xc1, 0x15,
+	0x92, 0x18, 0x8d, 0xd2, 0x4b, 0x7b, 0x06, 0xd1,
+	0x37, 0xb5, 0xe2, 0x7c, 0x2c, 0xf0, 0x25, 0xe4,
+	0x94, 0x2a, 0xbd, 0xe3, 0x82, 0x70, 0x78, 0xa3,
+	0x82, 0x10, 0x5a, 0x90, 0xd7, 0xa4, 0xfa, 0xaf,
+	0x1a, 0x88, 0x59, 0xdc, 0x74, 0x12, 0xb4, 0x8e,
+	0xd7, 0x19, 0x46, 0xf4, 0x84, 0x69, 0x9f, 0xbb,
+	0x70, 0xa8, 0x4c, 0x52, 0x81, 0xa9, 0xff, 0x76,
+	0x1c, 0xae, 0xd8, 0x11, 0x3d, 0x7f, 0x7d, 0xc5,
+	0x12, 0x59, 0x28, 0x18, 0xc2, 0xa2, 0xb7, 0x1c,
+	0x88, 0xf8, 0xd6, 0x1b, 0xa6, 0x7d, 0x9e, 0xde,
+	0x29, 0xf8, 0xed, 0xff, 0xeb, 0x92, 0x24, 0x4f,
+	0x05, 0xaa, 0xd9, 0x49, 0xba, 0x87, 0x59, 0x51,
+	0xc9, 0x20, 0x5c, 0x9b, 0x74, 0xcf, 0x03, 0xd9,
+	0x2d, 0x34, 0xc7, 0x5b, 0xa5, 0x40, 0xb2, 0x99,
+	0xf5, 0xcb, 0xb4, 0xf6, 0xb7, 0x72, 0x4a, 0xd6,
+	0xbd, 0xb0, 0xf3, 0x93, 0xe0, 0x1b, 0xa8, 0x04,
+	0x1e, 0x35, 0xd4, 0x80, 0x20, 0xf4, 0x9c, 0x31,
+	0x6b, 0x45, 0xb9, 0x15, 0xb0, 0x5e, 0xdd, 0x0a,
+	0x33, 0x9c, 0x83, 0xcd, 0x58, 0x89, 0x50, 0x56,
+	0xbb, 0x81, 0x00, 0x91, 0x32, 0xf3, 0x1b, 0x3e,
+	0xcf, 0x45, 0xe1, 0xf9, 0xe1, 0x2c, 0x26, 0x78,
+	0x93, 0x9a, 0x60, 0x46, 0xc9, 0xb5, 0x5e, 0x6a,
+	0x28, 0x92, 0x87, 0x3f, 0x63, 0x7b, 0xdb, 0xf7,
+	0xd0, 0x13, 0x9d, 0x32, 0x40, 0x5e, 0xcf, 0xfb,
+	0x79, 0x68, 0x47, 0x4c, 0xfd, 0x01, 0x17, 0xe6,
+	0x97, 0x93, 0x78, 0xbb, 0xa6, 0x27, 0xa3, 0xe8,
+	0x1a, 0xe8, 0x94, 0x55, 0x7d, 0x08, 0xe5, 0xdc,
+	0x66, 0xa3, 0x69, 0xc8, 0xca, 0xc5, 0xa1, 0x84,
+	0x55, 0xde, 0x08, 0x91, 0x16, 0x3a, 0x0c, 0x86,
+	0xab, 0x27, 0x2b, 0x64, 0x34, 0x02, 0x6c, 0x76,
+	0x8b, 0xc6, 0xaf, 0xcc, 0xe1, 0xd6, 0x8c, 0x2a,
+	0x18, 0x3d, 0xa6, 0x1b, 0x37, 0x75, 0x45, 0x73,
+	0xc2, 0x75, 0xd7, 0x53, 0x78, 0x3a, 0xd6, 0xe8,
+	0x29, 0xd2, 0x4a, 0xa8, 0x1e, 0x82, 0xf6, 0xb6,
+	0x81, 0xde, 0x21, 0xed, 0x2b, 0x56, 0xbb, 0xf2,
+	0xd0, 0x57, 0xc1, 0x7c, 0xd2, 0x6a, 0xd2, 0x56,
+	0xf5, 0x13, 0x5f, 0x1c, 0x6a, 0x0b, 0x74, 0xfb,
+	0xe9, 0xfe, 0x9e, 0xea, 0x95, 0xb2, 0x46, 0xab,
+	0x0a, 0xfc, 0xfd, 0xf3, 0xbb, 0x04, 0x2b, 0x76,
+	0x1b, 0xa4, 0x74, 0xb0, 0xc1, 0x78, 0xc3, 0x69,
+	0xe2, 0xb0, 0x01, 0xe1, 0xde, 0x32, 0x4c, 0x8d,
+	0x1a, 0xb3, 0x38, 0x08, 0xd5, 0xfc, 0x1f, 0xdc,
+	0x0e, 0x2c, 0x9c, 0xb1, 0xa1, 0x63, 0x17, 0x22,
+	0xf5, 0x6c, 0x93, 0x70, 0x74, 0x00, 0xf8, 0x39,
+	0x01, 0x94, 0xd1, 0x32, 0x23, 0x56, 0x5d, 0xa6,
+	0x02, 0x76, 0x76, 0x93, 0xce, 0x2f, 0x19, 0xe9,
+	0x17, 0x52, 0xae, 0x6e, 0x2c, 0x6d, 0x61, 0x7f,
+	0x3b, 0xaa, 0xe0, 0x52, 0x85, 0xc5, 0x65, 0xc1,
+	0xbb, 0x8e, 0x5b, 0x21, 0xd5, 0xc9, 0x78, 0x83,
+	0x07, 0x97, 0x4c, 0x62, 0x61, 0x41, 0xd4, 0xfc,
+	0xc9, 0x39, 0xe3, 0x9b, 0xd0, 0xcc, 0x75, 0xc4,
+	0x97, 0xe6, 0xdd, 0x2a, 0x5f, 0xa6, 0xe8, 0x59,
+	0x6c, 0x98, 0xb9, 0x02, 0xe2, 0xa2, 0xd6, 0x68,
+	0xee, 0x3b, 0x1d, 0xe3, 0x4d, 0x5b, 0x30, 0xef,
+	0x03, 0xf2, 0xeb, 0x18, 0x57, 0x36, 0xe8, 0xa1,
+	0xf4, 0x47, 0xfb, 0xcb, 0x8f, 0xcb, 0xc8, 0xf3,
+	0x4f, 0x74, 0x9d, 0x9d, 0xb1, 0x8d, 0x14, 0x44,
+	0xd9, 0x19, 0xb4, 0x54, 0x4f, 0x75, 0x19, 0x09,
+	0xa0, 0x75, 0xbc, 0x3b, 0x82, 0xc6, 0x3f, 0xb8,
+	0x83, 0x19, 0x6e, 0xd6, 0x37, 0xfe, 0x6e, 0x8a,
+	0x4e, 0xe0, 0x4a, 0xab, 0x7b, 0xc8, 0xb4, 0x1d,
+	0xf4, 0xed, 0x27, 0x03, 0x65, 0xa2, 0xa1, 0xae,
+	0x11, 0xe7, 0x98, 0x78, 0x48, 0x91, 0xd2, 0xd2,
+	0xd4, 0x23, 0x78, 0x50, 0xb1, 0x5b, 0x85, 0x10,
+	0x8d, 0xca, 0x5f, 0x0f, 0x71, 0xae, 0x72, 0x9a,
+	0xf6, 0x25, 0x19, 0x60, 0x06, 0xf7, 0x10, 0x34,
+	0x18, 0x0d, 0xc9, 0x9f, 0x7b, 0x0c, 0x9b, 0x8f,
+	0x91, 0x1b, 0x9f, 0xcd, 0x10, 0xee, 0x75, 0xf9,
+	0x97, 0x66, 0xfc, 0x4d, 0x33, 0x6e, 0x28, 0x2b,
+	0x92, 0x85, 0x4f, 0xab, 0x43, 0x8d, 0x8f, 0x7d,
+	0x86, 0xa7, 0xc7, 0xd8, 0xd3, 0x0b, 0x8b, 0x57,
+	0xb6, 0x1d, 0x95, 0x0d, 0xe9, 0xbc, 0xd9, 0x03,
+	0xd9, 0x10, 0x19, 0xc3, 0x46, 0x63, 0x55, 0x87,
+	0x61, 0x79, 0x6c, 0x95, 0x0e, 0x9c, 0xdd, 0xca,
+	0xc3, 0xf3, 0x64, 0xf0, 0x7d, 0x76, 0xb7, 0x53,
+	0x67, 0x2b, 0x1e, 0x44, 0x56, 0x81, 0xea, 0x8f,
+	0x5c, 0x42, 0x16, 0xb8, 0x28, 0xeb, 0x1b, 0x61,
+	0x10, 0x1e, 0xbf, 0xec, 0xa8
+};
+static const u8 enc_output011[] __initconst = {
+	0x6a, 0xfc, 0x4b, 0x25, 0xdf, 0xc0, 0xe4, 0xe8,
+	0x17, 0x4d, 0x4c, 0xc9, 0x7e, 0xde, 0x3a, 0xcc,
+	0x3c, 0xba, 0x6a, 0x77, 0x47, 0xdb, 0xe3, 0x74,
+	0x7a, 0x4d, 0x5f, 0x8d, 0x37, 0x55, 0x80, 0x73,
+	0x90, 0x66, 0x5d, 0x3a, 0x7d, 0x5d, 0x86, 0x5e,
+	0x8d, 0xfd, 0x83, 0xff, 0x4e, 0x74, 0x6f, 0xf9,
+	0xe6, 0x70, 0x17, 0x70, 0x3e, 0x96, 0xa7, 0x7e,
+	0xcb, 0xab, 0x8f, 0x58, 0x24, 0x9b, 0x01, 0xfd,
+	0xcb, 0xe6, 0x4d, 0x9b, 0xf0, 0x88, 0x94, 0x57,
+	0x66, 0xef, 0x72, 0x4c, 0x42, 0x6e, 0x16, 0x19,
+	0x15, 0xea, 0x70, 0x5b, 0xac, 0x13, 0xdb, 0x9f,
+	0x18, 0xe2, 0x3c, 0x26, 0x97, 0xbc, 0xdc, 0x45,
+	0x8c, 0x6c, 0x24, 0x69, 0x9c, 0xf7, 0x65, 0x1e,
+	0x18, 0x59, 0x31, 0x7c, 0xe4, 0x73, 0xbc, 0x39,
+	0x62, 0xc6, 0x5c, 0x9f, 0xbf, 0xfa, 0x90, 0x03,
+	0xc9, 0x72, 0x26, 0xb6, 0x1b, 0xc2, 0xb7, 0x3f,
+	0xf2, 0x13, 0x77, 0xf2, 0x8d, 0xb9, 0x47, 0xd0,
+	0x53, 0xdd, 0xc8, 0x91, 0x83, 0x8b, 0xb1, 0xce,
+	0xa3, 0xfe, 0xcd, 0xd9, 0xdd, 0x92, 0x7b, 0xdb,
+	0xb8, 0xfb, 0xc9, 0x2d, 0x01, 0x59, 0x39, 0x52,
+	0xad, 0x1b, 0xec, 0xcf, 0xd7, 0x70, 0x13, 0x21,
+	0xf5, 0x47, 0xaa, 0x18, 0x21, 0x5c, 0xc9, 0x9a,
+	0xd2, 0x6b, 0x05, 0x9c, 0x01, 0xa1, 0xda, 0x35,
+	0x5d, 0xb3, 0x70, 0xe6, 0xa9, 0x80, 0x8b, 0x91,
+	0xb7, 0xb3, 0x5f, 0x24, 0x9a, 0xb7, 0xd1, 0x6b,
+	0xa1, 0x1c, 0x50, 0xba, 0x49, 0xe0, 0xee, 0x2e,
+	0x75, 0xac, 0x69, 0xc0, 0xeb, 0x03, 0xdd, 0x19,
+	0xe5, 0xf6, 0x06, 0xdd, 0xc3, 0xd7, 0x2b, 0x07,
+	0x07, 0x30, 0xa7, 0x19, 0x0c, 0xbf, 0xe6, 0x18,
+	0xcc, 0xb1, 0x01, 0x11, 0x85, 0x77, 0x1d, 0x96,
+	0xa7, 0xa3, 0x00, 0x84, 0x02, 0xa2, 0x83, 0x68,
+	0xda, 0x17, 0x27, 0xc8, 0x7f, 0x23, 0xb7, 0xf4,
+	0x13, 0x85, 0xcf, 0xdd, 0x7a, 0x7d, 0x24, 0x57,
+	0xfe, 0x05, 0x93, 0xf5, 0x74, 0xce, 0xed, 0x0c,
+	0x20, 0x98, 0x8d, 0x92, 0x30, 0xa1, 0x29, 0x23,
+	0x1a, 0xa0, 0x4f, 0x69, 0x56, 0x4c, 0xe1, 0xc8,
+	0xce, 0xf6, 0x9a, 0x0c, 0xa4, 0xfa, 0x04, 0xf6,
+	0x62, 0x95, 0xf2, 0xfa, 0xc7, 0x40, 0x68, 0x40,
+	0x8f, 0x41, 0xda, 0xb4, 0x26, 0x6f, 0x70, 0xab,
+	0x40, 0x61, 0xa4, 0x0e, 0x75, 0xfb, 0x86, 0xeb,
+	0x9d, 0x9a, 0x1f, 0xec, 0x76, 0x99, 0xe7, 0xea,
+	0xaa, 0x1e, 0x2d, 0xb5, 0xd4, 0xa6, 0x1a, 0xb8,
+	0x61, 0x0a, 0x1d, 0x16, 0x5b, 0x98, 0xc2, 0x31,
+	0x40, 0xe7, 0x23, 0x1d, 0x66, 0x99, 0xc8, 0xc0,
+	0xd7, 0xce, 0xf3, 0x57, 0x40, 0x04, 0x3f, 0xfc,
+	0xea, 0xb3, 0xfc, 0xd2, 0xd3, 0x99, 0xa4, 0x94,
+	0x69, 0xa0, 0xef, 0xd1, 0x85, 0xb3, 0xa6, 0xb1,
+	0x28, 0xbf, 0x94, 0x67, 0x22, 0xc3, 0x36, 0x46,
+	0xf8, 0xd2, 0x0f, 0x5f, 0xf4, 0x59, 0x80, 0xe6,
+	0x2d, 0x43, 0x08, 0x7d, 0x19, 0x09, 0x97, 0xa7,
+	0x4c, 0x3d, 0x8d, 0xba, 0x65, 0x62, 0xa3, 0x71,
+	0x33, 0x29, 0x62, 0xdb, 0xc1, 0x33, 0x34, 0x1a,
+	0x63, 0x33, 0x16, 0xb6, 0x64, 0x7e, 0xab, 0x33,
+	0xf0, 0xe6, 0x26, 0x68, 0xba, 0x1d, 0x2e, 0x38,
+	0x08, 0xe6, 0x02, 0xd3, 0x25, 0x2c, 0x47, 0x23,
+	0x58, 0x34, 0x0f, 0x9d, 0x63, 0x4f, 0x63, 0xbb,
+	0x7f, 0x3b, 0x34, 0x38, 0xa7, 0xb5, 0x8d, 0x65,
+	0xd9, 0x9f, 0x79, 0x55, 0x3e, 0x4d, 0xe7, 0x73,
+	0xd8, 0xf6, 0x98, 0x97, 0x84, 0x60, 0x9c, 0xc8,
+	0xa9, 0x3c, 0xf6, 0xdc, 0x12, 0x5c, 0xe1, 0xbb,
+	0x0b, 0x8b, 0x98, 0x9c, 0x9d, 0x26, 0x7c, 0x4a,
+	0xe6, 0x46, 0x36, 0x58, 0x21, 0x4a, 0xee, 0xca,
+	0xd7, 0x3b, 0xc2, 0x6c, 0x49, 0x2f, 0xe5, 0xd5,
+	0x03, 0x59, 0x84, 0x53, 0xcb, 0xfe, 0x92, 0x71,
+	0x2e, 0x7c, 0x21, 0xcc, 0x99, 0x85, 0x7f, 0xb8,
+	0x74, 0x90, 0x13, 0x42, 0x3f, 0xe0, 0x6b, 0x1d,
+	0xf2, 0x4d, 0x54, 0xd4, 0xfc, 0x3a, 0x05, 0xe6,
+	0x74, 0xaf, 0xa6, 0xa0, 0x2a, 0x20, 0x23, 0x5d,
+	0x34, 0x5c, 0xd9, 0x3e, 0x4e, 0xfa, 0x93, 0xe7,
+	0xaa, 0xe9, 0x6f, 0x08, 0x43, 0x67, 0x41, 0xc5,
+	0xad, 0xfb, 0x31, 0x95, 0x82, 0x73, 0x32, 0xd8,
+	0xa6, 0xa3, 0xed, 0x0e, 0x2d, 0xf6, 0x5f, 0xfd,
+	0x80, 0xa6, 0x7a, 0xe0, 0xdf, 0x78, 0x15, 0x29,
+	0x74, 0x33, 0xd0, 0x9e, 0x83, 0x86, 0x72, 0x22,
+	0x57, 0x29, 0xb9, 0x9e, 0x5d, 0xd3, 0x1a, 0xb5,
+	0x96, 0x72, 0x41, 0x3d, 0xf1, 0x64, 0x43, 0x67,
+	0xee, 0xaa, 0x5c, 0xd3, 0x9a, 0x96, 0x13, 0x11,
+	0x5d, 0xf3, 0x0c, 0x87, 0x82, 0x1e, 0x41, 0x9e,
+	0xd0, 0x27, 0xd7, 0x54, 0x3b, 0x67, 0x73, 0x09,
+	0x91, 0xe9, 0xd5, 0x36, 0xa7, 0xb5, 0x55, 0xe4,
+	0xf3, 0x21, 0x51, 0x49, 0x22, 0x07, 0x55, 0x4f,
+	0x44, 0x4b, 0xd2, 0x15, 0x93, 0x17, 0x2a, 0xfa,
+	0x4d, 0x4a, 0x57, 0xdb, 0x4c, 0xa6, 0xeb, 0xec,
+	0x53, 0x25, 0x6c, 0x21, 0xed, 0x00, 0x4c, 0x3b,
+	0xca, 0x14, 0x57, 0xa9, 0xd6, 0x6a, 0xcd, 0x8d,
+	0x5e, 0x74, 0xac, 0x72, 0xc1, 0x97, 0xe5, 0x1b,
+	0x45, 0x4e, 0xda, 0xfc, 0xcc, 0x40, 0xe8, 0x48,
+	0x88, 0x0b, 0xa3, 0xe3, 0x8d, 0x83, 0x42, 0xc3,
+	0x23, 0xfd, 0x68, 0xb5, 0x8e, 0xf1, 0x9d, 0x63,
+	0x77, 0xe9, 0xa3, 0x8e, 0x8c, 0x26, 0x6b, 0xbd,
+	0x72, 0x73, 0x35, 0x0c, 0x03, 0xf8, 0x43, 0x78,
+	0x52, 0x71, 0x15, 0x1f, 0x71, 0x5d, 0x6e, 0xed,
+	0xb9, 0xcc, 0x86, 0x30, 0xdb, 0x2b, 0xd3, 0x82,
+	0x88, 0x23, 0x71, 0x90, 0x53, 0x5c, 0xa9, 0x2f,
+	0x76, 0x01, 0xb7, 0x9a, 0xfe, 0x43, 0x55, 0xa3,
+	0x04, 0x9b, 0x0e, 0xe4, 0x59, 0xdf, 0xc9, 0xe9,
+	0xb1, 0xea, 0x29, 0x28, 0x3c, 0x5c, 0xae, 0x72,
+	0x84, 0xb6, 0xc6, 0xeb, 0x0c, 0x27, 0x07, 0x74,
+	0x90, 0x0d, 0x31, 0xb0, 0x00, 0x77, 0xe9, 0x40,
+	0x70, 0x6f, 0x68, 0xa7, 0xfd, 0x06, 0xec, 0x4b,
+	0xc0, 0xb7, 0xac, 0xbc, 0x33, 0xb7, 0x6d, 0x0a,
+	0xbd, 0x12, 0x1b, 0x59, 0xcb, 0xdd, 0x32, 0xf5,
+	0x1d, 0x94, 0x57, 0x76, 0x9e, 0x0c, 0x18, 0x98,
+	0x71, 0xd7, 0x2a, 0xdb, 0x0b, 0x7b, 0xa7, 0x71,
+	0xb7, 0x67, 0x81, 0x23, 0x96, 0xae, 0xb9, 0x7e,
+	0x32, 0x43, 0x92, 0x8a, 0x19, 0xa0, 0xc4, 0xd4,
+	0x3b, 0x57, 0xf9, 0x4a, 0x2c, 0xfb, 0x51, 0x46,
+	0xbb, 0xcb, 0x5d, 0xb3, 0xef, 0x13, 0x93, 0x6e,
+	0x68, 0x42, 0x54, 0x57, 0xd3, 0x6a, 0x3a, 0x8f,
+	0x9d, 0x66, 0xbf, 0xbd, 0x36, 0x23, 0xf5, 0x93,
+	0x83, 0x7b, 0x9c, 0xc0, 0xdd, 0xc5, 0x49, 0xc0,
+	0x64, 0xed, 0x07, 0x12, 0xb3, 0xe6, 0xe4, 0xe5,
+	0x38, 0x95, 0x23, 0xb1, 0xa0, 0x3b, 0x1a, 0x61,
+	0xda, 0x17, 0xac, 0xc3, 0x58, 0xdd, 0x74, 0x64,
+	0x22, 0x11, 0xe8, 0x32, 0x1d, 0x16, 0x93, 0x85,
+	0x99, 0xa5, 0x9c, 0x34, 0x55, 0xb1, 0xe9, 0x20,
+	0x72, 0xc9, 0x28, 0x7b, 0x79, 0x00, 0xa1, 0xa6,
+	0xa3, 0x27, 0x40, 0x18, 0x8a, 0x54, 0xe0, 0xcc,
+	0xe8, 0x4e, 0x8e, 0x43, 0x96, 0xe7, 0x3f, 0xc8,
+	0xe9, 0xb2, 0xf9, 0xc9, 0xda, 0x04, 0x71, 0x50,
+	0x47, 0xe4, 0xaa, 0xce, 0xa2, 0x30, 0xc8, 0xe4,
+	0xac, 0xc7, 0x0d, 0x06, 0x2e, 0xe6, 0xe8, 0x80,
+	0x36, 0x29, 0x9e, 0x01, 0xb8, 0xc3, 0xf0, 0xa0,
+	0x5d, 0x7a, 0xca, 0x4d, 0xa0, 0x57, 0xbd, 0x2a,
+	0x45, 0xa7, 0x7f, 0x9c, 0x93, 0x07, 0x8f, 0x35,
+	0x67, 0x92, 0xe3, 0xe9, 0x7f, 0xa8, 0x61, 0x43,
+	0x9e, 0x25, 0x4f, 0x33, 0x76, 0x13, 0x6e, 0x12,
+	0xb9, 0xdd, 0xa4, 0x7c, 0x08, 0x9f, 0x7c, 0xe7,
+	0x0a, 0x8d, 0x84, 0x06, 0xa4, 0x33, 0x17, 0x34,
+	0x5e, 0x10, 0x7c, 0xc0, 0xa8, 0x3d, 0x1f, 0x42,
+	0x20, 0x51, 0x65, 0x5d, 0x09, 0xc3, 0xaa, 0xc0,
+	0xc8, 0x0d, 0xf0, 0x79, 0xbc, 0x20, 0x1b, 0x95,
+	0xe7, 0x06, 0x7d, 0x47, 0x20, 0x03, 0x1a, 0x74,
+	0xdd, 0xe2, 0xd4, 0xae, 0x38, 0x71, 0x9b, 0xf5,
+	0x80, 0xec, 0x08, 0x4e, 0x56, 0xba, 0x76, 0x12,
+	0x1a, 0xdf, 0x48, 0xf3, 0xae, 0xb3, 0xe6, 0xe6,
+	0xbe, 0xc0, 0x91, 0x2e, 0x01, 0xb3, 0x01, 0x86,
+	0xa2, 0xb9, 0x52, 0xd1, 0x21, 0xae, 0xd4, 0x97,
+	0x1d, 0xef, 0x41, 0x12, 0x95, 0x3d, 0x48, 0x45,
+	0x1c, 0x56, 0x32, 0x8f, 0xb8, 0x43, 0xbb, 0x19,
+	0xf3, 0xca, 0xe9, 0xeb, 0x6d, 0x84, 0xbe, 0x86,
+	0x06, 0xe2, 0x36, 0xb2, 0x62, 0x9d, 0xd3, 0x4c,
+	0x48, 0x18, 0x54, 0x13, 0x4e, 0xcf, 0xfd, 0xba,
+	0x84, 0xb9, 0x30, 0x53, 0xcf, 0xfb, 0xb9, 0x29,
+	0x8f, 0xdc, 0x9f, 0xef, 0x60, 0x0b, 0x64, 0xf6,
+	0x8b, 0xee, 0xa6, 0x91, 0xc2, 0x41, 0x6c, 0xf6,
+	0xfa, 0x79, 0x67, 0x4b, 0xc1, 0x3f, 0xaf, 0x09,
+	0x81, 0xd4, 0x5d, 0xcb, 0x09, 0xdf, 0x36, 0x31,
+	0xc0, 0x14, 0x3c, 0x7c, 0x0e, 0x65, 0x95, 0x99,
+	0x6d, 0xa3, 0xf4, 0xd7, 0x38, 0xee, 0x1a, 0x2b,
+	0x37, 0xe2, 0xa4, 0x3b, 0x4b, 0xd0, 0x65, 0xca,
+	0xf8, 0xc3, 0xe8, 0x15, 0x20, 0xef, 0xf2, 0x00,
+	0xfd, 0x01, 0x09, 0xc5, 0xc8, 0x17, 0x04, 0x93,
+	0xd0, 0x93, 0x03, 0x55, 0xc5, 0xfe, 0x32, 0xa3,
+	0x3e, 0x28, 0x2d, 0x3b, 0x93, 0x8a, 0xcc, 0x07,
+	0x72, 0x80, 0x8b, 0x74, 0x16, 0x24, 0xbb, 0xda,
+	0x94, 0x39, 0x30, 0x8f, 0xb1, 0xcd, 0x4a, 0x90,
+	0x92, 0x7c, 0x14, 0x8f, 0x95, 0x4e, 0xac, 0x9b,
+	0xd8, 0x8f, 0x1a, 0x87, 0xa4, 0x32, 0x27, 0x8a,
+	0xba, 0xf7, 0x41, 0xcf, 0x84, 0x37, 0x19, 0xe6,
+	0x06, 0xf5, 0x0e, 0xcf, 0x36, 0xf5, 0x9e, 0x6c,
+	0xde, 0xbc, 0xff, 0x64, 0x7e, 0x4e, 0x59, 0x57,
+	0x48, 0xfe, 0x14, 0xf7, 0x9c, 0x93, 0x5d, 0x15,
+	0xad, 0xcc, 0x11, 0xb1, 0x17, 0x18, 0xb2, 0x7e,
+	0xcc, 0xab, 0xe9, 0xce, 0x7d, 0x77, 0x5b, 0x51,
+	0x1b, 0x1e, 0x20, 0xa8, 0x32, 0x06, 0x0e, 0x75,
+	0x93, 0xac, 0xdb, 0x35, 0x37, 0x1f, 0xe9, 0x19,
+	0x1d, 0xb4, 0x71, 0x97, 0xd6, 0x4e, 0x2c, 0x08,
+	0xa5, 0x13, 0xf9, 0x0e, 0x7e, 0x78, 0x6e, 0x14,
+	0xe0, 0xa9, 0xb9, 0x96, 0x4c, 0x80, 0x82, 0xba,
+	0x17, 0xb3, 0x9d, 0x69, 0xb0, 0x84, 0x46, 0xff,
+	0xf9, 0x52, 0x79, 0x94, 0x58, 0x3a, 0x62, 0x90,
+	0x15, 0x35, 0x71, 0x10, 0x37, 0xed, 0xa1, 0x8e,
+	0x53, 0x6e, 0xf4, 0x26, 0x57, 0x93, 0x15, 0x93,
+	0xf6, 0x81, 0x2c, 0x5a, 0x10, 0xda, 0x92, 0xad,
+	0x2f, 0xdb, 0x28, 0x31, 0x2d, 0x55, 0x04, 0xd2,
+	0x06, 0x28, 0x8c, 0x1e, 0xdc, 0xea, 0x54, 0xac,
+	0xff, 0xb7, 0x6c, 0x30, 0x15, 0xd4, 0xb4, 0x0d,
+	0x00, 0x93, 0x57, 0xdd, 0xd2, 0x07, 0x07, 0x06,
+	0xd9, 0x43, 0x9b, 0xcd, 0x3a, 0xf4, 0x7d, 0x4c,
+	0x36, 0x5d, 0x23, 0xa2, 0xcc, 0x57, 0x40, 0x91,
+	0xe9, 0x2c, 0x2f, 0x2c, 0xd5, 0x30, 0x9b, 0x17,
+	0xb0, 0xc9, 0xf7, 0xa7, 0x2f, 0xd1, 0x93, 0x20,
+	0x6b, 0xc6, 0xc1, 0xe4, 0x6f, 0xcb, 0xd1, 0xe7,
+	0x09, 0x0f, 0x9e, 0xdc, 0xaa, 0x9f, 0x2f, 0xdf,
+	0x56, 0x9f, 0xd4, 0x33, 0x04, 0xaf, 0xd3, 0x6c,
+	0x58, 0x61, 0xf0, 0x30, 0xec, 0xf2, 0x7f, 0xf2,
+	0x9c, 0xdf, 0x39, 0xbb, 0x6f, 0xa2, 0x8c, 0x7e,
+	0xc4, 0x22, 0x51, 0x71, 0xc0, 0x4d, 0x14, 0x1a,
+	0xc4, 0xcd, 0x04, 0xd9, 0x87, 0x08, 0x50, 0x05,
+	0xcc, 0xaf, 0xf6, 0xf0, 0x8f, 0x92, 0x54, 0x58,
+	0xc2, 0xc7, 0x09, 0x7a, 0x59, 0x02, 0x05, 0xe8,
+	0xb0, 0x86, 0xd9, 0xbf, 0x7b, 0x35, 0x51, 0x4d,
+	0xaf, 0x08, 0x97, 0x2c, 0x65, 0xda, 0x2a, 0x71,
+	0x3a, 0xa8, 0x51, 0xcc, 0xf2, 0x73, 0x27, 0xc3,
+	0xfd, 0x62, 0xcf, 0xe3, 0xb2, 0xca, 0xcb, 0xbe,
+	0x1a, 0x0a, 0xa1, 0x34, 0x7b, 0x77, 0xc4, 0x62,
+	0x68, 0x78, 0x5f, 0x94, 0x07, 0x04, 0x65, 0x16,
+	0x4b, 0x61, 0xcb, 0xff, 0x75, 0x26, 0x50, 0x66,
+	0x1f, 0x6e, 0x93, 0xf8, 0xc5, 0x51, 0xeb, 0xa4,
+	0x4a, 0x48, 0x68, 0x6b, 0xe2, 0x5e, 0x44, 0xb2,
+	0x50, 0x2c, 0x6c, 0xae, 0x79, 0x4e, 0x66, 0x35,
+	0x81, 0x50, 0xac, 0xbc, 0x3f, 0xb1, 0x0c, 0xf3,
+	0x05, 0x3c, 0x4a, 0xa3, 0x6c, 0x2a, 0x79, 0xb4,
+	0xb7, 0xab, 0xca, 0xc7, 0x9b, 0x8e, 0xcd, 0x5f,
+	0x11, 0x03, 0xcb, 0x30, 0xa3, 0xab, 0xda, 0xfe,
+	0x64, 0xb9, 0xbb, 0xd8, 0x5e, 0x3a, 0x1a, 0x56,
+	0xe5, 0x05, 0x48, 0x90, 0x1e, 0x61, 0x69, 0x1b,
+	0x22, 0xe6, 0x1a, 0x3c, 0x75, 0xad, 0x1f, 0x37,
+	0x28, 0xdc, 0xe4, 0x6d, 0xbd, 0x42, 0xdc, 0xd3,
+	0xc8, 0xb6, 0x1c, 0x48, 0xfe, 0x94, 0x77, 0x7f,
+	0xbd, 0x62, 0xac, 0xa3, 0x47, 0x27, 0xcf, 0x5f,
+	0xd9, 0xdb, 0xaf, 0xec, 0xf7, 0x5e, 0xc1, 0xb0,
+	0x9d, 0x01, 0x26, 0x99, 0x7e, 0x8f, 0x03, 0x70,
+	0xb5, 0x42, 0xbe, 0x67, 0x28, 0x1b, 0x7c, 0xbd,
+	0x61, 0x21, 0x97, 0xcc, 0x5c, 0xe1, 0x97, 0x8f,
+	0x8d, 0xde, 0x2b, 0xaa, 0xa7, 0x71, 0x1d, 0x1e,
+	0x02, 0x73, 0x70, 0x58, 0x32, 0x5b, 0x1d, 0x67,
+	0x3d, 0xe0, 0x74, 0x4f, 0x03, 0xf2, 0x70, 0x51,
+	0x79, 0xf1, 0x61, 0x70, 0x15, 0x74, 0x9d, 0x23,
+	0x89, 0xde, 0xac, 0xfd, 0xde, 0xd0, 0x1f, 0xc3,
+	0x87, 0x44, 0x35, 0x4b, 0xe5, 0xb0, 0x60, 0xc5,
+	0x22, 0xe4, 0x9e, 0xca, 0xeb, 0xd5, 0x3a, 0x09,
+	0x45, 0xa4, 0xdb, 0xfa, 0x3f, 0xeb, 0x1b, 0xc7,
+	0xc8, 0x14, 0x99, 0x51, 0x92, 0x10, 0xed, 0xed,
+	0x28, 0xe0, 0xa1, 0xf8, 0x26, 0xcf, 0xcd, 0xcb,
+	0x63, 0xa1, 0x3b, 0xe3, 0xdf, 0x7e, 0xfe, 0xa6,
+	0xf0, 0x81, 0x9a, 0xbf, 0x55, 0xde, 0x54, 0xd5,
+	0x56, 0x60, 0x98, 0x10, 0x68, 0xf4, 0x38, 0x96,
+	0x8e, 0x6f, 0x1d, 0x44, 0x7f, 0xd6, 0x2f, 0xfe,
+	0x55, 0xfb, 0x0c, 0x7e, 0x67, 0xe2, 0x61, 0x44,
+	0xed, 0xf2, 0x35, 0x30, 0x5d, 0xe9, 0xc7, 0xd6,
+	0x6d, 0xe0, 0xa0, 0xed, 0xf3, 0xfc, 0xd8, 0x3e,
+	0x0a, 0x7b, 0xcd, 0xaf, 0x65, 0x68, 0x18, 0xc0,
+	0xec, 0x04, 0x1c, 0x74, 0x6d, 0xe2, 0x6e, 0x79,
+	0xd4, 0x11, 0x2b, 0x62, 0xd5, 0x27, 0xad, 0x4f,
+	0x01, 0x59, 0x73, 0xcc, 0x6a, 0x53, 0xfb, 0x2d,
+	0xd5, 0x4e, 0x99, 0x21, 0x65, 0x4d, 0xf5, 0x82,
+	0xf7, 0xd8, 0x42, 0xce, 0x6f, 0x3d, 0x36, 0x47,
+	0xf1, 0x05, 0x16, 0xe8, 0x1b, 0x6a, 0x8f, 0x93,
+	0xf2, 0x8f, 0x37, 0x40, 0x12, 0x28, 0xa3, 0xe6,
+	0xb9, 0x17, 0x4a, 0x1f, 0xb1, 0xd1, 0x66, 0x69,
+	0x86, 0xc4, 0xfc, 0x97, 0xae, 0x3f, 0x8f, 0x1e,
+	0x2b, 0xdf, 0xcd, 0xf9, 0x3c
+};
+static const u8 enc_assoc011[] __initconst = {
+	0xd6, 0x31, 0xda, 0x5d, 0x42, 0x5e, 0xd7
+};
+static const u8 enc_nonce011[] __initconst = {
+	0xfd, 0x87, 0xd4, 0xd8, 0x62, 0xfd, 0xec, 0xaa
+};
+static const u8 enc_key011[] __initconst = {
+	0x35, 0x4e, 0xb5, 0x70, 0x50, 0x42, 0x8a, 0x85,
+	0xf2, 0xfb, 0xed, 0x7b, 0xd0, 0x9e, 0x97, 0xca,
+	0xfa, 0x98, 0x66, 0x63, 0xee, 0x37, 0xcc, 0x52,
+	0xfe, 0xd1, 0xdf, 0x95, 0x15, 0x34, 0x29, 0x38
+};
+
+static const u8 enc_input012[] __initconst = {
+	0x74, 0xa6, 0x3e, 0xe4, 0xb1, 0xcb, 0xaf, 0xb0,
+	0x40, 0xe5, 0x0f, 0x9e, 0xf1, 0xf2, 0x89, 0xb5,
+	0x42, 0x34, 0x8a, 0xa1, 0x03, 0xb7, 0xe9, 0x57,
+	0x46, 0xbe, 0x20, 0xe4, 0x6e, 0xb0, 0xeb, 0xff,
+	0xea, 0x07, 0x7e, 0xef, 0xe2, 0x55, 0x9f, 0xe5,
+	0x78, 0x3a, 0xb7, 0x83, 0xc2, 0x18, 0x40, 0x7b,
+	0xeb, 0xcd, 0x81, 0xfb, 0x90, 0x12, 0x9e, 0x46,
+	0xa9, 0xd6, 0x4a, 0xba, 0xb0, 0x62, 0xdb, 0x6b,
+	0x99, 0xc4, 0xdb, 0x54, 0x4b, 0xb8, 0xa5, 0x71,
+	0xcb, 0xcd, 0x63, 0x32, 0x55, 0xfb, 0x31, 0xf0,
+	0x38, 0xf5, 0xbe, 0x78, 0xe4, 0x45, 0xce, 0x1b,
+	0x6a, 0x5b, 0x0e, 0xf4, 0x16, 0xe4, 0xb1, 0x3d,
+	0xf6, 0x63, 0x7b, 0xa7, 0x0c, 0xde, 0x6f, 0x8f,
+	0x74, 0xdf, 0xe0, 0x1e, 0x9d, 0xce, 0x8f, 0x24,
+	0xef, 0x23, 0x35, 0x33, 0x7b, 0x83, 0x34, 0x23,
+	0x58, 0x74, 0x14, 0x77, 0x1f, 0xc2, 0x4f, 0x4e,
+	0xc6, 0x89, 0xf9, 0x52, 0x09, 0x37, 0x64, 0x14,
+	0xc4, 0x01, 0x6b, 0x9d, 0x77, 0xe8, 0x90, 0x5d,
+	0xa8, 0x4a, 0x2a, 0xef, 0x5c, 0x7f, 0xeb, 0xbb,
+	0xb2, 0xc6, 0x93, 0x99, 0x66, 0xdc, 0x7f, 0xd4,
+	0x9e, 0x2a, 0xca, 0x8d, 0xdb, 0xe7, 0x20, 0xcf,
+	0xe4, 0x73, 0xae, 0x49, 0x7d, 0x64, 0x0f, 0x0e,
+	0x28, 0x46, 0xa9, 0xa8, 0x32, 0xe4, 0x0e, 0xf6,
+	0x51, 0x53, 0xb8, 0x3c, 0xb1, 0xff, 0xa3, 0x33,
+	0x41, 0x75, 0xff, 0xf1, 0x6f, 0xf1, 0xfb, 0xbb,
+	0x83, 0x7f, 0x06, 0x9b, 0xe7, 0x1b, 0x0a, 0xe0,
+	0x5c, 0x33, 0x60, 0x5b, 0xdb, 0x5b, 0xed, 0xfe,
+	0xa5, 0x16, 0x19, 0x72, 0xa3, 0x64, 0x23, 0x00,
+	0x02, 0xc7, 0xf3, 0x6a, 0x81, 0x3e, 0x44, 0x1d,
+	0x79, 0x15, 0x5f, 0x9a, 0xde, 0xe2, 0xfd, 0x1b,
+	0x73, 0xc1, 0xbc, 0x23, 0xba, 0x31, 0xd2, 0x50,
+	0xd5, 0xad, 0x7f, 0x74, 0xa7, 0xc9, 0xf8, 0x3e,
+	0x2b, 0x26, 0x10, 0xf6, 0x03, 0x36, 0x74, 0xe4,
+	0x0e, 0x6a, 0x72, 0xb7, 0x73, 0x0a, 0x42, 0x28,
+	0xc2, 0xad, 0x5e, 0x03, 0xbe, 0xb8, 0x0b, 0xa8,
+	0x5b, 0xd4, 0xb8, 0xba, 0x52, 0x89, 0xb1, 0x9b,
+	0xc1, 0xc3, 0x65, 0x87, 0xed, 0xa5, 0xf4, 0x86,
+	0xfd, 0x41, 0x80, 0x91, 0x27, 0x59, 0x53, 0x67,
+	0x15, 0x78, 0x54, 0x8b, 0x2d, 0x3d, 0xc7, 0xff,
+	0x02, 0x92, 0x07, 0x5f, 0x7a, 0x4b, 0x60, 0x59,
+	0x3c, 0x6f, 0x5c, 0xd8, 0xec, 0x95, 0xd2, 0xfe,
+	0xa0, 0x3b, 0xd8, 0x3f, 0xd1, 0x69, 0xa6, 0xd6,
+	0x41, 0xb2, 0xf4, 0x4d, 0x12, 0xf4, 0x58, 0x3e,
+	0x66, 0x64, 0x80, 0x31, 0x9b, 0xa8, 0x4c, 0x8b,
+	0x07, 0xb2, 0xec, 0x66, 0x94, 0x66, 0x47, 0x50,
+	0x50, 0x5f, 0x18, 0x0b, 0x0e, 0xd6, 0xc0, 0x39,
+	0x21, 0x13, 0x9e, 0x33, 0xbc, 0x79, 0x36, 0x02,
+	0x96, 0x70, 0xf0, 0x48, 0x67, 0x2f, 0x26, 0xe9,
+	0x6d, 0x10, 0xbb, 0xd6, 0x3f, 0xd1, 0x64, 0x7a,
+	0x2e, 0xbe, 0x0c, 0x61, 0xf0, 0x75, 0x42, 0x38,
+	0x23, 0xb1, 0x9e, 0x9f, 0x7c, 0x67, 0x66, 0xd9,
+	0x58, 0x9a, 0xf1, 0xbb, 0x41, 0x2a, 0x8d, 0x65,
+	0x84, 0x94, 0xfc, 0xdc, 0x6a, 0x50, 0x64, 0xdb,
+	0x56, 0x33, 0x76, 0x00, 0x10, 0xed, 0xbe, 0xd2,
+	0x12, 0xf6, 0xf6, 0x1b, 0xa2, 0x16, 0xde, 0xae,
+	0x31, 0x95, 0xdd, 0xb1, 0x08, 0x7e, 0x4e, 0xee,
+	0xe7, 0xf9, 0xa5, 0xfb, 0x5b, 0x61, 0x43, 0x00,
+	0x40, 0xf6, 0x7e, 0x02, 0x04, 0x32, 0x4e, 0x0c,
+	0xe2, 0x66, 0x0d, 0xd7, 0x07, 0x98, 0x0e, 0xf8,
+	0x72, 0x34, 0x6d, 0x95, 0x86, 0xd7, 0xcb, 0x31,
+	0x54, 0x47, 0xd0, 0x38, 0x29, 0x9c, 0x5a, 0x68,
+	0xd4, 0x87, 0x76, 0xc9, 0xe7, 0x7e, 0xe3, 0xf4,
+	0x81, 0x6d, 0x18, 0xcb, 0xc9, 0x05, 0xaf, 0xa0,
+	0xfb, 0x66, 0xf7, 0xf1, 0x1c, 0xc6, 0x14, 0x11,
+	0x4f, 0x2b, 0x79, 0x42, 0x8b, 0xbc, 0xac, 0xe7,
+	0x6c, 0xfe, 0x0f, 0x58, 0xe7, 0x7c, 0x78, 0x39,
+	0x30, 0xb0, 0x66, 0x2c, 0x9b, 0x6d, 0x3a, 0xe1,
+	0xcf, 0xc9, 0xa4, 0x0e, 0x6d, 0x6d, 0x8a, 0xa1,
+	0x3a, 0xe7, 0x28, 0xd4, 0x78, 0x4c, 0xa6, 0xa2,
+	0x2a, 0xa6, 0x03, 0x30, 0xd7, 0xa8, 0x25, 0x66,
+	0x87, 0x2f, 0x69, 0x5c, 0x4e, 0xdd, 0xa5, 0x49,
+	0x5d, 0x37, 0x4a, 0x59, 0xc4, 0xaf, 0x1f, 0xa2,
+	0xe4, 0xf8, 0xa6, 0x12, 0x97, 0xd5, 0x79, 0xf5,
+	0xe2, 0x4a, 0x2b, 0x5f, 0x61, 0xe4, 0x9e, 0xe3,
+	0xee, 0xb8, 0xa7, 0x5b, 0x2f, 0xf4, 0x9e, 0x6c,
+	0xfb, 0xd1, 0xc6, 0x56, 0x77, 0xba, 0x75, 0xaa,
+	0x3d, 0x1a, 0xa8, 0x0b, 0xb3, 0x68, 0x24, 0x00,
+	0x10, 0x7f, 0xfd, 0xd7, 0xa1, 0x8d, 0x83, 0x54,
+	0x4f, 0x1f, 0xd8, 0x2a, 0xbe, 0x8a, 0x0c, 0x87,
+	0xab, 0xa2, 0xde, 0xc3, 0x39, 0xbf, 0x09, 0x03,
+	0xa5, 0xf3, 0x05, 0x28, 0xe1, 0xe1, 0xee, 0x39,
+	0x70, 0x9c, 0xd8, 0x81, 0x12, 0x1e, 0x02, 0x40,
+	0xd2, 0x6e, 0xf0, 0xeb, 0x1b, 0x3d, 0x22, 0xc6,
+	0xe5, 0xe3, 0xb4, 0x5a, 0x98, 0xbb, 0xf0, 0x22,
+	0x28, 0x8d, 0xe5, 0xd3, 0x16, 0x48, 0x24, 0xa5,
+	0xe6, 0x66, 0x0c, 0xf9, 0x08, 0xf9, 0x7e, 0x1e,
+	0xe1, 0x28, 0x26, 0x22, 0xc7, 0xc7, 0x0a, 0x32,
+	0x47, 0xfa, 0xa3, 0xbe, 0x3c, 0xc4, 0xc5, 0x53,
+	0x0a, 0xd5, 0x94, 0x4a, 0xd7, 0x93, 0xd8, 0x42,
+	0x99, 0xb9, 0x0a, 0xdb, 0x56, 0xf7, 0xb9, 0x1c,
+	0x53, 0x4f, 0xfa, 0xd3, 0x74, 0xad, 0xd9, 0x68,
+	0xf1, 0x1b, 0xdf, 0x61, 0xc6, 0x5e, 0xa8, 0x48,
+	0xfc, 0xd4, 0x4a, 0x4c, 0x3c, 0x32, 0xf7, 0x1c,
+	0x96, 0x21, 0x9b, 0xf9, 0xa3, 0xcc, 0x5a, 0xce,
+	0xd5, 0xd7, 0x08, 0x24, 0xf6, 0x1c, 0xfd, 0xdd,
+	0x38, 0xc2, 0x32, 0xe9, 0xb8, 0xe7, 0xb6, 0xfa,
+	0x9d, 0x45, 0x13, 0x2c, 0x83, 0xfd, 0x4a, 0x69,
+	0x82, 0xcd, 0xdc, 0xb3, 0x76, 0x0c, 0x9e, 0xd8,
+	0xf4, 0x1b, 0x45, 0x15, 0xb4, 0x97, 0xe7, 0x58,
+	0x34, 0xe2, 0x03, 0x29, 0x5a, 0xbf, 0xb6, 0xe0,
+	0x5d, 0x13, 0xd9, 0x2b, 0xb4, 0x80, 0xb2, 0x45,
+	0x81, 0x6a, 0x2e, 0x6c, 0x89, 0x7d, 0xee, 0xbb,
+	0x52, 0xdd, 0x1f, 0x18, 0xe7, 0x13, 0x6b, 0x33,
+	0x0e, 0xea, 0x36, 0x92, 0x77, 0x7b, 0x6d, 0x9c,
+	0x5a, 0x5f, 0x45, 0x7b, 0x7b, 0x35, 0x62, 0x23,
+	0xd1, 0xbf, 0x0f, 0xd0, 0x08, 0x1b, 0x2b, 0x80,
+	0x6b, 0x7e, 0xf1, 0x21, 0x47, 0xb0, 0x57, 0xd1,
+	0x98, 0x72, 0x90, 0x34, 0x1c, 0x20, 0x04, 0xff,
+	0x3d, 0x5c, 0xee, 0x0e, 0x57, 0x5f, 0x6f, 0x24,
+	0x4e, 0x3c, 0xea, 0xfc, 0xa5, 0xa9, 0x83, 0xc9,
+	0x61, 0xb4, 0x51, 0x24, 0xf8, 0x27, 0x5e, 0x46,
+	0x8c, 0xb1, 0x53, 0x02, 0x96, 0x35, 0xba, 0xb8,
+	0x4c, 0x71, 0xd3, 0x15, 0x59, 0x35, 0x22, 0x20,
+	0xad, 0x03, 0x9f, 0x66, 0x44, 0x3b, 0x9c, 0x35,
+	0x37, 0x1f, 0x9b, 0xbb, 0xf3, 0xdb, 0x35, 0x63,
+	0x30, 0x64, 0xaa, 0xa2, 0x06, 0xa8, 0x5d, 0xbb,
+	0xe1, 0x9f, 0x70, 0xec, 0x82, 0x11, 0x06, 0x36,
+	0xec, 0x8b, 0x69, 0x66, 0x24, 0x44, 0xc9, 0x4a,
+	0x57, 0xbb, 0x9b, 0x78, 0x13, 0xce, 0x9c, 0x0c,
+	0xba, 0x92, 0x93, 0x63, 0xb8, 0xe2, 0x95, 0x0f,
+	0x0f, 0x16, 0x39, 0x52, 0xfd, 0x3a, 0x6d, 0x02,
+	0x4b, 0xdf, 0x13, 0xd3, 0x2a, 0x22, 0xb4, 0x03,
+	0x7c, 0x54, 0x49, 0x96, 0x68, 0x54, 0x10, 0xfa,
+	0xef, 0xaa, 0x6c, 0xe8, 0x22, 0xdc, 0x71, 0x16,
+	0x13, 0x1a, 0xf6, 0x28, 0xe5, 0x6d, 0x77, 0x3d,
+	0xcd, 0x30, 0x63, 0xb1, 0x70, 0x52, 0xa1, 0xc5,
+	0x94, 0x5f, 0xcf, 0xe8, 0xb8, 0x26, 0x98, 0xf7,
+	0x06, 0xa0, 0x0a, 0x70, 0xfa, 0x03, 0x80, 0xac,
+	0xc1, 0xec, 0xd6, 0x4c, 0x54, 0xd7, 0xfe, 0x47,
+	0xb6, 0x88, 0x4a, 0xf7, 0x71, 0x24, 0xee, 0xf3,
+	0xd2, 0xc2, 0x4a, 0x7f, 0xfe, 0x61, 0xc7, 0x35,
+	0xc9, 0x37, 0x67, 0xcb, 0x24, 0x35, 0xda, 0x7e,
+	0xca, 0x5f, 0xf3, 0x8d, 0xd4, 0x13, 0x8e, 0xd6,
+	0xcb, 0x4d, 0x53, 0x8f, 0x53, 0x1f, 0xc0, 0x74,
+	0xf7, 0x53, 0xb9, 0x5e, 0x23, 0x37, 0xba, 0x6e,
+	0xe3, 0x9d, 0x07, 0x55, 0x25, 0x7b, 0xe6, 0x2a,
+	0x64, 0xd1, 0x32, 0xdd, 0x54, 0x1b, 0x4b, 0xc0,
+	0xe1, 0xd7, 0x69, 0x58, 0xf8, 0x93, 0x29, 0xc4,
+	0xdd, 0x23, 0x2f, 0xa5, 0xfc, 0x9d, 0x7e, 0xf8,
+	0xd4, 0x90, 0xcd, 0x82, 0x55, 0xdc, 0x16, 0x16,
+	0x9f, 0x07, 0x52, 0x9b, 0x9d, 0x25, 0xed, 0x32,
+	0xc5, 0x7b, 0xdf, 0xf6, 0x83, 0x46, 0x3d, 0x65,
+	0xb7, 0xef, 0x87, 0x7a, 0x12, 0x69, 0x8f, 0x06,
+	0x7c, 0x51, 0x15, 0x4a, 0x08, 0xe8, 0xac, 0x9a,
+	0x0c, 0x24, 0xa7, 0x27, 0xd8, 0x46, 0x2f, 0xe7,
+	0x01, 0x0e, 0x1c, 0xc6, 0x91, 0xb0, 0x6e, 0x85,
+	0x65, 0xf0, 0x29, 0x0d, 0x2e, 0x6b, 0x3b, 0xfb,
+	0x4b, 0xdf, 0xe4, 0x80, 0x93, 0x03, 0x66, 0x46,
+	0x3e, 0x8a, 0x6e, 0xf3, 0x5e, 0x4d, 0x62, 0x0e,
+	0x49, 0x05, 0xaf, 0xd4, 0xf8, 0x21, 0x20, 0x61,
+	0x1d, 0x39, 0x17, 0xf4, 0x61, 0x47, 0x95, 0xfb,
+	0x15, 0x2e, 0xb3, 0x4f, 0xd0, 0x5d, 0xf5, 0x7d,
+	0x40, 0xda, 0x90, 0x3c, 0x6b, 0xcb, 0x17, 0x00,
+	0x13, 0x3b, 0x64, 0x34, 0x1b, 0xf0, 0xf2, 0xe5,
+	0x3b, 0xb2, 0xc7, 0xd3, 0x5f, 0x3a, 0x44, 0xa6,
+	0x9b, 0xb7, 0x78, 0x0e, 0x42, 0x5d, 0x4c, 0xc1,
+	0xe9, 0xd2, 0xcb, 0xb7, 0x78, 0xd1, 0xfe, 0x9a,
+	0xb5, 0x07, 0xe9, 0xe0, 0xbe, 0xe2, 0x8a, 0xa7,
+	0x01, 0x83, 0x00, 0x8c, 0x5c, 0x08, 0xe6, 0x63,
+	0x12, 0x92, 0xb7, 0xb7, 0xa6, 0x19, 0x7d, 0x38,
+	0x13, 0x38, 0x92, 0x87, 0x24, 0xf9, 0x48, 0xb3,
+	0x5e, 0x87, 0x6a, 0x40, 0x39, 0x5c, 0x3f, 0xed,
+	0x8f, 0xee, 0xdb, 0x15, 0x82, 0x06, 0xda, 0x49,
+	0x21, 0x2b, 0xb5, 0xbf, 0x32, 0x7c, 0x9f, 0x42,
+	0x28, 0x63, 0xcf, 0xaf, 0x1e, 0xf8, 0xc6, 0xa0,
+	0xd1, 0x02, 0x43, 0x57, 0x62, 0xec, 0x9b, 0x0f,
+	0x01, 0x9e, 0x71, 0xd8, 0x87, 0x9d, 0x01, 0xc1,
+	0x58, 0x77, 0xd9, 0xaf, 0xb1, 0x10, 0x7e, 0xdd,
+	0xa6, 0x50, 0x96, 0xe5, 0xf0, 0x72, 0x00, 0x6d,
+	0x4b, 0xf8, 0x2a, 0x8f, 0x19, 0xf3, 0x22, 0x88,
+	0x11, 0x4a, 0x8b, 0x7c, 0xfd, 0xb7, 0xed, 0xe1,
+	0xf6, 0x40, 0x39, 0xe0, 0xe9, 0xf6, 0x3d, 0x25,
+	0xe6, 0x74, 0x3c, 0x58, 0x57, 0x7f, 0xe1, 0x22,
+	0x96, 0x47, 0x31, 0x91, 0xba, 0x70, 0x85, 0x28,
+	0x6b, 0x9f, 0x6e, 0x25, 0xac, 0x23, 0x66, 0x2f,
+	0x29, 0x88, 0x28, 0xce, 0x8c, 0x5c, 0x88, 0x53,
+	0xd1, 0x3b, 0xcc, 0x6a, 0x51, 0xb2, 0xe1, 0x28,
+	0x3f, 0x91, 0xb4, 0x0d, 0x00, 0x3a, 0xe3, 0xf8,
+	0xc3, 0x8f, 0xd7, 0x96, 0x62, 0x0e, 0x2e, 0xfc,
+	0xc8, 0x6c, 0x77, 0xa6, 0x1d, 0x22, 0xc1, 0xb8,
+	0xe6, 0x61, 0xd7, 0x67, 0x36, 0x13, 0x7b, 0xbb,
+	0x9b, 0x59, 0x09, 0xa6, 0xdf, 0xf7, 0x6b, 0xa3,
+	0x40, 0x1a, 0xf5, 0x4f, 0xb4, 0xda, 0xd3, 0xf3,
+	0x81, 0x93, 0xc6, 0x18, 0xd9, 0x26, 0xee, 0xac,
+	0xf0, 0xaa, 0xdf, 0xc5, 0x9c, 0xca, 0xc2, 0xa2,
+	0xcc, 0x7b, 0x5c, 0x24, 0xb0, 0xbc, 0xd0, 0x6a,
+	0x4d, 0x89, 0x09, 0xb8, 0x07, 0xfe, 0x87, 0xad,
+	0x0a, 0xea, 0xb8, 0x42, 0xf9, 0x5e, 0xb3, 0x3e,
+	0x36, 0x4c, 0xaf, 0x75, 0x9e, 0x1c, 0xeb, 0xbd,
+	0xbc, 0xbb, 0x80, 0x40, 0xa7, 0x3a, 0x30, 0xbf,
+	0xa8, 0x44, 0xf4, 0xeb, 0x38, 0xad, 0x29, 0xba,
+	0x23, 0xed, 0x41, 0x0c, 0xea, 0xd2, 0xbb, 0x41,
+	0x18, 0xd6, 0xb9, 0xba, 0x65, 0x2b, 0xa3, 0x91,
+	0x6d, 0x1f, 0xa9, 0xf4, 0xd1, 0x25, 0x8d, 0x4d,
+	0x38, 0xff, 0x64, 0xa0, 0xec, 0xde, 0xa6, 0xb6,
+	0x79, 0xab, 0x8e, 0x33, 0x6c, 0x47, 0xde, 0xaf,
+	0x94, 0xa4, 0xa5, 0x86, 0x77, 0x55, 0x09, 0x92,
+	0x81, 0x31, 0x76, 0xc7, 0x34, 0x22, 0x89, 0x8e,
+	0x3d, 0x26, 0x26, 0xd7, 0xfc, 0x1e, 0x16, 0x72,
+	0x13, 0x33, 0x63, 0xd5, 0x22, 0xbe, 0xb8, 0x04,
+	0x34, 0x84, 0x41, 0xbb, 0x80, 0xd0, 0x9f, 0x46,
+	0x48, 0x07, 0xa7, 0xfc, 0x2b, 0x3a, 0x75, 0x55,
+	0x8c, 0xc7, 0x6a, 0xbd, 0x7e, 0x46, 0x08, 0x84,
+	0x0f, 0xd5, 0x74, 0xc0, 0x82, 0x8e, 0xaa, 0x61,
+	0x05, 0x01, 0xb2, 0x47, 0x6e, 0x20, 0x6a, 0x2d,
+	0x58, 0x70, 0x48, 0x32, 0xa7, 0x37, 0xd2, 0xb8,
+	0x82, 0x1a, 0x51, 0xb9, 0x61, 0xdd, 0xfd, 0x9d,
+	0x6b, 0x0e, 0x18, 0x97, 0xf8, 0x45, 0x5f, 0x87,
+	0x10, 0xcf, 0x34, 0x72, 0x45, 0x26, 0x49, 0x70,
+	0xe7, 0xa3, 0x78, 0xe0, 0x52, 0x89, 0x84, 0x94,
+	0x83, 0x82, 0xc2, 0x69, 0x8f, 0xe3, 0xe1, 0x3f,
+	0x60, 0x74, 0x88, 0xc4, 0xf7, 0x75, 0x2c, 0xfb,
+	0xbd, 0xb6, 0xc4, 0x7e, 0x10, 0x0a,