linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
@ 2019-09-25 16:12 Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 01/18] crypto: shash - add plumbing for operating on scatterlists Ard Biesheuvel
                   ` (14 more replies)
  0 siblings, 15 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

This series proposes a way to incorporate WireGuard into the kernel
without relying on a wholesale replacement of the existing crypto
stack. It addresses two issues with the existing crypto API, i.e.,
the need to do a kmalloc() allocation for each request, and the fact
that it can only operate on scatterlists, which limits the user to
data that is already accessible via an address in the linear map.

In the implementation of WireGuard, there are a number of dependencies
on cryptographic transformations:
- curve25519, blake2s, and [x]chach20poy1305 are all being used in the
  protocol handling, handshakes etc, mostly using inputs of a fixed, short
  length, mostly allocated on the stack
- chach20poy1305 is used for en/decrypting the actual packet data, using
  scatterlists to describe where the packet data is stored in memory.

The latter transformation is 99% compatible with the existing RFC7539
IPsec template in the crypto API, which means we already have the
plumbing to instantiate the correct transforms based on implementations
of ChaCha20 and Poly1305 that are provided per-architecture. Patch #18
shows the changes that need to be made to WireGuard to switch to the
crypto API for handling the packets. 

The remaining uses of [x]chacha20poly1305 operate on stack buffers, and
so switching to the crypto AEAD API is not as straight forward. However,
for these cases, as well as the uses of blake2s and curve25519, the fact
that they operate on small, fixed size buffers means that there is
really no point in providing alternative, SIMD based implementations of
these, and we can limit ourselves to generic C library version. 

Patches #1 .. #8 make some changes to the existing RFC7539 template and
the underlying ChaCha and Poly1305 drivers to reduce the number of times
that the template calls into the drivers, and to permit users of the
template to allocate the request structure on the stack instead of on
the heap, which removes the need for doing per-packet heap allocations
on the hot path.

Patches #9 and #10 refactor the existing Poly1305 code so we can easily
layer the Chacha20Poly1305 construction library on top in patch #14.

Patches #12 and #13 import the C implementations of blake2s and Curev25519
from the Zinc patch set, but moves them into lib/crypto, which is where
we keep generic crypto library C code. (Patch #11 is a preparatory patch for
patch #13.) The selftests are included as well.

Patch #14 incorporates the [x]chach20poly1305 library interface from Zinc,
but instead of providing numerous new implementation of Chacha20 and Poly1305,
it is built on top of the existing Chacha and Poly1305 library code that we
already have in the kernel. The original selftests that operate on 64-bit
nonces are included as well. (The ones using 96-bit nonces were dropped,
since the library interface [as it was defined originally] only supports
64-bit nonces in the first place)

Patch #15 is the original patch that adds WireGuard itself, and was taken
from the last series that Jason sent to the list ~6 months ago. It is
included verbatim to better illustrate the nature of the changes being
applied in the move to the crypto API.

Patch #16 is a followup fix for WireGuard that was taken from Jason's
repository, and is required to run WireGuard on recent kernels.

Patch #17 moves wireguard over to the crypto library headers in crypto/
rather than in zinc/

Patch #18 switches wireguard from the chach20poly1305 library API to
the crypto API. Note that RFC7539 defines a 96-bit nonce whereas WireGuard
only uses 64-bits, so some of the changes in this patch were needed just to
account for that.

Note that support for the rfc7539(chacha20,poly1305) algorithm has already
been implemented by at least two drivers for asynchronous accelerators, and
it seems relatively straight-forward to modify WireGuard further to support
asynchronous completions, and offload all the per-packet crypto to a separate
IP block. (People have argued in the past that accelerators are irrelevant
since CPUs perform better, but 'speed' is not the only performance metric
that people care about - 'battery life' is another one that comes to mind)

Cc: Herbert Xu <herbert@gondor.apana.org.au> 
Cc: David Miller <davem@davemloft.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>

Ard Biesheuvel (15):
  crypto: shash - add plumbing for operating on scatterlists
  crypto: x86/poly1305 - implement .update_from_sg method
  crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON
    implementation
  crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON
    implementation
  crypto: chacha - move existing library code into lib/crypto
  crypto: rfc7539 - switch to shash for Poly1305
  crypto: rfc7539 - use zero reqsize for sync instantiations without
    alignmask
  crypto: testmgr - add a chacha20poly1305 test case
  crypto: poly1305 - move core algorithm into lib/crypto
  crypto: poly1305 - add init/update/final library routines
  int128: move __uint128_t compiler test to Kconfig
  crypto: chacha20poly1305 - import construction and selftest from Zinc
  netlink: use new strict length types in policy for 5.2
  wg switch to lib/crypto algos
  net: wireguard - switch to crypto API for packet encryption

Jason A. Donenfeld (3):
  crypto: BLAKE2s - generic C library implementation and selftest
  crypto: Curve25519 - generic C library implementations and selftest
  net: WireGuard secure network tunnel

 MAINTAINERS                                  |    8 +
 arch/arm/crypto/Kconfig                      |    3 +
 arch/arm/crypto/Makefile                     |    7 +-
 arch/arm/crypto/chacha-neon-glue.c           |    2 +-
 arch/arm/crypto/poly1305-armv4.pl            | 1236 ++++
 arch/arm/crypto/poly1305-core.S_shipped      | 1158 +++
 arch/arm/crypto/poly1305-glue.c              |  253 +
 arch/arm64/crypto/Kconfig                    |    4 +
 arch/arm64/crypto/Makefile                   |    9 +-
 arch/arm64/crypto/chacha-neon-glue.c         |    2 +-
 arch/arm64/crypto/poly1305-armv8.pl          |  913 +++
 arch/arm64/crypto/poly1305-core.S_shipped    |  835 +++
 arch/arm64/crypto/poly1305-glue.c            |  215 +
 arch/x86/crypto/chacha_glue.c                |    2 +-
 arch/x86/crypto/poly1305_glue.c              |   56 +-
 crypto/Kconfig                               |   14 +
 crypto/adiantum.c                            |    5 +-
 crypto/ahash.c                               |   18 +
 crypto/chacha20poly1305.c                    |  540 +-
 crypto/chacha_generic.c                      |   42 +-
 crypto/ecc.c                                 |    2 +-
 crypto/nhpoly1305.c                          |    3 +-
 crypto/poly1305_generic.c                    |  218 +-
 crypto/shash.c                               |   24 +
 crypto/testmgr.h                             |   45 +
 drivers/net/Kconfig                          |   30 +
 drivers/net/Makefile                         |    1 +
 drivers/net/wireguard/Makefile               |   18 +
 drivers/net/wireguard/allowedips.c           |  377 +
 drivers/net/wireguard/allowedips.h           |   59 +
 drivers/net/wireguard/cookie.c               |  236 +
 drivers/net/wireguard/cookie.h               |   59 +
 drivers/net/wireguard/device.c               |  460 ++
 drivers/net/wireguard/device.h               |   65 +
 drivers/net/wireguard/main.c                 |   64 +
 drivers/net/wireguard/messages.h             |  128 +
 drivers/net/wireguard/netlink.c              |  621 ++
 drivers/net/wireguard/netlink.h              |   12 +
 drivers/net/wireguard/noise.c                |  837 +++
 drivers/net/wireguard/noise.h                |  132 +
 drivers/net/wireguard/peer.c                 |  239 +
 drivers/net/wireguard/peer.h                 |   83 +
 drivers/net/wireguard/peerlookup.c           |  221 +
 drivers/net/wireguard/peerlookup.h           |   64 +
 drivers/net/wireguard/queueing.c             |   53 +
 drivers/net/wireguard/queueing.h             |  199 +
 drivers/net/wireguard/ratelimiter.c          |  223 +
 drivers/net/wireguard/ratelimiter.h          |   19 +
 drivers/net/wireguard/receive.c              |  617 ++
 drivers/net/wireguard/selftest/allowedips.c  |  682 ++
 drivers/net/wireguard/selftest/counter.c     |  104 +
 drivers/net/wireguard/selftest/ratelimiter.c |  226 +
 drivers/net/wireguard/send.c                 |  442 ++
 drivers/net/wireguard/socket.c               |  433 ++
 drivers/net/wireguard/socket.h               |   44 +
 drivers/net/wireguard/timers.c               |  241 +
 drivers/net/wireguard/timers.h               |   31 +
 drivers/net/wireguard/version.h              |    1 +
 include/crypto/blake2s.h                     |   56 +
 include/crypto/chacha.h                      |   37 +-
 include/crypto/chacha20poly1305.h            |   37 +
 include/crypto/curve25519.h                  |   28 +
 include/crypto/hash.h                        |    3 +
 include/crypto/internal/chacha.h             |   25 +
 include/crypto/internal/hash.h               |   19 +
 include/crypto/internal/poly1305.h           |   33 +
 include/crypto/poly1305.h                    |   34 +-
 include/uapi/linux/wireguard.h               |  190 +
 init/Kconfig                                 |    1 +
 lib/Makefile                                 |    3 +-
 lib/crypto/Makefile                          |   39 +-
 lib/crypto/blake2s-selftest.c                | 2093 ++++++
 lib/crypto/blake2s.c                         |  274 +
 lib/{ => crypto}/chacha.c                    |   23 +
 lib/crypto/chacha20poly1305-selftest.c       | 7349 ++++++++++++++++++++
 lib/crypto/chacha20poly1305.c                |  216 +
 lib/crypto/curve25519-fiat32.c               |  864 +++
 lib/crypto/curve25519-hacl64.c               |  788 +++
 lib/crypto/curve25519-selftest.c             | 1321 ++++
 lib/crypto/curve25519.c                      |   73 +
 lib/crypto/poly1305.c                        |  216 +
 lib/ubsan.c                                  |    2 +-
 lib/ubsan.h                                  |    2 +-
 tools/testing/selftests/wireguard/netns.sh   |  503 ++
 84 files changed, 26192 insertions(+), 672 deletions(-)
 create mode 100644 arch/arm/crypto/poly1305-armv4.pl
 create mode 100644 arch/arm/crypto/poly1305-core.S_shipped
 create mode 100644 arch/arm/crypto/poly1305-glue.c
 create mode 100644 arch/arm64/crypto/poly1305-armv8.pl
 create mode 100644 arch/arm64/crypto/poly1305-core.S_shipped
 create mode 100644 arch/arm64/crypto/poly1305-glue.c
 create mode 100644 drivers/net/wireguard/Makefile
 create mode 100644 drivers/net/wireguard/allowedips.c
 create mode 100644 drivers/net/wireguard/allowedips.h
 create mode 100644 drivers/net/wireguard/cookie.c
 create mode 100644 drivers/net/wireguard/cookie.h
 create mode 100644 drivers/net/wireguard/device.c
 create mode 100644 drivers/net/wireguard/device.h
 create mode 100644 drivers/net/wireguard/main.c
 create mode 100644 drivers/net/wireguard/messages.h
 create mode 100644 drivers/net/wireguard/netlink.c
 create mode 100644 drivers/net/wireguard/netlink.h
 create mode 100644 drivers/net/wireguard/noise.c
 create mode 100644 drivers/net/wireguard/noise.h
 create mode 100644 drivers/net/wireguard/peer.c
 create mode 100644 drivers/net/wireguard/peer.h
 create mode 100644 drivers/net/wireguard/peerlookup.c
 create mode 100644 drivers/net/wireguard/peerlookup.h
 create mode 100644 drivers/net/wireguard/queueing.c
 create mode 100644 drivers/net/wireguard/queueing.h
 create mode 100644 drivers/net/wireguard/ratelimiter.c
 create mode 100644 drivers/net/wireguard/ratelimiter.h
 create mode 100644 drivers/net/wireguard/receive.c
 create mode 100644 drivers/net/wireguard/selftest/allowedips.c
 create mode 100644 drivers/net/wireguard/selftest/counter.c
 create mode 100644 drivers/net/wireguard/selftest/ratelimiter.c
 create mode 100644 drivers/net/wireguard/send.c
 create mode 100644 drivers/net/wireguard/socket.c
 create mode 100644 drivers/net/wireguard/socket.h
 create mode 100644 drivers/net/wireguard/timers.c
 create mode 100644 drivers/net/wireguard/timers.h
 create mode 100644 drivers/net/wireguard/version.h
 create mode 100644 include/crypto/blake2s.h
 create mode 100644 include/crypto/chacha20poly1305.h
 create mode 100644 include/crypto/curve25519.h
 create mode 100644 include/crypto/internal/chacha.h
 create mode 100644 include/crypto/internal/poly1305.h
 create mode 100644 include/uapi/linux/wireguard.h
 create mode 100644 lib/crypto/blake2s-selftest.c
 create mode 100644 lib/crypto/blake2s.c
 rename lib/{ => crypto}/chacha.c (85%)
 create mode 100644 lib/crypto/chacha20poly1305-selftest.c
 create mode 100644 lib/crypto/chacha20poly1305.c
 create mode 100644 lib/crypto/curve25519-fiat32.c
 create mode 100644 lib/crypto/curve25519-hacl64.c
 create mode 100644 lib/crypto/curve25519-selftest.c
 create mode 100644 lib/crypto/curve25519.c
 create mode 100644 lib/crypto/poly1305.c
 create mode 100755 tools/testing/selftests/wireguard/netns.sh

-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 01/18] crypto: shash - add plumbing for operating on scatterlists
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 02/18] crypto: x86/poly1305 - implement .update_from_sg method Ard Biesheuvel
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

Add an internal method to the shash interface that permits templates
to invoke it with a scatterlist. Drivers implementing the shash
interface can opt into using this method, making it more straightforward
for templates to pass down data provided via scatterlists without forcing
the underlying shash to process each scatterlist entry with a discrete
update() call. This will be used later in the SIMD accelerated Poly1305
to amortize SIMD begin()/end() calls over the entire input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/ahash.c                 | 18 +++++++++++++++
 crypto/shash.c                 | 24 ++++++++++++++++++++
 include/crypto/hash.h          |  3 +++
 include/crypto/internal/hash.h | 19 ++++++++++++++++
 4 files changed, 64 insertions(+)

diff --git a/crypto/ahash.c b/crypto/ahash.c
index 3815b363a693..aecb48f0f50c 100644
--- a/crypto/ahash.c
+++ b/crypto/ahash.c
@@ -144,6 +144,24 @@ int crypto_hash_walk_first(struct ahash_request *req,
 }
 EXPORT_SYMBOL_GPL(crypto_hash_walk_first);
 
+int crypto_shash_walk_sg(struct shash_desc *desc, struct scatterlist *sg,
+			 int nbytes, struct crypto_hash_walk *walk, int flags)
+{
+	walk->total = nbytes;
+
+	if (!walk->total) {
+		walk->entrylen = 0;
+		return 0;
+	}
+
+	walk->alignmask = crypto_shash_alignmask(desc->tfm);
+	walk->sg = sg;
+	walk->flags = flags;
+
+	return hash_walk_new_entry(walk);
+}
+EXPORT_SYMBOL_GPL(crypto_shash_walk_sg);
+
 int crypto_ahash_walk_first(struct ahash_request *req,
 			    struct crypto_hash_walk *walk)
 {
diff --git a/crypto/shash.c b/crypto/shash.c
index e83c5124f6eb..b16ab5590dc4 100644
--- a/crypto/shash.c
+++ b/crypto/shash.c
@@ -121,6 +121,30 @@ int crypto_shash_update(struct shash_desc *desc, const u8 *data,
 }
 EXPORT_SYMBOL_GPL(crypto_shash_update);
 
+int crypto_shash_update_from_sg(struct shash_desc *desc, struct scatterlist *sg,
+				unsigned int len, bool atomic)
+{
+	struct crypto_shash *tfm = desc->tfm;
+	struct shash_alg *shash = crypto_shash_alg(tfm);
+	struct crypto_hash_walk walk;
+	int flags = 0;
+	int nbytes;
+
+	if (!atomic)
+		flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+
+	if (shash->update_from_sg)
+		return shash->update_from_sg(desc, sg, len, flags);
+
+	for (nbytes = crypto_shash_walk_sg(desc, sg, len, &walk, flags);
+	     nbytes > 0;
+	     nbytes = crypto_hash_walk_done(&walk, nbytes))
+		nbytes = crypto_shash_update(desc, walk.data, nbytes);
+
+	return nbytes;
+}
+EXPORT_SYMBOL_GPL(crypto_shash_update_from_sg);
+
 static int shash_final_unaligned(struct shash_desc *desc, u8 *out)
 {
 	struct crypto_shash *tfm = desc->tfm;
diff --git a/include/crypto/hash.h b/include/crypto/hash.h
index ef10c370605a..0b83d85a3828 100644
--- a/include/crypto/hash.h
+++ b/include/crypto/hash.h
@@ -158,6 +158,7 @@ struct shash_desc {
  * struct shash_alg - synchronous message digest definition
  * @init: see struct ahash_alg
  * @update: see struct ahash_alg
+ * @update_from_sg: variant of update() taking a scatterlist as input [optional]
  * @final: see struct ahash_alg
  * @finup: see struct ahash_alg
  * @digest: see struct ahash_alg
@@ -175,6 +176,8 @@ struct shash_alg {
 	int (*init)(struct shash_desc *desc);
 	int (*update)(struct shash_desc *desc, const u8 *data,
 		      unsigned int len);
+	int (*update_from_sg)(struct shash_desc *desc, struct scatterlist *sg,
+			      unsigned int len, int flags);
 	int (*final)(struct shash_desc *desc, u8 *out);
 	int (*finup)(struct shash_desc *desc, const u8 *data,
 		     unsigned int len, u8 *out);
diff --git a/include/crypto/internal/hash.h b/include/crypto/internal/hash.h
index bfc9db7b100d..6f4bfa057bea 100644
--- a/include/crypto/internal/hash.h
+++ b/include/crypto/internal/hash.h
@@ -50,6 +50,8 @@ extern const struct crypto_type crypto_ahash_type;
 int crypto_hash_walk_done(struct crypto_hash_walk *walk, int err);
 int crypto_hash_walk_first(struct ahash_request *req,
 			   struct crypto_hash_walk *walk);
+int crypto_shash_walk_sg(struct shash_desc *desc, struct scatterlist *sg,
+			 int nbytes, struct crypto_hash_walk *walk, int flags);
 int crypto_ahash_walk_first(struct ahash_request *req,
 			   struct crypto_hash_walk *walk);
 
@@ -242,5 +244,22 @@ static inline struct crypto_shash *__crypto_shash_cast(struct crypto_tfm *tfm)
 	return container_of(tfm, struct crypto_shash, base);
 }
 
+/**
+ * crypto_shash_update_from_sg() - add data from a scatterlist to message digest
+ * 				   for processing
+ * @desc: operational state handle that is already initialized
+ * @data: scatterlist with input data to be added to the message digest
+ * @len: length of the input data
+ * @atomic: whether or not the call is permitted to sleep
+ *
+ * Updates the message digest state of the operational state handle.
+ *
+ * Context: Any context.
+ * Return: 0 if the message digest update was successful; < 0 if an error
+ *	   occurred
+ */
+int crypto_shash_update_from_sg(struct shash_desc *desc, struct scatterlist *sg,
+				unsigned int len, bool atomic);
+
 #endif	/* _CRYPTO_INTERNAL_HASH_H */
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 02/18] crypto: x86/poly1305 - implement .update_from_sg method
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 01/18] crypto: shash - add plumbing for operating on scatterlists Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 03/18] crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation Ard Biesheuvel
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

In order to reduce the number of invocations of the RFC7539 template
into the Poly1305 driver, implement the new internal .update_from_sg
method that allows the driver to amortize the cost of FPU preserve/
restore sequences over a larger chunk of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/poly1305_glue.c | 54 ++++++++++++++++----
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index 4a1c05dce950..f2afaa8e23c2 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -115,18 +115,11 @@ static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
 	return srclen;
 }
 
-static int poly1305_simd_update(struct shash_desc *desc,
-				const u8 *src, unsigned int srclen)
+static void poly1305_simd_do_update(struct shash_desc *desc,
+				    const u8 *src, unsigned int srclen)
 {
-	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
 	unsigned int bytes;
 
-	/* kernel_fpu_begin/end is costly, use fallback for small updates */
-	if (srclen <= 288 || !crypto_simd_usable())
-		return crypto_poly1305_update(desc, src, srclen);
-
-	kernel_fpu_begin();
-
 	if (unlikely(dctx->buflen)) {
 		bytes = min(srclen, POLY1305_BLOCK_SIZE - dctx->buflen);
 		memcpy(dctx->buf + dctx->buflen, src, bytes);
@@ -147,12 +140,50 @@ static int poly1305_simd_update(struct shash_desc *desc,
 		srclen = bytes;
 	}
 
-	kernel_fpu_end();
-
 	if (unlikely(srclen)) {
 		dctx->buflen = srclen;
 		memcpy(dctx->buf, src, srclen);
 	}
+}
+
+static int poly1305_simd_update(struct shash_desc *desc,
+				const u8 *src, unsigned int srclen)
+{
+	/* kernel_fpu_begin/end is costly, use fallback for small updates */
+	if (srclen <= 288 || !crypto_simd_usable())
+		return crypto_poly1305_update(desc, src, srclen);
+
+	kernel_fpu_begin();
+	poly1305_simd_do_update(desc, src, srclen);
+	kernel_fpu_end();
+
+	return 0;
+}
+
+static int poly1305_simd_update_from_sg(struct shash_desc *desc,
+					struct scatterlist *sg,
+					unsigned int srclen,
+					int flags)
+{
+	bool do_simd = crypto_simd_usable() && srclen > 288;
+	struct crypto_hash_walk walk;
+	int nbytes;
+
+	if (do_simd) {
+		kernel_fpu_begin();
+		flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+	}
+
+	for (nbytes = crypto_shash_walk_sg(desc, sg, srclen, &walk, flags);
+	     nbytes > 0;
+	     nbytes = crypto_hash_walk_done(&walk, 0)) {
+		if (do_simd)
+			poly1305_simd_do_update(desc, walk.data, nbytes);
+		else
+			crypto_poly1305_update(desc, walk.data, nbytes);
+	}
+	if (do_simd)
+		kernel_fpu_end();
 
 	return 0;
 }
@@ -161,6 +192,7 @@ static struct shash_alg alg = {
 	.digestsize	= POLY1305_DIGEST_SIZE,
 	.init		= poly1305_simd_init,
 	.update		= poly1305_simd_update,
+	.update_from_sg	= poly1305_simd_update_from_sg,
 	.final		= crypto_poly1305_final,
 	.descsize	= sizeof(struct poly1305_simd_desc_ctx),
 	.base		= {
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 03/18] crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 01/18] crypto: shash - add plumbing for operating on scatterlists Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 02/18] crypto: x86/poly1305 - implement .update_from_sg method Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 04/18] crypto: arm64/poly1305 " Ard Biesheuvel
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Andy Polyakov,
	Samuel Neves, Will Deacon, Dan Carpenter, Andy Lutomirski,
	Marc Zyngier, Linus Torvalds, David Miller, linux-arm-kernel

This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation
for NEON authored by Andy Polyakov, and contributed by him to the OpenSSL
project. The file 'poly1305-armv4.pl' is taken straight from this upstream
GitHub repository [0] at commit ec55a08dc0244ce570c4fc7cade330c60798952f,
and already contains all the changes required to build it as part of a
Linux kernel module.

[0] https://github.com/dot-asm/cryptogams

Co-developed-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/Kconfig                 |    3 +
 arch/arm/crypto/Makefile                |    7 +-
 arch/arm/crypto/poly1305-armv4.pl       | 1236 ++++++++++++++++++++
 arch/arm/crypto/poly1305-core.S_shipped | 1158 ++++++++++++++++++
 arch/arm/crypto/poly1305-glue.c         |  253 ++++
 5 files changed, 2656 insertions(+), 1 deletion(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index b24df84a1d7a..ae7fc69585f9 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -131,6 +131,9 @@ config CRYPTO_CHACHA20_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_CHACHA20
 
+config CRYPTO_POLY1305_ARM
+	tristate "Accelerated scalar and SIMD Poly1305 hash implementations"
+
 config CRYPTO_NHPOLY1305_NEON
 	tristate "NEON accelerated NHPoly1305 hash function (for Adiantum)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 4180f3a13512..c9d5fab8ad45 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
+obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
 
 ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
@@ -54,12 +55,16 @@ ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
 crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
 crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
 chacha-neon-y := chacha-neon-core.o chacha-neon-glue.o
+poly1305-arm-y := poly1305-core.o poly1305-glue.o
 nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
 
 ifdef REGENERATE_ARM_CRYPTO
 quiet_cmd_perl = PERL    $@
       cmd_perl = $(PERL) $(<) > $(@)
 
+$(src)/poly1305-core.S_shipped: $(src)/poly1305-armv4.pl
+	$(call cmd,perl)
+
 $(src)/sha256-core.S_shipped: $(src)/sha256-armv4.pl
 	$(call cmd,perl)
 
@@ -67,4 +72,4 @@ $(src)/sha512-core.S_shipped: $(src)/sha512-armv4.pl
 	$(call cmd,perl)
 endif
 
-clean-files += sha256-core.S sha512-core.S
+clean-files += poly1305-core.S sha256-core.S sha512-core.S
diff --git a/arch/arm/crypto/poly1305-armv4.pl b/arch/arm/crypto/poly1305-armv4.pl
new file mode 100644
index 000000000000..6d79498d3115
--- /dev/null
+++ b/arch/arm/crypto/poly1305-armv4.pl
@@ -0,0 +1,1236 @@
+#!/usr/bin/env perl
+# SPDX-License-Identifier: GPL-1.0+ OR BSD-3-Clause
+#
+# ====================================================================
+# Written by Andy Polyakov, @dot-asm, initially for the OpenSSL
+# project.
+# ====================================================================
+#
+#			IALU(*)/gcc-4.4		NEON
+#
+# ARM11xx(ARMv6)	7.78/+100%		-
+# Cortex-A5		6.35/+130%		3.00
+# Cortex-A8		6.25/+115%		2.36
+# Cortex-A9		5.10/+95%		2.55
+# Cortex-A15		3.85/+85%		1.25(**)
+# Snapdragon S4		5.70/+100%		1.48(**)
+#
+# (*)	this is for -march=armv6, i.e. with bunch of ldrb loading data;
+# (**)	these are trade-off results, they can be improved by ~8% but at
+#	the cost of 15/12% regression on Cortex-A5/A7, it's even possible
+#	to improve Cortex-A9 result, but then A5/A7 loose more than 20%;
+
+$flavour = shift;
+if ($flavour=~/\w[\w\-]*\.\w+$/) { $output=$flavour; undef $flavour; }
+else { while (($output=shift) && ($output!~/\w[\w\-]*\.\w+$/)) {} }
+
+if ($flavour && $flavour ne "void") {
+    $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+    ( $xlate="${dir}arm-xlate.pl" and -f $xlate ) or
+    ( $xlate="${dir}../../perlasm/arm-xlate.pl" and -f $xlate) or
+    die "can't locate arm-xlate.pl";
+
+    open STDOUT,"| \"$^X\" $xlate $flavour $output";
+} else {
+    open STDOUT,">$output";
+}
+
+($ctx,$inp,$len,$padbit)=map("r$_",(0..3));
+
+$code.=<<___;
+#ifndef	__KERNEL__
+# include "arm_arch.h"
+#else
+# define __ARM_ARCH__ __LINUX_ARM_ARCH__
+# define __ARM_MAX_ARCH__ __LINUX_ARM_ARCH__
+# define poly1305_init   poly1305_init_arm
+# define poly1305_blocks poly1305_blocks_arm
+# define poly1305_emit   poly1305_emit_arm
+.globl	poly1305_blocks_neon
+#endif
+
+#if defined(__thumb2__)
+.syntax	unified
+.thumb
+#else
+.code	32
+#endif
+
+.text
+
+.globl	poly1305_emit
+.globl	poly1305_blocks
+.globl	poly1305_init
+.type	poly1305_init,%function
+.align	5
+poly1305_init:
+.Lpoly1305_init:
+	stmdb	sp!,{r4-r11}
+
+	eor	r3,r3,r3
+	cmp	$inp,#0
+	str	r3,[$ctx,#0]		@ zero hash value
+	str	r3,[$ctx,#4]
+	str	r3,[$ctx,#8]
+	str	r3,[$ctx,#12]
+	str	r3,[$ctx,#16]
+	str	r3,[$ctx,#36]		@ clear is_base2_26
+	add	$ctx,$ctx,#20
+
+#ifdef	__thumb2__
+	it	eq
+#endif
+	moveq	r0,#0
+	beq	.Lno_key
+
+#if	__ARM_MAX_ARCH__>=7
+	mov	r3,#-1
+	str	r3,[$ctx,#28]		@ impossible key power value
+# ifndef __KERNEL__
+	adr	r11,.Lpoly1305_init
+	ldr	r12,.LOPENSSL_armcap
+# endif
+#endif
+	ldrb	r4,[$inp,#0]
+	mov	r10,#0x0fffffff
+	ldrb	r5,[$inp,#1]
+	and	r3,r10,#-4		@ 0x0ffffffc
+	ldrb	r6,[$inp,#2]
+	ldrb	r7,[$inp,#3]
+	orr	r4,r4,r5,lsl#8
+	ldrb	r5,[$inp,#4]
+	orr	r4,r4,r6,lsl#16
+	ldrb	r6,[$inp,#5]
+	orr	r4,r4,r7,lsl#24
+	ldrb	r7,[$inp,#6]
+	and	r4,r4,r10
+
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+# if !defined(_WIN32)
+	ldr	r12,[r11,r12]		@ OPENSSL_armcap_P
+# endif
+# if defined(__APPLE__) || defined(_WIN32)
+	ldr	r12,[r12]
+# endif
+#endif
+	ldrb	r8,[$inp,#7]
+	orr	r5,r5,r6,lsl#8
+	ldrb	r6,[$inp,#8]
+	orr	r5,r5,r7,lsl#16
+	ldrb	r7,[$inp,#9]
+	orr	r5,r5,r8,lsl#24
+	ldrb	r8,[$inp,#10]
+	and	r5,r5,r3
+
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+	tst	r12,#ARMV7_NEON		@ check for NEON
+# ifdef	__thumb2__
+	adr	r9,.Lpoly1305_blocks_neon
+	adr	r11,.Lpoly1305_blocks
+	it	ne
+	movne	r11,r9
+	adr	r12,.Lpoly1305_emit
+	orr	r11,r11,#1		@ thumb-ify addresses
+	orr	r12,r12,#1
+# else
+	add	r12,r11,#(.Lpoly1305_emit-.Lpoly1305_init)
+	ite	eq
+	addeq	r11,r11,#(.Lpoly1305_blocks-.Lpoly1305_init)
+	addne	r11,r11,#(.Lpoly1305_blocks_neon-.Lpoly1305_init)
+# endif
+#endif
+	ldrb	r9,[$inp,#11]
+	orr	r6,r6,r7,lsl#8
+	ldrb	r7,[$inp,#12]
+	orr	r6,r6,r8,lsl#16
+	ldrb	r8,[$inp,#13]
+	orr	r6,r6,r9,lsl#24
+	ldrb	r9,[$inp,#14]
+	and	r6,r6,r3
+
+	ldrb	r10,[$inp,#15]
+	orr	r7,r7,r8,lsl#8
+	str	r4,[$ctx,#0]
+	orr	r7,r7,r9,lsl#16
+	str	r5,[$ctx,#4]
+	orr	r7,r7,r10,lsl#24
+	str	r6,[$ctx,#8]
+	and	r7,r7,r3
+	str	r7,[$ctx,#12]
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+	stmia	r2,{r11,r12}		@ fill functions table
+	mov	r0,#1
+#else
+	mov	r0,#0
+#endif
+.Lno_key:
+	ldmia	sp!,{r4-r11}
+#if	__ARM_ARCH__>=5
+	ret				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	bx	lr			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_init,.-poly1305_init
+___
+{
+my ($h0,$h1,$h2,$h3,$h4,$r0,$r1,$r2,$r3)=map("r$_",(4..12));
+my ($s1,$s2,$s3)=($r1,$r2,$r3);
+
+$code.=<<___;
+.type	poly1305_blocks,%function
+.align	5
+poly1305_blocks:
+.Lpoly1305_blocks:
+	stmdb	sp!,{r3-r11,lr}
+
+	ands	$len,$len,#-16
+	beq	.Lno_data
+
+	add	$len,$len,$inp		@ end pointer
+	sub	sp,sp,#32
+
+#if __ARM_ARCH__<7
+	ldmia	$ctx,{$h0-$r3}		@ load context
+	add	$ctx,$ctx,#20
+	str	$len,[sp,#16]		@ offload stuff
+	str	$ctx,[sp,#12]
+#else
+	ldr	lr,[$ctx,#36]		@ is_base2_26
+	ldmia	$ctx!,{$h0-$h4}		@ load hash value
+	str	$len,[sp,#16]		@ offload stuff
+	str	$ctx,[sp,#12]
+
+	adds	$r0,$h0,$h1,lsl#26	@ base 2^26 -> base 2^32
+	mov	$r1,$h1,lsr#6
+	adcs	$r1,$r1,$h2,lsl#20
+	mov	$r2,$h2,lsr#12
+	adcs	$r2,$r2,$h3,lsl#14
+	mov	$r3,$h3,lsr#18
+	adcs	$r3,$r3,$h4,lsl#8
+	mov	$len,#0
+	teq	lr,#0
+	str	$len,[$ctx,#16]		@ clear is_base2_26
+	adc	$len,$len,$h4,lsr#24
+
+	itttt	ne
+	movne	$h0,$r0			@ choose between radixes
+	movne	$h1,$r1
+	movne	$h2,$r2
+	movne	$h3,$r3
+	ldmia	$ctx,{$r0-$r3}		@ load key
+	it	ne
+	movne	$h4,$len
+#endif
+
+	mov	lr,$inp
+	cmp	$padbit,#0
+	str	$r1,[sp,#20]
+	str	$r2,[sp,#24]
+	str	$r3,[sp,#28]
+	b	.Loop
+
+.align	4
+.Loop:
+#if __ARM_ARCH__<7
+	ldrb	r0,[lr],#16		@ load input
+# ifdef	__thumb2__
+	it	hi
+# endif
+	addhi	$h4,$h4,#1		@ 1<<128
+	ldrb	r1,[lr,#-15]
+	ldrb	r2,[lr,#-14]
+	ldrb	r3,[lr,#-13]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-12]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-11]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-10]
+	adds	$h0,$h0,r3		@ accumulate input
+
+	ldrb	r3,[lr,#-9]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-8]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-7]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-6]
+	adcs	$h1,$h1,r3
+
+	ldrb	r3,[lr,#-5]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-4]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-3]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-2]
+	adcs	$h2,$h2,r3
+
+	ldrb	r3,[lr,#-1]
+	orr	r1,r0,r1,lsl#8
+	str	lr,[sp,#8]		@ offload input pointer
+	orr	r2,r1,r2,lsl#16
+	add	$s1,$r1,$r1,lsr#2
+	orr	r3,r2,r3,lsl#24
+#else
+	ldr	r0,[lr],#16		@ load input
+	it	hi
+	addhi	$h4,$h4,#1		@ padbit
+	ldr	r1,[lr,#-12]
+	ldr	r2,[lr,#-8]
+	ldr	r3,[lr,#-4]
+# ifdef	__ARMEB__
+	rev	r0,r0
+	rev	r1,r1
+	rev	r2,r2
+	rev	r3,r3
+# endif
+	adds	$h0,$h0,r0		@ accumulate input
+	str	lr,[sp,#8]		@ offload input pointer
+	adcs	$h1,$h1,r1
+	add	$s1,$r1,$r1,lsr#2
+	adcs	$h2,$h2,r2
+#endif
+	add	$s2,$r2,$r2,lsr#2
+	adcs	$h3,$h3,r3
+	add	$s3,$r3,$r3,lsr#2
+
+	umull	r2,r3,$h1,$r0
+	 adc	$h4,$h4,#0
+	umull	r0,r1,$h0,$r0
+	umlal	r2,r3,$h4,$s1
+	umlal	r0,r1,$h3,$s1
+	ldr	$r1,[sp,#20]		@ reload $r1
+	umlal	r2,r3,$h2,$s3
+	umlal	r0,r1,$h1,$s3
+	umlal	r2,r3,$h3,$s2
+	umlal	r0,r1,$h2,$s2
+	umlal	r2,r3,$h0,$r1
+	str	r0,[sp,#0]		@ future $h0
+	 mul	r0,$s2,$h4
+	ldr	$r2,[sp,#24]		@ reload $r2
+	adds	r2,r2,r1		@ d1+=d0>>32
+	 eor	r1,r1,r1
+	adc	lr,r3,#0		@ future $h2
+	str	r2,[sp,#4]		@ future $h1
+
+	mul	r2,$s3,$h4
+	eor	r3,r3,r3
+	umlal	r0,r1,$h3,$s3
+	ldr	$r3,[sp,#28]		@ reload $r3
+	umlal	r2,r3,$h3,$r0
+	umlal	r0,r1,$h2,$r0
+	umlal	r2,r3,$h2,$r1
+	umlal	r0,r1,$h1,$r1
+	umlal	r2,r3,$h1,$r2
+	umlal	r0,r1,$h0,$r2
+	umlal	r2,r3,$h0,$r3
+	ldr	$h0,[sp,#0]
+	mul	$h4,$r0,$h4
+	ldr	$h1,[sp,#4]
+
+	adds	$h2,lr,r0		@ d2+=d1>>32
+	ldr	lr,[sp,#8]		@ reload input pointer
+	adc	r1,r1,#0
+	adds	$h3,r2,r1		@ d3+=d2>>32
+	ldr	r0,[sp,#16]		@ reload end pointer
+	adc	r3,r3,#0
+	add	$h4,$h4,r3		@ h4+=d3>>32
+
+	and	r1,$h4,#-4
+	and	$h4,$h4,#3
+	add	r1,r1,r1,lsr#2		@ *=5
+	adds	$h0,$h0,r1
+	adcs	$h1,$h1,#0
+	adcs	$h2,$h2,#0
+	adcs	$h3,$h3,#0
+	adc	$h4,$h4,#0
+
+	cmp	r0,lr			@ done yet?
+	bhi	.Loop
+
+	ldr	$ctx,[sp,#12]
+	add	sp,sp,#32
+	stmdb	$ctx,{$h0-$h4}		@ store the result
+
+.Lno_data:
+#if	__ARM_ARCH__>=5
+	ldmia	sp!,{r3-r11,pc}
+#else
+	ldmia	sp!,{r3-r11,lr}
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	bx	lr			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_blocks,.-poly1305_blocks
+___
+}
+{
+my ($ctx,$mac,$nonce)=map("r$_",(0..2));
+my ($h0,$h1,$h2,$h3,$h4,$g0,$g1,$g2,$g3)=map("r$_",(3..11));
+my $g4=$ctx;
+
+$code.=<<___;
+.type	poly1305_emit,%function
+.align	5
+poly1305_emit:
+.Lpoly1305_emit:
+	stmdb	sp!,{r4-r11}
+
+	ldmia	$ctx,{$h0-$h4}
+
+#if __ARM_ARCH__>=7
+	ldr	ip,[$ctx,#36]		@ is_base2_26
+
+	adds	$g0,$h0,$h1,lsl#26	@ base 2^26 -> base 2^32
+	mov	$g1,$h1,lsr#6
+	adcs	$g1,$g1,$h2,lsl#20
+	mov	$g2,$h2,lsr#12
+	adcs	$g2,$g2,$h3,lsl#14
+	mov	$g3,$h3,lsr#18
+	adcs	$g3,$g3,$h4,lsl#8
+	mov	$g4,#0
+	adc	$g4,$g4,$h4,lsr#24
+
+	tst	ip,ip
+	itttt	ne
+	movne	$h0,$g0
+	movne	$h1,$g1
+	movne	$h2,$g2
+	movne	$h3,$g3
+	it	ne
+	movne	$h4,$g4
+#endif
+
+	adds	$g0,$h0,#5		@ compare to modulus
+	adcs	$g1,$h1,#0
+	adcs	$g2,$h2,#0
+	adcs	$g3,$h3,#0
+	adc	$g4,$h4,#0
+	tst	$g4,#4			@ did it carry/borrow?
+
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	$h0,$g0
+	ldr	$g0,[$nonce,#0]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	$h1,$g1
+	ldr	$g1,[$nonce,#4]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	$h2,$g2
+	ldr	$g2,[$nonce,#8]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	$h3,$g3
+	ldr	$g3,[$nonce,#12]
+
+	adds	$h0,$h0,$g0
+	adcs	$h1,$h1,$g1
+	adcs	$h2,$h2,$g2
+	adc	$h3,$h3,$g3
+
+#if __ARM_ARCH__>=7
+# ifdef __ARMEB__
+	rev	$h0,$h0
+	rev	$h1,$h1
+	rev	$h2,$h2
+	rev	$h3,$h3
+# endif
+	str	$h0,[$mac,#0]
+	str	$h1,[$mac,#4]
+	str	$h2,[$mac,#8]
+	str	$h3,[$mac,#12]
+#else
+	strb	$h0,[$mac,#0]
+	mov	$h0,$h0,lsr#8
+	strb	$h1,[$mac,#4]
+	mov	$h1,$h1,lsr#8
+	strb	$h2,[$mac,#8]
+	mov	$h2,$h2,lsr#8
+	strb	$h3,[$mac,#12]
+	mov	$h3,$h3,lsr#8
+
+	strb	$h0,[$mac,#1]
+	mov	$h0,$h0,lsr#8
+	strb	$h1,[$mac,#5]
+	mov	$h1,$h1,lsr#8
+	strb	$h2,[$mac,#9]
+	mov	$h2,$h2,lsr#8
+	strb	$h3,[$mac,#13]
+	mov	$h3,$h3,lsr#8
+
+	strb	$h0,[$mac,#2]
+	mov	$h0,$h0,lsr#8
+	strb	$h1,[$mac,#6]
+	mov	$h1,$h1,lsr#8
+	strb	$h2,[$mac,#10]
+	mov	$h2,$h2,lsr#8
+	strb	$h3,[$mac,#14]
+	mov	$h3,$h3,lsr#8
+
+	strb	$h0,[$mac,#3]
+	strb	$h1,[$mac,#7]
+	strb	$h2,[$mac,#11]
+	strb	$h3,[$mac,#15]
+#endif
+	ldmia	sp!,{r4-r11}
+#if	__ARM_ARCH__>=5
+	ret				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	bx	lr			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_emit,.-poly1305_emit
+___
+{
+my ($R0,$R1,$S1,$R2,$S2,$R3,$S3,$R4,$S4) = map("d$_",(0..9));
+my ($D0,$D1,$D2,$D3,$D4, $H0,$H1,$H2,$H3,$H4) = map("q$_",(5..14));
+my ($T0,$T1,$MASK) = map("q$_",(15,4,0));
+
+my ($in2,$zeros,$tbl0,$tbl1) = map("r$_",(4..7));
+
+$code.=<<___;
+#if	__ARM_MAX_ARCH__>=7
+.fpu	neon
+
+.type	poly1305_init_neon,%function
+.align	5
+poly1305_init_neon:
+.Lpoly1305_init_neon:
+	ldr	r3,[$ctx,#48]		@ first table element
+	cmp	r3,#-1			@ is value impossible?
+	bne	.Lno_init_neon
+
+	ldr	r4,[$ctx,#20]		@ load key base 2^32
+	ldr	r5,[$ctx,#24]
+	ldr	r6,[$ctx,#28]
+	ldr	r7,[$ctx,#32]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	and	r3,r3,#0x03ffffff
+	and	r4,r4,#0x03ffffff
+	and	r5,r5,#0x03ffffff
+
+	vdup.32	$R0,r2			@ r^1 in both lanes
+	add	r2,r3,r3,lsl#2		@ *5
+	vdup.32	$R1,r3
+	add	r3,r4,r4,lsl#2
+	vdup.32	$S1,r2
+	vdup.32	$R2,r4
+	add	r4,r5,r5,lsl#2
+	vdup.32	$S2,r3
+	vdup.32	$R3,r5
+	add	r5,r6,r6,lsl#2
+	vdup.32	$S3,r4
+	vdup.32	$R4,r6
+	vdup.32	$S4,r5
+
+	mov	$zeros,#2		@ counter
+
+.Lsquare_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+
+	vmull.u32	$D0,$R0,${R0}[1]
+	vmull.u32	$D1,$R1,${R0}[1]
+	vmull.u32	$D2,$R2,${R0}[1]
+	vmull.u32	$D3,$R3,${R0}[1]
+	vmull.u32	$D4,$R4,${R0}[1]
+
+	vmlal.u32	$D0,$R4,${S1}[1]
+	vmlal.u32	$D1,$R0,${R1}[1]
+	vmlal.u32	$D2,$R1,${R1}[1]
+	vmlal.u32	$D3,$R2,${R1}[1]
+	vmlal.u32	$D4,$R3,${R1}[1]
+
+	vmlal.u32	$D0,$R3,${S2}[1]
+	vmlal.u32	$D1,$R4,${S2}[1]
+	vmlal.u32	$D3,$R1,${R2}[1]
+	vmlal.u32	$D2,$R0,${R2}[1]
+	vmlal.u32	$D4,$R2,${R2}[1]
+
+	vmlal.u32	$D0,$R2,${S3}[1]
+	vmlal.u32	$D3,$R0,${R3}[1]
+	vmlal.u32	$D1,$R3,${S3}[1]
+	vmlal.u32	$D2,$R4,${S3}[1]
+	vmlal.u32	$D4,$R1,${R3}[1]
+
+	vmlal.u32	$D3,$R4,${S4}[1]
+	vmlal.u32	$D0,$R1,${S4}[1]
+	vmlal.u32	$D1,$R2,${S4}[1]
+	vmlal.u32	$D2,$R3,${S4}[1]
+	vmlal.u32	$D4,$R0,${R4}[1]
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	@ and P. Schwabe
+	@
+	@ H0>>+H1>>+H2>>+H3>>+H4
+	@ H3>>+H4>>*5+H0>>+H1
+	@
+	@ Trivia.
+	@
+	@ Result of multiplication of n-bit number by m-bit number is
+	@ n+m bits wide. However! Even though 2^n is a n+1-bit number,
+	@ m-bit number multiplied by 2^n is still n+m bits wide.
+	@
+	@ Sum of two n-bit numbers is n+1 bits wide, sum of three - n+2,
+	@ and so is sum of four. Sum of 2^m n-m-bit numbers and n-bit
+	@ one is n+1 bits wide.
+	@
+	@ >>+ denotes Hnext += Hn>>26, Hn &= 0x3ffffff. This means that
+	@ H0, H2, H3 are guaranteed to be 26 bits wide, while H1 and H4
+	@ can be 27. However! In cases when their width exceeds 26 bits
+	@ they are limited by 2^26+2^6. This in turn means that *sum*
+	@ of the products with these values can still be viewed as sum
+	@ of 52-bit numbers as long as the amount of addends is not a
+	@ power of 2. For example,
+	@
+	@ H4 = H4*R0 + H3*R1 + H2*R2 + H1*R3 + H0 * R4,
+	@
+	@ which can't be larger than 5 * (2^26 + 2^6) * (2^26 + 2^6), or
+	@ 5 * (2^52 + 2*2^32 + 2^12), which in turn is smaller than
+	@ 8 * (2^52) or 2^55. However, the value is then multiplied by
+	@ by 5, so we should be looking at 5 * 5 * (2^52 + 2^33 + 2^12),
+	@ which is less than 32 * (2^52) or 2^57. And when processing
+	@ data we are looking at triple as many addends...
+	@
+	@ In key setup procedure pre-reduced H0 is limited by 5*4+1 and
+	@ 5*H4 - by 5*5 52-bit addends, or 57 bits. But when hashing the
+	@ input H0 is limited by (5*4+1)*3 addends, or 58 bits, while
+	@ 5*H4 by 5*5*3, or 59[!] bits. How is this relevant? vmlal.u32
+	@ instruction accepts 2x32-bit input and writes 2x64-bit result.
+	@ This means that result of reduction have to be compressed upon
+	@ loop wrap-around. This can be done in the process of reduction
+	@ to minimize amount of instructions [as well as amount of
+	@ 128-bit instructions, which benefits low-end processors], but
+	@ one has to watch for H2 (which is narrower than H0) and 5*H4
+	@ not being wider than 58 bits, so that result of right shift
+	@ by 26 bits fits in 32 bits. This is also useful on x86,
+	@ because it allows to use paddd in place for paddq, which
+	@ benefits Atom, where paddq is ridiculously slow.
+
+	vshr.u64	$T0,$D3,#26
+	vmovn.i64	$D3#lo,$D3
+	 vshr.u64	$T1,$D0,#26
+	 vmovn.i64	$D0#lo,$D0
+	vadd.i64	$D4,$D4,$T0		@ h3 -> h4
+	vbic.i32	$D3#lo,#0xfc000000	@ &=0x03ffffff
+	 vadd.i64	$D1,$D1,$T1		@ h0 -> h1
+	 vbic.i32	$D0#lo,#0xfc000000
+
+	vshrn.u64	$T0#lo,$D4,#26
+	vmovn.i64	$D4#lo,$D4
+	 vshr.u64	$T1,$D1,#26
+	 vmovn.i64	$D1#lo,$D1
+	 vadd.i64	$D2,$D2,$T1		@ h1 -> h2
+	vbic.i32	$D4#lo,#0xfc000000
+	 vbic.i32	$D1#lo,#0xfc000000
+
+	vadd.i32	$D0#lo,$D0#lo,$T0#lo
+	vshl.u32	$T0#lo,$T0#lo,#2
+	 vshrn.u64	$T1#lo,$D2,#26
+	 vmovn.i64	$D2#lo,$D2
+	vadd.i32	$D0#lo,$D0#lo,$T0#lo	@ h4 -> h0
+	 vadd.i32	$D3#lo,$D3#lo,$T1#lo	@ h2 -> h3
+	 vbic.i32	$D2#lo,#0xfc000000
+
+	vshr.u32	$T0#lo,$D0#lo,#26
+	vbic.i32	$D0#lo,#0xfc000000
+	 vshr.u32	$T1#lo,$D3#lo,#26
+	 vbic.i32	$D3#lo,#0xfc000000
+	vadd.i32	$D1#lo,$D1#lo,$T0#lo	@ h0 -> h1
+	 vadd.i32	$D4#lo,$D4#lo,$T1#lo	@ h3 -> h4
+
+	subs		$zeros,$zeros,#1
+	beq		.Lsquare_break_neon
+
+	add		$tbl0,$ctx,#(48+0*9*4)
+	add		$tbl1,$ctx,#(48+1*9*4)
+
+	vtrn.32		$R0,$D0#lo		@ r^2:r^1
+	vtrn.32		$R2,$D2#lo
+	vtrn.32		$R3,$D3#lo
+	vtrn.32		$R1,$D1#lo
+	vtrn.32		$R4,$D4#lo
+
+	vshl.u32	$S2,$R2,#2		@ *5
+	vshl.u32	$S3,$R3,#2
+	vshl.u32	$S1,$R1,#2
+	vshl.u32	$S4,$R4,#2
+	vadd.i32	$S2,$S2,$R2
+	vadd.i32	$S1,$S1,$R1
+	vadd.i32	$S3,$S3,$R3
+	vadd.i32	$S4,$S4,$R4
+
+	vst4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!
+	vst4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!
+	vst4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	vst4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vst1.32		{${S4}[0]},[$tbl0,:32]
+	vst1.32		{${S4}[1]},[$tbl1,:32]
+
+	b		.Lsquare_neon
+
+.align	4
+.Lsquare_break_neon:
+	add		$tbl0,$ctx,#(48+2*4*9)
+	add		$tbl1,$ctx,#(48+3*4*9)
+
+	vmov		$R0,$D0#lo		@ r^4:r^3
+	vshl.u32	$S1,$D1#lo,#2		@ *5
+	vmov		$R1,$D1#lo
+	vshl.u32	$S2,$D2#lo,#2
+	vmov		$R2,$D2#lo
+	vshl.u32	$S3,$D3#lo,#2
+	vmov		$R3,$D3#lo
+	vshl.u32	$S4,$D4#lo,#2
+	vmov		$R4,$D4#lo
+	vadd.i32	$S1,$S1,$D1#lo
+	vadd.i32	$S2,$S2,$D2#lo
+	vadd.i32	$S3,$S3,$D3#lo
+	vadd.i32	$S4,$S4,$D4#lo
+
+	vst4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!
+	vst4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!
+	vst4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	vst4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vst1.32		{${S4}[0]},[$tbl0]
+	vst1.32		{${S4}[1]},[$tbl1]
+
+.Lno_init_neon:
+	ret				@ bx	lr
+.size	poly1305_init_neon,.-poly1305_init_neon
+
+.type	poly1305_blocks_neon,%function
+.align	5
+poly1305_blocks_neon:
+.Lpoly1305_blocks_neon:
+	ldr	ip,[$ctx,#36]		@ is_base2_26
+
+	cmp	$len,#64
+	blo	.Lpoly1305_blocks
+
+	stmdb	sp!,{r4-r7}
+	vstmdb	sp!,{d8-d15}		@ ABI specification says so
+
+	tst	ip,ip			@ is_base2_26?
+	bne	.Lbase2_26_neon
+
+	stmdb	sp!,{r1-r3,lr}
+	bl	.Lpoly1305_init_neon
+
+	ldr	r4,[$ctx,#0]		@ load hash value base 2^32
+	ldr	r5,[$ctx,#4]
+	ldr	r6,[$ctx,#8]
+	ldr	r7,[$ctx,#12]
+	ldr	ip,[$ctx,#16]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	 veor	$D0#lo,$D0#lo,$D0#lo
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	 veor	$D1#lo,$D1#lo,$D1#lo
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	 veor	$D2#lo,$D2#lo,$D2#lo
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	 veor	$D3#lo,$D3#lo,$D3#lo
+	and	r3,r3,#0x03ffffff
+	orr	r6,r6,ip,lsl#24
+	 veor	$D4#lo,$D4#lo,$D4#lo
+	and	r4,r4,#0x03ffffff
+	mov	r1,#1
+	and	r5,r5,#0x03ffffff
+	str	r1,[$ctx,#36]		@ set is_base2_26
+
+	vmov.32	$D0#lo[0],r2
+	vmov.32	$D1#lo[0],r3
+	vmov.32	$D2#lo[0],r4
+	vmov.32	$D3#lo[0],r5
+	vmov.32	$D4#lo[0],r6
+	adr	$zeros,.Lzeros
+
+	ldmia	sp!,{r1-r3,lr}
+	b	.Lhash_loaded
+
+.align	4
+.Lbase2_26_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ load hash value
+
+	veor		$D0#lo,$D0#lo,$D0#lo
+	veor		$D1#lo,$D1#lo,$D1#lo
+	veor		$D2#lo,$D2#lo,$D2#lo
+	veor		$D3#lo,$D3#lo,$D3#lo
+	veor		$D4#lo,$D4#lo,$D4#lo
+	vld4.32		{$D0#lo[0],$D1#lo[0],$D2#lo[0],$D3#lo[0]},[$ctx]!
+	adr		$zeros,.Lzeros
+	vld1.32		{$D4#lo[0]},[$ctx]
+	sub		$ctx,$ctx,#16		@ rewind
+
+.Lhash_loaded:
+	add		$in2,$inp,#32
+	mov		$padbit,$padbit,lsl#24
+	tst		$len,#31
+	beq		.Leven
+
+	vld4.32		{$H0#lo[0],$H1#lo[0],$H2#lo[0],$H3#lo[0]},[$inp]!
+	vmov.32		$H4#lo[0],$padbit
+	sub		$len,$len,#16
+	add		$in2,$inp,#32
+
+# ifdef	__ARMEB__
+	vrev32.8	$H0,$H0
+	vrev32.8	$H3,$H3
+	vrev32.8	$H1,$H1
+	vrev32.8	$H2,$H2
+# endif
+	vsri.u32	$H4#lo,$H3#lo,#8	@ base 2^32 -> base 2^26
+	vshl.u32	$H3#lo,$H3#lo,#18
+
+	vsri.u32	$H3#lo,$H2#lo,#14
+	vshl.u32	$H2#lo,$H2#lo,#12
+	vadd.i32	$H4#hi,$H4#lo,$D4#lo	@ add hash value and move to #hi
+
+	vbic.i32	$H3#lo,#0xfc000000
+	vsri.u32	$H2#lo,$H1#lo,#20
+	vshl.u32	$H1#lo,$H1#lo,#6
+
+	vbic.i32	$H2#lo,#0xfc000000
+	vsri.u32	$H1#lo,$H0#lo,#26
+	vadd.i32	$H3#hi,$H3#lo,$D3#lo
+
+	vbic.i32	$H0#lo,#0xfc000000
+	vbic.i32	$H1#lo,#0xfc000000
+	vadd.i32	$H2#hi,$H2#lo,$D2#lo
+
+	vadd.i32	$H0#hi,$H0#lo,$D0#lo
+	vadd.i32	$H1#hi,$H1#lo,$D1#lo
+
+	mov		$tbl1,$zeros
+	add		$tbl0,$ctx,#48
+
+	cmp		$len,$len
+	b		.Long_tail
+
+.align	4
+.Leven:
+	subs		$len,$len,#64
+	it		lo
+	movlo		$in2,$zeros
+
+	vmov.i32	$H4,#1<<24		@ padbit, yes, always
+	vld4.32		{$H0#lo,$H1#lo,$H2#lo,$H3#lo},[$inp]	@ inp[0:1]
+	add		$inp,$inp,#64
+	vld4.32		{$H0#hi,$H1#hi,$H2#hi,$H3#hi},[$in2]	@ inp[2:3] (or 0)
+	add		$in2,$in2,#64
+	itt		hi
+	addhi		$tbl1,$ctx,#(48+1*9*4)
+	addhi		$tbl0,$ctx,#(48+3*9*4)
+
+# ifdef	__ARMEB__
+	vrev32.8	$H0,$H0
+	vrev32.8	$H3,$H3
+	vrev32.8	$H1,$H1
+	vrev32.8	$H2,$H2
+# endif
+	vsri.u32	$H4,$H3,#8		@ base 2^32 -> base 2^26
+	vshl.u32	$H3,$H3,#18
+
+	vsri.u32	$H3,$H2,#14
+	vshl.u32	$H2,$H2,#12
+
+	vbic.i32	$H3,#0xfc000000
+	vsri.u32	$H2,$H1,#20
+	vshl.u32	$H1,$H1,#6
+
+	vbic.i32	$H2,#0xfc000000
+	vsri.u32	$H1,$H0,#26
+
+	vbic.i32	$H0,#0xfc000000
+	vbic.i32	$H1,#0xfc000000
+
+	bls		.Lskip_loop
+
+	vld4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!	@ load r^2
+	vld4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!	@ load r^4
+	vld4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vld4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	b		.Loop_neon
+
+.align	5
+.Loop_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	@   \___________________/
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	@   \___________________/ \____________________/
+	@
+	@ Note that we start with inp[2:3]*r^2. This is because it
+	@ doesn't depend on reduction in previous iteration.
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ inp[2:3]*r^2
+
+	vadd.i32	$H2#lo,$H2#lo,$D2#lo	@ accumulate inp[0:1]
+	vmull.u32	$D2,$H2#hi,${R0}[1]
+	vadd.i32	$H0#lo,$H0#lo,$D0#lo
+	vmull.u32	$D0,$H0#hi,${R0}[1]
+	vadd.i32	$H3#lo,$H3#lo,$D3#lo
+	vmull.u32	$D3,$H3#hi,${R0}[1]
+	vmlal.u32	$D2,$H1#hi,${R1}[1]
+	vadd.i32	$H1#lo,$H1#lo,$D1#lo
+	vmull.u32	$D1,$H1#hi,${R0}[1]
+
+	vadd.i32	$H4#lo,$H4#lo,$D4#lo
+	vmull.u32	$D4,$H4#hi,${R0}[1]
+	subs		$len,$len,#64
+	vmlal.u32	$D0,$H4#hi,${S1}[1]
+	it		lo
+	movlo		$in2,$zeros
+	vmlal.u32	$D3,$H2#hi,${R1}[1]
+	vld1.32		${S4}[1],[$tbl1,:32]
+	vmlal.u32	$D1,$H0#hi,${R1}[1]
+	vmlal.u32	$D4,$H3#hi,${R1}[1]
+
+	vmlal.u32	$D0,$H3#hi,${S2}[1]
+	vmlal.u32	$D3,$H1#hi,${R2}[1]
+	vmlal.u32	$D4,$H2#hi,${R2}[1]
+	vmlal.u32	$D1,$H4#hi,${S2}[1]
+	vmlal.u32	$D2,$H0#hi,${R2}[1]
+
+	vmlal.u32	$D3,$H0#hi,${R3}[1]
+	vmlal.u32	$D0,$H2#hi,${S3}[1]
+	vmlal.u32	$D4,$H1#hi,${R3}[1]
+	vmlal.u32	$D1,$H3#hi,${S3}[1]
+	vmlal.u32	$D2,$H4#hi,${S3}[1]
+
+	vmlal.u32	$D3,$H4#hi,${S4}[1]
+	vmlal.u32	$D0,$H1#hi,${S4}[1]
+	vmlal.u32	$D4,$H0#hi,${R4}[1]
+	vmlal.u32	$D1,$H2#hi,${S4}[1]
+	vmlal.u32	$D2,$H3#hi,${S4}[1]
+
+	vld4.32		{$H0#hi,$H1#hi,$H2#hi,$H3#hi},[$in2]	@ inp[2:3] (or 0)
+	add		$in2,$in2,#64
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4 and accumulate
+
+	vmlal.u32	$D3,$H3#lo,${R0}[0]
+	vmlal.u32	$D0,$H0#lo,${R0}[0]
+	vmlal.u32	$D4,$H4#lo,${R0}[0]
+	vmlal.u32	$D1,$H1#lo,${R0}[0]
+	vmlal.u32	$D2,$H2#lo,${R0}[0]
+	vld1.32		${S4}[0],[$tbl0,:32]
+
+	vmlal.u32	$D3,$H2#lo,${R1}[0]
+	vmlal.u32	$D0,$H4#lo,${S1}[0]
+	vmlal.u32	$D4,$H3#lo,${R1}[0]
+	vmlal.u32	$D1,$H0#lo,${R1}[0]
+	vmlal.u32	$D2,$H1#lo,${R1}[0]
+
+	vmlal.u32	$D3,$H1#lo,${R2}[0]
+	vmlal.u32	$D0,$H3#lo,${S2}[0]
+	vmlal.u32	$D4,$H2#lo,${R2}[0]
+	vmlal.u32	$D1,$H4#lo,${S2}[0]
+	vmlal.u32	$D2,$H0#lo,${R2}[0]
+
+	vmlal.u32	$D3,$H0#lo,${R3}[0]
+	vmlal.u32	$D0,$H2#lo,${S3}[0]
+	vmlal.u32	$D4,$H1#lo,${R3}[0]
+	vmlal.u32	$D1,$H3#lo,${S3}[0]
+	vmlal.u32	$D3,$H4#lo,${S4}[0]
+
+	vmlal.u32	$D2,$H4#lo,${S3}[0]
+	vmlal.u32	$D0,$H1#lo,${S4}[0]
+	vmlal.u32	$D4,$H0#lo,${R4}[0]
+	vmov.i32	$H4,#1<<24		@ padbit, yes, always
+	vmlal.u32	$D1,$H2#lo,${S4}[0]
+	vmlal.u32	$D2,$H3#lo,${S4}[0]
+
+	vld4.32		{$H0#lo,$H1#lo,$H2#lo,$H3#lo},[$inp]	@ inp[0:1]
+	add		$inp,$inp,#64
+# ifdef	__ARMEB__
+	vrev32.8	$H0,$H0
+	vrev32.8	$H1,$H1
+	vrev32.8	$H2,$H2
+	vrev32.8	$H3,$H3
+# endif
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction interleaved with base 2^32 -> base 2^26 of
+	@ inp[0:3] previously loaded to $H0-$H3 and smashed to $H0-$H4.
+
+	vshr.u64	$T0,$D3,#26
+	vmovn.i64	$D3#lo,$D3
+	 vshr.u64	$T1,$D0,#26
+	 vmovn.i64	$D0#lo,$D0
+	vadd.i64	$D4,$D4,$T0		@ h3 -> h4
+	vbic.i32	$D3#lo,#0xfc000000
+	  vsri.u32	$H4,$H3,#8		@ base 2^32 -> base 2^26
+	 vadd.i64	$D1,$D1,$T1		@ h0 -> h1
+	  vshl.u32	$H3,$H3,#18
+	 vbic.i32	$D0#lo,#0xfc000000
+
+	vshrn.u64	$T0#lo,$D4,#26
+	vmovn.i64	$D4#lo,$D4
+	 vshr.u64	$T1,$D1,#26
+	 vmovn.i64	$D1#lo,$D1
+	 vadd.i64	$D2,$D2,$T1		@ h1 -> h2
+	  vsri.u32	$H3,$H2,#14
+	vbic.i32	$D4#lo,#0xfc000000
+	  vshl.u32	$H2,$H2,#12
+	 vbic.i32	$D1#lo,#0xfc000000
+
+	vadd.i32	$D0#lo,$D0#lo,$T0#lo
+	vshl.u32	$T0#lo,$T0#lo,#2
+	  vbic.i32	$H3,#0xfc000000
+	 vshrn.u64	$T1#lo,$D2,#26
+	 vmovn.i64	$D2#lo,$D2
+	vaddl.u32	$D0,$D0#lo,$T0#lo	@ h4 -> h0 [widen for a sec]
+	  vsri.u32	$H2,$H1,#20
+	 vadd.i32	$D3#lo,$D3#lo,$T1#lo	@ h2 -> h3
+	  vshl.u32	$H1,$H1,#6
+	 vbic.i32	$D2#lo,#0xfc000000
+	  vbic.i32	$H2,#0xfc000000
+
+	vshrn.u64	$T0#lo,$D0,#26		@ re-narrow
+	vmovn.i64	$D0#lo,$D0
+	  vsri.u32	$H1,$H0,#26
+	  vbic.i32	$H0,#0xfc000000
+	 vshr.u32	$T1#lo,$D3#lo,#26
+	 vbic.i32	$D3#lo,#0xfc000000
+	vbic.i32	$D0#lo,#0xfc000000
+	vadd.i32	$D1#lo,$D1#lo,$T0#lo	@ h0 -> h1
+	 vadd.i32	$D4#lo,$D4#lo,$T1#lo	@ h3 -> h4
+	  vbic.i32	$H1,#0xfc000000
+
+	bhi		.Loop_neon
+
+.Lskip_loop:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	add		$tbl1,$ctx,#(48+0*9*4)
+	add		$tbl0,$ctx,#(48+1*9*4)
+	adds		$len,$len,#32
+	it		ne
+	movne		$len,#0
+	bne		.Long_tail
+
+	vadd.i32	$H2#hi,$H2#lo,$D2#lo	@ add hash value and move to #hi
+	vadd.i32	$H0#hi,$H0#lo,$D0#lo
+	vadd.i32	$H3#hi,$H3#lo,$D3#lo
+	vadd.i32	$H1#hi,$H1#lo,$D1#lo
+	vadd.i32	$H4#hi,$H4#lo,$D4#lo
+
+.Long_tail:
+	vld4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!	@ load r^1
+	vld4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!	@ load r^2
+
+	vadd.i32	$H2#lo,$H2#lo,$D2#lo	@ can be redundant
+	vmull.u32	$D2,$H2#hi,$R0
+	vadd.i32	$H0#lo,$H0#lo,$D0#lo
+	vmull.u32	$D0,$H0#hi,$R0
+	vadd.i32	$H3#lo,$H3#lo,$D3#lo
+	vmull.u32	$D3,$H3#hi,$R0
+	vadd.i32	$H1#lo,$H1#lo,$D1#lo
+	vmull.u32	$D1,$H1#hi,$R0
+	vadd.i32	$H4#lo,$H4#lo,$D4#lo
+	vmull.u32	$D4,$H4#hi,$R0
+
+	vmlal.u32	$D0,$H4#hi,$S1
+	vld4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vmlal.u32	$D3,$H2#hi,$R1
+	vld4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	vmlal.u32	$D1,$H0#hi,$R1
+	vmlal.u32	$D4,$H3#hi,$R1
+	vmlal.u32	$D2,$H1#hi,$R1
+
+	vmlal.u32	$D3,$H1#hi,$R2
+	vld1.32		${S4}[1],[$tbl1,:32]
+	vmlal.u32	$D0,$H3#hi,$S2
+	vld1.32		${S4}[0],[$tbl0,:32]
+	vmlal.u32	$D4,$H2#hi,$R2
+	vmlal.u32	$D1,$H4#hi,$S2
+	vmlal.u32	$D2,$H0#hi,$R2
+
+	vmlal.u32	$D3,$H0#hi,$R3
+	 it		ne
+	 addne		$tbl1,$ctx,#(48+2*9*4)
+	vmlal.u32	$D0,$H2#hi,$S3
+	 it		ne
+	 addne		$tbl0,$ctx,#(48+3*9*4)
+	vmlal.u32	$D4,$H1#hi,$R3
+	vmlal.u32	$D1,$H3#hi,$S3
+	vmlal.u32	$D2,$H4#hi,$S3
+
+	vmlal.u32	$D3,$H4#hi,$S4
+	 vorn		$MASK,$MASK,$MASK	@ all-ones, can be redundant
+	vmlal.u32	$D0,$H1#hi,$S4
+	 vshr.u64	$MASK,$MASK,#38
+	vmlal.u32	$D4,$H0#hi,$R4
+	vmlal.u32	$D1,$H2#hi,$S4
+	vmlal.u32	$D2,$H3#hi,$S4
+
+	beq		.Lshort_tail
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	vld4.32		{${R0}[1],${R1}[1],${S1}[1],${R2}[1]},[$tbl1]!	@ load r^3
+	vld4.32		{${R0}[0],${R1}[0],${S1}[0],${R2}[0]},[$tbl0]!	@ load r^4
+
+	vmlal.u32	$D2,$H2#lo,$R0
+	vmlal.u32	$D0,$H0#lo,$R0
+	vmlal.u32	$D3,$H3#lo,$R0
+	vmlal.u32	$D1,$H1#lo,$R0
+	vmlal.u32	$D4,$H4#lo,$R0
+
+	vmlal.u32	$D0,$H4#lo,$S1
+	vld4.32		{${S2}[1],${R3}[1],${S3}[1],${R4}[1]},[$tbl1]!
+	vmlal.u32	$D3,$H2#lo,$R1
+	vld4.32		{${S2}[0],${R3}[0],${S3}[0],${R4}[0]},[$tbl0]!
+	vmlal.u32	$D1,$H0#lo,$R1
+	vmlal.u32	$D4,$H3#lo,$R1
+	vmlal.u32	$D2,$H1#lo,$R1
+
+	vmlal.u32	$D3,$H1#lo,$R2
+	vld1.32		${S4}[1],[$tbl1,:32]
+	vmlal.u32	$D0,$H3#lo,$S2
+	vld1.32		${S4}[0],[$tbl0,:32]
+	vmlal.u32	$D4,$H2#lo,$R2
+	vmlal.u32	$D1,$H4#lo,$S2
+	vmlal.u32	$D2,$H0#lo,$R2
+
+	vmlal.u32	$D3,$H0#lo,$R3
+	vmlal.u32	$D0,$H2#lo,$S3
+	vmlal.u32	$D4,$H1#lo,$R3
+	vmlal.u32	$D1,$H3#lo,$S3
+	vmlal.u32	$D2,$H4#lo,$S3
+
+	vmlal.u32	$D3,$H4#lo,$S4
+	 vorn		$MASK,$MASK,$MASK	@ all-ones
+	vmlal.u32	$D0,$H1#lo,$S4
+	 vshr.u64	$MASK,$MASK,#38
+	vmlal.u32	$D4,$H0#lo,$R4
+	vmlal.u32	$D1,$H2#lo,$S4
+	vmlal.u32	$D2,$H3#lo,$S4
+
+.Lshort_tail:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ horizontal addition
+
+	vadd.i64	$D3#lo,$D3#lo,$D3#hi
+	vadd.i64	$D0#lo,$D0#lo,$D0#hi
+	vadd.i64	$D4#lo,$D4#lo,$D4#hi
+	vadd.i64	$D1#lo,$D1#lo,$D1#hi
+	vadd.i64	$D2#lo,$D2#lo,$D2#hi
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction, but without narrowing
+
+	vshr.u64	$T0,$D3,#26
+	vand.i64	$D3,$D3,$MASK
+	 vshr.u64	$T1,$D0,#26
+	 vand.i64	$D0,$D0,$MASK
+	vadd.i64	$D4,$D4,$T0		@ h3 -> h4
+	 vadd.i64	$D1,$D1,$T1		@ h0 -> h1
+
+	vshr.u64	$T0,$D4,#26
+	vand.i64	$D4,$D4,$MASK
+	 vshr.u64	$T1,$D1,#26
+	 vand.i64	$D1,$D1,$MASK
+	 vadd.i64	$D2,$D2,$T1		@ h1 -> h2
+
+	vadd.i64	$D0,$D0,$T0
+	vshl.u64	$T0,$T0,#2
+	 vshr.u64	$T1,$D2,#26
+	 vand.i64	$D2,$D2,$MASK
+	vadd.i64	$D0,$D0,$T0		@ h4 -> h0
+	 vadd.i64	$D3,$D3,$T1		@ h2 -> h3
+
+	vshr.u64	$T0,$D0,#26
+	vand.i64	$D0,$D0,$MASK
+	 vshr.u64	$T1,$D3,#26
+	 vand.i64	$D3,$D3,$MASK
+	vadd.i64	$D1,$D1,$T0		@ h0 -> h1
+	 vadd.i64	$D4,$D4,$T1		@ h3 -> h4
+
+	cmp		$len,#0
+	bne		.Leven
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ store hash value
+
+	vst4.32		{$D0#lo[0],$D1#lo[0],$D2#lo[0],$D3#lo[0]},[$ctx]!
+	vst1.32		{$D4#lo[0]},[$ctx]
+
+	vldmia	sp!,{d8-d15}			@ epilogue
+	ldmia	sp!,{r4-r7}
+	ret					@ bx	lr
+.size	poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+#ifndef	__KERNEL__
+.LOPENSSL_armcap:
+# ifdef	_WIN32
+.word	OPENSSL_armcap_P
+# else
+.word	OPENSSL_armcap_P-.Lpoly1305_init
+# endif
+.comm	OPENSSL_armcap_P,4,4
+.hidden	OPENSSL_armcap_P
+#endif
+#endif
+___
+}	}
+$code.=<<___;
+.asciz	"Poly1305 for ARMv4/NEON, CRYPTOGAMS by \@dot-asm"
+.align	2
+___
+
+foreach (split("\n",$code)) {
+	s/\`([^\`]*)\`/eval $1/geo;
+
+	s/\bq([0-9]+)#(lo|hi)/sprintf "d%d",2*$1+($2 eq "hi")/geo	or
+	s/\bret\b/bx	lr/go						or
+	s/\bbx\s+lr\b/.word\t0xe12fff1e/go;	# make it possible to compile with -march=armv4
+
+	print $_,"\n";
+}
+close STDOUT; # enforce flush
diff --git a/arch/arm/crypto/poly1305-core.S_shipped b/arch/arm/crypto/poly1305-core.S_shipped
new file mode 100644
index 000000000000..37b71d990293
--- /dev/null
+++ b/arch/arm/crypto/poly1305-core.S_shipped
@@ -0,0 +1,1158 @@
+#ifndef	__KERNEL__
+# include "arm_arch.h"
+#else
+# define __ARM_ARCH__ __LINUX_ARM_ARCH__
+# define __ARM_MAX_ARCH__ __LINUX_ARM_ARCH__
+# define poly1305_init   poly1305_init_arm
+# define poly1305_blocks poly1305_blocks_arm
+# define poly1305_emit   poly1305_emit_arm
+.globl	poly1305_blocks_neon
+#endif
+
+#if defined(__thumb2__)
+.syntax	unified
+.thumb
+#else
+.code	32
+#endif
+
+.text
+
+.globl	poly1305_emit
+.globl	poly1305_blocks
+.globl	poly1305_init
+.type	poly1305_init,%function
+.align	5
+poly1305_init:
+.Lpoly1305_init:
+	stmdb	sp!,{r4-r11}
+
+	eor	r3,r3,r3
+	cmp	r1,#0
+	str	r3,[r0,#0]		@ zero hash value
+	str	r3,[r0,#4]
+	str	r3,[r0,#8]
+	str	r3,[r0,#12]
+	str	r3,[r0,#16]
+	str	r3,[r0,#36]		@ clear is_base2_26
+	add	r0,r0,#20
+
+#ifdef	__thumb2__
+	it	eq
+#endif
+	moveq	r0,#0
+	beq	.Lno_key
+
+#if	__ARM_MAX_ARCH__>=7
+	mov	r3,#-1
+	str	r3,[r0,#28]		@ impossible key power value
+# ifndef __KERNEL__
+	adr	r11,.Lpoly1305_init
+	ldr	r12,.LOPENSSL_armcap
+# endif
+#endif
+	ldrb	r4,[r1,#0]
+	mov	r10,#0x0fffffff
+	ldrb	r5,[r1,#1]
+	and	r3,r10,#-4		@ 0x0ffffffc
+	ldrb	r6,[r1,#2]
+	ldrb	r7,[r1,#3]
+	orr	r4,r4,r5,lsl#8
+	ldrb	r5,[r1,#4]
+	orr	r4,r4,r6,lsl#16
+	ldrb	r6,[r1,#5]
+	orr	r4,r4,r7,lsl#24
+	ldrb	r7,[r1,#6]
+	and	r4,r4,r10
+
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+# if !defined(_WIN32)
+	ldr	r12,[r11,r12]		@ OPENSSL_armcap_P
+# endif
+# if defined(__APPLE__) || defined(_WIN32)
+	ldr	r12,[r12]
+# endif
+#endif
+	ldrb	r8,[r1,#7]
+	orr	r5,r5,r6,lsl#8
+	ldrb	r6,[r1,#8]
+	orr	r5,r5,r7,lsl#16
+	ldrb	r7,[r1,#9]
+	orr	r5,r5,r8,lsl#24
+	ldrb	r8,[r1,#10]
+	and	r5,r5,r3
+
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+	tst	r12,#ARMV7_NEON		@ check for NEON
+# ifdef	__thumb2__
+	adr	r9,.Lpoly1305_blocks_neon
+	adr	r11,.Lpoly1305_blocks
+	it	ne
+	movne	r11,r9
+	adr	r12,.Lpoly1305_emit
+	orr	r11,r11,#1		@ thumb-ify addresses
+	orr	r12,r12,#1
+# else
+	add	r12,r11,#(.Lpoly1305_emit-.Lpoly1305_init)
+	ite	eq
+	addeq	r11,r11,#(.Lpoly1305_blocks-.Lpoly1305_init)
+	addne	r11,r11,#(.Lpoly1305_blocks_neon-.Lpoly1305_init)
+# endif
+#endif
+	ldrb	r9,[r1,#11]
+	orr	r6,r6,r7,lsl#8
+	ldrb	r7,[r1,#12]
+	orr	r6,r6,r8,lsl#16
+	ldrb	r8,[r1,#13]
+	orr	r6,r6,r9,lsl#24
+	ldrb	r9,[r1,#14]
+	and	r6,r6,r3
+
+	ldrb	r10,[r1,#15]
+	orr	r7,r7,r8,lsl#8
+	str	r4,[r0,#0]
+	orr	r7,r7,r9,lsl#16
+	str	r5,[r0,#4]
+	orr	r7,r7,r10,lsl#24
+	str	r6,[r0,#8]
+	and	r7,r7,r3
+	str	r7,[r0,#12]
+#if	__ARM_MAX_ARCH__>=7 && !defined(__KERNEL__)
+	stmia	r2,{r11,r12}		@ fill functions table
+	mov	r0,#1
+#else
+	mov	r0,#0
+#endif
+.Lno_key:
+	ldmia	sp!,{r4-r11}
+#if	__ARM_ARCH__>=5
+	bx	lr				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	.word	0xe12fff1e			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_init,.-poly1305_init
+.type	poly1305_blocks,%function
+.align	5
+poly1305_blocks:
+.Lpoly1305_blocks:
+	stmdb	sp!,{r3-r11,lr}
+
+	ands	r2,r2,#-16
+	beq	.Lno_data
+
+	add	r2,r2,r1		@ end pointer
+	sub	sp,sp,#32
+
+#if __ARM_ARCH__<7
+	ldmia	r0,{r4-r12}		@ load context
+	add	r0,r0,#20
+	str	r2,[sp,#16]		@ offload stuff
+	str	r0,[sp,#12]
+#else
+	ldr	lr,[r0,#36]		@ is_base2_26
+	ldmia	r0!,{r4-r8}		@ load hash value
+	str	r2,[sp,#16]		@ offload stuff
+	str	r0,[sp,#12]
+
+	adds	r9,r4,r5,lsl#26	@ base 2^26 -> base 2^32
+	mov	r10,r5,lsr#6
+	adcs	r10,r10,r6,lsl#20
+	mov	r11,r6,lsr#12
+	adcs	r11,r11,r7,lsl#14
+	mov	r12,r7,lsr#18
+	adcs	r12,r12,r8,lsl#8
+	mov	r2,#0
+	teq	lr,#0
+	str	r2,[r0,#16]		@ clear is_base2_26
+	adc	r2,r2,r8,lsr#24
+
+	itttt	ne
+	movne	r4,r9			@ choose between radixes
+	movne	r5,r10
+	movne	r6,r11
+	movne	r7,r12
+	ldmia	r0,{r9-r12}		@ load key
+	it	ne
+	movne	r8,r2
+#endif
+
+	mov	lr,r1
+	cmp	r3,#0
+	str	r10,[sp,#20]
+	str	r11,[sp,#24]
+	str	r12,[sp,#28]
+	b	.Loop
+
+.align	4
+.Loop:
+#if __ARM_ARCH__<7
+	ldrb	r0,[lr],#16		@ load input
+# ifdef	__thumb2__
+	it	hi
+# endif
+	addhi	r8,r8,#1		@ 1<<128
+	ldrb	r1,[lr,#-15]
+	ldrb	r2,[lr,#-14]
+	ldrb	r3,[lr,#-13]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-12]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-11]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-10]
+	adds	r4,r4,r3		@ accumulate input
+
+	ldrb	r3,[lr,#-9]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-8]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-7]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-6]
+	adcs	r5,r5,r3
+
+	ldrb	r3,[lr,#-5]
+	orr	r1,r0,r1,lsl#8
+	ldrb	r0,[lr,#-4]
+	orr	r2,r1,r2,lsl#16
+	ldrb	r1,[lr,#-3]
+	orr	r3,r2,r3,lsl#24
+	ldrb	r2,[lr,#-2]
+	adcs	r6,r6,r3
+
+	ldrb	r3,[lr,#-1]
+	orr	r1,r0,r1,lsl#8
+	str	lr,[sp,#8]		@ offload input pointer
+	orr	r2,r1,r2,lsl#16
+	add	r10,r10,r10,lsr#2
+	orr	r3,r2,r3,lsl#24
+#else
+	ldr	r0,[lr],#16		@ load input
+	it	hi
+	addhi	r8,r8,#1		@ padbit
+	ldr	r1,[lr,#-12]
+	ldr	r2,[lr,#-8]
+	ldr	r3,[lr,#-4]
+# ifdef	__ARMEB__
+	rev	r0,r0
+	rev	r1,r1
+	rev	r2,r2
+	rev	r3,r3
+# endif
+	adds	r4,r4,r0		@ accumulate input
+	str	lr,[sp,#8]		@ offload input pointer
+	adcs	r5,r5,r1
+	add	r10,r10,r10,lsr#2
+	adcs	r6,r6,r2
+#endif
+	add	r11,r11,r11,lsr#2
+	adcs	r7,r7,r3
+	add	r12,r12,r12,lsr#2
+
+	umull	r2,r3,r5,r9
+	 adc	r8,r8,#0
+	umull	r0,r1,r4,r9
+	umlal	r2,r3,r8,r10
+	umlal	r0,r1,r7,r10
+	ldr	r10,[sp,#20]		@ reload r10
+	umlal	r2,r3,r6,r12
+	umlal	r0,r1,r5,r12
+	umlal	r2,r3,r7,r11
+	umlal	r0,r1,r6,r11
+	umlal	r2,r3,r4,r10
+	str	r0,[sp,#0]		@ future r4
+	 mul	r0,r11,r8
+	ldr	r11,[sp,#24]		@ reload r11
+	adds	r2,r2,r1		@ d1+=d0>>32
+	 eor	r1,r1,r1
+	adc	lr,r3,#0		@ future r6
+	str	r2,[sp,#4]		@ future r5
+
+	mul	r2,r12,r8
+	eor	r3,r3,r3
+	umlal	r0,r1,r7,r12
+	ldr	r12,[sp,#28]		@ reload r12
+	umlal	r2,r3,r7,r9
+	umlal	r0,r1,r6,r9
+	umlal	r2,r3,r6,r10
+	umlal	r0,r1,r5,r10
+	umlal	r2,r3,r5,r11
+	umlal	r0,r1,r4,r11
+	umlal	r2,r3,r4,r12
+	ldr	r4,[sp,#0]
+	mul	r8,r9,r8
+	ldr	r5,[sp,#4]
+
+	adds	r6,lr,r0		@ d2+=d1>>32
+	ldr	lr,[sp,#8]		@ reload input pointer
+	adc	r1,r1,#0
+	adds	r7,r2,r1		@ d3+=d2>>32
+	ldr	r0,[sp,#16]		@ reload end pointer
+	adc	r3,r3,#0
+	add	r8,r8,r3		@ h4+=d3>>32
+
+	and	r1,r8,#-4
+	and	r8,r8,#3
+	add	r1,r1,r1,lsr#2		@ *=5
+	adds	r4,r4,r1
+	adcs	r5,r5,#0
+	adcs	r6,r6,#0
+	adcs	r7,r7,#0
+	adc	r8,r8,#0
+
+	cmp	r0,lr			@ done yet?
+	bhi	.Loop
+
+	ldr	r0,[sp,#12]
+	add	sp,sp,#32
+	stmdb	r0,{r4-r8}		@ store the result
+
+.Lno_data:
+#if	__ARM_ARCH__>=5
+	ldmia	sp!,{r3-r11,pc}
+#else
+	ldmia	sp!,{r3-r11,lr}
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	.word	0xe12fff1e			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_blocks,.-poly1305_blocks
+.type	poly1305_emit,%function
+.align	5
+poly1305_emit:
+.Lpoly1305_emit:
+	stmdb	sp!,{r4-r11}
+
+	ldmia	r0,{r3-r7}
+
+#if __ARM_ARCH__>=7
+	ldr	ip,[r0,#36]		@ is_base2_26
+
+	adds	r8,r3,r4,lsl#26	@ base 2^26 -> base 2^32
+	mov	r9,r4,lsr#6
+	adcs	r9,r9,r5,lsl#20
+	mov	r10,r5,lsr#12
+	adcs	r10,r10,r6,lsl#14
+	mov	r11,r6,lsr#18
+	adcs	r11,r11,r7,lsl#8
+	mov	r0,#0
+	adc	r0,r0,r7,lsr#24
+
+	tst	ip,ip
+	itttt	ne
+	movne	r3,r8
+	movne	r4,r9
+	movne	r5,r10
+	movne	r6,r11
+	it	ne
+	movne	r7,r0
+#endif
+
+	adds	r8,r3,#5		@ compare to modulus
+	adcs	r9,r4,#0
+	adcs	r10,r5,#0
+	adcs	r11,r6,#0
+	adc	r0,r7,#0
+	tst	r0,#4			@ did it carry/borrow?
+
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r3,r8
+	ldr	r8,[r2,#0]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r4,r9
+	ldr	r9,[r2,#4]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r5,r10
+	ldr	r10,[r2,#8]
+#ifdef	__thumb2__
+	it	ne
+#endif
+	movne	r6,r11
+	ldr	r11,[r2,#12]
+
+	adds	r3,r3,r8
+	adcs	r4,r4,r9
+	adcs	r5,r5,r10
+	adc	r6,r6,r11
+
+#if __ARM_ARCH__>=7
+# ifdef __ARMEB__
+	rev	r3,r3
+	rev	r4,r4
+	rev	r5,r5
+	rev	r6,r6
+# endif
+	str	r3,[r1,#0]
+	str	r4,[r1,#4]
+	str	r5,[r1,#8]
+	str	r6,[r1,#12]
+#else
+	strb	r3,[r1,#0]
+	mov	r3,r3,lsr#8
+	strb	r4,[r1,#4]
+	mov	r4,r4,lsr#8
+	strb	r5,[r1,#8]
+	mov	r5,r5,lsr#8
+	strb	r6,[r1,#12]
+	mov	r6,r6,lsr#8
+
+	strb	r3,[r1,#1]
+	mov	r3,r3,lsr#8
+	strb	r4,[r1,#5]
+	mov	r4,r4,lsr#8
+	strb	r5,[r1,#9]
+	mov	r5,r5,lsr#8
+	strb	r6,[r1,#13]
+	mov	r6,r6,lsr#8
+
+	strb	r3,[r1,#2]
+	mov	r3,r3,lsr#8
+	strb	r4,[r1,#6]
+	mov	r4,r4,lsr#8
+	strb	r5,[r1,#10]
+	mov	r5,r5,lsr#8
+	strb	r6,[r1,#14]
+	mov	r6,r6,lsr#8
+
+	strb	r3,[r1,#3]
+	strb	r4,[r1,#7]
+	strb	r5,[r1,#11]
+	strb	r6,[r1,#15]
+#endif
+	ldmia	sp!,{r4-r11}
+#if	__ARM_ARCH__>=5
+	bx	lr				@ bx	lr
+#else
+	tst	lr,#1
+	moveq	pc,lr			@ be binary compatible with V4, yet
+	.word	0xe12fff1e			@ interoperable with Thumb ISA:-)
+#endif
+.size	poly1305_emit,.-poly1305_emit
+#if	__ARM_MAX_ARCH__>=7
+.fpu	neon
+
+.type	poly1305_init_neon,%function
+.align	5
+poly1305_init_neon:
+.Lpoly1305_init_neon:
+	ldr	r3,[r0,#48]		@ first table element
+	cmp	r3,#-1			@ is value impossible?
+	bne	.Lno_init_neon
+
+	ldr	r4,[r0,#20]		@ load key base 2^32
+	ldr	r5,[r0,#24]
+	ldr	r6,[r0,#28]
+	ldr	r7,[r0,#32]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	and	r3,r3,#0x03ffffff
+	and	r4,r4,#0x03ffffff
+	and	r5,r5,#0x03ffffff
+
+	vdup.32	d0,r2			@ r^1 in both lanes
+	add	r2,r3,r3,lsl#2		@ *5
+	vdup.32	d1,r3
+	add	r3,r4,r4,lsl#2
+	vdup.32	d2,r2
+	vdup.32	d3,r4
+	add	r4,r5,r5,lsl#2
+	vdup.32	d4,r3
+	vdup.32	d5,r5
+	add	r5,r6,r6,lsl#2
+	vdup.32	d6,r4
+	vdup.32	d7,r6
+	vdup.32	d8,r5
+
+	mov	r5,#2		@ counter
+
+.Lsquare_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+
+	vmull.u32	q5,d0,d0[1]
+	vmull.u32	q6,d1,d0[1]
+	vmull.u32	q7,d3,d0[1]
+	vmull.u32	q8,d5,d0[1]
+	vmull.u32	q9,d7,d0[1]
+
+	vmlal.u32	q5,d7,d2[1]
+	vmlal.u32	q6,d0,d1[1]
+	vmlal.u32	q7,d1,d1[1]
+	vmlal.u32	q8,d3,d1[1]
+	vmlal.u32	q9,d5,d1[1]
+
+	vmlal.u32	q5,d5,d4[1]
+	vmlal.u32	q6,d7,d4[1]
+	vmlal.u32	q8,d1,d3[1]
+	vmlal.u32	q7,d0,d3[1]
+	vmlal.u32	q9,d3,d3[1]
+
+	vmlal.u32	q5,d3,d6[1]
+	vmlal.u32	q8,d0,d5[1]
+	vmlal.u32	q6,d5,d6[1]
+	vmlal.u32	q7,d7,d6[1]
+	vmlal.u32	q9,d1,d5[1]
+
+	vmlal.u32	q8,d7,d8[1]
+	vmlal.u32	q5,d1,d8[1]
+	vmlal.u32	q6,d3,d8[1]
+	vmlal.u32	q7,d5,d8[1]
+	vmlal.u32	q9,d0,d7[1]
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	@ and P. Schwabe
+	@
+	@ H0>>+H1>>+H2>>+H3>>+H4
+	@ H3>>+H4>>*5+H0>>+H1
+	@
+	@ Trivia.
+	@
+	@ Result of multiplication of n-bit number by m-bit number is
+	@ n+m bits wide. However! Even though 2^n is a n+1-bit number,
+	@ m-bit number multiplied by 2^n is still n+m bits wide.
+	@
+	@ Sum of two n-bit numbers is n+1 bits wide, sum of three - n+2,
+	@ and so is sum of four. Sum of 2^m n-m-bit numbers and n-bit
+	@ one is n+1 bits wide.
+	@
+	@ >>+ denotes Hnext += Hn>>26, Hn &= 0x3ffffff. This means that
+	@ H0, H2, H3 are guaranteed to be 26 bits wide, while H1 and H4
+	@ can be 27. However! In cases when their width exceeds 26 bits
+	@ they are limited by 2^26+2^6. This in turn means that *sum*
+	@ of the products with these values can still be viewed as sum
+	@ of 52-bit numbers as long as the amount of addends is not a
+	@ power of 2. For example,
+	@
+	@ H4 = H4*R0 + H3*R1 + H2*R2 + H1*R3 + H0 * R4,
+	@
+	@ which can't be larger than 5 * (2^26 + 2^6) * (2^26 + 2^6), or
+	@ 5 * (2^52 + 2*2^32 + 2^12), which in turn is smaller than
+	@ 8 * (2^52) or 2^55. However, the value is then multiplied by
+	@ by 5, so we should be looking at 5 * 5 * (2^52 + 2^33 + 2^12),
+	@ which is less than 32 * (2^52) or 2^57. And when processing
+	@ data we are looking at triple as many addends...
+	@
+	@ In key setup procedure pre-reduced H0 is limited by 5*4+1 and
+	@ 5*H4 - by 5*5 52-bit addends, or 57 bits. But when hashing the
+	@ input H0 is limited by (5*4+1)*3 addends, or 58 bits, while
+	@ 5*H4 by 5*5*3, or 59[!] bits. How is this relevant? vmlal.u32
+	@ instruction accepts 2x32-bit input and writes 2x64-bit result.
+	@ This means that result of reduction have to be compressed upon
+	@ loop wrap-around. This can be done in the process of reduction
+	@ to minimize amount of instructions [as well as amount of
+	@ 128-bit instructions, which benefits low-end processors], but
+	@ one has to watch for H2 (which is narrower than H0) and 5*H4
+	@ not being wider than 58 bits, so that result of right shift
+	@ by 26 bits fits in 32 bits. This is also useful on x86,
+	@ because it allows to use paddd in place for paddq, which
+	@ benefits Atom, where paddq is ridiculously slow.
+
+	vshr.u64	q15,q8,#26
+	vmovn.i64	d16,q8
+	 vshr.u64	q4,q5,#26
+	 vmovn.i64	d10,q5
+	vadd.i64	q9,q9,q15		@ h3 -> h4
+	vbic.i32	d16,#0xfc000000	@ &=0x03ffffff
+	 vadd.i64	q6,q6,q4		@ h0 -> h1
+	 vbic.i32	d10,#0xfc000000
+
+	vshrn.u64	d30,q9,#26
+	vmovn.i64	d18,q9
+	 vshr.u64	q4,q6,#26
+	 vmovn.i64	d12,q6
+	 vadd.i64	q7,q7,q4		@ h1 -> h2
+	vbic.i32	d18,#0xfc000000
+	 vbic.i32	d12,#0xfc000000
+
+	vadd.i32	d10,d10,d30
+	vshl.u32	d30,d30,#2
+	 vshrn.u64	d8,q7,#26
+	 vmovn.i64	d14,q7
+	vadd.i32	d10,d10,d30	@ h4 -> h0
+	 vadd.i32	d16,d16,d8	@ h2 -> h3
+	 vbic.i32	d14,#0xfc000000
+
+	vshr.u32	d30,d10,#26
+	vbic.i32	d10,#0xfc000000
+	 vshr.u32	d8,d16,#26
+	 vbic.i32	d16,#0xfc000000
+	vadd.i32	d12,d12,d30	@ h0 -> h1
+	 vadd.i32	d18,d18,d8	@ h3 -> h4
+
+	subs		r5,r5,#1
+	beq		.Lsquare_break_neon
+
+	add		r6,r0,#(48+0*9*4)
+	add		r7,r0,#(48+1*9*4)
+
+	vtrn.32		d0,d10		@ r^2:r^1
+	vtrn.32		d3,d14
+	vtrn.32		d5,d16
+	vtrn.32		d1,d12
+	vtrn.32		d7,d18
+
+	vshl.u32	d4,d3,#2		@ *5
+	vshl.u32	d6,d5,#2
+	vshl.u32	d2,d1,#2
+	vshl.u32	d8,d7,#2
+	vadd.i32	d4,d4,d3
+	vadd.i32	d2,d2,d1
+	vadd.i32	d6,d6,d5
+	vadd.i32	d8,d8,d7
+
+	vst4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!
+	vst4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!
+	vst4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vst4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vst1.32		{d8[0]},[r6,:32]
+	vst1.32		{d8[1]},[r7,:32]
+
+	b		.Lsquare_neon
+
+.align	4
+.Lsquare_break_neon:
+	add		r6,r0,#(48+2*4*9)
+	add		r7,r0,#(48+3*4*9)
+
+	vmov		d0,d10		@ r^4:r^3
+	vshl.u32	d2,d12,#2		@ *5
+	vmov		d1,d12
+	vshl.u32	d4,d14,#2
+	vmov		d3,d14
+	vshl.u32	d6,d16,#2
+	vmov		d5,d16
+	vshl.u32	d8,d18,#2
+	vmov		d7,d18
+	vadd.i32	d2,d2,d12
+	vadd.i32	d4,d4,d14
+	vadd.i32	d6,d6,d16
+	vadd.i32	d8,d8,d18
+
+	vst4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!
+	vst4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!
+	vst4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vst4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vst1.32		{d8[0]},[r6]
+	vst1.32		{d8[1]},[r7]
+
+.Lno_init_neon:
+	bx	lr				@ bx	lr
+.size	poly1305_init_neon,.-poly1305_init_neon
+
+.type	poly1305_blocks_neon,%function
+.align	5
+poly1305_blocks_neon:
+.Lpoly1305_blocks_neon:
+	ldr	ip,[r0,#36]		@ is_base2_26
+
+	cmp	r2,#64
+	blo	.Lpoly1305_blocks
+
+	stmdb	sp!,{r4-r7}
+	vstmdb	sp!,{d8-d15}		@ ABI specification says so
+
+	tst	ip,ip			@ is_base2_26?
+	bne	.Lbase2_26_neon
+
+	stmdb	sp!,{r1-r3,lr}
+	bl	.Lpoly1305_init_neon
+
+	ldr	r4,[r0,#0]		@ load hash value base 2^32
+	ldr	r5,[r0,#4]
+	ldr	r6,[r0,#8]
+	ldr	r7,[r0,#12]
+	ldr	ip,[r0,#16]
+
+	and	r2,r4,#0x03ffffff	@ base 2^32 -> base 2^26
+	mov	r3,r4,lsr#26
+	 veor	d10,d10,d10
+	mov	r4,r5,lsr#20
+	orr	r3,r3,r5,lsl#6
+	 veor	d12,d12,d12
+	mov	r5,r6,lsr#14
+	orr	r4,r4,r6,lsl#12
+	 veor	d14,d14,d14
+	mov	r6,r7,lsr#8
+	orr	r5,r5,r7,lsl#18
+	 veor	d16,d16,d16
+	and	r3,r3,#0x03ffffff
+	orr	r6,r6,ip,lsl#24
+	 veor	d18,d18,d18
+	and	r4,r4,#0x03ffffff
+	mov	r1,#1
+	and	r5,r5,#0x03ffffff
+	str	r1,[r0,#36]		@ set is_base2_26
+
+	vmov.32	d10[0],r2
+	vmov.32	d12[0],r3
+	vmov.32	d14[0],r4
+	vmov.32	d16[0],r5
+	vmov.32	d18[0],r6
+	adr	r5,.Lzeros
+
+	ldmia	sp!,{r1-r3,lr}
+	b	.Lhash_loaded
+
+.align	4
+.Lbase2_26_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ load hash value
+
+	veor		d10,d10,d10
+	veor		d12,d12,d12
+	veor		d14,d14,d14
+	veor		d16,d16,d16
+	veor		d18,d18,d18
+	vld4.32		{d10[0],d12[0],d14[0],d16[0]},[r0]!
+	adr		r5,.Lzeros
+	vld1.32		{d18[0]},[r0]
+	sub		r0,r0,#16		@ rewind
+
+.Lhash_loaded:
+	add		r4,r1,#32
+	mov		r3,r3,lsl#24
+	tst		r2,#31
+	beq		.Leven
+
+	vld4.32		{d20[0],d22[0],d24[0],d26[0]},[r1]!
+	vmov.32		d28[0],r3
+	sub		r2,r2,#16
+	add		r4,r1,#32
+
+# ifdef	__ARMEB__
+	vrev32.8	q10,q10
+	vrev32.8	q13,q13
+	vrev32.8	q11,q11
+	vrev32.8	q12,q12
+# endif
+	vsri.u32	d28,d26,#8	@ base 2^32 -> base 2^26
+	vshl.u32	d26,d26,#18
+
+	vsri.u32	d26,d24,#14
+	vshl.u32	d24,d24,#12
+	vadd.i32	d29,d28,d18	@ add hash value and move to #hi
+
+	vbic.i32	d26,#0xfc000000
+	vsri.u32	d24,d22,#20
+	vshl.u32	d22,d22,#6
+
+	vbic.i32	d24,#0xfc000000
+	vsri.u32	d22,d20,#26
+	vadd.i32	d27,d26,d16
+
+	vbic.i32	d20,#0xfc000000
+	vbic.i32	d22,#0xfc000000
+	vadd.i32	d25,d24,d14
+
+	vadd.i32	d21,d20,d10
+	vadd.i32	d23,d22,d12
+
+	mov		r7,r5
+	add		r6,r0,#48
+
+	cmp		r2,r2
+	b		.Long_tail
+
+.align	4
+.Leven:
+	subs		r2,r2,#64
+	it		lo
+	movlo		r4,r5
+
+	vmov.i32	q14,#1<<24		@ padbit, yes, always
+	vld4.32		{d20,d22,d24,d26},[r1]	@ inp[0:1]
+	add		r1,r1,#64
+	vld4.32		{d21,d23,d25,d27},[r4]	@ inp[2:3] (or 0)
+	add		r4,r4,#64
+	itt		hi
+	addhi		r7,r0,#(48+1*9*4)
+	addhi		r6,r0,#(48+3*9*4)
+
+# ifdef	__ARMEB__
+	vrev32.8	q10,q10
+	vrev32.8	q13,q13
+	vrev32.8	q11,q11
+	vrev32.8	q12,q12
+# endif
+	vsri.u32	q14,q13,#8		@ base 2^32 -> base 2^26
+	vshl.u32	q13,q13,#18
+
+	vsri.u32	q13,q12,#14
+	vshl.u32	q12,q12,#12
+
+	vbic.i32	q13,#0xfc000000
+	vsri.u32	q12,q11,#20
+	vshl.u32	q11,q11,#6
+
+	vbic.i32	q12,#0xfc000000
+	vsri.u32	q11,q10,#26
+
+	vbic.i32	q10,#0xfc000000
+	vbic.i32	q11,#0xfc000000
+
+	bls		.Lskip_loop
+
+	vld4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!	@ load r^2
+	vld4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!	@ load r^4
+	vld4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vld4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	b		.Loop_neon
+
+.align	5
+.Loop_neon:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	@   ___________________/
+	@ ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	@ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	@   ___________________/ ____________________/
+	@
+	@ Note that we start with inp[2:3]*r^2. This is because it
+	@ doesn't depend on reduction in previous iteration.
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	@ d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	@ d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	@ d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	@ d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ inp[2:3]*r^2
+
+	vadd.i32	d24,d24,d14	@ accumulate inp[0:1]
+	vmull.u32	q7,d25,d0[1]
+	vadd.i32	d20,d20,d10
+	vmull.u32	q5,d21,d0[1]
+	vadd.i32	d26,d26,d16
+	vmull.u32	q8,d27,d0[1]
+	vmlal.u32	q7,d23,d1[1]
+	vadd.i32	d22,d22,d12
+	vmull.u32	q6,d23,d0[1]
+
+	vadd.i32	d28,d28,d18
+	vmull.u32	q9,d29,d0[1]
+	subs		r2,r2,#64
+	vmlal.u32	q5,d29,d2[1]
+	it		lo
+	movlo		r4,r5
+	vmlal.u32	q8,d25,d1[1]
+	vld1.32		d8[1],[r7,:32]
+	vmlal.u32	q6,d21,d1[1]
+	vmlal.u32	q9,d27,d1[1]
+
+	vmlal.u32	q5,d27,d4[1]
+	vmlal.u32	q8,d23,d3[1]
+	vmlal.u32	q9,d25,d3[1]
+	vmlal.u32	q6,d29,d4[1]
+	vmlal.u32	q7,d21,d3[1]
+
+	vmlal.u32	q8,d21,d5[1]
+	vmlal.u32	q5,d25,d6[1]
+	vmlal.u32	q9,d23,d5[1]
+	vmlal.u32	q6,d27,d6[1]
+	vmlal.u32	q7,d29,d6[1]
+
+	vmlal.u32	q8,d29,d8[1]
+	vmlal.u32	q5,d23,d8[1]
+	vmlal.u32	q9,d21,d7[1]
+	vmlal.u32	q6,d25,d8[1]
+	vmlal.u32	q7,d27,d8[1]
+
+	vld4.32		{d21,d23,d25,d27},[r4]	@ inp[2:3] (or 0)
+	add		r4,r4,#64
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4 and accumulate
+
+	vmlal.u32	q8,d26,d0[0]
+	vmlal.u32	q5,d20,d0[0]
+	vmlal.u32	q9,d28,d0[0]
+	vmlal.u32	q6,d22,d0[0]
+	vmlal.u32	q7,d24,d0[0]
+	vld1.32		d8[0],[r6,:32]
+
+	vmlal.u32	q8,d24,d1[0]
+	vmlal.u32	q5,d28,d2[0]
+	vmlal.u32	q9,d26,d1[0]
+	vmlal.u32	q6,d20,d1[0]
+	vmlal.u32	q7,d22,d1[0]
+
+	vmlal.u32	q8,d22,d3[0]
+	vmlal.u32	q5,d26,d4[0]
+	vmlal.u32	q9,d24,d3[0]
+	vmlal.u32	q6,d28,d4[0]
+	vmlal.u32	q7,d20,d3[0]
+
+	vmlal.u32	q8,d20,d5[0]
+	vmlal.u32	q5,d24,d6[0]
+	vmlal.u32	q9,d22,d5[0]
+	vmlal.u32	q6,d26,d6[0]
+	vmlal.u32	q8,d28,d8[0]
+
+	vmlal.u32	q7,d28,d6[0]
+	vmlal.u32	q5,d22,d8[0]
+	vmlal.u32	q9,d20,d7[0]
+	vmov.i32	q14,#1<<24		@ padbit, yes, always
+	vmlal.u32	q6,d24,d8[0]
+	vmlal.u32	q7,d26,d8[0]
+
+	vld4.32		{d20,d22,d24,d26},[r1]	@ inp[0:1]
+	add		r1,r1,#64
+# ifdef	__ARMEB__
+	vrev32.8	q10,q10
+	vrev32.8	q11,q11
+	vrev32.8	q12,q12
+	vrev32.8	q13,q13
+# endif
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction interleaved with base 2^32 -> base 2^26 of
+	@ inp[0:3] previously loaded to q10-q13 and smashed to q10-q14.
+
+	vshr.u64	q15,q8,#26
+	vmovn.i64	d16,q8
+	 vshr.u64	q4,q5,#26
+	 vmovn.i64	d10,q5
+	vadd.i64	q9,q9,q15		@ h3 -> h4
+	vbic.i32	d16,#0xfc000000
+	  vsri.u32	q14,q13,#8		@ base 2^32 -> base 2^26
+	 vadd.i64	q6,q6,q4		@ h0 -> h1
+	  vshl.u32	q13,q13,#18
+	 vbic.i32	d10,#0xfc000000
+
+	vshrn.u64	d30,q9,#26
+	vmovn.i64	d18,q9
+	 vshr.u64	q4,q6,#26
+	 vmovn.i64	d12,q6
+	 vadd.i64	q7,q7,q4		@ h1 -> h2
+	  vsri.u32	q13,q12,#14
+	vbic.i32	d18,#0xfc000000
+	  vshl.u32	q12,q12,#12
+	 vbic.i32	d12,#0xfc000000
+
+	vadd.i32	d10,d10,d30
+	vshl.u32	d30,d30,#2
+	  vbic.i32	q13,#0xfc000000
+	 vshrn.u64	d8,q7,#26
+	 vmovn.i64	d14,q7
+	vaddl.u32	q5,d10,d30	@ h4 -> h0 [widen for a sec]
+	  vsri.u32	q12,q11,#20
+	 vadd.i32	d16,d16,d8	@ h2 -> h3
+	  vshl.u32	q11,q11,#6
+	 vbic.i32	d14,#0xfc000000
+	  vbic.i32	q12,#0xfc000000
+
+	vshrn.u64	d30,q5,#26		@ re-narrow
+	vmovn.i64	d10,q5
+	  vsri.u32	q11,q10,#26
+	  vbic.i32	q10,#0xfc000000
+	 vshr.u32	d8,d16,#26
+	 vbic.i32	d16,#0xfc000000
+	vbic.i32	d10,#0xfc000000
+	vadd.i32	d12,d12,d30	@ h0 -> h1
+	 vadd.i32	d18,d18,d8	@ h3 -> h4
+	  vbic.i32	q11,#0xfc000000
+
+	bhi		.Loop_neon
+
+.Lskip_loop:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	add		r7,r0,#(48+0*9*4)
+	add		r6,r0,#(48+1*9*4)
+	adds		r2,r2,#32
+	it		ne
+	movne		r2,#0
+	bne		.Long_tail
+
+	vadd.i32	d25,d24,d14	@ add hash value and move to #hi
+	vadd.i32	d21,d20,d10
+	vadd.i32	d27,d26,d16
+	vadd.i32	d23,d22,d12
+	vadd.i32	d29,d28,d18
+
+.Long_tail:
+	vld4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!	@ load r^1
+	vld4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!	@ load r^2
+
+	vadd.i32	d24,d24,d14	@ can be redundant
+	vmull.u32	q7,d25,d0
+	vadd.i32	d20,d20,d10
+	vmull.u32	q5,d21,d0
+	vadd.i32	d26,d26,d16
+	vmull.u32	q8,d27,d0
+	vadd.i32	d22,d22,d12
+	vmull.u32	q6,d23,d0
+	vadd.i32	d28,d28,d18
+	vmull.u32	q9,d29,d0
+
+	vmlal.u32	q5,d29,d2
+	vld4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vmlal.u32	q8,d25,d1
+	vld4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vmlal.u32	q6,d21,d1
+	vmlal.u32	q9,d27,d1
+	vmlal.u32	q7,d23,d1
+
+	vmlal.u32	q8,d23,d3
+	vld1.32		d8[1],[r7,:32]
+	vmlal.u32	q5,d27,d4
+	vld1.32		d8[0],[r6,:32]
+	vmlal.u32	q9,d25,d3
+	vmlal.u32	q6,d29,d4
+	vmlal.u32	q7,d21,d3
+
+	vmlal.u32	q8,d21,d5
+	 it		ne
+	 addne		r7,r0,#(48+2*9*4)
+	vmlal.u32	q5,d25,d6
+	 it		ne
+	 addne		r6,r0,#(48+3*9*4)
+	vmlal.u32	q9,d23,d5
+	vmlal.u32	q6,d27,d6
+	vmlal.u32	q7,d29,d6
+
+	vmlal.u32	q8,d29,d8
+	 vorn		q0,q0,q0	@ all-ones, can be redundant
+	vmlal.u32	q5,d23,d8
+	 vshr.u64	q0,q0,#38
+	vmlal.u32	q9,d21,d7
+	vmlal.u32	q6,d25,d8
+	vmlal.u32	q7,d27,d8
+
+	beq		.Lshort_tail
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	vld4.32		{d0[1],d1[1],d2[1],d3[1]},[r7]!	@ load r^3
+	vld4.32		{d0[0],d1[0],d2[0],d3[0]},[r6]!	@ load r^4
+
+	vmlal.u32	q7,d24,d0
+	vmlal.u32	q5,d20,d0
+	vmlal.u32	q8,d26,d0
+	vmlal.u32	q6,d22,d0
+	vmlal.u32	q9,d28,d0
+
+	vmlal.u32	q5,d28,d2
+	vld4.32		{d4[1],d5[1],d6[1],d7[1]},[r7]!
+	vmlal.u32	q8,d24,d1
+	vld4.32		{d4[0],d5[0],d6[0],d7[0]},[r6]!
+	vmlal.u32	q6,d20,d1
+	vmlal.u32	q9,d26,d1
+	vmlal.u32	q7,d22,d1
+
+	vmlal.u32	q8,d22,d3
+	vld1.32		d8[1],[r7,:32]
+	vmlal.u32	q5,d26,d4
+	vld1.32		d8[0],[r6,:32]
+	vmlal.u32	q9,d24,d3
+	vmlal.u32	q6,d28,d4
+	vmlal.u32	q7,d20,d3
+
+	vmlal.u32	q8,d20,d5
+	vmlal.u32	q5,d24,d6
+	vmlal.u32	q9,d22,d5
+	vmlal.u32	q6,d26,d6
+	vmlal.u32	q7,d28,d6
+
+	vmlal.u32	q8,d28,d8
+	 vorn		q0,q0,q0	@ all-ones
+	vmlal.u32	q5,d22,d8
+	 vshr.u64	q0,q0,#38
+	vmlal.u32	q9,d20,d7
+	vmlal.u32	q6,d24,d8
+	vmlal.u32	q7,d26,d8
+
+.Lshort_tail:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ horizontal addition
+
+	vadd.i64	d16,d16,d17
+	vadd.i64	d10,d10,d11
+	vadd.i64	d18,d18,d19
+	vadd.i64	d12,d12,d13
+	vadd.i64	d14,d14,d15
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ lazy reduction, but without narrowing
+
+	vshr.u64	q15,q8,#26
+	vand.i64	q8,q8,q0
+	 vshr.u64	q4,q5,#26
+	 vand.i64	q5,q5,q0
+	vadd.i64	q9,q9,q15		@ h3 -> h4
+	 vadd.i64	q6,q6,q4		@ h0 -> h1
+
+	vshr.u64	q15,q9,#26
+	vand.i64	q9,q9,q0
+	 vshr.u64	q4,q6,#26
+	 vand.i64	q6,q6,q0
+	 vadd.i64	q7,q7,q4		@ h1 -> h2
+
+	vadd.i64	q5,q5,q15
+	vshl.u64	q15,q15,#2
+	 vshr.u64	q4,q7,#26
+	 vand.i64	q7,q7,q0
+	vadd.i64	q5,q5,q15		@ h4 -> h0
+	 vadd.i64	q8,q8,q4		@ h2 -> h3
+
+	vshr.u64	q15,q5,#26
+	vand.i64	q5,q5,q0
+	 vshr.u64	q4,q8,#26
+	 vand.i64	q8,q8,q0
+	vadd.i64	q6,q6,q15		@ h0 -> h1
+	 vadd.i64	q9,q9,q4		@ h3 -> h4
+
+	cmp		r2,#0
+	bne		.Leven
+
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@ store hash value
+
+	vst4.32		{d10[0],d12[0],d14[0],d16[0]},[r0]!
+	vst1.32		{d18[0]},[r0]
+
+	vldmia	sp!,{d8-d15}			@ epilogue
+	ldmia	sp!,{r4-r7}
+	bx	lr					@ bx	lr
+.size	poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+#ifndef	__KERNEL__
+.LOPENSSL_armcap:
+# ifdef	_WIN32
+.word	OPENSSL_armcap_P
+# else
+.word	OPENSSL_armcap_P-.Lpoly1305_init
+# endif
+.comm	OPENSSL_armcap_P,4,4
+.hidden	OPENSSL_armcap_P
+#endif
+#endif
+.asciz	"Poly1305 for ARMv4/NEON, CRYPTOGAMS by @dot-asm"
+.align	2
diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
new file mode 100644
index 000000000000..adff7d7865bc
--- /dev/null
+++ b/arch/arm/crypto/poly1305-glue.c
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * OpenSSL/Cryptogams accelerated Poly1305 transform for arm64
+ *
+ * Copyright (C) 2019 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ */
+
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <asm/unaligned.h>
+#include <crypto/algapi.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+#define POLY1305_BLOCK_SIZE	16
+#define POLY1305_DIGEST_SIZE	16
+
+struct arm_poly1305_ctx {
+	/* the state owned by the accelerated code */
+	u64 state[24];
+	/* finalize key */
+	u32 s[4];
+	/* partial buffer */
+	u8 buf[POLY1305_BLOCK_SIZE];
+	/* bytes used in partial buffer */
+	unsigned int buflen;
+	/* r key has been set */
+	bool rset;
+	/* s key has been set */
+	bool sset;
+};
+
+asmlinkage void poly1305_init_arm(u64 *state, const u8 *key);
+asmlinkage void poly1305_blocks_arm(u64 *state, const u8 *src, u32 len, u32 hibit);
+asmlinkage void poly1305_blocks_neon(u64 *state, const u8 *src, u32 len, u32 hibit);
+asmlinkage void poly1305_emit_arm(u64 state[], __le32 *digest, const u32 *nonce);
+
+static int arm_poly1305_init(struct shash_desc *desc)
+{
+	struct arm_poly1305_ctx *dctx = shash_desc_ctx(desc);
+
+	dctx->buflen = 0;
+	dctx->rset = false;
+	dctx->sset = false;
+
+	return 0;
+}
+
+static void arm_poly1305_blocks(struct arm_poly1305_ctx *dctx, const u8 *src,
+				 u32 len, u32 hibit, bool do_neon)
+{
+	if (unlikely(!dctx->sset)) {
+		if (!dctx->rset) {
+			poly1305_init_arm(dctx->state, src);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->rset = true;
+		}
+		if (len >= POLY1305_BLOCK_SIZE) {
+			dctx->s[0] = get_unaligned_le32(src +  0);
+			dctx->s[1] = get_unaligned_le32(src +  4);
+			dctx->s[2] = get_unaligned_le32(src +  8);
+			dctx->s[3] = get_unaligned_le32(src + 12);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->sset = true;
+		}
+		if (len < POLY1305_BLOCK_SIZE)
+			return;
+	}
+
+	len &= ~(POLY1305_BLOCK_SIZE - 1);
+
+	if (likely(do_neon))
+		poly1305_blocks_neon(dctx->state, src, len, hibit);
+	else
+		poly1305_blocks_arm(dctx->state, src, len, hibit);
+}
+
+static void arm_poly1305_do_update(struct arm_poly1305_ctx *dctx,
+				    const u8 *src, u32 len, bool do_neon)
+{
+	if (unlikely(dctx->buflen)) {
+		u32 bytes = min(len, POLY1305_BLOCK_SIZE - dctx->buflen);
+
+		memcpy(dctx->buf + dctx->buflen, src, bytes);
+		src += bytes;
+		len -= bytes;
+		dctx->buflen += bytes;
+
+		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+			arm_poly1305_blocks(dctx, dctx->buf,
+					     POLY1305_BLOCK_SIZE, 1, false);
+			dctx->buflen = 0;
+		}
+	}
+
+	if (likely(len >= POLY1305_BLOCK_SIZE)) {
+		arm_poly1305_blocks(dctx, src, len, 1, do_neon);
+		src += round_down(len, POLY1305_BLOCK_SIZE);
+		len %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(len)) {
+		dctx->buflen = len;
+		memcpy(dctx->buf, src, len);
+	}
+}
+
+static int arm_poly1305_update(struct shash_desc *desc,
+			       const u8 *src, unsigned int srclen)
+{
+	struct arm_poly1305_ctx *dctx = shash_desc_ctx(desc);
+
+	arm_poly1305_do_update(dctx, src, srclen, false);
+	return 0;
+}
+
+static int __maybe_unused arm_poly1305_update_neon(struct shash_desc *desc,
+						   const u8 *src,
+						   unsigned int srclen)
+{
+	struct arm_poly1305_ctx *dctx = shash_desc_ctx(desc);
+	bool do_neon = crypto_simd_usable() && srclen > 128;
+
+	if (do_neon)
+		kernel_neon_begin();
+	arm_poly1305_do_update(dctx, src, srclen, do_neon);
+	if (do_neon)
+		kernel_neon_end();
+	return 0;
+}
+
+static
+int __maybe_unused arm_poly1305_update_from_sg_neon(struct shash_desc *desc,
+						    struct scatterlist *sg,
+						    unsigned int srclen,
+						    int flags)
+{
+	struct arm_poly1305_ctx *dctx = shash_desc_ctx(desc);
+	bool do_neon = crypto_simd_usable() && srclen > 128;
+	struct crypto_hash_walk walk;
+	int nbytes;
+
+	if (do_neon) {
+		kernel_neon_begin();
+		flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+	}
+
+	for (nbytes = crypto_shash_walk_sg(desc, sg, srclen, &walk, flags);
+	     nbytes > 0;
+	     nbytes = crypto_hash_walk_done(&walk, 0))
+		arm_poly1305_do_update(dctx, walk.data, nbytes, do_neon);
+
+	if (do_neon)
+		kernel_neon_end();
+
+	return 0;
+}
+
+static int arm_poly1305_final(struct shash_desc *desc, u8 *dst)
+{
+	struct arm_poly1305_ctx *dctx = shash_desc_ctx(desc);
+	__le32 digest[4];
+	u64 f = 0;
+
+	if (unlikely(!dctx->sset))
+		return -ENOKEY;
+
+	if (unlikely(dctx->buflen)) {
+		dctx->buf[dctx->buflen++] = 1;
+		memset(dctx->buf + dctx->buflen, 0,
+		       POLY1305_BLOCK_SIZE - dctx->buflen);
+		poly1305_blocks_arm(dctx->state, dctx->buf, POLY1305_BLOCK_SIZE, 0);
+	}
+
+	poly1305_emit_arm(dctx->state, digest, dctx->s);
+
+	/* mac = (h + s) % (2^128) */
+	f = (f >> 32) + le32_to_cpu(digest[0]);
+	put_unaligned_le32(f, dst);
+	f = (f >> 32) + le32_to_cpu(digest[1]);
+	put_unaligned_le32(f, dst + 4);
+	f = (f >> 32) + le32_to_cpu(digest[2]);
+	put_unaligned_le32(f, dst + 8);
+	f = (f >> 32) + le32_to_cpu(digest[3]);
+	put_unaligned_le32(f, dst + 12);
+
+	return 0;
+}
+
+static struct shash_alg arm_poly1305_algs[] = {{
+	.init			= arm_poly1305_init,
+	.update			= arm_poly1305_update,
+	.final			= arm_poly1305_final,
+	.digestsize		= POLY1305_DIGEST_SIZE,
+	.descsize		= sizeof(struct arm_poly1305_ctx),
+
+	.base.cra_name		= "poly1305",
+	.base.cra_driver_name	= "poly1305-arm",
+	.base.cra_priority	= 150,
+	.base.cra_blocksize	= POLY1305_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+#ifdef CONFIG_KERNEL_MODE_NEON
+}, {
+	.init			= arm_poly1305_init,
+	.update			= arm_poly1305_update_neon,
+	.update_from_sg		= arm_poly1305_update_from_sg_neon,
+	.final			= arm_poly1305_final,
+	.digestsize		= POLY1305_DIGEST_SIZE,
+	.descsize		= sizeof(struct arm_poly1305_ctx),
+
+	.base.cra_name		= "poly1305",
+	.base.cra_driver_name	= "poly1305-neon",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= POLY1305_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+#endif
+}};
+
+static int __init arm_poly1305_mod_init(void)
+{
+	if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && !(elf_hwcap & HWCAP_NEON))
+		/* register only the first entry */
+		return crypto_register_shash(&arm_poly1305_algs[0]);
+
+	return crypto_register_shashes(arm_poly1305_algs,
+				       ARRAY_SIZE(arm_poly1305_algs));
+
+
+}
+
+static void __exit arm_poly1305_mod_exit(void)
+{
+	if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && !(elf_hwcap & HWCAP_NEON)) {
+		crypto_unregister_shash(&arm_poly1305_algs[0]);
+		return;
+	}
+	crypto_unregister_shashes(arm_poly1305_algs,
+				  ARRAY_SIZE(arm_poly1305_algs));
+}
+
+module_init(arm_poly1305_mod_init);
+module_exit(arm_poly1305_mod_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("poly1305");
+MODULE_ALIAS_CRYPTO("poly1305-arm");
+MODULE_ALIAS_CRYPTO("poly1305-neon");
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 04/18] crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 03/18] crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 05/18] crypto: chacha - move existing library code into lib/crypto Ard Biesheuvel
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Andy Polyakov,
	Samuel Neves, Will Deacon, Dan Carpenter, Andy Lutomirski,
	Marc Zyngier, Linus Torvalds, David Miller, linux-arm-kernel

This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation
for NEON authored by Andy Polyakov, and contributed by him to the OpenSSL
project. The file 'poly1305-armv8.pl' is taken straight from this upstream
GitHub repository [0] at commit ec55a08dc0244ce570c4fc7cade330c60798952f,
and already contains all the changes required to build it as part of a
Linux kernel module.

[0] https://github.com/dot-asm/cryptogams

Co-developed-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Kconfig                 |   4 +
 arch/arm64/crypto/Makefile                |   9 +-
 arch/arm64/crypto/poly1305-armv8.pl       | 913 ++++++++++++++++++++
 arch/arm64/crypto/poly1305-core.S_shipped | 835 ++++++++++++++++++
 arch/arm64/crypto/poly1305-glue.c         | 215 +++++
 5 files changed, 1975 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 4922c4451e7c..6ee2fdfd84aa 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -105,6 +105,10 @@ config CRYPTO_CHACHA20_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_CHACHA20
 
+config CRYPTO_POLY1305_NEON
+	tristate "Poly1305 hash function using NEON instructions"
+	depends on KERNEL_MODE_NEON
+
 config CRYPTO_NHPOLY1305_NEON
 	tristate "NHPoly1305 hash function using NEON instructions (for Adiantum)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 0435f2a0610e..164d554422fe 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -50,6 +50,9 @@ sha512-arm64-y := sha512-glue.o sha512-core.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
 chacha-neon-y := chacha-neon-core.o chacha-neon-glue.o
 
+obj-$(CONFIG_CRYPTO_POLY1305_NEON) += poly1305-neon.o
+poly1305-neon-y := poly1305-core.o poly1305-glue.o
+
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
 nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
 
@@ -68,11 +71,15 @@ ifdef REGENERATE_ARM64_CRYPTO
 quiet_cmd_perlasm = PERLASM $@
       cmd_perlasm = $(PERL) $(<) void $(@)
 
+$(src)/poly1305-core.S_shipped: $(src)/poly1305-armv8.pl
+	$(call cmd,perlasm)
+
 $(src)/sha256-core.S_shipped: $(src)/sha512-armv8.pl
 	$(call cmd,perlasm)
 
 $(src)/sha512-core.S_shipped: $(src)/sha512-armv8.pl
 	$(call cmd,perlasm)
+
 endif
 
-clean-files += sha256-core.S sha512-core.S
+clean-files += poly1305-core.S sha256-core.S sha512-core.S
diff --git a/arch/arm64/crypto/poly1305-armv8.pl b/arch/arm64/crypto/poly1305-armv8.pl
new file mode 100644
index 000000000000..6e5576d19af8
--- /dev/null
+++ b/arch/arm64/crypto/poly1305-armv8.pl
@@ -0,0 +1,913 @@
+#!/usr/bin/env perl
+# SPDX-License-Identifier: GPL-1.0+ OR BSD-3-Clause
+#
+# ====================================================================
+# Written by Andy Polyakov, @dot-asm, initially for the OpenSSL
+# project.
+# ====================================================================
+#
+# This module implements Poly1305 hash for ARMv8.
+#
+# June 2015
+#
+# Numbers are cycles per processed byte with poly1305_blocks alone.
+#
+#		IALU/gcc-4.9	NEON
+#
+# Apple A7	1.86/+5%	0.72
+# Cortex-A53	2.69/+58%	1.47
+# Cortex-A57	2.70/+7%	1.14
+# Denver	1.64/+50%	1.18(*)
+# X-Gene	2.13/+68%	2.27
+# Mongoose	1.77/+75%	1.12
+# Kryo		2.70/+55%	1.13
+# ThunderX2	1.17/+95%	1.36
+#
+# (*)	estimate based on resources availability is less than 1.0,
+#	i.e. measured result is worse than expected, presumably binary
+#	translator is not almighty;
+
+$flavour=shift;
+$output=shift;
+
+if ($flavour && $flavour ne "void") {
+    $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+    ( $xlate="${dir}arm-xlate.pl" and -f $xlate ) or
+    ( $xlate="${dir}../../perlasm/arm-xlate.pl" and -f $xlate) or
+    die "can't locate arm-xlate.pl";
+
+    open STDOUT,"| \"$^X\" $xlate $flavour $output";
+} else {
+    open STDOUT,">$output";
+}
+
+my ($ctx,$inp,$len,$padbit) = map("x$_",(0..3));
+my ($mac,$nonce)=($inp,$len);
+
+my ($h0,$h1,$h2,$r0,$r1,$s1,$t0,$t1,$d0,$d1,$d2) = map("x$_",(4..14));
+
+$code.=<<___;
+#ifndef __KERNEL__
+# include "arm_arch.h"
+.extern	OPENSSL_armcap_P
+#endif
+
+.text
+
+// forward "declarations" are required for Apple
+.globl	poly1305_blocks
+.globl	poly1305_emit
+
+.globl	poly1305_init
+.type	poly1305_init,%function
+.align	5
+poly1305_init:
+	cmp	$inp,xzr
+	stp	xzr,xzr,[$ctx]		// zero hash value
+	stp	xzr,xzr,[$ctx,#16]	// [along with is_base2_26]
+
+	csel	x0,xzr,x0,eq
+	b.eq	.Lno_key
+
+#ifndef	__KERNEL__
+	adrp	x17,OPENSSL_armcap_P
+	ldr	w17,[x17,#:lo12:OPENSSL_armcap_P]
+#endif
+
+	ldp	$r0,$r1,[$inp]		// load key
+	mov	$s1,#0xfffffffc0fffffff
+	movk	$s1,#0x0fff,lsl#48
+#ifdef	__AARCH64EB__
+	rev	$r0,$r0			// flip bytes
+	rev	$r1,$r1
+#endif
+	and	$r0,$r0,$s1		// &=0ffffffc0fffffff
+	and	$s1,$s1,#-4
+	and	$r1,$r1,$s1		// &=0ffffffc0ffffffc
+	mov	w#$s1,#-1
+	stp	$r0,$r1,[$ctx,#32]	// save key value
+	str	w#$s1,[$ctx,#48]	// impossible key power value
+
+#ifndef	__KERNEL__
+	tst	w17,#ARMV7_NEON
+
+	adr	$d0,.Lpoly1305_blocks
+	adr	$r0,.Lpoly1305_blocks_neon
+	adr	$d1,.Lpoly1305_emit
+
+	csel	$d0,$d0,$r0,eq
+
+# ifdef	__ILP32__
+	stp	w#$d0,w#$d1,[$len]
+# else
+	stp	$d0,$d1,[$len]
+# endif
+#endif
+	mov	x0,#1
+.Lno_key:
+	ret
+.size	poly1305_init,.-poly1305_init
+
+.type	poly1305_blocks,%function
+.align	5
+poly1305_blocks:
+.Lpoly1305_blocks:
+	ands	$len,$len,#-16
+	b.eq	.Lno_data
+
+	ldp	$h0,$h1,[$ctx]		// load hash value
+	ldp	$h2,x17,[$ctx,#16]	// [along with is_base2_26]
+	ldp	$r0,$r1,[$ctx,#32]	// load key value
+
+#ifdef	__AARCH64EB__
+	lsr	$d0,$h0,#32
+	mov	w#$d1,w#$h0
+	lsr	$d2,$h1,#32
+	mov	w15,w#$h1
+	lsr	x16,$h2,#32
+#else
+	mov	w#$d0,w#$h0
+	lsr	$d1,$h0,#32
+	mov	w#$d2,w#$h1
+	lsr	x15,$h1,#32
+	mov	w16,w#$h2
+#endif
+
+	add	$d0,$d0,$d1,lsl#26	// base 2^26 -> base 2^64
+	lsr	$d1,$d2,#12
+	adds	$d0,$d0,$d2,lsl#52
+	add	$d1,$d1,x15,lsl#14
+	adc	$d1,$d1,xzr
+	lsr	$d2,x16,#24
+	adds	$d1,$d1,x16,lsl#40
+	adc	$d2,$d2,xzr
+
+	cmp	x17,#0			// is_base2_26?
+	add	$s1,$r1,$r1,lsr#2	// s1 = r1 + (r1 >> 2)
+	csel	$h0,$h0,$d0,eq		// choose between radixes
+	csel	$h1,$h1,$d1,eq
+	csel	$h2,$h2,$d2,eq
+
+.Loop:
+	ldp	$t0,$t1,[$inp],#16	// load input
+	sub	$len,$len,#16
+#ifdef	__AARCH64EB__
+	rev	$t0,$t0
+	rev	$t1,$t1
+#endif
+	adds	$h0,$h0,$t0		// accumulate input
+	adcs	$h1,$h1,$t1
+
+	mul	$d0,$h0,$r0		// h0*r0
+	adc	$h2,$h2,$padbit
+	umulh	$d1,$h0,$r0
+
+	mul	$t0,$h1,$s1		// h1*5*r1
+	umulh	$t1,$h1,$s1
+
+	adds	$d0,$d0,$t0
+	mul	$t0,$h0,$r1		// h0*r1
+	adc	$d1,$d1,$t1
+	umulh	$d2,$h0,$r1
+
+	adds	$d1,$d1,$t0
+	mul	$t0,$h1,$r0		// h1*r0
+	adc	$d2,$d2,xzr
+	umulh	$t1,$h1,$r0
+
+	adds	$d1,$d1,$t0
+	mul	$t0,$h2,$s1		// h2*5*r1
+	adc	$d2,$d2,$t1
+	mul	$t1,$h2,$r0		// h2*r0
+
+	adds	$d1,$d1,$t0
+	adc	$d2,$d2,$t1
+
+	and	$t0,$d2,#-4		// final reduction
+	and	$h2,$d2,#3
+	add	$t0,$t0,$d2,lsr#2
+	adds	$h0,$d0,$t0
+	adcs	$h1,$d1,xzr
+	adc	$h2,$h2,xzr
+
+	cbnz	$len,.Loop
+
+	stp	$h0,$h1,[$ctx]		// store hash value
+	stp	$h2,xzr,[$ctx,#16]	// [and clear is_base2_26]
+
+.Lno_data:
+	ret
+.size	poly1305_blocks,.-poly1305_blocks
+
+.type	poly1305_emit,%function
+.align	5
+poly1305_emit:
+.Lpoly1305_emit:
+	ldp	$h0,$h1,[$ctx]		// load hash base 2^64
+	ldp	$h2,$r0,[$ctx,#16]	// [along with is_base2_26]
+	ldp	$t0,$t1,[$nonce]	// load nonce
+
+#ifdef	__AARCH64EB__
+	lsr	$d0,$h0,#32
+	mov	w#$d1,w#$h0
+	lsr	$d2,$h1,#32
+	mov	w15,w#$h1
+	lsr	x16,$h2,#32
+#else
+	mov	w#$d0,w#$h0
+	lsr	$d1,$h0,#32
+	mov	w#$d2,w#$h1
+	lsr	x15,$h1,#32
+	mov	w16,w#$h2
+#endif
+
+	add	$d0,$d0,$d1,lsl#26	// base 2^26 -> base 2^64
+	lsr	$d1,$d2,#12
+	adds	$d0,$d0,$d2,lsl#52
+	add	$d1,$d1,x15,lsl#14
+	adc	$d1,$d1,xzr
+	lsr	$d2,x16,#24
+	adds	$d1,$d1,x16,lsl#40
+	adc	$d2,$d2,xzr
+
+	cmp	$r0,#0			// is_base2_26?
+	csel	$h0,$h0,$d0,eq		// choose between radixes
+	csel	$h1,$h1,$d1,eq
+	csel	$h2,$h2,$d2,eq
+
+	adds	$d0,$h0,#5		// compare to modulus
+	adcs	$d1,$h1,xzr
+	adc	$d2,$h2,xzr
+
+	tst	$d2,#-4			// see if it's carried/borrowed
+
+	csel	$h0,$h0,$d0,eq
+	csel	$h1,$h1,$d1,eq
+
+#ifdef	__AARCH64EB__
+	ror	$t0,$t0,#32		// flip nonce words
+	ror	$t1,$t1,#32
+#endif
+	adds	$h0,$h0,$t0		// accumulate nonce
+	adc	$h1,$h1,$t1
+#ifdef	__AARCH64EB__
+	rev	$h0,$h0			// flip output bytes
+	rev	$h1,$h1
+#endif
+	stp	$h0,$h1,[$mac]		// write result
+
+	ret
+.size	poly1305_emit,.-poly1305_emit
+___
+my ($R0,$R1,$S1,$R2,$S2,$R3,$S3,$R4,$S4) = map("v$_.4s",(0..8));
+my ($IN01_0,$IN01_1,$IN01_2,$IN01_3,$IN01_4) = map("v$_.2s",(9..13));
+my ($IN23_0,$IN23_1,$IN23_2,$IN23_3,$IN23_4) = map("v$_.2s",(14..18));
+my ($ACC0,$ACC1,$ACC2,$ACC3,$ACC4) = map("v$_.2d",(19..23));
+my ($H0,$H1,$H2,$H3,$H4) = map("v$_.2s",(24..28));
+my ($T0,$T1,$MASK) = map("v$_",(29..31));
+
+my ($in2,$zeros)=("x16","x17");
+my $is_base2_26 = $zeros;		# borrow
+
+$code.=<<___;
+.type	poly1305_mult,%function
+.align	5
+poly1305_mult:
+	mul	$d0,$h0,$r0		// h0*r0
+	umulh	$d1,$h0,$r0
+
+	mul	$t0,$h1,$s1		// h1*5*r1
+	umulh	$t1,$h1,$s1
+
+	adds	$d0,$d0,$t0
+	mul	$t0,$h0,$r1		// h0*r1
+	adc	$d1,$d1,$t1
+	umulh	$d2,$h0,$r1
+
+	adds	$d1,$d1,$t0
+	mul	$t0,$h1,$r0		// h1*r0
+	adc	$d2,$d2,xzr
+	umulh	$t1,$h1,$r0
+
+	adds	$d1,$d1,$t0
+	mul	$t0,$h2,$s1		// h2*5*r1
+	adc	$d2,$d2,$t1
+	mul	$t1,$h2,$r0		// h2*r0
+
+	adds	$d1,$d1,$t0
+	adc	$d2,$d2,$t1
+
+	and	$t0,$d2,#-4		// final reduction
+	and	$h2,$d2,#3
+	add	$t0,$t0,$d2,lsr#2
+	adds	$h0,$d0,$t0
+	adcs	$h1,$d1,xzr
+	adc	$h2,$h2,xzr
+
+	ret
+.size	poly1305_mult,.-poly1305_mult
+
+.type	poly1305_splat,%function
+.align	4
+poly1305_splat:
+	and	x12,$h0,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x13,$h0,#26,#26
+	extr	x14,$h1,$h0,#52
+	and	x14,x14,#0x03ffffff
+	ubfx	x15,$h1,#14,#26
+	extr	x16,$h2,$h1,#40
+
+	str	w12,[$ctx,#16*0]	// r0
+	add	w12,w13,w13,lsl#2	// r1*5
+	str	w13,[$ctx,#16*1]	// r1
+	add	w13,w14,w14,lsl#2	// r2*5
+	str	w12,[$ctx,#16*2]	// s1
+	str	w14,[$ctx,#16*3]	// r2
+	add	w14,w15,w15,lsl#2	// r3*5
+	str	w13,[$ctx,#16*4]	// s2
+	str	w15,[$ctx,#16*5]	// r3
+	add	w15,w16,w16,lsl#2	// r4*5
+	str	w14,[$ctx,#16*6]	// s3
+	str	w16,[$ctx,#16*7]	// r4
+	str	w15,[$ctx,#16*8]	// s4
+
+	ret
+.size	poly1305_splat,.-poly1305_splat
+
+#ifdef	__KERNEL__
+.globl	poly1305_blocks_neon
+#endif
+.type	poly1305_blocks_neon,%function
+.align	5
+poly1305_blocks_neon:
+.Lpoly1305_blocks_neon:
+	ldr	$is_base2_26,[$ctx,#24]
+	cmp	$len,#128
+	b.lo	.Lpoly1305_blocks
+
+	.inst	0xd503233f		// paciasp
+	stp	x29,x30,[sp,#-80]!
+	add	x29,sp,#0
+
+	stp	d8,d9,[sp,#16]		// meet ABI requirements
+	stp	d10,d11,[sp,#32]
+	stp	d12,d13,[sp,#48]
+	stp	d14,d15,[sp,#64]
+
+	cbz	$is_base2_26,.Lbase2_64_neon
+
+	ldp	w10,w11,[$ctx]		// load hash value base 2^26
+	ldp	w12,w13,[$ctx,#8]
+	ldr	w14,[$ctx,#16]
+
+	tst	$len,#31
+	b.eq	.Leven_neon
+
+	ldp	$r0,$r1,[$ctx,#32]	// load key value
+
+	add	$h0,x10,x11,lsl#26	// base 2^26 -> base 2^64
+	lsr	$h1,x12,#12
+	adds	$h0,$h0,x12,lsl#52
+	add	$h1,$h1,x13,lsl#14
+	adc	$h1,$h1,xzr
+	lsr	$h2,x14,#24
+	adds	$h1,$h1,x14,lsl#40
+	adc	$d2,$h2,xzr		// can be partially reduced...
+
+	ldp	$d0,$d1,[$inp],#16	// load input
+	sub	$len,$len,#16
+	add	$s1,$r1,$r1,lsr#2	// s1 = r1 + (r1 >> 2)
+
+#ifdef	__AARCH64EB__
+	rev	$d0,$d0
+	rev	$d1,$d1
+#endif
+	adds	$h0,$h0,$d0		// accumulate input
+	adcs	$h1,$h1,$d1
+	adc	$h2,$h2,$padbit
+
+	bl	poly1305_mult
+
+	and	x10,$h0,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,$h0,#26,#26
+	extr	x12,$h1,$h0,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,$h1,#14,#26
+	extr	x14,$h2,$h1,#40
+
+	b	.Leven_neon
+
+.align	4
+.Lbase2_64_neon:
+	ldp	$r0,$r1,[$ctx,#32]	// load key value
+
+	ldp	$h0,$h1,[$ctx]		// load hash value base 2^64
+	ldr	$h2,[$ctx,#16]
+
+	tst	$len,#31
+	b.eq	.Linit_neon
+
+	ldp	$d0,$d1,[$inp],#16	// load input
+	sub	$len,$len,#16
+	add	$s1,$r1,$r1,lsr#2	// s1 = r1 + (r1 >> 2)
+#ifdef	__AARCH64EB__
+	rev	$d0,$d0
+	rev	$d1,$d1
+#endif
+	adds	$h0,$h0,$d0		// accumulate input
+	adcs	$h1,$h1,$d1
+	adc	$h2,$h2,$padbit
+
+	bl	poly1305_mult
+
+.Linit_neon:
+	ldr	w17,[$ctx,#48]		// first table element
+	and	x10,$h0,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,$h0,#26,#26
+	extr	x12,$h1,$h0,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,$h1,#14,#26
+	extr	x14,$h2,$h1,#40
+
+	cmp	w17,#-1			// is value impossible?
+	b.ne	.Leven_neon
+
+	fmov	${H0},x10
+	fmov	${H1},x11
+	fmov	${H2},x12
+	fmov	${H3},x13
+	fmov	${H4},x14
+
+	////////////////////////////////// initialize r^n table
+	mov	$h0,$r0			// r^1
+	add	$s1,$r1,$r1,lsr#2	// s1 = r1 + (r1 >> 2)
+	mov	$h1,$r1
+	mov	$h2,xzr
+	add	$ctx,$ctx,#48+12
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^2
+	sub	$ctx,$ctx,#4
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^3
+	sub	$ctx,$ctx,#4
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^4
+	sub	$ctx,$ctx,#4
+	bl	poly1305_splat
+	sub	$ctx,$ctx,#48		// restore original $ctx
+	b	.Ldo_neon
+
+.align	4
+.Leven_neon:
+	fmov	${H0},x10
+	fmov	${H1},x11
+	fmov	${H2},x12
+	fmov	${H3},x13
+	fmov	${H4},x14
+
+.Ldo_neon:
+	ldp	x8,x12,[$inp,#32]	// inp[2:3]
+	subs	$len,$len,#64
+	ldp	x9,x13,[$inp,#48]
+	add	$in2,$inp,#96
+	adr	$zeros,.Lzeros
+
+	lsl	$padbit,$padbit,#24
+	add	x15,$ctx,#48
+
+#ifdef	__AARCH64EB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	$IN23_0,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,$padbit,x12,lsr#40
+	add	x13,$padbit,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	$IN23_1,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	fmov	$IN23_2,x8
+	fmov	$IN23_3,x10
+	fmov	$IN23_4,x12
+
+	ldp	x8,x12,[$inp],#16	// inp[0:1]
+	ldp	x9,x13,[$inp],#48
+
+	ld1	{$R0,$R1,$S1,$R2},[x15],#64
+	ld1	{$S2,$R3,$S3,$R4},[x15],#64
+	ld1	{$S4},[x15]
+
+#ifdef	__AARCH64EB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	$IN01_0,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,$padbit,x12,lsr#40
+	add	x13,$padbit,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	$IN01_1,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	movi	$MASK.2d,#-1
+	fmov	$IN01_2,x8
+	fmov	$IN01_3,x10
+	fmov	$IN01_4,x12
+	ushr	$MASK.2d,$MASK.2d,#38
+
+	b.ls	.Lskip_loop
+
+.align	4
+.Loop_neon:
+	////////////////////////////////////////////////////////////////
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	//   \___________________/
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	//   \___________________/ \____________________/
+	//
+	// Note that we start with inp[2:3]*r^2. This is because it
+	// doesn't depend on reduction in previous iteration.
+	////////////////////////////////////////////////////////////////
+	// d4 = h0*r4 + h1*r3   + h2*r2   + h3*r1   + h4*r0
+	// d3 = h0*r3 + h1*r2   + h2*r1   + h3*r0   + h4*5*r4
+	// d2 = h0*r2 + h1*r1   + h2*r0   + h3*5*r4 + h4*5*r3
+	// d1 = h0*r1 + h1*r0   + h2*5*r4 + h3*5*r3 + h4*5*r2
+	// d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1
+
+	subs	$len,$len,#64
+	umull	$ACC4,$IN23_0,${R4}[2]
+	csel	$in2,$zeros,$in2,lo
+	umull	$ACC3,$IN23_0,${R3}[2]
+	umull	$ACC2,$IN23_0,${R2}[2]
+	 ldp	x8,x12,[$in2],#16	// inp[2:3] (or zero)
+	umull	$ACC1,$IN23_0,${R1}[2]
+	 ldp	x9,x13,[$in2],#48
+	umull	$ACC0,$IN23_0,${R0}[2]
+#ifdef	__AARCH64EB__
+	 rev	x8,x8
+	 rev	x12,x12
+	 rev	x9,x9
+	 rev	x13,x13
+#endif
+
+	umlal	$ACC4,$IN23_1,${R3}[2]
+	 and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	$ACC3,$IN23_1,${R2}[2]
+	 and	x5,x9,#0x03ffffff
+	umlal	$ACC2,$IN23_1,${R1}[2]
+	 ubfx	x6,x8,#26,#26
+	umlal	$ACC1,$IN23_1,${R0}[2]
+	 ubfx	x7,x9,#26,#26
+	umlal	$ACC0,$IN23_1,${S4}[2]
+	 add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+
+	umlal	$ACC4,$IN23_2,${R2}[2]
+	 extr	x8,x12,x8,#52
+	umlal	$ACC3,$IN23_2,${R1}[2]
+	 extr	x9,x13,x9,#52
+	umlal	$ACC2,$IN23_2,${R0}[2]
+	 add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	$ACC1,$IN23_2,${S4}[2]
+	 fmov	$IN23_0,x4
+	umlal	$ACC0,$IN23_2,${S3}[2]
+	 and	x8,x8,#0x03ffffff
+
+	umlal	$ACC4,$IN23_3,${R1}[2]
+	 and	x9,x9,#0x03ffffff
+	umlal	$ACC3,$IN23_3,${R0}[2]
+	 ubfx	x10,x12,#14,#26
+	umlal	$ACC2,$IN23_3,${S4}[2]
+	 ubfx	x11,x13,#14,#26
+	umlal	$ACC1,$IN23_3,${S3}[2]
+	 add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	$ACC0,$IN23_3,${S2}[2]
+	 fmov	$IN23_1,x6
+
+	add	$IN01_2,$IN01_2,$H2
+	 add	x12,$padbit,x12,lsr#40
+	umlal	$ACC4,$IN23_4,${R0}[2]
+	 add	x13,$padbit,x13,lsr#40
+	umlal	$ACC3,$IN23_4,${S4}[2]
+	 add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	$ACC2,$IN23_4,${S3}[2]
+	 add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	$ACC1,$IN23_4,${S2}[2]
+	 fmov	$IN23_2,x8
+	umlal	$ACC0,$IN23_4,${S1}[2]
+	 fmov	$IN23_3,x10
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4 and accumulate
+
+	add	$IN01_0,$IN01_0,$H0
+	 fmov	$IN23_4,x12
+	umlal	$ACC3,$IN01_2,${R1}[0]
+	 ldp	x8,x12,[$inp],#16	// inp[0:1]
+	umlal	$ACC0,$IN01_2,${S3}[0]
+	 ldp	x9,x13,[$inp],#48
+	umlal	$ACC4,$IN01_2,${R2}[0]
+	umlal	$ACC1,$IN01_2,${S4}[0]
+	umlal	$ACC2,$IN01_2,${R0}[0]
+#ifdef	__AARCH64EB__
+	 rev	x8,x8
+	 rev	x12,x12
+	 rev	x9,x9
+	 rev	x13,x13
+#endif
+
+	add	$IN01_1,$IN01_1,$H1
+	umlal	$ACC3,$IN01_0,${R3}[0]
+	umlal	$ACC4,$IN01_0,${R4}[0]
+	 and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	$ACC2,$IN01_0,${R2}[0]
+	 and	x5,x9,#0x03ffffff
+	umlal	$ACC0,$IN01_0,${R0}[0]
+	 ubfx	x6,x8,#26,#26
+	umlal	$ACC1,$IN01_0,${R1}[0]
+	 ubfx	x7,x9,#26,#26
+
+	add	$IN01_3,$IN01_3,$H3
+	 add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	umlal	$ACC3,$IN01_1,${R2}[0]
+	 extr	x8,x12,x8,#52
+	umlal	$ACC4,$IN01_1,${R3}[0]
+	 extr	x9,x13,x9,#52
+	umlal	$ACC0,$IN01_1,${S4}[0]
+	 add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	$ACC2,$IN01_1,${R1}[0]
+	 fmov	$IN01_0,x4
+	umlal	$ACC1,$IN01_1,${R0}[0]
+	 and	x8,x8,#0x03ffffff
+
+	add	$IN01_4,$IN01_4,$H4
+	 and	x9,x9,#0x03ffffff
+	umlal	$ACC3,$IN01_3,${R0}[0]
+	 ubfx	x10,x12,#14,#26
+	umlal	$ACC0,$IN01_3,${S2}[0]
+	 ubfx	x11,x13,#14,#26
+	umlal	$ACC4,$IN01_3,${R1}[0]
+	 add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	$ACC1,$IN01_3,${S3}[0]
+	 fmov	$IN01_1,x6
+	umlal	$ACC2,$IN01_3,${S4}[0]
+	 add	x12,$padbit,x12,lsr#40
+
+	umlal	$ACC3,$IN01_4,${S4}[0]
+	 add	x13,$padbit,x13,lsr#40
+	umlal	$ACC0,$IN01_4,${S1}[0]
+	 add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	$ACC4,$IN01_4,${R0}[0]
+	 add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	$ACC1,$IN01_4,${S2}[0]
+	 fmov	$IN01_2,x8
+	umlal	$ACC2,$IN01_4,${S3}[0]
+	 fmov	$IN01_3,x10
+	 fmov	$IN01_4,x12
+
+	/////////////////////////////////////////////////////////////////
+	// lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	// and P. Schwabe
+	//
+	// [see discussion in poly1305-armv4 module]
+
+	ushr	$T0.2d,$ACC3,#26
+	xtn	$H3,$ACC3
+	 ushr	$T1.2d,$ACC0,#26
+	 and	$ACC0,$ACC0,$MASK.2d
+	add	$ACC4,$ACC4,$T0.2d	// h3 -> h4
+	bic	$H3,#0xfc,lsl#24	// &=0x03ffffff
+	 add	$ACC1,$ACC1,$T1.2d	// h0 -> h1
+
+	ushr	$T0.2d,$ACC4,#26
+	xtn	$H4,$ACC4
+	 ushr	$T1.2d,$ACC1,#26
+	 xtn	$H1,$ACC1
+	bic	$H4,#0xfc,lsl#24
+	 add	$ACC2,$ACC2,$T1.2d	// h1 -> h2
+
+	add	$ACC0,$ACC0,$T0.2d
+	shl	$T0.2d,$T0.2d,#2
+	 shrn	$T1.2s,$ACC2,#26
+	 xtn	$H2,$ACC2
+	add	$ACC0,$ACC0,$T0.2d	// h4 -> h0
+	 bic	$H1,#0xfc,lsl#24
+	 add	$H3,$H3,$T1.2s		// h2 -> h3
+	 bic	$H2,#0xfc,lsl#24
+
+	shrn	$T0.2s,$ACC0,#26
+	xtn	$H0,$ACC0
+	 ushr	$T1.2s,$H3,#26
+	 bic	$H3,#0xfc,lsl#24
+	 bic	$H0,#0xfc,lsl#24
+	add	$H1,$H1,$T0.2s		// h0 -> h1
+	 add	$H4,$H4,$T1.2s		// h3 -> h4
+
+	b.hi	.Loop_neon
+
+.Lskip_loop:
+	dup	$IN23_2,${IN23_2}[0]
+	add	$IN01_2,$IN01_2,$H2
+
+	////////////////////////////////////////////////////////////////
+	// multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	adds	$len,$len,#32
+	b.ne	.Long_tail
+
+	dup	$IN23_2,${IN01_2}[0]
+	add	$IN23_0,$IN01_0,$H0
+	add	$IN23_3,$IN01_3,$H3
+	add	$IN23_1,$IN01_1,$H1
+	add	$IN23_4,$IN01_4,$H4
+
+.Long_tail:
+	dup	$IN23_0,${IN23_0}[0]
+	umull2	$ACC0,$IN23_2,${S3}
+	umull2	$ACC3,$IN23_2,${R1}
+	umull2	$ACC4,$IN23_2,${R2}
+	umull2	$ACC2,$IN23_2,${R0}
+	umull2	$ACC1,$IN23_2,${S4}
+
+	dup	$IN23_1,${IN23_1}[0]
+	umlal2	$ACC0,$IN23_0,${R0}
+	umlal2	$ACC2,$IN23_0,${R2}
+	umlal2	$ACC3,$IN23_0,${R3}
+	umlal2	$ACC4,$IN23_0,${R4}
+	umlal2	$ACC1,$IN23_0,${R1}
+
+	dup	$IN23_3,${IN23_3}[0]
+	umlal2	$ACC0,$IN23_1,${S4}
+	umlal2	$ACC3,$IN23_1,${R2}
+	umlal2	$ACC2,$IN23_1,${R1}
+	umlal2	$ACC4,$IN23_1,${R3}
+	umlal2	$ACC1,$IN23_1,${R0}
+
+	dup	$IN23_4,${IN23_4}[0]
+	umlal2	$ACC3,$IN23_3,${R0}
+	umlal2	$ACC4,$IN23_3,${R1}
+	umlal2	$ACC0,$IN23_3,${S2}
+	umlal2	$ACC1,$IN23_3,${S3}
+	umlal2	$ACC2,$IN23_3,${S4}
+
+	umlal2	$ACC3,$IN23_4,${S4}
+	umlal2	$ACC0,$IN23_4,${S1}
+	umlal2	$ACC4,$IN23_4,${R0}
+	umlal2	$ACC1,$IN23_4,${S2}
+	umlal2	$ACC2,$IN23_4,${S3}
+
+	b.eq	.Lshort_tail
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	add	$IN01_0,$IN01_0,$H0
+	umlal	$ACC3,$IN01_2,${R1}
+	umlal	$ACC0,$IN01_2,${S3}
+	umlal	$ACC4,$IN01_2,${R2}
+	umlal	$ACC1,$IN01_2,${S4}
+	umlal	$ACC2,$IN01_2,${R0}
+
+	add	$IN01_1,$IN01_1,$H1
+	umlal	$ACC3,$IN01_0,${R3}
+	umlal	$ACC0,$IN01_0,${R0}
+	umlal	$ACC4,$IN01_0,${R4}
+	umlal	$ACC1,$IN01_0,${R1}
+	umlal	$ACC2,$IN01_0,${R2}
+
+	add	$IN01_3,$IN01_3,$H3
+	umlal	$ACC3,$IN01_1,${R2}
+	umlal	$ACC0,$IN01_1,${S4}
+	umlal	$ACC4,$IN01_1,${R3}
+	umlal	$ACC1,$IN01_1,${R0}
+	umlal	$ACC2,$IN01_1,${R1}
+
+	add	$IN01_4,$IN01_4,$H4
+	umlal	$ACC3,$IN01_3,${R0}
+	umlal	$ACC0,$IN01_3,${S2}
+	umlal	$ACC4,$IN01_3,${R1}
+	umlal	$ACC1,$IN01_3,${S3}
+	umlal	$ACC2,$IN01_3,${S4}
+
+	umlal	$ACC3,$IN01_4,${S4}
+	umlal	$ACC0,$IN01_4,${S1}
+	umlal	$ACC4,$IN01_4,${R0}
+	umlal	$ACC1,$IN01_4,${S2}
+	umlal	$ACC2,$IN01_4,${S3}
+
+.Lshort_tail:
+	////////////////////////////////////////////////////////////////
+	// horizontal add
+
+	addp	$ACC3,$ACC3,$ACC3
+	 ldp	d8,d9,[sp,#16]		// meet ABI requirements
+	addp	$ACC0,$ACC0,$ACC0
+	 ldp	d10,d11,[sp,#32]
+	addp	$ACC4,$ACC4,$ACC4
+	 ldp	d12,d13,[sp,#48]
+	addp	$ACC1,$ACC1,$ACC1
+	 ldp	d14,d15,[sp,#64]
+	addp	$ACC2,$ACC2,$ACC2
+	 ldr	x30,[sp,#8]
+	 .inst	0xd50323bf		// autiasp
+
+	////////////////////////////////////////////////////////////////
+	// lazy reduction, but without narrowing
+
+	ushr	$T0.2d,$ACC3,#26
+	and	$ACC3,$ACC3,$MASK.2d
+	 ushr	$T1.2d,$ACC0,#26
+	 and	$ACC0,$ACC0,$MASK.2d
+
+	add	$ACC4,$ACC4,$T0.2d	// h3 -> h4
+	 add	$ACC1,$ACC1,$T1.2d	// h0 -> h1
+
+	ushr	$T0.2d,$ACC4,#26
+	and	$ACC4,$ACC4,$MASK.2d
+	 ushr	$T1.2d,$ACC1,#26
+	 and	$ACC1,$ACC1,$MASK.2d
+	 add	$ACC2,$ACC2,$T1.2d	// h1 -> h2
+
+	add	$ACC0,$ACC0,$T0.2d
+	shl	$T0.2d,$T0.2d,#2
+	 ushr	$T1.2d,$ACC2,#26
+	 and	$ACC2,$ACC2,$MASK.2d
+	add	$ACC0,$ACC0,$T0.2d	// h4 -> h0
+	 add	$ACC3,$ACC3,$T1.2d	// h2 -> h3
+
+	ushr	$T0.2d,$ACC0,#26
+	and	$ACC0,$ACC0,$MASK.2d
+	 ushr	$T1.2d,$ACC3,#26
+	 and	$ACC3,$ACC3,$MASK.2d
+	add	$ACC1,$ACC1,$T0.2d	// h0 -> h1
+	 add	$ACC4,$ACC4,$T1.2d	// h3 -> h4
+
+	////////////////////////////////////////////////////////////////
+	// write the result, can be partially reduced
+
+	st4	{$ACC0,$ACC1,$ACC2,$ACC3}[0],[$ctx],#16
+	mov	x4,#1
+	st1	{$ACC4}[0],[$ctx]
+	str	x4,[$ctx,#8]		// set is_base2_26
+
+	ldr	x29,[sp],#80
+	ret
+.size	poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0
+.asciz	"Poly1305 for ARMv8, CRYPTOGAMS by \@dot-asm"
+.align	2
+#if !defined(__KERNEL__) && !defined(_WIN64)
+.comm	OPENSSL_armcap_P,4,4
+.hidden	OPENSSL_armcap_P
+#endif
+___
+
+foreach (split("\n",$code)) {
+	s/\b(shrn\s+v[0-9]+)\.[24]d/$1.2s/			or
+	s/\b(fmov\s+)v([0-9]+)[^,]*,\s*x([0-9]+)/$1d$2,x$3/	or
+	(m/\bdup\b/ and (s/\.[24]s/.2d/g or 1))			or
+	(m/\b(eor|and)/ and (s/\.[248][sdh]/.16b/g or 1))	or
+	(m/\bum(ul|la)l\b/ and (s/\.4s/.2s/g or 1))		or
+	(m/\bum(ul|la)l2\b/ and (s/\.2s/.4s/g or 1))		or
+	(m/\bst[1-4]\s+{[^}]+}\[/ and (s/\.[24]d/.s/g or 1));
+
+	s/\.[124]([sd])\[/.$1\[/;
+	s/w#x([0-9]+)/w$1/g;
+
+	print $_,"\n";
+}
+close STDOUT;
diff --git a/arch/arm64/crypto/poly1305-core.S_shipped b/arch/arm64/crypto/poly1305-core.S_shipped
new file mode 100644
index 000000000000..8d1c4e420ccd
--- /dev/null
+++ b/arch/arm64/crypto/poly1305-core.S_shipped
@@ -0,0 +1,835 @@
+#ifndef __KERNEL__
+# include "arm_arch.h"
+.extern	OPENSSL_armcap_P
+#endif
+
+.text
+
+// forward "declarations" are required for Apple
+.globl	poly1305_blocks
+.globl	poly1305_emit
+
+.globl	poly1305_init
+.type	poly1305_init,%function
+.align	5
+poly1305_init:
+	cmp	x1,xzr
+	stp	xzr,xzr,[x0]		// zero hash value
+	stp	xzr,xzr,[x0,#16]	// [along with is_base2_26]
+
+	csel	x0,xzr,x0,eq
+	b.eq	.Lno_key
+
+#ifndef	__KERNEL__
+	adrp	x17,OPENSSL_armcap_P
+	ldr	w17,[x17,#:lo12:OPENSSL_armcap_P]
+#endif
+
+	ldp	x7,x8,[x1]		// load key
+	mov	x9,#0xfffffffc0fffffff
+	movk	x9,#0x0fff,lsl#48
+#ifdef	__AARCH64EB__
+	rev	x7,x7			// flip bytes
+	rev	x8,x8
+#endif
+	and	x7,x7,x9		// &=0ffffffc0fffffff
+	and	x9,x9,#-4
+	and	x8,x8,x9		// &=0ffffffc0ffffffc
+	mov	w9,#-1
+	stp	x7,x8,[x0,#32]	// save key value
+	str	w9,[x0,#48]	// impossible key power value
+
+#ifndef	__KERNEL__
+	tst	w17,#ARMV7_NEON
+
+	adr	x12,.Lpoly1305_blocks
+	adr	x7,.Lpoly1305_blocks_neon
+	adr	x13,.Lpoly1305_emit
+
+	csel	x12,x12,x7,eq
+
+# ifdef	__ILP32__
+	stp	w12,w13,[x2]
+# else
+	stp	x12,x13,[x2]
+# endif
+#endif
+	mov	x0,#1
+.Lno_key:
+	ret
+.size	poly1305_init,.-poly1305_init
+
+.type	poly1305_blocks,%function
+.align	5
+poly1305_blocks:
+.Lpoly1305_blocks:
+	ands	x2,x2,#-16
+	b.eq	.Lno_data
+
+	ldp	x4,x5,[x0]		// load hash value
+	ldp	x6,x17,[x0,#16]	// [along with is_base2_26]
+	ldp	x7,x8,[x0,#32]	// load key value
+
+#ifdef	__AARCH64EB__
+	lsr	x12,x4,#32
+	mov	w13,w4
+	lsr	x14,x5,#32
+	mov	w15,w5
+	lsr	x16,x6,#32
+#else
+	mov	w12,w4
+	lsr	x13,x4,#32
+	mov	w14,w5
+	lsr	x15,x5,#32
+	mov	w16,w6
+#endif
+
+	add	x12,x12,x13,lsl#26	// base 2^26 -> base 2^64
+	lsr	x13,x14,#12
+	adds	x12,x12,x14,lsl#52
+	add	x13,x13,x15,lsl#14
+	adc	x13,x13,xzr
+	lsr	x14,x16,#24
+	adds	x13,x13,x16,lsl#40
+	adc	x14,x14,xzr
+
+	cmp	x17,#0			// is_base2_26?
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+	csel	x4,x4,x12,eq		// choose between radixes
+	csel	x5,x5,x13,eq
+	csel	x6,x6,x14,eq
+
+.Loop:
+	ldp	x10,x11,[x1],#16	// load input
+	sub	x2,x2,#16
+#ifdef	__AARCH64EB__
+	rev	x10,x10
+	rev	x11,x11
+#endif
+	adds	x4,x4,x10		// accumulate input
+	adcs	x5,x5,x11
+
+	mul	x12,x4,x7		// h0*r0
+	adc	x6,x6,x3
+	umulh	x13,x4,x7
+
+	mul	x10,x5,x9		// h1*5*r1
+	umulh	x11,x5,x9
+
+	adds	x12,x12,x10
+	mul	x10,x4,x8		// h0*r1
+	adc	x13,x13,x11
+	umulh	x14,x4,x8
+
+	adds	x13,x13,x10
+	mul	x10,x5,x7		// h1*r0
+	adc	x14,x14,xzr
+	umulh	x11,x5,x7
+
+	adds	x13,x13,x10
+	mul	x10,x6,x9		// h2*5*r1
+	adc	x14,x14,x11
+	mul	x11,x6,x7		// h2*r0
+
+	adds	x13,x13,x10
+	adc	x14,x14,x11
+
+	and	x10,x14,#-4		// final reduction
+	and	x6,x14,#3
+	add	x10,x10,x14,lsr#2
+	adds	x4,x12,x10
+	adcs	x5,x13,xzr
+	adc	x6,x6,xzr
+
+	cbnz	x2,.Loop
+
+	stp	x4,x5,[x0]		// store hash value
+	stp	x6,xzr,[x0,#16]	// [and clear is_base2_26]
+
+.Lno_data:
+	ret
+.size	poly1305_blocks,.-poly1305_blocks
+
+.type	poly1305_emit,%function
+.align	5
+poly1305_emit:
+.Lpoly1305_emit:
+	ldp	x4,x5,[x0]		// load hash base 2^64
+	ldp	x6,x7,[x0,#16]	// [along with is_base2_26]
+	ldp	x10,x11,[x2]	// load nonce
+
+#ifdef	__AARCH64EB__
+	lsr	x12,x4,#32
+	mov	w13,w4
+	lsr	x14,x5,#32
+	mov	w15,w5
+	lsr	x16,x6,#32
+#else
+	mov	w12,w4
+	lsr	x13,x4,#32
+	mov	w14,w5
+	lsr	x15,x5,#32
+	mov	w16,w6
+#endif
+
+	add	x12,x12,x13,lsl#26	// base 2^26 -> base 2^64
+	lsr	x13,x14,#12
+	adds	x12,x12,x14,lsl#52
+	add	x13,x13,x15,lsl#14
+	adc	x13,x13,xzr
+	lsr	x14,x16,#24
+	adds	x13,x13,x16,lsl#40
+	adc	x14,x14,xzr
+
+	cmp	x7,#0			// is_base2_26?
+	csel	x4,x4,x12,eq		// choose between radixes
+	csel	x5,x5,x13,eq
+	csel	x6,x6,x14,eq
+
+	adds	x12,x4,#5		// compare to modulus
+	adcs	x13,x5,xzr
+	adc	x14,x6,xzr
+
+	tst	x14,#-4			// see if it's carried/borrowed
+
+	csel	x4,x4,x12,eq
+	csel	x5,x5,x13,eq
+
+#ifdef	__AARCH64EB__
+	ror	x10,x10,#32		// flip nonce words
+	ror	x11,x11,#32
+#endif
+	adds	x4,x4,x10		// accumulate nonce
+	adc	x5,x5,x11
+#ifdef	__AARCH64EB__
+	rev	x4,x4			// flip output bytes
+	rev	x5,x5
+#endif
+	stp	x4,x5,[x1]		// write result
+
+	ret
+.size	poly1305_emit,.-poly1305_emit
+.type	poly1305_mult,%function
+.align	5
+poly1305_mult:
+	mul	x12,x4,x7		// h0*r0
+	umulh	x13,x4,x7
+
+	mul	x10,x5,x9		// h1*5*r1
+	umulh	x11,x5,x9
+
+	adds	x12,x12,x10
+	mul	x10,x4,x8		// h0*r1
+	adc	x13,x13,x11
+	umulh	x14,x4,x8
+
+	adds	x13,x13,x10
+	mul	x10,x5,x7		// h1*r0
+	adc	x14,x14,xzr
+	umulh	x11,x5,x7
+
+	adds	x13,x13,x10
+	mul	x10,x6,x9		// h2*5*r1
+	adc	x14,x14,x11
+	mul	x11,x6,x7		// h2*r0
+
+	adds	x13,x13,x10
+	adc	x14,x14,x11
+
+	and	x10,x14,#-4		// final reduction
+	and	x6,x14,#3
+	add	x10,x10,x14,lsr#2
+	adds	x4,x12,x10
+	adcs	x5,x13,xzr
+	adc	x6,x6,xzr
+
+	ret
+.size	poly1305_mult,.-poly1305_mult
+
+.type	poly1305_splat,%function
+.align	4
+poly1305_splat:
+	and	x12,x4,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x13,x4,#26,#26
+	extr	x14,x5,x4,#52
+	and	x14,x14,#0x03ffffff
+	ubfx	x15,x5,#14,#26
+	extr	x16,x6,x5,#40
+
+	str	w12,[x0,#16*0]	// r0
+	add	w12,w13,w13,lsl#2	// r1*5
+	str	w13,[x0,#16*1]	// r1
+	add	w13,w14,w14,lsl#2	// r2*5
+	str	w12,[x0,#16*2]	// s1
+	str	w14,[x0,#16*3]	// r2
+	add	w14,w15,w15,lsl#2	// r3*5
+	str	w13,[x0,#16*4]	// s2
+	str	w15,[x0,#16*5]	// r3
+	add	w15,w16,w16,lsl#2	// r4*5
+	str	w14,[x0,#16*6]	// s3
+	str	w16,[x0,#16*7]	// r4
+	str	w15,[x0,#16*8]	// s4
+
+	ret
+.size	poly1305_splat,.-poly1305_splat
+
+#ifdef	__KERNEL__
+.globl	poly1305_blocks_neon
+#endif
+.type	poly1305_blocks_neon,%function
+.align	5
+poly1305_blocks_neon:
+.Lpoly1305_blocks_neon:
+	ldr	x17,[x0,#24]
+	cmp	x2,#128
+	b.lo	.Lpoly1305_blocks
+
+	.inst	0xd503233f		// paciasp
+	stp	x29,x30,[sp,#-80]!
+	add	x29,sp,#0
+
+	stp	d8,d9,[sp,#16]		// meet ABI requirements
+	stp	d10,d11,[sp,#32]
+	stp	d12,d13,[sp,#48]
+	stp	d14,d15,[sp,#64]
+
+	cbz	x17,.Lbase2_64_neon
+
+	ldp	w10,w11,[x0]		// load hash value base 2^26
+	ldp	w12,w13,[x0,#8]
+	ldr	w14,[x0,#16]
+
+	tst	x2,#31
+	b.eq	.Leven_neon
+
+	ldp	x7,x8,[x0,#32]	// load key value
+
+	add	x4,x10,x11,lsl#26	// base 2^26 -> base 2^64
+	lsr	x5,x12,#12
+	adds	x4,x4,x12,lsl#52
+	add	x5,x5,x13,lsl#14
+	adc	x5,x5,xzr
+	lsr	x6,x14,#24
+	adds	x5,x5,x14,lsl#40
+	adc	x14,x6,xzr		// can be partially reduced...
+
+	ldp	x12,x13,[x1],#16	// load input
+	sub	x2,x2,#16
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+
+#ifdef	__AARCH64EB__
+	rev	x12,x12
+	rev	x13,x13
+#endif
+	adds	x4,x4,x12		// accumulate input
+	adcs	x5,x5,x13
+	adc	x6,x6,x3
+
+	bl	poly1305_mult
+
+	and	x10,x4,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,x4,#26,#26
+	extr	x12,x5,x4,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,x5,#14,#26
+	extr	x14,x6,x5,#40
+
+	b	.Leven_neon
+
+.align	4
+.Lbase2_64_neon:
+	ldp	x7,x8,[x0,#32]	// load key value
+
+	ldp	x4,x5,[x0]		// load hash value base 2^64
+	ldr	x6,[x0,#16]
+
+	tst	x2,#31
+	b.eq	.Linit_neon
+
+	ldp	x12,x13,[x1],#16	// load input
+	sub	x2,x2,#16
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+#ifdef	__AARCH64EB__
+	rev	x12,x12
+	rev	x13,x13
+#endif
+	adds	x4,x4,x12		// accumulate input
+	adcs	x5,x5,x13
+	adc	x6,x6,x3
+
+	bl	poly1305_mult
+
+.Linit_neon:
+	ldr	w17,[x0,#48]		// first table element
+	and	x10,x4,#0x03ffffff	// base 2^64 -> base 2^26
+	ubfx	x11,x4,#26,#26
+	extr	x12,x5,x4,#52
+	and	x12,x12,#0x03ffffff
+	ubfx	x13,x5,#14,#26
+	extr	x14,x6,x5,#40
+
+	cmp	w17,#-1			// is value impossible?
+	b.ne	.Leven_neon
+
+	fmov	d24,x10
+	fmov	d25,x11
+	fmov	d26,x12
+	fmov	d27,x13
+	fmov	d28,x14
+
+	////////////////////////////////// initialize r^n table
+	mov	x4,x7			// r^1
+	add	x9,x8,x8,lsr#2	// s1 = r1 + (r1 >> 2)
+	mov	x5,x8
+	mov	x6,xzr
+	add	x0,x0,#48+12
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^2
+	sub	x0,x0,#4
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^3
+	sub	x0,x0,#4
+	bl	poly1305_splat
+
+	bl	poly1305_mult		// r^4
+	sub	x0,x0,#4
+	bl	poly1305_splat
+	sub	x0,x0,#48		// restore original x0
+	b	.Ldo_neon
+
+.align	4
+.Leven_neon:
+	fmov	d24,x10
+	fmov	d25,x11
+	fmov	d26,x12
+	fmov	d27,x13
+	fmov	d28,x14
+
+.Ldo_neon:
+	ldp	x8,x12,[x1,#32]	// inp[2:3]
+	subs	x2,x2,#64
+	ldp	x9,x13,[x1,#48]
+	add	x16,x1,#96
+	adr	x17,.Lzeros
+
+	lsl	x3,x3,#24
+	add	x15,x0,#48
+
+#ifdef	__AARCH64EB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	d14,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,x3,x12,lsr#40
+	add	x13,x3,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	d15,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	fmov	d16,x8
+	fmov	d17,x10
+	fmov	d18,x12
+
+	ldp	x8,x12,[x1],#16	// inp[0:1]
+	ldp	x9,x13,[x1],#48
+
+	ld1	{v0.4s,v1.4s,v2.4s,v3.4s},[x15],#64
+	ld1	{v4.4s,v5.4s,v6.4s,v7.4s},[x15],#64
+	ld1	{v8.4s},[x15]
+
+#ifdef	__AARCH64EB__
+	rev	x8,x8
+	rev	x12,x12
+	rev	x9,x9
+	rev	x13,x13
+#endif
+	and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	and	x5,x9,#0x03ffffff
+	ubfx	x6,x8,#26,#26
+	ubfx	x7,x9,#26,#26
+	add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	extr	x8,x12,x8,#52
+	extr	x9,x13,x9,#52
+	add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	fmov	d9,x4
+	and	x8,x8,#0x03ffffff
+	and	x9,x9,#0x03ffffff
+	ubfx	x10,x12,#14,#26
+	ubfx	x11,x13,#14,#26
+	add	x12,x3,x12,lsr#40
+	add	x13,x3,x13,lsr#40
+	add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	fmov	d10,x6
+	add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	movi	v31.2d,#-1
+	fmov	d11,x8
+	fmov	d12,x10
+	fmov	d13,x12
+	ushr	v31.2d,v31.2d,#38
+
+	b.ls	.Lskip_loop
+
+.align	4
+.Loop_neon:
+	////////////////////////////////////////////////////////////////
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	//   ___________________/
+	// ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	// ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	//   ___________________/ ____________________/
+	//
+	// Note that we start with inp[2:3]*r^2. This is because it
+	// doesn't depend on reduction in previous iteration.
+	////////////////////////////////////////////////////////////////
+	// d4 = h0*r4 + h1*r3   + h2*r2   + h3*r1   + h4*r0
+	// d3 = h0*r3 + h1*r2   + h2*r1   + h3*r0   + h4*5*r4
+	// d2 = h0*r2 + h1*r1   + h2*r0   + h3*5*r4 + h4*5*r3
+	// d1 = h0*r1 + h1*r0   + h2*5*r4 + h3*5*r3 + h4*5*r2
+	// d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1
+
+	subs	x2,x2,#64
+	umull	v23.2d,v14.2s,v7.s[2]
+	csel	x16,x17,x16,lo
+	umull	v22.2d,v14.2s,v5.s[2]
+	umull	v21.2d,v14.2s,v3.s[2]
+	 ldp	x8,x12,[x16],#16	// inp[2:3] (or zero)
+	umull	v20.2d,v14.2s,v1.s[2]
+	 ldp	x9,x13,[x16],#48
+	umull	v19.2d,v14.2s,v0.s[2]
+#ifdef	__AARCH64EB__
+	 rev	x8,x8
+	 rev	x12,x12
+	 rev	x9,x9
+	 rev	x13,x13
+#endif
+
+	umlal	v23.2d,v15.2s,v5.s[2]
+	 and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	v22.2d,v15.2s,v3.s[2]
+	 and	x5,x9,#0x03ffffff
+	umlal	v21.2d,v15.2s,v1.s[2]
+	 ubfx	x6,x8,#26,#26
+	umlal	v20.2d,v15.2s,v0.s[2]
+	 ubfx	x7,x9,#26,#26
+	umlal	v19.2d,v15.2s,v8.s[2]
+	 add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+
+	umlal	v23.2d,v16.2s,v3.s[2]
+	 extr	x8,x12,x8,#52
+	umlal	v22.2d,v16.2s,v1.s[2]
+	 extr	x9,x13,x9,#52
+	umlal	v21.2d,v16.2s,v0.s[2]
+	 add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	v20.2d,v16.2s,v8.s[2]
+	 fmov	d14,x4
+	umlal	v19.2d,v16.2s,v6.s[2]
+	 and	x8,x8,#0x03ffffff
+
+	umlal	v23.2d,v17.2s,v1.s[2]
+	 and	x9,x9,#0x03ffffff
+	umlal	v22.2d,v17.2s,v0.s[2]
+	 ubfx	x10,x12,#14,#26
+	umlal	v21.2d,v17.2s,v8.s[2]
+	 ubfx	x11,x13,#14,#26
+	umlal	v20.2d,v17.2s,v6.s[2]
+	 add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	v19.2d,v17.2s,v4.s[2]
+	 fmov	d15,x6
+
+	add	v11.2s,v11.2s,v26.2s
+	 add	x12,x3,x12,lsr#40
+	umlal	v23.2d,v18.2s,v0.s[2]
+	 add	x13,x3,x13,lsr#40
+	umlal	v22.2d,v18.2s,v8.s[2]
+	 add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	v21.2d,v18.2s,v6.s[2]
+	 add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	v20.2d,v18.2s,v4.s[2]
+	 fmov	d16,x8
+	umlal	v19.2d,v18.2s,v2.s[2]
+	 fmov	d17,x10
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4 and accumulate
+
+	add	v9.2s,v9.2s,v24.2s
+	 fmov	d18,x12
+	umlal	v22.2d,v11.2s,v1.s[0]
+	 ldp	x8,x12,[x1],#16	// inp[0:1]
+	umlal	v19.2d,v11.2s,v6.s[0]
+	 ldp	x9,x13,[x1],#48
+	umlal	v23.2d,v11.2s,v3.s[0]
+	umlal	v20.2d,v11.2s,v8.s[0]
+	umlal	v21.2d,v11.2s,v0.s[0]
+#ifdef	__AARCH64EB__
+	 rev	x8,x8
+	 rev	x12,x12
+	 rev	x9,x9
+	 rev	x13,x13
+#endif
+
+	add	v10.2s,v10.2s,v25.2s
+	umlal	v22.2d,v9.2s,v5.s[0]
+	umlal	v23.2d,v9.2s,v7.s[0]
+	 and	x4,x8,#0x03ffffff	// base 2^64 -> base 2^26
+	umlal	v21.2d,v9.2s,v3.s[0]
+	 and	x5,x9,#0x03ffffff
+	umlal	v19.2d,v9.2s,v0.s[0]
+	 ubfx	x6,x8,#26,#26
+	umlal	v20.2d,v9.2s,v1.s[0]
+	 ubfx	x7,x9,#26,#26
+
+	add	v12.2s,v12.2s,v27.2s
+	 add	x4,x4,x5,lsl#32		// bfi	x4,x5,#32,#32
+	umlal	v22.2d,v10.2s,v3.s[0]
+	 extr	x8,x12,x8,#52
+	umlal	v23.2d,v10.2s,v5.s[0]
+	 extr	x9,x13,x9,#52
+	umlal	v19.2d,v10.2s,v8.s[0]
+	 add	x6,x6,x7,lsl#32		// bfi	x6,x7,#32,#32
+	umlal	v21.2d,v10.2s,v1.s[0]
+	 fmov	d9,x4
+	umlal	v20.2d,v10.2s,v0.s[0]
+	 and	x8,x8,#0x03ffffff
+
+	add	v13.2s,v13.2s,v28.2s
+	 and	x9,x9,#0x03ffffff
+	umlal	v22.2d,v12.2s,v0.s[0]
+	 ubfx	x10,x12,#14,#26
+	umlal	v19.2d,v12.2s,v4.s[0]
+	 ubfx	x11,x13,#14,#26
+	umlal	v23.2d,v12.2s,v1.s[0]
+	 add	x8,x8,x9,lsl#32		// bfi	x8,x9,#32,#32
+	umlal	v20.2d,v12.2s,v6.s[0]
+	 fmov	d10,x6
+	umlal	v21.2d,v12.2s,v8.s[0]
+	 add	x12,x3,x12,lsr#40
+
+	umlal	v22.2d,v13.2s,v8.s[0]
+	 add	x13,x3,x13,lsr#40
+	umlal	v19.2d,v13.2s,v2.s[0]
+	 add	x10,x10,x11,lsl#32	// bfi	x10,x11,#32,#32
+	umlal	v23.2d,v13.2s,v0.s[0]
+	 add	x12,x12,x13,lsl#32	// bfi	x12,x13,#32,#32
+	umlal	v20.2d,v13.2s,v4.s[0]
+	 fmov	d11,x8
+	umlal	v21.2d,v13.2s,v6.s[0]
+	 fmov	d12,x10
+	 fmov	d13,x12
+
+	/////////////////////////////////////////////////////////////////
+	// lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	// and P. Schwabe
+	//
+	// [see discussion in poly1305-armv4 module]
+
+	ushr	v29.2d,v22.2d,#26
+	xtn	v27.2s,v22.2d
+	 ushr	v30.2d,v19.2d,#26
+	 and	v19.16b,v19.16b,v31.16b
+	add	v23.2d,v23.2d,v29.2d	// h3 -> h4
+	bic	v27.2s,#0xfc,lsl#24	// &=0x03ffffff
+	 add	v20.2d,v20.2d,v30.2d	// h0 -> h1
+
+	ushr	v29.2d,v23.2d,#26
+	xtn	v28.2s,v23.2d
+	 ushr	v30.2d,v20.2d,#26
+	 xtn	v25.2s,v20.2d
+	bic	v28.2s,#0xfc,lsl#24
+	 add	v21.2d,v21.2d,v30.2d	// h1 -> h2
+
+	add	v19.2d,v19.2d,v29.2d
+	shl	v29.2d,v29.2d,#2
+	 shrn	v30.2s,v21.2d,#26
+	 xtn	v26.2s,v21.2d
+	add	v19.2d,v19.2d,v29.2d	// h4 -> h0
+	 bic	v25.2s,#0xfc,lsl#24
+	 add	v27.2s,v27.2s,v30.2s		// h2 -> h3
+	 bic	v26.2s,#0xfc,lsl#24
+
+	shrn	v29.2s,v19.2d,#26
+	xtn	v24.2s,v19.2d
+	 ushr	v30.2s,v27.2s,#26
+	 bic	v27.2s,#0xfc,lsl#24
+	 bic	v24.2s,#0xfc,lsl#24
+	add	v25.2s,v25.2s,v29.2s		// h0 -> h1
+	 add	v28.2s,v28.2s,v30.2s		// h3 -> h4
+
+	b.hi	.Loop_neon
+
+.Lskip_loop:
+	dup	v16.2d,v16.d[0]
+	add	v11.2s,v11.2s,v26.2s
+
+	////////////////////////////////////////////////////////////////
+	// multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	adds	x2,x2,#32
+	b.ne	.Long_tail
+
+	dup	v16.2d,v11.d[0]
+	add	v14.2s,v9.2s,v24.2s
+	add	v17.2s,v12.2s,v27.2s
+	add	v15.2s,v10.2s,v25.2s
+	add	v18.2s,v13.2s,v28.2s
+
+.Long_tail:
+	dup	v14.2d,v14.d[0]
+	umull2	v19.2d,v16.4s,v6.4s
+	umull2	v22.2d,v16.4s,v1.4s
+	umull2	v23.2d,v16.4s,v3.4s
+	umull2	v21.2d,v16.4s,v0.4s
+	umull2	v20.2d,v16.4s,v8.4s
+
+	dup	v15.2d,v15.d[0]
+	umlal2	v19.2d,v14.4s,v0.4s
+	umlal2	v21.2d,v14.4s,v3.4s
+	umlal2	v22.2d,v14.4s,v5.4s
+	umlal2	v23.2d,v14.4s,v7.4s
+	umlal2	v20.2d,v14.4s,v1.4s
+
+	dup	v17.2d,v17.d[0]
+	umlal2	v19.2d,v15.4s,v8.4s
+	umlal2	v22.2d,v15.4s,v3.4s
+	umlal2	v21.2d,v15.4s,v1.4s
+	umlal2	v23.2d,v15.4s,v5.4s
+	umlal2	v20.2d,v15.4s,v0.4s
+
+	dup	v18.2d,v18.d[0]
+	umlal2	v22.2d,v17.4s,v0.4s
+	umlal2	v23.2d,v17.4s,v1.4s
+	umlal2	v19.2d,v17.4s,v4.4s
+	umlal2	v20.2d,v17.4s,v6.4s
+	umlal2	v21.2d,v17.4s,v8.4s
+
+	umlal2	v22.2d,v18.4s,v8.4s
+	umlal2	v19.2d,v18.4s,v2.4s
+	umlal2	v23.2d,v18.4s,v0.4s
+	umlal2	v20.2d,v18.4s,v4.4s
+	umlal2	v21.2d,v18.4s,v6.4s
+
+	b.eq	.Lshort_tail
+
+	////////////////////////////////////////////////////////////////
+	// (hash+inp[0:1])*r^4:r^3 and accumulate
+
+	add	v9.2s,v9.2s,v24.2s
+	umlal	v22.2d,v11.2s,v1.2s
+	umlal	v19.2d,v11.2s,v6.2s
+	umlal	v23.2d,v11.2s,v3.2s
+	umlal	v20.2d,v11.2s,v8.2s
+	umlal	v21.2d,v11.2s,v0.2s
+
+	add	v10.2s,v10.2s,v25.2s
+	umlal	v22.2d,v9.2s,v5.2s
+	umlal	v19.2d,v9.2s,v0.2s
+	umlal	v23.2d,v9.2s,v7.2s
+	umlal	v20.2d,v9.2s,v1.2s
+	umlal	v21.2d,v9.2s,v3.2s
+
+	add	v12.2s,v12.2s,v27.2s
+	umlal	v22.2d,v10.2s,v3.2s
+	umlal	v19.2d,v10.2s,v8.2s
+	umlal	v23.2d,v10.2s,v5.2s
+	umlal	v20.2d,v10.2s,v0.2s
+	umlal	v21.2d,v10.2s,v1.2s
+
+	add	v13.2s,v13.2s,v28.2s
+	umlal	v22.2d,v12.2s,v0.2s
+	umlal	v19.2d,v12.2s,v4.2s
+	umlal	v23.2d,v12.2s,v1.2s
+	umlal	v20.2d,v12.2s,v6.2s
+	umlal	v21.2d,v12.2s,v8.2s
+
+	umlal	v22.2d,v13.2s,v8.2s
+	umlal	v19.2d,v13.2s,v2.2s
+	umlal	v23.2d,v13.2s,v0.2s
+	umlal	v20.2d,v13.2s,v4.2s
+	umlal	v21.2d,v13.2s,v6.2s
+
+.Lshort_tail:
+	////////////////////////////////////////////////////////////////
+	// horizontal add
+
+	addp	v22.2d,v22.2d,v22.2d
+	 ldp	d8,d9,[sp,#16]		// meet ABI requirements
+	addp	v19.2d,v19.2d,v19.2d
+	 ldp	d10,d11,[sp,#32]
+	addp	v23.2d,v23.2d,v23.2d
+	 ldp	d12,d13,[sp,#48]
+	addp	v20.2d,v20.2d,v20.2d
+	 ldp	d14,d15,[sp,#64]
+	addp	v21.2d,v21.2d,v21.2d
+	 ldr	x30,[sp,#8]
+	 .inst	0xd50323bf		// autiasp
+
+	////////////////////////////////////////////////////////////////
+	// lazy reduction, but without narrowing
+
+	ushr	v29.2d,v22.2d,#26
+	and	v22.16b,v22.16b,v31.16b
+	 ushr	v30.2d,v19.2d,#26
+	 and	v19.16b,v19.16b,v31.16b
+
+	add	v23.2d,v23.2d,v29.2d	// h3 -> h4
+	 add	v20.2d,v20.2d,v30.2d	// h0 -> h1
+
+	ushr	v29.2d,v23.2d,#26
+	and	v23.16b,v23.16b,v31.16b
+	 ushr	v30.2d,v20.2d,#26
+	 and	v20.16b,v20.16b,v31.16b
+	 add	v21.2d,v21.2d,v30.2d	// h1 -> h2
+
+	add	v19.2d,v19.2d,v29.2d
+	shl	v29.2d,v29.2d,#2
+	 ushr	v30.2d,v21.2d,#26
+	 and	v21.16b,v21.16b,v31.16b
+	add	v19.2d,v19.2d,v29.2d	// h4 -> h0
+	 add	v22.2d,v22.2d,v30.2d	// h2 -> h3
+
+	ushr	v29.2d,v19.2d,#26
+	and	v19.16b,v19.16b,v31.16b
+	 ushr	v30.2d,v22.2d,#26
+	 and	v22.16b,v22.16b,v31.16b
+	add	v20.2d,v20.2d,v29.2d	// h0 -> h1
+	 add	v23.2d,v23.2d,v30.2d	// h3 -> h4
+
+	////////////////////////////////////////////////////////////////
+	// write the result, can be partially reduced
+
+	st4	{v19.s,v20.s,v21.s,v22.s}[0],[x0],#16
+	mov	x4,#1
+	st1	{v23.s}[0],[x0]
+	str	x4,[x0,#8]		// set is_base2_26
+
+	ldr	x29,[sp],#80
+	ret
+.size	poly1305_blocks_neon,.-poly1305_blocks_neon
+
+.align	5
+.Lzeros:
+.long	0,0,0,0,0,0,0,0
+.asciz	"Poly1305 for ARMv8, CRYPTOGAMS by @dot-asm"
+.align	2
+#if !defined(__KERNEL__) && !defined(_WIN64)
+.comm	OPENSSL_armcap_P,4,4
+.hidden	OPENSSL_armcap_P
+#endif
diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
new file mode 100644
index 000000000000..cbf3eaca487e
--- /dev/null
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * OpenSSL/Cryptogams accelerated Poly1305 transform for arm64
+ *
+ * Copyright (C) 2019 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <asm/unaligned.h>
+#include <crypto/algapi.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+#define POLY1305_BLOCK_SIZE	16
+#define POLY1305_DIGEST_SIZE	16
+
+struct neon_poly1305_ctx {
+	/* the state owned by the accelerated code */
+	u64 state[24];
+	/* finalize key */
+	u32 s[4];
+	/* partial buffer */
+	u8 buf[POLY1305_BLOCK_SIZE];
+	/* bytes used in partial buffer */
+	unsigned int buflen;
+	/* r key has been set */
+	bool rset;
+	/* s key has been set */
+	bool sset;
+};
+
+asmlinkage void poly1305_init(u64 *state, const u8 *key);
+asmlinkage void poly1305_blocks(u64 *state, const u8 *src, u32 len, u32 hibit);
+asmlinkage void poly1305_blocks_neon(u64 *state, const u8 *src, u32 len, u32 hibit);
+asmlinkage void poly1305_emit(u64 state[], __le32 *digest, const u32 *nonce);
+
+static int neon_poly1305_init(struct shash_desc *desc)
+{
+	struct neon_poly1305_ctx *dctx = shash_desc_ctx(desc);
+
+	dctx->buflen = 0;
+	dctx->rset = false;
+	dctx->sset = false;
+
+	return 0;
+}
+
+static void neon_poly1305_blocks(struct neon_poly1305_ctx *dctx, const u8 *src,
+				 u32 len, u32 hibit, bool do_neon)
+{
+	if (unlikely(!dctx->sset)) {
+		if (!dctx->rset) {
+			poly1305_init(dctx->state, src);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->rset = true;
+		}
+		if (len >= POLY1305_BLOCK_SIZE) {
+			dctx->s[0] = get_unaligned_le32(src +  0);
+			dctx->s[1] = get_unaligned_le32(src +  4);
+			dctx->s[2] = get_unaligned_le32(src +  8);
+			dctx->s[3] = get_unaligned_le32(src + 12);
+			src += POLY1305_BLOCK_SIZE;
+			len -= POLY1305_BLOCK_SIZE;
+			dctx->sset = true;
+		}
+		if (len < POLY1305_BLOCK_SIZE)
+			return;
+	}
+
+	len &= ~(POLY1305_BLOCK_SIZE - 1);
+
+	if (likely(do_neon))
+		poly1305_blocks_neon(dctx->state, src, len, hibit);
+	else
+		poly1305_blocks(dctx->state, src, len, hibit);
+}
+
+static void neon_poly1305_do_update(struct neon_poly1305_ctx *dctx,
+				    const u8 *src, u32 len, bool do_neon)
+{
+	if (unlikely(dctx->buflen)) {
+		u32 bytes = min(len, POLY1305_BLOCK_SIZE - dctx->buflen);
+
+		memcpy(dctx->buf + dctx->buflen, src, bytes);
+		src += bytes;
+		len -= bytes;
+		dctx->buflen += bytes;
+
+		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+			neon_poly1305_blocks(dctx, dctx->buf,
+					     POLY1305_BLOCK_SIZE, 1, false);
+			dctx->buflen = 0;
+		}
+	}
+
+	if (likely(len >= POLY1305_BLOCK_SIZE)) {
+		neon_poly1305_blocks(dctx, src, len, 1, do_neon);
+		src += round_down(len, POLY1305_BLOCK_SIZE);
+		len %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(len)) {
+		dctx->buflen = len;
+		memcpy(dctx->buf, src, len);
+	}
+}
+
+static int neon_poly1305_update(struct shash_desc *desc,
+				const u8 *src, unsigned int srclen)
+{
+	bool do_neon = crypto_simd_usable() && srclen > 128;
+	struct neon_poly1305_ctx *dctx = shash_desc_ctx(desc);
+
+	if (do_neon)
+		kernel_neon_begin();
+	neon_poly1305_do_update(dctx, src, srclen, do_neon);
+	if (do_neon)
+		kernel_neon_end();
+	return 0;
+}
+
+static int neon_poly1305_update_from_sg(struct shash_desc *desc,
+					struct scatterlist *sg,
+					unsigned int srclen, int flags)
+{
+	bool do_neon = crypto_simd_usable() && srclen > 128;
+	struct neon_poly1305_ctx *dctx = shash_desc_ctx(desc);
+	struct crypto_hash_walk walk;
+	int nbytes;
+
+	if (do_neon) {
+		kernel_neon_begin();
+		flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+	}
+
+	for (nbytes = crypto_shash_walk_sg(desc, sg, srclen, &walk, flags);
+	     nbytes > 0;
+	     nbytes = crypto_hash_walk_done(&walk, 0))
+		neon_poly1305_do_update(dctx, walk.data, nbytes, do_neon);
+
+	if (do_neon)
+		kernel_neon_end();
+
+	return 0;
+}
+
+static int neon_poly1305_final(struct shash_desc *desc, u8 *dst)
+{
+	struct neon_poly1305_ctx *dctx = shash_desc_ctx(desc);
+	__le32 digest[4];
+	u64 f = 0;
+
+	if (unlikely(!dctx->sset))
+		return -ENOKEY;
+
+	if (unlikely(dctx->buflen)) {
+		dctx->buf[dctx->buflen++] = 1;
+		memset(dctx->buf + dctx->buflen, 0,
+		       POLY1305_BLOCK_SIZE - dctx->buflen);
+		poly1305_blocks(dctx->state, dctx->buf, POLY1305_BLOCK_SIZE, 0);
+	}
+
+	poly1305_emit(dctx->state, digest, dctx->s);
+
+	/* mac = (h + s) % (2^128) */
+	f = (f >> 32) + le32_to_cpu(digest[0]);
+	put_unaligned_le32(f, dst);
+	f = (f >> 32) + le32_to_cpu(digest[1]);
+	put_unaligned_le32(f, dst + 4);
+	f = (f >> 32) + le32_to_cpu(digest[2]);
+	put_unaligned_le32(f, dst + 8);
+	f = (f >> 32) + le32_to_cpu(digest[3]);
+	put_unaligned_le32(f, dst + 12);
+
+	return 0;
+}
+
+static struct shash_alg neon_poly1305_alg = {
+	.init			= neon_poly1305_init,
+	.update			= neon_poly1305_update,
+	.update_from_sg		= neon_poly1305_update_from_sg,
+	.final			= neon_poly1305_final,
+	.digestsize		= POLY1305_DIGEST_SIZE,
+	.descsize		= sizeof(struct neon_poly1305_ctx),
+
+	.base.cra_name		= "poly1305",
+	.base.cra_driver_name	= "poly1305-neon",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= POLY1305_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+};
+
+static int __init neon_poly1305_mod_init(void)
+{
+	if (!cpu_have_named_feature(ASIMD))
+		return -ENODEV;
+	return crypto_register_shash(&neon_poly1305_alg);
+}
+
+static void __exit neon_poly1305_mod_exit(void)
+{
+	crypto_unregister_shash(&neon_poly1305_alg);
+}
+
+module_init(neon_poly1305_mod_init);
+module_exit(neon_poly1305_mod_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("poly1305");
+MODULE_ALIAS_CRYPTO("poly1305-neon");
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 05/18] crypto: chacha - move existing library code into lib/crypto
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 04/18] crypto: arm64/poly1305 " Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 06/18] crypto: rfc7539 - switch to shash for Poly1305 Ard Biesheuvel
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

Move the existing shared ChaCha code into lib/crypto, and at the
same time, split the support header into a public version, and an
internal version that is only intended for consumption by crypto
implementations.

While at it, tidy up lib/crypto/Makefile a bit so we are ready for
some new arrivals.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/chacha-neon-glue.c   |  2 +-
 arch/arm64/crypto/chacha-neon-glue.c |  2 +-
 arch/x86/crypto/chacha_glue.c        |  2 +-
 crypto/chacha_generic.c              | 42 ++------------------
 include/crypto/chacha.h              | 37 ++++++++++-------
 include/crypto/internal/chacha.h     | 25 ++++++++++++
 lib/Makefile                         |  3 +-
 lib/crypto/Makefile                  | 19 +++++----
 lib/{ => crypto}/chacha.c            | 23 +++++++++++
 9 files changed, 89 insertions(+), 66 deletions(-)

diff --git a/arch/arm/crypto/chacha-neon-glue.c b/arch/arm/crypto/chacha-neon-glue.c
index a8e9b534c8da..26576772f18b 100644
--- a/arch/arm/crypto/chacha-neon-glue.c
+++ b/arch/arm/crypto/chacha-neon-glue.c
@@ -20,7 +20,7 @@
  */
 
 #include <crypto/algapi.h>
-#include <crypto/chacha.h>
+#include <crypto/internal/chacha.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <linux/kernel.h>
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index 1495d2b18518..d4cc61bfe79d 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -20,7 +20,7 @@
  */
 
 #include <crypto/algapi.h>
-#include <crypto/chacha.h>
+#include <crypto/internal/chacha.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <linux/kernel.h>
diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
index 388f95a4ec24..bc62daa8dafd 100644
--- a/arch/x86/crypto/chacha_glue.c
+++ b/arch/x86/crypto/chacha_glue.c
@@ -7,7 +7,7 @@
  */
 
 #include <crypto/algapi.h>
-#include <crypto/chacha.h>
+#include <crypto/internal/chacha.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <linux/kernel.h>
diff --git a/crypto/chacha_generic.c b/crypto/chacha_generic.c
index 085d8d219987..0a0847eacfc8 100644
--- a/crypto/chacha_generic.c
+++ b/crypto/chacha_generic.c
@@ -8,29 +8,10 @@
 
 #include <asm/unaligned.h>
 #include <crypto/algapi.h>
-#include <crypto/chacha.h>
+#include <crypto/internal/chacha.h>
 #include <crypto/internal/skcipher.h>
 #include <linux/module.h>
 
-static void chacha_docrypt(u32 *state, u8 *dst, const u8 *src,
-			   unsigned int bytes, int nrounds)
-{
-	/* aligned to potentially speed up crypto_xor() */
-	u8 stream[CHACHA_BLOCK_SIZE] __aligned(sizeof(long));
-
-	while (bytes >= CHACHA_BLOCK_SIZE) {
-		chacha_block(state, stream, nrounds);
-		crypto_xor_cpy(dst, src, stream, CHACHA_BLOCK_SIZE);
-		bytes -= CHACHA_BLOCK_SIZE;
-		dst += CHACHA_BLOCK_SIZE;
-		src += CHACHA_BLOCK_SIZE;
-	}
-	if (bytes) {
-		chacha_block(state, stream, nrounds);
-		crypto_xor_cpy(dst, src, stream, bytes);
-	}
-}
-
 static int chacha_stream_xor(struct skcipher_request *req,
 			     const struct chacha_ctx *ctx, const u8 *iv)
 {
@@ -48,8 +29,8 @@ static int chacha_stream_xor(struct skcipher_request *req,
 		if (nbytes < walk.total)
 			nbytes = round_down(nbytes, CHACHA_BLOCK_SIZE);
 
-		chacha_docrypt(state, walk.dst.virt.addr, walk.src.virt.addr,
-			       nbytes, ctx->nrounds);
+		chacha_crypt(state, walk.dst.virt.addr, walk.src.virt.addr,
+			     nbytes, ctx->nrounds);
 		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
 
@@ -58,22 +39,7 @@ static int chacha_stream_xor(struct skcipher_request *req,
 
 void crypto_chacha_init(u32 *state, const struct chacha_ctx *ctx, const u8 *iv)
 {
-	state[0]  = 0x61707865; /* "expa" */
-	state[1]  = 0x3320646e; /* "nd 3" */
-	state[2]  = 0x79622d32; /* "2-by" */
-	state[3]  = 0x6b206574; /* "te k" */
-	state[4]  = ctx->key[0];
-	state[5]  = ctx->key[1];
-	state[6]  = ctx->key[2];
-	state[7]  = ctx->key[3];
-	state[8]  = ctx->key[4];
-	state[9]  = ctx->key[5];
-	state[10] = ctx->key[6];
-	state[11] = ctx->key[7];
-	state[12] = get_unaligned_le32(iv +  0);
-	state[13] = get_unaligned_le32(iv +  4);
-	state[14] = get_unaligned_le32(iv +  8);
-	state[15] = get_unaligned_le32(iv + 12);
+	chacha_init(state, ctx->key, iv);
 }
 EXPORT_SYMBOL_GPL(crypto_chacha_init);
 
diff --git a/include/crypto/chacha.h b/include/crypto/chacha.h
index d1e723c6a37d..b2aaf711f2e7 100644
--- a/include/crypto/chacha.h
+++ b/include/crypto/chacha.h
@@ -15,9 +15,8 @@
 #ifndef _CRYPTO_CHACHA_H
 #define _CRYPTO_CHACHA_H
 
-#include <crypto/skcipher.h>
+#include <asm/unaligned.h>
 #include <linux/types.h>
-#include <linux/crypto.h>
 
 /* 32-bit stream position, then 96-bit nonce (RFC7539 convention) */
 #define CHACHA_IV_SIZE		16
@@ -29,11 +28,6 @@
 /* 192-bit nonce, then 64-bit stream position */
 #define XCHACHA_IV_SIZE		32
 
-struct chacha_ctx {
-	u32 key[8];
-	int nrounds;
-};
-
 void chacha_block(u32 *state, u8 *stream, int nrounds);
 static inline void chacha20_block(u32 *state, u8 *stream)
 {
@@ -41,14 +35,27 @@ static inline void chacha20_block(u32 *state, u8 *stream)
 }
 void hchacha_block(const u32 *in, u32 *out, int nrounds);
 
-void crypto_chacha_init(u32 *state, const struct chacha_ctx *ctx, const u8 *iv);
-
-int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
-			   unsigned int keysize);
-int crypto_chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
-			   unsigned int keysize);
+static inline void chacha_init(u32 *state, const u32 *key, const u8 *iv)
+{
+	state[0]  = 0x61707865; /* "expa" */
+	state[1]  = 0x3320646e; /* "nd 3" */
+	state[2]  = 0x79622d32; /* "2-by" */
+	state[3]  = 0x6b206574; /* "te k" */
+	state[4]  = key[0];
+	state[5]  = key[1];
+	state[6]  = key[2];
+	state[7]  = key[3];
+	state[8]  = key[4];
+	state[9]  = key[5];
+	state[10] = key[6];
+	state[11] = key[7];
+	state[12] = get_unaligned_le32(iv +  0);
+	state[13] = get_unaligned_le32(iv +  4);
+	state[14] = get_unaligned_le32(iv +  8);
+	state[15] = get_unaligned_le32(iv + 12);
+}
 
-int crypto_chacha_crypt(struct skcipher_request *req);
-int crypto_xchacha_crypt(struct skcipher_request *req);
+void chacha_crypt(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
+		  int nrounds);
 
 #endif /* _CRYPTO_CHACHA_H */
diff --git a/include/crypto/internal/chacha.h b/include/crypto/internal/chacha.h
new file mode 100644
index 000000000000..f7ffe0f3fa47
--- /dev/null
+++ b/include/crypto/internal/chacha.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _CRYPTO_INTERNAL_CHACHA_H
+#define _CRYPTO_INTERNAL_CHACHA_H
+
+#include <crypto/chacha.h>
+#include <crypto/skcipher.h>
+#include <linux/crypto.h>
+
+struct chacha_ctx {
+	u32 key[8];
+	int nrounds;
+};
+
+void crypto_chacha_init(u32 *state, const struct chacha_ctx *ctx, const u8 *iv);
+
+int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			   unsigned int keysize);
+int crypto_chacha12_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			   unsigned int keysize);
+
+int crypto_chacha_crypt(struct skcipher_request *req);
+int crypto_xchacha_crypt(struct skcipher_request *req);
+
+#endif /* _CRYPTO_CHACHA_H */
diff --git a/lib/Makefile b/lib/Makefile
index 29c02a924973..1436c9608fdb 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -30,8 +30,7 @@ endif
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o timerqueue.o xarray.o \
-	 idr.o extable.o \
-	 sha1.o chacha.o irq_regs.o argv_split.o \
+	 idr.o extable.o sha1.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
 	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index cbe0b6a6450d..24dad058f2ae 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -1,13 +1,16 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-$(CONFIG_CRYPTO_LIB_AES) += libaes.o
-libaes-y := aes.o
+# chacha is used by the /dev/random driver which is always builtin
+obj-y						+= chacha.o
 
-obj-$(CONFIG_CRYPTO_LIB_ARC4) += libarc4.o
-libarc4-y := arc4.o
+obj-$(CONFIG_CRYPTO_LIB_AES)			+= libaes.o
+libaes-y					:= aes.o
 
-obj-$(CONFIG_CRYPTO_LIB_DES) += libdes.o
-libdes-y := des.o
+obj-$(CONFIG_CRYPTO_LIB_ARC4)			+= libarc4.o
+libarc4-y					:= arc4.o
 
-obj-$(CONFIG_CRYPTO_LIB_SHA256) += libsha256.o
-libsha256-y := sha256.o
+obj-$(CONFIG_CRYPTO_LIB_DES)			+= libdes.o
+libdes-y					:= des.o
+
+obj-$(CONFIG_CRYPTO_LIB_SHA256)			+= libsha256.o
+libsha256-y					:= sha256.o
diff --git a/lib/chacha.c b/lib/crypto/chacha.c
similarity index 85%
rename from lib/chacha.c
rename to lib/crypto/chacha.c
index c7c9826564d3..429adac51f66 100644
--- a/lib/chacha.c
+++ b/lib/crypto/chacha.c
@@ -5,11 +5,14 @@
  * Copyright (C) 2015 Martin Willi
  */
 
+#include <linux/bug.h>
 #include <linux/kernel.h>
 #include <linux/export.h>
 #include <linux/bitops.h>
+#include <linux/string.h>
 #include <linux/cryptohash.h>
 #include <asm/unaligned.h>
+#include <crypto/algapi.h> // for crypto_xor_cpy
 #include <crypto/chacha.h>
 
 static void chacha_permute(u32 *x, int nrounds)
@@ -111,3 +114,23 @@ void hchacha_block(const u32 *in, u32 *out, int nrounds)
 	memcpy(&out[4], &x[12], 16);
 }
 EXPORT_SYMBOL(hchacha_block);
+
+void chacha_crypt(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
+		  int nrounds)
+{
+	/* aligned to potentially speed up crypto_xor() */
+	u8 stream[CHACHA_BLOCK_SIZE] __aligned(sizeof(long));
+
+	while (bytes >= CHACHA_BLOCK_SIZE) {
+		chacha_block(state, stream, nrounds);
+		crypto_xor_cpy(dst, src, stream, CHACHA_BLOCK_SIZE);
+		bytes -= CHACHA_BLOCK_SIZE;
+		dst += CHACHA_BLOCK_SIZE;
+		src += CHACHA_BLOCK_SIZE;
+	}
+	if (bytes) {
+		chacha_block(state, stream, nrounds);
+		crypto_xor_cpy(dst, src, stream, bytes);
+	}
+}
+EXPORT_SYMBOL(chacha_crypt);
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 06/18] crypto: rfc7539 - switch to shash for Poly1305
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 05/18] crypto: chacha - move existing library code into lib/crypto Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 07/18] crypto: rfc7539 - use zero reqsize for sync instantiations without alignmask Ard Biesheuvel
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

The RFC7539 template uses an ahash Poly1305 transformation to implement
the authentication part of the algorithm. Since ahashes operate on
scatterlists only, this forces the RFC7539 driver to allocate scratch
buffers in the request context, to ensure that they are allocated from
the heap.

However, in practice, all Poly1305 implementations available today are
shashes wrapped by the generic ahash->shash wrapper, which means we are
jumping through hoops unnecessarily, especially considering that the
way this driver invokes the ahash (6 consecutive requests for the key,
associated data, padding, ciphertext, more padding and the tail) will
make it very difficult for a true asynchronous implementation to ever
materialize that can operate efficiently in this context.

So now that shashes can operate on scatterlists as well as virtually
mapped buffers, switch the RFC7539 template from ahash to shash for
the Poly1305 transformation. At the same time, switch to using the
ChaCha library to generate the Poly1305 key so that we don't have to
call into the [potentially asynchronous] skcipher twice, with one call
only operating on 32 bytes of data.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/chacha20poly1305.c | 513 ++++++--------------
 1 file changed, 145 insertions(+), 368 deletions(-)

diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c
index 74e824e537e6..71496a8107f5 100644
--- a/crypto/chacha20poly1305.c
+++ b/crypto/chacha20poly1305.c
@@ -20,53 +20,32 @@
 
 struct chachapoly_instance_ctx {
 	struct crypto_skcipher_spawn chacha;
-	struct crypto_ahash_spawn poly;
+	struct crypto_shash_spawn poly;
 	unsigned int saltlen;
 };
 
 struct chachapoly_ctx {
 	struct crypto_skcipher *chacha;
-	struct crypto_ahash *poly;
+	struct crypto_shash *poly;
+	u32 chacha_key[CHACHA_KEY_SIZE / sizeof(u32)];
 	/* key bytes we use for the ChaCha20 IV */
 	unsigned int saltlen;
 	u8 salt[];
 };
 
-struct poly_req {
-	/* zero byte padding for AD/ciphertext, as needed */
-	u8 pad[POLY1305_BLOCK_SIZE];
-	/* tail data with AD/ciphertext lengths */
-	struct {
-		__le64 assoclen;
-		__le64 cryptlen;
-	} tail;
-	struct scatterlist src[1];
-	struct ahash_request req; /* must be last member */
-};
-
 struct chacha_req {
 	u8 iv[CHACHA_IV_SIZE];
-	struct scatterlist src[1];
 	struct skcipher_request req; /* must be last member */
 };
 
 struct chachapoly_req_ctx {
 	struct scatterlist src[2];
 	struct scatterlist dst[2];
-	/* the key we generate for Poly1305 using Chacha20 */
-	u8 key[POLY1305_KEY_SIZE];
-	/* calculated Poly1305 tag */
-	u8 tag[POLY1305_DIGEST_SIZE];
 	/* length of data to en/decrypt, without ICV */
 	unsigned int cryptlen;
-	/* Actual AD, excluding IV */
-	unsigned int assoclen;
 	/* request flags, with MAY_SLEEP cleared if needed */
 	u32 flags;
-	union {
-		struct poly_req poly;
-		struct chacha_req chacha;
-	} u;
+	struct chacha_req chacha;
 };
 
 static inline void async_done_continue(struct aead_request *req, int err,
@@ -94,43 +73,114 @@ static void chacha_iv(u8 *iv, struct aead_request *req, u32 icb)
 	       CHACHA_IV_SIZE - sizeof(leicb) - ctx->saltlen);
 }
 
-static int poly_verify_tag(struct aead_request *req)
+static int poly_generate_tag(struct aead_request *req, u8 *poly_tag,
+			     struct scatterlist *crypt)
 {
+	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+	struct chachapoly_ctx *ctx = crypto_aead_ctx(tfm);
 	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	u8 tag[sizeof(rctx->tag)];
+	u32 chacha_state[CHACHA_BLOCK_SIZE / sizeof(u32)];
+	SHASH_DESC_ON_STACK(desc, ctx->poly);
+	u8 poly_key[POLY1305_KEY_SIZE];
+	bool skip_ad_pad, atomic;
+	u8 iv[CHACHA_IV_SIZE];
+	unsigned int assoclen;
+	unsigned int padlen;
+	__le64 tail[4];
+
+	/*
+	 * Take the Poly1305 hash of the entire AD plus ciphertext in one go if
+	 * a) we are not running in ESP mode (which puts data between the AD
+	 *    and the ciphertext in the input scatterlist), and
+	 * b) no padding is required between the AD and the ciphertext, and
+	 * c) the source buffer points to the ciphertext, either because we
+	 *    are decrypting, or because we are encrypting in place.
+	 */
+	if (crypto_aead_ivsize(tfm) == 8) {
+		if (req->assoclen < 8)
+			return -EINVAL;
+		assoclen = req->assoclen - 8;
+		skip_ad_pad = false;
+	} else {
+		assoclen = req->assoclen;
+		skip_ad_pad = !(assoclen % POLY1305_BLOCK_SIZE) &&
+			      (crypt == req->src);
+	}
 
-	scatterwalk_map_and_copy(tag, req->src,
-				 req->assoclen + rctx->cryptlen,
-				 sizeof(tag), 0);
-	if (crypto_memneq(tag, rctx->tag, sizeof(tag)))
-		return -EBADMSG;
+	/* derive the Poly1305 key */
+	chacha_iv(iv, req, 0);
+	chacha_init(chacha_state, ctx->chacha_key, iv);
+	chacha_crypt(chacha_state, poly_key, page_address(ZERO_PAGE(0)),
+		     sizeof(poly_key), 20);
+
+	desc->tfm = ctx->poly;
+	crypto_shash_init(desc);
+	crypto_shash_update(desc, poly_key, sizeof(poly_key));
+
+	atomic = !(rctx->flags & CRYPTO_TFM_REQ_MAY_SLEEP);
+	if (skip_ad_pad) {
+		crypto_shash_update_from_sg(desc, crypt,
+					    assoclen + rctx->cryptlen,
+					    atomic);
+	} else {
+		struct scatterlist sg[2];
+
+		crypto_shash_update_from_sg(desc, req->src, assoclen, atomic);
+
+		padlen = -assoclen % POLY1305_BLOCK_SIZE;
+		if (padlen)
+			crypto_shash_update(desc, page_address(ZERO_PAGE(0)),
+					    padlen);
+
+		crypt = scatterwalk_ffwd(sg, crypt, req->assoclen);
+		crypto_shash_update_from_sg(desc, crypt, rctx->cryptlen,
+					    atomic);
+	}
+
+	tail[0] = 0;
+	tail[1] = 0;
+	tail[2] = cpu_to_le64(assoclen);
+	tail[3] = cpu_to_le64(rctx->cryptlen);
+
+	padlen = -rctx->cryptlen % POLY1305_BLOCK_SIZE;
+	crypto_shash_finup(desc, (u8 *)tail + (2 * sizeof(__le64) - padlen),
+			   padlen + 2 * sizeof(__le64), poly_tag);
 	return 0;
 }
 
-static int poly_copy_tag(struct aead_request *req)
+static int poly_append_tag(struct aead_request *req)
 {
 	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
+	u8 poly_tag[POLY1305_DIGEST_SIZE];
+	int err;
 
-	scatterwalk_map_and_copy(rctx->tag, req->dst,
+	err = poly_generate_tag(req, poly_tag, req->dst);
+	if (err)
+		return err;
+
+	scatterwalk_map_and_copy(poly_tag, req->dst,
 				 req->assoclen + rctx->cryptlen,
-				 sizeof(rctx->tag), 1);
+				 sizeof(poly_tag), 1);
 	return 0;
 }
 
-static void chacha_decrypt_done(struct crypto_async_request *areq, int err)
+static void chacha_encrypt_done(struct crypto_async_request *areq, int err)
 {
-	async_done_continue(areq->data, err, poly_verify_tag);
+	async_done_continue(areq->data, err, poly_append_tag);
 }
 
-static int chacha_decrypt(struct aead_request *req)
+static int chachapoly_encrypt(struct aead_request *req)
 {
 	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
 	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct chacha_req *creq = &rctx->u.chacha;
+	struct chacha_req *creq = &rctx->chacha;
 	struct scatterlist *src, *dst;
 	int err;
 
-	if (rctx->cryptlen == 0)
+	rctx->cryptlen = req->cryptlen;
+	rctx->flags = aead_request_flags(req);
+
+	if (req->cryptlen == 0)
 		goto skip;
 
 	chacha_iv(creq->iv, req, 1);
@@ -141,273 +191,48 @@ static int chacha_decrypt(struct aead_request *req)
 		dst = scatterwalk_ffwd(rctx->dst, req->dst, req->assoclen);
 
 	skcipher_request_set_callback(&creq->req, rctx->flags,
-				      chacha_decrypt_done, req);
+				      chacha_encrypt_done, req);
 	skcipher_request_set_tfm(&creq->req, ctx->chacha);
 	skcipher_request_set_crypt(&creq->req, src, dst,
-				   rctx->cryptlen, creq->iv);
-	err = crypto_skcipher_decrypt(&creq->req);
+				   req->cryptlen, creq->iv);
+	err = crypto_skcipher_encrypt(&creq->req);
 	if (err)
 		return err;
 
 skip:
-	return poly_verify_tag(req);
-}
-
-static int poly_tail_continue(struct aead_request *req)
-{
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-
-	if (rctx->cryptlen == req->cryptlen) /* encrypting */
-		return poly_copy_tag(req);
-
-	return chacha_decrypt(req);
-}
-
-static void poly_tail_done(struct crypto_async_request *areq, int err)
-{
-	async_done_continue(areq->data, err, poly_tail_continue);
-}
-
-static int poly_tail(struct aead_request *req)
-{
-	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(tfm);
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct poly_req *preq = &rctx->u.poly;
-	int err;
-
-	preq->tail.assoclen = cpu_to_le64(rctx->assoclen);
-	preq->tail.cryptlen = cpu_to_le64(rctx->cryptlen);
-	sg_init_one(preq->src, &preq->tail, sizeof(preq->tail));
-
-	ahash_request_set_callback(&preq->req, rctx->flags,
-				   poly_tail_done, req);
-	ahash_request_set_tfm(&preq->req, ctx->poly);
-	ahash_request_set_crypt(&preq->req, preq->src,
-				rctx->tag, sizeof(preq->tail));
-
-	err = crypto_ahash_finup(&preq->req);
-	if (err)
-		return err;
-
-	return poly_tail_continue(req);
-}
-
-static void poly_cipherpad_done(struct crypto_async_request *areq, int err)
-{
-	async_done_continue(areq->data, err, poly_tail);
-}
-
-static int poly_cipherpad(struct aead_request *req)
-{
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct poly_req *preq = &rctx->u.poly;
-	unsigned int padlen;
-	int err;
-
-	padlen = -rctx->cryptlen % POLY1305_BLOCK_SIZE;
-	memset(preq->pad, 0, sizeof(preq->pad));
-	sg_init_one(preq->src, preq->pad, padlen);
-
-	ahash_request_set_callback(&preq->req, rctx->flags,
-				   poly_cipherpad_done, req);
-	ahash_request_set_tfm(&preq->req, ctx->poly);
-	ahash_request_set_crypt(&preq->req, preq->src, NULL, padlen);
-
-	err = crypto_ahash_update(&preq->req);
-	if (err)
-		return err;
-
-	return poly_tail(req);
-}
-
-static void poly_cipher_done(struct crypto_async_request *areq, int err)
-{
-	async_done_continue(areq->data, err, poly_cipherpad);
-}
-
-static int poly_cipher(struct aead_request *req)
-{
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct poly_req *preq = &rctx->u.poly;
-	struct scatterlist *crypt = req->src;
-	int err;
-
-	if (rctx->cryptlen == req->cryptlen) /* encrypting */
-		crypt = req->dst;
-
-	crypt = scatterwalk_ffwd(rctx->src, crypt, req->assoclen);
-
-	ahash_request_set_callback(&preq->req, rctx->flags,
-				   poly_cipher_done, req);
-	ahash_request_set_tfm(&preq->req, ctx->poly);
-	ahash_request_set_crypt(&preq->req, crypt, NULL, rctx->cryptlen);
-
-	err = crypto_ahash_update(&preq->req);
-	if (err)
-		return err;
-
-	return poly_cipherpad(req);
-}
-
-static void poly_adpad_done(struct crypto_async_request *areq, int err)
-{
-	async_done_continue(areq->data, err, poly_cipher);
-}
-
-static int poly_adpad(struct aead_request *req)
-{
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct poly_req *preq = &rctx->u.poly;
-	unsigned int padlen;
-	int err;
-
-	padlen = -rctx->assoclen % POLY1305_BLOCK_SIZE;
-	memset(preq->pad, 0, sizeof(preq->pad));
-	sg_init_one(preq->src, preq->pad, padlen);
-
-	ahash_request_set_callback(&preq->req, rctx->flags,
-				   poly_adpad_done, req);
-	ahash_request_set_tfm(&preq->req, ctx->poly);
-	ahash_request_set_crypt(&preq->req, preq->src, NULL, padlen);
-
-	err = crypto_ahash_update(&preq->req);
-	if (err)
-		return err;
-
-	return poly_cipher(req);
-}
-
-static void poly_ad_done(struct crypto_async_request *areq, int err)
-{
-	async_done_continue(areq->data, err, poly_adpad);
-}
-
-static int poly_ad(struct aead_request *req)
-{
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct poly_req *preq = &rctx->u.poly;
-	int err;
-
-	ahash_request_set_callback(&preq->req, rctx->flags,
-				   poly_ad_done, req);
-	ahash_request_set_tfm(&preq->req, ctx->poly);
-	ahash_request_set_crypt(&preq->req, req->src, NULL, rctx->assoclen);
-
-	err = crypto_ahash_update(&preq->req);
-	if (err)
-		return err;
-
-	return poly_adpad(req);
-}
-
-static void poly_setkey_done(struct crypto_async_request *areq, int err)
-{
-	async_done_continue(areq->data, err, poly_ad);
-}
-
-static int poly_setkey(struct aead_request *req)
-{
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct poly_req *preq = &rctx->u.poly;
-	int err;
-
-	sg_init_one(preq->src, rctx->key, sizeof(rctx->key));
-
-	ahash_request_set_callback(&preq->req, rctx->flags,
-				   poly_setkey_done, req);
-	ahash_request_set_tfm(&preq->req, ctx->poly);
-	ahash_request_set_crypt(&preq->req, preq->src, NULL, sizeof(rctx->key));
-
-	err = crypto_ahash_update(&preq->req);
-	if (err)
-		return err;
-
-	return poly_ad(req);
+	return poly_append_tag(req);
 }
 
-static void poly_init_done(struct crypto_async_request *areq, int err)
+static void chacha_decrypt_done(struct crypto_async_request *areq, int err)
 {
-	async_done_continue(areq->data, err, poly_setkey);
+	aead_request_complete(areq->data, err);
 }
 
-static int poly_init(struct aead_request *req)
+static int chachapoly_decrypt(struct aead_request *req)
 {
 	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
 	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct poly_req *preq = &rctx->u.poly;
-	int err;
-
-	ahash_request_set_callback(&preq->req, rctx->flags,
-				   poly_init_done, req);
-	ahash_request_set_tfm(&preq->req, ctx->poly);
-
-	err = crypto_ahash_init(&preq->req);
-	if (err)
-		return err;
-
-	return poly_setkey(req);
-}
-
-static void poly_genkey_done(struct crypto_async_request *areq, int err)
-{
-	async_done_continue(areq->data, err, poly_init);
-}
-
-static int poly_genkey(struct aead_request *req)
-{
-	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(tfm);
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct chacha_req *creq = &rctx->u.chacha;
+	struct chacha_req *creq = &rctx->chacha;
+	u8 calculated_tag[POLY1305_DIGEST_SIZE];
+	u8 provided_tag[POLY1305_DIGEST_SIZE];
+	struct scatterlist *src, *dst;
 	int err;
 
-	rctx->assoclen = req->assoclen;
-
-	if (crypto_aead_ivsize(tfm) == 8) {
-		if (rctx->assoclen < 8)
-			return -EINVAL;
-		rctx->assoclen -= 8;
-	}
-
-	memset(rctx->key, 0, sizeof(rctx->key));
-	sg_init_one(creq->src, rctx->key, sizeof(rctx->key));
-
-	chacha_iv(creq->iv, req, 0);
-
-	skcipher_request_set_callback(&creq->req, rctx->flags,
-				      poly_genkey_done, req);
-	skcipher_request_set_tfm(&creq->req, ctx->chacha);
-	skcipher_request_set_crypt(&creq->req, creq->src, creq->src,
-				   POLY1305_KEY_SIZE, creq->iv);
+	rctx->cryptlen = req->cryptlen - POLY1305_DIGEST_SIZE;
+	rctx->flags = aead_request_flags(req);
 
-	err = crypto_skcipher_decrypt(&creq->req);
+	err = poly_generate_tag(req, calculated_tag, req->src);
 	if (err)
 		return err;
+	scatterwalk_map_and_copy(provided_tag, req->src,
+				 req->assoclen + rctx->cryptlen,
+				 sizeof(provided_tag), 0);
 
-	return poly_init(req);
-}
-
-static void chacha_encrypt_done(struct crypto_async_request *areq, int err)
-{
-	async_done_continue(areq->data, err, poly_genkey);
-}
-
-static int chacha_encrypt(struct aead_request *req)
-{
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-	struct chacha_req *creq = &rctx->u.chacha;
-	struct scatterlist *src, *dst;
-	int err;
+	if (crypto_memneq(calculated_tag, provided_tag, sizeof(provided_tag)))
+		return -EBADMSG;
 
-	if (req->cryptlen == 0)
-		goto skip;
+	if (rctx->cryptlen == 0)
+		return 0;
 
 	chacha_iv(creq->iv, req, 1);
 
@@ -417,60 +242,11 @@ static int chacha_encrypt(struct aead_request *req)
 		dst = scatterwalk_ffwd(rctx->dst, req->dst, req->assoclen);
 
 	skcipher_request_set_callback(&creq->req, rctx->flags,
-				      chacha_encrypt_done, req);
+				      chacha_decrypt_done, req);
 	skcipher_request_set_tfm(&creq->req, ctx->chacha);
 	skcipher_request_set_crypt(&creq->req, src, dst,
-				   req->cryptlen, creq->iv);
-	err = crypto_skcipher_encrypt(&creq->req);
-	if (err)
-		return err;
-
-skip:
-	return poly_genkey(req);
-}
-
-static int chachapoly_encrypt(struct aead_request *req)
-{
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-
-	rctx->cryptlen = req->cryptlen;
-	rctx->flags = aead_request_flags(req);
-
-	/* encrypt call chain:
-	 * - chacha_encrypt/done()
-	 * - poly_genkey/done()
-	 * - poly_init/done()
-	 * - poly_setkey/done()
-	 * - poly_ad/done()
-	 * - poly_adpad/done()
-	 * - poly_cipher/done()
-	 * - poly_cipherpad/done()
-	 * - poly_tail/done/continue()
-	 * - poly_copy_tag()
-	 */
-	return chacha_encrypt(req);
-}
-
-static int chachapoly_decrypt(struct aead_request *req)
-{
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
-
-	rctx->cryptlen = req->cryptlen - POLY1305_DIGEST_SIZE;
-	rctx->flags = aead_request_flags(req);
-
-	/* decrypt call chain:
-	 * - poly_genkey/done()
-	 * - poly_init/done()
-	 * - poly_setkey/done()
-	 * - poly_ad/done()
-	 * - poly_adpad/done()
-	 * - poly_cipher/done()
-	 * - poly_cipherpad/done()
-	 * - poly_tail/done/continue()
-	 * - chacha_decrypt/done()
-	 * - poly_verify_tag()
-	 */
-	return poly_genkey(req);
+				   rctx->cryptlen, creq->iv);
+	return crypto_skcipher_decrypt(&creq->req);
 }
 
 static int chachapoly_setkey(struct crypto_aead *aead, const u8 *key,
@@ -482,6 +258,15 @@ static int chachapoly_setkey(struct crypto_aead *aead, const u8 *key,
 	if (keylen != ctx->saltlen + CHACHA_KEY_SIZE)
 		return -EINVAL;
 
+	ctx->chacha_key[0] = get_unaligned_le32(key);
+	ctx->chacha_key[1] = get_unaligned_le32(key + 4);
+	ctx->chacha_key[2] = get_unaligned_le32(key + 8);
+	ctx->chacha_key[3] = get_unaligned_le32(key + 12);
+	ctx->chacha_key[4] = get_unaligned_le32(key + 16);
+	ctx->chacha_key[5] = get_unaligned_le32(key + 20);
+	ctx->chacha_key[6] = get_unaligned_le32(key + 24);
+	ctx->chacha_key[7] = get_unaligned_le32(key + 28);
+
 	keylen -= ctx->saltlen;
 	memcpy(ctx->salt, key + keylen, ctx->saltlen);
 
@@ -510,16 +295,16 @@ static int chachapoly_init(struct crypto_aead *tfm)
 	struct chachapoly_instance_ctx *ictx = aead_instance_ctx(inst);
 	struct chachapoly_ctx *ctx = crypto_aead_ctx(tfm);
 	struct crypto_skcipher *chacha;
-	struct crypto_ahash *poly;
+	struct crypto_shash *poly;
 	unsigned long align;
 
-	poly = crypto_spawn_ahash(&ictx->poly);
+	poly = crypto_spawn_shash(&ictx->poly);
 	if (IS_ERR(poly))
 		return PTR_ERR(poly);
 
 	chacha = crypto_spawn_skcipher(&ictx->chacha);
 	if (IS_ERR(chacha)) {
-		crypto_free_ahash(poly);
+		crypto_free_shash(poly);
 		return PTR_ERR(chacha);
 	}
 
@@ -531,13 +316,10 @@ static int chachapoly_init(struct crypto_aead *tfm)
 	align &= ~(crypto_tfm_ctx_alignment() - 1);
 	crypto_aead_set_reqsize(
 		tfm,
-		align + offsetof(struct chachapoly_req_ctx, u) +
-		max(offsetof(struct chacha_req, req) +
-		    sizeof(struct skcipher_request) +
-		    crypto_skcipher_reqsize(chacha),
-		    offsetof(struct poly_req, req) +
-		    sizeof(struct ahash_request) +
-		    crypto_ahash_reqsize(poly)));
+		align +
+		offsetof(struct chachapoly_req_ctx, chacha.req) +
+		sizeof(struct skcipher_request) +
+		crypto_skcipher_reqsize(chacha));
 
 	return 0;
 }
@@ -546,7 +328,7 @@ static void chachapoly_exit(struct crypto_aead *tfm)
 {
 	struct chachapoly_ctx *ctx = crypto_aead_ctx(tfm);
 
-	crypto_free_ahash(ctx->poly);
+	crypto_free_shash(ctx->poly);
 	crypto_free_skcipher(ctx->chacha);
 }
 
@@ -555,7 +337,7 @@ static void chachapoly_free(struct aead_instance *inst)
 	struct chachapoly_instance_ctx *ctx = aead_instance_ctx(inst);
 
 	crypto_drop_skcipher(&ctx->chacha);
-	crypto_drop_ahash(&ctx->poly);
+	crypto_drop_shash(&ctx->poly);
 	kfree(inst);
 }
 
@@ -566,9 +348,9 @@ static int chachapoly_create(struct crypto_template *tmpl, struct rtattr **tb,
 	struct aead_instance *inst;
 	struct skcipher_alg *chacha;
 	struct crypto_alg *poly;
-	struct hash_alg_common *poly_hash;
+	struct shash_alg *poly_hash;
 	struct chachapoly_instance_ctx *ctx;
-	const char *chacha_name, *poly_name;
+	const char *chacha_name;
 	int err;
 
 	if (ivsize > CHACHAPOLY_IV_SIZE)
@@ -584,18 +366,10 @@ static int chachapoly_create(struct crypto_template *tmpl, struct rtattr **tb,
 	chacha_name = crypto_attr_alg_name(tb[1]);
 	if (IS_ERR(chacha_name))
 		return PTR_ERR(chacha_name);
-	poly_name = crypto_attr_alg_name(tb[2]);
-	if (IS_ERR(poly_name))
-		return PTR_ERR(poly_name);
-
-	poly = crypto_find_alg(poly_name, &crypto_ahash_type,
-			       CRYPTO_ALG_TYPE_HASH,
-			       CRYPTO_ALG_TYPE_AHASH_MASK |
-			       crypto_requires_sync(algt->type,
-						    algt->mask));
-	if (IS_ERR(poly))
-		return PTR_ERR(poly);
-	poly_hash = __crypto_hash_alg_common(poly);
+	poly_hash = shash_attr_alg(tb[2], 0, 0);
+	if (IS_ERR(poly_hash))
+		return PTR_ERR(poly_hash);
+	poly = &poly_hash->base;
 
 	err = -EINVAL;
 	if (poly_hash->digestsize != POLY1305_DIGEST_SIZE)
@@ -608,7 +382,7 @@ static int chachapoly_create(struct crypto_template *tmpl, struct rtattr **tb,
 
 	ctx = aead_instance_ctx(inst);
 	ctx->saltlen = CHACHAPOLY_IV_SIZE - ivsize;
-	err = crypto_init_ahash_spawn(&ctx->poly, poly_hash,
+	err = crypto_init_shash_spawn(&ctx->poly, poly_hash,
 				      aead_crypto_instance(inst));
 	if (err)
 		goto err_free_inst;
@@ -630,10 +404,13 @@ static int chachapoly_create(struct crypto_template *tmpl, struct rtattr **tb,
 	if (chacha->base.cra_blocksize != 1)
 		goto out_drop_chacha;
 
+	if (strcmp(chacha->base.cra_name, "chacha20") ||
+	    strcmp(poly->cra_name, "poly1305"))
+		goto out_drop_chacha;
+
 	err = -ENAMETOOLONG;
 	if (snprintf(inst->alg.base.cra_name, CRYPTO_MAX_ALG_NAME,
-		     "%s(%s,%s)", name, chacha->base.cra_name,
-		     poly->cra_name) >= CRYPTO_MAX_ALG_NAME)
+		     "%s(chacha20,poly1305)", name) >= CRYPTO_MAX_ALG_NAME)
 		goto out_drop_chacha;
 	if (snprintf(inst->alg.base.cra_driver_name, CRYPTO_MAX_ALG_NAME,
 		     "%s(%s,%s)", name, chacha->base.cra_driver_name,
@@ -672,7 +449,7 @@ static int chachapoly_create(struct crypto_template *tmpl, struct rtattr **tb,
 out_drop_chacha:
 	crypto_drop_skcipher(&ctx->chacha);
 err_drop_poly:
-	crypto_drop_ahash(&ctx->poly);
+	crypto_drop_shash(&ctx->poly);
 err_free_inst:
 	kfree(inst);
 	goto out_put_poly;
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 07/18] crypto: rfc7539 - use zero reqsize for sync instantiations without alignmask
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 06/18] crypto: rfc7539 - switch to shash for Poly1305 Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 08/18] crypto: testmgr - add a chacha20poly1305 test case Ard Biesheuvel
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

Now that we have moved all the scratch buffers that must be allocated
on the heap out of the request context, we can move the request context
itself to the stack if we are instantiating a synchronous version of
the chacha20poly1305 transformation. This allows users of the AEAD to
allocate the request structure on the stack, removing the need for
per-packet heap allocations on the en/decryption hot path.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/chacha20poly1305.c | 51 ++++++++++++--------
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c
index 71496a8107f5..d171a0c9e837 100644
--- a/crypto/chacha20poly1305.c
+++ b/crypto/chacha20poly1305.c
@@ -49,13 +49,14 @@ struct chachapoly_req_ctx {
 };
 
 static inline void async_done_continue(struct aead_request *req, int err,
-				       int (*cont)(struct aead_request *))
+				       int (*cont)(struct aead_request *,
+						   struct chachapoly_req_ctx *))
 {
 	if (!err) {
 		struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
 
 		rctx->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
-		err = cont(req);
+		err = cont(req, rctx);
 	}
 
 	if (err != -EINPROGRESS && err != -EBUSY)
@@ -74,11 +75,11 @@ static void chacha_iv(u8 *iv, struct aead_request *req, u32 icb)
 }
 
 static int poly_generate_tag(struct aead_request *req, u8 *poly_tag,
-			     struct scatterlist *crypt)
+			     struct scatterlist *crypt,
+			     struct chachapoly_req_ctx *rctx)
 {
 	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
 	struct chachapoly_ctx *ctx = crypto_aead_ctx(tfm);
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
 	u32 chacha_state[CHACHA_BLOCK_SIZE / sizeof(u32)];
 	SHASH_DESC_ON_STACK(desc, ctx->poly);
 	u8 poly_key[POLY1305_KEY_SIZE];
@@ -148,13 +149,13 @@ static int poly_generate_tag(struct aead_request *req, u8 *poly_tag,
 	return 0;
 }
 
-static int poly_append_tag(struct aead_request *req)
+static int poly_append_tag(struct aead_request *req,
+			   struct chachapoly_req_ctx *rctx)
 {
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
 	u8 poly_tag[POLY1305_DIGEST_SIZE];
 	int err;
 
-	err = poly_generate_tag(req, poly_tag, req->dst);
+	err = poly_generate_tag(req, poly_tag, req->dst, rctx);
 	if (err)
 		return err;
 
@@ -171,12 +172,17 @@ static void chacha_encrypt_done(struct crypto_async_request *areq, int err)
 
 static int chachapoly_encrypt(struct aead_request *req)
 {
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
+	struct chachapoly_req_ctx stack_rctx CRYPTO_MINALIGN_ATTR;
+	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+	struct chachapoly_ctx *ctx = crypto_aead_ctx(tfm);
+	struct chachapoly_req_ctx *rctx = &stack_rctx;
 	struct chacha_req *creq = &rctx->chacha;
 	struct scatterlist *src, *dst;
 	int err;
 
+	if (unlikely(crypto_aead_reqsize(tfm) > 0))
+		rctx = aead_request_ctx(req);
+
 	rctx->cryptlen = req->cryptlen;
 	rctx->flags = aead_request_flags(req);
 
@@ -200,7 +206,7 @@ static int chachapoly_encrypt(struct aead_request *req)
 		return err;
 
 skip:
-	return poly_append_tag(req);
+	return poly_append_tag(req, rctx);
 }
 
 static void chacha_decrypt_done(struct crypto_async_request *areq, int err)
@@ -210,18 +216,23 @@ static void chacha_decrypt_done(struct crypto_async_request *areq, int err)
 
 static int chachapoly_decrypt(struct aead_request *req)
 {
-	struct chachapoly_ctx *ctx = crypto_aead_ctx(crypto_aead_reqtfm(req));
-	struct chachapoly_req_ctx *rctx = aead_request_ctx(req);
+	struct chachapoly_req_ctx stack_rctx CRYPTO_MINALIGN_ATTR;
+	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+	struct chachapoly_ctx *ctx = crypto_aead_ctx(tfm);
+	struct chachapoly_req_ctx *rctx = &stack_rctx;
 	struct chacha_req *creq = &rctx->chacha;
 	u8 calculated_tag[POLY1305_DIGEST_SIZE];
 	u8 provided_tag[POLY1305_DIGEST_SIZE];
 	struct scatterlist *src, *dst;
 	int err;
 
+	if (unlikely(crypto_aead_reqsize(tfm) > 0))
+		rctx = aead_request_ctx(req);
+
 	rctx->cryptlen = req->cryptlen - POLY1305_DIGEST_SIZE;
 	rctx->flags = aead_request_flags(req);
 
-	err = poly_generate_tag(req, calculated_tag, req->src);
+	err = poly_generate_tag(req, calculated_tag, req->src, rctx);
 	if (err)
 		return err;
 	scatterwalk_map_and_copy(provided_tag, req->src,
@@ -314,12 +325,14 @@ static int chachapoly_init(struct crypto_aead *tfm)
 
 	align = crypto_aead_alignmask(tfm);
 	align &= ~(crypto_tfm_ctx_alignment() - 1);
-	crypto_aead_set_reqsize(
-		tfm,
-		align +
-		offsetof(struct chachapoly_req_ctx, chacha.req) +
-		sizeof(struct skcipher_request) +
-		crypto_skcipher_reqsize(chacha));
+	if (crypto_aead_alignmask(tfm) > 0 ||
+	    (crypto_aead_get_flags(tfm) & CRYPTO_ALG_ASYNC))
+		crypto_aead_set_reqsize(
+			tfm,
+			align +
+			offsetof(struct chachapoly_req_ctx, chacha.req) +
+			sizeof(struct skcipher_request) +
+			crypto_skcipher_reqsize(chacha));
 
 	return 0;
 }
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 08/18] crypto: testmgr - add a chacha20poly1305 test case
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (6 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 07/18] crypto: rfc7539 - use zero reqsize for sync instantiations without alignmask Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 09/18] crypto: poly1305 - move core algorithm into lib/crypto Ard Biesheuvel
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

Add a test case to the RFC7539 (non-ESP) test vector array that
exercises the newly added code path that may optimize away one
invocation of the shash when the assoclen is a multiple of the
Poly1305 block size.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.h | 45 ++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index ef7d21f39d4a..5439b37f2b9f 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -18950,6 +18950,51 @@ static const struct aead_testvec rfc7539_tv_template[] = {
 			  "\x22\x39\x23\x36\xfe\xa1\x85\x1f"
 			  "\x38",
 		.clen	= 281,
+	}, {
+		.key	= "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f",
+		.klen	= 32,
+		.iv	= "\x07\x00\x00\x00\x40\x41\x42\x43"
+			  "\x44\x45\x46\x47",
+		.assoc	= "\x50\x51\x52\x53\xc0\xc1\xc2\xc3"
+			  "\xc4\xc5\xc6\xc7\x44\x45\x46\x47",
+		.alen	= 16,
+		.ptext	= "\x4c\x61\x64\x69\x65\x73\x20\x61"
+			  "\x6e\x64\x20\x47\x65\x6e\x74\x6c"
+			  "\x65\x6d\x65\x6e\x20\x6f\x66\x20"
+			  "\x74\x68\x65\x20\x63\x6c\x61\x73"
+			  "\x73\x20\x6f\x66\x20\x27\x39\x39"
+			  "\x3a\x20\x49\x66\x20\x49\x20\x63"
+			  "\x6f\x75\x6c\x64\x20\x6f\x66\x66"
+			  "\x65\x72\x20\x79\x6f\x75\x20\x6f"
+			  "\x6e\x6c\x79\x20\x6f\x6e\x65\x20"
+			  "\x74\x69\x70\x20\x66\x6f\x72\x20"
+			  "\x74\x68\x65\x20\x66\x75\x74\x75"
+			  "\x72\x65\x2c\x20\x73\x75\x6e\x73"
+			  "\x63\x72\x65\x65\x6e\x20\x77\x6f"
+			  "\x75\x6c\x64\x20\x62\x65\x20\x69"
+			  "\x74\x2e",
+		.plen	= 114,
+		.ctext	= "\xd3\x1a\x8d\x34\x64\x8e\x60\xdb"
+			  "\x7b\x86\xaf\xbc\x53\xef\x7e\xc2"
+			  "\xa4\xad\xed\x51\x29\x6e\x08\xfe"
+			  "\xa9\xe2\xb5\xa7\x36\xee\x62\xd6"
+			  "\x3d\xbe\xa4\x5e\x8c\xa9\x67\x12"
+			  "\x82\xfa\xfb\x69\xda\x92\x72\x8b"
+			  "\x1a\x71\xde\x0a\x9e\x06\x0b\x29"
+			  "\x05\xd6\xa5\xb6\x7e\xcd\x3b\x36"
+			  "\x92\xdd\xbd\x7f\x2d\x77\x8b\x8c"
+			  "\x98\x03\xae\xe3\x28\x09\x1b\x58"
+			  "\xfa\xb3\x24\xe4\xfa\xd6\x75\x94"
+			  "\x55\x85\x80\x8b\x48\x31\xd7\xbc"
+			  "\x3f\xf4\xde\xf0\x8e\x4b\x7a\x9d"
+			  "\xe5\x76\xd2\x65\x86\xce\xc6\x4b"
+			  "\x61\x16\xb3\xb8\x82\x76\x1f\x39"
+			  "\x35\x6f\x26\x8d\x28\x0f\xac\x45"
+			  "\x02\x5d",
+		.clen	= 130,
 	},
 };
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 09/18] crypto: poly1305 - move core algorithm into lib/crypto
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (7 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 08/18] crypto: testmgr - add a chacha20poly1305 test case Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 10/18] crypto: poly1305 - add init/update/final library routines Ard Biesheuvel
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

Move the core Poly1305 transformation into a separate library in
lib/crypto so it can be used by other subsystems without going
through the entire crypto API.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/poly1305_glue.c    |   2 +-
 crypto/Kconfig                     |   4 +
 crypto/adiantum.c                  |   4 +-
 crypto/nhpoly1305.c                |   2 +-
 crypto/poly1305_generic.c          | 167 +-------------------
 include/crypto/internal/poly1305.h |  37 +++++
 include/crypto/poly1305.h          |  32 +---
 lib/crypto/Makefile                |   3 +
 lib/crypto/poly1305.c              | 158 ++++++++++++++++++
 9 files changed, 217 insertions(+), 192 deletions(-)

diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index f2afaa8e23c2..9291cfea799d 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -7,8 +7,8 @@
 
 #include <crypto/algapi.h>
 #include <crypto/internal/hash.h>
+#include <crypto/internal/poly1305.h>
 #include <crypto/internal/simd.h>
-#include <crypto/poly1305.h>
 #include <linux/crypto.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 25a06cf49a5d..45589110bf80 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -654,9 +654,13 @@ config CRYPTO_GHASH
 	  GHASH is the hash function used in GCM (Galois/Counter Mode).
 	  It is not a general-purpose cryptographic hash function.
 
+config CRYPTO_LIB_POLY1305
+	tristate "Poly1305 authenticator library"
+
 config CRYPTO_POLY1305
 	tristate "Poly1305 authenticator algorithm"
 	select CRYPTO_HASH
+	select CRYPTO_LIB_POLY1305
 	help
 	  Poly1305 authenticator algorithm, RFC7539.
 
diff --git a/crypto/adiantum.c b/crypto/adiantum.c
index 395a3ddd3707..775ed1418e2b 100644
--- a/crypto/adiantum.c
+++ b/crypto/adiantum.c
@@ -242,11 +242,11 @@ static void adiantum_hash_header(struct skcipher_request *req)
 
 	BUILD_BUG_ON(sizeof(header) % POLY1305_BLOCK_SIZE != 0);
 	poly1305_core_blocks(&state, &tctx->header_hash_key,
-			     &header, sizeof(header) / POLY1305_BLOCK_SIZE);
+			     &header, sizeof(header) / POLY1305_BLOCK_SIZE, 1);
 
 	BUILD_BUG_ON(TWEAK_SIZE % POLY1305_BLOCK_SIZE != 0);
 	poly1305_core_blocks(&state, &tctx->header_hash_key, req->iv,
-			     TWEAK_SIZE / POLY1305_BLOCK_SIZE);
+			     TWEAK_SIZE / POLY1305_BLOCK_SIZE, 1);
 
 	poly1305_core_emit(&state, &rctx->header_hash);
 }
diff --git a/crypto/nhpoly1305.c b/crypto/nhpoly1305.c
index 9ab4e07cde4d..b88a6a71e3e2 100644
--- a/crypto/nhpoly1305.c
+++ b/crypto/nhpoly1305.c
@@ -78,7 +78,7 @@ static void process_nh_hash_value(struct nhpoly1305_state *state,
 	BUILD_BUG_ON(NH_HASH_BYTES % POLY1305_BLOCK_SIZE != 0);
 
 	poly1305_core_blocks(&state->poly_state, &key->poly_key, state->nh_hash,
-			     NH_HASH_BYTES / POLY1305_BLOCK_SIZE);
+			     NH_HASH_BYTES / POLY1305_BLOCK_SIZE, 1);
 }
 
 /*
diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index adc40298c749..46a2da1abac7 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -13,27 +13,12 @@
 
 #include <crypto/algapi.h>
 #include <crypto/internal/hash.h>
-#include <crypto/poly1305.h>
+#include <crypto/internal/poly1305.h>
 #include <linux/crypto.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <asm/unaligned.h>
 
-static inline u64 mlt(u64 a, u64 b)
-{
-	return a * b;
-}
-
-static inline u32 sr(u64 v, u_char n)
-{
-	return v >> n;
-}
-
-static inline u32 and(u32 v, u32 mask)
-{
-	return v & mask;
-}
-
 int crypto_poly1305_init(struct shash_desc *desc)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -47,17 +32,6 @@ int crypto_poly1305_init(struct shash_desc *desc)
 }
 EXPORT_SYMBOL_GPL(crypto_poly1305_init);
 
-void poly1305_core_setkey(struct poly1305_key *key, const u8 *raw_key)
-{
-	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
-	key->r[0] = (get_unaligned_le32(raw_key +  0) >> 0) & 0x3ffffff;
-	key->r[1] = (get_unaligned_le32(raw_key +  3) >> 2) & 0x3ffff03;
-	key->r[2] = (get_unaligned_le32(raw_key +  6) >> 4) & 0x3ffc0ff;
-	key->r[3] = (get_unaligned_le32(raw_key +  9) >> 6) & 0x3f03fff;
-	key->r[4] = (get_unaligned_le32(raw_key + 12) >> 8) & 0x00fffff;
-}
-EXPORT_SYMBOL_GPL(poly1305_core_setkey);
-
 /*
  * Poly1305 requires a unique key for each tag, which implies that we can't set
  * it on the tfm that gets accessed by multiple users simultaneously. Instead we
@@ -87,84 +61,8 @@ unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
 }
 EXPORT_SYMBOL_GPL(crypto_poly1305_setdesckey);
 
-static void poly1305_blocks_internal(struct poly1305_state *state,
-				     const struct poly1305_key *key,
-				     const void *src, unsigned int nblocks,
-				     u32 hibit)
-{
-	u32 r0, r1, r2, r3, r4;
-	u32 s1, s2, s3, s4;
-	u32 h0, h1, h2, h3, h4;
-	u64 d0, d1, d2, d3, d4;
-
-	if (!nblocks)
-		return;
-
-	r0 = key->r[0];
-	r1 = key->r[1];
-	r2 = key->r[2];
-	r3 = key->r[3];
-	r4 = key->r[4];
-
-	s1 = r1 * 5;
-	s2 = r2 * 5;
-	s3 = r3 * 5;
-	s4 = r4 * 5;
-
-	h0 = state->h[0];
-	h1 = state->h[1];
-	h2 = state->h[2];
-	h3 = state->h[3];
-	h4 = state->h[4];
-
-	do {
-		/* h += m[i] */
-		h0 += (get_unaligned_le32(src +  0) >> 0) & 0x3ffffff;
-		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
-		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
-		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
-		h4 += (get_unaligned_le32(src + 12) >> 8) | hibit;
-
-		/* h *= r */
-		d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
-		     mlt(h3, s2) + mlt(h4, s1);
-		d1 = mlt(h0, r1) + mlt(h1, r0) + mlt(h2, s4) +
-		     mlt(h3, s3) + mlt(h4, s2);
-		d2 = mlt(h0, r2) + mlt(h1, r1) + mlt(h2, r0) +
-		     mlt(h3, s4) + mlt(h4, s3);
-		d3 = mlt(h0, r3) + mlt(h1, r2) + mlt(h2, r1) +
-		     mlt(h3, r0) + mlt(h4, s4);
-		d4 = mlt(h0, r4) + mlt(h1, r3) + mlt(h2, r2) +
-		     mlt(h3, r1) + mlt(h4, r0);
-
-		/* (partial) h %= p */
-		d1 += sr(d0, 26);     h0 = and(d0, 0x3ffffff);
-		d2 += sr(d1, 26);     h1 = and(d1, 0x3ffffff);
-		d3 += sr(d2, 26);     h2 = and(d2, 0x3ffffff);
-		d4 += sr(d3, 26);     h3 = and(d3, 0x3ffffff);
-		h0 += sr(d4, 26) * 5; h4 = and(d4, 0x3ffffff);
-		h1 += h0 >> 26;       h0 = h0 & 0x3ffffff;
-
-		src += POLY1305_BLOCK_SIZE;
-	} while (--nblocks);
-
-	state->h[0] = h0;
-	state->h[1] = h1;
-	state->h[2] = h2;
-	state->h[3] = h3;
-	state->h[4] = h4;
-}
-
-void poly1305_core_blocks(struct poly1305_state *state,
-			  const struct poly1305_key *key,
-			  const void *src, unsigned int nblocks)
-{
-	poly1305_blocks_internal(state, key, src, nblocks, 1 << 24);
-}
-EXPORT_SYMBOL_GPL(poly1305_core_blocks);
-
-static void poly1305_blocks(struct poly1305_desc_ctx *dctx,
-			    const u8 *src, unsigned int srclen, u32 hibit)
+static void poly1305_blocks(struct poly1305_desc_ctx *dctx, const u8 *src,
+			    unsigned int srclen)
 {
 	unsigned int datalen;
 
@@ -174,8 +72,8 @@ static void poly1305_blocks(struct poly1305_desc_ctx *dctx,
 		srclen = datalen;
 	}
 
-	poly1305_blocks_internal(&dctx->h, &dctx->r,
-				 src, srclen / POLY1305_BLOCK_SIZE, hibit);
+	poly1305_core_blocks(&dctx->h, &dctx->r, src,
+			     srclen / POLY1305_BLOCK_SIZE, 1);
 }
 
 int crypto_poly1305_update(struct shash_desc *desc,
@@ -192,14 +90,13 @@ int crypto_poly1305_update(struct shash_desc *desc,
 		dctx->buflen += bytes;
 
 		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
-			poly1305_blocks(dctx, dctx->buf,
-					POLY1305_BLOCK_SIZE, 1 << 24);
+			poly1305_blocks(dctx, dctx->buf, POLY1305_BLOCK_SIZE);
 			dctx->buflen = 0;
 		}
 	}
 
 	if (likely(srclen >= POLY1305_BLOCK_SIZE)) {
-		poly1305_blocks(dctx, src, srclen, 1 << 24);
+		poly1305_blocks(dctx, src, srclen);
 		src += srclen - (srclen % POLY1305_BLOCK_SIZE);
 		srclen %= POLY1305_BLOCK_SIZE;
 	}
@@ -213,54 +110,6 @@ int crypto_poly1305_update(struct shash_desc *desc,
 }
 EXPORT_SYMBOL_GPL(crypto_poly1305_update);
 
-void poly1305_core_emit(const struct poly1305_state *state, void *dst)
-{
-	u32 h0, h1, h2, h3, h4;
-	u32 g0, g1, g2, g3, g4;
-	u32 mask;
-
-	/* fully carry h */
-	h0 = state->h[0];
-	h1 = state->h[1];
-	h2 = state->h[2];
-	h3 = state->h[3];
-	h4 = state->h[4];
-
-	h2 += (h1 >> 26);     h1 = h1 & 0x3ffffff;
-	h3 += (h2 >> 26);     h2 = h2 & 0x3ffffff;
-	h4 += (h3 >> 26);     h3 = h3 & 0x3ffffff;
-	h0 += (h4 >> 26) * 5; h4 = h4 & 0x3ffffff;
-	h1 += (h0 >> 26);     h0 = h0 & 0x3ffffff;
-
-	/* compute h + -p */
-	g0 = h0 + 5;
-	g1 = h1 + (g0 >> 26);             g0 &= 0x3ffffff;
-	g2 = h2 + (g1 >> 26);             g1 &= 0x3ffffff;
-	g3 = h3 + (g2 >> 26);             g2 &= 0x3ffffff;
-	g4 = h4 + (g3 >> 26) - (1 << 26); g3 &= 0x3ffffff;
-
-	/* select h if h < p, or h + -p if h >= p */
-	mask = (g4 >> ((sizeof(u32) * 8) - 1)) - 1;
-	g0 &= mask;
-	g1 &= mask;
-	g2 &= mask;
-	g3 &= mask;
-	g4 &= mask;
-	mask = ~mask;
-	h0 = (h0 & mask) | g0;
-	h1 = (h1 & mask) | g1;
-	h2 = (h2 & mask) | g2;
-	h3 = (h3 & mask) | g3;
-	h4 = (h4 & mask) | g4;
-
-	/* h = h % (2^128) */
-	put_unaligned_le32((h0 >>  0) | (h1 << 26), dst +  0);
-	put_unaligned_le32((h1 >>  6) | (h2 << 20), dst +  4);
-	put_unaligned_le32((h2 >> 12) | (h3 << 14), dst +  8);
-	put_unaligned_le32((h3 >> 18) | (h4 <<  8), dst + 12);
-}
-EXPORT_SYMBOL_GPL(poly1305_core_emit);
-
 int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -274,7 +123,7 @@ int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 		dctx->buf[dctx->buflen++] = 1;
 		memset(dctx->buf + dctx->buflen, 0,
 		       POLY1305_BLOCK_SIZE - dctx->buflen);
-		poly1305_blocks(dctx, dctx->buf, POLY1305_BLOCK_SIZE, 0);
+		poly1305_core_blocks(&dctx->h, &dctx->r, dctx->buf, 1, 0);
 	}
 
 	poly1305_core_emit(&dctx->h, digest);
diff --git a/include/crypto/internal/poly1305.h b/include/crypto/internal/poly1305.h
new file mode 100644
index 000000000000..ae0466c4ce59
--- /dev/null
+++ b/include/crypto/internal/poly1305.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Common values for the Poly1305 algorithm
+ */
+
+#ifndef _CRYPTO_INTERNAL_POLY1305_H
+#define _CRYPTO_INTERNAL_POLY1305_H
+
+#include <linux/types.h>
+#include <crypto/poly1305.h>
+
+struct poly1305_desc_ctx {
+	/* key */
+	struct poly1305_key r;
+	/* finalize key */
+	u32 s[4];
+	/* accumulator */
+	struct poly1305_state h;
+	/* partial buffer */
+	u8 buf[POLY1305_BLOCK_SIZE];
+	/* bytes used in partial buffer */
+	unsigned int buflen;
+	/* r key has been set */
+	bool rset;
+	/* s key has been set */
+	bool sset;
+};
+
+/* Crypto API helper functions for the Poly1305 MAC */
+int crypto_poly1305_init(struct shash_desc *desc);
+unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
+					const u8 *src, unsigned int srclen);
+int crypto_poly1305_update(struct shash_desc *desc,
+			   const u8 *src, unsigned int srclen);
+int crypto_poly1305_final(struct shash_desc *desc, u8 *dst);
+
+#endif
diff --git a/include/crypto/poly1305.h b/include/crypto/poly1305.h
index 34317ed2071e..83e4b4c69a5a 100644
--- a/include/crypto/poly1305.h
+++ b/include/crypto/poly1305.h
@@ -21,23 +21,6 @@ struct poly1305_state {
 	u32 h[5];	/* accumulator, base 2^26 */
 };
 
-struct poly1305_desc_ctx {
-	/* key */
-	struct poly1305_key r;
-	/* finalize key */
-	u32 s[4];
-	/* accumulator */
-	struct poly1305_state h;
-	/* partial buffer */
-	u8 buf[POLY1305_BLOCK_SIZE];
-	/* bytes used in partial buffer */
-	unsigned int buflen;
-	/* r key has been set */
-	bool rset;
-	/* s key has been set */
-	bool sset;
-};
-
 /*
  * Poly1305 core functions.  These implement the ε-almost-∆-universal hash
  * function underlying the Poly1305 MAC, i.e. they don't add an encrypted nonce
@@ -46,19 +29,10 @@ struct poly1305_desc_ctx {
 void poly1305_core_setkey(struct poly1305_key *key, const u8 *raw_key);
 static inline void poly1305_core_init(struct poly1305_state *state)
 {
-	memset(state->h, 0, sizeof(state->h));
+	*state = (struct poly1305_state){};
 }
 void poly1305_core_blocks(struct poly1305_state *state,
-			  const struct poly1305_key *key,
-			  const void *src, unsigned int nblocks);
+			  const struct poly1305_key *key, const void *src,
+			  unsigned int nblocks, u32 hibit);
 void poly1305_core_emit(const struct poly1305_state *state, void *dst);
-
-/* Crypto API helper functions for the Poly1305 MAC */
-int crypto_poly1305_init(struct shash_desc *desc);
-unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
-					const u8 *src, unsigned int srclen);
-int crypto_poly1305_update(struct shash_desc *desc,
-			   const u8 *src, unsigned int srclen);
-int crypto_poly1305_final(struct shash_desc *desc, u8 *dst);
-
 #endif
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index 24dad058f2ae..6bf8a0a4ee0e 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -12,5 +12,8 @@ libarc4-y					:= arc4.o
 obj-$(CONFIG_CRYPTO_LIB_DES)			+= libdes.o
 libdes-y					:= des.o
 
+obj-$(CONFIG_CRYPTO_LIB_POLY1305)		+= libpoly1305.o
+libpoly1305-y					:= poly1305.o
+
 obj-$(CONFIG_CRYPTO_LIB_SHA256)			+= libsha256.o
 libsha256-y					:= sha256.o
diff --git a/lib/crypto/poly1305.c b/lib/crypto/poly1305.c
new file mode 100644
index 000000000000..abe6fccf7b9c
--- /dev/null
+++ b/lib/crypto/poly1305.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Poly1305 authenticator algorithm, RFC7539
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * Based on public domain code by Andrew Moon and Daniel J. Bernstein.
+ */
+
+#include <crypto/poly1305.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/unaligned.h>
+
+static inline u64 mlt(u64 a, u64 b)
+{
+	return a * b;
+}
+
+static inline u32 sr(u64 v, u_char n)
+{
+	return v >> n;
+}
+
+static inline u32 and(u32 v, u32 mask)
+{
+	return v & mask;
+}
+
+void poly1305_core_setkey(struct poly1305_key *key, const u8 *raw_key)
+{
+	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
+	key->r[0] = (get_unaligned_le32(raw_key +  0) >> 0) & 0x3ffffff;
+	key->r[1] = (get_unaligned_le32(raw_key +  3) >> 2) & 0x3ffff03;
+	key->r[2] = (get_unaligned_le32(raw_key +  6) >> 4) & 0x3ffc0ff;
+	key->r[3] = (get_unaligned_le32(raw_key +  9) >> 6) & 0x3f03fff;
+	key->r[4] = (get_unaligned_le32(raw_key + 12) >> 8) & 0x00fffff;
+}
+EXPORT_SYMBOL_GPL(poly1305_core_setkey);
+
+void poly1305_core_blocks(struct poly1305_state *state,
+			  const struct poly1305_key *key, const void *src,
+			  unsigned int nblocks, u32 hibit)
+{
+	u32 r0, r1, r2, r3, r4;
+	u32 s1, s2, s3, s4;
+	u32 h0, h1, h2, h3, h4;
+	u64 d0, d1, d2, d3, d4;
+
+	if (!nblocks)
+		return;
+
+	r0 = key->r[0];
+	r1 = key->r[1];
+	r2 = key->r[2];
+	r3 = key->r[3];
+	r4 = key->r[4];
+
+	s1 = r1 * 5;
+	s2 = r2 * 5;
+	s3 = r3 * 5;
+	s4 = r4 * 5;
+
+	h0 = state->h[0];
+	h1 = state->h[1];
+	h2 = state->h[2];
+	h3 = state->h[3];
+	h4 = state->h[4];
+
+	do {
+		/* h += m[i] */
+		h0 += (get_unaligned_le32(src +  0) >> 0) & 0x3ffffff;
+		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
+		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
+		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
+		h4 += (get_unaligned_le32(src + 12) >> 8) | (hibit << 24);
+
+		/* h *= r */
+		d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
+		     mlt(h3, s2) + mlt(h4, s1);
+		d1 = mlt(h0, r1) + mlt(h1, r0) + mlt(h2, s4) +
+		     mlt(h3, s3) + mlt(h4, s2);
+		d2 = mlt(h0, r2) + mlt(h1, r1) + mlt(h2, r0) +
+		     mlt(h3, s4) + mlt(h4, s3);
+		d3 = mlt(h0, r3) + mlt(h1, r2) + mlt(h2, r1) +
+		     mlt(h3, r0) + mlt(h4, s4);
+		d4 = mlt(h0, r4) + mlt(h1, r3) + mlt(h2, r2) +
+		     mlt(h3, r1) + mlt(h4, r0);
+
+		/* (partial) h %= p */
+		d1 += sr(d0, 26);     h0 = and(d0, 0x3ffffff);
+		d2 += sr(d1, 26);     h1 = and(d1, 0x3ffffff);
+		d3 += sr(d2, 26);     h2 = and(d2, 0x3ffffff);
+		d4 += sr(d3, 26);     h3 = and(d3, 0x3ffffff);
+		h0 += sr(d4, 26) * 5; h4 = and(d4, 0x3ffffff);
+		h1 += h0 >> 26;       h0 = h0 & 0x3ffffff;
+
+		src += POLY1305_BLOCK_SIZE;
+	} while (--nblocks);
+
+	state->h[0] = h0;
+	state->h[1] = h1;
+	state->h[2] = h2;
+	state->h[3] = h3;
+	state->h[4] = h4;
+}
+EXPORT_SYMBOL_GPL(poly1305_core_blocks);
+
+void poly1305_core_emit(const struct poly1305_state *state, void *dst)
+{
+	u32 h0, h1, h2, h3, h4;
+	u32 g0, g1, g2, g3, g4;
+	u32 mask;
+
+	/* fully carry h */
+	h0 = state->h[0];
+	h1 = state->h[1];
+	h2 = state->h[2];
+	h3 = state->h[3];
+	h4 = state->h[4];
+
+	h2 += (h1 >> 26);     h1 = h1 & 0x3ffffff;
+	h3 += (h2 >> 26);     h2 = h2 & 0x3ffffff;
+	h4 += (h3 >> 26);     h3 = h3 & 0x3ffffff;
+	h0 += (h4 >> 26) * 5; h4 = h4 & 0x3ffffff;
+	h1 += (h0 >> 26);     h0 = h0 & 0x3ffffff;
+
+	/* compute h + -p */
+	g0 = h0 + 5;
+	g1 = h1 + (g0 >> 26);             g0 &= 0x3ffffff;
+	g2 = h2 + (g1 >> 26);             g1 &= 0x3ffffff;
+	g3 = h3 + (g2 >> 26);             g2 &= 0x3ffffff;
+	g4 = h4 + (g3 >> 26) - (1 << 26); g3 &= 0x3ffffff;
+
+	/* select h if h < p, or h + -p if h >= p */
+	mask = (g4 >> ((sizeof(u32) * 8) - 1)) - 1;
+	g0 &= mask;
+	g1 &= mask;
+	g2 &= mask;
+	g3 &= mask;
+	g4 &= mask;
+	mask = ~mask;
+	h0 = (h0 & mask) | g0;
+	h1 = (h1 & mask) | g1;
+	h2 = (h2 & mask) | g2;
+	h3 = (h3 & mask) | g3;
+	h4 = (h4 & mask) | g4;
+
+	/* h = h % (2^128) */
+	put_unaligned_le32((h0 >>  0) | (h1 << 26), dst +  0);
+	put_unaligned_le32((h1 >>  6) | (h2 << 20), dst +  4);
+	put_unaligned_le32((h2 >> 12) | (h3 << 14), dst +  8);
+	put_unaligned_le32((h3 >> 18) | (h4 <<  8), dst + 12);
+}
+EXPORT_SYMBOL_GPL(poly1305_core_emit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Martin Willi <martin@strongswan.org>");
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 10/18] crypto: poly1305 - add init/update/final library routines
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (8 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 09/18] crypto: poly1305 - move core algorithm into lib/crypto Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 11/18] int128: move __uint128_t compiler test to Kconfig Ard Biesheuvel
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

Add the usual init/update/final library routines for the Poly1305
keyed hash library. Since this will be the external interface of
the library, move the poly1305_core_* routines to the internal
header (and update the users to refer to it where needed)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/adiantum.c                  |  1 +
 crypto/nhpoly1305.c                |  1 +
 crypto/poly1305_generic.c          | 57 +++++++------------
 include/crypto/internal/poly1305.h | 16 ++----
 include/crypto/poly1305.h          | 34 ++++++++++--
 lib/crypto/poly1305.c              | 58 ++++++++++++++++++++
 6 files changed, 115 insertions(+), 52 deletions(-)

diff --git a/crypto/adiantum.c b/crypto/adiantum.c
index 775ed1418e2b..aded26092268 100644
--- a/crypto/adiantum.c
+++ b/crypto/adiantum.c
@@ -33,6 +33,7 @@
 #include <crypto/b128ops.h>
 #include <crypto/chacha.h>
 #include <crypto/internal/hash.h>
+#include <crypto/internal/poly1305.h>
 #include <crypto/internal/skcipher.h>
 #include <crypto/nhpoly1305.h>
 #include <crypto/scatterwalk.h>
diff --git a/crypto/nhpoly1305.c b/crypto/nhpoly1305.c
index b88a6a71e3e2..f6b6a52092b4 100644
--- a/crypto/nhpoly1305.c
+++ b/crypto/nhpoly1305.c
@@ -33,6 +33,7 @@
 #include <asm/unaligned.h>
 #include <crypto/algapi.h>
 #include <crypto/internal/hash.h>
+#include <crypto/internal/poly1305.h>
 #include <crypto/nhpoly1305.h>
 #include <linux/crypto.h>
 #include <linux/kernel.h>
diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 46a2da1abac7..69241e7e85be 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -23,8 +23,8 @@ int crypto_poly1305_init(struct shash_desc *desc)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
 
-	poly1305_core_init(&dctx->h);
-	dctx->buflen = 0;
+	poly1305_core_init(&dctx->desc.h);
+	dctx->desc.buflen = 0;
 	dctx->rset = false;
 	dctx->sset = false;
 
@@ -42,16 +42,16 @@ unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
 {
 	if (!dctx->sset) {
 		if (!dctx->rset && srclen >= POLY1305_BLOCK_SIZE) {
-			poly1305_core_setkey(&dctx->r, src);
+			poly1305_core_setkey(&dctx->desc.r, src);
 			src += POLY1305_BLOCK_SIZE;
 			srclen -= POLY1305_BLOCK_SIZE;
 			dctx->rset = true;
 		}
 		if (srclen >= POLY1305_BLOCK_SIZE) {
-			dctx->s[0] = get_unaligned_le32(src +  0);
-			dctx->s[1] = get_unaligned_le32(src +  4);
-			dctx->s[2] = get_unaligned_le32(src +  8);
-			dctx->s[3] = get_unaligned_le32(src + 12);
+			dctx->desc.s[0] = get_unaligned_le32(src +  0);
+			dctx->desc.s[1] = get_unaligned_le32(src +  4);
+			dctx->desc.s[2] = get_unaligned_le32(src +  8);
+			dctx->desc.s[3] = get_unaligned_le32(src + 12);
 			src += POLY1305_BLOCK_SIZE;
 			srclen -= POLY1305_BLOCK_SIZE;
 			dctx->sset = true;
@@ -72,7 +72,7 @@ static void poly1305_blocks(struct poly1305_desc_ctx *dctx, const u8 *src,
 		srclen = datalen;
 	}
 
-	poly1305_core_blocks(&dctx->h, &dctx->r, src,
+	poly1305_core_blocks(&dctx->desc.h, &dctx->desc.r, src,
 			     srclen / POLY1305_BLOCK_SIZE, 1);
 }
 
@@ -82,16 +82,17 @@ int crypto_poly1305_update(struct shash_desc *desc,
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
 	unsigned int bytes;
 
-	if (unlikely(dctx->buflen)) {
-		bytes = min(srclen, POLY1305_BLOCK_SIZE - dctx->buflen);
-		memcpy(dctx->buf + dctx->buflen, src, bytes);
+	if (unlikely(dctx->desc.buflen)) {
+		bytes = min(srclen, POLY1305_BLOCK_SIZE - dctx->desc.buflen);
+		memcpy(dctx->desc.buf + dctx->desc.buflen, src, bytes);
 		src += bytes;
 		srclen -= bytes;
-		dctx->buflen += bytes;
+		dctx->desc.buflen += bytes;
 
-		if (dctx->buflen == POLY1305_BLOCK_SIZE) {
-			poly1305_blocks(dctx, dctx->buf, POLY1305_BLOCK_SIZE);
-			dctx->buflen = 0;
+		if (dctx->desc.buflen == POLY1305_BLOCK_SIZE) {
+			poly1305_blocks(dctx, dctx->desc.buf,
+					POLY1305_BLOCK_SIZE);
+			dctx->desc.buflen = 0;
 		}
 	}
 
@@ -102,8 +103,8 @@ int crypto_poly1305_update(struct shash_desc *desc,
 	}
 
 	if (unlikely(srclen)) {
-		dctx->buflen = srclen;
-		memcpy(dctx->buf, src, srclen);
+		dctx->desc.buflen = srclen;
+		memcpy(dctx->desc.buf, src, srclen);
 	}
 
 	return 0;
@@ -113,31 +114,11 @@ EXPORT_SYMBOL_GPL(crypto_poly1305_update);
 int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
-	__le32 digest[4];
-	u64 f = 0;
 
 	if (unlikely(!dctx->sset))
 		return -ENOKEY;
 
-	if (unlikely(dctx->buflen)) {
-		dctx->buf[dctx->buflen++] = 1;
-		memset(dctx->buf + dctx->buflen, 0,
-		       POLY1305_BLOCK_SIZE - dctx->buflen);
-		poly1305_core_blocks(&dctx->h, &dctx->r, dctx->buf, 1, 0);
-	}
-
-	poly1305_core_emit(&dctx->h, digest);
-
-	/* mac = (h + s) % (2^128) */
-	f = (f >> 32) + le32_to_cpu(digest[0]) + dctx->s[0];
-	put_unaligned_le32(f, dst + 0);
-	f = (f >> 32) + le32_to_cpu(digest[1]) + dctx->s[1];
-	put_unaligned_le32(f, dst + 4);
-	f = (f >> 32) + le32_to_cpu(digest[2]) + dctx->s[2];
-	put_unaligned_le32(f, dst + 8);
-	f = (f >> 32) + le32_to_cpu(digest[3]) + dctx->s[3];
-	put_unaligned_le32(f, dst + 12);
-
+	poly1305_final(&dctx->desc, dst);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(crypto_poly1305_final);
diff --git a/include/crypto/internal/poly1305.h b/include/crypto/internal/poly1305.h
index ae0466c4ce59..619705200e70 100644
--- a/include/crypto/internal/poly1305.h
+++ b/include/crypto/internal/poly1305.h
@@ -10,22 +10,18 @@
 #include <crypto/poly1305.h>
 
 struct poly1305_desc_ctx {
-	/* key */
-	struct poly1305_key r;
-	/* finalize key */
-	u32 s[4];
-	/* accumulator */
-	struct poly1305_state h;
-	/* partial buffer */
-	u8 buf[POLY1305_BLOCK_SIZE];
-	/* bytes used in partial buffer */
-	unsigned int buflen;
+	struct poly1305_desc desc;
 	/* r key has been set */
 	bool rset;
 	/* s key has been set */
 	bool sset;
 };
 
+void poly1305_core_blocks(struct poly1305_state *state,
+			  const struct poly1305_key *key, const void *src,
+			  unsigned int nblocks, u32 hibit);
+void poly1305_core_emit(const struct poly1305_state *state, void *dst);
+
 /* Crypto API helper functions for the Poly1305 MAC */
 int crypto_poly1305_init(struct shash_desc *desc);
 unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
diff --git a/include/crypto/poly1305.h b/include/crypto/poly1305.h
index 83e4b4c69a5a..148d9049b906 100644
--- a/include/crypto/poly1305.h
+++ b/include/crypto/poly1305.h
@@ -6,6 +6,7 @@
 #ifndef _CRYPTO_POLY1305_H
 #define _CRYPTO_POLY1305_H
 
+#include <asm/unaligned.h>
 #include <linux/types.h>
 #include <linux/crypto.h>
 
@@ -21,6 +22,19 @@ struct poly1305_state {
 	u32 h[5];	/* accumulator, base 2^26 */
 };
 
+struct poly1305_desc {
+	/* key */
+	struct poly1305_key r;
+	/* finalize key */
+	u32 s[4];
+	/* accumulator */
+	struct poly1305_state h;
+	/* partial buffer */
+	u8 buf[POLY1305_BLOCK_SIZE];
+	/* bytes used in partial buffer */
+	unsigned int buflen;
+};
+
 /*
  * Poly1305 core functions.  These implement the ε-almost-∆-universal hash
  * function underlying the Poly1305 MAC, i.e. they don't add an encrypted nonce
@@ -31,8 +45,20 @@ static inline void poly1305_core_init(struct poly1305_state *state)
 {
 	*state = (struct poly1305_state){};
 }
-void poly1305_core_blocks(struct poly1305_state *state,
-			  const struct poly1305_key *key, const void *src,
-			  unsigned int nblocks, u32 hibit);
-void poly1305_core_emit(const struct poly1305_state *state, void *dst);
+
+static inline void poly1305_init(struct poly1305_desc *desc, const u8 *key)
+{
+	poly1305_core_setkey(&desc->r, key);
+	desc->s[0] = get_unaligned_le32(key + 16);
+	desc->s[1] = get_unaligned_le32(key + 20);
+	desc->s[2] = get_unaligned_le32(key + 24);
+	desc->s[3] = get_unaligned_le32(key + 28);
+	poly1305_core_init(&desc->h);
+	desc->buflen = 0;
+}
+
+void poly1305_update(struct poly1305_desc *desc, const u8 *src,
+		     unsigned int nbytes);
+void poly1305_final(struct poly1305_desc *desc, u8 *digest);
+
 #endif
diff --git a/lib/crypto/poly1305.c b/lib/crypto/poly1305.c
index abe6fccf7b9c..9af7cb5364af 100644
--- a/lib/crypto/poly1305.c
+++ b/lib/crypto/poly1305.c
@@ -154,5 +154,63 @@ void poly1305_core_emit(const struct poly1305_state *state, void *dst)
 }
 EXPORT_SYMBOL_GPL(poly1305_core_emit);
 
+void poly1305_update(struct poly1305_desc *desc, const u8 *src,
+		     unsigned int nbytes)
+{
+	unsigned int bytes;
+
+	if (unlikely(desc->buflen)) {
+		bytes = min(nbytes, POLY1305_BLOCK_SIZE - desc->buflen);
+		memcpy(desc->buf + desc->buflen, src, bytes);
+		src += bytes;
+		nbytes -= bytes;
+		desc->buflen += bytes;
+
+		if (desc->buflen == POLY1305_BLOCK_SIZE) {
+			poly1305_core_blocks(&desc->h, &desc->r, desc->buf, 1, 1);
+			desc->buflen = 0;
+		}
+	}
+
+	if (likely(nbytes >= POLY1305_BLOCK_SIZE)) {
+		poly1305_core_blocks(&desc->h, &desc->r, src,
+				     nbytes / POLY1305_BLOCK_SIZE, 1);
+		src += nbytes - (nbytes % POLY1305_BLOCK_SIZE);
+		nbytes %= POLY1305_BLOCK_SIZE;
+	}
+
+	if (unlikely(nbytes)) {
+		desc->buflen = nbytes;
+		memcpy(desc->buf, src, nbytes);
+	}
+}
+EXPORT_SYMBOL_GPL(poly1305_update);
+
+void poly1305_final(struct poly1305_desc *desc, u8 *dst)
+{
+	__le32 digest[4];
+	u64 f = 0;
+
+	if (unlikely(desc->buflen)) {
+		desc->buf[desc->buflen++] = 1;
+		memset(desc->buf + desc->buflen, 0,
+		       POLY1305_BLOCK_SIZE - desc->buflen);
+		poly1305_core_blocks(&desc->h, &desc->r, desc->buf, 1, 0);
+	}
+
+	poly1305_core_emit(&desc->h, digest);
+
+	/* mac = (h + s) % (2^128) */
+	f = (f >> 32) + le32_to_cpu(digest[0]) + desc->s[0];
+	put_unaligned_le32(f, dst + 0);
+	f = (f >> 32) + le32_to_cpu(digest[1]) + desc->s[1];
+	put_unaligned_le32(f, dst + 4);
+	f = (f >> 32) + le32_to_cpu(digest[2]) + desc->s[2];
+	put_unaligned_le32(f, dst + 8);
+	f = (f >> 32) + le32_to_cpu(digest[3]) + desc->s[3];
+	put_unaligned_le32(f, dst + 12);
+}
+EXPORT_SYMBOL_GPL(poly1305_final);
+
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Martin Willi <martin@strongswan.org>");
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 11/18] int128: move __uint128_t compiler test to Kconfig
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (9 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 10/18] crypto: poly1305 - add init/update/final library routines Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 21:01   ` Linus Torvalds
  2019-09-25 16:12 ` [RFC PATCH 16/18] netlink: use new strict length types in policy for 5.2 Ard Biesheuvel
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

In order to use 128-bit integer arithmetic in C code, the architecture
needs to have declared support for it by setting ARCH_SUPPORTS_INT128,
and it requires a version of the toolchain that supports this at build
time. This is why all existing tests for ARCH_SUPPORTS_INT128 also test
whether __SIZEOF_INT128__ is defined, since this is only the case for
compilers that can support 128-bit integers.

Let's fold this additional test into the Kconfig declaration of
ARCH_SUPPORTS_INT128 so that we can also use the symbol in Makefiles,
e.g., to decide whether a certain object needs to be included in the
first place.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/ecc.c | 2 +-
 init/Kconfig | 1 +
 lib/ubsan.c  | 2 +-
 lib/ubsan.h  | 2 +-
 4 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/crypto/ecc.c b/crypto/ecc.c
index dfe114bc0c4a..6e6aab6c987c 100644
--- a/crypto/ecc.c
+++ b/crypto/ecc.c
@@ -336,7 +336,7 @@ static u64 vli_usub(u64 *result, const u64 *left, u64 right,
 static uint128_t mul_64_64(u64 left, u64 right)
 {
 	uint128_t result;
-#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+#if defined(CONFIG_ARCH_SUPPORTS_INT128)
 	unsigned __int128 m = (unsigned __int128)left * right;
 
 	result.m_low  = m;
diff --git a/init/Kconfig b/init/Kconfig
index bd7d650d4a99..e66f64a26d7d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -785,6 +785,7 @@ config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 #
 config ARCH_SUPPORTS_INT128
 	bool
+	depends on !$(cc-option,-D__SIZEOF_INT128__=0)
 
 # For architectures that (ab)use NUMA to represent different memory regions
 # all cpu-local but of different latencies, such as SuperH.
diff --git a/lib/ubsan.c b/lib/ubsan.c
index e7d31735950d..b652cc14dd60 100644
--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -119,7 +119,7 @@ static void val_to_string(char *str, size_t size, struct type_descriptor *type,
 {
 	if (type_is_int(type)) {
 		if (type_bit_width(type) == 128) {
-#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+#if defined(CONFIG_ARCH_SUPPORTS_INT128)
 			u_max val = get_unsigned_val(type, value);
 
 			scnprintf(str, size, "0x%08x%08x%08x%08x",
diff --git a/lib/ubsan.h b/lib/ubsan.h
index b8fa83864467..7b56c09473a9 100644
--- a/lib/ubsan.h
+++ b/lib/ubsan.h
@@ -78,7 +78,7 @@ struct invalid_value_data {
 	struct type_descriptor *type;
 };
 
-#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+#if defined(CONFIG_ARCH_SUPPORTS_INT128)
 typedef __int128 s_max;
 typedef unsigned __int128 u_max;
 #else
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 16/18] netlink: use new strict length types in policy for 5.2
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (10 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 11/18] int128: move __uint128_t compiler test to Kconfig Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 17/18] wg switch to lib/crypto algos Ard Biesheuvel
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Bruno Wolff III, Jason A . Donenfeld, Catalin Marinas,
	Herbert Xu, Arnd Bergmann, Ard Biesheuvel, Greg KH, Eric Biggers,
	Samuel Neves, Will Deacon, Dan Carpenter, Andy Lutomirski,
	Marc Zyngier, Linus Torvalds, David Miller, linux-arm-kernel

Taken from
https://git.zx2c4.com/WireGuard/commit/src?id=3120425f69003be287cb2d308f89c7a6a0335ff0

Reported-by: Bruno Wolff III <bruno@wolff.to>
---
 drivers/net/wireguard/netlink.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c
index 3763e8c14ea5..676d36725120 100644
--- a/drivers/net/wireguard/netlink.c
+++ b/drivers/net/wireguard/netlink.c
@@ -21,8 +21,8 @@ static struct genl_family genl_family;
 static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = {
 	[WGDEVICE_A_IFINDEX]		= { .type = NLA_U32 },
 	[WGDEVICE_A_IFNAME]		= { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 },
-	[WGDEVICE_A_PRIVATE_KEY]	= { .len = NOISE_PUBLIC_KEY_LEN },
-	[WGDEVICE_A_PUBLIC_KEY]		= { .len = NOISE_PUBLIC_KEY_LEN },
+	[WGDEVICE_A_PRIVATE_KEY]	= { .type = NLA_EXACT_LEN, .len = NOISE_PUBLIC_KEY_LEN },
+	[WGDEVICE_A_PUBLIC_KEY]		= { .type = NLA_EXACT_LEN, .len = NOISE_PUBLIC_KEY_LEN },
 	[WGDEVICE_A_FLAGS]		= { .type = NLA_U32 },
 	[WGDEVICE_A_LISTEN_PORT]	= { .type = NLA_U16 },
 	[WGDEVICE_A_FWMARK]		= { .type = NLA_U32 },
@@ -30,12 +30,12 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = {
 };
 
 static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = {
-	[WGPEER_A_PUBLIC_KEY]				= { .len = NOISE_PUBLIC_KEY_LEN },
-	[WGPEER_A_PRESHARED_KEY]			= { .len = NOISE_SYMMETRIC_KEY_LEN },
+	[WGPEER_A_PUBLIC_KEY]				= { .type = NLA_EXACT_LEN, .len = NOISE_PUBLIC_KEY_LEN },
+	[WGPEER_A_PRESHARED_KEY]			= { .type = NLA_EXACT_LEN, .len = NOISE_SYMMETRIC_KEY_LEN },
 	[WGPEER_A_FLAGS]				= { .type = NLA_U32 },
-	[WGPEER_A_ENDPOINT]				= { .len = sizeof(struct sockaddr) },
+	[WGPEER_A_ENDPOINT]				= { .type = NLA_MIN_LEN, .len = sizeof(struct sockaddr) },
 	[WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL]	= { .type = NLA_U16 },
-	[WGPEER_A_LAST_HANDSHAKE_TIME]			= { .len = sizeof(struct __kernel_timespec) },
+	[WGPEER_A_LAST_HANDSHAKE_TIME]			= { .type = NLA_EXACT_LEN, .len = sizeof(struct __kernel_timespec) },
 	[WGPEER_A_RX_BYTES]				= { .type = NLA_U64 },
 	[WGPEER_A_TX_BYTES]				= { .type = NLA_U64 },
 	[WGPEER_A_ALLOWEDIPS]				= { .type = NLA_NESTED },
@@ -44,7 +44,7 @@ static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = {
 
 static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1] = {
 	[WGALLOWEDIP_A_FAMILY]		= { .type = NLA_U16 },
-	[WGALLOWEDIP_A_IPADDR]		= { .len = sizeof(struct in_addr) },
+	[WGALLOWEDIP_A_IPADDR]		= { .type = NLA_MIN_LEN, .len = sizeof(struct in_addr) },
 	[WGALLOWEDIP_A_CIDR_MASK]	= { .type = NLA_U8 }
 };
 
@@ -591,12 +591,10 @@ static const struct genl_ops genl_ops[] = {
 		.start = wg_get_device_start,
 		.dumpit = wg_get_device_dump,
 		.done = wg_get_device_done,
-		.policy = device_policy,
 		.flags = GENL_UNS_ADMIN_PERM
 	}, {
 		.cmd = WG_CMD_SET_DEVICE,
 		.doit = wg_set_device,
-		.policy = device_policy,
 		.flags = GENL_UNS_ADMIN_PERM
 	}
 };
@@ -608,6 +606,7 @@ static struct genl_family genl_family __ro_after_init = {
 	.version = WG_GENL_VERSION,
 	.maxattr = WGDEVICE_A_MAX,
 	.module = THIS_MODULE,
+	.policy = device_policy,
 	.netnsok = true
 };
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 17/18] wg switch to lib/crypto algos
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (11 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 16/18] netlink: use new strict length types in policy for 5.2 Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 16:12 ` [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption Ard Biesheuvel
  2019-09-26  8:59 ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Jason A. Donenfeld
  14 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

---
 drivers/net/Kconfig              | 6 +++---
 drivers/net/wireguard/cookie.c   | 4 ++--
 drivers/net/wireguard/messages.h | 6 +++---
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c26aef673538..3bd4dc662392 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -77,9 +77,9 @@ config WIREGUARD
 	depends on IPV6 || !IPV6
 	select NET_UDP_TUNNEL
 	select DST_CACHE
-	select ZINC_CHACHA20POLY1305
-	select ZINC_BLAKE2S
-	select ZINC_CURVE25519
+	select CRYPTO_LIB_CHACHA20POLY1305
+	select CRYPTO_LIB_BLAKE2S
+	select CRYPTO_LIB_CURVE25519
 	help
 	  WireGuard is a secure, fast, and easy to use replacement for IPSec
 	  that uses modern cryptography and clever networking tricks. It's
diff --git a/drivers/net/wireguard/cookie.c b/drivers/net/wireguard/cookie.c
index bd23a14ff87f..104b739c327f 100644
--- a/drivers/net/wireguard/cookie.c
+++ b/drivers/net/wireguard/cookie.c
@@ -10,8 +10,8 @@
 #include "ratelimiter.h"
 #include "timers.h"
 
-#include <zinc/blake2s.h>
-#include <zinc/chacha20poly1305.h>
+#include <crypto/blake2s.h>
+#include <crypto/chacha20poly1305.h>
 
 #include <net/ipv6.h>
 #include <crypto/algapi.h>
diff --git a/drivers/net/wireguard/messages.h b/drivers/net/wireguard/messages.h
index 3cfd1c5e9b02..4bbb1f97af04 100644
--- a/drivers/net/wireguard/messages.h
+++ b/drivers/net/wireguard/messages.h
@@ -6,9 +6,9 @@
 #ifndef _WG_MESSAGES_H
 #define _WG_MESSAGES_H
 
-#include <zinc/curve25519.h>
-#include <zinc/chacha20poly1305.h>
-#include <zinc/blake2s.h>
+#include <crypto/blake2s.h>
+#include <crypto/chacha20poly1305.h>
+#include <crypto/curve25519.h>
 
 #include <linux/kernel.h>
 #include <linux/param.h>
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (12 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 17/18] wg switch to lib/crypto algos Ard Biesheuvel
@ 2019-09-25 16:12 ` Ard Biesheuvel
  2019-09-25 22:15   ` Linus Torvalds
  2019-09-26  8:59 ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Jason A. Donenfeld
  14 siblings, 1 reply; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 16:12 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Dan Carpenter, Andy Lutomirski, Marc Zyngier, Linus Torvalds,
	David Miller, linux-arm-kernel

Replace the chacha20poly1305() library calls with invocations of the
RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.

For now, only synchronous AEADs are supported, but looking at the code,
it does not look terribly complicated to add support for async versions
of rfc7539(chacha20,poly1305) as well, some of which already exist in
the drivers/crypto tree.

The nonce related changes are there to address the mismatch between the
96-bit nonce (aka IV) that the rfc7539() template expects, and the 64-bit
nonce that WireGuard uses.

Note that these changes take advantage of the fact that synchronous
instantiations of the generic rfc7539() template will use a zero reqsize
if possible, removing the need for heap allocations for the request
structures.

This code was tested using the included netns.sh script, and by connecting
to the WireGuard demo server
(https://www.wireguard.com/quickstart/#demo-server)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 drivers/net/wireguard/noise.c    | 34 ++++++++++++-
 drivers/net/wireguard/noise.h    |  3 +-
 drivers/net/wireguard/queueing.h |  5 +-
 drivers/net/wireguard/receive.c  | 51 ++++++++++++--------
 drivers/net/wireguard/send.c     | 45 ++++++++++-------
 5 files changed, 97 insertions(+), 41 deletions(-)

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index bf0b8c5ab298..0e7aab9f645d 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -109,16 +109,37 @@ static struct noise_keypair *keypair_create(struct wg_peer *peer)
 
 	if (unlikely(!keypair))
 		return NULL;
+
+	keypair->sending.tfm = crypto_alloc_aead("rfc7539(chacha20,poly1305)",
+						 0, CRYPTO_ALG_ASYNC);
+	if (unlikely(IS_ERR(keypair->sending.tfm)))
+		goto free_keypair;
+	keypair->receiving.tfm = crypto_alloc_aead("rfc7539(chacha20,poly1305)",
+						   0, CRYPTO_ALG_ASYNC);
+	if (unlikely(IS_ERR(keypair->receiving.tfm)))
+		goto free_sending_tfm;
+
 	keypair->internal_id = atomic64_inc_return(&keypair_counter);
 	keypair->entry.type = INDEX_HASHTABLE_KEYPAIR;
 	keypair->entry.peer = peer;
 	kref_init(&keypair->refcount);
 	return keypair;
+
+free_sending_tfm:
+	crypto_free_aead(keypair->sending.tfm);
+free_keypair:
+	kzfree(keypair);
+	return NULL;
 }
 
 static void keypair_free_rcu(struct rcu_head *rcu)
 {
-	kzfree(container_of(rcu, struct noise_keypair, rcu));
+	struct noise_keypair *keypair =
+		container_of(rcu, struct noise_keypair, rcu);
+
+	crypto_free_aead(keypair->sending.tfm);
+	crypto_free_aead(keypair->receiving.tfm);
+	kzfree(keypair);
 }
 
 static void keypair_free_kref(struct kref *kref)
@@ -360,11 +381,20 @@ static void derive_keys(struct noise_symmetric_key *first_dst,
 			struct noise_symmetric_key *second_dst,
 			const u8 chaining_key[NOISE_HASH_LEN])
 {
-	kdf(first_dst->key, second_dst->key, NULL, NULL,
+	u8 key[2][NOISE_SYMMETRIC_KEY_LEN];
+	int err;
+
+	kdf(key[0], key[1], NULL, NULL,
 	    NOISE_SYMMETRIC_KEY_LEN, NOISE_SYMMETRIC_KEY_LEN, 0, 0,
 	    chaining_key);
 	symmetric_key_init(first_dst);
 	symmetric_key_init(second_dst);
+
+	err = crypto_aead_setkey(first_dst->tfm, key[0], sizeof(key[0])) ?:
+	      crypto_aead_setkey(second_dst->tfm, key[1], sizeof(key[1]));
+	memzero_explicit(key, sizeof(key));
+	if (unlikely(err))
+		pr_warn_once("crypto_aead_setkey() failed (%d)\n", err);
 }
 
 static bool __must_check mix_dh(u8 chaining_key[NOISE_HASH_LEN],
diff --git a/drivers/net/wireguard/noise.h b/drivers/net/wireguard/noise.h
index 9c2cc62dc11e..6f033d2ea52c 100644
--- a/drivers/net/wireguard/noise.h
+++ b/drivers/net/wireguard/noise.h
@@ -8,6 +8,7 @@
 #include "messages.h"
 #include "peerlookup.h"
 
+#include <crypto/aead.h>
 #include <linux/types.h>
 #include <linux/spinlock.h>
 #include <linux/atomic.h>
@@ -26,7 +27,7 @@ union noise_counter {
 };
 
 struct noise_symmetric_key {
-	u8 key[NOISE_SYMMETRIC_KEY_LEN];
+	struct crypto_aead *tfm;
 	union noise_counter counter;
 	u64 birthdate;
 	bool is_valid;
diff --git a/drivers/net/wireguard/queueing.h b/drivers/net/wireguard/queueing.h
index f8de703dff97..593971edf8a3 100644
--- a/drivers/net/wireguard/queueing.h
+++ b/drivers/net/wireguard/queueing.h
@@ -55,9 +55,10 @@ enum packet_state {
 };
 
 struct packet_cb {
-	u64 nonce;
-	struct noise_keypair *keypair;
 	atomic_t state;
+	__le32 ivpad;			/* pad 64-bit nonce to 96 bits */
+	__le64 nonce;
+	struct noise_keypair *keypair;
 	u32 mtu;
 	u8 ds;
 };
diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
index 900c76edb9d6..395089e7e3a6 100644
--- a/drivers/net/wireguard/receive.c
+++ b/drivers/net/wireguard/receive.c
@@ -11,7 +11,7 @@
 #include "cookie.h"
 #include "socket.h"
 
-#include <linux/simd.h>
+#include <crypto/aead.h>
 #include <linux/ip.h>
 #include <linux/ipv6.h>
 #include <linux/udp.h>
@@ -244,13 +244,14 @@ static void keep_key_fresh(struct wg_peer *peer)
 	}
 }
 
-static bool decrypt_packet(struct sk_buff *skb, struct noise_symmetric_key *key,
-			   simd_context_t *simd_context)
+static bool decrypt_packet(struct sk_buff *skb, struct noise_symmetric_key *key)
 {
 	struct scatterlist sg[MAX_SKB_FRAGS + 8];
+	struct aead_request *req, stackreq;
 	struct sk_buff *trailer;
 	unsigned int offset;
 	int num_frags;
+	int err;
 
 	if (unlikely(!key))
 		return false;
@@ -262,8 +263,8 @@ static bool decrypt_packet(struct sk_buff *skb, struct noise_symmetric_key *key,
 		return false;
 	}
 
-	PACKET_CB(skb)->nonce =
-		le64_to_cpu(((struct message_data *)skb->data)->counter);
+	PACKET_CB(skb)->ivpad = 0;
+	PACKET_CB(skb)->nonce = ((struct message_data *)skb->data)->counter;
 
 	/* We ensure that the network header is part of the packet before we
 	 * call skb_cow_data, so that there's no chance that data is removed
@@ -281,9 +282,23 @@ static bool decrypt_packet(struct sk_buff *skb, struct noise_symmetric_key *key,
 	if (skb_to_sgvec(skb, sg, 0, skb->len) <= 0)
 		return false;
 
-	if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,
-					 PACKET_CB(skb)->nonce, key->key,
-					 simd_context))
+	if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
+		req = aead_request_alloc(key->tfm, GFP_ATOMIC);
+		if (!req)
+			return false;
+	} else {
+		req = &stackreq;
+		aead_request_set_tfm(req, key->tfm);
+	}
+
+	aead_request_set_ad(req, 0);
+	aead_request_set_callback(req, 0, NULL, NULL);
+	aead_request_set_crypt(req, sg, sg, skb->len,
+			       (u8 *)&PACKET_CB(skb)->ivpad);
+	err = crypto_aead_decrypt(req);
+	if (unlikely(req != &stackreq))
+		aead_request_free(req);
+	if (err)
 		return false;
 
 	/* Another ugly situation of pushing and pulling the header so as to
@@ -475,10 +490,10 @@ int wg_packet_rx_poll(struct napi_struct *napi, int budget)
 			goto next;
 
 		if (unlikely(!counter_validate(&keypair->receiving.counter,
-					       PACKET_CB(skb)->nonce))) {
+					       le64_to_cpu(PACKET_CB(skb)->nonce)))) {
 			net_dbg_ratelimited("%s: Packet has invalid nonce %llu (max %llu)\n",
 					    peer->device->dev->name,
-					    PACKET_CB(skb)->nonce,
+					    le64_to_cpu(PACKET_CB(skb)->nonce),
 					    keypair->receiving.counter.receive.counter);
 			goto next;
 		}
@@ -510,21 +525,19 @@ void wg_packet_decrypt_worker(struct work_struct *work)
 {
 	struct crypt_queue *queue = container_of(work, struct multicore_worker,
 						 work)->ptr;
-	simd_context_t simd_context;
 	struct sk_buff *skb;
 
-	simd_get(&simd_context);
 	while ((skb = ptr_ring_consume_bh(&queue->ring)) != NULL) {
-		enum packet_state state = likely(decrypt_packet(skb,
-					   &PACKET_CB(skb)->keypair->receiving,
-					   &simd_context)) ?
-				PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
+		enum packet_state state;
+
+		if (likely(decrypt_packet(skb,
+					  &PACKET_CB(skb)->keypair->receiving)))
+			state = PACKET_STATE_CRYPTED;
+		else
+			state = PACKET_STATE_DEAD;
 		wg_queue_enqueue_per_peer_napi(&PACKET_PEER(skb)->rx_queue, skb,
 					       state);
-		simd_relax(&simd_context);
 	}
-
-	simd_put(&simd_context);
 }
 
 static void wg_packet_consume_data(struct wg_device *wg, struct sk_buff *skb)
diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
index b0df5c717502..48d1fb02f575 100644
--- a/drivers/net/wireguard/send.c
+++ b/drivers/net/wireguard/send.c
@@ -11,7 +11,7 @@
 #include "messages.h"
 #include "cookie.h"
 
-#include <linux/simd.h>
+#include <crypto/aead.h>
 #include <linux/uio.h>
 #include <linux/inetdevice.h>
 #include <linux/socket.h>
@@ -157,11 +157,11 @@ static unsigned int calculate_skb_padding(struct sk_buff *skb)
 	return padded_size - last_unit;
 }
 
-static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair,
-			   simd_context_t *simd_context)
+static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair)
 {
 	unsigned int padding_len, plaintext_len, trailer_len;
 	struct scatterlist sg[MAX_SKB_FRAGS + 8];
+	struct aead_request *req, stackreq;
 	struct message_data *header;
 	struct sk_buff *trailer;
 	int num_frags;
@@ -199,7 +199,7 @@ static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair,
 	header = (struct message_data *)skb_push(skb, sizeof(*header));
 	header->header.type = cpu_to_le32(MESSAGE_DATA);
 	header->key_idx = keypair->remote_index;
-	header->counter = cpu_to_le64(PACKET_CB(skb)->nonce);
+	header->counter = PACKET_CB(skb)->nonce;
 	pskb_put(skb, trailer, trailer_len);
 
 	/* Now we can encrypt the scattergather segments */
@@ -207,9 +207,24 @@ static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair,
 	if (skb_to_sgvec(skb, sg, sizeof(struct message_data),
 			 noise_encrypted_len(plaintext_len)) <= 0)
 		return false;
-	return chacha20poly1305_encrypt_sg(sg, sg, plaintext_len, NULL, 0,
-					   PACKET_CB(skb)->nonce,
-					   keypair->sending.key, simd_context);
+
+	if (unlikely(crypto_aead_reqsize(keypair->sending.tfm) > 0)) {
+		req = aead_request_alloc(keypair->sending.tfm, GFP_ATOMIC);
+		if (!req)
+			return false;
+	} else {
+		req = &stackreq;
+		aead_request_set_tfm(req, keypair->sending.tfm);
+	}
+
+	aead_request_set_ad(req, 0);
+	aead_request_set_callback(req, 0, NULL, NULL);
+	aead_request_set_crypt(req, sg, sg, plaintext_len,
+			       (u8 *)&PACKET_CB(skb)->ivpad);
+	crypto_aead_encrypt(req);
+	if (unlikely(req != &stackreq))
+		aead_request_free(req);
+	return true;
 }
 
 void wg_packet_send_keepalive(struct wg_peer *peer)
@@ -296,16 +311,13 @@ void wg_packet_encrypt_worker(struct work_struct *work)
 	struct crypt_queue *queue = container_of(work, struct multicore_worker,
 						 work)->ptr;
 	struct sk_buff *first, *skb, *next;
-	simd_context_t simd_context;
 
-	simd_get(&simd_context);
 	while ((first = ptr_ring_consume_bh(&queue->ring)) != NULL) {
 		enum packet_state state = PACKET_STATE_CRYPTED;
 
 		skb_walk_null_queue_safe(first, skb, next) {
 			if (likely(encrypt_packet(skb,
-						  PACKET_CB(first)->keypair,
-						  &simd_context))) {
+						  PACKET_CB(first)->keypair))) {
 				wg_reset_packet(skb);
 			} else {
 				state = PACKET_STATE_DEAD;
@@ -314,10 +326,7 @@ void wg_packet_encrypt_worker(struct work_struct *work)
 		}
 		wg_queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first,
 					  state);
-
-		simd_relax(&simd_context);
 	}
-	simd_put(&simd_context);
 }
 
 static void wg_packet_create_data(struct sk_buff *first)
@@ -389,13 +398,15 @@ void wg_packet_send_staged_packets(struct wg_peer *peer)
 	 * handshake.
 	 */
 	skb_queue_walk(&packets, skb) {
+		u64 counter = atomic64_inc_return(&key->counter.counter) - 1;
+
 		/* 0 for no outer TOS: no leak. TODO: at some later point, we
 		 * might consider using flowi->tos as outer instead.
 		 */
 		PACKET_CB(skb)->ds = ip_tunnel_ecn_encap(0, ip_hdr(skb), skb);
-		PACKET_CB(skb)->nonce =
-				atomic64_inc_return(&key->counter.counter) - 1;
-		if (unlikely(PACKET_CB(skb)->nonce >= REJECT_AFTER_MESSAGES))
+		PACKET_CB(skb)->ivpad = 0;
+		PACKET_CB(skb)->nonce = cpu_to_le64(counter);
+		if (unlikely(counter >= REJECT_AFTER_MESSAGES))
 			goto out_invalid;
 	}
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 11/18] int128: move __uint128_t compiler test to Kconfig
  2019-09-25 16:12 ` [RFC PATCH 11/18] int128: move __uint128_t compiler test to Kconfig Ard Biesheuvel
@ 2019-09-25 21:01   ` Linus Torvalds
  2019-09-25 21:19     ` Ard Biesheuvel
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2019-09-25 21:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Will Deacon,
	David Miller, Linux ARM

On Wed, Sep 25, 2019 at 9:14 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
>  config ARCH_SUPPORTS_INT128
>         bool
> +       depends on !$(cc-option,-D__SIZEOF_INT128__=0)

Hmm. Does this actually work?

If that "depends on" now ends up being 'n', afaik the people who
_enable_ it just do a

       select ARCH_SUPPORTS_INT128

and now you'll end up with the Kconfig erroring out with

   WARNING: unmet direct dependencies detected for ARCH_SUPPORTS_INT128

and then you end up with CONFIG_ARCH_SUPPORTS_INT128 anyway, instead
of the behavior you _want_ to get, which is to not get that CONFIG
defined at all.

So I heartily agree with your intent, but I don't think that model
works. I think you need to change the cases that currently do

       select ARCH_SUPPORTS_INT128

to instead have that cc-option test.

And take all the above with a pinch of salt. Maybe what you are doing
works, and I am just missing some piece of the puzzle. But I _think_
it's broken, and you didn't test with a compiler that doesn't support
that thing properly.

             Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 11/18] int128: move __uint128_t compiler test to Kconfig
  2019-09-25 21:01   ` Linus Torvalds
@ 2019-09-25 21:19     ` Ard Biesheuvel
  0 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-25 21:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Will Deacon,
	David Miller, Linux ARM

On Wed, 25 Sep 2019 at 23:01, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Sep 25, 2019 at 9:14 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> >  config ARCH_SUPPORTS_INT128
> >         bool
> > +       depends on !$(cc-option,-D__SIZEOF_INT128__=0)
>
> Hmm. Does this actually work?
>
> If that "depends on" now ends up being 'n', afaik the people who
> _enable_ it just do a
>
>        select ARCH_SUPPORTS_INT128
>
> and now you'll end up with the Kconfig erroring out with
>
>    WARNING: unmet direct dependencies detected for ARCH_SUPPORTS_INT128
>
> and then you end up with CONFIG_ARCH_SUPPORTS_INT128 anyway, instead
> of the behavior you _want_ to get, which is to not get that CONFIG
> defined at all.
>
> So I heartily agree with your intent, but I don't think that model
> works. I think you need to change the cases that currently do
>
>        select ARCH_SUPPORTS_INT128
>
> to instead have that cc-option test.
>
> And take all the above with a pinch of salt. Maybe what you are doing
> works, and I am just missing some piece of the puzzle. But I _think_
> it's broken, and you didn't test with a compiler that doesn't support
> that thing properly.
>

I think you may be right.

Instead, I'll add a separate CC_HAS_INT128 symbol with the
$(cc-option) test, and replace occurrences of

select ARCH_SUPPORTS_INT128

with

select ARCH_SUPPORTS_INT128 if CC_HAS_INT128

which is a slightly cleaner approach in any case.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-25 16:12 ` [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption Ard Biesheuvel
@ 2019-09-25 22:15   ` Linus Torvalds
  2019-09-25 22:22     ` Linus Torvalds
                       ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Linus Torvalds @ 2019-09-25 22:15 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Will Deacon,
	David Miller, Linux ARM

On Wed, Sep 25, 2019 at 9:14 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> Replace the chacha20poly1305() library calls with invocations of the
> RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.

Honestly, the other patches look fine to me from what I've seen (with
the small note I had in a separate email for 11/18), but this one I
consider just nasty, and a prime example of why people hate those
crypto lookup routines.

Some of it is just the fundamental and pointless silly indirection,
that just makes things harder to read, less efficient, and less
straightforward.

That's exemplified by this part of the patch:

>  struct noise_symmetric_key {
> -       u8 key[NOISE_SYMMETRIC_KEY_LEN];
> +       struct crypto_aead *tfm;

which is just one of those "we know what we want and we just want to
use it directly" things, and then the crypto indirection comes along
and makes that simple inline allocation of a small constant size
(afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another
allocation entirely.

And it's some random odd non-typed thing too, so then you have that
silly and stupid dynamic allocation using a name lookup:

   crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC);

to create what used to be (and should be) a simple allocation that was
has a static type and was just part of the code.

It also ends up doing other bad things, ie that packet-time

+       if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
+               req = aead_request_alloc(key->tfm, GFP_ATOMIC);
+               if (!req)
+                       return false;

thing that hopefully _is_ unlikely, but that's just more potential
breakage from that whole async crypto interface.

This is what people do *not* want to do, and why people often don't
like the crypto interfaces.

It's exactly why we then have those "bare" routines as a library where
people just want to access the low-level hashing or whatever directly.

So please don't do things like this for an initial import.

Leave the thing alone, and just use the direct and synchronous static
crypto library interface tjhat you imported in patch 14/18 (but see
below about the incomplete import).

Now, later on, if you can *show* that some async implementation really
really helps, and you have numbers for it, and you can convince people
that the above kind of indirection is _worth_ it, then that's a second
phase. But don't make code uglier without those actual numbers.

Because it's not just uglier and has that silly extra indirection and
potential allocation problems, this part just looks very fragile
indeed:

> The nonce related changes are there to address the mismatch between the
> 96-bit nonce (aka IV) that the rfc7539() template expects, and the 64-bit
> nonce that WireGuard uses.
...
>  struct packet_cb {
> -       u64 nonce;
> -       struct noise_keypair *keypair;
>         atomic_t state;
> +       __le32 ivpad;                   /* pad 64-bit nonce to 96 bits */
> +       __le64 nonce;
> +       struct noise_keypair *keypair;
>         u32 mtu;
>         u8 ds;
>  };

The above is subtle and silently depends on the struct layout.

It really really shouldn't.

Can it be acceptable doing something like that? Yeah, but you really
should be making it very explicit, perhaps using

  struct {
        __le32 ivpad;
        __le64 nonce;
   } __packed;

or something like that.

Because right now you're depending on particular layout of those fields:

> +       aead_request_set_crypt(req, sg, sg, skb->len,
> +                              (u8 *)&PACKET_CB(skb)->ivpad);

but honestly, that's not ok at all.

Somebody makes a slight change to that struct, and it might continue
to work fine on x86-32 (where 64-bit values are only 32-bit aligned)
but subtly break on other architectures.

Also, you changed how the nonce works from being in CPU byte order to
be explicitly LE. That may be ok, and looks like it might be a
cleanup, but honestly I think it should have been done as a separate
patch.

So could you please update that patch 14/18 to also have that
synchronous chacha20poly1305_decrypt_sg() interface, and then just
drop this 18/18 for now?

That would mean that

 (a) you wouldn't need this patch, and you can then do that as a
separate second phase once you have numbers and it can stand on its
own.

 (b) you'd actually have something that *builds* when  you import the
main wireguard patch in 15/18

because right now it looks like you're not only forcing this async
interface with the unnecessary indirection, you're also basically
having a tree that doesn't even build or work for a couple of commits.

And I'm still not convinced (a) ever makes sense - the overhead of any
accelerator is just high enought that I doubt you'll have numbers -
performance _or_ power.

But even if you're right that it might be a power advantage on some
platform, that wouldn't make it an advantage on other platforms. Maybe
it could be done as a config option where you can opt in to the async
interface when that makes sense - but not force the indirection and
extra allocations when it doesn't. As a separate patch, something like
that doesn't sound horrendous (and I think that's also an argument for
doing that CPU->LE change as an independent change).

Yes, yes, there's also that 17/18 that switches over to a different
header file location and Kconfig names but that could easily be folded
into 15/18 and then it would all be bisectable.

Alternatively, maybe 15/18 could be done with wireguard disabled in
the Kconfig (just to make the patch identical), and then 17/18 enables
it when it compiles with a big note about how you wanted to keep 15/18
pristine to make the changes obvious.

Hmm?

I don't really have a dog in this fight, but on the whole I really
liked the series. But this 18/18 raised my heckles, and I think I
understand why it might raise the heckles of the wireguard people.

Please?

     Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-25 22:15   ` Linus Torvalds
@ 2019-09-25 22:22     ` Linus Torvalds
  2019-09-26  9:40     ` Pascal Van Leeuwen
  2019-09-26 11:06     ` Ard Biesheuvel
  2 siblings, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2019-09-25 22:22 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Will Deacon,
	David Miller, Linux ARM

On Wed, Sep 25, 2019 at 3:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I don't really have a dog in this fight, but on the whole I really
> liked the series. But this 18/18 raised my heckles, and I think I
> understand why it might raise the heckles of the wireguard people.

To be honest, I guess I _do_ have a dog in the fight, namely the thing
that I'd love to see wireguard merged.

And this series otherwise looked non-offensive to me. Maybe not
everybody is hugely happy with all the details, but it looks like a
good sane "let's use as much of the existing models as possible" that
nobody should absolutely hate.

                   Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
                   ` (13 preceding siblings ...)
  2019-09-25 16:12 ` [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption Ard Biesheuvel
@ 2019-09-26  8:59 ` Jason A. Donenfeld
  2019-09-26 10:19   ` Pascal Van Leeuwen
  2019-09-26 12:07   ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
  14 siblings, 2 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2019-09-26  8:59 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Eric Biggers,
	Greg KH, Samuel Neves, Will Deacon, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Linus Torvalds,
	David Miller, linux-arm-kernel

 Hi Ard,

Thanks for taking the initiative on this. When I first discussed with
DaveM porting WireGuard to the crypto API and doing Zinc later
yesterday, I thought to myself, “I wonder if Ard might want to work on
this with me…” and sent you a message on IRC. It didn’t occur to me
that you were the one who had pushed this endeavor!

I must admit, though, I’m a bit surprised to see how it’s appearing.
When I wrote [1], I had really imagined postponing the goals of Zinc
entirely, and instead writing a small shim that calls into the
existing crypto API machinery. I imagined the series to look like
this:

1. Add blake2s generic as a crypto API shash object.
2. Add blake2s x86_64 as a crypto API shash object.
3. Add curve25519 generic as a crypto API dh object.
4. Add curve25519 x86_64 as a crypto API dh object.
5. Add curve25519 arm as a crypto API dh object.
6. The unmodified WireGuard commit.
7. A “cryptoapi.c” file for WireGuard that provides definitions of the
“just functions” approach of Zinc, but does so in terms of the crypto
API’s infrastructure, with global per-cpu lists and a few locks to
handle quick buffer and tfm reuse.

I wouldn’t expect (7) to be pretty, for the various reasons that most
people dislike the crypto API, but at least it would somewhat “work”,
not affect the general integrity of WireGuard, and provide a clear
path forward in an evolutionary manner for gradually, piecemeal,
swapping out pieces of that for a Zinc-like thing, however that winds
up appearing.

Instead what we’ve wound up with in this series is a Frankenstein’s
monster of Zinc, which appears to have basically the same goal as
Zinc, and even much of the same implementation just moved to a
different directory, but then skimps on making it actually work well
and introduces problems. (I’ll elucidate on some specific issues later
in this email so that we can get on the same page with regards to
security requirements for WireGuard.) I surmise from this Zinc-but-not
series is that what actually is going on here is mostly some kind of
power or leadership situation, which is what you’ve described to me
also at various other points and in person. I also recognize that I am
at least part way to blame for whatever dynamic there has stagnated
this process; let me try to rectify that:

A principle objection you’ve had is that Zinc moves to its own
directory, with its own name, and tries to segment itself off from the
rest of the crypto API’s infrastructure. You’ve always felt this
should be mixed in with the rest of the crypto API’s infrastructure
and directory structures in one way or another. Let’s do both of those
things – put this in a directory structure you find appropriate and
hook this into the rest of the crypto API’s infrastructure in a way
you find appropriate. I might disagree, which is why Zinc does things
the way it does, but I’m open to compromise and doing things more your
way.

Another objection you’ve had is that Zinc replaces many existing
implementations with its own. Martin wasn’t happy about that either.
So let’s not do that, and we’ll have some wholesale replacement of
implementations in future patchsets at future dates discussed and
benched and bikeshedded independently from this.

Finally, perhaps most importantly, Zinc’s been my design rather than
our design. Let’s do this together instead of me git-send-email(1)-ing
a v37.

If the process of doing that together will be fraught with difficulty,
I’m still open to the “7 patch series” with the ugly cryptoapi.c
approach, as described at the top. But I think if we start with Zinc
and whittle it down in accordance with the above, we’ll get something
mutually acceptable, and somewhat similar to this series, with a few
important exceptions, which illustrate some of the issues I see in
this RFC:

Issue 1) No fast implementations for the “it’s just functions” interface.

This is a deal breaker. I know you disagree here and perhaps think all
dynamic dispatch should be by loadable modules configured with
userspace policy and lots of function pointers and dynamically
composable DSL strings, as the current crypto API does it. But I think
a lot of other people agree with me here (and they’ve chimed in
before) that the branch predictor does things better, doesn’t have
Spectre issues, and is very simple to read and understand. For
reference, here’s what that kind of thing looks like: [2].

In this case, the relevance is that the handshake in WireGuard is
extremely performance sensitive, in order to fend off DoS. One of the
big design gambits in WireGuard is – can we make it 1-RTT to reduce
the complexity of the state machine, but keep the crypto efficient
enough that this is still safe to do from a DoS perspective. The
protocol succeeds at this goal, but in many ways, just by a hair when
at scale, and so I’m really quite loathe to decrease handshake
performance. Here’s where that matters specifically:

- Curve25519 does indeed always appear to be taking tiny 32 byte stack
inputs in WireGuard. However, your statement, “the fact that they
operate on small, fixed size buffers means that there is really no
point in providing alternative, SIMD based implementations of these,
and we can limit ourselves to generic C library version,” is just
plain wrong in this case. Curve25519 only ever operates on 32 byte
inputs, because these represent curve scalars and points. It’s not
like a block cipher where parallelism helps with larger inputs or
something. In this case, there are some pretty massive speed
improvements between the generic C implementations and the optimized
ones. Like huge. On both ARM and on Intel. And Curve25519 is the most
expensive operation in WireGuard, and each handshake message invokes a
few of them. (Aside - Something to look forward to: I’m in the process
of getting a formally verified x86_64 ADX implementation ready for
kernel usage, to replace our existing heavily-fuzzed one, which will
be cool.)

- Blake2s actually does benefit from the optimized code even for
relatively short inputs. While you might have been focused on the
super-super small inputs in noise.c, there are slightly larger ones in
cookie.c, and these are the most sensitive computations to make in
terms of DoS resistance; they’re on the “front lines” of the battle,
if you will. (Aside - Arguably WireGuard may have benefited from using
siphash with 128-bit outputs here, or calculated some security metrics
for DoS resistance in the face of forged 64-bit outputs or something,
or a different custom MAC, but hindsight is 20/20.)

- While 25519 and Blake2s are already in use, the optimized versions
of ChaPoly wind up being faster as well, even if it’s just hitting the
boring SSE code.

- On MIPS, the optimized versions of ChaPoly are a necessity. They’re
boring integer/scalar code, but they do things that the compiler
simply cannot do on the platform and we benefit immensely from it.

Taken together, we simply can’t skimp on the implementations available
on the handshake layer, so we’ll need to add some form of
implementation selection, whether it’s the method Zinc uses ([2]), or
something else we cook up together.

Issue 2) Linus’ objection to the async API invasion is more correct
than he realizes.

I could re-enumerate my objections to the API there, but I think we
all get it. It’s horrendous looking. Even the introduction of the
ivpad member (what on earth?) in the skb cb made me shutter. But
there’s actually another issue at play:

wg_noise_handshake_begin_session→derive_keys→symmetric_key_init is all
part of the handshake. We cannot afford to allocate a brand new crypto
object, parse the DSL string, connect all those function pointers,
etc. The allocations involved here aren’t really okay at all in that
path. That’s why the cryptoapi.c idea above involves just using a pool
of pre-allocated objects if we’re going to be using that API at all.
Also keep in mind that WireGuard instances sometimes have hundreds of
thousands of peers.

I’d recommend leaving this synchronous as it exists now, as Linus
suggested, and we can revisit that later down the road. There are a
number of improvements to the async API we could make down the line
that could make this viable in WireGuard. For example, I could imagine
decoupling the creation of the cipher object from its keys and
intermediate buffers, so that we could in fact allocate the cipher
objects with their DSLs globally in a safe way, while allowing the
keys and working buffers to come from elsewhere. This is deep plumbing
into the async API, but I think we could get there in time.

There’s also a degree of practicality: right now there is zero ChaPoly
async acceleration hardware anywhere that would fit into the crypto
API. At some point, it might come to exist and have incredible
performance, and then we’ll both feel very motivated to make this work
for WireGuard. But it might also not come to be (AES seems to have won
over most of the industry), in which case, why hassle?

Issue 3) WireGuard patch is out of date.

This is my fault, because I haven’t posted in a long time. There are
some important changes in the main WireGuard repo. I’ll roll another
patch soon for this so we have something recent to work off of. Sorry
about that.

Issue 4) FPU register batching?

When I introduced the simd_get/simd_put/simd_relax thing, people
seemed to think it was a good idea. My benchmarks of it showed
significant throughput improvements. Your patchset doesn’t have
anything similar to this. But on the other hand, last I spoke with the
x86 FPU guys, I thought they might actually be in the process of
making simd_get/put obsolete with some internal plumbing to make
restoration lazier. I’ll see tglx later today and will poke him about
this, as this might already be a non-issue.


So given the above, how would you like to proceed? My personal
preference would be to see you start with the Zinc patchset and rename
things and change the infrastructure to something that fits your
preferences, and we can see what that looks like. Less appealing would
be to do several iterations of you reworking Zinc from scratch and
going through the exercises all over again, but if you prefer that I
guess I could cope. Alternatively, maybe this is a lot to chew on, and
we should just throw caution into the wind, implement cryptoapi.c for
WireGuard (as described at the top), and add C functions to the crypto
API sometime later? This is what I had envisioned in [1].

And for the avoidance of doubt, or in case any of the above message
belied something different, I really am happy and relieved to have an
opportunity to work on this _with you_, and I am much more open than
before to compromise and finding practical solutions to the past
political issues. Also, if you’re into chat, we can always spec some
of the nitty-gritty aspects out over IRC or even the old-fashioned
telephone. Thanks again for pushing this forward.

Regards,
Jason

[1] https://lore.kernel.org/wireguard/CAHmME9pmfZAp5zd9BDLFc2fWUhtzZcjYZc2atTPTyNFFmEdHLg@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-25 22:15   ` Linus Torvalds
  2019-09-25 22:22     ` Linus Torvalds
@ 2019-09-26  9:40     ` Pascal Van Leeuwen
  2019-09-26 16:35       ` Linus Torvalds
  2019-09-26 11:06     ` Ard Biesheuvel
  2 siblings, 1 reply; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-26  9:40 UTC (permalink / raw)
  To: Linus Torvalds, Ard Biesheuvel
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Will Deacon,
	David Miller, Linux ARM

> >
> > Replace the chacha20poly1305() library calls with invocations of the
> > RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.
> 
> Honestly, the other patches look fine to me from what I've seen (with
> the small note I had in a separate email for 11/18), but this one I
> consider just nasty, and a prime example of why people hate those
> crypto lookup routines.
> 
> Some of it is just the fundamental and pointless silly indirection,
> that just makes things harder to read, less efficient, and less
> straightforward.
> 
> That's exemplified by this part of the patch:
> 
> >  struct noise_symmetric_key {
> > -       u8 key[NOISE_SYMMETRIC_KEY_LEN];
> > +       struct crypto_aead *tfm;
> 
> which is just one of those "we know what we want and we just want to
> use it directly" things, and then the crypto indirection comes along
> and makes that simple inline allocation of a small constant size
> (afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another
> allocation entirely.
> 
> And it's some random odd non-typed thing too, so then you have that
> silly and stupid dynamic allocation using a name lookup:
> 
>    crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC);
> 
> to create what used to be (and should be) a simple allocation that was
> has a static type and was just part of the code.
> 
While I agree with the principle of first merging Wireguard without 
hooking it up to the Crypto API and doing the latter in a later,
separate patch, I DONT'T agree with your bashing of the Crypto API
or HW crypto acceleration in general.

Yes, I do agree  that if you need to do the occasional single crypto 
op for a fixed algorithm on a small amount of data then you should
just use a simple direct  library call. I'm all for a Zinc type 
library for that.
(and I believe Ard is actually actively making such changes already)

However, if you're doing bulk crypto like network packet processing
(as Wireguard does!) or disk/filesystem encryption, then that cipher
allocation only happens once every blue moon and the overhead for
that is totally *irrelevant* as it is amortized over many hours or 
days of runtime.

While I generally dislike this whole hype of storing stuff in
textual formats like XML and JSON and then wasting lots of CPU
cycles on parsing that, I've learned to appreciate the power of
these textual Crypto API templates, as they allow a hardware 
accelerator to advertise complex combined operations as single
atomic calls, amortizing the communication overhead between SW
and HW. It's actually very flexible and powerful!

> It also ends up doing other bad things, ie that packet-time
> 
> +       if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
> +               req = aead_request_alloc(key->tfm, GFP_ATOMIC);
> +               if (!req)
> +                       return false;
> 
> thing that hopefully _is_ unlikely, but that's just more potential
> breakage from that whole async crypto interface.
> 
> This is what people do *not* want to do, and why people often don't
> like the crypto interfaces.
> 
Life is all about needing to do things you don't like to do ...
If you want the performance, you need to do the effort. That simple.
HW acceleration surely won't work from a naive synchronous interface.
(Same applies to running crypto in a separate thread on the CPU BTW!)

In any case, Wireguard bulk crypto *should* eventually run on top
of Crypto API such that it can leverage *existing* HW acceleration.
It would be incredibly silly not to do so, given the HW exists!

> And I'm still not convinced (a) ever makes sense - the overhead of any
> accelerator is just high enought that I doubt you'll have numbers -
> performance _or_ power.
> 
You shouldn't make such assertions if you obviously don't know what
you're talking about. Yes, there is significant overhead on the CPU
for doing lookaside crypto, but it's (usually) nothing compared to
doing the actual crypto itself on the CPU barring a few exceptions. 
(Notably AES-GCM or AES-CTR on ARM64 or x64 CPU's and *maybe* 
Chacha-Poly on recent Intel CPU's - but there's a *lot* more crypto 
being used out there than just AES-GCM and Chacha-Poly, not to 
mention a lot more less capable (embedded) CPU's running Linux)

For anything but those exceptions, we blow even the fastest Intel
server CPU's out of the water with our crypto accelerators.
(I can bore you with some figures actually measured with the
Crypto API on our HW, once I'm done optimizing the driver and I 
have some time to collect the results)

And in any case, for somewhat larger blocks/packets, the overhead
on the CPU would at least be such that it's less than what the CPU
would need to do the crypto itself - even if it's faster - such that
there is room there to do *other*, presumably more useful, work.

Then there's indeed the power consumption issue, which is complex
because crypto power != total system power so it depends too much on
the actual use case to make generic statements on it. So I'll leave
that with the remark that Intel server CPU's have to seriously
throttle down their clock if you start using AVX512 for crypto, just to
stay within their power budget, while we can do the same performance
(~200 Gbps) in just a few (~2) Watts on a similar technology node.
(excluding the CPU management overhead, but that surely won't consume
excessive power like AVX512)

> But even if you're right that it might be a power advantage on some
> platform, that wouldn't make it an advantage on other platforms. Maybe
> it could be done as a config option where you can opt in to the async
> interface when that makes sense - but not force the indirection and
> extra allocations when it doesn't. As a separate patch, something like
> that doesn't sound horrendous (and I think that's also an argument for
> doing that CPU->LE change as an independent change).
> 
Making it a switch sounds good to me though.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26  8:59 ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Jason A. Donenfeld
@ 2019-09-26 10:19   ` Pascal Van Leeuwen
  2019-09-26 10:59     ` Jason A. Donenfeld
  2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
  2019-09-26 12:07   ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
  1 sibling, 2 replies; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-26 10:19 UTC (permalink / raw)
  To: Jason A. Donenfeld, Ard Biesheuvel
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Eric Biggers,
	Greg KH, Samuel Neves, Will Deacon, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Linus Torvalds,
	David Miller, linux-arm-kernel

> There’s also a degree of practicality: right now there is zero ChaPoly
> async acceleration hardware anywhere that would fit into the crypto
> API.
>
Actually, that assumption is factually wrong. I don't know if anything
is *publicly* available, but I can assure you the silicon is running in
labs already. And something will be publicly available early next year
at the latest. Which could nicely coincide with having Wireguard support
in the kernel (which I would also like to see happen BTW) ...

> At some point, it might come to exist and have incredible
> performance, and then we’ll both feel very motivated to make this work
> for WireGuard. But it might also not come to be (AES seems to have won
> over most of the industry), in which case, why hassle?
>
Not "at some point". It will. Very soon. Maybe not in consumer or server
CPUs, but definitely in the embedded (networking) space.
And it *will* be much faster than the embedded CPU next to it, so it will 
be worth using it for something like bulk packet encryption.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26 10:19   ` Pascal Van Leeuwen
@ 2019-09-26 10:59     ` Jason A. Donenfeld
  2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
  1 sibling, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2019-09-26 10:59 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Ard Biesheuvel,
	Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel

On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
> Actually, that assumption is factually wrong. I don't know if anything
> is *publicly* available, but I can assure you the silicon is running in
> labs already.

Great to hear, and thanks for the information. I'll follow up with
some questions on this in another thread.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-25 22:15   ` Linus Torvalds
  2019-09-25 22:22     ` Linus Torvalds
  2019-09-26  9:40     ` Pascal Van Leeuwen
@ 2019-09-26 11:06     ` Ard Biesheuvel
  2019-09-26 12:34       ` Ard Biesheuvel
  2 siblings, 1 reply; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-26 11:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Will Deacon,
	David Miller, Linux ARM

On Thu, 26 Sep 2019 at 00:15, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Sep 25, 2019 at 9:14 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> > Replace the chacha20poly1305() library calls with invocations of the
> > RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.
>
> Honestly, the other patches look fine to me from what I've seen (with
> the small note I had in a separate email for 11/18), but this one I
> consider just nasty, and a prime example of why people hate those
> crypto lookup routines.
>
> Some of it is just the fundamental and pointless silly indirection,
> that just makes things harder to read, less efficient, and less
> straightforward.
>
> That's exemplified by this part of the patch:
>
> >  struct noise_symmetric_key {
> > -       u8 key[NOISE_SYMMETRIC_KEY_LEN];
> > +       struct crypto_aead *tfm;
>
> which is just one of those "we know what we want and we just want to
> use it directly" things, and then the crypto indirection comes along
> and makes that simple inline allocation of a small constant size
> (afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another
> allocation entirely.
>
> And it's some random odd non-typed thing too, so then you have that
> silly and stupid dynamic allocation using a name lookup:
>
>    crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC);
>
> to create what used to be (and should be) a simple allocation that was
> has a static type and was just part of the code.
>

That crypto_alloc_aead() call does a lot of things under the hood:
- use an existing instantiation of rfc7539(chacha20,poly1305) if available,
- look for modules that implement the whole transformation directly,
- if none are found, instantiate the rfc7539 template, which will
essentially do the above for chacha20 and poly1305, potentially using
per-arch accelerated implementations if available (for either), or
otherwise, fall back to the generic versions.

What *I* see as the issue here is not that we need to do this at all,
but that we have to do it for each value of the key. IMO, it would be
much better to instantiate this thing only once, and have a way of
passing a per-request key into it, permitting us to hide the whole
thing behind the existing library interface.


> It also ends up doing other bad things, ie that packet-time
>
> +       if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
> +               req = aead_request_alloc(key->tfm, GFP_ATOMIC);
> +               if (!req)
> +                       return false;
>
> thing that hopefully _is_ unlikely, but that's just more potential
> breakage from that whole async crypto interface.
>
> This is what people do *not* want to do, and why people often don't
> like the crypto interfaces.
>
> It's exactly why we then have those "bare" routines as a library where
> people just want to access the low-level hashing or whatever directly.
>
> So please don't do things like this for an initial import.
>

This is tied to the zero reqsize patch earlier in the series. If your
crypto_alloc_aead() call gets fulfilled by the template we modified
earlier (and which is the only synchronous implementation we currently
have), this is guaranteed to be unlikely/false. For async
implementations, however, we need this allocation to be on the heap
since the stack will go away before the call completes.

> Leave the thing alone, and just use the direct and synchronous static
> crypto library interface tjhat you imported in patch 14/18 (but see
> below about the incomplete import).
>

Patch #14 only imports the C library version, not any of the
accelerated versions that Jason proposed. So we will need a way to
hook all the existing accelerated drivers into that library interface,
add handling for SIMD etc.

> Now, later on, if you can *show* that some async implementation really
> really helps, and you have numbers for it, and you can convince people
> that the above kind of indirection is _worth_ it, then that's a second
> phase. But don't make code uglier without those actual numbers.
>
> Because it's not just uglier and has that silly extra indirection and
> potential allocation problems, this part just looks very fragile
> indeed:
>
> > The nonce related changes are there to address the mismatch between the
> > 96-bit nonce (aka IV) that the rfc7539() template expects, and the 64-bit
> > nonce that WireGuard uses.
> ...
> >  struct packet_cb {
> > -       u64 nonce;
> > -       struct noise_keypair *keypair;
> >         atomic_t state;
> > +       __le32 ivpad;                   /* pad 64-bit nonce to 96 bits */
> > +       __le64 nonce;
> > +       struct noise_keypair *keypair;
> >         u32 mtu;
> >         u8 ds;
> >  };
>
> The above is subtle and silently depends on the struct layout.
>
> It really really shouldn't.
>
> Can it be acceptable doing something like that? Yeah, but you really
> should be making it very explicit, perhaps using
>
>   struct {
>         __le32 ivpad;
>         __le64 nonce;
>    } __packed;
>
> or something like that.
>

This is what I started out with, put the packed struct causes GCC on
architectures that care about alignment to switch to the unaligned
accessors, which I tried to avoid. I'll add a build_bug to ensure that
offset(ivpad) + 4 == offset(nonce).

> Because right now you're depending on particular layout of those fields:
>
> > +       aead_request_set_crypt(req, sg, sg, skb->len,
> > +                              (u8 *)&PACKET_CB(skb)->ivpad);
>
> but honestly, that's not ok at all.
>
> Somebody makes a slight change to that struct, and it might continue
> to work fine on x86-32 (where 64-bit values are only 32-bit aligned)
> but subtly break on other architectures.
>
> Also, you changed how the nonce works from being in CPU byte order to
> be explicitly LE. That may be ok, and looks like it might be a
> cleanup, but honestly I think it should have been done as a separate
> patch.
>

Fair enough.

> So could you please update that patch 14/18 to also have that
> synchronous chacha20poly1305_decrypt_sg() interface, and then just
> drop this 18/18 for now?
>

Hmm, not really, because then performance is going to suck. The way we
organise the code in the crypto API today is to have generic C
libraries in lib/crypto, and use the full API for per-arch accelerated
code (or async accelerators). Patch #14 uses the former.

We'll need a way to refactor the existing accelerated code so it is
exposed via the library interface, and there are a couple of options:
- modify our AEAD code so it can take per-request keys - that way, we
could instantiate a single TFM in the implementation of the library,
and keep the chacha20poly1305_[en|de]crypt_sg() interfaces intact,
- create arch/*/lib versions of all the accelerated ChaCha20 and
Poly1305 routines we have, so that the ordinary library precedence
rules give you the fastest implementation available - we may need some
tweaks to the module loader for weak symbols etc to make this
seamless, but it should be doable
- add an entirely new accelerated crypto library stack on the side (aka Zinc)


> That would mean that
>
>  (a) you wouldn't need this patch, and you can then do that as a
> separate second phase once you have numbers and it can stand on its
> own.
>
>  (b) you'd actually have something that *builds* when  you import the
> main wireguard patch in 15/18
>
> because right now it looks like you're not only forcing this async
> interface with the unnecessary indirection, you're also basically
> having a tree that doesn't even build or work for a couple of commits.
>

True, but this was intentional, and not intended for merge as-is.

> And I'm still not convinced (a) ever makes sense - the overhead of any
> accelerator is just high enought that I doubt you'll have numbers -
> performance _or_ power.
>
> But even if you're right that it might be a power advantage on some
> platform, that wouldn't make it an advantage on other platforms. Maybe
> it could be done as a config option where you can opt in to the async
> interface when that makes sense - but not force the indirection and
> extra allocations when it doesn't.

I know the code isn't pretty, but it looks worse than it is. I'll look
into using the new static calls framework to instantiate the library
interface based on whatever the platform provides as synchronous
implementations of ChaCha20 and Poly1305.

> As a separate patch, something like
> that doesn't sound horrendous (and I think that's also an argument for
> doing that CPU->LE change as an independent change).
>
> Yes, yes, there's also that 17/18 that switches over to a different
> header file location and Kconfig names but that could easily be folded
> into 15/18 and then it would all be bisectable.
>
> Alternatively, maybe 15/18 could be done with wireguard disabled in
> the Kconfig (just to make the patch identical), and then 17/18 enables
> it when it compiles with a big note about how you wanted to keep 15/18
> pristine to make the changes obvious.
>
> Hmm?
>
> I don't really have a dog in this fight, but on the whole I really
> liked the series. But this 18/18 raised my heckles, and I think I
> understand why it might raise the heckles of the wireguard people.
>

We're likely to spend some time discussing all this before I get
around to respinning this (if ever). But I am also a fan of WireGuard,
and I am eager to finish this discussion once and for all.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 10:19   ` Pascal Van Leeuwen
  2019-09-26 10:59     ` Jason A. Donenfeld
@ 2019-09-26 11:06     ` Jason A. Donenfeld
  2019-09-26 11:38       ` Toke Høiland-Jørgensen
                         ` (2 more replies)
  1 sibling, 3 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2019-09-26 11:06 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Toke Høiland-Jørgensen, Catalin Marinas, Herbert Xu,
	Arnd Bergmann, Ard Biesheuvel, Greg KH, Eric Biggers, Dave Taht,
	Willy Tarreau, Samuel Neves, Will Deacon, Netdev,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel

[CC +willy, toke, dave, netdev]

Hi Pascal

On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
> Actually, that assumption is factually wrong. I don't know if anything
> is *publicly* available, but I can assure you the silicon is running in
> labs already. And something will be publicly available early next year
> at the latest. Which could nicely coincide with having Wireguard support
> in the kernel (which I would also like to see happen BTW) ...
>
> Not "at some point". It will. Very soon. Maybe not in consumer or server
> CPUs, but definitely in the embedded (networking) space.
> And it *will* be much faster than the embedded CPU next to it, so it will
> be worth using it for something like bulk packet encryption.

Super! I was wondering if you could speak a bit more about the
interface. My biggest questions surround latency. Will it be
synchronous or asynchronous? If the latter, why? What will its
latencies be? How deep will its buffers be? The reason I ask is that a
lot of crypto acceleration hardware of the past has been fast and
having very deep buffers, but at great expense of latency. In the
networking context, keeping latency low is pretty important. Already
WireGuard is multi-threaded which isn't super great all the time for
latency (improvements are a work in progress). If you're involved with
the design of the hardware, perhaps this is something you can help
ensure winds up working well? For example, AES-NI is straightforward
and good, but Intel can do that because they are the CPU. It sounds
like your silicon will be adjacent. How do you envision this working
in a low latency environment?

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
@ 2019-09-26 11:38       ` Toke Høiland-Jørgensen
  2019-09-26 13:52       ` Pascal Van Leeuwen
  2019-09-26 22:47       ` Jakub Kicinski
  2 siblings, 0 replies; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-09-26 11:38 UTC (permalink / raw)
  To: Jason A. Donenfeld, Pascal Van Leeuwen
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Ard Biesheuvel,
	Greg KH, Eric Biggers, Dave Taht, Willy Tarreau, Samuel Neves,
	Will Deacon, Netdev, Linux Crypto Mailing List, Andy Lutomirski,
	Marc Zyngier, Dan Carpenter, Linus Torvalds, David Miller,
	linux-arm-kernel

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> [CC +willy, toke, dave, netdev]
>
> Hi Pascal
>
> On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
>> Actually, that assumption is factually wrong. I don't know if anything
>> is *publicly* available, but I can assure you the silicon is running in
>> labs already. And something will be publicly available early next year
>> at the latest. Which could nicely coincide with having Wireguard support
>> in the kernel (which I would also like to see happen BTW) ...
>>
>> Not "at some point". It will. Very soon. Maybe not in consumer or server
>> CPUs, but definitely in the embedded (networking) space.
>> And it *will* be much faster than the embedded CPU next to it, so it will
>> be worth using it for something like bulk packet encryption.
>
> Super! I was wondering if you could speak a bit more about the
> interface. My biggest questions surround latency. Will it be
> synchronous or asynchronous? If the latter, why? What will its
> latencies be? How deep will its buffers be? The reason I ask is that a
> lot of crypto acceleration hardware of the past has been fast and
> having very deep buffers, but at great expense of latency. In the
> networking context, keeping latency low is pretty important. Already
> WireGuard is multi-threaded which isn't super great all the time for
> latency (improvements are a work in progress). If you're involved with
> the design of the hardware, perhaps this is something you can help
> ensure winds up working well? For example, AES-NI is straightforward
> and good, but Intel can do that because they are the CPU. It sounds
> like your silicon will be adjacent. How do you envision this working
> in a low latency environment?

Being asynchronous doesn't *necessarily* have to hurt latency; you just
need the right queue back-pressure.


We already have multiple queues in the stack. With an async crypto
engine we would go from something like:

stack -> [qdisc] -> wg if -> [wireguard buffer] -> netdev driver ->
device -> [device buffer] -> wire

to

stack -> [qdisc] -> wg if -> [wireguard buffer] -> crypto stack ->
crypto device -> [crypto device buffer] -> wg post-crypto -> netdev
driver -> device -> [device buffer] -> wire

(where everything in [] is a packet queue).

The wireguard buffer is the source of the latency you're alluding to
above (the comment about multi-threaded behaviour), so we probably need
to fix that anyway. For the device buffer we have BQL to keep it at a
minimum. So that leaves the buffering in the crypto offload device. If
we add something like BQL to the crypto offload drivers, we could
conceivably avoid having that add a significant amount of latency. In
fact, doing so may benefit other users of crypto offloads as well, no?
Presumably ipsec has this same issue?


Caveat: I am fairly ignorant about the inner workings of the crypto
subsystem, so please excuse any inaccuracies in the above; the diagrams
are solely for illustrative purposes... :)

-Toke

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26  8:59 ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Jason A. Donenfeld
  2019-09-26 10:19   ` Pascal Van Leeuwen
@ 2019-09-26 12:07   ` Ard Biesheuvel
  2019-09-26 13:06     ` Pascal Van Leeuwen
  2019-09-26 20:47     ` Jason A. Donenfeld
  1 sibling, 2 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-26 12:07 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Eric Biggers,
	Greg KH, Samuel Neves, Will Deacon, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Linus Torvalds,
	David Miller, linux-arm-kernel

On Thu, 26 Sep 2019 at 10:59, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
...
>
> Instead what we’ve wound up with in this series is a Frankenstein’s
> monster of Zinc, which appears to have basically the same goal as
> Zinc, and even much of the same implementation just moved to a
> different directory, but then skimps on making it actually work well
> and introduces problems. (I’ll elucidate on some specific issues later
> in this email so that we can get on the same page with regards to
> security requirements for WireGuard.) I surmise from this Zinc-but-not
> series is that what actually is going on here is mostly some kind of
> power or leadership situation, which is what you’ve described to me
> also at various other points and in person.

I'm not sure what you are alluding to here. I have always been very
clear about what I like about Zinc and what I don't like about Zinc.

I agree that it makes absolutely no sense for casual, in-kernel crypto
to jump through all the hoops that the crypto API requires. But for
operating on big chunks of data on the kernel heap, we have an
existing API that we should leverage if we can, and fix if we need to
so that all its users can benefit.

> I also recognize that I am
> at least part way to blame for whatever dynamic there has stagnated
> this process; let me try to rectify that:
>
> A principle objection you’ve had is that Zinc moves to its own
> directory, with its own name, and tries to segment itself off from the
> rest of the crypto API’s infrastructure. You’ve always felt this
> should be mixed in with the rest of the crypto API’s infrastructure
> and directory structures in one way or another. Let’s do both of those
> things – put this in a directory structure you find appropriate and
> hook this into the rest of the crypto API’s infrastructure in a way
> you find appropriate. I might disagree, which is why Zinc does things
> the way it does, but I’m open to compromise and doing things more your
> way.
>

It doesn't have to be your way or my way. The whole point of being
part of this community is that we find solutions that work for
everyone, through discussion and iterative prototyping. Turning up out
of the blue with a 50,000 line patch set and a take-it-or-leave-it
attitude goes counter to that, and this is why we have made so little
progress over the past year.

But I am happy with your willingness to collaborate and find common
ground, which was also my motivation for spending a considerable
amount of time to prepare this patch set.

> Another objection you’ve had is that Zinc replaces many existing
> implementations with its own. Martin wasn’t happy about that either.
> So let’s not do that, and we’ll have some wholesale replacement of
> implementations in future patchsets at future dates discussed and
> benched and bikeshedded independently from this.
>
> Finally, perhaps most importantly, Zinc’s been my design rather than
> our design. Let’s do this together instead of me git-send-email(1)-ing
> a v37.
>
> If the process of doing that together will be fraught with difficulty,
> I’m still open to the “7 patch series” with the ugly cryptoapi.c
> approach, as described at the top.

If your aim is to write ugly code and use that as a munition

> But I think if we start with Zinc
> and whittle it down in accordance with the above, we’ll get something
> mutually acceptable, and somewhat similar to this series, with a few
> important exceptions, which illustrate some of the issues I see in
> this RFC:
>
> Issue 1) No fast implementations for the “it’s just functions” interface.
>
> This is a deal breaker. I know you disagree here and perhaps think all
> dynamic dispatch should be by loadable modules configured with
> userspace policy and lots of function pointers and dynamically
> composable DSL strings, as the current crypto API does it. But I think
> a lot of other people agree with me here (and they’ve chimed in
> before) that the branch predictor does things better, doesn’t have
> Spectre issues, and is very simple to read and understand. For
> reference, here’s what that kind of thing looks like: [2].
>

This is one of the issues in the 'fix it for everyone else as well'
category. If we can improve the crypto API to be less susceptible to
these issues (e.g., using static calls), everybody benefits. I'd be
happy to collaborate on that.

> In this case, the relevance is that the handshake in WireGuard is
> extremely performance sensitive, in order to fend off DoS. One of the
> big design gambits in WireGuard is – can we make it 1-RTT to reduce
> the complexity of the state machine, but keep the crypto efficient
> enough that this is still safe to do from a DoS perspective. The
> protocol succeeds at this goal, but in many ways, just by a hair when
> at scale, and so I’m really quite loathe to decrease handshake
> performance.
...
> Taken together, we simply can’t skimp on the implementations available
> on the handshake layer, so we’ll need to add some form of
> implementation selection, whether it’s the method Zinc uses ([2]), or
> something else we cook up together.
>

So are you saying that the handshake timing constraints in the
WireGuard protocol are so stringent that we can't run it securely on,
e.g., an ARM CPU that lacks a NEON unit? Or given that you are not
providing accelerated implementations of blake2s or Curve25519 for
arm64, we can't run it securely on arm64 at all?

Typically, I would prefer to only introduce different versions of the
same algorithm if there is a clear performance benefit for an actual
use case.

Framing this as a security issue rather than a performance issue is
slightly disingenuous, since people are less likely to challenge it.
But the security of any VPN protocol worth its salt should not hinge
on the performance delta between the reference C code and a version
that was optimized for a particular CPU.

> Issue 2) Linus’ objection to the async API invasion is more correct
> than he realizes.
>
> I could re-enumerate my objections to the API there, but I think we
> all get it. It’s horrendous looking. Even the introduction of the
> ivpad member (what on earth?) in the skb cb made me shutter.

Your implementation of RFC7539 truncates the nonce to 64-bits, while
RFC7539 defines a clear purpose for the bits you omit. Since the Zinc
library is intended to be standalone (and you are proposing its use in
other places, like big_keys.c), you might want to document your
justification for doing so in the general case, instead of ridiculing
the code I needed to write to work around this limitation.

> But
> there’s actually another issue at play:
>
> wg_noise_handshake_begin_session→derive_keys→symmetric_key_init is all
> part of the handshake. We cannot afford to allocate a brand new crypto
> object, parse the DSL string, connect all those function pointers,
> etc.

Parsing the string and connecting the function pointers happens only
once, and only when the transform needs to be instantiated from its
constituent parts. Subsequent invocations will just grab the existing
object.

> The allocations involved here aren’t really okay at all in that
> path. That’s why the cryptoapi.c idea above involves just using a pool
> of pre-allocated objects if we’re going to be using that API at all.
> Also keep in mind that WireGuard instances sometimes have hundreds of
> thousands of peers.
>

My preference would be to address this by permitting per-request keys
in the AEAD layer. That way, we can instantiate the transform only
once, and just invoke it with the appropriate key on the hot path (and
avoid any per-keypair allocations)

> I’d recommend leaving this synchronous as it exists now, as Linus
> suggested, and we can revisit that later down the road. There are a
> number of improvements to the async API we could make down the line
> that could make this viable in WireGuard. For example, I could imagine
> decoupling the creation of the cipher object from its keys and
> intermediate buffers, so that we could in fact allocate the cipher
> objects with their DSLs globally in a safe way, while allowing the
> keys and working buffers to come from elsewhere. This is deep plumbing
> into the async API, but I think we could get there in time.
>

My changes actually move all the rfc7539() intermediate buffers to the
stack, so the only remaining allocation is the per-keypair one.

> There’s also a degree of practicality: right now there is zero ChaPoly
> async acceleration hardware anywhere that would fit into the crypto
> API. At some point, it might come to exist and have incredible
> performance, and then we’ll both feel very motivated to make this work
> for WireGuard. But it might also not come to be (AES seems to have won
> over most of the industry), in which case, why hassle?
>

As I already pointed out, we have supported hardware already: CAAM is
in mainline, and Inside-Secure patches are on the list.

> Issue 3) WireGuard patch is out of date.
>
> This is my fault, because I haven’t posted in a long time. There are
> some important changes in the main WireGuard repo. I’ll roll another
> patch soon for this so we have something recent to work off of. Sorry
> about that.
>

This is the reason I included your WG patch verbatim, to make it
easier to rebase to newer versions. In fact, I never intended or
expected anything but discussion from this submission, let alone
anyone actually merging it :-)

> Issue 4) FPU register batching?
>
> When I introduced the simd_get/simd_put/simd_relax thing, people
> seemed to think it was a good idea. My benchmarks of it showed
> significant throughput improvements. Your patchset doesn’t have
> anything similar to this.

It uses the existing SIMD batching, and enhances it slightly for the
Poly1305/shash case.

> But on the other hand, last I spoke with the
> x86 FPU guys, I thought they might actually be in the process of
> making simd_get/put obsolete with some internal plumbing to make
> restoration lazier. I’ll see tglx later today and will poke him about
> this, as this might already be a non-issue.
>

We've already made improvements here for arm64 as well (and ARM
already used lazy restore). But I think it still makes sense to
amortize these calls over a reasonable chunk of data, i.e., a packet.

>
> So given the above, how would you like to proceed? My personal
> preference would be to see you start with the Zinc patchset and rename
> things and change the infrastructure to something that fits your
> preferences, and we can see what that looks like. Less appealing would
> be to do several iterations of you reworking Zinc from scratch and
> going through the exercises all over again, but if you prefer that I
> guess I could cope. Alternatively, maybe this is a lot to chew on, and
> we should just throw caution into the wind, implement cryptoapi.c for
> WireGuard (as described at the top), and add C functions to the crypto
> API sometime later? This is what I had envisioned in [1].
>

It all depends on whether we are interested in supporting async
accelerators or not, and it is clear what my position is on this
point.

I am not convinced that we need accelerated implementations of blake2s
and curve25519, but if we do, I'd like those to be implemented as
individual modules under arch/*/crypto, with some moduleloader fu for
weak symbols or static calls thrown in if we have to. Exposing them as
shashes seems unnecessary to me at this point.

My only objection to your simd get/put interface is that it uses a
typedef rather than a struct definition (although I also wonder how we
can avoid two instances living on the same call stack, unless we
forbid functions that take a struct simd* to call functions that don't
take one, but these are details we should be able to work out.)

What I *don't* want is to merge WireGuard with its own library based
crypto now, and extend that later for async accelerators once people
realize that we really do need that as well.

> And for the avoidance of doubt, or in case any of the above message
> belied something different, I really am happy and relieved to have an
> opportunity to work on this _with you_, and I am much more open than
> before to compromise and finding practical solutions to the past
> political issues. Also, if you’re into chat, we can always spec some
> of the nitty-gritty aspects out over IRC or even the old-fashioned
> telephone. Thanks again for pushing this forward.
>

My pleasure :-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-26 11:06     ` Ard Biesheuvel
@ 2019-09-26 12:34       ` Ard Biesheuvel
  0 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-26 12:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Will Deacon,
	David Miller, Linux ARM

On Thu, 26 Sep 2019 at 13:06, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
> On Thu, 26 Sep 2019 at 00:15, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Wed, Sep 25, 2019 at 9:14 AM Ard Biesheuvel
> > <ard.biesheuvel@linaro.org> wrote:
> > >
> > > Replace the chacha20poly1305() library calls with invocations of the
> > > RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.
> >
> > Honestly, the other patches look fine to me from what I've seen (with
> > the small note I had in a separate email for 11/18), but this one I
> > consider just nasty, and a prime example of why people hate those
> > crypto lookup routines.
> >
> > Some of it is just the fundamental and pointless silly indirection,
> > that just makes things harder to read, less efficient, and less
> > straightforward.
> >
> > That's exemplified by this part of the patch:
> >
> > >  struct noise_symmetric_key {
> > > -       u8 key[NOISE_SYMMETRIC_KEY_LEN];
> > > +       struct crypto_aead *tfm;
> >
> > which is just one of those "we know what we want and we just want to
> > use it directly" things, and then the crypto indirection comes along
> > and makes that simple inline allocation of a small constant size
> > (afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another
> > allocation entirely.
> >
> > And it's some random odd non-typed thing too, so then you have that
> > silly and stupid dynamic allocation using a name lookup:
> >
> >    crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC);
> >
> > to create what used to be (and should be) a simple allocation that was
> > has a static type and was just part of the code.
> >
>
> That crypto_alloc_aead() call does a lot of things under the hood:
> - use an existing instantiation of rfc7539(chacha20,poly1305) if available,
> - look for modules that implement the whole transformation directly,
> - if none are found, instantiate the rfc7539 template, which will
> essentially do the above for chacha20 and poly1305, potentially using
> per-arch accelerated implementations if available (for either), or
> otherwise, fall back to the generic versions.
>
> What *I* see as the issue here is not that we need to do this at all,
> but that we have to do it for each value of the key. IMO, it would be
> much better to instantiate this thing only once, and have a way of
> passing a per-request key into it, permitting us to hide the whole
> thing behind the existing library interface.
>

Note that we don't have to do the whole dance for each new value of
the key: subsequent invocations will all succeed at step #1, and grab
the existing instantiation, but allocate a new TFM structure that
refers to it. It is this step that we should be able to omit as well
if the API is changed to allow per-request keys to be passed in via
the request structure.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26 12:07   ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
@ 2019-09-26 13:06     ` Pascal Van Leeuwen
  2019-09-26 13:15       ` Ard Biesheuvel
  2019-09-26 20:47     ` Jason A. Donenfeld
  1 sibling, 1 reply; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-26 13:06 UTC (permalink / raw)
  To: Ard Biesheuvel, Jason A. Donenfeld
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Eric Biggers,
	Greg KH, Samuel Neves, Will Deacon, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Linus Torvalds,
	David Miller, linux-arm-kernel

> > In this case, the relevance is that the handshake in WireGuard is
> > extremely performance sensitive, in order to fend off DoS. One of the
> > big design gambits in WireGuard is – can we make it 1-RTT to reduce
> > the complexity of the state machine, but keep the crypto efficient
> > enough that this is still safe to do from a DoS perspective. The
> > protocol succeeds at this goal, but in many ways, just by a hair when
> > at scale, and so I’m really quite loathe to decrease handshake
> > performance.
> ...
> > Taken together, we simply can’t skimp on the implementations available
> > on the handshake layer, so we’ll need to add some form of
> > implementation selection, whether it’s the method Zinc uses ([2]), or
> > something else we cook up together.
> >
> 
> So are you saying that the handshake timing constraints in the
> WireGuard protocol are so stringent that we can't run it securely on,
> e.g., an ARM CPU that lacks a NEON unit? Or given that you are not
> providing accelerated implementations of blake2s or Curve25519 for
> arm64, we can't run it securely on arm64 at all?
> 
> Typically, I would prefer to only introduce different versions of the
> same algorithm if there is a clear performance benefit for an actual
> use case.
> 
> Framing this as a security issue rather than a performance issue is
> slightly disingenuous, since people are less likely to challenge it.
> But the security of any VPN protocol worth its salt should not hinge
> on the performance delta between the reference C code and a version
> that was optimized for a particular CPU.
> 
Fully agree with that last statement. Security of a protocol should
*never* depend on the performance of a particular implementation.

I may want to run this on a very constrained embedded system that
would necessarily be very slow, and I would still want that to be
secure. If this is true, it's pretty much a deal-breaker to me ...

Which would be a shame, because I really do like some of the other
things Wireguard does and just the effort of improving VPN in general.

> > Issue 2) Linus’ objection to the async API invasion is more correct
> > than he realizes.
> >
> > I could re-enumerate my objections to the API there, but I think we
> > all get it. It’s horrendous looking. Even the introduction of the
> > ivpad member (what on earth?) in the skb cb made me shutter.
> 
> Your implementation of RFC7539 truncates the nonce to 64-bits, while
> RFC7539 defines a clear purpose for the bits you omit. Since the Zinc
> library is intended to be standalone (and you are proposing its use in
> other places, like big_keys.c), you might want to document your
> justification for doing so in the general case, instead of ridiculing
> the code I needed to write to work around this limitation.
> 
From RFC7539:
"Some protocols may have unique per-invocation inputs that are not 96
 bits in length.  For example, IPsec may specify a 64-bit nonce.  In
 such a case, it is up to the protocol document to define how to
 transform the protocol nonce into a 96-bit nonce, <<for example, by
 concatenating a constant value.>>"

So concatenating zeroes within the protocol is fine (if you can live
with the security consequences) but a generic library function should
of course take all 96 bits as input(!) Actually, the rfc7539esp variant
already takes that part of the nonce from the key, not the IV. This
may be more convenient for use with Wireguard as well? Just force the
trailing nonce portion of the key to zeroes when calling setkey().

> 
> My preference would be to address this by permitting per-request keys
> in the AEAD layer. That way, we can instantiate the transform only
> once, and just invoke it with the appropriate key on the hot path (and
> avoid any per-keypair allocations)
> 
This part I do not really understand. Why would you need to allocate a
new transform if you change the key? Why can't you just call setkey()
on the already allocated transform?

> 
> It all depends on whether we are interested in supporting async
> accelerators or not, and it is clear what my position is on this
> point.
> 
Maybe not for an initial upstream, but it should be a longer-term goal.

> 
> What I *don't* want is to merge WireGuard with its own library based
> crypto now, and extend that later for async accelerators once people
> realize that we really do need that as well.
> 
What's wrong with a step-by-step approach though? i.e. merge it with
library calls now and then gradually work towards the goal of integrating
(a tweaked version of) the Crypto API where that actually makes sense?
Rome wasn't built in one day either ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26 13:06     ` Pascal Van Leeuwen
@ 2019-09-26 13:15       ` Ard Biesheuvel
  2019-09-26 14:03         ` Pascal Van Leeuwen
  0 siblings, 1 reply; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-26 13:15 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A. Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Will Deacon,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel

On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
...
> >
> > My preference would be to address this by permitting per-request keys
> > in the AEAD layer. That way, we can instantiate the transform only
> > once, and just invoke it with the appropriate key on the hot path (and
> > avoid any per-keypair allocations)
> >
> This part I do not really understand. Why would you need to allocate a
> new transform if you change the key? Why can't you just call setkey()
> on the already allocated transform?
>

Because the single transform will be shared between all users running
on different CPUs etc, and so the key should not be part of the TFM
state but of the request state.

> >
> > It all depends on whether we are interested in supporting async
> > accelerators or not, and it is clear what my position is on this
> > point.
> >
> Maybe not for an initial upstream, but it should be a longer-term goal.
>
> >
> > What I *don't* want is to merge WireGuard with its own library based
> > crypto now, and extend that later for async accelerators once people
> > realize that we really do need that as well.
> >
> What's wrong with a step-by-step approach though? i.e. merge it with
> library calls now and then gradually work towards the goal of integrating
> (a tweaked version of) the Crypto API where that actually makes sense?
> Rome wasn't built in one day either ...
>

I should clarify: what I don't want is two frameworks in the kernel
for doing async crypto, the existing one plus a new library-based one.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
  2019-09-26 11:38       ` Toke Høiland-Jørgensen
@ 2019-09-26 13:52       ` Pascal Van Leeuwen
  2019-09-26 23:13         ` Dave Taht
  2019-09-26 22:47       ` Jakub Kicinski
  2 siblings, 1 reply; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-26 13:52 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Toke Høiland-Jørgensen, Catalin Marinas, Herbert Xu,
	Arnd Bergmann, Ard Biesheuvel, Greg KH, Eric Biggers, Dave Taht,
	Willy Tarreau, Samuel Neves, Will Deacon, Netdev,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel

> -----Original Message-----
> From: Jason A. Donenfeld <Jason@zx2c4.com>
> Sent: Thursday, September 26, 2019 1:07 PM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> <catalin.marinas@arm.com>; Willy Tarreau <w@1wt.eu>; Netdev <netdev@vger.kernel.org>;
> Toke Høiland-Jørgensen <toke@toke.dk>; Dave Taht <dave.taht@gmail.com>
> Subject: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard
> using the existing crypto API]
> 
> [CC +willy, toke, dave, netdev]
> 
> Hi Pascal
> 
> On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> > Actually, that assumption is factually wrong. I don't know if anything
> > is *publicly* available, but I can assure you the silicon is running in
> > labs already. And something will be publicly available early next year
> > at the latest. Which could nicely coincide with having Wireguard support
> > in the kernel (which I would also like to see happen BTW) ...
> >
> > Not "at some point". It will. Very soon. Maybe not in consumer or server
> > CPUs, but definitely in the embedded (networking) space.
> > And it *will* be much faster than the embedded CPU next to it, so it will
> > be worth using it for something like bulk packet encryption.
> 
> Super! I was wondering if you could speak a bit more about the
> interface. My biggest questions surround latency. Will it be
> synchronous or asynchronous?
>
The hardware being external to the CPU and running in parallel with it,
obviously asynchronous.

> If the latter, why? 
>
Because, as you probably already guessed, the round-trip latency is way
longer than the actual processing time, at least for small packets.

Partly because the only way to communicate between the CPU and the HW 
accelerator (whether that is crypto, a GPU, a NIC, etc.) that doesn't
keep the CPU busy moving data is through memory, with the HW doing DMA.
And, as any programmer should now, round trip times to memory are huge
relative to the processing speed.

And partly because these accelerators are very similar to CPU's in
terms of architecture, doing pipelined processing and having multiple
of such pipelines in parallel. Except that these pipelines are not
working on low-level instructions but on full packets/blocks. So they
need to have many packets in flight to keep those pipelines fully
occupied. And packets need to move through the various pipeline stages,
so they incur the time needed to process them multiple times. (just 
like e.g. a multiply instruction with a throughput of 1 per cycle
actually may need 4 or more cycles to actually provide its result)

Could you do that from a synchronous interface? In theory, probably, 
if you would spawn a new thread for every new packet arriving and
rely on the scheduler to preempt the waiting threads. But you'd need
as many threads as the HW  accelerator can have packets in flight,
while an async would need only 2 threads: one to handle the input to
the accelerator and one to handle the output (or at most one thread
per CPU, if you want to divide the workload)

Such a many-thread approach seems very inefficient to me.

> What will its latencies be?
>
Depends very much on the specific integration scenario (i.e. bus 
speed, bus hierarchy, cache hierarchy, memory speed, etc.) but on
the order of a few thousand CPU clocks is not unheard of.
Which is an eternity for the CPU, but still only a few uSec in
human time. Not a problem unless you're a high-frequency trader and
every ns counts ...
It's not like the CPU would process those packets in zero time.

> How deep will its buffers be? 
>
That of course depends on the specific accelerator implementation,
but possibly dozens of small packets in our case, as you'd need 
at least width x depth packets in there to keep the pipes busy.
Just like a modern CPU needs hundreds of instructions in flight
to keep all its resources busy.

> The reason I ask is that a
> lot of crypto acceleration hardware of the past has been fast and
> having very deep buffers, but at great expense of latency.
>
Define "great expense". Everything is relative. The latency is very
high compared to per-packet processing time but at the same time it's
only on the order of a few uSec. Which may not even be significant on
the total time it takes for the packet to travel from input MAC to
output MAC, considering the CPU will still need to parse and classify
it and do pre- and postprocessing on it.

> In the networking context, keeping latency low is pretty important.
>
I've been doing this for IPsec for nearly 20 years now and I've never
heard anyone complain about our latency, so it must be OK.

We're also doing (fully inline, no CPU involved) MACsec cores, which
operate at layer 2 and I know it's a concern there for very specific
use cases (high frequency trading, precision time protocol, ...).
For "normal" VPN's though, a few uSec more or less should be a non-issue.

> Already
> WireGuard is multi-threaded which isn't super great all the time for
> latency (improvements are a work in progress). If you're involved with
> the design of the hardware, perhaps this is something you can help
> ensure winds up working well? For example, AES-NI is straightforward
> and good, but Intel can do that because they are the CPU. It sounds
> like your silicon will be adjacent. How do you envision this working
> in a low latency environment?
> 
Depends on how low low-latency is. If you really need minimal latency,
you need an inline implementation. Which we can also provide, BTW :-)

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26 13:15       ` Ard Biesheuvel
@ 2019-09-26 14:03         ` Pascal Van Leeuwen
  2019-09-26 14:52           ` Ard Biesheuvel
  0 siblings, 1 reply; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-26 14:03 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jason A. Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Will Deacon,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel

> -----Original Message-----
> From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Sent: Thursday, September 26, 2019 3:16 PM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> <catalin.marinas@arm.com>
> Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
> 
> On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> ...
> > >
> > > My preference would be to address this by permitting per-request keys
> > > in the AEAD layer. That way, we can instantiate the transform only
> > > once, and just invoke it with the appropriate key on the hot path (and
> > > avoid any per-keypair allocations)
> > >
> > This part I do not really understand. Why would you need to allocate a
> > new transform if you change the key? Why can't you just call setkey()
> > on the already allocated transform?
> >
> 
> Because the single transform will be shared between all users running
> on different CPUs etc, and so the key should not be part of the TFM
> state but of the request state.
> 
So you need a transform per user, such that each user can have his own
key. But you shouldn't need to reallocate it when the user changes his
key. I also don't see how the "different CPUs" is relevant here? I can
share a single key across multiple CPUs here just fine ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26 14:03         ` Pascal Van Leeuwen
@ 2019-09-26 14:52           ` Ard Biesheuvel
  2019-09-26 15:04             ` Pascal Van Leeuwen
  0 siblings, 1 reply; 61+ messages in thread
From: Ard Biesheuvel @ 2019-09-26 14:52 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A. Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Will Deacon,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel

On Thu, 26 Sep 2019 at 16:03, Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>
> > -----Original Message-----
> > From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > Sent: Thursday, September 26, 2019 3:16 PM
> > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> > Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux-
> > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> > <catalin.marinas@arm.com>
> > Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
> >
> > On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen
> > <pvanleeuwen@verimatrix.com> wrote:
> > ...
> > > >
> > > > My preference would be to address this by permitting per-request keys
> > > > in the AEAD layer. That way, we can instantiate the transform only
> > > > once, and just invoke it with the appropriate key on the hot path (and
> > > > avoid any per-keypair allocations)
> > > >
> > > This part I do not really understand. Why would you need to allocate a
> > > new transform if you change the key? Why can't you just call setkey()
> > > on the already allocated transform?
> > >
> >
> > Because the single transform will be shared between all users running
> > on different CPUs etc, and so the key should not be part of the TFM
> > state but of the request state.
> >
> So you need a transform per user, such that each user can have his own
> key. But you shouldn't need to reallocate it when the user changes his
> key. I also don't see how the "different CPUs" is relevant here? I can
> share a single key across multiple CPUs here just fine ...
>

We need two transforms per connection, one for each direction. That is
how I currently implemented it, and it seems to me that, if
allocating/freeing those on the same path as where the keypair object
itself is allocated is too costly, I wonder why allocating the keypair
object itself is fine.

But what I am suggesting is to use a single TFM which gets shared by
all the connections, where the key for each operation is provided
per-request. That TFM cannot have a key set, because each user may use
a different key.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26 14:52           ` Ard Biesheuvel
@ 2019-09-26 15:04             ` Pascal Van Leeuwen
  0 siblings, 0 replies; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-26 15:04 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jason A. Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Eric Biggers, Greg KH, Samuel Neves, Will Deacon,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel

> -----Original Message-----
> From: linux-crypto-owner@vger.kernel.org <linux-crypto-owner@vger.kernel.org> On Behalf
> Of Ard Biesheuvel
> Sent: Thursday, September 26, 2019 4:53 PM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> <catalin.marinas@arm.com>
> Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
> 
> On Thu, 26 Sep 2019 at 16:03, Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> >
> > > -----Original Message-----
> > > From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > > Sent: Thursday, September 26, 2019 3:16 PM
> > > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> > > Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux-
> > > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> > > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg
> KH
> > > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> > > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> > > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski
> <luto@kernel.org>;
> > > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> > > <catalin.marinas@arm.com>
> > > Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
> > >
> > > On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen
> > > <pvanleeuwen@verimatrix.com> wrote:
> > > ...
> > > > >
> > > > > My preference would be to address this by permitting per-request keys
> > > > > in the AEAD layer. That way, we can instantiate the transform only
> > > > > once, and just invoke it with the appropriate key on the hot path (and
> > > > > avoid any per-keypair allocations)
> > > > >
> > > > This part I do not really understand. Why would you need to allocate a
> > > > new transform if you change the key? Why can't you just call setkey()
> > > > on the already allocated transform?
> > > >
> > >
> > > Because the single transform will be shared between all users running
> > > on different CPUs etc, and so the key should not be part of the TFM
> > > state but of the request state.
> > >
> > So you need a transform per user, such that each user can have his own
> > key. But you shouldn't need to reallocate it when the user changes his
> > key. I also don't see how the "different CPUs" is relevant here? I can
> > share a single key across multiple CPUs here just fine ...
> >
> 
> We need two transforms per connection, one for each direction. That is
> how I currently implemented it, and it seems to me that, if
> allocating/freeing those on the same path as where the keypair object
> itself is allocated is too costly, I wonder why allocating the keypair
> object itself is fine.
> 

I guess that keypair object is a Wireguard specific thing?
In that case it may not make a difference performance wise.
I just would not expect a new (pair of) transforms to be allocated
just for a rekey, only when a new connection is made. 

Thinking about this some more:
Allocating a transform is about more than just allocating the 
object, there may be all kinds of side-effects like fallback
ciphers being allocated, specific HW initialization being done, etc. 
I just feel that if you only need to change the key, you should
only change the key. As that's what the driver would be optimized
for.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-26  9:40     ` Pascal Van Leeuwen
@ 2019-09-26 16:35       ` Linus Torvalds
  2019-09-27  0:15         ` Pascal Van Leeuwen
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2019-09-26 16:35 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

On Thu, Sep 26, 2019 at 2:40 AM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>
> While I agree with the principle of first merging Wireguard without
> hooking it up to the Crypto API and doing the latter in a later,
> separate patch, I DONT'T agree with your bashing of the Crypto API
> or HW crypto acceleration in general.

I'm not bashing hardware crypto acceleration.

But I *am* bashing bad interfaces to it.

Honestly, you need to face a few facts, Pascal.

Really.

First off: availability.

 (a) hardware doesn't exist right now outside your lab

This is a fact.

 (b) hardware simply WILL NOT exist in any huge number for years,
possibly decades. If ever,

This is just reality, Pascal. Even if you release your hardware
tomorrow, where do you think it will exists? Laptops? PC's? Phones?
No. No. And no.

Phones are likely the strongest argument for power use, but phones
won't really start using it until it's on the SoC, because while they
care deeply about power, they care even more deeply about a lot of
other things a whole lot more. Form factor, price, and customers that
care.

So phones aren't happening. Not for years, and not until it's a big
deal and standard IP that everybody wants.

Laptops and PC's? No. Look at all the crypto acceleration they have today.

That was sarcasm, btw, just to make it clear. It's simply not a market.

End result: even with hardware, the hardware will be very rare. Maybe
routers that want to sell particular features in the near future.

Again, this isn't theory. This is that pesky little thing called
"reality". It sucks, I know.

But even if you *IGNORE* the fact that hardware doesn't exist right
now, and won't be widely available for years (or longer), there's
another little fact that you are ignoring:

The async crypto interfaces are badly designed. Full stop.

Seriously. This isn't rocket science. This is very very basic Computer
Science 101.

Tell me, what's _the_ most important part about optimizing something?

Yeah, it's "you optimize for the common case". But let's ignore that
part, since I already said that hardware isn't the common case but I
promised that I'd ignore that part.

The _other_ most important part of optimization is that you optimize
for the case that _matters_.

And the async crypto case got that completely wrong, and the wireguard
patch shows that issue very clearly.

The fact is, even you admit that a few CPU cycles don't matter for the
async case where you have hardware acceleration, because the real cost
is going to be in the IO - whether it's DMA, cache coherent
interconnects, or some day some SoC special bus. The CPU cycles just
don't matter, they are entirely in the noise.

What does that mean?  Think about it.

[ Time passes ]

Ok, I'll tell you: it means that the interface shouldn't be designed
for async hw-assisted crypto. The interface should be designed for the
case that _does_ care about CPU cycles, and then the async hw-assisted
crypto should be hidden by a conditional, and its (extra) data
structures should be the ones that are behind some indirect pointers
etc.  Because, as we agreed, the async external hw case really doesn't
care. It it has to traverse a pointer or two, and if it has to have a
*SEPARATE* keystore that has longer lifetimes, then the async code can
set that up on its own, but that should not affect the case that
cares.

Really, this is fundamental, and really, the crypto interfaces got this wrong.

This is in fact _so_ fundamental that the only possible reason you can
disagree is because you don't care about reality or fundamentals, and
you only care about the small particular hardware niche you work in
and nothing else.

You really should think about this a bit.

> However, if you're doing bulk crypto like network packet processing
> (as Wireguard does!) or disk/filesystem encryption, then that cipher
> allocation only happens once every blue moon and the overhead for
> that is totally *irrelevant* as it is amortized over many hours or
> days of runtime.

This is not true. It's not true because the overhead happens ALL THE TIME.

And in 99.9% of all cases there are no upsides from the hw crypto,
because the hardware DOES NOT EXIST.

You think the "common case" is that hardware encryption case, but see
above. It's really not.

And when you _do_ have HW encryption, you could do the indirection.

But that's not an argument for always having the indirection.

What indirection am I talking about?

There's multiple levels of indirection in the existing bad crypto interfaces:

 (a) the data structures themselves. This is things like keys and
state storage, both of which are (often) small enough that they would
be better off on a stack, or embedded in the data structures of the
callers.

 (b) the calling conventions. This is things like indirection -
usually several levels - just to find the function pointer to call to,
and then an indirect call to that function pointer.

and both of those are actively bad things when you don't have those
hardware accelerators.

When you *do* have them, you don't care. Everybody agrees about that.
But you're ignoring the big white elephant in the room, which is that
the hw really is very rare in the end, even if it were to exist at
all.

> While I generally dislike this whole hype of storing stuff in
> textual formats like XML and JSON and then wasting lots of CPU
> cycles on parsing that, I've learned to appreciate the power of
> these textual Crypto API templates, as they allow a hardware
> accelerator to advertise complex combined operations as single
> atomic calls, amortizing the communication overhead between SW
> and HW. It's actually very flexible and powerful!

BUT IT IS FUNDAMENTALLY BADLY DESIGNED!

Really.

You can get the advantages of hw-accelerated crypto with good designs.
The current crypto layer is not that.

The current crypto layer simply has indirection at the wrong level.

Here's how it should have been done:

 - make the interfaces be direct calls to the crypto you know you want.

 - make the basic key and state buffer be explicit and let people
allocate them on their own stacks or whatever

"But what about hw acceleration?"

 - add a single indirect private pointer that the direct calls can use
for their own state IF THEY HAVE REASON TO

 - make the direct crypto calls just have a couple of conditionals
inside of them

Why? Direct calls with a couple of conditionals are really cheap for
the non-accelerated case. MUCH cheaper than the indirection overhead
(both on a state level and on a "crypto description" level) that the
current code has.

Seriously. The hw accelerated crypto won't care about the "call a
static routine" part. The hw accelerated crypto won't care about the
"I need to allocate a copy of the key because I can't have it on
stack, and need to have it in a DMA'able region". The hw accelerated
crypto won't care about the two extra instructions that do "do I have
any extra state at all, or should I just do the synchronous CPU
version" before it gets called through some indirect pointer.

So there is absolutely NO DOWNSIDE for hw accelerated crypto to just
do it right, and use an interface like this:

       if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,
                                        PACKET_CB(skb)->nonce, key->key,
                                        simd_context))
               return false;

because for the hw accelerated case the costs are all elsewhere.

But for synchronous encryption code on the CPU? Avoiding the
indirection can be a huge win. Avoiding allocations, extra cachelines,
all that overhead. Avoiding indirect calls entirely, because doing a
few conditional branches (that will predict perfectly) on the state
will be a lot more efficient, both in direct CPU cycles and in things
like I$ etc.

In contrast, forcing people to use this model:

       if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
               req = aead_request_alloc(key->tfm, GFP_ATOMIC);
               if (!req)
                       return false;
       } else {
               req = &stackreq;
               aead_request_set_tfm(req, key->tfm);
       }

       aead_request_set_ad(req, 0);
       aead_request_set_callback(req, 0, NULL, NULL);
       aead_request_set_crypt(req, sg, sg, skb->len,
                              (u8 *)&PACKET_CB(skb)->ivpad);
       err = crypto_aead_decrypt(req);
       if (unlikely(req != &stackreq))
               aead_request_free(req);
       if (err)
               return false;

isn't going to help anybody. It sure as hell doesn't help the
CPU-synchronous case, and see above: it doesn't even help the hw
accelerated case. It could have had _all_ that "tfm" work behind a
private pointer that the CPU case never touches except to see "ok,
it's NULL, I don't have any".

See?

The interface *should* be that chacha20poly1305_decrypt_sg() library
interface, just give it a private pointer it can use and update. Then,
*internally* if can do something like

     bool chacha20poly1305_decrypt_sg(...)
     {
             struct cc20p1305_ptr *state = *state_p;
             if (state) {
                     .. do basically the above complex thing ..
                     return ret; .. or fall back to sw if the hw
queues are full..
             }
             .. do the CPU only thing..
     }

and now you'd have no extra obverhead for the no-hw-accel case, you'd
have a more pleasant interface to use, and you could still use hw
accel if you wanted to.

THIS is why I say that the crypto interface is bad. It was designed
for the wrong objectives. It was designed purely for a SSL-like model
where you do a complex key and algorithm exchange dance, and you
simply don't know ahead of time what crypto you are even using.

And yes, that "I'm doing the SSL thing" used to be a major use of
encryption. I understand why it happened. It was what people did in
the 90's. People thought it was a good idea back then, and it was also
most of the hw acceleration world.

And yes, in that model of "I don't have a clue of what crypto I'm even
using" the model works fine. But honestly, if you can't admit to
yourself that it's wrong for the case where you _do_ know the
algorithm, you have some serious blinders on.

Just from a user standpoint, do you seriously think users _like_
having to do the above 15+ lines of code, vs the single function call?

The crypto interface really isn't pleasant, and you're wrong to
believe that it really helps. The hw acceleration capability could
have been abstracted away, instead of making that indirection be front
and center.

And again - I do realize the historical reasons for it. But
understanding that doesn't magically make it wonderful.

                 Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26 12:07   ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
  2019-09-26 13:06     ` Pascal Van Leeuwen
@ 2019-09-26 20:47     ` Jason A. Donenfeld
  2019-09-26 21:36       ` Andy Lutomirski
  1 sibling, 1 reply; 61+ messages in thread
From: Jason A. Donenfeld @ 2019-09-26 20:47 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Eric Biggers,
	Greg KH, Samuel Neves, Will Deacon, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Linus Torvalds,
	David Miller, linux-arm-kernel

Hi Ard,

On Thu, Sep 26, 2019 at 2:07 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> attitude goes counter to that, and this is why we have made so little
> progress over the past year.

I also just haven't submitted much in the past year, taking a bit of a
break to see how things would settle. Seemed like rushing things
wasn't prudent, so I slowed down.

> But I am happy with your willingness to collaborate and find common
> ground, which was also my motivation for spending a considerable
> amount of time to prepare this patch set.

Super.

> > If the process of doing that together will be fraught with difficulty,
> > I’m still open to the “7 patch series” with the ugly cryptoapi.c
> > approach, as described at the top.
>
> If your aim is to write ugly code and use that as a munition

No, this is not a matter of munition at all. Please take my words
seriously; I am entirely genuine here. Three people I greatly respect
made a very compelling argument to me, prompting the decision in [1].
The argument was that trying to fix the crypto layer AND trying to get
WireGuard merged at the same time was ambitious and crazy. Maybe
instead, they argued, I should just use the old crypto API, get
WireGuard in, and then begin the Zinc process after. I think
procedurally, that's probably good advice, and the people I was
talking to seemed to have a really firm grasp on what works and what
doesn't in the mainlining process. Now it's possible their judgement
is wrong, but I really am open, in earnest, to following it. And the
way that would look would be not trying to fix the crypto API now,
getting WireGuard in based on whatever we can cobble together based on
the current foundations with some intermediate file (cryptoapi.c in my
previous email) to prevent it from infecting WireGuard. This isn't
"munition"; it's a serious proposal.

The funny thing, though, is that all the while I was under the
impression somebody had figured out a great way to do this, it turns
out you were busy with basically Zinc-but-not. So we're back to square
one: you and I both want the crypto API to change, and now we have to
figure out a way forward together on how to do this, prompting my last
email to you, indicating that I was open to all sorts of compromises.
However, I still remain fully open to following the prior suggestion,
of not doing that at all right now, and instead basing this on the
existing crypto API as-is.

[1] https://lore.kernel.org/wireguard/CAHmME9pmfZAp5zd9BDLFc2fWUhtzZcjYZc2atTPTyNFFmEdHLg@mail.gmail.com/

> > reference, here’s what that kind of thing looks like: [2].
>
> This is one of the issues in the 'fix it for everyone else as well'
> category. If we can improve the crypto API to be less susceptible to
> these issues (e.g., using static calls), everybody benefits. I'd be
> happy to collaborate on that.

Good. I'm happy to learn that you're all for having fast
implementations that underlie the simple function calls.

> > Taken together, we simply can’t skimp on the implementations available
> > on the handshake layer, so we’ll need to add some form of
> > implementation selection, whether it’s the method Zinc uses ([2]), or
> > something else we cook up together.
>
> So are you saying that the handshake timing constraints in the
> WireGuard protocol are so stringent that we can't run it securely on,
> e.g., an ARM CPU that lacks a NEON unit? Or given that you are not
> providing accelerated implementations of blake2s or Curve25519 for
> arm64, we can't run it securely on arm64 at all?

Deployed at scale, the handshake must have a certain performance to
not be DoS'd. I've spent a long time benching these and attacking my
own code.  I won't be comfortable with this going in without the fast
implementations for the handshake. And down the line, too, we can look
into how to even improve the DoS resistance. I think there's room for
improvement, and I hope at some point you'll join us in discussions on
WireGuard internals. But the bottom line is that we need those fast
primitives.

> Typically, I would prefer to only introduce different versions of the
> same algorithm if there is a clear performance benefit for an actual
> use case.

As I was saying, this is indeed the case.

> Framing this as a security issue rather than a performance issue is
> slightly disingenuous, since people are less likely to challenge it.

I'm not being disingenuous. DoS resistance is a real issue with
WireGuard. You might argue that FourQ and Siphash would have made
better choices, and that's an interesting discussion, but it is what
it is. The thing needs fast implementations. And we're going to have
to implement that code anyway for other things, so might as well get
it working well now.

> But the security of any VPN protocol worth its salt

You're not required to use WireGuard.

> Parsing the string and connecting the function pointers happens only
> once, and only when the transform needs to be instantiated from its
> constituent parts. Subsequent invocations will just grab the existing
> object.

That's good to know. It doesn't fully address the issue, though.

> My preference would be to address this by permitting per-request keys
> in the AEAD layer. That way, we can instantiate the transform only
> once, and just invoke it with the appropriate key on the hot path (and
> avoid any per-keypair allocations)

That'd be a major improvement to the async interface, yes.

> > So given the above, how would you like to proceed? My personal
> > preference would be to see you start with the Zinc patchset and rename
> > things and change the infrastructure to something that fits your
> > preferences, and we can see what that looks like. Less appealing would
> > be to do several iterations of you reworking Zinc from scratch and
> > going through the exercises all over again, but if you prefer that I
> > guess I could cope. Alternatively, maybe this is a lot to chew on, and
> > we should just throw caution into the wind, implement cryptoapi.c for
> > WireGuard (as described at the top), and add C functions to the crypto
> > API sometime later? This is what I had envisioned in [1].

> It all depends on whether we are interested in supporting async
> accelerators or not, and it is clear what my position is on this
> point.

For a first version of WireGuard, no, I'm really not interested in
that. Adding it in there is more ambitious than it looks to get it
right. Async means more buffers, which means the queuing system for
WireGuard needs to be changed. There's already ongoing research into
this, and I'm happy to consider that research with a light toward
maybe having async stuff in the future. But sticking into the code now
as-is simply does not work from a buffering/queueing perspective. So
again, let's take an iterative approach here: first we do stuff with
the simple synchronous API. After the dust has settled, hardware is
available for testing, Van Jacobson has been taken off the bookshelf
for a fresh reading, and we've all sat down for a few interesting
conversations at netdev on queueing and bufferbloat, then let's start
working this in. In otherwords, just because technically you can glue
those APIs together, sort of, doesn't mean that approach makes sense
for the system as a whole.

> I am not convinced that we need accelerated implementations of blake2s
> and curve25519, but if we do, I'd like those to be implemented as
> individual modules under arch/*/crypto, with some moduleloader fu for
> weak symbols or static calls thrown in if we have to. Exposing them as
> shashes seems unnecessary to me at this point.

We need the accelerated implementations. And we'll need it for chapoly
too, obviously. So let's work out a good way to hook that all into the
Zinc-style interface. [2] does it in a very effective way that's
overall quite good for performance and easy to follow. The
chacha20-x86_64-glue.c code itself gets called via the static symbol
chacha20_arch. This is implemented for each platform with a fall back
to one that returns false, so that the generic code is called. The
Zinc stuff here is obvious, simple, and I'm pretty sure you know
what's up with it.

I prefer each of these glue implementations to live in
lib/zinc/chacha20/chacha20-${ARCH}-glue.c. You don't like that and
want things in arch/${ARCH}/crypto/chacha20-glue.c. Okay, sure, fine,
let's do all the naming and organization and political stuff how you
like, and I'll leave aside my arguments about why I disagree. Let's
take stock of where that leaves us, in terms of files:

- lib/crypto/chacha20.c: this has a generic implementation, but at the
top of the generic implementation, it has some code like "if
(chacha20_arch(..., ..., ...)) return;"
- arch/crypto/x86_64/chacha20-glue.c: this has the chacha20_arch()
implementation, which branches out to the various SIMD implementations
depending on some booleans calculated at module load time.
- arch/crypto/arm/chacha20-glue.c: this has the chacha20_arch()
implementation, which branches out to the various SIMD implementations
depending on some booleans calculated at module load time.
- arch/crypto/mips/chacha20-glue.c: this has the chacha20_arch()
implementation, which contains an assembly version that's always run
unconditionally.

Our goals are that chacha20_arch() from each of these arch glues gets
included in the lib/crypto/chacha20.c compilation unit. The reason why
we want it in its own unit is so that the inliner can get rid of
unreached code and more tightly integrate the branches. For the MIPS
case, the advantage is clear. Here's how Zinc handles it: [3]. Some
simple ifdefs and includes. Easy and straightforward. Some people
might object, though, and it sounds like you might. So let's talk
about some alternative mechanisms with their pros and cons:

- The zinc method: straightforward, but not everybody likes ifdefs.
- Stick the filename to be included into a Kconfig variable and let
the configuration system do the logic: also straightforward. Not sure
it's kosher, but it would work.
- Weak symbols: we don't get inlining or the dead code elimination.
- Function pointers: ditto, plus spectre.
- Other ideas you might have? I'm open to suggestions here.

[2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54
[3] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n19

> What I *don't* want is to merge WireGuard with its own library based
> crypto now, and extend that later for async accelerators once people
> realize that we really do need that as well.

I wouldn't worry so much about that. Zinc/library-based-crypto is just
trying to fulfill the boring synchronous pure-code part of crypto
implementations. For the async stuff, we can work together on
improving the existing crypto API to be more appealing, in tandem with
some interesting research into packet queuing systems. From the other
thread, you might have seen that already Toke has cool ideas that I
hope we can all sit down and talk about. I'm certainly not interested
in "bolting" anything on to Zinc/library-based-crypto, and I'm happy
to work to improve the asynchronous API for doing asynchronous crypto.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26 20:47     ` Jason A. Donenfeld
@ 2019-09-26 21:36       ` Andy Lutomirski
  2019-09-27  7:20         ` Jason A. Donenfeld
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-09-26 21:36 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Ard Biesheuvel,
	Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel

On Thu, Sep 26, 2019 at 1:52 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Ard,
>
>
> Our goals are that chacha20_arch() from each of these arch glues gets
> included in the lib/crypto/chacha20.c compilation unit. The reason why
> we want it in its own unit is so that the inliner can get rid of
> unreached code and more tightly integrate the branches. For the MIPS
> case, the advantage is clear.

IMO this needs numbers.  My suggestion from way back, which is at
least a good deal of the way toward being doable, is to do static
calls.  This means that the common code will call out to the arch code
via a regular CALL instruction and will *not* inline the arch code.
This means that the arch code could live in its own module, it can be
selected at boot time, etc.  For x86, inlining seems a but nuts to
avoid a whole mess of:

if (use avx2)
  do_avx2_thing();
else if (use avx1)
  do_avx1_thing();
else
  etc;

On x86, direct calls are pretty cheap.  Certainly for operations like
curve25519, I doubt you will ever see a real-world effect from
inlining.  I'd be surprised for chacha20.  If you really want inlining
to dictate the overall design, I think you need some real numbers for
why it's necessary.  There also needs to be a clear story for how
exactly making everything inline plays with the actual decision of
which implementation to use.  I think it's also worth noting that LTO
is coming.

--Andy

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
  2019-09-26 11:38       ` Toke Høiland-Jørgensen
  2019-09-26 13:52       ` Pascal Van Leeuwen
@ 2019-09-26 22:47       ` Jakub Kicinski
  2 siblings, 0 replies; 61+ messages in thread
From: Jakub Kicinski @ 2019-09-26 22:47 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Toke Høiland-Jørgensen, Catalin Marinas, Herbert Xu,
	Arnd Bergmann, Ard Biesheuvel, Greg KH, Eric Biggers,
	Will Deacon, Dave Taht, Willy Tarreau, Samuel Neves,
	Pascal Van Leeuwen, Netdev, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Linus Torvalds,
	David Miller, linux-arm-kernel

On Thu, 26 Sep 2019 13:06:51 +0200, Jason A. Donenfeld wrote:
> On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen wrote:
> > Actually, that assumption is factually wrong. I don't know if anything
> > is *publicly* available, but I can assure you the silicon is running in
> > labs already. And something will be publicly available early next year
> > at the latest. Which could nicely coincide with having Wireguard support
> > in the kernel (which I would also like to see happen BTW) ...
> >
> > Not "at some point". It will. Very soon. Maybe not in consumer or server
> > CPUs, but definitely in the embedded (networking) space.
> > And it *will* be much faster than the embedded CPU next to it, so it will
> > be worth using it for something like bulk packet encryption.  
> 
> Super! I was wondering if you could speak a bit more about the
> interface. My biggest questions surround latency. Will it be
> synchronous or asynchronous? If the latter, why? What will its
> latencies be? How deep will its buffers be? The reason I ask is that a
> lot of crypto acceleration hardware of the past has been fast and
> having very deep buffers, but at great expense of latency. In the
> networking context, keeping latency low is pretty important.

FWIW are you familiar with existing kTLS, and IPsec offloads in the
networking stack? They offload the crypto into the NIC, inline, which
helps with the latency, and processing overhead.

There are also NIC silicon which can do some ChaCha/Poly, although 
I'm not familiar enough with WireGuard to know if offload to existing
silicon will be possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 13:52       ` Pascal Van Leeuwen
@ 2019-09-26 23:13         ` Dave Taht
  2019-09-27 12:18           ` Pascal Van Leeuwen
  0 siblings, 1 reply; 61+ messages in thread
From: Dave Taht @ 2019-09-26 23:13 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A. Donenfeld, Toke Høiland-Jørgensen,
	Catalin Marinas, Herbert Xu, Arnd Bergmann, Ard Biesheuvel,
	Greg KH, Eric Biggers, Willy Tarreau, Samuel Neves, Will Deacon,
	Netdev, Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel

On Thu, Sep 26, 2019 at 6:52 AM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>
> > -----Original Message-----
> > From: Jason A. Donenfeld <Jason@zx2c4.com>
> > Sent: Thursday, September 26, 2019 1:07 PM
> > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> > <catalin.marinas@arm.com>; Willy Tarreau <w@1wt.eu>; Netdev <netdev@vger.kernel.org>;
> > Toke Høiland-Jørgensen <toke@toke.dk>; Dave Taht <dave.taht@gmail.com>
> > Subject: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard
> > using the existing crypto API]
> >
> > [CC +willy, toke, dave, netdev]
> >
> > Hi Pascal
> >
> > On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
> > <pvanleeuwen@verimatrix.com> wrote:
> > > Actually, that assumption is factually wrong. I don't know if anything
> > > is *publicly* available, but I can assure you the silicon is running in
> > > labs already. And something will be publicly available early next year
> > > at the latest. Which could nicely coincide with having Wireguard support
> > > in the kernel (which I would also like to see happen BTW) ...
> > >
> > > Not "at some point". It will. Very soon. Maybe not in consumer or server
> > > CPUs, but definitely in the embedded (networking) space.
> > > And it *will* be much faster than the embedded CPU next to it, so it will
> > > be worth using it for something like bulk packet encryption.
> >
> > Super! I was wondering if you could speak a bit more about the
> > interface. My biggest questions surround latency. Will it be
> > synchronous or asynchronous?
> >
> The hardware being external to the CPU and running in parallel with it,
> obviously asynchronous.
>
> > If the latter, why?
> >
> Because, as you probably already guessed, the round-trip latency is way
> longer than the actual processing time, at least for small packets.
>
> Partly because the only way to communicate between the CPU and the HW
> accelerator (whether that is crypto, a GPU, a NIC, etc.) that doesn't
> keep the CPU busy moving data is through memory, with the HW doing DMA.
> And, as any programmer should now, round trip times to memory are huge
> relative to the processing speed.
>
> And partly because these accelerators are very similar to CPU's in
> terms of architecture, doing pipelined processing and having multiple
> of such pipelines in parallel. Except that these pipelines are not
> working on low-level instructions but on full packets/blocks. So they
> need to have many packets in flight to keep those pipelines fully
> occupied. And packets need to move through the various pipeline stages,
> so they incur the time needed to process them multiple times. (just
> like e.g. a multiply instruction with a throughput of 1 per cycle
> actually may need 4 or more cycles to actually provide its result)
>
> Could you do that from a synchronous interface? In theory, probably,
> if you would spawn a new thread for every new packet arriving and
> rely on the scheduler to preempt the waiting threads. But you'd need
> as many threads as the HW  accelerator can have packets in flight,
> while an async would need only 2 threads: one to handle the input to
> the accelerator and one to handle the output (or at most one thread
> per CPU, if you want to divide the workload)
>
> Such a many-thread approach seems very inefficient to me.
>
> > What will its latencies be?
> >
> Depends very much on the specific integration scenario (i.e. bus
> speed, bus hierarchy, cache hierarchy, memory speed, etc.) but on
> the order of a few thousand CPU clocks is not unheard of.
> Which is an eternity for the CPU, but still only a few uSec in
> human time. Not a problem unless you're a high-frequency trader and
> every ns counts ...
> It's not like the CPU would process those packets in zero time.
>
> > How deep will its buffers be?
> >
> That of course depends on the specific accelerator implementation,
> but possibly dozens of small packets in our case, as you'd need
> at least width x depth packets in there to keep the pipes busy.
> Just like a modern CPU needs hundreds of instructions in flight
> to keep all its resources busy.
>
> > The reason I ask is that a
> > lot of crypto acceleration hardware of the past has been fast and
> > having very deep buffers, but at great expense of latency.
> >
> Define "great expense". Everything is relative. The latency is very
> high compared to per-packet processing time but at the same time it's
> only on the order of a few uSec. Which may not even be significant on
> the total time it takes for the packet to travel from input MAC to
> output MAC, considering the CPU will still need to parse and classify
> it and do pre- and postprocessing on it.
>
> > In the networking context, keeping latency low is pretty important.
> >
> I've been doing this for IPsec for nearly 20 years now and I've never
> heard anyone complain about our latency, so it must be OK.

Well, it depends on where your bottlenecks are. On low-end hardware
you can and do tend to bottleneck on the crypto step, and with
uncontrolled, non-fq'd non-aqm'd buffering you get results like this:

http://blog.cerowrt.org/post/wireguard/

so in terms of "threads" I would prefer to think of flows entering
the tunnel and attempting to multiplex them as best as possible
across the crypto hard/software so that minimal in-hw latencies are experienced
for most packets and that the coupled queue length does not grow out of control,

Adding fq_codel's hashing algo and queuing to ipsec as was done in
commit: 264b87fa617e758966108db48db220571ff3d60e to leverage
the inner hash...

Had some nice results:

before: http://www.taht.net/~d/ipsec_fq_codel/oldqos.png (100ms spikes)
After: http://www.taht.net/~d/ipsec_fq_codel/newqos.png (2ms spikes)

I'd love to see more vpn vendors using the rrul test or something even
nastier to evaluate their results, rather than dragstrip bulk throughput tests,
steering multiple flows over multiple cores.

> We're also doing (fully inline, no CPU involved) MACsec cores, which
> operate at layer 2 and I know it's a concern there for very specific
> use cases (high frequency trading, precision time protocol, ...).
> For "normal" VPN's though, a few uSec more or less should be a non-issue.

Measured buffering is typically 1000 packets in userspace vpns. If you
can put data in, faster than you can get it out, well....

> > Already
> > WireGuard is multi-threaded which isn't super great all the time for
> > latency (improvements are a work in progress). If you're involved with
> > the design of the hardware, perhaps this is something you can help
> > ensure winds up working well? For example, AES-NI is straightforward
> > and good, but Intel can do that because they are the CPU. It sounds
> > like your silicon will be adjacent. How do you envision this working
> > in a low latency environment?
> >
> Depends on how low low-latency is. If you really need minimal latency,
> you need an inline implementation. Which we can also provide, BTW :-)
>
> Regards,
> Pascal van Leeuwen
> Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
> www.insidesecure.com



-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-26 16:35       ` Linus Torvalds
@ 2019-09-27  0:15         ` Pascal Van Leeuwen
  2019-09-27  1:30           ` Linus Torvalds
  2019-09-27  2:06           ` Linus Torvalds
  0 siblings, 2 replies; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-27  0:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

> -----Original Message-----
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Thursday, September 26, 2019 6:35 PM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu
> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves
> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann <arnd@arndb.de>;
> Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; Will Deacon
> <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas <catalin.marinas@arm.com>
> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
> 
> On Thu, Sep 26, 2019 at 2:40 AM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> >
> > While I agree with the principle of first merging Wireguard without
> > hooking it up to the Crypto API and doing the latter in a later,
> > separate patch, I DONT'T agree with your bashing of the Crypto API
> > or HW crypto acceleration in general.
> 
> I'm not bashing hardware crypto acceleration.
> 
> But I *am* bashing bad interfaces to it.
> 
And I'm arguing the interface is not that bad, it is the way it is
for good reasons. I think we all agree it's not suitable for the
occasional crypto operation using a fixed algorithm. For that, by
all means use direct library calls. No one is arguing against that.

However, I dare you to come up with something better that would
provide the same flexibility for doing configurable crypto and
offloading these (combined) operations to crypto acceleration 
hardware, depending on its actual capabilities.

i.e. something that would allow offloading rfc7539esp to
accelerators capable of doing that whole transform, while also being
able to offload separate chacha and/or poly operations to less
capable hardware. While actually being able to keep those deep HW
pipelines filled.

> Honestly, you need to face a few facts, Pascal.
> 
> Really.
> 
> First off: availability.
> 
>  (a) hardware doesn't exist right now outside your lab
> 
> This is a fact.
> 
Actually, that's _not_ a fact at all. For three reaons:

a) I don't even have real hardware (for this). We're an IP provider, 
   we don't make actual hardware. True, I have an FPGA dev board
   for prototyping but an ASIC guy like me considers that "just SW".
b) The actual hardware is in our customers labs, so definitely 
   outside of "my" lab. I don't even _have_ a lab. Just a full desk :-)
c) NXP Layerscape chips supporting Poly-Chacha acceleration can be bought
   right now (i.e NXP LX2160A, look it up). CAAM support for Poly-Chacha
   has been in mainline since kernel 5.0. So there's your hardware.

And does it matter that it doesn't exist today if it is a known fact
it *will* be there in just a few months? The general idea is that you
make sure the SW support is ready *before* you start selling the HW.

>  (b) hardware simply WILL NOT exist in any huge number for years,
> possibly decades. If ever,
> 
That remark is just very stupid. The hardware ALREADY exists, and
more hardware is in the pipeline. Once this stuff is designed in, it
usually stays in for many years to come. And these are chips sold in
_serious_ quantities, to be used in things like wireless routers and
DSL, cable and FTTH modems, 5G base stations, etc. etc.

> This is just reality, Pascal. 
>
I guess you must live in a different reality from mine? Because I see
some clear mismatch with _known facts_ in *my* reality. But then again,
I'm in this business so I know exactly what's out there. Unlike you.

> Even if you release your hardware tomorrow, 
>
We actually released this hardware a looooong time ago, it just takes
very long for silicon to reach the market. And that time is soon.

> where do you think it will exists? Laptops? PC's? Phones?
>
I already answered this above. Many embedded networking use cases.

> No. No. And no.
> 
Shouting "no" many times won't make it go away ;-)

> Phones are likely the strongest argument for power use, but phones
> won't really start using it until it's on the SoC, because while they
> care deeply about power, they care even more deeply about a lot of
> other things a whole lot more. Form factor, price, and customers that
> care.
> So phones aren't happening. Not for years, and not until it's a big
> deal and standard IP that everybody wants.
>
It will likely not be OUR HW, as the Qualcomms, Samsungs and Mediateks
of this world are very tough cookies to crack for an IP provider, but 
I would expect them to do their own at some point. I would also NOT
expect them to upstream any driver code for that. It may already exist!

> Laptops and PC's? No. Look at all the crypto acceleration they have today.
> 
No argument there. If an Intel or AMD adds crypto acceleration, they will
add it to the CPU core itself, for obvious reasons. If you don't actually
design the CPU, you don't have that choice though. (and from a power
consumption perspective, it's actually not that great of a solution)

> That was sarcasm, btw, just to make it clear. It's simply not a market.
> 
Eh ... we've been selling into that market for more than 20 years and
we still exist today? How is that possible if it doesn't exist?

> End result: even with hardware, the hardware will be very rare. Maybe
> routers that want to sell particular features in the near future.
> 
No, these are just the routers going into *everyone's* home. And 5G
basestations arriving at every other street corner. I wouldn't call 
that rare, exactly.

> Again, this isn't theory. This is that pesky little thing called
> "reality". It sucks, I know.
> 
You very obviously have absolutely NO idea what you're talking about.
Either that or you're living in some alternate reality.

> But even if you *IGNORE* the fact that hardware doesn't exist right
> now
>
Which I've proven to be FALSE

>, and won't be widely available for years (or longer),
>
Which again doesn't match the FACTS.

> there's another little fact that you are ignoring:
> 
> The async crypto interfaces are badly designed. Full stop.
> 
They may not be perfect. I think you are free to come up with solutions
to improve on that? But if such a solution would make it impossible to
offload to crypto hardware then *that* would be truly bad interface 
design. Do you have similar arguments about the interfacing to e.g.
graphics processing on the GPU? I'm sure those could be simplified to
be  easier to use and make a full SW implementation run that much more
efficiently ... (/sarcasm)

> Seriously. This isn't rocket science. This is very very basic Computer
> Science 101.
> 
I know. Rocket science is _easy_ ;-)

> Tell me, what's _the_ most important part about optimizing something?
> 	
> Yeah, it's "you optimize for the common case". But let's ignore that
> part, since I already said that hardware isn't the common case but I
> promised that I'd ignore that part.
> 
> The _other_ most important part of optimization is that you optimize
> for the case that _matters_.
> 
Matters to _whom_. What matters to someone running a fat desktop or
server CPU is _not_ what matters to someone running on a relatively
weak embedded CPU that _does_ have powerful crypto acceleration on the
side.
And when it comes to the _common_ case: there's actually a heck of a 
lot more embedded SoCs out there than x86 server/desktop CPU's. Fact!
You probably just don't know much about most of them.

But if you're arguing that the API should be lightweight and not add
significant overhead to a pure SW implementation, then I can agree.
However, I have not seen any PROOF (i.e. actual measurements, not 
theory) that it actually DOES add a lot of overhead. Nor any suggestions 
(beyond the hopelessly naive) for improving it.

> And the async crypto case got that completely wrong, and the wireguard
> patch shows that issue very clearly.
> 
> The fact is, even you admit that a few CPU cycles don't matter for the
> async case where you have hardware acceleration, because the real cost
> is going to be in the IO - whether it's DMA, cache coherent
> interconnects, or some day some SoC special bus.
>
I was talking _latency_ not _throughput_. I am _very_ interested in
reducing (API/driver) CPU overhead, if only because it doesn't allow
our HW to reach it's full potential. I'm working hard on optimizing our
driver for that right now.
And if something can be improved in the Crypto API itself there, without 
compromising it's functionality and flexibility, then I'm all for that.

> The CPU cycles just don't matter, they are entirely in the noise.
> 
> What does that mean?  Think about it.
> 
> [ Time passes ]
> 
> Ok, I'll tell you: it means that the interface shouldn't be designed
> for async hw-assisted crypto. 
>
If you don't design them with that in mind, you simply won't be able
to effectively use the HW-assisted crypto at all. Just like you don't
design an API to a GPU for running a software implementation on the 
CPU, but for minimizing state changes and batch-queuing large strips
of triangles as that's what the _HW_ needs to be efficiently used.
Just sayin'.

> The interface should be designed for the
> case that _does_ care about CPU cycles, and then the async hw-assisted
> crypto should be hidden by a conditional, and its (extra) data
> structures should be the ones that are behind some indirect pointers
> etc.  Because, as we agreed, the async external hw case really doesn't
> care. It it has to traverse a pointer or two, and if it has to have a
> *SEPARATE* keystore that has longer lifetimes, then the async code can
> set that up on its own, but that should not affect the case that
> cares.
> 
What the hardware cares about is that you can batch queue your requests
and not busy-wait for any results. What the hardware cares about is 
that you don't send separate requests for encryption, authentication,
IV generation, etc, but combine this in a single request, hence the
templates in the Crypto API. What the hardware may care about, is that
you keep your key changes limited to just those actually required.
That is _fundamental_ to getting performance. 
Note that many SW implementations that require multiple independent
operations in flight to achieve maximum efficiency due to deep
(dual) pipelines, and/or spend significant cycles on running the key 
scheduling will ALSO benefit from these properties.
An async interface also makes it possible to run the actual crypto ops 
in multiple independent threads, on multiple CPUs, although I'm not
sure if the current Crypto API actually leverages that right now.

> Really, this is fundamental, and really, the crypto interfaces got this wrong.
> 
> This is in fact _so_ fundamental that the only possible reason you can
> disagree is because you don't care about reality or fundamentals, and
> you only care about the small particular hardware niche you work in
> and nothing else.
> 
Well, to be completely honest, I for sure don't care about making the SW
implementations run faster at the expense of HW offload capabilities.
Which is obvious, as I make a _living_ creating HW offload solutions.
Why would I actively work towards obsoleting my work?

FACT is that dedicated HW, in many cases, is still MUCH faster than the
CPU. At much lower consumption to boot. So why would you NOT want to 
leverage that, if it's available? That would just be dumb.

Again, I don't see you making the same argument about moving graphics
functionality from the GPU to the CPU. So why does crypto *have* to be
on the CPU? I just don't understand _why_ you care about that so much.

> You really should think about this a bit.
> 
I've been thinking about this daily for about 19 years now. It's my job.

> > However, if you're doing bulk crypto like network packet processing
> > (as Wireguard does!) or disk/filesystem encryption, then that cipher
> > allocation only happens once every blue moon and the overhead for
> > that is totally *irrelevant* as it is amortized over many hours or
> > days of runtime.
> 
> This is not true. It's not true because the overhead happens ALL THE TIME.
> 
The overhead for the _cipher allocation_ (because that's what _I_ was
talking about specifically) happens all the time? You understand you
really only need to do that twice per connection? (once per direction)

But there will be some overhead on the cipher requests themselves,
sure. A small compromise to make for the _possibility_ to use HW 
offload when it IS available. I would agree that if that overhead
turns out to be very significant, then something needs to be done
about that. But then PROVE that that's the case and provide solutions
that do not immediately make HW offload downright impossible.
As our HW is _for sure_ much faster than the CPU cluster (yes, all
CPU's combined at full utilization) it is _usually_ paired with.

> And in 99.9% of all cases there are no upsides from the hw crypto,
> because the hardware DOES NOT EXIST.
> 
Which is a _false_ assumption, we covered that several times before.

> You think the "common case" is that hardware encryption case, but see
> above. It's really not.
> 
See my argument above about there being many more embedded SoC's out
there than x86 CPU's. Which usually have some form of crypto accel
on the side. Which will, eventually, have Chacha-Poly support 
because that's what the industry is currently gravitating towards.
So define _common_ case.

> And when you _do_ have HW encryption, you could do the indirection.
> 
Again, not if the API was not architected to do so from the get-go.

> But that's not an argument for always having the indirection.
> 
> What indirection am I talking about?
> 
> There's multiple levels of indirection in the existing bad crypto interfaces:
> 
>  (a) the data structures themselves. This is things like keys and
> state storage, both of which are (often) small enough that they would
> be better off on a stack, or embedded in the data structures of the
> callers.
> 
>  (b) the calling conventions. This is things like indirection -
> usually several levels - just to find the function pointer to call to,
> and then an indirect call to that function pointer.
> 
> and both of those are actively bad things when you don't have those
> hardware accelerators.
> 
I would say those things are not required just for hardware acceleration,
so perhaps something can be improved there in the existing code.
Ever tried suggesting these to the Crypto API maintainers before?

> When you *do* have them, you don't care. Everybody agrees about that.
> But you're ignoring the big white elephant in the room, which is that
> the hw really is very rare in the end, even if it were to exist at
> all.
> 
Crypto acceleration in general is _not_ rare, almost any embedded SoC
has it. The big white elephant in the room is _actually_ that there 
never were decent, standard, ubiquitous API's available to use them
so most of them could only be used from dedicated in-house applications.
Which seriously hampered general acceptance and actual _use_.

> > While I generally dislike this whole hype of storing stuff in
> > textual formats like XML and JSON and then wasting lots of CPU
> > cycles on parsing that, I've learned to appreciate the power of
> > these textual Crypto API templates, as they allow a hardware
> > accelerator to advertise complex combined operations as single
> > atomic calls, amortizing the communication overhead between SW
> > and HW. It's actually very flexible and powerful!
> 
> BUT IT IS FUNDAMENTALLY BADLY DESIGNED!
> 
> Really.
> 
> You can get the advantages of hw-accelerated crypto with good designs.
> The current crypto layer is not that.
> 
> The current crypto layer simply has indirection at the wrong level.
> 
> Here's how it should have been done:
> 
>  - make the interfaces be direct calls to the crypto you know you want.
> 
Which wouldn't work for stuff like IPsec and dmcrypt, where you want
to be able to configure the crypto to be used, i.e. it's _not_ fixed.
And you don't want to have to modify those applications _everytime_ a
new algorithm is added. As the application shouldn't care about that,
it should just be able to leverage it for what it is.

Also, for combined operations, there needs to be some central place
where they are decomposed into optimal sub-operations, if necessary, 
without bothering the actual applications with that.

Having a simple direct crypto call is just a very naive solution
that would require changing _every_ application (for which this is
relevant) anytime you add a ciphersuite. It does not scale.

Yes, that will - by necessity - involve some indirection but as long as
you don't process anything crazy short, a single indirect call (or 2)
should really not be that significant on the total operation.
(and modern CPU's can predict indirect branches pretty well, too)

Note that all these arguments have actually _nothing_ to do with
supporting HW acceleration, they apply just as well to SW only.

>  - make the basic key and state buffer be explicit and let people
> allocate them on their own stacks or whatever
> 
Hey, no argument there. I don't see any good reason why the key can't
be on the stack. I doubt any hardware would be able to DMA that as-is
directly, and in any case, key changes should be infrequent, so copying
it to some DMA buffer should not be a performance problem.
So maybe that's an area for improvement: allow that to be on the stack.

> "But what about hw acceleration?"
> 
>  - add a single indirect private pointer that the direct calls can use
> for their own state IF THEY HAVE REASON TO
> 
>  - make the direct crypto calls just have a couple of conditionals
> inside of them
> 
> Why? Direct calls with a couple of conditionals are really cheap for
> the non-accelerated case. MUCH cheaper than the indirection overhead
> (both on a state level and on a "crypto description" level) that the
> current code has.
> 
I already explained the reasons for _not_ doing direct calls above.

> Seriously. The hw accelerated crypto won't care about the "call a
> static routine" part.
>
Correct! It's totally unrelated.

> The hw accelerated crypto won't care about the
> "I need to allocate a copy of the key because I can't have it on
> stack, and need to have it in a DMA'able region". 
>
Our HW surely won't, some HW might care but copying it should be OK.

> The hw accelerated
> crypto won't care about the two extra instructions that do "do I have
> any extra state at all, or should I just do the synchronous CPU
> version" before it gets called through some indirect pointer.
> 
Actually, here I _do_ care. I want minimal CPU overhead just a much
as you do, probably even more desperately. But OK, I would be able to
live with that, if that were the _only_ downside.

> So there is absolutely NO DOWNSIDE for hw accelerated crypto to just
> do it right, and use an interface like this:
> 
>        if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,
>                                         PACKET_CB(skb)->nonce, key->key,
>                                         simd_context))
>                return false;
> 
Well, for one thing, a HW API should not expect the result to be
available when the function call returns. (if that's what you
mean here). That would just be WRONG.
HW offload doesn't work like that. Results come much later, and 
you need to keep dispatching more requests until the HW starts
asserting backpressure. You need to keep that HW pipeline filled.

> because for the hw accelerated case the costs are all elsewhere.
> 
> But for synchronous encryption code on the CPU? Avoiding the
> indirection can be a huge win. Avoiding allocations, extra cachelines,
> all that overhead. Avoiding indirect calls entirely, because doing a
> few conditional branches (that will predict perfectly) on the state
> will be a lot more efficient, both in direct CPU cycles and in things
> like I$ etc.
> 
Again, HW acceleration does not depend on the indirection _at all_,
that's there for entirely different purposes I explained above.
HW acceleration _does_ depend greatly on a truly async ifc though.
So queue requests on one side, handle results from the other side
in some callback func off of an interrupt handler. (with proper
interrupt coalescing, of course, and perhaps some NAPI-like 
functionality to further reduce interrupt rates when busy)

> In contrast, forcing people to use this model:
> 
>        if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
>                req = aead_request_alloc(key->tfm, GFP_ATOMIC);
>                if (!req)
>                        return false;
>        } else {
>                req = &stackreq;
>                aead_request_set_tfm(req, key->tfm);
>        }
> 
Agree that is fishy, but it is something that could be fixed.

>        aead_request_set_ad(req, 0);
>
I'd rather see this being part of the set_crypt call as well.
I never said I liked _all_ decisions made in the API.
Likely this is because AEAD was added as an afterthought.

>        aead_request_set_callback(req, 0, NULL, NULL);
>
This is just inevitable for HW acceration ...

>        aead_request_set_crypt(req, sg, sg, skb->len,
>                               (u8 *)&PACKET_CB(skb)->ivpad);
>        err = crypto_aead_decrypt(req);
>
It would probably be more efficient if set_crypt and _decrypt 
could be combined in a single call (together with _set_ad). 
No argument there and these decisions have _nothing_ to do
with being able to do HW acceleration or not.

Trust me, I have whole list of things I don't like about the
API myself, it's not exacty ideal for HW acceleration  either.
(Note that SW overhead actually matters _more_ when you do HW 
acceleration, as the HW is often so fast that the SW is the 
actual bottleneck!).

But I have faith that, over time, I may be able to get some
improvements in (which should improve both HW _and_ SW use
cases by the way). By working _with_ the Crypto API people
and being _patient_. Not by telling them they suck.

>        if (unlikely(req != &stackreq))
>                aead_request_free(req);
>        if (err)
>                return false;
> 
> isn't going to help anybody. It sure as hell doesn't help the
> CPU-synchronous case, and see above: it doesn't even help the hw
> accelerated case. It could have had _all_ that "tfm" work behind a
> private pointer that the CPU case never touches except to see "ok,
> it's NULL, I don't have any".
> 
> See?
> 
Yes, I agree with the point you have here. So let's fix that.

> The interface *should* be that chacha20poly1305_decrypt_sg() library
> interface, just give it a private pointer it can use and update. Then,
> *internally* if can do something like
> 
>      bool chacha20poly1305_decrypt_sg(...)
>      {
>              struct cc20p1305_ptr *state = *state_p;
>              if (state) {
>                      .. do basically the above complex thing ..
>                      return ret; .. or fall back to sw if the hw
> queues are full..
>              }
>              .. do the CPU only thing..
>      }
> 
But even the CPU only thing may have several implementations, of which
you want to select the fastest one supported by the _detected_ CPU
features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.)
Do you think this would still be efficient if that would be some
large if-else tree? Also, such a fixed implementation wouldn't scale.

> and now you'd have no extra obverhead for the no-hw-accel case, you'd
> have a more pleasant interface to use, and you could still use hw
> accel if you wanted to.
> 
I still get the impression you're mostly interested in the "pleasant
interface" while I don't see why that should be more important than
being able to use HW crypto efficiently. That reminds me of way back
when I was a junior designer and some - more senior - software
engineer forced me to implement a full hardware divider(!) for some
parameter that needed to be set only once at initialization time, just
because he was too lazy to do a few simple precomputes in the driver.
He considered that to be "unpleasant" as well. 

> THIS is why I say that the crypto interface is bad. It was designed
> for the wrong objectives. It was designed purely for a SSL-like model
> where you do a complex key and algorithm exchange dance, and you
> simply don't know ahead of time what crypto you are even using.
> 
I guess it was designed for that, sure. And that's how the IPsec stack
and dmcrypt (to name a few examples) need it. It's also how Wireguard
_will_ need it when we start adding more ciphersuites to Wireguard.
Which is a MUST anyway, if Wireguard wants to be taken seriously:
there MUST be a fallback ciphersuite. At least one. Just in case 
Chacha-Poly gets broken overnight somehow, in which case you need to
switch over instantly and can't wait for some new implementation.

If you really _don't_ need that, but just need bit of fixed algorithm
crypto, then by all means, don't go through the Crypto API. I've a
already argued that on many occasions. I think people like Ard are
_already_ working on doing such crypto calls directly.

> And yes, that "I'm doing the SSL thing" used to be a major use of
> encryption. I understand why it happened. It was what people did in
> the 90's. People thought it was a good idea back then, and it was also
> most of the hw acceleration world.
> 
> And yes, in that model of "I don't have a clue of what crypto I'm even
> using" the model works fine. But honestly, if you can't admit to
> yourself that it's wrong for the case where you _do_ know the
> algorithm, you have some serious blinders on.
> 
But the point is - there are those case where you _don't_ know and
_that_ is what the Crypto API is for. And just generally, crypto
really _should_ be switchable. So you don't need to wait for a
fix to ripple through a kernel release cycle when an algorithm gets
broken. I don't know many use cases for just one fixed algorithm.

> Just from a user standpoint, do you seriously think users _like_
> having to do the above 15+ lines of code, vs the single function call?
> 
I know I wouldn't. I also know I would do it anyway as I would 
understand _why_ I would be doing it. 

> The crypto interface really isn't pleasant, and you're wrong to
> believe that it really helps. The hw acceleration capability could
> have been abstracted away, instead of making that indirection be front
> and center.
> 
Again, the Crypto API aims to do more than just allow for HW
acceleration and your main gripes actually seem to be with the
"other" stuff.

> And again - I do realize the historical reasons for it. But
> understanding that doesn't magically make it wonderful.
> 
No one said it was wonderful. Or pleasant. Or perfect.
You definitely raised some points that I think _could_ be 
improved without compromising any functionality.
But some stuff you don't like just has good reasons to exist.
Reasons you may not agree with, but that doesn't make them invalid.

>                  Linus

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  0:15         ` Pascal Van Leeuwen
@ 2019-09-27  1:30           ` Linus Torvalds
  2019-09-27  2:54             ` Linus Torvalds
                               ` (2 more replies)
  2019-09-27  2:06           ` Linus Torvalds
  1 sibling, 3 replies; 61+ messages in thread
From: Linus Torvalds @ 2019-09-27  1:30 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>
> That remark is just very stupid. The hardware ALREADY exists, and
> more hardware is in the pipeline. Once this stuff is designed in, it
> usually stays in for many years to come. And these are chips sold in
> _serious_ quantities, to be used in things like wireless routers and
> DSL, cable and FTTH modems, 5G base stations, etc. etc.

Yes, I very much mentioned routers. I believe those can happen much
more quickly.

But I would very much hope that that is not the only situation where
you'd see wireguard used.

I'd want to see wireguard in an end-to-end situation from the very
client hardware. So laptops, phones, desktops. Not the untrusted (to
me) hw in between.

> No, these are just the routers going into *everyone's* home. And 5G
> basestations arriving at every other street corner. I wouldn't call
> that rare, exactly.

That's fine for a corporate tunnel between devices. Which is certainly
one use case for wireguard.

But if you want VPN for your own needs for security, you want it at
the _client_. Not at the router box. So that case really does matter.

And I really don't see the hardware happening in that space. So the
bad crypto interfaces only make the client _worse_.

See?

But on to the arguments that we actually agree on:

> Hey, no argument there. I don't see any good reason why the key can't
> be on the stack. I doubt any hardware would be able to DMA that as-is
> directly, and in any case, key changes should be infrequent, so copying
> it to some DMA buffer should not be a performance problem.
> So maybe that's an area for improvement: allow that to be on the stack.

It's not even just the stack. It's really that the crypto interfaces
are *designed* so that you have to allocate things separately, and
can't embed these things in your own data structures.

And they are that way, because the crypto interfaces aren't actually
about (just) hiding the hardware interface: they are about hiding
_all_ the encryption details.

There's no way to say "hey, I know the crypto I use, I know the key
size I have, I know the state size it needs, I can preallocate those
AS PART of my own data structures".

Because the interface is designed to be so "generic" that you simply
can't do those things, they are all external allocations, which is
inevitably slower when you don't have hardware.

And you've shown that you don't care about that "don't have hardware"
situation, and seem to think it's the only case that matters. That's
your job, after all.

But however much you try to claim otherwise, there's all these
situations where the hardware just isn't there, and the crypto
interface just forces nasty overhead for absolutely no good reason.

> I already explained the reasons for _not_ doing direct calls above.

And I've tried to explain how direct calls that do the synchronous
thing efficiently would be possible, but then _if_ there is hardware,
they can then fall back to an async interface.

> > So there is absolutely NO DOWNSIDE for hw accelerated crypto to just
> > do it right, and use an interface like this:
> >
> >        if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,
> >                                         PACKET_CB(skb)->nonce, key->key,
> >                                         simd_context))
> >                return false;
> >
> Well, for one thing, a HW API should not expect the result to be
> available when the function call returns. (if that's what you
> mean here). That would just be WRONG.

Right. But that also shouldn't mean that when you have synchronous
hardware (ie CPU) you have to set everything up even though it will
never be used.

Put another way: even with hardware acceleration, the queuing
interface should be a simple "do this" interface.

The current crypto interface is basically something that requires all
the setup up-front, whether it's needed or not. And it forces those
very inconvenient and slow external allocations.

And I'm saying that causes problems, because it fundamentally means
that you can't do a good job for the common CPU  case, because you're
paying all those costs even when you need absolutely none of them.
Both at setup time, but also at run-time due to the extra indirection
and cache misses etc.

> Again, HW acceleration does not depend on the indirection _at all_,
> that's there for entirely different purposes I explained above.
> HW acceleration _does_ depend greatly on a truly async ifc though.

Can you realize that the world isn't just all hw acceleration?

Can you admit that the current crypto interface is just horrid for the
non-accelerated case?

Can you perhaps then also think that "maybe there are better models".

> So queue requests on one side, handle results from the other side
> in some callback func off of an interrupt handler.

Actually, what you can do - and what people *have* done - is to admit
that the synchronous case is real and important, and then design
interfaces that work for that one too.

You don't need to allocate resources ahead of time, and you don't have
to disallow just having the state buffer allocated by the caller.

So here's the *wrong* way to do it (and the way that crypto does it):

 - dynamically allocate buffers at "init time"

 - fill in "callback fields" etc before starting the crypto, whether
they are needed or not

 - call a "decrypt" function that then uses the indirect functions you
set up at init time, and possibly waits for it (or calls the callbacks
you set up)

note how it's all this "state machine" model where you add data to the
state machine, and at some point you say "execute" and then either you
wait for things or you get callbacks.

That makes sense for a hw crypto engine. It's how a lot of them work, after all.

But it makes _zero_ sense for the synchronous case. You did a lot of
extra work for that case, and because it was all a styate machine, you
did it particularly inefficiently: not only do you have those separate
allocations with pointer following, the "decrypt()" call ends up doing
an indirect call to the CPU implementation, which is just quite slow
to begin with, particularly in this day and age with retpoline etc.

So what's the alternative?

I claim that a good interface would accept that "Oh, a lot of cases
will be synchronous, and a lot of cases use one fixed
encryption/decryption model".

And it's quite doable. Instead of having those callback fields and
indirection etc, you could have something more akin to this:

 - let the caller know what the state size is and allocate the
synchronous state in its own data structures

 - let the caller just call a static "decrypt_xyz()" function for xyz
decryptrion.

 - if you end up doing it synchronously, that function just returns
"done". No overhead. No extra allocations. No unnecessary stuff. Just
do it, using the buffers provided. End of story. Efficient and simple.

 - BUT.

 - any hardware could have registered itself for "I can do xyz", and
the decrypt_xyz() function would know about those, and *if* it has a
list of accelerators (hopefully sorted by preference etc), it would
try to use them. And if they take the job (they might not - maybe
their queues are full, maybe they don't have room for new keys at the
moment, which might be a separate setup from the queues), the
"decrypt_xyz()" function returns a _cookie_ for that job. It's
probably a pre-allocated one (the hw accelerator might preallocate a
fixed number of in-progress data structures).

And once you have that cookie, and you see "ok, I didn't get the
answer immediately" only THEN do you start filling in things like
callback stuff, or maybe you set up a wait-queue and start waiting for
it, or whatever".

See the difference in models? One forces that asynchronous model, and
actively penalizes the synchronous one.

The other _allows_ an asynchronous model, but is fine with a synchronous one.

> >        aead_request_set_callback(req, 0, NULL, NULL);
> >
> This is just inevitable for HW acceration ...

See above. It really isn't. You could do it *after* the fact, when
you've gotten that ticket from the hardware. Then you say "ok, if the
ticket is done, use these callbacks". Or "I'll now wait for this
ticket to be done" (which is what the above does by setting the
callbacks to zero).

Wouldn't that be lovely for a user?

I suspect it would be a nice model for a hw accelerator too. If you
have full queues or have problems allocating new memory or whatever,
you just let the code fall back to the synchronous interface.

> Trust me, I have whole list of things I don't like about the
> API myself, it's not exacty ideal for HW acceleration  either.

That';s the thing. It's actively detrimental for "I have no HW acceleration".

And apparently it's not optimal for you guys either.

> But the point is - there are those case where you _don't_ know and
> _that_ is what the Crypto API is for. And just generally, crypto
> really _should_ be switchable.

It's very much not what wireguard does.

And honestly, most of the switchable ones have caused way more
security problems than they have "fixed" by being switchable.

                 Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  0:15         ` Pascal Van Leeuwen
  2019-09-27  1:30           ` Linus Torvalds
@ 2019-09-27  2:06           ` Linus Torvalds
  2019-09-27 10:11             ` Pascal Van Leeuwen
  1 sibling, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2019-09-27  2:06 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>
> But even the CPU only thing may have several implementations, of which
> you want to select the fastest one supported by the _detected_ CPU
> features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.)
> Do you think this would still be efficient if that would be some
> large if-else tree? Also, such a fixed implementation wouldn't scale.

Just a note on this part.

Yes, with retpoline a large if-else tree is actually *way* better for
performance these days than even just one single indirect call. I
think the cross-over point is somewhere around 20 if-statements.

But those kinds of things also are things that we already handle well
with instruction rewriting, so they can actually have even less of an
overhead than a conditional branch. Using code like

  if (static_cpu_has(X86_FEATURE_AVX2))

actually ends up patching the code at run-time, so you end up having
just an unconditional branch. Exactly because CPU feature choices
often end up being in critical code-paths where you have
one-or-the-other kind of setup.

And yes, one of the big users of this is very much the crypto library code.

The code to do the above is disgusting, and when you look at the
generated code you see odd unreachable jumps and what looks like a
slow "bts" instruction that does the testing dynamically.

And then the kernel instruction stream gets rewritten fairly early
during the boot depending on the actual CPU capabilities, and the
dynamic tests get overwritten by a direct jump.

Admittedly I don't think the arm64 people go to quite those lengths,
but it certainly wouldn't be impossible there either.  It just takes a
bit of architecture knowledge and a strong stomach ;)

                 Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  1:30           ` Linus Torvalds
@ 2019-09-27  2:54             ` Linus Torvalds
  2019-09-27  3:53               ` Herbert Xu
                                 ` (2 more replies)
  2019-09-27  4:36             ` Andy Lutomirski
  2019-09-27  9:58             ` Pascal Van Leeuwen
  2 siblings, 3 replies; 61+ messages in thread
From: Linus Torvalds @ 2019-09-27  2:54 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

On Thu, Sep 26, 2019 at 6:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And once you have that cookie, and you see "ok, I didn't get the
> answer immediately" only THEN do you start filling in things like
> callback stuff, or maybe you set up a wait-queue and start waiting for
> it, or whatever".

Side note: almost nobody does this.

Almost every single async interface I've ever seen ends up being "only
designed for async".

And I think the reason is that everybody first does the simply
synchronous interfaces, and people start using those, and a lot of
people are perfectly happy with them. They are simple, and they work
fine for the huge majority of users.

And then somebody comes along and says "no, _we_ need to do this
asynchronously", and by definition that person does *not* care for the
synchronous case, since that interface already existed and was simpler
and already was mostly sufficient for the people who used it, and so
the async interface ends up being _only_ designed for the new async
workflow. Because that whole new world was written with just that case
is mind, and the synchronous case clearly didn't matter.

So then you end up with that kind of dichotomous situation, where you
have a strict black-and-white either-synchronous-or-async model.

And then some people - quite reasonably - just want the simplicity of
the synchronous code and it performs better for them because the
interfaces are simpler and better suited to their lack of extra work.

And other people feel they need the async code, because they can take
advantage of it.

And never the twain shall meet, because the async interface is
actively _bad_ for the people who have sync workloads and the sync
interface doesn't work for the async people.

Non-crypto example: [p]read() vs aio_read(). They do the same thing
(on a high level) apart from that sync/async issue. And there's no way
to get the best of both worlds.

Doing aio_read() on something that is already cached is actively much
worse than just doing a synchronous read() of cached data.

But aio_read() _can_ be much better if you know your workload doesn't
cache well and read() blocks too much for you.

There's no "read_potentially_async()" interface that just does the
synchronous read for any cached portion of the data, and then delays
just the IO parts and returns a "here, I gave you X bytes right now,
use this cookie to wait for the rest".

Maybe nobody would use it. But it really should be possibly to have
interfaces where a good synchronous implementation is _possible_
without the extra overhead, while also allowing async implementations.

                Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  2:54             ` Linus Torvalds
@ 2019-09-27  3:53               ` Herbert Xu
  2019-09-27  4:37                 ` Andy Lutomirski
  2019-09-27  4:01               ` Herbert Xu
  2019-09-27 10:44               ` Pascal Van Leeuwen
  2 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2019-09-27  3:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Pascal Van Leeuwen, Linux Crypto Mailing List, Andy Lutomirski,
	Marc Zyngier, Dan Carpenter, Will Deacon, David Miller,
	Linux ARM

On Thu, Sep 26, 2019 at 07:54:03PM -0700, Linus Torvalds wrote:
>
> Side note: almost nobody does this.
> 
> Almost every single async interface I've ever seen ends up being "only
> designed for async".
> 
> And I think the reason is that everybody first does the simply
> synchronous interfaces, and people start using those, and a lot of
> people are perfectly happy with them. They are simple, and they work
> fine for the huge majority of users.

The crypto API is not the way it is because of async.  In fact, the
crypto API started out as sync only and async was essentially
bolted on top with minimial changes.

The main reason why the crypto API contains indirections is because
of the algorithmic flexibility which WireGuard does not need.

Now whether algorithmic flexibility is a good thing or not is a
different discussion.  But the fact of the matter is that the
majority of heavy crypto users in our kernel do require this
flexibility (e.g., IPsec, dmcrypt, fscrypt).

I don't have a beef with the fact that WireGuard is tied to a
single algorithm.  However, that simply does not work for the
other users that we will have to continue to support.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  2:54             ` Linus Torvalds
  2019-09-27  3:53               ` Herbert Xu
@ 2019-09-27  4:01               ` Herbert Xu
  2019-09-27  4:13                 ` Linus Torvalds
  2019-09-27 10:44               ` Pascal Van Leeuwen
  2 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2019-09-27  4:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Pascal Van Leeuwen, Linux Crypto Mailing List, Andy Lutomirski,
	Marc Zyngier, Dan Carpenter, Will Deacon, David Miller,
	Linux ARM

On Thu, Sep 26, 2019 at 07:54:03PM -0700, Linus Torvalds wrote:
>
> There's no "read_potentially_async()" interface that just does the
> synchronous read for any cached portion of the data, and then delays
> just the IO parts and returns a "here, I gave you X bytes right now,
> use this cookie to wait for the rest".

Incidentally this is exactly how the crypto async interface works.
For example, the same call works whether you're sync or async:

	aead_request_set_callback(req, ...);
	aead_request_set_crypt(req, ...);
	err = crypto_aead_encrypt(req);
	if (err == -EINPROGRESS)
		/*
		 * Request is processed asynchronously.
		 * This cannot occur if the algorithm is sync,
		 * e.g., when you specifically allocated sync
		 * algorithms.
		 */
	else
		/* Request was processed synchronously */

We even allow the request to be on the stack in the sync case, e.g.,
with SYNC_SKCIPHER_REQUEST_ON_STACK.

> Maybe nobody would use it. But it really should be possibly to have
> interfaces where a good synchronous implementation is _possible_
> without the extra overhead, while also allowing async implementations.

So there is really no async overhead in the crypto API AFAICS if
you're always doing sync.  What you see as overheads are probably
the result of having to support multiple underlying algorithms
(not just accelerations which can indeed be handled without
indirection at least for CPU-based ones).

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  4:01               ` Herbert Xu
@ 2019-09-27  4:13                 ` Linus Torvalds
  0 siblings, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2019-09-27  4:13 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Jason A . Donenfeld, Catalin Marinas, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Pascal Van Leeuwen, Linux Crypto Mailing List, Andy Lutomirski,
	Marc Zyngier, Dan Carpenter, Will Deacon, David Miller,
	Linux ARM

On Thu, Sep 26, 2019 at 9:01 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> So there is really no async overhead in the crypto API AFAICS if
> you're always doing sync.  What you see as overheads are probably
> the result of having to support multiple underlying algorithms
> (not just accelerations which can indeed be handled without
> indirection at least for CPU-based ones).

Fair enough, and sounds good. The biggest overhead is that indirection
for the state data, and the fact that the code indirect calls the
actual function.

If that could be avoided by just statically saying

     crypto_xyz_encrypt()

(with the xyz being the crypto algorithm you want) and having the
state be explicit, then yes, that would remove most of the overhead.

It would still leave setting the callback fields etc that are
unnecessary for the synchronous case and that I think could be done
differently, but that's probably just a couple of stores, so not
particularly noticeable.

              Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  1:30           ` Linus Torvalds
  2019-09-27  2:54             ` Linus Torvalds
@ 2019-09-27  4:36             ` Andy Lutomirski
  2019-09-27  9:58             ` Pascal Van Leeuwen
  2 siblings, 0 replies; 61+ messages in thread
From: Andy Lutomirski @ 2019-09-27  4:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Pascal Van Leeuwen, Linux Crypto Mailing List, Andy Lutomirski,
	Marc Zyngier, Dan Carpenter, Will Deacon, David Miller,
	Linux ARM

> On Sep 26, 2019, at 6:38 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> - let the caller know what the state size is and allocate the
> synchronous state in its own data structures
>
> - let the caller just call a static "decrypt_xyz()" function for xyz
> decryptrion.
>
> - if you end up doing it synchronously, that function just returns
> "done". No overhead. No extra allocations. No unnecessary stuff. Just
> do it, using the buffers provided. End of story. Efficient and simple.
>
> - BUT.
>
> - any hardware could have registered itself for "I can do xyz", and
> the decrypt_xyz() function would know about those, and *if* it has a
> list of accelerators (hopefully sorted by preference etc), it would
> try to use them. And if they take the job (they might not - maybe
> their queues are full, maybe they don't have room for new keys at the
> moment, which might be a separate setup from the queues), the
> "decrypt_xyz()" function returns a _cookie_ for that job. It's
> probably a pre-allocated one (the hw accelerator might preallocate a
> fixed number of in-progress data structures).

To really do this right, I think this doesn't go far enough.  Suppose
I'm trying to implement send() over a VPN very efficiently.  I want to
do, roughly, this:

void __user *buf, etc;

if (crypto api thinks async is good) {
  copy buf to some kernel memory;
  set up a scatterlist;
  do it async with this callback;
} else {
  do the crypto synchronously, from *user* memory, straight to kernel memory;
  (or, if that's too complicated, maybe copy in little chunks to a
little stack buffer.
   setting up a scatterlist is a waste of time.)
}

I don't know if the network code is structured in a way to make this
work easily, and the API would be more complex, but it could be nice
and fast.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  3:53               ` Herbert Xu
@ 2019-09-27  4:37                 ` Andy Lutomirski
  2019-09-27  4:59                   ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-09-27  4:37 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Jason A . Donenfeld, Catalin Marinas, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Will Deacon, Samuel Neves,
	Pascal Van Leeuwen, Linux Crypto Mailing List, Andy Lutomirski,
	Marc Zyngier, Dan Carpenter, Linus Torvalds, David Miller,
	Linux ARM

On Thu, Sep 26, 2019 at 8:54 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Sep 26, 2019 at 07:54:03PM -0700, Linus Torvalds wrote:
> >
> > Side note: almost nobody does this.
> >
> > Almost every single async interface I've ever seen ends up being "only
> > designed for async".
> >
> > And I think the reason is that everybody first does the simply
> > synchronous interfaces, and people start using those, and a lot of
> > people are perfectly happy with them. They are simple, and they work
> > fine for the huge majority of users.
>
> The crypto API is not the way it is because of async.  In fact, the
> crypto API started out as sync only and async was essentially
> bolted on top with minimial changes.

Then what's up with the insistence on using physical addresses for so
many of the buffers?

--Andy

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  4:37                 ` Andy Lutomirski
@ 2019-09-27  4:59                   ` Herbert Xu
  0 siblings, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2019-09-27  4:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jason A . Donenfeld, Catalin Marinas, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Will Deacon, Samuel Neves,
	Pascal Van Leeuwen, Linux Crypto Mailing List, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, Linux ARM

On Thu, Sep 26, 2019 at 09:37:16PM -0700, Andy Lutomirski wrote:
>
> Then what's up with the insistence on using physical addresses for so
> many of the buffers?

This happens to be what async hardware wants, but the main reason
why the crypto API has them is because that's what the network
stack feeds us.  The crypto API was first created purely for IPsec
so the SG lists are intimiately tied with how skbs were constructed.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-26 21:36       ` Andy Lutomirski
@ 2019-09-27  7:20         ` Jason A. Donenfeld
  2019-10-01  8:56           ` Ard Biesheuvel
  0 siblings, 1 reply; 61+ messages in thread
From: Jason A. Donenfeld @ 2019-09-27  7:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Ard Biesheuvel,
	Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Linux Crypto Mailing List, Marc Zyngier, Dan Carpenter,
	Linus Torvalds, David Miller, linux-arm-kernel

Hey Andy,

Thanks for weighing in.

> inlining.  I'd be surprised for chacha20.  If you really want inlining
> to dictate the overall design, I think you need some real numbers for
> why it's necessary.  There also needs to be a clear story for how
> exactly making everything inline plays with the actual decision of
> which implementation to use.

Take a look at my description for the MIPS case: when on MIPS, the
arch code is *always* used since it's just straight up scalar
assembly. In this case, the chacha20_arch function *never* returns
false [1], which means it's always included [2], so the generic
implementation gets optimized out, saving disk and memory, which I
assume MIPS people care about.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-mips-glue.c?h=jd/wireguard#n13
[2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n118

I'm fine with considering this a form of "premature optimization",
though, and ditching the motivation there.

On Thu, Sep 26, 2019 at 11:37 PM Andy Lutomirski <luto@kernel.org> wrote:
> My suggestion from way back, which is at
> least a good deal of the way toward being doable, is to do static
> calls.  This means that the common code will call out to the arch code
> via a regular CALL instruction and will *not* inline the arch code.
> This means that the arch code could live in its own module, it can be
> selected at boot time, etc.

Alright, let's do static calls, then, to deal with the case of going
from the entry point implementation in lib/zinc (or lib/crypto, if you
want, Ard) to the arch-specific implementation in arch/${ARCH}/crypto.
And then within each arch, we can keep it simple, since everything is
already in the same directory.

Sound good?

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  1:30           ` Linus Torvalds
  2019-09-27  2:54             ` Linus Torvalds
  2019-09-27  4:36             ` Andy Lutomirski
@ 2019-09-27  9:58             ` Pascal Van Leeuwen
  2019-09-27 10:11               ` Herbert Xu
  2019-09-27 16:23               ` Linus Torvalds
  2 siblings, 2 replies; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-27  9:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

> > That remark is just very stupid. The hardware ALREADY exists, and
> > more hardware is in the pipeline. Once this stuff is designed in, it
> > usually stays in for many years to come. And these are chips sold in
> > _serious_ quantities, to be used in things like wireless routers and
> > DSL, cable and FTTH modems, 5G base stations, etc. etc.
> 
> Yes, I very much mentioned routers. I believe those can happen much
> more quickly.
> 
> But I would very much hope that that is not the only situation where
> you'd see wireguard used.
> 
Same here

> I'd want to see wireguard in an end-to-end situation from the very
> client hardware. So laptops, phones, desktops. Not the untrusted (to
> me) hw in between.
> 
I don't see why the crypto HW would deserve any less trust than, say,
the CPU itself. I would say CPU's don't deserve that trust at the moment.

> > No, these are just the routers going into *everyone's* home. And 5G
> > basestations arriving at every other street corner. I wouldn't call
> > that rare, exactly.
> 
> That's fine for a corporate tunnel between devices. Which is certainly
> one use case for wireguard.
> 
> But if you want VPN for your own needs for security, you want it at
> the _client_. Not at the router box. So that case really does matter.
> 
Personally, I would really like it in my router box so my CPU is free
to do useful work instead of boring crypto. And I know there's nothing
untrusted in between my client and the router box, so I don't need to
worry about security there. But hey, that's just me.

> And I really don't see the hardware happening in that space. So the
> bad crypto interfaces only make the client _worse_.
> 
Fully agree. We don't focus on the client side with our HW anyway.
(but than there may be that router box in between that can help out)

> See?
> 
> But on to the arguments that we actually agree on:
> 
> > Hey, no argument there. I don't see any good reason why the key can't
> > be on the stack. I doubt any hardware would be able to DMA that as-is
> > directly, and in any case, key changes should be infrequent, so copying
> > it to some DMA buffer should not be a performance problem.
> > So maybe that's an area for improvement: allow that to be on the stack.
> 
> It's not even just the stack. It's really that the crypto interfaces
> are *designed* so that you have to allocate things separately, and
> can't embed these things in your own data structures.
> 
> And they are that way, because the crypto interfaces aren't actually
> about (just) hiding the hardware interface: they are about hiding
> _all_ the encryption details.
> 
Well, that's the general idea of abstraction. It also allows for 
swapping in any other cipher with minimal effort just _because_ the 
details were hidden from the application. So it may cost you some 
effort initially, but it may save you effort later.

> There's no way to say "hey, I know the crypto I use, I know the key
> size I have, I know the state size it needs, I can preallocate those
> AS PART of my own data structures".
> 
> Because the interface is designed to be so "generic" that you simply
> can't do those things, they are all external allocations, which is
> inevitably slower when you don't have hardware.
> 
Hmm, Ok, I see your point here. But most of those data structures 
(like the key) should be allocated infrequently anyway, so you can
amortize that cost over _many_ crypto operations.

You _do_ realize that performing the key schedule for e.g. AES with
AES-NI also takes quite a lot of time? So you should keep your keys
alive and not reload them all the time anyway.

But I already agreed with you that there may be cases where you just
want to call the library function directly. Wireguard just isn't one
of those cases, IMHO.

> And you've shown that you don't care about that "don't have hardware"
> situation, and seem to think it's the only case that matters. That's
> your job, after all.
> 
I don't recall putting it that strongly ... and I certainly never said
the HW acceleration thing is the _only_ case that matters. But it does
matter _significantly_ to me, for blatantly obvious reasons.

> But however much you try to claim otherwise, there's all these
> situations where the hardware just isn't there, and the crypto
> interface just forces nasty overhead for absolutely no good reason.
> 
> > I already explained the reasons for _not_ doing direct calls above.
> 
> And I've tried to explain how direct calls that do the synchronous
> thing efficiently would be possible, but then _if_ there is hardware,
> they can then fall back to an async interface.
> 
OK, I did not fully get that latter part. I would be fine with such an
approach for use cases (i.e. fixed, known crypto) where that makes sense.
It would actually be better than calling the SW-only library directly
(which was my suggestion) as it would still allow HW acceleration as
an option ... 

> > > So there is absolutely NO DOWNSIDE for hw accelerated crypto to just
> > > do it right, and use an interface like this:
> > >
> > >        if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,
> > >                                         PACKET_CB(skb)->nonce, key->key,
> > >                                         simd_context))
> > >                return false;
> > >
> > Well, for one thing, a HW API should not expect the result to be
> > available when the function call returns. (if that's what you
> > mean here). That would just be WRONG.
> 
> Right. But that also shouldn't mean that when you have synchronous
> hardware (ie CPU) you have to set everything up even though it will
> never be used.
> 
> Put another way: even with hardware acceleration, the queuing
> interface should be a simple "do this" interface.
> 
OK, I don't think we disagree there. I _like_ simple. As long as it
doesn't sacrifice functionality I care about.

> The current crypto interface is basically something that requires all
> the setup up-front, whether it's needed or not. And it forces those
> very inconvenient and slow external allocations.
> 
But you should do the setup work (if by "setup" you mean things like
cipher allocation, key setup and request allocation) only _once_ in a 
_long_ while. You can just keep using it for the lifetime of the 
application (or key, for the key setup part).

If I look at my cipher fallback  paths in the driver (the only places
where I actually get to _use_ the API from the "other" side), per 
actual indivual request they _only_ do - the rest is all preallocated
earlier:

_set_callback()
_set_crypt()
_set_ad()
_encrypt() or _decrypt()

And now that I look at that, I think the _set_callback()  could
move to the setup phase as it's always the same callback function.
Probably, in case of Wireguard, you could even move the _set_ad()
there as it's always zero and  the crypto driver is not allowed 
to overwrite it in the request struct anyway.

Also, I already agreed with you that _set_crypt(), _set_ad()
and _encrypt()/_decrypt() _could_ be conveniently wrapped into
one API call instead of 3 separate ones if we think that's worth it.

BUT ... actually ... I just looked at the actual _implementation_
and it turns out these are _inlineable_ functions defined in the
header file that _just_ write to some struct fields. So they 
should not end up being function calls at all(!!).
_Only_ the _encrypt()/_decrypt() invocation will end up with a
true (indirect) function call.

So where are all those allocations you mention? Have you ever
actually _used_ the Crypto API for anything?

Yes, if you actually want to _queue_ requests you need to use one
request struct for every queued operation, but you could just
preallocate an array of them that you cycle through. No need to do
those allocations in the hot path.

So is your problem really with the API _itself_ or with incorrect/
inefficient _use_ of the API in some places?

> And I'm saying that causes problems, because it fundamentally means
> that you can't do a good job for the common CPU  case, because you're
> paying all those costs even when you need absolutely none of them.
> Both at setup time, but also at run-time due to the extra indirection
> and cache misses etc.
> 
There is some cost sure, but is it _significant_ for any use case that
_matters_? You started bringing up optimization rules, so how about
Amdahls law?

> > Again, HW acceleration does not depend on the indirection _at all_,
> > that's there for entirely different purposes I explained above.
> > HW acceleration _does_ depend greatly on a truly async ifc though.
> 
> Can you realize that the world isn't just all hw acceleration?
> 
Sure. But there's also a lot of HW acceleration _already out there_
that _could_ have been used if only the proper SW API's had existed.
Welcome to _my_ world.

> Can you admit that the current crypto interface is just horrid for the
> non-accelerated case?
> 
Is agreeing that it is not perfect sufficient for you? :-)

> Can you perhaps then also think that "maybe there are better models".
> 
Sure. There's always better. There's also good enough though ...

> > So queue requests on one side, handle results from the other side
> > in some callback func off of an interrupt handler.
> 
> Actually, what you can do - and what people *have* done - is to admit
> that the synchronous case is real and important, and then design
> interfaces that work for that one too.
> 
But they _do_ work for that case as well. I still haven't seen any
solid evidence that they are as horribly inefficient as you are 
implying for _real life_ use cases. And even if they are, then there's
the question whether that is the fault of the API or incorrect use 
thereof.

> You don't need to allocate resources ahead of time, and you don't have
> to disallow just having the state buffer allocated by the caller.
> 
> So here's the *wrong* way to do it (and the way that crypto does it):
> 
>  - dynamically allocate buffers at "init time"
> 
Why is that so "wrong"? It sure beats doing allocations on the hot path.
But yes, some stuff should be allowed to live on the stack. Some other
stuf can't be on the stack though, as that's gone when the calling 
function exits while the background crypto processing still needs it.

And you don't want to have it on the stack initially and then have
to _copy_ it to some DMA-able location that you allocate on the fly
on the hot path if you _do_ want HW acceleration.

>  - fill in "callback fields" etc before starting the crypto, whether
> they are needed or not
> 
I think this can be done _once_ at request allocation time.
But it's just one function pointer write anyway. Is that significant? 
Or: _if_ that is significant, you  shouldn't be using the Crypto API for 
that use case in the first place.

>  - call a "decrypt" function that then uses the indirect functions you
> set up at init time, and possibly waits for it (or calls the callbacks
> you set up)
> 
> note how it's all this "state machine" model where you add data to the
> state machine, and at some point you say "execute" and then either you
> wait for things or you get callbacks.
> 
Not sure how splitting data setup over a few seperate "function" calls
suddenly makes it a "state machine model" ...

But yes, I can understand why the completion handling through this
callback function seems like unnecessary complication for the SW case.

> That makes sense for a hw crypto engine. It's how a lot of them work, after all.
> 
Oh really?

Can't speak for other people stuff, but for our hardware you post a
request to it and then go off do other stuff while the HW does its thing
after which it will inform you it's done by means of an interrupt.
I don't see how this relates to the "statemachine model" above, there
is no persistent state involved, it's all included in the request.
The _only_ thing that matters is that you realize it's a pipeline that
needs to be kept filled and has latency >> throughput, just like your 
CPU pipeline.

> But it makes _zero_ sense for the synchronous case. You did a lot of
> extra work for that case, and because it was all a styate machine, you
> did it particularly inefficiently: not only do you have those separate
> allocations with pointer following, the "decrypt()" call ends up doing
> an indirect call to the CPU implementation, which is just quite slow
> to begin with, particularly in this day and age with retpoline etc.
> 
> So what's the alternative?
> 
> I claim that a good interface would accept that "Oh, a lot of cases
> will be synchronous, and a lot of cases use one fixed
> encryption/decryption model".
> 
> And it's quite doable. Instead of having those callback fields and
> indirection etc, you could have something more akin to this:
> 
>  - let the caller know what the state size is and allocate the
> synchronous state in its own data structures
> 
>  - let the caller just call a static "decrypt_xyz()" function for xyz
> decryptrion.
> 
Fine for those few cases where the algorithm is known and fixed.
(You do realize that the primary use cases are IPsec, dmcrypt and
fscrypt where that is most definitely _not_ the case?)

Also, you're still ignoring the fact that there is not one, single,
optimal, CPU implementation either. You have to select that as well,
based on CPU features. So it's either an indirect function call that
would be well predictable - as it's always the same at that point in
the program - or it's a deep if-else tree (which might actually be
implemented by the compiler as an indirect (table) jump ...) 
selecting the fastest implementation, either SW _or_ HW.

>  - if you end up doing it synchronously, that function just returns
> "done". No overhead. No extra allocations. No unnecessary stuff. Just
> do it, using the buffers provided. End of story. Efficient and simple.
> 
I don't see which "extra allocations" you would be saving here.
Those shouldn't happen in the hot path either way.

>  - BUT.
> 
>  - any hardware could have registered itself for "I can do xyz", and
> the decrypt_xyz() function would know about those, and *if* it has a
> list of accelerators (hopefully sorted by preference etc), it would
> try to use them. And if they take the job (they might not - maybe
> their queues are full, maybe they don't have room for new keys at the
> moment, which might be a separate setup from the queues), the
> "decrypt_xyz()" function returns a _cookie_ for that job. It's
> probably a pre-allocated one (the hw accelerator might preallocate a
> fixed number of in-progress data structures).
> 
> And once you have that cookie, and you see "ok, I didn't get the
> answer immediately" only THEN do you start filling in things like
> callback stuff, or maybe you set up a wait-queue and start waiting for
> it, or whatever".
> 
I don't see the point of saving that single callback pointer write.
I mean, it's just _one_ CPU word memory write. Likely to the L1 cache.

But I can see the appeal of getting a "done" response on the _encrypt()/
_decrypt() call and then being able to immediately continue processing
the result data and having the async response handling separated off. 

I think it should actually be possible to change the API to work like
that without breaking backward compatibility, i.e. define some flag
specifying you actually _want_ this behavior and then define some
return code that says "I'm done processing, carry on please".

> See the difference in models? One forces that asynchronous model, and
> actively penalizes the synchronous one.
> 
> The other _allows_ an asynchronous model, but is fine with a synchronous one.
> 
> > >        aead_request_set_callback(req, 0, NULL, NULL);
> > >
> > This is just inevitable for HW acceration ...
> 
> See above. It really isn't. You could do it *after* the fact,
>
Before ... after ... the point was you need it. And it's a totally
insignificant saving anyway.

> when
> you've gotten that ticket from the hardware. Then you say "ok, if the
> ticket is done, use these callbacks". Or "I'll now wait for this
> ticket to be done" (which is what the above does by setting the
> callbacks to zero).
> 
> Wouldn't that be lovely for a user?
> 
Yes and no.
Because the user would _still_ need to handle the case of callbacks.
In case the request _does_ go to the HW accelerator.

So you keep the main processing path clean I suppose, saving some 
cycles there, but you still have this case of callbacks and having
multiple requests queued you need to handle as well. Which now 
becomes a separate _exception_ case.  You now have  two distinct 
processing paths you have to manage from your application.
How is that an _improvement_ for the user? (not withstanding that
it may be an improvement to SW only performance)

> I suspect it would be a nice model for a hw accelerator too. If you
> have full queues or have problems allocating new memory or whatever,
> you just let the code fall back to the synchronous interface.
> 
HW drivers typically _do_ use SW fallback for cases they cannot
handle. Actually, that works very nicely with the current API,
with the fallback cipher just being attached to the original
requests' callback function ... i.e. just do a tail call to 
the fallback cipher request.

> > Trust me, I have whole list of things I don't like about the
> > API myself, it's not exacty ideal for HW acceleration  either.
> 
> That';s the thing. It's actively detrimental for "I have no HW acceleration".
> 
You keep asserting that with no evidence whatsoeever.

> And apparently it's not optimal for you guys either.
> 
True, but I accept the fact that it needs to be that way because some
_other_ HW may drive that requirement. I accept the fact that I'm not
alone in the world.

> > But the point is - there are those case where you _don't_ know and
> > _that_ is what the Crypto API is for. And just generally, crypto
> > really _should_ be switchable.
> 
> It's very much not what wireguard does.
> 
And that's very much a part of Wireguard that is _broken_. I like
Wireguard for a lot of things, but it's single cipher focus is not
one of them. Especially since all crypto it uses comes from a single
source (DJB), which is frowned upon in the industry.

Crypto agility is a very important _security_ feature and the whole
argument Jason makes that it is actually a weakness is _bullshit_.
(Just because SSL _implemented_ this horribly wrong doesn't mean 
it's a bad thing to do - it's not, it's actually _necessary_. As 
the alternative would be to either continue using broken crypto
or wait _months_ for a new implementation to reach your devices
when the crypto gets broken somehow. Not good.)

> And honestly, most of the switchable ones have caused way more
> security problems than they have "fixed" by being switchable.
> 
"most of the switchable ones"
You mean _just_ SSL/TLS. SSL/TLS before 1.3 just sucked security
wise, on so many levels. That has _nothing_ to do with the very
desirable feature of crypto agility. It _can_ be done properly and 
securely. (for one thing, it does not _need_ to be negotiable)

>                  Linus

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  9:58             ` Pascal Van Leeuwen
@ 2019-09-27 10:11               ` Herbert Xu
  2019-09-27 16:23               ` Linus Torvalds
  1 sibling, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2019-09-27 10:11 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A . Donenfeld, Catalin Marinas, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves, Will Deacon,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, Linux ARM

On Fri, Sep 27, 2019 at 09:58:14AM +0000, Pascal Van Leeuwen wrote:
>
> But I can see the appeal of getting a "done" response on the _encrypt()/
> _decrypt() call and then being able to immediately continue processing
> the result data and having the async response handling separated off. 

This is how it works today.  If your request can be fulfilled
right away, you will get a return value other than EINPROGRESS
and you just carry on, the completion callback never happens in
this case.

aesni-intel makes heavy use of this.  In most cases it is sync.
It only goes async when the FPU is not available.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  2:06           ` Linus Torvalds
@ 2019-09-27 10:11             ` Pascal Van Leeuwen
  0 siblings, 0 replies; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-27 10:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

> -----Original Message-----
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Friday, September 27, 2019 4:06 AM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu
> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves
> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> <catalin.marinas@arm.com>
> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet
> encryption
> 
> On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> >
> > But even the CPU only thing may have several implementations, of which
> > you want to select the fastest one supported by the _detected_ CPU
> > features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.)
> > Do you think this would still be efficient if that would be some
> > large if-else tree? Also, such a fixed implementation wouldn't scale.
> 
> Just a note on this part.
> 
> Yes, with retpoline a large if-else tree is actually *way* better for
> performance these days than even just one single indirect call. I
> think the cross-over point is somewhere around 20 if-statements.
> 
Yikes, that is just _horrible_ :-(

_However_ there's many CPU architectures out there that _don't_ need
the retpoline mitigation and would be unfairly penalized by the deep
if-else tree (as opposed to the indirect branch) for a problem they
did not cause in the first place.

Wouldn't it be more fair to impose the penalty on the CPU's actually
_causing_ this problem? Also because those are generally the more 
powerful CPU's anyway, that would suffer the least from the overhead?

> But those kinds of things also are things that we already handle well
> with instruction rewriting, so they can actually have even less of an
> overhead than a conditional branch. Using code like
> 
>   if (static_cpu_has(X86_FEATURE_AVX2))
> 
> actually ends up patching the code at run-time, so you end up having
> just an unconditional branch. Exactly because CPU feature choices
> often end up being in critical code-paths where you have
> one-or-the-other kind of setup.
> 
> And yes, one of the big users of this is very much the crypto library code.
> 
Ok, I didn't know about that. So I suppose we could have something
like if (static_soc_has(HW_CRYPTO_ACCELERATOR_XYZ)) ... Hmmm ...

> The code to do the above is disgusting, and when you look at the
> generated code you see odd unreachable jumps and what looks like a
> slow "bts" instruction that does the testing dynamically.
> 
> And then the kernel instruction stream gets rewritten fairly early
> during the boot depending on the actual CPU capabilities, and the
> dynamic tests get overwritten by a direct jump.
> 
> Admittedly I don't think the arm64 people go to quite those lengths,
> but it certainly wouldn't be impossible there either.  It just takes a
> bit of architecture knowledge and a strong stomach ;)
> 
>                  Linus

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  2:54             ` Linus Torvalds
  2019-09-27  3:53               ` Herbert Xu
  2019-09-27  4:01               ` Herbert Xu
@ 2019-09-27 10:44               ` Pascal Van Leeuwen
  2019-09-27 11:08                 ` Pascal Van Leeuwen
  2 siblings, 1 reply; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-27 10:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

> -----Original Message-----
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Friday, September 27, 2019 4:54 AM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu
> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves
> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> <catalin.marinas@arm.com>
> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet
> encryption
> 
> On Thu, Sep 26, 2019 at 6:30 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And once you have that cookie, and you see "ok, I didn't get the
> > answer immediately" only THEN do you start filling in things like
> > callback stuff, or maybe you set up a wait-queue and start waiting for
> > it, or whatever".
> 
> Side note: almost nobody does this.
> 
> Almost every single async interface I've ever seen ends up being "only
> designed for async".
> 
> And I think the reason is that everybody first does the simply
> synchronous interfaces, and people start using those, and a lot of
> people are perfectly happy with them. They are simple, and they work
> fine for the huge majority of users.
> 
> And then somebody comes along and says "no, _we_ need to do this
> asynchronously", and by definition that person does *not* care for the
> synchronous case, since that interface already existed and was simpler
> and already was mostly sufficient for the people who used it, and so
> the async interface ends up being _only_ designed for the new async
> workflow. Because that whole new world was written with just that case
> is mind, and the synchronous case clearly didn't matter.
> 
> So then you end up with that kind of dichotomous situation, where you
> have a strict black-and-white either-synchronous-or-async model.
> 
> And then some people - quite reasonably - just want the simplicity of
> the synchronous code and it performs better for them because the
> interfaces are simpler and better suited to their lack of extra work.
> 
> And other people feel they need the async code, because they can take
> advantage of it.
> 
> And never the twain shall meet, because the async interface is
> actively _bad_ for the people who have sync workloads and the sync
> interface doesn't work for the async people.
> 
> Non-crypto example: [p]read() vs aio_read(). They do the same thing
> (on a high level) apart from that sync/async issue. And there's no way
> to get the best of both worlds.
> 
> Doing aio_read() on something that is already cached is actively much
> worse than just doing a synchronous read() of cached data.
> 
> But aio_read() _can_ be much better if you know your workload doesn't
> cache well and read() blocks too much for you.
> 
> There's no "read_potentially_async()" interface that just does the
> synchronous read for any cached portion of the data, and then delays
> just the IO parts and returns a "here, I gave you X bytes right now,
> use this cookie to wait for the rest".
> 
> Maybe nobody would use it. But it really should be possibly to have
> interfaces where a good synchronous implementation is _possible_
> without the extra overhead, while also allowing async implementations.
> 
That's the question. I've never seen such an API yet ...

You could also just accept that those are two wildly different use 
cases with wildly different requirements and allow them to coexist,
while  sharing as much of the low-level SW implementation code as
possible underneath. With the async API only used for those cases
where HW acceleration can make the difference.

I believe for hashes, the Crypto API still maintains an shash and
an ahash API. It works the other way around from how you would
like  to see though, with ahash wrapping the shash in case of SW 
implementations. Still, if you're sure you can't benefit from HW 
acceleration you have the option of using the shash directly.

I don't know why the synchronous blkcipher API was deprecated,
that happened before I joined. IMHO it would make sense to have,
so users not interested in HW crypto are not burdened by it.


>                 Linus

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27 10:44               ` Pascal Van Leeuwen
@ 2019-09-27 11:08                 ` Pascal Van Leeuwen
  0 siblings, 0 replies; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-27 11:08 UTC (permalink / raw)
  To: Pascal Van Leeuwen, Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

> -----Original Message-----
> From: linux-crypto-owner@vger.kernel.org <linux-crypto-owner@vger.kernel.org> On Behalf
> Of Pascal Van Leeuwen
> Sent: Friday, September 27, 2019 12:44 PM
> To: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu
> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves
> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> <catalin.marinas@arm.com>
> Subject: RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet
> encryption
> 
> > -----Original Message-----
> > From: Linus Torvalds <torvalds@linux-foundation.org>
> > Sent: Friday, September 27, 2019 4:54 AM
> > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> > crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu
> > <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> > <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves
> > <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski
> <luto@kernel.org>;
> > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> > <catalin.marinas@arm.com>
> > Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet
> > encryption
> >
> > On Thu, Sep 26, 2019 at 6:30 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > And once you have that cookie, and you see "ok, I didn't get the
> > > answer immediately" only THEN do you start filling in things like
> > > callback stuff, or maybe you set up a wait-queue and start waiting for
> > > it, or whatever".
> >
> > Side note: almost nobody does this.
> >
> > Almost every single async interface I've ever seen ends up being "only
> > designed for async".
> >
> > And I think the reason is that everybody first does the simply
> > synchronous interfaces, and people start using those, and a lot of
> > people are perfectly happy with them. They are simple, and they work
> > fine for the huge majority of users.
> >
> > And then somebody comes along and says "no, _we_ need to do this
> > asynchronously", and by definition that person does *not* care for the
> > synchronous case, since that interface already existed and was simpler
> > and already was mostly sufficient for the people who used it, and so
> > the async interface ends up being _only_ designed for the new async
> > workflow. Because that whole new world was written with just that case
> > is mind, and the synchronous case clearly didn't matter.
> >
> > So then you end up with that kind of dichotomous situation, where you
> > have a strict black-and-white either-synchronous-or-async model.
> >
> > And then some people - quite reasonably - just want the simplicity of
> > the synchronous code and it performs better for them because the
> > interfaces are simpler and better suited to their lack of extra work.
> >
> > And other people feel they need the async code, because they can take
> > advantage of it.
> >
> > And never the twain shall meet, because the async interface is
> > actively _bad_ for the people who have sync workloads and the sync
> > interface doesn't work for the async people.
> >
> > Non-crypto example: [p]read() vs aio_read(). They do the same thing
> > (on a high level) apart from that sync/async issue. And there's no way
> > to get the best of both worlds.
> >
> > Doing aio_read() on something that is already cached is actively much
> > worse than just doing a synchronous read() of cached data.
> >
> > But aio_read() _can_ be much better if you know your workload doesn't
> > cache well and read() blocks too much for you.
> >
> > There's no "read_potentially_async()" interface that just does the
> > synchronous read for any cached portion of the data, and then delays
> > just the IO parts and returns a "here, I gave you X bytes right now,
> > use this cookie to wait for the rest".
> >
> > Maybe nobody would use it. But it really should be possibly to have
> > interfaces where a good synchronous implementation is _possible_
> > without the extra overhead, while also allowing async implementations.
> >
> That's the question. I've never seen such an API yet ...
> 
> You could also just accept that those are two wildly different use
> cases with wildly different requirements and allow them to coexist,
> while  sharing as much of the low-level SW implementation code as
> possible underneath. With the async API only used for those cases
> where HW acceleration can make the difference.
> 
> I believe for hashes, the Crypto API still maintains an shash and
> an ahash API. It works the other way around from how you would
> like  to see though, with ahash wrapping the shash in case of SW
> implementations. Still, if you're sure you can't benefit from HW
> acceleration you have the option of using the shash directly.
> 
> I don't know why the synchronous blkcipher API was deprecated,
> that happened before I joined. IMHO it would make sense to have,
> so users not interested in HW crypto are not burdened by it.
> 
> 
Never mind. From what I just learned, you can achieve the same 
thing with the skcipher API by just requesting a sync implementation.
Which would allow you to put your structs on the stack and would
not return from the encrypt()/decrypt() call until actually done.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 23:13         ` Dave Taht
@ 2019-09-27 12:18           ` Pascal Van Leeuwen
  0 siblings, 0 replies; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-27 12:18 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jason A. Donenfeld, Toke Høiland-Jørgensen,
	Catalin Marinas, Herbert Xu, Arnd Bergmann, Ard Biesheuvel,
	Greg KH, Eric Biggers, Willy Tarreau, Samuel Neves, Will Deacon,
	Netdev, Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Linus Torvalds, David Miller, linux-arm-kernel


> -----Original Message-----
> From: Dave Taht <dave.taht@gmail.com>
> Sent: Friday, September 27, 2019 1:14 AM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Ard Biesheuvel <ard.biesheuvel@linaro.org>;
> Linux Crypto Mailing List <linux-crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-
> kernel@lists.infradead.org>; Herbert Xu <herbert@gondor.apana.org.au>; David Miller
> <davem@davemloft.net>; Greg KH <gregkh@linuxfoundation.org>; Linus Torvalds
> <torvalds@linux-foundation.org>; Samuel Neves <sneves@dei.uc.pt>; Dan Carpenter
> <dan.carpenter@oracle.com>; Arnd Bergmann <arnd@arndb.de>; Eric Biggers
> <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; Will Deacon <will@kernel.org>;
> Marc Zyngier <maz@kernel.org>; Catalin Marinas <catalin.marinas@arm.com>; Willy Tarreau
> <w@1wt.eu>; Netdev <netdev@vger.kernel.org>; Toke Høiland-Jørgensen <toke@toke.dk>
> Subject: Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard
> using the existing crypto API]
> 
> On Thu, Sep 26, 2019 at 6:52 AM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> >
> > > -----Original Message-----
> > > From: Jason A. Donenfeld <Jason@zx2c4.com>
> > > Sent: Thursday, September 26, 2019 1:07 PM
> > > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> > > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> > > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> > > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg
> KH
> > > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> > > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> > > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski
> <luto@kernel.org>;
> > > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> > > <catalin.marinas@arm.com>; Willy Tarreau <w@1wt.eu>; Netdev
> <netdev@vger.kernel.org>;
> > > Toke Høiland-Jørgensen <toke@toke.dk>; Dave Taht <dave.taht@gmail.com>
> > > Subject: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard
> > > using the existing crypto API]
> > >
> > > [CC +willy, toke, dave, netdev]
> > >
> > > Hi Pascal
> > >
> > > On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
> > > <pvanleeuwen@verimatrix.com> wrote:
> > > > Actually, that assumption is factually wrong. I don't know if anything
> > > > is *publicly* available, but I can assure you the silicon is running in
> > > > labs already. And something will be publicly available early next year
> > > > at the latest. Which could nicely coincide with having Wireguard support
> > > > in the kernel (which I would also like to see happen BTW) ...
> > > >
> > > > Not "at some point". It will. Very soon. Maybe not in consumer or server
> > > > CPUs, but definitely in the embedded (networking) space.
> > > > And it *will* be much faster than the embedded CPU next to it, so it will
> > > > be worth using it for something like bulk packet encryption.
> > >
> > > Super! I was wondering if you could speak a bit more about the
> > > interface. My biggest questions surround latency. Will it be
> > > synchronous or asynchronous?
> > >
> > The hardware being external to the CPU and running in parallel with it,
> > obviously asynchronous.
> >
> > > If the latter, why?
> > >
> > Because, as you probably already guessed, the round-trip latency is way
> > longer than the actual processing time, at least for small packets.
> >
> > Partly because the only way to communicate between the CPU and the HW
> > accelerator (whether that is crypto, a GPU, a NIC, etc.) that doesn't
> > keep the CPU busy moving data is through memory, with the HW doing DMA.
> > And, as any programmer should now, round trip times to memory are huge
> > relative to the processing speed.
> >
> > And partly because these accelerators are very similar to CPU's in
> > terms of architecture, doing pipelined processing and having multiple
> > of such pipelines in parallel. Except that these pipelines are not
> > working on low-level instructions but on full packets/blocks. So they
> > need to have many packets in flight to keep those pipelines fully
> > occupied. And packets need to move through the various pipeline stages,
> > so they incur the time needed to process them multiple times. (just
> > like e.g. a multiply instruction with a throughput of 1 per cycle
> > actually may need 4 or more cycles to actually provide its result)
> >
> > Could you do that from a synchronous interface? In theory, probably,
> > if you would spawn a new thread for every new packet arriving and
> > rely on the scheduler to preempt the waiting threads. But you'd need
> > as many threads as the HW  accelerator can have packets in flight,
> > while an async would need only 2 threads: one to handle the input to
> > the accelerator and one to handle the output (or at most one thread
> > per CPU, if you want to divide the workload)
> >
> > Such a many-thread approach seems very inefficient to me.
> >
> > > What will its latencies be?
> > >
> > Depends very much on the specific integration scenario (i.e. bus
> > speed, bus hierarchy, cache hierarchy, memory speed, etc.) but on
> > the order of a few thousand CPU clocks is not unheard of.
> > Which is an eternity for the CPU, but still only a few uSec in
> > human time. Not a problem unless you're a high-frequency trader and
> > every ns counts ...
> > It's not like the CPU would process those packets in zero time.
> >
> > > How deep will its buffers be?
> > >
> > That of course depends on the specific accelerator implementation,
> > but possibly dozens of small packets in our case, as you'd need
> > at least width x depth packets in there to keep the pipes busy.
> > Just like a modern CPU needs hundreds of instructions in flight
> > to keep all its resources busy.
> >
> > > The reason I ask is that a
> > > lot of crypto acceleration hardware of the past has been fast and
> > > having very deep buffers, but at great expense of latency.
> > >
> > Define "great expense". Everything is relative. The latency is very
> > high compared to per-packet processing time but at the same time it's
> > only on the order of a few uSec. Which may not even be significant on
> > the total time it takes for the packet to travel from input MAC to
> > output MAC, considering the CPU will still need to parse and classify
> > it and do pre- and postprocessing on it.
> >
> > > In the networking context, keeping latency low is pretty important.
> > >
> > I've been doing this for IPsec for nearly 20 years now and I've never
> > heard anyone complain about our latency, so it must be OK.
> 
> Well, it depends on where your bottlenecks are. On low-end hardware
> you can and do tend to bottleneck on the crypto step, and with
> uncontrolled, non-fq'd non-aqm'd buffering you get results like this:
> 
> http://blog.cerowrt.org/post/wireguard/
> 
> so in terms of "threads" I would prefer to think of flows entering
> the tunnel and attempting to multiplex them as best as possible
> across the crypto hard/software so that minimal in-hw latencies are experienced
> for most packets and that the coupled queue length does not grow out of control,
> 
> Adding fq_codel's hashing algo and queuing to ipsec as was done in
> commit: 264b87fa617e758966108db48db220571ff3d60e to leverage
> the inner hash...
> 
> Had some nice results:
> 
> before: http://www.taht.net/~d/ipsec_fq_codel/oldqos.png (100ms spikes)
> After: http://www.taht.net/~d/ipsec_fq_codel/newqos.png (2ms spikes)
> 
> I'd love to see more vpn vendors using the rrul test or something even
> nastier to evaluate their results, rather than dragstrip bulk throughput tests,
> steering multiple flows over multiple cores.
> 
> > We're also doing (fully inline, no CPU involved) MACsec cores, which
> > operate at layer 2 and I know it's a concern there for very specific
> > use cases (high frequency trading, precision time protocol, ...).
> > For "normal" VPN's though, a few uSec more or less should be a non-issue.
> 
> Measured buffering is typically 1000 packets in userspace vpns. If you
> can put data in, faster than you can get it out, well....
> 
We don't buffer anywhere near 1000 packets in the hardware itself.
In fact, our buffers are designed to be carefully tunable to accept
the minimum number of packets required by the system as a whole.

But we do need to potentially keep a deep & wide pipeline busy, so for
the big, high-speed engines some double-digit buffering is inevitable. 
It won't get anywhere near even a 100 packets though, let alone 1000.

Also, the whole point of crypto HW acceleration is ensure the crypto
is *not* the bottleneck, not even for those pesky small TCP ACK packets
when they come back-to-back (although I doubt the crypto itself is the 
bottleneck there, as there is actually very little crypto to do then).
We work very hard to ensure decent *small* packet performance and 
generally you should scale your crypto HW to be able to keep up with 
the worst case there, with margin to spare ...

> > > Already
> > > WireGuard is multi-threaded which isn't super great all the time for
> > > latency (improvements are a work in progress). If you're involved with
> > > the design of the hardware, perhaps this is something you can help
> > > ensure winds up working well? For example, AES-NI is straightforward
> > > and good, but Intel can do that because they are the CPU. It sounds
> > > like your silicon will be adjacent. How do you envision this working
> > > in a low latency environment?
> > >
> > Depends on how low low-latency is. If you really need minimal latency,
> > you need an inline implementation. Which we can also provide, BTW :-)
> >
> > Regards,
> > Pascal van Leeuwen
> > Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
> > www.insidesecure.com
> 
> 
> 
> --
> 
> Dave Täht
> CTO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-831-205-9740

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27  9:58             ` Pascal Van Leeuwen
  2019-09-27 10:11               ` Herbert Xu
@ 2019-09-27 16:23               ` Linus Torvalds
  2019-09-30 11:14                 ` France didn't want GSM encryption Marc Gonzalez
  2019-09-30 20:44                 ` [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption Pascal Van Leeuwen
  1 sibling, 2 replies; 61+ messages in thread
From: Linus Torvalds @ 2019-09-27 16:23 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

On Fri, Sep 27, 2019 at 2:58 AM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>
> > I'd want to see wireguard in an end-to-end situation from the very
> > client hardware. So laptops, phones, desktops. Not the untrusted (to
> > me) hw in between.
> >
> I don't see why the crypto HW would deserve any less trust than, say,
> the CPU itself. I would say CPU's don't deserve that trust at the moment.

It's not the crypto engine that is part of the untrusted hardware.
It's the box itself, and the manufacturer, and you having to trust
that the manufacturer didn't set up some magic knocking sequence to
disable the encryption.

Maybe the company that makes them is trying to do a good job. But
maybe they are based in a country that has laws that require
backdoors.

Say, France. There's a long long history of that kind of thing.

It's all to "fight terrorism", but hey, a little industrial espionage
is good too, isn't it? So let's just disable GSM encryption based on
geographic locale and local regulation, shall we.

Yeah, yeah, GSM encryption wasn't all that strong to begin with, but
it was apparently strong enough that France didn't want it.

So tell me again why I should trust that box that I have no control over?

> Well, that's the general idea of abstraction. It also allows for
> swapping in any other cipher with minimal effort just _because_ the
> details were hidden from the application. So it may cost you some
> effort initially, but it may save you effort later.

We clearly disagree on the utility of crypto agility. You point to
things like ipsec as an argument for it.

And I point to ipsec as an argument *against* that horror. It's a
bloated, inefficient, horribly complex mess. And all the "agility" is
very much part of it.

I also point to GSM as a reason against "agility". It has caused way
more security problems than it has ever solved. The ":agility" is
often a way to turn off (or tune down) the encryption, not as a way to
say "ok, we can improve it later".

That "we can improve it later" is a bedtime story. It's not how it
gets used. Particularly as the weaknesses are often not primarily in
the crypto algorithm itself, but in how it gets used or other session
details.

When you actually want to *improve* security, you throw the old code
away, and start a new protocol entirely. Eg SSL -> TLS.

So cryptographic agility is way oversold, and often people are
actively lying about why they want it. And the people who aren't lying
are ignoring the costs.

One of the reasons _I_ like wireguard is that it just went for simple
and secure. No BS.

And you say

> Especially since all crypto it uses comes from a single
> source (DJB), which is frowned upon in the industry.

I'm perhaps not a fan of DJB in all respects, but there's no question
that he's at least competent.

The "industry practice" of having committees influenced by who knows
what isn't all that much better. Do you want to talk about NSA
elliptic curve constant choices?

Anyway, on the costs:

> >  - dynamically allocate buffers at "init time"
>
> Why is that so "wrong"? It sure beats doing allocations on the hot path.

It's wrong not becasue the allocation is costly (you do that only
once), but because the dynamic allocation means that you can't embed
stuff in your own native data structures as a user.

So now accessing those things is no longer dense in the cache.

And it's the cache that matters for a synchronous CPU algorithm. You
don't want the keys and state to be in some other location when you
already have your data structures for the stream that could just have
them right there with the other data.

> And you don't want to have it on the stack initially and then have
> to _copy_ it to some DMA-able location that you allocate on the fly
> on the hot path if you _do_ want HW acceleration.

Actually, that's *exactly* what you want. You want keys etc to be in
regular memory in a location that is convenient to the user, and then
only if the hardware has issues do you say "ok, copy the key to the
hardware". Because quite often the hardware will have very special key
caches that aren't even available to the CPU, because they are on some
hw-private buffers.

Yes, you want to have a "key identity" model so that the hardware
doesn't have to reload it all the time, but that's an invalidation
protocol, not a "put the keys or nonces in special places".

               Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: France didn't want GSM encryption
  2019-09-27 16:23               ` Linus Torvalds
@ 2019-09-30 11:14                 ` Marc Gonzalez
  2019-09-30 21:37                   ` Linus Torvalds
  2019-09-30 20:44                 ` [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption Pascal Van Leeuwen
  1 sibling, 1 reply; 61+ messages in thread
From: Marc Gonzalez @ 2019-09-30 11:14 UTC (permalink / raw)
  To: Linus Torvalds, Pascal Van Leeuwen; +Cc: Linux Crypto Mailing List, Linux ARM

[ Trimming recipients list ]

On 27/09/2019 18:23, Linus Torvalds wrote:

> It's not the crypto engine that is part of the untrusted hardware.
> It's the box itself, and the manufacturer, and you having to trust
> that the manufacturer didn't set up some magic knocking sequence to
> disable the encryption.
> 
> Maybe the company that makes them is trying to do a good job. But
> maybe they are based in a country that has laws that require
> backdoors.
> 
> Say, France. There's a long long history of that kind of thing.
> 
> It's all to "fight terrorism", but hey, a little industrial espionage
> is good too, isn't it? So let's just disable GSM encryption based on
> geographic locale and local regulation, shall we.
> 
> Yeah, yeah, GSM encryption wasn't all that strong to begin with, but
> it was apparently strong enough that France didn't want it.

Two statements above have raised at least one of my eyebrows.

1) France has laws that require backdoors.

2) France did not want GSM encryption.


The following article claims that it was the British who demanded that
A5/1 be weakened (not the algorithm, just the key size; which is what
the USgov did in the 90s).

https://www.aftenposten.no/verden/i/Olkl/Sources-We-were-pressured-to-weaken-the-mobile-security-in-the-80s


Additional references for myself

https://lwn.net/Articles/368861/
https://en.wikipedia.org/wiki/Export_of_cryptography_from_the_United_States
https://gsmmap.org/assets/pdfs/gsmmap.org-country_report-France-2017-06.pdf
https://gsmmap.org/assets/pdfs/gsmmap.org-country_report-France-2018-06.pdf
https://gsmmap.org/assets/pdfs/gsmmap.org-country_report-France-2019-08.pdf


As for your first claim, can you provide more information, so that I could
locate the law(s) in question? (Year the law was discussed, for example.)

I've seen a few propositions ("projet de loi") but none(?) have made it into
actual law, as far as I'm aware.

https://www.nextinpact.com/news/98039-loi-numerique-nkm-veut-backdoor-dans-chaque-materiel.htm
https://www.nextinpact.com/news/107546-lamendement-anti-huawei-porte-pour-backdoors-renseignement-francais.htm

Regards.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
  2019-09-27 16:23               ` Linus Torvalds
  2019-09-30 11:14                 ` France didn't want GSM encryption Marc Gonzalez
@ 2019-09-30 20:44                 ` Pascal Van Leeuwen
  1 sibling, 0 replies; 61+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-30 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Catalin Marinas, Herbert Xu, Arnd Bergmann,
	Ard Biesheuvel, Greg KH, Eric Biggers, Samuel Neves,
	Linux Crypto Mailing List, Andy Lutomirski, Marc Zyngier,
	Dan Carpenter, Will Deacon, David Miller, Linux ARM

> -----Original Message----
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Friday, September 27, 2019 6:24 PM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu
> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves
> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann <arnd@arndb.de>;
> Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; Will Deacon
> <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas <catalin.marinas@arm.com>
> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
> 
> On Fri, Sep 27, 2019 at 2:58 AM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> >
> > > I'd want to see wireguard in an end-to-end situation from the very
> > > client hardware. So laptops, phones, desktops. Not the untrusted (to
> > > me) hw in between.
> > >
> > I don't see why the crypto HW would deserve any less trust than, say,
> > the CPU itself. I would say CPU's don't deserve that trust at the moment.
> 
> It's not the crypto engine that is part of the untrusted hardware.
> It's the box itself, and the manufacturer, and you having to trust
> that the manufacturer didn't set up some magic knocking sequence to
> disable the encryption.
> 
> Maybe the company that makes them is trying to do a good job. But
> maybe they are based in a country that has laws that require
> backdoors.
> 
> Say, France. There's a long long history of that kind of thing.
> 
> It's all to "fight terrorism", but hey, a little industrial espionage
> is good too, isn't it? So let's just disable GSM encryption based on
> geographic locale and local regulation, shall we.
> 
> Yeah, yeah, GSM encryption wasn't all that strong to begin with, but
> it was apparently strong enough that France didn't want it.
> 
> So tell me again why I should trust that box that I have no control over?
> 
Same reason you trust your PC hardware you have no control over?
(That CPU is assembled in Malaysia, your motherboard likely in China.
And not being a US citizen, *I* wouldn't trust anything out of the US
anyway, _knowing_ they've been actively spying on us for decades ...)

In case you worry about the software part: of course you'd be running
something open-source and Linux based like DD-WRT on that router ...

Personally I'm not that paranoid and I really like to offload all the
silly  crypto heavy-lifting to my router box, where it belongs.

> > Well, that's the general idea of abstraction. It also allows for
> > swapping in any other cipher with minimal effort just _because_ the
> > details were hidden from the application. So it may cost you some
> > effort initially, but it may save you effort later.
> 
> We clearly disagree on the utility of crypto agility. You point to
> things like ipsec as an argument for it.
> 
I don't recall doing specifically that, but anyway.

> And I point to ipsec as an argument *against* that horror. It's a
> bloated, inefficient, horribly complex mess. And all the "agility" is
> very much part of it.
> 
Oh really? I've been working on implementations thereof for nearly 2
decades, but I don't recognise this at all, at least not for the datapath.
IPsec actually made a significant effort to keep the packet format the
same across all extensions done over its 20+ year history. The cipher
agility is mostly abstracted away from the base protocol, allowing us to
add new ciphersuites - to hardware, no less! - with very minimal effort.

In any, case, while I believe in the KISS principle, I also believe that
things should be as simple as possible, but _no simpler than that_(A.E.)
Oversimplification is the evil twin of overcomplication.

> I also point to GSM as a reason against "agility". It has caused way
> more security problems than it has ever solved. The ":agility" is
> often a way to turn off (or tune down) the encryption, not as a way to
> say "ok, we can improve it later".
> 
> That "we can improve it later" is a bedtime story. It's not how it
> gets used. Particularly as the weaknesses are often not primarily in
> the crypto algorithm itself, but in how it gets used or other session
> details.
> 
I don't see what this has to do with cipher agility. Cipher agility has
nothing to do with "improving things later" and everything with the 
realisation that, someday, some clever person _will_ find some weakness.

> When you actually want to *improve* security, you throw the old code
> away, and start a new protocol entirely. Eg SSL -> TLS.
> 
Uhm. Now you're starting to show some ignorance ...

TLS was NOT a new protocol. I was a simple rename of a very minor evolution 
of SSL 3.0. Has been for all versions up to and including TLS 1.2. And YES,
THAT was a mistake, because SSL was just a very poor  starting point. 
For TLS 1.3 they finally did a (reasonably) proper redesign.
(Fun fact: SSL was _not_ designed by a committee, but TLS 1.3 _was_ ...)

> So cryptographic agility is way oversold, and often people are
> actively lying about why they want it. And the people who aren't lying
> are ignoring the costs.
> 
I wouldn't know what they could be lying about, crypto agility is 
just common sense risk spreading.

> One of the reasons _I_ like wireguard is that it just went for simple
> and secure. No BS.
> 
You and me both, BTW. I just don't want it to be _too_ simple.

> And you say
> 
> > Especially since all crypto it uses comes from a single
> > source (DJB), which is frowned upon in the industry.
> 
> I'm perhaps not a fan of DJB in all respects, but there's no question
> that he's at least competent.
> 
I have nothing against DJB, I've enjoyed many of his presentations.
I might even be a fan. I certainly don't doubt his competence.

But being as paranoid as you are: can you really TRUST the guy? ;-)
And as good as he is: there may be some weakness in the algorithm(s)
discovered _tomorrow_ and in that case _I_ would want to be able to
switch to an alternative instantly.
(and I believe for some big international organisation critically 
depending on such a VPN to connect all their branch offices around
the world while protecting their trade secrets, this is likely to
be even more important - they probably wouldn't want to wait until
Jason pulls Wireguard 2.0 out of his hat and certainly not for that
to pass certification and finaly hit their devices months later ...)

I'm not talking about some convoluted and fragile negotiation scheme,
a static parameter in some config file is just fine for that. The 
textual crypto templates of the Crypto API just fit that use case
perfectly.

And I have other reasons not to want to use Chacha-Poly, while I would
like to use the Wireguard _protocol_ itself:

1) Contrary to popular belief, Chacha-Poly is NOT the best choice of
   algorithms in terms of performance on many modern systems. On the
   quad core Cortex A72 system I'm working on here, AES-GCM is over 2
   times faster, even including Ard's Poly1305-Neon patches of last
   week (current mainline code for PC is even slower than that).
   Also, on modern Intel systems with AES-NI or VAES, AES-GCM 
   outperforms Chacha-Poly by a considerable margin. And, to make
   matters worse, running Chacha-Poly at high throughput is known to
   result in excessive thermal throttling on some recent Intel CPU's.
   Even if you don't need that throughput, it's nice to have more CPU
   power left to do useful work.
2) Chacha-Poly is inefficient in terms of power. For our hardware,
   it uses about 2x the power of AES-GCM and I have indications (e.g.
   the thermal throttling mentioned above) that this is no better for
   software implementations.

> The "industry practice" of having committees influenced by who knows
> what isn't all that much better. Do you want to talk about NSA
> elliptic curve constant choices?
> 
Which is actually an argument _in favor_ of crypto agility - you don't
want to be stuck with just one choice you may not trust ...
Options are _good_. (but do add some implementation complexity, sure)

> Anyway, on the costs:
> 
> > >  - dynamically allocate buffers at "init time"
> >
> > Why is that so "wrong"? It sure beats doing allocations on the hot path.
> 
> It's wrong not becasue the allocation is costly (you do that only
> once), but because the dynamic allocation means that you can't embed
> stuff in your own native data structures as a user.
> 
> So now accessing those things is no longer dense in the cache.
> 
I don't see how data allocated at _init time_ would be local in the 
cache at the time it is _finally_ used in some remote location, far
away in both space and time.

If you init and then immediately use, you may have a point, but
that should be the exception and not the rule.

> And it's the cache that matters for a synchronous CPU algorithm. You
> don't want the keys and state to be in some other location when you
> already have your data structures for the stream that could just have
> them right there with the other data.
> 
Yeah yeah, we all know that. But that only works for stuff that stays
in scope in the cache, not for stuff that has long since been pushed
out by other local variables.

And "other" memory that's used frequently (i.e. when it matters!) CAN
be cached too, you known :-) Modern prefetchers tend to be quite good,
too, so it shouldn't even matter if it gets flushed out temporarily.

> > And you don't want to have it on the stack initially and then have
> > to _copy_ it to some DMA-able location that you allocate on the fly
> > on the hot path if you _do_ want HW acceleration.
> 
> Actually, that's *exactly* what you want. You want keys etc to be in
> regular memory in a location that is convenient to the user, and then
> only if the hardware has issues do you say "ok, copy the key to the
> hardware". Because quite often the hardware will have very special key
> caches that aren't even available to the CPU, because they are on some
> hw-private buffers.
> 
Unfortunately, the only way to get that _into_ the HW is usually DMA
and that relies on DMA-capable memory. And copying significant data 
around on the  CPU tends to totally kill performance if you're in the 
business of HW acceleration, so it's nice if it's already in a DMA
capable buffer. Assuming the cost of having it there is not excessive.

I don't care so much about the keys BTW, that should not be performance
critical as you set it only once in a long while.
But things like IV's etc. _may_ be another matter for _some_ hardware.
(Actually, for _my_ hardware I _only_ care about not having to copy the
actual _data_, so for all _I_ care everything else can be on the stack.
But alas, I'm not alone in the world ...)

> Yes, you want to have a "key identity" model so that the hardware
> doesn't have to reload it all the time, but that's an invalidation
> protocol, not a "put the keys or nonces in special places".
> 
Actually, that _is_ exactly how (most of) _our_ hardware works :-)

But I _think_ keys and nonces and whatnot are actually not the main
reason those structs can't be on the stack. Drivers tend to add their
own local data to those structs, and this may contain buffers that
are used for DMA. I know for a fact the Inside Secure driver does
this (_not_ my design, BTW). I would personally have opted for 
embedding pointers to dynamically allocated blobs elsewhere, such
that the main struct _can_ be on the stack. Food for discussion :-)


>                Linus


Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: France didn't want GSM encryption
  2019-09-30 11:14                 ` France didn't want GSM encryption Marc Gonzalez
@ 2019-09-30 21:37                   ` Linus Torvalds
  0 siblings, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2019-09-30 21:37 UTC (permalink / raw)
  To: Marc Gonzalez; +Cc: Pascal Van Leeuwen, Linux Crypto Mailing List, Linux ARM

On Mon, Sep 30, 2019 at 4:14 AM Marc Gonzalez <marc.w.gonzalez@free.fr> wrote:
>
> Two statements above have raised at least one of my eyebrows.
>
> 1) France has laws that require backdoors.

No. But France has a long history on being bad on encryption policies.
They've gotten better, thankfully.

France was one of the countries that had laws against strong
encryption back in the 90s. It got better in the early 2000s, but
there's a long history - and still a push - for some very questionable
practices.

It was just a couple of years ago that the had discussions about
mandatory backdoors for encryption in France. Look it up.

Are there other countries that have worse track records? Yes. And in
the west, the US (and Australia) have had similar issues.

But when it comes to Western Europe, France has been a particular
problem spot. And I wanted to point out that it's not always the
obvious problem countries (ie Middle East, China) that everybody
points to.

> 2) France did not want GSM encryption.

I'm pretty sure that France had the encryption bit off at least during the 90's.

GSM A5/1 isn't great, but as part of the spec there is also A5/0. No,
it's not used in the West any more.

France was also at least at one time considered a hotbed of industrial
espionage by other European countries. And the US.

You can try to google for it, but you won't find all that much from
the bad old days. You can find _some_ stuff still..

  https://apnews.com/4206823c63d58fd956f26fd5efc9a777

but basically French intelligence agencies have been accused of
extensive industrial espionage for French companies over the years.

Anyway, I'm not trying to point to France as some kind of "worst of
the worst". I literally picked it as an example because people
generally _don't_ think of Western European countries as having
encryption issues, and don't generally associate them with industrial
espionage. But there really is a history even there.

            Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
  2019-09-27  7:20         ` Jason A. Donenfeld
@ 2019-10-01  8:56           ` Ard Biesheuvel
  0 siblings, 0 replies; 61+ messages in thread
From: Ard Biesheuvel @ 2019-10-01  8:56 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Catalin Marinas, Herbert Xu, Arnd Bergmann, Eric Biggers,
	Greg KH, Samuel Neves, Will Deacon, Linux Crypto Mailing List,
	Andy Lutomirski, Marc Zyngier, Dan Carpenter, Linus Torvalds,
	David Miller, linux-arm-kernel

On Fri, 27 Sep 2019 at 09:21, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hey Andy,
>
> Thanks for weighing in.
>
> > inlining.  I'd be surprised for chacha20.  If you really want inlining
> > to dictate the overall design, I think you need some real numbers for
> > why it's necessary.  There also needs to be a clear story for how
> > exactly making everything inline plays with the actual decision of
> > which implementation to use.
>
> Take a look at my description for the MIPS case: when on MIPS, the
> arch code is *always* used since it's just straight up scalar
> assembly. In this case, the chacha20_arch function *never* returns
> false [1], which means it's always included [2], so the generic
> implementation gets optimized out, saving disk and memory, which I
> assume MIPS people care about.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-mips-glue.c?h=jd/wireguard#n13
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n118
>
> I'm fine with considering this a form of "premature optimization",
> though, and ditching the motivation there.
>
> On Thu, Sep 26, 2019 at 11:37 PM Andy Lutomirski <luto@kernel.org> wrote:
> > My suggestion from way back, which is at
> > least a good deal of the way toward being doable, is to do static
> > calls.  This means that the common code will call out to the arch code
> > via a regular CALL instruction and will *not* inline the arch code.
> > This means that the arch code could live in its own module, it can be
> > selected at boot time, etc.
>
> Alright, let's do static calls, then, to deal with the case of going
> from the entry point implementation in lib/zinc (or lib/crypto, if you
> want, Ard) to the arch-specific implementation in arch/${ARCH}/crypto.
> And then within each arch, we can keep it simple, since everything is
> already in the same directory.
>
> Sound good?
>

Yup.

I posted something to this effect - I am ironing out some wrinkles
doing randconfig builds (with Arnd's help) but the general picture
shouldn't change.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2019-10-01  8:56 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-25 16:12 [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 01/18] crypto: shash - add plumbing for operating on scatterlists Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 02/18] crypto: x86/poly1305 - implement .update_from_sg method Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 03/18] crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 04/18] crypto: arm64/poly1305 " Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 05/18] crypto: chacha - move existing library code into lib/crypto Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 06/18] crypto: rfc7539 - switch to shash for Poly1305 Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 07/18] crypto: rfc7539 - use zero reqsize for sync instantiations without alignmask Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 08/18] crypto: testmgr - add a chacha20poly1305 test case Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 09/18] crypto: poly1305 - move core algorithm into lib/crypto Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 10/18] crypto: poly1305 - add init/update/final library routines Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 11/18] int128: move __uint128_t compiler test to Kconfig Ard Biesheuvel
2019-09-25 21:01   ` Linus Torvalds
2019-09-25 21:19     ` Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 16/18] netlink: use new strict length types in policy for 5.2 Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 17/18] wg switch to lib/crypto algos Ard Biesheuvel
2019-09-25 16:12 ` [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption Ard Biesheuvel
2019-09-25 22:15   ` Linus Torvalds
2019-09-25 22:22     ` Linus Torvalds
2019-09-26  9:40     ` Pascal Van Leeuwen
2019-09-26 16:35       ` Linus Torvalds
2019-09-27  0:15         ` Pascal Van Leeuwen
2019-09-27  1:30           ` Linus Torvalds
2019-09-27  2:54             ` Linus Torvalds
2019-09-27  3:53               ` Herbert Xu
2019-09-27  4:37                 ` Andy Lutomirski
2019-09-27  4:59                   ` Herbert Xu
2019-09-27  4:01               ` Herbert Xu
2019-09-27  4:13                 ` Linus Torvalds
2019-09-27 10:44               ` Pascal Van Leeuwen
2019-09-27 11:08                 ` Pascal Van Leeuwen
2019-09-27  4:36             ` Andy Lutomirski
2019-09-27  9:58             ` Pascal Van Leeuwen
2019-09-27 10:11               ` Herbert Xu
2019-09-27 16:23               ` Linus Torvalds
2019-09-30 11:14                 ` France didn't want GSM encryption Marc Gonzalez
2019-09-30 21:37                   ` Linus Torvalds
2019-09-30 20:44                 ` [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption Pascal Van Leeuwen
2019-09-27  2:06           ` Linus Torvalds
2019-09-27 10:11             ` Pascal Van Leeuwen
2019-09-26 11:06     ` Ard Biesheuvel
2019-09-26 12:34       ` Ard Biesheuvel
2019-09-26  8:59 ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Jason A. Donenfeld
2019-09-26 10:19   ` Pascal Van Leeuwen
2019-09-26 10:59     ` Jason A. Donenfeld
2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
2019-09-26 11:38       ` Toke Høiland-Jørgensen
2019-09-26 13:52       ` Pascal Van Leeuwen
2019-09-26 23:13         ` Dave Taht
2019-09-27 12:18           ` Pascal Van Leeuwen
2019-09-26 22:47       ` Jakub Kicinski
2019-09-26 12:07   ` [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Ard Biesheuvel
2019-09-26 13:06     ` Pascal Van Leeuwen
2019-09-26 13:15       ` Ard Biesheuvel
2019-09-26 14:03         ` Pascal Van Leeuwen
2019-09-26 14:52           ` Ard Biesheuvel
2019-09-26 15:04             ` Pascal Van Leeuwen
2019-09-26 20:47     ` Jason A. Donenfeld
2019-09-26 21:36       ` Andy Lutomirski
2019-09-27  7:20         ` Jason A. Donenfeld
2019-10-01  8:56           ` Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).