linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s
@ 2020-12-23  8:09 Eric Biggers
  2020-12-23  8:09 ` [PATCH v3 01/14] crypto: blake2s - define shash_alg structs using macros Eric Biggers
                   ` (14 more replies)
  0 siblings, 15 replies; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

This patchset adds 32-bit ARM assembly language implementations of
BLAKE2b and BLAKE2s.

As a prerequisite to adding these without copy-and-pasting lots of code,
this patchset also reworks the existing BLAKE2b and BLAKE2s code to
provide helper functions that make implementing "shash" providers for
these algorithms much easier.  These changes also eliminate unnecessary
differences between the BLAKE2b and BLAKE2s code.

The new BLAKE2b implementation is NEON-accelerated, while the new
BLAKE2s implementation uses scalar instructions since NEON doesn't work
very well for it.  The BLAKE2b implementation is faster and is expected
to be useful as a replacement for SHA-1 in dm-verity, while the BLAKE2s
implementation would be useful for WireGuard which uses BLAKE2s.

Both new implementations are wired up to the shash API, while the new
BLAKE2s implementation is also wired up to the library API.

See the individual commits for full details, including benchmarks.

This patchset was tested on a Raspberry Pi 2 (which uses a Cortex-A7
processor) with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, plus other tests.

This patchset applies to mainline commit 614cb5894306.

Changed since v2:
   - Reworked the shash helpers again.  Now they are inline functions,
     and for BLAKE2s they now share more code with the library API.
   - Made the BLAKE2b code be more consistent with the BLAKE2s code.
   - Moved the BLAKE2s changes first in the patchset so that the BLAKE2b
     changes can be made just by syncing the code with BLAKE2s.
   - Added a few BLAKE2s cleanups (which get included in BLAKE2b too).
   - Improved some comments in the new asm files.

Changed since v1:
   - Added ARM scalar implementation of BLAKE2s.
   - Adjusted the BLAKE2b helper functions to be consistent with what I
     decided to do for BLAKE2s.
   - Fixed build error in blake2b-neon-core.S in some configurations.

Eric Biggers (14):
  crypto: blake2s - define shash_alg structs using macros
  crypto: x86/blake2s - define shash_alg structs using macros
  crypto: blake2s - remove unneeded includes
  crypto: blake2s - move update and final logic to internal/blake2s.h
  crypto: blake2s - share the "shash" API boilerplate code
  crypto: blake2s - optimize blake2s initialization
  crypto: blake2s - add comment for blake2s_state fields
  crypto: blake2s - adjust include guard naming
  crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h>
  crypto: arm/blake2s - add ARM scalar optimized BLAKE2s
  wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM
  crypto: blake2b - sync with blake2s implementation
  crypto: blake2b - update file comment
  crypto: arm/blake2b - add NEON-accelerated BLAKE2b

 arch/arm/crypto/Kconfig             |  19 ++
 arch/arm/crypto/Makefile            |   4 +
 arch/arm/crypto/blake2b-neon-core.S | 347 ++++++++++++++++++++++++++++
 arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
 arch/arm/crypto/blake2s-core.S      | 285 +++++++++++++++++++++++
 arch/arm/crypto/blake2s-glue.c      |  78 +++++++
 arch/x86/crypto/blake2s-glue.c      | 150 +++---------
 crypto/blake2b_generic.c            | 249 +++++---------------
 crypto/blake2s_generic.c            | 158 +++----------
 drivers/net/Kconfig                 |   1 +
 include/crypto/blake2b.h            |  67 ++++++
 include/crypto/blake2s.h            |  63 ++---
 include/crypto/internal/blake2b.h   | 115 +++++++++
 include/crypto/internal/blake2s.h   | 109 ++++++++-
 lib/crypto/blake2s.c                |  48 +---
 15 files changed, 1278 insertions(+), 520 deletions(-)
 create mode 100644 arch/arm/crypto/blake2b-neon-core.S
 create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
 create mode 100644 arch/arm/crypto/blake2s-core.S
 create mode 100644 arch/arm/crypto/blake2s-glue.c
 create mode 100644 include/crypto/blake2b.h
 create mode 100644 include/crypto/internal/blake2b.h


base-commit: 614cb5894306cfa2c7d9b6168182876ff5948735
-- 
2.29.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 01/14] crypto: blake2s - define shash_alg structs using macros
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  8:09 ` [PATCH v3 02/14] crypto: x86/blake2s " Eric Biggers
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

The shash_alg structs for the four variants of BLAKE2s are identical
except for the algorithm name, driver name, and digest size.  So, avoid
code duplication by using a macro to define these structs.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2s_generic.c | 88 ++++++++++++----------------------------
 1 file changed, 27 insertions(+), 61 deletions(-)

diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c
index 005783ff45ad0..e3aa6e7ff3d83 100644
--- a/crypto/blake2s_generic.c
+++ b/crypto/blake2s_generic.c
@@ -83,67 +83,33 @@ static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
 	return 0;
 }
 
-static struct shash_alg blake2s_algs[] = {{
-	.base.cra_name		= "blake2s-128",
-	.base.cra_driver_name	= "blake2s-128-generic",
-	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
-	.base.cra_priority	= 200,
-	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-
-	.digestsize		= BLAKE2S_128_HASH_SIZE,
-	.setkey			= crypto_blake2s_setkey,
-	.init			= crypto_blake2s_init,
-	.update			= crypto_blake2s_update,
-	.final			= crypto_blake2s_final,
-	.descsize		= sizeof(struct blake2s_state),
-}, {
-	.base.cra_name		= "blake2s-160",
-	.base.cra_driver_name	= "blake2s-160-generic",
-	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
-	.base.cra_priority	= 200,
-	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-
-	.digestsize		= BLAKE2S_160_HASH_SIZE,
-	.setkey			= crypto_blake2s_setkey,
-	.init			= crypto_blake2s_init,
-	.update			= crypto_blake2s_update,
-	.final			= crypto_blake2s_final,
-	.descsize		= sizeof(struct blake2s_state),
-}, {
-	.base.cra_name		= "blake2s-224",
-	.base.cra_driver_name	= "blake2s-224-generic",
-	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
-	.base.cra_priority	= 200,
-	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-
-	.digestsize		= BLAKE2S_224_HASH_SIZE,
-	.setkey			= crypto_blake2s_setkey,
-	.init			= crypto_blake2s_init,
-	.update			= crypto_blake2s_update,
-	.final			= crypto_blake2s_final,
-	.descsize		= sizeof(struct blake2s_state),
-}, {
-	.base.cra_name		= "blake2s-256",
-	.base.cra_driver_name	= "blake2s-256-generic",
-	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
-	.base.cra_priority	= 200,
-	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-
-	.digestsize		= BLAKE2S_256_HASH_SIZE,
-	.setkey			= crypto_blake2s_setkey,
-	.init			= crypto_blake2s_init,
-	.update			= crypto_blake2s_update,
-	.final			= crypto_blake2s_final,
-	.descsize		= sizeof(struct blake2s_state),
-}};
+#define BLAKE2S_ALG(name, driver_name, digest_size)			\
+	{								\
+		.base.cra_name		= name,				\
+		.base.cra_driver_name	= driver_name,			\
+		.base.cra_priority	= 100,				\
+		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,	\
+		.base.cra_blocksize	= BLAKE2S_BLOCK_SIZE,		\
+		.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx), \
+		.base.cra_module	= THIS_MODULE,			\
+		.digestsize		= digest_size,			\
+		.setkey			= crypto_blake2s_setkey,	\
+		.init			= crypto_blake2s_init,		\
+		.update			= crypto_blake2s_update,	\
+		.final			= crypto_blake2s_final,		\
+		.descsize		= sizeof(struct blake2s_state),	\
+	}
+
+static struct shash_alg blake2s_algs[] = {
+	BLAKE2S_ALG("blake2s-128", "blake2s-128-generic",
+		    BLAKE2S_128_HASH_SIZE),
+	BLAKE2S_ALG("blake2s-160", "blake2s-160-generic",
+		    BLAKE2S_160_HASH_SIZE),
+	BLAKE2S_ALG("blake2s-224", "blake2s-224-generic",
+		    BLAKE2S_224_HASH_SIZE),
+	BLAKE2S_ALG("blake2s-256", "blake2s-256-generic",
+		    BLAKE2S_256_HASH_SIZE),
+};
 
 static int __init blake2s_mod_init(void)
 {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 02/14] crypto: x86/blake2s - define shash_alg structs using macros
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
  2020-12-23  8:09 ` [PATCH v3 01/14] crypto: blake2s - define shash_alg structs using macros Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  8:09 ` [PATCH v3 03/14] crypto: blake2s - remove unneeded includes Eric Biggers
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

The shash_alg structs for the four variants of BLAKE2s are identical
except for the algorithm name, driver name, and digest size.  So, avoid
code duplication by using a macro to define these structs.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/blake2s-glue.c | 84 ++++++++++------------------------
 1 file changed, 23 insertions(+), 61 deletions(-)

diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c
index c025a01cf7084..4dcb2ee89efc9 100644
--- a/arch/x86/crypto/blake2s-glue.c
+++ b/arch/x86/crypto/blake2s-glue.c
@@ -129,67 +129,29 @@ static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
 	return 0;
 }
 
-static struct shash_alg blake2s_algs[] = {{
-	.base.cra_name		= "blake2s-128",
-	.base.cra_driver_name	= "blake2s-128-x86",
-	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
-	.base.cra_priority	= 200,
-	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-
-	.digestsize		= BLAKE2S_128_HASH_SIZE,
-	.setkey			= crypto_blake2s_setkey,
-	.init			= crypto_blake2s_init,
-	.update			= crypto_blake2s_update,
-	.final			= crypto_blake2s_final,
-	.descsize		= sizeof(struct blake2s_state),
-}, {
-	.base.cra_name		= "blake2s-160",
-	.base.cra_driver_name	= "blake2s-160-x86",
-	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
-	.base.cra_priority	= 200,
-	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-
-	.digestsize		= BLAKE2S_160_HASH_SIZE,
-	.setkey			= crypto_blake2s_setkey,
-	.init			= crypto_blake2s_init,
-	.update			= crypto_blake2s_update,
-	.final			= crypto_blake2s_final,
-	.descsize		= sizeof(struct blake2s_state),
-}, {
-	.base.cra_name		= "blake2s-224",
-	.base.cra_driver_name	= "blake2s-224-x86",
-	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
-	.base.cra_priority	= 200,
-	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-
-	.digestsize		= BLAKE2S_224_HASH_SIZE,
-	.setkey			= crypto_blake2s_setkey,
-	.init			= crypto_blake2s_init,
-	.update			= crypto_blake2s_update,
-	.final			= crypto_blake2s_final,
-	.descsize		= sizeof(struct blake2s_state),
-}, {
-	.base.cra_name		= "blake2s-256",
-	.base.cra_driver_name	= "blake2s-256-x86",
-	.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-	.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx),
-	.base.cra_priority	= 200,
-	.base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-
-	.digestsize		= BLAKE2S_256_HASH_SIZE,
-	.setkey			= crypto_blake2s_setkey,
-	.init			= crypto_blake2s_init,
-	.update			= crypto_blake2s_update,
-	.final			= crypto_blake2s_final,
-	.descsize		= sizeof(struct blake2s_state),
-}};
+#define BLAKE2S_ALG(name, driver_name, digest_size)			\
+	{								\
+		.base.cra_name		= name,				\
+		.base.cra_driver_name	= driver_name,			\
+		.base.cra_priority	= 200,				\
+		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,	\
+		.base.cra_blocksize	= BLAKE2S_BLOCK_SIZE,		\
+		.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx), \
+		.base.cra_module	= THIS_MODULE,			\
+		.digestsize		= digest_size,			\
+		.setkey			= crypto_blake2s_setkey,	\
+		.init			= crypto_blake2s_init,		\
+		.update			= crypto_blake2s_update,	\
+		.final			= crypto_blake2s_final,		\
+		.descsize		= sizeof(struct blake2s_state),	\
+	}
+
+static struct shash_alg blake2s_algs[] = {
+	BLAKE2S_ALG("blake2s-128", "blake2s-128-x86", BLAKE2S_128_HASH_SIZE),
+	BLAKE2S_ALG("blake2s-160", "blake2s-160-x86", BLAKE2S_160_HASH_SIZE),
+	BLAKE2S_ALG("blake2s-224", "blake2s-224-x86", BLAKE2S_224_HASH_SIZE),
+	BLAKE2S_ALG("blake2s-256", "blake2s-256-x86", BLAKE2S_256_HASH_SIZE),
+};
 
 static int __init blake2s_mod_init(void)
 {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 03/14] crypto: blake2s - remove unneeded includes
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
  2020-12-23  8:09 ` [PATCH v3 01/14] crypto: blake2s - define shash_alg structs using macros Eric Biggers
  2020-12-23  8:09 ` [PATCH v3 02/14] crypto: x86/blake2s " Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  8:09 ` [PATCH v3 04/14] crypto: blake2s - move update and final logic to internal/blake2s.h Eric Biggers
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

It doesn't make sense for the generic implementation of BLAKE2s to
include <crypto/internal/simd.h> and <linux/jump_label.h>, as these are
things that would only be useful in an architecture-specific
implementation.  Remove these unnecessary includes.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2s_generic.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c
index e3aa6e7ff3d83..b89536c3671cf 100644
--- a/crypto/blake2s_generic.c
+++ b/crypto/blake2s_generic.c
@@ -4,11 +4,9 @@
  */
 
 #include <crypto/internal/blake2s.h>
-#include <crypto/internal/simd.h>
 #include <crypto/internal/hash.h>
 
 #include <linux/types.h>
-#include <linux/jump_label.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 04/14] crypto: blake2s - move update and final logic to internal/blake2s.h
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (2 preceding siblings ...)
  2020-12-23  8:09 ` [PATCH v3 03/14] crypto: blake2s - remove unneeded includes Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  9:05   ` Ard Biesheuvel
  2020-12-23  8:09 ` [PATCH v3 05/14] crypto: blake2s - share the "shash" API boilerplate code Eric Biggers
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Move most of blake2s_update() and blake2s_final() into new inline
functions __blake2s_update() and __blake2s_final() in
include/crypto/internal/blake2s.h so that this logic can be shared by
the shash helper functions.  This will avoid duplicating this logic
between the library and shash implementations.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 include/crypto/internal/blake2s.h | 41 ++++++++++++++++++++++++++
 lib/crypto/blake2s.c              | 48 ++++++-------------------------
 2 files changed, 49 insertions(+), 40 deletions(-)

diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
index 6e376ae6b6b58..42deba4b8ceef 100644
--- a/include/crypto/internal/blake2s.h
+++ b/include/crypto/internal/blake2s.h
@@ -4,6 +4,7 @@
 #define BLAKE2S_INTERNAL_H
 
 #include <crypto/blake2s.h>
+#include <linux/string.h>
 
 struct blake2s_tfm_ctx {
 	u8 key[BLAKE2S_KEY_SIZE];
@@ -23,4 +24,44 @@ static inline void blake2s_set_lastblock(struct blake2s_state *state)
 	state->f[0] = -1;
 }
 
+typedef void (*blake2s_compress_t)(struct blake2s_state *state,
+				   const u8 *block, size_t nblocks, u32 inc);
+
+static inline void __blake2s_update(struct blake2s_state *state,
+				    const u8 *in, size_t inlen,
+				    blake2s_compress_t compress)
+{
+	const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
+
+	if (unlikely(!inlen))
+		return;
+	if (inlen > fill) {
+		memcpy(state->buf + state->buflen, in, fill);
+		(*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
+		state->buflen = 0;
+		in += fill;
+		inlen -= fill;
+	}
+	if (inlen > BLAKE2S_BLOCK_SIZE) {
+		const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
+		/* Hash one less (full) block than strictly possible */
+		(*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
+		in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+		inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+	}
+	memcpy(state->buf + state->buflen, in, inlen);
+	state->buflen += inlen;
+}
+
+static inline void __blake2s_final(struct blake2s_state *state, u8 *out,
+				   blake2s_compress_t compress)
+{
+	blake2s_set_lastblock(state);
+	memset(state->buf + state->buflen, 0,
+	       BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
+	(*compress)(state, state->buf, 1, state->buflen);
+	cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
+	memcpy(out, state->h, state->outlen);
+}
+
 #endif /* BLAKE2S_INTERNAL_H */
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 6a4b6b78d630f..c64ac8bfb6a97 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -15,55 +15,23 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/bug.h>
-#include <asm/unaligned.h>
+
+#if IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S)
+#  define blake2s_compress blake2s_compress_arch
+#else
+#  define blake2s_compress blake2s_compress_generic
+#endif
 
 void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
 {
-	const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
-
-	if (unlikely(!inlen))
-		return;
-	if (inlen > fill) {
-		memcpy(state->buf + state->buflen, in, fill);
-		if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
-			blake2s_compress_arch(state, state->buf, 1,
-					      BLAKE2S_BLOCK_SIZE);
-		else
-			blake2s_compress_generic(state, state->buf, 1,
-						 BLAKE2S_BLOCK_SIZE);
-		state->buflen = 0;
-		in += fill;
-		inlen -= fill;
-	}
-	if (inlen > BLAKE2S_BLOCK_SIZE) {
-		const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
-		/* Hash one less (full) block than strictly possible */
-		if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
-			blake2s_compress_arch(state, in, nblocks - 1,
-					      BLAKE2S_BLOCK_SIZE);
-		else
-			blake2s_compress_generic(state, in, nblocks - 1,
-						 BLAKE2S_BLOCK_SIZE);
-		in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
-		inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
-	}
-	memcpy(state->buf + state->buflen, in, inlen);
-	state->buflen += inlen;
+	__blake2s_update(state, in, inlen, blake2s_compress);
 }
 EXPORT_SYMBOL(blake2s_update);
 
 void blake2s_final(struct blake2s_state *state, u8 *out)
 {
 	WARN_ON(IS_ENABLED(DEBUG) && !out);
-	blake2s_set_lastblock(state);
-	memset(state->buf + state->buflen, 0,
-	       BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
-	if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
-		blake2s_compress_arch(state, state->buf, 1, state->buflen);
-	else
-		blake2s_compress_generic(state, state->buf, 1, state->buflen);
-	cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
-	memcpy(out, state->h, state->outlen);
+	__blake2s_final(state, out, blake2s_compress);
 	memzero_explicit(state, sizeof(*state));
 }
 EXPORT_SYMBOL(blake2s_final);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 05/14] crypto: blake2s - share the "shash" API boilerplate code
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (3 preceding siblings ...)
  2020-12-23  8:09 ` [PATCH v3 04/14] crypto: blake2s - move update and final logic to internal/blake2s.h Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  9:06   ` Ard Biesheuvel
  2020-12-23  8:09 ` [PATCH v3 06/14] crypto: blake2s - optimize blake2s initialization Eric Biggers
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Add helper functions for shash implementations of BLAKE2s to
include/crypto/internal/blake2s.h, taking advantage of
__blake2s_update() and __blake2s_final() that were added by the previous
patch to share more code between the library and shash implementations.

crypto_blake2s_setkey() and crypto_blake2s_init() are usable as
shash_alg::setkey and shash_alg::init directly, while
crypto_blake2s_update() and crypto_blake2s_final() take an extra
'blake2s_compress_t' function pointer parameter.  This allows the
implementation of the compression function to be overridden, which is
the only part that optimized implementations really care about.

The new functions are inline functions (similar to those in sha1_base.h,
sha256_base.h, and sm3_base.h) because this avoids needing to add a new
module blake2s_helpers.ko, they aren't *too* long, and this avoids
indirect calls which are expensive these days.  Note that they can't go
in blake2s_generic.ko, as that would require selecting CRYPTO_BLAKE2S
from CRYPTO_BLAKE2S_X86, which would cause a recursive dependency.

Finally, use these new helper functions in the x86 implementation of
BLAKE2s.  (This part should be a separate patch, but unfortunately the
x86 implementation used the exact same function names like
"crypto_blake2s_update()", so it had to be updated at the same time.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/blake2s-glue.c    | 74 +++---------------------------
 crypto/blake2s_generic.c          | 76 ++++---------------------------
 include/crypto/internal/blake2s.h | 65 ++++++++++++++++++++++++--
 3 files changed, 76 insertions(+), 139 deletions(-)

diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c
index 4dcb2ee89efc9..a40365ab301ee 100644
--- a/arch/x86/crypto/blake2s-glue.c
+++ b/arch/x86/crypto/blake2s-glue.c
@@ -58,75 +58,15 @@ void blake2s_compress_arch(struct blake2s_state *state,
 }
 EXPORT_SYMBOL(blake2s_compress_arch);
 
-static int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
-				 unsigned int keylen)
+static int crypto_blake2s_update_x86(struct shash_desc *desc,
+				     const u8 *in, unsigned int inlen)
 {
-	struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
-
-	if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE)
-		return -EINVAL;
-
-	memcpy(tctx->key, key, keylen);
-	tctx->keylen = keylen;
-
-	return 0;
-}
-
-static int crypto_blake2s_init(struct shash_desc *desc)
-{
-	struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
-	struct blake2s_state *state = shash_desc_ctx(desc);
-	const int outlen = crypto_shash_digestsize(desc->tfm);
-
-	if (tctx->keylen)
-		blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
-	else
-		blake2s_init(state, outlen);
-
-	return 0;
-}
-
-static int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
-				 unsigned int inlen)
-{
-	struct blake2s_state *state = shash_desc_ctx(desc);
-	const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
-
-	if (unlikely(!inlen))
-		return 0;
-	if (inlen > fill) {
-		memcpy(state->buf + state->buflen, in, fill);
-		blake2s_compress_arch(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
-		state->buflen = 0;
-		in += fill;
-		inlen -= fill;
-	}
-	if (inlen > BLAKE2S_BLOCK_SIZE) {
-		const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
-		/* Hash one less (full) block than strictly possible */
-		blake2s_compress_arch(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
-		in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
-		inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
-	}
-	memcpy(state->buf + state->buflen, in, inlen);
-	state->buflen += inlen;
-
-	return 0;
+	return crypto_blake2s_update(desc, in, inlen, blake2s_compress_arch);
 }
 
-static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
+static int crypto_blake2s_final_x86(struct shash_desc *desc, u8 *out)
 {
-	struct blake2s_state *state = shash_desc_ctx(desc);
-
-	blake2s_set_lastblock(state);
-	memset(state->buf + state->buflen, 0,
-	       BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
-	blake2s_compress_arch(state, state->buf, 1, state->buflen);
-	cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
-	memcpy(out, state->h, state->outlen);
-	memzero_explicit(state, sizeof(*state));
-
-	return 0;
+	return crypto_blake2s_final(desc, out, blake2s_compress_arch);
 }
 
 #define BLAKE2S_ALG(name, driver_name, digest_size)			\
@@ -141,8 +81,8 @@ static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
 		.digestsize		= digest_size,			\
 		.setkey			= crypto_blake2s_setkey,	\
 		.init			= crypto_blake2s_init,		\
-		.update			= crypto_blake2s_update,	\
-		.final			= crypto_blake2s_final,		\
+		.update			= crypto_blake2s_update_x86,	\
+		.final			= crypto_blake2s_final_x86,	\
 		.descsize		= sizeof(struct blake2s_state),	\
 	}
 
diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c
index b89536c3671cf..72fe480f9bd67 100644
--- a/crypto/blake2s_generic.c
+++ b/crypto/blake2s_generic.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0 OR MIT
 /*
+ * shash interface to the generic implementation of BLAKE2s
+ *
  * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
  */
 
@@ -10,75 +12,15 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 
-static int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
-				 unsigned int keylen)
+static int crypto_blake2s_update_generic(struct shash_desc *desc,
+					 const u8 *in, unsigned int inlen)
 {
-	struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
-
-	if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE)
-		return -EINVAL;
-
-	memcpy(tctx->key, key, keylen);
-	tctx->keylen = keylen;
-
-	return 0;
+	return crypto_blake2s_update(desc, in, inlen, blake2s_compress_generic);
 }
 
-static int crypto_blake2s_init(struct shash_desc *desc)
+static int crypto_blake2s_final_generic(struct shash_desc *desc, u8 *out)
 {
-	struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
-	struct blake2s_state *state = shash_desc_ctx(desc);
-	const int outlen = crypto_shash_digestsize(desc->tfm);
-
-	if (tctx->keylen)
-		blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
-	else
-		blake2s_init(state, outlen);
-
-	return 0;
-}
-
-static int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
-				 unsigned int inlen)
-{
-	struct blake2s_state *state = shash_desc_ctx(desc);
-	const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
-
-	if (unlikely(!inlen))
-		return 0;
-	if (inlen > fill) {
-		memcpy(state->buf + state->buflen, in, fill);
-		blake2s_compress_generic(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
-		state->buflen = 0;
-		in += fill;
-		inlen -= fill;
-	}
-	if (inlen > BLAKE2S_BLOCK_SIZE) {
-		const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
-		/* Hash one less (full) block than strictly possible */
-		blake2s_compress_generic(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
-		in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
-		inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
-	}
-	memcpy(state->buf + state->buflen, in, inlen);
-	state->buflen += inlen;
-
-	return 0;
-}
-
-static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
-{
-	struct blake2s_state *state = shash_desc_ctx(desc);
-
-	blake2s_set_lastblock(state);
-	memset(state->buf + state->buflen, 0,
-	       BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
-	blake2s_compress_generic(state, state->buf, 1, state->buflen);
-	cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
-	memcpy(out, state->h, state->outlen);
-	memzero_explicit(state, sizeof(*state));
-
-	return 0;
+	return crypto_blake2s_final(desc, out, blake2s_compress_generic);
 }
 
 #define BLAKE2S_ALG(name, driver_name, digest_size)			\
@@ -93,8 +35,8 @@ static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
 		.digestsize		= digest_size,			\
 		.setkey			= crypto_blake2s_setkey,	\
 		.init			= crypto_blake2s_init,		\
-		.update			= crypto_blake2s_update,	\
-		.final			= crypto_blake2s_final,		\
+		.update			= crypto_blake2s_update_generic, \
+		.final			= crypto_blake2s_final_generic,	\
 		.descsize		= sizeof(struct blake2s_state),	\
 	}
 
diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
index 42deba4b8ceef..2ea0a8f5e7f41 100644
--- a/include/crypto/internal/blake2s.h
+++ b/include/crypto/internal/blake2s.h
@@ -1,16 +1,16 @@
 /* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Helper functions for BLAKE2s implementations.
+ * Keep this in sync with the corresponding BLAKE2b header.
+ */
 
 #ifndef BLAKE2S_INTERNAL_H
 #define BLAKE2S_INTERNAL_H
 
 #include <crypto/blake2s.h>
+#include <crypto/internal/hash.h>
 #include <linux/string.h>
 
-struct blake2s_tfm_ctx {
-	u8 key[BLAKE2S_KEY_SIZE];
-	unsigned int keylen;
-};
-
 void blake2s_compress_generic(struct blake2s_state *state,const u8 *block,
 			      size_t nblocks, const u32 inc);
 
@@ -27,6 +27,8 @@ static inline void blake2s_set_lastblock(struct blake2s_state *state)
 typedef void (*blake2s_compress_t)(struct blake2s_state *state,
 				   const u8 *block, size_t nblocks, u32 inc);
 
+/* Helper functions for BLAKE2s shared by the library and shash APIs */
+
 static inline void __blake2s_update(struct blake2s_state *state,
 				    const u8 *in, size_t inlen,
 				    blake2s_compress_t compress)
@@ -64,4 +66,57 @@ static inline void __blake2s_final(struct blake2s_state *state, u8 *out,
 	memcpy(out, state->h, state->outlen);
 }
 
+/* Helper functions for shash implementations of BLAKE2s */
+
+struct blake2s_tfm_ctx {
+	u8 key[BLAKE2S_KEY_SIZE];
+	unsigned int keylen;
+};
+
+static inline int crypto_blake2s_setkey(struct crypto_shash *tfm,
+					const u8 *key, unsigned int keylen)
+{
+	struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
+
+	if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE)
+		return -EINVAL;
+
+	memcpy(tctx->key, key, keylen);
+	tctx->keylen = keylen;
+
+	return 0;
+}
+
+static inline int crypto_blake2s_init(struct shash_desc *desc)
+{
+	const struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct blake2s_state *state = shash_desc_ctx(desc);
+	unsigned int outlen = crypto_shash_digestsize(desc->tfm);
+
+	if (tctx->keylen)
+		blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
+	else
+		blake2s_init(state, outlen);
+	return 0;
+}
+
+static inline int crypto_blake2s_update(struct shash_desc *desc,
+					const u8 *in, unsigned int inlen,
+					blake2s_compress_t compress)
+{
+	struct blake2s_state *state = shash_desc_ctx(desc);
+
+	__blake2s_update(state, in, inlen, compress);
+	return 0;
+}
+
+static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out,
+				       blake2s_compress_t compress)
+{
+	struct blake2s_state *state = shash_desc_ctx(desc);
+
+	__blake2s_final(state, out, compress);
+	return 0;
+}
+
 #endif /* BLAKE2S_INTERNAL_H */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 06/14] crypto: blake2s - optimize blake2s initialization
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (4 preceding siblings ...)
  2020-12-23  8:09 ` [PATCH v3 05/14] crypto: blake2s - share the "shash" API boilerplate code Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  9:06   ` Ard Biesheuvel
  2020-12-23  8:09 ` [PATCH v3 07/14] crypto: blake2s - add comment for blake2s_state fields Eric Biggers
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

If no key was provided, then don't waste time initializing the block
buffer, as its initial contents won't be used.

Also, make crypto_blake2s_init() and blake2s() call a single internal
function __blake2s_init() which treats the key as optional, rather than
conditionally calling blake2s_init() or blake2s_init_key().  This
reduces the compiled code size, as previously both blake2s_init() and
blake2s_init_key() were being inlined into these two callers, except
when the key size passed to blake2s() was a compile-time constant.

These optimizations aren't that significant for BLAKE2s.  However, the
equivalent optimizations will be more significant for BLAKE2b, as
everything is twice as big in BLAKE2b.  And it's good to keep things
consistent rather than making optimizations for BLAKE2b but not BLAKE2s.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 include/crypto/blake2s.h          | 53 ++++++++++++++++---------------
 include/crypto/internal/blake2s.h |  5 +--
 2 files changed, 28 insertions(+), 30 deletions(-)

diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index b471deac28ff8..734ed22b7a6aa 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -43,29 +43,34 @@ enum blake2s_iv {
 	BLAKE2S_IV7 = 0x5BE0CD19UL,
 };
 
-void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen);
-void blake2s_final(struct blake2s_state *state, u8 *out);
-
-static inline void blake2s_init_param(struct blake2s_state *state,
-				      const u32 param)
+static inline void __blake2s_init(struct blake2s_state *state, size_t outlen,
+				  const void *key, size_t keylen)
 {
-	*state = (struct blake2s_state){{
-		BLAKE2S_IV0 ^ param,
-		BLAKE2S_IV1,
-		BLAKE2S_IV2,
-		BLAKE2S_IV3,
-		BLAKE2S_IV4,
-		BLAKE2S_IV5,
-		BLAKE2S_IV6,
-		BLAKE2S_IV7,
-	}};
+	state->h[0] = BLAKE2S_IV0 ^ (0x01010000 | keylen << 8 | outlen);
+	state->h[1] = BLAKE2S_IV1;
+	state->h[2] = BLAKE2S_IV2;
+	state->h[3] = BLAKE2S_IV3;
+	state->h[4] = BLAKE2S_IV4;
+	state->h[5] = BLAKE2S_IV5;
+	state->h[6] = BLAKE2S_IV6;
+	state->h[7] = BLAKE2S_IV7;
+	state->t[0] = 0;
+	state->t[1] = 0;
+	state->f[0] = 0;
+	state->f[1] = 0;
+	state->buflen = 0;
+	state->outlen = outlen;
+	if (keylen) {
+		memcpy(state->buf, key, keylen);
+		memset(&state->buf[keylen], 0, BLAKE2S_BLOCK_SIZE - keylen);
+		state->buflen = BLAKE2S_BLOCK_SIZE;
+	}
 }
 
 static inline void blake2s_init(struct blake2s_state *state,
 				const size_t outlen)
 {
-	blake2s_init_param(state, 0x01010000 | outlen);
-	state->outlen = outlen;
+	__blake2s_init(state, outlen, NULL, 0);
 }
 
 static inline void blake2s_init_key(struct blake2s_state *state,
@@ -75,12 +80,12 @@ static inline void blake2s_init_key(struct blake2s_state *state,
 	WARN_ON(IS_ENABLED(DEBUG) && (!outlen || outlen > BLAKE2S_HASH_SIZE ||
 		!key || !keylen || keylen > BLAKE2S_KEY_SIZE));
 
-	blake2s_init_param(state, 0x01010000 | keylen << 8 | outlen);
-	memcpy(state->buf, key, keylen);
-	state->buflen = BLAKE2S_BLOCK_SIZE;
-	state->outlen = outlen;
+	__blake2s_init(state, outlen, key, keylen);
 }
 
+void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen);
+void blake2s_final(struct blake2s_state *state, u8 *out);
+
 static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
 			   const size_t outlen, const size_t inlen,
 			   const size_t keylen)
@@ -91,11 +96,7 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
 		outlen > BLAKE2S_HASH_SIZE || keylen > BLAKE2S_KEY_SIZE ||
 		(!key && keylen)));
 
-	if (keylen)
-		blake2s_init_key(&state, outlen, key, keylen);
-	else
-		blake2s_init(&state, outlen);
-
+	__blake2s_init(&state, outlen, key, keylen);
 	blake2s_update(&state, in, inlen);
 	blake2s_final(&state, out);
 }
diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
index 2ea0a8f5e7f41..867ef3753f5c1 100644
--- a/include/crypto/internal/blake2s.h
+++ b/include/crypto/internal/blake2s.h
@@ -93,10 +93,7 @@ static inline int crypto_blake2s_init(struct shash_desc *desc)
 	struct blake2s_state *state = shash_desc_ctx(desc);
 	unsigned int outlen = crypto_shash_digestsize(desc->tfm);
 
-	if (tctx->keylen)
-		blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
-	else
-		blake2s_init(state, outlen);
+	__blake2s_init(state, outlen, tctx->key, tctx->keylen);
 	return 0;
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 07/14] crypto: blake2s - add comment for blake2s_state fields
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (5 preceding siblings ...)
  2020-12-23  8:09 ` [PATCH v3 06/14] crypto: blake2s - optimize blake2s initialization Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  9:07   ` Ard Biesheuvel
  2020-12-23  8:09 ` [PATCH v3 08/14] crypto: blake2s - adjust include guard naming Eric Biggers
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

The first three fields of 'struct blake2s_state' are used in assembly
code, which isn't immediately obvious, so add a comment to this effect.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 include/crypto/blake2s.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index 734ed22b7a6aa..f1c8330a61a91 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -24,6 +24,7 @@ enum blake2s_lengths {
 };
 
 struct blake2s_state {
+	/* 'h', 't', and 'f' are used in assembly code, so keep them as-is. */
 	u32 h[8];
 	u32 t[2];
 	u32 f[2];
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 08/14] crypto: blake2s - adjust include guard naming
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (6 preceding siblings ...)
  2020-12-23  8:09 ` [PATCH v3 07/14] crypto: blake2s - add comment for blake2s_state fields Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  9:07   ` Ard Biesheuvel
  2020-12-23  8:09 ` [PATCH v3 09/14] crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h> Eric Biggers
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Use the full path in the include guards for the BLAKE2s headers to avoid
ambiguity and to match the convention for most files in include/crypto/.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 include/crypto/blake2s.h          | 6 +++---
 include/crypto/internal/blake2s.h | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index f1c8330a61a91..3f06183c2d804 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -3,8 +3,8 @@
  * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
  */
 
-#ifndef BLAKE2S_H
-#define BLAKE2S_H
+#ifndef _CRYPTO_BLAKE2S_H
+#define _CRYPTO_BLAKE2S_H
 
 #include <linux/types.h>
 #include <linux/kernel.h>
@@ -105,4 +105,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
 void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
 		     const size_t keylen);
 
-#endif /* BLAKE2S_H */
+#endif /* _CRYPTO_BLAKE2S_H */
diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
index 867ef3753f5c1..8e50d487500f2 100644
--- a/include/crypto/internal/blake2s.h
+++ b/include/crypto/internal/blake2s.h
@@ -4,8 +4,8 @@
  * Keep this in sync with the corresponding BLAKE2b header.
  */
 
-#ifndef BLAKE2S_INTERNAL_H
-#define BLAKE2S_INTERNAL_H
+#ifndef _CRYPTO_INTERNAL_BLAKE2S_H
+#define _CRYPTO_INTERNAL_BLAKE2S_H
 
 #include <crypto/blake2s.h>
 #include <crypto/internal/hash.h>
@@ -116,4 +116,4 @@ static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out,
 	return 0;
 }
 
-#endif /* BLAKE2S_INTERNAL_H */
+#endif /* _CRYPTO_INTERNAL_BLAKE2S_H */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 09/14] crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h>
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (7 preceding siblings ...)
  2020-12-23  8:09 ` [PATCH v3 08/14] crypto: blake2s - adjust include guard naming Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  9:07   ` Ard Biesheuvel
  2020-12-23  8:09 ` [PATCH v3 10/14] crypto: arm/blake2s - add ARM scalar optimized BLAKE2s Eric Biggers
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Address the following checkpatch warning:

	WARNING: Use #include <linux/bug.h> instead of <asm/bug.h>

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 include/crypto/blake2s.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index 3f06183c2d804..bc3fb59442ce5 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -6,12 +6,11 @@
 #ifndef _CRYPTO_BLAKE2S_H
 #define _CRYPTO_BLAKE2S_H
 
+#include <linux/bug.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/string.h>
 
-#include <asm/bug.h>
-
 enum blake2s_lengths {
 	BLAKE2S_BLOCK_SIZE = 64,
 	BLAKE2S_HASH_SIZE = 32,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 10/14] crypto: arm/blake2s - add ARM scalar optimized BLAKE2s
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (8 preceding siblings ...)
  2020-12-23  8:09 ` [PATCH v3 09/14] crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h> Eric Biggers
@ 2020-12-23  8:09 ` Eric Biggers
  2020-12-23  9:08   ` Ard Biesheuvel
  2020-12-23  8:10 ` [PATCH v3 11/14] wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM Eric Biggers
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:09 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Add an ARM scalar optimized implementation of BLAKE2s.

NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too
small for NEON to help.  Each NEON instruction would depend on the
previous one, resulting in poor performance.

With scalar instructions, on the other hand, we can take advantage of
ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an
implementation get runs much faster than the C implementation.

Performance results on Cortex-A7 in cycles per byte using the shash API:

	4096-byte messages:
		blake2s-256-arm:     18.8
		blake2s-256-generic: 26.0

	500-byte messages:
		blake2s-256-arm:     20.3
		blake2s-256-generic: 27.9

	100-byte messages:
		blake2s-256-arm:     29.7
		blake2s-256-generic: 39.2

	32-byte messages:
		blake2s-256-arm:     50.6
		blake2s-256-generic: 66.2

Except on very short messages, this is still slower than the NEON
implementation of BLAKE2b which I've written; that is 14.0, 16.4, 25.8,
and 76.1 cpb on 4096, 500, 100, and 32-byte messages, respectively.
However, optimized BLAKE2s is useful for cases where BLAKE2s is used
instead of BLAKE2b, such as WireGuard.

This new implementation is added in the form of a new module
blake2s-arm.ko, which is analogous to blake2s-x86_64.ko in that it
provides blake2s_compress_arch() for use by the library API as well as
optionally register the algorithms with the shash API.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm/crypto/Kconfig        |   9 ++
 arch/arm/crypto/Makefile       |   2 +
 arch/arm/crypto/blake2s-core.S | 285 +++++++++++++++++++++++++++++++++
 arch/arm/crypto/blake2s-glue.c |  78 +++++++++
 4 files changed, 374 insertions(+)
 create mode 100644 arch/arm/crypto/blake2s-core.S
 create mode 100644 arch/arm/crypto/blake2s-glue.c

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index c9bf2df85cb90..281c829c12d0b 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -62,6 +62,15 @@ config CRYPTO_SHA512_ARM
 	  SHA-512 secure hash standard (DFIPS 180-2) implemented
 	  using optimized ARM assembler and NEON, when available.
 
+config CRYPTO_BLAKE2S_ARM
+	tristate "BLAKE2s digest algorithm (ARM)"
+	select CRYPTO_ARCH_HAVE_LIB_BLAKE2S
+	help
+	  BLAKE2s digest algorithm optimized with ARM scalar instructions.  This
+	  is faster than the generic implementations of BLAKE2s and BLAKE2b, but
+	  slower than the NEON implementation of BLAKE2b.  (There is no NEON
+	  implementation of BLAKE2s, since NEON doesn't really help with it.)
+
 config CRYPTO_AES_ARM
 	tristate "Scalar AES cipher for ARM"
 	select CRYPTO_ALGAPI
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index b745c17d356fe..5ad1e985a718b 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
+obj-$(CONFIG_CRYPTO_BLAKE2S_ARM) += blake2s-arm.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
 obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
@@ -29,6 +30,7 @@ sha256-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha256_neon_glue.o
 sha256-arm-y	:= sha256-core.o sha256_glue.o $(sha256-arm-neon-y)
 sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
 sha512-arm-y	:= sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
+blake2s-arm-y   := blake2s-core.o blake2s-glue.o
 sha1-arm-ce-y	:= sha1-ce-core.o sha1-ce-glue.o
 sha2-arm-ce-y	:= sha2-ce-core.o sha2-ce-glue.o
 aes-arm-ce-y	:= aes-ce-core.o aes-ce-glue.o
diff --git a/arch/arm/crypto/blake2s-core.S b/arch/arm/crypto/blake2s-core.S
new file mode 100644
index 0000000000000..bed897e9a181a
--- /dev/null
+++ b/arch/arm/crypto/blake2s-core.S
@@ -0,0 +1,285 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * BLAKE2s digest algorithm, ARM scalar implementation
+ *
+ * Copyright 2020 Google LLC
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+#include <linux/linkage.h>
+
+	// Registers used to hold message words temporarily.  There aren't
+	// enough ARM registers to hold the whole message block, so we have to
+	// load the words on-demand.
+	M_0		.req	r12
+	M_1		.req	r14
+
+// The BLAKE2s initialization vector
+.Lblake2s_IV:
+	.word	0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A
+	.word	0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+
+.macro __ldrd		a, b, src, offset
+#if __LINUX_ARM_ARCH__ >= 6
+	ldrd		\a, \b, [\src, #\offset]
+#else
+	ldr		\a, [\src, #\offset]
+	ldr		\b, [\src, #\offset + 4]
+#endif
+.endm
+
+.macro __strd		a, b, dst, offset
+#if __LINUX_ARM_ARCH__ >= 6
+	strd		\a, \b, [\dst, #\offset]
+#else
+	str		\a, [\dst, #\offset]
+	str		\b, [\dst, #\offset + 4]
+#endif
+.endm
+
+// Execute a quarter-round of BLAKE2s by mixing two columns or two diagonals.
+// (a0, b0, c0, d0) and (a1, b1, c1, d1) give the registers containing the two
+// columns/diagonals.  s0-s1 are the word offsets to the message words the first
+// column/diagonal needs, and likewise s2-s3 for the second column/diagonal.
+// M_0 and M_1 are free to use, and the message block can be found at sp + 32.
+//
+// Note that to save instructions, the rotations don't happen when the
+// pseudocode says they should, but rather they are delayed until the values are
+// used.  See the comment above _blake2s_round().
+.macro _blake2s_quarterround  a0, b0, c0, d0,  a1, b1, c1, d1,  s0, s1, s2, s3
+
+	ldr		M_0, [sp, #32 + 4 * \s0]
+	ldr		M_1, [sp, #32 + 4 * \s2]
+
+	// a += b + m[blake2s_sigma[r][2*i + 0]];
+	add		\a0, \a0, \b0, ror #brot
+	add		\a1, \a1, \b1, ror #brot
+	add		\a0, \a0, M_0
+	add		\a1, \a1, M_1
+
+	// d = ror32(d ^ a, 16);
+	eor		\d0, \a0, \d0, ror #drot
+	eor		\d1, \a1, \d1, ror #drot
+
+	// c += d;
+	add		\c0, \c0, \d0, ror #16
+	add		\c1, \c1, \d1, ror #16
+
+	// b = ror32(b ^ c, 12);
+	eor		\b0, \c0, \b0, ror #brot
+	eor		\b1, \c1, \b1, ror #brot
+
+	ldr		M_0, [sp, #32 + 4 * \s1]
+	ldr		M_1, [sp, #32 + 4 * \s3]
+
+	// a += b + m[blake2s_sigma[r][2*i + 1]];
+	add		\a0, \a0, \b0, ror #12
+	add		\a1, \a1, \b1, ror #12
+	add		\a0, \a0, M_0
+	add		\a1, \a1, M_1
+
+	// d = ror32(d ^ a, 8);
+	eor		\d0, \a0, \d0, ror#16
+	eor		\d1, \a1, \d1, ror#16
+
+	// c += d;
+	add		\c0, \c0, \d0, ror#8
+	add		\c1, \c1, \d1, ror#8
+
+	// b = ror32(b ^ c, 7);
+	eor		\b0, \c0, \b0, ror#12
+	eor		\b1, \c1, \b1, ror#12
+.endm
+
+// Execute one round of BLAKE2s by updating the state matrix v[0..15].  v[0..9]
+// are in r0..r9.  The stack pointer points to 8 bytes of scratch space for
+// spilling v[8..9], then to v[9..15], then to the message block.  r10-r12 and
+// r14 are free to use.  The macro arguments s0-s15 give the order in which the
+// message words are used in this round.
+//
+// All rotates are performed using the implicit rotate operand accepted by the
+// 'add' and 'eor' instructions.  This is faster than using explicit rotate
+// instructions.  To make this work, we allow the values in the second and last
+// rows of the BLAKE2s state matrix (rows 'b' and 'd') to temporarily have the
+// wrong rotation amount.  The rotation amount is then fixed up just in time
+// when the values are used.  'brot' is the number of bits the values in row 'b'
+// need to be rotated right to arrive at the correct values, and 'drot'
+// similarly for row 'd'.  (brot, drot) start out as (0, 0) but we make it such
+// that they end up as (7, 8) after every round.
+.macro	_blake2s_round	s0, s1, s2, s3, s4, s5, s6, s7, \
+			s8, s9, s10, s11, s12, s13, s14, s15
+
+	// Mix first two columns:
+	// (v[0], v[4], v[8], v[12]) and (v[1], v[5], v[9], v[13]).
+	__ldrd		r10, r11, sp, 16	// load v[12] and v[13]
+	_blake2s_quarterround	r0, r4, r8, r10,  r1, r5, r9, r11, \
+				\s0, \s1, \s2, \s3
+	__strd		r8, r9, sp, 0
+	__strd		r10, r11, sp, 16
+
+	// Mix second two columns:
+	// (v[2], v[6], v[10], v[14]) and (v[3], v[7], v[11], v[15]).
+	__ldrd		r8, r9, sp, 8		// load v[10] and v[11]
+	__ldrd		r10, r11, sp, 24	// load v[14] and v[15]
+	_blake2s_quarterround	r2, r6, r8, r10,  r3, r7, r9, r11, \
+				\s4, \s5, \s6, \s7
+	str		r10, [sp, #24]		// store v[14]
+	// v[10], v[11], and v[15] are used below, so no need to store them yet.
+
+	.set brot, 7
+	.set drot, 8
+
+	// Mix first two diagonals:
+	// (v[0], v[5], v[10], v[15]) and (v[1], v[6], v[11], v[12]).
+	ldr		r10, [sp, #16]		// load v[12]
+	_blake2s_quarterround	r0, r5, r8, r11,  r1, r6, r9, r10, \
+				\s8, \s9, \s10, \s11
+	__strd		r8, r9, sp, 8
+	str		r11, [sp, #28]
+	str		r10, [sp, #16]
+
+	// Mix second two diagonals:
+	// (v[2], v[7], v[8], v[13]) and (v[3], v[4], v[9], v[14]).
+	__ldrd		r8, r9, sp, 0		// load v[8] and v[9]
+	__ldrd		r10, r11, sp, 20	// load v[13] and v[14]
+	_blake2s_quarterround	r2, r7, r8, r10,  r3, r4, r9, r11, \
+				\s12, \s13, \s14, \s15
+	__strd		r10, r11, sp, 20
+.endm
+
+//
+// void blake2s_compress_arch(struct blake2s_state *state,
+//			      const u8 *block, size_t nblocks, u32 inc);
+//
+// Only the first three fields of struct blake2s_state are used:
+//	u32 h[8];	(inout)
+//	u32 t[2];	(inout)
+//	u32 f[2];	(in)
+//
+	.align		5
+ENTRY(blake2s_compress_arch)
+	push		{r0-r2,r4-r11,lr}	// keep this an even number
+
+.Lnext_block:
+	// r0 is 'state'
+	// r1 is 'block'
+	// r3 is 'inc'
+
+	// Load and increment the counter t[0..1].
+	__ldrd		r10, r11, r0, 32
+	adds		r10, r10, r3
+	adc		r11, r11, #0
+	__strd		r10, r11, r0, 32
+
+	// _blake2s_round is very short on registers, so copy the message block
+	// to the stack to save a register during the rounds.  This also has the
+	// advantage that misalignment only needs to be dealt with in one place.
+	sub		sp, sp, #64
+	mov		r12, sp
+	tst		r1, #3
+	bne		.Lcopy_block_misaligned
+	ldmia		r1!, {r2-r9}
+	stmia		r12!, {r2-r9}
+	ldmia		r1!, {r2-r9}
+	stmia		r12, {r2-r9}
+.Lcopy_block_done:
+	str		r1, [sp, #68]		// Update message pointer
+
+	// Calculate v[8..15].  Push v[9..15] onto the stack, and leave space
+	// for spilling v[8..9].  Leave v[8..9] in r8-r9.
+	mov		r14, r0			// r14 = state
+	adr		r12, .Lblake2s_IV
+	ldmia		r12!, {r8-r9}		// load IV[0..1]
+	__ldrd		r0, r1, r14, 40		// load f[0..1]
+	ldm		r12, {r2-r7}		// load IV[3..7]
+	eor		r4, r4, r10		// v[12] = IV[4] ^ t[0]
+	eor		r5, r5, r11		// v[13] = IV[5] ^ t[1]
+	eor		r6, r6, r0		// v[14] = IV[6] ^ f[0]
+	eor		r7, r7, r1		// v[15] = IV[7] ^ f[1]
+	push		{r2-r7}			// push v[9..15]
+	sub		sp, sp, #8		// leave space for v[8..9]
+
+	// Load h[0..7] == v[0..7].
+	ldm		r14, {r0-r7}
+
+	// Execute the rounds.  Each round is provided the order in which it
+	// needs to use the message words.
+	.set brot, 0
+	.set drot, 0
+	_blake2s_round	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+	_blake2s_round	14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3
+	_blake2s_round	11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4
+	_blake2s_round	7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8
+	_blake2s_round	9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13
+	_blake2s_round	2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9
+	_blake2s_round	12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11
+	_blake2s_round	13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10
+	_blake2s_round	6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5
+	_blake2s_round	10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0
+
+	// Fold the final state matrix into the hash chaining value:
+	//
+	//	for (i = 0; i < 8; i++)
+	//		h[i] ^= v[i] ^ v[i + 8];
+	//
+	ldr		r14, [sp, #96]		// r14 = &h[0]
+	add		sp, sp, #8		// v[8..9] are already loaded.
+	pop		{r10-r11}		// load v[10..11]
+	eor		r0, r0, r8
+	eor		r1, r1, r9
+	eor		r2, r2, r10
+	eor		r3, r3, r11
+	ldm		r14, {r8-r11}		// load h[0..3]
+	eor		r0, r0, r8
+	eor		r1, r1, r9
+	eor		r2, r2, r10
+	eor		r3, r3, r11
+	stmia		r14!, {r0-r3}		// store new h[0..3]
+	ldm		r14, {r0-r3}		// load old h[4..7]
+	pop		{r8-r11}		// load v[12..15]
+	eor		r0, r0, r4, ror #brot
+	eor		r1, r1, r5, ror #brot
+	eor		r2, r2, r6, ror #brot
+	eor		r3, r3, r7, ror #brot
+	eor		r0, r0, r8, ror #drot
+	eor		r1, r1, r9, ror #drot
+	eor		r2, r2, r10, ror #drot
+	eor		r3, r3, r11, ror #drot
+	  add		sp, sp, #64		// skip copy of message block
+	stm		r14, {r0-r3}		// store new h[4..7]
+
+	// Advance to the next block, if there is one.  Note that if there are
+	// multiple blocks, then 'inc' (the counter increment amount) must be
+	// 64.  So we can simply set it to 64 without re-loading it.
+	ldm		sp, {r0, r1, r2}	// load (state, block, nblocks)
+	mov		r3, #64			// set 'inc'
+	subs		r2, r2, #1		// nblocks--
+	str		r2, [sp, #8]
+	bne		.Lnext_block		// nblocks != 0?
+
+	pop		{r0-r2,r4-r11,pc}
+
+	// The next message block (pointed to by r1) isn't 4-byte aligned, so it
+	// can't be loaded using ldmia.  Copy it to the stack buffer (pointed to
+	// by r12) using an alternative method.  r2-r9 are free to use.
+.Lcopy_block_misaligned:
+	mov		r2, #64
+1:
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	ldr		r3, [r1], #4
+#else
+	ldrb		r3, [r1, #0]
+	ldrb		r4, [r1, #1]
+	ldrb		r5, [r1, #2]
+	ldrb		r6, [r1, #3]
+	add		r1, r1, #4
+	orr		r3, r3, r4, lsl #8
+	orr		r3, r3, r5, lsl #16
+	orr		r3, r3, r6, lsl #24
+#endif
+	subs		r2, r2, #4
+	str		r3, [r12], #4
+	bne		1b
+	b		.Lcopy_block_done
+ENDPROC(blake2s_compress_arch)
diff --git a/arch/arm/crypto/blake2s-glue.c b/arch/arm/crypto/blake2s-glue.c
new file mode 100644
index 0000000000000..f2cc1e5fc9ec1
--- /dev/null
+++ b/arch/arm/crypto/blake2s-glue.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BLAKE2s digest algorithm, ARM scalar implementation
+ *
+ * Copyright 2020 Google LLC
+ */
+
+#include <crypto/internal/blake2s.h>
+#include <crypto/internal/hash.h>
+
+#include <linux/module.h>
+
+/* defined in blake2s-core.S */
+EXPORT_SYMBOL(blake2s_compress_arch);
+
+static int crypto_blake2s_update_arm(struct shash_desc *desc,
+				     const u8 *in, unsigned int inlen)
+{
+	return crypto_blake2s_update(desc, in, inlen, blake2s_compress_arch);
+}
+
+static int crypto_blake2s_final_arm(struct shash_desc *desc, u8 *out)
+{
+	return crypto_blake2s_final(desc, out, blake2s_compress_arch);
+}
+
+#define BLAKE2S_ALG(name, driver_name, digest_size)			\
+	{								\
+		.base.cra_name		= name,				\
+		.base.cra_driver_name	= driver_name,			\
+		.base.cra_priority	= 200,				\
+		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,	\
+		.base.cra_blocksize	= BLAKE2S_BLOCK_SIZE,		\
+		.base.cra_ctxsize	= sizeof(struct blake2s_tfm_ctx), \
+		.base.cra_module	= THIS_MODULE,			\
+		.digestsize		= digest_size,			\
+		.setkey			= crypto_blake2s_setkey,	\
+		.init			= crypto_blake2s_init,		\
+		.update			= crypto_blake2s_update_arm,	\
+		.final			= crypto_blake2s_final_arm,	\
+		.descsize		= sizeof(struct blake2s_state),	\
+	}
+
+static struct shash_alg blake2s_arm_algs[] = {
+	BLAKE2S_ALG("blake2s-128", "blake2s-128-arm", BLAKE2S_128_HASH_SIZE),
+	BLAKE2S_ALG("blake2s-160", "blake2s-160-arm", BLAKE2S_160_HASH_SIZE),
+	BLAKE2S_ALG("blake2s-224", "blake2s-224-arm", BLAKE2S_224_HASH_SIZE),
+	BLAKE2S_ALG("blake2s-256", "blake2s-256-arm", BLAKE2S_256_HASH_SIZE),
+};
+
+static int __init blake2s_arm_mod_init(void)
+{
+	return IS_REACHABLE(CONFIG_CRYPTO_HASH) ?
+		crypto_register_shashes(blake2s_arm_algs,
+					ARRAY_SIZE(blake2s_arm_algs)) : 0;
+}
+
+static void __exit blake2s_arm_mod_exit(void)
+{
+	if (IS_REACHABLE(CONFIG_CRYPTO_HASH))
+		crypto_unregister_shashes(blake2s_arm_algs,
+					  ARRAY_SIZE(blake2s_arm_algs));
+}
+
+module_init(blake2s_arm_mod_init);
+module_exit(blake2s_arm_mod_exit);
+
+MODULE_DESCRIPTION("BLAKE2s digest algorithm, ARM scalar implementation");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("blake2s-128");
+MODULE_ALIAS_CRYPTO("blake2s-128-arm");
+MODULE_ALIAS_CRYPTO("blake2s-160");
+MODULE_ALIAS_CRYPTO("blake2s-160-arm");
+MODULE_ALIAS_CRYPTO("blake2s-224");
+MODULE_ALIAS_CRYPTO("blake2s-224-arm");
+MODULE_ALIAS_CRYPTO("blake2s-256");
+MODULE_ALIAS_CRYPTO("blake2s-256-arm");
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 11/14] wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (9 preceding siblings ...)
  2020-12-23  8:09 ` [PATCH v3 10/14] crypto: arm/blake2s - add ARM scalar optimized BLAKE2s Eric Biggers
@ 2020-12-23  8:10 ` Eric Biggers
  2020-12-23  8:10 ` [PATCH v3 12/14] crypto: blake2b - sync with blake2s implementation Eric Biggers
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:10 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

When available, select the new implementation of BLAKE2s for 32-bit ARM.
This is faster than the generic C implementation.

Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 drivers/net/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 260f9f46668b8..672fcdd9aecbb 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,7 @@ config WIREGUARD
 	select CRYPTO_CHACHA20_NEON if (ARM || ARM64) && KERNEL_MODE_NEON
 	select CRYPTO_POLY1305_NEON if ARM64 && KERNEL_MODE_NEON
 	select CRYPTO_POLY1305_ARM if ARM
+	select CRYPTO_BLAKE2S_ARM if ARM
 	select CRYPTO_CURVE25519_NEON if ARM && KERNEL_MODE_NEON
 	select CRYPTO_CHACHA_MIPS if CPU_MIPS32_R2
 	select CRYPTO_POLY1305_MIPS if CPU_MIPS32 || (CPU_MIPS64 && 64BIT)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 12/14] crypto: blake2b - sync with blake2s implementation
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (10 preceding siblings ...)
  2020-12-23  8:10 ` [PATCH v3 11/14] wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM Eric Biggers
@ 2020-12-23  8:10 ` Eric Biggers
  2020-12-23  9:09   ` Ard Biesheuvel
  2020-12-23  8:10 ` [PATCH v3 13/14] crypto: blake2b - update file comment Eric Biggers
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:10 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Sync the BLAKE2b code with the BLAKE2s code as much as possible:

- Move a lot of code into new headers <crypto/blake2b.h> and
  <crypto/internal/blake2b.h>, and adjust it to be like the
  corresponding BLAKE2s code, i.e. like <crypto/blake2s.h> and
  <crypto/internal/blake2s.h>.

- Rename constants, e.g. BLAKE2B_*_DIGEST_SIZE => BLAKE2B_*_HASH_SIZE.

- Use a macro BLAKE2B_ALG() to define the shash_alg structs.

- Export blake2b_compress_generic() for use as a fallback.

This makes it much easier to add optimized implementations of BLAKE2b,
as optimized implementations can use the helper functions
crypto_blake2b_{setkey,init,update,final}() and
blake2b_compress_generic().  The ARM implementation will use these.

But this change is also helpful because it eliminates unnecessary
differences between the BLAKE2b and BLAKE2s code, so that the same
improvements can easily be made to both.  (The two algorithms are
basically identical, except for the word size and constants.)  It also
makes it straightforward to add a library API for BLAKE2b in the future
if/when it's needed.

This change does make the BLAKE2b code slightly more complicated than it
needs to be, as it doesn't actually provide a library API yet.  For
example, __blake2b_update() doesn't really need to exist yet; it could
just be inlined into crypto_blake2b_update().  But I believe this is
outweighed by the benefits of keeping the code in sync.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c          | 226 +++++++-----------------------
 include/crypto/blake2b.h          |  67 +++++++++
 include/crypto/internal/blake2b.h | 115 +++++++++++++++
 3 files changed, 230 insertions(+), 178 deletions(-)
 create mode 100644 include/crypto/blake2b.h
 create mode 100644 include/crypto/internal/blake2b.h

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index a2ffe60e06d34..963f7fe0e4ea8 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -20,36 +20,11 @@
 
 #include <asm/unaligned.h>
 #include <linux/module.h>
-#include <linux/string.h>
 #include <linux/kernel.h>
 #include <linux/bitops.h>
+#include <crypto/internal/blake2b.h>
 #include <crypto/internal/hash.h>
 
-#define BLAKE2B_160_DIGEST_SIZE		(160 / 8)
-#define BLAKE2B_256_DIGEST_SIZE		(256 / 8)
-#define BLAKE2B_384_DIGEST_SIZE		(384 / 8)
-#define BLAKE2B_512_DIGEST_SIZE		(512 / 8)
-
-enum blake2b_constant {
-	BLAKE2B_BLOCKBYTES    = 128,
-	BLAKE2B_KEYBYTES      = 64,
-};
-
-struct blake2b_state {
-	u64      h[8];
-	u64      t[2];
-	u64      f[2];
-	u8       buf[BLAKE2B_BLOCKBYTES];
-	size_t   buflen;
-};
-
-static const u64 blake2b_IV[8] = {
-	0x6a09e667f3bcc908ULL, 0xbb67ae8584caa73bULL,
-	0x3c6ef372fe94f82bULL, 0xa54ff53a5f1d36f1ULL,
-	0x510e527fade682d1ULL, 0x9b05688c2b3e6c1fULL,
-	0x1f83d9abfb41bd6bULL, 0x5be0cd19137e2179ULL
-};
-
 static const u8 blake2b_sigma[12][16] = {
 	{  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
 	{ 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 },
@@ -95,8 +70,8 @@ static void blake2b_increment_counter(struct blake2b_state *S, const u64 inc)
 		G(r,7,v[ 3],v[ 4],v[ 9],v[14]); \
 	} while (0)
 
-static void blake2b_compress(struct blake2b_state *S,
-			     const u8 block[BLAKE2B_BLOCKBYTES])
+static void blake2b_compress_one_generic(struct blake2b_state *S,
+					 const u8 block[BLAKE2B_BLOCK_SIZE])
 {
 	u64 m[16];
 	u64 v[16];
@@ -108,14 +83,14 @@ static void blake2b_compress(struct blake2b_state *S,
 	for (i = 0; i < 8; ++i)
 		v[i] = S->h[i];
 
-	v[ 8] = blake2b_IV[0];
-	v[ 9] = blake2b_IV[1];
-	v[10] = blake2b_IV[2];
-	v[11] = blake2b_IV[3];
-	v[12] = blake2b_IV[4] ^ S->t[0];
-	v[13] = blake2b_IV[5] ^ S->t[1];
-	v[14] = blake2b_IV[6] ^ S->f[0];
-	v[15] = blake2b_IV[7] ^ S->f[1];
+	v[ 8] = BLAKE2B_IV0;
+	v[ 9] = BLAKE2B_IV1;
+	v[10] = BLAKE2B_IV2;
+	v[11] = BLAKE2B_IV3;
+	v[12] = BLAKE2B_IV4 ^ S->t[0];
+	v[13] = BLAKE2B_IV5 ^ S->t[1];
+	v[14] = BLAKE2B_IV6 ^ S->f[0];
+	v[15] = BLAKE2B_IV7 ^ S->f[1];
 
 	ROUND(0);
 	ROUND(1);
@@ -139,159 +114,54 @@ static void blake2b_compress(struct blake2b_state *S,
 #undef G
 #undef ROUND
 
-struct blake2b_tfm_ctx {
-	u8 key[BLAKE2B_KEYBYTES];
-	unsigned int keylen;
-};
-
-static int blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
-			  unsigned int keylen)
+void blake2b_compress_generic(struct blake2b_state *state,
+			      const u8 *block, size_t nblocks, u32 inc)
 {
-	struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(tfm);
-
-	if (keylen == 0 || keylen > BLAKE2B_KEYBYTES)
-		return -EINVAL;
-
-	memcpy(tctx->key, key, keylen);
-	tctx->keylen = keylen;
-
-	return 0;
+	do {
+		blake2b_increment_counter(state, inc);
+		blake2b_compress_one_generic(state, block);
+		block += BLAKE2B_BLOCK_SIZE;
+	} while (--nblocks);
 }
+EXPORT_SYMBOL(blake2b_compress_generic);
 
-static int blake2b_init(struct shash_desc *desc)
+static int crypto_blake2b_update_generic(struct shash_desc *desc,
+					 const u8 *in, unsigned int inlen)
 {
-	struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
-	struct blake2b_state *state = shash_desc_ctx(desc);
-	const int digestsize = crypto_shash_digestsize(desc->tfm);
-
-	memset(state, 0, sizeof(*state));
-	memcpy(state->h, blake2b_IV, sizeof(state->h));
-
-	/* Parameter block is all zeros except index 0, no xor for 1..7 */
-	state->h[0] ^= 0x01010000 | tctx->keylen << 8 | digestsize;
-
-	if (tctx->keylen) {
-		/*
-		 * Prefill the buffer with the key, next call to _update or
-		 * _final will process it
-		 */
-		memcpy(state->buf, tctx->key, tctx->keylen);
-		state->buflen = BLAKE2B_BLOCKBYTES;
-	}
-	return 0;
+	return crypto_blake2b_update(desc, in, inlen, blake2b_compress_generic);
 }
 
-static int blake2b_update(struct shash_desc *desc, const u8 *in,
-			  unsigned int inlen)
+static int crypto_blake2b_final_generic(struct shash_desc *desc, u8 *out)
 {
-	struct blake2b_state *state = shash_desc_ctx(desc);
-	const size_t left = state->buflen;
-	const size_t fill = BLAKE2B_BLOCKBYTES - left;
-
-	if (!inlen)
-		return 0;
-
-	if (inlen > fill) {
-		state->buflen = 0;
-		/* Fill buffer */
-		memcpy(state->buf + left, in, fill);
-		blake2b_increment_counter(state, BLAKE2B_BLOCKBYTES);
-		/* Compress */
-		blake2b_compress(state, state->buf);
-		in += fill;
-		inlen -= fill;
-		while (inlen > BLAKE2B_BLOCKBYTES) {
-			blake2b_increment_counter(state, BLAKE2B_BLOCKBYTES);
-			blake2b_compress(state, in);
-			in += BLAKE2B_BLOCKBYTES;
-			inlen -= BLAKE2B_BLOCKBYTES;
-		}
-	}
-	memcpy(state->buf + state->buflen, in, inlen);
-	state->buflen += inlen;
-
-	return 0;
+	return crypto_blake2b_final(desc, out, blake2b_compress_generic);
 }
 
-static int blake2b_final(struct shash_desc *desc, u8 *out)
-{
-	struct blake2b_state *state = shash_desc_ctx(desc);
-	const int digestsize = crypto_shash_digestsize(desc->tfm);
-	size_t i;
-
-	blake2b_increment_counter(state, state->buflen);
-	/* Set last block */
-	state->f[0] = (u64)-1;
-	/* Padding */
-	memset(state->buf + state->buflen, 0, BLAKE2B_BLOCKBYTES - state->buflen);
-	blake2b_compress(state, state->buf);
-
-	/* Avoid temporary buffer and switch the internal output to LE order */
-	for (i = 0; i < ARRAY_SIZE(state->h); i++)
-		__cpu_to_le64s(&state->h[i]);
-
-	memcpy(out, state->h, digestsize);
-	return 0;
-}
+#define BLAKE2B_ALG(name, driver_name, digest_size)			\
+	{								\
+		.base.cra_name		= name,				\
+		.base.cra_driver_name	= driver_name,			\
+		.base.cra_priority	= 100,				\
+		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,	\
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,		\
+		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx), \
+		.base.cra_module	= THIS_MODULE,			\
+		.digestsize		= digest_size,			\
+		.setkey			= crypto_blake2b_setkey,	\
+		.init			= crypto_blake2b_init,		\
+		.update			= crypto_blake2b_update_generic, \
+		.final			= crypto_blake2b_final_generic,	\
+		.descsize		= sizeof(struct blake2b_state),	\
+	}
 
 static struct shash_alg blake2b_algs[] = {
-	{
-		.base.cra_name		= "blake2b-160",
-		.base.cra_driver_name	= "blake2b-160-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_160_DIGEST_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}, {
-		.base.cra_name		= "blake2b-256",
-		.base.cra_driver_name	= "blake2b-256-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_256_DIGEST_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}, {
-		.base.cra_name		= "blake2b-384",
-		.base.cra_driver_name	= "blake2b-384-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_384_DIGEST_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}, {
-		.base.cra_name		= "blake2b-512",
-		.base.cra_driver_name	= "blake2b-512-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_512_DIGEST_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}
+	BLAKE2B_ALG("blake2b-160", "blake2b-160-generic",
+		    BLAKE2B_160_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-256", "blake2b-256-generic",
+		    BLAKE2B_256_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-384", "blake2b-384-generic",
+		    BLAKE2B_384_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-512", "blake2b-512-generic",
+		    BLAKE2B_512_HASH_SIZE),
 };
 
 static int __init blake2b_mod_init(void)
diff --git a/include/crypto/blake2b.h b/include/crypto/blake2b.h
new file mode 100644
index 0000000000000..18875f16f8cad
--- /dev/null
+++ b/include/crypto/blake2b.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+
+#ifndef _CRYPTO_BLAKE2B_H
+#define _CRYPTO_BLAKE2B_H
+
+#include <linux/bug.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+
+enum blake2b_lengths {
+	BLAKE2B_BLOCK_SIZE = 128,
+	BLAKE2B_HASH_SIZE = 64,
+	BLAKE2B_KEY_SIZE = 64,
+
+	BLAKE2B_160_HASH_SIZE = 20,
+	BLAKE2B_256_HASH_SIZE = 32,
+	BLAKE2B_384_HASH_SIZE = 48,
+	BLAKE2B_512_HASH_SIZE = 64,
+};
+
+struct blake2b_state {
+	/* 'h', 't', and 'f' are used in assembly code, so keep them as-is. */
+	u64 h[8];
+	u64 t[2];
+	u64 f[2];
+	u8 buf[BLAKE2B_BLOCK_SIZE];
+	unsigned int buflen;
+	unsigned int outlen;
+};
+
+enum blake2b_iv {
+	BLAKE2B_IV0 = 0x6A09E667F3BCC908ULL,
+	BLAKE2B_IV1 = 0xBB67AE8584CAA73BULL,
+	BLAKE2B_IV2 = 0x3C6EF372FE94F82BULL,
+	BLAKE2B_IV3 = 0xA54FF53A5F1D36F1ULL,
+	BLAKE2B_IV4 = 0x510E527FADE682D1ULL,
+	BLAKE2B_IV5 = 0x9B05688C2B3E6C1FULL,
+	BLAKE2B_IV6 = 0x1F83D9ABFB41BD6BULL,
+	BLAKE2B_IV7 = 0x5BE0CD19137E2179ULL,
+};
+
+static inline void __blake2b_init(struct blake2b_state *state, size_t outlen,
+				  const void *key, size_t keylen)
+{
+	state->h[0] = BLAKE2B_IV0 ^ (0x01010000 | keylen << 8 | outlen);
+	state->h[1] = BLAKE2B_IV1;
+	state->h[2] = BLAKE2B_IV2;
+	state->h[3] = BLAKE2B_IV3;
+	state->h[4] = BLAKE2B_IV4;
+	state->h[5] = BLAKE2B_IV5;
+	state->h[6] = BLAKE2B_IV6;
+	state->h[7] = BLAKE2B_IV7;
+	state->t[0] = 0;
+	state->t[1] = 0;
+	state->f[0] = 0;
+	state->f[1] = 0;
+	state->buflen = 0;
+	state->outlen = outlen;
+	if (keylen) {
+		memcpy(state->buf, key, keylen);
+		memset(&state->buf[keylen], 0, BLAKE2B_BLOCK_SIZE - keylen);
+		state->buflen = BLAKE2B_BLOCK_SIZE;
+	}
+}
+
+#endif /* _CRYPTO_BLAKE2B_H */
diff --git a/include/crypto/internal/blake2b.h b/include/crypto/internal/blake2b.h
new file mode 100644
index 0000000000000..982fe5e8471cd
--- /dev/null
+++ b/include/crypto/internal/blake2b.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Helper functions for BLAKE2b implementations.
+ * Keep this in sync with the corresponding BLAKE2s header.
+ */
+
+#ifndef _CRYPTO_INTERNAL_BLAKE2B_H
+#define _CRYPTO_INTERNAL_BLAKE2B_H
+
+#include <crypto/blake2b.h>
+#include <crypto/internal/hash.h>
+#include <linux/string.h>
+
+void blake2b_compress_generic(struct blake2b_state *state,
+			      const u8 *block, size_t nblocks, u32 inc);
+
+static inline void blake2b_set_lastblock(struct blake2b_state *state)
+{
+	state->f[0] = -1;
+}
+
+typedef void (*blake2b_compress_t)(struct blake2b_state *state,
+				   const u8 *block, size_t nblocks, u32 inc);
+
+static inline void __blake2b_update(struct blake2b_state *state,
+				    const u8 *in, size_t inlen,
+				    blake2b_compress_t compress)
+{
+	const size_t fill = BLAKE2B_BLOCK_SIZE - state->buflen;
+
+	if (unlikely(!inlen))
+		return;
+	if (inlen > fill) {
+		memcpy(state->buf + state->buflen, in, fill);
+		(*compress)(state, state->buf, 1, BLAKE2B_BLOCK_SIZE);
+		state->buflen = 0;
+		in += fill;
+		inlen -= fill;
+	}
+	if (inlen > BLAKE2B_BLOCK_SIZE) {
+		const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2B_BLOCK_SIZE);
+		/* Hash one less (full) block than strictly possible */
+		(*compress)(state, in, nblocks - 1, BLAKE2B_BLOCK_SIZE);
+		in += BLAKE2B_BLOCK_SIZE * (nblocks - 1);
+		inlen -= BLAKE2B_BLOCK_SIZE * (nblocks - 1);
+	}
+	memcpy(state->buf + state->buflen, in, inlen);
+	state->buflen += inlen;
+}
+
+static inline void __blake2b_final(struct blake2b_state *state, u8 *out,
+				   blake2b_compress_t compress)
+{
+	int i;
+
+	blake2b_set_lastblock(state);
+	memset(state->buf + state->buflen, 0,
+	       BLAKE2B_BLOCK_SIZE - state->buflen); /* Padding */
+	(*compress)(state, state->buf, 1, state->buflen);
+	for (i = 0; i < ARRAY_SIZE(state->h); i++)
+		__cpu_to_le64s(&state->h[i]);
+	memcpy(out, state->h, state->outlen);
+}
+
+/* Helper functions for shash implementations of BLAKE2b */
+
+struct blake2b_tfm_ctx {
+	u8 key[BLAKE2B_KEY_SIZE];
+	unsigned int keylen;
+};
+
+static inline int crypto_blake2b_setkey(struct crypto_shash *tfm,
+					const u8 *key, unsigned int keylen)
+{
+	struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(tfm);
+
+	if (keylen == 0 || keylen > BLAKE2B_KEY_SIZE)
+		return -EINVAL;
+
+	memcpy(tctx->key, key, keylen);
+	tctx->keylen = keylen;
+
+	return 0;
+}
+
+static inline int crypto_blake2b_init(struct shash_desc *desc)
+{
+	const struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct blake2b_state *state = shash_desc_ctx(desc);
+	unsigned int outlen = crypto_shash_digestsize(desc->tfm);
+
+	__blake2b_init(state, outlen, tctx->key, tctx->keylen);
+	return 0;
+}
+
+static inline int crypto_blake2b_update(struct shash_desc *desc,
+					const u8 *in, unsigned int inlen,
+					blake2b_compress_t compress)
+{
+	struct blake2b_state *state = shash_desc_ctx(desc);
+
+	__blake2b_update(state, in, inlen, compress);
+	return 0;
+}
+
+static inline int crypto_blake2b_final(struct shash_desc *desc, u8 *out,
+				       blake2b_compress_t compress)
+{
+	struct blake2b_state *state = shash_desc_ctx(desc);
+
+	__blake2b_final(state, out, compress);
+	return 0;
+}
+
+#endif /* _CRYPTO_INTERNAL_BLAKE2B_H */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 13/14] crypto: blake2b - update file comment
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (11 preceding siblings ...)
  2020-12-23  8:10 ` [PATCH v3 12/14] crypto: blake2b - sync with blake2s implementation Eric Biggers
@ 2020-12-23  8:10 ` Eric Biggers
  2020-12-23  8:10 ` [PATCH v3 14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b Eric Biggers
  2021-01-02 22:09 ` [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Herbert Xu
  14 siblings, 0 replies; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:10 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

The file comment for blake2b_generic.c makes it sound like it's the
reference implementation of BLAKE2b with only minor changes.  But it's
actually been changed a lot.  Update the comment to make this clearer.

Reviewed-by: David Sterba <dsterba@suse.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index 963f7fe0e4ea8..6704c03558896 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -1,21 +1,18 @@
 // SPDX-License-Identifier: (GPL-2.0-only OR Apache-2.0)
 /*
- * BLAKE2b reference source code package - reference C implementations
+ * Generic implementation of the BLAKE2b digest algorithm.  Based on the BLAKE2b
+ * reference implementation, but it has been heavily modified for use in the
+ * kernel.  The reference implementation was:
  *
- * Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
- * terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
- * your option.  The terms of these licenses can be found at:
+ *	Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under
+ *	the terms of the CC0, the OpenSSL Licence, or the Apache Public License
+ *	2.0, at your option.  The terms of these licenses can be found at:
  *
- * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
- * - OpenSSL license   : https://www.openssl.org/source/license.html
- * - Apache 2.0        : https://www.apache.org/licenses/LICENSE-2.0
+ *	- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ *	- OpenSSL license   : https://www.openssl.org/source/license.html
+ *	- Apache 2.0        : https://www.apache.org/licenses/LICENSE-2.0
  *
- * More information about the BLAKE2 hash function can be found at
- * https://blake2.net.
- *
- * Note: the original sources have been modified for inclusion in linux kernel
- * in terms of coding style, using generic helpers and simplifications of error
- * handling.
+ * More information about BLAKE2 can be found at https://blake2.net.
  */
 
 #include <asm/unaligned.h>
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (12 preceding siblings ...)
  2020-12-23  8:10 ` [PATCH v3 13/14] crypto: blake2b - update file comment Eric Biggers
@ 2020-12-23  8:10 ` Eric Biggers
  2020-12-23  9:10   ` Ard Biesheuvel
  2021-01-02 22:09 ` [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Herbert Xu
  14 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2020-12-23  8:10 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Add a NEON-accelerated implementation of BLAKE2b.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.0
	sha1-neon            16.3
	blake2s-256-arm      18.8
	sha1-asm             20.8
	blake2s-256-generic  26.0
	sha256-neon	     28.9
	sha256-asm	     32.0
	blake2b-256-generic  38.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).  It does
only one block at a time, so it performs well on short messages too.

NEON-accelerated BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 18.8 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

For now this BLAKE2b implementation is only wired up to the shash API,
since there is no library API for BLAKE2b yet.  However, I've tried to
keep things consistent with BLAKE2s, e.g. by defining
blake2b_compress_arch() which is analogous to blake2s_compress_arch()
and could be exported for use by the library API later if needed.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm/crypto/Kconfig             |  10 +
 arch/arm/crypto/Makefile            |   2 +
 arch/arm/crypto/blake2b-neon-core.S | 347 ++++++++++++++++++++++++++++
 arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
 4 files changed, 464 insertions(+)
 create mode 100644 arch/arm/crypto/blake2b-neon-core.S
 create mode 100644 arch/arm/crypto/blake2b-neon-glue.c

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 281c829c12d0b..2b575792363e5 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -71,6 +71,16 @@ config CRYPTO_BLAKE2S_ARM
 	  slower than the NEON implementation of BLAKE2b.  (There is no NEON
 	  implementation of BLAKE2s, since NEON doesn't really help with it.)
 
+config CRYPTO_BLAKE2B_NEON
+	tristate "BLAKE2b digest algorithm (ARM NEON)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_BLAKE2B
+	help
+	  BLAKE2b digest algorithm optimized with ARM NEON instructions.
+	  On ARM processors that have NEON support but not the ARMv8
+	  Crypto Extensions, typically this BLAKE2b implementation is
+	  much faster than SHA-2 and slightly faster than SHA-1.
+
 config CRYPTO_AES_ARM
 	tristate "Scalar AES cipher for ARM"
 	select CRYPTO_ALGAPI
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 5ad1e985a718b..8f26c454ea12e 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
 obj-$(CONFIG_CRYPTO_BLAKE2S_ARM) += blake2s-arm.o
+obj-$(CONFIG_CRYPTO_BLAKE2B_NEON) += blake2b-neon.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
 obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
@@ -31,6 +32,7 @@ sha256-arm-y	:= sha256-core.o sha256_glue.o $(sha256-arm-neon-y)
 sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
 sha512-arm-y	:= sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
 blake2s-arm-y   := blake2s-core.o blake2s-glue.o
+blake2b-neon-y  := blake2b-neon-core.o blake2b-neon-glue.o
 sha1-arm-ce-y	:= sha1-ce-core.o sha1-ce-glue.o
 sha2-arm-ce-y	:= sha2-ce-core.o sha2-ce-glue.o
 aes-arm-ce-y	:= aes-ce-core.o aes-ce-glue.o
diff --git a/arch/arm/crypto/blake2b-neon-core.S b/arch/arm/crypto/blake2b-neon-core.S
new file mode 100644
index 0000000000000..0406a186377fb
--- /dev/null
+++ b/arch/arm/crypto/blake2b-neon-core.S
@@ -0,0 +1,347 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * BLAKE2b digest algorithm, NEON accelerated
+ *
+ * Copyright 2020 Google LLC
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+#include <linux/linkage.h>
+
+	.text
+	.fpu		neon
+
+	// The arguments to blake2b_compress_neon()
+	STATE		.req	r0
+	BLOCK		.req	r1
+	NBLOCKS		.req	r2
+	INC		.req	r3
+
+	// Pointers to the rotation tables
+	ROR24_TABLE	.req	r4
+	ROR16_TABLE	.req	r5
+
+	// The original stack pointer
+	ORIG_SP		.req	r6
+
+	// NEON registers which contain the message words of the current block.
+	// M_0-M_3 are occasionally used for other purposes too.
+	M_0		.req	d16
+	M_1		.req	d17
+	M_2		.req	d18
+	M_3		.req	d19
+	M_4		.req	d20
+	M_5		.req	d21
+	M_6		.req	d22
+	M_7		.req	d23
+	M_8		.req	d24
+	M_9		.req	d25
+	M_10		.req	d26
+	M_11		.req	d27
+	M_12		.req	d28
+	M_13		.req	d29
+	M_14		.req	d30
+	M_15		.req	d31
+
+	.align		4
+	// Tables for computing ror64(x, 24) and ror64(x, 16) using the vtbl.8
+	// instruction.  This is the most efficient way to implement these
+	// rotation amounts with NEON.  (On Cortex-A53 it's the same speed as
+	// vshr.u64 + vsli.u64, while on Cortex-A7 it's faster.)
+.Lror24_table:
+	.byte		3, 4, 5, 6, 7, 0, 1, 2
+.Lror16_table:
+	.byte		2, 3, 4, 5, 6, 7, 0, 1
+	// The BLAKE2b initialization vector
+.Lblake2b_IV:
+	.quad		0x6a09e667f3bcc908, 0xbb67ae8584caa73b
+	.quad		0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1
+	.quad		0x510e527fade682d1, 0x9b05688c2b3e6c1f
+	.quad		0x1f83d9abfb41bd6b, 0x5be0cd19137e2179
+
+// Execute one round of BLAKE2b by updating the state matrix v[0..15] in the
+// NEON registers q0-q7.  The message block is in q8..q15 (M_0-M_15).  The stack
+// pointer points to a 32-byte aligned buffer containing a copy of q8 and q9
+// (M_0-M_3), so that they can be reloaded if they are used as temporary
+// registers.  The macro arguments s0-s15 give the order in which the message
+// words are used in this round.  'final' is 1 if this is the final round.
+.macro	_blake2b_round	s0, s1, s2, s3, s4, s5, s6, s7, \
+			s8, s9, s10, s11, s12, s13, s14, s15, final=0
+
+	// Mix the columns:
+	// (v[0], v[4], v[8], v[12]), (v[1], v[5], v[9], v[13]),
+	// (v[2], v[6], v[10], v[14]), and (v[3], v[7], v[11], v[15]).
+
+	// a += b + m[blake2b_sigma[r][2*i + 0]];
+	vadd.u64	q0, q0, q2
+	vadd.u64	q1, q1, q3
+	vadd.u64	d0, d0, M_\s0
+	vadd.u64	d1, d1, M_\s2
+	vadd.u64	d2, d2, M_\s4
+	vadd.u64	d3, d3, M_\s6
+
+	// d = ror64(d ^ a, 32);
+	veor		q6, q6, q0
+	veor		q7, q7, q1
+	vrev64.32	q6, q6
+	vrev64.32	q7, q7
+
+	// c += d;
+	vadd.u64	q4, q4, q6
+	vadd.u64	q5, q5, q7
+
+	// b = ror64(b ^ c, 24);
+	vld1.8		{M_0}, [ROR24_TABLE, :64]
+	veor		q2, q2, q4
+	veor		q3, q3, q5
+	vtbl.8		d4, {d4}, M_0
+	vtbl.8		d5, {d5}, M_0
+	vtbl.8		d6, {d6}, M_0
+	vtbl.8		d7, {d7}, M_0
+
+	// a += b + m[blake2b_sigma[r][2*i + 1]];
+	//
+	// M_0 got clobbered above, so we have to reload it if any of the four
+	// message words this step needs happens to be M_0.  Otherwise we don't
+	// need to reload it here, as it will just get clobbered again below.
+.if \s1 == 0 || \s3 == 0 || \s5 == 0 || \s7 == 0
+	vld1.8		{M_0}, [sp, :64]
+.endif
+	vadd.u64	q0, q0, q2
+	vadd.u64	q1, q1, q3
+	vadd.u64	d0, d0, M_\s1
+	vadd.u64	d1, d1, M_\s3
+	vadd.u64	d2, d2, M_\s5
+	vadd.u64	d3, d3, M_\s7
+
+	// d = ror64(d ^ a, 16);
+	vld1.8		{M_0}, [ROR16_TABLE, :64]
+	veor		q6, q6, q0
+	veor		q7, q7, q1
+	vtbl.8		d12, {d12}, M_0
+	vtbl.8		d13, {d13}, M_0
+	vtbl.8		d14, {d14}, M_0
+	vtbl.8		d15, {d15}, M_0
+
+	// c += d;
+	vadd.u64	q4, q4, q6
+	vadd.u64	q5, q5, q7
+
+	// b = ror64(b ^ c, 63);
+	//
+	// This rotation amount isn't a multiple of 8, so it has to be
+	// implemented using a pair of shifts, which requires temporary
+	// registers.  Use q8-q9 (M_0-M_3) for this, and reload them afterwards.
+	veor		q8, q2, q4
+	veor		q9, q3, q5
+	vshr.u64	q2, q8, #63
+	vshr.u64	q3, q9, #63
+	vsli.u64	q2, q8, #1
+	vsli.u64	q3, q9, #1
+	vld1.8		{q8-q9}, [sp, :256]
+
+	// Mix the diagonals:
+	// (v[0], v[5], v[10], v[15]), (v[1], v[6], v[11], v[12]),
+	// (v[2], v[7], v[8], v[13]), and (v[3], v[4], v[9], v[14]).
+	//
+	// There are two possible ways to do this: use 'vext' instructions to
+	// shift the rows of the matrix so that the diagonals become columns,
+	// and undo it afterwards; or just use 64-bit operations on 'd'
+	// registers instead of 128-bit operations on 'q' registers.  We use the
+	// latter approach, as it performs much better on Cortex-A7.
+
+	// a += b + m[blake2b_sigma[r][2*i + 0]];
+	vadd.u64	d0, d0, d5
+	vadd.u64	d1, d1, d6
+	vadd.u64	d2, d2, d7
+	vadd.u64	d3, d3, d4
+	vadd.u64	d0, d0, M_\s8
+	vadd.u64	d1, d1, M_\s10
+	vadd.u64	d2, d2, M_\s12
+	vadd.u64	d3, d3, M_\s14
+
+	// d = ror64(d ^ a, 32);
+	veor		d15, d15, d0
+	veor		d12, d12, d1
+	veor		d13, d13, d2
+	veor		d14, d14, d3
+	vrev64.32	d15, d15
+	vrev64.32	d12, d12
+	vrev64.32	d13, d13
+	vrev64.32	d14, d14
+
+	// c += d;
+	vadd.u64	d10, d10, d15
+	vadd.u64	d11, d11, d12
+	vadd.u64	d8, d8, d13
+	vadd.u64	d9, d9, d14
+
+	// b = ror64(b ^ c, 24);
+	vld1.8		{M_0}, [ROR24_TABLE, :64]
+	veor		d5, d5, d10
+	veor		d6, d6, d11
+	veor		d7, d7, d8
+	veor		d4, d4, d9
+	vtbl.8		d5, {d5}, M_0
+	vtbl.8		d6, {d6}, M_0
+	vtbl.8		d7, {d7}, M_0
+	vtbl.8		d4, {d4}, M_0
+
+	// a += b + m[blake2b_sigma[r][2*i + 1]];
+.if \s9 == 0 || \s11 == 0 || \s13 == 0 || \s15 == 0
+	vld1.8		{M_0}, [sp, :64]
+.endif
+	vadd.u64	d0, d0, d5
+	vadd.u64	d1, d1, d6
+	vadd.u64	d2, d2, d7
+	vadd.u64	d3, d3, d4
+	vadd.u64	d0, d0, M_\s9
+	vadd.u64	d1, d1, M_\s11
+	vadd.u64	d2, d2, M_\s13
+	vadd.u64	d3, d3, M_\s15
+
+	// d = ror64(d ^ a, 16);
+	vld1.8		{M_0}, [ROR16_TABLE, :64]
+	veor		d15, d15, d0
+	veor		d12, d12, d1
+	veor		d13, d13, d2
+	veor		d14, d14, d3
+	vtbl.8		d12, {d12}, M_0
+	vtbl.8		d13, {d13}, M_0
+	vtbl.8		d14, {d14}, M_0
+	vtbl.8		d15, {d15}, M_0
+
+	// c += d;
+	vadd.u64	d10, d10, d15
+	vadd.u64	d11, d11, d12
+	vadd.u64	d8, d8, d13
+	vadd.u64	d9, d9, d14
+
+	// b = ror64(b ^ c, 63);
+	veor		d16, d4, d9
+	veor		d17, d5, d10
+	veor		d18, d6, d11
+	veor		d19, d7, d8
+	vshr.u64	q2, q8, #63
+	vshr.u64	q3, q9, #63
+	vsli.u64	q2, q8, #1
+	vsli.u64	q3, q9, #1
+	// Reloading q8-q9 can be skipped on the final round.
+.if ! \final
+	vld1.8		{q8-q9}, [sp, :256]
+.endif
+.endm
+
+//
+// void blake2b_compress_neon(struct blake2b_state *state,
+//			      const u8 *block, size_t nblocks, u32 inc);
+//
+// Only the first three fields of struct blake2b_state are used:
+//	u64 h[8];	(inout)
+//	u64 t[2];	(inout)
+//	u64 f[2];	(in)
+//
+	.align		5
+ENTRY(blake2b_compress_neon)
+	push		{r4-r10}
+
+	// Allocate a 32-byte stack buffer that is 32-byte aligned.
+	mov		ORIG_SP, sp
+	sub		ip, sp, #32
+	bic		ip, ip, #31
+	mov		sp, ip
+
+	adr		ROR24_TABLE, .Lror24_table
+	adr		ROR16_TABLE, .Lror16_table
+
+	mov		ip, STATE
+	vld1.64		{q0-q1}, [ip]!		// Load h[0..3]
+	vld1.64		{q2-q3}, [ip]!		// Load h[4..7]
+.Lnext_block:
+	  adr		r10, .Lblake2b_IV
+	vld1.64		{q14-q15}, [ip]		// Load t[0..1] and f[0..1]
+	vld1.64		{q4-q5}, [r10]!		// Load IV[0..3]
+	  vmov		r7, r8, d28		// Copy t[0] to (r7, r8)
+	vld1.64		{q6-q7}, [r10]		// Load IV[4..7]
+	  adds		r7, r7, INC		// Increment counter
+	bcs		.Lslow_inc_ctr
+	vmov.i32	d28[0], r7
+	vst1.64		{d28}, [ip]		// Update t[0]
+.Linc_ctr_done:
+
+	// Load the next message block and finish initializing the state matrix
+	// 'v'.  Fortunately, there are exactly enough NEON registers to fit the
+	// entire state matrix in q0-q7 and the entire message block in q8-15.
+	//
+	// However, _blake2b_round also needs some extra registers for rotates,
+	// so we have to spill some registers.  It's better to spill the message
+	// registers than the state registers, as the message doesn't change.
+	// Therefore we store a copy of the first 32 bytes of the message block
+	// (q8-q9) in an aligned buffer on the stack so that they can be
+	// reloaded when needed.  (We could just reload directly from the
+	// message buffer, but it's faster to use aligned loads.)
+	vld1.8		{q8-q9}, [BLOCK]!
+	  veor		q6, q6, q14	// v[12..13] = IV[4..5] ^ t[0..1]
+	vld1.8		{q10-q11}, [BLOCK]!
+	  veor		q7, q7, q15	// v[14..15] = IV[6..7] ^ f[0..1]
+	vld1.8		{q12-q13}, [BLOCK]!
+	vst1.8		{q8-q9}, [sp, :256]
+	  mov		ip, STATE
+	vld1.8		{q14-q15}, [BLOCK]!
+
+	// Execute the rounds.  Each round is provided the order in which it
+	// needs to use the message words.
+	_blake2b_round	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+	_blake2b_round	14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3
+	_blake2b_round	11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4
+	_blake2b_round	7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8
+	_blake2b_round	9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13
+	_blake2b_round	2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9
+	_blake2b_round	12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11
+	_blake2b_round	13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10
+	_blake2b_round	6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5
+	_blake2b_round	10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0
+	_blake2b_round	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+	_blake2b_round	14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 \
+			final=1
+
+	// Fold the final state matrix into the hash chaining value:
+	//
+	//	for (i = 0; i < 8; i++)
+	//		h[i] ^= v[i] ^ v[i + 8];
+	//
+	  vld1.64	{q8-q9}, [ip]!		// Load old h[0..3]
+	veor		q0, q0, q4		// v[0..1] ^= v[8..9]
+	veor		q1, q1, q5		// v[2..3] ^= v[10..11]
+	  vld1.64	{q10-q11}, [ip]		// Load old h[4..7]
+	veor		q2, q2, q6		// v[4..5] ^= v[12..13]
+	veor		q3, q3, q7		// v[6..7] ^= v[14..15]
+	veor		q0, q0, q8		// v[0..1] ^= h[0..1]
+	veor		q1, q1, q9		// v[2..3] ^= h[2..3]
+	  mov		ip, STATE
+	  subs		NBLOCKS, NBLOCKS, #1	// nblocks--
+	  vst1.64	{q0-q1}, [ip]!		// Store new h[0..3]
+	veor		q2, q2, q10		// v[4..5] ^= h[4..5]
+	veor		q3, q3, q11		// v[6..7] ^= h[6..7]
+	  vst1.64	{q2-q3}, [ip]!		// Store new h[4..7]
+
+	// Advance to the next block, if there is one.
+	bne		.Lnext_block		// nblocks != 0?
+
+	mov		sp, ORIG_SP
+	pop		{r4-r10}
+	mov		pc, lr
+
+.Lslow_inc_ctr:
+	// Handle the case where the counter overflowed its low 32 bits, by
+	// carrying the overflow bit into the full 128-bit counter.
+	vmov		r9, r10, d29
+	adcs		r8, r8, #0
+	adcs		r9, r9, #0
+	adc		r10, r10, #0
+	vmov		d28, r7, r8
+	vmov		d29, r9, r10
+	vst1.64		{q14}, [ip]		// Update t[0] and t[1]
+	b		.Linc_ctr_done
+ENDPROC(blake2b_compress_neon)
diff --git a/arch/arm/crypto/blake2b-neon-glue.c b/arch/arm/crypto/blake2b-neon-glue.c
new file mode 100644
index 0000000000000..34d73200e7fa6
--- /dev/null
+++ b/arch/arm/crypto/blake2b-neon-glue.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BLAKE2b digest algorithm, NEON accelerated
+ *
+ * Copyright 2020 Google LLC
+ */
+
+#include <crypto/internal/blake2b.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+
+#include <linux/module.h>
+#include <linux/sizes.h>
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+
+asmlinkage void blake2b_compress_neon(struct blake2b_state *state,
+				      const u8 *block, size_t nblocks, u32 inc);
+
+static void blake2b_compress_arch(struct blake2b_state *state,
+				  const u8 *block, size_t nblocks, u32 inc)
+{
+	if (!crypto_simd_usable()) {
+		blake2b_compress_generic(state, block, nblocks, inc);
+		return;
+	}
+
+	do {
+		const size_t blocks = min_t(size_t, nblocks,
+					    SZ_4K / BLAKE2B_BLOCK_SIZE);
+
+		kernel_neon_begin();
+		blake2b_compress_neon(state, block, blocks, inc);
+		kernel_neon_end();
+
+		nblocks -= blocks;
+		block += blocks * BLAKE2B_BLOCK_SIZE;
+	} while (nblocks);
+}
+
+static int crypto_blake2b_update_neon(struct shash_desc *desc,
+				      const u8 *in, unsigned int inlen)
+{
+	return crypto_blake2b_update(desc, in, inlen, blake2b_compress_arch);
+}
+
+static int crypto_blake2b_final_neon(struct shash_desc *desc, u8 *out)
+{
+	return crypto_blake2b_final(desc, out, blake2b_compress_arch);
+}
+
+#define BLAKE2B_ALG(name, driver_name, digest_size)			\
+	{								\
+		.base.cra_name		= name,				\
+		.base.cra_driver_name	= driver_name,			\
+		.base.cra_priority	= 200,				\
+		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,	\
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,		\
+		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx), \
+		.base.cra_module	= THIS_MODULE,			\
+		.digestsize		= digest_size,			\
+		.setkey			= crypto_blake2b_setkey,	\
+		.init			= crypto_blake2b_init,		\
+		.update			= crypto_blake2b_update_neon,	\
+		.final			= crypto_blake2b_final_neon,	\
+		.descsize		= sizeof(struct blake2b_state),	\
+	}
+
+static struct shash_alg blake2b_neon_algs[] = {
+	BLAKE2B_ALG("blake2b-160", "blake2b-160-neon", BLAKE2B_160_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-256", "blake2b-256-neon", BLAKE2B_256_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-384", "blake2b-384-neon", BLAKE2B_384_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-512", "blake2b-512-neon", BLAKE2B_512_HASH_SIZE),
+};
+
+static int __init blake2b_neon_mod_init(void)
+{
+	if (!(elf_hwcap & HWCAP_NEON))
+		return -ENODEV;
+
+	return crypto_register_shashes(blake2b_neon_algs,
+				       ARRAY_SIZE(blake2b_neon_algs));
+}
+
+static void __exit blake2b_neon_mod_exit(void)
+{
+	return crypto_unregister_shashes(blake2b_neon_algs,
+					 ARRAY_SIZE(blake2b_neon_algs));
+}
+
+module_init(blake2b_neon_mod_init);
+module_exit(blake2b_neon_mod_exit);
+
+MODULE_DESCRIPTION("BLAKE2b digest algorithm, NEON accelerated");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("blake2b-160");
+MODULE_ALIAS_CRYPTO("blake2b-160-neon");
+MODULE_ALIAS_CRYPTO("blake2b-256");
+MODULE_ALIAS_CRYPTO("blake2b-256-neon");
+MODULE_ALIAS_CRYPTO("blake2b-384");
+MODULE_ALIAS_CRYPTO("blake2b-384-neon");
+MODULE_ALIAS_CRYPTO("blake2b-512");
+MODULE_ALIAS_CRYPTO("blake2b-512-neon");
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 04/14] crypto: blake2s - move update and final logic to internal/blake2s.h
  2020-12-23  8:09 ` [PATCH v3 04/14] crypto: blake2s - move update and final logic to internal/blake2s.h Eric Biggers
@ 2020-12-23  9:05   ` Ard Biesheuvel
  0 siblings, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2020-12-23  9:05 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Linux ARM, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Move most of blake2s_update() and blake2s_final() into new inline
> functions __blake2s_update() and __blake2s_final() in
> include/crypto/internal/blake2s.h so that this logic can be shared by
> the shash helper functions.  This will avoid duplicating this logic
> between the library and shash implementations.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  include/crypto/internal/blake2s.h | 41 ++++++++++++++++++++++++++
>  lib/crypto/blake2s.c              | 48 ++++++-------------------------
>  2 files changed, 49 insertions(+), 40 deletions(-)
>
> diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
> index 6e376ae6b6b58..42deba4b8ceef 100644
> --- a/include/crypto/internal/blake2s.h
> +++ b/include/crypto/internal/blake2s.h
> @@ -4,6 +4,7 @@
>  #define BLAKE2S_INTERNAL_H
>
>  #include <crypto/blake2s.h>
> +#include <linux/string.h>
>
>  struct blake2s_tfm_ctx {
>         u8 key[BLAKE2S_KEY_SIZE];
> @@ -23,4 +24,44 @@ static inline void blake2s_set_lastblock(struct blake2s_state *state)
>         state->f[0] = -1;
>  }
>
> +typedef void (*blake2s_compress_t)(struct blake2s_state *state,
> +                                  const u8 *block, size_t nblocks, u32 inc);
> +
> +static inline void __blake2s_update(struct blake2s_state *state,
> +                                   const u8 *in, size_t inlen,
> +                                   blake2s_compress_t compress)
> +{
> +       const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
> +
> +       if (unlikely(!inlen))
> +               return;
> +       if (inlen > fill) {
> +               memcpy(state->buf + state->buflen, in, fill);
> +               (*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
> +               state->buflen = 0;
> +               in += fill;
> +               inlen -= fill;
> +       }
> +       if (inlen > BLAKE2S_BLOCK_SIZE) {
> +               const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
> +               /* Hash one less (full) block than strictly possible */
> +               (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
> +               in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
> +               inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
> +       }
> +       memcpy(state->buf + state->buflen, in, inlen);
> +       state->buflen += inlen;
> +}
> +
> +static inline void __blake2s_final(struct blake2s_state *state, u8 *out,
> +                                  blake2s_compress_t compress)
> +{
> +       blake2s_set_lastblock(state);
> +       memset(state->buf + state->buflen, 0,
> +              BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
> +       (*compress)(state, state->buf, 1, state->buflen);
> +       cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
> +       memcpy(out, state->h, state->outlen);
> +}
> +
>  #endif /* BLAKE2S_INTERNAL_H */
> diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
> index 6a4b6b78d630f..c64ac8bfb6a97 100644
> --- a/lib/crypto/blake2s.c
> +++ b/lib/crypto/blake2s.c
> @@ -15,55 +15,23 @@
>  #include <linux/module.h>
>  #include <linux/init.h>
>  #include <linux/bug.h>
> -#include <asm/unaligned.h>
> +
> +#if IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S)
> +#  define blake2s_compress blake2s_compress_arch
> +#else
> +#  define blake2s_compress blake2s_compress_generic
> +#endif
>
>  void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
>  {
> -       const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
> -
> -       if (unlikely(!inlen))
> -               return;
> -       if (inlen > fill) {
> -               memcpy(state->buf + state->buflen, in, fill);
> -               if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
> -                       blake2s_compress_arch(state, state->buf, 1,
> -                                             BLAKE2S_BLOCK_SIZE);
> -               else
> -                       blake2s_compress_generic(state, state->buf, 1,
> -                                                BLAKE2S_BLOCK_SIZE);
> -               state->buflen = 0;
> -               in += fill;
> -               inlen -= fill;
> -       }
> -       if (inlen > BLAKE2S_BLOCK_SIZE) {
> -               const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
> -               /* Hash one less (full) block than strictly possible */
> -               if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
> -                       blake2s_compress_arch(state, in, nblocks - 1,
> -                                             BLAKE2S_BLOCK_SIZE);
> -               else
> -                       blake2s_compress_generic(state, in, nblocks - 1,
> -                                                BLAKE2S_BLOCK_SIZE);
> -               in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
> -               inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
> -       }
> -       memcpy(state->buf + state->buflen, in, inlen);
> -       state->buflen += inlen;
> +       __blake2s_update(state, in, inlen, blake2s_compress);
>  }
>  EXPORT_SYMBOL(blake2s_update);
>
>  void blake2s_final(struct blake2s_state *state, u8 *out)
>  {
>         WARN_ON(IS_ENABLED(DEBUG) && !out);
> -       blake2s_set_lastblock(state);
> -       memset(state->buf + state->buflen, 0,
> -              BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
> -       if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
> -               blake2s_compress_arch(state, state->buf, 1, state->buflen);
> -       else
> -               blake2s_compress_generic(state, state->buf, 1, state->buflen);
> -       cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
> -       memcpy(out, state->h, state->outlen);
> +       __blake2s_final(state, out, blake2s_compress);
>         memzero_explicit(state, sizeof(*state));
>  }
>  EXPORT_SYMBOL(blake2s_final);
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 05/14] crypto: blake2s - share the "shash" API boilerplate code
  2020-12-23  8:09 ` [PATCH v3 05/14] crypto: blake2s - share the "shash" API boilerplate code Eric Biggers
@ 2020-12-23  9:06   ` Ard Biesheuvel
  0 siblings, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2020-12-23  9:06 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Linux ARM, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Add helper functions for shash implementations of BLAKE2s to
> include/crypto/internal/blake2s.h, taking advantage of
> __blake2s_update() and __blake2s_final() that were added by the previous
> patch to share more code between the library and shash implementations.
>
> crypto_blake2s_setkey() and crypto_blake2s_init() are usable as
> shash_alg::setkey and shash_alg::init directly, while
> crypto_blake2s_update() and crypto_blake2s_final() take an extra
> 'blake2s_compress_t' function pointer parameter.  This allows the
> implementation of the compression function to be overridden, which is
> the only part that optimized implementations really care about.
>
> The new functions are inline functions (similar to those in sha1_base.h,
> sha256_base.h, and sm3_base.h) because this avoids needing to add a new
> module blake2s_helpers.ko, they aren't *too* long, and this avoids
> indirect calls which are expensive these days.  Note that they can't go
> in blake2s_generic.ko, as that would require selecting CRYPTO_BLAKE2S
> from CRYPTO_BLAKE2S_X86, which would cause a recursive dependency.
>
> Finally, use these new helper functions in the x86 implementation of
> BLAKE2s.  (This part should be a separate patch, but unfortunately the
> x86 implementation used the exact same function names like
> "crypto_blake2s_update()", so it had to be updated at the same time.)
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  arch/x86/crypto/blake2s-glue.c    | 74 +++---------------------------
>  crypto/blake2s_generic.c          | 76 ++++---------------------------
>  include/crypto/internal/blake2s.h | 65 ++++++++++++++++++++++++--
>  3 files changed, 76 insertions(+), 139 deletions(-)
>
> diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c
> index 4dcb2ee89efc9..a40365ab301ee 100644
> --- a/arch/x86/crypto/blake2s-glue.c
> +++ b/arch/x86/crypto/blake2s-glue.c
> @@ -58,75 +58,15 @@ void blake2s_compress_arch(struct blake2s_state *state,
>  }
>  EXPORT_SYMBOL(blake2s_compress_arch);
>
> -static int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
> -                                unsigned int keylen)
> +static int crypto_blake2s_update_x86(struct shash_desc *desc,
> +                                    const u8 *in, unsigned int inlen)
>  {
> -       struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
> -
> -       if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE)
> -               return -EINVAL;
> -
> -       memcpy(tctx->key, key, keylen);
> -       tctx->keylen = keylen;
> -
> -       return 0;
> -}
> -
> -static int crypto_blake2s_init(struct shash_desc *desc)
> -{
> -       struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
> -       struct blake2s_state *state = shash_desc_ctx(desc);
> -       const int outlen = crypto_shash_digestsize(desc->tfm);
> -
> -       if (tctx->keylen)
> -               blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
> -       else
> -               blake2s_init(state, outlen);
> -
> -       return 0;
> -}
> -
> -static int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
> -                                unsigned int inlen)
> -{
> -       struct blake2s_state *state = shash_desc_ctx(desc);
> -       const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
> -
> -       if (unlikely(!inlen))
> -               return 0;
> -       if (inlen > fill) {
> -               memcpy(state->buf + state->buflen, in, fill);
> -               blake2s_compress_arch(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
> -               state->buflen = 0;
> -               in += fill;
> -               inlen -= fill;
> -       }
> -       if (inlen > BLAKE2S_BLOCK_SIZE) {
> -               const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
> -               /* Hash one less (full) block than strictly possible */
> -               blake2s_compress_arch(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
> -               in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
> -               inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
> -       }
> -       memcpy(state->buf + state->buflen, in, inlen);
> -       state->buflen += inlen;
> -
> -       return 0;
> +       return crypto_blake2s_update(desc, in, inlen, blake2s_compress_arch);
>  }
>
> -static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
> +static int crypto_blake2s_final_x86(struct shash_desc *desc, u8 *out)
>  {
> -       struct blake2s_state *state = shash_desc_ctx(desc);
> -
> -       blake2s_set_lastblock(state);
> -       memset(state->buf + state->buflen, 0,
> -              BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
> -       blake2s_compress_arch(state, state->buf, 1, state->buflen);
> -       cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
> -       memcpy(out, state->h, state->outlen);
> -       memzero_explicit(state, sizeof(*state));
> -
> -       return 0;
> +       return crypto_blake2s_final(desc, out, blake2s_compress_arch);
>  }
>
>  #define BLAKE2S_ALG(name, driver_name, digest_size)                    \
> @@ -141,8 +81,8 @@ static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
>                 .digestsize             = digest_size,                  \
>                 .setkey                 = crypto_blake2s_setkey,        \
>                 .init                   = crypto_blake2s_init,          \
> -               .update                 = crypto_blake2s_update,        \
> -               .final                  = crypto_blake2s_final,         \
> +               .update                 = crypto_blake2s_update_x86,    \
> +               .final                  = crypto_blake2s_final_x86,     \
>                 .descsize               = sizeof(struct blake2s_state), \
>         }
>
> diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c
> index b89536c3671cf..72fe480f9bd67 100644
> --- a/crypto/blake2s_generic.c
> +++ b/crypto/blake2s_generic.c
> @@ -1,5 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0 OR MIT
>  /*
> + * shash interface to the generic implementation of BLAKE2s
> + *
>   * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
>   */
>
> @@ -10,75 +12,15 @@
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>
> -static int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
> -                                unsigned int keylen)
> +static int crypto_blake2s_update_generic(struct shash_desc *desc,
> +                                        const u8 *in, unsigned int inlen)
>  {
> -       struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
> -
> -       if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE)
> -               return -EINVAL;
> -
> -       memcpy(tctx->key, key, keylen);
> -       tctx->keylen = keylen;
> -
> -       return 0;
> +       return crypto_blake2s_update(desc, in, inlen, blake2s_compress_generic);
>  }
>
> -static int crypto_blake2s_init(struct shash_desc *desc)
> +static int crypto_blake2s_final_generic(struct shash_desc *desc, u8 *out)
>  {
> -       struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
> -       struct blake2s_state *state = shash_desc_ctx(desc);
> -       const int outlen = crypto_shash_digestsize(desc->tfm);
> -
> -       if (tctx->keylen)
> -               blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
> -       else
> -               blake2s_init(state, outlen);
> -
> -       return 0;
> -}
> -
> -static int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
> -                                unsigned int inlen)
> -{
> -       struct blake2s_state *state = shash_desc_ctx(desc);
> -       const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
> -
> -       if (unlikely(!inlen))
> -               return 0;
> -       if (inlen > fill) {
> -               memcpy(state->buf + state->buflen, in, fill);
> -               blake2s_compress_generic(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
> -               state->buflen = 0;
> -               in += fill;
> -               inlen -= fill;
> -       }
> -       if (inlen > BLAKE2S_BLOCK_SIZE) {
> -               const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
> -               /* Hash one less (full) block than strictly possible */
> -               blake2s_compress_generic(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
> -               in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
> -               inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
> -       }
> -       memcpy(state->buf + state->buflen, in, inlen);
> -       state->buflen += inlen;
> -
> -       return 0;
> -}
> -
> -static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
> -{
> -       struct blake2s_state *state = shash_desc_ctx(desc);
> -
> -       blake2s_set_lastblock(state);
> -       memset(state->buf + state->buflen, 0,
> -              BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
> -       blake2s_compress_generic(state, state->buf, 1, state->buflen);
> -       cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
> -       memcpy(out, state->h, state->outlen);
> -       memzero_explicit(state, sizeof(*state));
> -
> -       return 0;
> +       return crypto_blake2s_final(desc, out, blake2s_compress_generic);
>  }
>
>  #define BLAKE2S_ALG(name, driver_name, digest_size)                    \
> @@ -93,8 +35,8 @@ static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
>                 .digestsize             = digest_size,                  \
>                 .setkey                 = crypto_blake2s_setkey,        \
>                 .init                   = crypto_blake2s_init,          \
> -               .update                 = crypto_blake2s_update,        \
> -               .final                  = crypto_blake2s_final,         \
> +               .update                 = crypto_blake2s_update_generic, \
> +               .final                  = crypto_blake2s_final_generic, \
>                 .descsize               = sizeof(struct blake2s_state), \
>         }
>
> diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
> index 42deba4b8ceef..2ea0a8f5e7f41 100644
> --- a/include/crypto/internal/blake2s.h
> +++ b/include/crypto/internal/blake2s.h
> @@ -1,16 +1,16 @@
>  /* SPDX-License-Identifier: GPL-2.0 OR MIT */
> +/*
> + * Helper functions for BLAKE2s implementations.
> + * Keep this in sync with the corresponding BLAKE2b header.
> + */
>
>  #ifndef BLAKE2S_INTERNAL_H
>  #define BLAKE2S_INTERNAL_H
>
>  #include <crypto/blake2s.h>
> +#include <crypto/internal/hash.h>
>  #include <linux/string.h>
>
> -struct blake2s_tfm_ctx {
> -       u8 key[BLAKE2S_KEY_SIZE];
> -       unsigned int keylen;
> -};
> -
>  void blake2s_compress_generic(struct blake2s_state *state,const u8 *block,
>                               size_t nblocks, const u32 inc);
>
> @@ -27,6 +27,8 @@ static inline void blake2s_set_lastblock(struct blake2s_state *state)
>  typedef void (*blake2s_compress_t)(struct blake2s_state *state,
>                                    const u8 *block, size_t nblocks, u32 inc);
>
> +/* Helper functions for BLAKE2s shared by the library and shash APIs */
> +
>  static inline void __blake2s_update(struct blake2s_state *state,
>                                     const u8 *in, size_t inlen,
>                                     blake2s_compress_t compress)
> @@ -64,4 +66,57 @@ static inline void __blake2s_final(struct blake2s_state *state, u8 *out,
>         memcpy(out, state->h, state->outlen);
>  }
>
> +/* Helper functions for shash implementations of BLAKE2s */
> +
> +struct blake2s_tfm_ctx {
> +       u8 key[BLAKE2S_KEY_SIZE];
> +       unsigned int keylen;
> +};
> +
> +static inline int crypto_blake2s_setkey(struct crypto_shash *tfm,
> +                                       const u8 *key, unsigned int keylen)
> +{
> +       struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
> +
> +       if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE)
> +               return -EINVAL;
> +
> +       memcpy(tctx->key, key, keylen);
> +       tctx->keylen = keylen;
> +
> +       return 0;
> +}
> +
> +static inline int crypto_blake2s_init(struct shash_desc *desc)
> +{
> +       const struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
> +       struct blake2s_state *state = shash_desc_ctx(desc);
> +       unsigned int outlen = crypto_shash_digestsize(desc->tfm);
> +
> +       if (tctx->keylen)
> +               blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
> +       else
> +               blake2s_init(state, outlen);
> +       return 0;
> +}
> +
> +static inline int crypto_blake2s_update(struct shash_desc *desc,
> +                                       const u8 *in, unsigned int inlen,
> +                                       blake2s_compress_t compress)
> +{
> +       struct blake2s_state *state = shash_desc_ctx(desc);
> +
> +       __blake2s_update(state, in, inlen, compress);
> +       return 0;
> +}
> +
> +static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out,
> +                                      blake2s_compress_t compress)
> +{
> +       struct blake2s_state *state = shash_desc_ctx(desc);
> +
> +       __blake2s_final(state, out, compress);
> +       return 0;
> +}
> +
>  #endif /* BLAKE2S_INTERNAL_H */
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 06/14] crypto: blake2s - optimize blake2s initialization
  2020-12-23  8:09 ` [PATCH v3 06/14] crypto: blake2s - optimize blake2s initialization Eric Biggers
@ 2020-12-23  9:06   ` Ard Biesheuvel
  0 siblings, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2020-12-23  9:06 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Linux ARM, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> If no key was provided, then don't waste time initializing the block
> buffer, as its initial contents won't be used.
>
> Also, make crypto_blake2s_init() and blake2s() call a single internal
> function __blake2s_init() which treats the key as optional, rather than
> conditionally calling blake2s_init() or blake2s_init_key().  This
> reduces the compiled code size, as previously both blake2s_init() and
> blake2s_init_key() were being inlined into these two callers, except
> when the key size passed to blake2s() was a compile-time constant.
>
> These optimizations aren't that significant for BLAKE2s.  However, the
> equivalent optimizations will be more significant for BLAKE2b, as
> everything is twice as big in BLAKE2b.  And it's good to keep things
> consistent rather than making optimizations for BLAKE2b but not BLAKE2s.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  include/crypto/blake2s.h          | 53 ++++++++++++++++---------------
>  include/crypto/internal/blake2s.h |  5 +--
>  2 files changed, 28 insertions(+), 30 deletions(-)
>
> diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
> index b471deac28ff8..734ed22b7a6aa 100644
> --- a/include/crypto/blake2s.h
> +++ b/include/crypto/blake2s.h
> @@ -43,29 +43,34 @@ enum blake2s_iv {
>         BLAKE2S_IV7 = 0x5BE0CD19UL,
>  };
>
> -void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen);
> -void blake2s_final(struct blake2s_state *state, u8 *out);
> -
> -static inline void blake2s_init_param(struct blake2s_state *state,
> -                                     const u32 param)
> +static inline void __blake2s_init(struct blake2s_state *state, size_t outlen,
> +                                 const void *key, size_t keylen)
>  {
> -       *state = (struct blake2s_state){{
> -               BLAKE2S_IV0 ^ param,
> -               BLAKE2S_IV1,
> -               BLAKE2S_IV2,
> -               BLAKE2S_IV3,
> -               BLAKE2S_IV4,
> -               BLAKE2S_IV5,
> -               BLAKE2S_IV6,
> -               BLAKE2S_IV7,
> -       }};
> +       state->h[0] = BLAKE2S_IV0 ^ (0x01010000 | keylen << 8 | outlen);
> +       state->h[1] = BLAKE2S_IV1;
> +       state->h[2] = BLAKE2S_IV2;
> +       state->h[3] = BLAKE2S_IV3;
> +       state->h[4] = BLAKE2S_IV4;
> +       state->h[5] = BLAKE2S_IV5;
> +       state->h[6] = BLAKE2S_IV6;
> +       state->h[7] = BLAKE2S_IV7;
> +       state->t[0] = 0;
> +       state->t[1] = 0;
> +       state->f[0] = 0;
> +       state->f[1] = 0;
> +       state->buflen = 0;
> +       state->outlen = outlen;
> +       if (keylen) {
> +               memcpy(state->buf, key, keylen);
> +               memset(&state->buf[keylen], 0, BLAKE2S_BLOCK_SIZE - keylen);
> +               state->buflen = BLAKE2S_BLOCK_SIZE;
> +       }
>  }
>
>  static inline void blake2s_init(struct blake2s_state *state,
>                                 const size_t outlen)
>  {
> -       blake2s_init_param(state, 0x01010000 | outlen);
> -       state->outlen = outlen;
> +       __blake2s_init(state, outlen, NULL, 0);
>  }
>
>  static inline void blake2s_init_key(struct blake2s_state *state,
> @@ -75,12 +80,12 @@ static inline void blake2s_init_key(struct blake2s_state *state,
>         WARN_ON(IS_ENABLED(DEBUG) && (!outlen || outlen > BLAKE2S_HASH_SIZE ||
>                 !key || !keylen || keylen > BLAKE2S_KEY_SIZE));
>
> -       blake2s_init_param(state, 0x01010000 | keylen << 8 | outlen);
> -       memcpy(state->buf, key, keylen);
> -       state->buflen = BLAKE2S_BLOCK_SIZE;
> -       state->outlen = outlen;
> +       __blake2s_init(state, outlen, key, keylen);
>  }
>
> +void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen);
> +void blake2s_final(struct blake2s_state *state, u8 *out);
> +
>  static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
>                            const size_t outlen, const size_t inlen,
>                            const size_t keylen)
> @@ -91,11 +96,7 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
>                 outlen > BLAKE2S_HASH_SIZE || keylen > BLAKE2S_KEY_SIZE ||
>                 (!key && keylen)));
>
> -       if (keylen)
> -               blake2s_init_key(&state, outlen, key, keylen);
> -       else
> -               blake2s_init(&state, outlen);
> -
> +       __blake2s_init(&state, outlen, key, keylen);
>         blake2s_update(&state, in, inlen);
>         blake2s_final(&state, out);
>  }
> diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
> index 2ea0a8f5e7f41..867ef3753f5c1 100644
> --- a/include/crypto/internal/blake2s.h
> +++ b/include/crypto/internal/blake2s.h
> @@ -93,10 +93,7 @@ static inline int crypto_blake2s_init(struct shash_desc *desc)
>         struct blake2s_state *state = shash_desc_ctx(desc);
>         unsigned int outlen = crypto_shash_digestsize(desc->tfm);
>
> -       if (tctx->keylen)
> -               blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
> -       else
> -               blake2s_init(state, outlen);
> +       __blake2s_init(state, outlen, tctx->key, tctx->keylen);
>         return 0;
>  }
>
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 07/14] crypto: blake2s - add comment for blake2s_state fields
  2020-12-23  8:09 ` [PATCH v3 07/14] crypto: blake2s - add comment for blake2s_state fields Eric Biggers
@ 2020-12-23  9:07   ` Ard Biesheuvel
  0 siblings, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2020-12-23  9:07 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Linux ARM, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> The first three fields of 'struct blake2s_state' are used in assembly
> code, which isn't immediately obvious, so add a comment to this effect.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  include/crypto/blake2s.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
> index 734ed22b7a6aa..f1c8330a61a91 100644
> --- a/include/crypto/blake2s.h
> +++ b/include/crypto/blake2s.h
> @@ -24,6 +24,7 @@ enum blake2s_lengths {
>  };
>
>  struct blake2s_state {
> +       /* 'h', 't', and 'f' are used in assembly code, so keep them as-is. */
>         u32 h[8];
>         u32 t[2];
>         u32 f[2];
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 08/14] crypto: blake2s - adjust include guard naming
  2020-12-23  8:09 ` [PATCH v3 08/14] crypto: blake2s - adjust include guard naming Eric Biggers
@ 2020-12-23  9:07   ` Ard Biesheuvel
  0 siblings, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2020-12-23  9:07 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Linux ARM, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Use the full path in the include guards for the BLAKE2s headers to avoid
> ambiguity and to match the convention for most files in include/crypto/.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  include/crypto/blake2s.h          | 6 +++---
>  include/crypto/internal/blake2s.h | 6 +++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
> index f1c8330a61a91..3f06183c2d804 100644
> --- a/include/crypto/blake2s.h
> +++ b/include/crypto/blake2s.h
> @@ -3,8 +3,8 @@
>   * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
>   */
>
> -#ifndef BLAKE2S_H
> -#define BLAKE2S_H
> +#ifndef _CRYPTO_BLAKE2S_H
> +#define _CRYPTO_BLAKE2S_H
>
>  #include <linux/types.h>
>  #include <linux/kernel.h>
> @@ -105,4 +105,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
>  void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
>                      const size_t keylen);
>
> -#endif /* BLAKE2S_H */
> +#endif /* _CRYPTO_BLAKE2S_H */
> diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
> index 867ef3753f5c1..8e50d487500f2 100644
> --- a/include/crypto/internal/blake2s.h
> +++ b/include/crypto/internal/blake2s.h
> @@ -4,8 +4,8 @@
>   * Keep this in sync with the corresponding BLAKE2b header.
>   */
>
> -#ifndef BLAKE2S_INTERNAL_H
> -#define BLAKE2S_INTERNAL_H
> +#ifndef _CRYPTO_INTERNAL_BLAKE2S_H
> +#define _CRYPTO_INTERNAL_BLAKE2S_H
>
>  #include <crypto/blake2s.h>
>  #include <crypto/internal/hash.h>
> @@ -116,4 +116,4 @@ static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out,
>         return 0;
>  }
>
> -#endif /* BLAKE2S_INTERNAL_H */
> +#endif /* _CRYPTO_INTERNAL_BLAKE2S_H */
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 09/14] crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h>
  2020-12-23  8:09 ` [PATCH v3 09/14] crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h> Eric Biggers
@ 2020-12-23  9:07   ` Ard Biesheuvel
  0 siblings, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2020-12-23  9:07 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Linux ARM, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Address the following checkpatch warning:
>
>         WARNING: Use #include <linux/bug.h> instead of <asm/bug.h>
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  include/crypto/blake2s.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
> index 3f06183c2d804..bc3fb59442ce5 100644
> --- a/include/crypto/blake2s.h
> +++ b/include/crypto/blake2s.h
> @@ -6,12 +6,11 @@
>  #ifndef _CRYPTO_BLAKE2S_H
>  #define _CRYPTO_BLAKE2S_H
>
> +#include <linux/bug.h>
>  #include <linux/types.h>
>  #include <linux/kernel.h>
>  #include <linux/string.h>
>
> -#include <asm/bug.h>
> -
>  enum blake2s_lengths {
>         BLAKE2S_BLOCK_SIZE = 64,
>         BLAKE2S_HASH_SIZE = 32,
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 10/14] crypto: arm/blake2s - add ARM scalar optimized BLAKE2s
  2020-12-23  8:09 ` [PATCH v3 10/14] crypto: arm/blake2s - add ARM scalar optimized BLAKE2s Eric Biggers
@ 2020-12-23  9:08   ` Ard Biesheuvel
  0 siblings, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2020-12-23  9:08 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Linux ARM, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Add an ARM scalar optimized implementation of BLAKE2s.
>
> NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too
> small for NEON to help.  Each NEON instruction would depend on the
> previous one, resulting in poor performance.
>
> With scalar instructions, on the other hand, we can take advantage of
> ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an
> implementation get runs much faster than the C implementation.
>
> Performance results on Cortex-A7 in cycles per byte using the shash API:
>
>         4096-byte messages:
>                 blake2s-256-arm:     18.8
>                 blake2s-256-generic: 26.0
>
>         500-byte messages:
>                 blake2s-256-arm:     20.3
>                 blake2s-256-generic: 27.9
>
>         100-byte messages:
>                 blake2s-256-arm:     29.7
>                 blake2s-256-generic: 39.2
>
>         32-byte messages:
>                 blake2s-256-arm:     50.6
>                 blake2s-256-generic: 66.2
>
> Except on very short messages, this is still slower than the NEON
> implementation of BLAKE2b which I've written; that is 14.0, 16.4, 25.8,
> and 76.1 cpb on 4096, 500, 100, and 32-byte messages, respectively.
> However, optimized BLAKE2s is useful for cases where BLAKE2s is used
> instead of BLAKE2b, such as WireGuard.
>
> This new implementation is added in the form of a new module
> blake2s-arm.ko, which is analogous to blake2s-x86_64.ko in that it
> provides blake2s_compress_arch() for use by the library API as well as
> optionally register the algorithms with the shash API.
>
> Acked-by: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Tested-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  arch/arm/crypto/Kconfig        |   9 ++
>  arch/arm/crypto/Makefile       |   2 +
>  arch/arm/crypto/blake2s-core.S | 285 +++++++++++++++++++++++++++++++++
>  arch/arm/crypto/blake2s-glue.c |  78 +++++++++
>  4 files changed, 374 insertions(+)
>  create mode 100644 arch/arm/crypto/blake2s-core.S
>  create mode 100644 arch/arm/crypto/blake2s-glue.c
>
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index c9bf2df85cb90..281c829c12d0b 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -62,6 +62,15 @@ config CRYPTO_SHA512_ARM
>           SHA-512 secure hash standard (DFIPS 180-2) implemented
>           using optimized ARM assembler and NEON, when available.
>
> +config CRYPTO_BLAKE2S_ARM
> +       tristate "BLAKE2s digest algorithm (ARM)"
> +       select CRYPTO_ARCH_HAVE_LIB_BLAKE2S
> +       help
> +         BLAKE2s digest algorithm optimized with ARM scalar instructions.  This
> +         is faster than the generic implementations of BLAKE2s and BLAKE2b, but
> +         slower than the NEON implementation of BLAKE2b.  (There is no NEON
> +         implementation of BLAKE2s, since NEON doesn't really help with it.)
> +
>  config CRYPTO_AES_ARM
>         tristate "Scalar AES cipher for ARM"
>         select CRYPTO_ALGAPI
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index b745c17d356fe..5ad1e985a718b 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -9,6 +9,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
>  obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
> +obj-$(CONFIG_CRYPTO_BLAKE2S_ARM) += blake2s-arm.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
>  obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
>  obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
> @@ -29,6 +30,7 @@ sha256-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha256_neon_glue.o
>  sha256-arm-y   := sha256-core.o sha256_glue.o $(sha256-arm-neon-y)
>  sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
>  sha512-arm-y   := sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
> +blake2s-arm-y   := blake2s-core.o blake2s-glue.o
>  sha1-arm-ce-y  := sha1-ce-core.o sha1-ce-glue.o
>  sha2-arm-ce-y  := sha2-ce-core.o sha2-ce-glue.o
>  aes-arm-ce-y   := aes-ce-core.o aes-ce-glue.o
> diff --git a/arch/arm/crypto/blake2s-core.S b/arch/arm/crypto/blake2s-core.S
> new file mode 100644
> index 0000000000000..bed897e9a181a
> --- /dev/null
> +++ b/arch/arm/crypto/blake2s-core.S
> @@ -0,0 +1,285 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * BLAKE2s digest algorithm, ARM scalar implementation
> + *
> + * Copyright 2020 Google LLC
> + *
> + * Author: Eric Biggers <ebiggers@google.com>
> + */
> +
> +#include <linux/linkage.h>
> +
> +       // Registers used to hold message words temporarily.  There aren't
> +       // enough ARM registers to hold the whole message block, so we have to
> +       // load the words on-demand.
> +       M_0             .req    r12
> +       M_1             .req    r14
> +
> +// The BLAKE2s initialization vector
> +.Lblake2s_IV:
> +       .word   0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A
> +       .word   0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
> +
> +.macro __ldrd          a, b, src, offset
> +#if __LINUX_ARM_ARCH__ >= 6
> +       ldrd            \a, \b, [\src, #\offset]
> +#else
> +       ldr             \a, [\src, #\offset]
> +       ldr             \b, [\src, #\offset + 4]
> +#endif
> +.endm
> +
> +.macro __strd          a, b, dst, offset
> +#if __LINUX_ARM_ARCH__ >= 6
> +       strd            \a, \b, [\dst, #\offset]
> +#else
> +       str             \a, [\dst, #\offset]
> +       str             \b, [\dst, #\offset + 4]
> +#endif
> +.endm
> +
> +// Execute a quarter-round of BLAKE2s by mixing two columns or two diagonals.
> +// (a0, b0, c0, d0) and (a1, b1, c1, d1) give the registers containing the two
> +// columns/diagonals.  s0-s1 are the word offsets to the message words the first
> +// column/diagonal needs, and likewise s2-s3 for the second column/diagonal.
> +// M_0 and M_1 are free to use, and the message block can be found at sp + 32.
> +//
> +// Note that to save instructions, the rotations don't happen when the
> +// pseudocode says they should, but rather they are delayed until the values are
> +// used.  See the comment above _blake2s_round().
> +.macro _blake2s_quarterround  a0, b0, c0, d0,  a1, b1, c1, d1,  s0, s1, s2, s3
> +
> +       ldr             M_0, [sp, #32 + 4 * \s0]
> +       ldr             M_1, [sp, #32 + 4 * \s2]
> +
> +       // a += b + m[blake2s_sigma[r][2*i + 0]];
> +       add             \a0, \a0, \b0, ror #brot
> +       add             \a1, \a1, \b1, ror #brot
> +       add             \a0, \a0, M_0
> +       add             \a1, \a1, M_1
> +
> +       // d = ror32(d ^ a, 16);
> +       eor             \d0, \a0, \d0, ror #drot
> +       eor             \d1, \a1, \d1, ror #drot
> +
> +       // c += d;
> +       add             \c0, \c0, \d0, ror #16
> +       add             \c1, \c1, \d1, ror #16
> +
> +       // b = ror32(b ^ c, 12);
> +       eor             \b0, \c0, \b0, ror #brot
> +       eor             \b1, \c1, \b1, ror #brot
> +
> +       ldr             M_0, [sp, #32 + 4 * \s1]
> +       ldr             M_1, [sp, #32 + 4 * \s3]
> +
> +       // a += b + m[blake2s_sigma[r][2*i + 1]];
> +       add             \a0, \a0, \b0, ror #12
> +       add             \a1, \a1, \b1, ror #12
> +       add             \a0, \a0, M_0
> +       add             \a1, \a1, M_1
> +
> +       // d = ror32(d ^ a, 8);
> +       eor             \d0, \a0, \d0, ror#16
> +       eor             \d1, \a1, \d1, ror#16
> +
> +       // c += d;
> +       add             \c0, \c0, \d0, ror#8
> +       add             \c1, \c1, \d1, ror#8
> +
> +       // b = ror32(b ^ c, 7);
> +       eor             \b0, \c0, \b0, ror#12
> +       eor             \b1, \c1, \b1, ror#12
> +.endm
> +
> +// Execute one round of BLAKE2s by updating the state matrix v[0..15].  v[0..9]
> +// are in r0..r9.  The stack pointer points to 8 bytes of scratch space for
> +// spilling v[8..9], then to v[9..15], then to the message block.  r10-r12 and
> +// r14 are free to use.  The macro arguments s0-s15 give the order in which the
> +// message words are used in this round.
> +//
> +// All rotates are performed using the implicit rotate operand accepted by the
> +// 'add' and 'eor' instructions.  This is faster than using explicit rotate
> +// instructions.  To make this work, we allow the values in the second and last
> +// rows of the BLAKE2s state matrix (rows 'b' and 'd') to temporarily have the
> +// wrong rotation amount.  The rotation amount is then fixed up just in time
> +// when the values are used.  'brot' is the number of bits the values in row 'b'
> +// need to be rotated right to arrive at the correct values, and 'drot'
> +// similarly for row 'd'.  (brot, drot) start out as (0, 0) but we make it such
> +// that they end up as (7, 8) after every round.
> +.macro _blake2s_round  s0, s1, s2, s3, s4, s5, s6, s7, \
> +                       s8, s9, s10, s11, s12, s13, s14, s15
> +
> +       // Mix first two columns:
> +       // (v[0], v[4], v[8], v[12]) and (v[1], v[5], v[9], v[13]).
> +       __ldrd          r10, r11, sp, 16        // load v[12] and v[13]
> +       _blake2s_quarterround   r0, r4, r8, r10,  r1, r5, r9, r11, \
> +                               \s0, \s1, \s2, \s3
> +       __strd          r8, r9, sp, 0
> +       __strd          r10, r11, sp, 16
> +
> +       // Mix second two columns:
> +       // (v[2], v[6], v[10], v[14]) and (v[3], v[7], v[11], v[15]).
> +       __ldrd          r8, r9, sp, 8           // load v[10] and v[11]
> +       __ldrd          r10, r11, sp, 24        // load v[14] and v[15]
> +       _blake2s_quarterround   r2, r6, r8, r10,  r3, r7, r9, r11, \
> +                               \s4, \s5, \s6, \s7
> +       str             r10, [sp, #24]          // store v[14]
> +       // v[10], v[11], and v[15] are used below, so no need to store them yet.
> +
> +       .set brot, 7
> +       .set drot, 8
> +
> +       // Mix first two diagonals:
> +       // (v[0], v[5], v[10], v[15]) and (v[1], v[6], v[11], v[12]).
> +       ldr             r10, [sp, #16]          // load v[12]
> +       _blake2s_quarterround   r0, r5, r8, r11,  r1, r6, r9, r10, \
> +                               \s8, \s9, \s10, \s11
> +       __strd          r8, r9, sp, 8
> +       str             r11, [sp, #28]
> +       str             r10, [sp, #16]
> +
> +       // Mix second two diagonals:
> +       // (v[2], v[7], v[8], v[13]) and (v[3], v[4], v[9], v[14]).
> +       __ldrd          r8, r9, sp, 0           // load v[8] and v[9]
> +       __ldrd          r10, r11, sp, 20        // load v[13] and v[14]
> +       _blake2s_quarterround   r2, r7, r8, r10,  r3, r4, r9, r11, \
> +                               \s12, \s13, \s14, \s15
> +       __strd          r10, r11, sp, 20
> +.endm
> +
> +//
> +// void blake2s_compress_arch(struct blake2s_state *state,
> +//                           const u8 *block, size_t nblocks, u32 inc);
> +//
> +// Only the first three fields of struct blake2s_state are used:
> +//     u32 h[8];       (inout)
> +//     u32 t[2];       (inout)
> +//     u32 f[2];       (in)
> +//
> +       .align          5
> +ENTRY(blake2s_compress_arch)
> +       push            {r0-r2,r4-r11,lr}       // keep this an even number
> +
> +.Lnext_block:
> +       // r0 is 'state'
> +       // r1 is 'block'
> +       // r3 is 'inc'
> +
> +       // Load and increment the counter t[0..1].
> +       __ldrd          r10, r11, r0, 32
> +       adds            r10, r10, r3
> +       adc             r11, r11, #0
> +       __strd          r10, r11, r0, 32
> +
> +       // _blake2s_round is very short on registers, so copy the message block
> +       // to the stack to save a register during the rounds.  This also has the
> +       // advantage that misalignment only needs to be dealt with in one place.
> +       sub             sp, sp, #64
> +       mov             r12, sp
> +       tst             r1, #3
> +       bne             .Lcopy_block_misaligned
> +       ldmia           r1!, {r2-r9}
> +       stmia           r12!, {r2-r9}
> +       ldmia           r1!, {r2-r9}
> +       stmia           r12, {r2-r9}
> +.Lcopy_block_done:
> +       str             r1, [sp, #68]           // Update message pointer
> +
> +       // Calculate v[8..15].  Push v[9..15] onto the stack, and leave space
> +       // for spilling v[8..9].  Leave v[8..9] in r8-r9.
> +       mov             r14, r0                 // r14 = state
> +       adr             r12, .Lblake2s_IV
> +       ldmia           r12!, {r8-r9}           // load IV[0..1]
> +       __ldrd          r0, r1, r14, 40         // load f[0..1]
> +       ldm             r12, {r2-r7}            // load IV[3..7]
> +       eor             r4, r4, r10             // v[12] = IV[4] ^ t[0]
> +       eor             r5, r5, r11             // v[13] = IV[5] ^ t[1]
> +       eor             r6, r6, r0              // v[14] = IV[6] ^ f[0]
> +       eor             r7, r7, r1              // v[15] = IV[7] ^ f[1]
> +       push            {r2-r7}                 // push v[9..15]
> +       sub             sp, sp, #8              // leave space for v[8..9]
> +
> +       // Load h[0..7] == v[0..7].
> +       ldm             r14, {r0-r7}
> +
> +       // Execute the rounds.  Each round is provided the order in which it
> +       // needs to use the message words.
> +       .set brot, 0
> +       .set drot, 0
> +       _blake2s_round  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
> +       _blake2s_round  14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3
> +       _blake2s_round  11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4
> +       _blake2s_round  7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8
> +       _blake2s_round  9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13
> +       _blake2s_round  2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9
> +       _blake2s_round  12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11
> +       _blake2s_round  13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10
> +       _blake2s_round  6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5
> +       _blake2s_round  10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0
> +
> +       // Fold the final state matrix into the hash chaining value:
> +       //
> +       //      for (i = 0; i < 8; i++)
> +       //              h[i] ^= v[i] ^ v[i + 8];
> +       //
> +       ldr             r14, [sp, #96]          // r14 = &h[0]
> +       add             sp, sp, #8              // v[8..9] are already loaded.
> +       pop             {r10-r11}               // load v[10..11]
> +       eor             r0, r0, r8
> +       eor             r1, r1, r9
> +       eor             r2, r2, r10
> +       eor             r3, r3, r11
> +       ldm             r14, {r8-r11}           // load h[0..3]
> +       eor             r0, r0, r8
> +       eor             r1, r1, r9
> +       eor             r2, r2, r10
> +       eor             r3, r3, r11
> +       stmia           r14!, {r0-r3}           // store new h[0..3]
> +       ldm             r14, {r0-r3}            // load old h[4..7]
> +       pop             {r8-r11}                // load v[12..15]
> +       eor             r0, r0, r4, ror #brot
> +       eor             r1, r1, r5, ror #brot
> +       eor             r2, r2, r6, ror #brot
> +       eor             r3, r3, r7, ror #brot
> +       eor             r0, r0, r8, ror #drot
> +       eor             r1, r1, r9, ror #drot
> +       eor             r2, r2, r10, ror #drot
> +       eor             r3, r3, r11, ror #drot
> +         add           sp, sp, #64             // skip copy of message block
> +       stm             r14, {r0-r3}            // store new h[4..7]
> +
> +       // Advance to the next block, if there is one.  Note that if there are
> +       // multiple blocks, then 'inc' (the counter increment amount) must be
> +       // 64.  So we can simply set it to 64 without re-loading it.
> +       ldm             sp, {r0, r1, r2}        // load (state, block, nblocks)
> +       mov             r3, #64                 // set 'inc'
> +       subs            r2, r2, #1              // nblocks--
> +       str             r2, [sp, #8]
> +       bne             .Lnext_block            // nblocks != 0?
> +
> +       pop             {r0-r2,r4-r11,pc}
> +
> +       // The next message block (pointed to by r1) isn't 4-byte aligned, so it
> +       // can't be loaded using ldmia.  Copy it to the stack buffer (pointed to
> +       // by r12) using an alternative method.  r2-r9 are free to use.
> +.Lcopy_block_misaligned:
> +       mov             r2, #64
> +1:
> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +       ldr             r3, [r1], #4
> +#else
> +       ldrb            r3, [r1, #0]
> +       ldrb            r4, [r1, #1]
> +       ldrb            r5, [r1, #2]
> +       ldrb            r6, [r1, #3]
> +       add             r1, r1, #4
> +       orr             r3, r3, r4, lsl #8
> +       orr             r3, r3, r5, lsl #16
> +       orr             r3, r3, r6, lsl #24
> +#endif
> +       subs            r2, r2, #4
> +       str             r3, [r12], #4
> +       bne             1b
> +       b               .Lcopy_block_done
> +ENDPROC(blake2s_compress_arch)
> diff --git a/arch/arm/crypto/blake2s-glue.c b/arch/arm/crypto/blake2s-glue.c
> new file mode 100644
> index 0000000000000..f2cc1e5fc9ec1
> --- /dev/null
> +++ b/arch/arm/crypto/blake2s-glue.c
> @@ -0,0 +1,78 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * BLAKE2s digest algorithm, ARM scalar implementation
> + *
> + * Copyright 2020 Google LLC
> + */
> +
> +#include <crypto/internal/blake2s.h>
> +#include <crypto/internal/hash.h>
> +
> +#include <linux/module.h>
> +
> +/* defined in blake2s-core.S */
> +EXPORT_SYMBOL(blake2s_compress_arch);
> +
> +static int crypto_blake2s_update_arm(struct shash_desc *desc,
> +                                    const u8 *in, unsigned int inlen)
> +{
> +       return crypto_blake2s_update(desc, in, inlen, blake2s_compress_arch);
> +}
> +
> +static int crypto_blake2s_final_arm(struct shash_desc *desc, u8 *out)
> +{
> +       return crypto_blake2s_final(desc, out, blake2s_compress_arch);
> +}
> +
> +#define BLAKE2S_ALG(name, driver_name, digest_size)                    \
> +       {                                                               \
> +               .base.cra_name          = name,                         \
> +               .base.cra_driver_name   = driver_name,                  \
> +               .base.cra_priority      = 200,                          \
> +               .base.cra_flags         = CRYPTO_ALG_OPTIONAL_KEY,      \
> +               .base.cra_blocksize     = BLAKE2S_BLOCK_SIZE,           \
> +               .base.cra_ctxsize       = sizeof(struct blake2s_tfm_ctx), \
> +               .base.cra_module        = THIS_MODULE,                  \
> +               .digestsize             = digest_size,                  \
> +               .setkey                 = crypto_blake2s_setkey,        \
> +               .init                   = crypto_blake2s_init,          \
> +               .update                 = crypto_blake2s_update_arm,    \
> +               .final                  = crypto_blake2s_final_arm,     \
> +               .descsize               = sizeof(struct blake2s_state), \
> +       }
> +
> +static struct shash_alg blake2s_arm_algs[] = {
> +       BLAKE2S_ALG("blake2s-128", "blake2s-128-arm", BLAKE2S_128_HASH_SIZE),
> +       BLAKE2S_ALG("blake2s-160", "blake2s-160-arm", BLAKE2S_160_HASH_SIZE),
> +       BLAKE2S_ALG("blake2s-224", "blake2s-224-arm", BLAKE2S_224_HASH_SIZE),
> +       BLAKE2S_ALG("blake2s-256", "blake2s-256-arm", BLAKE2S_256_HASH_SIZE),
> +};
> +
> +static int __init blake2s_arm_mod_init(void)
> +{
> +       return IS_REACHABLE(CONFIG_CRYPTO_HASH) ?
> +               crypto_register_shashes(blake2s_arm_algs,
> +                                       ARRAY_SIZE(blake2s_arm_algs)) : 0;
> +}
> +
> +static void __exit blake2s_arm_mod_exit(void)
> +{
> +       if (IS_REACHABLE(CONFIG_CRYPTO_HASH))
> +               crypto_unregister_shashes(blake2s_arm_algs,
> +                                         ARRAY_SIZE(blake2s_arm_algs));
> +}
> +
> +module_init(blake2s_arm_mod_init);
> +module_exit(blake2s_arm_mod_exit);
> +
> +MODULE_DESCRIPTION("BLAKE2s digest algorithm, ARM scalar implementation");
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
> +MODULE_ALIAS_CRYPTO("blake2s-128");
> +MODULE_ALIAS_CRYPTO("blake2s-128-arm");
> +MODULE_ALIAS_CRYPTO("blake2s-160");
> +MODULE_ALIAS_CRYPTO("blake2s-160-arm");
> +MODULE_ALIAS_CRYPTO("blake2s-224");
> +MODULE_ALIAS_CRYPTO("blake2s-224-arm");
> +MODULE_ALIAS_CRYPTO("blake2s-256");
> +MODULE_ALIAS_CRYPTO("blake2s-256-arm");
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 12/14] crypto: blake2b - sync with blake2s implementation
  2020-12-23  8:10 ` [PATCH v3 12/14] crypto: blake2b - sync with blake2s implementation Eric Biggers
@ 2020-12-23  9:09   ` Ard Biesheuvel
  0 siblings, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2020-12-23  9:09 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Linux ARM, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Sync the BLAKE2b code with the BLAKE2s code as much as possible:
>
> - Move a lot of code into new headers <crypto/blake2b.h> and
>   <crypto/internal/blake2b.h>, and adjust it to be like the
>   corresponding BLAKE2s code, i.e. like <crypto/blake2s.h> and
>   <crypto/internal/blake2s.h>.
>
> - Rename constants, e.g. BLAKE2B_*_DIGEST_SIZE => BLAKE2B_*_HASH_SIZE.
>
> - Use a macro BLAKE2B_ALG() to define the shash_alg structs.
>
> - Export blake2b_compress_generic() for use as a fallback.
>
> This makes it much easier to add optimized implementations of BLAKE2b,
> as optimized implementations can use the helper functions
> crypto_blake2b_{setkey,init,update,final}() and
> blake2b_compress_generic().  The ARM implementation will use these.
>
> But this change is also helpful because it eliminates unnecessary
> differences between the BLAKE2b and BLAKE2s code, so that the same
> improvements can easily be made to both.  (The two algorithms are
> basically identical, except for the word size and constants.)  It also
> makes it straightforward to add a library API for BLAKE2b in the future
> if/when it's needed.
>
> This change does make the BLAKE2b code slightly more complicated than it
> needs to be, as it doesn't actually provide a library API yet.  For
> example, __blake2b_update() doesn't really need to exist yet; it could
> just be inlined into crypto_blake2b_update().  But I believe this is
> outweighed by the benefits of keeping the code in sync.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  crypto/blake2b_generic.c          | 226 +++++++-----------------------
>  include/crypto/blake2b.h          |  67 +++++++++
>  include/crypto/internal/blake2b.h | 115 +++++++++++++++
>  3 files changed, 230 insertions(+), 178 deletions(-)
>  create mode 100644 include/crypto/blake2b.h
>  create mode 100644 include/crypto/internal/blake2b.h
>
> diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
> index a2ffe60e06d34..963f7fe0e4ea8 100644
> --- a/crypto/blake2b_generic.c
> +++ b/crypto/blake2b_generic.c
> @@ -20,36 +20,11 @@
>
>  #include <asm/unaligned.h>
>  #include <linux/module.h>
> -#include <linux/string.h>
>  #include <linux/kernel.h>
>  #include <linux/bitops.h>
> +#include <crypto/internal/blake2b.h>
>  #include <crypto/internal/hash.h>
>
> -#define BLAKE2B_160_DIGEST_SIZE                (160 / 8)
> -#define BLAKE2B_256_DIGEST_SIZE                (256 / 8)
> -#define BLAKE2B_384_DIGEST_SIZE                (384 / 8)
> -#define BLAKE2B_512_DIGEST_SIZE                (512 / 8)
> -
> -enum blake2b_constant {
> -       BLAKE2B_BLOCKBYTES    = 128,
> -       BLAKE2B_KEYBYTES      = 64,
> -};
> -
> -struct blake2b_state {
> -       u64      h[8];
> -       u64      t[2];
> -       u64      f[2];
> -       u8       buf[BLAKE2B_BLOCKBYTES];
> -       size_t   buflen;
> -};
> -
> -static const u64 blake2b_IV[8] = {
> -       0x6a09e667f3bcc908ULL, 0xbb67ae8584caa73bULL,
> -       0x3c6ef372fe94f82bULL, 0xa54ff53a5f1d36f1ULL,
> -       0x510e527fade682d1ULL, 0x9b05688c2b3e6c1fULL,
> -       0x1f83d9abfb41bd6bULL, 0x5be0cd19137e2179ULL
> -};
> -
>  static const u8 blake2b_sigma[12][16] = {
>         {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
>         { 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 },
> @@ -95,8 +70,8 @@ static void blake2b_increment_counter(struct blake2b_state *S, const u64 inc)
>                 G(r,7,v[ 3],v[ 4],v[ 9],v[14]); \
>         } while (0)
>
> -static void blake2b_compress(struct blake2b_state *S,
> -                            const u8 block[BLAKE2B_BLOCKBYTES])
> +static void blake2b_compress_one_generic(struct blake2b_state *S,
> +                                        const u8 block[BLAKE2B_BLOCK_SIZE])
>  {
>         u64 m[16];
>         u64 v[16];
> @@ -108,14 +83,14 @@ static void blake2b_compress(struct blake2b_state *S,
>         for (i = 0; i < 8; ++i)
>                 v[i] = S->h[i];
>
> -       v[ 8] = blake2b_IV[0];
> -       v[ 9] = blake2b_IV[1];
> -       v[10] = blake2b_IV[2];
> -       v[11] = blake2b_IV[3];
> -       v[12] = blake2b_IV[4] ^ S->t[0];
> -       v[13] = blake2b_IV[5] ^ S->t[1];
> -       v[14] = blake2b_IV[6] ^ S->f[0];
> -       v[15] = blake2b_IV[7] ^ S->f[1];
> +       v[ 8] = BLAKE2B_IV0;
> +       v[ 9] = BLAKE2B_IV1;
> +       v[10] = BLAKE2B_IV2;
> +       v[11] = BLAKE2B_IV3;
> +       v[12] = BLAKE2B_IV4 ^ S->t[0];
> +       v[13] = BLAKE2B_IV5 ^ S->t[1];
> +       v[14] = BLAKE2B_IV6 ^ S->f[0];
> +       v[15] = BLAKE2B_IV7 ^ S->f[1];
>
>         ROUND(0);
>         ROUND(1);
> @@ -139,159 +114,54 @@ static void blake2b_compress(struct blake2b_state *S,
>  #undef G
>  #undef ROUND
>
> -struct blake2b_tfm_ctx {
> -       u8 key[BLAKE2B_KEYBYTES];
> -       unsigned int keylen;
> -};
> -
> -static int blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
> -                         unsigned int keylen)
> +void blake2b_compress_generic(struct blake2b_state *state,
> +                             const u8 *block, size_t nblocks, u32 inc)
>  {
> -       struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(tfm);
> -
> -       if (keylen == 0 || keylen > BLAKE2B_KEYBYTES)
> -               return -EINVAL;
> -
> -       memcpy(tctx->key, key, keylen);
> -       tctx->keylen = keylen;
> -
> -       return 0;
> +       do {
> +               blake2b_increment_counter(state, inc);
> +               blake2b_compress_one_generic(state, block);
> +               block += BLAKE2B_BLOCK_SIZE;
> +       } while (--nblocks);
>  }
> +EXPORT_SYMBOL(blake2b_compress_generic);
>
> -static int blake2b_init(struct shash_desc *desc)
> +static int crypto_blake2b_update_generic(struct shash_desc *desc,
> +                                        const u8 *in, unsigned int inlen)
>  {
> -       struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
> -       struct blake2b_state *state = shash_desc_ctx(desc);
> -       const int digestsize = crypto_shash_digestsize(desc->tfm);
> -
> -       memset(state, 0, sizeof(*state));
> -       memcpy(state->h, blake2b_IV, sizeof(state->h));
> -
> -       /* Parameter block is all zeros except index 0, no xor for 1..7 */
> -       state->h[0] ^= 0x01010000 | tctx->keylen << 8 | digestsize;
> -
> -       if (tctx->keylen) {
> -               /*
> -                * Prefill the buffer with the key, next call to _update or
> -                * _final will process it
> -                */
> -               memcpy(state->buf, tctx->key, tctx->keylen);
> -               state->buflen = BLAKE2B_BLOCKBYTES;
> -       }
> -       return 0;
> +       return crypto_blake2b_update(desc, in, inlen, blake2b_compress_generic);
>  }
>
> -static int blake2b_update(struct shash_desc *desc, const u8 *in,
> -                         unsigned int inlen)
> +static int crypto_blake2b_final_generic(struct shash_desc *desc, u8 *out)
>  {
> -       struct blake2b_state *state = shash_desc_ctx(desc);
> -       const size_t left = state->buflen;
> -       const size_t fill = BLAKE2B_BLOCKBYTES - left;
> -
> -       if (!inlen)
> -               return 0;
> -
> -       if (inlen > fill) {
> -               state->buflen = 0;
> -               /* Fill buffer */
> -               memcpy(state->buf + left, in, fill);
> -               blake2b_increment_counter(state, BLAKE2B_BLOCKBYTES);
> -               /* Compress */
> -               blake2b_compress(state, state->buf);
> -               in += fill;
> -               inlen -= fill;
> -               while (inlen > BLAKE2B_BLOCKBYTES) {
> -                       blake2b_increment_counter(state, BLAKE2B_BLOCKBYTES);
> -                       blake2b_compress(state, in);
> -                       in += BLAKE2B_BLOCKBYTES;
> -                       inlen -= BLAKE2B_BLOCKBYTES;
> -               }
> -       }
> -       memcpy(state->buf + state->buflen, in, inlen);
> -       state->buflen += inlen;
> -
> -       return 0;
> +       return crypto_blake2b_final(desc, out, blake2b_compress_generic);
>  }
>
> -static int blake2b_final(struct shash_desc *desc, u8 *out)
> -{
> -       struct blake2b_state *state = shash_desc_ctx(desc);
> -       const int digestsize = crypto_shash_digestsize(desc->tfm);
> -       size_t i;
> -
> -       blake2b_increment_counter(state, state->buflen);
> -       /* Set last block */
> -       state->f[0] = (u64)-1;
> -       /* Padding */
> -       memset(state->buf + state->buflen, 0, BLAKE2B_BLOCKBYTES - state->buflen);
> -       blake2b_compress(state, state->buf);
> -
> -       /* Avoid temporary buffer and switch the internal output to LE order */
> -       for (i = 0; i < ARRAY_SIZE(state->h); i++)
> -               __cpu_to_le64s(&state->h[i]);
> -
> -       memcpy(out, state->h, digestsize);
> -       return 0;
> -}
> +#define BLAKE2B_ALG(name, driver_name, digest_size)                    \
> +       {                                                               \
> +               .base.cra_name          = name,                         \
> +               .base.cra_driver_name   = driver_name,                  \
> +               .base.cra_priority      = 100,                          \
> +               .base.cra_flags         = CRYPTO_ALG_OPTIONAL_KEY,      \
> +               .base.cra_blocksize     = BLAKE2B_BLOCK_SIZE,           \
> +               .base.cra_ctxsize       = sizeof(struct blake2b_tfm_ctx), \
> +               .base.cra_module        = THIS_MODULE,                  \
> +               .digestsize             = digest_size,                  \
> +               .setkey                 = crypto_blake2b_setkey,        \
> +               .init                   = crypto_blake2b_init,          \
> +               .update                 = crypto_blake2b_update_generic, \
> +               .final                  = crypto_blake2b_final_generic, \
> +               .descsize               = sizeof(struct blake2b_state), \
> +       }
>
>  static struct shash_alg blake2b_algs[] = {
> -       {
> -               .base.cra_name          = "blake2b-160",
> -               .base.cra_driver_name   = "blake2b-160-generic",
> -               .base.cra_priority      = 100,
> -               .base.cra_flags         = CRYPTO_ALG_OPTIONAL_KEY,
> -               .base.cra_blocksize     = BLAKE2B_BLOCKBYTES,
> -               .base.cra_ctxsize       = sizeof(struct blake2b_tfm_ctx),
> -               .base.cra_module        = THIS_MODULE,
> -               .digestsize             = BLAKE2B_160_DIGEST_SIZE,
> -               .setkey                 = blake2b_setkey,
> -               .init                   = blake2b_init,
> -               .update                 = blake2b_update,
> -               .final                  = blake2b_final,
> -               .descsize               = sizeof(struct blake2b_state),
> -       }, {
> -               .base.cra_name          = "blake2b-256",
> -               .base.cra_driver_name   = "blake2b-256-generic",
> -               .base.cra_priority      = 100,
> -               .base.cra_flags         = CRYPTO_ALG_OPTIONAL_KEY,
> -               .base.cra_blocksize     = BLAKE2B_BLOCKBYTES,
> -               .base.cra_ctxsize       = sizeof(struct blake2b_tfm_ctx),
> -               .base.cra_module        = THIS_MODULE,
> -               .digestsize             = BLAKE2B_256_DIGEST_SIZE,
> -               .setkey                 = blake2b_setkey,
> -               .init                   = blake2b_init,
> -               .update                 = blake2b_update,
> -               .final                  = blake2b_final,
> -               .descsize               = sizeof(struct blake2b_state),
> -       }, {
> -               .base.cra_name          = "blake2b-384",
> -               .base.cra_driver_name   = "blake2b-384-generic",
> -               .base.cra_priority      = 100,
> -               .base.cra_flags         = CRYPTO_ALG_OPTIONAL_KEY,
> -               .base.cra_blocksize     = BLAKE2B_BLOCKBYTES,
> -               .base.cra_ctxsize       = sizeof(struct blake2b_tfm_ctx),
> -               .base.cra_module        = THIS_MODULE,
> -               .digestsize             = BLAKE2B_384_DIGEST_SIZE,
> -               .setkey                 = blake2b_setkey,
> -               .init                   = blake2b_init,
> -               .update                 = blake2b_update,
> -               .final                  = blake2b_final,
> -               .descsize               = sizeof(struct blake2b_state),
> -       }, {
> -               .base.cra_name          = "blake2b-512",
> -               .base.cra_driver_name   = "blake2b-512-generic",
> -               .base.cra_priority      = 100,
> -               .base.cra_flags         = CRYPTO_ALG_OPTIONAL_KEY,
> -               .base.cra_blocksize     = BLAKE2B_BLOCKBYTES,
> -               .base.cra_ctxsize       = sizeof(struct blake2b_tfm_ctx),
> -               .base.cra_module        = THIS_MODULE,
> -               .digestsize             = BLAKE2B_512_DIGEST_SIZE,
> -               .setkey                 = blake2b_setkey,
> -               .init                   = blake2b_init,
> -               .update                 = blake2b_update,
> -               .final                  = blake2b_final,
> -               .descsize               = sizeof(struct blake2b_state),
> -       }
> +       BLAKE2B_ALG("blake2b-160", "blake2b-160-generic",
> +                   BLAKE2B_160_HASH_SIZE),
> +       BLAKE2B_ALG("blake2b-256", "blake2b-256-generic",
> +                   BLAKE2B_256_HASH_SIZE),
> +       BLAKE2B_ALG("blake2b-384", "blake2b-384-generic",
> +                   BLAKE2B_384_HASH_SIZE),
> +       BLAKE2B_ALG("blake2b-512", "blake2b-512-generic",
> +                   BLAKE2B_512_HASH_SIZE),
>  };
>
>  static int __init blake2b_mod_init(void)
> diff --git a/include/crypto/blake2b.h b/include/crypto/blake2b.h
> new file mode 100644
> index 0000000000000..18875f16f8cad
> --- /dev/null
> +++ b/include/crypto/blake2b.h
> @@ -0,0 +1,67 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR MIT */
> +
> +#ifndef _CRYPTO_BLAKE2B_H
> +#define _CRYPTO_BLAKE2B_H
> +
> +#include <linux/bug.h>
> +#include <linux/types.h>
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +
> +enum blake2b_lengths {
> +       BLAKE2B_BLOCK_SIZE = 128,
> +       BLAKE2B_HASH_SIZE = 64,
> +       BLAKE2B_KEY_SIZE = 64,
> +
> +       BLAKE2B_160_HASH_SIZE = 20,
> +       BLAKE2B_256_HASH_SIZE = 32,
> +       BLAKE2B_384_HASH_SIZE = 48,
> +       BLAKE2B_512_HASH_SIZE = 64,
> +};
> +
> +struct blake2b_state {
> +       /* 'h', 't', and 'f' are used in assembly code, so keep them as-is. */
> +       u64 h[8];
> +       u64 t[2];
> +       u64 f[2];
> +       u8 buf[BLAKE2B_BLOCK_SIZE];
> +       unsigned int buflen;
> +       unsigned int outlen;
> +};
> +
> +enum blake2b_iv {
> +       BLAKE2B_IV0 = 0x6A09E667F3BCC908ULL,
> +       BLAKE2B_IV1 = 0xBB67AE8584CAA73BULL,
> +       BLAKE2B_IV2 = 0x3C6EF372FE94F82BULL,
> +       BLAKE2B_IV3 = 0xA54FF53A5F1D36F1ULL,
> +       BLAKE2B_IV4 = 0x510E527FADE682D1ULL,
> +       BLAKE2B_IV5 = 0x9B05688C2B3E6C1FULL,
> +       BLAKE2B_IV6 = 0x1F83D9ABFB41BD6BULL,
> +       BLAKE2B_IV7 = 0x5BE0CD19137E2179ULL,
> +};
> +
> +static inline void __blake2b_init(struct blake2b_state *state, size_t outlen,
> +                                 const void *key, size_t keylen)
> +{
> +       state->h[0] = BLAKE2B_IV0 ^ (0x01010000 | keylen << 8 | outlen);
> +       state->h[1] = BLAKE2B_IV1;
> +       state->h[2] = BLAKE2B_IV2;
> +       state->h[3] = BLAKE2B_IV3;
> +       state->h[4] = BLAKE2B_IV4;
> +       state->h[5] = BLAKE2B_IV5;
> +       state->h[6] = BLAKE2B_IV6;
> +       state->h[7] = BLAKE2B_IV7;
> +       state->t[0] = 0;
> +       state->t[1] = 0;
> +       state->f[0] = 0;
> +       state->f[1] = 0;
> +       state->buflen = 0;
> +       state->outlen = outlen;
> +       if (keylen) {
> +               memcpy(state->buf, key, keylen);
> +               memset(&state->buf[keylen], 0, BLAKE2B_BLOCK_SIZE - keylen);
> +               state->buflen = BLAKE2B_BLOCK_SIZE;
> +       }
> +}
> +
> +#endif /* _CRYPTO_BLAKE2B_H */
> diff --git a/include/crypto/internal/blake2b.h b/include/crypto/internal/blake2b.h
> new file mode 100644
> index 0000000000000..982fe5e8471cd
> --- /dev/null
> +++ b/include/crypto/internal/blake2b.h
> @@ -0,0 +1,115 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR MIT */
> +/*
> + * Helper functions for BLAKE2b implementations.
> + * Keep this in sync with the corresponding BLAKE2s header.
> + */
> +
> +#ifndef _CRYPTO_INTERNAL_BLAKE2B_H
> +#define _CRYPTO_INTERNAL_BLAKE2B_H
> +
> +#include <crypto/blake2b.h>
> +#include <crypto/internal/hash.h>
> +#include <linux/string.h>
> +
> +void blake2b_compress_generic(struct blake2b_state *state,
> +                             const u8 *block, size_t nblocks, u32 inc);
> +
> +static inline void blake2b_set_lastblock(struct blake2b_state *state)
> +{
> +       state->f[0] = -1;
> +}
> +
> +typedef void (*blake2b_compress_t)(struct blake2b_state *state,
> +                                  const u8 *block, size_t nblocks, u32 inc);
> +
> +static inline void __blake2b_update(struct blake2b_state *state,
> +                                   const u8 *in, size_t inlen,
> +                                   blake2b_compress_t compress)
> +{
> +       const size_t fill = BLAKE2B_BLOCK_SIZE - state->buflen;
> +
> +       if (unlikely(!inlen))
> +               return;
> +       if (inlen > fill) {
> +               memcpy(state->buf + state->buflen, in, fill);
> +               (*compress)(state, state->buf, 1, BLAKE2B_BLOCK_SIZE);
> +               state->buflen = 0;
> +               in += fill;
> +               inlen -= fill;
> +       }
> +       if (inlen > BLAKE2B_BLOCK_SIZE) {
> +               const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2B_BLOCK_SIZE);
> +               /* Hash one less (full) block than strictly possible */
> +               (*compress)(state, in, nblocks - 1, BLAKE2B_BLOCK_SIZE);
> +               in += BLAKE2B_BLOCK_SIZE * (nblocks - 1);
> +               inlen -= BLAKE2B_BLOCK_SIZE * (nblocks - 1);
> +       }
> +       memcpy(state->buf + state->buflen, in, inlen);
> +       state->buflen += inlen;
> +}
> +
> +static inline void __blake2b_final(struct blake2b_state *state, u8 *out,
> +                                  blake2b_compress_t compress)
> +{
> +       int i;
> +
> +       blake2b_set_lastblock(state);
> +       memset(state->buf + state->buflen, 0,
> +              BLAKE2B_BLOCK_SIZE - state->buflen); /* Padding */
> +       (*compress)(state, state->buf, 1, state->buflen);
> +       for (i = 0; i < ARRAY_SIZE(state->h); i++)
> +               __cpu_to_le64s(&state->h[i]);
> +       memcpy(out, state->h, state->outlen);
> +}
> +
> +/* Helper functions for shash implementations of BLAKE2b */
> +
> +struct blake2b_tfm_ctx {
> +       u8 key[BLAKE2B_KEY_SIZE];
> +       unsigned int keylen;
> +};
> +
> +static inline int crypto_blake2b_setkey(struct crypto_shash *tfm,
> +                                       const u8 *key, unsigned int keylen)
> +{
> +       struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(tfm);
> +
> +       if (keylen == 0 || keylen > BLAKE2B_KEY_SIZE)
> +               return -EINVAL;
> +
> +       memcpy(tctx->key, key, keylen);
> +       tctx->keylen = keylen;
> +
> +       return 0;
> +}
> +
> +static inline int crypto_blake2b_init(struct shash_desc *desc)
> +{
> +       const struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
> +       struct blake2b_state *state = shash_desc_ctx(desc);
> +       unsigned int outlen = crypto_shash_digestsize(desc->tfm);
> +
> +       __blake2b_init(state, outlen, tctx->key, tctx->keylen);
> +       return 0;
> +}
> +
> +static inline int crypto_blake2b_update(struct shash_desc *desc,
> +                                       const u8 *in, unsigned int inlen,
> +                                       blake2b_compress_t compress)
> +{
> +       struct blake2b_state *state = shash_desc_ctx(desc);
> +
> +       __blake2b_update(state, in, inlen, compress);
> +       return 0;
> +}
> +
> +static inline int crypto_blake2b_final(struct shash_desc *desc, u8 *out,
> +                                      blake2b_compress_t compress)
> +{
> +       struct blake2b_state *state = shash_desc_ctx(desc);
> +
> +       __blake2b_final(state, out, compress);
> +       return 0;
> +}
> +
> +#endif /* _CRYPTO_INTERNAL_BLAKE2B_H */
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b
  2020-12-23  8:10 ` [PATCH v3 14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b Eric Biggers
@ 2020-12-23  9:10   ` Ard Biesheuvel
  0 siblings, 0 replies; 25+ messages in thread
From: Ard Biesheuvel @ 2020-12-23  9:10 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, Linux ARM, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, 23 Dec 2020 at 09:13, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Add a NEON-accelerated implementation of BLAKE2b.
>
> On Cortex-A7 (which these days is the most common ARM processor that
> doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
> SHA-256, and slightly faster than SHA-1.  It is also almost three times
> as fast as the generic implementation of BLAKE2b:
>
>         Algorithm            Cycles per byte (on 4096-byte messages)
>         ===================  =======================================
>         blake2b-256-neon     14.0
>         sha1-neon            16.3
>         blake2s-256-arm      18.8
>         sha1-asm             20.8
>         blake2s-256-generic  26.0
>         sha256-neon          28.9
>         sha256-asm           32.0
>         blake2b-256-generic  38.9
>
> This implementation isn't directly based on any other implementation,
> but it borrows some ideas from previous NEON code I've written as well
> as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
> the other NEON implementations of BLAKE2b I'm aware of (the
> implementation in the BLAKE2 official repository using intrinsics, and
> Andrew Moon's implementation which can be found in SUPERCOP).  It does
> only one block at a time, so it performs well on short messages too.
>
> NEON-accelerated BLAKE2b is useful because there is interest in using
> BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
> devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
> these devices, the performance cost of upgrading to SHA-256 may be
> unacceptable, whereas BLAKE2b-256 would actually improve performance.
>
> Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
> is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
> BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
> 64-bit operations, and because BLAKE2s's block size is too small for
> NEON to be helpful for it.  The best I've been able to do with BLAKE2s
> on Cortex-A7 is 18.8 cpb with an optimized scalar implementation.
>
> (I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
> they're more complex as they require running multiple hashes at once.
> Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
> so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
> from the smaller number of rounds, not from the extra parallelism.)
>
> For now this BLAKE2b implementation is only wired up to the shash API,
> since there is no library API for BLAKE2b yet.  However, I've tried to
> keep things consistent with BLAKE2s, e.g. by defining
> blake2b_compress_arch() which is analogous to blake2s_compress_arch()
> and could be exported for use by the library API later if needed.
>
> Acked-by: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Tested-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  arch/arm/crypto/Kconfig             |  10 +
>  arch/arm/crypto/Makefile            |   2 +
>  arch/arm/crypto/blake2b-neon-core.S | 347 ++++++++++++++++++++++++++++
>  arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
>  4 files changed, 464 insertions(+)
>  create mode 100644 arch/arm/crypto/blake2b-neon-core.S
>  create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
>
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index 281c829c12d0b..2b575792363e5 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -71,6 +71,16 @@ config CRYPTO_BLAKE2S_ARM
>           slower than the NEON implementation of BLAKE2b.  (There is no NEON
>           implementation of BLAKE2s, since NEON doesn't really help with it.)
>
> +config CRYPTO_BLAKE2B_NEON
> +       tristate "BLAKE2b digest algorithm (ARM NEON)"
> +       depends on KERNEL_MODE_NEON
> +       select CRYPTO_BLAKE2B
> +       help
> +         BLAKE2b digest algorithm optimized with ARM NEON instructions.
> +         On ARM processors that have NEON support but not the ARMv8
> +         Crypto Extensions, typically this BLAKE2b implementation is
> +         much faster than SHA-2 and slightly faster than SHA-1.
> +
>  config CRYPTO_AES_ARM
>         tristate "Scalar AES cipher for ARM"
>         select CRYPTO_ALGAPI
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 5ad1e985a718b..8f26c454ea12e 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>  obj-$(CONFIG_CRYPTO_BLAKE2S_ARM) += blake2s-arm.o
> +obj-$(CONFIG_CRYPTO_BLAKE2B_NEON) += blake2b-neon.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
>  obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
>  obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
> @@ -31,6 +32,7 @@ sha256-arm-y  := sha256-core.o sha256_glue.o $(sha256-arm-neon-y)
>  sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
>  sha512-arm-y   := sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
>  blake2s-arm-y   := blake2s-core.o blake2s-glue.o
> +blake2b-neon-y  := blake2b-neon-core.o blake2b-neon-glue.o
>  sha1-arm-ce-y  := sha1-ce-core.o sha1-ce-glue.o
>  sha2-arm-ce-y  := sha2-ce-core.o sha2-ce-glue.o
>  aes-arm-ce-y   := aes-ce-core.o aes-ce-glue.o
> diff --git a/arch/arm/crypto/blake2b-neon-core.S b/arch/arm/crypto/blake2b-neon-core.S
> new file mode 100644
> index 0000000000000..0406a186377fb
> --- /dev/null
> +++ b/arch/arm/crypto/blake2b-neon-core.S
> @@ -0,0 +1,347 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * BLAKE2b digest algorithm, NEON accelerated
> + *
> + * Copyright 2020 Google LLC
> + *
> + * Author: Eric Biggers <ebiggers@google.com>
> + */
> +
> +#include <linux/linkage.h>
> +
> +       .text
> +       .fpu            neon
> +
> +       // The arguments to blake2b_compress_neon()
> +       STATE           .req    r0
> +       BLOCK           .req    r1
> +       NBLOCKS         .req    r2
> +       INC             .req    r3
> +
> +       // Pointers to the rotation tables
> +       ROR24_TABLE     .req    r4
> +       ROR16_TABLE     .req    r5
> +
> +       // The original stack pointer
> +       ORIG_SP         .req    r6
> +
> +       // NEON registers which contain the message words of the current block.
> +       // M_0-M_3 are occasionally used for other purposes too.
> +       M_0             .req    d16
> +       M_1             .req    d17
> +       M_2             .req    d18
> +       M_3             .req    d19
> +       M_4             .req    d20
> +       M_5             .req    d21
> +       M_6             .req    d22
> +       M_7             .req    d23
> +       M_8             .req    d24
> +       M_9             .req    d25
> +       M_10            .req    d26
> +       M_11            .req    d27
> +       M_12            .req    d28
> +       M_13            .req    d29
> +       M_14            .req    d30
> +       M_15            .req    d31
> +
> +       .align          4
> +       // Tables for computing ror64(x, 24) and ror64(x, 16) using the vtbl.8
> +       // instruction.  This is the most efficient way to implement these
> +       // rotation amounts with NEON.  (On Cortex-A53 it's the same speed as
> +       // vshr.u64 + vsli.u64, while on Cortex-A7 it's faster.)
> +.Lror24_table:
> +       .byte           3, 4, 5, 6, 7, 0, 1, 2
> +.Lror16_table:
> +       .byte           2, 3, 4, 5, 6, 7, 0, 1
> +       // The BLAKE2b initialization vector
> +.Lblake2b_IV:
> +       .quad           0x6a09e667f3bcc908, 0xbb67ae8584caa73b
> +       .quad           0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1
> +       .quad           0x510e527fade682d1, 0x9b05688c2b3e6c1f
> +       .quad           0x1f83d9abfb41bd6b, 0x5be0cd19137e2179
> +
> +// Execute one round of BLAKE2b by updating the state matrix v[0..15] in the
> +// NEON registers q0-q7.  The message block is in q8..q15 (M_0-M_15).  The stack
> +// pointer points to a 32-byte aligned buffer containing a copy of q8 and q9
> +// (M_0-M_3), so that they can be reloaded if they are used as temporary
> +// registers.  The macro arguments s0-s15 give the order in which the message
> +// words are used in this round.  'final' is 1 if this is the final round.
> +.macro _blake2b_round  s0, s1, s2, s3, s4, s5, s6, s7, \
> +                       s8, s9, s10, s11, s12, s13, s14, s15, final=0
> +
> +       // Mix the columns:
> +       // (v[0], v[4], v[8], v[12]), (v[1], v[5], v[9], v[13]),
> +       // (v[2], v[6], v[10], v[14]), and (v[3], v[7], v[11], v[15]).
> +
> +       // a += b + m[blake2b_sigma[r][2*i + 0]];
> +       vadd.u64        q0, q0, q2
> +       vadd.u64        q1, q1, q3
> +       vadd.u64        d0, d0, M_\s0
> +       vadd.u64        d1, d1, M_\s2
> +       vadd.u64        d2, d2, M_\s4
> +       vadd.u64        d3, d3, M_\s6
> +
> +       // d = ror64(d ^ a, 32);
> +       veor            q6, q6, q0
> +       veor            q7, q7, q1
> +       vrev64.32       q6, q6
> +       vrev64.32       q7, q7
> +
> +       // c += d;
> +       vadd.u64        q4, q4, q6
> +       vadd.u64        q5, q5, q7
> +
> +       // b = ror64(b ^ c, 24);
> +       vld1.8          {M_0}, [ROR24_TABLE, :64]
> +       veor            q2, q2, q4
> +       veor            q3, q3, q5
> +       vtbl.8          d4, {d4}, M_0
> +       vtbl.8          d5, {d5}, M_0
> +       vtbl.8          d6, {d6}, M_0
> +       vtbl.8          d7, {d7}, M_0
> +
> +       // a += b + m[blake2b_sigma[r][2*i + 1]];
> +       //
> +       // M_0 got clobbered above, so we have to reload it if any of the four
> +       // message words this step needs happens to be M_0.  Otherwise we don't
> +       // need to reload it here, as it will just get clobbered again below.
> +.if \s1 == 0 || \s3 == 0 || \s5 == 0 || \s7 == 0
> +       vld1.8          {M_0}, [sp, :64]
> +.endif
> +       vadd.u64        q0, q0, q2
> +       vadd.u64        q1, q1, q3
> +       vadd.u64        d0, d0, M_\s1
> +       vadd.u64        d1, d1, M_\s3
> +       vadd.u64        d2, d2, M_\s5
> +       vadd.u64        d3, d3, M_\s7
> +
> +       // d = ror64(d ^ a, 16);
> +       vld1.8          {M_0}, [ROR16_TABLE, :64]
> +       veor            q6, q6, q0
> +       veor            q7, q7, q1
> +       vtbl.8          d12, {d12}, M_0
> +       vtbl.8          d13, {d13}, M_0
> +       vtbl.8          d14, {d14}, M_0
> +       vtbl.8          d15, {d15}, M_0
> +
> +       // c += d;
> +       vadd.u64        q4, q4, q6
> +       vadd.u64        q5, q5, q7
> +
> +       // b = ror64(b ^ c, 63);
> +       //
> +       // This rotation amount isn't a multiple of 8, so it has to be
> +       // implemented using a pair of shifts, which requires temporary
> +       // registers.  Use q8-q9 (M_0-M_3) for this, and reload them afterwards.
> +       veor            q8, q2, q4
> +       veor            q9, q3, q5
> +       vshr.u64        q2, q8, #63
> +       vshr.u64        q3, q9, #63
> +       vsli.u64        q2, q8, #1
> +       vsli.u64        q3, q9, #1
> +       vld1.8          {q8-q9}, [sp, :256]
> +
> +       // Mix the diagonals:
> +       // (v[0], v[5], v[10], v[15]), (v[1], v[6], v[11], v[12]),
> +       // (v[2], v[7], v[8], v[13]), and (v[3], v[4], v[9], v[14]).
> +       //
> +       // There are two possible ways to do this: use 'vext' instructions to
> +       // shift the rows of the matrix so that the diagonals become columns,
> +       // and undo it afterwards; or just use 64-bit operations on 'd'
> +       // registers instead of 128-bit operations on 'q' registers.  We use the
> +       // latter approach, as it performs much better on Cortex-A7.
> +
> +       // a += b + m[blake2b_sigma[r][2*i + 0]];
> +       vadd.u64        d0, d0, d5
> +       vadd.u64        d1, d1, d6
> +       vadd.u64        d2, d2, d7
> +       vadd.u64        d3, d3, d4
> +       vadd.u64        d0, d0, M_\s8
> +       vadd.u64        d1, d1, M_\s10
> +       vadd.u64        d2, d2, M_\s12
> +       vadd.u64        d3, d3, M_\s14
> +
> +       // d = ror64(d ^ a, 32);
> +       veor            d15, d15, d0
> +       veor            d12, d12, d1
> +       veor            d13, d13, d2
> +       veor            d14, d14, d3
> +       vrev64.32       d15, d15
> +       vrev64.32       d12, d12
> +       vrev64.32       d13, d13
> +       vrev64.32       d14, d14
> +
> +       // c += d;
> +       vadd.u64        d10, d10, d15
> +       vadd.u64        d11, d11, d12
> +       vadd.u64        d8, d8, d13
> +       vadd.u64        d9, d9, d14
> +
> +       // b = ror64(b ^ c, 24);
> +       vld1.8          {M_0}, [ROR24_TABLE, :64]
> +       veor            d5, d5, d10
> +       veor            d6, d6, d11
> +       veor            d7, d7, d8
> +       veor            d4, d4, d9
> +       vtbl.8          d5, {d5}, M_0
> +       vtbl.8          d6, {d6}, M_0
> +       vtbl.8          d7, {d7}, M_0
> +       vtbl.8          d4, {d4}, M_0
> +
> +       // a += b + m[blake2b_sigma[r][2*i + 1]];
> +.if \s9 == 0 || \s11 == 0 || \s13 == 0 || \s15 == 0
> +       vld1.8          {M_0}, [sp, :64]
> +.endif
> +       vadd.u64        d0, d0, d5
> +       vadd.u64        d1, d1, d6
> +       vadd.u64        d2, d2, d7
> +       vadd.u64        d3, d3, d4
> +       vadd.u64        d0, d0, M_\s9
> +       vadd.u64        d1, d1, M_\s11
> +       vadd.u64        d2, d2, M_\s13
> +       vadd.u64        d3, d3, M_\s15
> +
> +       // d = ror64(d ^ a, 16);
> +       vld1.8          {M_0}, [ROR16_TABLE, :64]
> +       veor            d15, d15, d0
> +       veor            d12, d12, d1
> +       veor            d13, d13, d2
> +       veor            d14, d14, d3
> +       vtbl.8          d12, {d12}, M_0
> +       vtbl.8          d13, {d13}, M_0
> +       vtbl.8          d14, {d14}, M_0
> +       vtbl.8          d15, {d15}, M_0
> +
> +       // c += d;
> +       vadd.u64        d10, d10, d15
> +       vadd.u64        d11, d11, d12
> +       vadd.u64        d8, d8, d13
> +       vadd.u64        d9, d9, d14
> +
> +       // b = ror64(b ^ c, 63);
> +       veor            d16, d4, d9
> +       veor            d17, d5, d10
> +       veor            d18, d6, d11
> +       veor            d19, d7, d8
> +       vshr.u64        q2, q8, #63
> +       vshr.u64        q3, q9, #63
> +       vsli.u64        q2, q8, #1
> +       vsli.u64        q3, q9, #1
> +       // Reloading q8-q9 can be skipped on the final round.
> +.if ! \final
> +       vld1.8          {q8-q9}, [sp, :256]
> +.endif
> +.endm
> +
> +//
> +// void blake2b_compress_neon(struct blake2b_state *state,
> +//                           const u8 *block, size_t nblocks, u32 inc);
> +//
> +// Only the first three fields of struct blake2b_state are used:
> +//     u64 h[8];       (inout)
> +//     u64 t[2];       (inout)
> +//     u64 f[2];       (in)
> +//
> +       .align          5
> +ENTRY(blake2b_compress_neon)
> +       push            {r4-r10}
> +
> +       // Allocate a 32-byte stack buffer that is 32-byte aligned.
> +       mov             ORIG_SP, sp
> +       sub             ip, sp, #32
> +       bic             ip, ip, #31
> +       mov             sp, ip
> +
> +       adr             ROR24_TABLE, .Lror24_table
> +       adr             ROR16_TABLE, .Lror16_table
> +
> +       mov             ip, STATE
> +       vld1.64         {q0-q1}, [ip]!          // Load h[0..3]
> +       vld1.64         {q2-q3}, [ip]!          // Load h[4..7]
> +.Lnext_block:
> +         adr           r10, .Lblake2b_IV
> +       vld1.64         {q14-q15}, [ip]         // Load t[0..1] and f[0..1]
> +       vld1.64         {q4-q5}, [r10]!         // Load IV[0..3]
> +         vmov          r7, r8, d28             // Copy t[0] to (r7, r8)
> +       vld1.64         {q6-q7}, [r10]          // Load IV[4..7]
> +         adds          r7, r7, INC             // Increment counter
> +       bcs             .Lslow_inc_ctr
> +       vmov.i32        d28[0], r7
> +       vst1.64         {d28}, [ip]             // Update t[0]
> +.Linc_ctr_done:
> +
> +       // Load the next message block and finish initializing the state matrix
> +       // 'v'.  Fortunately, there are exactly enough NEON registers to fit the
> +       // entire state matrix in q0-q7 and the entire message block in q8-15.
> +       //
> +       // However, _blake2b_round also needs some extra registers for rotates,
> +       // so we have to spill some registers.  It's better to spill the message
> +       // registers than the state registers, as the message doesn't change.
> +       // Therefore we store a copy of the first 32 bytes of the message block
> +       // (q8-q9) in an aligned buffer on the stack so that they can be
> +       // reloaded when needed.  (We could just reload directly from the
> +       // message buffer, but it's faster to use aligned loads.)
> +       vld1.8          {q8-q9}, [BLOCK]!
> +         veor          q6, q6, q14     // v[12..13] = IV[4..5] ^ t[0..1]
> +       vld1.8          {q10-q11}, [BLOCK]!
> +         veor          q7, q7, q15     // v[14..15] = IV[6..7] ^ f[0..1]
> +       vld1.8          {q12-q13}, [BLOCK]!
> +       vst1.8          {q8-q9}, [sp, :256]
> +         mov           ip, STATE
> +       vld1.8          {q14-q15}, [BLOCK]!
> +
> +       // Execute the rounds.  Each round is provided the order in which it
> +       // needs to use the message words.
> +       _blake2b_round  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
> +       _blake2b_round  14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3
> +       _blake2b_round  11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4
> +       _blake2b_round  7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8
> +       _blake2b_round  9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13
> +       _blake2b_round  2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9
> +       _blake2b_round  12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11
> +       _blake2b_round  13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10
> +       _blake2b_round  6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5
> +       _blake2b_round  10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0
> +       _blake2b_round  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
> +       _blake2b_round  14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 \
> +                       final=1
> +
> +       // Fold the final state matrix into the hash chaining value:
> +       //
> +       //      for (i = 0; i < 8; i++)
> +       //              h[i] ^= v[i] ^ v[i + 8];
> +       //
> +         vld1.64       {q8-q9}, [ip]!          // Load old h[0..3]
> +       veor            q0, q0, q4              // v[0..1] ^= v[8..9]
> +       veor            q1, q1, q5              // v[2..3] ^= v[10..11]
> +         vld1.64       {q10-q11}, [ip]         // Load old h[4..7]
> +       veor            q2, q2, q6              // v[4..5] ^= v[12..13]
> +       veor            q3, q3, q7              // v[6..7] ^= v[14..15]
> +       veor            q0, q0, q8              // v[0..1] ^= h[0..1]
> +       veor            q1, q1, q9              // v[2..3] ^= h[2..3]
> +         mov           ip, STATE
> +         subs          NBLOCKS, NBLOCKS, #1    // nblocks--
> +         vst1.64       {q0-q1}, [ip]!          // Store new h[0..3]
> +       veor            q2, q2, q10             // v[4..5] ^= h[4..5]
> +       veor            q3, q3, q11             // v[6..7] ^= h[6..7]
> +         vst1.64       {q2-q3}, [ip]!          // Store new h[4..7]
> +
> +       // Advance to the next block, if there is one.
> +       bne             .Lnext_block            // nblocks != 0?
> +
> +       mov             sp, ORIG_SP
> +       pop             {r4-r10}
> +       mov             pc, lr
> +
> +.Lslow_inc_ctr:
> +       // Handle the case where the counter overflowed its low 32 bits, by
> +       // carrying the overflow bit into the full 128-bit counter.
> +       vmov            r9, r10, d29
> +       adcs            r8, r8, #0
> +       adcs            r9, r9, #0
> +       adc             r10, r10, #0
> +       vmov            d28, r7, r8
> +       vmov            d29, r9, r10
> +       vst1.64         {q14}, [ip]             // Update t[0] and t[1]
> +       b               .Linc_ctr_done
> +ENDPROC(blake2b_compress_neon)
> diff --git a/arch/arm/crypto/blake2b-neon-glue.c b/arch/arm/crypto/blake2b-neon-glue.c
> new file mode 100644
> index 0000000000000..34d73200e7fa6
> --- /dev/null
> +++ b/arch/arm/crypto/blake2b-neon-glue.c
> @@ -0,0 +1,105 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * BLAKE2b digest algorithm, NEON accelerated
> + *
> + * Copyright 2020 Google LLC
> + */
> +
> +#include <crypto/internal/blake2b.h>
> +#include <crypto/internal/hash.h>
> +#include <crypto/internal/simd.h>
> +
> +#include <linux/module.h>
> +#include <linux/sizes.h>
> +
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +
> +asmlinkage void blake2b_compress_neon(struct blake2b_state *state,
> +                                     const u8 *block, size_t nblocks, u32 inc);
> +
> +static void blake2b_compress_arch(struct blake2b_state *state,
> +                                 const u8 *block, size_t nblocks, u32 inc)
> +{
> +       if (!crypto_simd_usable()) {
> +               blake2b_compress_generic(state, block, nblocks, inc);
> +               return;
> +       }
> +
> +       do {
> +               const size_t blocks = min_t(size_t, nblocks,
> +                                           SZ_4K / BLAKE2B_BLOCK_SIZE);
> +
> +               kernel_neon_begin();
> +               blake2b_compress_neon(state, block, blocks, inc);
> +               kernel_neon_end();
> +
> +               nblocks -= blocks;
> +               block += blocks * BLAKE2B_BLOCK_SIZE;
> +       } while (nblocks);
> +}
> +
> +static int crypto_blake2b_update_neon(struct shash_desc *desc,
> +                                     const u8 *in, unsigned int inlen)
> +{
> +       return crypto_blake2b_update(desc, in, inlen, blake2b_compress_arch);
> +}
> +
> +static int crypto_blake2b_final_neon(struct shash_desc *desc, u8 *out)
> +{
> +       return crypto_blake2b_final(desc, out, blake2b_compress_arch);
> +}
> +
> +#define BLAKE2B_ALG(name, driver_name, digest_size)                    \
> +       {                                                               \
> +               .base.cra_name          = name,                         \
> +               .base.cra_driver_name   = driver_name,                  \
> +               .base.cra_priority      = 200,                          \
> +               .base.cra_flags         = CRYPTO_ALG_OPTIONAL_KEY,      \
> +               .base.cra_blocksize     = BLAKE2B_BLOCK_SIZE,           \
> +               .base.cra_ctxsize       = sizeof(struct blake2b_tfm_ctx), \
> +               .base.cra_module        = THIS_MODULE,                  \
> +               .digestsize             = digest_size,                  \
> +               .setkey                 = crypto_blake2b_setkey,        \
> +               .init                   = crypto_blake2b_init,          \
> +               .update                 = crypto_blake2b_update_neon,   \
> +               .final                  = crypto_blake2b_final_neon,    \
> +               .descsize               = sizeof(struct blake2b_state), \
> +       }
> +
> +static struct shash_alg blake2b_neon_algs[] = {
> +       BLAKE2B_ALG("blake2b-160", "blake2b-160-neon", BLAKE2B_160_HASH_SIZE),
> +       BLAKE2B_ALG("blake2b-256", "blake2b-256-neon", BLAKE2B_256_HASH_SIZE),
> +       BLAKE2B_ALG("blake2b-384", "blake2b-384-neon", BLAKE2B_384_HASH_SIZE),
> +       BLAKE2B_ALG("blake2b-512", "blake2b-512-neon", BLAKE2B_512_HASH_SIZE),
> +};
> +
> +static int __init blake2b_neon_mod_init(void)
> +{
> +       if (!(elf_hwcap & HWCAP_NEON))
> +               return -ENODEV;
> +
> +       return crypto_register_shashes(blake2b_neon_algs,
> +                                      ARRAY_SIZE(blake2b_neon_algs));
> +}
> +
> +static void __exit blake2b_neon_mod_exit(void)
> +{
> +       return crypto_unregister_shashes(blake2b_neon_algs,
> +                                        ARRAY_SIZE(blake2b_neon_algs));
> +}
> +
> +module_init(blake2b_neon_mod_init);
> +module_exit(blake2b_neon_mod_exit);
> +
> +MODULE_DESCRIPTION("BLAKE2b digest algorithm, NEON accelerated");
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
> +MODULE_ALIAS_CRYPTO("blake2b-160");
> +MODULE_ALIAS_CRYPTO("blake2b-160-neon");
> +MODULE_ALIAS_CRYPTO("blake2b-256");
> +MODULE_ALIAS_CRYPTO("blake2b-256-neon");
> +MODULE_ALIAS_CRYPTO("blake2b-384");
> +MODULE_ALIAS_CRYPTO("blake2b-384-neon");
> +MODULE_ALIAS_CRYPTO("blake2b-512");
> +MODULE_ALIAS_CRYPTO("blake2b-512-neon");
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s
  2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
                   ` (13 preceding siblings ...)
  2020-12-23  8:10 ` [PATCH v3 14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b Eric Biggers
@ 2021-01-02 22:09 ` Herbert Xu
  14 siblings, 0 replies; 25+ messages in thread
From: Herbert Xu @ 2021-01-02 22:09 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-arm-kernel, Ard Biesheuvel, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Wed, Dec 23, 2020 at 12:09:49AM -0800, Eric Biggers wrote:
> This patchset adds 32-bit ARM assembly language implementations of
> BLAKE2b and BLAKE2s.
> 
> As a prerequisite to adding these without copy-and-pasting lots of code,
> this patchset also reworks the existing BLAKE2b and BLAKE2s code to
> provide helper functions that make implementing "shash" providers for
> these algorithms much easier.  These changes also eliminate unnecessary
> differences between the BLAKE2b and BLAKE2s code.
> 
> The new BLAKE2b implementation is NEON-accelerated, while the new
> BLAKE2s implementation uses scalar instructions since NEON doesn't work
> very well for it.  The BLAKE2b implementation is faster and is expected
> to be useful as a replacement for SHA-1 in dm-verity, while the BLAKE2s
> implementation would be useful for WireGuard which uses BLAKE2s.
> 
> Both new implementations are wired up to the shash API, while the new
> BLAKE2s implementation is also wired up to the library API.
> 
> See the individual commits for full details, including benchmarks.
> 
> This patchset was tested on a Raspberry Pi 2 (which uses a Cortex-A7
> processor) with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, plus other tests.
> 
> This patchset applies to mainline commit 614cb5894306.
> 
> Changed since v2:
>    - Reworked the shash helpers again.  Now they are inline functions,
>      and for BLAKE2s they now share more code with the library API.
>    - Made the BLAKE2b code be more consistent with the BLAKE2s code.
>    - Moved the BLAKE2s changes first in the patchset so that the BLAKE2b
>      changes can be made just by syncing the code with BLAKE2s.
>    - Added a few BLAKE2s cleanups (which get included in BLAKE2b too).
>    - Improved some comments in the new asm files.
> 
> Changed since v1:
>    - Added ARM scalar implementation of BLAKE2s.
>    - Adjusted the BLAKE2b helper functions to be consistent with what I
>      decided to do for BLAKE2s.
>    - Fixed build error in blake2b-neon-core.S in some configurations.
> 
> Eric Biggers (14):
>   crypto: blake2s - define shash_alg structs using macros
>   crypto: x86/blake2s - define shash_alg structs using macros
>   crypto: blake2s - remove unneeded includes
>   crypto: blake2s - move update and final logic to internal/blake2s.h
>   crypto: blake2s - share the "shash" API boilerplate code
>   crypto: blake2s - optimize blake2s initialization
>   crypto: blake2s - add comment for blake2s_state fields
>   crypto: blake2s - adjust include guard naming
>   crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h>
>   crypto: arm/blake2s - add ARM scalar optimized BLAKE2s
>   wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM
>   crypto: blake2b - sync with blake2s implementation
>   crypto: blake2b - update file comment
>   crypto: arm/blake2b - add NEON-accelerated BLAKE2b
> 
>  arch/arm/crypto/Kconfig             |  19 ++
>  arch/arm/crypto/Makefile            |   4 +
>  arch/arm/crypto/blake2b-neon-core.S | 347 ++++++++++++++++++++++++++++
>  arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
>  arch/arm/crypto/blake2s-core.S      | 285 +++++++++++++++++++++++
>  arch/arm/crypto/blake2s-glue.c      |  78 +++++++
>  arch/x86/crypto/blake2s-glue.c      | 150 +++---------
>  crypto/blake2b_generic.c            | 249 +++++---------------
>  crypto/blake2s_generic.c            | 158 +++----------
>  drivers/net/Kconfig                 |   1 +
>  include/crypto/blake2b.h            |  67 ++++++
>  include/crypto/blake2s.h            |  63 ++---
>  include/crypto/internal/blake2b.h   | 115 +++++++++
>  include/crypto/internal/blake2s.h   | 109 ++++++++-
>  lib/crypto/blake2s.c                |  48 +---
>  15 files changed, 1278 insertions(+), 520 deletions(-)
>  create mode 100644 arch/arm/crypto/blake2b-neon-core.S
>  create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
>  create mode 100644 arch/arm/crypto/blake2s-core.S
>  create mode 100644 arch/arm/crypto/blake2s-glue.c
>  create mode 100644 include/crypto/blake2b.h
>  create mode 100644 include/crypto/internal/blake2b.h

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-01-02 22:10 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-23  8:09 [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Eric Biggers
2020-12-23  8:09 ` [PATCH v3 01/14] crypto: blake2s - define shash_alg structs using macros Eric Biggers
2020-12-23  8:09 ` [PATCH v3 02/14] crypto: x86/blake2s " Eric Biggers
2020-12-23  8:09 ` [PATCH v3 03/14] crypto: blake2s - remove unneeded includes Eric Biggers
2020-12-23  8:09 ` [PATCH v3 04/14] crypto: blake2s - move update and final logic to internal/blake2s.h Eric Biggers
2020-12-23  9:05   ` Ard Biesheuvel
2020-12-23  8:09 ` [PATCH v3 05/14] crypto: blake2s - share the "shash" API boilerplate code Eric Biggers
2020-12-23  9:06   ` Ard Biesheuvel
2020-12-23  8:09 ` [PATCH v3 06/14] crypto: blake2s - optimize blake2s initialization Eric Biggers
2020-12-23  9:06   ` Ard Biesheuvel
2020-12-23  8:09 ` [PATCH v3 07/14] crypto: blake2s - add comment for blake2s_state fields Eric Biggers
2020-12-23  9:07   ` Ard Biesheuvel
2020-12-23  8:09 ` [PATCH v3 08/14] crypto: blake2s - adjust include guard naming Eric Biggers
2020-12-23  9:07   ` Ard Biesheuvel
2020-12-23  8:09 ` [PATCH v3 09/14] crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h> Eric Biggers
2020-12-23  9:07   ` Ard Biesheuvel
2020-12-23  8:09 ` [PATCH v3 10/14] crypto: arm/blake2s - add ARM scalar optimized BLAKE2s Eric Biggers
2020-12-23  9:08   ` Ard Biesheuvel
2020-12-23  8:10 ` [PATCH v3 11/14] wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM Eric Biggers
2020-12-23  8:10 ` [PATCH v3 12/14] crypto: blake2b - sync with blake2s implementation Eric Biggers
2020-12-23  9:09   ` Ard Biesheuvel
2020-12-23  8:10 ` [PATCH v3 13/14] crypto: blake2b - update file comment Eric Biggers
2020-12-23  8:10 ` [PATCH v3 14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b Eric Biggers
2020-12-23  9:10   ` Ard Biesheuvel
2021-01-02 22:09 ` [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).