All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
@ 2020-12-15 23:47 ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM.
Patches 1-4 prepare for it by making some updates to the generic
implementation, while patch 5 adds the actual NEON implementation.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.1
	sha1-neon            16.4
	sha1-asm             20.8
	blake2s-256-generic  26.1
	sha256-neon	     28.9
	sha256-asm	     32.1
	blake2b-256-generic  39.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).

NEON-optimized BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

This patchset was tested on a Raspberry Pi 2, including with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y.

Eric Biggers (5):
  crypto: blake2b - rename constants for consistency with blake2s
  crypto: blake2b - define shash_alg structs using macros
  crypto: blake2b - export helpers for optimized implementations
  crypto: blake2b - update file comment
  crypto: arm/blake2b - add NEON-optimized BLAKE2b implementation

 arch/arm/crypto/Kconfig             |  10 +
 arch/arm/crypto/Makefile            |   2 +
 arch/arm/crypto/blake2b-neon-core.S | 357 ++++++++++++++++++++++++++++
 arch/arm/crypto/blake2b-neon-glue.c | 105 ++++++++
 crypto/blake2b_generic.c            | 205 +++++++---------
 include/crypto/blake2b.h            |  54 +++++
 6 files changed, 619 insertions(+), 114 deletions(-)
 create mode 100644 arch/arm/crypto/blake2b-neon-core.S
 create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
 create mode 100644 include/crypto/blake2b.h


base-commit: 3db1a3fa98808aa90f95ec3e0fa2fc7abf28f5c9
-- 
2.29.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
@ 2020-12-15 23:47 ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Herbert Xu, David Sterba, Ard Biesheuvel,
	linux-arm-kernel, Paul Crowley

This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM.
Patches 1-4 prepare for it by making some updates to the generic
implementation, while patch 5 adds the actual NEON implementation.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.1
	sha1-neon            16.4
	sha1-asm             20.8
	blake2s-256-generic  26.1
	sha256-neon	     28.9
	sha256-asm	     32.1
	blake2b-256-generic  39.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).

NEON-optimized BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

This patchset was tested on a Raspberry Pi 2, including with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y.

Eric Biggers (5):
  crypto: blake2b - rename constants for consistency with blake2s
  crypto: blake2b - define shash_alg structs using macros
  crypto: blake2b - export helpers for optimized implementations
  crypto: blake2b - update file comment
  crypto: arm/blake2b - add NEON-optimized BLAKE2b implementation

 arch/arm/crypto/Kconfig             |  10 +
 arch/arm/crypto/Makefile            |   2 +
 arch/arm/crypto/blake2b-neon-core.S | 357 ++++++++++++++++++++++++++++
 arch/arm/crypto/blake2b-neon-glue.c | 105 ++++++++
 crypto/blake2b_generic.c            | 205 +++++++---------
 include/crypto/blake2b.h            |  54 +++++
 6 files changed, 619 insertions(+), 114 deletions(-)
 create mode 100644 arch/arm/crypto/blake2b-neon-core.S
 create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
 create mode 100644 include/crypto/blake2b.h


base-commit: 3db1a3fa98808aa90f95ec3e0fa2fc7abf28f5c9
-- 
2.29.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] crypto: blake2b - rename constants for consistency with blake2s
  2020-12-15 23:47 ` Eric Biggers
@ 2020-12-15 23:47   ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Rename some BLAKE2b-related constants to be consistent with the names
used in the BLAKE2s implementation (see include/crypto/blake2s.h):

	BLAKE2B_*_DIGEST_SIZE  => BLAKE2B_*_HASH_SIZE
	BLAKE2B_BLOCKBYTES     => BLAKE2B_BLOCK_SIZE
	BLAKE2B_KEYBYTES       => BLAKE2B_KEY_SIZE

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c | 58 +++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 28 deletions(-)

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index a2ffe60e06d34..83942f511075e 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -25,21 +25,22 @@
 #include <linux/bitops.h>
 #include <crypto/internal/hash.h>
 
-#define BLAKE2B_160_DIGEST_SIZE		(160 / 8)
-#define BLAKE2B_256_DIGEST_SIZE		(256 / 8)
-#define BLAKE2B_384_DIGEST_SIZE		(384 / 8)
-#define BLAKE2B_512_DIGEST_SIZE		(512 / 8)
-
-enum blake2b_constant {
-	BLAKE2B_BLOCKBYTES    = 128,
-	BLAKE2B_KEYBYTES      = 64,
+
+enum blake2b_lengths {
+	BLAKE2B_BLOCK_SIZE = 128,
+	BLAKE2B_KEY_SIZE = 64,
+
+	BLAKE2B_160_HASH_SIZE = 20,
+	BLAKE2B_256_HASH_SIZE =	32,
+	BLAKE2B_384_HASH_SIZE = 48,
+	BLAKE2B_512_HASH_SIZE = 64,
 };
 
 struct blake2b_state {
 	u64      h[8];
 	u64      t[2];
 	u64      f[2];
-	u8       buf[BLAKE2B_BLOCKBYTES];
+	u8       buf[BLAKE2B_BLOCK_SIZE];
 	size_t   buflen;
 };
 
@@ -96,7 +97,7 @@ static void blake2b_increment_counter(struct blake2b_state *S, const u64 inc)
 	} while (0)
 
 static void blake2b_compress(struct blake2b_state *S,
-			     const u8 block[BLAKE2B_BLOCKBYTES])
+			     const u8 block[BLAKE2B_BLOCK_SIZE])
 {
 	u64 m[16];
 	u64 v[16];
@@ -140,7 +141,7 @@ static void blake2b_compress(struct blake2b_state *S,
 #undef ROUND
 
 struct blake2b_tfm_ctx {
-	u8 key[BLAKE2B_KEYBYTES];
+	u8 key[BLAKE2B_KEY_SIZE];
 	unsigned int keylen;
 };
 
@@ -149,7 +150,7 @@ static int blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
 {
 	struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(tfm);
 
-	if (keylen == 0 || keylen > BLAKE2B_KEYBYTES)
+	if (keylen == 0 || keylen > BLAKE2B_KEY_SIZE)
 		return -EINVAL;
 
 	memcpy(tctx->key, key, keylen);
@@ -176,7 +177,7 @@ static int blake2b_init(struct shash_desc *desc)
 		 * _final will process it
 		 */
 		memcpy(state->buf, tctx->key, tctx->keylen);
-		state->buflen = BLAKE2B_BLOCKBYTES;
+		state->buflen = BLAKE2B_BLOCK_SIZE;
 	}
 	return 0;
 }
@@ -186,7 +187,7 @@ static int blake2b_update(struct shash_desc *desc, const u8 *in,
 {
 	struct blake2b_state *state = shash_desc_ctx(desc);
 	const size_t left = state->buflen;
-	const size_t fill = BLAKE2B_BLOCKBYTES - left;
+	const size_t fill = BLAKE2B_BLOCK_SIZE - left;
 
 	if (!inlen)
 		return 0;
@@ -195,16 +196,16 @@ static int blake2b_update(struct shash_desc *desc, const u8 *in,
 		state->buflen = 0;
 		/* Fill buffer */
 		memcpy(state->buf + left, in, fill);
-		blake2b_increment_counter(state, BLAKE2B_BLOCKBYTES);
+		blake2b_increment_counter(state, BLAKE2B_BLOCK_SIZE);
 		/* Compress */
 		blake2b_compress(state, state->buf);
 		in += fill;
 		inlen -= fill;
-		while (inlen > BLAKE2B_BLOCKBYTES) {
-			blake2b_increment_counter(state, BLAKE2B_BLOCKBYTES);
+		while (inlen > BLAKE2B_BLOCK_SIZE) {
+			blake2b_increment_counter(state, BLAKE2B_BLOCK_SIZE);
 			blake2b_compress(state, in);
-			in += BLAKE2B_BLOCKBYTES;
-			inlen -= BLAKE2B_BLOCKBYTES;
+			in += BLAKE2B_BLOCK_SIZE;
+			inlen -= BLAKE2B_BLOCK_SIZE;
 		}
 	}
 	memcpy(state->buf + state->buflen, in, inlen);
@@ -223,7 +224,8 @@ static int blake2b_final(struct shash_desc *desc, u8 *out)
 	/* Set last block */
 	state->f[0] = (u64)-1;
 	/* Padding */
-	memset(state->buf + state->buflen, 0, BLAKE2B_BLOCKBYTES - state->buflen);
+	memset(state->buf + state->buflen, 0,
+	       BLAKE2B_BLOCK_SIZE - state->buflen);
 	blake2b_compress(state, state->buf);
 
 	/* Avoid temporary buffer and switch the internal output to LE order */
@@ -240,10 +242,10 @@ static struct shash_alg blake2b_algs[] = {
 		.base.cra_driver_name	= "blake2b-160-generic",
 		.base.cra_priority	= 100,
 		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
 		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_160_DIGEST_SIZE,
+		.digestsize		= BLAKE2B_160_HASH_SIZE,
 		.setkey			= blake2b_setkey,
 		.init			= blake2b_init,
 		.update			= blake2b_update,
@@ -254,10 +256,10 @@ static struct shash_alg blake2b_algs[] = {
 		.base.cra_driver_name	= "blake2b-256-generic",
 		.base.cra_priority	= 100,
 		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
 		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_256_DIGEST_SIZE,
+		.digestsize		= BLAKE2B_256_HASH_SIZE,
 		.setkey			= blake2b_setkey,
 		.init			= blake2b_init,
 		.update			= blake2b_update,
@@ -268,10 +270,10 @@ static struct shash_alg blake2b_algs[] = {
 		.base.cra_driver_name	= "blake2b-384-generic",
 		.base.cra_priority	= 100,
 		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
 		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_384_DIGEST_SIZE,
+		.digestsize		= BLAKE2B_384_HASH_SIZE,
 		.setkey			= blake2b_setkey,
 		.init			= blake2b_init,
 		.update			= blake2b_update,
@@ -282,10 +284,10 @@ static struct shash_alg blake2b_algs[] = {
 		.base.cra_driver_name	= "blake2b-512-generic",
 		.base.cra_priority	= 100,
 		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
 		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_512_DIGEST_SIZE,
+		.digestsize		= BLAKE2B_512_HASH_SIZE,
 		.setkey			= blake2b_setkey,
 		.init			= blake2b_init,
 		.update			= blake2b_update,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/5] crypto: blake2b - rename constants for consistency with blake2s
@ 2020-12-15 23:47   ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Herbert Xu, David Sterba, Ard Biesheuvel,
	linux-arm-kernel, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Rename some BLAKE2b-related constants to be consistent with the names
used in the BLAKE2s implementation (see include/crypto/blake2s.h):

	BLAKE2B_*_DIGEST_SIZE  => BLAKE2B_*_HASH_SIZE
	BLAKE2B_BLOCKBYTES     => BLAKE2B_BLOCK_SIZE
	BLAKE2B_KEYBYTES       => BLAKE2B_KEY_SIZE

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c | 58 +++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 28 deletions(-)

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index a2ffe60e06d34..83942f511075e 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -25,21 +25,22 @@
 #include <linux/bitops.h>
 #include <crypto/internal/hash.h>
 
-#define BLAKE2B_160_DIGEST_SIZE		(160 / 8)
-#define BLAKE2B_256_DIGEST_SIZE		(256 / 8)
-#define BLAKE2B_384_DIGEST_SIZE		(384 / 8)
-#define BLAKE2B_512_DIGEST_SIZE		(512 / 8)
-
-enum blake2b_constant {
-	BLAKE2B_BLOCKBYTES    = 128,
-	BLAKE2B_KEYBYTES      = 64,
+
+enum blake2b_lengths {
+	BLAKE2B_BLOCK_SIZE = 128,
+	BLAKE2B_KEY_SIZE = 64,
+
+	BLAKE2B_160_HASH_SIZE = 20,
+	BLAKE2B_256_HASH_SIZE =	32,
+	BLAKE2B_384_HASH_SIZE = 48,
+	BLAKE2B_512_HASH_SIZE = 64,
 };
 
 struct blake2b_state {
 	u64      h[8];
 	u64      t[2];
 	u64      f[2];
-	u8       buf[BLAKE2B_BLOCKBYTES];
+	u8       buf[BLAKE2B_BLOCK_SIZE];
 	size_t   buflen;
 };
 
@@ -96,7 +97,7 @@ static void blake2b_increment_counter(struct blake2b_state *S, const u64 inc)
 	} while (0)
 
 static void blake2b_compress(struct blake2b_state *S,
-			     const u8 block[BLAKE2B_BLOCKBYTES])
+			     const u8 block[BLAKE2B_BLOCK_SIZE])
 {
 	u64 m[16];
 	u64 v[16];
@@ -140,7 +141,7 @@ static void blake2b_compress(struct blake2b_state *S,
 #undef ROUND
 
 struct blake2b_tfm_ctx {
-	u8 key[BLAKE2B_KEYBYTES];
+	u8 key[BLAKE2B_KEY_SIZE];
 	unsigned int keylen;
 };
 
@@ -149,7 +150,7 @@ static int blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
 {
 	struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(tfm);
 
-	if (keylen == 0 || keylen > BLAKE2B_KEYBYTES)
+	if (keylen == 0 || keylen > BLAKE2B_KEY_SIZE)
 		return -EINVAL;
 
 	memcpy(tctx->key, key, keylen);
@@ -176,7 +177,7 @@ static int blake2b_init(struct shash_desc *desc)
 		 * _final will process it
 		 */
 		memcpy(state->buf, tctx->key, tctx->keylen);
-		state->buflen = BLAKE2B_BLOCKBYTES;
+		state->buflen = BLAKE2B_BLOCK_SIZE;
 	}
 	return 0;
 }
@@ -186,7 +187,7 @@ static int blake2b_update(struct shash_desc *desc, const u8 *in,
 {
 	struct blake2b_state *state = shash_desc_ctx(desc);
 	const size_t left = state->buflen;
-	const size_t fill = BLAKE2B_BLOCKBYTES - left;
+	const size_t fill = BLAKE2B_BLOCK_SIZE - left;
 
 	if (!inlen)
 		return 0;
@@ -195,16 +196,16 @@ static int blake2b_update(struct shash_desc *desc, const u8 *in,
 		state->buflen = 0;
 		/* Fill buffer */
 		memcpy(state->buf + left, in, fill);
-		blake2b_increment_counter(state, BLAKE2B_BLOCKBYTES);
+		blake2b_increment_counter(state, BLAKE2B_BLOCK_SIZE);
 		/* Compress */
 		blake2b_compress(state, state->buf);
 		in += fill;
 		inlen -= fill;
-		while (inlen > BLAKE2B_BLOCKBYTES) {
-			blake2b_increment_counter(state, BLAKE2B_BLOCKBYTES);
+		while (inlen > BLAKE2B_BLOCK_SIZE) {
+			blake2b_increment_counter(state, BLAKE2B_BLOCK_SIZE);
 			blake2b_compress(state, in);
-			in += BLAKE2B_BLOCKBYTES;
-			inlen -= BLAKE2B_BLOCKBYTES;
+			in += BLAKE2B_BLOCK_SIZE;
+			inlen -= BLAKE2B_BLOCK_SIZE;
 		}
 	}
 	memcpy(state->buf + state->buflen, in, inlen);
@@ -223,7 +224,8 @@ static int blake2b_final(struct shash_desc *desc, u8 *out)
 	/* Set last block */
 	state->f[0] = (u64)-1;
 	/* Padding */
-	memset(state->buf + state->buflen, 0, BLAKE2B_BLOCKBYTES - state->buflen);
+	memset(state->buf + state->buflen, 0,
+	       BLAKE2B_BLOCK_SIZE - state->buflen);
 	blake2b_compress(state, state->buf);
 
 	/* Avoid temporary buffer and switch the internal output to LE order */
@@ -240,10 +242,10 @@ static struct shash_alg blake2b_algs[] = {
 		.base.cra_driver_name	= "blake2b-160-generic",
 		.base.cra_priority	= 100,
 		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
 		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_160_DIGEST_SIZE,
+		.digestsize		= BLAKE2B_160_HASH_SIZE,
 		.setkey			= blake2b_setkey,
 		.init			= blake2b_init,
 		.update			= blake2b_update,
@@ -254,10 +256,10 @@ static struct shash_alg blake2b_algs[] = {
 		.base.cra_driver_name	= "blake2b-256-generic",
 		.base.cra_priority	= 100,
 		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
 		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_256_DIGEST_SIZE,
+		.digestsize		= BLAKE2B_256_HASH_SIZE,
 		.setkey			= blake2b_setkey,
 		.init			= blake2b_init,
 		.update			= blake2b_update,
@@ -268,10 +270,10 @@ static struct shash_alg blake2b_algs[] = {
 		.base.cra_driver_name	= "blake2b-384-generic",
 		.base.cra_priority	= 100,
 		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
 		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_384_DIGEST_SIZE,
+		.digestsize		= BLAKE2B_384_HASH_SIZE,
 		.setkey			= blake2b_setkey,
 		.init			= blake2b_init,
 		.update			= blake2b_update,
@@ -282,10 +284,10 @@ static struct shash_alg blake2b_algs[] = {
 		.base.cra_driver_name	= "blake2b-512-generic",
 		.base.cra_priority	= 100,
 		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCKBYTES,
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
 		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_512_DIGEST_SIZE,
+		.digestsize		= BLAKE2B_512_HASH_SIZE,
 		.setkey			= blake2b_setkey,
 		.init			= blake2b_init,
 		.update			= blake2b_update,
-- 
2.29.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/5] crypto: blake2b - define shash_alg structs using macros
  2020-12-15 23:47 ` Eric Biggers
@ 2020-12-15 23:47   ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

The shash_alg structs for the four variants of BLAKE2b are identical
except for the algorithm name, driver name, and digest size.  So, avoid
code duplication by using a macro to define these structs.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c | 82 ++++++++++++----------------------------
 1 file changed, 25 insertions(+), 57 deletions(-)

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index 83942f511075e..0e38e3e48297c 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -236,64 +236,32 @@ static int blake2b_final(struct shash_desc *desc, u8 *out)
 	return 0;
 }
 
-static struct shash_alg blake2b_algs[] = {
-	{
-		.base.cra_name		= "blake2b-160",
-		.base.cra_driver_name	= "blake2b-160-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_160_HASH_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}, {
-		.base.cra_name		= "blake2b-256",
-		.base.cra_driver_name	= "blake2b-256-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_256_HASH_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}, {
-		.base.cra_name		= "blake2b-384",
-		.base.cra_driver_name	= "blake2b-384-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_384_HASH_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}, {
-		.base.cra_name		= "blake2b-512",
-		.base.cra_driver_name	= "blake2b-512-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_512_HASH_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
+#define BLAKE2B_ALG(name, driver_name, digest_size)			\
+	{								\
+		.base.cra_name		= name,				\
+		.base.cra_driver_name	= driver_name,			\
+		.base.cra_priority	= 100,				\
+		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,	\
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,		\
+		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx), \
+		.base.cra_module	= THIS_MODULE,			\
+		.digestsize		= digest_size,			\
+		.setkey			= blake2b_setkey,		\
+		.init			= blake2b_init,			\
+		.update			= blake2b_update,		\
+		.final			= blake2b_final,		\
+		.descsize		= sizeof(struct blake2b_state),	\
 	}
+
+static struct shash_alg blake2b_algs[] = {
+	BLAKE2B_ALG("blake2b-160", "blake2b-160-generic",
+		    BLAKE2B_160_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-256", "blake2b-256-generic",
+		    BLAKE2B_256_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-384", "blake2b-384-generic",
+		    BLAKE2B_384_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-512", "blake2b-512-generic",
+		    BLAKE2B_512_HASH_SIZE),
 };
 
 static int __init blake2b_mod_init(void)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/5] crypto: blake2b - define shash_alg structs using macros
@ 2020-12-15 23:47   ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Herbert Xu, David Sterba, Ard Biesheuvel,
	linux-arm-kernel, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

The shash_alg structs for the four variants of BLAKE2b are identical
except for the algorithm name, driver name, and digest size.  So, avoid
code duplication by using a macro to define these structs.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c | 82 ++++++++++++----------------------------
 1 file changed, 25 insertions(+), 57 deletions(-)

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index 83942f511075e..0e38e3e48297c 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -236,64 +236,32 @@ static int blake2b_final(struct shash_desc *desc, u8 *out)
 	return 0;
 }
 
-static struct shash_alg blake2b_algs[] = {
-	{
-		.base.cra_name		= "blake2b-160",
-		.base.cra_driver_name	= "blake2b-160-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_160_HASH_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}, {
-		.base.cra_name		= "blake2b-256",
-		.base.cra_driver_name	= "blake2b-256-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_256_HASH_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}, {
-		.base.cra_name		= "blake2b-384",
-		.base.cra_driver_name	= "blake2b-384-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_384_HASH_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
-	}, {
-		.base.cra_name		= "blake2b-512",
-		.base.cra_driver_name	= "blake2b-512-generic",
-		.base.cra_priority	= 100,
-		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,
-		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,
-		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx),
-		.base.cra_module	= THIS_MODULE,
-		.digestsize		= BLAKE2B_512_HASH_SIZE,
-		.setkey			= blake2b_setkey,
-		.init			= blake2b_init,
-		.update			= blake2b_update,
-		.final			= blake2b_final,
-		.descsize		= sizeof(struct blake2b_state),
+#define BLAKE2B_ALG(name, driver_name, digest_size)			\
+	{								\
+		.base.cra_name		= name,				\
+		.base.cra_driver_name	= driver_name,			\
+		.base.cra_priority	= 100,				\
+		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,	\
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,		\
+		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx), \
+		.base.cra_module	= THIS_MODULE,			\
+		.digestsize		= digest_size,			\
+		.setkey			= blake2b_setkey,		\
+		.init			= blake2b_init,			\
+		.update			= blake2b_update,		\
+		.final			= blake2b_final,		\
+		.descsize		= sizeof(struct blake2b_state),	\
 	}
+
+static struct shash_alg blake2b_algs[] = {
+	BLAKE2B_ALG("blake2b-160", "blake2b-160-generic",
+		    BLAKE2B_160_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-256", "blake2b-256-generic",
+		    BLAKE2B_256_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-384", "blake2b-384-generic",
+		    BLAKE2B_384_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-512", "blake2b-512-generic",
+		    BLAKE2B_512_HASH_SIZE),
 };
 
 static int __init blake2b_mod_init(void)
-- 
2.29.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] crypto: blake2b - export helpers for optimized implementations
  2020-12-15 23:47 ` Eric Biggers
@ 2020-12-15 23:47   ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

In preparation for adding architecture-specific implementations of
BLAKE2b, create a header <crypto/blake2b.h> that contains common
constants, structs, and helper functions for BLAKE2b.

Furthermore, export the BLAKE2b generic setkey(), init(), update(), and
final() functions, and add functions __crypto_blake2b_update() and
__crypto_blake2b_final() which take a pointer to a
blake2b_compress_blocks_t function.

This way, optimized implementations of BLAKE2b only have to provide an
implementation of blake2b_compress_blocks_t.  (This is modeled on how
the nhpoly1305 implementations work.  Also, the prototype of
blake2b_compress_blocks_t is meant to be similar to that of
blake2s_compress_arch().)

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c | 100 +++++++++++++++++++++------------------
 include/crypto/blake2b.h |  54 +++++++++++++++++++++
 2 files changed, 109 insertions(+), 45 deletions(-)
 create mode 100644 include/crypto/blake2b.h

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index 0e38e3e48297c..fd5159b7aa9f2 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -23,27 +23,9 @@
 #include <linux/string.h>
 #include <linux/kernel.h>
 #include <linux/bitops.h>
+#include <crypto/blake2b.h>
 #include <crypto/internal/hash.h>
 
-
-enum blake2b_lengths {
-	BLAKE2B_BLOCK_SIZE = 128,
-	BLAKE2B_KEY_SIZE = 64,
-
-	BLAKE2B_160_HASH_SIZE = 20,
-	BLAKE2B_256_HASH_SIZE =	32,
-	BLAKE2B_384_HASH_SIZE = 48,
-	BLAKE2B_512_HASH_SIZE = 64,
-};
-
-struct blake2b_state {
-	u64      h[8];
-	u64      t[2];
-	u64      f[2];
-	u8       buf[BLAKE2B_BLOCK_SIZE];
-	size_t   buflen;
-};
-
 static const u64 blake2b_IV[8] = {
 	0x6a09e667f3bcc908ULL, 0xbb67ae8584caa73bULL,
 	0x3c6ef372fe94f82bULL, 0xa54ff53a5f1d36f1ULL,
@@ -96,8 +78,8 @@ static void blake2b_increment_counter(struct blake2b_state *S, const u64 inc)
 		G(r,7,v[ 3],v[ 4],v[ 9],v[14]); \
 	} while (0)
 
-static void blake2b_compress(struct blake2b_state *S,
-			     const u8 block[BLAKE2B_BLOCK_SIZE])
+static void blake2b_compress_generic(struct blake2b_state *S,
+				     const u8 block[BLAKE2B_BLOCK_SIZE])
 {
 	u64 m[16];
 	u64 v[16];
@@ -140,12 +122,18 @@ static void blake2b_compress(struct blake2b_state *S,
 #undef G
 #undef ROUND
 
-struct blake2b_tfm_ctx {
-	u8 key[BLAKE2B_KEY_SIZE];
-	unsigned int keylen;
-};
+static void blake2b_compress_blocks_generic(struct blake2b_state *S,
+					    const u8 *in, size_t nblocks,
+					    unsigned int inc)
+{
+	do {
+		blake2b_increment_counter(S, inc);
+		blake2b_compress_generic(S, in);
+		in += BLAKE2B_BLOCK_SIZE;
+	} while (--nblocks);
+}
 
-static int blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
+int crypto_blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
 			  unsigned int keylen)
 {
 	struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(tfm);
@@ -158,8 +146,9 @@ static int blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(crypto_blake2b_setkey);
 
-static int blake2b_init(struct shash_desc *desc)
+int crypto_blake2b_init(struct shash_desc *desc)
 {
 	struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
 	struct blake2b_state *state = shash_desc_ctx(desc);
@@ -181,9 +170,19 @@ static int blake2b_init(struct shash_desc *desc)
 	}
 	return 0;
 }
+EXPORT_SYMBOL_GPL(crypto_blake2b_init);
+
+int crypto_blake2b_update(struct shash_desc *desc,
+			  const u8 *in, unsigned int inlen)
+{
+	return __crypto_blake2b_update(desc, in, inlen,
+				       blake2b_compress_blocks_generic);
+}
+EXPORT_SYMBOL_GPL(crypto_blake2b_update);
 
-static int blake2b_update(struct shash_desc *desc, const u8 *in,
-			  unsigned int inlen)
+int __crypto_blake2b_update(struct shash_desc *desc,
+			    const u8 *in, unsigned int inlen,
+			    blake2b_compress_blocks_t compress)
 {
 	struct blake2b_state *state = shash_desc_ctx(desc);
 	const size_t left = state->buflen;
@@ -194,18 +193,19 @@ static int blake2b_update(struct shash_desc *desc, const u8 *in,
 
 	if (inlen > fill) {
 		state->buflen = 0;
-		/* Fill buffer */
 		memcpy(state->buf + left, in, fill);
-		blake2b_increment_counter(state, BLAKE2B_BLOCK_SIZE);
-		/* Compress */
-		blake2b_compress(state, state->buf);
+		(*compress)(state, state->buf, 1, BLAKE2B_BLOCK_SIZE);
 		in += fill;
 		inlen -= fill;
-		while (inlen > BLAKE2B_BLOCK_SIZE) {
-			blake2b_increment_counter(state, BLAKE2B_BLOCK_SIZE);
-			blake2b_compress(state, in);
-			in += BLAKE2B_BLOCK_SIZE;
-			inlen -= BLAKE2B_BLOCK_SIZE;
+		if (inlen > BLAKE2B_BLOCK_SIZE) {
+			/* Hash one less (full) block than strictly possible */
+			size_t nbytes = round_up(inlen - BLAKE2B_BLOCK_SIZE,
+						 BLAKE2B_BLOCK_SIZE);
+
+			(*compress)(state, in, nbytes / BLAKE2B_BLOCK_SIZE,
+				    BLAKE2B_BLOCK_SIZE);
+			in += nbytes;
+			inlen -= nbytes;
 		}
 	}
 	memcpy(state->buf + state->buflen, in, inlen);
@@ -213,20 +213,29 @@ static int blake2b_update(struct shash_desc *desc, const u8 *in,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(__crypto_blake2b_update);
 
-static int blake2b_final(struct shash_desc *desc, u8 *out)
+int crypto_blake2b_final(struct shash_desc *desc, u8 *out)
+{
+	return __crypto_blake2b_final(desc, out,
+				      blake2b_compress_blocks_generic);
+}
+EXPORT_SYMBOL_GPL(crypto_blake2b_final);
+
+int __crypto_blake2b_final(struct shash_desc *desc, u8 *out,
+			   blake2b_compress_blocks_t compress)
 {
 	struct blake2b_state *state = shash_desc_ctx(desc);
 	const int digestsize = crypto_shash_digestsize(desc->tfm);
 	size_t i;
 
-	blake2b_increment_counter(state, state->buflen);
 	/* Set last block */
 	state->f[0] = (u64)-1;
 	/* Padding */
 	memset(state->buf + state->buflen, 0,
 	       BLAKE2B_BLOCK_SIZE - state->buflen);
-	blake2b_compress(state, state->buf);
+
+	(*compress)(state, state->buf, 1, state->buflen);
 
 	/* Avoid temporary buffer and switch the internal output to LE order */
 	for (i = 0; i < ARRAY_SIZE(state->h); i++)
@@ -235,6 +244,7 @@ static int blake2b_final(struct shash_desc *desc, u8 *out)
 	memcpy(out, state->h, digestsize);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(__crypto_blake2b_final);
 
 #define BLAKE2B_ALG(name, driver_name, digest_size)			\
 	{								\
@@ -246,10 +256,10 @@ static int blake2b_final(struct shash_desc *desc, u8 *out)
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx), \
 		.base.cra_module	= THIS_MODULE,			\
 		.digestsize		= digest_size,			\
-		.setkey			= blake2b_setkey,		\
-		.init			= blake2b_init,			\
-		.update			= blake2b_update,		\
-		.final			= blake2b_final,		\
+		.setkey			= crypto_blake2b_setkey,	\
+		.init			= crypto_blake2b_init,		\
+		.update			= crypto_blake2b_update,	\
+		.final			= crypto_blake2b_final,		\
 		.descsize		= sizeof(struct blake2b_state),	\
 	}
 
diff --git a/include/crypto/blake2b.h b/include/crypto/blake2b.h
new file mode 100644
index 0000000000000..ced3ba5b45c8d
--- /dev/null
+++ b/include/crypto/blake2b.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _CRYPTO_BLAKE2B_H
+#define _CRYPTO_BLAKE2B_H
+
+#include <linux/types.h>
+
+enum blake2b_lengths {
+	BLAKE2B_BLOCK_SIZE = 128,
+	BLAKE2B_KEY_SIZE = 64,
+
+	BLAKE2B_160_HASH_SIZE = 20,
+	BLAKE2B_256_HASH_SIZE =	32,
+	BLAKE2B_384_HASH_SIZE = 48,
+	BLAKE2B_512_HASH_SIZE = 64,
+};
+
+struct blake2b_state {
+	/* 'h', 't', and 'f' are used in assembly code, so keep them as-is. */
+	u64      h[8];
+	u64      t[2];
+	u64      f[2];
+	u8       buf[BLAKE2B_BLOCK_SIZE];
+	size_t   buflen;
+};
+
+struct blake2b_tfm_ctx {
+	u8 key[BLAKE2B_KEY_SIZE];
+	unsigned int keylen;
+};
+
+typedef void (*blake2b_compress_blocks_t)(struct blake2b_state *S,
+					  const u8 *in, size_t nblocks,
+					  unsigned int inc);
+
+struct crypto_shash;
+struct shash_desc;
+
+int crypto_blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
+			  unsigned int keylen);
+
+int crypto_blake2b_init(struct shash_desc *desc);
+
+int crypto_blake2b_update(struct shash_desc *desc,
+			  const u8 *in, unsigned int inlen);
+int __crypto_blake2b_update(struct shash_desc *desc,
+			    const u8 *in, unsigned int inlen,
+			    blake2b_compress_blocks_t compress);
+
+int crypto_blake2b_final(struct shash_desc *desc, u8 *out);
+int __crypto_blake2b_final(struct shash_desc *desc, u8 *out,
+			   blake2b_compress_blocks_t compress);
+
+#endif /* _CRYPTO_BLAKE2B_H */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] crypto: blake2b - export helpers for optimized implementations
@ 2020-12-15 23:47   ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Herbert Xu, David Sterba, Ard Biesheuvel,
	linux-arm-kernel, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

In preparation for adding architecture-specific implementations of
BLAKE2b, create a header <crypto/blake2b.h> that contains common
constants, structs, and helper functions for BLAKE2b.

Furthermore, export the BLAKE2b generic setkey(), init(), update(), and
final() functions, and add functions __crypto_blake2b_update() and
__crypto_blake2b_final() which take a pointer to a
blake2b_compress_blocks_t function.

This way, optimized implementations of BLAKE2b only have to provide an
implementation of blake2b_compress_blocks_t.  (This is modeled on how
the nhpoly1305 implementations work.  Also, the prototype of
blake2b_compress_blocks_t is meant to be similar to that of
blake2s_compress_arch().)

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c | 100 +++++++++++++++++++++------------------
 include/crypto/blake2b.h |  54 +++++++++++++++++++++
 2 files changed, 109 insertions(+), 45 deletions(-)
 create mode 100644 include/crypto/blake2b.h

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index 0e38e3e48297c..fd5159b7aa9f2 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -23,27 +23,9 @@
 #include <linux/string.h>
 #include <linux/kernel.h>
 #include <linux/bitops.h>
+#include <crypto/blake2b.h>
 #include <crypto/internal/hash.h>
 
-
-enum blake2b_lengths {
-	BLAKE2B_BLOCK_SIZE = 128,
-	BLAKE2B_KEY_SIZE = 64,
-
-	BLAKE2B_160_HASH_SIZE = 20,
-	BLAKE2B_256_HASH_SIZE =	32,
-	BLAKE2B_384_HASH_SIZE = 48,
-	BLAKE2B_512_HASH_SIZE = 64,
-};
-
-struct blake2b_state {
-	u64      h[8];
-	u64      t[2];
-	u64      f[2];
-	u8       buf[BLAKE2B_BLOCK_SIZE];
-	size_t   buflen;
-};
-
 static const u64 blake2b_IV[8] = {
 	0x6a09e667f3bcc908ULL, 0xbb67ae8584caa73bULL,
 	0x3c6ef372fe94f82bULL, 0xa54ff53a5f1d36f1ULL,
@@ -96,8 +78,8 @@ static void blake2b_increment_counter(struct blake2b_state *S, const u64 inc)
 		G(r,7,v[ 3],v[ 4],v[ 9],v[14]); \
 	} while (0)
 
-static void blake2b_compress(struct blake2b_state *S,
-			     const u8 block[BLAKE2B_BLOCK_SIZE])
+static void blake2b_compress_generic(struct blake2b_state *S,
+				     const u8 block[BLAKE2B_BLOCK_SIZE])
 {
 	u64 m[16];
 	u64 v[16];
@@ -140,12 +122,18 @@ static void blake2b_compress(struct blake2b_state *S,
 #undef G
 #undef ROUND
 
-struct blake2b_tfm_ctx {
-	u8 key[BLAKE2B_KEY_SIZE];
-	unsigned int keylen;
-};
+static void blake2b_compress_blocks_generic(struct blake2b_state *S,
+					    const u8 *in, size_t nblocks,
+					    unsigned int inc)
+{
+	do {
+		blake2b_increment_counter(S, inc);
+		blake2b_compress_generic(S, in);
+		in += BLAKE2B_BLOCK_SIZE;
+	} while (--nblocks);
+}
 
-static int blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
+int crypto_blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
 			  unsigned int keylen)
 {
 	struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(tfm);
@@ -158,8 +146,9 @@ static int blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(crypto_blake2b_setkey);
 
-static int blake2b_init(struct shash_desc *desc)
+int crypto_blake2b_init(struct shash_desc *desc)
 {
 	struct blake2b_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
 	struct blake2b_state *state = shash_desc_ctx(desc);
@@ -181,9 +170,19 @@ static int blake2b_init(struct shash_desc *desc)
 	}
 	return 0;
 }
+EXPORT_SYMBOL_GPL(crypto_blake2b_init);
+
+int crypto_blake2b_update(struct shash_desc *desc,
+			  const u8 *in, unsigned int inlen)
+{
+	return __crypto_blake2b_update(desc, in, inlen,
+				       blake2b_compress_blocks_generic);
+}
+EXPORT_SYMBOL_GPL(crypto_blake2b_update);
 
-static int blake2b_update(struct shash_desc *desc, const u8 *in,
-			  unsigned int inlen)
+int __crypto_blake2b_update(struct shash_desc *desc,
+			    const u8 *in, unsigned int inlen,
+			    blake2b_compress_blocks_t compress)
 {
 	struct blake2b_state *state = shash_desc_ctx(desc);
 	const size_t left = state->buflen;
@@ -194,18 +193,19 @@ static int blake2b_update(struct shash_desc *desc, const u8 *in,
 
 	if (inlen > fill) {
 		state->buflen = 0;
-		/* Fill buffer */
 		memcpy(state->buf + left, in, fill);
-		blake2b_increment_counter(state, BLAKE2B_BLOCK_SIZE);
-		/* Compress */
-		blake2b_compress(state, state->buf);
+		(*compress)(state, state->buf, 1, BLAKE2B_BLOCK_SIZE);
 		in += fill;
 		inlen -= fill;
-		while (inlen > BLAKE2B_BLOCK_SIZE) {
-			blake2b_increment_counter(state, BLAKE2B_BLOCK_SIZE);
-			blake2b_compress(state, in);
-			in += BLAKE2B_BLOCK_SIZE;
-			inlen -= BLAKE2B_BLOCK_SIZE;
+		if (inlen > BLAKE2B_BLOCK_SIZE) {
+			/* Hash one less (full) block than strictly possible */
+			size_t nbytes = round_up(inlen - BLAKE2B_BLOCK_SIZE,
+						 BLAKE2B_BLOCK_SIZE);
+
+			(*compress)(state, in, nbytes / BLAKE2B_BLOCK_SIZE,
+				    BLAKE2B_BLOCK_SIZE);
+			in += nbytes;
+			inlen -= nbytes;
 		}
 	}
 	memcpy(state->buf + state->buflen, in, inlen);
@@ -213,20 +213,29 @@ static int blake2b_update(struct shash_desc *desc, const u8 *in,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(__crypto_blake2b_update);
 
-static int blake2b_final(struct shash_desc *desc, u8 *out)
+int crypto_blake2b_final(struct shash_desc *desc, u8 *out)
+{
+	return __crypto_blake2b_final(desc, out,
+				      blake2b_compress_blocks_generic);
+}
+EXPORT_SYMBOL_GPL(crypto_blake2b_final);
+
+int __crypto_blake2b_final(struct shash_desc *desc, u8 *out,
+			   blake2b_compress_blocks_t compress)
 {
 	struct blake2b_state *state = shash_desc_ctx(desc);
 	const int digestsize = crypto_shash_digestsize(desc->tfm);
 	size_t i;
 
-	blake2b_increment_counter(state, state->buflen);
 	/* Set last block */
 	state->f[0] = (u64)-1;
 	/* Padding */
 	memset(state->buf + state->buflen, 0,
 	       BLAKE2B_BLOCK_SIZE - state->buflen);
-	blake2b_compress(state, state->buf);
+
+	(*compress)(state, state->buf, 1, state->buflen);
 
 	/* Avoid temporary buffer and switch the internal output to LE order */
 	for (i = 0; i < ARRAY_SIZE(state->h); i++)
@@ -235,6 +244,7 @@ static int blake2b_final(struct shash_desc *desc, u8 *out)
 	memcpy(out, state->h, digestsize);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(__crypto_blake2b_final);
 
 #define BLAKE2B_ALG(name, driver_name, digest_size)			\
 	{								\
@@ -246,10 +256,10 @@ static int blake2b_final(struct shash_desc *desc, u8 *out)
 		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx), \
 		.base.cra_module	= THIS_MODULE,			\
 		.digestsize		= digest_size,			\
-		.setkey			= blake2b_setkey,		\
-		.init			= blake2b_init,			\
-		.update			= blake2b_update,		\
-		.final			= blake2b_final,		\
+		.setkey			= crypto_blake2b_setkey,	\
+		.init			= crypto_blake2b_init,		\
+		.update			= crypto_blake2b_update,	\
+		.final			= crypto_blake2b_final,		\
 		.descsize		= sizeof(struct blake2b_state),	\
 	}
 
diff --git a/include/crypto/blake2b.h b/include/crypto/blake2b.h
new file mode 100644
index 0000000000000..ced3ba5b45c8d
--- /dev/null
+++ b/include/crypto/blake2b.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _CRYPTO_BLAKE2B_H
+#define _CRYPTO_BLAKE2B_H
+
+#include <linux/types.h>
+
+enum blake2b_lengths {
+	BLAKE2B_BLOCK_SIZE = 128,
+	BLAKE2B_KEY_SIZE = 64,
+
+	BLAKE2B_160_HASH_SIZE = 20,
+	BLAKE2B_256_HASH_SIZE =	32,
+	BLAKE2B_384_HASH_SIZE = 48,
+	BLAKE2B_512_HASH_SIZE = 64,
+};
+
+struct blake2b_state {
+	/* 'h', 't', and 'f' are used in assembly code, so keep them as-is. */
+	u64      h[8];
+	u64      t[2];
+	u64      f[2];
+	u8       buf[BLAKE2B_BLOCK_SIZE];
+	size_t   buflen;
+};
+
+struct blake2b_tfm_ctx {
+	u8 key[BLAKE2B_KEY_SIZE];
+	unsigned int keylen;
+};
+
+typedef void (*blake2b_compress_blocks_t)(struct blake2b_state *S,
+					  const u8 *in, size_t nblocks,
+					  unsigned int inc);
+
+struct crypto_shash;
+struct shash_desc;
+
+int crypto_blake2b_setkey(struct crypto_shash *tfm, const u8 *key,
+			  unsigned int keylen);
+
+int crypto_blake2b_init(struct shash_desc *desc);
+
+int crypto_blake2b_update(struct shash_desc *desc,
+			  const u8 *in, unsigned int inlen);
+int __crypto_blake2b_update(struct shash_desc *desc,
+			    const u8 *in, unsigned int inlen,
+			    blake2b_compress_blocks_t compress);
+
+int crypto_blake2b_final(struct shash_desc *desc, u8 *out);
+int __crypto_blake2b_final(struct shash_desc *desc, u8 *out,
+			   blake2b_compress_blocks_t compress);
+
+#endif /* _CRYPTO_BLAKE2B_H */
-- 
2.29.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/5] crypto: blake2b - update file comment
  2020-12-15 23:47 ` Eric Biggers
@ 2020-12-15 23:47   ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

The file comment for blake2b_generic.c makes it sound like it's the
reference implementation of BLAKE2b with only minor changes.  But it's
actually been changed a lot.  Update the comment to make this clearer.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index fd5159b7aa9f2..c82a12dd47f17 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -1,21 +1,18 @@
 // SPDX-License-Identifier: (GPL-2.0-only OR Apache-2.0)
 /*
- * BLAKE2b reference source code package - reference C implementations
+ * Generic implementation of the BLAKE2b digest algorithm.  Based on the BLAKE2b
+ * reference implementation, but it has been heavily modified for use in the
+ * kernel.  The reference implementation was:
  *
- * Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
- * terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
- * your option.  The terms of these licenses can be found at:
+ *	Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under
+ *	the terms of the CC0, the OpenSSL Licence, or the Apache Public License
+ *	2.0, at your option.  The terms of these licenses can be found at:
  *
- * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
- * - OpenSSL license   : https://www.openssl.org/source/license.html
- * - Apache 2.0        : https://www.apache.org/licenses/LICENSE-2.0
+ *	- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ *	- OpenSSL license   : https://www.openssl.org/source/license.html
+ *	- Apache 2.0        : https://www.apache.org/licenses/LICENSE-2.0
  *
- * More information about the BLAKE2 hash function can be found at
- * https://blake2.net.
- *
- * Note: the original sources have been modified for inclusion in linux kernel
- * in terms of coding style, using generic helpers and simplifications of error
- * handling.
+ * More information about BLAKE2 can be found at https://blake2.net.
  */
 
 #include <asm/unaligned.h>
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/5] crypto: blake2b - update file comment
@ 2020-12-15 23:47   ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Herbert Xu, David Sterba, Ard Biesheuvel,
	linux-arm-kernel, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

The file comment for blake2b_generic.c makes it sound like it's the
reference implementation of BLAKE2b with only minor changes.  But it's
actually been changed a lot.  Update the comment to make this clearer.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/blake2b_generic.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index fd5159b7aa9f2..c82a12dd47f17 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -1,21 +1,18 @@
 // SPDX-License-Identifier: (GPL-2.0-only OR Apache-2.0)
 /*
- * BLAKE2b reference source code package - reference C implementations
+ * Generic implementation of the BLAKE2b digest algorithm.  Based on the BLAKE2b
+ * reference implementation, but it has been heavily modified for use in the
+ * kernel.  The reference implementation was:
  *
- * Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
- * terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
- * your option.  The terms of these licenses can be found at:
+ *	Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under
+ *	the terms of the CC0, the OpenSSL Licence, or the Apache Public License
+ *	2.0, at your option.  The terms of these licenses can be found at:
  *
- * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
- * - OpenSSL license   : https://www.openssl.org/source/license.html
- * - Apache 2.0        : https://www.apache.org/licenses/LICENSE-2.0
+ *	- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ *	- OpenSSL license   : https://www.openssl.org/source/license.html
+ *	- Apache 2.0        : https://www.apache.org/licenses/LICENSE-2.0
  *
- * More information about the BLAKE2 hash function can be found at
- * https://blake2.net.
- *
- * Note: the original sources have been modified for inclusion in linux kernel
- * in terms of coding style, using generic helpers and simplifications of error
- * handling.
+ * More information about BLAKE2 can be found at https://blake2.net.
  */
 
 #include <asm/unaligned.h>
-- 
2.29.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/5] crypto: arm/blake2b - add NEON-optimized BLAKE2b implementation
  2020-12-15 23:47 ` Eric Biggers
@ 2020-12-15 23:47   ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Add a NEON-optimized implementation of BLAKE2b.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.1
	sha1-neon            16.4
	sha1-asm             20.8
	blake2s-256-generic  26.1
	sha256-neon	     28.9
	sha256-asm	     32.1
	blake2b-256-generic  39.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).

NEON-optimized BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm/crypto/Kconfig             |  10 +
 arch/arm/crypto/Makefile            |   2 +
 arch/arm/crypto/blake2b-neon-core.S | 357 ++++++++++++++++++++++++++++
 arch/arm/crypto/blake2b-neon-glue.c | 105 ++++++++
 4 files changed, 474 insertions(+)
 create mode 100644 arch/arm/crypto/blake2b-neon-core.S
 create mode 100644 arch/arm/crypto/blake2b-neon-glue.c

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index c9bf2df85cb90..f6a14c186b4ec 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -62,6 +62,16 @@ config CRYPTO_SHA512_ARM
 	  SHA-512 secure hash standard (DFIPS 180-2) implemented
 	  using optimized ARM assembler and NEON, when available.
 
+config CRYPTO_BLAKE2B_NEON
+	tristate "BLAKE2b digest algorithm (ARM NEON)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_BLAKE2B
+	help
+	  BLAKE2b digest algorithm optimized with ARM NEON instructions.
+	  On ARM processors that have NEON support but not the ARMv8
+	  Crypto Extensions, typically this BLAKE2b implementation is
+	  much faster than SHA-2 and slightly faster than SHA-1.
+
 config CRYPTO_AES_ARM
 	tristate "Scalar AES cipher for ARM"
 	select CRYPTO_ALGAPI
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index b745c17d356fe..ab835ceeb4f2e 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
+obj-$(CONFIG_CRYPTO_BLAKE2B_NEON) += blake2b-neon.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
 obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
@@ -29,6 +30,7 @@ sha256-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha256_neon_glue.o
 sha256-arm-y	:= sha256-core.o sha256_glue.o $(sha256-arm-neon-y)
 sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
 sha512-arm-y	:= sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
+blake2b-neon-y  := blake2b-neon-core.o blake2b-neon-glue.o
 sha1-arm-ce-y	:= sha1-ce-core.o sha1-ce-glue.o
 sha2-arm-ce-y	:= sha2-ce-core.o sha2-ce-glue.o
 aes-arm-ce-y	:= aes-ce-core.o aes-ce-glue.o
diff --git a/arch/arm/crypto/blake2b-neon-core.S b/arch/arm/crypto/blake2b-neon-core.S
new file mode 100644
index 0000000000000..734d9d3a161f7
--- /dev/null
+++ b/arch/arm/crypto/blake2b-neon-core.S
@@ -0,0 +1,357 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * BLAKE2b digest algorithm, NEON accelerated
+ *
+ * Copyright 2020 Google LLC
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+#include <linux/linkage.h>
+
+	.text
+	.fpu		neon
+
+	// The arguments to blake2b_compress_blocks_neon()
+	STATE		.req	r0
+	IN		.req	r1
+	NBLOCKS		.req	r2
+	INC		.req	r3
+
+	// Pointers to the rotation tables
+	ROR24_TABLE	.req	r4
+	ROR16_TABLE	.req	r5
+
+	// The original stack pointer
+	ORIG_SP		.req	r6
+
+	// NEON registers which contain the message words of the current block.
+	// M_0-M_3 are occasionally used for other purposes too.
+	M_0		.req	d16
+	M_1		.req	d17
+	M_2		.req	d18
+	M_3		.req	d19
+	M_4		.req	d20
+	M_5		.req	d21
+	M_6		.req	d22
+	M_7		.req	d23
+	M_8		.req	d24
+	M_9		.req	d25
+	M_10		.req	d26
+	M_11		.req	d27
+	M_12		.req	d28
+	M_13		.req	d29
+	M_14		.req	d30
+	M_15		.req	d31
+
+	.align		4
+	// Tables for computing ror64(x, 24) and ror64(x, 16) using the vtbl.8
+	// instruction.  This is the most efficient way to implement these
+	// rotation amounts with NEON.  (On Cortex-A53 it's the same speed as
+	// vshr.u64 + vsli.u64, while on Cortex-A7 it's faster.)
+.Lror24_table:
+	.byte		3, 4, 5, 6, 7, 0, 1, 2
+.Lror16_table:
+	.byte		2, 3, 4, 5, 6, 7, 0, 1
+	// The BLAKE2b initialization vector
+.Lblake2b_IV:
+	.quad		0x6a09e667f3bcc908, 0xbb67ae8584caa73b
+	.quad		0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1
+	.quad		0x510e527fade682d1, 0x9b05688c2b3e6c1f
+	.quad		0x1f83d9abfb41bd6b, 0x5be0cd19137e2179
+
+// Execute one round of BLAKE2b by updating the state matrix v[0..15] in the
+// NEON registers q0-q7.  The message block is in q8..q15.  The stack pointer
+// points to a 32-byte aligned buffer containing a copy of q8 and q9, so that
+// they can be reloaded if q8 and q9 are used as temporary registers.  The macro
+// arguments s0-s15 give the order in which the message words are used in this
+// round.  'final' is "true" if this is the final round, i.e. round 12 of 12.
+.macro	_blake2b_round	s0, s1, s2, s3, s4, s5, s6, s7, \
+			s8, s9, s10, s11, s12, s13, s14, s15, final="false"
+
+	// Mix the columns:
+	// (v[0], v[4], v[8], v[12]), (v[1], v[5], v[9], v[13]),
+	// (v[2], v[6], v[10], v[14]), and (v[3], v[7], v[11], v[15]).
+
+	// a += b + m[blake2b_sigma[r][2*i + 0]];
+	vadd.u64	q0, q0, q2
+	vadd.u64	q1, q1, q3
+	vadd.u64	d0, d0, M_\s0
+	vadd.u64	d1, d1, M_\s2
+	vadd.u64	d2, d2, M_\s4
+	vadd.u64	d3, d3, M_\s6
+
+	// d = ror64(d ^ a, 32);
+	veor		q6, q6, q0
+	veor		q7, q7, q1
+	vrev64.32	q6, q6
+	vrev64.32	q7, q7
+
+	// c += d;
+	vadd.u64	q4, q4, q6
+	vadd.u64	q5, q5, q7
+
+	// b = ror64(b ^ c, 24);
+	vld1.8		{M_0}, [ROR24_TABLE, :64]
+	veor		q2, q2, q4
+	veor		q3, q3, q5
+	vtbl.8		d4, {d4}, M_0
+	vtbl.8		d5, {d5}, M_0
+	vtbl.8		d6, {d6}, M_0
+	vtbl.8		d7, {d7}, M_0
+
+	// a += b + m[blake2b_sigma[r][2*i + 1]];
+	//
+	// M_0 got clobbered above, so we have to reload it if any of the four
+	// message words this step needs happens to be M_0.  Otherwise we don't
+	// need to reload it here, as it will just get clobbered again below.
+.if \s1 == 0 || \s3 == 0 || \s5 == 0 || \s7 == 0
+	vld1.8		{M_0}, [sp, :64]
+.endif
+	vadd.u64	q0, q0, q2
+	vadd.u64	q1, q1, q3
+	vadd.u64	d0, d0, M_\s1
+	vadd.u64	d1, d1, M_\s3
+	vadd.u64	d2, d2, M_\s5
+	vadd.u64	d3, d3, M_\s7
+
+	// d = ror64(d ^ a, 16);
+	vld1.8		{M_0}, [ROR16_TABLE, :64]
+	veor		q6, q6, q0
+	veor		q7, q7, q1
+	vtbl.8		d12, {d12}, M_0
+	vtbl.8		d13, {d13}, M_0
+	vtbl.8		d14, {d14}, M_0
+	vtbl.8		d15, {d15}, M_0
+
+	// c += d;
+	vadd.u64	q4, q4, q6
+	vadd.u64	q5, q5, q7
+
+	// b = ror64(b ^ c, 63);
+	//
+	// This rotation amount isn't a multiple of 8, so it has to be
+	// implemented using a pair of shifts, which requires temporary
+	// registers.  Use q8-q9 (M_0-M_3) for this, and reload them afterwards.
+	veor		q8, q2, q4
+	veor		q9, q3, q5
+	vshr.u64	q2, q8, #63
+	vshr.u64	q3, q9, #63
+	vsli.u64	q2, q8, #1
+	vsli.u64	q3, q9, #1
+	vld1.8		{q8-q9}, [sp, :256]
+
+	// Mix the diagonals:
+	// (v[0], v[5], v[10], v[15]), (v[1], v[6], v[11], v[12]),
+	// (v[2], v[7], v[8], v[13]), and (v[3], v[4], v[9], v[14]).
+	//
+	// There are two possible ways to do this: use 'vext' instructions to
+	// shift the rows of the matrix so that the diagonals become columns,
+	// and undo it afterwards; or just use 64-bit operations on 'd'
+	// registers instead of 128-bit operations on 'q' registers.  We use the
+	// latter approach, as it performs much better on Cortex-A7.
+
+	// a += b + m[blake2b_sigma[r][2*i + 0]];
+	vadd.u64	d0, d0, d5
+	vadd.u64	d1, d1, d6
+	vadd.u64	d2, d2, d7
+	vadd.u64	d3, d3, d4
+	vadd.u64	d0, d0, M_\s8
+	vadd.u64	d1, d1, M_\s10
+	vadd.u64	d2, d2, M_\s12
+	vadd.u64	d3, d3, M_\s14
+
+	// d = ror64(d ^ a, 32);
+	veor		d15, d15, d0
+	veor		d12, d12, d1
+	veor		d13, d13, d2
+	veor		d14, d14, d3
+	vrev64.32	d15, d15
+	vrev64.32	d12, d12
+	vrev64.32	d13, d13
+	vrev64.32	d14, d14
+
+	// c += d;
+	vadd.u64	d10, d10, d15
+	vadd.u64	d11, d11, d12
+	vadd.u64	d8, d8, d13
+	vadd.u64	d9, d9, d14
+
+	// b = ror64(b ^ c, 24);
+	vld1.8		{M_0}, [ROR24_TABLE, :64]
+	veor		d5, d5, d10
+	veor		d6, d6, d11
+	veor		d7, d7, d8
+	veor		d4, d4, d9
+	vtbl.8		d5, {d5}, M_0
+	vtbl.8		d6, {d6}, M_0
+	vtbl.8		d7, {d7}, M_0
+	vtbl.8		d4, {d4}, M_0
+
+	// a += b + m[blake2b_sigma[r][2*i + 1]];
+.if \s9 == 0 || \s11 == 0 || \s13 == 0 || \s15 == 0
+	vld1.8		{M_0}, [sp, :64]
+.endif
+	vadd.u64	d0, d0, d5
+	vadd.u64	d1, d1, d6
+	vadd.u64	d2, d2, d7
+	vadd.u64	d3, d3, d4
+	vadd.u64	d0, d0, M_\s9
+	vadd.u64	d1, d1, M_\s11
+	vadd.u64	d2, d2, M_\s13
+	vadd.u64	d3, d3, M_\s15
+
+	// d = ror64(d ^ a, 16);
+	vld1.8		{M_0}, [ROR16_TABLE, :64]
+	veor		d15, d15, d0
+	veor		d12, d12, d1
+	veor		d13, d13, d2
+	veor		d14, d14, d3
+	vtbl.8		d12, {d12}, M_0
+	vtbl.8		d13, {d13}, M_0
+	vtbl.8		d14, {d14}, M_0
+	vtbl.8		d15, {d15}, M_0
+
+	// c += d;
+	vadd.u64	d10, d10, d15
+	vadd.u64	d11, d11, d12
+	vadd.u64	d8, d8, d13
+	vadd.u64	d9, d9, d14
+
+	// b = ror64(b ^ c, 63);
+	veor		d16, d4, d9
+	veor		d17, d5, d10
+	veor		d18, d6, d11
+	veor		d19, d7, d8
+	vshr.u64	q2, q8, #63
+	vshr.u64	q3, q9, #63
+	vsli.u64	q2, q8, #1
+	vsli.u64	q3, q9, #1
+	// Reloading q8-q9 can be skipped on the final round.
+.if \final != "true"
+	vld1.8		{q8-q9}, [sp, :256]
+.endif
+.endm
+
+//
+// void blake2b_compress_blocks_neon(struct blake2b_state *S,
+//				     const u8 *in, size_t nblocks,
+//				     unsigned int inc);
+//
+// Only the first three fields of struct blake2b_state are used:
+//	u64 h[8];	(inout)
+//	u64 t[2];	(in)
+//	u64 f[2];	(in)
+//
+	.align		5
+ENTRY(blake2b_compress_blocks_neon)
+	push		{r4-r10}
+
+	// Allocate a 32-byte stack buffer that is 32-byte aligned.
+	mov		ORIG_SP, sp
+	sub		ip, sp, #32
+	bic		ip, ip, #31
+	mov		sp, ip
+
+	adr		ROR24_TABLE, .Lror24_table
+	adr		ROR16_TABLE, .Lror16_table
+
+	mov		ip, STATE
+	vld1.64		{q0-q1}, [ip]!		// Load h[0..3]
+	vld1.64		{q2-q3}, [ip]!		// Load h[4..7]
+.Lnext_block:
+	  adr		r10, .Lblake2b_IV
+	vld1.64		{q14-q15}, [ip]		// Load t[0..1] and f[0..1]
+	vld1.64		{q4-q5}, [r10]!		// Load IV[0..3]
+	  vmov		r7, r8, d28		// Copy t[0] to (r7, r8)
+	vld1.64		{q6-q7}, [r10]		// Load IV[4..7]
+	  adds		r7, r7, INC		// Increment counter
+	bcs		.Lslow_inc_ctr
+	vmov.i32	d28[0], r7
+	vst1.64		{d28}, [ip]		// Update t[0]
+.Linc_ctr_done:
+
+	// Load the next message block and finish initializing the state matrix
+	// 'v'.  Fortunately, there are exactly enough NEON registers to fit the
+	// entire state matrix in q0-q7 and the entire message block in q8-15.
+	//
+	// However, _blake2b_round also needs some extra registers for rotates,
+	// so we have to spill some registers.  It's better to spill the message
+	// registers than the state registers, as the message doesn't change.
+	// Therefore we store a copy of the first 32 bytes of the message block
+	// (q8-q9) in an aligned buffer on the stack so that they can be
+	// reloaded when needed.  (We could just reload directly from the
+	// message buffer, but it's faster to use aligned loads.)
+	vld1.8		{q8-q9}, [IN]!
+	  veor		q6, q6, q14	// v[12..13] = IV[4..5] ^ t[0..1]
+	vld1.8		{q10-q11}, [IN]!
+	  veor		q7, q7, q15	// v[14..15] = IV[6..7] ^ f[0..1]
+	vld1.8		{q12-q13}, [IN]!
+	vst1.8		{q8-q9}, [sp, :256]
+	  mov		ip, STATE
+	vld1.8		{q14-q15}, [IN]!
+
+	// Execute the rounds.  Each round is provided the order in which it
+	// needs to use the message words.
+	_blake2b_round	0,  1,  2,  3,  4,  5,  6,  7, \
+			8,  9, 10, 11, 12, 13, 14, 15
+	_blake2b_round 14, 10,  4,  8,  9, 15, 13,  6, \
+			1, 12,  0,  2, 11,  7,  5,  3
+	_blake2b_round 11,  8, 12,  0,  5,  2, 15, 13, \
+		       10, 14,  3,  6,  7,  1,  9,  4
+	_blake2b_round	7,  9,  3,  1, 13, 12, 11, 14, \
+			2,  6,  5, 10,  4,  0, 15,  8
+	_blake2b_round	9,  0,  5,  7,  2,  4, 10, 15, \
+			14,  1, 11, 12,  6,  8,  3, 13
+	_blake2b_round	2, 12,  6, 10,  0, 11,  8,  3, \
+			4, 13,  7,  5, 15, 14,  1,  9
+	_blake2b_round 12,  5,  1, 15, 14, 13,  4, 10, \
+			0,  7,  6,  3,  9,  2,  8, 11
+	_blake2b_round 13, 11,  7, 14, 12,  1,  3,  9, \
+			5,  0, 15,  4,  8,  6,  2, 10
+	_blake2b_round	6, 15, 14,  9, 11,  3,  0,  8, \
+			12,  2, 13,  7,  1,  4, 10,  5
+	_blake2b_round 10,  2,  8,  4,  7,  6,  1,  5, \
+		       15, 11,  9, 14,  3, 12, 13,  0
+	_blake2b_round	0,  1,  2,  3,  4,  5,  6,  7, \
+			8,  9, 10, 11, 12, 13, 14, 15
+	_blake2b_round 14, 10,  4,  8,  9, 15, 13,  6, \
+			1, 12,  0,  2, 11,  7, 5,   3,  "true"
+
+	// Fold the final state matrix into the hash chaining value:
+	//
+	//	for (i = 0; i < 8; i++)
+	//		h[i] ^= v[i] ^ v[i + 8];
+	//
+	  vld1.64	{q8-q9}, [ip]!		// Load old h[0..3]
+	veor		q0, q0, q4		// v[0..1] ^= v[8..9]
+	veor		q1, q1, q5		// v[2..3] ^= v[10..11]
+	  vld1.64	{q10-q11}, [ip]		// Load old h[4..7]
+	veor		q2, q2, q6		// v[4..5] ^= v[12..13]
+	veor		q3, q3, q7		// v[6..7] ^= v[14..15]
+	veor		q0, q0, q8		// v[0..1] ^= h[0..1]
+	veor		q1, q1, q9		// v[2..3] ^= h[2..3]
+	  mov		ip, STATE
+	  subs		NBLOCKS, NBLOCKS, #1	// nblocks--
+	  vst1.64	{q0-q1}, [ip]!		// Store new h[0..3]
+	veor		q2, q2, q10		// v[4..5] ^= h[4..5]
+	veor		q3, q3, q11		// v[6..7] ^= h[6..7]
+	  vst1.64	{q2-q3}, [ip]!		// Store new h[4..7]
+	bne		.Lnext_block		// nblocks != 0?
+
+	mov		sp, ORIG_SP
+	pop		{r4-r10}
+	mov		pc, lr
+
+.Lslow_inc_ctr:
+	// Handle the case where the counter overflowed its low 32 bits, by
+	// carrying the overflow bit into the full 128-bit counter.
+	vmov		r9, r10, d29
+	adcs		r8, r8, #0
+	adcs		r9, r9, #0
+	adc		r10, r10, #0
+	vmov		d28, r7, r8
+	vmov		d29, r9, r10
+	vst1.64		{q14}, [ip]		// Update t[0] and t[1]
+	b		.Linc_ctr_done
+ENDPROC(blake2b_compress_blocks_neon)
diff --git a/arch/arm/crypto/blake2b-neon-glue.c b/arch/arm/crypto/blake2b-neon-glue.c
new file mode 100644
index 0000000000000..27620d7a8bcb7
--- /dev/null
+++ b/arch/arm/crypto/blake2b-neon-glue.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BLAKE2b digest algorithm, NEON accelerated
+ *
+ * Copyright 2020 Google LLC
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <crypto/blake2b.h>
+#include <linux/module.h>
+
+asmlinkage void blake2b_compress_blocks_neon(struct blake2b_state *S,
+					     const u8 *in, size_t nblocks,
+					     unsigned int inc);
+
+static int blake2b_neon_update(struct shash_desc *desc,
+			       const u8 *in, unsigned int inlen)
+{
+	struct blake2b_state *S = shash_desc_ctx(desc);
+
+	if (S->buflen + inlen < BLAKE2B_BLOCK_SIZE || !crypto_simd_usable())
+		return crypto_blake2b_update(desc, in, inlen);
+
+	do {
+		unsigned int n = min_t(unsigned int, inlen, SZ_4K);
+
+		kernel_neon_begin();
+		__crypto_blake2b_update(desc, in, n,
+					blake2b_compress_blocks_neon);
+		kernel_neon_end();
+		in += n;
+		inlen -= n;
+	} while (inlen);
+	return 0;
+}
+
+static int blake2b_neon_final(struct shash_desc *desc, u8 *out)
+{
+	int err;
+
+	if (!crypto_simd_usable())
+		return crypto_blake2b_final(desc, out);
+
+	kernel_neon_begin();
+	err = __crypto_blake2b_final(desc, out, blake2b_compress_blocks_neon);
+	kernel_neon_end();
+	return err;
+}
+
+#define BLAKE2B_ALG(name, driver_name, digest_size)			\
+	{								\
+		.base.cra_name		= name,				\
+		.base.cra_driver_name	= driver_name,			\
+		.base.cra_priority	= 200,				\
+		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,	\
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,		\
+		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx), \
+		.base.cra_module	= THIS_MODULE,			\
+		.digestsize		= digest_size,			\
+		.setkey			= crypto_blake2b_setkey,	\
+		.init			= crypto_blake2b_init,		\
+		.update			= blake2b_neon_update,		\
+		.final			= blake2b_neon_final,		\
+		.descsize		= sizeof(struct blake2b_state),	\
+	}
+
+static struct shash_alg blake2b_neon_algs[] = {
+	BLAKE2B_ALG("blake2b-160", "blake2b-160-neon", BLAKE2B_160_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-256", "blake2b-256-neon", BLAKE2B_256_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-384", "blake2b-384-neon", BLAKE2B_384_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-512", "blake2b-512-neon", BLAKE2B_512_HASH_SIZE),
+};
+
+static int __init blake2b_neon_mod_init(void)
+{
+	if (!(elf_hwcap & HWCAP_NEON))
+		return -ENODEV;
+
+	return crypto_register_shashes(blake2b_neon_algs,
+				       ARRAY_SIZE(blake2b_neon_algs));
+}
+
+static void __exit blake2b_neon_mod_exit(void)
+{
+	return crypto_unregister_shashes(blake2b_neon_algs,
+					 ARRAY_SIZE(blake2b_neon_algs));
+}
+
+module_init(blake2b_neon_mod_init);
+module_exit(blake2b_neon_mod_exit);
+
+MODULE_DESCRIPTION("BLAKE2b digest algorithm, NEON accelerated");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("blake2b-160");
+MODULE_ALIAS_CRYPTO("blake2b-160-neon");
+MODULE_ALIAS_CRYPTO("blake2b-256");
+MODULE_ALIAS_CRYPTO("blake2b-256-neon");
+MODULE_ALIAS_CRYPTO("blake2b-384");
+MODULE_ALIAS_CRYPTO("blake2b-384-neon");
+MODULE_ALIAS_CRYPTO("blake2b-512");
+MODULE_ALIAS_CRYPTO("blake2b-512-neon");
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/5] crypto: arm/blake2b - add NEON-optimized BLAKE2b implementation
@ 2020-12-15 23:47   ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-15 23:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Herbert Xu, David Sterba, Ard Biesheuvel,
	linux-arm-kernel, Paul Crowley

From: Eric Biggers <ebiggers@google.com>

Add a NEON-optimized implementation of BLAKE2b.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.1
	sha1-neon            16.4
	sha1-asm             20.8
	blake2s-256-generic  26.1
	sha256-neon	     28.9
	sha256-asm	     32.1
	blake2b-256-generic  39.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).

NEON-optimized BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm/crypto/Kconfig             |  10 +
 arch/arm/crypto/Makefile            |   2 +
 arch/arm/crypto/blake2b-neon-core.S | 357 ++++++++++++++++++++++++++++
 arch/arm/crypto/blake2b-neon-glue.c | 105 ++++++++
 4 files changed, 474 insertions(+)
 create mode 100644 arch/arm/crypto/blake2b-neon-core.S
 create mode 100644 arch/arm/crypto/blake2b-neon-glue.c

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index c9bf2df85cb90..f6a14c186b4ec 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -62,6 +62,16 @@ config CRYPTO_SHA512_ARM
 	  SHA-512 secure hash standard (DFIPS 180-2) implemented
 	  using optimized ARM assembler and NEON, when available.
 
+config CRYPTO_BLAKE2B_NEON
+	tristate "BLAKE2b digest algorithm (ARM NEON)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_BLAKE2B
+	help
+	  BLAKE2b digest algorithm optimized with ARM NEON instructions.
+	  On ARM processors that have NEON support but not the ARMv8
+	  Crypto Extensions, typically this BLAKE2b implementation is
+	  much faster than SHA-2 and slightly faster than SHA-1.
+
 config CRYPTO_AES_ARM
 	tristate "Scalar AES cipher for ARM"
 	select CRYPTO_ALGAPI
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index b745c17d356fe..ab835ceeb4f2e 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
+obj-$(CONFIG_CRYPTO_BLAKE2B_NEON) += blake2b-neon.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
 obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
@@ -29,6 +30,7 @@ sha256-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha256_neon_glue.o
 sha256-arm-y	:= sha256-core.o sha256_glue.o $(sha256-arm-neon-y)
 sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
 sha512-arm-y	:= sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
+blake2b-neon-y  := blake2b-neon-core.o blake2b-neon-glue.o
 sha1-arm-ce-y	:= sha1-ce-core.o sha1-ce-glue.o
 sha2-arm-ce-y	:= sha2-ce-core.o sha2-ce-glue.o
 aes-arm-ce-y	:= aes-ce-core.o aes-ce-glue.o
diff --git a/arch/arm/crypto/blake2b-neon-core.S b/arch/arm/crypto/blake2b-neon-core.S
new file mode 100644
index 0000000000000..734d9d3a161f7
--- /dev/null
+++ b/arch/arm/crypto/blake2b-neon-core.S
@@ -0,0 +1,357 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * BLAKE2b digest algorithm, NEON accelerated
+ *
+ * Copyright 2020 Google LLC
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+#include <linux/linkage.h>
+
+	.text
+	.fpu		neon
+
+	// The arguments to blake2b_compress_blocks_neon()
+	STATE		.req	r0
+	IN		.req	r1
+	NBLOCKS		.req	r2
+	INC		.req	r3
+
+	// Pointers to the rotation tables
+	ROR24_TABLE	.req	r4
+	ROR16_TABLE	.req	r5
+
+	// The original stack pointer
+	ORIG_SP		.req	r6
+
+	// NEON registers which contain the message words of the current block.
+	// M_0-M_3 are occasionally used for other purposes too.
+	M_0		.req	d16
+	M_1		.req	d17
+	M_2		.req	d18
+	M_3		.req	d19
+	M_4		.req	d20
+	M_5		.req	d21
+	M_6		.req	d22
+	M_7		.req	d23
+	M_8		.req	d24
+	M_9		.req	d25
+	M_10		.req	d26
+	M_11		.req	d27
+	M_12		.req	d28
+	M_13		.req	d29
+	M_14		.req	d30
+	M_15		.req	d31
+
+	.align		4
+	// Tables for computing ror64(x, 24) and ror64(x, 16) using the vtbl.8
+	// instruction.  This is the most efficient way to implement these
+	// rotation amounts with NEON.  (On Cortex-A53 it's the same speed as
+	// vshr.u64 + vsli.u64, while on Cortex-A7 it's faster.)
+.Lror24_table:
+	.byte		3, 4, 5, 6, 7, 0, 1, 2
+.Lror16_table:
+	.byte		2, 3, 4, 5, 6, 7, 0, 1
+	// The BLAKE2b initialization vector
+.Lblake2b_IV:
+	.quad		0x6a09e667f3bcc908, 0xbb67ae8584caa73b
+	.quad		0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1
+	.quad		0x510e527fade682d1, 0x9b05688c2b3e6c1f
+	.quad		0x1f83d9abfb41bd6b, 0x5be0cd19137e2179
+
+// Execute one round of BLAKE2b by updating the state matrix v[0..15] in the
+// NEON registers q0-q7.  The message block is in q8..q15.  The stack pointer
+// points to a 32-byte aligned buffer containing a copy of q8 and q9, so that
+// they can be reloaded if q8 and q9 are used as temporary registers.  The macro
+// arguments s0-s15 give the order in which the message words are used in this
+// round.  'final' is "true" if this is the final round, i.e. round 12 of 12.
+.macro	_blake2b_round	s0, s1, s2, s3, s4, s5, s6, s7, \
+			s8, s9, s10, s11, s12, s13, s14, s15, final="false"
+
+	// Mix the columns:
+	// (v[0], v[4], v[8], v[12]), (v[1], v[5], v[9], v[13]),
+	// (v[2], v[6], v[10], v[14]), and (v[3], v[7], v[11], v[15]).
+
+	// a += b + m[blake2b_sigma[r][2*i + 0]];
+	vadd.u64	q0, q0, q2
+	vadd.u64	q1, q1, q3
+	vadd.u64	d0, d0, M_\s0
+	vadd.u64	d1, d1, M_\s2
+	vadd.u64	d2, d2, M_\s4
+	vadd.u64	d3, d3, M_\s6
+
+	// d = ror64(d ^ a, 32);
+	veor		q6, q6, q0
+	veor		q7, q7, q1
+	vrev64.32	q6, q6
+	vrev64.32	q7, q7
+
+	// c += d;
+	vadd.u64	q4, q4, q6
+	vadd.u64	q5, q5, q7
+
+	// b = ror64(b ^ c, 24);
+	vld1.8		{M_0}, [ROR24_TABLE, :64]
+	veor		q2, q2, q4
+	veor		q3, q3, q5
+	vtbl.8		d4, {d4}, M_0
+	vtbl.8		d5, {d5}, M_0
+	vtbl.8		d6, {d6}, M_0
+	vtbl.8		d7, {d7}, M_0
+
+	// a += b + m[blake2b_sigma[r][2*i + 1]];
+	//
+	// M_0 got clobbered above, so we have to reload it if any of the four
+	// message words this step needs happens to be M_0.  Otherwise we don't
+	// need to reload it here, as it will just get clobbered again below.
+.if \s1 == 0 || \s3 == 0 || \s5 == 0 || \s7 == 0
+	vld1.8		{M_0}, [sp, :64]
+.endif
+	vadd.u64	q0, q0, q2
+	vadd.u64	q1, q1, q3
+	vadd.u64	d0, d0, M_\s1
+	vadd.u64	d1, d1, M_\s3
+	vadd.u64	d2, d2, M_\s5
+	vadd.u64	d3, d3, M_\s7
+
+	// d = ror64(d ^ a, 16);
+	vld1.8		{M_0}, [ROR16_TABLE, :64]
+	veor		q6, q6, q0
+	veor		q7, q7, q1
+	vtbl.8		d12, {d12}, M_0
+	vtbl.8		d13, {d13}, M_0
+	vtbl.8		d14, {d14}, M_0
+	vtbl.8		d15, {d15}, M_0
+
+	// c += d;
+	vadd.u64	q4, q4, q6
+	vadd.u64	q5, q5, q7
+
+	// b = ror64(b ^ c, 63);
+	//
+	// This rotation amount isn't a multiple of 8, so it has to be
+	// implemented using a pair of shifts, which requires temporary
+	// registers.  Use q8-q9 (M_0-M_3) for this, and reload them afterwards.
+	veor		q8, q2, q4
+	veor		q9, q3, q5
+	vshr.u64	q2, q8, #63
+	vshr.u64	q3, q9, #63
+	vsli.u64	q2, q8, #1
+	vsli.u64	q3, q9, #1
+	vld1.8		{q8-q9}, [sp, :256]
+
+	// Mix the diagonals:
+	// (v[0], v[5], v[10], v[15]), (v[1], v[6], v[11], v[12]),
+	// (v[2], v[7], v[8], v[13]), and (v[3], v[4], v[9], v[14]).
+	//
+	// There are two possible ways to do this: use 'vext' instructions to
+	// shift the rows of the matrix so that the diagonals become columns,
+	// and undo it afterwards; or just use 64-bit operations on 'd'
+	// registers instead of 128-bit operations on 'q' registers.  We use the
+	// latter approach, as it performs much better on Cortex-A7.
+
+	// a += b + m[blake2b_sigma[r][2*i + 0]];
+	vadd.u64	d0, d0, d5
+	vadd.u64	d1, d1, d6
+	vadd.u64	d2, d2, d7
+	vadd.u64	d3, d3, d4
+	vadd.u64	d0, d0, M_\s8
+	vadd.u64	d1, d1, M_\s10
+	vadd.u64	d2, d2, M_\s12
+	vadd.u64	d3, d3, M_\s14
+
+	// d = ror64(d ^ a, 32);
+	veor		d15, d15, d0
+	veor		d12, d12, d1
+	veor		d13, d13, d2
+	veor		d14, d14, d3
+	vrev64.32	d15, d15
+	vrev64.32	d12, d12
+	vrev64.32	d13, d13
+	vrev64.32	d14, d14
+
+	// c += d;
+	vadd.u64	d10, d10, d15
+	vadd.u64	d11, d11, d12
+	vadd.u64	d8, d8, d13
+	vadd.u64	d9, d9, d14
+
+	// b = ror64(b ^ c, 24);
+	vld1.8		{M_0}, [ROR24_TABLE, :64]
+	veor		d5, d5, d10
+	veor		d6, d6, d11
+	veor		d7, d7, d8
+	veor		d4, d4, d9
+	vtbl.8		d5, {d5}, M_0
+	vtbl.8		d6, {d6}, M_0
+	vtbl.8		d7, {d7}, M_0
+	vtbl.8		d4, {d4}, M_0
+
+	// a += b + m[blake2b_sigma[r][2*i + 1]];
+.if \s9 == 0 || \s11 == 0 || \s13 == 0 || \s15 == 0
+	vld1.8		{M_0}, [sp, :64]
+.endif
+	vadd.u64	d0, d0, d5
+	vadd.u64	d1, d1, d6
+	vadd.u64	d2, d2, d7
+	vadd.u64	d3, d3, d4
+	vadd.u64	d0, d0, M_\s9
+	vadd.u64	d1, d1, M_\s11
+	vadd.u64	d2, d2, M_\s13
+	vadd.u64	d3, d3, M_\s15
+
+	// d = ror64(d ^ a, 16);
+	vld1.8		{M_0}, [ROR16_TABLE, :64]
+	veor		d15, d15, d0
+	veor		d12, d12, d1
+	veor		d13, d13, d2
+	veor		d14, d14, d3
+	vtbl.8		d12, {d12}, M_0
+	vtbl.8		d13, {d13}, M_0
+	vtbl.8		d14, {d14}, M_0
+	vtbl.8		d15, {d15}, M_0
+
+	// c += d;
+	vadd.u64	d10, d10, d15
+	vadd.u64	d11, d11, d12
+	vadd.u64	d8, d8, d13
+	vadd.u64	d9, d9, d14
+
+	// b = ror64(b ^ c, 63);
+	veor		d16, d4, d9
+	veor		d17, d5, d10
+	veor		d18, d6, d11
+	veor		d19, d7, d8
+	vshr.u64	q2, q8, #63
+	vshr.u64	q3, q9, #63
+	vsli.u64	q2, q8, #1
+	vsli.u64	q3, q9, #1
+	// Reloading q8-q9 can be skipped on the final round.
+.if \final != "true"
+	vld1.8		{q8-q9}, [sp, :256]
+.endif
+.endm
+
+//
+// void blake2b_compress_blocks_neon(struct blake2b_state *S,
+//				     const u8 *in, size_t nblocks,
+//				     unsigned int inc);
+//
+// Only the first three fields of struct blake2b_state are used:
+//	u64 h[8];	(inout)
+//	u64 t[2];	(in)
+//	u64 f[2];	(in)
+//
+	.align		5
+ENTRY(blake2b_compress_blocks_neon)
+	push		{r4-r10}
+
+	// Allocate a 32-byte stack buffer that is 32-byte aligned.
+	mov		ORIG_SP, sp
+	sub		ip, sp, #32
+	bic		ip, ip, #31
+	mov		sp, ip
+
+	adr		ROR24_TABLE, .Lror24_table
+	adr		ROR16_TABLE, .Lror16_table
+
+	mov		ip, STATE
+	vld1.64		{q0-q1}, [ip]!		// Load h[0..3]
+	vld1.64		{q2-q3}, [ip]!		// Load h[4..7]
+.Lnext_block:
+	  adr		r10, .Lblake2b_IV
+	vld1.64		{q14-q15}, [ip]		// Load t[0..1] and f[0..1]
+	vld1.64		{q4-q5}, [r10]!		// Load IV[0..3]
+	  vmov		r7, r8, d28		// Copy t[0] to (r7, r8)
+	vld1.64		{q6-q7}, [r10]		// Load IV[4..7]
+	  adds		r7, r7, INC		// Increment counter
+	bcs		.Lslow_inc_ctr
+	vmov.i32	d28[0], r7
+	vst1.64		{d28}, [ip]		// Update t[0]
+.Linc_ctr_done:
+
+	// Load the next message block and finish initializing the state matrix
+	// 'v'.  Fortunately, there are exactly enough NEON registers to fit the
+	// entire state matrix in q0-q7 and the entire message block in q8-15.
+	//
+	// However, _blake2b_round also needs some extra registers for rotates,
+	// so we have to spill some registers.  It's better to spill the message
+	// registers than the state registers, as the message doesn't change.
+	// Therefore we store a copy of the first 32 bytes of the message block
+	// (q8-q9) in an aligned buffer on the stack so that they can be
+	// reloaded when needed.  (We could just reload directly from the
+	// message buffer, but it's faster to use aligned loads.)
+	vld1.8		{q8-q9}, [IN]!
+	  veor		q6, q6, q14	// v[12..13] = IV[4..5] ^ t[0..1]
+	vld1.8		{q10-q11}, [IN]!
+	  veor		q7, q7, q15	// v[14..15] = IV[6..7] ^ f[0..1]
+	vld1.8		{q12-q13}, [IN]!
+	vst1.8		{q8-q9}, [sp, :256]
+	  mov		ip, STATE
+	vld1.8		{q14-q15}, [IN]!
+
+	// Execute the rounds.  Each round is provided the order in which it
+	// needs to use the message words.
+	_blake2b_round	0,  1,  2,  3,  4,  5,  6,  7, \
+			8,  9, 10, 11, 12, 13, 14, 15
+	_blake2b_round 14, 10,  4,  8,  9, 15, 13,  6, \
+			1, 12,  0,  2, 11,  7,  5,  3
+	_blake2b_round 11,  8, 12,  0,  5,  2, 15, 13, \
+		       10, 14,  3,  6,  7,  1,  9,  4
+	_blake2b_round	7,  9,  3,  1, 13, 12, 11, 14, \
+			2,  6,  5, 10,  4,  0, 15,  8
+	_blake2b_round	9,  0,  5,  7,  2,  4, 10, 15, \
+			14,  1, 11, 12,  6,  8,  3, 13
+	_blake2b_round	2, 12,  6, 10,  0, 11,  8,  3, \
+			4, 13,  7,  5, 15, 14,  1,  9
+	_blake2b_round 12,  5,  1, 15, 14, 13,  4, 10, \
+			0,  7,  6,  3,  9,  2,  8, 11
+	_blake2b_round 13, 11,  7, 14, 12,  1,  3,  9, \
+			5,  0, 15,  4,  8,  6,  2, 10
+	_blake2b_round	6, 15, 14,  9, 11,  3,  0,  8, \
+			12,  2, 13,  7,  1,  4, 10,  5
+	_blake2b_round 10,  2,  8,  4,  7,  6,  1,  5, \
+		       15, 11,  9, 14,  3, 12, 13,  0
+	_blake2b_round	0,  1,  2,  3,  4,  5,  6,  7, \
+			8,  9, 10, 11, 12, 13, 14, 15
+	_blake2b_round 14, 10,  4,  8,  9, 15, 13,  6, \
+			1, 12,  0,  2, 11,  7, 5,   3,  "true"
+
+	// Fold the final state matrix into the hash chaining value:
+	//
+	//	for (i = 0; i < 8; i++)
+	//		h[i] ^= v[i] ^ v[i + 8];
+	//
+	  vld1.64	{q8-q9}, [ip]!		// Load old h[0..3]
+	veor		q0, q0, q4		// v[0..1] ^= v[8..9]
+	veor		q1, q1, q5		// v[2..3] ^= v[10..11]
+	  vld1.64	{q10-q11}, [ip]		// Load old h[4..7]
+	veor		q2, q2, q6		// v[4..5] ^= v[12..13]
+	veor		q3, q3, q7		// v[6..7] ^= v[14..15]
+	veor		q0, q0, q8		// v[0..1] ^= h[0..1]
+	veor		q1, q1, q9		// v[2..3] ^= h[2..3]
+	  mov		ip, STATE
+	  subs		NBLOCKS, NBLOCKS, #1	// nblocks--
+	  vst1.64	{q0-q1}, [ip]!		// Store new h[0..3]
+	veor		q2, q2, q10		// v[4..5] ^= h[4..5]
+	veor		q3, q3, q11		// v[6..7] ^= h[6..7]
+	  vst1.64	{q2-q3}, [ip]!		// Store new h[4..7]
+	bne		.Lnext_block		// nblocks != 0?
+
+	mov		sp, ORIG_SP
+	pop		{r4-r10}
+	mov		pc, lr
+
+.Lslow_inc_ctr:
+	// Handle the case where the counter overflowed its low 32 bits, by
+	// carrying the overflow bit into the full 128-bit counter.
+	vmov		r9, r10, d29
+	adcs		r8, r8, #0
+	adcs		r9, r9, #0
+	adc		r10, r10, #0
+	vmov		d28, r7, r8
+	vmov		d29, r9, r10
+	vst1.64		{q14}, [ip]		// Update t[0] and t[1]
+	b		.Linc_ctr_done
+ENDPROC(blake2b_compress_blocks_neon)
diff --git a/arch/arm/crypto/blake2b-neon-glue.c b/arch/arm/crypto/blake2b-neon-glue.c
new file mode 100644
index 0000000000000..27620d7a8bcb7
--- /dev/null
+++ b/arch/arm/crypto/blake2b-neon-glue.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BLAKE2b digest algorithm, NEON accelerated
+ *
+ * Copyright 2020 Google LLC
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <crypto/blake2b.h>
+#include <linux/module.h>
+
+asmlinkage void blake2b_compress_blocks_neon(struct blake2b_state *S,
+					     const u8 *in, size_t nblocks,
+					     unsigned int inc);
+
+static int blake2b_neon_update(struct shash_desc *desc,
+			       const u8 *in, unsigned int inlen)
+{
+	struct blake2b_state *S = shash_desc_ctx(desc);
+
+	if (S->buflen + inlen < BLAKE2B_BLOCK_SIZE || !crypto_simd_usable())
+		return crypto_blake2b_update(desc, in, inlen);
+
+	do {
+		unsigned int n = min_t(unsigned int, inlen, SZ_4K);
+
+		kernel_neon_begin();
+		__crypto_blake2b_update(desc, in, n,
+					blake2b_compress_blocks_neon);
+		kernel_neon_end();
+		in += n;
+		inlen -= n;
+	} while (inlen);
+	return 0;
+}
+
+static int blake2b_neon_final(struct shash_desc *desc, u8 *out)
+{
+	int err;
+
+	if (!crypto_simd_usable())
+		return crypto_blake2b_final(desc, out);
+
+	kernel_neon_begin();
+	err = __crypto_blake2b_final(desc, out, blake2b_compress_blocks_neon);
+	kernel_neon_end();
+	return err;
+}
+
+#define BLAKE2B_ALG(name, driver_name, digest_size)			\
+	{								\
+		.base.cra_name		= name,				\
+		.base.cra_driver_name	= driver_name,			\
+		.base.cra_priority	= 200,				\
+		.base.cra_flags		= CRYPTO_ALG_OPTIONAL_KEY,	\
+		.base.cra_blocksize	= BLAKE2B_BLOCK_SIZE,		\
+		.base.cra_ctxsize	= sizeof(struct blake2b_tfm_ctx), \
+		.base.cra_module	= THIS_MODULE,			\
+		.digestsize		= digest_size,			\
+		.setkey			= crypto_blake2b_setkey,	\
+		.init			= crypto_blake2b_init,		\
+		.update			= blake2b_neon_update,		\
+		.final			= blake2b_neon_final,		\
+		.descsize		= sizeof(struct blake2b_state),	\
+	}
+
+static struct shash_alg blake2b_neon_algs[] = {
+	BLAKE2B_ALG("blake2b-160", "blake2b-160-neon", BLAKE2B_160_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-256", "blake2b-256-neon", BLAKE2B_256_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-384", "blake2b-384-neon", BLAKE2B_384_HASH_SIZE),
+	BLAKE2B_ALG("blake2b-512", "blake2b-512-neon", BLAKE2B_512_HASH_SIZE),
+};
+
+static int __init blake2b_neon_mod_init(void)
+{
+	if (!(elf_hwcap & HWCAP_NEON))
+		return -ENODEV;
+
+	return crypto_register_shashes(blake2b_neon_algs,
+				       ARRAY_SIZE(blake2b_neon_algs));
+}
+
+static void __exit blake2b_neon_mod_exit(void)
+{
+	return crypto_unregister_shashes(blake2b_neon_algs,
+					 ARRAY_SIZE(blake2b_neon_algs));
+}
+
+module_init(blake2b_neon_mod_init);
+module_exit(blake2b_neon_mod_exit);
+
+MODULE_DESCRIPTION("BLAKE2b digest algorithm, NEON accelerated");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("blake2b-160");
+MODULE_ALIAS_CRYPTO("blake2b-160-neon");
+MODULE_ALIAS_CRYPTO("blake2b-256");
+MODULE_ALIAS_CRYPTO("blake2b-256-neon");
+MODULE_ALIAS_CRYPTO("blake2b-384");
+MODULE_ALIAS_CRYPTO("blake2b-384-neon");
+MODULE_ALIAS_CRYPTO("blake2b-512");
+MODULE_ALIAS_CRYPTO("blake2b-512-neon");
-- 
2.29.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
  2020-12-15 23:47 ` Eric Biggers
@ 2020-12-16 20:47   ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-16 20:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:03PM -0800, Eric Biggers wrote:
> This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM.
> Patches 1-4 prepare for it by making some updates to the generic
> implementation, while patch 5 adds the actual NEON implementation.
> 
> On Cortex-A7 (which these days is the most common ARM processor that
> doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
> SHA-256, and slightly faster than SHA-1.  It is also almost three times
> as fast as the generic implementation of BLAKE2b:
> 
> 	Algorithm            Cycles per byte (on 4096-byte messages)
> 	===================  =======================================
> 	blake2b-256-neon     14.1
> 	sha1-neon            16.4
> 	sha1-asm             20.8
> 	blake2s-256-generic  26.1
> 	sha256-neon	     28.9
> 	sha256-asm	     32.1
> 	blake2b-256-generic  39.9
> 
> This implementation isn't directly based on any other implementation,
> but it borrows some ideas from previous NEON code I've written as well
> as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
> the other NEON implementations of BLAKE2b I'm aware of (the
> implementation in the BLAKE2 official repository using intrinsics, and
> Andrew Moon's implementation which can be found in SUPERCOP).
> 
> NEON-optimized BLAKE2b is useful because there is interest in using
> BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
> devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
> these devices, the performance cost of upgrading to SHA-256 may be
> unacceptable, whereas BLAKE2b-256 would actually improve performance.
> 
> Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
> is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
> BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
> 64-bit operations, and because BLAKE2s's block size is too small for
> NEON to be helpful for it.  The best I've been able to do with BLAKE2s
> on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

By the way, if people are interested in having my ARM scalar implementation of
BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.1
	sha1-neon            16.4
	blake2s-256-arm      19.0
	sha1-asm             20.8
	blake2s-256-generic  26.1
	sha256-neon	     28.9
	sha256-asm	     32.1
	blake2b-256-generic  39.9

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
@ 2020-12-16 20:47   ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-16 20:47 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Herbert Xu, David Sterba, Ard Biesheuvel,
	linux-arm-kernel, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:03PM -0800, Eric Biggers wrote:
> This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM.
> Patches 1-4 prepare for it by making some updates to the generic
> implementation, while patch 5 adds the actual NEON implementation.
> 
> On Cortex-A7 (which these days is the most common ARM processor that
> doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
> SHA-256, and slightly faster than SHA-1.  It is also almost three times
> as fast as the generic implementation of BLAKE2b:
> 
> 	Algorithm            Cycles per byte (on 4096-byte messages)
> 	===================  =======================================
> 	blake2b-256-neon     14.1
> 	sha1-neon            16.4
> 	sha1-asm             20.8
> 	blake2s-256-generic  26.1
> 	sha256-neon	     28.9
> 	sha256-asm	     32.1
> 	blake2b-256-generic  39.9
> 
> This implementation isn't directly based on any other implementation,
> but it borrows some ideas from previous NEON code I've written as well
> as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
> the other NEON implementations of BLAKE2b I'm aware of (the
> implementation in the BLAKE2 official repository using intrinsics, and
> Andrew Moon's implementation which can be found in SUPERCOP).
> 
> NEON-optimized BLAKE2b is useful because there is interest in using
> BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
> devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
> these devices, the performance cost of upgrading to SHA-256 may be
> unacceptable, whereas BLAKE2b-256 would actually improve performance.
> 
> Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
> is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
> BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
> 64-bit operations, and because BLAKE2s's block size is too small for
> NEON to be helpful for it.  The best I've been able to do with BLAKE2s
> on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

By the way, if people are interested in having my ARM scalar implementation of
BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.1
	sha1-neon            16.4
	blake2s-256-arm      19.0
	sha1-asm             20.8
	blake2s-256-generic  26.1
	sha256-neon	     28.9
	sha256-asm	     32.1
	blake2b-256-generic  39.9

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] crypto: arm/blake2b - add NEON-optimized BLAKE2b implementation
  2020-12-15 23:47   ` Eric Biggers
@ 2020-12-16 20:57     ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-16 20:57 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, Ard Biesheuvel, Herbert Xu, David Sterba,
	Jason A . Donenfeld, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:08PM -0800, Eric Biggers wrote:
> +// Execute one round of BLAKE2b by updating the state matrix v[0..15] in the
> +// NEON registers q0-q7.  The message block is in q8..q15.  The stack pointer
> +// points to a 32-byte aligned buffer containing a copy of q8 and q9, so that
> +// they can be reloaded if q8 and q9 are used as temporary registers.  The macro
> +// arguments s0-s15 give the order in which the message words are used in this
> +// round.  'final' is "true" if this is the final round, i.e. round 12 of 12.
> +.macro	_blake2b_round	s0, s1, s2, s3, s4, s5, s6, s7, \
> +			s8, s9, s10, s11, s12, s13, s14, s15, final="false"
[...]
> +	// Reloading q8-q9 can be skipped on the final round.
> +.if \final != "true"
> +	vld1.8		{q8-q9}, [sp, :256]
> +.endif
> +.endm
[...]
> +	_blake2b_round 14, 10,  4,  8,  9, 15, 13,  6, \
> +			1, 12,  0,  2, 11,  7, 5,   3,  "true"

Apparently using the strings "true" and "false" here sometimes causes a build
error where they get treated as symbols
(https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org/thread/2JPD4H3VFBSKWPUCPEPRAXBVMSR2UCQI/),
though somehow it worked fine for me.  I'll change this to use 1 and 0 instead.

- Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] crypto: arm/blake2b - add NEON-optimized BLAKE2b implementation
@ 2020-12-16 20:57     ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-16 20:57 UTC (permalink / raw)
  To: linux-crypto
  Cc: Jason A . Donenfeld, Herbert Xu, David Sterba, Ard Biesheuvel,
	linux-arm-kernel, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:08PM -0800, Eric Biggers wrote:
> +// Execute one round of BLAKE2b by updating the state matrix v[0..15] in the
> +// NEON registers q0-q7.  The message block is in q8..q15.  The stack pointer
> +// points to a 32-byte aligned buffer containing a copy of q8 and q9, so that
> +// they can be reloaded if q8 and q9 are used as temporary registers.  The macro
> +// arguments s0-s15 give the order in which the message words are used in this
> +// round.  'final' is "true" if this is the final round, i.e. round 12 of 12.
> +.macro	_blake2b_round	s0, s1, s2, s3, s4, s5, s6, s7, \
> +			s8, s9, s10, s11, s12, s13, s14, s15, final="false"
[...]
> +	// Reloading q8-q9 can be skipped on the final round.
> +.if \final != "true"
> +	vld1.8		{q8-q9}, [sp, :256]
> +.endif
> +.endm
[...]
> +	_blake2b_round 14, 10,  4,  8,  9, 15, 13,  6, \
> +			1, 12,  0,  2, 11,  7, 5,   3,  "true"

Apparently using the strings "true" and "false" here sometimes causes a build
error where they get treated as symbols
(https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org/thread/2JPD4H3VFBSKWPUCPEPRAXBVMSR2UCQI/),
though somehow it worked fine for me.  I'll change this to use 1 and 0 instead.

- Eric

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
  2020-12-16 20:47   ` Eric Biggers
@ 2020-12-16 22:32     ` Jason A. Donenfeld
  -1 siblings, 0 replies; 34+ messages in thread
From: Jason A. Donenfeld @ 2020-12-16 22:32 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, linux-arm-kernel, Ard Biesheuvel,
	Herbert Xu, David Sterba, Paul Crowley

Hi Eric,

On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote:
> By the way, if people are interested in having my ARM scalar implementation of
> BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
> being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
> mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

I'd certainly be interested in this. Any rough idea how it performs
for pretty small messages compared to the generic implementation?
100-140 byte ranges? Is the speedup about the same as for longer
messages because this doesn't parallelize across multiple blocks?

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
@ 2020-12-16 22:32     ` Jason A. Donenfeld
  0 siblings, 0 replies; 34+ messages in thread
From: Jason A. Donenfeld @ 2020-12-16 22:32 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Linux Crypto Mailing List, David Sterba,
	Ard Biesheuvel, linux-arm-kernel, Paul Crowley

Hi Eric,

On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote:
> By the way, if people are interested in having my ARM scalar implementation of
> BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
> being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
> mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

I'd certainly be interested in this. Any rough idea how it performs
for pretty small messages compared to the generic implementation?
100-140 byte ranges? Is the speedup about the same as for longer
messages because this doesn't parallelize across multiple blocks?

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
  2020-12-16 22:32     ` Jason A. Donenfeld
@ 2020-12-17  3:54       ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-17  3:54 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linux Crypto Mailing List, linux-arm-kernel, Ard Biesheuvel,
	Herbert Xu, David Sterba, Paul Crowley

On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote:
> Hi Eric,
> 
> On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > By the way, if people are interested in having my ARM scalar implementation of
> > BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
> > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
> > mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:
> 
> I'd certainly be interested in this. Any rough idea how it performs
> for pretty small messages compared to the generic implementation?
> 100-140 byte ranges? Is the speedup about the same as for longer
> messages because this doesn't parallelize across multiple blocks?
> 

It does one block at a time, and there isn't much overhead, so yes the speedup
on short messages should be about the same as on long messages.

I did a couple quick userspace benchmarks and got (still on Cortex-A7):

	100-byte messages:
		BLAKE2s ARM:     28.9 cpb
		BLAKE2s generic: 42.4 cpb

	140-byte messages:
		BLAKE2s ARM:     29.5 cpb
		BLAKE2s generic: 44.0 cpb

The results in the kernel may differ a bit, but probably not by much.

- Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
@ 2020-12-17  3:54       ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-17  3:54 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Linux Crypto Mailing List, David Sterba,
	Ard Biesheuvel, linux-arm-kernel, Paul Crowley

On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote:
> Hi Eric,
> 
> On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > By the way, if people are interested in having my ARM scalar implementation of
> > BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
> > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
> > mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:
> 
> I'd certainly be interested in this. Any rough idea how it performs
> for pretty small messages compared to the generic implementation?
> 100-140 byte ranges? Is the speedup about the same as for longer
> messages because this doesn't parallelize across multiple blocks?
> 

It does one block at a time, and there isn't much overhead, so yes the speedup
on short messages should be about the same as on long messages.

I did a couple quick userspace benchmarks and got (still on Cortex-A7):

	100-byte messages:
		BLAKE2s ARM:     28.9 cpb
		BLAKE2s generic: 42.4 cpb

	140-byte messages:
		BLAKE2s ARM:     29.5 cpb
		BLAKE2s generic: 44.0 cpb

The results in the kernel may differ a bit, but probably not by much.

- Eric

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
  2020-12-17  3:54       ` Eric Biggers
@ 2020-12-17 14:01         ` Jason A. Donenfeld
  -1 siblings, 0 replies; 34+ messages in thread
From: Jason A. Donenfeld @ 2020-12-17 14:01 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Crypto Mailing List, linux-arm-kernel, Ard Biesheuvel,
	Herbert Xu, David Sterba, Paul Crowley

On Thu, Dec 17, 2020 at 4:54 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote:
> > Hi Eric,
> >
> > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > By the way, if people are interested in having my ARM scalar implementation of
> > > BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
> > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
> > > mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:
> >
> > I'd certainly be interested in this. Any rough idea how it performs
> > for pretty small messages compared to the generic implementation?
> > 100-140 byte ranges? Is the speedup about the same as for longer
> > messages because this doesn't parallelize across multiple blocks?
> >
>
> It does one block at a time, and there isn't much overhead, so yes the speedup
> on short messages should be about the same as on long messages.
>
> I did a couple quick userspace benchmarks and got (still on Cortex-A7):
>
>         100-byte messages:
>                 BLAKE2s ARM:     28.9 cpb
>                 BLAKE2s generic: 42.4 cpb
>
>         140-byte messages:
>                 BLAKE2s ARM:     29.5 cpb
>                 BLAKE2s generic: 44.0 cpb
>
> The results in the kernel may differ a bit, but probably not by much.

That's certainly a nice improvement though, and I'd very much welcome
the faster implementation.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
@ 2020-12-17 14:01         ` Jason A. Donenfeld
  0 siblings, 0 replies; 34+ messages in thread
From: Jason A. Donenfeld @ 2020-12-17 14:01 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Linux Crypto Mailing List, David Sterba,
	Ard Biesheuvel, linux-arm-kernel, Paul Crowley

On Thu, Dec 17, 2020 at 4:54 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote:
> > Hi Eric,
> >
> > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > By the way, if people are interested in having my ARM scalar implementation of
> > > BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
> > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
> > > mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:
> >
> > I'd certainly be interested in this. Any rough idea how it performs
> > for pretty small messages compared to the generic implementation?
> > 100-140 byte ranges? Is the speedup about the same as for longer
> > messages because this doesn't parallelize across multiple blocks?
> >
>
> It does one block at a time, and there isn't much overhead, so yes the speedup
> on short messages should be about the same as on long messages.
>
> I did a couple quick userspace benchmarks and got (still on Cortex-A7):
>
>         100-byte messages:
>                 BLAKE2s ARM:     28.9 cpb
>                 BLAKE2s generic: 42.4 cpb
>
>         140-byte messages:
>                 BLAKE2s ARM:     29.5 cpb
>                 BLAKE2s generic: 44.0 cpb
>
> The results in the kernel may differ a bit, but probably not by much.

That's certainly a nice improvement though, and I'd very much welcome
the faster implementation.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] crypto: blake2b - rename constants for consistency with blake2s
  2020-12-15 23:47   ` Eric Biggers
@ 2020-12-17 17:13     ` David Sterba
  -1 siblings, 0 replies; 34+ messages in thread
From: David Sterba @ 2020-12-17 17:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-arm-kernel, Ard Biesheuvel, Herbert Xu,
	David Sterba, Jason A . Donenfeld, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:04PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Rename some BLAKE2b-related constants to be consistent with the names
> used in the BLAKE2s implementation (see include/crypto/blake2s.h):
> 
> 	BLAKE2B_*_DIGEST_SIZE  => BLAKE2B_*_HASH_SIZE
> 	BLAKE2B_BLOCKBYTES     => BLAKE2B_BLOCK_SIZE
> 	BLAKE2B_KEYBYTES       => BLAKE2B_KEY_SIZE
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] crypto: blake2b - rename constants for consistency with blake2s
@ 2020-12-17 17:13     ` David Sterba
  0 siblings, 0 replies; 34+ messages in thread
From: David Sterba @ 2020-12-17 17:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jason A . Donenfeld, Herbert Xu, linux-crypto, David Sterba,
	Ard Biesheuvel, linux-arm-kernel, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:04PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Rename some BLAKE2b-related constants to be consistent with the names
> used in the BLAKE2s implementation (see include/crypto/blake2s.h):
> 
> 	BLAKE2B_*_DIGEST_SIZE  => BLAKE2B_*_HASH_SIZE
> 	BLAKE2B_BLOCKBYTES     => BLAKE2B_BLOCK_SIZE
> 	BLAKE2B_KEYBYTES       => BLAKE2B_KEY_SIZE
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: David Sterba <dsterba@suse.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] crypto: blake2b - define shash_alg structs using macros
  2020-12-15 23:47   ` Eric Biggers
@ 2020-12-17 17:15     ` David Sterba
  -1 siblings, 0 replies; 34+ messages in thread
From: David Sterba @ 2020-12-17 17:15 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-arm-kernel, Ard Biesheuvel, Herbert Xu,
	David Sterba, Jason A . Donenfeld, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:05PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> The shash_alg structs for the four variants of BLAKE2b are identical
> except for the algorithm name, driver name, and digest size.  So, avoid
> code duplication by using a macro to define these structs.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: David Sterba <dsterba@suse.com>

> +static struct shash_alg blake2b_algs[] = {
> +	BLAKE2B_ALG("blake2b-160", "blake2b-160-generic",
> +		    BLAKE2B_160_HASH_SIZE),

Spelling out the algo names as string is better as it is greppable and
matches the module name, compared to using the #stringify macro
operator.

> +	BLAKE2B_ALG("blake2b-256", "blake2b-256-generic",
> +		    BLAKE2B_256_HASH_SIZE),
> +	BLAKE2B_ALG("blake2b-384", "blake2b-384-generic",
> +		    BLAKE2B_384_HASH_SIZE),
> +	BLAKE2B_ALG("blake2b-512", "blake2b-512-generic",
> +		    BLAKE2B_512_HASH_SIZE),
>  };
>  
>  static int __init blake2b_mod_init(void)
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] crypto: blake2b - define shash_alg structs using macros
@ 2020-12-17 17:15     ` David Sterba
  0 siblings, 0 replies; 34+ messages in thread
From: David Sterba @ 2020-12-17 17:15 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jason A . Donenfeld, Herbert Xu, linux-crypto, David Sterba,
	Ard Biesheuvel, linux-arm-kernel, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:05PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> The shash_alg structs for the four variants of BLAKE2b are identical
> except for the algorithm name, driver name, and digest size.  So, avoid
> code duplication by using a macro to define these structs.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: David Sterba <dsterba@suse.com>

> +static struct shash_alg blake2b_algs[] = {
> +	BLAKE2B_ALG("blake2b-160", "blake2b-160-generic",
> +		    BLAKE2B_160_HASH_SIZE),

Spelling out the algo names as string is better as it is greppable and
matches the module name, compared to using the #stringify macro
operator.

> +	BLAKE2B_ALG("blake2b-256", "blake2b-256-generic",
> +		    BLAKE2B_256_HASH_SIZE),
> +	BLAKE2B_ALG("blake2b-384", "blake2b-384-generic",
> +		    BLAKE2B_384_HASH_SIZE),
> +	BLAKE2B_ALG("blake2b-512", "blake2b-512-generic",
> +		    BLAKE2B_512_HASH_SIZE),
>  };
>  
>  static int __init blake2b_mod_init(void)
> -- 
> 2.29.2
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] crypto: blake2b - export helpers for optimized implementations
  2020-12-15 23:47   ` Eric Biggers
@ 2020-12-17 17:15     ` David Sterba
  -1 siblings, 0 replies; 34+ messages in thread
From: David Sterba @ 2020-12-17 17:15 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-arm-kernel, Ard Biesheuvel, Herbert Xu,
	David Sterba, Jason A . Donenfeld, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:06PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> In preparation for adding architecture-specific implementations of
> BLAKE2b, create a header <crypto/blake2b.h> that contains common
> constants, structs, and helper functions for BLAKE2b.
> 
> Furthermore, export the BLAKE2b generic setkey(), init(), update(), and
> final() functions, and add functions __crypto_blake2b_update() and
> __crypto_blake2b_final() which take a pointer to a
> blake2b_compress_blocks_t function.
> 
> This way, optimized implementations of BLAKE2b only have to provide an
> implementation of blake2b_compress_blocks_t.  (This is modeled on how
> the nhpoly1305 implementations work.  Also, the prototype of
> blake2b_compress_blocks_t is meant to be similar to that of
> blake2s_compress_arch().)
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] crypto: blake2b - export helpers for optimized implementations
@ 2020-12-17 17:15     ` David Sterba
  0 siblings, 0 replies; 34+ messages in thread
From: David Sterba @ 2020-12-17 17:15 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jason A . Donenfeld, Herbert Xu, linux-crypto, David Sterba,
	Ard Biesheuvel, linux-arm-kernel, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:06PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> In preparation for adding architecture-specific implementations of
> BLAKE2b, create a header <crypto/blake2b.h> that contains common
> constants, structs, and helper functions for BLAKE2b.
> 
> Furthermore, export the BLAKE2b generic setkey(), init(), update(), and
> final() functions, and add functions __crypto_blake2b_update() and
> __crypto_blake2b_final() which take a pointer to a
> blake2b_compress_blocks_t function.
> 
> This way, optimized implementations of BLAKE2b only have to provide an
> implementation of blake2b_compress_blocks_t.  (This is modeled on how
> the nhpoly1305 implementations work.  Also, the prototype of
> blake2b_compress_blocks_t is meant to be similar to that of
> blake2s_compress_arch().)
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: David Sterba <dsterba@suse.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/5] crypto: blake2b - update file comment
  2020-12-15 23:47   ` Eric Biggers
@ 2020-12-17 17:17     ` David Sterba
  -1 siblings, 0 replies; 34+ messages in thread
From: David Sterba @ 2020-12-17 17:17 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-arm-kernel, Ard Biesheuvel, Herbert Xu,
	David Sterba, Jason A . Donenfeld, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:07PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> The file comment for blake2b_generic.c makes it sound like it's the
> reference implementation of BLAKE2b with only minor changes.  But it's
> actually been changed a lot.  Update the comment to make this clearer.

After your changes I agree it appropriate to reword it.

> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/5] crypto: blake2b - update file comment
@ 2020-12-17 17:17     ` David Sterba
  0 siblings, 0 replies; 34+ messages in thread
From: David Sterba @ 2020-12-17 17:17 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jason A . Donenfeld, Herbert Xu, linux-crypto, David Sterba,
	Ard Biesheuvel, linux-arm-kernel, Paul Crowley

On Tue, Dec 15, 2020 at 03:47:07PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> The file comment for blake2b_generic.c makes it sound like it's the
> reference implementation of BLAKE2b with only minor changes.  But it's
> actually been changed a lot.  Update the comment to make this clearer.

After your changes I agree it appropriate to reword it.

> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: David Sterba <dsterba@suse.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] crypto: blake2b - define shash_alg structs using macros
  2020-12-17 17:15     ` David Sterba
@ 2020-12-17 18:35       ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-17 18:35 UTC (permalink / raw)
  To: dsterba, linux-crypto, linux-arm-kernel, Ard Biesheuvel,
	Herbert Xu, David Sterba, Jason A . Donenfeld, Paul Crowley

On Thu, Dec 17, 2020 at 06:15:17PM +0100, David Sterba wrote:
> On Tue, Dec 15, 2020 at 03:47:05PM -0800, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > The shash_alg structs for the four variants of BLAKE2b are identical
> > except for the algorithm name, driver name, and digest size.  So, avoid
> > code duplication by using a macro to define these structs.
> > 
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
> 
> Reviewed-by: David Sterba <dsterba@suse.com>
> 
> > +static struct shash_alg blake2b_algs[] = {
> > +	BLAKE2B_ALG("blake2b-160", "blake2b-160-generic",
> > +		    BLAKE2B_160_HASH_SIZE),
> 
> Spelling out the algo names as string is better as it is greppable and
> matches the module name, compared to using the #stringify macro
> operator.
> 

Yes, I considered trying to make it so that BLAKE2B_ALG(n) would generate the
names and constant for blake2b-$n, but it seemed better to keep it greppable.

- Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] crypto: blake2b - define shash_alg structs using macros
@ 2020-12-17 18:35       ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-17 18:35 UTC (permalink / raw)
  To: dsterba, linux-crypto, linux-arm-kernel, Ard Biesheuvel,
	Herbert Xu, David Sterba, Jason A . Donenfeld, Paul Crowley

On Thu, Dec 17, 2020 at 06:15:17PM +0100, David Sterba wrote:
> On Tue, Dec 15, 2020 at 03:47:05PM -0800, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > The shash_alg structs for the four variants of BLAKE2b are identical
> > except for the algorithm name, driver name, and digest size.  So, avoid
> > code duplication by using a macro to define these structs.
> > 
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
> 
> Reviewed-by: David Sterba <dsterba@suse.com>
> 
> > +static struct shash_alg blake2b_algs[] = {
> > +	BLAKE2B_ALG("blake2b-160", "blake2b-160-generic",
> > +		    BLAKE2B_160_HASH_SIZE),
> 
> Spelling out the algo names as string is better as it is greppable and
> matches the module name, compared to using the #stringify macro
> operator.
> 

Yes, I considered trying to make it so that BLAKE2B_ALG(n) would generate the
names and constant for blake2b-$n, but it seemed better to keep it greppable.

- Eric

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] crypto: blake2b - export helpers for optimized implementations
  2020-12-17 17:15     ` David Sterba
@ 2020-12-17 22:33       ` Eric Biggers
  -1 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-17 22:33 UTC (permalink / raw)
  To: dsterba, linux-crypto, linux-arm-kernel, Ard Biesheuvel,
	Herbert Xu, David Sterba, Jason A . Donenfeld, Paul Crowley

On Thu, Dec 17, 2020 at 06:15:40PM +0100, David Sterba wrote:
> On Tue, Dec 15, 2020 at 03:47:06PM -0800, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > In preparation for adding architecture-specific implementations of
> > BLAKE2b, create a header <crypto/blake2b.h> that contains common
> > constants, structs, and helper functions for BLAKE2b.
> > 
> > Furthermore, export the BLAKE2b generic setkey(), init(), update(), and
> > final() functions, and add functions __crypto_blake2b_update() and
> > __crypto_blake2b_final() which take a pointer to a
> > blake2b_compress_blocks_t function.
> > 
> > This way, optimized implementations of BLAKE2b only have to provide an
> > implementation of blake2b_compress_blocks_t.  (This is modeled on how
> > the nhpoly1305 implementations work.  Also, the prototype of
> > blake2b_compress_blocks_t is meant to be similar to that of
> > blake2s_compress_arch().)
> > 
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
> 
> Reviewed-by: David Sterba <dsterba@suse.com>

I ended up making some changes to this patch in v2, to keep things consistent
with what I decided to do for BLAKE2s, so I didn't add your Reviewed-by to this
patch.  If you could review the new version too that would be great -- thanks!

- Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] crypto: blake2b - export helpers for optimized implementations
@ 2020-12-17 22:33       ` Eric Biggers
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Biggers @ 2020-12-17 22:33 UTC (permalink / raw)
  To: dsterba, linux-crypto, linux-arm-kernel, Ard Biesheuvel,
	Herbert Xu, David Sterba, Jason A . Donenfeld, Paul Crowley

On Thu, Dec 17, 2020 at 06:15:40PM +0100, David Sterba wrote:
> On Tue, Dec 15, 2020 at 03:47:06PM -0800, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > In preparation for adding architecture-specific implementations of
> > BLAKE2b, create a header <crypto/blake2b.h> that contains common
> > constants, structs, and helper functions for BLAKE2b.
> > 
> > Furthermore, export the BLAKE2b generic setkey(), init(), update(), and
> > final() functions, and add functions __crypto_blake2b_update() and
> > __crypto_blake2b_final() which take a pointer to a
> > blake2b_compress_blocks_t function.
> > 
> > This way, optimized implementations of BLAKE2b only have to provide an
> > implementation of blake2b_compress_blocks_t.  (This is modeled on how
> > the nhpoly1305 implementations work.  Also, the prototype of
> > blake2b_compress_blocks_t is meant to be similar to that of
> > blake2s_compress_arch().)
> > 
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
> 
> Reviewed-by: David Sterba <dsterba@suse.com>

I ended up making some changes to this patch in v2, to keep things consistent
with what I decided to do for BLAKE2s, so I didn't add your Reviewed-by to this
patch.  If you could review the new version too that would be great -- thanks!

- Eric

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2020-12-17 22:35 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-15 23:47 [PATCH 0/5] crypto: add NEON-optimized BLAKE2b Eric Biggers
2020-12-15 23:47 ` Eric Biggers
2020-12-15 23:47 ` [PATCH 1/5] crypto: blake2b - rename constants for consistency with blake2s Eric Biggers
2020-12-15 23:47   ` Eric Biggers
2020-12-17 17:13   ` David Sterba
2020-12-17 17:13     ` David Sterba
2020-12-15 23:47 ` [PATCH 2/5] crypto: blake2b - define shash_alg structs using macros Eric Biggers
2020-12-15 23:47   ` Eric Biggers
2020-12-17 17:15   ` David Sterba
2020-12-17 17:15     ` David Sterba
2020-12-17 18:35     ` Eric Biggers
2020-12-17 18:35       ` Eric Biggers
2020-12-15 23:47 ` [PATCH 3/5] crypto: blake2b - export helpers for optimized implementations Eric Biggers
2020-12-15 23:47   ` Eric Biggers
2020-12-17 17:15   ` David Sterba
2020-12-17 17:15     ` David Sterba
2020-12-17 22:33     ` Eric Biggers
2020-12-17 22:33       ` Eric Biggers
2020-12-15 23:47 ` [PATCH 4/5] crypto: blake2b - update file comment Eric Biggers
2020-12-15 23:47   ` Eric Biggers
2020-12-17 17:17   ` David Sterba
2020-12-17 17:17     ` David Sterba
2020-12-15 23:47 ` [PATCH 5/5] crypto: arm/blake2b - add NEON-optimized BLAKE2b implementation Eric Biggers
2020-12-15 23:47   ` Eric Biggers
2020-12-16 20:57   ` Eric Biggers
2020-12-16 20:57     ` Eric Biggers
2020-12-16 20:47 ` [PATCH 0/5] crypto: add NEON-optimized BLAKE2b Eric Biggers
2020-12-16 20:47   ` Eric Biggers
2020-12-16 22:32   ` Jason A. Donenfeld
2020-12-16 22:32     ` Jason A. Donenfeld
2020-12-17  3:54     ` Eric Biggers
2020-12-17  3:54       ` Eric Biggers
2020-12-17 14:01       ` Jason A. Donenfeld
2020-12-17 14:01         ` Jason A. Donenfeld

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.