[PATCH 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum)
@ 2018-11-29  6:34 Eric Biggers
  2018-11-29  6:34 ` [PATCH 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305 Eric Biggers
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Eric Biggers @ 2018-11-29  6:34 UTC (permalink / raw)
  To: linux-crypto
  Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
	linux-arm-kernel, linux-kernel

Hello,

This series optimizes the Adiantum encryption mode for ARM64 by adding
an ARM64 NEON accelerated implementation of NHPoly1305, specifically the
NH part; and by modifying the existing ARM64 NEON ChaCha20
implementation to support XChaCha20 and XChaCha12.

This greatly improves Adiantum performance on ARM64.  For example,
encrypting 4096-byte messages (single-threaded) on a Raspberry Pi 3
Model B v1.2, which has a Cortex-A53 processor:

                           Before            After
                           ---------         ---------
adiantum(xchacha12,aes)    44.1 MB/s         82.7 MB/s
adiantum(xchacha20,aes)    35.5 MB/s         65.7 MB/s

Decryption is the same speed as encryption.

The biggest benefit comes from accelerating XChaCha.  Accelerating NH
gives a somewhat smaller, but still significant benefit.

Performance on 512-byte inputs is also improved, though that is much
slower in the first place.  When Adiantium is used with dm-crypt (or
cryptsetup), we recommend using a 4096-byte sector size.

For comparison, on the same hardware AES-256-XTS encryption is only
24.5 MB/s and decryption 21.6 MB/s, both using the NEON-bitsliced
implementation ("xts-aes-neonbs").  That is the fastest AES-256-XTS
implementation on this processor, since it doesn't have the ARMv8
Cryptography Extensions.  This is despite Adiantum also being a super-
pseudorandom permutation (SPRP) over the entire sector, unlike XTS.

Note that XChaCha20 and XChaCha12 can be used for other purposes too.

Eric Biggers (4):
  crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305
  crypto: arm64/chacha20 - add XChaCha20 support
  crypto: arm64/chacha20 - refactor to allow varying number of rounds
  crypto: arm64/chacha - add XChaCha12 support

 arch/arm64/crypto/Kconfig                     |   7 +-
 arch/arm64/crypto/Makefile                    |   7 +-
 ...hacha20-neon-core.S => chacha-neon-core.S} |  89 +++++---
 arch/arm64/crypto/chacha-neon-glue.c          | 207 ++++++++++++++++++
 arch/arm64/crypto/chacha20-neon-glue.c        | 133 -----------
 arch/arm64/crypto/nh-neon-core.S              | 103 +++++++++
 arch/arm64/crypto/nhpoly1305-neon-glue.c      |  77 +++++++
 7 files changed, 457 insertions(+), 166 deletions(-)
 rename arch/arm64/crypto/{chacha20-neon-core.S => chacha-neon-core.S} (90%)
 create mode 100644 arch/arm64/crypto/chacha-neon-glue.c
 delete mode 100644 arch/arm64/crypto/chacha20-neon-glue.c
 create mode 100644 arch/arm64/crypto/nh-neon-core.S
 create mode 100644 arch/arm64/crypto/nhpoly1305-neon-glue.c

-- 
2.19.2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305
  2018-11-29  6:34 [PATCH 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
@ 2018-11-29  6:34 ` Eric Biggers
  2018-11-30 12:58   ` Ard Biesheuvel
  2018-11-29  6:34 ` [PATCH 2/4] crypto: arm64/chacha20 - add XChaCha20 support Eric Biggers
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Eric Biggers @ 2018-11-29  6:34 UTC (permalink / raw)
  To: linux-crypto
  Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
	linux-arm-kernel, linux-kernel

From: Eric Biggers <ebiggers@google.com>

Add an ARM64 NEON implementation of NHPoly1305, an ε-almost-∆-universal
hash function used in the Adiantum encryption mode.  For now, only the
NH portion is actually NEON-accelerated; the Poly1305 part is less
performance-critical so is just implemented in C.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm64/crypto/Kconfig                |   5 ++
 arch/arm64/crypto/Makefile               |   3 +
 arch/arm64/crypto/nh-neon-core.S         | 103 +++++++++++++++++++++++
 arch/arm64/crypto/nhpoly1305-neon-glue.c |  77 +++++++++++++++++
 4 files changed, 188 insertions(+)
 create mode 100644 arch/arm64/crypto/nh-neon-core.S
 create mode 100644 arch/arm64/crypto/nhpoly1305-neon-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index a5606823ed4da..3f5aeb786192b 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -106,6 +106,11 @@ config CRYPTO_CHACHA20_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_CHACHA20
 
+config CRYPTO_NHPOLY1305_NEON
+	tristate "NHPoly1305 hash function using NEON instructions (for Adiantum)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_NHPOLY1305
+
 config CRYPTO_AES_ARM64_BS
 	tristate "AES in ECB/CBC/CTR/XTS modes using bit-sliced NEON algorithm"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index f476fede09ba4..125dbb10a93ed 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -53,6 +53,9 @@ sha512-arm64-y := sha512-glue.o sha512-core.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
 chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
 
+obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
+nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
+
 obj-$(CONFIG_CRYPTO_AES_ARM64) += aes-arm64.o
 aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
 
diff --git a/arch/arm64/crypto/nh-neon-core.S b/arch/arm64/crypto/nh-neon-core.S
new file mode 100644
index 0000000000000..e08d57676127a
--- /dev/null
+++ b/arch/arm64/crypto/nh-neon-core.S
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NH - ε-almost-universal hash function, ARM64 NEON accelerated version
+ *
+ * Copyright 2018 Google LLC
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+#include <linux/linkage.h>
+
+	KEY		.req	x0
+	MESSAGE		.req	x1
+	MESSAGE_LEN	.req	x2
+	HASH		.req	x3
+
+	PASS0_SUMS	.req	v0
+	PASS1_SUMS	.req	v1
+	PASS2_SUMS	.req	v2
+	PASS3_SUMS	.req	v3
+	K0		.req	v4
+	K1		.req	v5
+	K2		.req	v6
+	K3		.req	v7
+	T0		.req	v8
+	T1		.req	v9
+	T2		.req	v10
+	T3		.req	v11
+	T4		.req	v12
+	T5		.req	v13
+	T6		.req	v14
+	T7		.req	v15
+
+.macro _nh_stride	k0, k1, k2, k3
+
+	// Load next message stride
+	ld1		{T3.16b}, [MESSAGE], #16
+
+	// Load next key stride
+	ld1		{\k3\().4s}, [KEY], #16
+
+	// Add message words to key words
+	add		T0.4s, T3.4s, \k0\().4s
+	add		T1.4s, T3.4s, \k1\().4s
+	add		T2.4s, T3.4s, \k2\().4s
+	add		T3.4s, T3.4s, \k3\().4s
+
+	// Multiply 32x32 => 64 and accumulate
+	mov             T4.d[0], T0.d[1]
+	mov             T5.d[0], T1.d[1]
+	mov             T6.d[0], T2.d[1]
+	mov             T7.d[0], T3.d[1]
+	umlal		PASS0_SUMS.2d, T0.2s, T4.2s
+	umlal		PASS1_SUMS.2d, T1.2s, T5.2s
+	umlal		PASS2_SUMS.2d, T2.2s, T6.2s
+	umlal		PASS3_SUMS.2d, T3.2s, T7.2s
+.endm
+
+/*
+ * void nh_neon(const u32 *key, const u8 *message, size_t message_len,
+ *		u8 hash[NH_HASH_BYTES])
+ *
+ * It's guaranteed that message_len % 16 == 0.
+ */
+ENTRY(nh_neon)
+
+	ld1		{K0.4s,K1.4s}, [KEY], #32
+	  movi		PASS0_SUMS.2d, #0
+	  movi		PASS1_SUMS.2d, #0
+	ld1		{K2.4s}, [KEY], #16
+	  movi		PASS2_SUMS.2d, #0
+	  movi		PASS3_SUMS.2d, #0
+
+	subs		MESSAGE_LEN, MESSAGE_LEN, #64
+	blt		.Lloop4_done
+.Lloop4:
+	_nh_stride	K0, K1, K2, K3
+	_nh_stride	K1, K2, K3, K0
+	_nh_stride	K2, K3, K0, K1
+	_nh_stride	K3, K0, K1, K2
+	subs		MESSAGE_LEN, MESSAGE_LEN, #64
+	bge		.Lloop4
+
+.Lloop4_done:
+	ands		MESSAGE_LEN, MESSAGE_LEN, #63
+	beq		.Ldone
+	_nh_stride	K0, K1, K2, K3
+
+	subs		MESSAGE_LEN, MESSAGE_LEN, #16
+	beq		.Ldone
+	_nh_stride	K1, K2, K3, K0
+
+	subs		MESSAGE_LEN, MESSAGE_LEN, #16
+	beq		.Ldone
+	_nh_stride	K2, K3, K0, K1
+
+.Ldone:
+	// Sum the accumulators for each pass, then store the sums to 'hash'
+	addp		T0.2d, PASS0_SUMS.2d, PASS1_SUMS.2d
+	addp		T1.2d, PASS2_SUMS.2d, PASS3_SUMS.2d
+	st1		{T0.16b,T1.16b}, [HASH]
+	ret
+ENDPROC(nh_neon)
diff --git a/arch/arm64/crypto/nhpoly1305-neon-glue.c b/arch/arm64/crypto/nhpoly1305-neon-glue.c
new file mode 100644
index 0000000000000..22cc32ac9448d
--- /dev/null
+++ b/arch/arm64/crypto/nhpoly1305-neon-glue.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NHPoly1305 - ε-almost-∆-universal hash function for Adiantum
+ * (ARM64 NEON accelerated version)
+ *
+ * Copyright 2018 Google LLC
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/hash.h>
+#include <crypto/nhpoly1305.h>
+#include <linux/module.h>
+
+asmlinkage void nh_neon(const u32 *key, const u8 *message, size_t message_len,
+			u8 hash[NH_HASH_BYTES]);
+
+/* wrapper to avoid indirect call to assembly, which doesn't work with CFI */
+static void _nh_neon(const u32 *key, const u8 *message, size_t message_len,
+		     __le64 hash[NH_NUM_PASSES])
+{
+	nh_neon(key, message, message_len, (u8 *)hash);
+}
+
+static int nhpoly1305_neon_update(struct shash_desc *desc,
+				  const u8 *src, unsigned int srclen)
+{
+	if (srclen < 64 || !may_use_simd())
+		return crypto_nhpoly1305_update(desc, src, srclen);
+
+	do {
+		unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+
+		kernel_neon_begin();
+		crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon);
+		kernel_neon_end();
+		src += n;
+		srclen -= n;
+	} while (srclen);
+	return 0;
+}
+
+static struct shash_alg nhpoly1305_alg = {
+	.base.cra_name		= "nhpoly1305",
+	.base.cra_driver_name	= "nhpoly1305-neon",
+	.base.cra_priority	= 200,
+	.base.cra_ctxsize	= sizeof(struct nhpoly1305_key),
+	.base.cra_module	= THIS_MODULE,
+	.digestsize		= POLY1305_DIGEST_SIZE,
+	.init			= crypto_nhpoly1305_init,
+	.update			= nhpoly1305_neon_update,
+	.final			= crypto_nhpoly1305_final,
+	.setkey			= crypto_nhpoly1305_setkey,
+	.descsize		= sizeof(struct nhpoly1305_state),
+};
+
+static int __init nhpoly1305_mod_init(void)
+{
+	if (!(elf_hwcap & HWCAP_ASIMD))
+		return -ENODEV;
+
+	return crypto_register_shash(&nhpoly1305_alg);
+}
+
+static void __exit nhpoly1305_mod_exit(void)
+{
+	crypto_unregister_shash(&nhpoly1305_alg);
+}
+
+module_init(nhpoly1305_mod_init);
+module_exit(nhpoly1305_mod_exit);
+
+MODULE_DESCRIPTION("NHPoly1305 ε-almost-∆-universal hash function (NEON-accelerated)");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("nhpoly1305");
+MODULE_ALIAS_CRYPTO("nhpoly1305-neon");
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] crypto: arm64/chacha20 - add XChaCha20 support
  2018-11-29  6:34 [PATCH 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
  2018-11-29  6:34 ` [PATCH 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305 Eric Biggers
@ 2018-11-29  6:34 ` Eric Biggers
  2018-11-30 13:02   ` Ard Biesheuvel
  2018-11-29  6:34 ` [PATCH 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds Eric Biggers
  2018-11-29  6:34 ` [PATCH 4/4] crypto: arm64/chacha - add XChaCha12 support Eric Biggers
  3 siblings, 1 reply; 9+ messages in thread
From: Eric Biggers @ 2018-11-29  6:34 UTC (permalink / raw)
  To: linux-crypto
  Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
	linux-arm-kernel, linux-kernel

From: Eric Biggers <ebiggers@google.com>

Add an XChaCha20 implementation that is hooked up to the ARM64 NEON
implementation of ChaCha20.  This can be used by Adiantum.

A NEON implementation of single-block HChaCha20 is also added so that
XChaCha20 can use it rather than the generic implementation.  This
required refactoring the ChaCha permutation into its own function.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm64/crypto/Kconfig              |   2 +-
 arch/arm64/crypto/chacha20-neon-core.S |  62 ++++++++++-----
 arch/arm64/crypto/chacha20-neon-glue.c | 101 +++++++++++++++++++------
 3 files changed, 121 insertions(+), 44 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 3f5aeb786192b..d54ddb8468ef5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -101,7 +101,7 @@ config CRYPTO_AES_ARM64_NEON_BLK
 	select CRYPTO_SIMD
 
 config CRYPTO_CHACHA20_NEON
-	tristate "NEON accelerated ChaCha20 symmetric cipher"
+	tristate "ChaCha20 and XChaCha20 stream ciphers using NEON instructions"
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_CHACHA20
diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha20-neon-core.S
index 13c85e272c2a1..378850505c0ae 100644
--- a/arch/arm64/crypto/chacha20-neon-core.S
+++ b/arch/arm64/crypto/chacha20-neon-core.S
@@ -23,25 +23,20 @@
 	.text
 	.align		6
 
-ENTRY(chacha20_block_xor_neon)
-	// x0: Input state matrix, s
-	// x1: 1 data block output, o
-	// x2: 1 data block input, i
-
-	//
-	// This function encrypts one ChaCha20 block by loading the state matrix
-	// in four NEON registers. It performs matrix operation on four words in
-	// parallel, but requires shuffling to rearrange the words after each
-	// round.
-	//
-
-	// x0..3 = s0..3
-	adr		x3, ROT8
-	ld1		{v0.4s-v3.4s}, [x0]
-	ld1		{v8.4s-v11.4s}, [x0]
-	ld1		{v12.4s}, [x3]
+/*
+ * chacha20_permute - permute one block
+ *
+ * Permute one 64-byte block where the state matrix is stored in the four NEON
+ * registers v0-v3.  It performs matrix operations on four words in parallel,
+ * but requires shuffling to rearrange the words after each round.
+ *
+ * Clobbers: x3, x10, v4, v12
+ */
+chacha20_permute:
 
 	mov		x3, #10
+	adr		x10, ROT8
+	ld1		{v12.4s}, [x10]
 
 .Ldoubleround:
 	// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
@@ -105,6 +100,22 @@ ENTRY(chacha20_block_xor_neon)
 	subs		x3, x3, #1
 	b.ne		.Ldoubleround
 
+	ret
+ENDPROC(chacha20_permute)
+
+ENTRY(chacha20_block_xor_neon)
+	// x0: Input state matrix, s
+	// x1: 1 data block output, o
+	// x2: 1 data block input, i
+
+	mov		x9, lr
+
+	// x0..3 = s0..3
+	ld1		{v0.4s-v3.4s}, [x0]
+	ld1		{v8.4s-v11.4s}, [x0]
+
+	bl		chacha20_permute
+
 	ld1		{v4.16b-v7.16b}, [x2]
 
 	// o0 = i0 ^ (x0 + s0)
@@ -125,9 +136,24 @@ ENTRY(chacha20_block_xor_neon)
 
 	st1		{v0.16b-v3.16b}, [x1]
 
-	ret
+	ret		x9
 ENDPROC(chacha20_block_xor_neon)
 
+ENTRY(hchacha20_block_neon)
+	// x0: Input state matrix, s
+	// x1: output (8 32-bit words)
+	mov		x9, lr
+
+	ld1		{v0.4s-v3.4s}, [x0]
+
+	bl		chacha20_permute
+
+	st1		{v0.16b}, [x1], #16
+	st1		{v3.16b}, [x1]
+
+	ret		x9
+ENDPROC(hchacha20_block_neon)
+
 	.align		6
 ENTRY(chacha20_4block_xor_neon)
 	// x0: Input state matrix, s
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index 96e0cfb8c3f5b..a5b9cbc0c4de4 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -30,6 +30,7 @@
 
 asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
 asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
+asmlinkage void hchacha20_block_neon(const u32 *state, u32 *out);
 
 static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
 			    unsigned int bytes)
@@ -65,20 +66,16 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
 	kernel_neon_end();
 }
 
-static int chacha20_neon(struct skcipher_request *req)
+static int chacha20_neon_stream_xor(struct skcipher_request *req,
+				    struct chacha_ctx *ctx, u8 *iv)
 {
-	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
-	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
 	struct skcipher_walk walk;
 	u32 state[16];
 	int err;
 
-	if (!may_use_simd() || req->cryptlen <= CHACHA_BLOCK_SIZE)
-		return crypto_chacha_crypt(req);
-
 	err = skcipher_walk_virt(&walk, req, false);
 
-	crypto_chacha_init(state, ctx, walk.iv);
+	crypto_chacha_init(state, ctx, iv);
 
 	while (walk.nbytes > 0) {
 		unsigned int nbytes = walk.nbytes;
@@ -94,22 +91,73 @@ static int chacha20_neon(struct skcipher_request *req)
 	return err;
 }
 
-static struct skcipher_alg alg = {
-	.base.cra_name		= "chacha20",
-	.base.cra_driver_name	= "chacha20-neon",
-	.base.cra_priority	= 300,
-	.base.cra_blocksize	= 1,
-	.base.cra_ctxsize	= sizeof(struct chacha_ctx),
-	.base.cra_module	= THIS_MODULE,
-
-	.min_keysize		= CHACHA_KEY_SIZE,
-	.max_keysize		= CHACHA_KEY_SIZE,
-	.ivsize			= CHACHA_IV_SIZE,
-	.chunksize		= CHACHA_BLOCK_SIZE,
-	.walksize		= 4 * CHACHA_BLOCK_SIZE,
-	.setkey			= crypto_chacha20_setkey,
-	.encrypt		= chacha20_neon,
-	.decrypt		= chacha20_neon,
+static int chacha20_neon(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
+		return crypto_chacha_crypt(req);
+
+	return chacha20_neon_stream_xor(req, ctx, req->iv);
+}
+
+static int xchacha20_neon(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct chacha_ctx subctx;
+	u32 state[16];
+	u8 real_iv[16];
+
+	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
+		return crypto_xchacha_crypt(req);
+
+	crypto_chacha_init(state, ctx, req->iv);
+
+	kernel_neon_begin();
+	hchacha20_block_neon(state, subctx.key);
+	kernel_neon_end();
+
+	memcpy(&real_iv[0], req->iv + 24, 8);
+	memcpy(&real_iv[8], req->iv + 16, 8);
+	return chacha20_neon_stream_xor(req, &subctx, real_iv);
+}
+
+static struct skcipher_alg algs[] = {
+	{
+		.base.cra_name		= "chacha20",
+		.base.cra_driver_name	= "chacha20-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= CHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.walksize		= 4 * CHACHA_BLOCK_SIZE,
+		.setkey			= crypto_chacha20_setkey,
+		.encrypt		= chacha20_neon,
+		.decrypt		= chacha20_neon,
+	}, {
+		.base.cra_name		= "xchacha20",
+		.base.cra_driver_name	= "xchacha20-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.walksize		= 4 * CHACHA_BLOCK_SIZE,
+		.setkey			= crypto_chacha20_setkey,
+		.encrypt		= xchacha20_neon,
+		.decrypt		= xchacha20_neon,
+	}
 };
 
 static int __init chacha20_simd_mod_init(void)
@@ -117,12 +165,12 @@ static int __init chacha20_simd_mod_init(void)
 	if (!(elf_hwcap & HWCAP_ASIMD))
 		return -ENODEV;
 
-	return crypto_register_skcipher(&alg);
+	return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
 }
 
 static void __exit chacha20_simd_mod_fini(void)
 {
-	crypto_unregister_skcipher(&alg);
+	crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
 }
 
 module_init(chacha20_simd_mod_init);
@@ -131,3 +179,6 @@ module_exit(chacha20_simd_mod_fini);
 MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
 MODULE_LICENSE("GPL v2");
 MODULE_ALIAS_CRYPTO("chacha20");
+MODULE_ALIAS_CRYPTO("chacha20-neon");
+MODULE_ALIAS_CRYPTO("xchacha20");
+MODULE_ALIAS_CRYPTO("xchacha20-neon");
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds
  2018-11-29  6:34 [PATCH 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
  2018-11-29  6:34 ` [PATCH 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305 Eric Biggers
  2018-11-29  6:34 ` [PATCH 2/4] crypto: arm64/chacha20 - add XChaCha20 support Eric Biggers
@ 2018-11-29  6:34 ` Eric Biggers
  2018-11-30 13:03   ` Ard Biesheuvel
  2018-11-29  6:34 ` [PATCH 4/4] crypto: arm64/chacha - add XChaCha12 support Eric Biggers
  3 siblings, 1 reply; 9+ messages in thread
From: Eric Biggers @ 2018-11-29  6:34 UTC (permalink / raw)
  To: linux-crypto
  Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
	linux-arm-kernel, linux-kernel

From: Eric Biggers <ebiggers@google.com>

In preparation for adding XChaCha12 support, rename/refactor the ARM64
NEON implementation of ChaCha20 to support different numbers of rounds.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm64/crypto/Makefile                    |  4 +-
 ...hacha20-neon-core.S => chacha-neon-core.S} | 45 ++++++++-------
 ...hacha20-neon-glue.c => chacha-neon-glue.c} | 57 ++++++++++---------
 3 files changed, 57 insertions(+), 49 deletions(-)
 rename arch/arm64/crypto/{chacha20-neon-core.S => chacha-neon-core.S} (94%)
 rename arch/arm64/crypto/{chacha20-neon-glue.c => chacha-neon-glue.c} (71%)

diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 125dbb10a93ed..a4ffd9fe32650 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -50,8 +50,8 @@ sha256-arm64-y := sha256-glue.o sha256-core.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM64) += sha512-arm64.o
 sha512-arm64-y := sha512-glue.o sha512-core.o
 
-obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
-chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
+obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
+chacha-neon-y := chacha-neon-core.o chacha-neon-glue.o
 
 obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
 nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha-neon-core.S
similarity index 94%
rename from arch/arm64/crypto/chacha20-neon-core.S
rename to arch/arm64/crypto/chacha-neon-core.S
index 378850505c0ae..75b4e06cee79f 100644
--- a/arch/arm64/crypto/chacha20-neon-core.S
+++ b/arch/arm64/crypto/chacha-neon-core.S
@@ -1,5 +1,5 @@
 /*
- * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
+ * ChaCha/XChaCha NEON helper functions
  *
  * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
  *
@@ -24,17 +24,18 @@
 	.align		6
 
 /*
- * chacha20_permute - permute one block
+ * chacha_permute - permute one block
  *
  * Permute one 64-byte block where the state matrix is stored in the four NEON
  * registers v0-v3.  It performs matrix operations on four words in parallel,
  * but requires shuffling to rearrange the words after each round.
  *
- * Clobbers: x3, x10, v4, v12
+ * The round count is given in w3.
+ *
+ * Clobbers: w3, x10, v4, v12
  */
-chacha20_permute:
+chacha_permute:
 
-	mov		x3, #10
 	adr		x10, ROT8
 	ld1		{v12.4s}, [x10]
 
@@ -97,16 +98,17 @@ chacha20_permute:
 	// x3 = shuffle32(x3, MASK(0, 3, 2, 1))
 	ext		v3.16b, v3.16b, v3.16b, #4
 
-	subs		x3, x3, #1
+	subs		w3, w3, #2
 	b.ne		.Ldoubleround
 
 	ret
-ENDPROC(chacha20_permute)
+ENDPROC(chacha_permute)
 
-ENTRY(chacha20_block_xor_neon)
+ENTRY(chacha_block_xor_neon)
 	// x0: Input state matrix, s
 	// x1: 1 data block output, o
 	// x2: 1 data block input, i
+	// w3: nrounds
 
 	mov		x9, lr
 
@@ -114,7 +116,7 @@ ENTRY(chacha20_block_xor_neon)
 	ld1		{v0.4s-v3.4s}, [x0]
 	ld1		{v8.4s-v11.4s}, [x0]
 
-	bl		chacha20_permute
+	bl		chacha_permute
 
 	ld1		{v4.16b-v7.16b}, [x2]
 
@@ -137,39 +139,42 @@ ENTRY(chacha20_block_xor_neon)
 	st1		{v0.16b-v3.16b}, [x1]
 
 	ret		x9
-ENDPROC(chacha20_block_xor_neon)
+ENDPROC(chacha_block_xor_neon)
 
-ENTRY(hchacha20_block_neon)
+ENTRY(hchacha_block_neon)
 	// x0: Input state matrix, s
 	// x1: output (8 32-bit words)
+	// w2: nrounds
 	mov		x9, lr
 
 	ld1		{v0.4s-v3.4s}, [x0]
 
-	bl		chacha20_permute
+	mov		w3, w2
+	bl		chacha_permute
 
 	st1		{v0.16b}, [x1], #16
 	st1		{v3.16b}, [x1]
 
 	ret		x9
-ENDPROC(hchacha20_block_neon)
+ENDPROC(hchacha_block_neon)
 
 	.align		6
-ENTRY(chacha20_4block_xor_neon)
+ENTRY(chacha_4block_xor_neon)
 	// x0: Input state matrix, s
 	// x1: 4 data blocks output, o
 	// x2: 4 data blocks input, i
+	// w3: nrounds
 
 	//
-	// This function encrypts four consecutive ChaCha20 blocks by loading
+	// This function encrypts four consecutive ChaCha blocks by loading
 	// the state matrix in NEON registers four times. The algorithm performs
 	// each operation on the corresponding word of each state matrix, hence
 	// requires no word shuffling. For final XORing step we transpose the
 	// matrix by interleaving 32- and then 64-bit words, which allows us to
 	// do XOR in NEON registers.
 	//
-	adr		x3, CTRINC		// ... and ROT8
-	ld1		{v30.4s-v31.4s}, [x3]
+	adr		x9, CTRINC		// ... and ROT8
+	ld1		{v30.4s-v31.4s}, [x9]
 
 	// x0..15[0-3] = s0..3[0..3]
 	mov		x4, x0
@@ -181,8 +186,6 @@ ENTRY(chacha20_4block_xor_neon)
 	// x12 += counter values 0-3
 	add		v12.4s, v12.4s, v30.4s
 
-	mov		x3, #10
-
 .Ldoubleround4:
 	// x0 += x4, x12 = rotl32(x12 ^ x0, 16)
 	// x1 += x5, x13 = rotl32(x13 ^ x1, 16)
@@ -356,7 +359,7 @@ ENTRY(chacha20_4block_xor_neon)
 	sri		v7.4s, v18.4s, #25
 	sri		v4.4s, v19.4s, #25
 
-	subs		x3, x3, #1
+	subs		w3, w3, #2
 	b.ne		.Ldoubleround4
 
 	ld4r		{v16.4s-v19.4s}, [x0], #16
@@ -470,7 +473,7 @@ ENTRY(chacha20_4block_xor_neon)
 	st1		{v28.16b-v31.16b}, [x1]
 
 	ret
-ENDPROC(chacha20_4block_xor_neon)
+ENDPROC(chacha_4block_xor_neon)
 
 CTRINC:	.word		0, 1, 2, 3
 ROT8:	.word		0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
similarity index 71%
rename from arch/arm64/crypto/chacha20-neon-glue.c
rename to arch/arm64/crypto/chacha-neon-glue.c
index a5b9cbc0c4de4..4d992029b9121 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -1,5 +1,6 @@
 /*
- * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
+ * ARM NEON accelerated ChaCha and XChaCha stream ciphers,
+ * including ChaCha20 (RFC7539)
  *
  * Copyright (C) 2016 - 2017 Linaro, Ltd. <ard.biesheuvel@linaro.org>
  *
@@ -28,18 +29,20 @@
 #include <asm/neon.h>
 #include <asm/simd.h>
 
-asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
-asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
-asmlinkage void hchacha20_block_neon(const u32 *state, u32 *out);
+asmlinkage void chacha_block_xor_neon(u32 *state, u8 *dst, const u8 *src,
+				      int nrounds);
+asmlinkage void chacha_4block_xor_neon(u32 *state, u8 *dst, const u8 *src,
+				       int nrounds);
+asmlinkage void hchacha_block_neon(const u32 *state, u32 *out, int nrounds);
 
-static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
-			    unsigned int bytes)
+static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
+			  unsigned int bytes, int nrounds)
 {
 	u8 buf[CHACHA_BLOCK_SIZE];
 
 	while (bytes >= CHACHA_BLOCK_SIZE * 4) {
 		kernel_neon_begin();
-		chacha20_4block_xor_neon(state, dst, src);
+		chacha_4block_xor_neon(state, dst, src, nrounds);
 		kernel_neon_end();
 		bytes -= CHACHA_BLOCK_SIZE * 4;
 		src += CHACHA_BLOCK_SIZE * 4;
@@ -52,7 +55,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
 
 	kernel_neon_begin();
 	while (bytes >= CHACHA_BLOCK_SIZE) {
-		chacha20_block_xor_neon(state, dst, src);
+		chacha_block_xor_neon(state, dst, src, nrounds);
 		bytes -= CHACHA_BLOCK_SIZE;
 		src += CHACHA_BLOCK_SIZE;
 		dst += CHACHA_BLOCK_SIZE;
@@ -60,14 +63,14 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
 	}
 	if (bytes) {
 		memcpy(buf, src, bytes);
-		chacha20_block_xor_neon(state, buf, buf);
+		chacha_block_xor_neon(state, buf, buf, nrounds);
 		memcpy(dst, buf, bytes);
 	}
 	kernel_neon_end();
 }
 
-static int chacha20_neon_stream_xor(struct skcipher_request *req,
-				    struct chacha_ctx *ctx, u8 *iv)
+static int chacha_neon_stream_xor(struct skcipher_request *req,
+				  struct chacha_ctx *ctx, u8 *iv)
 {
 	struct skcipher_walk walk;
 	u32 state[16];
@@ -83,15 +86,15 @@ static int chacha20_neon_stream_xor(struct skcipher_request *req,
 		if (nbytes < walk.total)
 			nbytes = round_down(nbytes, walk.stride);
 
-		chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
-				nbytes);
+		chacha_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
+			      nbytes, ctx->nrounds);
 		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
 
 	return err;
 }
 
-static int chacha20_neon(struct skcipher_request *req)
+static int chacha_neon(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
@@ -99,10 +102,10 @@ static int chacha20_neon(struct skcipher_request *req)
 	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
 		return crypto_chacha_crypt(req);
 
-	return chacha20_neon_stream_xor(req, ctx, req->iv);
+	return chacha_neon_stream_xor(req, ctx, req->iv);
 }
 
-static int xchacha20_neon(struct skcipher_request *req)
+static int xchacha_neon(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
@@ -116,12 +119,13 @@ static int xchacha20_neon(struct skcipher_request *req)
 	crypto_chacha_init(state, ctx, req->iv);
 
 	kernel_neon_begin();
-	hchacha20_block_neon(state, subctx.key);
+	hchacha_block_neon(state, subctx.key, ctx->nrounds);
 	kernel_neon_end();
+	subctx.nrounds = ctx->nrounds;
 
 	memcpy(&real_iv[0], req->iv + 24, 8);
 	memcpy(&real_iv[8], req->iv + 16, 8);
-	return chacha20_neon_stream_xor(req, &subctx, real_iv);
+	return chacha_neon_stream_xor(req, &subctx, real_iv);
 }
 
 static struct skcipher_alg algs[] = {
@@ -139,8 +143,8 @@ static struct skcipher_alg algs[] = {
 		.chunksize		= CHACHA_BLOCK_SIZE,
 		.walksize		= 4 * CHACHA_BLOCK_SIZE,
 		.setkey			= crypto_chacha20_setkey,
-		.encrypt		= chacha20_neon,
-		.decrypt		= chacha20_neon,
+		.encrypt		= chacha_neon,
+		.decrypt		= chacha_neon,
 	}, {
 		.base.cra_name		= "xchacha20",
 		.base.cra_driver_name	= "xchacha20-neon",
@@ -155,12 +159,12 @@ static struct skcipher_alg algs[] = {
 		.chunksize		= CHACHA_BLOCK_SIZE,
 		.walksize		= 4 * CHACHA_BLOCK_SIZE,
 		.setkey			= crypto_chacha20_setkey,
-		.encrypt		= xchacha20_neon,
-		.decrypt		= xchacha20_neon,
+		.encrypt		= xchacha_neon,
+		.decrypt		= xchacha_neon,
 	}
 };
 
-static int __init chacha20_simd_mod_init(void)
+static int __init chacha_simd_mod_init(void)
 {
 	if (!(elf_hwcap & HWCAP_ASIMD))
 		return -ENODEV;
@@ -168,14 +172,15 @@ static int __init chacha20_simd_mod_init(void)
 	return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
 }
 
-static void __exit chacha20_simd_mod_fini(void)
+static void __exit chacha_simd_mod_fini(void)
 {
 	crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
 }
 
-module_init(chacha20_simd_mod_init);
-module_exit(chacha20_simd_mod_fini);
+module_init(chacha_simd_mod_init);
+module_exit(chacha_simd_mod_fini);
 
+MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (NEON accelerated)");
 MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
 MODULE_LICENSE("GPL v2");
 MODULE_ALIAS_CRYPTO("chacha20");
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] crypto: arm64/chacha - add XChaCha12 support
  2018-11-29  6:34 [PATCH 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
                   ` (2 preceding siblings ...)
  2018-11-29  6:34 ` [PATCH 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds Eric Biggers
@ 2018-11-29  6:34 ` Eric Biggers
  2018-11-30 13:04   ` Ard Biesheuvel
  3 siblings, 1 reply; 9+ messages in thread
From: Eric Biggers @ 2018-11-29  6:34 UTC (permalink / raw)
  To: linux-crypto
  Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
	linux-arm-kernel, linux-kernel

From: Eric Biggers <ebiggers@google.com>

Now that the ARM64 NEON implementation of ChaCha20 and XChaCha20 has
been refactored to support varying the number of rounds, add support for
XChaCha12.  This is identical to XChaCha20 except for the number of
rounds, which is 12 instead of 20.  This can be used by Adiantum.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm64/crypto/Kconfig            |  2 +-
 arch/arm64/crypto/chacha-neon-glue.c | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index d54ddb8468ef5..d9a523ecdd836 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -101,7 +101,7 @@ config CRYPTO_AES_ARM64_NEON_BLK
 	select CRYPTO_SIMD
 
 config CRYPTO_CHACHA20_NEON
-	tristate "ChaCha20 and XChaCha20 stream ciphers using NEON instructions"
+	tristate "ChaCha20, XChaCha20, and XChaCha12 stream ciphers using NEON instructions"
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_CHACHA20
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index 4d992029b9121..346eb85498a1e 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -161,6 +161,22 @@ static struct skcipher_alg algs[] = {
 		.setkey			= crypto_chacha20_setkey,
 		.encrypt		= xchacha_neon,
 		.decrypt		= xchacha_neon,
+	}, {
+		.base.cra_name		= "xchacha12",
+		.base.cra_driver_name	= "xchacha12-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= 1,
+		.base.cra_ctxsize	= sizeof(struct chacha_ctx),
+		.base.cra_module	= THIS_MODULE,
+
+		.min_keysize		= CHACHA_KEY_SIZE,
+		.max_keysize		= CHACHA_KEY_SIZE,
+		.ivsize			= XCHACHA_IV_SIZE,
+		.chunksize		= CHACHA_BLOCK_SIZE,
+		.walksize		= 4 * CHACHA_BLOCK_SIZE,
+		.setkey			= crypto_chacha12_setkey,
+		.encrypt		= xchacha_neon,
+		.decrypt		= xchacha_neon,
 	}
 };
 
@@ -187,3 +203,5 @@ MODULE_ALIAS_CRYPTO("chacha20");
 MODULE_ALIAS_CRYPTO("chacha20-neon");
 MODULE_ALIAS_CRYPTO("xchacha20");
 MODULE_ALIAS_CRYPTO("xchacha20-neon");
+MODULE_ALIAS_CRYPTO("xchacha12");
+MODULE_ALIAS_CRYPTO("xchacha12-neon");
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305
  2018-11-29  6:34 ` [PATCH 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305 Eric Biggers
@ 2018-11-30 12:58   ` Ard Biesheuvel
  0 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2018-11-30 12:58 UTC (permalink / raw)
  To: Eric Biggers
  Cc: open list:HARDWARE RANDOM NUMBER GENERATOR CORE, Paul Crowley,
	Jason A. Donenfeld, linux-arm-kernel, Linux Kernel Mailing List

On Thu, 29 Nov 2018 at 07:35, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Add an ARM64 NEON implementation of NHPoly1305, an ε-almost-∆-universal
> hash function used in the Adiantum encryption mode.  For now, only the
> NH portion is actually NEON-accelerated; the Poly1305 part is less
> performance-critical so is just implemented in C.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  arch/arm64/crypto/Kconfig                |   5 ++
>  arch/arm64/crypto/Makefile               |   3 +
>  arch/arm64/crypto/nh-neon-core.S         | 103 +++++++++++++++++++++++
>  arch/arm64/crypto/nhpoly1305-neon-glue.c |  77 +++++++++++++++++
>  4 files changed, 188 insertions(+)
>  create mode 100644 arch/arm64/crypto/nh-neon-core.S
>  create mode 100644 arch/arm64/crypto/nhpoly1305-neon-glue.c
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index a5606823ed4da..3f5aeb786192b 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -106,6 +106,11 @@ config CRYPTO_CHACHA20_NEON
>         select CRYPTO_BLKCIPHER
>         select CRYPTO_CHACHA20
>
> +config CRYPTO_NHPOLY1305_NEON
> +       tristate "NHPoly1305 hash function using NEON instructions (for Adiantum)"
> +       depends on KERNEL_MODE_NEON
> +       select CRYPTO_NHPOLY1305
> +
>  config CRYPTO_AES_ARM64_BS
>         tristate "AES in ECB/CBC/CTR/XTS modes using bit-sliced NEON algorithm"
>         depends on KERNEL_MODE_NEON
> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> index f476fede09ba4..125dbb10a93ed 100644
> --- a/arch/arm64/crypto/Makefile
> +++ b/arch/arm64/crypto/Makefile
> @@ -53,6 +53,9 @@ sha512-arm64-y := sha512-glue.o sha512-core.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>
> +obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
> +nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
> +
>  obj-$(CONFIG_CRYPTO_AES_ARM64) += aes-arm64.o
>  aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
>
> diff --git a/arch/arm64/crypto/nh-neon-core.S b/arch/arm64/crypto/nh-neon-core.S
> new file mode 100644
> index 0000000000000..e08d57676127a
> --- /dev/null
> +++ b/arch/arm64/crypto/nh-neon-core.S
> @@ -0,0 +1,103 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * NH - ε-almost-universal hash function, ARM64 NEON accelerated version
> + *
> + * Copyright 2018 Google LLC
> + *
> + * Author: Eric Biggers <ebiggers@google.com>
> + */
> +
> +#include <linux/linkage.h>
> +
> +       KEY             .req    x0
> +       MESSAGE         .req    x1
> +       MESSAGE_LEN     .req    x2
> +       HASH            .req    x3
> +
> +       PASS0_SUMS      .req    v0
> +       PASS1_SUMS      .req    v1
> +       PASS2_SUMS      .req    v2
> +       PASS3_SUMS      .req    v3
> +       K0              .req    v4
> +       K1              .req    v5
> +       K2              .req    v6
> +       K3              .req    v7
> +       T0              .req    v8
> +       T1              .req    v9
> +       T2              .req    v10
> +       T3              .req    v11
> +       T4              .req    v12
> +       T5              .req    v13
> +       T6              .req    v14
> +       T7              .req    v15
> +
> +.macro _nh_stride      k0, k1, k2, k3
> +
> +       // Load next message stride
> +       ld1             {T3.16b}, [MESSAGE], #16
> +
> +       // Load next key stride
> +       ld1             {\k3\().4s}, [KEY], #16
> +
> +       // Add message words to key words
> +       add             T0.4s, T3.4s, \k0\().4s
> +       add             T1.4s, T3.4s, \k1\().4s
> +       add             T2.4s, T3.4s, \k2\().4s
> +       add             T3.4s, T3.4s, \k3\().4s
> +
> +       // Multiply 32x32 => 64 and accumulate
> +       mov             T4.d[0], T0.d[1]
> +       mov             T5.d[0], T1.d[1]
> +       mov             T6.d[0], T2.d[1]
> +       mov             T7.d[0], T3.d[1]

Nit: gmail mangles the whitespace anyway, but in the patch, there is
some whitespace damage here

With that fixed,

Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> # big-endian


> +       umlal           PASS0_SUMS.2d, T0.2s, T4.2s
> +       umlal           PASS1_SUMS.2d, T1.2s, T5.2s
> +       umlal           PASS2_SUMS.2d, T2.2s, T6.2s
> +       umlal           PASS3_SUMS.2d, T3.2s, T7.2s
> +.endm
> +
> +/*
> + * void nh_neon(const u32 *key, const u8 *message, size_t message_len,
> + *             u8 hash[NH_HASH_BYTES])
> + *
> + * It's guaranteed that message_len % 16 == 0.
> + */
> +ENTRY(nh_neon)
> +
> +       ld1             {K0.4s,K1.4s}, [KEY], #32
> +         movi          PASS0_SUMS.2d, #0
> +         movi          PASS1_SUMS.2d, #0
> +       ld1             {K2.4s}, [KEY], #16
> +         movi          PASS2_SUMS.2d, #0
> +         movi          PASS3_SUMS.2d, #0
> +
> +       subs            MESSAGE_LEN, MESSAGE_LEN, #64
> +       blt             .Lloop4_done
> +.Lloop4:
> +       _nh_stride      K0, K1, K2, K3
> +       _nh_stride      K1, K2, K3, K0
> +       _nh_stride      K2, K3, K0, K1
> +       _nh_stride      K3, K0, K1, K2
> +       subs            MESSAGE_LEN, MESSAGE_LEN, #64
> +       bge             .Lloop4
> +
> +.Lloop4_done:
> +       ands            MESSAGE_LEN, MESSAGE_LEN, #63
> +       beq             .Ldone
> +       _nh_stride      K0, K1, K2, K3
> +
> +       subs            MESSAGE_LEN, MESSAGE_LEN, #16
> +       beq             .Ldone
> +       _nh_stride      K1, K2, K3, K0
> +
> +       subs            MESSAGE_LEN, MESSAGE_LEN, #16
> +       beq             .Ldone
> +       _nh_stride      K2, K3, K0, K1
> +
> +.Ldone:
> +       // Sum the accumulators for each pass, then store the sums to 'hash'
> +       addp            T0.2d, PASS0_SUMS.2d, PASS1_SUMS.2d
> +       addp            T1.2d, PASS2_SUMS.2d, PASS3_SUMS.2d
> +       st1             {T0.16b,T1.16b}, [HASH]
> +       ret
> +ENDPROC(nh_neon)
> diff --git a/arch/arm64/crypto/nhpoly1305-neon-glue.c b/arch/arm64/crypto/nhpoly1305-neon-glue.c
> new file mode 100644
> index 0000000000000..22cc32ac9448d
> --- /dev/null
> +++ b/arch/arm64/crypto/nhpoly1305-neon-glue.c
> @@ -0,0 +1,77 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NHPoly1305 - ε-almost-∆-universal hash function for Adiantum
> + * (ARM64 NEON accelerated version)
> + *
> + * Copyright 2018 Google LLC
> + */
> +
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <crypto/internal/hash.h>
> +#include <crypto/nhpoly1305.h>
> +#include <linux/module.h>
> +
> +asmlinkage void nh_neon(const u32 *key, const u8 *message, size_t message_len,
> +                       u8 hash[NH_HASH_BYTES]);
> +
> +/* wrapper to avoid indirect call to assembly, which doesn't work with CFI */
> +static void _nh_neon(const u32 *key, const u8 *message, size_t message_len,
> +                    __le64 hash[NH_NUM_PASSES])
> +{
> +       nh_neon(key, message, message_len, (u8 *)hash);
> +}
> +
> +static int nhpoly1305_neon_update(struct shash_desc *desc,
> +                                 const u8 *src, unsigned int srclen)
> +{
> +       if (srclen < 64 || !may_use_simd())
> +               return crypto_nhpoly1305_update(desc, src, srclen);
> +
> +       do {
> +               unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
> +
> +               kernel_neon_begin();
> +               crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon);
> +               kernel_neon_end();
> +               src += n;
> +               srclen -= n;
> +       } while (srclen);
> +       return 0;
> +}
> +
> +static struct shash_alg nhpoly1305_alg = {
> +       .base.cra_name          = "nhpoly1305",
> +       .base.cra_driver_name   = "nhpoly1305-neon",
> +       .base.cra_priority      = 200,
> +       .base.cra_ctxsize       = sizeof(struct nhpoly1305_key),
> +       .base.cra_module        = THIS_MODULE,
> +       .digestsize             = POLY1305_DIGEST_SIZE,
> +       .init                   = crypto_nhpoly1305_init,
> +       .update                 = nhpoly1305_neon_update,
> +       .final                  = crypto_nhpoly1305_final,
> +       .setkey                 = crypto_nhpoly1305_setkey,
> +       .descsize               = sizeof(struct nhpoly1305_state),
> +};
> +
> +static int __init nhpoly1305_mod_init(void)
> +{
> +       if (!(elf_hwcap & HWCAP_ASIMD))
> +               return -ENODEV;
> +
> +       return crypto_register_shash(&nhpoly1305_alg);
> +}
> +
> +static void __exit nhpoly1305_mod_exit(void)
> +{
> +       crypto_unregister_shash(&nhpoly1305_alg);
> +}
> +
> +module_init(nhpoly1305_mod_init);
> +module_exit(nhpoly1305_mod_exit);
> +
> +MODULE_DESCRIPTION("NHPoly1305 ε-almost-∆-universal hash function (NEON-accelerated)");
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
> +MODULE_ALIAS_CRYPTO("nhpoly1305");
> +MODULE_ALIAS_CRYPTO("nhpoly1305-neon");
> --
> 2.19.2
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] crypto: arm64/chacha20 - add XChaCha20 support
  2018-11-29  6:34 ` [PATCH 2/4] crypto: arm64/chacha20 - add XChaCha20 support Eric Biggers
@ 2018-11-30 13:02   ` Ard Biesheuvel
  0 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2018-11-30 13:02 UTC (permalink / raw)
  To: Eric Biggers
  Cc: open list:HARDWARE RANDOM NUMBER GENERATOR CORE, Paul Crowley,
	Jason A. Donenfeld, linux-arm-kernel, Linux Kernel Mailing List

On Thu, 29 Nov 2018 at 07:35, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Add an XChaCha20 implementation that is hooked up to the ARM64 NEON
> implementation of ChaCha20.  This can be used by Adiantum.
>
> A NEON implementation of single-block HChaCha20 is also added so that
> XChaCha20 can use it rather than the generic implementation.  This
> required refactoring the ChaCha permutation into its own function.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  arch/arm64/crypto/Kconfig              |   2 +-
>  arch/arm64/crypto/chacha20-neon-core.S |  62 ++++++++++-----
>  arch/arm64/crypto/chacha20-neon-glue.c | 101 +++++++++++++++++++------
>  3 files changed, 121 insertions(+), 44 deletions(-)
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index 3f5aeb786192b..d54ddb8468ef5 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -101,7 +101,7 @@ config CRYPTO_AES_ARM64_NEON_BLK
>         select CRYPTO_SIMD
>
>  config CRYPTO_CHACHA20_NEON
> -       tristate "NEON accelerated ChaCha20 symmetric cipher"
> +       tristate "ChaCha20 and XChaCha20 stream ciphers using NEON instructions"
>         depends on KERNEL_MODE_NEON
>         select CRYPTO_BLKCIPHER
>         select CRYPTO_CHACHA20
> diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha20-neon-core.S
> index 13c85e272c2a1..378850505c0ae 100644
> --- a/arch/arm64/crypto/chacha20-neon-core.S
> +++ b/arch/arm64/crypto/chacha20-neon-core.S
> @@ -23,25 +23,20 @@
>         .text
>         .align          6
>
> -ENTRY(chacha20_block_xor_neon)
> -       // x0: Input state matrix, s
> -       // x1: 1 data block output, o
> -       // x2: 1 data block input, i
> -
> -       //
> -       // This function encrypts one ChaCha20 block by loading the state matrix
> -       // in four NEON registers. It performs matrix operation on four words in
> -       // parallel, but requires shuffling to rearrange the words after each
> -       // round.
> -       //
> -
> -       // x0..3 = s0..3
> -       adr             x3, ROT8
> -       ld1             {v0.4s-v3.4s}, [x0]
> -       ld1             {v8.4s-v11.4s}, [x0]
> -       ld1             {v12.4s}, [x3]
> +/*
> + * chacha20_permute - permute one block
> + *
> + * Permute one 64-byte block where the state matrix is stored in the four NEON
> + * registers v0-v3.  It performs matrix operations on four words in parallel,
> + * but requires shuffling to rearrange the words after each round.
> + *
> + * Clobbers: x3, x10, v4, v12
> + */
> +chacha20_permute:
>
>         mov             x3, #10
> +       adr             x10, ROT8
> +       ld1             {v12.4s}, [x10]
>
>  .Ldoubleround:
>         // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
> @@ -105,6 +100,22 @@ ENTRY(chacha20_block_xor_neon)
>         subs            x3, x3, #1
>         b.ne            .Ldoubleround
>
> +       ret
> +ENDPROC(chacha20_permute)
> +
> +ENTRY(chacha20_block_xor_neon)
> +       // x0: Input state matrix, s
> +       // x1: 1 data block output, o
> +       // x2: 1 data block input, i
> +
> +       mov             x9, lr

please use x30 not lr. The latter is not supported by all assemblers

Also, I'd prefer it if we could create a full stack frame here, i.e.,

stp x29, x30, [sp, #-16]!
mov x29, sp

(same for the other caller)

Otherwise, this will look like a tail call to the unwinder,
potentially creating problems with features that need reliable
backtraces, such as livepatch.

> +
> +       // x0..3 = s0..3
> +       ld1             {v0.4s-v3.4s}, [x0]
> +       ld1             {v8.4s-v11.4s}, [x0]
> +
> +       bl              chacha20_permute
> +
>         ld1             {v4.16b-v7.16b}, [x2]
>
>         // o0 = i0 ^ (x0 + s0)
> @@ -125,9 +136,24 @@ ENTRY(chacha20_block_xor_neon)
>
>         st1             {v0.16b-v3.16b}, [x1]
>
> -       ret
> +       ret             x9
>  ENDPROC(chacha20_block_xor_neon)
>
> +ENTRY(hchacha20_block_neon)
> +       // x0: Input state matrix, s
> +       // x1: output (8 32-bit words)
> +       mov             x9, lr
> +
> +       ld1             {v0.4s-v3.4s}, [x0]
> +
> +       bl              chacha20_permute
> +
> +       st1             {v0.16b}, [x1], #16
> +       st1             {v3.16b}, [x1]
> +
> +       ret             x9
> +ENDPROC(hchacha20_block_neon)
> +
>         .align          6
>  ENTRY(chacha20_4block_xor_neon)
>         // x0: Input state matrix, s
> diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
> index 96e0cfb8c3f5b..a5b9cbc0c4de4 100644
> --- a/arch/arm64/crypto/chacha20-neon-glue.c
> +++ b/arch/arm64/crypto/chacha20-neon-glue.c
> @@ -30,6 +30,7 @@
>
>  asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
>  asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
> +asmlinkage void hchacha20_block_neon(const u32 *state, u32 *out);
>
>  static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
>                             unsigned int bytes)
> @@ -65,20 +66,16 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
>         kernel_neon_end();
>  }
>
> -static int chacha20_neon(struct skcipher_request *req)
> +static int chacha20_neon_stream_xor(struct skcipher_request *req,
> +                                   struct chacha_ctx *ctx, u8 *iv)
>  {
> -       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> -       struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
>         struct skcipher_walk walk;
>         u32 state[16];
>         int err;
>
> -       if (!may_use_simd() || req->cryptlen <= CHACHA_BLOCK_SIZE)
> -               return crypto_chacha_crypt(req);
> -
>         err = skcipher_walk_virt(&walk, req, false);
>
> -       crypto_chacha_init(state, ctx, walk.iv);
> +       crypto_chacha_init(state, ctx, iv);
>
>         while (walk.nbytes > 0) {
>                 unsigned int nbytes = walk.nbytes;
> @@ -94,22 +91,73 @@ static int chacha20_neon(struct skcipher_request *req)
>         return err;
>  }
>
> -static struct skcipher_alg alg = {
> -       .base.cra_name          = "chacha20",
> -       .base.cra_driver_name   = "chacha20-neon",
> -       .base.cra_priority      = 300,
> -       .base.cra_blocksize     = 1,
> -       .base.cra_ctxsize       = sizeof(struct chacha_ctx),
> -       .base.cra_module        = THIS_MODULE,
> -
> -       .min_keysize            = CHACHA_KEY_SIZE,
> -       .max_keysize            = CHACHA_KEY_SIZE,
> -       .ivsize                 = CHACHA_IV_SIZE,
> -       .chunksize              = CHACHA_BLOCK_SIZE,
> -       .walksize               = 4 * CHACHA_BLOCK_SIZE,
> -       .setkey                 = crypto_chacha20_setkey,
> -       .encrypt                = chacha20_neon,
> -       .decrypt                = chacha20_neon,
> +static int chacha20_neon(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
> +               return crypto_chacha_crypt(req);
> +
> +       return chacha20_neon_stream_xor(req, ctx, req->iv);
> +}
> +
> +static int xchacha20_neon(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
> +       struct chacha_ctx subctx;
> +       u32 state[16];
> +       u8 real_iv[16];
> +
> +       if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
> +               return crypto_xchacha_crypt(req);
> +
> +       crypto_chacha_init(state, ctx, req->iv);
> +
> +       kernel_neon_begin();
> +       hchacha20_block_neon(state, subctx.key);
> +       kernel_neon_end();
> +
> +       memcpy(&real_iv[0], req->iv + 24, 8);
> +       memcpy(&real_iv[8], req->iv + 16, 8);
> +       return chacha20_neon_stream_xor(req, &subctx, real_iv);
> +}
> +
> +static struct skcipher_alg algs[] = {
> +       {
> +               .base.cra_name          = "chacha20",
> +               .base.cra_driver_name   = "chacha20-neon",
> +               .base.cra_priority      = 300,
> +               .base.cra_blocksize     = 1,
> +               .base.cra_ctxsize       = sizeof(struct chacha_ctx),
> +               .base.cra_module        = THIS_MODULE,
> +
> +               .min_keysize            = CHACHA_KEY_SIZE,
> +               .max_keysize            = CHACHA_KEY_SIZE,
> +               .ivsize                 = CHACHA_IV_SIZE,
> +               .chunksize              = CHACHA_BLOCK_SIZE,
> +               .walksize               = 4 * CHACHA_BLOCK_SIZE,
> +               .setkey                 = crypto_chacha20_setkey,
> +               .encrypt                = chacha20_neon,
> +               .decrypt                = chacha20_neon,
> +       }, {
> +               .base.cra_name          = "xchacha20",
> +               .base.cra_driver_name   = "xchacha20-neon",
> +               .base.cra_priority      = 300,
> +               .base.cra_blocksize     = 1,
> +               .base.cra_ctxsize       = sizeof(struct chacha_ctx),
> +               .base.cra_module        = THIS_MODULE,
> +
> +               .min_keysize            = CHACHA_KEY_SIZE,
> +               .max_keysize            = CHACHA_KEY_SIZE,
> +               .ivsize                 = XCHACHA_IV_SIZE,
> +               .chunksize              = CHACHA_BLOCK_SIZE,
> +               .walksize               = 4 * CHACHA_BLOCK_SIZE,
> +               .setkey                 = crypto_chacha20_setkey,
> +               .encrypt                = xchacha20_neon,
> +               .decrypt                = xchacha20_neon,
> +       }
>  };
>
>  static int __init chacha20_simd_mod_init(void)
> @@ -117,12 +165,12 @@ static int __init chacha20_simd_mod_init(void)
>         if (!(elf_hwcap & HWCAP_ASIMD))
>                 return -ENODEV;
>
> -       return crypto_register_skcipher(&alg);
> +       return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
>  }
>
>  static void __exit chacha20_simd_mod_fini(void)
>  {
> -       crypto_unregister_skcipher(&alg);
> +       crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
>  }
>
>  module_init(chacha20_simd_mod_init);
> @@ -131,3 +179,6 @@ module_exit(chacha20_simd_mod_fini);
>  MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
>  MODULE_LICENSE("GPL v2");
>  MODULE_ALIAS_CRYPTO("chacha20");
> +MODULE_ALIAS_CRYPTO("chacha20-neon");
> +MODULE_ALIAS_CRYPTO("xchacha20");
> +MODULE_ALIAS_CRYPTO("xchacha20-neon");
> --
> 2.19.2
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds
  2018-11-29  6:34 ` [PATCH 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds Eric Biggers
@ 2018-11-30 13:03   ` Ard Biesheuvel
  0 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2018-11-30 13:03 UTC (permalink / raw)
  To: Eric Biggers
  Cc: open list:HARDWARE RANDOM NUMBER GENERATOR CORE, Paul Crowley,
	Jason A. Donenfeld, linux-arm-kernel, Linux Kernel Mailing List

On Thu, 29 Nov 2018 at 07:35, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> In preparation for adding XChaCha12 support, rename/refactor the ARM64
> NEON implementation of ChaCha20 to support different numbers of rounds.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---
>  arch/arm64/crypto/Makefile                    |  4 +-
>  ...hacha20-neon-core.S => chacha-neon-core.S} | 45 ++++++++-------
>  ...hacha20-neon-glue.c => chacha-neon-glue.c} | 57 ++++++++++---------
>  3 files changed, 57 insertions(+), 49 deletions(-)
>  rename arch/arm64/crypto/{chacha20-neon-core.S => chacha-neon-core.S} (94%)
>  rename arch/arm64/crypto/{chacha20-neon-glue.c => chacha-neon-glue.c} (71%)
>
> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> index 125dbb10a93ed..a4ffd9fe32650 100644
> --- a/arch/arm64/crypto/Makefile
> +++ b/arch/arm64/crypto/Makefile
> @@ -50,8 +50,8 @@ sha256-arm64-y := sha256-glue.o sha256-core.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM64) += sha512-arm64.o
>  sha512-arm64-y := sha512-glue.o sha512-core.o
>
> -obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
> -chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
> +obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
> +chacha-neon-y := chacha-neon-core.o chacha-neon-glue.o
>
>  obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
>  nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
> diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha-neon-core.S
> similarity index 94%
> rename from arch/arm64/crypto/chacha20-neon-core.S
> rename to arch/arm64/crypto/chacha-neon-core.S
> index 378850505c0ae..75b4e06cee79f 100644
> --- a/arch/arm64/crypto/chacha20-neon-core.S
> +++ b/arch/arm64/crypto/chacha-neon-core.S
> @@ -1,5 +1,5 @@
>  /*
> - * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
> + * ChaCha/XChaCha NEON helper functions
>   *
>   * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
>   *
> @@ -24,17 +24,18 @@
>         .align          6
>
>  /*
> - * chacha20_permute - permute one block
> + * chacha_permute - permute one block
>   *
>   * Permute one 64-byte block where the state matrix is stored in the four NEON
>   * registers v0-v3.  It performs matrix operations on four words in parallel,
>   * but requires shuffling to rearrange the words after each round.
>   *
> - * Clobbers: x3, x10, v4, v12
> + * The round count is given in w3.
> + *
> + * Clobbers: w3, x10, v4, v12
>   */
> -chacha20_permute:
> +chacha_permute:
>
> -       mov             x3, #10
>         adr             x10, ROT8
>         ld1             {v12.4s}, [x10]
>
> @@ -97,16 +98,17 @@ chacha20_permute:
>         // x3 = shuffle32(x3, MASK(0, 3, 2, 1))
>         ext             v3.16b, v3.16b, v3.16b, #4
>
> -       subs            x3, x3, #1
> +       subs            w3, w3, #2
>         b.ne            .Ldoubleround
>
>         ret
> -ENDPROC(chacha20_permute)
> +ENDPROC(chacha_permute)
>
> -ENTRY(chacha20_block_xor_neon)
> +ENTRY(chacha_block_xor_neon)
>         // x0: Input state matrix, s
>         // x1: 1 data block output, o
>         // x2: 1 data block input, i
> +       // w3: nrounds
>
>         mov             x9, lr
>
> @@ -114,7 +116,7 @@ ENTRY(chacha20_block_xor_neon)
>         ld1             {v0.4s-v3.4s}, [x0]
>         ld1             {v8.4s-v11.4s}, [x0]
>
> -       bl              chacha20_permute
> +       bl              chacha_permute
>
>         ld1             {v4.16b-v7.16b}, [x2]
>
> @@ -137,39 +139,42 @@ ENTRY(chacha20_block_xor_neon)
>         st1             {v0.16b-v3.16b}, [x1]
>
>         ret             x9
> -ENDPROC(chacha20_block_xor_neon)
> +ENDPROC(chacha_block_xor_neon)
>
> -ENTRY(hchacha20_block_neon)
> +ENTRY(hchacha_block_neon)
>         // x0: Input state matrix, s
>         // x1: output (8 32-bit words)
> +       // w2: nrounds
>         mov             x9, lr
>
>         ld1             {v0.4s-v3.4s}, [x0]
>
> -       bl              chacha20_permute
> +       mov             w3, w2
> +       bl              chacha_permute
>
>         st1             {v0.16b}, [x1], #16
>         st1             {v3.16b}, [x1]
>
>         ret             x9
> -ENDPROC(hchacha20_block_neon)
> +ENDPROC(hchacha_block_neon)
>
>         .align          6
> -ENTRY(chacha20_4block_xor_neon)
> +ENTRY(chacha_4block_xor_neon)
>         // x0: Input state matrix, s
>         // x1: 4 data blocks output, o
>         // x2: 4 data blocks input, i
> +       // w3: nrounds
>
>         //
> -       // This function encrypts four consecutive ChaCha20 blocks by loading
> +       // This function encrypts four consecutive ChaCha blocks by loading
>         // the state matrix in NEON registers four times. The algorithm performs
>         // each operation on the corresponding word of each state matrix, hence
>         // requires no word shuffling. For final XORing step we transpose the
>         // matrix by interleaving 32- and then 64-bit words, which allows us to
>         // do XOR in NEON registers.
>         //
> -       adr             x3, CTRINC              // ... and ROT8
> -       ld1             {v30.4s-v31.4s}, [x3]
> +       adr             x9, CTRINC              // ... and ROT8
> +       ld1             {v30.4s-v31.4s}, [x9]
>
>         // x0..15[0-3] = s0..3[0..3]
>         mov             x4, x0
> @@ -181,8 +186,6 @@ ENTRY(chacha20_4block_xor_neon)
>         // x12 += counter values 0-3
>         add             v12.4s, v12.4s, v30.4s
>
> -       mov             x3, #10
> -
>  .Ldoubleround4:
>         // x0 += x4, x12 = rotl32(x12 ^ x0, 16)
>         // x1 += x5, x13 = rotl32(x13 ^ x1, 16)
> @@ -356,7 +359,7 @@ ENTRY(chacha20_4block_xor_neon)
>         sri             v7.4s, v18.4s, #25
>         sri             v4.4s, v19.4s, #25
>
> -       subs            x3, x3, #1
> +       subs            w3, w3, #2
>         b.ne            .Ldoubleround4
>
>         ld4r            {v16.4s-v19.4s}, [x0], #16
> @@ -470,7 +473,7 @@ ENTRY(chacha20_4block_xor_neon)
>         st1             {v28.16b-v31.16b}, [x1]
>
>         ret
> -ENDPROC(chacha20_4block_xor_neon)
> +ENDPROC(chacha_4block_xor_neon)
>
>  CTRINC:        .word           0, 1, 2, 3
>  ROT8:  .word           0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f
> diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
> similarity index 71%
> rename from arch/arm64/crypto/chacha20-neon-glue.c
> rename to arch/arm64/crypto/chacha-neon-glue.c
> index a5b9cbc0c4de4..4d992029b9121 100644
> --- a/arch/arm64/crypto/chacha20-neon-glue.c
> +++ b/arch/arm64/crypto/chacha-neon-glue.c
> @@ -1,5 +1,6 @@
>  /*
> - * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
> + * ARM NEON accelerated ChaCha and XChaCha stream ciphers,
> + * including ChaCha20 (RFC7539)
>   *
>   * Copyright (C) 2016 - 2017 Linaro, Ltd. <ard.biesheuvel@linaro.org>
>   *
> @@ -28,18 +29,20 @@
>  #include <asm/neon.h>
>  #include <asm/simd.h>
>
> -asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
> -asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
> -asmlinkage void hchacha20_block_neon(const u32 *state, u32 *out);
> +asmlinkage void chacha_block_xor_neon(u32 *state, u8 *dst, const u8 *src,
> +                                     int nrounds);
> +asmlinkage void chacha_4block_xor_neon(u32 *state, u8 *dst, const u8 *src,
> +                                      int nrounds);
> +asmlinkage void hchacha_block_neon(const u32 *state, u32 *out, int nrounds);
>
> -static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
> -                           unsigned int bytes)
> +static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
> +                         unsigned int bytes, int nrounds)
>  {
>         u8 buf[CHACHA_BLOCK_SIZE];
>
>         while (bytes >= CHACHA_BLOCK_SIZE * 4) {
>                 kernel_neon_begin();
> -               chacha20_4block_xor_neon(state, dst, src);
> +               chacha_4block_xor_neon(state, dst, src, nrounds);
>                 kernel_neon_end();
>                 bytes -= CHACHA_BLOCK_SIZE * 4;
>                 src += CHACHA_BLOCK_SIZE * 4;
> @@ -52,7 +55,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
>
>         kernel_neon_begin();
>         while (bytes >= CHACHA_BLOCK_SIZE) {
> -               chacha20_block_xor_neon(state, dst, src);
> +               chacha_block_xor_neon(state, dst, src, nrounds);
>                 bytes -= CHACHA_BLOCK_SIZE;
>                 src += CHACHA_BLOCK_SIZE;
>                 dst += CHACHA_BLOCK_SIZE;
> @@ -60,14 +63,14 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
>         }
>         if (bytes) {
>                 memcpy(buf, src, bytes);
> -               chacha20_block_xor_neon(state, buf, buf);
> +               chacha_block_xor_neon(state, buf, buf, nrounds);
>                 memcpy(dst, buf, bytes);
>         }
>         kernel_neon_end();
>  }
>
> -static int chacha20_neon_stream_xor(struct skcipher_request *req,
> -                                   struct chacha_ctx *ctx, u8 *iv)
> +static int chacha_neon_stream_xor(struct skcipher_request *req,
> +                                 struct chacha_ctx *ctx, u8 *iv)
>  {
>         struct skcipher_walk walk;
>         u32 state[16];
> @@ -83,15 +86,15 @@ static int chacha20_neon_stream_xor(struct skcipher_request *req,
>                 if (nbytes < walk.total)
>                         nbytes = round_down(nbytes, walk.stride);
>
> -               chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
> -                               nbytes);
> +               chacha_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
> +                             nbytes, ctx->nrounds);
>                 err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
>         }
>
>         return err;
>  }
>
> -static int chacha20_neon(struct skcipher_request *req)
> +static int chacha_neon(struct skcipher_request *req)
>  {
>         struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
>         struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
> @@ -99,10 +102,10 @@ static int chacha20_neon(struct skcipher_request *req)
>         if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
>                 return crypto_chacha_crypt(req);
>
> -       return chacha20_neon_stream_xor(req, ctx, req->iv);
> +       return chacha_neon_stream_xor(req, ctx, req->iv);
>  }
>
> -static int xchacha20_neon(struct skcipher_request *req)
> +static int xchacha_neon(struct skcipher_request *req)
>  {
>         struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
>         struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
> @@ -116,12 +119,13 @@ static int xchacha20_neon(struct skcipher_request *req)
>         crypto_chacha_init(state, ctx, req->iv);
>
>         kernel_neon_begin();
> -       hchacha20_block_neon(state, subctx.key);
> +       hchacha_block_neon(state, subctx.key, ctx->nrounds);
>         kernel_neon_end();
> +       subctx.nrounds = ctx->nrounds;
>
>         memcpy(&real_iv[0], req->iv + 24, 8);
>         memcpy(&real_iv[8], req->iv + 16, 8);
> -       return chacha20_neon_stream_xor(req, &subctx, real_iv);
> +       return chacha_neon_stream_xor(req, &subctx, real_iv);
>  }
>
>  static struct skcipher_alg algs[] = {
> @@ -139,8 +143,8 @@ static struct skcipher_alg algs[] = {
>                 .chunksize              = CHACHA_BLOCK_SIZE,
>                 .walksize               = 4 * CHACHA_BLOCK_SIZE,
>                 .setkey                 = crypto_chacha20_setkey,
> -               .encrypt                = chacha20_neon,
> -               .decrypt                = chacha20_neon,
> +               .encrypt                = chacha_neon,
> +               .decrypt                = chacha_neon,
>         }, {
>                 .base.cra_name          = "xchacha20",
>                 .base.cra_driver_name   = "xchacha20-neon",
> @@ -155,12 +159,12 @@ static struct skcipher_alg algs[] = {
>                 .chunksize              = CHACHA_BLOCK_SIZE,
>                 .walksize               = 4 * CHACHA_BLOCK_SIZE,
>                 .setkey                 = crypto_chacha20_setkey,
> -               .encrypt                = xchacha20_neon,
> -               .decrypt                = xchacha20_neon,
> +               .encrypt                = xchacha_neon,
> +               .decrypt                = xchacha_neon,
>         }
>  };
>
> -static int __init chacha20_simd_mod_init(void)
> +static int __init chacha_simd_mod_init(void)
>  {
>         if (!(elf_hwcap & HWCAP_ASIMD))
>                 return -ENODEV;
> @@ -168,14 +172,15 @@ static int __init chacha20_simd_mod_init(void)
>         return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
>  }
>
> -static void __exit chacha20_simd_mod_fini(void)
> +static void __exit chacha_simd_mod_fini(void)
>  {
>         crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
>  }
>
> -module_init(chacha20_simd_mod_init);
> -module_exit(chacha20_simd_mod_fini);
> +module_init(chacha_simd_mod_init);
> +module_exit(chacha_simd_mod_fini);
>
> +MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (NEON accelerated)");
>  MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
>  MODULE_LICENSE("GPL v2");
>  MODULE_ALIAS_CRYPTO("chacha20");
> --
> 2.19.2
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/4] crypto: arm64/chacha - add XChaCha12 support
  2018-11-29  6:34 ` [PATCH 4/4] crypto: arm64/chacha - add XChaCha12 support Eric Biggers
@ 2018-11-30 13:04   ` Ard Biesheuvel
  0 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2018-11-30 13:04 UTC (permalink / raw)
  To: Eric Biggers
  Cc: open list:HARDWARE RANDOM NUMBER GENERATOR CORE, Paul Crowley,
	Jason A. Donenfeld, linux-arm-kernel, Linux Kernel Mailing List

On Thu, 29 Nov 2018 at 07:35, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Now that the ARM64 NEON implementation of ChaCha20 and XChaCha20 has
> been refactored to support varying the number of rounds, add support for
> XChaCha12.  This is identical to XChaCha20 except for the number of
> rounds, which is 12 instead of 20.  This can be used by Adiantum.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---
>  arch/arm64/crypto/Kconfig            |  2 +-
>  arch/arm64/crypto/chacha-neon-glue.c | 18 ++++++++++++++++++
>  2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index d54ddb8468ef5..d9a523ecdd836 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -101,7 +101,7 @@ config CRYPTO_AES_ARM64_NEON_BLK
>         select CRYPTO_SIMD
>
>  config CRYPTO_CHACHA20_NEON
> -       tristate "ChaCha20 and XChaCha20 stream ciphers using NEON instructions"
> +       tristate "ChaCha20, XChaCha20, and XChaCha12 stream ciphers using NEON instructions"
>         depends on KERNEL_MODE_NEON
>         select CRYPTO_BLKCIPHER
>         select CRYPTO_CHACHA20
> diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
> index 4d992029b9121..346eb85498a1e 100644
> --- a/arch/arm64/crypto/chacha-neon-glue.c
> +++ b/arch/arm64/crypto/chacha-neon-glue.c
> @@ -161,6 +161,22 @@ static struct skcipher_alg algs[] = {
>                 .setkey                 = crypto_chacha20_setkey,
>                 .encrypt                = xchacha_neon,
>                 .decrypt                = xchacha_neon,
> +       }, {
> +               .base.cra_name          = "xchacha12",
> +               .base.cra_driver_name   = "xchacha12-neon",
> +               .base.cra_priority      = 300,
> +               .base.cra_blocksize     = 1,
> +               .base.cra_ctxsize       = sizeof(struct chacha_ctx),
> +               .base.cra_module        = THIS_MODULE,
> +
> +               .min_keysize            = CHACHA_KEY_SIZE,
> +               .max_keysize            = CHACHA_KEY_SIZE,
> +               .ivsize                 = XCHACHA_IV_SIZE,
> +               .chunksize              = CHACHA_BLOCK_SIZE,
> +               .walksize               = 4 * CHACHA_BLOCK_SIZE,
> +               .setkey                 = crypto_chacha12_setkey,
> +               .encrypt                = xchacha_neon,
> +               .decrypt                = xchacha_neon,
>         }
>  };
>
> @@ -187,3 +203,5 @@ MODULE_ALIAS_CRYPTO("chacha20");
>  MODULE_ALIAS_CRYPTO("chacha20-neon");
>  MODULE_ALIAS_CRYPTO("xchacha20");
>  MODULE_ALIAS_CRYPTO("xchacha20-neon");
> +MODULE_ALIAS_CRYPTO("xchacha12");
> +MODULE_ALIAS_CRYPTO("xchacha12-neon");
> --
> 2.19.2
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-11-30 13:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-29  6:34 [PATCH 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
2018-11-29  6:34 ` [PATCH 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305 Eric Biggers
2018-11-30 12:58   ` Ard Biesheuvel
2018-11-29  6:34 ` [PATCH 2/4] crypto: arm64/chacha20 - add XChaCha20 support Eric Biggers
2018-11-30 13:02   ` Ard Biesheuvel
2018-11-29  6:34 ` [PATCH 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds Eric Biggers
2018-11-30 13:03   ` Ard Biesheuvel
2018-11-29  6:34 ` [PATCH 4/4] crypto: arm64/chacha - add XChaCha12 support Eric Biggers
2018-11-30 13:04   ` Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).