* [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum)
@ 2018-12-04 3:52 Eric Biggers
2018-12-04 3:52 ` [PATCH v2 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305 Eric Biggers
` (4 more replies)
0 siblings, 5 replies; 7+ messages in thread
From: Eric Biggers @ 2018-12-04 3:52 UTC (permalink / raw)
To: linux-crypto
Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
linux-arm-kernel, linux-kernel
Hello,
This series optimizes the Adiantum encryption mode for ARM64 by adding
an ARM64 NEON accelerated implementation of NHPoly1305, specifically the
NH part; and by modifying the existing ARM64 NEON implementation of
ChaCha20 to support XChaCha20 and XChaCha12.
This greatly improves Adiantum performance on ARM64. For example,
encrypting 4096-byte messages (single-threaded) on a Raspberry Pi 3
Model B v1.2, which has a Cortex-A53 processor:
Before After
--------- ---------
adiantum(xchacha12,aes) 44.1 MB/s 82.7 MB/s
adiantum(xchacha20,aes) 35.5 MB/s 65.7 MB/s
Decryption is almost exactly the same speed as encryption.
The biggest benefit comes from accelerating XChaCha. Accelerating NH
gives a somewhat smaller, but still significant benefit.
Performance on 512-byte inputs is also improved, though that is much
slower in the first place. When Adiantium is used with dm-crypt (or
cryptsetup), we recommend using a 4096-byte sector size.
For comparison, on the same hardware AES-256-XTS encryption is only
24.5 MB/s and decryption 21.6 MB/s, both using the NEON-bitsliced
implementation ("xts-aes-neonbs"). That is the fastest AES-256-XTS
implementation on this processor, since it doesn't have the ARMv8
Cryptography Extensions. This is despite Adiantum also being a super-
pseudorandom permutation (SPRP) over the entire sector, unlike XTS.
Note that XChaCha20 and XChaCha12 can be used for other purposes too.
Changed since v1:
- Create full stack frame in hchacha_block_neon() and
chacha_block_xor_neon().
- Use x30 instead of lr.
- Fix whitespace in nh-neon-core.S.
Eric Biggers (4):
crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305
crypto: arm64/chacha20 - add XChaCha20 support
crypto: arm64/chacha20 - refactor to allow varying number of rounds
crypto: arm64/chacha - add XChaCha12 support
arch/arm64/crypto/Kconfig | 7 +-
arch/arm64/crypto/Makefile | 7 +-
...hacha20-neon-core.S => chacha-neon-core.S} | 92 +++++---
arch/arm64/crypto/chacha-neon-glue.c | 207 ++++++++++++++++++
arch/arm64/crypto/chacha20-neon-glue.c | 133 -----------
arch/arm64/crypto/nh-neon-core.S | 103 +++++++++
arch/arm64/crypto/nhpoly1305-neon-glue.c | 77 +++++++
7 files changed, 461 insertions(+), 165 deletions(-)
rename arch/arm64/crypto/{chacha20-neon-core.S => chacha-neon-core.S} (90%)
create mode 100644 arch/arm64/crypto/chacha-neon-glue.c
delete mode 100644 arch/arm64/crypto/chacha20-neon-glue.c
create mode 100644 arch/arm64/crypto/nh-neon-core.S
create mode 100644 arch/arm64/crypto/nhpoly1305-neon-glue.c
--
2.19.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305
2018-12-04 3:52 [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
@ 2018-12-04 3:52 ` Eric Biggers
2018-12-04 3:52 ` [PATCH v2 2/4] crypto: arm64/chacha20 - add XChaCha20 support Eric Biggers
` (3 subsequent siblings)
4 siblings, 0 replies; 7+ messages in thread
From: Eric Biggers @ 2018-12-04 3:52 UTC (permalink / raw)
To: linux-crypto
Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
linux-arm-kernel, linux-kernel
From: Eric Biggers <ebiggers@google.com>
Add an ARM64 NEON implementation of NHPoly1305, an ε-almost-∆-universal
hash function used in the Adiantum encryption mode. For now, only the
NH portion is actually NEON-accelerated; the Poly1305 part is less
performance-critical so is just implemented in C.
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> # big-endian
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
arch/arm64/crypto/Kconfig | 5 ++
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/nh-neon-core.S | 103 +++++++++++++++++++++++
arch/arm64/crypto/nhpoly1305-neon-glue.c | 77 +++++++++++++++++
4 files changed, 188 insertions(+)
create mode 100644 arch/arm64/crypto/nh-neon-core.S
create mode 100644 arch/arm64/crypto/nhpoly1305-neon-glue.c
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index a5606823ed4d..3f5aeb786192 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -106,6 +106,11 @@ config CRYPTO_CHACHA20_NEON
select CRYPTO_BLKCIPHER
select CRYPTO_CHACHA20
+config CRYPTO_NHPOLY1305_NEON
+ tristate "NHPoly1305 hash function using NEON instructions (for Adiantum)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_NHPOLY1305
+
config CRYPTO_AES_ARM64_BS
tristate "AES in ECB/CBC/CTR/XTS modes using bit-sliced NEON algorithm"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index f476fede09ba..125dbb10a93e 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -53,6 +53,9 @@ sha512-arm64-y := sha512-glue.o sha512-core.o
obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
+obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
+nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
+
obj-$(CONFIG_CRYPTO_AES_ARM64) += aes-arm64.o
aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
diff --git a/arch/arm64/crypto/nh-neon-core.S b/arch/arm64/crypto/nh-neon-core.S
new file mode 100644
index 000000000000..e05570c38de7
--- /dev/null
+++ b/arch/arm64/crypto/nh-neon-core.S
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NH - ε-almost-universal hash function, ARM64 NEON accelerated version
+ *
+ * Copyright 2018 Google LLC
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+#include <linux/linkage.h>
+
+ KEY .req x0
+ MESSAGE .req x1
+ MESSAGE_LEN .req x2
+ HASH .req x3
+
+ PASS0_SUMS .req v0
+ PASS1_SUMS .req v1
+ PASS2_SUMS .req v2
+ PASS3_SUMS .req v3
+ K0 .req v4
+ K1 .req v5
+ K2 .req v6
+ K3 .req v7
+ T0 .req v8
+ T1 .req v9
+ T2 .req v10
+ T3 .req v11
+ T4 .req v12
+ T5 .req v13
+ T6 .req v14
+ T7 .req v15
+
+.macro _nh_stride k0, k1, k2, k3
+
+ // Load next message stride
+ ld1 {T3.16b}, [MESSAGE], #16
+
+ // Load next key stride
+ ld1 {\k3\().4s}, [KEY], #16
+
+ // Add message words to key words
+ add T0.4s, T3.4s, \k0\().4s
+ add T1.4s, T3.4s, \k1\().4s
+ add T2.4s, T3.4s, \k2\().4s
+ add T3.4s, T3.4s, \k3\().4s
+
+ // Multiply 32x32 => 64 and accumulate
+ mov T4.d[0], T0.d[1]
+ mov T5.d[0], T1.d[1]
+ mov T6.d[0], T2.d[1]
+ mov T7.d[0], T3.d[1]
+ umlal PASS0_SUMS.2d, T0.2s, T4.2s
+ umlal PASS1_SUMS.2d, T1.2s, T5.2s
+ umlal PASS2_SUMS.2d, T2.2s, T6.2s
+ umlal PASS3_SUMS.2d, T3.2s, T7.2s
+.endm
+
+/*
+ * void nh_neon(const u32 *key, const u8 *message, size_t message_len,
+ * u8 hash[NH_HASH_BYTES])
+ *
+ * It's guaranteed that message_len % 16 == 0.
+ */
+ENTRY(nh_neon)
+
+ ld1 {K0.4s,K1.4s}, [KEY], #32
+ movi PASS0_SUMS.2d, #0
+ movi PASS1_SUMS.2d, #0
+ ld1 {K2.4s}, [KEY], #16
+ movi PASS2_SUMS.2d, #0
+ movi PASS3_SUMS.2d, #0
+
+ subs MESSAGE_LEN, MESSAGE_LEN, #64
+ blt .Lloop4_done
+.Lloop4:
+ _nh_stride K0, K1, K2, K3
+ _nh_stride K1, K2, K3, K0
+ _nh_stride K2, K3, K0, K1
+ _nh_stride K3, K0, K1, K2
+ subs MESSAGE_LEN, MESSAGE_LEN, #64
+ bge .Lloop4
+
+.Lloop4_done:
+ ands MESSAGE_LEN, MESSAGE_LEN, #63
+ beq .Ldone
+ _nh_stride K0, K1, K2, K3
+
+ subs MESSAGE_LEN, MESSAGE_LEN, #16
+ beq .Ldone
+ _nh_stride K1, K2, K3, K0
+
+ subs MESSAGE_LEN, MESSAGE_LEN, #16
+ beq .Ldone
+ _nh_stride K2, K3, K0, K1
+
+.Ldone:
+ // Sum the accumulators for each pass, then store the sums to 'hash'
+ addp T0.2d, PASS0_SUMS.2d, PASS1_SUMS.2d
+ addp T1.2d, PASS2_SUMS.2d, PASS3_SUMS.2d
+ st1 {T0.16b,T1.16b}, [HASH]
+ ret
+ENDPROC(nh_neon)
diff --git a/arch/arm64/crypto/nhpoly1305-neon-glue.c b/arch/arm64/crypto/nhpoly1305-neon-glue.c
new file mode 100644
index 000000000000..22cc32ac9448
--- /dev/null
+++ b/arch/arm64/crypto/nhpoly1305-neon-glue.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NHPoly1305 - ε-almost-∆-universal hash function for Adiantum
+ * (ARM64 NEON accelerated version)
+ *
+ * Copyright 2018 Google LLC
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/hash.h>
+#include <crypto/nhpoly1305.h>
+#include <linux/module.h>
+
+asmlinkage void nh_neon(const u32 *key, const u8 *message, size_t message_len,
+ u8 hash[NH_HASH_BYTES]);
+
+/* wrapper to avoid indirect call to assembly, which doesn't work with CFI */
+static void _nh_neon(const u32 *key, const u8 *message, size_t message_len,
+ __le64 hash[NH_NUM_PASSES])
+{
+ nh_neon(key, message, message_len, (u8 *)hash);
+}
+
+static int nhpoly1305_neon_update(struct shash_desc *desc,
+ const u8 *src, unsigned int srclen)
+{
+ if (srclen < 64 || !may_use_simd())
+ return crypto_nhpoly1305_update(desc, src, srclen);
+
+ do {
+ unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+
+ kernel_neon_begin();
+ crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon);
+ kernel_neon_end();
+ src += n;
+ srclen -= n;
+ } while (srclen);
+ return 0;
+}
+
+static struct shash_alg nhpoly1305_alg = {
+ .base.cra_name = "nhpoly1305",
+ .base.cra_driver_name = "nhpoly1305-neon",
+ .base.cra_priority = 200,
+ .base.cra_ctxsize = sizeof(struct nhpoly1305_key),
+ .base.cra_module = THIS_MODULE,
+ .digestsize = POLY1305_DIGEST_SIZE,
+ .init = crypto_nhpoly1305_init,
+ .update = nhpoly1305_neon_update,
+ .final = crypto_nhpoly1305_final,
+ .setkey = crypto_nhpoly1305_setkey,
+ .descsize = sizeof(struct nhpoly1305_state),
+};
+
+static int __init nhpoly1305_mod_init(void)
+{
+ if (!(elf_hwcap & HWCAP_ASIMD))
+ return -ENODEV;
+
+ return crypto_register_shash(&nhpoly1305_alg);
+}
+
+static void __exit nhpoly1305_mod_exit(void)
+{
+ crypto_unregister_shash(&nhpoly1305_alg);
+}
+
+module_init(nhpoly1305_mod_init);
+module_exit(nhpoly1305_mod_exit);
+
+MODULE_DESCRIPTION("NHPoly1305 ε-almost-∆-universal hash function (NEON-accelerated)");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("nhpoly1305");
+MODULE_ALIAS_CRYPTO("nhpoly1305-neon");
--
2.19.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 2/4] crypto: arm64/chacha20 - add XChaCha20 support
2018-12-04 3:52 [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
2018-12-04 3:52 ` [PATCH v2 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305 Eric Biggers
@ 2018-12-04 3:52 ` Eric Biggers
2018-12-04 14:51 ` Ard Biesheuvel
2018-12-04 3:52 ` [PATCH v2 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds Eric Biggers
` (2 subsequent siblings)
4 siblings, 1 reply; 7+ messages in thread
From: Eric Biggers @ 2018-12-04 3:52 UTC (permalink / raw)
To: linux-crypto
Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
linux-arm-kernel, linux-kernel
From: Eric Biggers <ebiggers@google.com>
Add an XChaCha20 implementation that is hooked up to the ARM64 NEON
implementation of ChaCha20. This can be used by Adiantum.
A NEON implementation of single-block HChaCha20 is also added so that
XChaCha20 can use it rather than the generic implementation. This
required refactoring the ChaCha20 permutation into its own function.
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
arch/arm64/crypto/Kconfig | 2 +-
arch/arm64/crypto/chacha20-neon-core.S | 65 +++++++++++-----
arch/arm64/crypto/chacha20-neon-glue.c | 101 +++++++++++++++++++------
3 files changed, 125 insertions(+), 43 deletions(-)
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 3f5aeb786192..d54ddb8468ef 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -101,7 +101,7 @@ config CRYPTO_AES_ARM64_NEON_BLK
select CRYPTO_SIMD
config CRYPTO_CHACHA20_NEON
- tristate "NEON accelerated ChaCha20 symmetric cipher"
+ tristate "ChaCha20 and XChaCha20 stream ciphers using NEON instructions"
depends on KERNEL_MODE_NEON
select CRYPTO_BLKCIPHER
select CRYPTO_CHACHA20
diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha20-neon-core.S
index 13c85e272c2a..0571e45a1a0a 100644
--- a/arch/arm64/crypto/chacha20-neon-core.S
+++ b/arch/arm64/crypto/chacha20-neon-core.S
@@ -23,25 +23,20 @@
.text
.align 6
-ENTRY(chacha20_block_xor_neon)
- // x0: Input state matrix, s
- // x1: 1 data block output, o
- // x2: 1 data block input, i
-
- //
- // This function encrypts one ChaCha20 block by loading the state matrix
- // in four NEON registers. It performs matrix operation on four words in
- // parallel, but requires shuffling to rearrange the words after each
- // round.
- //
-
- // x0..3 = s0..3
- adr x3, ROT8
- ld1 {v0.4s-v3.4s}, [x0]
- ld1 {v8.4s-v11.4s}, [x0]
- ld1 {v12.4s}, [x3]
+/*
+ * chacha20_permute - permute one block
+ *
+ * Permute one 64-byte block where the state matrix is stored in the four NEON
+ * registers v0-v3. It performs matrix operations on four words in parallel,
+ * but requires shuffling to rearrange the words after each round.
+ *
+ * Clobbers: x3, x10, v4, v12
+ */
+chacha20_permute:
mov x3, #10
+ adr x10, ROT8
+ ld1 {v12.4s}, [x10]
.Ldoubleround:
// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
@@ -105,6 +100,23 @@ ENTRY(chacha20_block_xor_neon)
subs x3, x3, #1
b.ne .Ldoubleround
+ ret
+ENDPROC(chacha20_permute)
+
+ENTRY(chacha20_block_xor_neon)
+ // x0: Input state matrix, s
+ // x1: 1 data block output, o
+ // x2: 1 data block input, i
+
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
+ // x0..3 = s0..3
+ ld1 {v0.4s-v3.4s}, [x0]
+ ld1 {v8.4s-v11.4s}, [x0]
+
+ bl chacha20_permute
+
ld1 {v4.16b-v7.16b}, [x2]
// o0 = i0 ^ (x0 + s0)
@@ -125,9 +137,28 @@ ENTRY(chacha20_block_xor_neon)
st1 {v0.16b-v3.16b}, [x1]
+ ldp x29, x30, [sp], #16
ret
ENDPROC(chacha20_block_xor_neon)
+ENTRY(hchacha20_block_neon)
+ // x0: Input state matrix, s
+ // x1: output (8 32-bit words)
+
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
+ ld1 {v0.4s-v3.4s}, [x0]
+
+ bl chacha20_permute
+
+ st1 {v0.16b}, [x1], #16
+ st1 {v3.16b}, [x1]
+
+ ldp x29, x30, [sp], #16
+ ret
+ENDPROC(hchacha20_block_neon)
+
.align 6
ENTRY(chacha20_4block_xor_neon)
// x0: Input state matrix, s
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index 96e0cfb8c3f5..a5b9cbc0c4de 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -30,6 +30,7 @@
asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
+asmlinkage void hchacha20_block_neon(const u32 *state, u32 *out);
static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
unsigned int bytes)
@@ -65,20 +66,16 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
kernel_neon_end();
}
-static int chacha20_neon(struct skcipher_request *req)
+static int chacha20_neon_stream_xor(struct skcipher_request *req,
+ struct chacha_ctx *ctx, u8 *iv)
{
- struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
- struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
u32 state[16];
int err;
- if (!may_use_simd() || req->cryptlen <= CHACHA_BLOCK_SIZE)
- return crypto_chacha_crypt(req);
-
err = skcipher_walk_virt(&walk, req, false);
- crypto_chacha_init(state, ctx, walk.iv);
+ crypto_chacha_init(state, ctx, iv);
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
@@ -94,22 +91,73 @@ static int chacha20_neon(struct skcipher_request *req)
return err;
}
-static struct skcipher_alg alg = {
- .base.cra_name = "chacha20",
- .base.cra_driver_name = "chacha20-neon",
- .base.cra_priority = 300,
- .base.cra_blocksize = 1,
- .base.cra_ctxsize = sizeof(struct chacha_ctx),
- .base.cra_module = THIS_MODULE,
-
- .min_keysize = CHACHA_KEY_SIZE,
- .max_keysize = CHACHA_KEY_SIZE,
- .ivsize = CHACHA_IV_SIZE,
- .chunksize = CHACHA_BLOCK_SIZE,
- .walksize = 4 * CHACHA_BLOCK_SIZE,
- .setkey = crypto_chacha20_setkey,
- .encrypt = chacha20_neon,
- .decrypt = chacha20_neon,
+static int chacha20_neon(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
+ return crypto_chacha_crypt(req);
+
+ return chacha20_neon_stream_xor(req, ctx, req->iv);
+}
+
+static int xchacha20_neon(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct chacha_ctx subctx;
+ u32 state[16];
+ u8 real_iv[16];
+
+ if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
+ return crypto_xchacha_crypt(req);
+
+ crypto_chacha_init(state, ctx, req->iv);
+
+ kernel_neon_begin();
+ hchacha20_block_neon(state, subctx.key);
+ kernel_neon_end();
+
+ memcpy(&real_iv[0], req->iv + 24, 8);
+ memcpy(&real_iv[8], req->iv + 16, 8);
+ return chacha20_neon_stream_xor(req, &subctx, real_iv);
+}
+
+static struct skcipher_alg algs[] = {
+ {
+ .base.cra_name = "chacha20",
+ .base.cra_driver_name = "chacha20-neon",
+ .base.cra_priority = 300,
+ .base.cra_blocksize = 1,
+ .base.cra_ctxsize = sizeof(struct chacha_ctx),
+ .base.cra_module = THIS_MODULE,
+
+ .min_keysize = CHACHA_KEY_SIZE,
+ .max_keysize = CHACHA_KEY_SIZE,
+ .ivsize = CHACHA_IV_SIZE,
+ .chunksize = CHACHA_BLOCK_SIZE,
+ .walksize = 4 * CHACHA_BLOCK_SIZE,
+ .setkey = crypto_chacha20_setkey,
+ .encrypt = chacha20_neon,
+ .decrypt = chacha20_neon,
+ }, {
+ .base.cra_name = "xchacha20",
+ .base.cra_driver_name = "xchacha20-neon",
+ .base.cra_priority = 300,
+ .base.cra_blocksize = 1,
+ .base.cra_ctxsize = sizeof(struct chacha_ctx),
+ .base.cra_module = THIS_MODULE,
+
+ .min_keysize = CHACHA_KEY_SIZE,
+ .max_keysize = CHACHA_KEY_SIZE,
+ .ivsize = XCHACHA_IV_SIZE,
+ .chunksize = CHACHA_BLOCK_SIZE,
+ .walksize = 4 * CHACHA_BLOCK_SIZE,
+ .setkey = crypto_chacha20_setkey,
+ .encrypt = xchacha20_neon,
+ .decrypt = xchacha20_neon,
+ }
};
static int __init chacha20_simd_mod_init(void)
@@ -117,12 +165,12 @@ static int __init chacha20_simd_mod_init(void)
if (!(elf_hwcap & HWCAP_ASIMD))
return -ENODEV;
- return crypto_register_skcipher(&alg);
+ return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
}
static void __exit chacha20_simd_mod_fini(void)
{
- crypto_unregister_skcipher(&alg);
+ crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
}
module_init(chacha20_simd_mod_init);
@@ -131,3 +179,6 @@ module_exit(chacha20_simd_mod_fini);
MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS_CRYPTO("chacha20");
+MODULE_ALIAS_CRYPTO("chacha20-neon");
+MODULE_ALIAS_CRYPTO("xchacha20");
+MODULE_ALIAS_CRYPTO("xchacha20-neon");
--
2.19.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds
2018-12-04 3:52 [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
2018-12-04 3:52 ` [PATCH v2 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305 Eric Biggers
2018-12-04 3:52 ` [PATCH v2 2/4] crypto: arm64/chacha20 - add XChaCha20 support Eric Biggers
@ 2018-12-04 3:52 ` Eric Biggers
2018-12-04 3:52 ` [PATCH v2 4/4] crypto: arm64/chacha - add XChaCha12 support Eric Biggers
2018-12-13 10:31 ` [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Herbert Xu
4 siblings, 0 replies; 7+ messages in thread
From: Eric Biggers @ 2018-12-04 3:52 UTC (permalink / raw)
To: linux-crypto
Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
linux-arm-kernel, linux-kernel
From: Eric Biggers <ebiggers@google.com>
In preparation for adding XChaCha12 support, rename/refactor the ARM64
NEON implementation of ChaCha20 to support different numbers of rounds.
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
arch/arm64/crypto/Makefile | 4 +-
...hacha20-neon-core.S => chacha-neon-core.S} | 45 ++++++++-------
...hacha20-neon-glue.c => chacha-neon-glue.c} | 57 ++++++++++---------
3 files changed, 57 insertions(+), 49 deletions(-)
rename arch/arm64/crypto/{chacha20-neon-core.S => chacha-neon-core.S} (94%)
rename arch/arm64/crypto/{chacha20-neon-glue.c => chacha-neon-glue.c} (71%)
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 125dbb10a93e..a4ffd9fe3265 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -50,8 +50,8 @@ sha256-arm64-y := sha256-glue.o sha256-core.o
obj-$(CONFIG_CRYPTO_SHA512_ARM64) += sha512-arm64.o
sha512-arm64-y := sha512-glue.o sha512-core.o
-obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
-chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
+obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
+chacha-neon-y := chacha-neon-core.o chacha-neon-glue.o
obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha-neon-core.S
similarity index 94%
rename from arch/arm64/crypto/chacha20-neon-core.S
rename to arch/arm64/crypto/chacha-neon-core.S
index 0571e45a1a0a..3d3a12db5204 100644
--- a/arch/arm64/crypto/chacha20-neon-core.S
+++ b/arch/arm64/crypto/chacha-neon-core.S
@@ -1,5 +1,5 @@
/*
- * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
+ * ChaCha/XChaCha NEON helper functions
*
* Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
*
@@ -24,17 +24,18 @@
.align 6
/*
- * chacha20_permute - permute one block
+ * chacha_permute - permute one block
*
* Permute one 64-byte block where the state matrix is stored in the four NEON
* registers v0-v3. It performs matrix operations on four words in parallel,
* but requires shuffling to rearrange the words after each round.
*
- * Clobbers: x3, x10, v4, v12
+ * The round count is given in w3.
+ *
+ * Clobbers: w3, x10, v4, v12
*/
-chacha20_permute:
+chacha_permute:
- mov x3, #10
adr x10, ROT8
ld1 {v12.4s}, [x10]
@@ -97,16 +98,17 @@ chacha20_permute:
// x3 = shuffle32(x3, MASK(0, 3, 2, 1))
ext v3.16b, v3.16b, v3.16b, #4
- subs x3, x3, #1
+ subs w3, w3, #2
b.ne .Ldoubleround
ret
-ENDPROC(chacha20_permute)
+ENDPROC(chacha_permute)
-ENTRY(chacha20_block_xor_neon)
+ENTRY(chacha_block_xor_neon)
// x0: Input state matrix, s
// x1: 1 data block output, o
// x2: 1 data block input, i
+ // w3: nrounds
stp x29, x30, [sp, #-16]!
mov x29, sp
@@ -115,7 +117,7 @@ ENTRY(chacha20_block_xor_neon)
ld1 {v0.4s-v3.4s}, [x0]
ld1 {v8.4s-v11.4s}, [x0]
- bl chacha20_permute
+ bl chacha_permute
ld1 {v4.16b-v7.16b}, [x2]
@@ -139,42 +141,45 @@ ENTRY(chacha20_block_xor_neon)
ldp x29, x30, [sp], #16
ret
-ENDPROC(chacha20_block_xor_neon)
+ENDPROC(chacha_block_xor_neon)
-ENTRY(hchacha20_block_neon)
+ENTRY(hchacha_block_neon)
// x0: Input state matrix, s
// x1: output (8 32-bit words)
+ // w2: nrounds
stp x29, x30, [sp, #-16]!
mov x29, sp
ld1 {v0.4s-v3.4s}, [x0]
- bl chacha20_permute
+ mov w3, w2
+ bl chacha_permute
st1 {v0.16b}, [x1], #16
st1 {v3.16b}, [x1]
ldp x29, x30, [sp], #16
ret
-ENDPROC(hchacha20_block_neon)
+ENDPROC(hchacha_block_neon)
.align 6
-ENTRY(chacha20_4block_xor_neon)
+ENTRY(chacha_4block_xor_neon)
// x0: Input state matrix, s
// x1: 4 data blocks output, o
// x2: 4 data blocks input, i
+ // w3: nrounds
//
- // This function encrypts four consecutive ChaCha20 blocks by loading
+ // This function encrypts four consecutive ChaCha blocks by loading
// the state matrix in NEON registers four times. The algorithm performs
// each operation on the corresponding word of each state matrix, hence
// requires no word shuffling. For final XORing step we transpose the
// matrix by interleaving 32- and then 64-bit words, which allows us to
// do XOR in NEON registers.
//
- adr x3, CTRINC // ... and ROT8
- ld1 {v30.4s-v31.4s}, [x3]
+ adr x9, CTRINC // ... and ROT8
+ ld1 {v30.4s-v31.4s}, [x9]
// x0..15[0-3] = s0..3[0..3]
mov x4, x0
@@ -186,8 +191,6 @@ ENTRY(chacha20_4block_xor_neon)
// x12 += counter values 0-3
add v12.4s, v12.4s, v30.4s
- mov x3, #10
-
.Ldoubleround4:
// x0 += x4, x12 = rotl32(x12 ^ x0, 16)
// x1 += x5, x13 = rotl32(x13 ^ x1, 16)
@@ -361,7 +364,7 @@ ENTRY(chacha20_4block_xor_neon)
sri v7.4s, v18.4s, #25
sri v4.4s, v19.4s, #25
- subs x3, x3, #1
+ subs w3, w3, #2
b.ne .Ldoubleround4
ld4r {v16.4s-v19.4s}, [x0], #16
@@ -475,7 +478,7 @@ ENTRY(chacha20_4block_xor_neon)
st1 {v28.16b-v31.16b}, [x1]
ret
-ENDPROC(chacha20_4block_xor_neon)
+ENDPROC(chacha_4block_xor_neon)
CTRINC: .word 0, 1, 2, 3
ROT8: .word 0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
similarity index 71%
rename from arch/arm64/crypto/chacha20-neon-glue.c
rename to arch/arm64/crypto/chacha-neon-glue.c
index a5b9cbc0c4de..4d992029b912 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -1,5 +1,6 @@
/*
- * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
+ * ARM NEON accelerated ChaCha and XChaCha stream ciphers,
+ * including ChaCha20 (RFC7539)
*
* Copyright (C) 2016 - 2017 Linaro, Ltd. <ard.biesheuvel@linaro.org>
*
@@ -28,18 +29,20 @@
#include <asm/neon.h>
#include <asm/simd.h>
-asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
-asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
-asmlinkage void hchacha20_block_neon(const u32 *state, u32 *out);
+asmlinkage void chacha_block_xor_neon(u32 *state, u8 *dst, const u8 *src,
+ int nrounds);
+asmlinkage void chacha_4block_xor_neon(u32 *state, u8 *dst, const u8 *src,
+ int nrounds);
+asmlinkage void hchacha_block_neon(const u32 *state, u32 *out, int nrounds);
-static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
- unsigned int bytes)
+static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
+ unsigned int bytes, int nrounds)
{
u8 buf[CHACHA_BLOCK_SIZE];
while (bytes >= CHACHA_BLOCK_SIZE * 4) {
kernel_neon_begin();
- chacha20_4block_xor_neon(state, dst, src);
+ chacha_4block_xor_neon(state, dst, src, nrounds);
kernel_neon_end();
bytes -= CHACHA_BLOCK_SIZE * 4;
src += CHACHA_BLOCK_SIZE * 4;
@@ -52,7 +55,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
kernel_neon_begin();
while (bytes >= CHACHA_BLOCK_SIZE) {
- chacha20_block_xor_neon(state, dst, src);
+ chacha_block_xor_neon(state, dst, src, nrounds);
bytes -= CHACHA_BLOCK_SIZE;
src += CHACHA_BLOCK_SIZE;
dst += CHACHA_BLOCK_SIZE;
@@ -60,14 +63,14 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
}
if (bytes) {
memcpy(buf, src, bytes);
- chacha20_block_xor_neon(state, buf, buf);
+ chacha_block_xor_neon(state, buf, buf, nrounds);
memcpy(dst, buf, bytes);
}
kernel_neon_end();
}
-static int chacha20_neon_stream_xor(struct skcipher_request *req,
- struct chacha_ctx *ctx, u8 *iv)
+static int chacha_neon_stream_xor(struct skcipher_request *req,
+ struct chacha_ctx *ctx, u8 *iv)
{
struct skcipher_walk walk;
u32 state[16];
@@ -83,15 +86,15 @@ static int chacha20_neon_stream_xor(struct skcipher_request *req,
if (nbytes < walk.total)
nbytes = round_down(nbytes, walk.stride);
- chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
- nbytes);
+ chacha_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
+ nbytes, ctx->nrounds);
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
return err;
}
-static int chacha20_neon(struct skcipher_request *req)
+static int chacha_neon(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
@@ -99,10 +102,10 @@ static int chacha20_neon(struct skcipher_request *req)
if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
return crypto_chacha_crypt(req);
- return chacha20_neon_stream_xor(req, ctx, req->iv);
+ return chacha_neon_stream_xor(req, ctx, req->iv);
}
-static int xchacha20_neon(struct skcipher_request *req)
+static int xchacha_neon(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
@@ -116,12 +119,13 @@ static int xchacha20_neon(struct skcipher_request *req)
crypto_chacha_init(state, ctx, req->iv);
kernel_neon_begin();
- hchacha20_block_neon(state, subctx.key);
+ hchacha_block_neon(state, subctx.key, ctx->nrounds);
kernel_neon_end();
+ subctx.nrounds = ctx->nrounds;
memcpy(&real_iv[0], req->iv + 24, 8);
memcpy(&real_iv[8], req->iv + 16, 8);
- return chacha20_neon_stream_xor(req, &subctx, real_iv);
+ return chacha_neon_stream_xor(req, &subctx, real_iv);
}
static struct skcipher_alg algs[] = {
@@ -139,8 +143,8 @@ static struct skcipher_alg algs[] = {
.chunksize = CHACHA_BLOCK_SIZE,
.walksize = 4 * CHACHA_BLOCK_SIZE,
.setkey = crypto_chacha20_setkey,
- .encrypt = chacha20_neon,
- .decrypt = chacha20_neon,
+ .encrypt = chacha_neon,
+ .decrypt = chacha_neon,
}, {
.base.cra_name = "xchacha20",
.base.cra_driver_name = "xchacha20-neon",
@@ -155,12 +159,12 @@ static struct skcipher_alg algs[] = {
.chunksize = CHACHA_BLOCK_SIZE,
.walksize = 4 * CHACHA_BLOCK_SIZE,
.setkey = crypto_chacha20_setkey,
- .encrypt = xchacha20_neon,
- .decrypt = xchacha20_neon,
+ .encrypt = xchacha_neon,
+ .decrypt = xchacha_neon,
}
};
-static int __init chacha20_simd_mod_init(void)
+static int __init chacha_simd_mod_init(void)
{
if (!(elf_hwcap & HWCAP_ASIMD))
return -ENODEV;
@@ -168,14 +172,15 @@ static int __init chacha20_simd_mod_init(void)
return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
}
-static void __exit chacha20_simd_mod_fini(void)
+static void __exit chacha_simd_mod_fini(void)
{
crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
}
-module_init(chacha20_simd_mod_init);
-module_exit(chacha20_simd_mod_fini);
+module_init(chacha_simd_mod_init);
+module_exit(chacha_simd_mod_fini);
+MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (NEON accelerated)");
MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS_CRYPTO("chacha20");
--
2.19.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 4/4] crypto: arm64/chacha - add XChaCha12 support
2018-12-04 3:52 [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
` (2 preceding siblings ...)
2018-12-04 3:52 ` [PATCH v2 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds Eric Biggers
@ 2018-12-04 3:52 ` Eric Biggers
2018-12-13 10:31 ` [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Herbert Xu
4 siblings, 0 replies; 7+ messages in thread
From: Eric Biggers @ 2018-12-04 3:52 UTC (permalink / raw)
To: linux-crypto
Cc: Paul Crowley, Ard Biesheuvel, Jason A . Donenfeld,
linux-arm-kernel, linux-kernel
From: Eric Biggers <ebiggers@google.com>
Now that the ARM64 NEON implementation of ChaCha20 and XChaCha20 has
been refactored to support varying the number of rounds, add support for
XChaCha12. This is identical to XChaCha20 except for the number of
rounds, which is 12 instead of 20. This can be used by Adiantum.
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
arch/arm64/crypto/Kconfig | 2 +-
arch/arm64/crypto/chacha-neon-glue.c | 18 ++++++++++++++++++
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index d54ddb8468ef..d9a523ecdd83 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -101,7 +101,7 @@ config CRYPTO_AES_ARM64_NEON_BLK
select CRYPTO_SIMD
config CRYPTO_CHACHA20_NEON
- tristate "ChaCha20 and XChaCha20 stream ciphers using NEON instructions"
+ tristate "ChaCha20, XChaCha20, and XChaCha12 stream ciphers using NEON instructions"
depends on KERNEL_MODE_NEON
select CRYPTO_BLKCIPHER
select CRYPTO_CHACHA20
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index 4d992029b912..346eb85498a1 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -161,6 +161,22 @@ static struct skcipher_alg algs[] = {
.setkey = crypto_chacha20_setkey,
.encrypt = xchacha_neon,
.decrypt = xchacha_neon,
+ }, {
+ .base.cra_name = "xchacha12",
+ .base.cra_driver_name = "xchacha12-neon",
+ .base.cra_priority = 300,
+ .base.cra_blocksize = 1,
+ .base.cra_ctxsize = sizeof(struct chacha_ctx),
+ .base.cra_module = THIS_MODULE,
+
+ .min_keysize = CHACHA_KEY_SIZE,
+ .max_keysize = CHACHA_KEY_SIZE,
+ .ivsize = XCHACHA_IV_SIZE,
+ .chunksize = CHACHA_BLOCK_SIZE,
+ .walksize = 4 * CHACHA_BLOCK_SIZE,
+ .setkey = crypto_chacha12_setkey,
+ .encrypt = xchacha_neon,
+ .decrypt = xchacha_neon,
}
};
@@ -187,3 +203,5 @@ MODULE_ALIAS_CRYPTO("chacha20");
MODULE_ALIAS_CRYPTO("chacha20-neon");
MODULE_ALIAS_CRYPTO("xchacha20");
MODULE_ALIAS_CRYPTO("xchacha20-neon");
+MODULE_ALIAS_CRYPTO("xchacha12");
+MODULE_ALIAS_CRYPTO("xchacha12-neon");
--
2.19.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2 2/4] crypto: arm64/chacha20 - add XChaCha20 support
2018-12-04 3:52 ` [PATCH v2 2/4] crypto: arm64/chacha20 - add XChaCha20 support Eric Biggers
@ 2018-12-04 14:51 ` Ard Biesheuvel
0 siblings, 0 replies; 7+ messages in thread
From: Ard Biesheuvel @ 2018-12-04 14:51 UTC (permalink / raw)
To: Eric Biggers
Cc: open list:HARDWARE RANDOM NUMBER GENERATOR CORE, Paul Crowley,
Jason A. Donenfeld, linux-arm-kernel, Linux Kernel Mailing List
On Tue, 4 Dec 2018 at 04:56, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Add an XChaCha20 implementation that is hooked up to the ARM64 NEON
> implementation of ChaCha20. This can be used by Adiantum.
>
> A NEON implementation of single-block HChaCha20 is also added so that
> XChaCha20 can use it rather than the generic implementation. This
> required refactoring the ChaCha20 permutation into its own function.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> arch/arm64/crypto/Kconfig | 2 +-
> arch/arm64/crypto/chacha20-neon-core.S | 65 +++++++++++-----
> arch/arm64/crypto/chacha20-neon-glue.c | 101 +++++++++++++++++++------
> 3 files changed, 125 insertions(+), 43 deletions(-)
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index 3f5aeb786192..d54ddb8468ef 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -101,7 +101,7 @@ config CRYPTO_AES_ARM64_NEON_BLK
> select CRYPTO_SIMD
>
> config CRYPTO_CHACHA20_NEON
> - tristate "NEON accelerated ChaCha20 symmetric cipher"
> + tristate "ChaCha20 and XChaCha20 stream ciphers using NEON instructions"
> depends on KERNEL_MODE_NEON
> select CRYPTO_BLKCIPHER
> select CRYPTO_CHACHA20
> diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha20-neon-core.S
> index 13c85e272c2a..0571e45a1a0a 100644
> --- a/arch/arm64/crypto/chacha20-neon-core.S
> +++ b/arch/arm64/crypto/chacha20-neon-core.S
> @@ -23,25 +23,20 @@
> .text
> .align 6
>
> -ENTRY(chacha20_block_xor_neon)
> - // x0: Input state matrix, s
> - // x1: 1 data block output, o
> - // x2: 1 data block input, i
> -
> - //
> - // This function encrypts one ChaCha20 block by loading the state matrix
> - // in four NEON registers. It performs matrix operation on four words in
> - // parallel, but requires shuffling to rearrange the words after each
> - // round.
> - //
> -
> - // x0..3 = s0..3
> - adr x3, ROT8
> - ld1 {v0.4s-v3.4s}, [x0]
> - ld1 {v8.4s-v11.4s}, [x0]
> - ld1 {v12.4s}, [x3]
> +/*
> + * chacha20_permute - permute one block
> + *
> + * Permute one 64-byte block where the state matrix is stored in the four NEON
> + * registers v0-v3. It performs matrix operations on four words in parallel,
> + * but requires shuffling to rearrange the words after each round.
> + *
> + * Clobbers: x3, x10, v4, v12
> + */
> +chacha20_permute:
>
> mov x3, #10
> + adr x10, ROT8
> + ld1 {v12.4s}, [x10]
>
> .Ldoubleround:
> // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
> @@ -105,6 +100,23 @@ ENTRY(chacha20_block_xor_neon)
> subs x3, x3, #1
> b.ne .Ldoubleround
>
> + ret
> +ENDPROC(chacha20_permute)
> +
> +ENTRY(chacha20_block_xor_neon)
> + // x0: Input state matrix, s
> + // x1: 1 data block output, o
> + // x2: 1 data block input, i
> +
> + stp x29, x30, [sp, #-16]!
> + mov x29, sp
> +
> + // x0..3 = s0..3
> + ld1 {v0.4s-v3.4s}, [x0]
> + ld1 {v8.4s-v11.4s}, [x0]
> +
> + bl chacha20_permute
> +
> ld1 {v4.16b-v7.16b}, [x2]
>
> // o0 = i0 ^ (x0 + s0)
> @@ -125,9 +137,28 @@ ENTRY(chacha20_block_xor_neon)
>
> st1 {v0.16b-v3.16b}, [x1]
>
> + ldp x29, x30, [sp], #16
> ret
> ENDPROC(chacha20_block_xor_neon)
>
> +ENTRY(hchacha20_block_neon)
> + // x0: Input state matrix, s
> + // x1: output (8 32-bit words)
> +
> + stp x29, x30, [sp, #-16]!
> + mov x29, sp
> +
> + ld1 {v0.4s-v3.4s}, [x0]
> +
> + bl chacha20_permute
> +
> + st1 {v0.16b}, [x1], #16
> + st1 {v3.16b}, [x1]
> +
> + ldp x29, x30, [sp], #16
> + ret
> +ENDPROC(hchacha20_block_neon)
> +
> .align 6
> ENTRY(chacha20_4block_xor_neon)
> // x0: Input state matrix, s
> diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
> index 96e0cfb8c3f5..a5b9cbc0c4de 100644
> --- a/arch/arm64/crypto/chacha20-neon-glue.c
> +++ b/arch/arm64/crypto/chacha20-neon-glue.c
> @@ -30,6 +30,7 @@
>
> asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
> asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
> +asmlinkage void hchacha20_block_neon(const u32 *state, u32 *out);
>
> static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
> unsigned int bytes)
> @@ -65,20 +66,16 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
> kernel_neon_end();
> }
>
> -static int chacha20_neon(struct skcipher_request *req)
> +static int chacha20_neon_stream_xor(struct skcipher_request *req,
> + struct chacha_ctx *ctx, u8 *iv)
> {
> - struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> - struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
> struct skcipher_walk walk;
> u32 state[16];
> int err;
>
> - if (!may_use_simd() || req->cryptlen <= CHACHA_BLOCK_SIZE)
> - return crypto_chacha_crypt(req);
> -
> err = skcipher_walk_virt(&walk, req, false);
>
> - crypto_chacha_init(state, ctx, walk.iv);
> + crypto_chacha_init(state, ctx, iv);
>
> while (walk.nbytes > 0) {
> unsigned int nbytes = walk.nbytes;
> @@ -94,22 +91,73 @@ static int chacha20_neon(struct skcipher_request *req)
> return err;
> }
>
> -static struct skcipher_alg alg = {
> - .base.cra_name = "chacha20",
> - .base.cra_driver_name = "chacha20-neon",
> - .base.cra_priority = 300,
> - .base.cra_blocksize = 1,
> - .base.cra_ctxsize = sizeof(struct chacha_ctx),
> - .base.cra_module = THIS_MODULE,
> -
> - .min_keysize = CHACHA_KEY_SIZE,
> - .max_keysize = CHACHA_KEY_SIZE,
> - .ivsize = CHACHA_IV_SIZE,
> - .chunksize = CHACHA_BLOCK_SIZE,
> - .walksize = 4 * CHACHA_BLOCK_SIZE,
> - .setkey = crypto_chacha20_setkey,
> - .encrypt = chacha20_neon,
> - .decrypt = chacha20_neon,
> +static int chacha20_neon(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
> + return crypto_chacha_crypt(req);
> +
> + return chacha20_neon_stream_xor(req, ctx, req->iv);
> +}
> +
> +static int xchacha20_neon(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
> + struct chacha_ctx subctx;
> + u32 state[16];
> + u8 real_iv[16];
> +
> + if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
> + return crypto_xchacha_crypt(req);
> +
> + crypto_chacha_init(state, ctx, req->iv);
> +
> + kernel_neon_begin();
> + hchacha20_block_neon(state, subctx.key);
> + kernel_neon_end();
> +
> + memcpy(&real_iv[0], req->iv + 24, 8);
> + memcpy(&real_iv[8], req->iv + 16, 8);
> + return chacha20_neon_stream_xor(req, &subctx, real_iv);
> +}
> +
> +static struct skcipher_alg algs[] = {
> + {
> + .base.cra_name = "chacha20",
> + .base.cra_driver_name = "chacha20-neon",
> + .base.cra_priority = 300,
> + .base.cra_blocksize = 1,
> + .base.cra_ctxsize = sizeof(struct chacha_ctx),
> + .base.cra_module = THIS_MODULE,
> +
> + .min_keysize = CHACHA_KEY_SIZE,
> + .max_keysize = CHACHA_KEY_SIZE,
> + .ivsize = CHACHA_IV_SIZE,
> + .chunksize = CHACHA_BLOCK_SIZE,
> + .walksize = 4 * CHACHA_BLOCK_SIZE,
> + .setkey = crypto_chacha20_setkey,
> + .encrypt = chacha20_neon,
> + .decrypt = chacha20_neon,
> + }, {
> + .base.cra_name = "xchacha20",
> + .base.cra_driver_name = "xchacha20-neon",
> + .base.cra_priority = 300,
> + .base.cra_blocksize = 1,
> + .base.cra_ctxsize = sizeof(struct chacha_ctx),
> + .base.cra_module = THIS_MODULE,
> +
> + .min_keysize = CHACHA_KEY_SIZE,
> + .max_keysize = CHACHA_KEY_SIZE,
> + .ivsize = XCHACHA_IV_SIZE,
> + .chunksize = CHACHA_BLOCK_SIZE,
> + .walksize = 4 * CHACHA_BLOCK_SIZE,
> + .setkey = crypto_chacha20_setkey,
> + .encrypt = xchacha20_neon,
> + .decrypt = xchacha20_neon,
> + }
> };
>
> static int __init chacha20_simd_mod_init(void)
> @@ -117,12 +165,12 @@ static int __init chacha20_simd_mod_init(void)
> if (!(elf_hwcap & HWCAP_ASIMD))
> return -ENODEV;
>
> - return crypto_register_skcipher(&alg);
> + return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
> }
>
> static void __exit chacha20_simd_mod_fini(void)
> {
> - crypto_unregister_skcipher(&alg);
> + crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
> }
>
> module_init(chacha20_simd_mod_init);
> @@ -131,3 +179,6 @@ module_exit(chacha20_simd_mod_fini);
> MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
> MODULE_LICENSE("GPL v2");
> MODULE_ALIAS_CRYPTO("chacha20");
> +MODULE_ALIAS_CRYPTO("chacha20-neon");
> +MODULE_ALIAS_CRYPTO("xchacha20");
> +MODULE_ALIAS_CRYPTO("xchacha20-neon");
> --
> 2.19.2
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum)
2018-12-04 3:52 [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
` (3 preceding siblings ...)
2018-12-04 3:52 ` [PATCH v2 4/4] crypto: arm64/chacha - add XChaCha12 support Eric Biggers
@ 2018-12-13 10:31 ` Herbert Xu
4 siblings, 0 replies; 7+ messages in thread
From: Herbert Xu @ 2018-12-13 10:31 UTC (permalink / raw)
To: Eric Biggers
Cc: linux-crypto, paulcrowley, ard.biesheuvel, Jason,
linux-arm-kernel, linux-kernel
Eric Biggers <ebiggers@kernel.org> wrote:
> Hello,
>
> This series optimizes the Adiantum encryption mode for ARM64 by adding
> an ARM64 NEON accelerated implementation of NHPoly1305, specifically the
> NH part; and by modifying the existing ARM64 NEON implementation of
> ChaCha20 to support XChaCha20 and XChaCha12.
>
> This greatly improves Adiantum performance on ARM64. For example,
> encrypting 4096-byte messages (single-threaded) on a Raspberry Pi 3
> Model B v1.2, which has a Cortex-A53 processor:
>
> Before After
> --------- ---------
> adiantum(xchacha12,aes) 44.1 MB/s 82.7 MB/s
> adiantum(xchacha20,aes) 35.5 MB/s 65.7 MB/s
>
> Decryption is almost exactly the same speed as encryption.
>
> The biggest benefit comes from accelerating XChaCha. Accelerating NH
> gives a somewhat smaller, but still significant benefit.
>
> Performance on 512-byte inputs is also improved, though that is much
> slower in the first place. When Adiantium is used with dm-crypt (or
> cryptsetup), we recommend using a 4096-byte sector size.
>
> For comparison, on the same hardware AES-256-XTS encryption is only
> 24.5 MB/s and decryption 21.6 MB/s, both using the NEON-bitsliced
> implementation ("xts-aes-neonbs"). That is the fastest AES-256-XTS
> implementation on this processor, since it doesn't have the ARMv8
> Cryptography Extensions. This is despite Adiantum also being a super-
> pseudorandom permutation (SPRP) over the entire sector, unlike XTS.
>
> Note that XChaCha20 and XChaCha12 can be used for other purposes too.
>
> Changed since v1:
> - Create full stack frame in hchacha_block_neon() and
> chacha_block_xor_neon().
> - Use x30 instead of lr.
> - Fix whitespace in nh-neon-core.S.
>
> Eric Biggers (4):
> crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305
> crypto: arm64/chacha20 - add XChaCha20 support
> crypto: arm64/chacha20 - refactor to allow varying number of rounds
> crypto: arm64/chacha - add XChaCha12 support
>
> arch/arm64/crypto/Kconfig | 7 +-
> arch/arm64/crypto/Makefile | 7 +-
> ...hacha20-neon-core.S => chacha-neon-core.S} | 92 +++++---
> arch/arm64/crypto/chacha-neon-glue.c | 207 ++++++++++++++++++
> arch/arm64/crypto/chacha20-neon-glue.c | 133 -----------
> arch/arm64/crypto/nh-neon-core.S | 103 +++++++++
> arch/arm64/crypto/nhpoly1305-neon-glue.c | 77 +++++++
> 7 files changed, 461 insertions(+), 165 deletions(-)
> rename arch/arm64/crypto/{chacha20-neon-core.S => chacha-neon-core.S} (90%)
> create mode 100644 arch/arm64/crypto/chacha-neon-glue.c
> delete mode 100644 arch/arm64/crypto/chacha20-neon-glue.c
> create mode 100644 arch/arm64/crypto/nh-neon-core.S
> create mode 100644 arch/arm64/crypto/nhpoly1305-neon-glue.c
All applied. Thanks.
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-12-13 10:31 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-04 3:52 [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Eric Biggers
2018-12-04 3:52 ` [PATCH v2 1/4] crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305 Eric Biggers
2018-12-04 3:52 ` [PATCH v2 2/4] crypto: arm64/chacha20 - add XChaCha20 support Eric Biggers
2018-12-04 14:51 ` Ard Biesheuvel
2018-12-04 3:52 ` [PATCH v2 3/4] crypto: arm64/chacha20 - refactor to allow varying number of rounds Eric Biggers
2018-12-04 3:52 ` [PATCH v2 4/4] crypto: arm64/chacha - add XChaCha12 support Eric Biggers
2018-12-13 10:31 ` [PATCH v2 0/4] crypto: ARM64 NEON optimized XChaCha and NHPoly1305 (for Adiantum) Herbert Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).