All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] crypto: Speck support
@ 2018-02-14 18:42 ` Eric Biggers
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-crypto, Herbert Xu
  Cc: linux-fscrypt, linux-arm-kernel, Ard Biesheuvel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman, Eric Biggers

Hello,

This series adds Speck support to the crypto API, including the Speck128
and Speck64 variants.  Speck is a lightweight block cipher that can be
much faster than AES on processors that don't have AES instructions.

We are planning to offer Speck-XTS (probably Speck128/256-XTS) as an
option for dm-crypt and fscrypt on Android, for low-end mobile devices
with older CPUs such as ARMv7 which don't have the Cryptography
Extensions.  Currently, such devices are unencrypted because AES is not
fast enough, even when the NEON bit-sliced implementation of AES is
used.  Other AES alternatives such as Twofish, Threefish, Camellia,
CAST6, and Serpent aren't fast enough either; it seems that only a
modern ARX cipher can provide sufficient performance on these devices.

This is a replacement for our original proposal
(https://patchwork.kernel.org/patch/10101451/) which was to offer
ChaCha20 for these devices.  However, the use of a stream cipher for
disk/file encryption with no space to store nonces would have been much
more insecure than we thought initially, given that it would be used on
top of flash storage as well as potentially on top of F2FS, neither of
which is guaranteed to overwrite data in-place.

Speck has been somewhat controversial due to its origin.  Nevertheless,
it has a straightforward design (it's an ARX cipher), and it appears to
be the leading software-optimized lightweight block cipher currently,
with the most cryptanalysis.  It's also easy to implement without side
channels, unlike AES.  Moreover, we only intend Speck to be used when
the status quo is no encryption, due to AES not being fast enough.

We've also considered a novel length-preserving encryption mode based on
ChaCha20 and Poly1305.  While theoretically attractive, such a mode
would be a brand new crypto construction and would be more complicated
and difficult to implement efficiently in comparison to Speck-XTS.

Thus, patch 1 adds a generic implementation of Speck, and the following
patches add a 32-bit ARM NEON implementation of Speck-XTS.  The
NEON-accelerated implementation is much faster than the generic
implementation and therefore is the implementation that would primarily
be used in practice on the devices we are targeting.

There is no AArch64 implementation included, since most such CPUs have
the Cryptography Extensions, allowing the use of AES.  An AArch64
implementation can be added later if there is interest though.

Changed since v2:

  - Fix __speck64_xts_crypt() to work on big endian CPUs.

Changed since v1:

  - Use the word order recommended by the Speck authors.  All test
    vectors were updated.

Eric Biggers (5):
  crypto: add support for the Speck block cipher
  crypto: speck - export common helpers
  crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  crypto: speck - add test vectors for Speck128-XTS
  crypto: speck - add test vectors for Speck64-XTS

 arch/arm/crypto/Kconfig           |    6 +
 arch/arm/crypto/Makefile          |    2 +
 arch/arm/crypto/speck-neon-core.S |  432 +++++++++
 arch/arm/crypto/speck-neon-glue.c |  288 ++++++
 crypto/Kconfig                    |   14 +
 crypto/Makefile                   |    1 +
 crypto/speck.c                    |  307 ++++++
 crypto/testmgr.c                  |   36 +
 crypto/testmgr.h                  | 1486 +++++++++++++++++++++++++++++
 include/crypto/speck.h            |   62 ++
 10 files changed, 2634 insertions(+)
 create mode 100644 arch/arm/crypto/speck-neon-core.S
 create mode 100644 arch/arm/crypto/speck-neon-glue.c
 create mode 100644 crypto/speck.c
 create mode 100644 include/crypto/speck.h

-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 0/5] crypto: Speck support
@ 2018-02-14 18:42 ` Eric Biggers
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

This series adds Speck support to the crypto API, including the Speck128
and Speck64 variants.  Speck is a lightweight block cipher that can be
much faster than AES on processors that don't have AES instructions.

We are planning to offer Speck-XTS (probably Speck128/256-XTS) as an
option for dm-crypt and fscrypt on Android, for low-end mobile devices
with older CPUs such as ARMv7 which don't have the Cryptography
Extensions.  Currently, such devices are unencrypted because AES is not
fast enough, even when the NEON bit-sliced implementation of AES is
used.  Other AES alternatives such as Twofish, Threefish, Camellia,
CAST6, and Serpent aren't fast enough either; it seems that only a
modern ARX cipher can provide sufficient performance on these devices.

This is a replacement for our original proposal
(https://patchwork.kernel.org/patch/10101451/) which was to offer
ChaCha20 for these devices.  However, the use of a stream cipher for
disk/file encryption with no space to store nonces would have been much
more insecure than we thought initially, given that it would be used on
top of flash storage as well as potentially on top of F2FS, neither of
which is guaranteed to overwrite data in-place.

Speck has been somewhat controversial due to its origin.  Nevertheless,
it has a straightforward design (it's an ARX cipher), and it appears to
be the leading software-optimized lightweight block cipher currently,
with the most cryptanalysis.  It's also easy to implement without side
channels, unlike AES.  Moreover, we only intend Speck to be used when
the status quo is no encryption, due to AES not being fast enough.

We've also considered a novel length-preserving encryption mode based on
ChaCha20 and Poly1305.  While theoretically attractive, such a mode
would be a brand new crypto construction and would be more complicated
and difficult to implement efficiently in comparison to Speck-XTS.

Thus, patch 1 adds a generic implementation of Speck, and the following
patches add a 32-bit ARM NEON implementation of Speck-XTS.  The
NEON-accelerated implementation is much faster than the generic
implementation and therefore is the implementation that would primarily
be used in practice on the devices we are targeting.

There is no AArch64 implementation included, since most such CPUs have
the Cryptography Extensions, allowing the use of AES.  An AArch64
implementation can be added later if there is interest though.

Changed since v2:

  - Fix __speck64_xts_crypt() to work on big endian CPUs.

Changed since v1:

  - Use the word order recommended by the Speck authors.  All test
    vectors were updated.

Eric Biggers (5):
  crypto: add support for the Speck block cipher
  crypto: speck - export common helpers
  crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  crypto: speck - add test vectors for Speck128-XTS
  crypto: speck - add test vectors for Speck64-XTS

 arch/arm/crypto/Kconfig           |    6 +
 arch/arm/crypto/Makefile          |    2 +
 arch/arm/crypto/speck-neon-core.S |  432 +++++++++
 arch/arm/crypto/speck-neon-glue.c |  288 ++++++
 crypto/Kconfig                    |   14 +
 crypto/Makefile                   |    1 +
 crypto/speck.c                    |  307 ++++++
 crypto/testmgr.c                  |   36 +
 crypto/testmgr.h                  | 1486 +++++++++++++++++++++++++++++
 include/crypto/speck.h            |   62 ++
 10 files changed, 2634 insertions(+)
 create mode 100644 arch/arm/crypto/speck-neon-core.S
 create mode 100644 arch/arm/crypto/speck-neon-glue.c
 create mode 100644 crypto/speck.c
 create mode 100644 include/crypto/speck.h

-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 1/5] crypto: add support for the Speck block cipher
  2018-02-14 18:42 ` Eric Biggers
@ 2018-02-14 18:42   ` Eric Biggers
  -1 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-crypto, Herbert Xu
  Cc: linux-fscrypt, linux-arm-kernel, Ard Biesheuvel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman, Eric Biggers

Add a generic implementation of Speck, including the Speck128 and
Speck64 variants.  Speck is a lightweight block cipher that can be much
faster than AES on processors that don't have AES instructions.

We are planning to offer Speck-XTS (probably Speck128/256-XTS) as an
option for dm-crypt and fscrypt on Android, for low-end mobile devices
with older CPUs such as ARMv7 which don't have the Cryptography
Extensions.  Currently, such devices are unencrypted because AES is not
fast enough, even when the NEON bit-sliced implementation of AES is
used.  Other AES alternatives such as Twofish, Threefish, Camellia,
CAST6, and Serpent aren't fast enough either; it seems that only a
modern ARX cipher can provide sufficient performance on these devices.

This is a replacement for our original proposal
(https://patchwork.kernel.org/patch/10101451/) which was to offer
ChaCha20 for these devices.  However, the use of a stream cipher for
disk/file encryption with no space to store nonces would have been much
more insecure than we thought initially, given that it would be used on
top of flash storage as well as potentially on top of F2FS, neither of
which is guaranteed to overwrite data in-place.

Speck has been somewhat controversial due to its origin.  Nevertheless,
it has a straightforward design (it's an ARX cipher), and it appears to
be the leading software-optimized lightweight block cipher currently,
with the most cryptanalysis.  It's also easy to implement without side
channels, unlike AES.  Moreover, we only intend Speck to be used when
the status quo is no encryption, due to AES not being fast enough.

We've also considered a novel length-preserving encryption mode based on
ChaCha20 and Poly1305.  While theoretically attractive, such a mode
would be a brand new crypto construction and would be more complicated
and difficult to implement efficiently in comparison to Speck-XTS.

There is confusion about the byte and word orders of Speck, since the
original paper doesn't specify them.  But we have implemented it using
the orders the authors recommended in a correspondence with them.  The
test vectors are taken from the original paper but were mapped to byte
arrays using the recommended byte and word orders.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/Kconfig   |  14 +++
 crypto/Makefile  |   1 +
 crypto/speck.c   | 299 +++++++++++++++++++++++++++++++++++++++++++++++
 crypto/testmgr.c |  18 +++
 crypto/testmgr.h | 128 ++++++++++++++++++++
 5 files changed, 460 insertions(+)
 create mode 100644 crypto/speck.c

diff --git a/crypto/Kconfig b/crypto/Kconfig
index b75264b09a46..558eff07b799 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1508,6 +1508,20 @@ config CRYPTO_SERPENT_AVX2_X86_64
 	  See also:
 	  <http://www.cl.cam.ac.uk/~rja14/serpent.html>
 
+config CRYPTO_SPECK
+	tristate "Speck cipher algorithm"
+	select CRYPTO_ALGAPI
+	help
+	  Speck is a lightweight block cipher that is tuned for optimal
+	  performance in software (rather than hardware).
+
+	  Speck may not be as secure as AES, and should only be used on systems
+	  where AES is not fast enough.
+
+	  See also: <https://eprint.iacr.org/2013/404.pdf>
+
+	  If unsure, say N.
+
 config CRYPTO_TEA
 	tristate "TEA, XTEA and XETA cipher algorithms"
 	select CRYPTO_ALGAPI
diff --git a/crypto/Makefile b/crypto/Makefile
index cdbc03b35510..ba6019471447 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -110,6 +110,7 @@ obj-$(CONFIG_CRYPTO_TEA) += tea.o
 obj-$(CONFIG_CRYPTO_KHAZAD) += khazad.o
 obj-$(CONFIG_CRYPTO_ANUBIS) += anubis.o
 obj-$(CONFIG_CRYPTO_SEED) += seed.o
+obj-$(CONFIG_CRYPTO_SPECK) += speck.o
 obj-$(CONFIG_CRYPTO_SALSA20) += salsa20_generic.o
 obj-$(CONFIG_CRYPTO_CHACHA20) += chacha20_generic.o
 obj-$(CONFIG_CRYPTO_POLY1305) += poly1305_generic.o
diff --git a/crypto/speck.c b/crypto/speck.c
new file mode 100644
index 000000000000..4e80ad76bcd7
--- /dev/null
+++ b/crypto/speck.c
@@ -0,0 +1,299 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Speck: a lightweight block cipher
+ *
+ * Copyright (c) 2018 Google, Inc
+ *
+ * Speck has 10 variants, including 5 block sizes.  For now we only implement
+ * the variants Speck128/128, Speck128/192, Speck128/256, Speck64/96, and
+ * Speck64/128.   Speck${B}/${K} denotes the variant with a block size of B bits
+ * and a key size of K bits.  The Speck128 variants are believed to be the most
+ * secure variants, and they use the same block size and key sizes as AES.  The
+ * Speck64 variants are less secure, but on 32-bit processors are usually
+ * faster.  The remaining variants (Speck32, Speck48, and Speck96) are even less
+ * secure and/or not as well suited for implementation on either 32-bit or
+ * 64-bit processors, so are omitted.
+ *
+ * Reference: "The Simon and Speck Families of Lightweight Block Ciphers"
+ * https://eprint.iacr.org/2013/404.pdf
+ *
+ * In a correspondence, the Speck designers have also clarified that the words
+ * should be interpreted in little-endian format, and the words should be
+ * ordered such that the first word of each block is 'y' rather than 'x', and
+ * the first key word (rather than the last) becomes the first round key.
+ */
+
+#include <asm/unaligned.h>
+#include <linux/bitops.h>
+#include <linux/crypto.h>
+#include <linux/init.h>
+#include <linux/module.h>
+
+/* Speck128 */
+
+#define SPECK128_BLOCK_SIZE	16
+
+#define SPECK128_128_KEY_SIZE	16
+#define SPECK128_128_NROUNDS	32
+
+#define SPECK128_192_KEY_SIZE	24
+#define SPECK128_192_NROUNDS	33
+
+#define SPECK128_256_KEY_SIZE	32
+#define SPECK128_256_NROUNDS	34
+
+struct speck128_tfm_ctx {
+	u64 round_keys[SPECK128_256_NROUNDS];
+	int nrounds;
+};
+
+static __always_inline void speck128_round(u64 *x, u64 *y, u64 k)
+{
+	*x = ror64(*x, 8);
+	*x += *y;
+	*x ^= k;
+	*y = rol64(*y, 3);
+	*y ^= *x;
+}
+
+static __always_inline void speck128_unround(u64 *x, u64 *y, u64 k)
+{
+	*y ^= *x;
+	*y = ror64(*y, 3);
+	*x ^= k;
+	*x -= *y;
+	*x = rol64(*x, 8);
+}
+
+static void speck128_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	const struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u64 y = get_unaligned_le64(in);
+	u64 x = get_unaligned_le64(in + 8);
+	int i;
+
+	for (i = 0; i < ctx->nrounds; i++)
+		speck128_round(&x, &y, ctx->round_keys[i]);
+
+	put_unaligned_le64(y, out);
+	put_unaligned_le64(x, out + 8);
+}
+
+static void speck128_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	const struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u64 y = get_unaligned_le64(in);
+	u64 x = get_unaligned_le64(in + 8);
+	int i;
+
+	for (i = ctx->nrounds - 1; i >= 0; i--)
+		speck128_unround(&x, &y, ctx->round_keys[i]);
+
+	put_unaligned_le64(y, out);
+	put_unaligned_le64(x, out + 8);
+}
+
+static int speck128_setkey(struct crypto_tfm *tfm, const u8 *key,
+			   unsigned int keylen)
+{
+	struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u64 l[3];
+	u64 k;
+	int i;
+
+	switch (keylen) {
+	case SPECK128_128_KEY_SIZE:
+		k = get_unaligned_le64(key);
+		l[0] = get_unaligned_le64(key + 8);
+		ctx->nrounds = SPECK128_128_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck128_round(&l[0], &k, i);
+		}
+		break;
+	case SPECK128_192_KEY_SIZE:
+		k = get_unaligned_le64(key);
+		l[0] = get_unaligned_le64(key + 8);
+		l[1] = get_unaligned_le64(key + 16);
+		ctx->nrounds = SPECK128_192_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck128_round(&l[i % 2], &k, i);
+		}
+		break;
+	case SPECK128_256_KEY_SIZE:
+		k = get_unaligned_le64(key);
+		l[0] = get_unaligned_le64(key + 8);
+		l[1] = get_unaligned_le64(key + 16);
+		l[2] = get_unaligned_le64(key + 24);
+		ctx->nrounds = SPECK128_256_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck128_round(&l[i % 3], &k, i);
+		}
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/* Speck64 */
+
+#define SPECK64_BLOCK_SIZE	8
+
+#define SPECK64_96_KEY_SIZE	12
+#define SPECK64_96_NROUNDS	26
+
+#define SPECK64_128_KEY_SIZE	16
+#define SPECK64_128_NROUNDS	27
+
+struct speck64_tfm_ctx {
+	u32 round_keys[SPECK64_128_NROUNDS];
+	int nrounds;
+};
+
+static __always_inline void speck64_round(u32 *x, u32 *y, u32 k)
+{
+	*x = ror32(*x, 8);
+	*x += *y;
+	*x ^= k;
+	*y = rol32(*y, 3);
+	*y ^= *x;
+}
+
+static __always_inline void speck64_unround(u32 *x, u32 *y, u32 k)
+{
+	*y ^= *x;
+	*y = ror32(*y, 3);
+	*x ^= k;
+	*x -= *y;
+	*x = rol32(*x, 8);
+}
+
+static void speck64_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	const struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u32 y = get_unaligned_le32(in);
+	u32 x = get_unaligned_le32(in + 4);
+	int i;
+
+	for (i = 0; i < ctx->nrounds; i++)
+		speck64_round(&x, &y, ctx->round_keys[i]);
+
+	put_unaligned_le32(y, out);
+	put_unaligned_le32(x, out + 4);
+}
+
+static void speck64_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	const struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u32 y = get_unaligned_le32(in);
+	u32 x = get_unaligned_le32(in + 4);
+	int i;
+
+	for (i = ctx->nrounds - 1; i >= 0; i--)
+		speck64_unround(&x, &y, ctx->round_keys[i]);
+
+	put_unaligned_le32(y, out);
+	put_unaligned_le32(x, out + 4);
+}
+
+static int speck64_setkey(struct crypto_tfm *tfm, const u8 *key,
+			  unsigned int keylen)
+{
+	struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u32 l[3];
+	u32 k;
+	int i;
+
+	switch (keylen) {
+	case SPECK64_96_KEY_SIZE:
+		k = get_unaligned_le32(key);
+		l[0] = get_unaligned_le32(key + 4);
+		l[1] = get_unaligned_le32(key + 8);
+		ctx->nrounds = SPECK64_96_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck64_round(&l[i % 2], &k, i);
+		}
+		break;
+	case SPECK64_128_KEY_SIZE:
+		k = get_unaligned_le32(key);
+		l[0] = get_unaligned_le32(key + 4);
+		l[1] = get_unaligned_le32(key + 8);
+		l[2] = get_unaligned_le32(key + 12);
+		ctx->nrounds = SPECK64_128_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck64_round(&l[i % 3], &k, i);
+		}
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/* Algorithm definitions */
+
+static struct crypto_alg speck_algs[] = {
+	{
+		.cra_name		= "speck128",
+		.cra_driver_name	= "speck128-generic",
+		.cra_priority		= 100,
+		.cra_flags		= CRYPTO_ALG_TYPE_CIPHER,
+		.cra_blocksize		= SPECK128_BLOCK_SIZE,
+		.cra_ctxsize		= sizeof(struct speck128_tfm_ctx),
+		.cra_module		= THIS_MODULE,
+		.cra_u			= {
+			.cipher = {
+				.cia_min_keysize	= SPECK128_128_KEY_SIZE,
+				.cia_max_keysize	= SPECK128_256_KEY_SIZE,
+				.cia_setkey		= speck128_setkey,
+				.cia_encrypt		= speck128_encrypt,
+				.cia_decrypt		= speck128_decrypt
+			}
+		}
+	}, {
+		.cra_name		= "speck64",
+		.cra_driver_name	= "speck64-generic",
+		.cra_priority		= 100,
+		.cra_flags		= CRYPTO_ALG_TYPE_CIPHER,
+		.cra_blocksize		= SPECK64_BLOCK_SIZE,
+		.cra_ctxsize		= sizeof(struct speck64_tfm_ctx),
+		.cra_module		= THIS_MODULE,
+		.cra_u			= {
+			.cipher = {
+				.cia_min_keysize	= SPECK64_96_KEY_SIZE,
+				.cia_max_keysize	= SPECK64_128_KEY_SIZE,
+				.cia_setkey		= speck64_setkey,
+				.cia_encrypt		= speck64_encrypt,
+				.cia_decrypt		= speck64_decrypt
+			}
+		}
+	}
+};
+
+static int __init speck_module_init(void)
+{
+	return crypto_register_algs(speck_algs, ARRAY_SIZE(speck_algs));
+}
+
+static void __exit speck_module_exit(void)
+{
+	crypto_unregister_algs(speck_algs, ARRAY_SIZE(speck_algs));
+}
+
+module_init(speck_module_init);
+module_exit(speck_module_exit);
+
+MODULE_DESCRIPTION("Speck block cipher (generic)");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("speck128");
+MODULE_ALIAS_CRYPTO("speck128-generic");
+MODULE_ALIAS_CRYPTO("speck64");
+MODULE_ALIAS_CRYPTO("speck64-generic");
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index d5e23a142a04..058ed5eb6620 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3000,6 +3000,24 @@ static const struct alg_test_desc alg_test_descs[] = {
 				.dec = __VECS(serpent_dec_tv_template)
 			}
 		}
+	}, {
+		.alg = "ecb(speck128)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = {
+				.enc = __VECS(speck128_enc_tv_template),
+				.dec = __VECS(speck128_dec_tv_template)
+			}
+		}
+	}, {
+		.alg = "ecb(speck64)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = {
+				.enc = __VECS(speck64_enc_tv_template),
+				.dec = __VECS(speck64_dec_tv_template)
+			}
+		}
 	}, {
 		.alg = "ecb(tea)",
 		.test = alg_test_skcipher,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 6044f6906bd6..3818210f77cf 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -14323,6 +14323,134 @@ static const struct cipher_testvec serpent_xts_dec_tv_template[] = {
 	},
 };
 
+/*
+ * Speck test vectors taken from the original paper:
+ * "The Simon and Speck Families of Lightweight Block Ciphers"
+ * https://eprint.iacr.org/2013/404.pdf
+ *
+ * Note that the paper does not make byte and word order clear.  But it was
+ * confirmed with the authors that the intended orders are little endian byte
+ * order and (y, x) word order.  Equivalently, the printed test vectors, when
+ * looking at only the bytes (ignoring the whitespace that divides them into
+ * words), are backwards: the left-most byte is actually the one with the
+ * highest memory address, while the right-most byte is actually the one with
+ * the lowest memory address.
+ */
+
+static const struct cipher_testvec speck128_enc_tv_template[] = {
+	{ /* Speck128/128 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.klen	= 16,
+		.input	= "\x20\x6d\x61\x64\x65\x20\x69\x74"
+			  "\x20\x65\x71\x75\x69\x76\x61\x6c",
+		.ilen	= 16,
+		.result	= "\x18\x0d\x57\x5c\xdf\xfe\x60\x78"
+			  "\x65\x32\x78\x79\x51\x98\x5d\xa6",
+		.rlen	= 16,
+	}, { /* Speck128/192 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17",
+		.klen	= 24,
+		.input	= "\x65\x6e\x74\x20\x74\x6f\x20\x43"
+			  "\x68\x69\x65\x66\x20\x48\x61\x72",
+		.ilen	= 16,
+		.result	= "\x86\x18\x3c\xe0\x5d\x18\xbc\xf9"
+			  "\x66\x55\x13\x13\x3a\xcf\xe4\x1b",
+		.rlen	= 16,
+	}, { /* Speck128/256 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.klen	= 32,
+		.input	= "\x70\x6f\x6f\x6e\x65\x72\x2e\x20"
+			  "\x49\x6e\x20\x74\x68\x6f\x73\x65",
+		.ilen	= 16,
+		.result	= "\x43\x8f\x18\x9c\x8d\xb4\xee\x4e"
+			  "\x3e\xf5\xc0\x05\x04\x01\x09\x41",
+		.rlen	= 16,
+	},
+};
+
+static const struct cipher_testvec speck128_dec_tv_template[] = {
+	{ /* Speck128/128 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.klen	= 16,
+		.input	= "\x18\x0d\x57\x5c\xdf\xfe\x60\x78"
+			  "\x65\x32\x78\x79\x51\x98\x5d\xa6",
+		.ilen	= 16,
+		.result	= "\x20\x6d\x61\x64\x65\x20\x69\x74"
+			  "\x20\x65\x71\x75\x69\x76\x61\x6c",
+		.rlen	= 16,
+	}, { /* Speck128/192 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17",
+		.klen	= 24,
+		.input	= "\x86\x18\x3c\xe0\x5d\x18\xbc\xf9"
+			  "\x66\x55\x13\x13\x3a\xcf\xe4\x1b",
+		.ilen	= 16,
+		.result	= "\x65\x6e\x74\x20\x74\x6f\x20\x43"
+			  "\x68\x69\x65\x66\x20\x48\x61\x72",
+		.rlen	= 16,
+	}, { /* Speck128/256 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.klen	= 32,
+		.input	= "\x43\x8f\x18\x9c\x8d\xb4\xee\x4e"
+			  "\x3e\xf5\xc0\x05\x04\x01\x09\x41",
+		.ilen	= 16,
+		.result	= "\x70\x6f\x6f\x6e\x65\x72\x2e\x20"
+			  "\x49\x6e\x20\x74\x68\x6f\x73\x65",
+		.rlen	= 16,
+	},
+};
+
+static const struct cipher_testvec speck64_enc_tv_template[] = {
+	{ /* Speck64/96 */
+		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
+			  "\x10\x11\x12\x13",
+		.klen	= 12,
+		.input	= "\x65\x61\x6e\x73\x20\x46\x61\x74",
+		.ilen	= 8,
+		.result	= "\x6c\x94\x75\x41\xec\x52\x79\x9f",
+		.rlen	= 8,
+	}, { /* Speck64/128 */
+		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
+			  "\x10\x11\x12\x13\x18\x19\x1a\x1b",
+		.klen	= 16,
+		.input	= "\x2d\x43\x75\x74\x74\x65\x72\x3b",
+		.ilen	= 8,
+		.result	= "\x8b\x02\x4e\x45\x48\xa5\x6f\x8c",
+		.rlen	= 8,
+	},
+};
+
+static const struct cipher_testvec speck64_dec_tv_template[] = {
+	{ /* Speck64/96 */
+		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
+			  "\x10\x11\x12\x13",
+		.klen	= 12,
+		.input	= "\x6c\x94\x75\x41\xec\x52\x79\x9f",
+		.ilen	= 8,
+		.result	= "\x65\x61\x6e\x73\x20\x46\x61\x74",
+		.rlen	= 8,
+	}, { /* Speck64/128 */
+		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
+			  "\x10\x11\x12\x13\x18\x19\x1a\x1b",
+		.klen	= 16,
+		.input	= "\x8b\x02\x4e\x45\x48\xa5\x6f\x8c",
+		.ilen	= 8,
+		.result	= "\x2d\x43\x75\x74\x74\x65\x72\x3b",
+		.rlen	= 8,
+	},
+};
+
 /* Cast6 test vectors from RFC 2612 */
 static const struct cipher_testvec cast6_enc_tv_template[] = {
 	{
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 1/5] crypto: add support for the Speck block cipher
@ 2018-02-14 18:42   ` Eric Biggers
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-arm-kernel

Add a generic implementation of Speck, including the Speck128 and
Speck64 variants.  Speck is a lightweight block cipher that can be much
faster than AES on processors that don't have AES instructions.

We are planning to offer Speck-XTS (probably Speck128/256-XTS) as an
option for dm-crypt and fscrypt on Android, for low-end mobile devices
with older CPUs such as ARMv7 which don't have the Cryptography
Extensions.  Currently, such devices are unencrypted because AES is not
fast enough, even when the NEON bit-sliced implementation of AES is
used.  Other AES alternatives such as Twofish, Threefish, Camellia,
CAST6, and Serpent aren't fast enough either; it seems that only a
modern ARX cipher can provide sufficient performance on these devices.

This is a replacement for our original proposal
(https://patchwork.kernel.org/patch/10101451/) which was to offer
ChaCha20 for these devices.  However, the use of a stream cipher for
disk/file encryption with no space to store nonces would have been much
more insecure than we thought initially, given that it would be used on
top of flash storage as well as potentially on top of F2FS, neither of
which is guaranteed to overwrite data in-place.

Speck has been somewhat controversial due to its origin.  Nevertheless,
it has a straightforward design (it's an ARX cipher), and it appears to
be the leading software-optimized lightweight block cipher currently,
with the most cryptanalysis.  It's also easy to implement without side
channels, unlike AES.  Moreover, we only intend Speck to be used when
the status quo is no encryption, due to AES not being fast enough.

We've also considered a novel length-preserving encryption mode based on
ChaCha20 and Poly1305.  While theoretically attractive, such a mode
would be a brand new crypto construction and would be more complicated
and difficult to implement efficiently in comparison to Speck-XTS.

There is confusion about the byte and word orders of Speck, since the
original paper doesn't specify them.  But we have implemented it using
the orders the authors recommended in a correspondence with them.  The
test vectors are taken from the original paper but were mapped to byte
arrays using the recommended byte and word orders.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/Kconfig   |  14 +++
 crypto/Makefile  |   1 +
 crypto/speck.c   | 299 +++++++++++++++++++++++++++++++++++++++++++++++
 crypto/testmgr.c |  18 +++
 crypto/testmgr.h | 128 ++++++++++++++++++++
 5 files changed, 460 insertions(+)
 create mode 100644 crypto/speck.c

diff --git a/crypto/Kconfig b/crypto/Kconfig
index b75264b09a46..558eff07b799 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1508,6 +1508,20 @@ config CRYPTO_SERPENT_AVX2_X86_64
 	  See also:
 	  <http://www.cl.cam.ac.uk/~rja14/serpent.html>
 
+config CRYPTO_SPECK
+	tristate "Speck cipher algorithm"
+	select CRYPTO_ALGAPI
+	help
+	  Speck is a lightweight block cipher that is tuned for optimal
+	  performance in software (rather than hardware).
+
+	  Speck may not be as secure as AES, and should only be used on systems
+	  where AES is not fast enough.
+
+	  See also: <https://eprint.iacr.org/2013/404.pdf>
+
+	  If unsure, say N.
+
 config CRYPTO_TEA
 	tristate "TEA, XTEA and XETA cipher algorithms"
 	select CRYPTO_ALGAPI
diff --git a/crypto/Makefile b/crypto/Makefile
index cdbc03b35510..ba6019471447 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -110,6 +110,7 @@ obj-$(CONFIG_CRYPTO_TEA) += tea.o
 obj-$(CONFIG_CRYPTO_KHAZAD) += khazad.o
 obj-$(CONFIG_CRYPTO_ANUBIS) += anubis.o
 obj-$(CONFIG_CRYPTO_SEED) += seed.o
+obj-$(CONFIG_CRYPTO_SPECK) += speck.o
 obj-$(CONFIG_CRYPTO_SALSA20) += salsa20_generic.o
 obj-$(CONFIG_CRYPTO_CHACHA20) += chacha20_generic.o
 obj-$(CONFIG_CRYPTO_POLY1305) += poly1305_generic.o
diff --git a/crypto/speck.c b/crypto/speck.c
new file mode 100644
index 000000000000..4e80ad76bcd7
--- /dev/null
+++ b/crypto/speck.c
@@ -0,0 +1,299 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Speck: a lightweight block cipher
+ *
+ * Copyright (c) 2018 Google, Inc
+ *
+ * Speck has 10 variants, including 5 block sizes.  For now we only implement
+ * the variants Speck128/128, Speck128/192, Speck128/256, Speck64/96, and
+ * Speck64/128.   Speck${B}/${K} denotes the variant with a block size of B bits
+ * and a key size of K bits.  The Speck128 variants are believed to be the most
+ * secure variants, and they use the same block size and key sizes as AES.  The
+ * Speck64 variants are less secure, but on 32-bit processors are usually
+ * faster.  The remaining variants (Speck32, Speck48, and Speck96) are even less
+ * secure and/or not as well suited for implementation on either 32-bit or
+ * 64-bit processors, so are omitted.
+ *
+ * Reference: "The Simon and Speck Families of Lightweight Block Ciphers"
+ * https://eprint.iacr.org/2013/404.pdf
+ *
+ * In a correspondence, the Speck designers have also clarified that the words
+ * should be interpreted in little-endian format, and the words should be
+ * ordered such that the first word of each block is 'y' rather than 'x', and
+ * the first key word (rather than the last) becomes the first round key.
+ */
+
+#include <asm/unaligned.h>
+#include <linux/bitops.h>
+#include <linux/crypto.h>
+#include <linux/init.h>
+#include <linux/module.h>
+
+/* Speck128 */
+
+#define SPECK128_BLOCK_SIZE	16
+
+#define SPECK128_128_KEY_SIZE	16
+#define SPECK128_128_NROUNDS	32
+
+#define SPECK128_192_KEY_SIZE	24
+#define SPECK128_192_NROUNDS	33
+
+#define SPECK128_256_KEY_SIZE	32
+#define SPECK128_256_NROUNDS	34
+
+struct speck128_tfm_ctx {
+	u64 round_keys[SPECK128_256_NROUNDS];
+	int nrounds;
+};
+
+static __always_inline void speck128_round(u64 *x, u64 *y, u64 k)
+{
+	*x = ror64(*x, 8);
+	*x += *y;
+	*x ^= k;
+	*y = rol64(*y, 3);
+	*y ^= *x;
+}
+
+static __always_inline void speck128_unround(u64 *x, u64 *y, u64 k)
+{
+	*y ^= *x;
+	*y = ror64(*y, 3);
+	*x ^= k;
+	*x -= *y;
+	*x = rol64(*x, 8);
+}
+
+static void speck128_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	const struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u64 y = get_unaligned_le64(in);
+	u64 x = get_unaligned_le64(in + 8);
+	int i;
+
+	for (i = 0; i < ctx->nrounds; i++)
+		speck128_round(&x, &y, ctx->round_keys[i]);
+
+	put_unaligned_le64(y, out);
+	put_unaligned_le64(x, out + 8);
+}
+
+static void speck128_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	const struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u64 y = get_unaligned_le64(in);
+	u64 x = get_unaligned_le64(in + 8);
+	int i;
+
+	for (i = ctx->nrounds - 1; i >= 0; i--)
+		speck128_unround(&x, &y, ctx->round_keys[i]);
+
+	put_unaligned_le64(y, out);
+	put_unaligned_le64(x, out + 8);
+}
+
+static int speck128_setkey(struct crypto_tfm *tfm, const u8 *key,
+			   unsigned int keylen)
+{
+	struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u64 l[3];
+	u64 k;
+	int i;
+
+	switch (keylen) {
+	case SPECK128_128_KEY_SIZE:
+		k = get_unaligned_le64(key);
+		l[0] = get_unaligned_le64(key + 8);
+		ctx->nrounds = SPECK128_128_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck128_round(&l[0], &k, i);
+		}
+		break;
+	case SPECK128_192_KEY_SIZE:
+		k = get_unaligned_le64(key);
+		l[0] = get_unaligned_le64(key + 8);
+		l[1] = get_unaligned_le64(key + 16);
+		ctx->nrounds = SPECK128_192_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck128_round(&l[i % 2], &k, i);
+		}
+		break;
+	case SPECK128_256_KEY_SIZE:
+		k = get_unaligned_le64(key);
+		l[0] = get_unaligned_le64(key + 8);
+		l[1] = get_unaligned_le64(key + 16);
+		l[2] = get_unaligned_le64(key + 24);
+		ctx->nrounds = SPECK128_256_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck128_round(&l[i % 3], &k, i);
+		}
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/* Speck64 */
+
+#define SPECK64_BLOCK_SIZE	8
+
+#define SPECK64_96_KEY_SIZE	12
+#define SPECK64_96_NROUNDS	26
+
+#define SPECK64_128_KEY_SIZE	16
+#define SPECK64_128_NROUNDS	27
+
+struct speck64_tfm_ctx {
+	u32 round_keys[SPECK64_128_NROUNDS];
+	int nrounds;
+};
+
+static __always_inline void speck64_round(u32 *x, u32 *y, u32 k)
+{
+	*x = ror32(*x, 8);
+	*x += *y;
+	*x ^= k;
+	*y = rol32(*y, 3);
+	*y ^= *x;
+}
+
+static __always_inline void speck64_unround(u32 *x, u32 *y, u32 k)
+{
+	*y ^= *x;
+	*y = ror32(*y, 3);
+	*x ^= k;
+	*x -= *y;
+	*x = rol32(*x, 8);
+}
+
+static void speck64_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	const struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u32 y = get_unaligned_le32(in);
+	u32 x = get_unaligned_le32(in + 4);
+	int i;
+
+	for (i = 0; i < ctx->nrounds; i++)
+		speck64_round(&x, &y, ctx->round_keys[i]);
+
+	put_unaligned_le32(y, out);
+	put_unaligned_le32(x, out + 4);
+}
+
+static void speck64_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	const struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u32 y = get_unaligned_le32(in);
+	u32 x = get_unaligned_le32(in + 4);
+	int i;
+
+	for (i = ctx->nrounds - 1; i >= 0; i--)
+		speck64_unround(&x, &y, ctx->round_keys[i]);
+
+	put_unaligned_le32(y, out);
+	put_unaligned_le32(x, out + 4);
+}
+
+static int speck64_setkey(struct crypto_tfm *tfm, const u8 *key,
+			  unsigned int keylen)
+{
+	struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
+	u32 l[3];
+	u32 k;
+	int i;
+
+	switch (keylen) {
+	case SPECK64_96_KEY_SIZE:
+		k = get_unaligned_le32(key);
+		l[0] = get_unaligned_le32(key + 4);
+		l[1] = get_unaligned_le32(key + 8);
+		ctx->nrounds = SPECK64_96_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck64_round(&l[i % 2], &k, i);
+		}
+		break;
+	case SPECK64_128_KEY_SIZE:
+		k = get_unaligned_le32(key);
+		l[0] = get_unaligned_le32(key + 4);
+		l[1] = get_unaligned_le32(key + 8);
+		l[2] = get_unaligned_le32(key + 12);
+		ctx->nrounds = SPECK64_128_NROUNDS;
+		for (i = 0; i < ctx->nrounds; i++) {
+			ctx->round_keys[i] = k;
+			speck64_round(&l[i % 3], &k, i);
+		}
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/* Algorithm definitions */
+
+static struct crypto_alg speck_algs[] = {
+	{
+		.cra_name		= "speck128",
+		.cra_driver_name	= "speck128-generic",
+		.cra_priority		= 100,
+		.cra_flags		= CRYPTO_ALG_TYPE_CIPHER,
+		.cra_blocksize		= SPECK128_BLOCK_SIZE,
+		.cra_ctxsize		= sizeof(struct speck128_tfm_ctx),
+		.cra_module		= THIS_MODULE,
+		.cra_u			= {
+			.cipher = {
+				.cia_min_keysize	= SPECK128_128_KEY_SIZE,
+				.cia_max_keysize	= SPECK128_256_KEY_SIZE,
+				.cia_setkey		= speck128_setkey,
+				.cia_encrypt		= speck128_encrypt,
+				.cia_decrypt		= speck128_decrypt
+			}
+		}
+	}, {
+		.cra_name		= "speck64",
+		.cra_driver_name	= "speck64-generic",
+		.cra_priority		= 100,
+		.cra_flags		= CRYPTO_ALG_TYPE_CIPHER,
+		.cra_blocksize		= SPECK64_BLOCK_SIZE,
+		.cra_ctxsize		= sizeof(struct speck64_tfm_ctx),
+		.cra_module		= THIS_MODULE,
+		.cra_u			= {
+			.cipher = {
+				.cia_min_keysize	= SPECK64_96_KEY_SIZE,
+				.cia_max_keysize	= SPECK64_128_KEY_SIZE,
+				.cia_setkey		= speck64_setkey,
+				.cia_encrypt		= speck64_encrypt,
+				.cia_decrypt		= speck64_decrypt
+			}
+		}
+	}
+};
+
+static int __init speck_module_init(void)
+{
+	return crypto_register_algs(speck_algs, ARRAY_SIZE(speck_algs));
+}
+
+static void __exit speck_module_exit(void)
+{
+	crypto_unregister_algs(speck_algs, ARRAY_SIZE(speck_algs));
+}
+
+module_init(speck_module_init);
+module_exit(speck_module_exit);
+
+MODULE_DESCRIPTION("Speck block cipher (generic)");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("speck128");
+MODULE_ALIAS_CRYPTO("speck128-generic");
+MODULE_ALIAS_CRYPTO("speck64");
+MODULE_ALIAS_CRYPTO("speck64-generic");
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index d5e23a142a04..058ed5eb6620 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3000,6 +3000,24 @@ static const struct alg_test_desc alg_test_descs[] = {
 				.dec = __VECS(serpent_dec_tv_template)
 			}
 		}
+	}, {
+		.alg = "ecb(speck128)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = {
+				.enc = __VECS(speck128_enc_tv_template),
+				.dec = __VECS(speck128_dec_tv_template)
+			}
+		}
+	}, {
+		.alg = "ecb(speck64)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = {
+				.enc = __VECS(speck64_enc_tv_template),
+				.dec = __VECS(speck64_dec_tv_template)
+			}
+		}
 	}, {
 		.alg = "ecb(tea)",
 		.test = alg_test_skcipher,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 6044f6906bd6..3818210f77cf 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -14323,6 +14323,134 @@ static const struct cipher_testvec serpent_xts_dec_tv_template[] = {
 	},
 };
 
+/*
+ * Speck test vectors taken from the original paper:
+ * "The Simon and Speck Families of Lightweight Block Ciphers"
+ * https://eprint.iacr.org/2013/404.pdf
+ *
+ * Note that the paper does not make byte and word order clear.  But it was
+ * confirmed with the authors that the intended orders are little endian byte
+ * order and (y, x) word order.  Equivalently, the printed test vectors, when
+ * looking at only the bytes (ignoring the whitespace that divides them into
+ * words), are backwards: the left-most byte is actually the one with the
+ * highest memory address, while the right-most byte is actually the one with
+ * the lowest memory address.
+ */
+
+static const struct cipher_testvec speck128_enc_tv_template[] = {
+	{ /* Speck128/128 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.klen	= 16,
+		.input	= "\x20\x6d\x61\x64\x65\x20\x69\x74"
+			  "\x20\x65\x71\x75\x69\x76\x61\x6c",
+		.ilen	= 16,
+		.result	= "\x18\x0d\x57\x5c\xdf\xfe\x60\x78"
+			  "\x65\x32\x78\x79\x51\x98\x5d\xa6",
+		.rlen	= 16,
+	}, { /* Speck128/192 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17",
+		.klen	= 24,
+		.input	= "\x65\x6e\x74\x20\x74\x6f\x20\x43"
+			  "\x68\x69\x65\x66\x20\x48\x61\x72",
+		.ilen	= 16,
+		.result	= "\x86\x18\x3c\xe0\x5d\x18\xbc\xf9"
+			  "\x66\x55\x13\x13\x3a\xcf\xe4\x1b",
+		.rlen	= 16,
+	}, { /* Speck128/256 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.klen	= 32,
+		.input	= "\x70\x6f\x6f\x6e\x65\x72\x2e\x20"
+			  "\x49\x6e\x20\x74\x68\x6f\x73\x65",
+		.ilen	= 16,
+		.result	= "\x43\x8f\x18\x9c\x8d\xb4\xee\x4e"
+			  "\x3e\xf5\xc0\x05\x04\x01\x09\x41",
+		.rlen	= 16,
+	},
+};
+
+static const struct cipher_testvec speck128_dec_tv_template[] = {
+	{ /* Speck128/128 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.klen	= 16,
+		.input	= "\x18\x0d\x57\x5c\xdf\xfe\x60\x78"
+			  "\x65\x32\x78\x79\x51\x98\x5d\xa6",
+		.ilen	= 16,
+		.result	= "\x20\x6d\x61\x64\x65\x20\x69\x74"
+			  "\x20\x65\x71\x75\x69\x76\x61\x6c",
+		.rlen	= 16,
+	}, { /* Speck128/192 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17",
+		.klen	= 24,
+		.input	= "\x86\x18\x3c\xe0\x5d\x18\xbc\xf9"
+			  "\x66\x55\x13\x13\x3a\xcf\xe4\x1b",
+		.ilen	= 16,
+		.result	= "\x65\x6e\x74\x20\x74\x6f\x20\x43"
+			  "\x68\x69\x65\x66\x20\x48\x61\x72",
+		.rlen	= 16,
+	}, { /* Speck128/256 */
+		.key	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.klen	= 32,
+		.input	= "\x43\x8f\x18\x9c\x8d\xb4\xee\x4e"
+			  "\x3e\xf5\xc0\x05\x04\x01\x09\x41",
+		.ilen	= 16,
+		.result	= "\x70\x6f\x6f\x6e\x65\x72\x2e\x20"
+			  "\x49\x6e\x20\x74\x68\x6f\x73\x65",
+		.rlen	= 16,
+	},
+};
+
+static const struct cipher_testvec speck64_enc_tv_template[] = {
+	{ /* Speck64/96 */
+		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
+			  "\x10\x11\x12\x13",
+		.klen	= 12,
+		.input	= "\x65\x61\x6e\x73\x20\x46\x61\x74",
+		.ilen	= 8,
+		.result	= "\x6c\x94\x75\x41\xec\x52\x79\x9f",
+		.rlen	= 8,
+	}, { /* Speck64/128 */
+		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
+			  "\x10\x11\x12\x13\x18\x19\x1a\x1b",
+		.klen	= 16,
+		.input	= "\x2d\x43\x75\x74\x74\x65\x72\x3b",
+		.ilen	= 8,
+		.result	= "\x8b\x02\x4e\x45\x48\xa5\x6f\x8c",
+		.rlen	= 8,
+	},
+};
+
+static const struct cipher_testvec speck64_dec_tv_template[] = {
+	{ /* Speck64/96 */
+		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
+			  "\x10\x11\x12\x13",
+		.klen	= 12,
+		.input	= "\x6c\x94\x75\x41\xec\x52\x79\x9f",
+		.ilen	= 8,
+		.result	= "\x65\x61\x6e\x73\x20\x46\x61\x74",
+		.rlen	= 8,
+	}, { /* Speck64/128 */
+		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
+			  "\x10\x11\x12\x13\x18\x19\x1a\x1b",
+		.klen	= 16,
+		.input	= "\x8b\x02\x4e\x45\x48\xa5\x6f\x8c",
+		.ilen	= 8,
+		.result	= "\x2d\x43\x75\x74\x74\x65\x72\x3b",
+		.rlen	= 8,
+	},
+};
+
 /* Cast6 test vectors from RFC 2612 */
 static const struct cipher_testvec cast6_enc_tv_template[] = {
 	{
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 2/5] crypto: speck - export common helpers
  2018-02-14 18:42 ` Eric Biggers
@ 2018-02-14 18:42   ` Eric Biggers
  -1 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-crypto, Herbert Xu
  Cc: linux-fscrypt, linux-arm-kernel, Ard Biesheuvel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman, Eric Biggers

Export the Speck constants and transform context and the ->setkey(),
->encrypt(), and ->decrypt() functions so that they can be reused by the
ARM NEON implementation of Speck-XTS.  The generic key expansion code
will be reused because it is not performance-critical and is not
vectorizable, while the generic encryption and decryption functions are
needed as fallbacks and for the XTS tweak encryption.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/speck.c         | 90 +++++++++++++++++++++++-------------------
 include/crypto/speck.h | 62 +++++++++++++++++++++++++++++
 2 files changed, 111 insertions(+), 41 deletions(-)
 create mode 100644 include/crypto/speck.h

diff --git a/crypto/speck.c b/crypto/speck.c
index 4e80ad76bcd7..58aa9f7f91f7 100644
--- a/crypto/speck.c
+++ b/crypto/speck.c
@@ -24,6 +24,7 @@
  */
 
 #include <asm/unaligned.h>
+#include <crypto/speck.h>
 #include <linux/bitops.h>
 #include <linux/crypto.h>
 #include <linux/init.h>
@@ -31,22 +32,6 @@
 
 /* Speck128 */
 
-#define SPECK128_BLOCK_SIZE	16
-
-#define SPECK128_128_KEY_SIZE	16
-#define SPECK128_128_NROUNDS	32
-
-#define SPECK128_192_KEY_SIZE	24
-#define SPECK128_192_NROUNDS	33
-
-#define SPECK128_256_KEY_SIZE	32
-#define SPECK128_256_NROUNDS	34
-
-struct speck128_tfm_ctx {
-	u64 round_keys[SPECK128_256_NROUNDS];
-	int nrounds;
-};
-
 static __always_inline void speck128_round(u64 *x, u64 *y, u64 k)
 {
 	*x = ror64(*x, 8);
@@ -65,9 +50,9 @@ static __always_inline void speck128_unround(u64 *x, u64 *y, u64 k)
 	*x = rol64(*x, 8);
 }
 
-static void speck128_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+void crypto_speck128_encrypt(const struct speck128_tfm_ctx *ctx,
+			     u8 *out, const u8 *in)
 {
-	const struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u64 y = get_unaligned_le64(in);
 	u64 x = get_unaligned_le64(in + 8);
 	int i;
@@ -78,10 +63,16 @@ static void speck128_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	put_unaligned_le64(y, out);
 	put_unaligned_le64(x, out + 8);
 }
+EXPORT_SYMBOL_GPL(crypto_speck128_encrypt);
 
-static void speck128_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+static void speck128_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	crypto_speck128_encrypt(crypto_tfm_ctx(tfm), out, in);
+}
+
+void crypto_speck128_decrypt(const struct speck128_tfm_ctx *ctx,
+			     u8 *out, const u8 *in)
 {
-	const struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u64 y = get_unaligned_le64(in);
 	u64 x = get_unaligned_le64(in + 8);
 	int i;
@@ -92,11 +83,16 @@ static void speck128_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	put_unaligned_le64(y, out);
 	put_unaligned_le64(x, out + 8);
 }
+EXPORT_SYMBOL_GPL(crypto_speck128_decrypt);
 
-static int speck128_setkey(struct crypto_tfm *tfm, const u8 *key,
+static void speck128_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	crypto_speck128_decrypt(crypto_tfm_ctx(tfm), out, in);
+}
+
+int crypto_speck128_setkey(struct speck128_tfm_ctx *ctx, const u8 *key,
 			   unsigned int keylen)
 {
-	struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u64 l[3];
 	u64 k;
 	int i;
@@ -138,21 +134,15 @@ static int speck128_setkey(struct crypto_tfm *tfm, const u8 *key,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(crypto_speck128_setkey);
 
-/* Speck64 */
-
-#define SPECK64_BLOCK_SIZE	8
-
-#define SPECK64_96_KEY_SIZE	12
-#define SPECK64_96_NROUNDS	26
-
-#define SPECK64_128_KEY_SIZE	16
-#define SPECK64_128_NROUNDS	27
+static int speck128_setkey(struct crypto_tfm *tfm, const u8 *key,
+			   unsigned int keylen)
+{
+	return crypto_speck128_setkey(crypto_tfm_ctx(tfm), key, keylen);
+}
 
-struct speck64_tfm_ctx {
-	u32 round_keys[SPECK64_128_NROUNDS];
-	int nrounds;
-};
+/* Speck64 */
 
 static __always_inline void speck64_round(u32 *x, u32 *y, u32 k)
 {
@@ -172,9 +162,9 @@ static __always_inline void speck64_unround(u32 *x, u32 *y, u32 k)
 	*x = rol32(*x, 8);
 }
 
-static void speck64_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+void crypto_speck64_encrypt(const struct speck64_tfm_ctx *ctx,
+			    u8 *out, const u8 *in)
 {
-	const struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 y = get_unaligned_le32(in);
 	u32 x = get_unaligned_le32(in + 4);
 	int i;
@@ -185,10 +175,16 @@ static void speck64_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	put_unaligned_le32(y, out);
 	put_unaligned_le32(x, out + 4);
 }
+EXPORT_SYMBOL_GPL(crypto_speck64_encrypt);
 
-static void speck64_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+static void speck64_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	crypto_speck64_encrypt(crypto_tfm_ctx(tfm), out, in);
+}
+
+void crypto_speck64_decrypt(const struct speck64_tfm_ctx *ctx,
+			    u8 *out, const u8 *in)
 {
-	const struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 y = get_unaligned_le32(in);
 	u32 x = get_unaligned_le32(in + 4);
 	int i;
@@ -199,11 +195,16 @@ static void speck64_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	put_unaligned_le32(y, out);
 	put_unaligned_le32(x, out + 4);
 }
+EXPORT_SYMBOL_GPL(crypto_speck64_decrypt);
 
-static int speck64_setkey(struct crypto_tfm *tfm, const u8 *key,
+static void speck64_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	crypto_speck64_decrypt(crypto_tfm_ctx(tfm), out, in);
+}
+
+int crypto_speck64_setkey(struct speck64_tfm_ctx *ctx, const u8 *key,
 			  unsigned int keylen)
 {
-	struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 l[3];
 	u32 k;
 	int i;
@@ -236,6 +237,13 @@ static int speck64_setkey(struct crypto_tfm *tfm, const u8 *key,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(crypto_speck64_setkey);
+
+static int speck64_setkey(struct crypto_tfm *tfm, const u8 *key,
+			  unsigned int keylen)
+{
+	return crypto_speck64_setkey(crypto_tfm_ctx(tfm), key, keylen);
+}
 
 /* Algorithm definitions */
 
diff --git a/include/crypto/speck.h b/include/crypto/speck.h
new file mode 100644
index 000000000000..73cfc952d405
--- /dev/null
+++ b/include/crypto/speck.h
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Common values for the Speck algorithm
+ */
+
+#ifndef _CRYPTO_SPECK_H
+#define _CRYPTO_SPECK_H
+
+#include <linux/types.h>
+
+/* Speck128 */
+
+#define SPECK128_BLOCK_SIZE	16
+
+#define SPECK128_128_KEY_SIZE	16
+#define SPECK128_128_NROUNDS	32
+
+#define SPECK128_192_KEY_SIZE	24
+#define SPECK128_192_NROUNDS	33
+
+#define SPECK128_256_KEY_SIZE	32
+#define SPECK128_256_NROUNDS	34
+
+struct speck128_tfm_ctx {
+	u64 round_keys[SPECK128_256_NROUNDS];
+	int nrounds;
+};
+
+void crypto_speck128_encrypt(const struct speck128_tfm_ctx *ctx,
+			     u8 *out, const u8 *in);
+
+void crypto_speck128_decrypt(const struct speck128_tfm_ctx *ctx,
+			     u8 *out, const u8 *in);
+
+int crypto_speck128_setkey(struct speck128_tfm_ctx *ctx, const u8 *key,
+			   unsigned int keysize);
+
+/* Speck64 */
+
+#define SPECK64_BLOCK_SIZE	8
+
+#define SPECK64_96_KEY_SIZE	12
+#define SPECK64_96_NROUNDS	26
+
+#define SPECK64_128_KEY_SIZE	16
+#define SPECK64_128_NROUNDS	27
+
+struct speck64_tfm_ctx {
+	u32 round_keys[SPECK64_128_NROUNDS];
+	int nrounds;
+};
+
+void crypto_speck64_encrypt(const struct speck64_tfm_ctx *ctx,
+			    u8 *out, const u8 *in);
+
+void crypto_speck64_decrypt(const struct speck64_tfm_ctx *ctx,
+			    u8 *out, const u8 *in);
+
+int crypto_speck64_setkey(struct speck64_tfm_ctx *ctx, const u8 *key,
+			  unsigned int keysize);
+
+#endif /* _CRYPTO_SPECK_H */
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 2/5] crypto: speck - export common helpers
@ 2018-02-14 18:42   ` Eric Biggers
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-arm-kernel

Export the Speck constants and transform context and the ->setkey(),
->encrypt(), and ->decrypt() functions so that they can be reused by the
ARM NEON implementation of Speck-XTS.  The generic key expansion code
will be reused because it is not performance-critical and is not
vectorizable, while the generic encryption and decryption functions are
needed as fallbacks and for the XTS tweak encryption.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/speck.c         | 90 +++++++++++++++++++++++-------------------
 include/crypto/speck.h | 62 +++++++++++++++++++++++++++++
 2 files changed, 111 insertions(+), 41 deletions(-)
 create mode 100644 include/crypto/speck.h

diff --git a/crypto/speck.c b/crypto/speck.c
index 4e80ad76bcd7..58aa9f7f91f7 100644
--- a/crypto/speck.c
+++ b/crypto/speck.c
@@ -24,6 +24,7 @@
  */
 
 #include <asm/unaligned.h>
+#include <crypto/speck.h>
 #include <linux/bitops.h>
 #include <linux/crypto.h>
 #include <linux/init.h>
@@ -31,22 +32,6 @@
 
 /* Speck128 */
 
-#define SPECK128_BLOCK_SIZE	16
-
-#define SPECK128_128_KEY_SIZE	16
-#define SPECK128_128_NROUNDS	32
-
-#define SPECK128_192_KEY_SIZE	24
-#define SPECK128_192_NROUNDS	33
-
-#define SPECK128_256_KEY_SIZE	32
-#define SPECK128_256_NROUNDS	34
-
-struct speck128_tfm_ctx {
-	u64 round_keys[SPECK128_256_NROUNDS];
-	int nrounds;
-};
-
 static __always_inline void speck128_round(u64 *x, u64 *y, u64 k)
 {
 	*x = ror64(*x, 8);
@@ -65,9 +50,9 @@ static __always_inline void speck128_unround(u64 *x, u64 *y, u64 k)
 	*x = rol64(*x, 8);
 }
 
-static void speck128_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+void crypto_speck128_encrypt(const struct speck128_tfm_ctx *ctx,
+			     u8 *out, const u8 *in)
 {
-	const struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u64 y = get_unaligned_le64(in);
 	u64 x = get_unaligned_le64(in + 8);
 	int i;
@@ -78,10 +63,16 @@ static void speck128_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	put_unaligned_le64(y, out);
 	put_unaligned_le64(x, out + 8);
 }
+EXPORT_SYMBOL_GPL(crypto_speck128_encrypt);
 
-static void speck128_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+static void speck128_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	crypto_speck128_encrypt(crypto_tfm_ctx(tfm), out, in);
+}
+
+void crypto_speck128_decrypt(const struct speck128_tfm_ctx *ctx,
+			     u8 *out, const u8 *in)
 {
-	const struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u64 y = get_unaligned_le64(in);
 	u64 x = get_unaligned_le64(in + 8);
 	int i;
@@ -92,11 +83,16 @@ static void speck128_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	put_unaligned_le64(y, out);
 	put_unaligned_le64(x, out + 8);
 }
+EXPORT_SYMBOL_GPL(crypto_speck128_decrypt);
 
-static int speck128_setkey(struct crypto_tfm *tfm, const u8 *key,
+static void speck128_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	crypto_speck128_decrypt(crypto_tfm_ctx(tfm), out, in);
+}
+
+int crypto_speck128_setkey(struct speck128_tfm_ctx *ctx, const u8 *key,
 			   unsigned int keylen)
 {
-	struct speck128_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u64 l[3];
 	u64 k;
 	int i;
@@ -138,21 +134,15 @@ static int speck128_setkey(struct crypto_tfm *tfm, const u8 *key,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(crypto_speck128_setkey);
 
-/* Speck64 */
-
-#define SPECK64_BLOCK_SIZE	8
-
-#define SPECK64_96_KEY_SIZE	12
-#define SPECK64_96_NROUNDS	26
-
-#define SPECK64_128_KEY_SIZE	16
-#define SPECK64_128_NROUNDS	27
+static int speck128_setkey(struct crypto_tfm *tfm, const u8 *key,
+			   unsigned int keylen)
+{
+	return crypto_speck128_setkey(crypto_tfm_ctx(tfm), key, keylen);
+}
 
-struct speck64_tfm_ctx {
-	u32 round_keys[SPECK64_128_NROUNDS];
-	int nrounds;
-};
+/* Speck64 */
 
 static __always_inline void speck64_round(u32 *x, u32 *y, u32 k)
 {
@@ -172,9 +162,9 @@ static __always_inline void speck64_unround(u32 *x, u32 *y, u32 k)
 	*x = rol32(*x, 8);
 }
 
-static void speck64_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+void crypto_speck64_encrypt(const struct speck64_tfm_ctx *ctx,
+			    u8 *out, const u8 *in)
 {
-	const struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 y = get_unaligned_le32(in);
 	u32 x = get_unaligned_le32(in + 4);
 	int i;
@@ -185,10 +175,16 @@ static void speck64_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	put_unaligned_le32(y, out);
 	put_unaligned_le32(x, out + 4);
 }
+EXPORT_SYMBOL_GPL(crypto_speck64_encrypt);
 
-static void speck64_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+static void speck64_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	crypto_speck64_encrypt(crypto_tfm_ctx(tfm), out, in);
+}
+
+void crypto_speck64_decrypt(const struct speck64_tfm_ctx *ctx,
+			    u8 *out, const u8 *in)
 {
-	const struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 y = get_unaligned_le32(in);
 	u32 x = get_unaligned_le32(in + 4);
 	int i;
@@ -199,11 +195,16 @@ static void speck64_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	put_unaligned_le32(y, out);
 	put_unaligned_le32(x, out + 4);
 }
+EXPORT_SYMBOL_GPL(crypto_speck64_decrypt);
 
-static int speck64_setkey(struct crypto_tfm *tfm, const u8 *key,
+static void speck64_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+	crypto_speck64_decrypt(crypto_tfm_ctx(tfm), out, in);
+}
+
+int crypto_speck64_setkey(struct speck64_tfm_ctx *ctx, const u8 *key,
 			  unsigned int keylen)
 {
-	struct speck64_tfm_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 l[3];
 	u32 k;
 	int i;
@@ -236,6 +237,13 @@ static int speck64_setkey(struct crypto_tfm *tfm, const u8 *key,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(crypto_speck64_setkey);
+
+static int speck64_setkey(struct crypto_tfm *tfm, const u8 *key,
+			  unsigned int keylen)
+{
+	return crypto_speck64_setkey(crypto_tfm_ctx(tfm), key, keylen);
+}
 
 /* Algorithm definitions */
 
diff --git a/include/crypto/speck.h b/include/crypto/speck.h
new file mode 100644
index 000000000000..73cfc952d405
--- /dev/null
+++ b/include/crypto/speck.h
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Common values for the Speck algorithm
+ */
+
+#ifndef _CRYPTO_SPECK_H
+#define _CRYPTO_SPECK_H
+
+#include <linux/types.h>
+
+/* Speck128 */
+
+#define SPECK128_BLOCK_SIZE	16
+
+#define SPECK128_128_KEY_SIZE	16
+#define SPECK128_128_NROUNDS	32
+
+#define SPECK128_192_KEY_SIZE	24
+#define SPECK128_192_NROUNDS	33
+
+#define SPECK128_256_KEY_SIZE	32
+#define SPECK128_256_NROUNDS	34
+
+struct speck128_tfm_ctx {
+	u64 round_keys[SPECK128_256_NROUNDS];
+	int nrounds;
+};
+
+void crypto_speck128_encrypt(const struct speck128_tfm_ctx *ctx,
+			     u8 *out, const u8 *in);
+
+void crypto_speck128_decrypt(const struct speck128_tfm_ctx *ctx,
+			     u8 *out, const u8 *in);
+
+int crypto_speck128_setkey(struct speck128_tfm_ctx *ctx, const u8 *key,
+			   unsigned int keysize);
+
+/* Speck64 */
+
+#define SPECK64_BLOCK_SIZE	8
+
+#define SPECK64_96_KEY_SIZE	12
+#define SPECK64_96_NROUNDS	26
+
+#define SPECK64_128_KEY_SIZE	16
+#define SPECK64_128_NROUNDS	27
+
+struct speck64_tfm_ctx {
+	u32 round_keys[SPECK64_128_NROUNDS];
+	int nrounds;
+};
+
+void crypto_speck64_encrypt(const struct speck64_tfm_ctx *ctx,
+			    u8 *out, const u8 *in);
+
+void crypto_speck64_decrypt(const struct speck64_tfm_ctx *ctx,
+			    u8 *out, const u8 *in);
+
+int crypto_speck64_setkey(struct speck64_tfm_ctx *ctx, const u8 *key,
+			  unsigned int keysize);
+
+#endif /* _CRYPTO_SPECK_H */
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  2018-02-14 18:42 ` Eric Biggers
@ 2018-02-14 18:42   ` Eric Biggers
  -1 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-crypto, Herbert Xu
  Cc: linux-fscrypt, linux-arm-kernel, Ard Biesheuvel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman, Eric Biggers

Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
encrypted/decrypted (doing one cipher round for all the blocks, then the
next round, etc.), then goes through XTS postprocessing.

The performance depends on the processor but can be about 3 times faster
than the generic code.  For example, on an ARMv7 processor we observe
the following performance with Speck128/256-XTS:

    xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
    xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s

In comparison to AES-256-XTS without the Cryptography Extensions:

    xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
    xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
    xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s

Speck64/128-XTS is even faster:

    xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s

Note that as with the generic code, only the Speck128 and Speck64
variants are supported.  Also, for now only the XTS mode of operation is
supported, to target the disk and file encryption use cases.  The NEON
code also only handles the portion of the data that is evenly divisible
into 128-byte chunks, with any remainder handled by a C fallback.  Of
course, other modes of operation could be added later if needed, and/or
the NEON code could be updated to handle other buffer sizes.

The XTS specification is only defined for AES which has a 128-bit block
size, so for the GF(2^64) math needed for Speck64-XTS we use the
reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
paper.  Of course, when possible users should use Speck128-XTS, but even
that may be too slow on some processors; Speck64-XTS can be faster.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm/crypto/Kconfig           |   6 +
 arch/arm/crypto/Makefile          |   2 +
 arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
 arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
 4 files changed, 728 insertions(+)
 create mode 100644 arch/arm/crypto/speck-neon-core.S
 create mode 100644 arch/arm/crypto/speck-neon-glue.c

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index b8e69fe282b8..925d1364727a 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_CHACHA20
 
+config CRYPTO_SPECK_NEON
+	tristate "NEON accelerated Speck cipher algorithms"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_BLKCIPHER
+	select CRYPTO_SPECK
+
 endif
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 30ef8e291271..a758107c5525 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
+obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
 
 ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
@@ -53,6 +54,7 @@ ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
 crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
 crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
 chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
+speck-neon-y := speck-neon-core.o speck-neon-glue.o
 
 quiet_cmd_perl = PERL    $@
       cmd_perl = $(PERL) $(<) > $(@)
diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S
new file mode 100644
index 000000000000..3c1e203e53b9
--- /dev/null
+++ b/arch/arm/crypto/speck-neon-core.S
@@ -0,0 +1,432 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
+ *
+ * Copyright (c) 2018 Google, Inc
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+#include <linux/linkage.h>
+
+	.text
+	.fpu		neon
+
+	// arguments
+	ROUND_KEYS	.req	r0	// const {u64,u32} *round_keys
+	NROUNDS		.req	r1	// int nrounds
+	DST		.req	r2	// void *dst
+	SRC		.req	r3	// const void *src
+	NBYTES		.req	r4	// unsigned int nbytes
+	TWEAK		.req	r5	// void *tweak
+
+	// registers which hold the data being encrypted/decrypted
+	X0		.req	q0
+	X0_L		.req	d0
+	X0_H		.req	d1
+	Y0		.req	q1
+	Y0_H		.req	d3
+	X1		.req	q2
+	X1_L		.req	d4
+	X1_H		.req	d5
+	Y1		.req	q3
+	Y1_H		.req	d7
+	X2		.req	q4
+	X2_L		.req	d8
+	X2_H		.req	d9
+	Y2		.req	q5
+	Y2_H		.req	d11
+	X3		.req	q6
+	X3_L		.req	d12
+	X3_H		.req	d13
+	Y3		.req	q7
+	Y3_H		.req	d15
+
+	// the round key, duplicated in all lanes
+	ROUND_KEY	.req	q8
+	ROUND_KEY_L	.req	d16
+	ROUND_KEY_H	.req	d17
+
+	// index vector for vtbl-based 8-bit rotates
+	ROTATE_TABLE	.req	d18
+
+	// multiplication table for updating XTS tweaks
+	GF128MUL_TABLE	.req	d19
+	GF64MUL_TABLE	.req	d19
+
+	// current XTS tweak value(s)
+	TWEAKV		.req	q10
+	TWEAKV_L	.req	d20
+	TWEAKV_H	.req	d21
+
+	TMP0		.req	q12
+	TMP0_L		.req	d24
+	TMP0_H		.req	d25
+	TMP1		.req	q13
+	TMP2		.req	q14
+	TMP3		.req	q15
+
+	.align		4
+.Lror64_8_table:
+	.byte		1, 2, 3, 4, 5, 6, 7, 0
+.Lror32_8_table:
+	.byte		1, 2, 3, 0, 5, 6, 7, 4
+.Lrol64_8_table:
+	.byte		7, 0, 1, 2, 3, 4, 5, 6
+.Lrol32_8_table:
+	.byte		3, 0, 1, 2, 7, 4, 5, 6
+.Lgf128mul_table:
+	.byte		0, 0x87
+	.fill		14
+.Lgf64mul_table:
+	.byte		0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
+	.fill		12
+
+/*
+ * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
+ *
+ * Do one Speck encryption round on the 128 bytes (8 blocks for Speck128, 16 for
+ * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
+ * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
+ *
+ * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
+ * the vtbl approach is faster on some processors and the same speed on others.
+ */
+.macro _speck_round_128bytes	n
+
+	// x = ror(x, 8)
+	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
+	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
+	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
+	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
+	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
+	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
+	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
+	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
+
+	// x += y
+	vadd.u\n	X0, Y0
+	vadd.u\n	X1, Y1
+	vadd.u\n	X2, Y2
+	vadd.u\n	X3, Y3
+
+	// x ^= k
+	veor		X0, ROUND_KEY
+	veor		X1, ROUND_KEY
+	veor		X2, ROUND_KEY
+	veor		X3, ROUND_KEY
+
+	// y = rol(y, 3)
+	vshl.u\n	TMP0, Y0, #3
+	vshl.u\n	TMP1, Y1, #3
+	vshl.u\n	TMP2, Y2, #3
+	vshl.u\n	TMP3, Y3, #3
+	vsri.u\n	TMP0, Y0, #(\n - 3)
+	vsri.u\n	TMP1, Y1, #(\n - 3)
+	vsri.u\n	TMP2, Y2, #(\n - 3)
+	vsri.u\n	TMP3, Y3, #(\n - 3)
+
+	// y ^= x
+	veor		Y0, TMP0, X0
+	veor		Y1, TMP1, X1
+	veor		Y2, TMP2, X2
+	veor		Y3, TMP3, X3
+.endm
+
+/*
+ * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
+ *
+ * This is the inverse of _speck_round_128bytes().
+ */
+.macro _speck_unround_128bytes	n
+
+	// y ^= x
+	veor		TMP0, Y0, X0
+	veor		TMP1, Y1, X1
+	veor		TMP2, Y2, X2
+	veor		TMP3, Y3, X3
+
+	// y = ror(y, 3)
+	vshr.u\n	Y0, TMP0, #3
+	vshr.u\n	Y1, TMP1, #3
+	vshr.u\n	Y2, TMP2, #3
+	vshr.u\n	Y3, TMP3, #3
+	vsli.u\n	Y0, TMP0, #(\n - 3)
+	vsli.u\n	Y1, TMP1, #(\n - 3)
+	vsli.u\n	Y2, TMP2, #(\n - 3)
+	vsli.u\n	Y3, TMP3, #(\n - 3)
+
+	// x ^= k
+	veor		X0, ROUND_KEY
+	veor		X1, ROUND_KEY
+	veor		X2, ROUND_KEY
+	veor		X3, ROUND_KEY
+
+	// x -= y
+	vsub.u\n	X0, Y0
+	vsub.u\n	X1, Y1
+	vsub.u\n	X2, Y2
+	vsub.u\n	X3, Y3
+
+	// x = rol(x, 8);
+	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
+	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
+	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
+	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
+	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
+	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
+	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
+	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
+.endm
+
+.macro _xts128_precrypt_one	dst_reg, tweak_buf, tmp
+
+	// Load the next source block
+	vld1.8		{\dst_reg}, [SRC]!
+
+	// Save the current tweak in the tweak buffer
+	vst1.8		{TWEAKV}, [\tweak_buf:128]!
+
+	// XOR the next source block with the current tweak
+	veor		\dst_reg, TWEAKV
+
+	/*
+	 * Calculate the next tweak by multiplying the current one by x,
+	 * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
+	 */
+	vshr.u64	\tmp, TWEAKV, #63
+	vshl.u64	TWEAKV, #1
+	veor		TWEAKV_H, \tmp\()_L
+	vtbl.8		\tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
+	veor		TWEAKV_L, \tmp\()_H
+.endm
+
+.macro _xts64_precrypt_two	dst_reg, tweak_buf, tmp
+
+	// Load the next two source blocks
+	vld1.8		{\dst_reg}, [SRC]!
+
+	// Save the current two tweaks in the tweak buffer
+	vst1.8		{TWEAKV}, [\tweak_buf:128]!
+
+	// XOR the next two source blocks with the current two tweaks
+	veor		\dst_reg, TWEAKV
+
+	/*
+	 * Calculate the next two tweaks by multiplying the current ones by x^2,
+	 * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
+	 */
+	vshr.u64	\tmp, TWEAKV, #62
+	vshl.u64	TWEAKV, #2
+	vtbl.8		\tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
+	vtbl.8		\tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
+	veor		TWEAKV, \tmp
+.endm
+
+/*
+ * _speck_xts_crypt() - Speck-XTS encryption/decryption
+ *
+ * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the DST buffer
+ * using Speck-XTS, specifically the variant with a block size of '2n' and round
+ * count given by NROUNDS.  The expanded round keys are given in ROUND_KEYS, and
+ * the current XTS tweak value is given in TWEAK.  It's assumed that NBYTES is a
+ * nonzero multiple of 128.
+ */
+.macro _speck_xts_crypt	n, decrypting
+	push		{r4-r7}
+	mov		r7, sp
+
+	/*
+	 * The first four parameters were passed in registers r0-r3.  Load the
+	 * additional parameters, which were passed on the stack.
+	 */
+	ldr		NBYTES, [sp, #16]
+	ldr		TWEAK, [sp, #20]
+
+	/*
+	 * If decrypting, modify the ROUND_KEYS parameter to point to the last
+	 * round key rather than the first, since for decryption the round keys
+	 * are used in reverse order.
+	 */
+.if \decrypting
+.if \n == 64
+	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
+	sub		ROUND_KEYS, #8
+.else
+	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
+	sub		ROUND_KEYS, #4
+.endif
+.endif
+
+	// Load the index vector for vtbl-based 8-bit rotates
+.if \decrypting
+	ldr		r12, =.Lrol\n\()_8_table
+.else
+	ldr		r12, =.Lror\n\()_8_table
+.endif
+	vld1.8		{ROTATE_TABLE}, [r12:64]
+
+	// One-time XTS preparation
+
+	/*
+	 * Allocate stack space to store 128 bytes worth of tweaks.  For
+	 * performance, this space is aligned to a 16-byte boundary so that we
+	 * can use the load/store instructions that declare 16-byte alignment.
+	 */
+	sub		sp, #128
+	bic		sp, #0xf
+
+.if \n == 64
+	// Load first tweak
+	vld1.8		{TWEAKV}, [TWEAK]
+
+	// Load GF(2^128) multiplication table
+	ldr		r12, =.Lgf128mul_table
+	vld1.8		{GF128MUL_TABLE}, [r12:64]
+.else
+	// Load first tweak
+	vld1.8		{TWEAKV_L}, [TWEAK]
+
+	// Load GF(2^64) multiplication table
+	ldr		r12, =.Lgf64mul_table
+	vld1.8		{GF64MUL_TABLE}, [r12:64]
+
+	// Calculate second tweak, packing it together with the first
+	vshr.u64	TMP0_L, TWEAKV_L, #63
+	vtbl.u8		TMP0_L, {GF64MUL_TABLE}, TMP0_L
+	vshl.u64	TWEAKV_H, TWEAKV_L, #1
+	veor		TWEAKV_H, TMP0_L
+.endif
+
+.Lnext_128bytes_\@:
+
+	/*
+	 * Load the source blocks into {X,Y}[0-3], XOR them with their XTS tweak
+	 * values, and save the tweaks on the stack for later.  Then
+	 * de-interleave the 'x' and 'y' elements of each block, i.e. make it so
+	 * that the X[0-3] registers contain only the second halves of blocks,
+	 * and the Y[0-3] registers contain only the first halves of blocks.
+	 * (Speck uses the order (y, x) rather than the more intuitive (x, y).)
+	 */
+	mov		r12, sp
+.if \n == 64
+	_xts128_precrypt_one	X0, r12, TMP0
+	_xts128_precrypt_one	Y0, r12, TMP0
+	_xts128_precrypt_one	X1, r12, TMP0
+	_xts128_precrypt_one	Y1, r12, TMP0
+	_xts128_precrypt_one	X2, r12, TMP0
+	_xts128_precrypt_one	Y2, r12, TMP0
+	_xts128_precrypt_one	X3, r12, TMP0
+	_xts128_precrypt_one	Y3, r12, TMP0
+	vswp		X0_L, Y0_H
+	vswp		X1_L, Y1_H
+	vswp		X2_L, Y2_H
+	vswp		X3_L, Y3_H
+.else
+	_xts64_precrypt_two	X0, r12, TMP0
+	_xts64_precrypt_two	Y0, r12, TMP0
+	_xts64_precrypt_two	X1, r12, TMP0
+	_xts64_precrypt_two	Y1, r12, TMP0
+	_xts64_precrypt_two	X2, r12, TMP0
+	_xts64_precrypt_two	Y2, r12, TMP0
+	_xts64_precrypt_two	X3, r12, TMP0
+	_xts64_precrypt_two	Y3, r12, TMP0
+	vuzp.32		Y0, X0
+	vuzp.32		Y1, X1
+	vuzp.32		Y2, X2
+	vuzp.32		Y3, X3
+.endif
+
+	// Do the cipher rounds
+
+	mov		r12, ROUND_KEYS
+	mov		r6, NROUNDS
+
+.Lnext_round_\@:
+.if \decrypting
+.if \n == 64
+	vld1.64		ROUND_KEY_L, [r12]
+	sub		r12, #8
+	vmov		ROUND_KEY_H, ROUND_KEY_L
+.else
+	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]
+	sub		r12, #4
+.endif
+	_speck_unround_128bytes	\n
+.else
+.if \n == 64
+	vld1.64		ROUND_KEY_L, [r12]!
+	vmov		ROUND_KEY_H, ROUND_KEY_L
+.else
+	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]!
+.endif
+	_speck_round_128bytes	\n
+.endif
+	subs		r6, r6, #1
+	bne		.Lnext_round_\@
+
+	// Re-interleave the 'x' and 'y' elements of each block
+.if \n == 64
+	vswp		X0_L, Y0_H
+	vswp		X1_L, Y1_H
+	vswp		X2_L, Y2_H
+	vswp		X3_L, Y3_H
+.else
+	vzip.32		Y0, X0
+	vzip.32		Y1, X1
+	vzip.32		Y2, X2
+	vzip.32		Y3, X3
+.endif
+
+	// XOR the encrypted/decrypted blocks with the tweaks we saved earlier
+	mov		r12, sp
+	vld1.8		{TMP0, TMP1}, [r12:128]!
+	vld1.8		{TMP2, TMP3}, [r12:128]!
+	veor		X0, TMP0
+	veor		Y0, TMP1
+	veor		X1, TMP2
+	veor		Y1, TMP3
+	vld1.8		{TMP0, TMP1}, [r12:128]!
+	vld1.8		{TMP2, TMP3}, [r12:128]!
+	veor		X2, TMP0
+	veor		Y2, TMP1
+	veor		X3, TMP2
+	veor		Y3, TMP3
+
+	// Store the ciphertext in the destination buffer
+	vst1.8		{X0, Y0}, [DST]!
+	vst1.8		{X1, Y1}, [DST]!
+	vst1.8		{X2, Y2}, [DST]!
+	vst1.8		{X3, Y3}, [DST]!
+
+	// Continue if there are more 128-byte chunks remaining, else return
+	subs		NBYTES, #128
+	bne		.Lnext_128bytes_\@
+
+	// Store the next tweak
+.if \n == 64
+	vst1.8		{TWEAKV}, [TWEAK]
+.else
+	vst1.8		{TWEAKV_L}, [TWEAK]
+.endif
+
+	mov		sp, r7
+	pop		{r4-r7}
+	bx		lr
+.endm
+
+ENTRY(speck128_xts_encrypt_neon)
+	_speck_xts_crypt	n=64, decrypting=0
+ENDPROC(speck128_xts_encrypt_neon)
+
+ENTRY(speck128_xts_decrypt_neon)
+	_speck_xts_crypt	n=64, decrypting=1
+ENDPROC(speck128_xts_decrypt_neon)
+
+ENTRY(speck64_xts_encrypt_neon)
+	_speck_xts_crypt	n=32, decrypting=0
+ENDPROC(speck64_xts_encrypt_neon)
+
+ENTRY(speck64_xts_decrypt_neon)
+	_speck_xts_crypt	n=32, decrypting=1
+ENDPROC(speck64_xts_decrypt_neon)
diff --git a/arch/arm/crypto/speck-neon-glue.c b/arch/arm/crypto/speck-neon-glue.c
new file mode 100644
index 000000000000..f012c3ea998f
--- /dev/null
+++ b/arch/arm/crypto/speck-neon-glue.c
@@ -0,0 +1,288 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
+ *
+ * Copyright (c) 2018 Google, Inc
+ *
+ * Note: the NIST recommendation for XTS only specifies a 128-bit block size,
+ * but a 64-bit version (needed for Speck64) is fairly straightforward; the math
+ * is just done in GF(2^64) instead of GF(2^128), with the reducing polynomial
+ * x^64 + x^4 + x^3 + x + 1 from the original XEX paper (Rogaway, 2004:
+ * "Efficient Instantiations of Tweakable Blockciphers and Refinements to Modes
+ * OCB and PMAC"), represented as 0x1B.
+ */
+
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/algapi.h>
+#include <crypto/gf128mul.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/speck.h>
+#include <crypto/xts.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+/* The assembly functions only handle multiples of 128 bytes */
+#define SPECK_NEON_CHUNK_SIZE	128
+
+/* Speck128 */
+
+struct speck128_xts_tfm_ctx {
+	struct speck128_tfm_ctx main_key;
+	struct speck128_tfm_ctx tweak_key;
+};
+
+asmlinkage void speck128_xts_encrypt_neon(const u64 *round_keys, int nrounds,
+					  void *dst, const void *src,
+					  unsigned int nbytes, void *tweak);
+
+asmlinkage void speck128_xts_decrypt_neon(const u64 *round_keys, int nrounds,
+					  void *dst, const void *src,
+					  unsigned int nbytes, void *tweak);
+
+typedef void (*speck128_crypt_one_t)(const struct speck128_tfm_ctx *,
+				     u8 *, const u8 *);
+typedef void (*speck128_xts_crypt_many_t)(const u64 *, int, void *,
+					  const void *, unsigned int, void *);
+
+static __always_inline int
+__speck128_xts_crypt(struct skcipher_request *req,
+		     speck128_crypt_one_t crypt_one,
+		     speck128_xts_crypt_many_t crypt_many)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	le128 tweak;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, true);
+
+	crypto_speck128_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
+
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
+		u8 *dst = walk.dst.virt.addr;
+		const u8 *src = walk.src.virt.addr;
+
+		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
+			unsigned int count;
+
+			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
+			kernel_neon_begin();
+			(*crypt_many)(ctx->main_key.round_keys,
+				      ctx->main_key.nrounds,
+				      dst, src, count, &tweak);
+			kernel_neon_end();
+			dst += count;
+			src += count;
+			nbytes -= count;
+		}
+
+		/* Handle any remainder with generic code */
+		while (nbytes >= sizeof(tweak)) {
+			le128_xor((le128 *)dst, (const le128 *)src, &tweak);
+			(*crypt_one)(&ctx->main_key, dst, dst);
+			le128_xor((le128 *)dst, (const le128 *)dst, &tweak);
+			gf128mul_x_ble(&tweak, &tweak);
+
+			dst += sizeof(tweak);
+			src += sizeof(tweak);
+			nbytes -= sizeof(tweak);
+		}
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static int speck128_xts_encrypt(struct skcipher_request *req)
+{
+	return __speck128_xts_crypt(req, crypto_speck128_encrypt,
+				    speck128_xts_encrypt_neon);
+}
+
+static int speck128_xts_decrypt(struct skcipher_request *req)
+{
+	return __speck128_xts_crypt(req, crypto_speck128_decrypt,
+				    speck128_xts_decrypt_neon);
+}
+
+static int speck128_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			       unsigned int keylen)
+{
+	struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int err;
+
+	err = xts_verify_key(tfm, key, keylen);
+	if (err)
+		return err;
+
+	keylen /= 2;
+
+	err = crypto_speck128_setkey(&ctx->main_key, key, keylen);
+	if (err)
+		return err;
+
+	return crypto_speck128_setkey(&ctx->tweak_key, key + keylen, keylen);
+}
+
+/* Speck64 */
+
+struct speck64_xts_tfm_ctx {
+	struct speck64_tfm_ctx main_key;
+	struct speck64_tfm_ctx tweak_key;
+};
+
+asmlinkage void speck64_xts_encrypt_neon(const u32 *round_keys, int nrounds,
+					 void *dst, const void *src,
+					 unsigned int nbytes, void *tweak);
+
+asmlinkage void speck64_xts_decrypt_neon(const u32 *round_keys, int nrounds,
+					 void *dst, const void *src,
+					 unsigned int nbytes, void *tweak);
+
+typedef void (*speck64_crypt_one_t)(const struct speck64_tfm_ctx *,
+				    u8 *, const u8 *);
+typedef void (*speck64_xts_crypt_many_t)(const u32 *, int, void *,
+					 const void *, unsigned int, void *);
+
+static __always_inline int
+__speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one,
+		    speck64_xts_crypt_many_t crypt_many)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	__le64 tweak;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, true);
+
+	crypto_speck64_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
+
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
+		u8 *dst = walk.dst.virt.addr;
+		const u8 *src = walk.src.virt.addr;
+
+		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
+			unsigned int count;
+
+			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
+			kernel_neon_begin();
+			(*crypt_many)(ctx->main_key.round_keys,
+				      ctx->main_key.nrounds,
+				      dst, src, count, &tweak);
+			kernel_neon_end();
+			dst += count;
+			src += count;
+			nbytes -= count;
+		}
+
+		/* Handle any remainder with generic code */
+		while (nbytes >= sizeof(tweak)) {
+			*(__le64 *)dst = *(__le64 *)src ^ tweak;
+			(*crypt_one)(&ctx->main_key, dst, dst);
+			*(__le64 *)dst ^= tweak;
+			tweak = cpu_to_le64((le64_to_cpu(tweak) << 1) ^
+					    ((tweak & cpu_to_le64(1ULL << 63)) ?
+					     0x1B : 0));
+			dst += sizeof(tweak);
+			src += sizeof(tweak);
+			nbytes -= sizeof(tweak);
+		}
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static int speck64_xts_encrypt(struct skcipher_request *req)
+{
+	return __speck64_xts_crypt(req, crypto_speck64_encrypt,
+				   speck64_xts_encrypt_neon);
+}
+
+static int speck64_xts_decrypt(struct skcipher_request *req)
+{
+	return __speck64_xts_crypt(req, crypto_speck64_decrypt,
+				   speck64_xts_decrypt_neon);
+}
+
+static int speck64_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			      unsigned int keylen)
+{
+	struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int err;
+
+	err = xts_verify_key(tfm, key, keylen);
+	if (err)
+		return err;
+
+	keylen /= 2;
+
+	err = crypto_speck64_setkey(&ctx->main_key, key, keylen);
+	if (err)
+		return err;
+
+	return crypto_speck64_setkey(&ctx->tweak_key, key + keylen, keylen);
+}
+
+static struct skcipher_alg speck_algs[] = {
+	{
+		.base.cra_name		= "xts(speck128)",
+		.base.cra_driver_name	= "xts-speck128-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= SPECK128_BLOCK_SIZE,
+		.base.cra_ctxsize	= sizeof(struct speck128_xts_tfm_ctx),
+		.base.cra_alignmask	= 7,
+		.base.cra_module	= THIS_MODULE,
+		.min_keysize		= 2 * SPECK128_128_KEY_SIZE,
+		.max_keysize		= 2 * SPECK128_256_KEY_SIZE,
+		.ivsize			= SPECK128_BLOCK_SIZE,
+		.walksize		= SPECK_NEON_CHUNK_SIZE,
+		.setkey			= speck128_xts_setkey,
+		.encrypt		= speck128_xts_encrypt,
+		.decrypt		= speck128_xts_decrypt,
+	}, {
+		.base.cra_name		= "xts(speck64)",
+		.base.cra_driver_name	= "xts-speck64-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= SPECK64_BLOCK_SIZE,
+		.base.cra_ctxsize	= sizeof(struct speck64_xts_tfm_ctx),
+		.base.cra_alignmask	= 7,
+		.base.cra_module	= THIS_MODULE,
+		.min_keysize		= 2 * SPECK64_96_KEY_SIZE,
+		.max_keysize		= 2 * SPECK64_128_KEY_SIZE,
+		.ivsize			= SPECK64_BLOCK_SIZE,
+		.walksize		= SPECK_NEON_CHUNK_SIZE,
+		.setkey			= speck64_xts_setkey,
+		.encrypt		= speck64_xts_encrypt,
+		.decrypt		= speck64_xts_decrypt,
+	}
+};
+
+static int __init speck_neon_module_init(void)
+{
+	if (!(elf_hwcap & HWCAP_NEON))
+		return -ENODEV;
+	return crypto_register_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
+}
+
+static void __exit speck_neon_module_exit(void)
+{
+	crypto_unregister_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
+}
+
+module_init(speck_neon_module_init);
+module_exit(speck_neon_module_exit);
+
+MODULE_DESCRIPTION("Speck block cipher (NEON-accelerated)");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("xts(speck128)");
+MODULE_ALIAS_CRYPTO("xts-speck128-neon");
+MODULE_ALIAS_CRYPTO("xts(speck64)");
+MODULE_ALIAS_CRYPTO("xts-speck64-neon");
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-02-14 18:42   ` Eric Biggers
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-arm-kernel

Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
encrypted/decrypted (doing one cipher round for all the blocks, then the
next round, etc.), then goes through XTS postprocessing.

The performance depends on the processor but can be about 3 times faster
than the generic code.  For example, on an ARMv7 processor we observe
the following performance with Speck128/256-XTS:

    xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
    xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s

In comparison to AES-256-XTS without the Cryptography Extensions:

    xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
    xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
    xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s

Speck64/128-XTS is even faster:

    xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s

Note that as with the generic code, only the Speck128 and Speck64
variants are supported.  Also, for now only the XTS mode of operation is
supported, to target the disk and file encryption use cases.  The NEON
code also only handles the portion of the data that is evenly divisible
into 128-byte chunks, with any remainder handled by a C fallback.  Of
course, other modes of operation could be added later if needed, and/or
the NEON code could be updated to handle other buffer sizes.

The XTS specification is only defined for AES which has a 128-bit block
size, so for the GF(2^64) math needed for Speck64-XTS we use the
reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
paper.  Of course, when possible users should use Speck128-XTS, but even
that may be too slow on some processors; Speck64-XTS can be faster.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm/crypto/Kconfig           |   6 +
 arch/arm/crypto/Makefile          |   2 +
 arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
 arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
 4 files changed, 728 insertions(+)
 create mode 100644 arch/arm/crypto/speck-neon-core.S
 create mode 100644 arch/arm/crypto/speck-neon-glue.c

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index b8e69fe282b8..925d1364727a 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_CHACHA20
 
+config CRYPTO_SPECK_NEON
+	tristate "NEON accelerated Speck cipher algorithms"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_BLKCIPHER
+	select CRYPTO_SPECK
+
 endif
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 30ef8e291271..a758107c5525 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
+obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
 
 ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
@@ -53,6 +54,7 @@ ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
 crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
 crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
 chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
+speck-neon-y := speck-neon-core.o speck-neon-glue.o
 
 quiet_cmd_perl = PERL    $@
       cmd_perl = $(PERL) $(<) > $(@)
diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S
new file mode 100644
index 000000000000..3c1e203e53b9
--- /dev/null
+++ b/arch/arm/crypto/speck-neon-core.S
@@ -0,0 +1,432 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
+ *
+ * Copyright (c) 2018 Google, Inc
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+#include <linux/linkage.h>
+
+	.text
+	.fpu		neon
+
+	// arguments
+	ROUND_KEYS	.req	r0	// const {u64,u32} *round_keys
+	NROUNDS		.req	r1	// int nrounds
+	DST		.req	r2	// void *dst
+	SRC		.req	r3	// const void *src
+	NBYTES		.req	r4	// unsigned int nbytes
+	TWEAK		.req	r5	// void *tweak
+
+	// registers which hold the data being encrypted/decrypted
+	X0		.req	q0
+	X0_L		.req	d0
+	X0_H		.req	d1
+	Y0		.req	q1
+	Y0_H		.req	d3
+	X1		.req	q2
+	X1_L		.req	d4
+	X1_H		.req	d5
+	Y1		.req	q3
+	Y1_H		.req	d7
+	X2		.req	q4
+	X2_L		.req	d8
+	X2_H		.req	d9
+	Y2		.req	q5
+	Y2_H		.req	d11
+	X3		.req	q6
+	X3_L		.req	d12
+	X3_H		.req	d13
+	Y3		.req	q7
+	Y3_H		.req	d15
+
+	// the round key, duplicated in all lanes
+	ROUND_KEY	.req	q8
+	ROUND_KEY_L	.req	d16
+	ROUND_KEY_H	.req	d17
+
+	// index vector for vtbl-based 8-bit rotates
+	ROTATE_TABLE	.req	d18
+
+	// multiplication table for updating XTS tweaks
+	GF128MUL_TABLE	.req	d19
+	GF64MUL_TABLE	.req	d19
+
+	// current XTS tweak value(s)
+	TWEAKV		.req	q10
+	TWEAKV_L	.req	d20
+	TWEAKV_H	.req	d21
+
+	TMP0		.req	q12
+	TMP0_L		.req	d24
+	TMP0_H		.req	d25
+	TMP1		.req	q13
+	TMP2		.req	q14
+	TMP3		.req	q15
+
+	.align		4
+.Lror64_8_table:
+	.byte		1, 2, 3, 4, 5, 6, 7, 0
+.Lror32_8_table:
+	.byte		1, 2, 3, 0, 5, 6, 7, 4
+.Lrol64_8_table:
+	.byte		7, 0, 1, 2, 3, 4, 5, 6
+.Lrol32_8_table:
+	.byte		3, 0, 1, 2, 7, 4, 5, 6
+.Lgf128mul_table:
+	.byte		0, 0x87
+	.fill		14
+.Lgf64mul_table:
+	.byte		0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
+	.fill		12
+
+/*
+ * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
+ *
+ * Do one Speck encryption round on the 128 bytes (8 blocks for Speck128, 16 for
+ * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
+ * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
+ *
+ * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
+ * the vtbl approach is faster on some processors and the same speed on others.
+ */
+.macro _speck_round_128bytes	n
+
+	// x = ror(x, 8)
+	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
+	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
+	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
+	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
+	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
+	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
+	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
+	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
+
+	// x += y
+	vadd.u\n	X0, Y0
+	vadd.u\n	X1, Y1
+	vadd.u\n	X2, Y2
+	vadd.u\n	X3, Y3
+
+	// x ^= k
+	veor		X0, ROUND_KEY
+	veor		X1, ROUND_KEY
+	veor		X2, ROUND_KEY
+	veor		X3, ROUND_KEY
+
+	// y = rol(y, 3)
+	vshl.u\n	TMP0, Y0, #3
+	vshl.u\n	TMP1, Y1, #3
+	vshl.u\n	TMP2, Y2, #3
+	vshl.u\n	TMP3, Y3, #3
+	vsri.u\n	TMP0, Y0, #(\n - 3)
+	vsri.u\n	TMP1, Y1, #(\n - 3)
+	vsri.u\n	TMP2, Y2, #(\n - 3)
+	vsri.u\n	TMP3, Y3, #(\n - 3)
+
+	// y ^= x
+	veor		Y0, TMP0, X0
+	veor		Y1, TMP1, X1
+	veor		Y2, TMP2, X2
+	veor		Y3, TMP3, X3
+.endm
+
+/*
+ * _speck_unround_128bytes() - Speck decryption round on 128 bytes@a time
+ *
+ * This is the inverse of _speck_round_128bytes().
+ */
+.macro _speck_unround_128bytes	n
+
+	// y ^= x
+	veor		TMP0, Y0, X0
+	veor		TMP1, Y1, X1
+	veor		TMP2, Y2, X2
+	veor		TMP3, Y3, X3
+
+	// y = ror(y, 3)
+	vshr.u\n	Y0, TMP0, #3
+	vshr.u\n	Y1, TMP1, #3
+	vshr.u\n	Y2, TMP2, #3
+	vshr.u\n	Y3, TMP3, #3
+	vsli.u\n	Y0, TMP0, #(\n - 3)
+	vsli.u\n	Y1, TMP1, #(\n - 3)
+	vsli.u\n	Y2, TMP2, #(\n - 3)
+	vsli.u\n	Y3, TMP3, #(\n - 3)
+
+	// x ^= k
+	veor		X0, ROUND_KEY
+	veor		X1, ROUND_KEY
+	veor		X2, ROUND_KEY
+	veor		X3, ROUND_KEY
+
+	// x -= y
+	vsub.u\n	X0, Y0
+	vsub.u\n	X1, Y1
+	vsub.u\n	X2, Y2
+	vsub.u\n	X3, Y3
+
+	// x = rol(x, 8);
+	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
+	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
+	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
+	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
+	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
+	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
+	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
+	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
+.endm
+
+.macro _xts128_precrypt_one	dst_reg, tweak_buf, tmp
+
+	// Load the next source block
+	vld1.8		{\dst_reg}, [SRC]!
+
+	// Save the current tweak in the tweak buffer
+	vst1.8		{TWEAKV}, [\tweak_buf:128]!
+
+	// XOR the next source block with the current tweak
+	veor		\dst_reg, TWEAKV
+
+	/*
+	 * Calculate the next tweak by multiplying the current one by x,
+	 * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
+	 */
+	vshr.u64	\tmp, TWEAKV, #63
+	vshl.u64	TWEAKV, #1
+	veor		TWEAKV_H, \tmp\()_L
+	vtbl.8		\tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
+	veor		TWEAKV_L, \tmp\()_H
+.endm
+
+.macro _xts64_precrypt_two	dst_reg, tweak_buf, tmp
+
+	// Load the next two source blocks
+	vld1.8		{\dst_reg}, [SRC]!
+
+	// Save the current two tweaks in the tweak buffer
+	vst1.8		{TWEAKV}, [\tweak_buf:128]!
+
+	// XOR the next two source blocks with the current two tweaks
+	veor		\dst_reg, TWEAKV
+
+	/*
+	 * Calculate the next two tweaks by multiplying the current ones by x^2,
+	 * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
+	 */
+	vshr.u64	\tmp, TWEAKV, #62
+	vshl.u64	TWEAKV, #2
+	vtbl.8		\tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
+	vtbl.8		\tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
+	veor		TWEAKV, \tmp
+.endm
+
+/*
+ * _speck_xts_crypt() - Speck-XTS encryption/decryption
+ *
+ * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the DST buffer
+ * using Speck-XTS, specifically the variant with a block size of '2n' and round
+ * count given by NROUNDS.  The expanded round keys are given in ROUND_KEYS, and
+ * the current XTS tweak value is given in TWEAK.  It's assumed that NBYTES is a
+ * nonzero multiple of 128.
+ */
+.macro _speck_xts_crypt	n, decrypting
+	push		{r4-r7}
+	mov		r7, sp
+
+	/*
+	 * The first four parameters were passed in registers r0-r3.  Load the
+	 * additional parameters, which were passed on the stack.
+	 */
+	ldr		NBYTES, [sp, #16]
+	ldr		TWEAK, [sp, #20]
+
+	/*
+	 * If decrypting, modify the ROUND_KEYS parameter to point to the last
+	 * round key rather than the first, since for decryption the round keys
+	 * are used in reverse order.
+	 */
+.if \decrypting
+.if \n == 64
+	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
+	sub		ROUND_KEYS, #8
+.else
+	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
+	sub		ROUND_KEYS, #4
+.endif
+.endif
+
+	// Load the index vector for vtbl-based 8-bit rotates
+.if \decrypting
+	ldr		r12, =.Lrol\n\()_8_table
+.else
+	ldr		r12, =.Lror\n\()_8_table
+.endif
+	vld1.8		{ROTATE_TABLE}, [r12:64]
+
+	// One-time XTS preparation
+
+	/*
+	 * Allocate stack space to store 128 bytes worth of tweaks.  For
+	 * performance, this space is aligned to a 16-byte boundary so that we
+	 * can use the load/store instructions that declare 16-byte alignment.
+	 */
+	sub		sp, #128
+	bic		sp, #0xf
+
+.if \n == 64
+	// Load first tweak
+	vld1.8		{TWEAKV}, [TWEAK]
+
+	// Load GF(2^128) multiplication table
+	ldr		r12, =.Lgf128mul_table
+	vld1.8		{GF128MUL_TABLE}, [r12:64]
+.else
+	// Load first tweak
+	vld1.8		{TWEAKV_L}, [TWEAK]
+
+	// Load GF(2^64) multiplication table
+	ldr		r12, =.Lgf64mul_table
+	vld1.8		{GF64MUL_TABLE}, [r12:64]
+
+	// Calculate second tweak, packing it together with the first
+	vshr.u64	TMP0_L, TWEAKV_L, #63
+	vtbl.u8		TMP0_L, {GF64MUL_TABLE}, TMP0_L
+	vshl.u64	TWEAKV_H, TWEAKV_L, #1
+	veor		TWEAKV_H, TMP0_L
+.endif
+
+.Lnext_128bytes_\@:
+
+	/*
+	 * Load the source blocks into {X,Y}[0-3], XOR them with their XTS tweak
+	 * values, and save the tweaks on the stack for later.  Then
+	 * de-interleave the 'x' and 'y' elements of each block, i.e. make it so
+	 * that the X[0-3] registers contain only the second halves of blocks,
+	 * and the Y[0-3] registers contain only the first halves of blocks.
+	 * (Speck uses the order (y, x) rather than the more intuitive (x, y).)
+	 */
+	mov		r12, sp
+.if \n == 64
+	_xts128_precrypt_one	X0, r12, TMP0
+	_xts128_precrypt_one	Y0, r12, TMP0
+	_xts128_precrypt_one	X1, r12, TMP0
+	_xts128_precrypt_one	Y1, r12, TMP0
+	_xts128_precrypt_one	X2, r12, TMP0
+	_xts128_precrypt_one	Y2, r12, TMP0
+	_xts128_precrypt_one	X3, r12, TMP0
+	_xts128_precrypt_one	Y3, r12, TMP0
+	vswp		X0_L, Y0_H
+	vswp		X1_L, Y1_H
+	vswp		X2_L, Y2_H
+	vswp		X3_L, Y3_H
+.else
+	_xts64_precrypt_two	X0, r12, TMP0
+	_xts64_precrypt_two	Y0, r12, TMP0
+	_xts64_precrypt_two	X1, r12, TMP0
+	_xts64_precrypt_two	Y1, r12, TMP0
+	_xts64_precrypt_two	X2, r12, TMP0
+	_xts64_precrypt_two	Y2, r12, TMP0
+	_xts64_precrypt_two	X3, r12, TMP0
+	_xts64_precrypt_two	Y3, r12, TMP0
+	vuzp.32		Y0, X0
+	vuzp.32		Y1, X1
+	vuzp.32		Y2, X2
+	vuzp.32		Y3, X3
+.endif
+
+	// Do the cipher rounds
+
+	mov		r12, ROUND_KEYS
+	mov		r6, NROUNDS
+
+.Lnext_round_\@:
+.if \decrypting
+.if \n == 64
+	vld1.64		ROUND_KEY_L, [r12]
+	sub		r12, #8
+	vmov		ROUND_KEY_H, ROUND_KEY_L
+.else
+	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]
+	sub		r12, #4
+.endif
+	_speck_unround_128bytes	\n
+.else
+.if \n == 64
+	vld1.64		ROUND_KEY_L, [r12]!
+	vmov		ROUND_KEY_H, ROUND_KEY_L
+.else
+	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]!
+.endif
+	_speck_round_128bytes	\n
+.endif
+	subs		r6, r6, #1
+	bne		.Lnext_round_\@
+
+	// Re-interleave the 'x' and 'y' elements of each block
+.if \n == 64
+	vswp		X0_L, Y0_H
+	vswp		X1_L, Y1_H
+	vswp		X2_L, Y2_H
+	vswp		X3_L, Y3_H
+.else
+	vzip.32		Y0, X0
+	vzip.32		Y1, X1
+	vzip.32		Y2, X2
+	vzip.32		Y3, X3
+.endif
+
+	// XOR the encrypted/decrypted blocks with the tweaks we saved earlier
+	mov		r12, sp
+	vld1.8		{TMP0, TMP1}, [r12:128]!
+	vld1.8		{TMP2, TMP3}, [r12:128]!
+	veor		X0, TMP0
+	veor		Y0, TMP1
+	veor		X1, TMP2
+	veor		Y1, TMP3
+	vld1.8		{TMP0, TMP1}, [r12:128]!
+	vld1.8		{TMP2, TMP3}, [r12:128]!
+	veor		X2, TMP0
+	veor		Y2, TMP1
+	veor		X3, TMP2
+	veor		Y3, TMP3
+
+	// Store the ciphertext in the destination buffer
+	vst1.8		{X0, Y0}, [DST]!
+	vst1.8		{X1, Y1}, [DST]!
+	vst1.8		{X2, Y2}, [DST]!
+	vst1.8		{X3, Y3}, [DST]!
+
+	// Continue if there are more 128-byte chunks remaining, else return
+	subs		NBYTES, #128
+	bne		.Lnext_128bytes_\@
+
+	// Store the next tweak
+.if \n == 64
+	vst1.8		{TWEAKV}, [TWEAK]
+.else
+	vst1.8		{TWEAKV_L}, [TWEAK]
+.endif
+
+	mov		sp, r7
+	pop		{r4-r7}
+	bx		lr
+.endm
+
+ENTRY(speck128_xts_encrypt_neon)
+	_speck_xts_crypt	n=64, decrypting=0
+ENDPROC(speck128_xts_encrypt_neon)
+
+ENTRY(speck128_xts_decrypt_neon)
+	_speck_xts_crypt	n=64, decrypting=1
+ENDPROC(speck128_xts_decrypt_neon)
+
+ENTRY(speck64_xts_encrypt_neon)
+	_speck_xts_crypt	n=32, decrypting=0
+ENDPROC(speck64_xts_encrypt_neon)
+
+ENTRY(speck64_xts_decrypt_neon)
+	_speck_xts_crypt	n=32, decrypting=1
+ENDPROC(speck64_xts_decrypt_neon)
diff --git a/arch/arm/crypto/speck-neon-glue.c b/arch/arm/crypto/speck-neon-glue.c
new file mode 100644
index 000000000000..f012c3ea998f
--- /dev/null
+++ b/arch/arm/crypto/speck-neon-glue.c
@@ -0,0 +1,288 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
+ *
+ * Copyright (c) 2018 Google, Inc
+ *
+ * Note: the NIST recommendation for XTS only specifies a 128-bit block size,
+ * but a 64-bit version (needed for Speck64) is fairly straightforward; the math
+ * is just done in GF(2^64) instead of GF(2^128), with the reducing polynomial
+ * x^64 + x^4 + x^3 + x + 1 from the original XEX paper (Rogaway, 2004:
+ * "Efficient Instantiations of Tweakable Blockciphers and Refinements to Modes
+ * OCB and PMAC"), represented as 0x1B.
+ */
+
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/algapi.h>
+#include <crypto/gf128mul.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/speck.h>
+#include <crypto/xts.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+/* The assembly functions only handle multiples of 128 bytes */
+#define SPECK_NEON_CHUNK_SIZE	128
+
+/* Speck128 */
+
+struct speck128_xts_tfm_ctx {
+	struct speck128_tfm_ctx main_key;
+	struct speck128_tfm_ctx tweak_key;
+};
+
+asmlinkage void speck128_xts_encrypt_neon(const u64 *round_keys, int nrounds,
+					  void *dst, const void *src,
+					  unsigned int nbytes, void *tweak);
+
+asmlinkage void speck128_xts_decrypt_neon(const u64 *round_keys, int nrounds,
+					  void *dst, const void *src,
+					  unsigned int nbytes, void *tweak);
+
+typedef void (*speck128_crypt_one_t)(const struct speck128_tfm_ctx *,
+				     u8 *, const u8 *);
+typedef void (*speck128_xts_crypt_many_t)(const u64 *, int, void *,
+					  const void *, unsigned int, void *);
+
+static __always_inline int
+__speck128_xts_crypt(struct skcipher_request *req,
+		     speck128_crypt_one_t crypt_one,
+		     speck128_xts_crypt_many_t crypt_many)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	le128 tweak;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, true);
+
+	crypto_speck128_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
+
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
+		u8 *dst = walk.dst.virt.addr;
+		const u8 *src = walk.src.virt.addr;
+
+		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
+			unsigned int count;
+
+			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
+			kernel_neon_begin();
+			(*crypt_many)(ctx->main_key.round_keys,
+				      ctx->main_key.nrounds,
+				      dst, src, count, &tweak);
+			kernel_neon_end();
+			dst += count;
+			src += count;
+			nbytes -= count;
+		}
+
+		/* Handle any remainder with generic code */
+		while (nbytes >= sizeof(tweak)) {
+			le128_xor((le128 *)dst, (const le128 *)src, &tweak);
+			(*crypt_one)(&ctx->main_key, dst, dst);
+			le128_xor((le128 *)dst, (const le128 *)dst, &tweak);
+			gf128mul_x_ble(&tweak, &tweak);
+
+			dst += sizeof(tweak);
+			src += sizeof(tweak);
+			nbytes -= sizeof(tweak);
+		}
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static int speck128_xts_encrypt(struct skcipher_request *req)
+{
+	return __speck128_xts_crypt(req, crypto_speck128_encrypt,
+				    speck128_xts_encrypt_neon);
+}
+
+static int speck128_xts_decrypt(struct skcipher_request *req)
+{
+	return __speck128_xts_crypt(req, crypto_speck128_decrypt,
+				    speck128_xts_decrypt_neon);
+}
+
+static int speck128_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			       unsigned int keylen)
+{
+	struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int err;
+
+	err = xts_verify_key(tfm, key, keylen);
+	if (err)
+		return err;
+
+	keylen /= 2;
+
+	err = crypto_speck128_setkey(&ctx->main_key, key, keylen);
+	if (err)
+		return err;
+
+	return crypto_speck128_setkey(&ctx->tweak_key, key + keylen, keylen);
+}
+
+/* Speck64 */
+
+struct speck64_xts_tfm_ctx {
+	struct speck64_tfm_ctx main_key;
+	struct speck64_tfm_ctx tweak_key;
+};
+
+asmlinkage void speck64_xts_encrypt_neon(const u32 *round_keys, int nrounds,
+					 void *dst, const void *src,
+					 unsigned int nbytes, void *tweak);
+
+asmlinkage void speck64_xts_decrypt_neon(const u32 *round_keys, int nrounds,
+					 void *dst, const void *src,
+					 unsigned int nbytes, void *tweak);
+
+typedef void (*speck64_crypt_one_t)(const struct speck64_tfm_ctx *,
+				    u8 *, const u8 *);
+typedef void (*speck64_xts_crypt_many_t)(const u32 *, int, void *,
+					 const void *, unsigned int, void *);
+
+static __always_inline int
+__speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one,
+		    speck64_xts_crypt_many_t crypt_many)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	__le64 tweak;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, true);
+
+	crypto_speck64_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
+
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
+		u8 *dst = walk.dst.virt.addr;
+		const u8 *src = walk.src.virt.addr;
+
+		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
+			unsigned int count;
+
+			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
+			kernel_neon_begin();
+			(*crypt_many)(ctx->main_key.round_keys,
+				      ctx->main_key.nrounds,
+				      dst, src, count, &tweak);
+			kernel_neon_end();
+			dst += count;
+			src += count;
+			nbytes -= count;
+		}
+
+		/* Handle any remainder with generic code */
+		while (nbytes >= sizeof(tweak)) {
+			*(__le64 *)dst = *(__le64 *)src ^ tweak;
+			(*crypt_one)(&ctx->main_key, dst, dst);
+			*(__le64 *)dst ^= tweak;
+			tweak = cpu_to_le64((le64_to_cpu(tweak) << 1) ^
+					    ((tweak & cpu_to_le64(1ULL << 63)) ?
+					     0x1B : 0));
+			dst += sizeof(tweak);
+			src += sizeof(tweak);
+			nbytes -= sizeof(tweak);
+		}
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static int speck64_xts_encrypt(struct skcipher_request *req)
+{
+	return __speck64_xts_crypt(req, crypto_speck64_encrypt,
+				   speck64_xts_encrypt_neon);
+}
+
+static int speck64_xts_decrypt(struct skcipher_request *req)
+{
+	return __speck64_xts_crypt(req, crypto_speck64_decrypt,
+				   speck64_xts_decrypt_neon);
+}
+
+static int speck64_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			      unsigned int keylen)
+{
+	struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int err;
+
+	err = xts_verify_key(tfm, key, keylen);
+	if (err)
+		return err;
+
+	keylen /= 2;
+
+	err = crypto_speck64_setkey(&ctx->main_key, key, keylen);
+	if (err)
+		return err;
+
+	return crypto_speck64_setkey(&ctx->tweak_key, key + keylen, keylen);
+}
+
+static struct skcipher_alg speck_algs[] = {
+	{
+		.base.cra_name		= "xts(speck128)",
+		.base.cra_driver_name	= "xts-speck128-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= SPECK128_BLOCK_SIZE,
+		.base.cra_ctxsize	= sizeof(struct speck128_xts_tfm_ctx),
+		.base.cra_alignmask	= 7,
+		.base.cra_module	= THIS_MODULE,
+		.min_keysize		= 2 * SPECK128_128_KEY_SIZE,
+		.max_keysize		= 2 * SPECK128_256_KEY_SIZE,
+		.ivsize			= SPECK128_BLOCK_SIZE,
+		.walksize		= SPECK_NEON_CHUNK_SIZE,
+		.setkey			= speck128_xts_setkey,
+		.encrypt		= speck128_xts_encrypt,
+		.decrypt		= speck128_xts_decrypt,
+	}, {
+		.base.cra_name		= "xts(speck64)",
+		.base.cra_driver_name	= "xts-speck64-neon",
+		.base.cra_priority	= 300,
+		.base.cra_blocksize	= SPECK64_BLOCK_SIZE,
+		.base.cra_ctxsize	= sizeof(struct speck64_xts_tfm_ctx),
+		.base.cra_alignmask	= 7,
+		.base.cra_module	= THIS_MODULE,
+		.min_keysize		= 2 * SPECK64_96_KEY_SIZE,
+		.max_keysize		= 2 * SPECK64_128_KEY_SIZE,
+		.ivsize			= SPECK64_BLOCK_SIZE,
+		.walksize		= SPECK_NEON_CHUNK_SIZE,
+		.setkey			= speck64_xts_setkey,
+		.encrypt		= speck64_xts_encrypt,
+		.decrypt		= speck64_xts_decrypt,
+	}
+};
+
+static int __init speck_neon_module_init(void)
+{
+	if (!(elf_hwcap & HWCAP_NEON))
+		return -ENODEV;
+	return crypto_register_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
+}
+
+static void __exit speck_neon_module_exit(void)
+{
+	crypto_unregister_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
+}
+
+module_init(speck_neon_module_init);
+module_exit(speck_neon_module_exit);
+
+MODULE_DESCRIPTION("Speck block cipher (NEON-accelerated)");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_ALIAS_CRYPTO("xts(speck128)");
+MODULE_ALIAS_CRYPTO("xts-speck128-neon");
+MODULE_ALIAS_CRYPTO("xts(speck64)");
+MODULE_ALIAS_CRYPTO("xts-speck64-neon");
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 4/5] crypto: speck - add test vectors for Speck128-XTS
  2018-02-14 18:42 ` Eric Biggers
@ 2018-02-14 18:42   ` Eric Biggers
  -1 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-crypto, Herbert Xu
  Cc: linux-fscrypt, linux-arm-kernel, Ard Biesheuvel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman, Eric Biggers

Add test vectors for Speck128-XTS, generated in userspace using C code.
The inputs were borrowed from the AES-XTS test vectors.

Both xts(speck128-generic) and xts-speck128-neon pass these tests.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/testmgr.c |   9 +
 crypto/testmgr.h | 687 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 696 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 058ed5eb6620..e011a347d51b 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3575,6 +3575,15 @@ static const struct alg_test_desc alg_test_descs[] = {
 				.dec = __VECS(serpent_xts_dec_tv_template)
 			}
 		}
+	}, {
+		.alg = "xts(speck128)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = {
+				.enc = __VECS(speck128_xts_enc_tv_template),
+				.dec = __VECS(speck128_xts_dec_tv_template)
+			}
+		}
 	}, {
 		.alg = "xts(twofish)",
 		.test = alg_test_skcipher,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 3818210f77cf..0212e0ebcd0c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -14411,6 +14411,693 @@ static const struct cipher_testvec speck128_dec_tv_template[] = {
 	},
 };
 
+/*
+ * Speck128-XTS test vectors, taken from the AES-XTS test vectors with the
+ * result recomputed with Speck128 as the cipher
+ */
+
+static const struct cipher_testvec speck128_xts_enc_tv_template[] = {
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ilen	= 32,
+		.result	= "\xbe\xa0\xe7\x03\xd7\xfe\xab\x62"
+			  "\x3b\x99\x4a\x64\x74\x77\xac\xed"
+			  "\xd8\xf4\xa6\xcf\xae\xb9\x07\x42"
+			  "\x51\xd9\xb6\x1d\xe0\x5e\xbc\x54",
+		.rlen	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ilen	= 32,
+		.result	= "\xfb\x53\x81\x75\x6f\x9f\x34\xad"
+			  "\x7e\x01\xed\x7b\xcc\xda\x4e\x4a"
+			  "\xd4\x84\xa4\x53\xd5\x88\x73\x1b"
+			  "\xfd\xcb\xae\x0d\xf3\x04\xee\xe6",
+		.rlen	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ilen	= 32,
+		.result	= "\x21\x52\x84\x15\xd1\xf7\x21\x55"
+			  "\xd9\x75\x4a\xd3\xc5\xdb\x9f\x7d"
+			  "\xda\x63\xb2\xf1\x82\xb0\x89\x59"
+			  "\x86\xd4\xaa\xaa\xdd\xff\x4f\x92",
+		.rlen	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ilen	= 512,
+		.result	= "\x57\xb5\xf8\x71\x6e\x6d\xdd\x82"
+			  "\x53\xd0\xed\x2d\x30\xc1\x20\xef"
+			  "\x70\x67\x5e\xff\x09\x70\xbb\xc1"
+			  "\x3a\x7b\x48\x26\xd9\x0b\xf4\x48"
+			  "\xbe\xce\xb1\xc7\xb2\x67\xc4\xa7"
+			  "\x76\xf8\x36\x30\xb7\xb4\x9a\xd9"
+			  "\xf5\x9d\xd0\x7b\xc1\x06\x96\x44"
+			  "\x19\xc5\x58\x84\x63\xb9\x12\x68"
+			  "\x68\xc7\xaa\x18\x98\xf2\x1f\x5c"
+			  "\x39\xa6\xd8\x32\x2b\xc3\x51\xfd"
+			  "\x74\x79\x2e\xb4\x44\xd7\x69\xc4"
+			  "\xfc\x29\xe6\xed\x26\x1e\xa6\x9d"
+			  "\x1c\xbe\x00\x0e\x7f\x3a\xca\xfb"
+			  "\x6d\x13\x65\xa0\xf9\x31\x12\xe2"
+			  "\x26\xd1\xec\x2b\x0a\x8b\x59\x99"
+			  "\xa7\x49\xa0\x0e\x09\x33\x85\x50"
+			  "\xc3\x23\xca\x7a\xdd\x13\x45\x5f"
+			  "\xde\x4c\xa7\xcb\x00\x8a\x66\x6f"
+			  "\xa2\xb6\xb1\x2e\xe1\xa0\x18\xf6"
+			  "\xad\xf3\xbd\xeb\xc7\xef\x55\x4f"
+			  "\x79\x91\x8d\x36\x13\x7b\xd0\x4a"
+			  "\x6c\x39\xfb\x53\xb8\x6f\x02\x51"
+			  "\xa5\x20\xac\x24\x1c\x73\x59\x73"
+			  "\x58\x61\x3a\x87\x58\xb3\x20\x56"
+			  "\x39\x06\x2b\x4d\xd3\x20\x2b\x89"
+			  "\x3f\xa2\xf0\x96\xeb\x7f\xa4\xcd"
+			  "\x11\xae\xbd\xcb\x3a\xb4\xd9\x91"
+			  "\x09\x35\x71\x50\x65\xac\x92\xe3"
+			  "\x7b\x32\xc0\x7a\xdd\xd4\xc3\x92"
+			  "\x6f\xeb\x79\xde\x6f\xd3\x25\xc9"
+			  "\xcd\x63\xf5\x1e\x7a\x3b\x26\x9d"
+			  "\x77\x04\x80\xa9\xbf\x38\xb5\xbd"
+			  "\xb8\x05\x07\xbd\xfd\xab\x7b\xf8"
+			  "\x2a\x26\xcc\x49\x14\x6d\x55\x01"
+			  "\x06\x94\xd8\xb2\x2d\x53\x83\x1b"
+			  "\x8f\xd4\xdd\x57\x12\x7e\x18\xba"
+			  "\x8e\xe2\x4d\x80\xef\x7e\x6b\x9d"
+			  "\x24\xa9\x60\xa4\x97\x85\x86\x2a"
+			  "\x01\x00\x09\xf1\xcb\x4a\x24\x1c"
+			  "\xd8\xf6\xe6\x5b\xe7\x5d\xf2\xc4"
+			  "\x97\x1c\x10\xc6\x4d\x66\x4f\x98"
+			  "\x87\x30\xac\xd5\xea\x73\x49\x10"
+			  "\x80\xea\xe5\x5f\x4d\x5f\x03\x33"
+			  "\x66\x02\x35\x3d\x60\x06\x36\x4f"
+			  "\x14\x1c\xd8\x07\x1f\x78\xd0\xf8"
+			  "\x4f\x6c\x62\x7c\x15\xa5\x7c\x28"
+			  "\x7c\xcc\xeb\x1f\xd1\x07\x90\x93"
+			  "\x7e\xc2\xa8\x3a\x80\xc0\xf5\x30"
+			  "\xcc\x75\xcf\x16\x26\xa9\x26\x3b"
+			  "\xe7\x68\x2f\x15\x21\x5b\xe4\x00"
+			  "\xbd\x48\x50\xcd\x75\x70\xc4\x62"
+			  "\xbb\x41\xfb\x89\x4a\x88\x3b\x3b"
+			  "\x51\x66\x02\x69\x04\x97\x36\xd4"
+			  "\x75\xae\x0b\xa3\x42\xf8\xca\x79"
+			  "\x8f\x93\xe9\xcc\x38\xbd\xd6\xd2"
+			  "\xf9\x70\x4e\xc3\x6a\x8e\x25\xbd"
+			  "\xea\x15\x5a\xa0\x85\x7e\x81\x0d"
+			  "\x03\xe7\x05\x39\xf5\x05\x26\xee"
+			  "\xec\xaa\x1f\x3d\xc9\x98\x76\x01"
+			  "\x2c\xf4\xfc\xa3\x88\x77\x38\xc4"
+			  "\x50\x65\x50\x6d\x04\x1f\xdf\x5a"
+			  "\xaa\xf2\x01\xa9\xc1\x8d\xee\xca"
+			  "\x47\x26\xef\x39\xb8\xb4\xf2\xd1"
+			  "\xd6\xbb\x1b\x2a\xc1\x34\x14\xcf",
+		.rlen	= 512,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95"
+			  "\x02\x88\x41\x97\x16\x93\x99\x37"
+			  "\x51\x05\x82\x09\x74\x94\x45\x92",
+		.klen	= 64,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ilen	= 512,
+		.result	= "\xc5\x85\x2a\x4b\x73\xe4\xf6\xf1"
+			  "\x7e\xf9\xf6\xe9\xa3\x73\x36\xcb"
+			  "\xaa\xb6\x22\xb0\x24\x6e\x3d\x73"
+			  "\x92\x99\xde\xd3\x76\xed\xcd\x63"
+			  "\x64\x3a\x22\x57\xc1\x43\x49\xd4"
+			  "\x79\x36\x31\x19\x62\xae\x10\x7e"
+			  "\x7d\xcf\x7a\xe2\x6b\xce\x27\xfa"
+			  "\xdc\x3d\xd9\x83\xd3\x42\x4c\xe0"
+			  "\x1b\xd6\x1d\x1a\x6f\xd2\x03\x00"
+			  "\xfc\x81\x99\x8a\x14\x62\xf5\x7e"
+			  "\x0d\xe7\x12\xe8\x17\x9d\x0b\xec"
+			  "\xe2\xf7\xc9\xa7\x63\xd1\x79\xb6"
+			  "\x62\x62\x37\xfe\x0a\x4c\x4a\x37"
+			  "\x70\xc7\x5e\x96\x5f\xbc\x8e\x9e"
+			  "\x85\x3c\x4f\x26\x64\x85\xbc\x68"
+			  "\xb0\xe0\x86\x5e\x26\x41\xce\x11"
+			  "\x50\xda\x97\x14\xe9\x9e\xc7\x6d"
+			  "\x3b\xdc\x43\xde\x2b\x27\x69\x7d"
+			  "\xfc\xb0\x28\xbd\x8f\xb1\xc6\x31"
+			  "\x14\x4d\xf0\x74\x37\xfd\x07\x25"
+			  "\x96\x55\xe5\xfc\x9e\x27\x2a\x74"
+			  "\x1b\x83\x4d\x15\x83\xac\x57\xa0"
+			  "\xac\xa5\xd0\x38\xef\x19\x56\x53"
+			  "\x25\x4b\xfc\xce\x04\x23\xe5\x6b"
+			  "\xf6\xc6\x6c\x32\x0b\xb3\x12\xc5"
+			  "\xed\x22\x34\x1c\x5d\xed\x17\x06"
+			  "\x36\xa3\xe6\x77\xb9\x97\x46\xb8"
+			  "\xe9\x3f\x7e\xc7\xbc\x13\x5c\xdc"
+			  "\x6e\x3f\x04\x5e\xd1\x59\xa5\x82"
+			  "\x35\x91\x3d\x1b\xe4\x97\x9f\x92"
+			  "\x1c\x5e\x5f\x6f\x41\xd4\x62\xa1"
+			  "\x8d\x39\xfc\x42\xfb\x38\x80\xb9"
+			  "\x0a\xe3\xcc\x6a\x93\xd9\x7a\xb1"
+			  "\xe9\x69\xaf\x0a\x6b\x75\x38\xa7"
+			  "\xa1\xbf\xf7\xda\x95\x93\x4b\x78"
+			  "\x19\xf5\x94\xf9\xd2\x00\x33\x37"
+			  "\xcf\xf5\x9e\x9c\xf3\xcc\xa6\xee"
+			  "\x42\xb2\x9e\x2c\x5f\x48\x23\x26"
+			  "\x15\x25\x17\x03\x3d\xfe\x2c\xfc"
+			  "\xeb\xba\xda\xe0\x00\x05\xb6\xa6"
+			  "\x07\xb3\xe8\x36\x5b\xec\x5b\xbf"
+			  "\xd6\x5b\x00\x74\xc6\x97\xf1\x6a"
+			  "\x49\xa1\xc3\xfa\x10\x52\xb9\x14"
+			  "\xad\xb7\x73\xf8\x78\x12\xc8\x59"
+			  "\x17\x80\x4c\x57\x39\xf1\x6d\x80"
+			  "\x25\x77\x0f\x5e\x7d\xf0\xaf\x21"
+			  "\xec\xce\xb7\xc8\x02\x8a\xed\x53"
+			  "\x2c\x25\x68\x2e\x1f\x85\x5e\x67"
+			  "\xd1\x07\x7a\x3a\x89\x08\xe0\x34"
+			  "\xdc\xdb\x26\xb4\x6b\x77\xfc\x40"
+			  "\x31\x15\x72\xa0\xf0\x73\xd9\x3b"
+			  "\xd5\xdb\xfe\xfc\x8f\xa9\x44\xa2"
+			  "\x09\x9f\xc6\x33\xe5\xe2\x88\xe8"
+			  "\xf3\xf0\x1a\xf4\xce\x12\x0f\xd6"
+			  "\xf7\x36\xe6\xa4\xf4\x7a\x10\x58"
+			  "\xcc\x1f\x48\x49\x65\x47\x75\xe9"
+			  "\x28\xe1\x65\x7b\xf2\xc4\xb5\x07"
+			  "\xf2\xec\x76\xd8\x8f\x09\xf3\x16"
+			  "\xa1\x51\x89\x3b\xeb\x96\x42\xac"
+			  "\x65\xe0\x67\x63\x29\xdc\xb4\x7d"
+			  "\xf2\x41\x51\x6a\xcb\xde\x3c\xfb"
+			  "\x66\x8d\x13\xca\xe0\x59\x2a\x00"
+			  "\xc9\x53\x4c\xe6\x9e\xe2\x73\xd5"
+			  "\x67\x19\xb2\xbd\x9a\x63\xd7\x5c",
+		.rlen	= 512,
+		.also_non_np = 1,
+		.np	= 3,
+		.tap	= { 512 - 20, 4, 16 },
+	}
+};
+
+static const struct cipher_testvec speck128_xts_dec_tv_template[] = {
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\xbe\xa0\xe7\x03\xd7\xfe\xab\x62"
+			  "\x3b\x99\x4a\x64\x74\x77\xac\xed"
+			  "\xd8\xf4\xa6\xcf\xae\xb9\x07\x42"
+			  "\x51\xd9\xb6\x1d\xe0\x5e\xbc\x54",
+		.ilen	= 32,
+		.result	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.rlen	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\xfb\x53\x81\x75\x6f\x9f\x34\xad"
+			  "\x7e\x01\xed\x7b\xcc\xda\x4e\x4a"
+			  "\xd4\x84\xa4\x53\xd5\x88\x73\x1b"
+			  "\xfd\xcb\xae\x0d\xf3\x04\xee\xe6",
+		.ilen	= 32,
+		.result	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.rlen	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x21\x52\x84\x15\xd1\xf7\x21\x55"
+			  "\xd9\x75\x4a\xd3\xc5\xdb\x9f\x7d"
+			  "\xda\x63\xb2\xf1\x82\xb0\x89\x59"
+			  "\x86\xd4\xaa\xaa\xdd\xff\x4f\x92",
+		.ilen	= 32,
+		.result	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.rlen	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x57\xb5\xf8\x71\x6e\x6d\xdd\x82"
+			  "\x53\xd0\xed\x2d\x30\xc1\x20\xef"
+			  "\x70\x67\x5e\xff\x09\x70\xbb\xc1"
+			  "\x3a\x7b\x48\x26\xd9\x0b\xf4\x48"
+			  "\xbe\xce\xb1\xc7\xb2\x67\xc4\xa7"
+			  "\x76\xf8\x36\x30\xb7\xb4\x9a\xd9"
+			  "\xf5\x9d\xd0\x7b\xc1\x06\x96\x44"
+			  "\x19\xc5\x58\x84\x63\xb9\x12\x68"
+			  "\x68\xc7\xaa\x18\x98\xf2\x1f\x5c"
+			  "\x39\xa6\xd8\x32\x2b\xc3\x51\xfd"
+			  "\x74\x79\x2e\xb4\x44\xd7\x69\xc4"
+			  "\xfc\x29\xe6\xed\x26\x1e\xa6\x9d"
+			  "\x1c\xbe\x00\x0e\x7f\x3a\xca\xfb"
+			  "\x6d\x13\x65\xa0\xf9\x31\x12\xe2"
+			  "\x26\xd1\xec\x2b\x0a\x8b\x59\x99"
+			  "\xa7\x49\xa0\x0e\x09\x33\x85\x50"
+			  "\xc3\x23\xca\x7a\xdd\x13\x45\x5f"
+			  "\xde\x4c\xa7\xcb\x00\x8a\x66\x6f"
+			  "\xa2\xb6\xb1\x2e\xe1\xa0\x18\xf6"
+			  "\xad\xf3\xbd\xeb\xc7\xef\x55\x4f"
+			  "\x79\x91\x8d\x36\x13\x7b\xd0\x4a"
+			  "\x6c\x39\xfb\x53\xb8\x6f\x02\x51"
+			  "\xa5\x20\xac\x24\x1c\x73\x59\x73"
+			  "\x58\x61\x3a\x87\x58\xb3\x20\x56"
+			  "\x39\x06\x2b\x4d\xd3\x20\x2b\x89"
+			  "\x3f\xa2\xf0\x96\xeb\x7f\xa4\xcd"
+			  "\x11\xae\xbd\xcb\x3a\xb4\xd9\x91"
+			  "\x09\x35\x71\x50\x65\xac\x92\xe3"
+			  "\x7b\x32\xc0\x7a\xdd\xd4\xc3\x92"
+			  "\x6f\xeb\x79\xde\x6f\xd3\x25\xc9"
+			  "\xcd\x63\xf5\x1e\x7a\x3b\x26\x9d"
+			  "\x77\x04\x80\xa9\xbf\x38\xb5\xbd"
+			  "\xb8\x05\x07\xbd\xfd\xab\x7b\xf8"
+			  "\x2a\x26\xcc\x49\x14\x6d\x55\x01"
+			  "\x06\x94\xd8\xb2\x2d\x53\x83\x1b"
+			  "\x8f\xd4\xdd\x57\x12\x7e\x18\xba"
+			  "\x8e\xe2\x4d\x80\xef\x7e\x6b\x9d"
+			  "\x24\xa9\x60\xa4\x97\x85\x86\x2a"
+			  "\x01\x00\x09\xf1\xcb\x4a\x24\x1c"
+			  "\xd8\xf6\xe6\x5b\xe7\x5d\xf2\xc4"
+			  "\x97\x1c\x10\xc6\x4d\x66\x4f\x98"
+			  "\x87\x30\xac\xd5\xea\x73\x49\x10"
+			  "\x80\xea\xe5\x5f\x4d\x5f\x03\x33"
+			  "\x66\x02\x35\x3d\x60\x06\x36\x4f"
+			  "\x14\x1c\xd8\x07\x1f\x78\xd0\xf8"
+			  "\x4f\x6c\x62\x7c\x15\xa5\x7c\x28"
+			  "\x7c\xcc\xeb\x1f\xd1\x07\x90\x93"
+			  "\x7e\xc2\xa8\x3a\x80\xc0\xf5\x30"
+			  "\xcc\x75\xcf\x16\x26\xa9\x26\x3b"
+			  "\xe7\x68\x2f\x15\x21\x5b\xe4\x00"
+			  "\xbd\x48\x50\xcd\x75\x70\xc4\x62"
+			  "\xbb\x41\xfb\x89\x4a\x88\x3b\x3b"
+			  "\x51\x66\x02\x69\x04\x97\x36\xd4"
+			  "\x75\xae\x0b\xa3\x42\xf8\xca\x79"
+			  "\x8f\x93\xe9\xcc\x38\xbd\xd6\xd2"
+			  "\xf9\x70\x4e\xc3\x6a\x8e\x25\xbd"
+			  "\xea\x15\x5a\xa0\x85\x7e\x81\x0d"
+			  "\x03\xe7\x05\x39\xf5\x05\x26\xee"
+			  "\xec\xaa\x1f\x3d\xc9\x98\x76\x01"
+			  "\x2c\xf4\xfc\xa3\x88\x77\x38\xc4"
+			  "\x50\x65\x50\x6d\x04\x1f\xdf\x5a"
+			  "\xaa\xf2\x01\xa9\xc1\x8d\xee\xca"
+			  "\x47\x26\xef\x39\xb8\xb4\xf2\xd1"
+			  "\xd6\xbb\x1b\x2a\xc1\x34\x14\xcf",
+		.ilen	= 512,
+		.result	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.rlen	= 512,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95"
+			  "\x02\x88\x41\x97\x16\x93\x99\x37"
+			  "\x51\x05\x82\x09\x74\x94\x45\x92",
+		.klen	= 64,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\xc5\x85\x2a\x4b\x73\xe4\xf6\xf1"
+			  "\x7e\xf9\xf6\xe9\xa3\x73\x36\xcb"
+			  "\xaa\xb6\x22\xb0\x24\x6e\x3d\x73"
+			  "\x92\x99\xde\xd3\x76\xed\xcd\x63"
+			  "\x64\x3a\x22\x57\xc1\x43\x49\xd4"
+			  "\x79\x36\x31\x19\x62\xae\x10\x7e"
+			  "\x7d\xcf\x7a\xe2\x6b\xce\x27\xfa"
+			  "\xdc\x3d\xd9\x83\xd3\x42\x4c\xe0"
+			  "\x1b\xd6\x1d\x1a\x6f\xd2\x03\x00"
+			  "\xfc\x81\x99\x8a\x14\x62\xf5\x7e"
+			  "\x0d\xe7\x12\xe8\x17\x9d\x0b\xec"
+			  "\xe2\xf7\xc9\xa7\x63\xd1\x79\xb6"
+			  "\x62\x62\x37\xfe\x0a\x4c\x4a\x37"
+			  "\x70\xc7\x5e\x96\x5f\xbc\x8e\x9e"
+			  "\x85\x3c\x4f\x26\x64\x85\xbc\x68"
+			  "\xb0\xe0\x86\x5e\x26\x41\xce\x11"
+			  "\x50\xda\x97\x14\xe9\x9e\xc7\x6d"
+			  "\x3b\xdc\x43\xde\x2b\x27\x69\x7d"
+			  "\xfc\xb0\x28\xbd\x8f\xb1\xc6\x31"
+			  "\x14\x4d\xf0\x74\x37\xfd\x07\x25"
+			  "\x96\x55\xe5\xfc\x9e\x27\x2a\x74"
+			  "\x1b\x83\x4d\x15\x83\xac\x57\xa0"
+			  "\xac\xa5\xd0\x38\xef\x19\x56\x53"
+			  "\x25\x4b\xfc\xce\x04\x23\xe5\x6b"
+			  "\xf6\xc6\x6c\x32\x0b\xb3\x12\xc5"
+			  "\xed\x22\x34\x1c\x5d\xed\x17\x06"
+			  "\x36\xa3\xe6\x77\xb9\x97\x46\xb8"
+			  "\xe9\x3f\x7e\xc7\xbc\x13\x5c\xdc"
+			  "\x6e\x3f\x04\x5e\xd1\x59\xa5\x82"
+			  "\x35\x91\x3d\x1b\xe4\x97\x9f\x92"
+			  "\x1c\x5e\x5f\x6f\x41\xd4\x62\xa1"
+			  "\x8d\x39\xfc\x42\xfb\x38\x80\xb9"
+			  "\x0a\xe3\xcc\x6a\x93\xd9\x7a\xb1"
+			  "\xe9\x69\xaf\x0a\x6b\x75\x38\xa7"
+			  "\xa1\xbf\xf7\xda\x95\x93\x4b\x78"
+			  "\x19\xf5\x94\xf9\xd2\x00\x33\x37"
+			  "\xcf\xf5\x9e\x9c\xf3\xcc\xa6\xee"
+			  "\x42\xb2\x9e\x2c\x5f\x48\x23\x26"
+			  "\x15\x25\x17\x03\x3d\xfe\x2c\xfc"
+			  "\xeb\xba\xda\xe0\x00\x05\xb6\xa6"
+			  "\x07\xb3\xe8\x36\x5b\xec\x5b\xbf"
+			  "\xd6\x5b\x00\x74\xc6\x97\xf1\x6a"
+			  "\x49\xa1\xc3\xfa\x10\x52\xb9\x14"
+			  "\xad\xb7\x73\xf8\x78\x12\xc8\x59"
+			  "\x17\x80\x4c\x57\x39\xf1\x6d\x80"
+			  "\x25\x77\x0f\x5e\x7d\xf0\xaf\x21"
+			  "\xec\xce\xb7\xc8\x02\x8a\xed\x53"
+			  "\x2c\x25\x68\x2e\x1f\x85\x5e\x67"
+			  "\xd1\x07\x7a\x3a\x89\x08\xe0\x34"
+			  "\xdc\xdb\x26\xb4\x6b\x77\xfc\x40"
+			  "\x31\x15\x72\xa0\xf0\x73\xd9\x3b"
+			  "\xd5\xdb\xfe\xfc\x8f\xa9\x44\xa2"
+			  "\x09\x9f\xc6\x33\xe5\xe2\x88\xe8"
+			  "\xf3\xf0\x1a\xf4\xce\x12\x0f\xd6"
+			  "\xf7\x36\xe6\xa4\xf4\x7a\x10\x58"
+			  "\xcc\x1f\x48\x49\x65\x47\x75\xe9"
+			  "\x28\xe1\x65\x7b\xf2\xc4\xb5\x07"
+			  "\xf2\xec\x76\xd8\x8f\x09\xf3\x16"
+			  "\xa1\x51\x89\x3b\xeb\x96\x42\xac"
+			  "\x65\xe0\x67\x63\x29\xdc\xb4\x7d"
+			  "\xf2\x41\x51\x6a\xcb\xde\x3c\xfb"
+			  "\x66\x8d\x13\xca\xe0\x59\x2a\x00"
+			  "\xc9\x53\x4c\xe6\x9e\xe2\x73\xd5"
+			  "\x67\x19\xb2\xbd\x9a\x63\xd7\x5c",
+		.ilen	= 512,
+		.result	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.rlen	= 512,
+		.also_non_np = 1,
+		.np	= 3,
+		.tap	= { 512 - 20, 4, 16 },
+	}
+};
+
 static const struct cipher_testvec speck64_enc_tv_template[] = {
 	{ /* Speck64/96 */
 		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 4/5] crypto: speck - add test vectors for Speck128-XTS
@ 2018-02-14 18:42   ` Eric Biggers
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-arm-kernel

Add test vectors for Speck128-XTS, generated in userspace using C code.
The inputs were borrowed from the AES-XTS test vectors.

Both xts(speck128-generic) and xts-speck128-neon pass these tests.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/testmgr.c |   9 +
 crypto/testmgr.h | 687 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 696 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 058ed5eb6620..e011a347d51b 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3575,6 +3575,15 @@ static const struct alg_test_desc alg_test_descs[] = {
 				.dec = __VECS(serpent_xts_dec_tv_template)
 			}
 		}
+	}, {
+		.alg = "xts(speck128)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = {
+				.enc = __VECS(speck128_xts_enc_tv_template),
+				.dec = __VECS(speck128_xts_dec_tv_template)
+			}
+		}
 	}, {
 		.alg = "xts(twofish)",
 		.test = alg_test_skcipher,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 3818210f77cf..0212e0ebcd0c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -14411,6 +14411,693 @@ static const struct cipher_testvec speck128_dec_tv_template[] = {
 	},
 };
 
+/*
+ * Speck128-XTS test vectors, taken from the AES-XTS test vectors with the
+ * result recomputed with Speck128 as the cipher
+ */
+
+static const struct cipher_testvec speck128_xts_enc_tv_template[] = {
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ilen	= 32,
+		.result	= "\xbe\xa0\xe7\x03\xd7\xfe\xab\x62"
+			  "\x3b\x99\x4a\x64\x74\x77\xac\xed"
+			  "\xd8\xf4\xa6\xcf\xae\xb9\x07\x42"
+			  "\x51\xd9\xb6\x1d\xe0\x5e\xbc\x54",
+		.rlen	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ilen	= 32,
+		.result	= "\xfb\x53\x81\x75\x6f\x9f\x34\xad"
+			  "\x7e\x01\xed\x7b\xcc\xda\x4e\x4a"
+			  "\xd4\x84\xa4\x53\xd5\x88\x73\x1b"
+			  "\xfd\xcb\xae\x0d\xf3\x04\xee\xe6",
+		.rlen	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ilen	= 32,
+		.result	= "\x21\x52\x84\x15\xd1\xf7\x21\x55"
+			  "\xd9\x75\x4a\xd3\xc5\xdb\x9f\x7d"
+			  "\xda\x63\xb2\xf1\x82\xb0\x89\x59"
+			  "\x86\xd4\xaa\xaa\xdd\xff\x4f\x92",
+		.rlen	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ilen	= 512,
+		.result	= "\x57\xb5\xf8\x71\x6e\x6d\xdd\x82"
+			  "\x53\xd0\xed\x2d\x30\xc1\x20\xef"
+			  "\x70\x67\x5e\xff\x09\x70\xbb\xc1"
+			  "\x3a\x7b\x48\x26\xd9\x0b\xf4\x48"
+			  "\xbe\xce\xb1\xc7\xb2\x67\xc4\xa7"
+			  "\x76\xf8\x36\x30\xb7\xb4\x9a\xd9"
+			  "\xf5\x9d\xd0\x7b\xc1\x06\x96\x44"
+			  "\x19\xc5\x58\x84\x63\xb9\x12\x68"
+			  "\x68\xc7\xaa\x18\x98\xf2\x1f\x5c"
+			  "\x39\xa6\xd8\x32\x2b\xc3\x51\xfd"
+			  "\x74\x79\x2e\xb4\x44\xd7\x69\xc4"
+			  "\xfc\x29\xe6\xed\x26\x1e\xa6\x9d"
+			  "\x1c\xbe\x00\x0e\x7f\x3a\xca\xfb"
+			  "\x6d\x13\x65\xa0\xf9\x31\x12\xe2"
+			  "\x26\xd1\xec\x2b\x0a\x8b\x59\x99"
+			  "\xa7\x49\xa0\x0e\x09\x33\x85\x50"
+			  "\xc3\x23\xca\x7a\xdd\x13\x45\x5f"
+			  "\xde\x4c\xa7\xcb\x00\x8a\x66\x6f"
+			  "\xa2\xb6\xb1\x2e\xe1\xa0\x18\xf6"
+			  "\xad\xf3\xbd\xeb\xc7\xef\x55\x4f"
+			  "\x79\x91\x8d\x36\x13\x7b\xd0\x4a"
+			  "\x6c\x39\xfb\x53\xb8\x6f\x02\x51"
+			  "\xa5\x20\xac\x24\x1c\x73\x59\x73"
+			  "\x58\x61\x3a\x87\x58\xb3\x20\x56"
+			  "\x39\x06\x2b\x4d\xd3\x20\x2b\x89"
+			  "\x3f\xa2\xf0\x96\xeb\x7f\xa4\xcd"
+			  "\x11\xae\xbd\xcb\x3a\xb4\xd9\x91"
+			  "\x09\x35\x71\x50\x65\xac\x92\xe3"
+			  "\x7b\x32\xc0\x7a\xdd\xd4\xc3\x92"
+			  "\x6f\xeb\x79\xde\x6f\xd3\x25\xc9"
+			  "\xcd\x63\xf5\x1e\x7a\x3b\x26\x9d"
+			  "\x77\x04\x80\xa9\xbf\x38\xb5\xbd"
+			  "\xb8\x05\x07\xbd\xfd\xab\x7b\xf8"
+			  "\x2a\x26\xcc\x49\x14\x6d\x55\x01"
+			  "\x06\x94\xd8\xb2\x2d\x53\x83\x1b"
+			  "\x8f\xd4\xdd\x57\x12\x7e\x18\xba"
+			  "\x8e\xe2\x4d\x80\xef\x7e\x6b\x9d"
+			  "\x24\xa9\x60\xa4\x97\x85\x86\x2a"
+			  "\x01\x00\x09\xf1\xcb\x4a\x24\x1c"
+			  "\xd8\xf6\xe6\x5b\xe7\x5d\xf2\xc4"
+			  "\x97\x1c\x10\xc6\x4d\x66\x4f\x98"
+			  "\x87\x30\xac\xd5\xea\x73\x49\x10"
+			  "\x80\xea\xe5\x5f\x4d\x5f\x03\x33"
+			  "\x66\x02\x35\x3d\x60\x06\x36\x4f"
+			  "\x14\x1c\xd8\x07\x1f\x78\xd0\xf8"
+			  "\x4f\x6c\x62\x7c\x15\xa5\x7c\x28"
+			  "\x7c\xcc\xeb\x1f\xd1\x07\x90\x93"
+			  "\x7e\xc2\xa8\x3a\x80\xc0\xf5\x30"
+			  "\xcc\x75\xcf\x16\x26\xa9\x26\x3b"
+			  "\xe7\x68\x2f\x15\x21\x5b\xe4\x00"
+			  "\xbd\x48\x50\xcd\x75\x70\xc4\x62"
+			  "\xbb\x41\xfb\x89\x4a\x88\x3b\x3b"
+			  "\x51\x66\x02\x69\x04\x97\x36\xd4"
+			  "\x75\xae\x0b\xa3\x42\xf8\xca\x79"
+			  "\x8f\x93\xe9\xcc\x38\xbd\xd6\xd2"
+			  "\xf9\x70\x4e\xc3\x6a\x8e\x25\xbd"
+			  "\xea\x15\x5a\xa0\x85\x7e\x81\x0d"
+			  "\x03\xe7\x05\x39\xf5\x05\x26\xee"
+			  "\xec\xaa\x1f\x3d\xc9\x98\x76\x01"
+			  "\x2c\xf4\xfc\xa3\x88\x77\x38\xc4"
+			  "\x50\x65\x50\x6d\x04\x1f\xdf\x5a"
+			  "\xaa\xf2\x01\xa9\xc1\x8d\xee\xca"
+			  "\x47\x26\xef\x39\xb8\xb4\xf2\xd1"
+			  "\xd6\xbb\x1b\x2a\xc1\x34\x14\xcf",
+		.rlen	= 512,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95"
+			  "\x02\x88\x41\x97\x16\x93\x99\x37"
+			  "\x51\x05\x82\x09\x74\x94\x45\x92",
+		.klen	= 64,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ilen	= 512,
+		.result	= "\xc5\x85\x2a\x4b\x73\xe4\xf6\xf1"
+			  "\x7e\xf9\xf6\xe9\xa3\x73\x36\xcb"
+			  "\xaa\xb6\x22\xb0\x24\x6e\x3d\x73"
+			  "\x92\x99\xde\xd3\x76\xed\xcd\x63"
+			  "\x64\x3a\x22\x57\xc1\x43\x49\xd4"
+			  "\x79\x36\x31\x19\x62\xae\x10\x7e"
+			  "\x7d\xcf\x7a\xe2\x6b\xce\x27\xfa"
+			  "\xdc\x3d\xd9\x83\xd3\x42\x4c\xe0"
+			  "\x1b\xd6\x1d\x1a\x6f\xd2\x03\x00"
+			  "\xfc\x81\x99\x8a\x14\x62\xf5\x7e"
+			  "\x0d\xe7\x12\xe8\x17\x9d\x0b\xec"
+			  "\xe2\xf7\xc9\xa7\x63\xd1\x79\xb6"
+			  "\x62\x62\x37\xfe\x0a\x4c\x4a\x37"
+			  "\x70\xc7\x5e\x96\x5f\xbc\x8e\x9e"
+			  "\x85\x3c\x4f\x26\x64\x85\xbc\x68"
+			  "\xb0\xe0\x86\x5e\x26\x41\xce\x11"
+			  "\x50\xda\x97\x14\xe9\x9e\xc7\x6d"
+			  "\x3b\xdc\x43\xde\x2b\x27\x69\x7d"
+			  "\xfc\xb0\x28\xbd\x8f\xb1\xc6\x31"
+			  "\x14\x4d\xf0\x74\x37\xfd\x07\x25"
+			  "\x96\x55\xe5\xfc\x9e\x27\x2a\x74"
+			  "\x1b\x83\x4d\x15\x83\xac\x57\xa0"
+			  "\xac\xa5\xd0\x38\xef\x19\x56\x53"
+			  "\x25\x4b\xfc\xce\x04\x23\xe5\x6b"
+			  "\xf6\xc6\x6c\x32\x0b\xb3\x12\xc5"
+			  "\xed\x22\x34\x1c\x5d\xed\x17\x06"
+			  "\x36\xa3\xe6\x77\xb9\x97\x46\xb8"
+			  "\xe9\x3f\x7e\xc7\xbc\x13\x5c\xdc"
+			  "\x6e\x3f\x04\x5e\xd1\x59\xa5\x82"
+			  "\x35\x91\x3d\x1b\xe4\x97\x9f\x92"
+			  "\x1c\x5e\x5f\x6f\x41\xd4\x62\xa1"
+			  "\x8d\x39\xfc\x42\xfb\x38\x80\xb9"
+			  "\x0a\xe3\xcc\x6a\x93\xd9\x7a\xb1"
+			  "\xe9\x69\xaf\x0a\x6b\x75\x38\xa7"
+			  "\xa1\xbf\xf7\xda\x95\x93\x4b\x78"
+			  "\x19\xf5\x94\xf9\xd2\x00\x33\x37"
+			  "\xcf\xf5\x9e\x9c\xf3\xcc\xa6\xee"
+			  "\x42\xb2\x9e\x2c\x5f\x48\x23\x26"
+			  "\x15\x25\x17\x03\x3d\xfe\x2c\xfc"
+			  "\xeb\xba\xda\xe0\x00\x05\xb6\xa6"
+			  "\x07\xb3\xe8\x36\x5b\xec\x5b\xbf"
+			  "\xd6\x5b\x00\x74\xc6\x97\xf1\x6a"
+			  "\x49\xa1\xc3\xfa\x10\x52\xb9\x14"
+			  "\xad\xb7\x73\xf8\x78\x12\xc8\x59"
+			  "\x17\x80\x4c\x57\x39\xf1\x6d\x80"
+			  "\x25\x77\x0f\x5e\x7d\xf0\xaf\x21"
+			  "\xec\xce\xb7\xc8\x02\x8a\xed\x53"
+			  "\x2c\x25\x68\x2e\x1f\x85\x5e\x67"
+			  "\xd1\x07\x7a\x3a\x89\x08\xe0\x34"
+			  "\xdc\xdb\x26\xb4\x6b\x77\xfc\x40"
+			  "\x31\x15\x72\xa0\xf0\x73\xd9\x3b"
+			  "\xd5\xdb\xfe\xfc\x8f\xa9\x44\xa2"
+			  "\x09\x9f\xc6\x33\xe5\xe2\x88\xe8"
+			  "\xf3\xf0\x1a\xf4\xce\x12\x0f\xd6"
+			  "\xf7\x36\xe6\xa4\xf4\x7a\x10\x58"
+			  "\xcc\x1f\x48\x49\x65\x47\x75\xe9"
+			  "\x28\xe1\x65\x7b\xf2\xc4\xb5\x07"
+			  "\xf2\xec\x76\xd8\x8f\x09\xf3\x16"
+			  "\xa1\x51\x89\x3b\xeb\x96\x42\xac"
+			  "\x65\xe0\x67\x63\x29\xdc\xb4\x7d"
+			  "\xf2\x41\x51\x6a\xcb\xde\x3c\xfb"
+			  "\x66\x8d\x13\xca\xe0\x59\x2a\x00"
+			  "\xc9\x53\x4c\xe6\x9e\xe2\x73\xd5"
+			  "\x67\x19\xb2\xbd\x9a\x63\xd7\x5c",
+		.rlen	= 512,
+		.also_non_np = 1,
+		.np	= 3,
+		.tap	= { 512 - 20, 4, 16 },
+	}
+};
+
+static const struct cipher_testvec speck128_xts_dec_tv_template[] = {
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\xbe\xa0\xe7\x03\xd7\xfe\xab\x62"
+			  "\x3b\x99\x4a\x64\x74\x77\xac\xed"
+			  "\xd8\xf4\xa6\xcf\xae\xb9\x07\x42"
+			  "\x51\xd9\xb6\x1d\xe0\x5e\xbc\x54",
+		.ilen	= 32,
+		.result	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.rlen	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\xfb\x53\x81\x75\x6f\x9f\x34\xad"
+			  "\x7e\x01\xed\x7b\xcc\xda\x4e\x4a"
+			  "\xd4\x84\xa4\x53\xd5\x88\x73\x1b"
+			  "\xfd\xcb\xae\x0d\xf3\x04\xee\xe6",
+		.ilen	= 32,
+		.result	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.rlen	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x21\x52\x84\x15\xd1\xf7\x21\x55"
+			  "\xd9\x75\x4a\xd3\xc5\xdb\x9f\x7d"
+			  "\xda\x63\xb2\xf1\x82\xb0\x89\x59"
+			  "\x86\xd4\xaa\xaa\xdd\xff\x4f\x92",
+		.ilen	= 32,
+		.result	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.rlen	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x57\xb5\xf8\x71\x6e\x6d\xdd\x82"
+			  "\x53\xd0\xed\x2d\x30\xc1\x20\xef"
+			  "\x70\x67\x5e\xff\x09\x70\xbb\xc1"
+			  "\x3a\x7b\x48\x26\xd9\x0b\xf4\x48"
+			  "\xbe\xce\xb1\xc7\xb2\x67\xc4\xa7"
+			  "\x76\xf8\x36\x30\xb7\xb4\x9a\xd9"
+			  "\xf5\x9d\xd0\x7b\xc1\x06\x96\x44"
+			  "\x19\xc5\x58\x84\x63\xb9\x12\x68"
+			  "\x68\xc7\xaa\x18\x98\xf2\x1f\x5c"
+			  "\x39\xa6\xd8\x32\x2b\xc3\x51\xfd"
+			  "\x74\x79\x2e\xb4\x44\xd7\x69\xc4"
+			  "\xfc\x29\xe6\xed\x26\x1e\xa6\x9d"
+			  "\x1c\xbe\x00\x0e\x7f\x3a\xca\xfb"
+			  "\x6d\x13\x65\xa0\xf9\x31\x12\xe2"
+			  "\x26\xd1\xec\x2b\x0a\x8b\x59\x99"
+			  "\xa7\x49\xa0\x0e\x09\x33\x85\x50"
+			  "\xc3\x23\xca\x7a\xdd\x13\x45\x5f"
+			  "\xde\x4c\xa7\xcb\x00\x8a\x66\x6f"
+			  "\xa2\xb6\xb1\x2e\xe1\xa0\x18\xf6"
+			  "\xad\xf3\xbd\xeb\xc7\xef\x55\x4f"
+			  "\x79\x91\x8d\x36\x13\x7b\xd0\x4a"
+			  "\x6c\x39\xfb\x53\xb8\x6f\x02\x51"
+			  "\xa5\x20\xac\x24\x1c\x73\x59\x73"
+			  "\x58\x61\x3a\x87\x58\xb3\x20\x56"
+			  "\x39\x06\x2b\x4d\xd3\x20\x2b\x89"
+			  "\x3f\xa2\xf0\x96\xeb\x7f\xa4\xcd"
+			  "\x11\xae\xbd\xcb\x3a\xb4\xd9\x91"
+			  "\x09\x35\x71\x50\x65\xac\x92\xe3"
+			  "\x7b\x32\xc0\x7a\xdd\xd4\xc3\x92"
+			  "\x6f\xeb\x79\xde\x6f\xd3\x25\xc9"
+			  "\xcd\x63\xf5\x1e\x7a\x3b\x26\x9d"
+			  "\x77\x04\x80\xa9\xbf\x38\xb5\xbd"
+			  "\xb8\x05\x07\xbd\xfd\xab\x7b\xf8"
+			  "\x2a\x26\xcc\x49\x14\x6d\x55\x01"
+			  "\x06\x94\xd8\xb2\x2d\x53\x83\x1b"
+			  "\x8f\xd4\xdd\x57\x12\x7e\x18\xba"
+			  "\x8e\xe2\x4d\x80\xef\x7e\x6b\x9d"
+			  "\x24\xa9\x60\xa4\x97\x85\x86\x2a"
+			  "\x01\x00\x09\xf1\xcb\x4a\x24\x1c"
+			  "\xd8\xf6\xe6\x5b\xe7\x5d\xf2\xc4"
+			  "\x97\x1c\x10\xc6\x4d\x66\x4f\x98"
+			  "\x87\x30\xac\xd5\xea\x73\x49\x10"
+			  "\x80\xea\xe5\x5f\x4d\x5f\x03\x33"
+			  "\x66\x02\x35\x3d\x60\x06\x36\x4f"
+			  "\x14\x1c\xd8\x07\x1f\x78\xd0\xf8"
+			  "\x4f\x6c\x62\x7c\x15\xa5\x7c\x28"
+			  "\x7c\xcc\xeb\x1f\xd1\x07\x90\x93"
+			  "\x7e\xc2\xa8\x3a\x80\xc0\xf5\x30"
+			  "\xcc\x75\xcf\x16\x26\xa9\x26\x3b"
+			  "\xe7\x68\x2f\x15\x21\x5b\xe4\x00"
+			  "\xbd\x48\x50\xcd\x75\x70\xc4\x62"
+			  "\xbb\x41\xfb\x89\x4a\x88\x3b\x3b"
+			  "\x51\x66\x02\x69\x04\x97\x36\xd4"
+			  "\x75\xae\x0b\xa3\x42\xf8\xca\x79"
+			  "\x8f\x93\xe9\xcc\x38\xbd\xd6\xd2"
+			  "\xf9\x70\x4e\xc3\x6a\x8e\x25\xbd"
+			  "\xea\x15\x5a\xa0\x85\x7e\x81\x0d"
+			  "\x03\xe7\x05\x39\xf5\x05\x26\xee"
+			  "\xec\xaa\x1f\x3d\xc9\x98\x76\x01"
+			  "\x2c\xf4\xfc\xa3\x88\x77\x38\xc4"
+			  "\x50\x65\x50\x6d\x04\x1f\xdf\x5a"
+			  "\xaa\xf2\x01\xa9\xc1\x8d\xee\xca"
+			  "\x47\x26\xef\x39\xb8\xb4\xf2\xd1"
+			  "\xd6\xbb\x1b\x2a\xc1\x34\x14\xcf",
+		.ilen	= 512,
+		.result	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.rlen	= 512,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95"
+			  "\x02\x88\x41\x97\x16\x93\x99\x37"
+			  "\x51\x05\x82\x09\x74\x94\x45\x92",
+		.klen	= 64,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\xc5\x85\x2a\x4b\x73\xe4\xf6\xf1"
+			  "\x7e\xf9\xf6\xe9\xa3\x73\x36\xcb"
+			  "\xaa\xb6\x22\xb0\x24\x6e\x3d\x73"
+			  "\x92\x99\xde\xd3\x76\xed\xcd\x63"
+			  "\x64\x3a\x22\x57\xc1\x43\x49\xd4"
+			  "\x79\x36\x31\x19\x62\xae\x10\x7e"
+			  "\x7d\xcf\x7a\xe2\x6b\xce\x27\xfa"
+			  "\xdc\x3d\xd9\x83\xd3\x42\x4c\xe0"
+			  "\x1b\xd6\x1d\x1a\x6f\xd2\x03\x00"
+			  "\xfc\x81\x99\x8a\x14\x62\xf5\x7e"
+			  "\x0d\xe7\x12\xe8\x17\x9d\x0b\xec"
+			  "\xe2\xf7\xc9\xa7\x63\xd1\x79\xb6"
+			  "\x62\x62\x37\xfe\x0a\x4c\x4a\x37"
+			  "\x70\xc7\x5e\x96\x5f\xbc\x8e\x9e"
+			  "\x85\x3c\x4f\x26\x64\x85\xbc\x68"
+			  "\xb0\xe0\x86\x5e\x26\x41\xce\x11"
+			  "\x50\xda\x97\x14\xe9\x9e\xc7\x6d"
+			  "\x3b\xdc\x43\xde\x2b\x27\x69\x7d"
+			  "\xfc\xb0\x28\xbd\x8f\xb1\xc6\x31"
+			  "\x14\x4d\xf0\x74\x37\xfd\x07\x25"
+			  "\x96\x55\xe5\xfc\x9e\x27\x2a\x74"
+			  "\x1b\x83\x4d\x15\x83\xac\x57\xa0"
+			  "\xac\xa5\xd0\x38\xef\x19\x56\x53"
+			  "\x25\x4b\xfc\xce\x04\x23\xe5\x6b"
+			  "\xf6\xc6\x6c\x32\x0b\xb3\x12\xc5"
+			  "\xed\x22\x34\x1c\x5d\xed\x17\x06"
+			  "\x36\xa3\xe6\x77\xb9\x97\x46\xb8"
+			  "\xe9\x3f\x7e\xc7\xbc\x13\x5c\xdc"
+			  "\x6e\x3f\x04\x5e\xd1\x59\xa5\x82"
+			  "\x35\x91\x3d\x1b\xe4\x97\x9f\x92"
+			  "\x1c\x5e\x5f\x6f\x41\xd4\x62\xa1"
+			  "\x8d\x39\xfc\x42\xfb\x38\x80\xb9"
+			  "\x0a\xe3\xcc\x6a\x93\xd9\x7a\xb1"
+			  "\xe9\x69\xaf\x0a\x6b\x75\x38\xa7"
+			  "\xa1\xbf\xf7\xda\x95\x93\x4b\x78"
+			  "\x19\xf5\x94\xf9\xd2\x00\x33\x37"
+			  "\xcf\xf5\x9e\x9c\xf3\xcc\xa6\xee"
+			  "\x42\xb2\x9e\x2c\x5f\x48\x23\x26"
+			  "\x15\x25\x17\x03\x3d\xfe\x2c\xfc"
+			  "\xeb\xba\xda\xe0\x00\x05\xb6\xa6"
+			  "\x07\xb3\xe8\x36\x5b\xec\x5b\xbf"
+			  "\xd6\x5b\x00\x74\xc6\x97\xf1\x6a"
+			  "\x49\xa1\xc3\xfa\x10\x52\xb9\x14"
+			  "\xad\xb7\x73\xf8\x78\x12\xc8\x59"
+			  "\x17\x80\x4c\x57\x39\xf1\x6d\x80"
+			  "\x25\x77\x0f\x5e\x7d\xf0\xaf\x21"
+			  "\xec\xce\xb7\xc8\x02\x8a\xed\x53"
+			  "\x2c\x25\x68\x2e\x1f\x85\x5e\x67"
+			  "\xd1\x07\x7a\x3a\x89\x08\xe0\x34"
+			  "\xdc\xdb\x26\xb4\x6b\x77\xfc\x40"
+			  "\x31\x15\x72\xa0\xf0\x73\xd9\x3b"
+			  "\xd5\xdb\xfe\xfc\x8f\xa9\x44\xa2"
+			  "\x09\x9f\xc6\x33\xe5\xe2\x88\xe8"
+			  "\xf3\xf0\x1a\xf4\xce\x12\x0f\xd6"
+			  "\xf7\x36\xe6\xa4\xf4\x7a\x10\x58"
+			  "\xcc\x1f\x48\x49\x65\x47\x75\xe9"
+			  "\x28\xe1\x65\x7b\xf2\xc4\xb5\x07"
+			  "\xf2\xec\x76\xd8\x8f\x09\xf3\x16"
+			  "\xa1\x51\x89\x3b\xeb\x96\x42\xac"
+			  "\x65\xe0\x67\x63\x29\xdc\xb4\x7d"
+			  "\xf2\x41\x51\x6a\xcb\xde\x3c\xfb"
+			  "\x66\x8d\x13\xca\xe0\x59\x2a\x00"
+			  "\xc9\x53\x4c\xe6\x9e\xe2\x73\xd5"
+			  "\x67\x19\xb2\xbd\x9a\x63\xd7\x5c",
+		.ilen	= 512,
+		.result	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.rlen	= 512,
+		.also_non_np = 1,
+		.np	= 3,
+		.tap	= { 512 - 20, 4, 16 },
+	}
+};
+
 static const struct cipher_testvec speck64_enc_tv_template[] = {
 	{ /* Speck64/96 */
 		.key	= "\x00\x01\x02\x03\x08\x09\x0a\x0b"
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 5/5] crypto: speck - add test vectors for Speck64-XTS
  2018-02-14 18:42 ` Eric Biggers
@ 2018-02-14 18:42   ` Eric Biggers
  -1 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-crypto, Herbert Xu
  Cc: linux-fscrypt, linux-arm-kernel, Ard Biesheuvel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman, Eric Biggers

Add test vectors for Speck64-XTS, generated in userspace using C code.
The inputs were borrowed from the AES-XTS test vectors, with key lengths
adjusted.

xts-speck64-neon passes these tests.  However, they aren't currently
applicable for the generic XTS template, as that only supports a 128-bit
block size.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/testmgr.c |   9 +
 crypto/testmgr.h | 671 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 680 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index e011a347d51b..9f82e7bc9c56 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3584,6 +3584,15 @@ static const struct alg_test_desc alg_test_descs[] = {
 				.dec = __VECS(speck128_xts_dec_tv_template)
 			}
 		}
+	}, {
+		.alg = "xts(speck64)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = {
+				.enc = __VECS(speck64_xts_enc_tv_template),
+				.dec = __VECS(speck64_xts_dec_tv_template)
+			}
+		}
 	}, {
 		.alg = "xts(twofish)",
 		.test = alg_test_skcipher,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 0212e0ebcd0c..da72fd394f35 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -15138,6 +15138,677 @@ static const struct cipher_testvec speck64_dec_tv_template[] = {
 	},
 };
 
+/*
+ * Speck64-XTS test vectors, taken from the AES-XTS test vectors with the result
+ * recomputed with Speck64 as the cipher, and key lengths adjusted
+ */
+
+static const struct cipher_testvec speck64_xts_enc_tv_template[] = {
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 24,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ilen	= 32,
+		.result	= "\x84\xaf\x54\x07\x19\xd4\x7c\xa6"
+			  "\xe4\xfe\xdf\xc4\x1f\x34\xc3\xc2"
+			  "\x80\xf5\x72\xe7\xcd\xf0\x99\x22"
+			  "\x35\xa7\x2f\x06\xef\xdc\x51\xaa",
+		.rlen	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 24,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ilen	= 32,
+		.result	= "\x12\x56\x73\xcd\x15\x87\xa8\x59"
+			  "\xcf\x84\xae\xd9\x1c\x66\xd6\x9f"
+			  "\xb3\x12\x69\x7e\x36\xeb\x52\xff"
+			  "\x62\xdd\xba\x90\xb3\xe1\xee\x99",
+		.rlen	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 24,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ilen	= 32,
+		.result	= "\x15\x1b\xe4\x2c\xa2\x5a\x2d\x2c"
+			  "\x27\x36\xc0\xbf\x5d\xea\x36\x37"
+			  "\x2d\x1a\x88\xbc\x66\xb5\xd0\x0b"
+			  "\xa1\xbc\x19\xb2\x0f\x3b\x75\x34",
+		.rlen	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93",
+		.klen	= 24,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ilen	= 512,
+		.result	= "\xaf\xa1\x81\xa6\x32\xbb\x15\x8e"
+			  "\xf8\x95\x2e\xd3\xe6\xee\x7e\x09"
+			  "\x0c\x1a\xf5\x02\x97\x8b\xe3\xb3"
+			  "\x11\xc7\x39\x96\xd0\x95\xf4\x56"
+			  "\xf4\xdd\x03\x38\x01\x44\x2c\xcf"
+			  "\x88\xae\x8e\x3c\xcd\xe7\xaa\x66"
+			  "\xfe\x3d\xc6\xfb\x01\x23\x51\x43"
+			  "\xd5\xd2\x13\x86\x94\x34\xe9\x62"
+			  "\xf9\x89\xe3\xd1\x7b\xbe\xf8\xef"
+			  "\x76\x35\x04\x3f\xdb\x23\x9d\x0b"
+			  "\x85\x42\xb9\x02\xd6\xcc\xdb\x96"
+			  "\xa7\x6b\x27\xb6\xd4\x45\x8f\x7d"
+			  "\xae\xd2\x04\xd5\xda\xc1\x7e\x24"
+			  "\x8c\x73\xbe\x48\x7e\xcf\x65\x28"
+			  "\x29\xe5\xbe\x54\x30\xcb\x46\x95"
+			  "\x4f\x2e\x8a\x36\xc8\x27\xc5\xbe"
+			  "\xd0\x1a\xaf\xab\x26\xcd\x9e\x69"
+			  "\xa1\x09\x95\x71\x26\xe9\xc4\xdf"
+			  "\xe6\x31\xc3\x46\xda\xaf\x0b\x41"
+			  "\x1f\xab\xb1\x8e\xd6\xfc\x0b\xb3"
+			  "\x82\xc0\x37\x27\xfc\x91\xa7\x05"
+			  "\xfb\xc5\xdc\x2b\x74\x96\x48\x43"
+			  "\x5d\x9c\x19\x0f\x60\x63\x3a\x1f"
+			  "\x6f\xf0\x03\xbe\x4d\xfd\xc8\x4a"
+			  "\xc6\xa4\x81\x6d\xc3\x12\x2a\x5c"
+			  "\x07\xff\xf3\x72\x74\x48\xb5\x40"
+			  "\x50\xb5\xdd\x90\x43\x31\x18\x15"
+			  "\x7b\xf2\xa6\xdb\x83\xc8\x4b\x4a"
+			  "\x29\x93\x90\x8b\xda\x07\xf0\x35"
+			  "\x6d\x90\x88\x09\x4e\x83\xf5\x5b"
+			  "\x94\x12\xbb\x33\x27\x1d\x3f\x23"
+			  "\x51\xa8\x7c\x07\xa2\xae\x77\xa6"
+			  "\x50\xfd\xcc\xc0\x4f\x80\x7a\x9f"
+			  "\x66\xdd\xcd\x75\x24\x8b\x33\xf7"
+			  "\x20\xdb\x83\x9b\x4f\x11\x63\x6e"
+			  "\xcf\x37\xef\xc9\x11\x01\x5c\x45"
+			  "\x32\x99\x7c\x3c\x9e\x42\x89\xe3"
+			  "\x70\x6d\x15\x9f\xb1\xe6\xb6\x05"
+			  "\xfe\x0c\xb9\x49\x2d\x90\x6d\xcc"
+			  "\x5d\x3f\xc1\xfe\x89\x0a\x2e\x2d"
+			  "\xa0\xa8\x89\x3b\x73\x39\xa5\x94"
+			  "\x4c\xa4\xa6\xbb\xa7\x14\x46\x89"
+			  "\x10\xff\xaf\xef\xca\xdd\x4f\x80"
+			  "\xb3\xdf\x3b\xab\xd4\xe5\x5a\xc7"
+			  "\x33\xca\x00\x8b\x8b\x3f\xea\xec"
+			  "\x68\x8a\xc2\x6d\xfd\xd4\x67\x0f"
+			  "\x22\x31\xe1\x0e\xfe\x5a\x04\xd5"
+			  "\x64\xa3\xf1\x1a\x76\x28\xcc\x35"
+			  "\x36\xa7\x0a\x74\xf7\x1c\x44\x9b"
+			  "\xc7\x1b\x53\x17\x02\xea\xd1\xad"
+			  "\x13\x51\x73\xc0\xa0\xb2\x05\x32"
+			  "\xa8\xa2\x37\x2e\xe1\x7a\x3a\x19"
+			  "\x26\xb4\x6c\x62\x5d\xb3\x1a\x1d"
+			  "\x59\xda\xee\x1a\x22\x18\xda\x0d"
+			  "\x88\x0f\x55\x8b\x72\x62\xfd\xc1"
+			  "\x69\x13\xcd\x0d\x5f\xc1\x09\x52"
+			  "\xee\xd6\xe3\x84\x4d\xee\xf6\x88"
+			  "\xaf\x83\xdc\x76\xf4\xc0\x93\x3f"
+			  "\x4a\x75\x2f\xb0\x0b\x3e\xc4\x54"
+			  "\x7d\x69\x8d\x00\x62\x77\x0d\x14"
+			  "\xbe\x7c\xa6\x7d\xc5\x24\x4f\xf3"
+			  "\x50\xf7\x5f\xf4\xc2\xca\x41\x97"
+			  "\x37\xbe\x75\x74\xcd\xf0\x75\x6e"
+			  "\x25\x23\x94\xbd\xda\x8d\xb0\xd4",
+		.rlen	= 512,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27",
+		.klen	= 32,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ilen	= 512,
+		.result	= "\x55\xed\x71\xd3\x02\x8e\x15\x3b"
+			  "\xc6\x71\x29\x2d\x3e\x89\x9f\x59"
+			  "\x68\x6a\xcc\x8a\x56\x97\xf3\x95"
+			  "\x4e\x51\x08\xda\x2a\xf8\x6f\x3c"
+			  "\x78\x16\xea\x80\xdb\x33\x75\x94"
+			  "\xf9\x29\xc4\x2b\x76\x75\x97\xc7"
+			  "\xf2\x98\x2c\xf9\xff\xc8\xd5\x2b"
+			  "\x18\xf1\xaf\xcf\x7c\xc5\x0b\xee"
+			  "\xad\x3c\x76\x7c\xe6\x27\xa2\x2a"
+			  "\xe4\x66\xe1\xab\xa2\x39\xfc\x7c"
+			  "\xf5\xec\x32\x74\xa3\xb8\x03\x88"
+			  "\x52\xfc\x2e\x56\x3f\xa1\xf0\x9f"
+			  "\x84\x5e\x46\xed\x20\x89\xb6\x44"
+			  "\x8d\xd0\xed\x54\x47\x16\xbe\x95"
+			  "\x8a\xb3\x6b\x72\xc4\x32\x52\x13"
+			  "\x1b\xb0\x82\xbe\xac\xf9\x70\xa6"
+			  "\x44\x18\xdd\x8c\x6e\xca\x6e\x45"
+			  "\x8f\x1e\x10\x07\x57\x25\x98\x7b"
+			  "\x17\x8c\x78\xdd\x80\xa7\xd9\xd8"
+			  "\x63\xaf\xb9\x67\x57\xfd\xbc\xdb"
+			  "\x44\xe9\xc5\x65\xd1\xc7\x3b\xff"
+			  "\x20\xa0\x80\x1a\xc3\x9a\xad\x5e"
+			  "\x5d\x3b\xd3\x07\xd9\xf5\xfd\x3d"
+			  "\x4a\x8b\xa8\xd2\x6e\x7a\x51\x65"
+			  "\x6c\x8e\x95\xe0\x45\xc9\x5f\x4a"
+			  "\x09\x3c\x3d\x71\x7f\x0c\x84\x2a"
+			  "\xc8\x48\x52\x1a\xc2\xd5\xd6\x78"
+			  "\x92\x1e\xa0\x90\x2e\xea\xf0\xf3"
+			  "\xdc\x0f\xb1\xaf\x0d\x9b\x06\x2e"
+			  "\x35\x10\x30\x82\x0d\xe7\xc5\x9b"
+			  "\xde\x44\x18\xbd\x9f\xd1\x45\xa9"
+			  "\x7b\x7a\x4a\xad\x35\x65\x27\xca"
+			  "\xb2\xc3\xd4\x9b\x71\x86\x70\xee"
+			  "\xf1\x89\x3b\x85\x4b\x5b\xaa\xaf"
+			  "\xfc\x42\xc8\x31\x59\xbe\x16\x60"
+			  "\x4f\xf9\xfa\x12\xea\xd0\xa7\x14"
+			  "\xf0\x7a\xf3\xd5\x8d\xbd\x81\xef"
+			  "\x52\x7f\x29\x51\x94\x20\x67\x3c"
+			  "\xd1\xaf\x77\x9f\x22\x5a\x4e\x63"
+			  "\xe7\xff\x73\x25\xd1\xdd\x96\x8a"
+			  "\x98\x52\x6d\xf3\xac\x3e\xf2\x18"
+			  "\x6d\xf6\x0a\x29\xa6\x34\x3d\xed"
+			  "\xe3\x27\x0d\x9d\x0a\x02\x44\x7e"
+			  "\x5a\x7e\x67\x0f\x0a\x9e\xd6\xad"
+			  "\x91\xe6\x4d\x81\x8c\x5c\x59\xaa"
+			  "\xfb\xeb\x56\x53\xd2\x7d\x4c\x81"
+			  "\x65\x53\x0f\x41\x11\xbd\x98\x99"
+			  "\xf9\xc6\xfa\x51\x2e\xa3\xdd\x8d"
+			  "\x84\x98\xf9\x34\xed\x33\x2a\x1f"
+			  "\x82\xed\xc1\x73\x98\xd3\x02\xdc"
+			  "\xe6\xc2\x33\x1d\xa2\xb4\xca\x76"
+			  "\x63\x51\x34\x9d\x96\x12\xae\xce"
+			  "\x83\xc9\x76\x5e\xa4\x1b\x53\x37"
+			  "\x17\xd5\xc0\x80\x1d\x62\xf8\x3d"
+			  "\x54\x27\x74\xbb\x10\x86\x57\x46"
+			  "\x68\xe1\xed\x14\xe7\x9d\xfc\x84"
+			  "\x47\xbc\xc2\xf8\x19\x4b\x99\xcf"
+			  "\x7a\xe9\xc4\xb8\x8c\x82\x72\x4d"
+			  "\x7b\x4f\x38\x55\x36\x71\x64\xc1"
+			  "\xfc\x5c\x75\x52\x33\x02\x18\xf8"
+			  "\x17\xe1\x2b\xc2\x43\x39\xbd\x76"
+			  "\x9b\x63\x76\x32\x2f\x19\x72\x10"
+			  "\x9f\x21\x0c\xf1\x66\x50\x7f\xa5"
+			  "\x0d\x1f\x46\xe0\xba\xd3\x2f\x3c",
+		.rlen	= 512,
+		.also_non_np = 1,
+		.np	= 3,
+		.tap	= { 512 - 20, 4, 16 },
+	}
+};
+
+static const struct cipher_testvec speck64_xts_dec_tv_template[] = {
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 24,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x84\xaf\x54\x07\x19\xd4\x7c\xa6"
+			  "\xe4\xfe\xdf\xc4\x1f\x34\xc3\xc2"
+			  "\x80\xf5\x72\xe7\xcd\xf0\x99\x22"
+			  "\x35\xa7\x2f\x06\xef\xdc\x51\xaa",
+		.ilen	= 32,
+		.result	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.rlen	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 24,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x12\x56\x73\xcd\x15\x87\xa8\x59"
+			  "\xcf\x84\xae\xd9\x1c\x66\xd6\x9f"
+			  "\xb3\x12\x69\x7e\x36\xeb\x52\xff"
+			  "\x62\xdd\xba\x90\xb3\xe1\xee\x99",
+		.ilen	= 32,
+		.result	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.rlen	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 24,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x15\x1b\xe4\x2c\xa2\x5a\x2d\x2c"
+			  "\x27\x36\xc0\xbf\x5d\xea\x36\x37"
+			  "\x2d\x1a\x88\xbc\x66\xb5\xd0\x0b"
+			  "\xa1\xbc\x19\xb2\x0f\x3b\x75\x34",
+		.ilen	= 32,
+		.result	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.rlen	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93",
+		.klen	= 24,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\xaf\xa1\x81\xa6\x32\xbb\x15\x8e"
+			  "\xf8\x95\x2e\xd3\xe6\xee\x7e\x09"
+			  "\x0c\x1a\xf5\x02\x97\x8b\xe3\xb3"
+			  "\x11\xc7\x39\x96\xd0\x95\xf4\x56"
+			  "\xf4\xdd\x03\x38\x01\x44\x2c\xcf"
+			  "\x88\xae\x8e\x3c\xcd\xe7\xaa\x66"
+			  "\xfe\x3d\xc6\xfb\x01\x23\x51\x43"
+			  "\xd5\xd2\x13\x86\x94\x34\xe9\x62"
+			  "\xf9\x89\xe3\xd1\x7b\xbe\xf8\xef"
+			  "\x76\x35\x04\x3f\xdb\x23\x9d\x0b"
+			  "\x85\x42\xb9\x02\xd6\xcc\xdb\x96"
+			  "\xa7\x6b\x27\xb6\xd4\x45\x8f\x7d"
+			  "\xae\xd2\x04\xd5\xda\xc1\x7e\x24"
+			  "\x8c\x73\xbe\x48\x7e\xcf\x65\x28"
+			  "\x29\xe5\xbe\x54\x30\xcb\x46\x95"
+			  "\x4f\x2e\x8a\x36\xc8\x27\xc5\xbe"
+			  "\xd0\x1a\xaf\xab\x26\xcd\x9e\x69"
+			  "\xa1\x09\x95\x71\x26\xe9\xc4\xdf"
+			  "\xe6\x31\xc3\x46\xda\xaf\x0b\x41"
+			  "\x1f\xab\xb1\x8e\xd6\xfc\x0b\xb3"
+			  "\x82\xc0\x37\x27\xfc\x91\xa7\x05"
+			  "\xfb\xc5\xdc\x2b\x74\x96\x48\x43"
+			  "\x5d\x9c\x19\x0f\x60\x63\x3a\x1f"
+			  "\x6f\xf0\x03\xbe\x4d\xfd\xc8\x4a"
+			  "\xc6\xa4\x81\x6d\xc3\x12\x2a\x5c"
+			  "\x07\xff\xf3\x72\x74\x48\xb5\x40"
+			  "\x50\xb5\xdd\x90\x43\x31\x18\x15"
+			  "\x7b\xf2\xa6\xdb\x83\xc8\x4b\x4a"
+			  "\x29\x93\x90\x8b\xda\x07\xf0\x35"
+			  "\x6d\x90\x88\x09\x4e\x83\xf5\x5b"
+			  "\x94\x12\xbb\x33\x27\x1d\x3f\x23"
+			  "\x51\xa8\x7c\x07\xa2\xae\x77\xa6"
+			  "\x50\xfd\xcc\xc0\x4f\x80\x7a\x9f"
+			  "\x66\xdd\xcd\x75\x24\x8b\x33\xf7"
+			  "\x20\xdb\x83\x9b\x4f\x11\x63\x6e"
+			  "\xcf\x37\xef\xc9\x11\x01\x5c\x45"
+			  "\x32\x99\x7c\x3c\x9e\x42\x89\xe3"
+			  "\x70\x6d\x15\x9f\xb1\xe6\xb6\x05"
+			  "\xfe\x0c\xb9\x49\x2d\x90\x6d\xcc"
+			  "\x5d\x3f\xc1\xfe\x89\x0a\x2e\x2d"
+			  "\xa0\xa8\x89\x3b\x73\x39\xa5\x94"
+			  "\x4c\xa4\xa6\xbb\xa7\x14\x46\x89"
+			  "\x10\xff\xaf\xef\xca\xdd\x4f\x80"
+			  "\xb3\xdf\x3b\xab\xd4\xe5\x5a\xc7"
+			  "\x33\xca\x00\x8b\x8b\x3f\xea\xec"
+			  "\x68\x8a\xc2\x6d\xfd\xd4\x67\x0f"
+			  "\x22\x31\xe1\x0e\xfe\x5a\x04\xd5"
+			  "\x64\xa3\xf1\x1a\x76\x28\xcc\x35"
+			  "\x36\xa7\x0a\x74\xf7\x1c\x44\x9b"
+			  "\xc7\x1b\x53\x17\x02\xea\xd1\xad"
+			  "\x13\x51\x73\xc0\xa0\xb2\x05\x32"
+			  "\xa8\xa2\x37\x2e\xe1\x7a\x3a\x19"
+			  "\x26\xb4\x6c\x62\x5d\xb3\x1a\x1d"
+			  "\x59\xda\xee\x1a\x22\x18\xda\x0d"
+			  "\x88\x0f\x55\x8b\x72\x62\xfd\xc1"
+			  "\x69\x13\xcd\x0d\x5f\xc1\x09\x52"
+			  "\xee\xd6\xe3\x84\x4d\xee\xf6\x88"
+			  "\xaf\x83\xdc\x76\xf4\xc0\x93\x3f"
+			  "\x4a\x75\x2f\xb0\x0b\x3e\xc4\x54"
+			  "\x7d\x69\x8d\x00\x62\x77\x0d\x14"
+			  "\xbe\x7c\xa6\x7d\xc5\x24\x4f\xf3"
+			  "\x50\xf7\x5f\xf4\xc2\xca\x41\x97"
+			  "\x37\xbe\x75\x74\xcd\xf0\x75\x6e"
+			  "\x25\x23\x94\xbd\xda\x8d\xb0\xd4",
+		.ilen	= 512,
+		.result	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.rlen	= 512,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27",
+		.klen	= 32,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x55\xed\x71\xd3\x02\x8e\x15\x3b"
+			  "\xc6\x71\x29\x2d\x3e\x89\x9f\x59"
+			  "\x68\x6a\xcc\x8a\x56\x97\xf3\x95"
+			  "\x4e\x51\x08\xda\x2a\xf8\x6f\x3c"
+			  "\x78\x16\xea\x80\xdb\x33\x75\x94"
+			  "\xf9\x29\xc4\x2b\x76\x75\x97\xc7"
+			  "\xf2\x98\x2c\xf9\xff\xc8\xd5\x2b"
+			  "\x18\xf1\xaf\xcf\x7c\xc5\x0b\xee"
+			  "\xad\x3c\x76\x7c\xe6\x27\xa2\x2a"
+			  "\xe4\x66\xe1\xab\xa2\x39\xfc\x7c"
+			  "\xf5\xec\x32\x74\xa3\xb8\x03\x88"
+			  "\x52\xfc\x2e\x56\x3f\xa1\xf0\x9f"
+			  "\x84\x5e\x46\xed\x20\x89\xb6\x44"
+			  "\x8d\xd0\xed\x54\x47\x16\xbe\x95"
+			  "\x8a\xb3\x6b\x72\xc4\x32\x52\x13"
+			  "\x1b\xb0\x82\xbe\xac\xf9\x70\xa6"
+			  "\x44\x18\xdd\x8c\x6e\xca\x6e\x45"
+			  "\x8f\x1e\x10\x07\x57\x25\x98\x7b"
+			  "\x17\x8c\x78\xdd\x80\xa7\xd9\xd8"
+			  "\x63\xaf\xb9\x67\x57\xfd\xbc\xdb"
+			  "\x44\xe9\xc5\x65\xd1\xc7\x3b\xff"
+			  "\x20\xa0\x80\x1a\xc3\x9a\xad\x5e"
+			  "\x5d\x3b\xd3\x07\xd9\xf5\xfd\x3d"
+			  "\x4a\x8b\xa8\xd2\x6e\x7a\x51\x65"
+			  "\x6c\x8e\x95\xe0\x45\xc9\x5f\x4a"
+			  "\x09\x3c\x3d\x71\x7f\x0c\x84\x2a"
+			  "\xc8\x48\x52\x1a\xc2\xd5\xd6\x78"
+			  "\x92\x1e\xa0\x90\x2e\xea\xf0\xf3"
+			  "\xdc\x0f\xb1\xaf\x0d\x9b\x06\x2e"
+			  "\x35\x10\x30\x82\x0d\xe7\xc5\x9b"
+			  "\xde\x44\x18\xbd\x9f\xd1\x45\xa9"
+			  "\x7b\x7a\x4a\xad\x35\x65\x27\xca"
+			  "\xb2\xc3\xd4\x9b\x71\x86\x70\xee"
+			  "\xf1\x89\x3b\x85\x4b\x5b\xaa\xaf"
+			  "\xfc\x42\xc8\x31\x59\xbe\x16\x60"
+			  "\x4f\xf9\xfa\x12\xea\xd0\xa7\x14"
+			  "\xf0\x7a\xf3\xd5\x8d\xbd\x81\xef"
+			  "\x52\x7f\x29\x51\x94\x20\x67\x3c"
+			  "\xd1\xaf\x77\x9f\x22\x5a\x4e\x63"
+			  "\xe7\xff\x73\x25\xd1\xdd\x96\x8a"
+			  "\x98\x52\x6d\xf3\xac\x3e\xf2\x18"
+			  "\x6d\xf6\x0a\x29\xa6\x34\x3d\xed"
+			  "\xe3\x27\x0d\x9d\x0a\x02\x44\x7e"
+			  "\x5a\x7e\x67\x0f\x0a\x9e\xd6\xad"
+			  "\x91\xe6\x4d\x81\x8c\x5c\x59\xaa"
+			  "\xfb\xeb\x56\x53\xd2\x7d\x4c\x81"
+			  "\x65\x53\x0f\x41\x11\xbd\x98\x99"
+			  "\xf9\xc6\xfa\x51\x2e\xa3\xdd\x8d"
+			  "\x84\x98\xf9\x34\xed\x33\x2a\x1f"
+			  "\x82\xed\xc1\x73\x98\xd3\x02\xdc"
+			  "\xe6\xc2\x33\x1d\xa2\xb4\xca\x76"
+			  "\x63\x51\x34\x9d\x96\x12\xae\xce"
+			  "\x83\xc9\x76\x5e\xa4\x1b\x53\x37"
+			  "\x17\xd5\xc0\x80\x1d\x62\xf8\x3d"
+			  "\x54\x27\x74\xbb\x10\x86\x57\x46"
+			  "\x68\xe1\xed\x14\xe7\x9d\xfc\x84"
+			  "\x47\xbc\xc2\xf8\x19\x4b\x99\xcf"
+			  "\x7a\xe9\xc4\xb8\x8c\x82\x72\x4d"
+			  "\x7b\x4f\x38\x55\x36\x71\x64\xc1"
+			  "\xfc\x5c\x75\x52\x33\x02\x18\xf8"
+			  "\x17\xe1\x2b\xc2\x43\x39\xbd\x76"
+			  "\x9b\x63\x76\x32\x2f\x19\x72\x10"
+			  "\x9f\x21\x0c\xf1\x66\x50\x7f\xa5"
+			  "\x0d\x1f\x46\xe0\xba\xd3\x2f\x3c",
+		.ilen	= 512,
+		.result	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.rlen	= 512,
+		.also_non_np = 1,
+		.np	= 3,
+		.tap	= { 512 - 20, 4, 16 },
+	}
+};
+
 /* Cast6 test vectors from RFC 2612 */
 static const struct cipher_testvec cast6_enc_tv_template[] = {
 	{
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 5/5] crypto: speck - add test vectors for Speck64-XTS
@ 2018-02-14 18:42   ` Eric Biggers
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-02-14 18:42 UTC (permalink / raw)
  To: linux-arm-kernel

Add test vectors for Speck64-XTS, generated in userspace using C code.
The inputs were borrowed from the AES-XTS test vectors, with key lengths
adjusted.

xts-speck64-neon passes these tests.  However, they aren't currently
applicable for the generic XTS template, as that only supports a 128-bit
block size.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/testmgr.c |   9 +
 crypto/testmgr.h | 671 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 680 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index e011a347d51b..9f82e7bc9c56 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3584,6 +3584,15 @@ static const struct alg_test_desc alg_test_descs[] = {
 				.dec = __VECS(speck128_xts_dec_tv_template)
 			}
 		}
+	}, {
+		.alg = "xts(speck64)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = {
+				.enc = __VECS(speck64_xts_enc_tv_template),
+				.dec = __VECS(speck64_xts_dec_tv_template)
+			}
+		}
 	}, {
 		.alg = "xts(twofish)",
 		.test = alg_test_skcipher,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 0212e0ebcd0c..da72fd394f35 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -15138,6 +15138,677 @@ static const struct cipher_testvec speck64_dec_tv_template[] = {
 	},
 };
 
+/*
+ * Speck64-XTS test vectors, taken from the AES-XTS test vectors with the result
+ * recomputed with Speck64 as the cipher, and key lengths adjusted
+ */
+
+static const struct cipher_testvec speck64_xts_enc_tv_template[] = {
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 24,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ilen	= 32,
+		.result	= "\x84\xaf\x54\x07\x19\xd4\x7c\xa6"
+			  "\xe4\xfe\xdf\xc4\x1f\x34\xc3\xc2"
+			  "\x80\xf5\x72\xe7\xcd\xf0\x99\x22"
+			  "\x35\xa7\x2f\x06\xef\xdc\x51\xaa",
+		.rlen	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 24,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ilen	= 32,
+		.result	= "\x12\x56\x73\xcd\x15\x87\xa8\x59"
+			  "\xcf\x84\xae\xd9\x1c\x66\xd6\x9f"
+			  "\xb3\x12\x69\x7e\x36\xeb\x52\xff"
+			  "\x62\xdd\xba\x90\xb3\xe1\xee\x99",
+		.rlen	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 24,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ilen	= 32,
+		.result	= "\x15\x1b\xe4\x2c\xa2\x5a\x2d\x2c"
+			  "\x27\x36\xc0\xbf\x5d\xea\x36\x37"
+			  "\x2d\x1a\x88\xbc\x66\xb5\xd0\x0b"
+			  "\xa1\xbc\x19\xb2\x0f\x3b\x75\x34",
+		.rlen	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93",
+		.klen	= 24,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ilen	= 512,
+		.result	= "\xaf\xa1\x81\xa6\x32\xbb\x15\x8e"
+			  "\xf8\x95\x2e\xd3\xe6\xee\x7e\x09"
+			  "\x0c\x1a\xf5\x02\x97\x8b\xe3\xb3"
+			  "\x11\xc7\x39\x96\xd0\x95\xf4\x56"
+			  "\xf4\xdd\x03\x38\x01\x44\x2c\xcf"
+			  "\x88\xae\x8e\x3c\xcd\xe7\xaa\x66"
+			  "\xfe\x3d\xc6\xfb\x01\x23\x51\x43"
+			  "\xd5\xd2\x13\x86\x94\x34\xe9\x62"
+			  "\xf9\x89\xe3\xd1\x7b\xbe\xf8\xef"
+			  "\x76\x35\x04\x3f\xdb\x23\x9d\x0b"
+			  "\x85\x42\xb9\x02\xd6\xcc\xdb\x96"
+			  "\xa7\x6b\x27\xb6\xd4\x45\x8f\x7d"
+			  "\xae\xd2\x04\xd5\xda\xc1\x7e\x24"
+			  "\x8c\x73\xbe\x48\x7e\xcf\x65\x28"
+			  "\x29\xe5\xbe\x54\x30\xcb\x46\x95"
+			  "\x4f\x2e\x8a\x36\xc8\x27\xc5\xbe"
+			  "\xd0\x1a\xaf\xab\x26\xcd\x9e\x69"
+			  "\xa1\x09\x95\x71\x26\xe9\xc4\xdf"
+			  "\xe6\x31\xc3\x46\xda\xaf\x0b\x41"
+			  "\x1f\xab\xb1\x8e\xd6\xfc\x0b\xb3"
+			  "\x82\xc0\x37\x27\xfc\x91\xa7\x05"
+			  "\xfb\xc5\xdc\x2b\x74\x96\x48\x43"
+			  "\x5d\x9c\x19\x0f\x60\x63\x3a\x1f"
+			  "\x6f\xf0\x03\xbe\x4d\xfd\xc8\x4a"
+			  "\xc6\xa4\x81\x6d\xc3\x12\x2a\x5c"
+			  "\x07\xff\xf3\x72\x74\x48\xb5\x40"
+			  "\x50\xb5\xdd\x90\x43\x31\x18\x15"
+			  "\x7b\xf2\xa6\xdb\x83\xc8\x4b\x4a"
+			  "\x29\x93\x90\x8b\xda\x07\xf0\x35"
+			  "\x6d\x90\x88\x09\x4e\x83\xf5\x5b"
+			  "\x94\x12\xbb\x33\x27\x1d\x3f\x23"
+			  "\x51\xa8\x7c\x07\xa2\xae\x77\xa6"
+			  "\x50\xfd\xcc\xc0\x4f\x80\x7a\x9f"
+			  "\x66\xdd\xcd\x75\x24\x8b\x33\xf7"
+			  "\x20\xdb\x83\x9b\x4f\x11\x63\x6e"
+			  "\xcf\x37\xef\xc9\x11\x01\x5c\x45"
+			  "\x32\x99\x7c\x3c\x9e\x42\x89\xe3"
+			  "\x70\x6d\x15\x9f\xb1\xe6\xb6\x05"
+			  "\xfe\x0c\xb9\x49\x2d\x90\x6d\xcc"
+			  "\x5d\x3f\xc1\xfe\x89\x0a\x2e\x2d"
+			  "\xa0\xa8\x89\x3b\x73\x39\xa5\x94"
+			  "\x4c\xa4\xa6\xbb\xa7\x14\x46\x89"
+			  "\x10\xff\xaf\xef\xca\xdd\x4f\x80"
+			  "\xb3\xdf\x3b\xab\xd4\xe5\x5a\xc7"
+			  "\x33\xca\x00\x8b\x8b\x3f\xea\xec"
+			  "\x68\x8a\xc2\x6d\xfd\xd4\x67\x0f"
+			  "\x22\x31\xe1\x0e\xfe\x5a\x04\xd5"
+			  "\x64\xa3\xf1\x1a\x76\x28\xcc\x35"
+			  "\x36\xa7\x0a\x74\xf7\x1c\x44\x9b"
+			  "\xc7\x1b\x53\x17\x02\xea\xd1\xad"
+			  "\x13\x51\x73\xc0\xa0\xb2\x05\x32"
+			  "\xa8\xa2\x37\x2e\xe1\x7a\x3a\x19"
+			  "\x26\xb4\x6c\x62\x5d\xb3\x1a\x1d"
+			  "\x59\xda\xee\x1a\x22\x18\xda\x0d"
+			  "\x88\x0f\x55\x8b\x72\x62\xfd\xc1"
+			  "\x69\x13\xcd\x0d\x5f\xc1\x09\x52"
+			  "\xee\xd6\xe3\x84\x4d\xee\xf6\x88"
+			  "\xaf\x83\xdc\x76\xf4\xc0\x93\x3f"
+			  "\x4a\x75\x2f\xb0\x0b\x3e\xc4\x54"
+			  "\x7d\x69\x8d\x00\x62\x77\x0d\x14"
+			  "\xbe\x7c\xa6\x7d\xc5\x24\x4f\xf3"
+			  "\x50\xf7\x5f\xf4\xc2\xca\x41\x97"
+			  "\x37\xbe\x75\x74\xcd\xf0\x75\x6e"
+			  "\x25\x23\x94\xbd\xda\x8d\xb0\xd4",
+		.rlen	= 512,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27",
+		.klen	= 32,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ilen	= 512,
+		.result	= "\x55\xed\x71\xd3\x02\x8e\x15\x3b"
+			  "\xc6\x71\x29\x2d\x3e\x89\x9f\x59"
+			  "\x68\x6a\xcc\x8a\x56\x97\xf3\x95"
+			  "\x4e\x51\x08\xda\x2a\xf8\x6f\x3c"
+			  "\x78\x16\xea\x80\xdb\x33\x75\x94"
+			  "\xf9\x29\xc4\x2b\x76\x75\x97\xc7"
+			  "\xf2\x98\x2c\xf9\xff\xc8\xd5\x2b"
+			  "\x18\xf1\xaf\xcf\x7c\xc5\x0b\xee"
+			  "\xad\x3c\x76\x7c\xe6\x27\xa2\x2a"
+			  "\xe4\x66\xe1\xab\xa2\x39\xfc\x7c"
+			  "\xf5\xec\x32\x74\xa3\xb8\x03\x88"
+			  "\x52\xfc\x2e\x56\x3f\xa1\xf0\x9f"
+			  "\x84\x5e\x46\xed\x20\x89\xb6\x44"
+			  "\x8d\xd0\xed\x54\x47\x16\xbe\x95"
+			  "\x8a\xb3\x6b\x72\xc4\x32\x52\x13"
+			  "\x1b\xb0\x82\xbe\xac\xf9\x70\xa6"
+			  "\x44\x18\xdd\x8c\x6e\xca\x6e\x45"
+			  "\x8f\x1e\x10\x07\x57\x25\x98\x7b"
+			  "\x17\x8c\x78\xdd\x80\xa7\xd9\xd8"
+			  "\x63\xaf\xb9\x67\x57\xfd\xbc\xdb"
+			  "\x44\xe9\xc5\x65\xd1\xc7\x3b\xff"
+			  "\x20\xa0\x80\x1a\xc3\x9a\xad\x5e"
+			  "\x5d\x3b\xd3\x07\xd9\xf5\xfd\x3d"
+			  "\x4a\x8b\xa8\xd2\x6e\x7a\x51\x65"
+			  "\x6c\x8e\x95\xe0\x45\xc9\x5f\x4a"
+			  "\x09\x3c\x3d\x71\x7f\x0c\x84\x2a"
+			  "\xc8\x48\x52\x1a\xc2\xd5\xd6\x78"
+			  "\x92\x1e\xa0\x90\x2e\xea\xf0\xf3"
+			  "\xdc\x0f\xb1\xaf\x0d\x9b\x06\x2e"
+			  "\x35\x10\x30\x82\x0d\xe7\xc5\x9b"
+			  "\xde\x44\x18\xbd\x9f\xd1\x45\xa9"
+			  "\x7b\x7a\x4a\xad\x35\x65\x27\xca"
+			  "\xb2\xc3\xd4\x9b\x71\x86\x70\xee"
+			  "\xf1\x89\x3b\x85\x4b\x5b\xaa\xaf"
+			  "\xfc\x42\xc8\x31\x59\xbe\x16\x60"
+			  "\x4f\xf9\xfa\x12\xea\xd0\xa7\x14"
+			  "\xf0\x7a\xf3\xd5\x8d\xbd\x81\xef"
+			  "\x52\x7f\x29\x51\x94\x20\x67\x3c"
+			  "\xd1\xaf\x77\x9f\x22\x5a\x4e\x63"
+			  "\xe7\xff\x73\x25\xd1\xdd\x96\x8a"
+			  "\x98\x52\x6d\xf3\xac\x3e\xf2\x18"
+			  "\x6d\xf6\x0a\x29\xa6\x34\x3d\xed"
+			  "\xe3\x27\x0d\x9d\x0a\x02\x44\x7e"
+			  "\x5a\x7e\x67\x0f\x0a\x9e\xd6\xad"
+			  "\x91\xe6\x4d\x81\x8c\x5c\x59\xaa"
+			  "\xfb\xeb\x56\x53\xd2\x7d\x4c\x81"
+			  "\x65\x53\x0f\x41\x11\xbd\x98\x99"
+			  "\xf9\xc6\xfa\x51\x2e\xa3\xdd\x8d"
+			  "\x84\x98\xf9\x34\xed\x33\x2a\x1f"
+			  "\x82\xed\xc1\x73\x98\xd3\x02\xdc"
+			  "\xe6\xc2\x33\x1d\xa2\xb4\xca\x76"
+			  "\x63\x51\x34\x9d\x96\x12\xae\xce"
+			  "\x83\xc9\x76\x5e\xa4\x1b\x53\x37"
+			  "\x17\xd5\xc0\x80\x1d\x62\xf8\x3d"
+			  "\x54\x27\x74\xbb\x10\x86\x57\x46"
+			  "\x68\xe1\xed\x14\xe7\x9d\xfc\x84"
+			  "\x47\xbc\xc2\xf8\x19\x4b\x99\xcf"
+			  "\x7a\xe9\xc4\xb8\x8c\x82\x72\x4d"
+			  "\x7b\x4f\x38\x55\x36\x71\x64\xc1"
+			  "\xfc\x5c\x75\x52\x33\x02\x18\xf8"
+			  "\x17\xe1\x2b\xc2\x43\x39\xbd\x76"
+			  "\x9b\x63\x76\x32\x2f\x19\x72\x10"
+			  "\x9f\x21\x0c\xf1\x66\x50\x7f\xa5"
+			  "\x0d\x1f\x46\xe0\xba\xd3\x2f\x3c",
+		.rlen	= 512,
+		.also_non_np = 1,
+		.np	= 3,
+		.tap	= { 512 - 20, 4, 16 },
+	}
+};
+
+static const struct cipher_testvec speck64_xts_dec_tv_template[] = {
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 24,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x84\xaf\x54\x07\x19\xd4\x7c\xa6"
+			  "\xe4\xfe\xdf\xc4\x1f\x34\xc3\xc2"
+			  "\x80\xf5\x72\xe7\xcd\xf0\x99\x22"
+			  "\x35\xa7\x2f\x06\xef\xdc\x51\xaa",
+		.ilen	= 32,
+		.result	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.rlen	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 24,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x12\x56\x73\xcd\x15\x87\xa8\x59"
+			  "\xcf\x84\xae\xd9\x1c\x66\xd6\x9f"
+			  "\xb3\x12\x69\x7e\x36\xeb\x52\xff"
+			  "\x62\xdd\xba\x90\xb3\xe1\xee\x99",
+		.ilen	= 32,
+		.result	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.rlen	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 24,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x15\x1b\xe4\x2c\xa2\x5a\x2d\x2c"
+			  "\x27\x36\xc0\xbf\x5d\xea\x36\x37"
+			  "\x2d\x1a\x88\xbc\x66\xb5\xd0\x0b"
+			  "\xa1\xbc\x19\xb2\x0f\x3b\x75\x34",
+		.ilen	= 32,
+		.result	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.rlen	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93",
+		.klen	= 24,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\xaf\xa1\x81\xa6\x32\xbb\x15\x8e"
+			  "\xf8\x95\x2e\xd3\xe6\xee\x7e\x09"
+			  "\x0c\x1a\xf5\x02\x97\x8b\xe3\xb3"
+			  "\x11\xc7\x39\x96\xd0\x95\xf4\x56"
+			  "\xf4\xdd\x03\x38\x01\x44\x2c\xcf"
+			  "\x88\xae\x8e\x3c\xcd\xe7\xaa\x66"
+			  "\xfe\x3d\xc6\xfb\x01\x23\x51\x43"
+			  "\xd5\xd2\x13\x86\x94\x34\xe9\x62"
+			  "\xf9\x89\xe3\xd1\x7b\xbe\xf8\xef"
+			  "\x76\x35\x04\x3f\xdb\x23\x9d\x0b"
+			  "\x85\x42\xb9\x02\xd6\xcc\xdb\x96"
+			  "\xa7\x6b\x27\xb6\xd4\x45\x8f\x7d"
+			  "\xae\xd2\x04\xd5\xda\xc1\x7e\x24"
+			  "\x8c\x73\xbe\x48\x7e\xcf\x65\x28"
+			  "\x29\xe5\xbe\x54\x30\xcb\x46\x95"
+			  "\x4f\x2e\x8a\x36\xc8\x27\xc5\xbe"
+			  "\xd0\x1a\xaf\xab\x26\xcd\x9e\x69"
+			  "\xa1\x09\x95\x71\x26\xe9\xc4\xdf"
+			  "\xe6\x31\xc3\x46\xda\xaf\x0b\x41"
+			  "\x1f\xab\xb1\x8e\xd6\xfc\x0b\xb3"
+			  "\x82\xc0\x37\x27\xfc\x91\xa7\x05"
+			  "\xfb\xc5\xdc\x2b\x74\x96\x48\x43"
+			  "\x5d\x9c\x19\x0f\x60\x63\x3a\x1f"
+			  "\x6f\xf0\x03\xbe\x4d\xfd\xc8\x4a"
+			  "\xc6\xa4\x81\x6d\xc3\x12\x2a\x5c"
+			  "\x07\xff\xf3\x72\x74\x48\xb5\x40"
+			  "\x50\xb5\xdd\x90\x43\x31\x18\x15"
+			  "\x7b\xf2\xa6\xdb\x83\xc8\x4b\x4a"
+			  "\x29\x93\x90\x8b\xda\x07\xf0\x35"
+			  "\x6d\x90\x88\x09\x4e\x83\xf5\x5b"
+			  "\x94\x12\xbb\x33\x27\x1d\x3f\x23"
+			  "\x51\xa8\x7c\x07\xa2\xae\x77\xa6"
+			  "\x50\xfd\xcc\xc0\x4f\x80\x7a\x9f"
+			  "\x66\xdd\xcd\x75\x24\x8b\x33\xf7"
+			  "\x20\xdb\x83\x9b\x4f\x11\x63\x6e"
+			  "\xcf\x37\xef\xc9\x11\x01\x5c\x45"
+			  "\x32\x99\x7c\x3c\x9e\x42\x89\xe3"
+			  "\x70\x6d\x15\x9f\xb1\xe6\xb6\x05"
+			  "\xfe\x0c\xb9\x49\x2d\x90\x6d\xcc"
+			  "\x5d\x3f\xc1\xfe\x89\x0a\x2e\x2d"
+			  "\xa0\xa8\x89\x3b\x73\x39\xa5\x94"
+			  "\x4c\xa4\xa6\xbb\xa7\x14\x46\x89"
+			  "\x10\xff\xaf\xef\xca\xdd\x4f\x80"
+			  "\xb3\xdf\x3b\xab\xd4\xe5\x5a\xc7"
+			  "\x33\xca\x00\x8b\x8b\x3f\xea\xec"
+			  "\x68\x8a\xc2\x6d\xfd\xd4\x67\x0f"
+			  "\x22\x31\xe1\x0e\xfe\x5a\x04\xd5"
+			  "\x64\xa3\xf1\x1a\x76\x28\xcc\x35"
+			  "\x36\xa7\x0a\x74\xf7\x1c\x44\x9b"
+			  "\xc7\x1b\x53\x17\x02\xea\xd1\xad"
+			  "\x13\x51\x73\xc0\xa0\xb2\x05\x32"
+			  "\xa8\xa2\x37\x2e\xe1\x7a\x3a\x19"
+			  "\x26\xb4\x6c\x62\x5d\xb3\x1a\x1d"
+			  "\x59\xda\xee\x1a\x22\x18\xda\x0d"
+			  "\x88\x0f\x55\x8b\x72\x62\xfd\xc1"
+			  "\x69\x13\xcd\x0d\x5f\xc1\x09\x52"
+			  "\xee\xd6\xe3\x84\x4d\xee\xf6\x88"
+			  "\xaf\x83\xdc\x76\xf4\xc0\x93\x3f"
+			  "\x4a\x75\x2f\xb0\x0b\x3e\xc4\x54"
+			  "\x7d\x69\x8d\x00\x62\x77\x0d\x14"
+			  "\xbe\x7c\xa6\x7d\xc5\x24\x4f\xf3"
+			  "\x50\xf7\x5f\xf4\xc2\xca\x41\x97"
+			  "\x37\xbe\x75\x74\xcd\xf0\x75\x6e"
+			  "\x25\x23\x94\xbd\xda\x8d\xb0\xd4",
+		.ilen	= 512,
+		.result	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.rlen	= 512,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27",
+		.klen	= 32,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.input	= "\x55\xed\x71\xd3\x02\x8e\x15\x3b"
+			  "\xc6\x71\x29\x2d\x3e\x89\x9f\x59"
+			  "\x68\x6a\xcc\x8a\x56\x97\xf3\x95"
+			  "\x4e\x51\x08\xda\x2a\xf8\x6f\x3c"
+			  "\x78\x16\xea\x80\xdb\x33\x75\x94"
+			  "\xf9\x29\xc4\x2b\x76\x75\x97\xc7"
+			  "\xf2\x98\x2c\xf9\xff\xc8\xd5\x2b"
+			  "\x18\xf1\xaf\xcf\x7c\xc5\x0b\xee"
+			  "\xad\x3c\x76\x7c\xe6\x27\xa2\x2a"
+			  "\xe4\x66\xe1\xab\xa2\x39\xfc\x7c"
+			  "\xf5\xec\x32\x74\xa3\xb8\x03\x88"
+			  "\x52\xfc\x2e\x56\x3f\xa1\xf0\x9f"
+			  "\x84\x5e\x46\xed\x20\x89\xb6\x44"
+			  "\x8d\xd0\xed\x54\x47\x16\xbe\x95"
+			  "\x8a\xb3\x6b\x72\xc4\x32\x52\x13"
+			  "\x1b\xb0\x82\xbe\xac\xf9\x70\xa6"
+			  "\x44\x18\xdd\x8c\x6e\xca\x6e\x45"
+			  "\x8f\x1e\x10\x07\x57\x25\x98\x7b"
+			  "\x17\x8c\x78\xdd\x80\xa7\xd9\xd8"
+			  "\x63\xaf\xb9\x67\x57\xfd\xbc\xdb"
+			  "\x44\xe9\xc5\x65\xd1\xc7\x3b\xff"
+			  "\x20\xa0\x80\x1a\xc3\x9a\xad\x5e"
+			  "\x5d\x3b\xd3\x07\xd9\xf5\xfd\x3d"
+			  "\x4a\x8b\xa8\xd2\x6e\x7a\x51\x65"
+			  "\x6c\x8e\x95\xe0\x45\xc9\x5f\x4a"
+			  "\x09\x3c\x3d\x71\x7f\x0c\x84\x2a"
+			  "\xc8\x48\x52\x1a\xc2\xd5\xd6\x78"
+			  "\x92\x1e\xa0\x90\x2e\xea\xf0\xf3"
+			  "\xdc\x0f\xb1\xaf\x0d\x9b\x06\x2e"
+			  "\x35\x10\x30\x82\x0d\xe7\xc5\x9b"
+			  "\xde\x44\x18\xbd\x9f\xd1\x45\xa9"
+			  "\x7b\x7a\x4a\xad\x35\x65\x27\xca"
+			  "\xb2\xc3\xd4\x9b\x71\x86\x70\xee"
+			  "\xf1\x89\x3b\x85\x4b\x5b\xaa\xaf"
+			  "\xfc\x42\xc8\x31\x59\xbe\x16\x60"
+			  "\x4f\xf9\xfa\x12\xea\xd0\xa7\x14"
+			  "\xf0\x7a\xf3\xd5\x8d\xbd\x81\xef"
+			  "\x52\x7f\x29\x51\x94\x20\x67\x3c"
+			  "\xd1\xaf\x77\x9f\x22\x5a\x4e\x63"
+			  "\xe7\xff\x73\x25\xd1\xdd\x96\x8a"
+			  "\x98\x52\x6d\xf3\xac\x3e\xf2\x18"
+			  "\x6d\xf6\x0a\x29\xa6\x34\x3d\xed"
+			  "\xe3\x27\x0d\x9d\x0a\x02\x44\x7e"
+			  "\x5a\x7e\x67\x0f\x0a\x9e\xd6\xad"
+			  "\x91\xe6\x4d\x81\x8c\x5c\x59\xaa"
+			  "\xfb\xeb\x56\x53\xd2\x7d\x4c\x81"
+			  "\x65\x53\x0f\x41\x11\xbd\x98\x99"
+			  "\xf9\xc6\xfa\x51\x2e\xa3\xdd\x8d"
+			  "\x84\x98\xf9\x34\xed\x33\x2a\x1f"
+			  "\x82\xed\xc1\x73\x98\xd3\x02\xdc"
+			  "\xe6\xc2\x33\x1d\xa2\xb4\xca\x76"
+			  "\x63\x51\x34\x9d\x96\x12\xae\xce"
+			  "\x83\xc9\x76\x5e\xa4\x1b\x53\x37"
+			  "\x17\xd5\xc0\x80\x1d\x62\xf8\x3d"
+			  "\x54\x27\x74\xbb\x10\x86\x57\x46"
+			  "\x68\xe1\xed\x14\xe7\x9d\xfc\x84"
+			  "\x47\xbc\xc2\xf8\x19\x4b\x99\xcf"
+			  "\x7a\xe9\xc4\xb8\x8c\x82\x72\x4d"
+			  "\x7b\x4f\x38\x55\x36\x71\x64\xc1"
+			  "\xfc\x5c\x75\x52\x33\x02\x18\xf8"
+			  "\x17\xe1\x2b\xc2\x43\x39\xbd\x76"
+			  "\x9b\x63\x76\x32\x2f\x19\x72\x10"
+			  "\x9f\x21\x0c\xf1\x66\x50\x7f\xa5"
+			  "\x0d\x1f\x46\xe0\xba\xd3\x2f\x3c",
+		.ilen	= 512,
+		.result	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.rlen	= 512,
+		.also_non_np = 1,
+		.np	= 3,
+		.tap	= { 512 - 20, 4, 16 },
+	}
+};
+
 /* Cast6 test vectors from RFC 2612 */
 static const struct cipher_testvec cast6_enc_tv_template[] = {
 	{
-- 
2.16.1.291.g4437f3f132-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] crypto: Speck support
  2018-02-14 18:42 ` Eric Biggers
  (?)
@ 2018-02-22 15:13   ` Herbert Xu
  -1 siblings, 0 replies; 36+ messages in thread
From: Herbert Xu @ 2018-02-22 15:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jeffrey Walton, Greg Kaiser, Ard Biesheuvel, Michael Halcrow,
	Patrik Torstensson, Alex Cope, Paul Lawrence, linux-fscrypt,
	linux-crypto, Greg Kroah-Hartman, linux-arm-kernel, Paul Crowley

On Wed, Feb 14, 2018 at 10:42:18AM -0800, Eric Biggers wrote:
> Hello,
> 
> This series adds Speck support to the crypto API, including the Speck128
> and Speck64 variants.  Speck is a lightweight block cipher that can be
> much faster than AES on processors that don't have AES instructions.
> 
> We are planning to offer Speck-XTS (probably Speck128/256-XTS) as an
> option for dm-crypt and fscrypt on Android, for low-end mobile devices
> with older CPUs such as ARMv7 which don't have the Cryptography
> Extensions.  Currently, such devices are unencrypted because AES is not
> fast enough, even when the NEON bit-sliced implementation of AES is
> used.  Other AES alternatives such as Twofish, Threefish, Camellia,
> CAST6, and Serpent aren't fast enough either; it seems that only a
> modern ARX cipher can provide sufficient performance on these devices.
> 
> This is a replacement for our original proposal
> (https://patchwork.kernel.org/patch/10101451/) which was to offer
> ChaCha20 for these devices.  However, the use of a stream cipher for
> disk/file encryption with no space to store nonces would have been much
> more insecure than we thought initially, given that it would be used on
> top of flash storage as well as potentially on top of F2FS, neither of
> which is guaranteed to overwrite data in-place.
> 
> Speck has been somewhat controversial due to its origin.  Nevertheless,
> it has a straightforward design (it's an ARX cipher), and it appears to
> be the leading software-optimized lightweight block cipher currently,
> with the most cryptanalysis.  It's also easy to implement without side
> channels, unlike AES.  Moreover, we only intend Speck to be used when
> the status quo is no encryption, due to AES not being fast enough.
> 
> We've also considered a novel length-preserving encryption mode based on
> ChaCha20 and Poly1305.  While theoretically attractive, such a mode
> would be a brand new crypto construction and would be more complicated
> and difficult to implement efficiently in comparison to Speck-XTS.
> 
> Thus, patch 1 adds a generic implementation of Speck, and the following
> patches add a 32-bit ARM NEON implementation of Speck-XTS.  The
> NEON-accelerated implementation is much faster than the generic
> implementation and therefore is the implementation that would primarily
> be used in practice on the devices we are targeting.
> 
> There is no AArch64 implementation included, since most such CPUs have
> the Cryptography Extensions, allowing the use of AES.  An AArch64
> implementation can be added later if there is interest though.
> 
> Changed since v2:
> 
>   - Fix __speck64_xts_crypt() to work on big endian CPUs.
> 
> Changed since v1:
> 
>   - Use the word order recommended by the Speck authors.  All test
>     vectors were updated.
> 
> Eric Biggers (5):
>   crypto: add support for the Speck block cipher
>   crypto: speck - export common helpers
>   crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
>   crypto: speck - add test vectors for Speck128-XTS
>   crypto: speck - add test vectors for Speck64-XTS
> 
>  arch/arm/crypto/Kconfig           |    6 +
>  arch/arm/crypto/Makefile          |    2 +
>  arch/arm/crypto/speck-neon-core.S |  432 +++++++++
>  arch/arm/crypto/speck-neon-glue.c |  288 ++++++
>  crypto/Kconfig                    |   14 +
>  crypto/Makefile                   |    1 +
>  crypto/speck.c                    |  307 ++++++
>  crypto/testmgr.c                  |   36 +
>  crypto/testmgr.h                  | 1486 +++++++++++++++++++++++++++++
>  include/crypto/speck.h            |   62 ++
>  10 files changed, 2634 insertions(+)
>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>  create mode 100644 crypto/speck.c
>  create mode 100644 include/crypto/speck.h

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] crypto: Speck support
@ 2018-02-22 15:13   ` Herbert Xu
  0 siblings, 0 replies; 36+ messages in thread
From: Herbert Xu @ 2018-02-22 15:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-fscrypt, linux-arm-kernel, Ard Biesheuvel,
	Jeffrey Walton, Paul Crowley, Patrik Torstensson, Greg Kaiser,
	Paul Lawrence, Michael Halcrow, Alex Cope, Greg Kroah-Hartman

On Wed, Feb 14, 2018 at 10:42:18AM -0800, Eric Biggers wrote:
> Hello,
> 
> This series adds Speck support to the crypto API, including the Speck128
> and Speck64 variants.  Speck is a lightweight block cipher that can be
> much faster than AES on processors that don't have AES instructions.
> 
> We are planning to offer Speck-XTS (probably Speck128/256-XTS) as an
> option for dm-crypt and fscrypt on Android, for low-end mobile devices
> with older CPUs such as ARMv7 which don't have the Cryptography
> Extensions.  Currently, such devices are unencrypted because AES is not
> fast enough, even when the NEON bit-sliced implementation of AES is
> used.  Other AES alternatives such as Twofish, Threefish, Camellia,
> CAST6, and Serpent aren't fast enough either; it seems that only a
> modern ARX cipher can provide sufficient performance on these devices.
> 
> This is a replacement for our original proposal
> (https://patchwork.kernel.org/patch/10101451/) which was to offer
> ChaCha20 for these devices.  However, the use of a stream cipher for
> disk/file encryption with no space to store nonces would have been much
> more insecure than we thought initially, given that it would be used on
> top of flash storage as well as potentially on top of F2FS, neither of
> which is guaranteed to overwrite data in-place.
> 
> Speck has been somewhat controversial due to its origin.  Nevertheless,
> it has a straightforward design (it's an ARX cipher), and it appears to
> be the leading software-optimized lightweight block cipher currently,
> with the most cryptanalysis.  It's also easy to implement without side
> channels, unlike AES.  Moreover, we only intend Speck to be used when
> the status quo is no encryption, due to AES not being fast enough.
> 
> We've also considered a novel length-preserving encryption mode based on
> ChaCha20 and Poly1305.  While theoretically attractive, such a mode
> would be a brand new crypto construction and would be more complicated
> and difficult to implement efficiently in comparison to Speck-XTS.
> 
> Thus, patch 1 adds a generic implementation of Speck, and the following
> patches add a 32-bit ARM NEON implementation of Speck-XTS.  The
> NEON-accelerated implementation is much faster than the generic
> implementation and therefore is the implementation that would primarily
> be used in practice on the devices we are targeting.
> 
> There is no AArch64 implementation included, since most such CPUs have
> the Cryptography Extensions, allowing the use of AES.  An AArch64
> implementation can be added later if there is interest though.
> 
> Changed since v2:
> 
>   - Fix __speck64_xts_crypt() to work on big endian CPUs.
> 
> Changed since v1:
> 
>   - Use the word order recommended by the Speck authors.  All test
>     vectors were updated.
> 
> Eric Biggers (5):
>   crypto: add support for the Speck block cipher
>   crypto: speck - export common helpers
>   crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
>   crypto: speck - add test vectors for Speck128-XTS
>   crypto: speck - add test vectors for Speck64-XTS
> 
>  arch/arm/crypto/Kconfig           |    6 +
>  arch/arm/crypto/Makefile          |    2 +
>  arch/arm/crypto/speck-neon-core.S |  432 +++++++++
>  arch/arm/crypto/speck-neon-glue.c |  288 ++++++
>  crypto/Kconfig                    |   14 +
>  crypto/Makefile                   |    1 +
>  crypto/speck.c                    |  307 ++++++
>  crypto/testmgr.c                  |   36 +
>  crypto/testmgr.h                  | 1486 +++++++++++++++++++++++++++++
>  include/crypto/speck.h            |   62 ++
>  10 files changed, 2634 insertions(+)
>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>  create mode 100644 crypto/speck.c
>  create mode 100644 include/crypto/speck.h

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 0/5] crypto: Speck support
@ 2018-02-22 15:13   ` Herbert Xu
  0 siblings, 0 replies; 36+ messages in thread
From: Herbert Xu @ 2018-02-22 15:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 14, 2018 at 10:42:18AM -0800, Eric Biggers wrote:
> Hello,
> 
> This series adds Speck support to the crypto API, including the Speck128
> and Speck64 variants.  Speck is a lightweight block cipher that can be
> much faster than AES on processors that don't have AES instructions.
> 
> We are planning to offer Speck-XTS (probably Speck128/256-XTS) as an
> option for dm-crypt and fscrypt on Android, for low-end mobile devices
> with older CPUs such as ARMv7 which don't have the Cryptography
> Extensions.  Currently, such devices are unencrypted because AES is not
> fast enough, even when the NEON bit-sliced implementation of AES is
> used.  Other AES alternatives such as Twofish, Threefish, Camellia,
> CAST6, and Serpent aren't fast enough either; it seems that only a
> modern ARX cipher can provide sufficient performance on these devices.
> 
> This is a replacement for our original proposal
> (https://patchwork.kernel.org/patch/10101451/) which was to offer
> ChaCha20 for these devices.  However, the use of a stream cipher for
> disk/file encryption with no space to store nonces would have been much
> more insecure than we thought initially, given that it would be used on
> top of flash storage as well as potentially on top of F2FS, neither of
> which is guaranteed to overwrite data in-place.
> 
> Speck has been somewhat controversial due to its origin.  Nevertheless,
> it has a straightforward design (it's an ARX cipher), and it appears to
> be the leading software-optimized lightweight block cipher currently,
> with the most cryptanalysis.  It's also easy to implement without side
> channels, unlike AES.  Moreover, we only intend Speck to be used when
> the status quo is no encryption, due to AES not being fast enough.
> 
> We've also considered a novel length-preserving encryption mode based on
> ChaCha20 and Poly1305.  While theoretically attractive, such a mode
> would be a brand new crypto construction and would be more complicated
> and difficult to implement efficiently in comparison to Speck-XTS.
> 
> Thus, patch 1 adds a generic implementation of Speck, and the following
> patches add a 32-bit ARM NEON implementation of Speck-XTS.  The
> NEON-accelerated implementation is much faster than the generic
> implementation and therefore is the implementation that would primarily
> be used in practice on the devices we are targeting.
> 
> There is no AArch64 implementation included, since most such CPUs have
> the Cryptography Extensions, allowing the use of AES.  An AArch64
> implementation can be added later if there is interest though.
> 
> Changed since v2:
> 
>   - Fix __speck64_xts_crypt() to work on big endian CPUs.
> 
> Changed since v1:
> 
>   - Use the word order recommended by the Speck authors.  All test
>     vectors were updated.
> 
> Eric Biggers (5):
>   crypto: add support for the Speck block cipher
>   crypto: speck - export common helpers
>   crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
>   crypto: speck - add test vectors for Speck128-XTS
>   crypto: speck - add test vectors for Speck64-XTS
> 
>  arch/arm/crypto/Kconfig           |    6 +
>  arch/arm/crypto/Makefile          |    2 +
>  arch/arm/crypto/speck-neon-core.S |  432 +++++++++
>  arch/arm/crypto/speck-neon-glue.c |  288 ++++++
>  crypto/Kconfig                    |   14 +
>  crypto/Makefile                   |    1 +
>  crypto/speck.c                    |  307 ++++++
>  crypto/testmgr.c                  |   36 +
>  crypto/testmgr.h                  | 1486 +++++++++++++++++++++++++++++
>  include/crypto/speck.h            |   62 ++
>  10 files changed, 2634 insertions(+)
>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>  create mode 100644 crypto/speck.c
>  create mode 100644 include/crypto/speck.h

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  2018-02-14 18:42   ` Eric Biggers
  (?)
@ 2018-06-16 22:40     ` Stefan Agner
  -1 siblings, 0 replies; 36+ messages in thread
From: Stefan Agner @ 2018-06-16 22:40 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jeffrey Walton, Greg Kaiser, Herbert Xu, Ard Biesheuvel,
	Michael Halcrow, Patrik Torstensson, Alex Cope, Paul Lawrence,
	linux-fscrypt, linux-crypto, Greg Kroah-Hartman,
	linux-crypto-owner, linux-arm-kernel, Paul Crowley

Hi Eric,

On 14.02.2018 19:42, Eric Biggers wrote:
> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
> encrypted/decrypted (doing one cipher round for all the blocks, then the
> next round, etc.), then goes through XTS postprocessing.
> 
> The performance depends on the processor but can be about 3 times faster
> than the generic code.  For example, on an ARMv7 processor we observe
> the following performance with Speck128/256-XTS:
> 
>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
> 
> In comparison to AES-256-XTS without the Cryptography Extensions:
> 
>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
> 
> Speck64/128-XTS is even faster:
> 
>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
> 
> Note that as with the generic code, only the Speck128 and Speck64
> variants are supported.  Also, for now only the XTS mode of operation is
> supported, to target the disk and file encryption use cases.  The NEON
> code also only handles the portion of the data that is evenly divisible
> into 128-byte chunks, with any remainder handled by a C fallback.  Of
> course, other modes of operation could be added later if needed, and/or
> the NEON code could be updated to handle other buffer sizes.
> 
> The XTS specification is only defined for AES which has a 128-bit block
> size, so for the GF(2^64) math needed for Speck64-XTS we use the
> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
> paper.  Of course, when possible users should use Speck128-XTS, but even
> that may be too slow on some processors; Speck64-XTS can be faster.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  arch/arm/crypto/Kconfig           |   6 +
>  arch/arm/crypto/Makefile          |   2 +
>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>  4 files changed, 728 insertions(+)
>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
> 
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index b8e69fe282b8..925d1364727a 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>  	select CRYPTO_BLKCIPHER
>  	select CRYPTO_CHACHA20
>  
> +config CRYPTO_SPECK_NEON
> +	tristate "NEON accelerated Speck cipher algorithms"
> +	depends on KERNEL_MODE_NEON
> +	select CRYPTO_BLKCIPHER
> +	select CRYPTO_SPECK
> +
>  endif
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 30ef8e291271..a758107c5525 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>  
>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
> @@ -53,6 +54,7 @@ ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
>  crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>  
>  quiet_cmd_perl = PERL    $@
>        cmd_perl = $(PERL) $(<) > $(@)
> diff --git a/arch/arm/crypto/speck-neon-core.S
> b/arch/arm/crypto/speck-neon-core.S
> new file mode 100644
> index 000000000000..3c1e203e53b9
> --- /dev/null
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -0,0 +1,432 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
> + *
> + * Copyright (c) 2018 Google, Inc
> + *
> + * Author: Eric Biggers <ebiggers@google.com>
> + */
> +
> +#include <linux/linkage.h>
> +
> +	.text
> +	.fpu		neon
> +
> +	// arguments
> +	ROUND_KEYS	.req	r0	// const {u64,u32} *round_keys
> +	NROUNDS		.req	r1	// int nrounds
> +	DST		.req	r2	// void *dst
> +	SRC		.req	r3	// const void *src
> +	NBYTES		.req	r4	// unsigned int nbytes
> +	TWEAK		.req	r5	// void *tweak
> +
> +	// registers which hold the data being encrypted/decrypted
> +	X0		.req	q0
> +	X0_L		.req	d0
> +	X0_H		.req	d1
> +	Y0		.req	q1
> +	Y0_H		.req	d3
> +	X1		.req	q2
> +	X1_L		.req	d4
> +	X1_H		.req	d5
> +	Y1		.req	q3
> +	Y1_H		.req	d7
> +	X2		.req	q4
> +	X2_L		.req	d8
> +	X2_H		.req	d9
> +	Y2		.req	q5
> +	Y2_H		.req	d11
> +	X3		.req	q6
> +	X3_L		.req	d12
> +	X3_H		.req	d13
> +	Y3		.req	q7
> +	Y3_H		.req	d15
> +
> +	// the round key, duplicated in all lanes
> +	ROUND_KEY	.req	q8
> +	ROUND_KEY_L	.req	d16
> +	ROUND_KEY_H	.req	d17
> +
> +	// index vector for vtbl-based 8-bit rotates
> +	ROTATE_TABLE	.req	d18
> +
> +	// multiplication table for updating XTS tweaks
> +	GF128MUL_TABLE	.req	d19
> +	GF64MUL_TABLE	.req	d19
> +
> +	// current XTS tweak value(s)
> +	TWEAKV		.req	q10
> +	TWEAKV_L	.req	d20
> +	TWEAKV_H	.req	d21
> +
> +	TMP0		.req	q12
> +	TMP0_L		.req	d24
> +	TMP0_H		.req	d25
> +	TMP1		.req	q13
> +	TMP2		.req	q14
> +	TMP3		.req	q15
> +
> +	.align		4
> +.Lror64_8_table:
> +	.byte		1, 2, 3, 4, 5, 6, 7, 0
> +.Lror32_8_table:
> +	.byte		1, 2, 3, 0, 5, 6, 7, 4
> +.Lrol64_8_table:
> +	.byte		7, 0, 1, 2, 3, 4, 5, 6
> +.Lrol32_8_table:
> +	.byte		3, 0, 1, 2, 7, 4, 5, 6
> +.Lgf128mul_table:
> +	.byte		0, 0x87
> +	.fill		14
> +.Lgf64mul_table:
> +	.byte		0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
> +	.fill		12
> +
> +/*
> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
> + *
> + * Do one Speck encryption round on the 128 bytes (8 blocks for
> Speck128, 16 for
> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
> + *
> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
> + * the vtbl approach is faster on some processors and the same speed on others.
> + */
> +.macro _speck_round_128bytes	n
> +
> +	// x = ror(x, 8)
> +	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
> +	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
> +	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
> +	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
> +	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
> +	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
> +	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
> +	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
> +
> +	// x += y
> +	vadd.u\n	X0, Y0
> +	vadd.u\n	X1, Y1
> +	vadd.u\n	X2, Y2
> +	vadd.u\n	X3, Y3
> +
> +	// x ^= k
> +	veor		X0, ROUND_KEY
> +	veor		X1, ROUND_KEY
> +	veor		X2, ROUND_KEY
> +	veor		X3, ROUND_KEY
> +
> +	// y = rol(y, 3)
> +	vshl.u\n	TMP0, Y0, #3
> +	vshl.u\n	TMP1, Y1, #3
> +	vshl.u\n	TMP2, Y2, #3
> +	vshl.u\n	TMP3, Y3, #3
> +	vsri.u\n	TMP0, Y0, #(\n - 3)
> +	vsri.u\n	TMP1, Y1, #(\n - 3)
> +	vsri.u\n	TMP2, Y2, #(\n - 3)
> +	vsri.u\n	TMP3, Y3, #(\n - 3)
> +
> +	// y ^= x
> +	veor		Y0, TMP0, X0
> +	veor		Y1, TMP1, X1
> +	veor		Y2, TMP2, X2
> +	veor		Y3, TMP3, X3
> +.endm
> +
> +/*
> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
> + *
> + * This is the inverse of _speck_round_128bytes().
> + */
> +.macro _speck_unround_128bytes	n
> +
> +	// y ^= x
> +	veor		TMP0, Y0, X0
> +	veor		TMP1, Y1, X1
> +	veor		TMP2, Y2, X2
> +	veor		TMP3, Y3, X3
> +
> +	// y = ror(y, 3)
> +	vshr.u\n	Y0, TMP0, #3
> +	vshr.u\n	Y1, TMP1, #3
> +	vshr.u\n	Y2, TMP2, #3
> +	vshr.u\n	Y3, TMP3, #3
> +	vsli.u\n	Y0, TMP0, #(\n - 3)
> +	vsli.u\n	Y1, TMP1, #(\n - 3)
> +	vsli.u\n	Y2, TMP2, #(\n - 3)
> +	vsli.u\n	Y3, TMP3, #(\n - 3)
> +
> +	// x ^= k
> +	veor		X0, ROUND_KEY
> +	veor		X1, ROUND_KEY
> +	veor		X2, ROUND_KEY
> +	veor		X3, ROUND_KEY
> +
> +	// x -= y
> +	vsub.u\n	X0, Y0
> +	vsub.u\n	X1, Y1
> +	vsub.u\n	X2, Y2
> +	vsub.u\n	X3, Y3
> +
> +	// x = rol(x, 8);
> +	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
> +	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
> +	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
> +	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
> +	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
> +	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
> +	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
> +	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
> +.endm
> +
> +.macro _xts128_precrypt_one	dst_reg, tweak_buf, tmp
> +
> +	// Load the next source block
> +	vld1.8		{\dst_reg}, [SRC]!
> +
> +	// Save the current tweak in the tweak buffer
> +	vst1.8		{TWEAKV}, [\tweak_buf:128]!
> +
> +	// XOR the next source block with the current tweak
> +	veor		\dst_reg, TWEAKV
> +
> +	/*
> +	 * Calculate the next tweak by multiplying the current one by x,
> +	 * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
> +	 */
> +	vshr.u64	\tmp, TWEAKV, #63
> +	vshl.u64	TWEAKV, #1
> +	veor		TWEAKV_H, \tmp\()_L
> +	vtbl.8		\tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
> +	veor		TWEAKV_L, \tmp\()_H
> +.endm
> +
> +.macro _xts64_precrypt_two	dst_reg, tweak_buf, tmp
> +
> +	// Load the next two source blocks
> +	vld1.8		{\dst_reg}, [SRC]!
> +
> +	// Save the current two tweaks in the tweak buffer
> +	vst1.8		{TWEAKV}, [\tweak_buf:128]!
> +
> +	// XOR the next two source blocks with the current two tweaks
> +	veor		\dst_reg, TWEAKV
> +
> +	/*
> +	 * Calculate the next two tweaks by multiplying the current ones by x^2,
> +	 * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
> +	 */
> +	vshr.u64	\tmp, TWEAKV, #62
> +	vshl.u64	TWEAKV, #2
> +	vtbl.8		\tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
> +	vtbl.8		\tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
> +	veor		TWEAKV, \tmp
> +.endm
> +
> +/*
> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
> + *
> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
> DST buffer
> + * using Speck-XTS, specifically the variant with a block size of
> '2n' and round
> + * count given by NROUNDS.  The expanded round keys are given in
> ROUND_KEYS, and
> + * the current XTS tweak value is given in TWEAK.  It's assumed that
> NBYTES is a
> + * nonzero multiple of 128.
> + */
> +.macro _speck_xts_crypt	n, decrypting
> +	push		{r4-r7}
> +	mov		r7, sp
> +
> +	/*
> +	 * The first four parameters were passed in registers r0-r3.  Load the
> +	 * additional parameters, which were passed on the stack.
> +	 */
> +	ldr		NBYTES, [sp, #16]
> +	ldr		TWEAK, [sp, #20]
> +
> +	/*
> +	 * If decrypting, modify the ROUND_KEYS parameter to point to the last
> +	 * round key rather than the first, since for decryption the round keys
> +	 * are used in reverse order.
> +	 */
> +.if \decrypting
> +.if \n == 64
> +	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
> +	sub		ROUND_KEYS, #8
> +.else
> +	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
> +	sub		ROUND_KEYS, #4
> +.endif
> +.endif
> +
> +	// Load the index vector for vtbl-based 8-bit rotates
> +.if \decrypting
> +	ldr		r12, =.Lrol\n\()_8_table
> +.else
> +	ldr		r12, =.Lror\n\()_8_table
> +.endif
> +	vld1.8		{ROTATE_TABLE}, [r12:64]
> +
> +	// One-time XTS preparation
> +
> +	/*
> +	 * Allocate stack space to store 128 bytes worth of tweaks.  For
> +	 * performance, this space is aligned to a 16-byte boundary so that we
> +	 * can use the load/store instructions that declare 16-byte alignment.
> +	 */
> +	sub		sp, #128
> +	bic		sp, #0xf


This fails here when building with CONFIG_THUMB2_KERNEL=y

  AS      arch/arm/crypto/speck-neon-core.o                             
                                                                   
arch/arm/crypto/speck-neon-core.S: Assembler messages:                  
                                                                   
arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
`bic sp,#0xf'                                                         
arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
`bic sp,#0xf'                                                         
arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
`bic sp,#0xf'                                                         
arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
`bic sp,#0xf'

In a quick hack this change seems to address it:


-       sub             sp, #128
-       bic             sp, #0xf
+       mov             r6, sp
+       sub             r6, #128
+       bic             r6, #0xf
+       mov             sp, r6

But there is probably a better solution to address this.

--
Stefan


> +
> +.if \n == 64
> +	// Load first tweak
> +	vld1.8		{TWEAKV}, [TWEAK]
> +
> +	// Load GF(2^128) multiplication table
> +	ldr		r12, =.Lgf128mul_table
> +	vld1.8		{GF128MUL_TABLE}, [r12:64]
> +.else
> +	// Load first tweak
> +	vld1.8		{TWEAKV_L}, [TWEAK]
> +
> +	// Load GF(2^64) multiplication table
> +	ldr		r12, =.Lgf64mul_table
> +	vld1.8		{GF64MUL_TABLE}, [r12:64]
> +
> +	// Calculate second tweak, packing it together with the first
> +	vshr.u64	TMP0_L, TWEAKV_L, #63
> +	vtbl.u8		TMP0_L, {GF64MUL_TABLE}, TMP0_L
> +	vshl.u64	TWEAKV_H, TWEAKV_L, #1
> +	veor		TWEAKV_H, TMP0_L
> +.endif
> +
> +.Lnext_128bytes_\@:
> +
> +	/*
> +	 * Load the source blocks into {X,Y}[0-3], XOR them with their XTS tweak
> +	 * values, and save the tweaks on the stack for later.  Then
> +	 * de-interleave the 'x' and 'y' elements of each block, i.e. make it so
> +	 * that the X[0-3] registers contain only the second halves of blocks,
> +	 * and the Y[0-3] registers contain only the first halves of blocks.
> +	 * (Speck uses the order (y, x) rather than the more intuitive (x, y).)
> +	 */
> +	mov		r12, sp
> +.if \n == 64
> +	_xts128_precrypt_one	X0, r12, TMP0
> +	_xts128_precrypt_one	Y0, r12, TMP0
> +	_xts128_precrypt_one	X1, r12, TMP0
> +	_xts128_precrypt_one	Y1, r12, TMP0
> +	_xts128_precrypt_one	X2, r12, TMP0
> +	_xts128_precrypt_one	Y2, r12, TMP0
> +	_xts128_precrypt_one	X3, r12, TMP0
> +	_xts128_precrypt_one	Y3, r12, TMP0
> +	vswp		X0_L, Y0_H
> +	vswp		X1_L, Y1_H
> +	vswp		X2_L, Y2_H
> +	vswp		X3_L, Y3_H
> +.else
> +	_xts64_precrypt_two	X0, r12, TMP0
> +	_xts64_precrypt_two	Y0, r12, TMP0
> +	_xts64_precrypt_two	X1, r12, TMP0
> +	_xts64_precrypt_two	Y1, r12, TMP0
> +	_xts64_precrypt_two	X2, r12, TMP0
> +	_xts64_precrypt_two	Y2, r12, TMP0
> +	_xts64_precrypt_two	X3, r12, TMP0
> +	_xts64_precrypt_two	Y3, r12, TMP0
> +	vuzp.32		Y0, X0
> +	vuzp.32		Y1, X1
> +	vuzp.32		Y2, X2
> +	vuzp.32		Y3, X3
> +.endif
> +
> +	// Do the cipher rounds
> +
> +	mov		r12, ROUND_KEYS
> +	mov		r6, NROUNDS
> +
> +.Lnext_round_\@:
> +.if \decrypting
> +.if \n == 64
> +	vld1.64		ROUND_KEY_L, [r12]
> +	sub		r12, #8
> +	vmov		ROUND_KEY_H, ROUND_KEY_L
> +.else
> +	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]
> +	sub		r12, #4
> +.endif
> +	_speck_unround_128bytes	\n
> +.else
> +.if \n == 64
> +	vld1.64		ROUND_KEY_L, [r12]!
> +	vmov		ROUND_KEY_H, ROUND_KEY_L
> +.else
> +	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]!
> +.endif
> +	_speck_round_128bytes	\n
> +.endif
> +	subs		r6, r6, #1
> +	bne		.Lnext_round_\@
> +
> +	// Re-interleave the 'x' and 'y' elements of each block
> +.if \n == 64
> +	vswp		X0_L, Y0_H
> +	vswp		X1_L, Y1_H
> +	vswp		X2_L, Y2_H
> +	vswp		X3_L, Y3_H
> +.else
> +	vzip.32		Y0, X0
> +	vzip.32		Y1, X1
> +	vzip.32		Y2, X2
> +	vzip.32		Y3, X3
> +.endif
> +
> +	// XOR the encrypted/decrypted blocks with the tweaks we saved earlier
> +	mov		r12, sp
> +	vld1.8		{TMP0, TMP1}, [r12:128]!
> +	vld1.8		{TMP2, TMP3}, [r12:128]!
> +	veor		X0, TMP0
> +	veor		Y0, TMP1
> +	veor		X1, TMP2
> +	veor		Y1, TMP3
> +	vld1.8		{TMP0, TMP1}, [r12:128]!
> +	vld1.8		{TMP2, TMP3}, [r12:128]!
> +	veor		X2, TMP0
> +	veor		Y2, TMP1
> +	veor		X3, TMP2
> +	veor		Y3, TMP3
> +
> +	// Store the ciphertext in the destination buffer
> +	vst1.8		{X0, Y0}, [DST]!
> +	vst1.8		{X1, Y1}, [DST]!
> +	vst1.8		{X2, Y2}, [DST]!
> +	vst1.8		{X3, Y3}, [DST]!
> +
> +	// Continue if there are more 128-byte chunks remaining, else return
> +	subs		NBYTES, #128
> +	bne		.Lnext_128bytes_\@
> +
> +	// Store the next tweak
> +.if \n == 64
> +	vst1.8		{TWEAKV}, [TWEAK]
> +.else
> +	vst1.8		{TWEAKV_L}, [TWEAK]
> +.endif
> +
> +	mov		sp, r7
> +	pop		{r4-r7}
> +	bx		lr
> +.endm
> +
> +ENTRY(speck128_xts_encrypt_neon)
> +	_speck_xts_crypt	n=64, decrypting=0
> +ENDPROC(speck128_xts_encrypt_neon)
> +
> +ENTRY(speck128_xts_decrypt_neon)
> +	_speck_xts_crypt	n=64, decrypting=1
> +ENDPROC(speck128_xts_decrypt_neon)
> +
> +ENTRY(speck64_xts_encrypt_neon)
> +	_speck_xts_crypt	n=32, decrypting=0
> +ENDPROC(speck64_xts_encrypt_neon)
> +
> +ENTRY(speck64_xts_decrypt_neon)
> +	_speck_xts_crypt	n=32, decrypting=1
> +ENDPROC(speck64_xts_decrypt_neon)
> diff --git a/arch/arm/crypto/speck-neon-glue.c
> b/arch/arm/crypto/speck-neon-glue.c
> new file mode 100644
> index 000000000000..f012c3ea998f
> --- /dev/null
> +++ b/arch/arm/crypto/speck-neon-glue.c
> @@ -0,0 +1,288 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
> + *
> + * Copyright (c) 2018 Google, Inc
> + *
> + * Note: the NIST recommendation for XTS only specifies a 128-bit block size,
> + * but a 64-bit version (needed for Speck64) is fairly
> straightforward; the math
> + * is just done in GF(2^64) instead of GF(2^128), with the reducing polynomial
> + * x^64 + x^4 + x^3 + x + 1 from the original XEX paper (Rogaway, 2004:
> + * "Efficient Instantiations of Tweakable Blockciphers and Refinements to Modes
> + * OCB and PMAC"), represented as 0x1B.
> + */
> +
> +#include <asm/hwcap.h>
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <crypto/algapi.h>
> +#include <crypto/gf128mul.h>
> +#include <crypto/internal/skcipher.h>
> +#include <crypto/speck.h>
> +#include <crypto/xts.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +
> +/* The assembly functions only handle multiples of 128 bytes */
> +#define SPECK_NEON_CHUNK_SIZE	128
> +
> +/* Speck128 */
> +
> +struct speck128_xts_tfm_ctx {
> +	struct speck128_tfm_ctx main_key;
> +	struct speck128_tfm_ctx tweak_key;
> +};
> +
> +asmlinkage void speck128_xts_encrypt_neon(const u64 *round_keys, int nrounds,
> +					  void *dst, const void *src,
> +					  unsigned int nbytes, void *tweak);
> +
> +asmlinkage void speck128_xts_decrypt_neon(const u64 *round_keys, int nrounds,
> +					  void *dst, const void *src,
> +					  unsigned int nbytes, void *tweak);
> +
> +typedef void (*speck128_crypt_one_t)(const struct speck128_tfm_ctx *,
> +				     u8 *, const u8 *);
> +typedef void (*speck128_xts_crypt_many_t)(const u64 *, int, void *,
> +					  const void *, unsigned int, void *);
> +
> +static __always_inline int
> +__speck128_xts_crypt(struct skcipher_request *req,
> +		     speck128_crypt_one_t crypt_one,
> +		     speck128_xts_crypt_many_t crypt_many)
> +{
> +	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +	const struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	struct skcipher_walk walk;
> +	le128 tweak;
> +	int err;
> +
> +	err = skcipher_walk_virt(&walk, req, true);
> +
> +	crypto_speck128_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
> +
> +	while (walk.nbytes > 0) {
> +		unsigned int nbytes = walk.nbytes;
> +		u8 *dst = walk.dst.virt.addr;
> +		const u8 *src = walk.src.virt.addr;
> +
> +		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
> +			unsigned int count;
> +
> +			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
> +			kernel_neon_begin();
> +			(*crypt_many)(ctx->main_key.round_keys,
> +				      ctx->main_key.nrounds,
> +				      dst, src, count, &tweak);
> +			kernel_neon_end();
> +			dst += count;
> +			src += count;
> +			nbytes -= count;
> +		}
> +
> +		/* Handle any remainder with generic code */
> +		while (nbytes >= sizeof(tweak)) {
> +			le128_xor((le128 *)dst, (const le128 *)src, &tweak);
> +			(*crypt_one)(&ctx->main_key, dst, dst);
> +			le128_xor((le128 *)dst, (const le128 *)dst, &tweak);
> +			gf128mul_x_ble(&tweak, &tweak);
> +
> +			dst += sizeof(tweak);
> +			src += sizeof(tweak);
> +			nbytes -= sizeof(tweak);
> +		}
> +		err = skcipher_walk_done(&walk, nbytes);
> +	}
> +
> +	return err;
> +}
> +
> +static int speck128_xts_encrypt(struct skcipher_request *req)
> +{
> +	return __speck128_xts_crypt(req, crypto_speck128_encrypt,
> +				    speck128_xts_encrypt_neon);
> +}
> +
> +static int speck128_xts_decrypt(struct skcipher_request *req)
> +{
> +	return __speck128_xts_crypt(req, crypto_speck128_decrypt,
> +				    speck128_xts_decrypt_neon);
> +}
> +
> +static int speck128_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
> +			       unsigned int keylen)
> +{
> +	struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	int err;
> +
> +	err = xts_verify_key(tfm, key, keylen);
> +	if (err)
> +		return err;
> +
> +	keylen /= 2;
> +
> +	err = crypto_speck128_setkey(&ctx->main_key, key, keylen);
> +	if (err)
> +		return err;
> +
> +	return crypto_speck128_setkey(&ctx->tweak_key, key + keylen, keylen);
> +}
> +
> +/* Speck64 */
> +
> +struct speck64_xts_tfm_ctx {
> +	struct speck64_tfm_ctx main_key;
> +	struct speck64_tfm_ctx tweak_key;
> +};
> +
> +asmlinkage void speck64_xts_encrypt_neon(const u32 *round_keys, int nrounds,
> +					 void *dst, const void *src,
> +					 unsigned int nbytes, void *tweak);
> +
> +asmlinkage void speck64_xts_decrypt_neon(const u32 *round_keys, int nrounds,
> +					 void *dst, const void *src,
> +					 unsigned int nbytes, void *tweak);
> +
> +typedef void (*speck64_crypt_one_t)(const struct speck64_tfm_ctx *,
> +				    u8 *, const u8 *);
> +typedef void (*speck64_xts_crypt_many_t)(const u32 *, int, void *,
> +					 const void *, unsigned int, void *);
> +
> +static __always_inline int
> +__speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t
> crypt_one,
> +		    speck64_xts_crypt_many_t crypt_many)
> +{
> +	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +	const struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	struct skcipher_walk walk;
> +	__le64 tweak;
> +	int err;
> +
> +	err = skcipher_walk_virt(&walk, req, true);
> +
> +	crypto_speck64_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
> +
> +	while (walk.nbytes > 0) {
> +		unsigned int nbytes = walk.nbytes;
> +		u8 *dst = walk.dst.virt.addr;
> +		const u8 *src = walk.src.virt.addr;
> +
> +		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
> +			unsigned int count;
> +
> +			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
> +			kernel_neon_begin();
> +			(*crypt_many)(ctx->main_key.round_keys,
> +				      ctx->main_key.nrounds,
> +				      dst, src, count, &tweak);
> +			kernel_neon_end();
> +			dst += count;
> +			src += count;
> +			nbytes -= count;
> +		}
> +
> +		/* Handle any remainder with generic code */
> +		while (nbytes >= sizeof(tweak)) {
> +			*(__le64 *)dst = *(__le64 *)src ^ tweak;
> +			(*crypt_one)(&ctx->main_key, dst, dst);
> +			*(__le64 *)dst ^= tweak;
> +			tweak = cpu_to_le64((le64_to_cpu(tweak) << 1) ^
> +					    ((tweak & cpu_to_le64(1ULL << 63)) ?
> +					     0x1B : 0));
> +			dst += sizeof(tweak);
> +			src += sizeof(tweak);
> +			nbytes -= sizeof(tweak);
> +		}
> +		err = skcipher_walk_done(&walk, nbytes);
> +	}
> +
> +	return err;
> +}
> +
> +static int speck64_xts_encrypt(struct skcipher_request *req)
> +{
> +	return __speck64_xts_crypt(req, crypto_speck64_encrypt,
> +				   speck64_xts_encrypt_neon);
> +}
> +
> +static int speck64_xts_decrypt(struct skcipher_request *req)
> +{
> +	return __speck64_xts_crypt(req, crypto_speck64_decrypt,
> +				   speck64_xts_decrypt_neon);
> +}
> +
> +static int speck64_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
> +			      unsigned int keylen)
> +{
> +	struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	int err;
> +
> +	err = xts_verify_key(tfm, key, keylen);
> +	if (err)
> +		return err;
> +
> +	keylen /= 2;
> +
> +	err = crypto_speck64_setkey(&ctx->main_key, key, keylen);
> +	if (err)
> +		return err;
> +
> +	return crypto_speck64_setkey(&ctx->tweak_key, key + keylen, keylen);
> +}
> +
> +static struct skcipher_alg speck_algs[] = {
> +	{
> +		.base.cra_name		= "xts(speck128)",
> +		.base.cra_driver_name	= "xts-speck128-neon",
> +		.base.cra_priority	= 300,
> +		.base.cra_blocksize	= SPECK128_BLOCK_SIZE,
> +		.base.cra_ctxsize	= sizeof(struct speck128_xts_tfm_ctx),
> +		.base.cra_alignmask	= 7,
> +		.base.cra_module	= THIS_MODULE,
> +		.min_keysize		= 2 * SPECK128_128_KEY_SIZE,
> +		.max_keysize		= 2 * SPECK128_256_KEY_SIZE,
> +		.ivsize			= SPECK128_BLOCK_SIZE,
> +		.walksize		= SPECK_NEON_CHUNK_SIZE,
> +		.setkey			= speck128_xts_setkey,
> +		.encrypt		= speck128_xts_encrypt,
> +		.decrypt		= speck128_xts_decrypt,
> +	}, {
> +		.base.cra_name		= "xts(speck64)",
> +		.base.cra_driver_name	= "xts-speck64-neon",
> +		.base.cra_priority	= 300,
> +		.base.cra_blocksize	= SPECK64_BLOCK_SIZE,
> +		.base.cra_ctxsize	= sizeof(struct speck64_xts_tfm_ctx),
> +		.base.cra_alignmask	= 7,
> +		.base.cra_module	= THIS_MODULE,
> +		.min_keysize		= 2 * SPECK64_96_KEY_SIZE,
> +		.max_keysize		= 2 * SPECK64_128_KEY_SIZE,
> +		.ivsize			= SPECK64_BLOCK_SIZE,
> +		.walksize		= SPECK_NEON_CHUNK_SIZE,
> +		.setkey			= speck64_xts_setkey,
> +		.encrypt		= speck64_xts_encrypt,
> +		.decrypt		= speck64_xts_decrypt,
> +	}
> +};
> +
> +static int __init speck_neon_module_init(void)
> +{
> +	if (!(elf_hwcap & HWCAP_NEON))
> +		return -ENODEV;
> +	return crypto_register_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
> +}
> +
> +static void __exit speck_neon_module_exit(void)
> +{
> +	crypto_unregister_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
> +}
> +
> +module_init(speck_neon_module_init);
> +module_exit(speck_neon_module_exit);
> +
> +MODULE_DESCRIPTION("Speck block cipher (NEON-accelerated)");
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
> +MODULE_ALIAS_CRYPTO("xts(speck128)");
> +MODULE_ALIAS_CRYPTO("xts-speck128-neon");
> +MODULE_ALIAS_CRYPTO("xts(speck64)");
> +MODULE_ALIAS_CRYPTO("xts-speck64-neon");

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-16 22:40     ` Stefan Agner
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Agner @ 2018-06-16 22:40 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, Herbert Xu, linux-fscrypt, linux-arm-kernel,
	Ard Biesheuvel, Jeffrey Walton, Paul Crowley, Patrik Torstensson,
	Greg Kaiser, Paul Lawrence, Michael Halcrow, Alex Cope,
	Greg Kroah-Hartman, linux-crypto-owner

Hi Eric,

On 14.02.2018 19:42, Eric Biggers wrote:
> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
> encrypted/decrypted (doing one cipher round for all the blocks, then the
> next round, etc.), then goes through XTS postprocessing.
> 
> The performance depends on the processor but can be about 3 times faster
> than the generic code.  For example, on an ARMv7 processor we observe
> the following performance with Speck128/256-XTS:
> 
>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
> 
> In comparison to AES-256-XTS without the Cryptography Extensions:
> 
>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
> 
> Speck64/128-XTS is even faster:
> 
>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
> 
> Note that as with the generic code, only the Speck128 and Speck64
> variants are supported.  Also, for now only the XTS mode of operation is
> supported, to target the disk and file encryption use cases.  The NEON
> code also only handles the portion of the data that is evenly divisible
> into 128-byte chunks, with any remainder handled by a C fallback.  Of
> course, other modes of operation could be added later if needed, and/or
> the NEON code could be updated to handle other buffer sizes.
> 
> The XTS specification is only defined for AES which has a 128-bit block
> size, so for the GF(2^64) math needed for Speck64-XTS we use the
> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
> paper.  Of course, when possible users should use Speck128-XTS, but even
> that may be too slow on some processors; Speck64-XTS can be faster.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  arch/arm/crypto/Kconfig           |   6 +
>  arch/arm/crypto/Makefile          |   2 +
>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>  4 files changed, 728 insertions(+)
>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
> 
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index b8e69fe282b8..925d1364727a 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>  	select CRYPTO_BLKCIPHER
>  	select CRYPTO_CHACHA20
>  
> +config CRYPTO_SPECK_NEON
> +	tristate "NEON accelerated Speck cipher algorithms"
> +	depends on KERNEL_MODE_NEON
> +	select CRYPTO_BLKCIPHER
> +	select CRYPTO_SPECK
> +
>  endif
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 30ef8e291271..a758107c5525 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>  
>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
> @@ -53,6 +54,7 @@ ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
>  crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>  
>  quiet_cmd_perl = PERL    $@
>        cmd_perl = $(PERL) $(<) > $(@)
> diff --git a/arch/arm/crypto/speck-neon-core.S
> b/arch/arm/crypto/speck-neon-core.S
> new file mode 100644
> index 000000000000..3c1e203e53b9
> --- /dev/null
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -0,0 +1,432 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
> + *
> + * Copyright (c) 2018 Google, Inc
> + *
> + * Author: Eric Biggers <ebiggers@google.com>
> + */
> +
> +#include <linux/linkage.h>
> +
> +	.text
> +	.fpu		neon
> +
> +	// arguments
> +	ROUND_KEYS	.req	r0	// const {u64,u32} *round_keys
> +	NROUNDS		.req	r1	// int nrounds
> +	DST		.req	r2	// void *dst
> +	SRC		.req	r3	// const void *src
> +	NBYTES		.req	r4	// unsigned int nbytes
> +	TWEAK		.req	r5	// void *tweak
> +
> +	// registers which hold the data being encrypted/decrypted
> +	X0		.req	q0
> +	X0_L		.req	d0
> +	X0_H		.req	d1
> +	Y0		.req	q1
> +	Y0_H		.req	d3
> +	X1		.req	q2
> +	X1_L		.req	d4
> +	X1_H		.req	d5
> +	Y1		.req	q3
> +	Y1_H		.req	d7
> +	X2		.req	q4
> +	X2_L		.req	d8
> +	X2_H		.req	d9
> +	Y2		.req	q5
> +	Y2_H		.req	d11
> +	X3		.req	q6
> +	X3_L		.req	d12
> +	X3_H		.req	d13
> +	Y3		.req	q7
> +	Y3_H		.req	d15
> +
> +	// the round key, duplicated in all lanes
> +	ROUND_KEY	.req	q8
> +	ROUND_KEY_L	.req	d16
> +	ROUND_KEY_H	.req	d17
> +
> +	// index vector for vtbl-based 8-bit rotates
> +	ROTATE_TABLE	.req	d18
> +
> +	// multiplication table for updating XTS tweaks
> +	GF128MUL_TABLE	.req	d19
> +	GF64MUL_TABLE	.req	d19
> +
> +	// current XTS tweak value(s)
> +	TWEAKV		.req	q10
> +	TWEAKV_L	.req	d20
> +	TWEAKV_H	.req	d21
> +
> +	TMP0		.req	q12
> +	TMP0_L		.req	d24
> +	TMP0_H		.req	d25
> +	TMP1		.req	q13
> +	TMP2		.req	q14
> +	TMP3		.req	q15
> +
> +	.align		4
> +.Lror64_8_table:
> +	.byte		1, 2, 3, 4, 5, 6, 7, 0
> +.Lror32_8_table:
> +	.byte		1, 2, 3, 0, 5, 6, 7, 4
> +.Lrol64_8_table:
> +	.byte		7, 0, 1, 2, 3, 4, 5, 6
> +.Lrol32_8_table:
> +	.byte		3, 0, 1, 2, 7, 4, 5, 6
> +.Lgf128mul_table:
> +	.byte		0, 0x87
> +	.fill		14
> +.Lgf64mul_table:
> +	.byte		0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
> +	.fill		12
> +
> +/*
> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
> + *
> + * Do one Speck encryption round on the 128 bytes (8 blocks for
> Speck128, 16 for
> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
> + *
> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
> + * the vtbl approach is faster on some processors and the same speed on others.
> + */
> +.macro _speck_round_128bytes	n
> +
> +	// x = ror(x, 8)
> +	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
> +	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
> +	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
> +	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
> +	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
> +	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
> +	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
> +	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
> +
> +	// x += y
> +	vadd.u\n	X0, Y0
> +	vadd.u\n	X1, Y1
> +	vadd.u\n	X2, Y2
> +	vadd.u\n	X3, Y3
> +
> +	// x ^= k
> +	veor		X0, ROUND_KEY
> +	veor		X1, ROUND_KEY
> +	veor		X2, ROUND_KEY
> +	veor		X3, ROUND_KEY
> +
> +	// y = rol(y, 3)
> +	vshl.u\n	TMP0, Y0, #3
> +	vshl.u\n	TMP1, Y1, #3
> +	vshl.u\n	TMP2, Y2, #3
> +	vshl.u\n	TMP3, Y3, #3
> +	vsri.u\n	TMP0, Y0, #(\n - 3)
> +	vsri.u\n	TMP1, Y1, #(\n - 3)
> +	vsri.u\n	TMP2, Y2, #(\n - 3)
> +	vsri.u\n	TMP3, Y3, #(\n - 3)
> +
> +	// y ^= x
> +	veor		Y0, TMP0, X0
> +	veor		Y1, TMP1, X1
> +	veor		Y2, TMP2, X2
> +	veor		Y3, TMP3, X3
> +.endm
> +
> +/*
> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
> + *
> + * This is the inverse of _speck_round_128bytes().
> + */
> +.macro _speck_unround_128bytes	n
> +
> +	// y ^= x
> +	veor		TMP0, Y0, X0
> +	veor		TMP1, Y1, X1
> +	veor		TMP2, Y2, X2
> +	veor		TMP3, Y3, X3
> +
> +	// y = ror(y, 3)
> +	vshr.u\n	Y0, TMP0, #3
> +	vshr.u\n	Y1, TMP1, #3
> +	vshr.u\n	Y2, TMP2, #3
> +	vshr.u\n	Y3, TMP3, #3
> +	vsli.u\n	Y0, TMP0, #(\n - 3)
> +	vsli.u\n	Y1, TMP1, #(\n - 3)
> +	vsli.u\n	Y2, TMP2, #(\n - 3)
> +	vsli.u\n	Y3, TMP3, #(\n - 3)
> +
> +	// x ^= k
> +	veor		X0, ROUND_KEY
> +	veor		X1, ROUND_KEY
> +	veor		X2, ROUND_KEY
> +	veor		X3, ROUND_KEY
> +
> +	// x -= y
> +	vsub.u\n	X0, Y0
> +	vsub.u\n	X1, Y1
> +	vsub.u\n	X2, Y2
> +	vsub.u\n	X3, Y3
> +
> +	// x = rol(x, 8);
> +	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
> +	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
> +	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
> +	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
> +	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
> +	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
> +	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
> +	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
> +.endm
> +
> +.macro _xts128_precrypt_one	dst_reg, tweak_buf, tmp
> +
> +	// Load the next source block
> +	vld1.8		{\dst_reg}, [SRC]!
> +
> +	// Save the current tweak in the tweak buffer
> +	vst1.8		{TWEAKV}, [\tweak_buf:128]!
> +
> +	// XOR the next source block with the current tweak
> +	veor		\dst_reg, TWEAKV
> +
> +	/*
> +	 * Calculate the next tweak by multiplying the current one by x,
> +	 * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
> +	 */
> +	vshr.u64	\tmp, TWEAKV, #63
> +	vshl.u64	TWEAKV, #1
> +	veor		TWEAKV_H, \tmp\()_L
> +	vtbl.8		\tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
> +	veor		TWEAKV_L, \tmp\()_H
> +.endm
> +
> +.macro _xts64_precrypt_two	dst_reg, tweak_buf, tmp
> +
> +	// Load the next two source blocks
> +	vld1.8		{\dst_reg}, [SRC]!
> +
> +	// Save the current two tweaks in the tweak buffer
> +	vst1.8		{TWEAKV}, [\tweak_buf:128]!
> +
> +	// XOR the next two source blocks with the current two tweaks
> +	veor		\dst_reg, TWEAKV
> +
> +	/*
> +	 * Calculate the next two tweaks by multiplying the current ones by x^2,
> +	 * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
> +	 */
> +	vshr.u64	\tmp, TWEAKV, #62
> +	vshl.u64	TWEAKV, #2
> +	vtbl.8		\tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
> +	vtbl.8		\tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
> +	veor		TWEAKV, \tmp
> +.endm
> +
> +/*
> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
> + *
> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
> DST buffer
> + * using Speck-XTS, specifically the variant with a block size of
> '2n' and round
> + * count given by NROUNDS.  The expanded round keys are given in
> ROUND_KEYS, and
> + * the current XTS tweak value is given in TWEAK.  It's assumed that
> NBYTES is a
> + * nonzero multiple of 128.
> + */
> +.macro _speck_xts_crypt	n, decrypting
> +	push		{r4-r7}
> +	mov		r7, sp
> +
> +	/*
> +	 * The first four parameters were passed in registers r0-r3.  Load the
> +	 * additional parameters, which were passed on the stack.
> +	 */
> +	ldr		NBYTES, [sp, #16]
> +	ldr		TWEAK, [sp, #20]
> +
> +	/*
> +	 * If decrypting, modify the ROUND_KEYS parameter to point to the last
> +	 * round key rather than the first, since for decryption the round keys
> +	 * are used in reverse order.
> +	 */
> +.if \decrypting
> +.if \n == 64
> +	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
> +	sub		ROUND_KEYS, #8
> +.else
> +	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
> +	sub		ROUND_KEYS, #4
> +.endif
> +.endif
> +
> +	// Load the index vector for vtbl-based 8-bit rotates
> +.if \decrypting
> +	ldr		r12, =.Lrol\n\()_8_table
> +.else
> +	ldr		r12, =.Lror\n\()_8_table
> +.endif
> +	vld1.8		{ROTATE_TABLE}, [r12:64]
> +
> +	// One-time XTS preparation
> +
> +	/*
> +	 * Allocate stack space to store 128 bytes worth of tweaks.  For
> +	 * performance, this space is aligned to a 16-byte boundary so that we
> +	 * can use the load/store instructions that declare 16-byte alignment.
> +	 */
> +	sub		sp, #128
> +	bic		sp, #0xf


This fails here when building with CONFIG_THUMB2_KERNEL=y

  AS      arch/arm/crypto/speck-neon-core.o                             
                                                                   
arch/arm/crypto/speck-neon-core.S: Assembler messages:                  
                                                                   
arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
`bic sp,#0xf'                                                         
arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
`bic sp,#0xf'                                                         
arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
`bic sp,#0xf'                                                         
arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
`bic sp,#0xf'

In a quick hack this change seems to address it:


-       sub             sp, #128
-       bic             sp, #0xf
+       mov             r6, sp
+       sub             r6, #128
+       bic             r6, #0xf
+       mov             sp, r6

But there is probably a better solution to address this.

--
Stefan


> +
> +.if \n == 64
> +	// Load first tweak
> +	vld1.8		{TWEAKV}, [TWEAK]
> +
> +	// Load GF(2^128) multiplication table
> +	ldr		r12, =.Lgf128mul_table
> +	vld1.8		{GF128MUL_TABLE}, [r12:64]
> +.else
> +	// Load first tweak
> +	vld1.8		{TWEAKV_L}, [TWEAK]
> +
> +	// Load GF(2^64) multiplication table
> +	ldr		r12, =.Lgf64mul_table
> +	vld1.8		{GF64MUL_TABLE}, [r12:64]
> +
> +	// Calculate second tweak, packing it together with the first
> +	vshr.u64	TMP0_L, TWEAKV_L, #63
> +	vtbl.u8		TMP0_L, {GF64MUL_TABLE}, TMP0_L
> +	vshl.u64	TWEAKV_H, TWEAKV_L, #1
> +	veor		TWEAKV_H, TMP0_L
> +.endif
> +
> +.Lnext_128bytes_\@:
> +
> +	/*
> +	 * Load the source blocks into {X,Y}[0-3], XOR them with their XTS tweak
> +	 * values, and save the tweaks on the stack for later.  Then
> +	 * de-interleave the 'x' and 'y' elements of each block, i.e. make it so
> +	 * that the X[0-3] registers contain only the second halves of blocks,
> +	 * and the Y[0-3] registers contain only the first halves of blocks.
> +	 * (Speck uses the order (y, x) rather than the more intuitive (x, y).)
> +	 */
> +	mov		r12, sp
> +.if \n == 64
> +	_xts128_precrypt_one	X0, r12, TMP0
> +	_xts128_precrypt_one	Y0, r12, TMP0
> +	_xts128_precrypt_one	X1, r12, TMP0
> +	_xts128_precrypt_one	Y1, r12, TMP0
> +	_xts128_precrypt_one	X2, r12, TMP0
> +	_xts128_precrypt_one	Y2, r12, TMP0
> +	_xts128_precrypt_one	X3, r12, TMP0
> +	_xts128_precrypt_one	Y3, r12, TMP0
> +	vswp		X0_L, Y0_H
> +	vswp		X1_L, Y1_H
> +	vswp		X2_L, Y2_H
> +	vswp		X3_L, Y3_H
> +.else
> +	_xts64_precrypt_two	X0, r12, TMP0
> +	_xts64_precrypt_two	Y0, r12, TMP0
> +	_xts64_precrypt_two	X1, r12, TMP0
> +	_xts64_precrypt_two	Y1, r12, TMP0
> +	_xts64_precrypt_two	X2, r12, TMP0
> +	_xts64_precrypt_two	Y2, r12, TMP0
> +	_xts64_precrypt_two	X3, r12, TMP0
> +	_xts64_precrypt_two	Y3, r12, TMP0
> +	vuzp.32		Y0, X0
> +	vuzp.32		Y1, X1
> +	vuzp.32		Y2, X2
> +	vuzp.32		Y3, X3
> +.endif
> +
> +	// Do the cipher rounds
> +
> +	mov		r12, ROUND_KEYS
> +	mov		r6, NROUNDS
> +
> +.Lnext_round_\@:
> +.if \decrypting
> +.if \n == 64
> +	vld1.64		ROUND_KEY_L, [r12]
> +	sub		r12, #8
> +	vmov		ROUND_KEY_H, ROUND_KEY_L
> +.else
> +	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]
> +	sub		r12, #4
> +.endif
> +	_speck_unround_128bytes	\n
> +.else
> +.if \n == 64
> +	vld1.64		ROUND_KEY_L, [r12]!
> +	vmov		ROUND_KEY_H, ROUND_KEY_L
> +.else
> +	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]!
> +.endif
> +	_speck_round_128bytes	\n
> +.endif
> +	subs		r6, r6, #1
> +	bne		.Lnext_round_\@
> +
> +	// Re-interleave the 'x' and 'y' elements of each block
> +.if \n == 64
> +	vswp		X0_L, Y0_H
> +	vswp		X1_L, Y1_H
> +	vswp		X2_L, Y2_H
> +	vswp		X3_L, Y3_H
> +.else
> +	vzip.32		Y0, X0
> +	vzip.32		Y1, X1
> +	vzip.32		Y2, X2
> +	vzip.32		Y3, X3
> +.endif
> +
> +	// XOR the encrypted/decrypted blocks with the tweaks we saved earlier
> +	mov		r12, sp
> +	vld1.8		{TMP0, TMP1}, [r12:128]!
> +	vld1.8		{TMP2, TMP3}, [r12:128]!
> +	veor		X0, TMP0
> +	veor		Y0, TMP1
> +	veor		X1, TMP2
> +	veor		Y1, TMP3
> +	vld1.8		{TMP0, TMP1}, [r12:128]!
> +	vld1.8		{TMP2, TMP3}, [r12:128]!
> +	veor		X2, TMP0
> +	veor		Y2, TMP1
> +	veor		X3, TMP2
> +	veor		Y3, TMP3
> +
> +	// Store the ciphertext in the destination buffer
> +	vst1.8		{X0, Y0}, [DST]!
> +	vst1.8		{X1, Y1}, [DST]!
> +	vst1.8		{X2, Y2}, [DST]!
> +	vst1.8		{X3, Y3}, [DST]!
> +
> +	// Continue if there are more 128-byte chunks remaining, else return
> +	subs		NBYTES, #128
> +	bne		.Lnext_128bytes_\@
> +
> +	// Store the next tweak
> +.if \n == 64
> +	vst1.8		{TWEAKV}, [TWEAK]
> +.else
> +	vst1.8		{TWEAKV_L}, [TWEAK]
> +.endif
> +
> +	mov		sp, r7
> +	pop		{r4-r7}
> +	bx		lr
> +.endm
> +
> +ENTRY(speck128_xts_encrypt_neon)
> +	_speck_xts_crypt	n=64, decrypting=0
> +ENDPROC(speck128_xts_encrypt_neon)
> +
> +ENTRY(speck128_xts_decrypt_neon)
> +	_speck_xts_crypt	n=64, decrypting=1
> +ENDPROC(speck128_xts_decrypt_neon)
> +
> +ENTRY(speck64_xts_encrypt_neon)
> +	_speck_xts_crypt	n=32, decrypting=0
> +ENDPROC(speck64_xts_encrypt_neon)
> +
> +ENTRY(speck64_xts_decrypt_neon)
> +	_speck_xts_crypt	n=32, decrypting=1
> +ENDPROC(speck64_xts_decrypt_neon)
> diff --git a/arch/arm/crypto/speck-neon-glue.c
> b/arch/arm/crypto/speck-neon-glue.c
> new file mode 100644
> index 000000000000..f012c3ea998f
> --- /dev/null
> +++ b/arch/arm/crypto/speck-neon-glue.c
> @@ -0,0 +1,288 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
> + *
> + * Copyright (c) 2018 Google, Inc
> + *
> + * Note: the NIST recommendation for XTS only specifies a 128-bit block size,
> + * but a 64-bit version (needed for Speck64) is fairly
> straightforward; the math
> + * is just done in GF(2^64) instead of GF(2^128), with the reducing polynomial
> + * x^64 + x^4 + x^3 + x + 1 from the original XEX paper (Rogaway, 2004:
> + * "Efficient Instantiations of Tweakable Blockciphers and Refinements to Modes
> + * OCB and PMAC"), represented as 0x1B.
> + */
> +
> +#include <asm/hwcap.h>
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <crypto/algapi.h>
> +#include <crypto/gf128mul.h>
> +#include <crypto/internal/skcipher.h>
> +#include <crypto/speck.h>
> +#include <crypto/xts.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +
> +/* The assembly functions only handle multiples of 128 bytes */
> +#define SPECK_NEON_CHUNK_SIZE	128
> +
> +/* Speck128 */
> +
> +struct speck128_xts_tfm_ctx {
> +	struct speck128_tfm_ctx main_key;
> +	struct speck128_tfm_ctx tweak_key;
> +};
> +
> +asmlinkage void speck128_xts_encrypt_neon(const u64 *round_keys, int nrounds,
> +					  void *dst, const void *src,
> +					  unsigned int nbytes, void *tweak);
> +
> +asmlinkage void speck128_xts_decrypt_neon(const u64 *round_keys, int nrounds,
> +					  void *dst, const void *src,
> +					  unsigned int nbytes, void *tweak);
> +
> +typedef void (*speck128_crypt_one_t)(const struct speck128_tfm_ctx *,
> +				     u8 *, const u8 *);
> +typedef void (*speck128_xts_crypt_many_t)(const u64 *, int, void *,
> +					  const void *, unsigned int, void *);
> +
> +static __always_inline int
> +__speck128_xts_crypt(struct skcipher_request *req,
> +		     speck128_crypt_one_t crypt_one,
> +		     speck128_xts_crypt_many_t crypt_many)
> +{
> +	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +	const struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	struct skcipher_walk walk;
> +	le128 tweak;
> +	int err;
> +
> +	err = skcipher_walk_virt(&walk, req, true);
> +
> +	crypto_speck128_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
> +
> +	while (walk.nbytes > 0) {
> +		unsigned int nbytes = walk.nbytes;
> +		u8 *dst = walk.dst.virt.addr;
> +		const u8 *src = walk.src.virt.addr;
> +
> +		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
> +			unsigned int count;
> +
> +			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
> +			kernel_neon_begin();
> +			(*crypt_many)(ctx->main_key.round_keys,
> +				      ctx->main_key.nrounds,
> +				      dst, src, count, &tweak);
> +			kernel_neon_end();
> +			dst += count;
> +			src += count;
> +			nbytes -= count;
> +		}
> +
> +		/* Handle any remainder with generic code */
> +		while (nbytes >= sizeof(tweak)) {
> +			le128_xor((le128 *)dst, (const le128 *)src, &tweak);
> +			(*crypt_one)(&ctx->main_key, dst, dst);
> +			le128_xor((le128 *)dst, (const le128 *)dst, &tweak);
> +			gf128mul_x_ble(&tweak, &tweak);
> +
> +			dst += sizeof(tweak);
> +			src += sizeof(tweak);
> +			nbytes -= sizeof(tweak);
> +		}
> +		err = skcipher_walk_done(&walk, nbytes);
> +	}
> +
> +	return err;
> +}
> +
> +static int speck128_xts_encrypt(struct skcipher_request *req)
> +{
> +	return __speck128_xts_crypt(req, crypto_speck128_encrypt,
> +				    speck128_xts_encrypt_neon);
> +}
> +
> +static int speck128_xts_decrypt(struct skcipher_request *req)
> +{
> +	return __speck128_xts_crypt(req, crypto_speck128_decrypt,
> +				    speck128_xts_decrypt_neon);
> +}
> +
> +static int speck128_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
> +			       unsigned int keylen)
> +{
> +	struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	int err;
> +
> +	err = xts_verify_key(tfm, key, keylen);
> +	if (err)
> +		return err;
> +
> +	keylen /= 2;
> +
> +	err = crypto_speck128_setkey(&ctx->main_key, key, keylen);
> +	if (err)
> +		return err;
> +
> +	return crypto_speck128_setkey(&ctx->tweak_key, key + keylen, keylen);
> +}
> +
> +/* Speck64 */
> +
> +struct speck64_xts_tfm_ctx {
> +	struct speck64_tfm_ctx main_key;
> +	struct speck64_tfm_ctx tweak_key;
> +};
> +
> +asmlinkage void speck64_xts_encrypt_neon(const u32 *round_keys, int nrounds,
> +					 void *dst, const void *src,
> +					 unsigned int nbytes, void *tweak);
> +
> +asmlinkage void speck64_xts_decrypt_neon(const u32 *round_keys, int nrounds,
> +					 void *dst, const void *src,
> +					 unsigned int nbytes, void *tweak);
> +
> +typedef void (*speck64_crypt_one_t)(const struct speck64_tfm_ctx *,
> +				    u8 *, const u8 *);
> +typedef void (*speck64_xts_crypt_many_t)(const u32 *, int, void *,
> +					 const void *, unsigned int, void *);
> +
> +static __always_inline int
> +__speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t
> crypt_one,
> +		    speck64_xts_crypt_many_t crypt_many)
> +{
> +	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +	const struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	struct skcipher_walk walk;
> +	__le64 tweak;
> +	int err;
> +
> +	err = skcipher_walk_virt(&walk, req, true);
> +
> +	crypto_speck64_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
> +
> +	while (walk.nbytes > 0) {
> +		unsigned int nbytes = walk.nbytes;
> +		u8 *dst = walk.dst.virt.addr;
> +		const u8 *src = walk.src.virt.addr;
> +
> +		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
> +			unsigned int count;
> +
> +			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
> +			kernel_neon_begin();
> +			(*crypt_many)(ctx->main_key.round_keys,
> +				      ctx->main_key.nrounds,
> +				      dst, src, count, &tweak);
> +			kernel_neon_end();
> +			dst += count;
> +			src += count;
> +			nbytes -= count;
> +		}
> +
> +		/* Handle any remainder with generic code */
> +		while (nbytes >= sizeof(tweak)) {
> +			*(__le64 *)dst = *(__le64 *)src ^ tweak;
> +			(*crypt_one)(&ctx->main_key, dst, dst);
> +			*(__le64 *)dst ^= tweak;
> +			tweak = cpu_to_le64((le64_to_cpu(tweak) << 1) ^
> +					    ((tweak & cpu_to_le64(1ULL << 63)) ?
> +					     0x1B : 0));
> +			dst += sizeof(tweak);
> +			src += sizeof(tweak);
> +			nbytes -= sizeof(tweak);
> +		}
> +		err = skcipher_walk_done(&walk, nbytes);
> +	}
> +
> +	return err;
> +}
> +
> +static int speck64_xts_encrypt(struct skcipher_request *req)
> +{
> +	return __speck64_xts_crypt(req, crypto_speck64_encrypt,
> +				   speck64_xts_encrypt_neon);
> +}
> +
> +static int speck64_xts_decrypt(struct skcipher_request *req)
> +{
> +	return __speck64_xts_crypt(req, crypto_speck64_decrypt,
> +				   speck64_xts_decrypt_neon);
> +}
> +
> +static int speck64_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
> +			      unsigned int keylen)
> +{
> +	struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	int err;
> +
> +	err = xts_verify_key(tfm, key, keylen);
> +	if (err)
> +		return err;
> +
> +	keylen /= 2;
> +
> +	err = crypto_speck64_setkey(&ctx->main_key, key, keylen);
> +	if (err)
> +		return err;
> +
> +	return crypto_speck64_setkey(&ctx->tweak_key, key + keylen, keylen);
> +}
> +
> +static struct skcipher_alg speck_algs[] = {
> +	{
> +		.base.cra_name		= "xts(speck128)",
> +		.base.cra_driver_name	= "xts-speck128-neon",
> +		.base.cra_priority	= 300,
> +		.base.cra_blocksize	= SPECK128_BLOCK_SIZE,
> +		.base.cra_ctxsize	= sizeof(struct speck128_xts_tfm_ctx),
> +		.base.cra_alignmask	= 7,
> +		.base.cra_module	= THIS_MODULE,
> +		.min_keysize		= 2 * SPECK128_128_KEY_SIZE,
> +		.max_keysize		= 2 * SPECK128_256_KEY_SIZE,
> +		.ivsize			= SPECK128_BLOCK_SIZE,
> +		.walksize		= SPECK_NEON_CHUNK_SIZE,
> +		.setkey			= speck128_xts_setkey,
> +		.encrypt		= speck128_xts_encrypt,
> +		.decrypt		= speck128_xts_decrypt,
> +	}, {
> +		.base.cra_name		= "xts(speck64)",
> +		.base.cra_driver_name	= "xts-speck64-neon",
> +		.base.cra_priority	= 300,
> +		.base.cra_blocksize	= SPECK64_BLOCK_SIZE,
> +		.base.cra_ctxsize	= sizeof(struct speck64_xts_tfm_ctx),
> +		.base.cra_alignmask	= 7,
> +		.base.cra_module	= THIS_MODULE,
> +		.min_keysize		= 2 * SPECK64_96_KEY_SIZE,
> +		.max_keysize		= 2 * SPECK64_128_KEY_SIZE,
> +		.ivsize			= SPECK64_BLOCK_SIZE,
> +		.walksize		= SPECK_NEON_CHUNK_SIZE,
> +		.setkey			= speck64_xts_setkey,
> +		.encrypt		= speck64_xts_encrypt,
> +		.decrypt		= speck64_xts_decrypt,
> +	}
> +};
> +
> +static int __init speck_neon_module_init(void)
> +{
> +	if (!(elf_hwcap & HWCAP_NEON))
> +		return -ENODEV;
> +	return crypto_register_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
> +}
> +
> +static void __exit speck_neon_module_exit(void)
> +{
> +	crypto_unregister_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
> +}
> +
> +module_init(speck_neon_module_init);
> +module_exit(speck_neon_module_exit);
> +
> +MODULE_DESCRIPTION("Speck block cipher (NEON-accelerated)");
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
> +MODULE_ALIAS_CRYPTO("xts(speck128)");
> +MODULE_ALIAS_CRYPTO("xts-speck128-neon");
> +MODULE_ALIAS_CRYPTO("xts(speck64)");
> +MODULE_ALIAS_CRYPTO("xts-speck64-neon");

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-16 22:40     ` Stefan Agner
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Agner @ 2018-06-16 22:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Eric,

On 14.02.2018 19:42, Eric Biggers wrote:
> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
> encrypted/decrypted (doing one cipher round for all the blocks, then the
> next round, etc.), then goes through XTS postprocessing.
> 
> The performance depends on the processor but can be about 3 times faster
> than the generic code.  For example, on an ARMv7 processor we observe
> the following performance with Speck128/256-XTS:
> 
>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
> 
> In comparison to AES-256-XTS without the Cryptography Extensions:
> 
>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
> 
> Speck64/128-XTS is even faster:
> 
>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
> 
> Note that as with the generic code, only the Speck128 and Speck64
> variants are supported.  Also, for now only the XTS mode of operation is
> supported, to target the disk and file encryption use cases.  The NEON
> code also only handles the portion of the data that is evenly divisible
> into 128-byte chunks, with any remainder handled by a C fallback.  Of
> course, other modes of operation could be added later if needed, and/or
> the NEON code could be updated to handle other buffer sizes.
> 
> The XTS specification is only defined for AES which has a 128-bit block
> size, so for the GF(2^64) math needed for Speck64-XTS we use the
> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
> paper.  Of course, when possible users should use Speck128-XTS, but even
> that may be too slow on some processors; Speck64-XTS can be faster.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  arch/arm/crypto/Kconfig           |   6 +
>  arch/arm/crypto/Makefile          |   2 +
>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>  4 files changed, 728 insertions(+)
>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
> 
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index b8e69fe282b8..925d1364727a 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>  	select CRYPTO_BLKCIPHER
>  	select CRYPTO_CHACHA20
>  
> +config CRYPTO_SPECK_NEON
> +	tristate "NEON accelerated Speck cipher algorithms"
> +	depends on KERNEL_MODE_NEON
> +	select CRYPTO_BLKCIPHER
> +	select CRYPTO_SPECK
> +
>  endif
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 30ef8e291271..a758107c5525 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>  
>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
> @@ -53,6 +54,7 @@ ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
>  crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>  
>  quiet_cmd_perl = PERL    $@
>        cmd_perl = $(PERL) $(<) > $(@)
> diff --git a/arch/arm/crypto/speck-neon-core.S
> b/arch/arm/crypto/speck-neon-core.S
> new file mode 100644
> index 000000000000..3c1e203e53b9
> --- /dev/null
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -0,0 +1,432 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
> + *
> + * Copyright (c) 2018 Google, Inc
> + *
> + * Author: Eric Biggers <ebiggers@google.com>
> + */
> +
> +#include <linux/linkage.h>
> +
> +	.text
> +	.fpu		neon
> +
> +	// arguments
> +	ROUND_KEYS	.req	r0	// const {u64,u32} *round_keys
> +	NROUNDS		.req	r1	// int nrounds
> +	DST		.req	r2	// void *dst
> +	SRC		.req	r3	// const void *src
> +	NBYTES		.req	r4	// unsigned int nbytes
> +	TWEAK		.req	r5	// void *tweak
> +
> +	// registers which hold the data being encrypted/decrypted
> +	X0		.req	q0
> +	X0_L		.req	d0
> +	X0_H		.req	d1
> +	Y0		.req	q1
> +	Y0_H		.req	d3
> +	X1		.req	q2
> +	X1_L		.req	d4
> +	X1_H		.req	d5
> +	Y1		.req	q3
> +	Y1_H		.req	d7
> +	X2		.req	q4
> +	X2_L		.req	d8
> +	X2_H		.req	d9
> +	Y2		.req	q5
> +	Y2_H		.req	d11
> +	X3		.req	q6
> +	X3_L		.req	d12
> +	X3_H		.req	d13
> +	Y3		.req	q7
> +	Y3_H		.req	d15
> +
> +	// the round key, duplicated in all lanes
> +	ROUND_KEY	.req	q8
> +	ROUND_KEY_L	.req	d16
> +	ROUND_KEY_H	.req	d17
> +
> +	// index vector for vtbl-based 8-bit rotates
> +	ROTATE_TABLE	.req	d18
> +
> +	// multiplication table for updating XTS tweaks
> +	GF128MUL_TABLE	.req	d19
> +	GF64MUL_TABLE	.req	d19
> +
> +	// current XTS tweak value(s)
> +	TWEAKV		.req	q10
> +	TWEAKV_L	.req	d20
> +	TWEAKV_H	.req	d21
> +
> +	TMP0		.req	q12
> +	TMP0_L		.req	d24
> +	TMP0_H		.req	d25
> +	TMP1		.req	q13
> +	TMP2		.req	q14
> +	TMP3		.req	q15
> +
> +	.align		4
> +.Lror64_8_table:
> +	.byte		1, 2, 3, 4, 5, 6, 7, 0
> +.Lror32_8_table:
> +	.byte		1, 2, 3, 0, 5, 6, 7, 4
> +.Lrol64_8_table:
> +	.byte		7, 0, 1, 2, 3, 4, 5, 6
> +.Lrol32_8_table:
> +	.byte		3, 0, 1, 2, 7, 4, 5, 6
> +.Lgf128mul_table:
> +	.byte		0, 0x87
> +	.fill		14
> +.Lgf64mul_table:
> +	.byte		0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
> +	.fill		12
> +
> +/*
> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
> + *
> + * Do one Speck encryption round on the 128 bytes (8 blocks for
> Speck128, 16 for
> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
> + *
> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
> + * the vtbl approach is faster on some processors and the same speed on others.
> + */
> +.macro _speck_round_128bytes	n
> +
> +	// x = ror(x, 8)
> +	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
> +	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
> +	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
> +	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
> +	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
> +	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
> +	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
> +	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
> +
> +	// x += y
> +	vadd.u\n	X0, Y0
> +	vadd.u\n	X1, Y1
> +	vadd.u\n	X2, Y2
> +	vadd.u\n	X3, Y3
> +
> +	// x ^= k
> +	veor		X0, ROUND_KEY
> +	veor		X1, ROUND_KEY
> +	veor		X2, ROUND_KEY
> +	veor		X3, ROUND_KEY
> +
> +	// y = rol(y, 3)
> +	vshl.u\n	TMP0, Y0, #3
> +	vshl.u\n	TMP1, Y1, #3
> +	vshl.u\n	TMP2, Y2, #3
> +	vshl.u\n	TMP3, Y3, #3
> +	vsri.u\n	TMP0, Y0, #(\n - 3)
> +	vsri.u\n	TMP1, Y1, #(\n - 3)
> +	vsri.u\n	TMP2, Y2, #(\n - 3)
> +	vsri.u\n	TMP3, Y3, #(\n - 3)
> +
> +	// y ^= x
> +	veor		Y0, TMP0, X0
> +	veor		Y1, TMP1, X1
> +	veor		Y2, TMP2, X2
> +	veor		Y3, TMP3, X3
> +.endm
> +
> +/*
> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
> + *
> + * This is the inverse of _speck_round_128bytes().
> + */
> +.macro _speck_unround_128bytes	n
> +
> +	// y ^= x
> +	veor		TMP0, Y0, X0
> +	veor		TMP1, Y1, X1
> +	veor		TMP2, Y2, X2
> +	veor		TMP3, Y3, X3
> +
> +	// y = ror(y, 3)
> +	vshr.u\n	Y0, TMP0, #3
> +	vshr.u\n	Y1, TMP1, #3
> +	vshr.u\n	Y2, TMP2, #3
> +	vshr.u\n	Y3, TMP3, #3
> +	vsli.u\n	Y0, TMP0, #(\n - 3)
> +	vsli.u\n	Y1, TMP1, #(\n - 3)
> +	vsli.u\n	Y2, TMP2, #(\n - 3)
> +	vsli.u\n	Y3, TMP3, #(\n - 3)
> +
> +	// x ^= k
> +	veor		X0, ROUND_KEY
> +	veor		X1, ROUND_KEY
> +	veor		X2, ROUND_KEY
> +	veor		X3, ROUND_KEY
> +
> +	// x -= y
> +	vsub.u\n	X0, Y0
> +	vsub.u\n	X1, Y1
> +	vsub.u\n	X2, Y2
> +	vsub.u\n	X3, Y3
> +
> +	// x = rol(x, 8);
> +	vtbl.8		X0_L, {X0_L}, ROTATE_TABLE
> +	vtbl.8		X0_H, {X0_H}, ROTATE_TABLE
> +	vtbl.8		X1_L, {X1_L}, ROTATE_TABLE
> +	vtbl.8		X1_H, {X1_H}, ROTATE_TABLE
> +	vtbl.8		X2_L, {X2_L}, ROTATE_TABLE
> +	vtbl.8		X2_H, {X2_H}, ROTATE_TABLE
> +	vtbl.8		X3_L, {X3_L}, ROTATE_TABLE
> +	vtbl.8		X3_H, {X3_H}, ROTATE_TABLE
> +.endm
> +
> +.macro _xts128_precrypt_one	dst_reg, tweak_buf, tmp
> +
> +	// Load the next source block
> +	vld1.8		{\dst_reg}, [SRC]!
> +
> +	// Save the current tweak in the tweak buffer
> +	vst1.8		{TWEAKV}, [\tweak_buf:128]!
> +
> +	// XOR the next source block with the current tweak
> +	veor		\dst_reg, TWEAKV
> +
> +	/*
> +	 * Calculate the next tweak by multiplying the current one by x,
> +	 * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
> +	 */
> +	vshr.u64	\tmp, TWEAKV, #63
> +	vshl.u64	TWEAKV, #1
> +	veor		TWEAKV_H, \tmp\()_L
> +	vtbl.8		\tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
> +	veor		TWEAKV_L, \tmp\()_H
> +.endm
> +
> +.macro _xts64_precrypt_two	dst_reg, tweak_buf, tmp
> +
> +	// Load the next two source blocks
> +	vld1.8		{\dst_reg}, [SRC]!
> +
> +	// Save the current two tweaks in the tweak buffer
> +	vst1.8		{TWEAKV}, [\tweak_buf:128]!
> +
> +	// XOR the next two source blocks with the current two tweaks
> +	veor		\dst_reg, TWEAKV
> +
> +	/*
> +	 * Calculate the next two tweaks by multiplying the current ones by x^2,
> +	 * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
> +	 */
> +	vshr.u64	\tmp, TWEAKV, #62
> +	vshl.u64	TWEAKV, #2
> +	vtbl.8		\tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
> +	vtbl.8		\tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
> +	veor		TWEAKV, \tmp
> +.endm
> +
> +/*
> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
> + *
> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
> DST buffer
> + * using Speck-XTS, specifically the variant with a block size of
> '2n' and round
> + * count given by NROUNDS.  The expanded round keys are given in
> ROUND_KEYS, and
> + * the current XTS tweak value is given in TWEAK.  It's assumed that
> NBYTES is a
> + * nonzero multiple of 128.
> + */
> +.macro _speck_xts_crypt	n, decrypting
> +	push		{r4-r7}
> +	mov		r7, sp
> +
> +	/*
> +	 * The first four parameters were passed in registers r0-r3.  Load the
> +	 * additional parameters, which were passed on the stack.
> +	 */
> +	ldr		NBYTES, [sp, #16]
> +	ldr		TWEAK, [sp, #20]
> +
> +	/*
> +	 * If decrypting, modify the ROUND_KEYS parameter to point to the last
> +	 * round key rather than the first, since for decryption the round keys
> +	 * are used in reverse order.
> +	 */
> +.if \decrypting
> +.if \n == 64
> +	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
> +	sub		ROUND_KEYS, #8
> +.else
> +	add		ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
> +	sub		ROUND_KEYS, #4
> +.endif
> +.endif
> +
> +	// Load the index vector for vtbl-based 8-bit rotates
> +.if \decrypting
> +	ldr		r12, =.Lrol\n\()_8_table
> +.else
> +	ldr		r12, =.Lror\n\()_8_table
> +.endif
> +	vld1.8		{ROTATE_TABLE}, [r12:64]
> +
> +	// One-time XTS preparation
> +
> +	/*
> +	 * Allocate stack space to store 128 bytes worth of tweaks.  For
> +	 * performance, this space is aligned to a 16-byte boundary so that we
> +	 * can use the load/store instructions that declare 16-byte alignment.
> +	 */
> +	sub		sp, #128
> +	bic		sp, #0xf


This fails here when building with CONFIG_THUMB2_KERNEL=y

  AS      arch/arm/crypto/speck-neon-core.o                             
                                                                   
arch/arm/crypto/speck-neon-core.S: Assembler messages:                  
                                                                   
arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
`bic sp,#0xf'                                                         
arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
`bic sp,#0xf'                                                         
arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
`bic sp,#0xf'                                                         
arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
`bic sp,#0xf'

In a quick hack this change seems to address it:


-       sub             sp, #128
-       bic             sp, #0xf
+       mov             r6, sp
+       sub             r6, #128
+       bic             r6, #0xf
+       mov             sp, r6

But there is probably a better solution to address this.

--
Stefan


> +
> +.if \n == 64
> +	// Load first tweak
> +	vld1.8		{TWEAKV}, [TWEAK]
> +
> +	// Load GF(2^128) multiplication table
> +	ldr		r12, =.Lgf128mul_table
> +	vld1.8		{GF128MUL_TABLE}, [r12:64]
> +.else
> +	// Load first tweak
> +	vld1.8		{TWEAKV_L}, [TWEAK]
> +
> +	// Load GF(2^64) multiplication table
> +	ldr		r12, =.Lgf64mul_table
> +	vld1.8		{GF64MUL_TABLE}, [r12:64]
> +
> +	// Calculate second tweak, packing it together with the first
> +	vshr.u64	TMP0_L, TWEAKV_L, #63
> +	vtbl.u8		TMP0_L, {GF64MUL_TABLE}, TMP0_L
> +	vshl.u64	TWEAKV_H, TWEAKV_L, #1
> +	veor		TWEAKV_H, TMP0_L
> +.endif
> +
> +.Lnext_128bytes_\@:
> +
> +	/*
> +	 * Load the source blocks into {X,Y}[0-3], XOR them with their XTS tweak
> +	 * values, and save the tweaks on the stack for later.  Then
> +	 * de-interleave the 'x' and 'y' elements of each block, i.e. make it so
> +	 * that the X[0-3] registers contain only the second halves of blocks,
> +	 * and the Y[0-3] registers contain only the first halves of blocks.
> +	 * (Speck uses the order (y, x) rather than the more intuitive (x, y).)
> +	 */
> +	mov		r12, sp
> +.if \n == 64
> +	_xts128_precrypt_one	X0, r12, TMP0
> +	_xts128_precrypt_one	Y0, r12, TMP0
> +	_xts128_precrypt_one	X1, r12, TMP0
> +	_xts128_precrypt_one	Y1, r12, TMP0
> +	_xts128_precrypt_one	X2, r12, TMP0
> +	_xts128_precrypt_one	Y2, r12, TMP0
> +	_xts128_precrypt_one	X3, r12, TMP0
> +	_xts128_precrypt_one	Y3, r12, TMP0
> +	vswp		X0_L, Y0_H
> +	vswp		X1_L, Y1_H
> +	vswp		X2_L, Y2_H
> +	vswp		X3_L, Y3_H
> +.else
> +	_xts64_precrypt_two	X0, r12, TMP0
> +	_xts64_precrypt_two	Y0, r12, TMP0
> +	_xts64_precrypt_two	X1, r12, TMP0
> +	_xts64_precrypt_two	Y1, r12, TMP0
> +	_xts64_precrypt_two	X2, r12, TMP0
> +	_xts64_precrypt_two	Y2, r12, TMP0
> +	_xts64_precrypt_two	X3, r12, TMP0
> +	_xts64_precrypt_two	Y3, r12, TMP0
> +	vuzp.32		Y0, X0
> +	vuzp.32		Y1, X1
> +	vuzp.32		Y2, X2
> +	vuzp.32		Y3, X3
> +.endif
> +
> +	// Do the cipher rounds
> +
> +	mov		r12, ROUND_KEYS
> +	mov		r6, NROUNDS
> +
> +.Lnext_round_\@:
> +.if \decrypting
> +.if \n == 64
> +	vld1.64		ROUND_KEY_L, [r12]
> +	sub		r12, #8
> +	vmov		ROUND_KEY_H, ROUND_KEY_L
> +.else
> +	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]
> +	sub		r12, #4
> +.endif
> +	_speck_unround_128bytes	\n
> +.else
> +.if \n == 64
> +	vld1.64		ROUND_KEY_L, [r12]!
> +	vmov		ROUND_KEY_H, ROUND_KEY_L
> +.else
> +	vld1.32		{ROUND_KEY_L[],ROUND_KEY_H[]}, [r12]!
> +.endif
> +	_speck_round_128bytes	\n
> +.endif
> +	subs		r6, r6, #1
> +	bne		.Lnext_round_\@
> +
> +	// Re-interleave the 'x' and 'y' elements of each block
> +.if \n == 64
> +	vswp		X0_L, Y0_H
> +	vswp		X1_L, Y1_H
> +	vswp		X2_L, Y2_H
> +	vswp		X3_L, Y3_H
> +.else
> +	vzip.32		Y0, X0
> +	vzip.32		Y1, X1
> +	vzip.32		Y2, X2
> +	vzip.32		Y3, X3
> +.endif
> +
> +	// XOR the encrypted/decrypted blocks with the tweaks we saved earlier
> +	mov		r12, sp
> +	vld1.8		{TMP0, TMP1}, [r12:128]!
> +	vld1.8		{TMP2, TMP3}, [r12:128]!
> +	veor		X0, TMP0
> +	veor		Y0, TMP1
> +	veor		X1, TMP2
> +	veor		Y1, TMP3
> +	vld1.8		{TMP0, TMP1}, [r12:128]!
> +	vld1.8		{TMP2, TMP3}, [r12:128]!
> +	veor		X2, TMP0
> +	veor		Y2, TMP1
> +	veor		X3, TMP2
> +	veor		Y3, TMP3
> +
> +	// Store the ciphertext in the destination buffer
> +	vst1.8		{X0, Y0}, [DST]!
> +	vst1.8		{X1, Y1}, [DST]!
> +	vst1.8		{X2, Y2}, [DST]!
> +	vst1.8		{X3, Y3}, [DST]!
> +
> +	// Continue if there are more 128-byte chunks remaining, else return
> +	subs		NBYTES, #128
> +	bne		.Lnext_128bytes_\@
> +
> +	// Store the next tweak
> +.if \n == 64
> +	vst1.8		{TWEAKV}, [TWEAK]
> +.else
> +	vst1.8		{TWEAKV_L}, [TWEAK]
> +.endif
> +
> +	mov		sp, r7
> +	pop		{r4-r7}
> +	bx		lr
> +.endm
> +
> +ENTRY(speck128_xts_encrypt_neon)
> +	_speck_xts_crypt	n=64, decrypting=0
> +ENDPROC(speck128_xts_encrypt_neon)
> +
> +ENTRY(speck128_xts_decrypt_neon)
> +	_speck_xts_crypt	n=64, decrypting=1
> +ENDPROC(speck128_xts_decrypt_neon)
> +
> +ENTRY(speck64_xts_encrypt_neon)
> +	_speck_xts_crypt	n=32, decrypting=0
> +ENDPROC(speck64_xts_encrypt_neon)
> +
> +ENTRY(speck64_xts_decrypt_neon)
> +	_speck_xts_crypt	n=32, decrypting=1
> +ENDPROC(speck64_xts_decrypt_neon)
> diff --git a/arch/arm/crypto/speck-neon-glue.c
> b/arch/arm/crypto/speck-neon-glue.c
> new file mode 100644
> index 000000000000..f012c3ea998f
> --- /dev/null
> +++ b/arch/arm/crypto/speck-neon-glue.c
> @@ -0,0 +1,288 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
> + *
> + * Copyright (c) 2018 Google, Inc
> + *
> + * Note: the NIST recommendation for XTS only specifies a 128-bit block size,
> + * but a 64-bit version (needed for Speck64) is fairly
> straightforward; the math
> + * is just done in GF(2^64) instead of GF(2^128), with the reducing polynomial
> + * x^64 + x^4 + x^3 + x + 1 from the original XEX paper (Rogaway, 2004:
> + * "Efficient Instantiations of Tweakable Blockciphers and Refinements to Modes
> + * OCB and PMAC"), represented as 0x1B.
> + */
> +
> +#include <asm/hwcap.h>
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <crypto/algapi.h>
> +#include <crypto/gf128mul.h>
> +#include <crypto/internal/skcipher.h>
> +#include <crypto/speck.h>
> +#include <crypto/xts.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +
> +/* The assembly functions only handle multiples of 128 bytes */
> +#define SPECK_NEON_CHUNK_SIZE	128
> +
> +/* Speck128 */
> +
> +struct speck128_xts_tfm_ctx {
> +	struct speck128_tfm_ctx main_key;
> +	struct speck128_tfm_ctx tweak_key;
> +};
> +
> +asmlinkage void speck128_xts_encrypt_neon(const u64 *round_keys, int nrounds,
> +					  void *dst, const void *src,
> +					  unsigned int nbytes, void *tweak);
> +
> +asmlinkage void speck128_xts_decrypt_neon(const u64 *round_keys, int nrounds,
> +					  void *dst, const void *src,
> +					  unsigned int nbytes, void *tweak);
> +
> +typedef void (*speck128_crypt_one_t)(const struct speck128_tfm_ctx *,
> +				     u8 *, const u8 *);
> +typedef void (*speck128_xts_crypt_many_t)(const u64 *, int, void *,
> +					  const void *, unsigned int, void *);
> +
> +static __always_inline int
> +__speck128_xts_crypt(struct skcipher_request *req,
> +		     speck128_crypt_one_t crypt_one,
> +		     speck128_xts_crypt_many_t crypt_many)
> +{
> +	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +	const struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	struct skcipher_walk walk;
> +	le128 tweak;
> +	int err;
> +
> +	err = skcipher_walk_virt(&walk, req, true);
> +
> +	crypto_speck128_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
> +
> +	while (walk.nbytes > 0) {
> +		unsigned int nbytes = walk.nbytes;
> +		u8 *dst = walk.dst.virt.addr;
> +		const u8 *src = walk.src.virt.addr;
> +
> +		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
> +			unsigned int count;
> +
> +			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
> +			kernel_neon_begin();
> +			(*crypt_many)(ctx->main_key.round_keys,
> +				      ctx->main_key.nrounds,
> +				      dst, src, count, &tweak);
> +			kernel_neon_end();
> +			dst += count;
> +			src += count;
> +			nbytes -= count;
> +		}
> +
> +		/* Handle any remainder with generic code */
> +		while (nbytes >= sizeof(tweak)) {
> +			le128_xor((le128 *)dst, (const le128 *)src, &tweak);
> +			(*crypt_one)(&ctx->main_key, dst, dst);
> +			le128_xor((le128 *)dst, (const le128 *)dst, &tweak);
> +			gf128mul_x_ble(&tweak, &tweak);
> +
> +			dst += sizeof(tweak);
> +			src += sizeof(tweak);
> +			nbytes -= sizeof(tweak);
> +		}
> +		err = skcipher_walk_done(&walk, nbytes);
> +	}
> +
> +	return err;
> +}
> +
> +static int speck128_xts_encrypt(struct skcipher_request *req)
> +{
> +	return __speck128_xts_crypt(req, crypto_speck128_encrypt,
> +				    speck128_xts_encrypt_neon);
> +}
> +
> +static int speck128_xts_decrypt(struct skcipher_request *req)
> +{
> +	return __speck128_xts_crypt(req, crypto_speck128_decrypt,
> +				    speck128_xts_decrypt_neon);
> +}
> +
> +static int speck128_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
> +			       unsigned int keylen)
> +{
> +	struct speck128_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	int err;
> +
> +	err = xts_verify_key(tfm, key, keylen);
> +	if (err)
> +		return err;
> +
> +	keylen /= 2;
> +
> +	err = crypto_speck128_setkey(&ctx->main_key, key, keylen);
> +	if (err)
> +		return err;
> +
> +	return crypto_speck128_setkey(&ctx->tweak_key, key + keylen, keylen);
> +}
> +
> +/* Speck64 */
> +
> +struct speck64_xts_tfm_ctx {
> +	struct speck64_tfm_ctx main_key;
> +	struct speck64_tfm_ctx tweak_key;
> +};
> +
> +asmlinkage void speck64_xts_encrypt_neon(const u32 *round_keys, int nrounds,
> +					 void *dst, const void *src,
> +					 unsigned int nbytes, void *tweak);
> +
> +asmlinkage void speck64_xts_decrypt_neon(const u32 *round_keys, int nrounds,
> +					 void *dst, const void *src,
> +					 unsigned int nbytes, void *tweak);
> +
> +typedef void (*speck64_crypt_one_t)(const struct speck64_tfm_ctx *,
> +				    u8 *, const u8 *);
> +typedef void (*speck64_xts_crypt_many_t)(const u32 *, int, void *,
> +					 const void *, unsigned int, void *);
> +
> +static __always_inline int
> +__speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t
> crypt_one,
> +		    speck64_xts_crypt_many_t crypt_many)
> +{
> +	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +	const struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	struct skcipher_walk walk;
> +	__le64 tweak;
> +	int err;
> +
> +	err = skcipher_walk_virt(&walk, req, true);
> +
> +	crypto_speck64_encrypt(&ctx->tweak_key, (u8 *)&tweak, walk.iv);
> +
> +	while (walk.nbytes > 0) {
> +		unsigned int nbytes = walk.nbytes;
> +		u8 *dst = walk.dst.virt.addr;
> +		const u8 *src = walk.src.virt.addr;
> +
> +		if (nbytes >= SPECK_NEON_CHUNK_SIZE && may_use_simd()) {
> +			unsigned int count;
> +
> +			count = round_down(nbytes, SPECK_NEON_CHUNK_SIZE);
> +			kernel_neon_begin();
> +			(*crypt_many)(ctx->main_key.round_keys,
> +				      ctx->main_key.nrounds,
> +				      dst, src, count, &tweak);
> +			kernel_neon_end();
> +			dst += count;
> +			src += count;
> +			nbytes -= count;
> +		}
> +
> +		/* Handle any remainder with generic code */
> +		while (nbytes >= sizeof(tweak)) {
> +			*(__le64 *)dst = *(__le64 *)src ^ tweak;
> +			(*crypt_one)(&ctx->main_key, dst, dst);
> +			*(__le64 *)dst ^= tweak;
> +			tweak = cpu_to_le64((le64_to_cpu(tweak) << 1) ^
> +					    ((tweak & cpu_to_le64(1ULL << 63)) ?
> +					     0x1B : 0));
> +			dst += sizeof(tweak);
> +			src += sizeof(tweak);
> +			nbytes -= sizeof(tweak);
> +		}
> +		err = skcipher_walk_done(&walk, nbytes);
> +	}
> +
> +	return err;
> +}
> +
> +static int speck64_xts_encrypt(struct skcipher_request *req)
> +{
> +	return __speck64_xts_crypt(req, crypto_speck64_encrypt,
> +				   speck64_xts_encrypt_neon);
> +}
> +
> +static int speck64_xts_decrypt(struct skcipher_request *req)
> +{
> +	return __speck64_xts_crypt(req, crypto_speck64_decrypt,
> +				   speck64_xts_decrypt_neon);
> +}
> +
> +static int speck64_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
> +			      unsigned int keylen)
> +{
> +	struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
> +	int err;
> +
> +	err = xts_verify_key(tfm, key, keylen);
> +	if (err)
> +		return err;
> +
> +	keylen /= 2;
> +
> +	err = crypto_speck64_setkey(&ctx->main_key, key, keylen);
> +	if (err)
> +		return err;
> +
> +	return crypto_speck64_setkey(&ctx->tweak_key, key + keylen, keylen);
> +}
> +
> +static struct skcipher_alg speck_algs[] = {
> +	{
> +		.base.cra_name		= "xts(speck128)",
> +		.base.cra_driver_name	= "xts-speck128-neon",
> +		.base.cra_priority	= 300,
> +		.base.cra_blocksize	= SPECK128_BLOCK_SIZE,
> +		.base.cra_ctxsize	= sizeof(struct speck128_xts_tfm_ctx),
> +		.base.cra_alignmask	= 7,
> +		.base.cra_module	= THIS_MODULE,
> +		.min_keysize		= 2 * SPECK128_128_KEY_SIZE,
> +		.max_keysize		= 2 * SPECK128_256_KEY_SIZE,
> +		.ivsize			= SPECK128_BLOCK_SIZE,
> +		.walksize		= SPECK_NEON_CHUNK_SIZE,
> +		.setkey			= speck128_xts_setkey,
> +		.encrypt		= speck128_xts_encrypt,
> +		.decrypt		= speck128_xts_decrypt,
> +	}, {
> +		.base.cra_name		= "xts(speck64)",
> +		.base.cra_driver_name	= "xts-speck64-neon",
> +		.base.cra_priority	= 300,
> +		.base.cra_blocksize	= SPECK64_BLOCK_SIZE,
> +		.base.cra_ctxsize	= sizeof(struct speck64_xts_tfm_ctx),
> +		.base.cra_alignmask	= 7,
> +		.base.cra_module	= THIS_MODULE,
> +		.min_keysize		= 2 * SPECK64_96_KEY_SIZE,
> +		.max_keysize		= 2 * SPECK64_128_KEY_SIZE,
> +		.ivsize			= SPECK64_BLOCK_SIZE,
> +		.walksize		= SPECK_NEON_CHUNK_SIZE,
> +		.setkey			= speck64_xts_setkey,
> +		.encrypt		= speck64_xts_encrypt,
> +		.decrypt		= speck64_xts_decrypt,
> +	}
> +};
> +
> +static int __init speck_neon_module_init(void)
> +{
> +	if (!(elf_hwcap & HWCAP_NEON))
> +		return -ENODEV;
> +	return crypto_register_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
> +}
> +
> +static void __exit speck_neon_module_exit(void)
> +{
> +	crypto_unregister_skciphers(speck_algs, ARRAY_SIZE(speck_algs));
> +}
> +
> +module_init(speck_neon_module_init);
> +module_exit(speck_neon_module_exit);
> +
> +MODULE_DESCRIPTION("Speck block cipher (NEON-accelerated)");
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
> +MODULE_ALIAS_CRYPTO("xts(speck128)");
> +MODULE_ALIAS_CRYPTO("xts-speck128-neon");
> +MODULE_ALIAS_CRYPTO("xts(speck64)");
> +MODULE_ALIAS_CRYPTO("xts-speck64-neon");

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  2018-06-16 22:40     ` Stefan Agner
  (?)
@ 2018-06-17  9:30       ` Ard Biesheuvel
  -1 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-17  9:30 UTC (permalink / raw)
  To: Stefan Agner
  Cc: Jeffrey Walton, Greg Kaiser, Herbert Xu, Eric Biggers,
	Michael Halcrow, Patrik Torstensson, Alex Cope, Paul Lawrence,
	linux-fscrypt, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Greg Kroah-Hartman, linux-crypto-owner, linux-arm-kernel,
	Paul Crowley

On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
> Hi Eric,
>
> On 14.02.2018 19:42, Eric Biggers wrote:
>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>> next round, etc.), then goes through XTS postprocessing.
>>
>> The performance depends on the processor but can be about 3 times faster
>> than the generic code.  For example, on an ARMv7 processor we observe
>> the following performance with Speck128/256-XTS:
>>
>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>
>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>
>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>
>> Speck64/128-XTS is even faster:
>>
>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>
>> Note that as with the generic code, only the Speck128 and Speck64
>> variants are supported.  Also, for now only the XTS mode of operation is
>> supported, to target the disk and file encryption use cases.  The NEON
>> code also only handles the portion of the data that is evenly divisible
>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>> course, other modes of operation could be added later if needed, and/or
>> the NEON code could be updated to handle other buffer sizes.
>>
>> The XTS specification is only defined for AES which has a 128-bit block
>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>> paper.  Of course, when possible users should use Speck128-XTS, but even
>> that may be too slow on some processors; Speck64-XTS can be faster.
>>
>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>> ---
>>  arch/arm/crypto/Kconfig           |   6 +
>>  arch/arm/crypto/Makefile          |   2 +
>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>  4 files changed, 728 insertions(+)
>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>
>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>> index b8e69fe282b8..925d1364727a 100644
>> --- a/arch/arm/crypto/Kconfig
>> +++ b/arch/arm/crypto/Kconfig
>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>       select CRYPTO_BLKCIPHER
>>       select CRYPTO_CHACHA20
>>
>> +config CRYPTO_SPECK_NEON
>> +     tristate "NEON accelerated Speck cipher algorithms"
>> +     depends on KERNEL_MODE_NEON
>> +     select CRYPTO_BLKCIPHER
>> +     select CRYPTO_SPECK
>> +
>>  endif
>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>> index 30ef8e291271..a758107c5525 100644
>> --- a/arch/arm/crypto/Makefile
>> +++ b/arch/arm/crypto/Makefile
>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>
>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>
>>  quiet_cmd_perl = PERL    $@
>>        cmd_perl = $(PERL) $(<) > $(@)
>> diff --git a/arch/arm/crypto/speck-neon-core.S
>> b/arch/arm/crypto/speck-neon-core.S
>> new file mode 100644
>> index 000000000000..3c1e203e53b9
>> --- /dev/null
>> +++ b/arch/arm/crypto/speck-neon-core.S
>> @@ -0,0 +1,432 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>> + *
>> + * Copyright (c) 2018 Google, Inc
>> + *
>> + * Author: Eric Biggers <ebiggers@google.com>
>> + */
>> +
>> +#include <linux/linkage.h>
>> +
>> +     .text
>> +     .fpu            neon
>> +
>> +     // arguments
>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>> +     NROUNDS         .req    r1      // int nrounds
>> +     DST             .req    r2      // void *dst
>> +     SRC             .req    r3      // const void *src
>> +     NBYTES          .req    r4      // unsigned int nbytes
>> +     TWEAK           .req    r5      // void *tweak
>> +
>> +     // registers which hold the data being encrypted/decrypted
>> +     X0              .req    q0
>> +     X0_L            .req    d0
>> +     X0_H            .req    d1
>> +     Y0              .req    q1
>> +     Y0_H            .req    d3
>> +     X1              .req    q2
>> +     X1_L            .req    d4
>> +     X1_H            .req    d5
>> +     Y1              .req    q3
>> +     Y1_H            .req    d7
>> +     X2              .req    q4
>> +     X2_L            .req    d8
>> +     X2_H            .req    d9
>> +     Y2              .req    q5
>> +     Y2_H            .req    d11
>> +     X3              .req    q6
>> +     X3_L            .req    d12
>> +     X3_H            .req    d13
>> +     Y3              .req    q7
>> +     Y3_H            .req    d15
>> +
>> +     // the round key, duplicated in all lanes
>> +     ROUND_KEY       .req    q8
>> +     ROUND_KEY_L     .req    d16
>> +     ROUND_KEY_H     .req    d17
>> +
>> +     // index vector for vtbl-based 8-bit rotates
>> +     ROTATE_TABLE    .req    d18
>> +
>> +     // multiplication table for updating XTS tweaks
>> +     GF128MUL_TABLE  .req    d19
>> +     GF64MUL_TABLE   .req    d19
>> +
>> +     // current XTS tweak value(s)
>> +     TWEAKV          .req    q10
>> +     TWEAKV_L        .req    d20
>> +     TWEAKV_H        .req    d21
>> +
>> +     TMP0            .req    q12
>> +     TMP0_L          .req    d24
>> +     TMP0_H          .req    d25
>> +     TMP1            .req    q13
>> +     TMP2            .req    q14
>> +     TMP3            .req    q15
>> +
>> +     .align          4
>> +.Lror64_8_table:
>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>> +.Lror32_8_table:
>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>> +.Lrol64_8_table:
>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>> +.Lrol32_8_table:
>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>> +.Lgf128mul_table:
>> +     .byte           0, 0x87
>> +     .fill           14
>> +.Lgf64mul_table:
>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>> +     .fill           12
>> +
>> +/*
>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>> + *
>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>> Speck128, 16 for
>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>> + *
>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>> + * the vtbl approach is faster on some processors and the same speed on others.
>> + */
>> +.macro _speck_round_128bytes n
>> +
>> +     // x = ror(x, 8)
>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>> +
>> +     // x += y
>> +     vadd.u\n        X0, Y0
>> +     vadd.u\n        X1, Y1
>> +     vadd.u\n        X2, Y2
>> +     vadd.u\n        X3, Y3
>> +
>> +     // x ^= k
>> +     veor            X0, ROUND_KEY
>> +     veor            X1, ROUND_KEY
>> +     veor            X2, ROUND_KEY
>> +     veor            X3, ROUND_KEY
>> +
>> +     // y = rol(y, 3)
>> +     vshl.u\n        TMP0, Y0, #3
>> +     vshl.u\n        TMP1, Y1, #3
>> +     vshl.u\n        TMP2, Y2, #3
>> +     vshl.u\n        TMP3, Y3, #3
>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>> +
>> +     // y ^= x
>> +     veor            Y0, TMP0, X0
>> +     veor            Y1, TMP1, X1
>> +     veor            Y2, TMP2, X2
>> +     veor            Y3, TMP3, X3
>> +.endm
>> +
>> +/*
>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>> + *
>> + * This is the inverse of _speck_round_128bytes().
>> + */
>> +.macro _speck_unround_128bytes       n
>> +
>> +     // y ^= x
>> +     veor            TMP0, Y0, X0
>> +     veor            TMP1, Y1, X1
>> +     veor            TMP2, Y2, X2
>> +     veor            TMP3, Y3, X3
>> +
>> +     // y = ror(y, 3)
>> +     vshr.u\n        Y0, TMP0, #3
>> +     vshr.u\n        Y1, TMP1, #3
>> +     vshr.u\n        Y2, TMP2, #3
>> +     vshr.u\n        Y3, TMP3, #3
>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>> +
>> +     // x ^= k
>> +     veor            X0, ROUND_KEY
>> +     veor            X1, ROUND_KEY
>> +     veor            X2, ROUND_KEY
>> +     veor            X3, ROUND_KEY
>> +
>> +     // x -= y
>> +     vsub.u\n        X0, Y0
>> +     vsub.u\n        X1, Y1
>> +     vsub.u\n        X2, Y2
>> +     vsub.u\n        X3, Y3
>> +
>> +     // x = rol(x, 8);
>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>> +.endm
>> +
>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>> +
>> +     // Load the next source block
>> +     vld1.8          {\dst_reg}, [SRC]!
>> +
>> +     // Save the current tweak in the tweak buffer
>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>> +
>> +     // XOR the next source block with the current tweak
>> +     veor            \dst_reg, TWEAKV
>> +
>> +     /*
>> +      * Calculate the next tweak by multiplying the current one by x,
>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>> +      */
>> +     vshr.u64        \tmp, TWEAKV, #63
>> +     vshl.u64        TWEAKV, #1
>> +     veor            TWEAKV_H, \tmp\()_L
>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>> +     veor            TWEAKV_L, \tmp\()_H
>> +.endm
>> +
>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>> +
>> +     // Load the next two source blocks
>> +     vld1.8          {\dst_reg}, [SRC]!
>> +
>> +     // Save the current two tweaks in the tweak buffer
>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>> +
>> +     // XOR the next two source blocks with the current two tweaks
>> +     veor            \dst_reg, TWEAKV
>> +
>> +     /*
>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>> +      */
>> +     vshr.u64        \tmp, TWEAKV, #62
>> +     vshl.u64        TWEAKV, #2
>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>> +     veor            TWEAKV, \tmp
>> +.endm
>> +
>> +/*
>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>> + *
>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>> DST buffer
>> + * using Speck-XTS, specifically the variant with a block size of
>> '2n' and round
>> + * count given by NROUNDS.  The expanded round keys are given in
>> ROUND_KEYS, and
>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>> NBYTES is a
>> + * nonzero multiple of 128.
>> + */
>> +.macro _speck_xts_crypt      n, decrypting
>> +     push            {r4-r7}
>> +     mov             r7, sp
>> +
>> +     /*
>> +      * The first four parameters were passed in registers r0-r3.  Load the
>> +      * additional parameters, which were passed on the stack.
>> +      */
>> +     ldr             NBYTES, [sp, #16]
>> +     ldr             TWEAK, [sp, #20]
>> +
>> +     /*
>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>> +      * round key rather than the first, since for decryption the round keys
>> +      * are used in reverse order.
>> +      */
>> +.if \decrypting
>> +.if \n == 64
>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>> +     sub             ROUND_KEYS, #8
>> +.else
>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>> +     sub             ROUND_KEYS, #4
>> +.endif
>> +.endif
>> +
>> +     // Load the index vector for vtbl-based 8-bit rotates
>> +.if \decrypting
>> +     ldr             r12, =.Lrol\n\()_8_table
>> +.else
>> +     ldr             r12, =.Lror\n\()_8_table
>> +.endif
>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>> +
>> +     // One-time XTS preparation
>> +
>> +     /*
>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>> +      * performance, this space is aligned to a 16-byte boundary so that we
>> +      * can use the load/store instructions that declare 16-byte alignment.
>> +      */
>> +     sub             sp, #128
>> +     bic             sp, #0xf
>
>
> This fails here when building with CONFIG_THUMB2_KERNEL=y
>
>   AS      arch/arm/crypto/speck-neon-core.o
>
> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>
> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
> `bic sp,#0xf'
>
> In a quick hack this change seems to address it:
>
>
> -       sub             sp, #128
> -       bic             sp, #0xf
> +       mov             r6, sp
> +       sub             r6, #128
> +       bic             r6, #0xf
> +       mov             sp, r6
>
> But there is probably a better solution to address this.
>

Given that there is no NEON on M class cores, I recommend we put something like

THUMB(bx pc)
THUMB(nop.w)
THUMB(.arm)

at the beginning and be done with it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-17  9:30       ` Ard Biesheuvel
  0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-17  9:30 UTC (permalink / raw)
  To: Stefan Agner
  Cc: Eric Biggers, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Herbert Xu, linux-fscrypt, linux-arm-kernel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman,
	linux-crypto-owner

On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
> Hi Eric,
>
> On 14.02.2018 19:42, Eric Biggers wrote:
>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>> next round, etc.), then goes through XTS postprocessing.
>>
>> The performance depends on the processor but can be about 3 times faster
>> than the generic code.  For example, on an ARMv7 processor we observe
>> the following performance with Speck128/256-XTS:
>>
>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>
>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>
>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>
>> Speck64/128-XTS is even faster:
>>
>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>
>> Note that as with the generic code, only the Speck128 and Speck64
>> variants are supported.  Also, for now only the XTS mode of operation is
>> supported, to target the disk and file encryption use cases.  The NEON
>> code also only handles the portion of the data that is evenly divisible
>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>> course, other modes of operation could be added later if needed, and/or
>> the NEON code could be updated to handle other buffer sizes.
>>
>> The XTS specification is only defined for AES which has a 128-bit block
>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>> paper.  Of course, when possible users should use Speck128-XTS, but even
>> that may be too slow on some processors; Speck64-XTS can be faster.
>>
>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>> ---
>>  arch/arm/crypto/Kconfig           |   6 +
>>  arch/arm/crypto/Makefile          |   2 +
>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>  4 files changed, 728 insertions(+)
>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>
>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>> index b8e69fe282b8..925d1364727a 100644
>> --- a/arch/arm/crypto/Kconfig
>> +++ b/arch/arm/crypto/Kconfig
>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>       select CRYPTO_BLKCIPHER
>>       select CRYPTO_CHACHA20
>>
>> +config CRYPTO_SPECK_NEON
>> +     tristate "NEON accelerated Speck cipher algorithms"
>> +     depends on KERNEL_MODE_NEON
>> +     select CRYPTO_BLKCIPHER
>> +     select CRYPTO_SPECK
>> +
>>  endif
>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>> index 30ef8e291271..a758107c5525 100644
>> --- a/arch/arm/crypto/Makefile
>> +++ b/arch/arm/crypto/Makefile
>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>
>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>
>>  quiet_cmd_perl = PERL    $@
>>        cmd_perl = $(PERL) $(<) > $(@)
>> diff --git a/arch/arm/crypto/speck-neon-core.S
>> b/arch/arm/crypto/speck-neon-core.S
>> new file mode 100644
>> index 000000000000..3c1e203e53b9
>> --- /dev/null
>> +++ b/arch/arm/crypto/speck-neon-core.S
>> @@ -0,0 +1,432 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>> + *
>> + * Copyright (c) 2018 Google, Inc
>> + *
>> + * Author: Eric Biggers <ebiggers@google.com>
>> + */
>> +
>> +#include <linux/linkage.h>
>> +
>> +     .text
>> +     .fpu            neon
>> +
>> +     // arguments
>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>> +     NROUNDS         .req    r1      // int nrounds
>> +     DST             .req    r2      // void *dst
>> +     SRC             .req    r3      // const void *src
>> +     NBYTES          .req    r4      // unsigned int nbytes
>> +     TWEAK           .req    r5      // void *tweak
>> +
>> +     // registers which hold the data being encrypted/decrypted
>> +     X0              .req    q0
>> +     X0_L            .req    d0
>> +     X0_H            .req    d1
>> +     Y0              .req    q1
>> +     Y0_H            .req    d3
>> +     X1              .req    q2
>> +     X1_L            .req    d4
>> +     X1_H            .req    d5
>> +     Y1              .req    q3
>> +     Y1_H            .req    d7
>> +     X2              .req    q4
>> +     X2_L            .req    d8
>> +     X2_H            .req    d9
>> +     Y2              .req    q5
>> +     Y2_H            .req    d11
>> +     X3              .req    q6
>> +     X3_L            .req    d12
>> +     X3_H            .req    d13
>> +     Y3              .req    q7
>> +     Y3_H            .req    d15
>> +
>> +     // the round key, duplicated in all lanes
>> +     ROUND_KEY       .req    q8
>> +     ROUND_KEY_L     .req    d16
>> +     ROUND_KEY_H     .req    d17
>> +
>> +     // index vector for vtbl-based 8-bit rotates
>> +     ROTATE_TABLE    .req    d18
>> +
>> +     // multiplication table for updating XTS tweaks
>> +     GF128MUL_TABLE  .req    d19
>> +     GF64MUL_TABLE   .req    d19
>> +
>> +     // current XTS tweak value(s)
>> +     TWEAKV          .req    q10
>> +     TWEAKV_L        .req    d20
>> +     TWEAKV_H        .req    d21
>> +
>> +     TMP0            .req    q12
>> +     TMP0_L          .req    d24
>> +     TMP0_H          .req    d25
>> +     TMP1            .req    q13
>> +     TMP2            .req    q14
>> +     TMP3            .req    q15
>> +
>> +     .align          4
>> +.Lror64_8_table:
>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>> +.Lror32_8_table:
>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>> +.Lrol64_8_table:
>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>> +.Lrol32_8_table:
>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>> +.Lgf128mul_table:
>> +     .byte           0, 0x87
>> +     .fill           14
>> +.Lgf64mul_table:
>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>> +     .fill           12
>> +
>> +/*
>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>> + *
>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>> Speck128, 16 for
>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>> + *
>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>> + * the vtbl approach is faster on some processors and the same speed on others.
>> + */
>> +.macro _speck_round_128bytes n
>> +
>> +     // x = ror(x, 8)
>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>> +
>> +     // x += y
>> +     vadd.u\n        X0, Y0
>> +     vadd.u\n        X1, Y1
>> +     vadd.u\n        X2, Y2
>> +     vadd.u\n        X3, Y3
>> +
>> +     // x ^= k
>> +     veor            X0, ROUND_KEY
>> +     veor            X1, ROUND_KEY
>> +     veor            X2, ROUND_KEY
>> +     veor            X3, ROUND_KEY
>> +
>> +     // y = rol(y, 3)
>> +     vshl.u\n        TMP0, Y0, #3
>> +     vshl.u\n        TMP1, Y1, #3
>> +     vshl.u\n        TMP2, Y2, #3
>> +     vshl.u\n        TMP3, Y3, #3
>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>> +
>> +     // y ^= x
>> +     veor            Y0, TMP0, X0
>> +     veor            Y1, TMP1, X1
>> +     veor            Y2, TMP2, X2
>> +     veor            Y3, TMP3, X3
>> +.endm
>> +
>> +/*
>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>> + *
>> + * This is the inverse of _speck_round_128bytes().
>> + */
>> +.macro _speck_unround_128bytes       n
>> +
>> +     // y ^= x
>> +     veor            TMP0, Y0, X0
>> +     veor            TMP1, Y1, X1
>> +     veor            TMP2, Y2, X2
>> +     veor            TMP3, Y3, X3
>> +
>> +     // y = ror(y, 3)
>> +     vshr.u\n        Y0, TMP0, #3
>> +     vshr.u\n        Y1, TMP1, #3
>> +     vshr.u\n        Y2, TMP2, #3
>> +     vshr.u\n        Y3, TMP3, #3
>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>> +
>> +     // x ^= k
>> +     veor            X0, ROUND_KEY
>> +     veor            X1, ROUND_KEY
>> +     veor            X2, ROUND_KEY
>> +     veor            X3, ROUND_KEY
>> +
>> +     // x -= y
>> +     vsub.u\n        X0, Y0
>> +     vsub.u\n        X1, Y1
>> +     vsub.u\n        X2, Y2
>> +     vsub.u\n        X3, Y3
>> +
>> +     // x = rol(x, 8);
>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>> +.endm
>> +
>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>> +
>> +     // Load the next source block
>> +     vld1.8          {\dst_reg}, [SRC]!
>> +
>> +     // Save the current tweak in the tweak buffer
>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>> +
>> +     // XOR the next source block with the current tweak
>> +     veor            \dst_reg, TWEAKV
>> +
>> +     /*
>> +      * Calculate the next tweak by multiplying the current one by x,
>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>> +      */
>> +     vshr.u64        \tmp, TWEAKV, #63
>> +     vshl.u64        TWEAKV, #1
>> +     veor            TWEAKV_H, \tmp\()_L
>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>> +     veor            TWEAKV_L, \tmp\()_H
>> +.endm
>> +
>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>> +
>> +     // Load the next two source blocks
>> +     vld1.8          {\dst_reg}, [SRC]!
>> +
>> +     // Save the current two tweaks in the tweak buffer
>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>> +
>> +     // XOR the next two source blocks with the current two tweaks
>> +     veor            \dst_reg, TWEAKV
>> +
>> +     /*
>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>> +      */
>> +     vshr.u64        \tmp, TWEAKV, #62
>> +     vshl.u64        TWEAKV, #2
>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>> +     veor            TWEAKV, \tmp
>> +.endm
>> +
>> +/*
>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>> + *
>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>> DST buffer
>> + * using Speck-XTS, specifically the variant with a block size of
>> '2n' and round
>> + * count given by NROUNDS.  The expanded round keys are given in
>> ROUND_KEYS, and
>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>> NBYTES is a
>> + * nonzero multiple of 128.
>> + */
>> +.macro _speck_xts_crypt      n, decrypting
>> +     push            {r4-r7}
>> +     mov             r7, sp
>> +
>> +     /*
>> +      * The first four parameters were passed in registers r0-r3.  Load the
>> +      * additional parameters, which were passed on the stack.
>> +      */
>> +     ldr             NBYTES, [sp, #16]
>> +     ldr             TWEAK, [sp, #20]
>> +
>> +     /*
>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>> +      * round key rather than the first, since for decryption the round keys
>> +      * are used in reverse order.
>> +      */
>> +.if \decrypting
>> +.if \n == 64
>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>> +     sub             ROUND_KEYS, #8
>> +.else
>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>> +     sub             ROUND_KEYS, #4
>> +.endif
>> +.endif
>> +
>> +     // Load the index vector for vtbl-based 8-bit rotates
>> +.if \decrypting
>> +     ldr             r12, =.Lrol\n\()_8_table
>> +.else
>> +     ldr             r12, =.Lror\n\()_8_table
>> +.endif
>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>> +
>> +     // One-time XTS preparation
>> +
>> +     /*
>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>> +      * performance, this space is aligned to a 16-byte boundary so that we
>> +      * can use the load/store instructions that declare 16-byte alignment.
>> +      */
>> +     sub             sp, #128
>> +     bic             sp, #0xf
>
>
> This fails here when building with CONFIG_THUMB2_KERNEL=y
>
>   AS      arch/arm/crypto/speck-neon-core.o
>
> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>
> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
> `bic sp,#0xf'
>
> In a quick hack this change seems to address it:
>
>
> -       sub             sp, #128
> -       bic             sp, #0xf
> +       mov             r6, sp
> +       sub             r6, #128
> +       bic             r6, #0xf
> +       mov             sp, r6
>
> But there is probably a better solution to address this.
>

Given that there is no NEON on M class cores, I recommend we put something like

THUMB(bx pc)
THUMB(nop.w)
THUMB(.arm)

at the beginning and be done with it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-17  9:30       ` Ard Biesheuvel
  0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-17  9:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
> Hi Eric,
>
> On 14.02.2018 19:42, Eric Biggers wrote:
>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>> next round, etc.), then goes through XTS postprocessing.
>>
>> The performance depends on the processor but can be about 3 times faster
>> than the generic code.  For example, on an ARMv7 processor we observe
>> the following performance with Speck128/256-XTS:
>>
>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>
>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>
>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>
>> Speck64/128-XTS is even faster:
>>
>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>
>> Note that as with the generic code, only the Speck128 and Speck64
>> variants are supported.  Also, for now only the XTS mode of operation is
>> supported, to target the disk and file encryption use cases.  The NEON
>> code also only handles the portion of the data that is evenly divisible
>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>> course, other modes of operation could be added later if needed, and/or
>> the NEON code could be updated to handle other buffer sizes.
>>
>> The XTS specification is only defined for AES which has a 128-bit block
>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>> paper.  Of course, when possible users should use Speck128-XTS, but even
>> that may be too slow on some processors; Speck64-XTS can be faster.
>>
>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>> ---
>>  arch/arm/crypto/Kconfig           |   6 +
>>  arch/arm/crypto/Makefile          |   2 +
>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>  4 files changed, 728 insertions(+)
>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>
>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>> index b8e69fe282b8..925d1364727a 100644
>> --- a/arch/arm/crypto/Kconfig
>> +++ b/arch/arm/crypto/Kconfig
>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>       select CRYPTO_BLKCIPHER
>>       select CRYPTO_CHACHA20
>>
>> +config CRYPTO_SPECK_NEON
>> +     tristate "NEON accelerated Speck cipher algorithms"
>> +     depends on KERNEL_MODE_NEON
>> +     select CRYPTO_BLKCIPHER
>> +     select CRYPTO_SPECK
>> +
>>  endif
>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>> index 30ef8e291271..a758107c5525 100644
>> --- a/arch/arm/crypto/Makefile
>> +++ b/arch/arm/crypto/Makefile
>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>
>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>
>>  quiet_cmd_perl = PERL    $@
>>        cmd_perl = $(PERL) $(<) > $(@)
>> diff --git a/arch/arm/crypto/speck-neon-core.S
>> b/arch/arm/crypto/speck-neon-core.S
>> new file mode 100644
>> index 000000000000..3c1e203e53b9
>> --- /dev/null
>> +++ b/arch/arm/crypto/speck-neon-core.S
>> @@ -0,0 +1,432 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>> + *
>> + * Copyright (c) 2018 Google, Inc
>> + *
>> + * Author: Eric Biggers <ebiggers@google.com>
>> + */
>> +
>> +#include <linux/linkage.h>
>> +
>> +     .text
>> +     .fpu            neon
>> +
>> +     // arguments
>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>> +     NROUNDS         .req    r1      // int nrounds
>> +     DST             .req    r2      // void *dst
>> +     SRC             .req    r3      // const void *src
>> +     NBYTES          .req    r4      // unsigned int nbytes
>> +     TWEAK           .req    r5      // void *tweak
>> +
>> +     // registers which hold the data being encrypted/decrypted
>> +     X0              .req    q0
>> +     X0_L            .req    d0
>> +     X0_H            .req    d1
>> +     Y0              .req    q1
>> +     Y0_H            .req    d3
>> +     X1              .req    q2
>> +     X1_L            .req    d4
>> +     X1_H            .req    d5
>> +     Y1              .req    q3
>> +     Y1_H            .req    d7
>> +     X2              .req    q4
>> +     X2_L            .req    d8
>> +     X2_H            .req    d9
>> +     Y2              .req    q5
>> +     Y2_H            .req    d11
>> +     X3              .req    q6
>> +     X3_L            .req    d12
>> +     X3_H            .req    d13
>> +     Y3              .req    q7
>> +     Y3_H            .req    d15
>> +
>> +     // the round key, duplicated in all lanes
>> +     ROUND_KEY       .req    q8
>> +     ROUND_KEY_L     .req    d16
>> +     ROUND_KEY_H     .req    d17
>> +
>> +     // index vector for vtbl-based 8-bit rotates
>> +     ROTATE_TABLE    .req    d18
>> +
>> +     // multiplication table for updating XTS tweaks
>> +     GF128MUL_TABLE  .req    d19
>> +     GF64MUL_TABLE   .req    d19
>> +
>> +     // current XTS tweak value(s)
>> +     TWEAKV          .req    q10
>> +     TWEAKV_L        .req    d20
>> +     TWEAKV_H        .req    d21
>> +
>> +     TMP0            .req    q12
>> +     TMP0_L          .req    d24
>> +     TMP0_H          .req    d25
>> +     TMP1            .req    q13
>> +     TMP2            .req    q14
>> +     TMP3            .req    q15
>> +
>> +     .align          4
>> +.Lror64_8_table:
>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>> +.Lror32_8_table:
>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>> +.Lrol64_8_table:
>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>> +.Lrol32_8_table:
>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>> +.Lgf128mul_table:
>> +     .byte           0, 0x87
>> +     .fill           14
>> +.Lgf64mul_table:
>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>> +     .fill           12
>> +
>> +/*
>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>> + *
>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>> Speck128, 16 for
>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>> + *
>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>> + * the vtbl approach is faster on some processors and the same speed on others.
>> + */
>> +.macro _speck_round_128bytes n
>> +
>> +     // x = ror(x, 8)
>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>> +
>> +     // x += y
>> +     vadd.u\n        X0, Y0
>> +     vadd.u\n        X1, Y1
>> +     vadd.u\n        X2, Y2
>> +     vadd.u\n        X3, Y3
>> +
>> +     // x ^= k
>> +     veor            X0, ROUND_KEY
>> +     veor            X1, ROUND_KEY
>> +     veor            X2, ROUND_KEY
>> +     veor            X3, ROUND_KEY
>> +
>> +     // y = rol(y, 3)
>> +     vshl.u\n        TMP0, Y0, #3
>> +     vshl.u\n        TMP1, Y1, #3
>> +     vshl.u\n        TMP2, Y2, #3
>> +     vshl.u\n        TMP3, Y3, #3
>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>> +
>> +     // y ^= x
>> +     veor            Y0, TMP0, X0
>> +     veor            Y1, TMP1, X1
>> +     veor            Y2, TMP2, X2
>> +     veor            Y3, TMP3, X3
>> +.endm
>> +
>> +/*
>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>> + *
>> + * This is the inverse of _speck_round_128bytes().
>> + */
>> +.macro _speck_unround_128bytes       n
>> +
>> +     // y ^= x
>> +     veor            TMP0, Y0, X0
>> +     veor            TMP1, Y1, X1
>> +     veor            TMP2, Y2, X2
>> +     veor            TMP3, Y3, X3
>> +
>> +     // y = ror(y, 3)
>> +     vshr.u\n        Y0, TMP0, #3
>> +     vshr.u\n        Y1, TMP1, #3
>> +     vshr.u\n        Y2, TMP2, #3
>> +     vshr.u\n        Y3, TMP3, #3
>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>> +
>> +     // x ^= k
>> +     veor            X0, ROUND_KEY
>> +     veor            X1, ROUND_KEY
>> +     veor            X2, ROUND_KEY
>> +     veor            X3, ROUND_KEY
>> +
>> +     // x -= y
>> +     vsub.u\n        X0, Y0
>> +     vsub.u\n        X1, Y1
>> +     vsub.u\n        X2, Y2
>> +     vsub.u\n        X3, Y3
>> +
>> +     // x = rol(x, 8);
>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>> +.endm
>> +
>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>> +
>> +     // Load the next source block
>> +     vld1.8          {\dst_reg}, [SRC]!
>> +
>> +     // Save the current tweak in the tweak buffer
>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>> +
>> +     // XOR the next source block with the current tweak
>> +     veor            \dst_reg, TWEAKV
>> +
>> +     /*
>> +      * Calculate the next tweak by multiplying the current one by x,
>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>> +      */
>> +     vshr.u64        \tmp, TWEAKV, #63
>> +     vshl.u64        TWEAKV, #1
>> +     veor            TWEAKV_H, \tmp\()_L
>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>> +     veor            TWEAKV_L, \tmp\()_H
>> +.endm
>> +
>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>> +
>> +     // Load the next two source blocks
>> +     vld1.8          {\dst_reg}, [SRC]!
>> +
>> +     // Save the current two tweaks in the tweak buffer
>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>> +
>> +     // XOR the next two source blocks with the current two tweaks
>> +     veor            \dst_reg, TWEAKV
>> +
>> +     /*
>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>> +      */
>> +     vshr.u64        \tmp, TWEAKV, #62
>> +     vshl.u64        TWEAKV, #2
>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>> +     veor            TWEAKV, \tmp
>> +.endm
>> +
>> +/*
>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>> + *
>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>> DST buffer
>> + * using Speck-XTS, specifically the variant with a block size of
>> '2n' and round
>> + * count given by NROUNDS.  The expanded round keys are given in
>> ROUND_KEYS, and
>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>> NBYTES is a
>> + * nonzero multiple of 128.
>> + */
>> +.macro _speck_xts_crypt      n, decrypting
>> +     push            {r4-r7}
>> +     mov             r7, sp
>> +
>> +     /*
>> +      * The first four parameters were passed in registers r0-r3.  Load the
>> +      * additional parameters, which were passed on the stack.
>> +      */
>> +     ldr             NBYTES, [sp, #16]
>> +     ldr             TWEAK, [sp, #20]
>> +
>> +     /*
>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>> +      * round key rather than the first, since for decryption the round keys
>> +      * are used in reverse order.
>> +      */
>> +.if \decrypting
>> +.if \n == 64
>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>> +     sub             ROUND_KEYS, #8
>> +.else
>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>> +     sub             ROUND_KEYS, #4
>> +.endif
>> +.endif
>> +
>> +     // Load the index vector for vtbl-based 8-bit rotates
>> +.if \decrypting
>> +     ldr             r12, =.Lrol\n\()_8_table
>> +.else
>> +     ldr             r12, =.Lror\n\()_8_table
>> +.endif
>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>> +
>> +     // One-time XTS preparation
>> +
>> +     /*
>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>> +      * performance, this space is aligned to a 16-byte boundary so that we
>> +      * can use the load/store instructions that declare 16-byte alignment.
>> +      */
>> +     sub             sp, #128
>> +     bic             sp, #0xf
>
>
> This fails here when building with CONFIG_THUMB2_KERNEL=y
>
>   AS      arch/arm/crypto/speck-neon-core.o
>
> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>
> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
> `bic sp,#0xf'
>
> In a quick hack this change seems to address it:
>
>
> -       sub             sp, #128
> -       bic             sp, #0xf
> +       mov             r6, sp
> +       sub             r6, #128
> +       bic             r6, #0xf
> +       mov             sp, r6
>
> But there is probably a better solution to address this.
>

Given that there is no NEON on M class cores, I recommend we put something like

THUMB(bx pc)
THUMB(nop.w)
THUMB(.arm)

at the beginning and be done with it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  2018-06-17  9:30       ` Ard Biesheuvel
  (?)
@ 2018-06-17  9:40         ` Ard Biesheuvel
  -1 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-17  9:40 UTC (permalink / raw)
  To: Stefan Agner
  Cc: Jeffrey Walton, Greg Kaiser, Herbert Xu, Eric Biggers,
	Michael Halcrow, Patrik Torstensson, Alex Cope, Paul Lawrence,
	linux-fscrypt, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Greg Kroah-Hartman, linux-crypto-owner, linux-arm-kernel,
	Paul Crowley

On 17 June 2018 at 11:30, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
>> Hi Eric,
>>
>> On 14.02.2018 19:42, Eric Biggers wrote:
>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>> next round, etc.), then goes through XTS postprocessing.
>>>
>>> The performance depends on the processor but can be about 3 times faster
>>> than the generic code.  For example, on an ARMv7 processor we observe
>>> the following performance with Speck128/256-XTS:
>>>
>>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>
>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>
>>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>
>>> Speck64/128-XTS is even faster:
>>>
>>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>
>>> Note that as with the generic code, only the Speck128 and Speck64
>>> variants are supported.  Also, for now only the XTS mode of operation is
>>> supported, to target the disk and file encryption use cases.  The NEON
>>> code also only handles the portion of the data that is evenly divisible
>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>> course, other modes of operation could be added later if needed, and/or
>>> the NEON code could be updated to handle other buffer sizes.
>>>
>>> The XTS specification is only defined for AES which has a 128-bit block
>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>
>>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>>> ---
>>>  arch/arm/crypto/Kconfig           |   6 +
>>>  arch/arm/crypto/Makefile          |   2 +
>>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>>  4 files changed, 728 insertions(+)
>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>
>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>> index b8e69fe282b8..925d1364727a 100644
>>> --- a/arch/arm/crypto/Kconfig
>>> +++ b/arch/arm/crypto/Kconfig
>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>       select CRYPTO_BLKCIPHER
>>>       select CRYPTO_CHACHA20
>>>
>>> +config CRYPTO_SPECK_NEON
>>> +     tristate "NEON accelerated Speck cipher algorithms"
>>> +     depends on KERNEL_MODE_NEON
>>> +     select CRYPTO_BLKCIPHER
>>> +     select CRYPTO_SPECK
>>> +
>>>  endif
>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>> index 30ef8e291271..a758107c5525 100644
>>> --- a/arch/arm/crypto/Makefile
>>> +++ b/arch/arm/crypto/Makefile
>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>
>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>
>>>  quiet_cmd_perl = PERL    $@
>>>        cmd_perl = $(PERL) $(<) > $(@)
>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>> b/arch/arm/crypto/speck-neon-core.S
>>> new file mode 100644
>>> index 000000000000..3c1e203e53b9
>>> --- /dev/null
>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>> @@ -0,0 +1,432 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>> + *
>>> + * Copyright (c) 2018 Google, Inc
>>> + *
>>> + * Author: Eric Biggers <ebiggers@google.com>
>>> + */
>>> +
>>> +#include <linux/linkage.h>
>>> +
>>> +     .text
>>> +     .fpu            neon
>>> +
>>> +     // arguments
>>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>>> +     NROUNDS         .req    r1      // int nrounds
>>> +     DST             .req    r2      // void *dst
>>> +     SRC             .req    r3      // const void *src
>>> +     NBYTES          .req    r4      // unsigned int nbytes
>>> +     TWEAK           .req    r5      // void *tweak
>>> +
>>> +     // registers which hold the data being encrypted/decrypted
>>> +     X0              .req    q0
>>> +     X0_L            .req    d0
>>> +     X0_H            .req    d1
>>> +     Y0              .req    q1
>>> +     Y0_H            .req    d3
>>> +     X1              .req    q2
>>> +     X1_L            .req    d4
>>> +     X1_H            .req    d5
>>> +     Y1              .req    q3
>>> +     Y1_H            .req    d7
>>> +     X2              .req    q4
>>> +     X2_L            .req    d8
>>> +     X2_H            .req    d9
>>> +     Y2              .req    q5
>>> +     Y2_H            .req    d11
>>> +     X3              .req    q6
>>> +     X3_L            .req    d12
>>> +     X3_H            .req    d13
>>> +     Y3              .req    q7
>>> +     Y3_H            .req    d15
>>> +
>>> +     // the round key, duplicated in all lanes
>>> +     ROUND_KEY       .req    q8
>>> +     ROUND_KEY_L     .req    d16
>>> +     ROUND_KEY_H     .req    d17
>>> +
>>> +     // index vector for vtbl-based 8-bit rotates
>>> +     ROTATE_TABLE    .req    d18
>>> +
>>> +     // multiplication table for updating XTS tweaks
>>> +     GF128MUL_TABLE  .req    d19
>>> +     GF64MUL_TABLE   .req    d19
>>> +
>>> +     // current XTS tweak value(s)
>>> +     TWEAKV          .req    q10
>>> +     TWEAKV_L        .req    d20
>>> +     TWEAKV_H        .req    d21
>>> +
>>> +     TMP0            .req    q12
>>> +     TMP0_L          .req    d24
>>> +     TMP0_H          .req    d25
>>> +     TMP1            .req    q13
>>> +     TMP2            .req    q14
>>> +     TMP3            .req    q15
>>> +
>>> +     .align          4
>>> +.Lror64_8_table:
>>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>>> +.Lror32_8_table:
>>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>>> +.Lrol64_8_table:
>>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>>> +.Lrol32_8_table:
>>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>>> +.Lgf128mul_table:
>>> +     .byte           0, 0x87
>>> +     .fill           14
>>> +.Lgf64mul_table:
>>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>>> +     .fill           12
>>> +
>>> +/*
>>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>>> + *
>>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>>> Speck128, 16 for
>>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>>> + *
>>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>>> + * the vtbl approach is faster on some processors and the same speed on others.
>>> + */
>>> +.macro _speck_round_128bytes n
>>> +
>>> +     // x = ror(x, 8)
>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>> +
>>> +     // x += y
>>> +     vadd.u\n        X0, Y0
>>> +     vadd.u\n        X1, Y1
>>> +     vadd.u\n        X2, Y2
>>> +     vadd.u\n        X3, Y3
>>> +
>>> +     // x ^= k
>>> +     veor            X0, ROUND_KEY
>>> +     veor            X1, ROUND_KEY
>>> +     veor            X2, ROUND_KEY
>>> +     veor            X3, ROUND_KEY
>>> +
>>> +     // y = rol(y, 3)
>>> +     vshl.u\n        TMP0, Y0, #3
>>> +     vshl.u\n        TMP1, Y1, #3
>>> +     vshl.u\n        TMP2, Y2, #3
>>> +     vshl.u\n        TMP3, Y3, #3
>>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>>> +
>>> +     // y ^= x
>>> +     veor            Y0, TMP0, X0
>>> +     veor            Y1, TMP1, X1
>>> +     veor            Y2, TMP2, X2
>>> +     veor            Y3, TMP3, X3
>>> +.endm
>>> +
>>> +/*
>>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>>> + *
>>> + * This is the inverse of _speck_round_128bytes().
>>> + */
>>> +.macro _speck_unround_128bytes       n
>>> +
>>> +     // y ^= x
>>> +     veor            TMP0, Y0, X0
>>> +     veor            TMP1, Y1, X1
>>> +     veor            TMP2, Y2, X2
>>> +     veor            TMP3, Y3, X3
>>> +
>>> +     // y = ror(y, 3)
>>> +     vshr.u\n        Y0, TMP0, #3
>>> +     vshr.u\n        Y1, TMP1, #3
>>> +     vshr.u\n        Y2, TMP2, #3
>>> +     vshr.u\n        Y3, TMP3, #3
>>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>>> +
>>> +     // x ^= k
>>> +     veor            X0, ROUND_KEY
>>> +     veor            X1, ROUND_KEY
>>> +     veor            X2, ROUND_KEY
>>> +     veor            X3, ROUND_KEY
>>> +
>>> +     // x -= y
>>> +     vsub.u\n        X0, Y0
>>> +     vsub.u\n        X1, Y1
>>> +     vsub.u\n        X2, Y2
>>> +     vsub.u\n        X3, Y3
>>> +
>>> +     // x = rol(x, 8);
>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>> +.endm
>>> +
>>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>>> +
>>> +     // Load the next source block
>>> +     vld1.8          {\dst_reg}, [SRC]!
>>> +
>>> +     // Save the current tweak in the tweak buffer
>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>> +
>>> +     // XOR the next source block with the current tweak
>>> +     veor            \dst_reg, TWEAKV
>>> +
>>> +     /*
>>> +      * Calculate the next tweak by multiplying the current one by x,
>>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>>> +      */
>>> +     vshr.u64        \tmp, TWEAKV, #63
>>> +     vshl.u64        TWEAKV, #1
>>> +     veor            TWEAKV_H, \tmp\()_L
>>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>>> +     veor            TWEAKV_L, \tmp\()_H
>>> +.endm
>>> +
>>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>>> +
>>> +     // Load the next two source blocks
>>> +     vld1.8          {\dst_reg}, [SRC]!
>>> +
>>> +     // Save the current two tweaks in the tweak buffer
>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>> +
>>> +     // XOR the next two source blocks with the current two tweaks
>>> +     veor            \dst_reg, TWEAKV
>>> +
>>> +     /*
>>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>>> +      */
>>> +     vshr.u64        \tmp, TWEAKV, #62
>>> +     vshl.u64        TWEAKV, #2
>>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>>> +     veor            TWEAKV, \tmp
>>> +.endm
>>> +
>>> +/*
>>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>>> + *
>>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>>> DST buffer
>>> + * using Speck-XTS, specifically the variant with a block size of
>>> '2n' and round
>>> + * count given by NROUNDS.  The expanded round keys are given in
>>> ROUND_KEYS, and
>>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>>> NBYTES is a
>>> + * nonzero multiple of 128.
>>> + */
>>> +.macro _speck_xts_crypt      n, decrypting
>>> +     push            {r4-r7}
>>> +     mov             r7, sp
>>> +
>>> +     /*
>>> +      * The first four parameters were passed in registers r0-r3.  Load the
>>> +      * additional parameters, which were passed on the stack.
>>> +      */
>>> +     ldr             NBYTES, [sp, #16]
>>> +     ldr             TWEAK, [sp, #20]
>>> +
>>> +     /*
>>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>>> +      * round key rather than the first, since for decryption the round keys
>>> +      * are used in reverse order.
>>> +      */
>>> +.if \decrypting
>>> +.if \n == 64
>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>>> +     sub             ROUND_KEYS, #8
>>> +.else
>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>>> +     sub             ROUND_KEYS, #4
>>> +.endif
>>> +.endif
>>> +
>>> +     // Load the index vector for vtbl-based 8-bit rotates
>>> +.if \decrypting
>>> +     ldr             r12, =.Lrol\n\()_8_table
>>> +.else
>>> +     ldr             r12, =.Lror\n\()_8_table
>>> +.endif
>>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>>> +
>>> +     // One-time XTS preparation
>>> +
>>> +     /*
>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>>> +      * can use the load/store instructions that declare 16-byte alignment.
>>> +      */
>>> +     sub             sp, #128
>>> +     bic             sp, #0xf
>>
>>
>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>>
>>   AS      arch/arm/crypto/speck-neon-core.o
>>
>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>>
>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>> `bic sp,#0xf'
>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>> `bic sp,#0xf'
>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>> `bic sp,#0xf'
>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>> `bic sp,#0xf'
>>
>> In a quick hack this change seems to address it:
>>
>>
>> -       sub             sp, #128
>> -       bic             sp, #0xf
>> +       mov             r6, sp
>> +       sub             r6, #128
>> +       bic             r6, #0xf
>> +       mov             sp, r6
>>
>> But there is probably a better solution to address this.
>>
>
> Given that there is no NEON on M class cores, I recommend we put something like
>
> THUMB(bx pc)
> THUMB(nop.w)
> THUMB(.arm)
>
> at the beginning and be done with it.

I mean nop.n or just nop, of course, and we may need a '.align 2' at
the beginning as well.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-17  9:40         ` Ard Biesheuvel
  0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-17  9:40 UTC (permalink / raw)
  To: Stefan Agner
  Cc: Eric Biggers, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Herbert Xu, linux-fscrypt, linux-arm-kernel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman,
	linux-crypto-owner

On 17 June 2018 at 11:30, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
>> Hi Eric,
>>
>> On 14.02.2018 19:42, Eric Biggers wrote:
>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>> next round, etc.), then goes through XTS postprocessing.
>>>
>>> The performance depends on the processor but can be about 3 times faster
>>> than the generic code.  For example, on an ARMv7 processor we observe
>>> the following performance with Speck128/256-XTS:
>>>
>>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>
>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>
>>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>
>>> Speck64/128-XTS is even faster:
>>>
>>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>
>>> Note that as with the generic code, only the Speck128 and Speck64
>>> variants are supported.  Also, for now only the XTS mode of operation is
>>> supported, to target the disk and file encryption use cases.  The NEON
>>> code also only handles the portion of the data that is evenly divisible
>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>> course, other modes of operation could be added later if needed, and/or
>>> the NEON code could be updated to handle other buffer sizes.
>>>
>>> The XTS specification is only defined for AES which has a 128-bit block
>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>
>>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>>> ---
>>>  arch/arm/crypto/Kconfig           |   6 +
>>>  arch/arm/crypto/Makefile          |   2 +
>>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>>  4 files changed, 728 insertions(+)
>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>
>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>> index b8e69fe282b8..925d1364727a 100644
>>> --- a/arch/arm/crypto/Kconfig
>>> +++ b/arch/arm/crypto/Kconfig
>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>       select CRYPTO_BLKCIPHER
>>>       select CRYPTO_CHACHA20
>>>
>>> +config CRYPTO_SPECK_NEON
>>> +     tristate "NEON accelerated Speck cipher algorithms"
>>> +     depends on KERNEL_MODE_NEON
>>> +     select CRYPTO_BLKCIPHER
>>> +     select CRYPTO_SPECK
>>> +
>>>  endif
>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>> index 30ef8e291271..a758107c5525 100644
>>> --- a/arch/arm/crypto/Makefile
>>> +++ b/arch/arm/crypto/Makefile
>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>
>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>
>>>  quiet_cmd_perl = PERL    $@
>>>        cmd_perl = $(PERL) $(<) > $(@)
>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>> b/arch/arm/crypto/speck-neon-core.S
>>> new file mode 100644
>>> index 000000000000..3c1e203e53b9
>>> --- /dev/null
>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>> @@ -0,0 +1,432 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>> + *
>>> + * Copyright (c) 2018 Google, Inc
>>> + *
>>> + * Author: Eric Biggers <ebiggers@google.com>
>>> + */
>>> +
>>> +#include <linux/linkage.h>
>>> +
>>> +     .text
>>> +     .fpu            neon
>>> +
>>> +     // arguments
>>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>>> +     NROUNDS         .req    r1      // int nrounds
>>> +     DST             .req    r2      // void *dst
>>> +     SRC             .req    r3      // const void *src
>>> +     NBYTES          .req    r4      // unsigned int nbytes
>>> +     TWEAK           .req    r5      // void *tweak
>>> +
>>> +     // registers which hold the data being encrypted/decrypted
>>> +     X0              .req    q0
>>> +     X0_L            .req    d0
>>> +     X0_H            .req    d1
>>> +     Y0              .req    q1
>>> +     Y0_H            .req    d3
>>> +     X1              .req    q2
>>> +     X1_L            .req    d4
>>> +     X1_H            .req    d5
>>> +     Y1              .req    q3
>>> +     Y1_H            .req    d7
>>> +     X2              .req    q4
>>> +     X2_L            .req    d8
>>> +     X2_H            .req    d9
>>> +     Y2              .req    q5
>>> +     Y2_H            .req    d11
>>> +     X3              .req    q6
>>> +     X3_L            .req    d12
>>> +     X3_H            .req    d13
>>> +     Y3              .req    q7
>>> +     Y3_H            .req    d15
>>> +
>>> +     // the round key, duplicated in all lanes
>>> +     ROUND_KEY       .req    q8
>>> +     ROUND_KEY_L     .req    d16
>>> +     ROUND_KEY_H     .req    d17
>>> +
>>> +     // index vector for vtbl-based 8-bit rotates
>>> +     ROTATE_TABLE    .req    d18
>>> +
>>> +     // multiplication table for updating XTS tweaks
>>> +     GF128MUL_TABLE  .req    d19
>>> +     GF64MUL_TABLE   .req    d19
>>> +
>>> +     // current XTS tweak value(s)
>>> +     TWEAKV          .req    q10
>>> +     TWEAKV_L        .req    d20
>>> +     TWEAKV_H        .req    d21
>>> +
>>> +     TMP0            .req    q12
>>> +     TMP0_L          .req    d24
>>> +     TMP0_H          .req    d25
>>> +     TMP1            .req    q13
>>> +     TMP2            .req    q14
>>> +     TMP3            .req    q15
>>> +
>>> +     .align          4
>>> +.Lror64_8_table:
>>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>>> +.Lror32_8_table:
>>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>>> +.Lrol64_8_table:
>>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>>> +.Lrol32_8_table:
>>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>>> +.Lgf128mul_table:
>>> +     .byte           0, 0x87
>>> +     .fill           14
>>> +.Lgf64mul_table:
>>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>>> +     .fill           12
>>> +
>>> +/*
>>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>>> + *
>>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>>> Speck128, 16 for
>>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>>> + *
>>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>>> + * the vtbl approach is faster on some processors and the same speed on others.
>>> + */
>>> +.macro _speck_round_128bytes n
>>> +
>>> +     // x = ror(x, 8)
>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>> +
>>> +     // x += y
>>> +     vadd.u\n        X0, Y0
>>> +     vadd.u\n        X1, Y1
>>> +     vadd.u\n        X2, Y2
>>> +     vadd.u\n        X3, Y3
>>> +
>>> +     // x ^= k
>>> +     veor            X0, ROUND_KEY
>>> +     veor            X1, ROUND_KEY
>>> +     veor            X2, ROUND_KEY
>>> +     veor            X3, ROUND_KEY
>>> +
>>> +     // y = rol(y, 3)
>>> +     vshl.u\n        TMP0, Y0, #3
>>> +     vshl.u\n        TMP1, Y1, #3
>>> +     vshl.u\n        TMP2, Y2, #3
>>> +     vshl.u\n        TMP3, Y3, #3
>>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>>> +
>>> +     // y ^= x
>>> +     veor            Y0, TMP0, X0
>>> +     veor            Y1, TMP1, X1
>>> +     veor            Y2, TMP2, X2
>>> +     veor            Y3, TMP3, X3
>>> +.endm
>>> +
>>> +/*
>>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>>> + *
>>> + * This is the inverse of _speck_round_128bytes().
>>> + */
>>> +.macro _speck_unround_128bytes       n
>>> +
>>> +     // y ^= x
>>> +     veor            TMP0, Y0, X0
>>> +     veor            TMP1, Y1, X1
>>> +     veor            TMP2, Y2, X2
>>> +     veor            TMP3, Y3, X3
>>> +
>>> +     // y = ror(y, 3)
>>> +     vshr.u\n        Y0, TMP0, #3
>>> +     vshr.u\n        Y1, TMP1, #3
>>> +     vshr.u\n        Y2, TMP2, #3
>>> +     vshr.u\n        Y3, TMP3, #3
>>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>>> +
>>> +     // x ^= k
>>> +     veor            X0, ROUND_KEY
>>> +     veor            X1, ROUND_KEY
>>> +     veor            X2, ROUND_KEY
>>> +     veor            X3, ROUND_KEY
>>> +
>>> +     // x -= y
>>> +     vsub.u\n        X0, Y0
>>> +     vsub.u\n        X1, Y1
>>> +     vsub.u\n        X2, Y2
>>> +     vsub.u\n        X3, Y3
>>> +
>>> +     // x = rol(x, 8);
>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>> +.endm
>>> +
>>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>>> +
>>> +     // Load the next source block
>>> +     vld1.8          {\dst_reg}, [SRC]!
>>> +
>>> +     // Save the current tweak in the tweak buffer
>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>> +
>>> +     // XOR the next source block with the current tweak
>>> +     veor            \dst_reg, TWEAKV
>>> +
>>> +     /*
>>> +      * Calculate the next tweak by multiplying the current one by x,
>>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>>> +      */
>>> +     vshr.u64        \tmp, TWEAKV, #63
>>> +     vshl.u64        TWEAKV, #1
>>> +     veor            TWEAKV_H, \tmp\()_L
>>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>>> +     veor            TWEAKV_L, \tmp\()_H
>>> +.endm
>>> +
>>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>>> +
>>> +     // Load the next two source blocks
>>> +     vld1.8          {\dst_reg}, [SRC]!
>>> +
>>> +     // Save the current two tweaks in the tweak buffer
>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>> +
>>> +     // XOR the next two source blocks with the current two tweaks
>>> +     veor            \dst_reg, TWEAKV
>>> +
>>> +     /*
>>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>>> +      */
>>> +     vshr.u64        \tmp, TWEAKV, #62
>>> +     vshl.u64        TWEAKV, #2
>>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>>> +     veor            TWEAKV, \tmp
>>> +.endm
>>> +
>>> +/*
>>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>>> + *
>>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>>> DST buffer
>>> + * using Speck-XTS, specifically the variant with a block size of
>>> '2n' and round
>>> + * count given by NROUNDS.  The expanded round keys are given in
>>> ROUND_KEYS, and
>>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>>> NBYTES is a
>>> + * nonzero multiple of 128.
>>> + */
>>> +.macro _speck_xts_crypt      n, decrypting
>>> +     push            {r4-r7}
>>> +     mov             r7, sp
>>> +
>>> +     /*
>>> +      * The first four parameters were passed in registers r0-r3.  Load the
>>> +      * additional parameters, which were passed on the stack.
>>> +      */
>>> +     ldr             NBYTES, [sp, #16]
>>> +     ldr             TWEAK, [sp, #20]
>>> +
>>> +     /*
>>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>>> +      * round key rather than the first, since for decryption the round keys
>>> +      * are used in reverse order.
>>> +      */
>>> +.if \decrypting
>>> +.if \n == 64
>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>>> +     sub             ROUND_KEYS, #8
>>> +.else
>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>>> +     sub             ROUND_KEYS, #4
>>> +.endif
>>> +.endif
>>> +
>>> +     // Load the index vector for vtbl-based 8-bit rotates
>>> +.if \decrypting
>>> +     ldr             r12, =.Lrol\n\()_8_table
>>> +.else
>>> +     ldr             r12, =.Lror\n\()_8_table
>>> +.endif
>>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>>> +
>>> +     // One-time XTS preparation
>>> +
>>> +     /*
>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>>> +      * can use the load/store instructions that declare 16-byte alignment.
>>> +      */
>>> +     sub             sp, #128
>>> +     bic             sp, #0xf
>>
>>
>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>>
>>   AS      arch/arm/crypto/speck-neon-core.o
>>
>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>>
>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>> `bic sp,#0xf'
>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>> `bic sp,#0xf'
>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>> `bic sp,#0xf'
>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>> `bic sp,#0xf'
>>
>> In a quick hack this change seems to address it:
>>
>>
>> -       sub             sp, #128
>> -       bic             sp, #0xf
>> +       mov             r6, sp
>> +       sub             r6, #128
>> +       bic             r6, #0xf
>> +       mov             sp, r6
>>
>> But there is probably a better solution to address this.
>>
>
> Given that there is no NEON on M class cores, I recommend we put something like
>
> THUMB(bx pc)
> THUMB(nop.w)
> THUMB(.arm)
>
> at the beginning and be done with it.

I mean nop.n or just nop, of course, and we may need a '.align 2' at
the beginning as well.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-17  9:40         ` Ard Biesheuvel
  0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-17  9:40 UTC (permalink / raw)
  To: linux-arm-kernel

On 17 June 2018 at 11:30, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
>> Hi Eric,
>>
>> On 14.02.2018 19:42, Eric Biggers wrote:
>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>> next round, etc.), then goes through XTS postprocessing.
>>>
>>> The performance depends on the processor but can be about 3 times faster
>>> than the generic code.  For example, on an ARMv7 processor we observe
>>> the following performance with Speck128/256-XTS:
>>>
>>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>
>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>
>>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>
>>> Speck64/128-XTS is even faster:
>>>
>>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>
>>> Note that as with the generic code, only the Speck128 and Speck64
>>> variants are supported.  Also, for now only the XTS mode of operation is
>>> supported, to target the disk and file encryption use cases.  The NEON
>>> code also only handles the portion of the data that is evenly divisible
>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>> course, other modes of operation could be added later if needed, and/or
>>> the NEON code could be updated to handle other buffer sizes.
>>>
>>> The XTS specification is only defined for AES which has a 128-bit block
>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>
>>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>>> ---
>>>  arch/arm/crypto/Kconfig           |   6 +
>>>  arch/arm/crypto/Makefile          |   2 +
>>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>>  4 files changed, 728 insertions(+)
>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>
>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>> index b8e69fe282b8..925d1364727a 100644
>>> --- a/arch/arm/crypto/Kconfig
>>> +++ b/arch/arm/crypto/Kconfig
>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>       select CRYPTO_BLKCIPHER
>>>       select CRYPTO_CHACHA20
>>>
>>> +config CRYPTO_SPECK_NEON
>>> +     tristate "NEON accelerated Speck cipher algorithms"
>>> +     depends on KERNEL_MODE_NEON
>>> +     select CRYPTO_BLKCIPHER
>>> +     select CRYPTO_SPECK
>>> +
>>>  endif
>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>> index 30ef8e291271..a758107c5525 100644
>>> --- a/arch/arm/crypto/Makefile
>>> +++ b/arch/arm/crypto/Makefile
>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>
>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>
>>>  quiet_cmd_perl = PERL    $@
>>>        cmd_perl = $(PERL) $(<) > $(@)
>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>> b/arch/arm/crypto/speck-neon-core.S
>>> new file mode 100644
>>> index 000000000000..3c1e203e53b9
>>> --- /dev/null
>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>> @@ -0,0 +1,432 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>> + *
>>> + * Copyright (c) 2018 Google, Inc
>>> + *
>>> + * Author: Eric Biggers <ebiggers@google.com>
>>> + */
>>> +
>>> +#include <linux/linkage.h>
>>> +
>>> +     .text
>>> +     .fpu            neon
>>> +
>>> +     // arguments
>>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>>> +     NROUNDS         .req    r1      // int nrounds
>>> +     DST             .req    r2      // void *dst
>>> +     SRC             .req    r3      // const void *src
>>> +     NBYTES          .req    r4      // unsigned int nbytes
>>> +     TWEAK           .req    r5      // void *tweak
>>> +
>>> +     // registers which hold the data being encrypted/decrypted
>>> +     X0              .req    q0
>>> +     X0_L            .req    d0
>>> +     X0_H            .req    d1
>>> +     Y0              .req    q1
>>> +     Y0_H            .req    d3
>>> +     X1              .req    q2
>>> +     X1_L            .req    d4
>>> +     X1_H            .req    d5
>>> +     Y1              .req    q3
>>> +     Y1_H            .req    d7
>>> +     X2              .req    q4
>>> +     X2_L            .req    d8
>>> +     X2_H            .req    d9
>>> +     Y2              .req    q5
>>> +     Y2_H            .req    d11
>>> +     X3              .req    q6
>>> +     X3_L            .req    d12
>>> +     X3_H            .req    d13
>>> +     Y3              .req    q7
>>> +     Y3_H            .req    d15
>>> +
>>> +     // the round key, duplicated in all lanes
>>> +     ROUND_KEY       .req    q8
>>> +     ROUND_KEY_L     .req    d16
>>> +     ROUND_KEY_H     .req    d17
>>> +
>>> +     // index vector for vtbl-based 8-bit rotates
>>> +     ROTATE_TABLE    .req    d18
>>> +
>>> +     // multiplication table for updating XTS tweaks
>>> +     GF128MUL_TABLE  .req    d19
>>> +     GF64MUL_TABLE   .req    d19
>>> +
>>> +     // current XTS tweak value(s)
>>> +     TWEAKV          .req    q10
>>> +     TWEAKV_L        .req    d20
>>> +     TWEAKV_H        .req    d21
>>> +
>>> +     TMP0            .req    q12
>>> +     TMP0_L          .req    d24
>>> +     TMP0_H          .req    d25
>>> +     TMP1            .req    q13
>>> +     TMP2            .req    q14
>>> +     TMP3            .req    q15
>>> +
>>> +     .align          4
>>> +.Lror64_8_table:
>>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>>> +.Lror32_8_table:
>>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>>> +.Lrol64_8_table:
>>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>>> +.Lrol32_8_table:
>>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>>> +.Lgf128mul_table:
>>> +     .byte           0, 0x87
>>> +     .fill           14
>>> +.Lgf64mul_table:
>>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>>> +     .fill           12
>>> +
>>> +/*
>>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>>> + *
>>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>>> Speck128, 16 for
>>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>>> + *
>>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>>> + * the vtbl approach is faster on some processors and the same speed on others.
>>> + */
>>> +.macro _speck_round_128bytes n
>>> +
>>> +     // x = ror(x, 8)
>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>> +
>>> +     // x += y
>>> +     vadd.u\n        X0, Y0
>>> +     vadd.u\n        X1, Y1
>>> +     vadd.u\n        X2, Y2
>>> +     vadd.u\n        X3, Y3
>>> +
>>> +     // x ^= k
>>> +     veor            X0, ROUND_KEY
>>> +     veor            X1, ROUND_KEY
>>> +     veor            X2, ROUND_KEY
>>> +     veor            X3, ROUND_KEY
>>> +
>>> +     // y = rol(y, 3)
>>> +     vshl.u\n        TMP0, Y0, #3
>>> +     vshl.u\n        TMP1, Y1, #3
>>> +     vshl.u\n        TMP2, Y2, #3
>>> +     vshl.u\n        TMP3, Y3, #3
>>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>>> +
>>> +     // y ^= x
>>> +     veor            Y0, TMP0, X0
>>> +     veor            Y1, TMP1, X1
>>> +     veor            Y2, TMP2, X2
>>> +     veor            Y3, TMP3, X3
>>> +.endm
>>> +
>>> +/*
>>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>>> + *
>>> + * This is the inverse of _speck_round_128bytes().
>>> + */
>>> +.macro _speck_unround_128bytes       n
>>> +
>>> +     // y ^= x
>>> +     veor            TMP0, Y0, X0
>>> +     veor            TMP1, Y1, X1
>>> +     veor            TMP2, Y2, X2
>>> +     veor            TMP3, Y3, X3
>>> +
>>> +     // y = ror(y, 3)
>>> +     vshr.u\n        Y0, TMP0, #3
>>> +     vshr.u\n        Y1, TMP1, #3
>>> +     vshr.u\n        Y2, TMP2, #3
>>> +     vshr.u\n        Y3, TMP3, #3
>>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>>> +
>>> +     // x ^= k
>>> +     veor            X0, ROUND_KEY
>>> +     veor            X1, ROUND_KEY
>>> +     veor            X2, ROUND_KEY
>>> +     veor            X3, ROUND_KEY
>>> +
>>> +     // x -= y
>>> +     vsub.u\n        X0, Y0
>>> +     vsub.u\n        X1, Y1
>>> +     vsub.u\n        X2, Y2
>>> +     vsub.u\n        X3, Y3
>>> +
>>> +     // x = rol(x, 8);
>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>> +.endm
>>> +
>>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>>> +
>>> +     // Load the next source block
>>> +     vld1.8          {\dst_reg}, [SRC]!
>>> +
>>> +     // Save the current tweak in the tweak buffer
>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>> +
>>> +     // XOR the next source block with the current tweak
>>> +     veor            \dst_reg, TWEAKV
>>> +
>>> +     /*
>>> +      * Calculate the next tweak by multiplying the current one by x,
>>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>>> +      */
>>> +     vshr.u64        \tmp, TWEAKV, #63
>>> +     vshl.u64        TWEAKV, #1
>>> +     veor            TWEAKV_H, \tmp\()_L
>>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>>> +     veor            TWEAKV_L, \tmp\()_H
>>> +.endm
>>> +
>>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>>> +
>>> +     // Load the next two source blocks
>>> +     vld1.8          {\dst_reg}, [SRC]!
>>> +
>>> +     // Save the current two tweaks in the tweak buffer
>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>> +
>>> +     // XOR the next two source blocks with the current two tweaks
>>> +     veor            \dst_reg, TWEAKV
>>> +
>>> +     /*
>>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>>> +      */
>>> +     vshr.u64        \tmp, TWEAKV, #62
>>> +     vshl.u64        TWEAKV, #2
>>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>>> +     veor            TWEAKV, \tmp
>>> +.endm
>>> +
>>> +/*
>>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>>> + *
>>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>>> DST buffer
>>> + * using Speck-XTS, specifically the variant with a block size of
>>> '2n' and round
>>> + * count given by NROUNDS.  The expanded round keys are given in
>>> ROUND_KEYS, and
>>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>>> NBYTES is a
>>> + * nonzero multiple of 128.
>>> + */
>>> +.macro _speck_xts_crypt      n, decrypting
>>> +     push            {r4-r7}
>>> +     mov             r7, sp
>>> +
>>> +     /*
>>> +      * The first four parameters were passed in registers r0-r3.  Load the
>>> +      * additional parameters, which were passed on the stack.
>>> +      */
>>> +     ldr             NBYTES, [sp, #16]
>>> +     ldr             TWEAK, [sp, #20]
>>> +
>>> +     /*
>>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>>> +      * round key rather than the first, since for decryption the round keys
>>> +      * are used in reverse order.
>>> +      */
>>> +.if \decrypting
>>> +.if \n == 64
>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>>> +     sub             ROUND_KEYS, #8
>>> +.else
>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>>> +     sub             ROUND_KEYS, #4
>>> +.endif
>>> +.endif
>>> +
>>> +     // Load the index vector for vtbl-based 8-bit rotates
>>> +.if \decrypting
>>> +     ldr             r12, =.Lrol\n\()_8_table
>>> +.else
>>> +     ldr             r12, =.Lror\n\()_8_table
>>> +.endif
>>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>>> +
>>> +     // One-time XTS preparation
>>> +
>>> +     /*
>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>>> +      * can use the load/store instructions that declare 16-byte alignment.
>>> +      */
>>> +     sub             sp, #128
>>> +     bic             sp, #0xf
>>
>>
>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>>
>>   AS      arch/arm/crypto/speck-neon-core.o
>>
>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>>
>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>> `bic sp,#0xf'
>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>> `bic sp,#0xf'
>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>> `bic sp,#0xf'
>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>> `bic sp,#0xf'
>>
>> In a quick hack this change seems to address it:
>>
>>
>> -       sub             sp, #128
>> -       bic             sp, #0xf
>> +       mov             r6, sp
>> +       sub             r6, #128
>> +       bic             r6, #0xf
>> +       mov             sp, r6
>>
>> But there is probably a better solution to address this.
>>
>
> Given that there is no NEON on M class cores, I recommend we put something like
>
> THUMB(bx pc)
> THUMB(nop.w)
> THUMB(.arm)
>
> at the beginning and be done with it.

I mean nop.n or just nop, of course, and we may need a '.align 2' at
the beginning as well.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  2018-06-17  9:40         ` Ard Biesheuvel
  (?)
@ 2018-06-17 10:41           ` Stefan Agner
  -1 siblings, 0 replies; 36+ messages in thread
From: Stefan Agner @ 2018-06-17 10:41 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jeffrey Walton, Greg Kaiser, Herbert Xu, Eric Biggers,
	Michael Halcrow, Patrik Torstensson, Alex Cope, Paul Lawrence,
	linux-fscrypt, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Greg Kroah-Hartman, linux-crypto-owner, linux-arm-kernel,
	Paul Crowley

On 17.06.2018 11:40, Ard Biesheuvel wrote:
> On 17 June 2018 at 11:30, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
>>> Hi Eric,
>>>
>>> On 14.02.2018 19:42, Eric Biggers wrote:
>>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>>> next round, etc.), then goes through XTS postprocessing.
>>>>
>>>> The performance depends on the processor but can be about 3 times faster
>>>> than the generic code.  For example, on an ARMv7 processor we observe
>>>> the following performance with Speck128/256-XTS:
>>>>
>>>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>>
>>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>>
>>>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>>
>>>> Speck64/128-XTS is even faster:
>>>>
>>>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>>
>>>> Note that as with the generic code, only the Speck128 and Speck64
>>>> variants are supported.  Also, for now only the XTS mode of operation is
>>>> supported, to target the disk and file encryption use cases.  The NEON
>>>> code also only handles the portion of the data that is evenly divisible
>>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>>> course, other modes of operation could be added later if needed, and/or
>>>> the NEON code could be updated to handle other buffer sizes.
>>>>
>>>> The XTS specification is only defined for AES which has a 128-bit block
>>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>>
>>>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>>>> ---
>>>>  arch/arm/crypto/Kconfig           |   6 +
>>>>  arch/arm/crypto/Makefile          |   2 +
>>>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>>>  4 files changed, 728 insertions(+)
>>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>>
>>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>>> index b8e69fe282b8..925d1364727a 100644
>>>> --- a/arch/arm/crypto/Kconfig
>>>> +++ b/arch/arm/crypto/Kconfig
>>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>>       select CRYPTO_BLKCIPHER
>>>>       select CRYPTO_CHACHA20
>>>>
>>>> +config CRYPTO_SPECK_NEON
>>>> +     tristate "NEON accelerated Speck cipher algorithms"
>>>> +     depends on KERNEL_MODE_NEON
>>>> +     select CRYPTO_BLKCIPHER
>>>> +     select CRYPTO_SPECK
>>>> +
>>>>  endif
>>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>>> index 30ef8e291271..a758107c5525 100644
>>>> --- a/arch/arm/crypto/Makefile
>>>> +++ b/arch/arm/crypto/Makefile
>>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>>
>>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>>
>>>>  quiet_cmd_perl = PERL    $@
>>>>        cmd_perl = $(PERL) $(<) > $(@)
>>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>>> b/arch/arm/crypto/speck-neon-core.S
>>>> new file mode 100644
>>>> index 000000000000..3c1e203e53b9
>>>> --- /dev/null
>>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>>> @@ -0,0 +1,432 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>>> + *
>>>> + * Copyright (c) 2018 Google, Inc
>>>> + *
>>>> + * Author: Eric Biggers <ebiggers@google.com>
>>>> + */
>>>> +
>>>> +#include <linux/linkage.h>
>>>> +
>>>> +     .text
>>>> +     .fpu            neon
>>>> +
>>>> +     // arguments
>>>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>>>> +     NROUNDS         .req    r1      // int nrounds
>>>> +     DST             .req    r2      // void *dst
>>>> +     SRC             .req    r3      // const void *src
>>>> +     NBYTES          .req    r4      // unsigned int nbytes
>>>> +     TWEAK           .req    r5      // void *tweak
>>>> +
>>>> +     // registers which hold the data being encrypted/decrypted
>>>> +     X0              .req    q0
>>>> +     X0_L            .req    d0
>>>> +     X0_H            .req    d1
>>>> +     Y0              .req    q1
>>>> +     Y0_H            .req    d3
>>>> +     X1              .req    q2
>>>> +     X1_L            .req    d4
>>>> +     X1_H            .req    d5
>>>> +     Y1              .req    q3
>>>> +     Y1_H            .req    d7
>>>> +     X2              .req    q4
>>>> +     X2_L            .req    d8
>>>> +     X2_H            .req    d9
>>>> +     Y2              .req    q5
>>>> +     Y2_H            .req    d11
>>>> +     X3              .req    q6
>>>> +     X3_L            .req    d12
>>>> +     X3_H            .req    d13
>>>> +     Y3              .req    q7
>>>> +     Y3_H            .req    d15
>>>> +
>>>> +     // the round key, duplicated in all lanes
>>>> +     ROUND_KEY       .req    q8
>>>> +     ROUND_KEY_L     .req    d16
>>>> +     ROUND_KEY_H     .req    d17
>>>> +
>>>> +     // index vector for vtbl-based 8-bit rotates
>>>> +     ROTATE_TABLE    .req    d18
>>>> +
>>>> +     // multiplication table for updating XTS tweaks
>>>> +     GF128MUL_TABLE  .req    d19
>>>> +     GF64MUL_TABLE   .req    d19
>>>> +
>>>> +     // current XTS tweak value(s)
>>>> +     TWEAKV          .req    q10
>>>> +     TWEAKV_L        .req    d20
>>>> +     TWEAKV_H        .req    d21
>>>> +
>>>> +     TMP0            .req    q12
>>>> +     TMP0_L          .req    d24
>>>> +     TMP0_H          .req    d25
>>>> +     TMP1            .req    q13
>>>> +     TMP2            .req    q14
>>>> +     TMP3            .req    q15
>>>> +
>>>> +     .align          4
>>>> +.Lror64_8_table:
>>>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>>>> +.Lror32_8_table:
>>>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>>>> +.Lrol64_8_table:
>>>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>>>> +.Lrol32_8_table:
>>>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>>>> +.Lgf128mul_table:
>>>> +     .byte           0, 0x87
>>>> +     .fill           14
>>>> +.Lgf64mul_table:
>>>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>>>> +     .fill           12
>>>> +
>>>> +/*
>>>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>>>> + *
>>>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>>>> Speck128, 16 for
>>>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>>>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>>>> + *
>>>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>>>> + * the vtbl approach is faster on some processors and the same speed on others.
>>>> + */
>>>> +.macro _speck_round_128bytes n
>>>> +
>>>> +     // x = ror(x, 8)
>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>> +
>>>> +     // x += y
>>>> +     vadd.u\n        X0, Y0
>>>> +     vadd.u\n        X1, Y1
>>>> +     vadd.u\n        X2, Y2
>>>> +     vadd.u\n        X3, Y3
>>>> +
>>>> +     // x ^= k
>>>> +     veor            X0, ROUND_KEY
>>>> +     veor            X1, ROUND_KEY
>>>> +     veor            X2, ROUND_KEY
>>>> +     veor            X3, ROUND_KEY
>>>> +
>>>> +     // y = rol(y, 3)
>>>> +     vshl.u\n        TMP0, Y0, #3
>>>> +     vshl.u\n        TMP1, Y1, #3
>>>> +     vshl.u\n        TMP2, Y2, #3
>>>> +     vshl.u\n        TMP3, Y3, #3
>>>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>>>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>>>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>>>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>>>> +
>>>> +     // y ^= x
>>>> +     veor            Y0, TMP0, X0
>>>> +     veor            Y1, TMP1, X1
>>>> +     veor            Y2, TMP2, X2
>>>> +     veor            Y3, TMP3, X3
>>>> +.endm
>>>> +
>>>> +/*
>>>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>>>> + *
>>>> + * This is the inverse of _speck_round_128bytes().
>>>> + */
>>>> +.macro _speck_unround_128bytes       n
>>>> +
>>>> +     // y ^= x
>>>> +     veor            TMP0, Y0, X0
>>>> +     veor            TMP1, Y1, X1
>>>> +     veor            TMP2, Y2, X2
>>>> +     veor            TMP3, Y3, X3
>>>> +
>>>> +     // y = ror(y, 3)
>>>> +     vshr.u\n        Y0, TMP0, #3
>>>> +     vshr.u\n        Y1, TMP1, #3
>>>> +     vshr.u\n        Y2, TMP2, #3
>>>> +     vshr.u\n        Y3, TMP3, #3
>>>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>>>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>>>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>>>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>>>> +
>>>> +     // x ^= k
>>>> +     veor            X0, ROUND_KEY
>>>> +     veor            X1, ROUND_KEY
>>>> +     veor            X2, ROUND_KEY
>>>> +     veor            X3, ROUND_KEY
>>>> +
>>>> +     // x -= y
>>>> +     vsub.u\n        X0, Y0
>>>> +     vsub.u\n        X1, Y1
>>>> +     vsub.u\n        X2, Y2
>>>> +     vsub.u\n        X3, Y3
>>>> +
>>>> +     // x = rol(x, 8);
>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>> +.endm
>>>> +
>>>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>>>> +
>>>> +     // Load the next source block
>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>> +
>>>> +     // Save the current tweak in the tweak buffer
>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>> +
>>>> +     // XOR the next source block with the current tweak
>>>> +     veor            \dst_reg, TWEAKV
>>>> +
>>>> +     /*
>>>> +      * Calculate the next tweak by multiplying the current one by x,
>>>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>>>> +      */
>>>> +     vshr.u64        \tmp, TWEAKV, #63
>>>> +     vshl.u64        TWEAKV, #1
>>>> +     veor            TWEAKV_H, \tmp\()_L
>>>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>>>> +     veor            TWEAKV_L, \tmp\()_H
>>>> +.endm
>>>> +
>>>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>>>> +
>>>> +     // Load the next two source blocks
>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>> +
>>>> +     // Save the current two tweaks in the tweak buffer
>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>> +
>>>> +     // XOR the next two source blocks with the current two tweaks
>>>> +     veor            \dst_reg, TWEAKV
>>>> +
>>>> +     /*
>>>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>>>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>>>> +      */
>>>> +     vshr.u64        \tmp, TWEAKV, #62
>>>> +     vshl.u64        TWEAKV, #2
>>>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>>>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>>>> +     veor            TWEAKV, \tmp
>>>> +.endm
>>>> +
>>>> +/*
>>>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>>>> + *
>>>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>>>> DST buffer
>>>> + * using Speck-XTS, specifically the variant with a block size of
>>>> '2n' and round
>>>> + * count given by NROUNDS.  The expanded round keys are given in
>>>> ROUND_KEYS, and
>>>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>>>> NBYTES is a
>>>> + * nonzero multiple of 128.
>>>> + */
>>>> +.macro _speck_xts_crypt      n, decrypting
>>>> +     push            {r4-r7}
>>>> +     mov             r7, sp
>>>> +
>>>> +     /*
>>>> +      * The first four parameters were passed in registers r0-r3.  Load the
>>>> +      * additional parameters, which were passed on the stack.
>>>> +      */
>>>> +     ldr             NBYTES, [sp, #16]
>>>> +     ldr             TWEAK, [sp, #20]
>>>> +
>>>> +     /*
>>>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>>>> +      * round key rather than the first, since for decryption the round keys
>>>> +      * are used in reverse order.
>>>> +      */
>>>> +.if \decrypting
>>>> +.if \n == 64
>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>>>> +     sub             ROUND_KEYS, #8
>>>> +.else
>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>>>> +     sub             ROUND_KEYS, #4
>>>> +.endif
>>>> +.endif
>>>> +
>>>> +     // Load the index vector for vtbl-based 8-bit rotates
>>>> +.if \decrypting
>>>> +     ldr             r12, =.Lrol\n\()_8_table
>>>> +.else
>>>> +     ldr             r12, =.Lror\n\()_8_table
>>>> +.endif
>>>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>>>> +
>>>> +     // One-time XTS preparation
>>>> +
>>>> +     /*
>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>>>> +      * can use the load/store instructions that declare 16-byte alignment.
>>>> +      */
>>>> +     sub             sp, #128
>>>> +     bic             sp, #0xf
>>>
>>>
>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>>>
>>>   AS      arch/arm/crypto/speck-neon-core.o
>>>
>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>>>
>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>>
>>> In a quick hack this change seems to address it:
>>>
>>>
>>> -       sub             sp, #128
>>> -       bic             sp, #0xf
>>> +       mov             r6, sp
>>> +       sub             r6, #128
>>> +       bic             r6, #0xf
>>> +       mov             sp, r6
>>>
>>> But there is probably a better solution to address this.
>>>
>>
>> Given that there is no NEON on M class cores, I recommend we put something like
>>
>> THUMB(bx pc)
>> THUMB(nop.w)
>> THUMB(.arm)
>>
>> at the beginning and be done with it.
> 
> I mean nop.n or just nop, of course, and we may need a '.align 2' at
> the beginning as well.

Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
that bic sp,#0xf is the only issue...

--
Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-17 10:41           ` Stefan Agner
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Agner @ 2018-06-17 10:41 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Eric Biggers, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Herbert Xu, linux-fscrypt, linux-arm-kernel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman,
	linux-crypto-owner

On 17.06.2018 11:40, Ard Biesheuvel wrote:
> On 17 June 2018 at 11:30, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
>>> Hi Eric,
>>>
>>> On 14.02.2018 19:42, Eric Biggers wrote:
>>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>>> next round, etc.), then goes through XTS postprocessing.
>>>>
>>>> The performance depends on the processor but can be about 3 times faster
>>>> than the generic code.  For example, on an ARMv7 processor we observe
>>>> the following performance with Speck128/256-XTS:
>>>>
>>>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>>
>>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>>
>>>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>>
>>>> Speck64/128-XTS is even faster:
>>>>
>>>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>>
>>>> Note that as with the generic code, only the Speck128 and Speck64
>>>> variants are supported.  Also, for now only the XTS mode of operation is
>>>> supported, to target the disk and file encryption use cases.  The NEON
>>>> code also only handles the portion of the data that is evenly divisible
>>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>>> course, other modes of operation could be added later if needed, and/or
>>>> the NEON code could be updated to handle other buffer sizes.
>>>>
>>>> The XTS specification is only defined for AES which has a 128-bit block
>>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>>
>>>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>>>> ---
>>>>  arch/arm/crypto/Kconfig           |   6 +
>>>>  arch/arm/crypto/Makefile          |   2 +
>>>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>>>  4 files changed, 728 insertions(+)
>>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>>
>>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>>> index b8e69fe282b8..925d1364727a 100644
>>>> --- a/arch/arm/crypto/Kconfig
>>>> +++ b/arch/arm/crypto/Kconfig
>>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>>       select CRYPTO_BLKCIPHER
>>>>       select CRYPTO_CHACHA20
>>>>
>>>> +config CRYPTO_SPECK_NEON
>>>> +     tristate "NEON accelerated Speck cipher algorithms"
>>>> +     depends on KERNEL_MODE_NEON
>>>> +     select CRYPTO_BLKCIPHER
>>>> +     select CRYPTO_SPECK
>>>> +
>>>>  endif
>>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>>> index 30ef8e291271..a758107c5525 100644
>>>> --- a/arch/arm/crypto/Makefile
>>>> +++ b/arch/arm/crypto/Makefile
>>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>>
>>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>>
>>>>  quiet_cmd_perl = PERL    $@
>>>>        cmd_perl = $(PERL) $(<) > $(@)
>>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>>> b/arch/arm/crypto/speck-neon-core.S
>>>> new file mode 100644
>>>> index 000000000000..3c1e203e53b9
>>>> --- /dev/null
>>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>>> @@ -0,0 +1,432 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>>> + *
>>>> + * Copyright (c) 2018 Google, Inc
>>>> + *
>>>> + * Author: Eric Biggers <ebiggers@google.com>
>>>> + */
>>>> +
>>>> +#include <linux/linkage.h>
>>>> +
>>>> +     .text
>>>> +     .fpu            neon
>>>> +
>>>> +     // arguments
>>>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>>>> +     NROUNDS         .req    r1      // int nrounds
>>>> +     DST             .req    r2      // void *dst
>>>> +     SRC             .req    r3      // const void *src
>>>> +     NBYTES          .req    r4      // unsigned int nbytes
>>>> +     TWEAK           .req    r5      // void *tweak
>>>> +
>>>> +     // registers which hold the data being encrypted/decrypted
>>>> +     X0              .req    q0
>>>> +     X0_L            .req    d0
>>>> +     X0_H            .req    d1
>>>> +     Y0              .req    q1
>>>> +     Y0_H            .req    d3
>>>> +     X1              .req    q2
>>>> +     X1_L            .req    d4
>>>> +     X1_H            .req    d5
>>>> +     Y1              .req    q3
>>>> +     Y1_H            .req    d7
>>>> +     X2              .req    q4
>>>> +     X2_L            .req    d8
>>>> +     X2_H            .req    d9
>>>> +     Y2              .req    q5
>>>> +     Y2_H            .req    d11
>>>> +     X3              .req    q6
>>>> +     X3_L            .req    d12
>>>> +     X3_H            .req    d13
>>>> +     Y3              .req    q7
>>>> +     Y3_H            .req    d15
>>>> +
>>>> +     // the round key, duplicated in all lanes
>>>> +     ROUND_KEY       .req    q8
>>>> +     ROUND_KEY_L     .req    d16
>>>> +     ROUND_KEY_H     .req    d17
>>>> +
>>>> +     // index vector for vtbl-based 8-bit rotates
>>>> +     ROTATE_TABLE    .req    d18
>>>> +
>>>> +     // multiplication table for updating XTS tweaks
>>>> +     GF128MUL_TABLE  .req    d19
>>>> +     GF64MUL_TABLE   .req    d19
>>>> +
>>>> +     // current XTS tweak value(s)
>>>> +     TWEAKV          .req    q10
>>>> +     TWEAKV_L        .req    d20
>>>> +     TWEAKV_H        .req    d21
>>>> +
>>>> +     TMP0            .req    q12
>>>> +     TMP0_L          .req    d24
>>>> +     TMP0_H          .req    d25
>>>> +     TMP1            .req    q13
>>>> +     TMP2            .req    q14
>>>> +     TMP3            .req    q15
>>>> +
>>>> +     .align          4
>>>> +.Lror64_8_table:
>>>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>>>> +.Lror32_8_table:
>>>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>>>> +.Lrol64_8_table:
>>>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>>>> +.Lrol32_8_table:
>>>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>>>> +.Lgf128mul_table:
>>>> +     .byte           0, 0x87
>>>> +     .fill           14
>>>> +.Lgf64mul_table:
>>>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>>>> +     .fill           12
>>>> +
>>>> +/*
>>>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>>>> + *
>>>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>>>> Speck128, 16 for
>>>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>>>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>>>> + *
>>>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>>>> + * the vtbl approach is faster on some processors and the same speed on others.
>>>> + */
>>>> +.macro _speck_round_128bytes n
>>>> +
>>>> +     // x = ror(x, 8)
>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>> +
>>>> +     // x += y
>>>> +     vadd.u\n        X0, Y0
>>>> +     vadd.u\n        X1, Y1
>>>> +     vadd.u\n        X2, Y2
>>>> +     vadd.u\n        X3, Y3
>>>> +
>>>> +     // x ^= k
>>>> +     veor            X0, ROUND_KEY
>>>> +     veor            X1, ROUND_KEY
>>>> +     veor            X2, ROUND_KEY
>>>> +     veor            X3, ROUND_KEY
>>>> +
>>>> +     // y = rol(y, 3)
>>>> +     vshl.u\n        TMP0, Y0, #3
>>>> +     vshl.u\n        TMP1, Y1, #3
>>>> +     vshl.u\n        TMP2, Y2, #3
>>>> +     vshl.u\n        TMP3, Y3, #3
>>>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>>>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>>>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>>>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>>>> +
>>>> +     // y ^= x
>>>> +     veor            Y0, TMP0, X0
>>>> +     veor            Y1, TMP1, X1
>>>> +     veor            Y2, TMP2, X2
>>>> +     veor            Y3, TMP3, X3
>>>> +.endm
>>>> +
>>>> +/*
>>>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>>>> + *
>>>> + * This is the inverse of _speck_round_128bytes().
>>>> + */
>>>> +.macro _speck_unround_128bytes       n
>>>> +
>>>> +     // y ^= x
>>>> +     veor            TMP0, Y0, X0
>>>> +     veor            TMP1, Y1, X1
>>>> +     veor            TMP2, Y2, X2
>>>> +     veor            TMP3, Y3, X3
>>>> +
>>>> +     // y = ror(y, 3)
>>>> +     vshr.u\n        Y0, TMP0, #3
>>>> +     vshr.u\n        Y1, TMP1, #3
>>>> +     vshr.u\n        Y2, TMP2, #3
>>>> +     vshr.u\n        Y3, TMP3, #3
>>>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>>>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>>>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>>>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>>>> +
>>>> +     // x ^= k
>>>> +     veor            X0, ROUND_KEY
>>>> +     veor            X1, ROUND_KEY
>>>> +     veor            X2, ROUND_KEY
>>>> +     veor            X3, ROUND_KEY
>>>> +
>>>> +     // x -= y
>>>> +     vsub.u\n        X0, Y0
>>>> +     vsub.u\n        X1, Y1
>>>> +     vsub.u\n        X2, Y2
>>>> +     vsub.u\n        X3, Y3
>>>> +
>>>> +     // x = rol(x, 8);
>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>> +.endm
>>>> +
>>>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>>>> +
>>>> +     // Load the next source block
>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>> +
>>>> +     // Save the current tweak in the tweak buffer
>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>> +
>>>> +     // XOR the next source block with the current tweak
>>>> +     veor            \dst_reg, TWEAKV
>>>> +
>>>> +     /*
>>>> +      * Calculate the next tweak by multiplying the current one by x,
>>>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>>>> +      */
>>>> +     vshr.u64        \tmp, TWEAKV, #63
>>>> +     vshl.u64        TWEAKV, #1
>>>> +     veor            TWEAKV_H, \tmp\()_L
>>>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>>>> +     veor            TWEAKV_L, \tmp\()_H
>>>> +.endm
>>>> +
>>>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>>>> +
>>>> +     // Load the next two source blocks
>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>> +
>>>> +     // Save the current two tweaks in the tweak buffer
>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>> +
>>>> +     // XOR the next two source blocks with the current two tweaks
>>>> +     veor            \dst_reg, TWEAKV
>>>> +
>>>> +     /*
>>>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>>>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>>>> +      */
>>>> +     vshr.u64        \tmp, TWEAKV, #62
>>>> +     vshl.u64        TWEAKV, #2
>>>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>>>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>>>> +     veor            TWEAKV, \tmp
>>>> +.endm
>>>> +
>>>> +/*
>>>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>>>> + *
>>>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>>>> DST buffer
>>>> + * using Speck-XTS, specifically the variant with a block size of
>>>> '2n' and round
>>>> + * count given by NROUNDS.  The expanded round keys are given in
>>>> ROUND_KEYS, and
>>>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>>>> NBYTES is a
>>>> + * nonzero multiple of 128.
>>>> + */
>>>> +.macro _speck_xts_crypt      n, decrypting
>>>> +     push            {r4-r7}
>>>> +     mov             r7, sp
>>>> +
>>>> +     /*
>>>> +      * The first four parameters were passed in registers r0-r3.  Load the
>>>> +      * additional parameters, which were passed on the stack.
>>>> +      */
>>>> +     ldr             NBYTES, [sp, #16]
>>>> +     ldr             TWEAK, [sp, #20]
>>>> +
>>>> +     /*
>>>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>>>> +      * round key rather than the first, since for decryption the round keys
>>>> +      * are used in reverse order.
>>>> +      */
>>>> +.if \decrypting
>>>> +.if \n == 64
>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>>>> +     sub             ROUND_KEYS, #8
>>>> +.else
>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>>>> +     sub             ROUND_KEYS, #4
>>>> +.endif
>>>> +.endif
>>>> +
>>>> +     // Load the index vector for vtbl-based 8-bit rotates
>>>> +.if \decrypting
>>>> +     ldr             r12, =.Lrol\n\()_8_table
>>>> +.else
>>>> +     ldr             r12, =.Lror\n\()_8_table
>>>> +.endif
>>>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>>>> +
>>>> +     // One-time XTS preparation
>>>> +
>>>> +     /*
>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>>>> +      * can use the load/store instructions that declare 16-byte alignment.
>>>> +      */
>>>> +     sub             sp, #128
>>>> +     bic             sp, #0xf
>>>
>>>
>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>>>
>>>   AS      arch/arm/crypto/speck-neon-core.o
>>>
>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>>>
>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>>
>>> In a quick hack this change seems to address it:
>>>
>>>
>>> -       sub             sp, #128
>>> -       bic             sp, #0xf
>>> +       mov             r6, sp
>>> +       sub             r6, #128
>>> +       bic             r6, #0xf
>>> +       mov             sp, r6
>>>
>>> But there is probably a better solution to address this.
>>>
>>
>> Given that there is no NEON on M class cores, I recommend we put something like
>>
>> THUMB(bx pc)
>> THUMB(nop.w)
>> THUMB(.arm)
>>
>> at the beginning and be done with it.
> 
> I mean nop.n or just nop, of course, and we may need a '.align 2' at
> the beginning as well.

Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
that bic sp,#0xf is the only issue...

--
Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-17 10:41           ` Stefan Agner
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Agner @ 2018-06-17 10:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 17.06.2018 11:40, Ard Biesheuvel wrote:
> On 17 June 2018 at 11:30, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
>>> Hi Eric,
>>>
>>> On 14.02.2018 19:42, Eric Biggers wrote:
>>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>>> next round, etc.), then goes through XTS postprocessing.
>>>>
>>>> The performance depends on the processor but can be about 3 times faster
>>>> than the generic code.  For example, on an ARMv7 processor we observe
>>>> the following performance with Speck128/256-XTS:
>>>>
>>>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>>
>>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>>
>>>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>>
>>>> Speck64/128-XTS is even faster:
>>>>
>>>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>>
>>>> Note that as with the generic code, only the Speck128 and Speck64
>>>> variants are supported.  Also, for now only the XTS mode of operation is
>>>> supported, to target the disk and file encryption use cases.  The NEON
>>>> code also only handles the portion of the data that is evenly divisible
>>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>>> course, other modes of operation could be added later if needed, and/or
>>>> the NEON code could be updated to handle other buffer sizes.
>>>>
>>>> The XTS specification is only defined for AES which has a 128-bit block
>>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>>
>>>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>>>> ---
>>>>  arch/arm/crypto/Kconfig           |   6 +
>>>>  arch/arm/crypto/Makefile          |   2 +
>>>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>>>  4 files changed, 728 insertions(+)
>>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>>
>>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>>> index b8e69fe282b8..925d1364727a 100644
>>>> --- a/arch/arm/crypto/Kconfig
>>>> +++ b/arch/arm/crypto/Kconfig
>>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>>       select CRYPTO_BLKCIPHER
>>>>       select CRYPTO_CHACHA20
>>>>
>>>> +config CRYPTO_SPECK_NEON
>>>> +     tristate "NEON accelerated Speck cipher algorithms"
>>>> +     depends on KERNEL_MODE_NEON
>>>> +     select CRYPTO_BLKCIPHER
>>>> +     select CRYPTO_SPECK
>>>> +
>>>>  endif
>>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>>> index 30ef8e291271..a758107c5525 100644
>>>> --- a/arch/arm/crypto/Makefile
>>>> +++ b/arch/arm/crypto/Makefile
>>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>>
>>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>>
>>>>  quiet_cmd_perl = PERL    $@
>>>>        cmd_perl = $(PERL) $(<) > $(@)
>>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>>> b/arch/arm/crypto/speck-neon-core.S
>>>> new file mode 100644
>>>> index 000000000000..3c1e203e53b9
>>>> --- /dev/null
>>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>>> @@ -0,0 +1,432 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>>> + *
>>>> + * Copyright (c) 2018 Google, Inc
>>>> + *
>>>> + * Author: Eric Biggers <ebiggers@google.com>
>>>> + */
>>>> +
>>>> +#include <linux/linkage.h>
>>>> +
>>>> +     .text
>>>> +     .fpu            neon
>>>> +
>>>> +     // arguments
>>>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>>>> +     NROUNDS         .req    r1      // int nrounds
>>>> +     DST             .req    r2      // void *dst
>>>> +     SRC             .req    r3      // const void *src
>>>> +     NBYTES          .req    r4      // unsigned int nbytes
>>>> +     TWEAK           .req    r5      // void *tweak
>>>> +
>>>> +     // registers which hold the data being encrypted/decrypted
>>>> +     X0              .req    q0
>>>> +     X0_L            .req    d0
>>>> +     X0_H            .req    d1
>>>> +     Y0              .req    q1
>>>> +     Y0_H            .req    d3
>>>> +     X1              .req    q2
>>>> +     X1_L            .req    d4
>>>> +     X1_H            .req    d5
>>>> +     Y1              .req    q3
>>>> +     Y1_H            .req    d7
>>>> +     X2              .req    q4
>>>> +     X2_L            .req    d8
>>>> +     X2_H            .req    d9
>>>> +     Y2              .req    q5
>>>> +     Y2_H            .req    d11
>>>> +     X3              .req    q6
>>>> +     X3_L            .req    d12
>>>> +     X3_H            .req    d13
>>>> +     Y3              .req    q7
>>>> +     Y3_H            .req    d15
>>>> +
>>>> +     // the round key, duplicated in all lanes
>>>> +     ROUND_KEY       .req    q8
>>>> +     ROUND_KEY_L     .req    d16
>>>> +     ROUND_KEY_H     .req    d17
>>>> +
>>>> +     // index vector for vtbl-based 8-bit rotates
>>>> +     ROTATE_TABLE    .req    d18
>>>> +
>>>> +     // multiplication table for updating XTS tweaks
>>>> +     GF128MUL_TABLE  .req    d19
>>>> +     GF64MUL_TABLE   .req    d19
>>>> +
>>>> +     // current XTS tweak value(s)
>>>> +     TWEAKV          .req    q10
>>>> +     TWEAKV_L        .req    d20
>>>> +     TWEAKV_H        .req    d21
>>>> +
>>>> +     TMP0            .req    q12
>>>> +     TMP0_L          .req    d24
>>>> +     TMP0_H          .req    d25
>>>> +     TMP1            .req    q13
>>>> +     TMP2            .req    q14
>>>> +     TMP3            .req    q15
>>>> +
>>>> +     .align          4
>>>> +.Lror64_8_table:
>>>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>>>> +.Lror32_8_table:
>>>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>>>> +.Lrol64_8_table:
>>>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>>>> +.Lrol32_8_table:
>>>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>>>> +.Lgf128mul_table:
>>>> +     .byte           0, 0x87
>>>> +     .fill           14
>>>> +.Lgf64mul_table:
>>>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>>>> +     .fill           12
>>>> +
>>>> +/*
>>>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>>>> + *
>>>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>>>> Speck128, 16 for
>>>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>>>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>>>> + *
>>>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>>>> + * the vtbl approach is faster on some processors and the same speed on others.
>>>> + */
>>>> +.macro _speck_round_128bytes n
>>>> +
>>>> +     // x = ror(x, 8)
>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>> +
>>>> +     // x += y
>>>> +     vadd.u\n        X0, Y0
>>>> +     vadd.u\n        X1, Y1
>>>> +     vadd.u\n        X2, Y2
>>>> +     vadd.u\n        X3, Y3
>>>> +
>>>> +     // x ^= k
>>>> +     veor            X0, ROUND_KEY
>>>> +     veor            X1, ROUND_KEY
>>>> +     veor            X2, ROUND_KEY
>>>> +     veor            X3, ROUND_KEY
>>>> +
>>>> +     // y = rol(y, 3)
>>>> +     vshl.u\n        TMP0, Y0, #3
>>>> +     vshl.u\n        TMP1, Y1, #3
>>>> +     vshl.u\n        TMP2, Y2, #3
>>>> +     vshl.u\n        TMP3, Y3, #3
>>>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>>>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>>>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>>>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>>>> +
>>>> +     // y ^= x
>>>> +     veor            Y0, TMP0, X0
>>>> +     veor            Y1, TMP1, X1
>>>> +     veor            Y2, TMP2, X2
>>>> +     veor            Y3, TMP3, X3
>>>> +.endm
>>>> +
>>>> +/*
>>>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>>>> + *
>>>> + * This is the inverse of _speck_round_128bytes().
>>>> + */
>>>> +.macro _speck_unround_128bytes       n
>>>> +
>>>> +     // y ^= x
>>>> +     veor            TMP0, Y0, X0
>>>> +     veor            TMP1, Y1, X1
>>>> +     veor            TMP2, Y2, X2
>>>> +     veor            TMP3, Y3, X3
>>>> +
>>>> +     // y = ror(y, 3)
>>>> +     vshr.u\n        Y0, TMP0, #3
>>>> +     vshr.u\n        Y1, TMP1, #3
>>>> +     vshr.u\n        Y2, TMP2, #3
>>>> +     vshr.u\n        Y3, TMP3, #3
>>>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>>>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>>>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>>>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>>>> +
>>>> +     // x ^= k
>>>> +     veor            X0, ROUND_KEY
>>>> +     veor            X1, ROUND_KEY
>>>> +     veor            X2, ROUND_KEY
>>>> +     veor            X3, ROUND_KEY
>>>> +
>>>> +     // x -= y
>>>> +     vsub.u\n        X0, Y0
>>>> +     vsub.u\n        X1, Y1
>>>> +     vsub.u\n        X2, Y2
>>>> +     vsub.u\n        X3, Y3
>>>> +
>>>> +     // x = rol(x, 8);
>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>> +.endm
>>>> +
>>>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>>>> +
>>>> +     // Load the next source block
>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>> +
>>>> +     // Save the current tweak in the tweak buffer
>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>> +
>>>> +     // XOR the next source block with the current tweak
>>>> +     veor            \dst_reg, TWEAKV
>>>> +
>>>> +     /*
>>>> +      * Calculate the next tweak by multiplying the current one by x,
>>>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>>>> +      */
>>>> +     vshr.u64        \tmp, TWEAKV, #63
>>>> +     vshl.u64        TWEAKV, #1
>>>> +     veor            TWEAKV_H, \tmp\()_L
>>>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>>>> +     veor            TWEAKV_L, \tmp\()_H
>>>> +.endm
>>>> +
>>>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>>>> +
>>>> +     // Load the next two source blocks
>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>> +
>>>> +     // Save the current two tweaks in the tweak buffer
>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>> +
>>>> +     // XOR the next two source blocks with the current two tweaks
>>>> +     veor            \dst_reg, TWEAKV
>>>> +
>>>> +     /*
>>>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>>>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>>>> +      */
>>>> +     vshr.u64        \tmp, TWEAKV, #62
>>>> +     vshl.u64        TWEAKV, #2
>>>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>>>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>>>> +     veor            TWEAKV, \tmp
>>>> +.endm
>>>> +
>>>> +/*
>>>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>>>> + *
>>>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>>>> DST buffer
>>>> + * using Speck-XTS, specifically the variant with a block size of
>>>> '2n' and round
>>>> + * count given by NROUNDS.  The expanded round keys are given in
>>>> ROUND_KEYS, and
>>>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>>>> NBYTES is a
>>>> + * nonzero multiple of 128.
>>>> + */
>>>> +.macro _speck_xts_crypt      n, decrypting
>>>> +     push            {r4-r7}
>>>> +     mov             r7, sp
>>>> +
>>>> +     /*
>>>> +      * The first four parameters were passed in registers r0-r3.  Load the
>>>> +      * additional parameters, which were passed on the stack.
>>>> +      */
>>>> +     ldr             NBYTES, [sp, #16]
>>>> +     ldr             TWEAK, [sp, #20]
>>>> +
>>>> +     /*
>>>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>>>> +      * round key rather than the first, since for decryption the round keys
>>>> +      * are used in reverse order.
>>>> +      */
>>>> +.if \decrypting
>>>> +.if \n == 64
>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>>>> +     sub             ROUND_KEYS, #8
>>>> +.else
>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>>>> +     sub             ROUND_KEYS, #4
>>>> +.endif
>>>> +.endif
>>>> +
>>>> +     // Load the index vector for vtbl-based 8-bit rotates
>>>> +.if \decrypting
>>>> +     ldr             r12, =.Lrol\n\()_8_table
>>>> +.else
>>>> +     ldr             r12, =.Lror\n\()_8_table
>>>> +.endif
>>>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>>>> +
>>>> +     // One-time XTS preparation
>>>> +
>>>> +     /*
>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>>>> +      * can use the load/store instructions that declare 16-byte alignment.
>>>> +      */
>>>> +     sub             sp, #128
>>>> +     bic             sp, #0xf
>>>
>>>
>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>>>
>>>   AS      arch/arm/crypto/speck-neon-core.o
>>>
>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>>>
>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>>> `bic sp,#0xf'
>>>
>>> In a quick hack this change seems to address it:
>>>
>>>
>>> -       sub             sp, #128
>>> -       bic             sp, #0xf
>>> +       mov             r6, sp
>>> +       sub             r6, #128
>>> +       bic             r6, #0xf
>>> +       mov             sp, r6
>>>
>>> But there is probably a better solution to address this.
>>>
>>
>> Given that there is no NEON on M class cores, I recommend we put something like
>>
>> THUMB(bx pc)
>> THUMB(nop.w)
>> THUMB(.arm)
>>
>> at the beginning and be done with it.
> 
> I mean nop.n or just nop, of course, and we may need a '.align 2' at
> the beginning as well.

Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
that bic sp,#0xf is the only issue...

--
Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  2018-06-17 10:41           ` Stefan Agner
  (?)
@ 2018-06-17 11:10             ` Ard Biesheuvel
  -1 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-17 11:10 UTC (permalink / raw)
  To: Stefan Agner
  Cc: Jeffrey Walton, Greg Kaiser, Herbert Xu, Eric Biggers,
	Michael Halcrow, Patrik Torstensson, Alex Cope, Paul Lawrence,
	linux-fscrypt, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Greg Kroah-Hartman, linux-crypto-owner, linux-arm-kernel,
	Paul Crowley

On 17 June 2018 at 12:41, Stefan Agner <stefan@agner.ch> wrote:
> On 17.06.2018 11:40, Ard Biesheuvel wrote:
>> On 17 June 2018 at 11:30, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>>> On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
>>>> Hi Eric,
>>>>
>>>> On 14.02.2018 19:42, Eric Biggers wrote:
>>>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>>>> next round, etc.), then goes through XTS postprocessing.
>>>>>
>>>>> The performance depends on the processor but can be about 3 times faster
>>>>> than the generic code.  For example, on an ARMv7 processor we observe
>>>>> the following performance with Speck128/256-XTS:
>>>>>
>>>>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>>>
>>>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>>>
>>>>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>>>
>>>>> Speck64/128-XTS is even faster:
>>>>>
>>>>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>>>
>>>>> Note that as with the generic code, only the Speck128 and Speck64
>>>>> variants are supported.  Also, for now only the XTS mode of operation is
>>>>> supported, to target the disk and file encryption use cases.  The NEON
>>>>> code also only handles the portion of the data that is evenly divisible
>>>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>>>> course, other modes of operation could be added later if needed, and/or
>>>>> the NEON code could be updated to handle other buffer sizes.
>>>>>
>>>>> The XTS specification is only defined for AES which has a 128-bit block
>>>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>>>
>>>>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>>>>> ---
>>>>>  arch/arm/crypto/Kconfig           |   6 +
>>>>>  arch/arm/crypto/Makefile          |   2 +
>>>>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>>>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>>>>  4 files changed, 728 insertions(+)
>>>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>>>
>>>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>>>> index b8e69fe282b8..925d1364727a 100644
>>>>> --- a/arch/arm/crypto/Kconfig
>>>>> +++ b/arch/arm/crypto/Kconfig
>>>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>>>       select CRYPTO_BLKCIPHER
>>>>>       select CRYPTO_CHACHA20
>>>>>
>>>>> +config CRYPTO_SPECK_NEON
>>>>> +     tristate "NEON accelerated Speck cipher algorithms"
>>>>> +     depends on KERNEL_MODE_NEON
>>>>> +     select CRYPTO_BLKCIPHER
>>>>> +     select CRYPTO_SPECK
>>>>> +
>>>>>  endif
>>>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>>>> index 30ef8e291271..a758107c5525 100644
>>>>> --- a/arch/arm/crypto/Makefile
>>>>> +++ b/arch/arm/crypto/Makefile
>>>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>>>
>>>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>>>
>>>>>  quiet_cmd_perl = PERL    $@
>>>>>        cmd_perl = $(PERL) $(<) > $(@)
>>>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>>>> b/arch/arm/crypto/speck-neon-core.S
>>>>> new file mode 100644
>>>>> index 000000000000..3c1e203e53b9
>>>>> --- /dev/null
>>>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>>>> @@ -0,0 +1,432 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +/*
>>>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>>>> + *
>>>>> + * Copyright (c) 2018 Google, Inc
>>>>> + *
>>>>> + * Author: Eric Biggers <ebiggers@google.com>
>>>>> + */
>>>>> +
>>>>> +#include <linux/linkage.h>
>>>>> +
>>>>> +     .text
>>>>> +     .fpu            neon
>>>>> +
>>>>> +     // arguments
>>>>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>>>>> +     NROUNDS         .req    r1      // int nrounds
>>>>> +     DST             .req    r2      // void *dst
>>>>> +     SRC             .req    r3      // const void *src
>>>>> +     NBYTES          .req    r4      // unsigned int nbytes
>>>>> +     TWEAK           .req    r5      // void *tweak
>>>>> +
>>>>> +     // registers which hold the data being encrypted/decrypted
>>>>> +     X0              .req    q0
>>>>> +     X0_L            .req    d0
>>>>> +     X0_H            .req    d1
>>>>> +     Y0              .req    q1
>>>>> +     Y0_H            .req    d3
>>>>> +     X1              .req    q2
>>>>> +     X1_L            .req    d4
>>>>> +     X1_H            .req    d5
>>>>> +     Y1              .req    q3
>>>>> +     Y1_H            .req    d7
>>>>> +     X2              .req    q4
>>>>> +     X2_L            .req    d8
>>>>> +     X2_H            .req    d9
>>>>> +     Y2              .req    q5
>>>>> +     Y2_H            .req    d11
>>>>> +     X3              .req    q6
>>>>> +     X3_L            .req    d12
>>>>> +     X3_H            .req    d13
>>>>> +     Y3              .req    q7
>>>>> +     Y3_H            .req    d15
>>>>> +
>>>>> +     // the round key, duplicated in all lanes
>>>>> +     ROUND_KEY       .req    q8
>>>>> +     ROUND_KEY_L     .req    d16
>>>>> +     ROUND_KEY_H     .req    d17
>>>>> +
>>>>> +     // index vector for vtbl-based 8-bit rotates
>>>>> +     ROTATE_TABLE    .req    d18
>>>>> +
>>>>> +     // multiplication table for updating XTS tweaks
>>>>> +     GF128MUL_TABLE  .req    d19
>>>>> +     GF64MUL_TABLE   .req    d19
>>>>> +
>>>>> +     // current XTS tweak value(s)
>>>>> +     TWEAKV          .req    q10
>>>>> +     TWEAKV_L        .req    d20
>>>>> +     TWEAKV_H        .req    d21
>>>>> +
>>>>> +     TMP0            .req    q12
>>>>> +     TMP0_L          .req    d24
>>>>> +     TMP0_H          .req    d25
>>>>> +     TMP1            .req    q13
>>>>> +     TMP2            .req    q14
>>>>> +     TMP3            .req    q15
>>>>> +
>>>>> +     .align          4
>>>>> +.Lror64_8_table:
>>>>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>>>>> +.Lror32_8_table:
>>>>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>>>>> +.Lrol64_8_table:
>>>>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>>>>> +.Lrol32_8_table:
>>>>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>>>>> +.Lgf128mul_table:
>>>>> +     .byte           0, 0x87
>>>>> +     .fill           14
>>>>> +.Lgf64mul_table:
>>>>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>>>>> +     .fill           12
>>>>> +
>>>>> +/*
>>>>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>>>>> + *
>>>>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>>>>> Speck128, 16 for
>>>>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>>>>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>>>>> + *
>>>>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>>>>> + * the vtbl approach is faster on some processors and the same speed on others.
>>>>> + */
>>>>> +.macro _speck_round_128bytes n
>>>>> +
>>>>> +     // x = ror(x, 8)
>>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>>> +
>>>>> +     // x += y
>>>>> +     vadd.u\n        X0, Y0
>>>>> +     vadd.u\n        X1, Y1
>>>>> +     vadd.u\n        X2, Y2
>>>>> +     vadd.u\n        X3, Y3
>>>>> +
>>>>> +     // x ^= k
>>>>> +     veor            X0, ROUND_KEY
>>>>> +     veor            X1, ROUND_KEY
>>>>> +     veor            X2, ROUND_KEY
>>>>> +     veor            X3, ROUND_KEY
>>>>> +
>>>>> +     // y = rol(y, 3)
>>>>> +     vshl.u\n        TMP0, Y0, #3
>>>>> +     vshl.u\n        TMP1, Y1, #3
>>>>> +     vshl.u\n        TMP2, Y2, #3
>>>>> +     vshl.u\n        TMP3, Y3, #3
>>>>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>>>>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>>>>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>>>>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>>>>> +
>>>>> +     // y ^= x
>>>>> +     veor            Y0, TMP0, X0
>>>>> +     veor            Y1, TMP1, X1
>>>>> +     veor            Y2, TMP2, X2
>>>>> +     veor            Y3, TMP3, X3
>>>>> +.endm
>>>>> +
>>>>> +/*
>>>>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>>>>> + *
>>>>> + * This is the inverse of _speck_round_128bytes().
>>>>> + */
>>>>> +.macro _speck_unround_128bytes       n
>>>>> +
>>>>> +     // y ^= x
>>>>> +     veor            TMP0, Y0, X0
>>>>> +     veor            TMP1, Y1, X1
>>>>> +     veor            TMP2, Y2, X2
>>>>> +     veor            TMP3, Y3, X3
>>>>> +
>>>>> +     // y = ror(y, 3)
>>>>> +     vshr.u\n        Y0, TMP0, #3
>>>>> +     vshr.u\n        Y1, TMP1, #3
>>>>> +     vshr.u\n        Y2, TMP2, #3
>>>>> +     vshr.u\n        Y3, TMP3, #3
>>>>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>>>>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>>>>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>>>>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>>>>> +
>>>>> +     // x ^= k
>>>>> +     veor            X0, ROUND_KEY
>>>>> +     veor            X1, ROUND_KEY
>>>>> +     veor            X2, ROUND_KEY
>>>>> +     veor            X3, ROUND_KEY
>>>>> +
>>>>> +     // x -= y
>>>>> +     vsub.u\n        X0, Y0
>>>>> +     vsub.u\n        X1, Y1
>>>>> +     vsub.u\n        X2, Y2
>>>>> +     vsub.u\n        X3, Y3
>>>>> +
>>>>> +     // x = rol(x, 8);
>>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>>> +.endm
>>>>> +
>>>>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>>>>> +
>>>>> +     // Load the next source block
>>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>>> +
>>>>> +     // Save the current tweak in the tweak buffer
>>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>>> +
>>>>> +     // XOR the next source block with the current tweak
>>>>> +     veor            \dst_reg, TWEAKV
>>>>> +
>>>>> +     /*
>>>>> +      * Calculate the next tweak by multiplying the current one by x,
>>>>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>>>>> +      */
>>>>> +     vshr.u64        \tmp, TWEAKV, #63
>>>>> +     vshl.u64        TWEAKV, #1
>>>>> +     veor            TWEAKV_H, \tmp\()_L
>>>>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>>>>> +     veor            TWEAKV_L, \tmp\()_H
>>>>> +.endm
>>>>> +
>>>>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>>>>> +
>>>>> +     // Load the next two source blocks
>>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>>> +
>>>>> +     // Save the current two tweaks in the tweak buffer
>>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>>> +
>>>>> +     // XOR the next two source blocks with the current two tweaks
>>>>> +     veor            \dst_reg, TWEAKV
>>>>> +
>>>>> +     /*
>>>>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>>>>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>>>>> +      */
>>>>> +     vshr.u64        \tmp, TWEAKV, #62
>>>>> +     vshl.u64        TWEAKV, #2
>>>>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>>>>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>>>>> +     veor            TWEAKV, \tmp
>>>>> +.endm
>>>>> +
>>>>> +/*
>>>>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>>>>> + *
>>>>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>>>>> DST buffer
>>>>> + * using Speck-XTS, specifically the variant with a block size of
>>>>> '2n' and round
>>>>> + * count given by NROUNDS.  The expanded round keys are given in
>>>>> ROUND_KEYS, and
>>>>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>>>>> NBYTES is a
>>>>> + * nonzero multiple of 128.
>>>>> + */
>>>>> +.macro _speck_xts_crypt      n, decrypting
>>>>> +     push            {r4-r7}
>>>>> +     mov             r7, sp
>>>>> +
>>>>> +     /*
>>>>> +      * The first four parameters were passed in registers r0-r3.  Load the
>>>>> +      * additional parameters, which were passed on the stack.
>>>>> +      */
>>>>> +     ldr             NBYTES, [sp, #16]
>>>>> +     ldr             TWEAK, [sp, #20]
>>>>> +
>>>>> +     /*
>>>>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>>>>> +      * round key rather than the first, since for decryption the round keys
>>>>> +      * are used in reverse order.
>>>>> +      */
>>>>> +.if \decrypting
>>>>> +.if \n == 64
>>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>>>>> +     sub             ROUND_KEYS, #8
>>>>> +.else
>>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>>>>> +     sub             ROUND_KEYS, #4
>>>>> +.endif
>>>>> +.endif
>>>>> +
>>>>> +     // Load the index vector for vtbl-based 8-bit rotates
>>>>> +.if \decrypting
>>>>> +     ldr             r12, =.Lrol\n\()_8_table
>>>>> +.else
>>>>> +     ldr             r12, =.Lror\n\()_8_table
>>>>> +.endif
>>>>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>>>>> +
>>>>> +     // One-time XTS preparation
>>>>> +
>>>>> +     /*
>>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>>>>> +      * can use the load/store instructions that declare 16-byte alignment.
>>>>> +      */
>>>>> +     sub             sp, #128
>>>>> +     bic             sp, #0xf
>>>>
>>>>
>>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>>>>
>>>>   AS      arch/arm/crypto/speck-neon-core.o
>>>>
>>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>>>>
>>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>>
>>>> In a quick hack this change seems to address it:
>>>>
>>>>
>>>> -       sub             sp, #128
>>>> -       bic             sp, #0xf
>>>> +       mov             r6, sp
>>>> +       sub             r6, #128
>>>> +       bic             r6, #0xf
>>>> +       mov             sp, r6
>>>>
>>>> But there is probably a better solution to address this.
>>>>
>>>
>>> Given that there is no NEON on M class cores, I recommend we put something like
>>>
>>> THUMB(bx pc)
>>> THUMB(nop.w)
>>> THUMB(.arm)
>>>
>>> at the beginning and be done with it.
>>
>> I mean nop.n or just nop, of course, and we may need a '.align 2' at
>> the beginning as well.
>
> Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
> that bic sp,#0xf is the only issue...
>

Well, in general, yes. In the case of NEON code, not really, since the
resulting code will not be smaller anyway, because the Thumb2 NEON
opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
units, so all cores that this code can run on will be able to run in
ARM mode.

So from a maintainability pov, having code that only assembles in one
way is better than having code that must compile both to ARM and to
Thumb2 opcodes.

Just my 2 cents, anyway.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-17 11:10             ` Ard Biesheuvel
  0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-17 11:10 UTC (permalink / raw)
  To: Stefan Agner
  Cc: Eric Biggers, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Herbert Xu, linux-fscrypt, linux-arm-kernel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman,
	linux-crypto-owner

On 17 June 2018 at 12:41, Stefan Agner <stefan@agner.ch> wrote:
> On 17.06.2018 11:40, Ard Biesheuvel wrote:
>> On 17 June 2018 at 11:30, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>>> On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
>>>> Hi Eric,
>>>>
>>>> On 14.02.2018 19:42, Eric Biggers wrote:
>>>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>>>> next round, etc.), then goes through XTS postprocessing.
>>>>>
>>>>> The performance depends on the processor but can be about 3 times faster
>>>>> than the generic code.  For example, on an ARMv7 processor we observe
>>>>> the following performance with Speck128/256-XTS:
>>>>>
>>>>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>>>
>>>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>>>
>>>>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>>>
>>>>> Speck64/128-XTS is even faster:
>>>>>
>>>>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>>>
>>>>> Note that as with the generic code, only the Speck128 and Speck64
>>>>> variants are supported.  Also, for now only the XTS mode of operation is
>>>>> supported, to target the disk and file encryption use cases.  The NEON
>>>>> code also only handles the portion of the data that is evenly divisible
>>>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>>>> course, other modes of operation could be added later if needed, and/or
>>>>> the NEON code could be updated to handle other buffer sizes.
>>>>>
>>>>> The XTS specification is only defined for AES which has a 128-bit block
>>>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>>>
>>>>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>>>>> ---
>>>>>  arch/arm/crypto/Kconfig           |   6 +
>>>>>  arch/arm/crypto/Makefile          |   2 +
>>>>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>>>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>>>>  4 files changed, 728 insertions(+)
>>>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>>>
>>>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>>>> index b8e69fe282b8..925d1364727a 100644
>>>>> --- a/arch/arm/crypto/Kconfig
>>>>> +++ b/arch/arm/crypto/Kconfig
>>>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>>>       select CRYPTO_BLKCIPHER
>>>>>       select CRYPTO_CHACHA20
>>>>>
>>>>> +config CRYPTO_SPECK_NEON
>>>>> +     tristate "NEON accelerated Speck cipher algorithms"
>>>>> +     depends on KERNEL_MODE_NEON
>>>>> +     select CRYPTO_BLKCIPHER
>>>>> +     select CRYPTO_SPECK
>>>>> +
>>>>>  endif
>>>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>>>> index 30ef8e291271..a758107c5525 100644
>>>>> --- a/arch/arm/crypto/Makefile
>>>>> +++ b/arch/arm/crypto/Makefile
>>>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>>>
>>>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>>>
>>>>>  quiet_cmd_perl = PERL    $@
>>>>>        cmd_perl = $(PERL) $(<) > $(@)
>>>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>>>> b/arch/arm/crypto/speck-neon-core.S
>>>>> new file mode 100644
>>>>> index 000000000000..3c1e203e53b9
>>>>> --- /dev/null
>>>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>>>> @@ -0,0 +1,432 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +/*
>>>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>>>> + *
>>>>> + * Copyright (c) 2018 Google, Inc
>>>>> + *
>>>>> + * Author: Eric Biggers <ebiggers@google.com>
>>>>> + */
>>>>> +
>>>>> +#include <linux/linkage.h>
>>>>> +
>>>>> +     .text
>>>>> +     .fpu            neon
>>>>> +
>>>>> +     // arguments
>>>>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>>>>> +     NROUNDS         .req    r1      // int nrounds
>>>>> +     DST             .req    r2      // void *dst
>>>>> +     SRC             .req    r3      // const void *src
>>>>> +     NBYTES          .req    r4      // unsigned int nbytes
>>>>> +     TWEAK           .req    r5      // void *tweak
>>>>> +
>>>>> +     // registers which hold the data being encrypted/decrypted
>>>>> +     X0              .req    q0
>>>>> +     X0_L            .req    d0
>>>>> +     X0_H            .req    d1
>>>>> +     Y0              .req    q1
>>>>> +     Y0_H            .req    d3
>>>>> +     X1              .req    q2
>>>>> +     X1_L            .req    d4
>>>>> +     X1_H            .req    d5
>>>>> +     Y1              .req    q3
>>>>> +     Y1_H            .req    d7
>>>>> +     X2              .req    q4
>>>>> +     X2_L            .req    d8
>>>>> +     X2_H            .req    d9
>>>>> +     Y2              .req    q5
>>>>> +     Y2_H            .req    d11
>>>>> +     X3              .req    q6
>>>>> +     X3_L            .req    d12
>>>>> +     X3_H            .req    d13
>>>>> +     Y3              .req    q7
>>>>> +     Y3_H            .req    d15
>>>>> +
>>>>> +     // the round key, duplicated in all lanes
>>>>> +     ROUND_KEY       .req    q8
>>>>> +     ROUND_KEY_L     .req    d16
>>>>> +     ROUND_KEY_H     .req    d17
>>>>> +
>>>>> +     // index vector for vtbl-based 8-bit rotates
>>>>> +     ROTATE_TABLE    .req    d18
>>>>> +
>>>>> +     // multiplication table for updating XTS tweaks
>>>>> +     GF128MUL_TABLE  .req    d19
>>>>> +     GF64MUL_TABLE   .req    d19
>>>>> +
>>>>> +     // current XTS tweak value(s)
>>>>> +     TWEAKV          .req    q10
>>>>> +     TWEAKV_L        .req    d20
>>>>> +     TWEAKV_H        .req    d21
>>>>> +
>>>>> +     TMP0            .req    q12
>>>>> +     TMP0_L          .req    d24
>>>>> +     TMP0_H          .req    d25
>>>>> +     TMP1            .req    q13
>>>>> +     TMP2            .req    q14
>>>>> +     TMP3            .req    q15
>>>>> +
>>>>> +     .align          4
>>>>> +.Lror64_8_table:
>>>>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>>>>> +.Lror32_8_table:
>>>>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>>>>> +.Lrol64_8_table:
>>>>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>>>>> +.Lrol32_8_table:
>>>>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>>>>> +.Lgf128mul_table:
>>>>> +     .byte           0, 0x87
>>>>> +     .fill           14
>>>>> +.Lgf64mul_table:
>>>>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>>>>> +     .fill           12
>>>>> +
>>>>> +/*
>>>>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>>>>> + *
>>>>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>>>>> Speck128, 16 for
>>>>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>>>>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>>>>> + *
>>>>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>>>>> + * the vtbl approach is faster on some processors and the same speed on others.
>>>>> + */
>>>>> +.macro _speck_round_128bytes n
>>>>> +
>>>>> +     // x = ror(x, 8)
>>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>>> +
>>>>> +     // x += y
>>>>> +     vadd.u\n        X0, Y0
>>>>> +     vadd.u\n        X1, Y1
>>>>> +     vadd.u\n        X2, Y2
>>>>> +     vadd.u\n        X3, Y3
>>>>> +
>>>>> +     // x ^= k
>>>>> +     veor            X0, ROUND_KEY
>>>>> +     veor            X1, ROUND_KEY
>>>>> +     veor            X2, ROUND_KEY
>>>>> +     veor            X3, ROUND_KEY
>>>>> +
>>>>> +     // y = rol(y, 3)
>>>>> +     vshl.u\n        TMP0, Y0, #3
>>>>> +     vshl.u\n        TMP1, Y1, #3
>>>>> +     vshl.u\n        TMP2, Y2, #3
>>>>> +     vshl.u\n        TMP3, Y3, #3
>>>>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>>>>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>>>>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>>>>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>>>>> +
>>>>> +     // y ^= x
>>>>> +     veor            Y0, TMP0, X0
>>>>> +     veor            Y1, TMP1, X1
>>>>> +     veor            Y2, TMP2, X2
>>>>> +     veor            Y3, TMP3, X3
>>>>> +.endm
>>>>> +
>>>>> +/*
>>>>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>>>>> + *
>>>>> + * This is the inverse of _speck_round_128bytes().
>>>>> + */
>>>>> +.macro _speck_unround_128bytes       n
>>>>> +
>>>>> +     // y ^= x
>>>>> +     veor            TMP0, Y0, X0
>>>>> +     veor            TMP1, Y1, X1
>>>>> +     veor            TMP2, Y2, X2
>>>>> +     veor            TMP3, Y3, X3
>>>>> +
>>>>> +     // y = ror(y, 3)
>>>>> +     vshr.u\n        Y0, TMP0, #3
>>>>> +     vshr.u\n        Y1, TMP1, #3
>>>>> +     vshr.u\n        Y2, TMP2, #3
>>>>> +     vshr.u\n        Y3, TMP3, #3
>>>>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>>>>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>>>>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>>>>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>>>>> +
>>>>> +     // x ^= k
>>>>> +     veor            X0, ROUND_KEY
>>>>> +     veor            X1, ROUND_KEY
>>>>> +     veor            X2, ROUND_KEY
>>>>> +     veor            X3, ROUND_KEY
>>>>> +
>>>>> +     // x -= y
>>>>> +     vsub.u\n        X0, Y0
>>>>> +     vsub.u\n        X1, Y1
>>>>> +     vsub.u\n        X2, Y2
>>>>> +     vsub.u\n        X3, Y3
>>>>> +
>>>>> +     // x = rol(x, 8);
>>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>>> +.endm
>>>>> +
>>>>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>>>>> +
>>>>> +     // Load the next source block
>>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>>> +
>>>>> +     // Save the current tweak in the tweak buffer
>>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>>> +
>>>>> +     // XOR the next source block with the current tweak
>>>>> +     veor            \dst_reg, TWEAKV
>>>>> +
>>>>> +     /*
>>>>> +      * Calculate the next tweak by multiplying the current one by x,
>>>>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>>>>> +      */
>>>>> +     vshr.u64        \tmp, TWEAKV, #63
>>>>> +     vshl.u64        TWEAKV, #1
>>>>> +     veor            TWEAKV_H, \tmp\()_L
>>>>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>>>>> +     veor            TWEAKV_L, \tmp\()_H
>>>>> +.endm
>>>>> +
>>>>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>>>>> +
>>>>> +     // Load the next two source blocks
>>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>>> +
>>>>> +     // Save the current two tweaks in the tweak buffer
>>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>>> +
>>>>> +     // XOR the next two source blocks with the current two tweaks
>>>>> +     veor            \dst_reg, TWEAKV
>>>>> +
>>>>> +     /*
>>>>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>>>>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>>>>> +      */
>>>>> +     vshr.u64        \tmp, TWEAKV, #62
>>>>> +     vshl.u64        TWEAKV, #2
>>>>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>>>>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>>>>> +     veor            TWEAKV, \tmp
>>>>> +.endm
>>>>> +
>>>>> +/*
>>>>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>>>>> + *
>>>>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>>>>> DST buffer
>>>>> + * using Speck-XTS, specifically the variant with a block size of
>>>>> '2n' and round
>>>>> + * count given by NROUNDS.  The expanded round keys are given in
>>>>> ROUND_KEYS, and
>>>>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>>>>> NBYTES is a
>>>>> + * nonzero multiple of 128.
>>>>> + */
>>>>> +.macro _speck_xts_crypt      n, decrypting
>>>>> +     push            {r4-r7}
>>>>> +     mov             r7, sp
>>>>> +
>>>>> +     /*
>>>>> +      * The first four parameters were passed in registers r0-r3.  Load the
>>>>> +      * additional parameters, which were passed on the stack.
>>>>> +      */
>>>>> +     ldr             NBYTES, [sp, #16]
>>>>> +     ldr             TWEAK, [sp, #20]
>>>>> +
>>>>> +     /*
>>>>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>>>>> +      * round key rather than the first, since for decryption the round keys
>>>>> +      * are used in reverse order.
>>>>> +      */
>>>>> +.if \decrypting
>>>>> +.if \n == 64
>>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>>>>> +     sub             ROUND_KEYS, #8
>>>>> +.else
>>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>>>>> +     sub             ROUND_KEYS, #4
>>>>> +.endif
>>>>> +.endif
>>>>> +
>>>>> +     // Load the index vector for vtbl-based 8-bit rotates
>>>>> +.if \decrypting
>>>>> +     ldr             r12, =.Lrol\n\()_8_table
>>>>> +.else
>>>>> +     ldr             r12, =.Lror\n\()_8_table
>>>>> +.endif
>>>>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>>>>> +
>>>>> +     // One-time XTS preparation
>>>>> +
>>>>> +     /*
>>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>>>>> +      * can use the load/store instructions that declare 16-byte alignment.
>>>>> +      */
>>>>> +     sub             sp, #128
>>>>> +     bic             sp, #0xf
>>>>
>>>>
>>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>>>>
>>>>   AS      arch/arm/crypto/speck-neon-core.o
>>>>
>>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>>>>
>>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>>
>>>> In a quick hack this change seems to address it:
>>>>
>>>>
>>>> -       sub             sp, #128
>>>> -       bic             sp, #0xf
>>>> +       mov             r6, sp
>>>> +       sub             r6, #128
>>>> +       bic             r6, #0xf
>>>> +       mov             sp, r6
>>>>
>>>> But there is probably a better solution to address this.
>>>>
>>>
>>> Given that there is no NEON on M class cores, I recommend we put something like
>>>
>>> THUMB(bx pc)
>>> THUMB(nop.w)
>>> THUMB(.arm)
>>>
>>> at the beginning and be done with it.
>>
>> I mean nop.n or just nop, of course, and we may need a '.align 2' at
>> the beginning as well.
>
> Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
> that bic sp,#0xf is the only issue...
>

Well, in general, yes. In the case of NEON code, not really, since the
resulting code will not be smaller anyway, because the Thumb2 NEON
opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
units, so all cores that this code can run on will be able to run in
ARM mode.

So from a maintainability pov, having code that only assembles in one
way is better than having code that must compile both to ARM and to
Thumb2 opcodes.

Just my 2 cents, anyway.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-17 11:10             ` Ard Biesheuvel
  0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-17 11:10 UTC (permalink / raw)
  To: linux-arm-kernel

On 17 June 2018 at 12:41, Stefan Agner <stefan@agner.ch> wrote:
> On 17.06.2018 11:40, Ard Biesheuvel wrote:
>> On 17 June 2018 at 11:30, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>>> On 17 June 2018 at 00:40, Stefan Agner <stefan@agner.ch> wrote:
>>>> Hi Eric,
>>>>
>>>> On 14.02.2018 19:42, Eric Biggers wrote:
>>>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>>>> next round, etc.), then goes through XTS postprocessing.
>>>>>
>>>>> The performance depends on the processor but can be about 3 times faster
>>>>> than the generic code.  For example, on an ARMv7 processor we observe
>>>>> the following performance with Speck128/256-XTS:
>>>>>
>>>>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>>>
>>>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>>>
>>>>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>>>
>>>>> Speck64/128-XTS is even faster:
>>>>>
>>>>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>>>
>>>>> Note that as with the generic code, only the Speck128 and Speck64
>>>>> variants are supported.  Also, for now only the XTS mode of operation is
>>>>> supported, to target the disk and file encryption use cases.  The NEON
>>>>> code also only handles the portion of the data that is evenly divisible
>>>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>>>> course, other modes of operation could be added later if needed, and/or
>>>>> the NEON code could be updated to handle other buffer sizes.
>>>>>
>>>>> The XTS specification is only defined for AES which has a 128-bit block
>>>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>>>
>>>>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>>>>> ---
>>>>>  arch/arm/crypto/Kconfig           |   6 +
>>>>>  arch/arm/crypto/Makefile          |   2 +
>>>>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>>>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>>>>  4 files changed, 728 insertions(+)
>>>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>>>
>>>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>>>> index b8e69fe282b8..925d1364727a 100644
>>>>> --- a/arch/arm/crypto/Kconfig
>>>>> +++ b/arch/arm/crypto/Kconfig
>>>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>>>       select CRYPTO_BLKCIPHER
>>>>>       select CRYPTO_CHACHA20
>>>>>
>>>>> +config CRYPTO_SPECK_NEON
>>>>> +     tristate "NEON accelerated Speck cipher algorithms"
>>>>> +     depends on KERNEL_MODE_NEON
>>>>> +     select CRYPTO_BLKCIPHER
>>>>> +     select CRYPTO_SPECK
>>>>> +
>>>>>  endif
>>>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>>>> index 30ef8e291271..a758107c5525 100644
>>>>> --- a/arch/arm/crypto/Makefile
>>>>> +++ b/arch/arm/crypto/Makefile
>>>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>>>
>>>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>>>
>>>>>  quiet_cmd_perl = PERL    $@
>>>>>        cmd_perl = $(PERL) $(<) > $(@)
>>>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>>>> b/arch/arm/crypto/speck-neon-core.S
>>>>> new file mode 100644
>>>>> index 000000000000..3c1e203e53b9
>>>>> --- /dev/null
>>>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>>>> @@ -0,0 +1,432 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +/*
>>>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>>>> + *
>>>>> + * Copyright (c) 2018 Google, Inc
>>>>> + *
>>>>> + * Author: Eric Biggers <ebiggers@google.com>
>>>>> + */
>>>>> +
>>>>> +#include <linux/linkage.h>
>>>>> +
>>>>> +     .text
>>>>> +     .fpu            neon
>>>>> +
>>>>> +     // arguments
>>>>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>>>>> +     NROUNDS         .req    r1      // int nrounds
>>>>> +     DST             .req    r2      // void *dst
>>>>> +     SRC             .req    r3      // const void *src
>>>>> +     NBYTES          .req    r4      // unsigned int nbytes
>>>>> +     TWEAK           .req    r5      // void *tweak
>>>>> +
>>>>> +     // registers which hold the data being encrypted/decrypted
>>>>> +     X0              .req    q0
>>>>> +     X0_L            .req    d0
>>>>> +     X0_H            .req    d1
>>>>> +     Y0              .req    q1
>>>>> +     Y0_H            .req    d3
>>>>> +     X1              .req    q2
>>>>> +     X1_L            .req    d4
>>>>> +     X1_H            .req    d5
>>>>> +     Y1              .req    q3
>>>>> +     Y1_H            .req    d7
>>>>> +     X2              .req    q4
>>>>> +     X2_L            .req    d8
>>>>> +     X2_H            .req    d9
>>>>> +     Y2              .req    q5
>>>>> +     Y2_H            .req    d11
>>>>> +     X3              .req    q6
>>>>> +     X3_L            .req    d12
>>>>> +     X3_H            .req    d13
>>>>> +     Y3              .req    q7
>>>>> +     Y3_H            .req    d15
>>>>> +
>>>>> +     // the round key, duplicated in all lanes
>>>>> +     ROUND_KEY       .req    q8
>>>>> +     ROUND_KEY_L     .req    d16
>>>>> +     ROUND_KEY_H     .req    d17
>>>>> +
>>>>> +     // index vector for vtbl-based 8-bit rotates
>>>>> +     ROTATE_TABLE    .req    d18
>>>>> +
>>>>> +     // multiplication table for updating XTS tweaks
>>>>> +     GF128MUL_TABLE  .req    d19
>>>>> +     GF64MUL_TABLE   .req    d19
>>>>> +
>>>>> +     // current XTS tweak value(s)
>>>>> +     TWEAKV          .req    q10
>>>>> +     TWEAKV_L        .req    d20
>>>>> +     TWEAKV_H        .req    d21
>>>>> +
>>>>> +     TMP0            .req    q12
>>>>> +     TMP0_L          .req    d24
>>>>> +     TMP0_H          .req    d25
>>>>> +     TMP1            .req    q13
>>>>> +     TMP2            .req    q14
>>>>> +     TMP3            .req    q15
>>>>> +
>>>>> +     .align          4
>>>>> +.Lror64_8_table:
>>>>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>>>>> +.Lror32_8_table:
>>>>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>>>>> +.Lrol64_8_table:
>>>>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>>>>> +.Lrol32_8_table:
>>>>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>>>>> +.Lgf128mul_table:
>>>>> +     .byte           0, 0x87
>>>>> +     .fill           14
>>>>> +.Lgf64mul_table:
>>>>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>>>>> +     .fill           12
>>>>> +
>>>>> +/*
>>>>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>>>>> + *
>>>>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>>>>> Speck128, 16 for
>>>>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>>>>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>>>>> + *
>>>>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>>>>> + * the vtbl approach is faster on some processors and the same speed on others.
>>>>> + */
>>>>> +.macro _speck_round_128bytes n
>>>>> +
>>>>> +     // x = ror(x, 8)
>>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>>> +
>>>>> +     // x += y
>>>>> +     vadd.u\n        X0, Y0
>>>>> +     vadd.u\n        X1, Y1
>>>>> +     vadd.u\n        X2, Y2
>>>>> +     vadd.u\n        X3, Y3
>>>>> +
>>>>> +     // x ^= k
>>>>> +     veor            X0, ROUND_KEY
>>>>> +     veor            X1, ROUND_KEY
>>>>> +     veor            X2, ROUND_KEY
>>>>> +     veor            X3, ROUND_KEY
>>>>> +
>>>>> +     // y = rol(y, 3)
>>>>> +     vshl.u\n        TMP0, Y0, #3
>>>>> +     vshl.u\n        TMP1, Y1, #3
>>>>> +     vshl.u\n        TMP2, Y2, #3
>>>>> +     vshl.u\n        TMP3, Y3, #3
>>>>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>>>>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>>>>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>>>>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>>>>> +
>>>>> +     // y ^= x
>>>>> +     veor            Y0, TMP0, X0
>>>>> +     veor            Y1, TMP1, X1
>>>>> +     veor            Y2, TMP2, X2
>>>>> +     veor            Y3, TMP3, X3
>>>>> +.endm
>>>>> +
>>>>> +/*
>>>>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>>>>> + *
>>>>> + * This is the inverse of _speck_round_128bytes().
>>>>> + */
>>>>> +.macro _speck_unround_128bytes       n
>>>>> +
>>>>> +     // y ^= x
>>>>> +     veor            TMP0, Y0, X0
>>>>> +     veor            TMP1, Y1, X1
>>>>> +     veor            TMP2, Y2, X2
>>>>> +     veor            TMP3, Y3, X3
>>>>> +
>>>>> +     // y = ror(y, 3)
>>>>> +     vshr.u\n        Y0, TMP0, #3
>>>>> +     vshr.u\n        Y1, TMP1, #3
>>>>> +     vshr.u\n        Y2, TMP2, #3
>>>>> +     vshr.u\n        Y3, TMP3, #3
>>>>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>>>>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>>>>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>>>>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>>>>> +
>>>>> +     // x ^= k
>>>>> +     veor            X0, ROUND_KEY
>>>>> +     veor            X1, ROUND_KEY
>>>>> +     veor            X2, ROUND_KEY
>>>>> +     veor            X3, ROUND_KEY
>>>>> +
>>>>> +     // x -= y
>>>>> +     vsub.u\n        X0, Y0
>>>>> +     vsub.u\n        X1, Y1
>>>>> +     vsub.u\n        X2, Y2
>>>>> +     vsub.u\n        X3, Y3
>>>>> +
>>>>> +     // x = rol(x, 8);
>>>>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>>>>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>>>>> +.endm
>>>>> +
>>>>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>>>>> +
>>>>> +     // Load the next source block
>>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>>> +
>>>>> +     // Save the current tweak in the tweak buffer
>>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>>> +
>>>>> +     // XOR the next source block with the current tweak
>>>>> +     veor            \dst_reg, TWEAKV
>>>>> +
>>>>> +     /*
>>>>> +      * Calculate the next tweak by multiplying the current one by x,
>>>>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>>>>> +      */
>>>>> +     vshr.u64        \tmp, TWEAKV, #63
>>>>> +     vshl.u64        TWEAKV, #1
>>>>> +     veor            TWEAKV_H, \tmp\()_L
>>>>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>>>>> +     veor            TWEAKV_L, \tmp\()_H
>>>>> +.endm
>>>>> +
>>>>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>>>>> +
>>>>> +     // Load the next two source blocks
>>>>> +     vld1.8          {\dst_reg}, [SRC]!
>>>>> +
>>>>> +     // Save the current two tweaks in the tweak buffer
>>>>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>>>>> +
>>>>> +     // XOR the next two source blocks with the current two tweaks
>>>>> +     veor            \dst_reg, TWEAKV
>>>>> +
>>>>> +     /*
>>>>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>>>>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>>>>> +      */
>>>>> +     vshr.u64        \tmp, TWEAKV, #62
>>>>> +     vshl.u64        TWEAKV, #2
>>>>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>>>>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>>>>> +     veor            TWEAKV, \tmp
>>>>> +.endm
>>>>> +
>>>>> +/*
>>>>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>>>>> + *
>>>>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>>>>> DST buffer
>>>>> + * using Speck-XTS, specifically the variant with a block size of
>>>>> '2n' and round
>>>>> + * count given by NROUNDS.  The expanded round keys are given in
>>>>> ROUND_KEYS, and
>>>>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>>>>> NBYTES is a
>>>>> + * nonzero multiple of 128.
>>>>> + */
>>>>> +.macro _speck_xts_crypt      n, decrypting
>>>>> +     push            {r4-r7}
>>>>> +     mov             r7, sp
>>>>> +
>>>>> +     /*
>>>>> +      * The first four parameters were passed in registers r0-r3.  Load the
>>>>> +      * additional parameters, which were passed on the stack.
>>>>> +      */
>>>>> +     ldr             NBYTES, [sp, #16]
>>>>> +     ldr             TWEAK, [sp, #20]
>>>>> +
>>>>> +     /*
>>>>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>>>>> +      * round key rather than the first, since for decryption the round keys
>>>>> +      * are used in reverse order.
>>>>> +      */
>>>>> +.if \decrypting
>>>>> +.if \n == 64
>>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>>>>> +     sub             ROUND_KEYS, #8
>>>>> +.else
>>>>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>>>>> +     sub             ROUND_KEYS, #4
>>>>> +.endif
>>>>> +.endif
>>>>> +
>>>>> +     // Load the index vector for vtbl-based 8-bit rotates
>>>>> +.if \decrypting
>>>>> +     ldr             r12, =.Lrol\n\()_8_table
>>>>> +.else
>>>>> +     ldr             r12, =.Lror\n\()_8_table
>>>>> +.endif
>>>>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>>>>> +
>>>>> +     // One-time XTS preparation
>>>>> +
>>>>> +     /*
>>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>>>>> +      * can use the load/store instructions that declare 16-byte alignment.
>>>>> +      */
>>>>> +     sub             sp, #128
>>>>> +     bic             sp, #0xf
>>>>
>>>>
>>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>>>>
>>>>   AS      arch/arm/crypto/speck-neon-core.o
>>>>
>>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>>>>
>>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>>>> `bic sp,#0xf'
>>>>
>>>> In a quick hack this change seems to address it:
>>>>
>>>>
>>>> -       sub             sp, #128
>>>> -       bic             sp, #0xf
>>>> +       mov             r6, sp
>>>> +       sub             r6, #128
>>>> +       bic             r6, #0xf
>>>> +       mov             sp, r6
>>>>
>>>> But there is probably a better solution to address this.
>>>>
>>>
>>> Given that there is no NEON on M class cores, I recommend we put something like
>>>
>>> THUMB(bx pc)
>>> THUMB(nop.w)
>>> THUMB(.arm)
>>>
>>> at the beginning and be done with it.
>>
>> I mean nop.n or just nop, of course, and we may need a '.align 2' at
>> the beginning as well.
>
> Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
> that bic sp,#0xf is the only issue...
>

Well, in general, yes. In the case of NEON code, not really, since the
resulting code will not be smaller anyway, because the Thumb2 NEON
opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
units, so all cores that this code can run on will be able to run in
ARM mode.

So from a maintainability pov, having code that only assembles in one
way is better than having code that must compile both to ARM and to
Thumb2 opcodes.

Just my 2 cents, anyway.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  2018-06-17 11:10             ` Ard Biesheuvel
  (?)
@ 2018-06-18 21:56               ` Eric Biggers
  -1 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-06-18 21:56 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jeffrey Walton, Greg Kaiser, Herbert Xu, Michael Halcrow,
	Patrik Torstensson, Stefan Agner, Paul Lawrence, linux-fscrypt,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Greg Kroah-Hartman, Alex Cope, linux-crypto-owner,
	linux-arm-kernel, Paul Crowley

On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
> >>>>> +
> >>>>> +     // One-time XTS preparation
> >>>>> +
> >>>>> +     /*
> >>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
> >>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
> >>>>> +      * can use the load/store instructions that declare 16-byte alignment.
> >>>>> +      */
> >>>>> +     sub             sp, #128
> >>>>> +     bic             sp, #0xf
> >>>>
> >>>>
> >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
> >>>>
> >>>>   AS      arch/arm/crypto/speck-neon-core.o
> >>>>
> >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
> >>>>
> >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>>
> >>>> In a quick hack this change seems to address it:
> >>>>
> >>>>
> >>>> -       sub             sp, #128
> >>>> -       bic             sp, #0xf
> >>>> +       mov             r6, sp
> >>>> +       sub             r6, #128
> >>>> +       bic             r6, #0xf
> >>>> +       mov             sp, r6
> >>>>
> >>>> But there is probably a better solution to address this.
> >>>>
> >>>
> >>> Given that there is no NEON on M class cores, I recommend we put something like
> >>>
> >>> THUMB(bx pc)
> >>> THUMB(nop.w)
> >>> THUMB(.arm)
> >>>
> >>> at the beginning and be done with it.
> >>
> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
> >> the beginning as well.
> >
> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
> > that bic sp,#0xf is the only issue...
> >
> 
> Well, in general, yes. In the case of NEON code, not really, since the
> resulting code will not be smaller anyway, because the Thumb2 NEON
> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
> units, so all cores that this code can run on will be able to run in
> ARM mode.
> 
> So from a maintainability pov, having code that only assembles in one
> way is better than having code that must compile both to ARM and to
> Thumb2 opcodes.
> 
> Just my 2 cents, anyway.

I don't have too much of a preference, though Stefan's suggested 4 instructions
can be reduced to 3, which also matches what aes-neonbs-core.S does:

        sub             r12, sp, #128
        bic             r12, #0xf
        mov             sp, r12

Ard, is the following what you're suggesting instead?

diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S
index 3c1e203e53b9..c989ce3dc057 100644
--- a/arch/arm/crypto/speck-neon-core.S
+++ b/arch/arm/crypto/speck-neon-core.S
@@ -8,6 +8,7 @@
  */
 
 #include <linux/linkage.h>
+#include <asm/assembler.h>
 
 	.text
 	.fpu		neon
@@ -233,6 +234,12 @@
  * nonzero multiple of 128.
  */
 .macro _speck_xts_crypt	n, decrypting
+
+	.align		2
+	THUMB(bx pc)
+	THUMB(nop)
+	THUMB(.arm)
+
 	push		{r4-r7}
 	mov		r7, sp
 
@@ -413,6 +420,8 @@
 	mov		sp, r7
 	pop		{r4-r7}
 	bx		lr
+
+	THUMB(.thumb)
 .endm
 
 ENTRY(speck128_xts_encrypt_neon)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-18 21:56               ` Eric Biggers
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-06-18 21:56 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Stefan Agner, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Herbert Xu, linux-fscrypt, linux-arm-kernel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman,
	linux-crypto-owner

On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
> >>>>> +
> >>>>> +     // One-time XTS preparation
> >>>>> +
> >>>>> +     /*
> >>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
> >>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
> >>>>> +      * can use the load/store instructions that declare 16-byte alignment.
> >>>>> +      */
> >>>>> +     sub             sp, #128
> >>>>> +     bic             sp, #0xf
> >>>>
> >>>>
> >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
> >>>>
> >>>>   AS      arch/arm/crypto/speck-neon-core.o
> >>>>
> >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
> >>>>
> >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>>
> >>>> In a quick hack this change seems to address it:
> >>>>
> >>>>
> >>>> -       sub             sp, #128
> >>>> -       bic             sp, #0xf
> >>>> +       mov             r6, sp
> >>>> +       sub             r6, #128
> >>>> +       bic             r6, #0xf
> >>>> +       mov             sp, r6
> >>>>
> >>>> But there is probably a better solution to address this.
> >>>>
> >>>
> >>> Given that there is no NEON on M class cores, I recommend we put something like
> >>>
> >>> THUMB(bx pc)
> >>> THUMB(nop.w)
> >>> THUMB(.arm)
> >>>
> >>> at the beginning and be done with it.
> >>
> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
> >> the beginning as well.
> >
> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
> > that bic sp,#0xf is the only issue...
> >
> 
> Well, in general, yes. In the case of NEON code, not really, since the
> resulting code will not be smaller anyway, because the Thumb2 NEON
> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
> units, so all cores that this code can run on will be able to run in
> ARM mode.
> 
> So from a maintainability pov, having code that only assembles in one
> way is better than having code that must compile both to ARM and to
> Thumb2 opcodes.
> 
> Just my 2 cents, anyway.

I don't have too much of a preference, though Stefan's suggested 4 instructions
can be reduced to 3, which also matches what aes-neonbs-core.S does:

        sub             r12, sp, #128
        bic             r12, #0xf
        mov             sp, r12

Ard, is the following what you're suggesting instead?

diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S
index 3c1e203e53b9..c989ce3dc057 100644
--- a/arch/arm/crypto/speck-neon-core.S
+++ b/arch/arm/crypto/speck-neon-core.S
@@ -8,6 +8,7 @@
  */
 
 #include <linux/linkage.h>
+#include <asm/assembler.h>
 
 	.text
 	.fpu		neon
@@ -233,6 +234,12 @@
  * nonzero multiple of 128.
  */
 .macro _speck_xts_crypt	n, decrypting
+
+	.align		2
+	THUMB(bx pc)
+	THUMB(nop)
+	THUMB(.arm)
+
 	push		{r4-r7}
 	mov		r7, sp
 
@@ -413,6 +420,8 @@
 	mov		sp, r7
 	pop		{r4-r7}
 	bx		lr
+
+	THUMB(.thumb)
 .endm
 
 ENTRY(speck128_xts_encrypt_neon)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-18 21:56               ` Eric Biggers
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Biggers @ 2018-06-18 21:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
> >>>>> +
> >>>>> +     // One-time XTS preparation
> >>>>> +
> >>>>> +     /*
> >>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
> >>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
> >>>>> +      * can use the load/store instructions that declare 16-byte alignment.
> >>>>> +      */
> >>>>> +     sub             sp, #128
> >>>>> +     bic             sp, #0xf
> >>>>
> >>>>
> >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
> >>>>
> >>>>   AS      arch/arm/crypto/speck-neon-core.o
> >>>>
> >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
> >>>>
> >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
> >>>> `bic sp,#0xf'
> >>>>
> >>>> In a quick hack this change seems to address it:
> >>>>
> >>>>
> >>>> -       sub             sp, #128
> >>>> -       bic             sp, #0xf
> >>>> +       mov             r6, sp
> >>>> +       sub             r6, #128
> >>>> +       bic             r6, #0xf
> >>>> +       mov             sp, r6
> >>>>
> >>>> But there is probably a better solution to address this.
> >>>>
> >>>
> >>> Given that there is no NEON on M class cores, I recommend we put something like
> >>>
> >>> THUMB(bx pc)
> >>> THUMB(nop.w)
> >>> THUMB(.arm)
> >>>
> >>> at the beginning and be done with it.
> >>
> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
> >> the beginning as well.
> >
> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
> > that bic sp,#0xf is the only issue...
> >
> 
> Well, in general, yes. In the case of NEON code, not really, since the
> resulting code will not be smaller anyway, because the Thumb2 NEON
> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
> units, so all cores that this code can run on will be able to run in
> ARM mode.
> 
> So from a maintainability pov, having code that only assembles in one
> way is better than having code that must compile both to ARM and to
> Thumb2 opcodes.
> 
> Just my 2 cents, anyway.

I don't have too much of a preference, though Stefan's suggested 4 instructions
can be reduced to 3, which also matches what aes-neonbs-core.S does:

        sub             r12, sp, #128
        bic             r12, #0xf
        mov             sp, r12

Ard, is the following what you're suggesting instead?

diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S
index 3c1e203e53b9..c989ce3dc057 100644
--- a/arch/arm/crypto/speck-neon-core.S
+++ b/arch/arm/crypto/speck-neon-core.S
@@ -8,6 +8,7 @@
  */
 
 #include <linux/linkage.h>
+#include <asm/assembler.h>
 
 	.text
 	.fpu		neon
@@ -233,6 +234,12 @@
  * nonzero multiple of 128.
  */
 .macro _speck_xts_crypt	n, decrypting
+
+	.align		2
+	THUMB(bx pc)
+	THUMB(nop)
+	THUMB(.arm)
+
 	push		{r4-r7}
 	mov		r7, sp
 
@@ -413,6 +420,8 @@
 	mov		sp, r7
 	pop		{r4-r7}
 	bx		lr
+
+	THUMB(.thumb)
 .endm
 
 ENTRY(speck128_xts_encrypt_neon)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  2018-06-18 21:56               ` Eric Biggers
  (?)
@ 2018-06-18 22:04                 ` Ard Biesheuvel
  -1 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-18 22:04 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jeffrey Walton, Greg Kaiser, Herbert Xu, Michael Halcrow,
	Patrik Torstensson, Stefan Agner, Paul Lawrence, linux-fscrypt,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Greg Kroah-Hartman, Alex Cope, linux-crypto-owner,
	linux-arm-kernel, Paul Crowley

On 18 June 2018 at 23:56, Eric Biggers <ebiggers@google.com> wrote:
> On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
>> >>>>> +
>> >>>>> +     // One-time XTS preparation
>> >>>>> +
>> >>>>> +     /*
>> >>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>> >>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>> >>>>> +      * can use the load/store instructions that declare 16-byte alignment.
>> >>>>> +      */
>> >>>>> +     sub             sp, #128
>> >>>>> +     bic             sp, #0xf
>> >>>>
>> >>>>
>> >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>> >>>>
>> >>>>   AS      arch/arm/crypto/speck-neon-core.o
>> >>>>
>> >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>> >>>>
>> >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>>
>> >>>> In a quick hack this change seems to address it:
>> >>>>
>> >>>>
>> >>>> -       sub             sp, #128
>> >>>> -       bic             sp, #0xf
>> >>>> +       mov             r6, sp
>> >>>> +       sub             r6, #128
>> >>>> +       bic             r6, #0xf
>> >>>> +       mov             sp, r6
>> >>>>
>> >>>> But there is probably a better solution to address this.
>> >>>>
>> >>>
>> >>> Given that there is no NEON on M class cores, I recommend we put something like
>> >>>
>> >>> THUMB(bx pc)
>> >>> THUMB(nop.w)
>> >>> THUMB(.arm)
>> >>>
>> >>> at the beginning and be done with it.
>> >>
>> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
>> >> the beginning as well.
>> >
>> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
>> > that bic sp,#0xf is the only issue...
>> >
>>
>> Well, in general, yes. In the case of NEON code, not really, since the
>> resulting code will not be smaller anyway, because the Thumb2 NEON
>> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
>> units, so all cores that this code can run on will be able to run in
>> ARM mode.
>>
>> So from a maintainability pov, having code that only assembles in one
>> way is better than having code that must compile both to ARM and to
>> Thumb2 opcodes.
>>
>> Just my 2 cents, anyway.
>
> I don't have too much of a preference, though Stefan's suggested 4 instructions
> can be reduced to 3, which also matches what aes-neonbs-core.S does:
>
>         sub             r12, sp, #128
>         bic             r12, #0xf
>         mov             sp, r12
>
> Ard, is the following what you're suggesting instead?
>

Yes, but after looking at the actual code, I prefer the change above.
The access occurs only once, not in the loop so the additional
instructions should not affect performance.

Apologies for the noise.

> diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S
> index 3c1e203e53b9..c989ce3dc057 100644
> --- a/arch/arm/crypto/speck-neon-core.S
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -8,6 +8,7 @@
>   */
>
>  #include <linux/linkage.h>
> +#include <asm/assembler.h>
>
>         .text
>         .fpu            neon
> @@ -233,6 +234,12 @@
>   * nonzero multiple of 128.
>   */
>  .macro _speck_xts_crypt        n, decrypting
> +
> +       .align          2
> +       THUMB(bx pc)
> +       THUMB(nop)
> +       THUMB(.arm)
> +
>         push            {r4-r7}
>         mov             r7, sp
>
> @@ -413,6 +420,8 @@
>         mov             sp, r7
>         pop             {r4-r7}
>         bx              lr
> +
> +       THUMB(.thumb)
>  .endm
>
>  ENTRY(speck128_xts_encrypt_neon)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-18 22:04                 ` Ard Biesheuvel
  0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-18 22:04 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Stefan Agner, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	Herbert Xu, linux-fscrypt, linux-arm-kernel, Jeffrey Walton,
	Paul Crowley, Patrik Torstensson, Greg Kaiser, Paul Lawrence,
	Michael Halcrow, Alex Cope, Greg Kroah-Hartman,
	linux-crypto-owner

On 18 June 2018 at 23:56, Eric Biggers <ebiggers@google.com> wrote:
> On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
>> >>>>> +
>> >>>>> +     // One-time XTS preparation
>> >>>>> +
>> >>>>> +     /*
>> >>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>> >>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>> >>>>> +      * can use the load/store instructions that declare 16-byte alignment.
>> >>>>> +      */
>> >>>>> +     sub             sp, #128
>> >>>>> +     bic             sp, #0xf
>> >>>>
>> >>>>
>> >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>> >>>>
>> >>>>   AS      arch/arm/crypto/speck-neon-core.o
>> >>>>
>> >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>> >>>>
>> >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>>
>> >>>> In a quick hack this change seems to address it:
>> >>>>
>> >>>>
>> >>>> -       sub             sp, #128
>> >>>> -       bic             sp, #0xf
>> >>>> +       mov             r6, sp
>> >>>> +       sub             r6, #128
>> >>>> +       bic             r6, #0xf
>> >>>> +       mov             sp, r6
>> >>>>
>> >>>> But there is probably a better solution to address this.
>> >>>>
>> >>>
>> >>> Given that there is no NEON on M class cores, I recommend we put something like
>> >>>
>> >>> THUMB(bx pc)
>> >>> THUMB(nop.w)
>> >>> THUMB(.arm)
>> >>>
>> >>> at the beginning and be done with it.
>> >>
>> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
>> >> the beginning as well.
>> >
>> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
>> > that bic sp,#0xf is the only issue...
>> >
>>
>> Well, in general, yes. In the case of NEON code, not really, since the
>> resulting code will not be smaller anyway, because the Thumb2 NEON
>> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
>> units, so all cores that this code can run on will be able to run in
>> ARM mode.
>>
>> So from a maintainability pov, having code that only assembles in one
>> way is better than having code that must compile both to ARM and to
>> Thumb2 opcodes.
>>
>> Just my 2 cents, anyway.
>
> I don't have too much of a preference, though Stefan's suggested 4 instructions
> can be reduced to 3, which also matches what aes-neonbs-core.S does:
>
>         sub             r12, sp, #128
>         bic             r12, #0xf
>         mov             sp, r12
>
> Ard, is the following what you're suggesting instead?
>

Yes, but after looking at the actual code, I prefer the change above.
The access occurs only once, not in the loop so the additional
instructions should not affect performance.

Apologies for the noise.

> diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S
> index 3c1e203e53b9..c989ce3dc057 100644
> --- a/arch/arm/crypto/speck-neon-core.S
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -8,6 +8,7 @@
>   */
>
>  #include <linux/linkage.h>
> +#include <asm/assembler.h>
>
>         .text
>         .fpu            neon
> @@ -233,6 +234,12 @@
>   * nonzero multiple of 128.
>   */
>  .macro _speck_xts_crypt        n, decrypting
> +
> +       .align          2
> +       THUMB(bx pc)
> +       THUMB(nop)
> +       THUMB(.arm)
> +
>         push            {r4-r7}
>         mov             r7, sp
>
> @@ -413,6 +420,8 @@
>         mov             sp, r7
>         pop             {r4-r7}
>         bx              lr
> +
> +       THUMB(.thumb)
>  .endm
>
>  ENTRY(speck128_xts_encrypt_neon)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
@ 2018-06-18 22:04                 ` Ard Biesheuvel
  0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2018-06-18 22:04 UTC (permalink / raw)
  To: linux-arm-kernel

On 18 June 2018 at 23:56, Eric Biggers <ebiggers@google.com> wrote:
> On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
>> >>>>> +
>> >>>>> +     // One-time XTS preparation
>> >>>>> +
>> >>>>> +     /*
>> >>>>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>> >>>>> +      * performance, this space is aligned to a 16-byte boundary so that we
>> >>>>> +      * can use the load/store instructions that declare 16-byte alignment.
>> >>>>> +      */
>> >>>>> +     sub             sp, #128
>> >>>>> +     bic             sp, #0xf
>> >>>>
>> >>>>
>> >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>> >>>>
>> >>>>   AS      arch/arm/crypto/speck-neon-core.o
>> >>>>
>> >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>> >>>>
>> >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>>
>> >>>> In a quick hack this change seems to address it:
>> >>>>
>> >>>>
>> >>>> -       sub             sp, #128
>> >>>> -       bic             sp, #0xf
>> >>>> +       mov             r6, sp
>> >>>> +       sub             r6, #128
>> >>>> +       bic             r6, #0xf
>> >>>> +       mov             sp, r6
>> >>>>
>> >>>> But there is probably a better solution to address this.
>> >>>>
>> >>>
>> >>> Given that there is no NEON on M class cores, I recommend we put something like
>> >>>
>> >>> THUMB(bx pc)
>> >>> THUMB(nop.w)
>> >>> THUMB(.arm)
>> >>>
>> >>> at the beginning and be done with it.
>> >>
>> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
>> >> the beginning as well.
>> >
>> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
>> > that bic sp,#0xf is the only issue...
>> >
>>
>> Well, in general, yes. In the case of NEON code, not really, since the
>> resulting code will not be smaller anyway, because the Thumb2 NEON
>> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
>> units, so all cores that this code can run on will be able to run in
>> ARM mode.
>>
>> So from a maintainability pov, having code that only assembles in one
>> way is better than having code that must compile both to ARM and to
>> Thumb2 opcodes.
>>
>> Just my 2 cents, anyway.
>
> I don't have too much of a preference, though Stefan's suggested 4 instructions
> can be reduced to 3, which also matches what aes-neonbs-core.S does:
>
>         sub             r12, sp, #128
>         bic             r12, #0xf
>         mov             sp, r12
>
> Ard, is the following what you're suggesting instead?
>

Yes, but after looking at the actual code, I prefer the change above.
The access occurs only once, not in the loop so the additional
instructions should not affect performance.

Apologies for the noise.

> diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S
> index 3c1e203e53b9..c989ce3dc057 100644
> --- a/arch/arm/crypto/speck-neon-core.S
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -8,6 +8,7 @@
>   */
>
>  #include <linux/linkage.h>
> +#include <asm/assembler.h>
>
>         .text
>         .fpu            neon
> @@ -233,6 +234,12 @@
>   * nonzero multiple of 128.
>   */
>  .macro _speck_xts_crypt        n, decrypting
> +
> +       .align          2
> +       THUMB(bx pc)
> +       THUMB(nop)
> +       THUMB(.arm)
> +
>         push            {r4-r7}
>         mov             r7, sp
>
> @@ -413,6 +420,8 @@
>         mov             sp, r7
>         pop             {r4-r7}
>         bx              lr
> +
> +       THUMB(.thumb)
>  .endm
>
>  ENTRY(speck128_xts_encrypt_neon)

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-06-18 22:04 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-14 18:42 [PATCH v3 0/5] crypto: Speck support Eric Biggers
2018-02-14 18:42 ` Eric Biggers
2018-02-14 18:42 ` [PATCH v3 1/5] crypto: add support for the Speck block cipher Eric Biggers
2018-02-14 18:42   ` Eric Biggers
2018-02-14 18:42 ` [PATCH v3 2/5] crypto: speck - export common helpers Eric Biggers
2018-02-14 18:42   ` Eric Biggers
2018-02-14 18:42 ` [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS Eric Biggers
2018-02-14 18:42   ` Eric Biggers
2018-06-16 22:40   ` Stefan Agner
2018-06-16 22:40     ` Stefan Agner
2018-06-16 22:40     ` Stefan Agner
2018-06-17  9:30     ` Ard Biesheuvel
2018-06-17  9:30       ` Ard Biesheuvel
2018-06-17  9:30       ` Ard Biesheuvel
2018-06-17  9:40       ` Ard Biesheuvel
2018-06-17  9:40         ` Ard Biesheuvel
2018-06-17  9:40         ` Ard Biesheuvel
2018-06-17 10:41         ` Stefan Agner
2018-06-17 10:41           ` Stefan Agner
2018-06-17 10:41           ` Stefan Agner
2018-06-17 11:10           ` Ard Biesheuvel
2018-06-17 11:10             ` Ard Biesheuvel
2018-06-17 11:10             ` Ard Biesheuvel
2018-06-18 21:56             ` Eric Biggers
2018-06-18 21:56               ` Eric Biggers
2018-06-18 21:56               ` Eric Biggers
2018-06-18 22:04               ` Ard Biesheuvel
2018-06-18 22:04                 ` Ard Biesheuvel
2018-06-18 22:04                 ` Ard Biesheuvel
2018-02-14 18:42 ` [PATCH v3 4/5] crypto: speck - add test vectors for Speck128-XTS Eric Biggers
2018-02-14 18:42   ` Eric Biggers
2018-02-14 18:42 ` [PATCH v3 5/5] crypto: speck - add test vectors for Speck64-XTS Eric Biggers
2018-02-14 18:42   ` Eric Biggers
2018-02-22 15:13 ` [PATCH v3 0/5] crypto: Speck support Herbert Xu
2018-02-22 15:13   ` Herbert Xu
2018-02-22 15:13   ` Herbert Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.