* [PATCH 00/16] Optimizing SM3 and SM4 algorithms using NEON/CE/SVE instructions
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This series of patches uses different instruction sets to optimize
the SM3 and SM4 algorithms, as well as the optimization of different
modes of SM4.
patch 1-2: NEON instruction set optimization for SM3
patch 3: Refactored and streamlined SM4 NEON instruction implementation
patch 4-5: support test for new SM4 mode
patch 6-8: Refactored and streamlined SM4 CE instruction implementation
patch 9-12: CE accelerated implementation of SM4 CTS/XTS/ESSIV
patch 13: CE accelerated implementation of SM4 CMAC/XCBC/CBCMAC
patch 14-15: CE accelerated implementation of SM4 CCM/GCM
patch 16: SM4 ARMv9 SVE cryptography acceleration implementation
Tianjia Zhang (16):
crypto: arm64/sm3 - raise the priority of the CE implementation
crypto: arm64/sm3 - add NEON assembly implementation
crypto: arm64/sm4 - refactor and simplify NEON implementation
crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors
crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test
crypto: arm64/sm4 - refactor and simplify CE implementation
crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation
crypto: arm64/sm4 - export reusable CE acceleration functions
crypto: arm64/sm4 - add CE implementation for CTS-CBC mode
crypto: arm64/sm4 - add CE implementation for XTS mode
crypto: essiv - allow digestsize to be greater than keysize
crypto: arm64/sm4 - add CE implementation for ESSIV mode
crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac
crypto: arm64/sm4 - add CE implementation for CCM mode
crypto: arm64/sm4 - add CE implementation for GCM mode
crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration
implementation
arch/arm64/crypto/Kconfig | 66 +-
arch/arm64/crypto/Makefile | 12 +
arch/arm64/crypto/sm3-ce-glue.c | 2 +-
arch/arm64/crypto/sm3-neon-core.S | 600 +++++++++++++
arch/arm64/crypto/sm3-neon-glue.c | 103 +++
arch/arm64/crypto/sm4-ce-asm.h | 209 +++++
arch/arm64/crypto/sm4-ce-ccm-core.S | 328 +++++++
arch/arm64/crypto/sm4-ce-ccm-glue.c | 303 +++++++
arch/arm64/crypto/sm4-ce-core.S | 1247 ++++++++++++++++++---------
arch/arm64/crypto/sm4-ce-gcm-core.S | 741 ++++++++++++++++
arch/arm64/crypto/sm4-ce-gcm-glue.c | 286 ++++++
arch/arm64/crypto/sm4-ce-glue.c | 703 ++++++++++++++-
arch/arm64/crypto/sm4-ce.h | 16 +
arch/arm64/crypto/sm4-neon-core.S | 630 +++++++++-----
arch/arm64/crypto/sm4-neon-glue.c | 172 +---
arch/arm64/crypto/sm4-sve-ce-core.S | 1028 ++++++++++++++++++++++
arch/arm64/crypto/sm4-sve-ce-glue.c | 332 +++++++
crypto/essiv.c | 11 +-
crypto/tcrypt.c | 28 +
crypto/testmgr.c | 25 +
crypto/testmgr.h | 1161 +++++++++++++++++++++++++
21 files changed, 7234 insertions(+), 769 deletions(-)
create mode 100644 arch/arm64/crypto/sm3-neon-core.S
create mode 100644 arch/arm64/crypto/sm3-neon-glue.c
create mode 100644 arch/arm64/crypto/sm4-ce-asm.h
create mode 100644 arch/arm64/crypto/sm4-ce-ccm-core.S
create mode 100644 arch/arm64/crypto/sm4-ce-ccm-glue.c
create mode 100644 arch/arm64/crypto/sm4-ce-gcm-core.S
create mode 100644 arch/arm64/crypto/sm4-ce-gcm-glue.c
create mode 100644 arch/arm64/crypto/sm4-ce.h
create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 00/16] Optimizing SM3 and SM4 algorithms using NEON/CE/SVE instructions
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This series of patches uses different instruction sets to optimize
the SM3 and SM4 algorithms, as well as the optimization of different
modes of SM4.
patch 1-2: NEON instruction set optimization for SM3
patch 3: Refactored and streamlined SM4 NEON instruction implementation
patch 4-5: support test for new SM4 mode
patch 6-8: Refactored and streamlined SM4 CE instruction implementation
patch 9-12: CE accelerated implementation of SM4 CTS/XTS/ESSIV
patch 13: CE accelerated implementation of SM4 CMAC/XCBC/CBCMAC
patch 14-15: CE accelerated implementation of SM4 CCM/GCM
patch 16: SM4 ARMv9 SVE cryptography acceleration implementation
Tianjia Zhang (16):
crypto: arm64/sm3 - raise the priority of the CE implementation
crypto: arm64/sm3 - add NEON assembly implementation
crypto: arm64/sm4 - refactor and simplify NEON implementation
crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors
crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test
crypto: arm64/sm4 - refactor and simplify CE implementation
crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation
crypto: arm64/sm4 - export reusable CE acceleration functions
crypto: arm64/sm4 - add CE implementation for CTS-CBC mode
crypto: arm64/sm4 - add CE implementation for XTS mode
crypto: essiv - allow digestsize to be greater than keysize
crypto: arm64/sm4 - add CE implementation for ESSIV mode
crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac
crypto: arm64/sm4 - add CE implementation for CCM mode
crypto: arm64/sm4 - add CE implementation for GCM mode
crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration
implementation
arch/arm64/crypto/Kconfig | 66 +-
arch/arm64/crypto/Makefile | 12 +
arch/arm64/crypto/sm3-ce-glue.c | 2 +-
arch/arm64/crypto/sm3-neon-core.S | 600 +++++++++++++
arch/arm64/crypto/sm3-neon-glue.c | 103 +++
arch/arm64/crypto/sm4-ce-asm.h | 209 +++++
arch/arm64/crypto/sm4-ce-ccm-core.S | 328 +++++++
arch/arm64/crypto/sm4-ce-ccm-glue.c | 303 +++++++
arch/arm64/crypto/sm4-ce-core.S | 1247 ++++++++++++++++++---------
arch/arm64/crypto/sm4-ce-gcm-core.S | 741 ++++++++++++++++
arch/arm64/crypto/sm4-ce-gcm-glue.c | 286 ++++++
arch/arm64/crypto/sm4-ce-glue.c | 703 ++++++++++++++-
arch/arm64/crypto/sm4-ce.h | 16 +
arch/arm64/crypto/sm4-neon-core.S | 630 +++++++++-----
arch/arm64/crypto/sm4-neon-glue.c | 172 +---
arch/arm64/crypto/sm4-sve-ce-core.S | 1028 ++++++++++++++++++++++
arch/arm64/crypto/sm4-sve-ce-glue.c | 332 +++++++
crypto/essiv.c | 11 +-
crypto/tcrypt.c | 28 +
crypto/testmgr.c | 25 +
crypto/testmgr.h | 1161 +++++++++++++++++++++++++
21 files changed, 7234 insertions(+), 769 deletions(-)
create mode 100644 arch/arm64/crypto/sm3-neon-core.S
create mode 100644 arch/arm64/crypto/sm3-neon-glue.c
create mode 100644 arch/arm64/crypto/sm4-ce-asm.h
create mode 100644 arch/arm64/crypto/sm4-ce-ccm-core.S
create mode 100644 arch/arm64/crypto/sm4-ce-ccm-glue.c
create mode 100644 arch/arm64/crypto/sm4-ce-gcm-core.S
create mode 100644 arch/arm64/crypto/sm4-ce-gcm-glue.c
create mode 100644 arch/arm64/crypto/sm4-ce.h
create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 01/16] crypto: arm64/sm3 - raise the priority of the CE implementation
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
Raise the priority of the sm3-ce algorithm from 200 to 400, this is
to make room for the implementation of sm3-neon.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm3-ce-glue.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/crypto/sm3-ce-glue.c b/arch/arm64/crypto/sm3-ce-glue.c
index ee98954ae8ca..54bf6ebcfffb 100644
--- a/arch/arm64/crypto/sm3-ce-glue.c
+++ b/arch/arm64/crypto/sm3-ce-glue.c
@@ -84,7 +84,7 @@ static struct shash_alg sm3_alg = {
.base.cra_driver_name = "sm3-ce",
.base.cra_blocksize = SM3_BLOCK_SIZE,
.base.cra_module = THIS_MODULE,
- .base.cra_priority = 200,
+ .base.cra_priority = 400,
};
static int __init sm3_ce_mod_init(void)
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 01/16] crypto: arm64/sm3 - raise the priority of the CE implementation
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
Raise the priority of the sm3-ce algorithm from 200 to 400, this is
to make room for the implementation of sm3-neon.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm3-ce-glue.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/crypto/sm3-ce-glue.c b/arch/arm64/crypto/sm3-ce-glue.c
index ee98954ae8ca..54bf6ebcfffb 100644
--- a/arch/arm64/crypto/sm3-ce-glue.c
+++ b/arch/arm64/crypto/sm3-ce-glue.c
@@ -84,7 +84,7 @@ static struct shash_alg sm3_alg = {
.base.cra_driver_name = "sm3-ce",
.base.cra_blocksize = SM3_BLOCK_SIZE,
.base.cra_module = THIS_MODULE,
- .base.cra_priority = 200,
+ .base.cra_priority = 400,
};
static int __init sm3_ce_mod_init(void)
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 02/16] crypto: arm64/sm3 - add NEON assembly implementation
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch adds the NEON acceleration implementation of the SM3 hash
algorithm. The main algorithm is based on SM3 NEON accelerated work of
the libgcrypt project.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 326 mode
of tcrypt, and compares the performance data of sm3-generic and sm3-ce.
The abscissas are blocks of different lengths. The data is tabulated and
the unit is Mb/s:
update-size | 16 64 256 1024 2048 4096 8192
---------------+--------------------------------------------------------
sm3-generic | 185.24 221.28 301.26 307.43 300.83 308.82 308.91
sm3-neon | 171.81 220.20 322.94 339.28 334.09 343.61 343.87
sm3-ce | 227.48 333.48 502.62 527.87 520.45 534.91 535.40
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 11 +
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/sm3-neon-core.S | 600 ++++++++++++++++++++++++++++++
arch/arm64/crypto/sm3-neon-glue.c | 103 +++++
4 files changed, 717 insertions(+)
create mode 100644 arch/arm64/crypto/sm3-neon-core.S
create mode 100644 arch/arm64/crypto/sm3-neon-glue.c
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8bd80508a710..4b121dc0cfba 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -96,6 +96,17 @@ config CRYPTO_SHA3_ARM64
Architecture: arm64 using:
- ARMv8.2 Crypto Extensions
+config CRYPTO_SM3_NEON
+ tristate "Hash functions: SM3 (NEON)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_HASH
+ select CRYPTO_SM3
+ help
+ SM3 (ShangMi 3) secure hash function (OSCCA GM/T 0004-2012)
+
+ Architecture: arm64 using:
+ - NEON (Advanced SIMD) extensions
+
config CRYPTO_SM3_ARM64_CE
tristate "Hash functions: SM3 (ARMv8.2 Crypto Extensions)"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 24bb0c4610de..087f1625e775 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
obj-$(CONFIG_CRYPTO_SHA3_ARM64) += sha3-ce.o
sha3-ce-y := sha3-ce-glue.o sha3-ce-core.o
+obj-$(CONFIG_CRYPTO_SM3_NEON) += sm3-neon.o
+sm3-neon-y := sm3-neon-glue.o sm3-neon-core.o
+
obj-$(CONFIG_CRYPTO_SM3_ARM64_CE) += sm3-ce.o
sm3-ce-y := sm3-ce-glue.o sm3-ce-core.o
diff --git a/arch/arm64/crypto/sm3-neon-core.S b/arch/arm64/crypto/sm3-neon-core.S
new file mode 100644
index 000000000000..3e3b4e5c736f
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-core.S
@@ -0,0 +1,600 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-core.S - SM3 secure hash using NEON instructions
+ *
+ * Linux/arm64 port of the libgcrypt SM3 implementation for AArch64
+ *
+ * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (c) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/* Context structure */
+
+#define state_h0 0
+#define state_h1 4
+#define state_h2 8
+#define state_h3 12
+#define state_h4 16
+#define state_h5 20
+#define state_h6 24
+#define state_h7 28
+
+/* Stack structure */
+
+#define STACK_W_SIZE (32 * 2 * 3)
+
+#define STACK_W (0)
+#define STACK_SIZE (STACK_W + STACK_W_SIZE)
+
+/* Register macros */
+
+#define RSTATE x0
+#define RDATA x1
+#define RNBLKS x2
+#define RKPTR x28
+#define RFRAME x29
+
+#define ra w3
+#define rb w4
+#define rc w5
+#define rd w6
+#define re w7
+#define rf w8
+#define rg w9
+#define rh w10
+
+#define t0 w11
+#define t1 w12
+#define t2 w13
+#define t3 w14
+#define t4 w15
+#define t5 w16
+#define t6 w17
+
+#define k_even w19
+#define k_odd w20
+
+#define addr0 x21
+#define addr1 x22
+
+#define s0 w23
+#define s1 w24
+#define s2 w25
+#define s3 w26
+
+#define W0 v0
+#define W1 v1
+#define W2 v2
+#define W3 v3
+#define W4 v4
+#define W5 v5
+
+#define XTMP0 v6
+#define XTMP1 v7
+#define XTMP2 v16
+#define XTMP3 v17
+#define XTMP4 v18
+#define XTMP5 v19
+#define XTMP6 v20
+
+/* Helper macros. */
+
+#define _(...) /*_*/
+
+#define clear_vec(x) \
+ movi x.8h, #0;
+
+#define rolw(o, a, n) \
+ ror o, a, #(32 - n);
+
+/* Round function macros. */
+
+#define GG1_1(x, y, z, o, t) \
+ eor o, x, y;
+#define GG1_2(x, y, z, o, t) \
+ eor o, o, z;
+#define GG1_3(x, y, z, o, t)
+
+#define FF1_1(x, y, z, o, t) GG1_1(x, y, z, o, t)
+#define FF1_2(x, y, z, o, t)
+#define FF1_3(x, y, z, o, t) GG1_2(x, y, z, o, t)
+
+#define GG2_1(x, y, z, o, t) \
+ bic o, z, x;
+#define GG2_2(x, y, z, o, t) \
+ and t, y, x;
+#define GG2_3(x, y, z, o, t) \
+ eor o, o, t;
+
+#define FF2_1(x, y, z, o, t) \
+ eor o, x, y;
+#define FF2_2(x, y, z, o, t) \
+ and t, x, y; \
+ and o, o, z;
+#define FF2_3(x, y, z, o, t) \
+ eor o, o, t;
+
+#define R(i, a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+ K_LOAD(round); \
+ ldr t5, [sp, #(wtype##_W1_ADDR(round, widx))]; \
+ rolw(t0, a, 12); /* rol(a, 12) => t0 */ \
+ IOP(1, iop_param); \
+ FF##i##_1(a, b, c, t1, t2); \
+ ldr t6, [sp, #(wtype##_W1W2_ADDR(round, widx))]; \
+ add k, k, e; \
+ IOP(2, iop_param); \
+ GG##i##_1(e, f, g, t3, t4); \
+ FF##i##_2(a, b, c, t1, t2); \
+ IOP(3, iop_param); \
+ add k, k, t0; \
+ add h, h, t5; \
+ add d, d, t6; /* w1w2 + d => d */ \
+ IOP(4, iop_param); \
+ rolw(k, k, 7); /* rol (t0 + e + t), 7) => k */ \
+ GG##i##_2(e, f, g, t3, t4); \
+ add h, h, k; /* h + w1 + k => h */ \
+ IOP(5, iop_param); \
+ FF##i##_3(a, b, c, t1, t2); \
+ eor t0, t0, k; /* k ^ t0 => t0 */ \
+ GG##i##_3(e, f, g, t3, t4); \
+ add d, d, t1; /* FF(a,b,c) + d => d */ \
+ IOP(6, iop_param); \
+ add t3, t3, h; /* GG(e,f,g) + h => t3 */ \
+ rolw(b, b, 9); /* rol(b, 9) => b */ \
+ eor h, t3, t3, ror #(32-9); \
+ IOP(7, iop_param); \
+ add d, d, t0; /* t0 + d => d */ \
+ rolw(f, f, 19); /* rol(f, 19) => f */ \
+ IOP(8, iop_param); \
+ eor h, h, t3, ror #(32-17); /* P0(t3) => h */
+
+#define R1(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+ R(1, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define R2(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+ R(2, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define KL(round) \
+ ldp k_even, k_odd, [RKPTR, #(4*(round))];
+
+/* Input expansion macros. */
+
+/* Byte-swapped input address. */
+#define IW_W_ADDR(round, widx, offs) \
+ (STACK_W + ((round) / 4) * 64 + (offs) + ((widx) * 4))
+
+/* Expanded input address. */
+#define XW_W_ADDR(round, widx, offs) \
+ (STACK_W + ((((round) / 3) - 4) % 2) * 64 + (offs) + ((widx) * 4))
+
+/* Rounds 1-12, byte-swapped input block addresses. */
+#define IW_W1_ADDR(round, widx) IW_W_ADDR(round, widx, 32)
+#define IW_W1W2_ADDR(round, widx) IW_W_ADDR(round, widx, 48)
+
+/* Rounds 1-12, expanded input block addresses. */
+#define XW_W1_ADDR(round, widx) XW_W_ADDR(round, widx, 0)
+#define XW_W1W2_ADDR(round, widx) XW_W_ADDR(round, widx, 16)
+
+/* Input block loading.
+ * Interleaving within round function needed for in-order CPUs. */
+#define LOAD_W_VEC_1_1() \
+ add addr0, sp, #IW_W1_ADDR(0, 0);
+#define LOAD_W_VEC_1_2() \
+ add addr1, sp, #IW_W1_ADDR(4, 0);
+#define LOAD_W_VEC_1_3() \
+ ld1 {W0.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_4() \
+ ld1 {W1.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_5() \
+ ld1 {W2.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_6() \
+ ld1 {W3.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_7() \
+ rev32 XTMP0.16b, W0.16b;
+#define LOAD_W_VEC_1_8() \
+ rev32 XTMP1.16b, W1.16b;
+#define LOAD_W_VEC_2_1() \
+ rev32 XTMP2.16b, W2.16b;
+#define LOAD_W_VEC_2_2() \
+ rev32 XTMP3.16b, W3.16b;
+#define LOAD_W_VEC_2_3() \
+ eor XTMP4.16b, XTMP1.16b, XTMP0.16b;
+#define LOAD_W_VEC_2_4() \
+ eor XTMP5.16b, XTMP2.16b, XTMP1.16b;
+#define LOAD_W_VEC_2_5() \
+ st1 {XTMP0.16b}, [addr0], #16;
+#define LOAD_W_VEC_2_6() \
+ st1 {XTMP4.16b}, [addr0]; \
+ add addr0, sp, #IW_W1_ADDR(8, 0);
+#define LOAD_W_VEC_2_7() \
+ eor XTMP6.16b, XTMP3.16b, XTMP2.16b;
+#define LOAD_W_VEC_2_8() \
+ ext W0.16b, XTMP0.16b, XTMP0.16b, #8; /* W0: xx, w0, xx, xx */
+#define LOAD_W_VEC_3_1() \
+ mov W2.16b, XTMP1.16b; /* W2: xx, w6, w5, w4 */
+#define LOAD_W_VEC_3_2() \
+ st1 {XTMP1.16b}, [addr1], #16;
+#define LOAD_W_VEC_3_3() \
+ st1 {XTMP5.16b}, [addr1]; \
+ ext W1.16b, XTMP0.16b, XTMP0.16b, #4; /* W1: xx, w3, w2, w1 */
+#define LOAD_W_VEC_3_4() \
+ ext W3.16b, XTMP1.16b, XTMP2.16b, #12; /* W3: xx, w9, w8, w7 */
+#define LOAD_W_VEC_3_5() \
+ ext W4.16b, XTMP2.16b, XTMP3.16b, #8; /* W4: xx, w12, w11, w10 */
+#define LOAD_W_VEC_3_6() \
+ st1 {XTMP2.16b}, [addr0], #16;
+#define LOAD_W_VEC_3_7() \
+ st1 {XTMP6.16b}, [addr0];
+#define LOAD_W_VEC_3_8() \
+ ext W5.16b, XTMP3.16b, XTMP3.16b, #4; /* W5: xx, w15, w14, w13 */
+
+#define LOAD_W_VEC_1(iop_num, ...) \
+ LOAD_W_VEC_1_##iop_num()
+#define LOAD_W_VEC_2(iop_num, ...) \
+ LOAD_W_VEC_2_##iop_num()
+#define LOAD_W_VEC_3(iop_num, ...) \
+ LOAD_W_VEC_3_##iop_num()
+
+/* Message scheduling. Note: 3 words per vector register.
+ * Interleaving within round function needed for in-order CPUs. */
+#define SCHED_W_1_1(round, w0, w1, w2, w3, w4, w5) \
+ /* Load (w[i - 16]) => XTMP0 */ \
+ /* Load (w[i - 13]) => XTMP5 */ \
+ ext XTMP0.16b, w0.16b, w0.16b, #12; /* XTMP0: w0, xx, xx, xx */
+#define SCHED_W_1_2(round, w0, w1, w2, w3, w4, w5) \
+ ext XTMP5.16b, w1.16b, w1.16b, #12;
+#define SCHED_W_1_3(round, w0, w1, w2, w3, w4, w5) \
+ ext XTMP0.16b, XTMP0.16b, w1.16b, #12; /* XTMP0: xx, w2, w1, w0 */
+#define SCHED_W_1_4(round, w0, w1, w2, w3, w4, w5) \
+ ext XTMP5.16b, XTMP5.16b, w2.16b, #12;
+#define SCHED_W_1_5(round, w0, w1, w2, w3, w4, w5) \
+ /* w[i - 9] == w3 */ \
+ /* W3 ^ XTMP0 => XTMP0 */ \
+ eor XTMP0.16b, XTMP0.16b, w3.16b;
+#define SCHED_W_1_6(round, w0, w1, w2, w3, w4, w5) \
+ /* w[i - 3] == w5 */ \
+ /* rol(XMM5, 15) ^ XTMP0 => XTMP0 */ \
+ /* rol(XTMP5, 7) => XTMP1 */ \
+ add addr0, sp, #XW_W1_ADDR((round), 0); \
+ shl XTMP2.4s, w5.4s, #15;
+#define SCHED_W_1_7(round, w0, w1, w2, w3, w4, w5) \
+ shl XTMP1.4s, XTMP5.4s, #7;
+#define SCHED_W_1_8(round, w0, w1, w2, w3, w4, w5) \
+ sri XTMP2.4s, w5.4s, #(32-15);
+#define SCHED_W_2_1(round, w0, w1, w2, w3, w4, w5) \
+ sri XTMP1.4s, XTMP5.4s, #(32-7);
+#define SCHED_W_2_2(round, w0, w1, w2, w3, w4, w5) \
+ eor XTMP0.16b, XTMP0.16b, XTMP2.16b;
+#define SCHED_W_2_3(round, w0, w1, w2, w3, w4, w5) \
+ /* w[i - 6] == W4 */ \
+ /* W4 ^ XTMP1 => XTMP1 */ \
+ eor XTMP1.16b, XTMP1.16b, w4.16b;
+#define SCHED_W_2_4(round, w0, w1, w2, w3, w4, w5) \
+ /* P1(XTMP0) ^ XTMP1 => W0 */ \
+ shl XTMP3.4s, XTMP0.4s, #15;
+#define SCHED_W_2_5(round, w0, w1, w2, w3, w4, w5) \
+ shl XTMP4.4s, XTMP0.4s, #23;
+#define SCHED_W_2_6(round, w0, w1, w2, w3, w4, w5) \
+ eor w0.16b, XTMP1.16b, XTMP0.16b;
+#define SCHED_W_2_7(round, w0, w1, w2, w3, w4, w5) \
+ sri XTMP3.4s, XTMP0.4s, #(32-15);
+#define SCHED_W_2_8(round, w0, w1, w2, w3, w4, w5) \
+ sri XTMP4.4s, XTMP0.4s, #(32-23);
+#define SCHED_W_3_1(round, w0, w1, w2, w3, w4, w5) \
+ eor w0.16b, w0.16b, XTMP3.16b;
+#define SCHED_W_3_2(round, w0, w1, w2, w3, w4, w5) \
+ /* Load (w[i - 3]) => XTMP2 */ \
+ ext XTMP2.16b, w4.16b, w4.16b, #12;
+#define SCHED_W_3_3(round, w0, w1, w2, w3, w4, w5) \
+ eor w0.16b, w0.16b, XTMP4.16b;
+#define SCHED_W_3_4(round, w0, w1, w2, w3, w4, w5) \
+ ext XTMP2.16b, XTMP2.16b, w5.16b, #12;
+#define SCHED_W_3_5(round, w0, w1, w2, w3, w4, w5) \
+ /* W1 ^ W2 => XTMP3 */ \
+ eor XTMP3.16b, XTMP2.16b, w0.16b;
+#define SCHED_W_3_6(round, w0, w1, w2, w3, w4, w5)
+#define SCHED_W_3_7(round, w0, w1, w2, w3, w4, w5) \
+ st1 {XTMP2.16b-XTMP3.16b}, [addr0];
+#define SCHED_W_3_8(round, w0, w1, w2, w3, w4, w5)
+
+#define SCHED_W_W0W1W2W3W4W5_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W0, W1, W2, W3, W4, W5)
+
+#define SCHED_W_W1W2W3W4W5W0_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W1, W2, W3, W4, W5, W0)
+
+#define SCHED_W_W2W3W4W5W0W1_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W2, W3, W4, W5, W0, W1)
+
+#define SCHED_W_W3W4W5W0W1W2_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W3, W4, W5, W0, W1, W2)
+
+#define SCHED_W_W4W5W0W1W2W3_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W4, W5, W0, W1, W2, W3)
+
+#define SCHED_W_W5W0W1W2W3W4_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W5, W0, W1, W2, W3, W4)
+
+
+ /*
+ * Transform blocks*64 bytes (blocks*16 32-bit words) at 'src'.
+ *
+ * void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+ * int blocks)
+ */
+ .text
+.align 3
+SYM_FUNC_START(sm3_neon_transform)
+ ldp ra, rb, [RSTATE, #0]
+ ldp rc, rd, [RSTATE, #8]
+ ldp re, rf, [RSTATE, #16]
+ ldp rg, rh, [RSTATE, #24]
+
+ stp x28, x29, [sp, #-16]!
+ stp x19, x20, [sp, #-16]!
+ stp x21, x22, [sp, #-16]!
+ stp x23, x24, [sp, #-16]!
+ stp x25, x26, [sp, #-16]!
+ mov RFRAME, sp
+
+ sub addr0, sp, #STACK_SIZE
+ adr_l RKPTR, .LKtable
+ and sp, addr0, #(~63)
+
+ /* Preload first block. */
+ LOAD_W_VEC_1(1, 0)
+ LOAD_W_VEC_1(2, 0)
+ LOAD_W_VEC_1(3, 0)
+ LOAD_W_VEC_1(4, 0)
+ LOAD_W_VEC_1(5, 0)
+ LOAD_W_VEC_1(6, 0)
+ LOAD_W_VEC_1(7, 0)
+ LOAD_W_VEC_1(8, 0)
+ LOAD_W_VEC_2(1, 0)
+ LOAD_W_VEC_2(2, 0)
+ LOAD_W_VEC_2(3, 0)
+ LOAD_W_VEC_2(4, 0)
+ LOAD_W_VEC_2(5, 0)
+ LOAD_W_VEC_2(6, 0)
+ LOAD_W_VEC_2(7, 0)
+ LOAD_W_VEC_2(8, 0)
+ LOAD_W_VEC_3(1, 0)
+ LOAD_W_VEC_3(2, 0)
+ LOAD_W_VEC_3(3, 0)
+ LOAD_W_VEC_3(4, 0)
+ LOAD_W_VEC_3(5, 0)
+ LOAD_W_VEC_3(6, 0)
+ LOAD_W_VEC_3(7, 0)
+ LOAD_W_VEC_3(8, 0)
+
+.balign 16
+.Loop:
+ /* Transform 0-3 */
+ R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 0, 0, IW, _, 0)
+ R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 1, 1, IW, _, 0)
+ R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 2, 2, IW, _, 0)
+ R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 3, 3, IW, _, 0)
+
+ /* Transform 4-7 + Precalc 12-14 */
+ R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 4, 0, IW, _, 0)
+ R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 5, 1, IW, _, 0)
+ R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 6, 2, IW, SCHED_W_W0W1W2W3W4W5_1, 12)
+ R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 7, 3, IW, SCHED_W_W0W1W2W3W4W5_2, 12)
+
+ /* Transform 8-11 + Precalc 12-17 */
+ R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 8, 0, IW, SCHED_W_W0W1W2W3W4W5_3, 12)
+ R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 9, 1, IW, SCHED_W_W1W2W3W4W5W0_1, 15)
+ R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 10, 2, IW, SCHED_W_W1W2W3W4W5W0_2, 15)
+ R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 11, 3, IW, SCHED_W_W1W2W3W4W5W0_3, 15)
+
+ /* Transform 12-14 + Precalc 18-20 */
+ R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 12, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 18)
+ R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 13, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 18)
+ R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 14, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 18)
+
+ /* Transform 15-17 + Precalc 21-23 */
+ R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 15, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 21)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 16, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 21)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 17, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 21)
+
+ /* Transform 18-20 + Precalc 24-26 */
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 18, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 24)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 19, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 24)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 20, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 24)
+
+ /* Transform 21-23 + Precalc 27-29 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 21, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 27)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 22, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 27)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 23, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 27)
+
+ /* Transform 24-26 + Precalc 30-32 */
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 24, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 30)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 25, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 30)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 26, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 30)
+
+ /* Transform 27-29 + Precalc 33-35 */
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 27, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 33)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 28, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 33)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 29, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 33)
+
+ /* Transform 30-32 + Precalc 36-38 */
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 30, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 36)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 31, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 36)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 32, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 36)
+
+ /* Transform 33-35 + Precalc 39-41 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 33, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 39)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 34, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 39)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 35, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 39)
+
+ /* Transform 36-38 + Precalc 42-44 */
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 36, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 42)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 37, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 42)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 38, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 42)
+
+ /* Transform 39-41 + Precalc 45-47 */
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 39, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 45)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 40, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 45)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 41, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 45)
+
+ /* Transform 42-44 + Precalc 48-50 */
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 42, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 48)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 43, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 48)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 44, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 48)
+
+ /* Transform 45-47 + Precalc 51-53 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 45, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 51)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 46, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 51)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 47, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 51)
+
+ /* Transform 48-50 + Precalc 54-56 */
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 48, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 54)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 49, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 54)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 50, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 54)
+
+ /* Transform 51-53 + Precalc 57-59 */
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 51, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 57)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 52, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 57)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 53, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 57)
+
+ /* Transform 54-56 + Precalc 60-62 */
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 54, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 60)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 55, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 60)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 56, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 60)
+
+ /* Transform 57-59 + Precalc 63 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 57, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 63)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 58, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 63)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 59, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 63)
+
+ /* Transform 60 */
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 60, 0, XW, _, _)
+ subs RNBLKS, RNBLKS, #1
+ b.eq .Lend
+
+ /* Transform 61-63 + Preload next block */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 61, 1, XW, LOAD_W_VEC_1, _)
+ ldp s0, s1, [RSTATE, #0]
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, LOAD_W_VEC_2, _)
+ ldp s2, s3, [RSTATE, #8]
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 63, 0, XW, LOAD_W_VEC_3, _)
+
+ /* Update the chaining variables. */
+ eor ra, ra, s0
+ eor rb, rb, s1
+ ldp s0, s1, [RSTATE, #16]
+ eor rc, rc, s2
+ ldp k_even, k_odd, [RSTATE, #24]
+ eor rd, rd, s3
+ eor re, re, s0
+ stp ra, rb, [RSTATE, #0]
+ eor rf, rf, s1
+ stp rc, rd, [RSTATE, #8]
+ eor rg, rg, k_even
+ stp re, rf, [RSTATE, #16]
+ eor rh, rh, k_odd
+ stp rg, rh, [RSTATE, #24]
+ b .Loop
+
+.Lend:
+ /* Transform 61-63 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 61, 1, XW, _, _)
+ ldp s0, s1, [RSTATE, #0]
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, _, _)
+ ldp s2, s3, [RSTATE, #8]
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 63, 0, XW, _, _)
+
+ /* Update the chaining variables. */
+ eor ra, ra, s0
+ clear_vec(W0)
+ eor rb, rb, s1
+ clear_vec(W1)
+ ldp s0, s1, [RSTATE, #16]
+ clear_vec(W2)
+ eor rc, rc, s2
+ clear_vec(W3)
+ ldp k_even, k_odd, [RSTATE, #24]
+ clear_vec(W4)
+ eor rd, rd, s3
+ clear_vec(W5)
+ eor re, re, s0
+ clear_vec(XTMP0)
+ stp ra, rb, [RSTATE, #0]
+ clear_vec(XTMP1)
+ eor rf, rf, s1
+ clear_vec(XTMP2)
+ stp rc, rd, [RSTATE, #8]
+ clear_vec(XTMP3)
+ eor rg, rg, k_even
+ clear_vec(XTMP4)
+ stp re, rf, [RSTATE, #16]
+ clear_vec(XTMP5)
+ eor rh, rh, k_odd
+ clear_vec(XTMP6)
+ stp rg, rh, [RSTATE, #24]
+
+ /* Clear message expansion area */
+ add addr0, sp, #STACK_W
+ st1 {W0.16b-W3.16b}, [addr0], #64
+ st1 {W0.16b-W3.16b}, [addr0], #64
+ st1 {W0.16b-W3.16b}, [addr0]
+
+ mov sp, RFRAME
+
+ ldp x25, x26, [sp], #16
+ ldp x23, x24, [sp], #16
+ ldp x21, x22, [sp], #16
+ ldp x19, x20, [sp], #16
+ ldp x28, x29, [sp], #16
+
+ ret
+SYM_FUNC_END(sm3_neon_transform)
+
+
+ .section ".rodata", "a"
+
+ .align 4
+.LKtable:
+ .long 0x79cc4519, 0xf3988a32, 0xe7311465, 0xce6228cb
+ .long 0x9cc45197, 0x3988a32f, 0x7311465e, 0xe6228cbc
+ .long 0xcc451979, 0x988a32f3, 0x311465e7, 0x6228cbce
+ .long 0xc451979c, 0x88a32f39, 0x11465e73, 0x228cbce6
+ .long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+ .long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+ .long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+ .long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
+ .long 0x7a879d8a, 0xf50f3b14, 0xea1e7629, 0xd43cec53
+ .long 0xa879d8a7, 0x50f3b14f, 0xa1e7629e, 0x43cec53d
+ .long 0x879d8a7a, 0x0f3b14f5, 0x1e7629ea, 0x3cec53d4
+ .long 0x79d8a7a8, 0xf3b14f50, 0xe7629ea1, 0xcec53d43
+ .long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+ .long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+ .long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+ .long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
diff --git a/arch/arm64/crypto/sm3-neon-glue.c b/arch/arm64/crypto/sm3-neon-glue.c
new file mode 100644
index 000000000000..7182ee683f14
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-glue.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-glue.c - SM3 secure hash using NEON instructions
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <asm/unaligned.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <crypto/sm3.h>
+#include <crypto/sm3_base.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+
+asmlinkage void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+ int blocks);
+
+static int sm3_neon_update(struct shash_desc *desc, const u8 *data,
+ unsigned int len)
+{
+ if (!crypto_simd_usable()) {
+ sm3_update(shash_desc_ctx(desc), data, len);
+ return 0;
+ }
+
+ kernel_neon_begin();
+ sm3_base_do_update(desc, data, len, sm3_neon_transform);
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int sm3_neon_final(struct shash_desc *desc, u8 *out)
+{
+ if (!crypto_simd_usable()) {
+ sm3_final(shash_desc_ctx(desc), out);
+ return 0;
+ }
+
+ kernel_neon_begin();
+ sm3_base_do_finalize(desc, sm3_neon_transform);
+ kernel_neon_end();
+
+ return sm3_base_finish(desc, out);
+}
+
+static int sm3_neon_finup(struct shash_desc *desc, const u8 *data,
+ unsigned int len, u8 *out)
+{
+ if (!crypto_simd_usable()) {
+ struct sm3_state *sctx = shash_desc_ctx(desc);
+
+ if (len)
+ sm3_update(sctx, data, len);
+ sm3_final(sctx, out);
+ return 0;
+ }
+
+ kernel_neon_begin();
+ if (len)
+ sm3_base_do_update(desc, data, len, sm3_neon_transform);
+ sm3_base_do_finalize(desc, sm3_neon_transform);
+ kernel_neon_end();
+
+ return sm3_base_finish(desc, out);
+}
+
+static struct shash_alg sm3_alg = {
+ .digestsize = SM3_DIGEST_SIZE,
+ .init = sm3_base_init,
+ .update = sm3_neon_update,
+ .final = sm3_neon_final,
+ .finup = sm3_neon_finup,
+ .descsize = sizeof(struct sm3_state),
+ .base.cra_name = "sm3",
+ .base.cra_driver_name = "sm3-neon",
+ .base.cra_blocksize = SM3_BLOCK_SIZE,
+ .base.cra_module = THIS_MODULE,
+ .base.cra_priority = 200,
+};
+
+static int __init sm3_neon_init(void)
+{
+ return crypto_register_shash(&sm3_alg);
+}
+
+static void __exit sm3_neon_fini(void)
+{
+ crypto_unregister_shash(&sm3_alg);
+}
+
+module_init(sm3_neon_init);
+module_exit(sm3_neon_fini);
+
+MODULE_DESCRIPTION("SM3 secure hash using NEON instructions");
+MODULE_AUTHOR("Jussi Kivilinna <jussi.kivilinna@iki.fi>");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 02/16] crypto: arm64/sm3 - add NEON assembly implementation
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch adds the NEON acceleration implementation of the SM3 hash
algorithm. The main algorithm is based on SM3 NEON accelerated work of
the libgcrypt project.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 326 mode
of tcrypt, and compares the performance data of sm3-generic and sm3-ce.
The abscissas are blocks of different lengths. The data is tabulated and
the unit is Mb/s:
update-size | 16 64 256 1024 2048 4096 8192
---------------+--------------------------------------------------------
sm3-generic | 185.24 221.28 301.26 307.43 300.83 308.82 308.91
sm3-neon | 171.81 220.20 322.94 339.28 334.09 343.61 343.87
sm3-ce | 227.48 333.48 502.62 527.87 520.45 534.91 535.40
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 11 +
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/sm3-neon-core.S | 600 ++++++++++++++++++++++++++++++
arch/arm64/crypto/sm3-neon-glue.c | 103 +++++
4 files changed, 717 insertions(+)
create mode 100644 arch/arm64/crypto/sm3-neon-core.S
create mode 100644 arch/arm64/crypto/sm3-neon-glue.c
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8bd80508a710..4b121dc0cfba 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -96,6 +96,17 @@ config CRYPTO_SHA3_ARM64
Architecture: arm64 using:
- ARMv8.2 Crypto Extensions
+config CRYPTO_SM3_NEON
+ tristate "Hash functions: SM3 (NEON)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_HASH
+ select CRYPTO_SM3
+ help
+ SM3 (ShangMi 3) secure hash function (OSCCA GM/T 0004-2012)
+
+ Architecture: arm64 using:
+ - NEON (Advanced SIMD) extensions
+
config CRYPTO_SM3_ARM64_CE
tristate "Hash functions: SM3 (ARMv8.2 Crypto Extensions)"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 24bb0c4610de..087f1625e775 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
obj-$(CONFIG_CRYPTO_SHA3_ARM64) += sha3-ce.o
sha3-ce-y := sha3-ce-glue.o sha3-ce-core.o
+obj-$(CONFIG_CRYPTO_SM3_NEON) += sm3-neon.o
+sm3-neon-y := sm3-neon-glue.o sm3-neon-core.o
+
obj-$(CONFIG_CRYPTO_SM3_ARM64_CE) += sm3-ce.o
sm3-ce-y := sm3-ce-glue.o sm3-ce-core.o
diff --git a/arch/arm64/crypto/sm3-neon-core.S b/arch/arm64/crypto/sm3-neon-core.S
new file mode 100644
index 000000000000..3e3b4e5c736f
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-core.S
@@ -0,0 +1,600 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-core.S - SM3 secure hash using NEON instructions
+ *
+ * Linux/arm64 port of the libgcrypt SM3 implementation for AArch64
+ *
+ * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (c) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/* Context structure */
+
+#define state_h0 0
+#define state_h1 4
+#define state_h2 8
+#define state_h3 12
+#define state_h4 16
+#define state_h5 20
+#define state_h6 24
+#define state_h7 28
+
+/* Stack structure */
+
+#define STACK_W_SIZE (32 * 2 * 3)
+
+#define STACK_W (0)
+#define STACK_SIZE (STACK_W + STACK_W_SIZE)
+
+/* Register macros */
+
+#define RSTATE x0
+#define RDATA x1
+#define RNBLKS x2
+#define RKPTR x28
+#define RFRAME x29
+
+#define ra w3
+#define rb w4
+#define rc w5
+#define rd w6
+#define re w7
+#define rf w8
+#define rg w9
+#define rh w10
+
+#define t0 w11
+#define t1 w12
+#define t2 w13
+#define t3 w14
+#define t4 w15
+#define t5 w16
+#define t6 w17
+
+#define k_even w19
+#define k_odd w20
+
+#define addr0 x21
+#define addr1 x22
+
+#define s0 w23
+#define s1 w24
+#define s2 w25
+#define s3 w26
+
+#define W0 v0
+#define W1 v1
+#define W2 v2
+#define W3 v3
+#define W4 v4
+#define W5 v5
+
+#define XTMP0 v6
+#define XTMP1 v7
+#define XTMP2 v16
+#define XTMP3 v17
+#define XTMP4 v18
+#define XTMP5 v19
+#define XTMP6 v20
+
+/* Helper macros. */
+
+#define _(...) /*_*/
+
+#define clear_vec(x) \
+ movi x.8h, #0;
+
+#define rolw(o, a, n) \
+ ror o, a, #(32 - n);
+
+/* Round function macros. */
+
+#define GG1_1(x, y, z, o, t) \
+ eor o, x, y;
+#define GG1_2(x, y, z, o, t) \
+ eor o, o, z;
+#define GG1_3(x, y, z, o, t)
+
+#define FF1_1(x, y, z, o, t) GG1_1(x, y, z, o, t)
+#define FF1_2(x, y, z, o, t)
+#define FF1_3(x, y, z, o, t) GG1_2(x, y, z, o, t)
+
+#define GG2_1(x, y, z, o, t) \
+ bic o, z, x;
+#define GG2_2(x, y, z, o, t) \
+ and t, y, x;
+#define GG2_3(x, y, z, o, t) \
+ eor o, o, t;
+
+#define FF2_1(x, y, z, o, t) \
+ eor o, x, y;
+#define FF2_2(x, y, z, o, t) \
+ and t, x, y; \
+ and o, o, z;
+#define FF2_3(x, y, z, o, t) \
+ eor o, o, t;
+
+#define R(i, a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+ K_LOAD(round); \
+ ldr t5, [sp, #(wtype##_W1_ADDR(round, widx))]; \
+ rolw(t0, a, 12); /* rol(a, 12) => t0 */ \
+ IOP(1, iop_param); \
+ FF##i##_1(a, b, c, t1, t2); \
+ ldr t6, [sp, #(wtype##_W1W2_ADDR(round, widx))]; \
+ add k, k, e; \
+ IOP(2, iop_param); \
+ GG##i##_1(e, f, g, t3, t4); \
+ FF##i##_2(a, b, c, t1, t2); \
+ IOP(3, iop_param); \
+ add k, k, t0; \
+ add h, h, t5; \
+ add d, d, t6; /* w1w2 + d => d */ \
+ IOP(4, iop_param); \
+ rolw(k, k, 7); /* rol (t0 + e + t), 7) => k */ \
+ GG##i##_2(e, f, g, t3, t4); \
+ add h, h, k; /* h + w1 + k => h */ \
+ IOP(5, iop_param); \
+ FF##i##_3(a, b, c, t1, t2); \
+ eor t0, t0, k; /* k ^ t0 => t0 */ \
+ GG##i##_3(e, f, g, t3, t4); \
+ add d, d, t1; /* FF(a,b,c) + d => d */ \
+ IOP(6, iop_param); \
+ add t3, t3, h; /* GG(e,f,g) + h => t3 */ \
+ rolw(b, b, 9); /* rol(b, 9) => b */ \
+ eor h, t3, t3, ror #(32-9); \
+ IOP(7, iop_param); \
+ add d, d, t0; /* t0 + d => d */ \
+ rolw(f, f, 19); /* rol(f, 19) => f */ \
+ IOP(8, iop_param); \
+ eor h, h, t3, ror #(32-17); /* P0(t3) => h */
+
+#define R1(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+ R(1, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define R2(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+ R(2, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define KL(round) \
+ ldp k_even, k_odd, [RKPTR, #(4*(round))];
+
+/* Input expansion macros. */
+
+/* Byte-swapped input address. */
+#define IW_W_ADDR(round, widx, offs) \
+ (STACK_W + ((round) / 4) * 64 + (offs) + ((widx) * 4))
+
+/* Expanded input address. */
+#define XW_W_ADDR(round, widx, offs) \
+ (STACK_W + ((((round) / 3) - 4) % 2) * 64 + (offs) + ((widx) * 4))
+
+/* Rounds 1-12, byte-swapped input block addresses. */
+#define IW_W1_ADDR(round, widx) IW_W_ADDR(round, widx, 32)
+#define IW_W1W2_ADDR(round, widx) IW_W_ADDR(round, widx, 48)
+
+/* Rounds 1-12, expanded input block addresses. */
+#define XW_W1_ADDR(round, widx) XW_W_ADDR(round, widx, 0)
+#define XW_W1W2_ADDR(round, widx) XW_W_ADDR(round, widx, 16)
+
+/* Input block loading.
+ * Interleaving within round function needed for in-order CPUs. */
+#define LOAD_W_VEC_1_1() \
+ add addr0, sp, #IW_W1_ADDR(0, 0);
+#define LOAD_W_VEC_1_2() \
+ add addr1, sp, #IW_W1_ADDR(4, 0);
+#define LOAD_W_VEC_1_3() \
+ ld1 {W0.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_4() \
+ ld1 {W1.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_5() \
+ ld1 {W2.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_6() \
+ ld1 {W3.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_7() \
+ rev32 XTMP0.16b, W0.16b;
+#define LOAD_W_VEC_1_8() \
+ rev32 XTMP1.16b, W1.16b;
+#define LOAD_W_VEC_2_1() \
+ rev32 XTMP2.16b, W2.16b;
+#define LOAD_W_VEC_2_2() \
+ rev32 XTMP3.16b, W3.16b;
+#define LOAD_W_VEC_2_3() \
+ eor XTMP4.16b, XTMP1.16b, XTMP0.16b;
+#define LOAD_W_VEC_2_4() \
+ eor XTMP5.16b, XTMP2.16b, XTMP1.16b;
+#define LOAD_W_VEC_2_5() \
+ st1 {XTMP0.16b}, [addr0], #16;
+#define LOAD_W_VEC_2_6() \
+ st1 {XTMP4.16b}, [addr0]; \
+ add addr0, sp, #IW_W1_ADDR(8, 0);
+#define LOAD_W_VEC_2_7() \
+ eor XTMP6.16b, XTMP3.16b, XTMP2.16b;
+#define LOAD_W_VEC_2_8() \
+ ext W0.16b, XTMP0.16b, XTMP0.16b, #8; /* W0: xx, w0, xx, xx */
+#define LOAD_W_VEC_3_1() \
+ mov W2.16b, XTMP1.16b; /* W2: xx, w6, w5, w4 */
+#define LOAD_W_VEC_3_2() \
+ st1 {XTMP1.16b}, [addr1], #16;
+#define LOAD_W_VEC_3_3() \
+ st1 {XTMP5.16b}, [addr1]; \
+ ext W1.16b, XTMP0.16b, XTMP0.16b, #4; /* W1: xx, w3, w2, w1 */
+#define LOAD_W_VEC_3_4() \
+ ext W3.16b, XTMP1.16b, XTMP2.16b, #12; /* W3: xx, w9, w8, w7 */
+#define LOAD_W_VEC_3_5() \
+ ext W4.16b, XTMP2.16b, XTMP3.16b, #8; /* W4: xx, w12, w11, w10 */
+#define LOAD_W_VEC_3_6() \
+ st1 {XTMP2.16b}, [addr0], #16;
+#define LOAD_W_VEC_3_7() \
+ st1 {XTMP6.16b}, [addr0];
+#define LOAD_W_VEC_3_8() \
+ ext W5.16b, XTMP3.16b, XTMP3.16b, #4; /* W5: xx, w15, w14, w13 */
+
+#define LOAD_W_VEC_1(iop_num, ...) \
+ LOAD_W_VEC_1_##iop_num()
+#define LOAD_W_VEC_2(iop_num, ...) \
+ LOAD_W_VEC_2_##iop_num()
+#define LOAD_W_VEC_3(iop_num, ...) \
+ LOAD_W_VEC_3_##iop_num()
+
+/* Message scheduling. Note: 3 words per vector register.
+ * Interleaving within round function needed for in-order CPUs. */
+#define SCHED_W_1_1(round, w0, w1, w2, w3, w4, w5) \
+ /* Load (w[i - 16]) => XTMP0 */ \
+ /* Load (w[i - 13]) => XTMP5 */ \
+ ext XTMP0.16b, w0.16b, w0.16b, #12; /* XTMP0: w0, xx, xx, xx */
+#define SCHED_W_1_2(round, w0, w1, w2, w3, w4, w5) \
+ ext XTMP5.16b, w1.16b, w1.16b, #12;
+#define SCHED_W_1_3(round, w0, w1, w2, w3, w4, w5) \
+ ext XTMP0.16b, XTMP0.16b, w1.16b, #12; /* XTMP0: xx, w2, w1, w0 */
+#define SCHED_W_1_4(round, w0, w1, w2, w3, w4, w5) \
+ ext XTMP5.16b, XTMP5.16b, w2.16b, #12;
+#define SCHED_W_1_5(round, w0, w1, w2, w3, w4, w5) \
+ /* w[i - 9] == w3 */ \
+ /* W3 ^ XTMP0 => XTMP0 */ \
+ eor XTMP0.16b, XTMP0.16b, w3.16b;
+#define SCHED_W_1_6(round, w0, w1, w2, w3, w4, w5) \
+ /* w[i - 3] == w5 */ \
+ /* rol(XMM5, 15) ^ XTMP0 => XTMP0 */ \
+ /* rol(XTMP5, 7) => XTMP1 */ \
+ add addr0, sp, #XW_W1_ADDR((round), 0); \
+ shl XTMP2.4s, w5.4s, #15;
+#define SCHED_W_1_7(round, w0, w1, w2, w3, w4, w5) \
+ shl XTMP1.4s, XTMP5.4s, #7;
+#define SCHED_W_1_8(round, w0, w1, w2, w3, w4, w5) \
+ sri XTMP2.4s, w5.4s, #(32-15);
+#define SCHED_W_2_1(round, w0, w1, w2, w3, w4, w5) \
+ sri XTMP1.4s, XTMP5.4s, #(32-7);
+#define SCHED_W_2_2(round, w0, w1, w2, w3, w4, w5) \
+ eor XTMP0.16b, XTMP0.16b, XTMP2.16b;
+#define SCHED_W_2_3(round, w0, w1, w2, w3, w4, w5) \
+ /* w[i - 6] == W4 */ \
+ /* W4 ^ XTMP1 => XTMP1 */ \
+ eor XTMP1.16b, XTMP1.16b, w4.16b;
+#define SCHED_W_2_4(round, w0, w1, w2, w3, w4, w5) \
+ /* P1(XTMP0) ^ XTMP1 => W0 */ \
+ shl XTMP3.4s, XTMP0.4s, #15;
+#define SCHED_W_2_5(round, w0, w1, w2, w3, w4, w5) \
+ shl XTMP4.4s, XTMP0.4s, #23;
+#define SCHED_W_2_6(round, w0, w1, w2, w3, w4, w5) \
+ eor w0.16b, XTMP1.16b, XTMP0.16b;
+#define SCHED_W_2_7(round, w0, w1, w2, w3, w4, w5) \
+ sri XTMP3.4s, XTMP0.4s, #(32-15);
+#define SCHED_W_2_8(round, w0, w1, w2, w3, w4, w5) \
+ sri XTMP4.4s, XTMP0.4s, #(32-23);
+#define SCHED_W_3_1(round, w0, w1, w2, w3, w4, w5) \
+ eor w0.16b, w0.16b, XTMP3.16b;
+#define SCHED_W_3_2(round, w0, w1, w2, w3, w4, w5) \
+ /* Load (w[i - 3]) => XTMP2 */ \
+ ext XTMP2.16b, w4.16b, w4.16b, #12;
+#define SCHED_W_3_3(round, w0, w1, w2, w3, w4, w5) \
+ eor w0.16b, w0.16b, XTMP4.16b;
+#define SCHED_W_3_4(round, w0, w1, w2, w3, w4, w5) \
+ ext XTMP2.16b, XTMP2.16b, w5.16b, #12;
+#define SCHED_W_3_5(round, w0, w1, w2, w3, w4, w5) \
+ /* W1 ^ W2 => XTMP3 */ \
+ eor XTMP3.16b, XTMP2.16b, w0.16b;
+#define SCHED_W_3_6(round, w0, w1, w2, w3, w4, w5)
+#define SCHED_W_3_7(round, w0, w1, w2, w3, w4, w5) \
+ st1 {XTMP2.16b-XTMP3.16b}, [addr0];
+#define SCHED_W_3_8(round, w0, w1, w2, w3, w4, w5)
+
+#define SCHED_W_W0W1W2W3W4W5_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W0, W1, W2, W3, W4, W5)
+
+#define SCHED_W_W1W2W3W4W5W0_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W1, W2, W3, W4, W5, W0)
+
+#define SCHED_W_W2W3W4W5W0W1_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W2, W3, W4, W5, W0, W1)
+
+#define SCHED_W_W3W4W5W0W1W2_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W3, W4, W5, W0, W1, W2)
+
+#define SCHED_W_W4W5W0W1W2W3_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W4, W5, W0, W1, W2, W3)
+
+#define SCHED_W_W5W0W1W2W3W4_1(iop_num, round) \
+ SCHED_W_1_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_2(iop_num, round) \
+ SCHED_W_2_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_3(iop_num, round) \
+ SCHED_W_3_##iop_num(round, W5, W0, W1, W2, W3, W4)
+
+
+ /*
+ * Transform blocks*64 bytes (blocks*16 32-bit words) at 'src'.
+ *
+ * void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+ * int blocks)
+ */
+ .text
+.align 3
+SYM_FUNC_START(sm3_neon_transform)
+ ldp ra, rb, [RSTATE, #0]
+ ldp rc, rd, [RSTATE, #8]
+ ldp re, rf, [RSTATE, #16]
+ ldp rg, rh, [RSTATE, #24]
+
+ stp x28, x29, [sp, #-16]!
+ stp x19, x20, [sp, #-16]!
+ stp x21, x22, [sp, #-16]!
+ stp x23, x24, [sp, #-16]!
+ stp x25, x26, [sp, #-16]!
+ mov RFRAME, sp
+
+ sub addr0, sp, #STACK_SIZE
+ adr_l RKPTR, .LKtable
+ and sp, addr0, #(~63)
+
+ /* Preload first block. */
+ LOAD_W_VEC_1(1, 0)
+ LOAD_W_VEC_1(2, 0)
+ LOAD_W_VEC_1(3, 0)
+ LOAD_W_VEC_1(4, 0)
+ LOAD_W_VEC_1(5, 0)
+ LOAD_W_VEC_1(6, 0)
+ LOAD_W_VEC_1(7, 0)
+ LOAD_W_VEC_1(8, 0)
+ LOAD_W_VEC_2(1, 0)
+ LOAD_W_VEC_2(2, 0)
+ LOAD_W_VEC_2(3, 0)
+ LOAD_W_VEC_2(4, 0)
+ LOAD_W_VEC_2(5, 0)
+ LOAD_W_VEC_2(6, 0)
+ LOAD_W_VEC_2(7, 0)
+ LOAD_W_VEC_2(8, 0)
+ LOAD_W_VEC_3(1, 0)
+ LOAD_W_VEC_3(2, 0)
+ LOAD_W_VEC_3(3, 0)
+ LOAD_W_VEC_3(4, 0)
+ LOAD_W_VEC_3(5, 0)
+ LOAD_W_VEC_3(6, 0)
+ LOAD_W_VEC_3(7, 0)
+ LOAD_W_VEC_3(8, 0)
+
+.balign 16
+.Loop:
+ /* Transform 0-3 */
+ R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 0, 0, IW, _, 0)
+ R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 1, 1, IW, _, 0)
+ R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 2, 2, IW, _, 0)
+ R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 3, 3, IW, _, 0)
+
+ /* Transform 4-7 + Precalc 12-14 */
+ R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 4, 0, IW, _, 0)
+ R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 5, 1, IW, _, 0)
+ R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 6, 2, IW, SCHED_W_W0W1W2W3W4W5_1, 12)
+ R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 7, 3, IW, SCHED_W_W0W1W2W3W4W5_2, 12)
+
+ /* Transform 8-11 + Precalc 12-17 */
+ R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 8, 0, IW, SCHED_W_W0W1W2W3W4W5_3, 12)
+ R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 9, 1, IW, SCHED_W_W1W2W3W4W5W0_1, 15)
+ R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 10, 2, IW, SCHED_W_W1W2W3W4W5W0_2, 15)
+ R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 11, 3, IW, SCHED_W_W1W2W3W4W5W0_3, 15)
+
+ /* Transform 12-14 + Precalc 18-20 */
+ R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 12, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 18)
+ R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 13, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 18)
+ R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 14, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 18)
+
+ /* Transform 15-17 + Precalc 21-23 */
+ R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 15, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 21)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 16, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 21)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 17, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 21)
+
+ /* Transform 18-20 + Precalc 24-26 */
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 18, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 24)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 19, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 24)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 20, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 24)
+
+ /* Transform 21-23 + Precalc 27-29 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 21, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 27)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 22, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 27)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 23, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 27)
+
+ /* Transform 24-26 + Precalc 30-32 */
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 24, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 30)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 25, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 30)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 26, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 30)
+
+ /* Transform 27-29 + Precalc 33-35 */
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 27, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 33)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 28, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 33)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 29, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 33)
+
+ /* Transform 30-32 + Precalc 36-38 */
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 30, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 36)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 31, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 36)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 32, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 36)
+
+ /* Transform 33-35 + Precalc 39-41 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 33, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 39)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 34, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 39)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 35, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 39)
+
+ /* Transform 36-38 + Precalc 42-44 */
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 36, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 42)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 37, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 42)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 38, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 42)
+
+ /* Transform 39-41 + Precalc 45-47 */
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 39, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 45)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 40, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 45)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 41, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 45)
+
+ /* Transform 42-44 + Precalc 48-50 */
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 42, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 48)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 43, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 48)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 44, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 48)
+
+ /* Transform 45-47 + Precalc 51-53 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 45, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 51)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 46, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 51)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 47, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 51)
+
+ /* Transform 48-50 + Precalc 54-56 */
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 48, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 54)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 49, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 54)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 50, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 54)
+
+ /* Transform 51-53 + Precalc 57-59 */
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 51, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 57)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 52, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 57)
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 53, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 57)
+
+ /* Transform 54-56 + Precalc 60-62 */
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 54, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 60)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 55, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 60)
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 56, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 60)
+
+ /* Transform 57-59 + Precalc 63 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 57, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 63)
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 58, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 63)
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 59, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 63)
+
+ /* Transform 60 */
+ R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 60, 0, XW, _, _)
+ subs RNBLKS, RNBLKS, #1
+ b.eq .Lend
+
+ /* Transform 61-63 + Preload next block */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 61, 1, XW, LOAD_W_VEC_1, _)
+ ldp s0, s1, [RSTATE, #0]
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, LOAD_W_VEC_2, _)
+ ldp s2, s3, [RSTATE, #8]
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 63, 0, XW, LOAD_W_VEC_3, _)
+
+ /* Update the chaining variables. */
+ eor ra, ra, s0
+ eor rb, rb, s1
+ ldp s0, s1, [RSTATE, #16]
+ eor rc, rc, s2
+ ldp k_even, k_odd, [RSTATE, #24]
+ eor rd, rd, s3
+ eor re, re, s0
+ stp ra, rb, [RSTATE, #0]
+ eor rf, rf, s1
+ stp rc, rd, [RSTATE, #8]
+ eor rg, rg, k_even
+ stp re, rf, [RSTATE, #16]
+ eor rh, rh, k_odd
+ stp rg, rh, [RSTATE, #24]
+ b .Loop
+
+.Lend:
+ /* Transform 61-63 */
+ R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd, _, 61, 1, XW, _, _)
+ ldp s0, s1, [RSTATE, #0]
+ R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, _, _)
+ ldp s2, s3, [RSTATE, #8]
+ R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd, _, 63, 0, XW, _, _)
+
+ /* Update the chaining variables. */
+ eor ra, ra, s0
+ clear_vec(W0)
+ eor rb, rb, s1
+ clear_vec(W1)
+ ldp s0, s1, [RSTATE, #16]
+ clear_vec(W2)
+ eor rc, rc, s2
+ clear_vec(W3)
+ ldp k_even, k_odd, [RSTATE, #24]
+ clear_vec(W4)
+ eor rd, rd, s3
+ clear_vec(W5)
+ eor re, re, s0
+ clear_vec(XTMP0)
+ stp ra, rb, [RSTATE, #0]
+ clear_vec(XTMP1)
+ eor rf, rf, s1
+ clear_vec(XTMP2)
+ stp rc, rd, [RSTATE, #8]
+ clear_vec(XTMP3)
+ eor rg, rg, k_even
+ clear_vec(XTMP4)
+ stp re, rf, [RSTATE, #16]
+ clear_vec(XTMP5)
+ eor rh, rh, k_odd
+ clear_vec(XTMP6)
+ stp rg, rh, [RSTATE, #24]
+
+ /* Clear message expansion area */
+ add addr0, sp, #STACK_W
+ st1 {W0.16b-W3.16b}, [addr0], #64
+ st1 {W0.16b-W3.16b}, [addr0], #64
+ st1 {W0.16b-W3.16b}, [addr0]
+
+ mov sp, RFRAME
+
+ ldp x25, x26, [sp], #16
+ ldp x23, x24, [sp], #16
+ ldp x21, x22, [sp], #16
+ ldp x19, x20, [sp], #16
+ ldp x28, x29, [sp], #16
+
+ ret
+SYM_FUNC_END(sm3_neon_transform)
+
+
+ .section ".rodata", "a"
+
+ .align 4
+.LKtable:
+ .long 0x79cc4519, 0xf3988a32, 0xe7311465, 0xce6228cb
+ .long 0x9cc45197, 0x3988a32f, 0x7311465e, 0xe6228cbc
+ .long 0xcc451979, 0x988a32f3, 0x311465e7, 0x6228cbce
+ .long 0xc451979c, 0x88a32f39, 0x11465e73, 0x228cbce6
+ .long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+ .long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+ .long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+ .long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
+ .long 0x7a879d8a, 0xf50f3b14, 0xea1e7629, 0xd43cec53
+ .long 0xa879d8a7, 0x50f3b14f, 0xa1e7629e, 0x43cec53d
+ .long 0x879d8a7a, 0x0f3b14f5, 0x1e7629ea, 0x3cec53d4
+ .long 0x79d8a7a8, 0xf3b14f50, 0xe7629ea1, 0xcec53d43
+ .long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+ .long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+ .long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+ .long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
diff --git a/arch/arm64/crypto/sm3-neon-glue.c b/arch/arm64/crypto/sm3-neon-glue.c
new file mode 100644
index 000000000000..7182ee683f14
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-glue.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-glue.c - SM3 secure hash using NEON instructions
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <asm/unaligned.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <crypto/sm3.h>
+#include <crypto/sm3_base.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+
+asmlinkage void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+ int blocks);
+
+static int sm3_neon_update(struct shash_desc *desc, const u8 *data,
+ unsigned int len)
+{
+ if (!crypto_simd_usable()) {
+ sm3_update(shash_desc_ctx(desc), data, len);
+ return 0;
+ }
+
+ kernel_neon_begin();
+ sm3_base_do_update(desc, data, len, sm3_neon_transform);
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int sm3_neon_final(struct shash_desc *desc, u8 *out)
+{
+ if (!crypto_simd_usable()) {
+ sm3_final(shash_desc_ctx(desc), out);
+ return 0;
+ }
+
+ kernel_neon_begin();
+ sm3_base_do_finalize(desc, sm3_neon_transform);
+ kernel_neon_end();
+
+ return sm3_base_finish(desc, out);
+}
+
+static int sm3_neon_finup(struct shash_desc *desc, const u8 *data,
+ unsigned int len, u8 *out)
+{
+ if (!crypto_simd_usable()) {
+ struct sm3_state *sctx = shash_desc_ctx(desc);
+
+ if (len)
+ sm3_update(sctx, data, len);
+ sm3_final(sctx, out);
+ return 0;
+ }
+
+ kernel_neon_begin();
+ if (len)
+ sm3_base_do_update(desc, data, len, sm3_neon_transform);
+ sm3_base_do_finalize(desc, sm3_neon_transform);
+ kernel_neon_end();
+
+ return sm3_base_finish(desc, out);
+}
+
+static struct shash_alg sm3_alg = {
+ .digestsize = SM3_DIGEST_SIZE,
+ .init = sm3_base_init,
+ .update = sm3_neon_update,
+ .final = sm3_neon_final,
+ .finup = sm3_neon_finup,
+ .descsize = sizeof(struct sm3_state),
+ .base.cra_name = "sm3",
+ .base.cra_driver_name = "sm3-neon",
+ .base.cra_blocksize = SM3_BLOCK_SIZE,
+ .base.cra_module = THIS_MODULE,
+ .base.cra_priority = 200,
+};
+
+static int __init sm3_neon_init(void)
+{
+ return crypto_register_shash(&sm3_alg);
+}
+
+static void __exit sm3_neon_fini(void)
+{
+ crypto_unregister_shash(&sm3_alg);
+}
+
+module_init(sm3_neon_init);
+module_exit(sm3_neon_fini);
+
+MODULE_DESCRIPTION("SM3 secure hash using NEON instructions");
+MODULE_AUTHOR("Jussi Kivilinna <jussi.kivilinna@iki.fi>");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 03/16] crypto: arm64/sm4 - refactor and simplify NEON implementation
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch does not add new features. The main work is to refactor and
simplify the implementation of SM4 NEON, which is reflected in the
following aspects:
The accelerated implementation supports the arbitrary number of blocks,
not just multiples of 8, which simplifies the implementation and brings
some optimization acceleration for data that is not aligned by 8 blocks.
When loading the input data, use the ld4 instruction to replace the
original ld1 instruction as much as possible, which will save the cost
of matrix transposition of the input data.
Use 8-block parallelism whenever possible to speed up matrix transpose
and rotation operations, instead of up to 4-block parallelism.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-neon-core.S | 630 +++++++++++++++++++-----------
arch/arm64/crypto/sm4-neon-glue.c | 172 +++-----
2 files changed, 456 insertions(+), 346 deletions(-)
diff --git a/arch/arm64/crypto/sm4-neon-core.S b/arch/arm64/crypto/sm4-neon-core.S
index 3d5256b354d2..f295b4b7d70a 100644
--- a/arch/arm64/crypto/sm4-neon-core.S
+++ b/arch/arm64/crypto/sm4-neon-core.S
@@ -18,6 +18,11 @@
#define RTMP2 v10
#define RTMP3 v11
+#define RTMP4 v12
+#define RTMP5 v13
+#define RTMP6 v14
+#define RTMP7 v15
+
#define RX0 v12
#define RX1 v13
#define RKEY v14
@@ -25,7 +30,7 @@
/* Helper macros. */
-#define PREPARE \
+#define SM4_PREPARE() \
adr_l x5, crypto_sm4_sbox; \
ld1 {v16.16b-v19.16b}, [x5], #64; \
ld1 {v20.16b-v23.16b}, [x5], #64; \
@@ -42,7 +47,25 @@
zip1 s2.2d, RTMP2.2d, RTMP3.2d; \
zip2 s3.2d, RTMP2.2d, RTMP3.2d;
-#define rotate_clockwise_90(s0, s1, s2, s3) \
+#define transpose_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7) \
+ zip1 RTMP0.4s, s0.4s, s1.4s; \
+ zip1 RTMP1.4s, s2.4s, s3.4s; \
+ zip2 RTMP2.4s, s0.4s, s1.4s; \
+ zip2 RTMP3.4s, s2.4s, s3.4s; \
+ zip1 RTMP4.4s, s4.4s, s5.4s; \
+ zip1 RTMP5.4s, s6.4s, s7.4s; \
+ zip2 RTMP6.4s, s4.4s, s5.4s; \
+ zip2 RTMP7.4s, s6.4s, s7.4s; \
+ zip1 s0.2d, RTMP0.2d, RTMP1.2d; \
+ zip2 s1.2d, RTMP0.2d, RTMP1.2d; \
+ zip1 s2.2d, RTMP2.2d, RTMP3.2d; \
+ zip2 s3.2d, RTMP2.2d, RTMP3.2d; \
+ zip1 s4.2d, RTMP4.2d, RTMP5.2d; \
+ zip2 s5.2d, RTMP4.2d, RTMP5.2d; \
+ zip1 s6.2d, RTMP6.2d, RTMP7.2d; \
+ zip2 s7.2d, RTMP6.2d, RTMP7.2d;
+
+#define rotate_clockwise_4x4(s0, s1, s2, s3) \
zip1 RTMP0.4s, s1.4s, s0.4s; \
zip2 RTMP1.4s, s1.4s, s0.4s; \
zip1 RTMP2.4s, s3.4s, s2.4s; \
@@ -52,6 +75,24 @@
zip1 s2.2d, RTMP3.2d, RTMP1.2d; \
zip2 s3.2d, RTMP3.2d, RTMP1.2d;
+#define rotate_clockwise_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7) \
+ zip1 RTMP0.4s, s1.4s, s0.4s; \
+ zip1 RTMP2.4s, s3.4s, s2.4s; \
+ zip2 RTMP1.4s, s1.4s, s0.4s; \
+ zip2 RTMP3.4s, s3.4s, s2.4s; \
+ zip1 RTMP4.4s, s5.4s, s4.4s; \
+ zip1 RTMP6.4s, s7.4s, s6.4s; \
+ zip2 RTMP5.4s, s5.4s, s4.4s; \
+ zip2 RTMP7.4s, s7.4s, s6.4s; \
+ zip1 s0.2d, RTMP2.2d, RTMP0.2d; \
+ zip2 s1.2d, RTMP2.2d, RTMP0.2d; \
+ zip1 s2.2d, RTMP3.2d, RTMP1.2d; \
+ zip2 s3.2d, RTMP3.2d, RTMP1.2d; \
+ zip1 s4.2d, RTMP6.2d, RTMP4.2d; \
+ zip2 s5.2d, RTMP6.2d, RTMP4.2d; \
+ zip1 s6.2d, RTMP7.2d, RTMP5.2d; \
+ zip2 s7.2d, RTMP7.2d, RTMP5.2d;
+
#define ROUND4(round, s0, s1, s2, s3) \
dup RX0.4s, RKEY.s[round]; \
/* rk ^ s1 ^ s2 ^ s3 */ \
@@ -87,14 +128,7 @@
/* s0 ^= RTMP3 */ \
eor s0.16b, s0.16b, RTMP3.16b;
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3) \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b; \
- \
- transpose_4x4(b0, b1, b2, b3); \
- \
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3) \
mov x6, 8; \
4: \
ld1 {RKEY.4s}, [x0], #16; \
@@ -107,15 +141,23 @@
\
bne 4b; \
\
- rotate_clockwise_90(b0, b1, b2, b3); \
rev32 b0.16b, b0.16b; \
rev32 b1.16b, b1.16b; \
rev32 b2.16b, b2.16b; \
rev32 b3.16b, b3.16b; \
\
+ rotate_clockwise_4x4(b0, b1, b2, b3); \
+ \
/* repoint to rkey */ \
sub x0, x0, #128;
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b; \
+ SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
#define ROUND8(round, s0, s1, s2, s3, t0, t1, t2, t3) \
/* rk ^ s1 ^ s2 ^ s3 */ \
dup RX0.4s, RKEY.s[round]; \
@@ -175,7 +217,7 @@
eor s0.16b, s0.16b, RTMP0.16b; \
eor t0.16b, t0.16b, RTMP1.16b;
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
+#define SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7) \
rev32 b0.16b, b0.16b; \
rev32 b1.16b, b1.16b; \
rev32 b2.16b, b2.16b; \
@@ -185,9 +227,6 @@
rev32 b6.16b, b6.16b; \
rev32 b7.16b, b7.16b; \
\
- transpose_4x4(b0, b1, b2, b3); \
- transpose_4x4(b4, b5, b6, b7); \
- \
mov x6, 8; \
8: \
ld1 {RKEY.4s}, [x0], #16; \
@@ -200,8 +239,6 @@
\
bne 8b; \
\
- rotate_clockwise_90(b0, b1, b2, b3); \
- rotate_clockwise_90(b4, b5, b6, b7); \
rev32 b0.16b, b0.16b; \
rev32 b1.16b, b1.16b; \
rev32 b2.16b, b2.16b; \
@@ -214,274 +251,429 @@
/* repoint to rkey */ \
sub x0, x0, #128;
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
+ SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7); \
+ rotate_clockwise_4x4_2x(b0, b1, b2, b3, b4, b5, b6, b7); \
-.align 3
-SYM_FUNC_START_LOCAL(__sm4_neon_crypt_blk1_4)
- /* input:
- * x0: round key array, CTX
- * x1: dst
- * x2: src
- * w3: num blocks (1..4)
- */
- PREPARE;
-
- ld1 {v0.16b}, [x2], #16;
- mov v1.16b, v0.16b;
- mov v2.16b, v0.16b;
- mov v3.16b, v0.16b;
- cmp w3, #2;
- blt .Lblk4_load_input_done;
- ld1 {v1.16b}, [x2], #16;
- beq .Lblk4_load_input_done;
- ld1 {v2.16b}, [x2], #16;
- cmp w3, #3;
- beq .Lblk4_load_input_done;
- ld1 {v3.16b}, [x2];
-
-.Lblk4_load_input_done:
- SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
- st1 {v0.16b}, [x1], #16;
- cmp w3, #2;
- blt .Lblk4_store_output_done;
- st1 {v1.16b}, [x1], #16;
- beq .Lblk4_store_output_done;
- st1 {v2.16b}, [x1], #16;
- cmp w3, #3;
- beq .Lblk4_store_output_done;
- st1 {v3.16b}, [x1];
-
-.Lblk4_store_output_done:
- ret;
-SYM_FUNC_END(__sm4_neon_crypt_blk1_4)
.align 3
-SYM_FUNC_START(sm4_neon_crypt_blk1_8)
+SYM_FUNC_START(sm4_neon_crypt)
/* input:
* x0: round key array, CTX
* x1: dst
* x2: src
- * w3: num blocks (1..8)
+ * w3: nblocks
*/
- cmp w3, #5;
- blt __sm4_neon_crypt_blk1_4;
-
- PREPARE;
-
- ld1 {v0.16b-v3.16b}, [x2], #64;
- ld1 {v4.16b}, [x2], #16;
- mov v5.16b, v4.16b;
- mov v6.16b, v4.16b;
- mov v7.16b, v4.16b;
- beq .Lblk8_load_input_done;
- ld1 {v5.16b}, [x2], #16;
- cmp w3, #7;
- blt .Lblk8_load_input_done;
- ld1 {v6.16b}, [x2], #16;
- beq .Lblk8_load_input_done;
- ld1 {v7.16b}, [x2];
-
-.Lblk8_load_input_done:
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
- cmp w3, #6;
- st1 {v0.16b-v3.16b}, [x1], #64;
- st1 {v4.16b}, [x1], #16;
- blt .Lblk8_store_output_done;
- st1 {v5.16b}, [x1], #16;
- beq .Lblk8_store_output_done;
- st1 {v6.16b}, [x1], #16;
- cmp w3, #7;
- beq .Lblk8_store_output_done;
- st1 {v7.16b}, [x1];
-
-.Lblk8_store_output_done:
- ret;
-SYM_FUNC_END(sm4_neon_crypt_blk1_8)
+ SM4_PREPARE()
-.align 3
-SYM_FUNC_START(sm4_neon_crypt_blk8)
- /* input:
- * x0: round key array, CTX
- * x1: dst
- * x2: src
- * w3: nblocks (multiples of 8)
- */
- PREPARE;
+.Lcrypt_loop_8x:
+ sub w3, w3, #8
+ tbnz w3, #31, .Lcrypt_4x
+
+ ld4 {v0.4s-v3.4s}, [x2], #64
+ ld4 {v4.4s-v7.4s}, [x2], #64
-.Lcrypt_loop_blk:
- subs w3, w3, #8;
- bmi .Lcrypt_end;
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
- ld1 {v0.16b-v3.16b}, [x2], #64;
- ld1 {v4.16b-v7.16b}, [x2], #64;
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ cbz w3, .Lcrypt_end
+ b .Lcrypt_loop_8x
- st1 {v0.16b-v3.16b}, [x1], #64;
- st1 {v4.16b-v7.16b}, [x1], #64;
+.Lcrypt_4x:
+ add w3, w3, #8
+ cmp w3, #4
+ blt .Lcrypt_tail
- b .Lcrypt_loop_blk;
+ sub w3, w3, #4
+
+ ld4 {v0.4s-v3.4s}, [x2], #64
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ cbz w3, .Lcrypt_end
+
+.Lcrypt_tail:
+ cmp w3, #2
+ ld1 {v0.16b}, [x2], #16
+ blt .Lcrypt_tail_load_done
+ ld1 {v1.16b}, [x2], #16
+ beq .Lcrypt_tail_load_done
+ ld1 {v2.16b}, [x2], #16
+
+.Lcrypt_tail_load_done:
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ cmp w3, #2
+ st1 {v0.16b}, [x1], #16
+ blt .Lcrypt_end
+ st1 {v1.16b}, [x1], #16
+ beq .Lcrypt_end
+ st1 {v2.16b}, [x1], #16
.Lcrypt_end:
- ret;
-SYM_FUNC_END(sm4_neon_crypt_blk8)
+ ret
+SYM_FUNC_END(sm4_neon_crypt)
.align 3
-SYM_FUNC_START(sm4_neon_cbc_dec_blk8)
+SYM_FUNC_START(sm4_neon_cbc_dec)
/* input:
* x0: round key array, CTX
* x1: dst
* x2: src
* x3: iv (big endian, 128 bit)
- * w4: nblocks (multiples of 8)
+ * w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE()
+
+ ld1 {RIV.16b}, [x3]
+
+.Lcbc_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lcbc_dec_4x
+
+ ld4 {v0.4s-v3.4s}, [x2], #64
+ ld4 {v4.4s-v7.4s}, [x2]
+
+ SM4_CRYPT_BLK8_norotate(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ /* Avoid overwriting the RIV register */
+ rotate_clockwise_4x4(v0, v1, v2, v3)
+ rotate_clockwise_4x4(v4, v5, v6, v7)
+
+ sub x2, x2, #64
+
+ eor v0.16b, v0.16b, RIV.16b
- ld1 {RIV.16b}, [x3];
+ ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64
+ ld1 {RTMP4.16b-RTMP7.16b}, [x2], #64
-.Lcbc_loop_blk:
- subs w4, w4, #8;
- bmi .Lcbc_end;
+ eor v1.16b, v1.16b, RTMP0.16b
+ eor v2.16b, v2.16b, RTMP1.16b
+ eor v3.16b, v3.16b, RTMP2.16b
+ eor v4.16b, v4.16b, RTMP3.16b
+ eor v5.16b, v5.16b, RTMP4.16b
+ eor v6.16b, v6.16b, RTMP5.16b
+ eor v7.16b, v7.16b, RTMP6.16b
- ld1 {v0.16b-v3.16b}, [x2], #64;
- ld1 {v4.16b-v7.16b}, [x2];
+ mov RIV.16b, RTMP7.16b
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
- sub x2, x2, #64;
- eor v0.16b, v0.16b, RIV.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v1.16b, v1.16b, RTMP0.16b;
- eor v2.16b, v2.16b, RTMP1.16b;
- eor v3.16b, v3.16b, RTMP2.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ cbz w4, .Lcbc_dec_end
+ b .Lcbc_dec_loop_8x
- eor v4.16b, v4.16b, RTMP3.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v5.16b, v5.16b, RTMP0.16b;
- eor v6.16b, v6.16b, RTMP1.16b;
- eor v7.16b, v7.16b, RTMP2.16b;
+.Lcbc_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lcbc_dec_tail
- mov RIV.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+ sub w4, w4, #4
- b .Lcbc_loop_blk;
+ ld1 {v0.16b-v3.16b}, [x2], #64
-.Lcbc_end:
+ rev32 v4.16b, v0.16b
+ rev32 v5.16b, v1.16b
+ rev32 v6.16b, v2.16b
+ rev32 v7.16b, v3.16b
+
+ transpose_4x4(v4, v5, v6, v7)
+
+ SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+ eor v4.16b, v4.16b, RIV.16b
+ eor v5.16b, v5.16b, v0.16b
+ eor v6.16b, v6.16b, v1.16b
+ eor v7.16b, v7.16b, v2.16b
+
+ mov RIV.16b, v3.16b
+
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ cbz w4, .Lcbc_dec_end
+
+.Lcbc_dec_tail:
+ cmp w4, #2
+ ld1 {v0.16b}, [x2], #16
+ blt .Lcbc_dec_tail_load_done
+ ld1 {v1.16b}, [x2], #16
+ beq .Lcbc_dec_tail_load_done
+ ld1 {v2.16b}, [x2], #16
+
+.Lcbc_dec_tail_load_done:
+ rev32 v4.16b, v0.16b
+ rev32 v5.16b, v1.16b
+ rev32 v6.16b, v2.16b
+
+ transpose_4x4(v4, v5, v6, v7)
+
+ SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+ cmp w4, #2
+ eor v4.16b, v4.16b, RIV.16b
+ mov RIV.16b, v0.16b
+ st1 {v4.16b}, [x1], #16
+ blt .Lcbc_dec_end
+
+ eor v5.16b, v5.16b, v0.16b
+ mov RIV.16b, v1.16b
+ st1 {v5.16b}, [x1], #16
+ beq .Lcbc_dec_end
+
+ eor v6.16b, v6.16b, v1.16b
+ mov RIV.16b, v2.16b
+ st1 {v6.16b}, [x1], #16
+
+.Lcbc_dec_end:
/* store new IV */
- st1 {RIV.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
-SYM_FUNC_END(sm4_neon_cbc_dec_blk8)
+ ret
+SYM_FUNC_END(sm4_neon_cbc_dec)
.align 3
-SYM_FUNC_START(sm4_neon_cfb_dec_blk8)
+SYM_FUNC_START(sm4_neon_cfb_dec)
/* input:
* x0: round key array, CTX
* x1: dst
* x2: src
* x3: iv (big endian, 128 bit)
- * w4: nblocks (multiples of 8)
+ * w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE()
+
+ ld1 {v0.16b}, [x3]
+
+.Lcfb_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lcfb_dec_4x
+
+ ld1 {v1.16b-v3.16b}, [x2], #48
+ ld4 {v4.4s-v7.4s}, [x2]
+
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ sub x2, x2, #48
+ ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64
+ ld1 {RTMP4.16b-RTMP7.16b}, [x2], #64
+
+ eor v0.16b, v0.16b, RTMP0.16b
+ eor v1.16b, v1.16b, RTMP1.16b
+ eor v2.16b, v2.16b, RTMP2.16b
+ eor v3.16b, v3.16b, RTMP3.16b
+ eor v4.16b, v4.16b, RTMP4.16b
+ eor v5.16b, v5.16b, RTMP5.16b
+ eor v6.16b, v6.16b, RTMP6.16b
+ eor v7.16b, v7.16b, RTMP7.16b
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ mov v0.16b, RTMP7.16b
+
+ cbz w4, .Lcfb_dec_end
+ b .Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lcfb_dec_tail
+
+ sub w4, w4, #4
+
+ ld1 {v4.16b-v7.16b}, [x2], #64
+
+ rev32 v0.16b, v0.16b /* v0 is IV register */
+ rev32 v1.16b, v4.16b
+ rev32 v2.16b, v5.16b
+ rev32 v3.16b, v6.16b
+
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
- ld1 {v0.16b}, [x3];
+ eor v0.16b, v0.16b, v4.16b
+ eor v1.16b, v1.16b, v5.16b
+ eor v2.16b, v2.16b, v6.16b
+ eor v3.16b, v3.16b, v7.16b
-.Lcfb_loop_blk:
- subs w4, w4, #8;
- bmi .Lcfb_end;
+ st1 {v0.16b-v3.16b}, [x1], #64
- ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48;
- ld1 {v4.16b-v7.16b}, [x2];
+ mov v0.16b, v7.16b
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ cbz w4, .Lcfb_dec_end
- sub x2, x2, #48;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+.Lcfb_dec_tail:
+ cmp w4, #2
+ ld1 {v4.16b}, [x2], #16
+ blt .Lcfb_dec_tail_load_done
+ ld1 {v5.16b}, [x2], #16
+ beq .Lcfb_dec_tail_load_done
+ ld1 {v6.16b}, [x2], #16
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v4.16b, v4.16b, RTMP0.16b;
- eor v5.16b, v5.16b, RTMP1.16b;
- eor v6.16b, v6.16b, RTMP2.16b;
- eor v7.16b, v7.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+.Lcfb_dec_tail_load_done:
+ rev32 v0.16b, v0.16b /* v0 is IV register */
+ rev32 v1.16b, v4.16b
+ rev32 v2.16b, v5.16b
- mov v0.16b, RTMP3.16b;
+ transpose_4x4(v0, v1, v2, v3)
- b .Lcfb_loop_blk;
+ SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
-.Lcfb_end:
+ cmp w4, #2
+ eor v0.16b, v0.16b, v4.16b
+ st1 {v0.16b}, [x1], #16
+ mov v0.16b, v4.16b
+ blt .Lcfb_dec_end
+
+ eor v1.16b, v1.16b, v5.16b
+ st1 {v1.16b}, [x1], #16
+ mov v0.16b, v5.16b
+ beq .Lcfb_dec_end
+
+ eor v2.16b, v2.16b, v6.16b
+ st1 {v2.16b}, [x1], #16
+ mov v0.16b, v6.16b
+
+.Lcfb_dec_end:
/* store new IV */
- st1 {v0.16b}, [x3];
+ st1 {v0.16b}, [x3]
- ret;
-SYM_FUNC_END(sm4_neon_cfb_dec_blk8)
+ ret
+SYM_FUNC_END(sm4_neon_cfb_dec)
.align 3
-SYM_FUNC_START(sm4_neon_ctr_enc_blk8)
+SYM_FUNC_START(sm4_neon_ctr_crypt)
/* input:
* x0: round key array, CTX
* x1: dst
* x2: src
* x3: ctr (big endian, 128 bit)
- * w4: nblocks (multiples of 8)
+ * w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE()
- ldp x7, x8, [x3];
- rev x7, x7;
- rev x8, x8;
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
-.Lctr_loop_blk:
- subs w4, w4, #8;
- bmi .Lctr_end;
+.Lctr_crypt_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lctr_crypt_4x
-#define inc_le128(vctr) \
- mov vctr.d[1], x8; \
- mov vctr.d[0], x7; \
- adds x8, x8, #1; \
- adc x7, x7, xzr; \
- rev64 vctr.16b, vctr.16b;
+#define inc_le128(vctr) \
+ mov vctr.d[1], x8; \
+ mov vctr.d[0], x7; \
+ adds x8, x8, #1; \
+ rev64 vctr.16b, vctr.16b; \
+ adc x7, x7, xzr;
/* construct CTRs */
- inc_le128(v0); /* +0 */
- inc_le128(v1); /* +1 */
- inc_le128(v2); /* +2 */
- inc_le128(v3); /* +3 */
- inc_le128(v4); /* +4 */
- inc_le128(v5); /* +5 */
- inc_le128(v6); /* +6 */
- inc_le128(v7); /* +7 */
-
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
-
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v4.16b, v4.16b, RTMP0.16b;
- eor v5.16b, v5.16b, RTMP1.16b;
- eor v6.16b, v6.16b, RTMP2.16b;
- eor v7.16b, v7.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
-
- b .Lctr_loop_blk;
-
-.Lctr_end:
+ inc_le128(v0) /* +0 */
+ inc_le128(v1) /* +1 */
+ inc_le128(v2) /* +2 */
+ inc_le128(v3) /* +3 */
+ inc_le128(v4) /* +4 */
+ inc_le128(v5) /* +5 */
+ inc_le128(v6) /* +6 */
+ inc_le128(v7) /* +7 */
+
+ transpose_4x4_2x(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64
+ ld1 {RTMP4.16b-RTMP7.16b}, [x2], #64
+
+ eor v0.16b, v0.16b, RTMP0.16b
+ eor v1.16b, v1.16b, RTMP1.16b
+ eor v2.16b, v2.16b, RTMP2.16b
+ eor v3.16b, v3.16b, RTMP3.16b
+ eor v4.16b, v4.16b, RTMP4.16b
+ eor v5.16b, v5.16b, RTMP5.16b
+ eor v6.16b, v6.16b, RTMP6.16b
+ eor v7.16b, v7.16b, RTMP7.16b
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ cbz w4, .Lctr_crypt_end
+ b .Lctr_crypt_loop_8x
+
+.Lctr_crypt_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lctr_crypt_tail
+
+ sub w4, w4, #4
+
+ /* construct CTRs */
+ inc_le128(v0) /* +0 */
+ inc_le128(v1) /* +1 */
+ inc_le128(v2) /* +2 */
+ inc_le128(v3) /* +3 */
+
+ ld1 {v4.16b-v7.16b}, [x2], #64
+
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ eor v0.16b, v0.16b, v4.16b
+ eor v1.16b, v1.16b, v5.16b
+ eor v2.16b, v2.16b, v6.16b
+ eor v3.16b, v3.16b, v7.16b
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ cbz w4, .Lctr_crypt_end
+
+.Lctr_crypt_tail:
+ /* inc_le128 will change the sign bit */
+ ld1 {v4.16b}, [x2], #16
+ inc_le128(v0)
+ cmp w4, #2
+ blt .Lctr_crypt_tail_load_done
+
+ ld1 {v5.16b}, [x2], #16
+ inc_le128(v1)
+ cmp w4, #2
+ beq .Lctr_crypt_tail_load_done
+
+ ld1 {v6.16b}, [x2], #16
+ inc_le128(v2)
+
+.Lctr_crypt_tail_load_done:
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ cmp w4, #2
+
+ eor v0.16b, v0.16b, v4.16b
+ st1 {v0.16b}, [x1], #16
+ blt .Lctr_crypt_end
+
+ eor v1.16b, v1.16b, v5.16b
+ st1 {v1.16b}, [x1], #16
+ beq .Lctr_crypt_end
+
+ eor v2.16b, v2.16b, v6.16b
+ st1 {v2.16b}, [x1], #16
+
+.Lctr_crypt_end:
/* store new CTR */
- rev x7, x7;
- rev x8, x8;
- stp x7, x8, [x3];
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
- ret;
-SYM_FUNC_END(sm4_neon_ctr_enc_blk8)
+ ret
+SYM_FUNC_END(sm4_neon_ctr_crypt)
diff --git a/arch/arm64/crypto/sm4-neon-glue.c b/arch/arm64/crypto/sm4-neon-glue.c
index 03a6a6866a31..7b19accf5c03 100644
--- a/arch/arm64/crypto/sm4-neon-glue.c
+++ b/arch/arm64/crypto/sm4-neon-glue.c
@@ -18,19 +18,14 @@
#include <crypto/internal/skcipher.h>
#include <crypto/sm4.h>
-#define BYTES2BLKS(nbytes) ((nbytes) >> 4)
-#define BYTES2BLK8(nbytes) (((nbytes) >> 4) & ~(8 - 1))
-
-asmlinkage void sm4_neon_crypt_blk1_8(const u32 *rkey, u8 *dst, const u8 *src,
- unsigned int nblks);
-asmlinkage void sm4_neon_crypt_blk8(const u32 *rkey, u8 *dst, const u8 *src,
- unsigned int nblks);
-asmlinkage void sm4_neon_cbc_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_cfb_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_ctr_enc_blk8(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
+asmlinkage void sm4_neon_crypt(const u32 *rkey, u8 *dst, const u8 *src,
+ unsigned int nblocks);
+asmlinkage void sm4_neon_cbc_dec(const u32 *rkey_dec, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_cfb_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_ctr_crypt(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
@@ -51,27 +46,18 @@ static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLK8(nbytes);
- if (nblks) {
- sm4_neon_crypt_blk8(rkey, dst, src, nblks);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ sm4_neon_crypt(rkey, dst, src, nblocks);
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- sm4_neon_crypt_blk1_8(rkey, dst, src, nblks);
- nbytes -= nblks * SM4_BLOCK_SIZE;
+ kernel_neon_end();
}
- kernel_neon_end();
-
- err = skcipher_walk_done(&walk, nbytes);
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
}
return err;
@@ -138,48 +124,19 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLK8(nbytes);
- if (nblks) {
- sm4_neon_cbc_dec_blk8(ctx->rkey_dec, dst, src,
- walk.iv, nblks);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ sm4_neon_cbc_dec(ctx->rkey_dec, dst, src,
+ walk.iv, nblocks);
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- u8 keystream[SM4_BLOCK_SIZE * 8];
- u8 iv[SM4_BLOCK_SIZE];
- int i;
-
- sm4_neon_crypt_blk1_8(ctx->rkey_dec, keystream,
- src, nblks);
-
- src += ((int)nblks - 2) * SM4_BLOCK_SIZE;
- dst += (nblks - 1) * SM4_BLOCK_SIZE;
- memcpy(iv, src + SM4_BLOCK_SIZE, SM4_BLOCK_SIZE);
-
- for (i = nblks - 1; i > 0; i--) {
- crypto_xor_cpy(dst, src,
- &keystream[i * SM4_BLOCK_SIZE],
- SM4_BLOCK_SIZE);
- src -= SM4_BLOCK_SIZE;
- dst -= SM4_BLOCK_SIZE;
- }
- crypto_xor_cpy(dst, walk.iv,
- keystream, SM4_BLOCK_SIZE);
- memcpy(walk.iv, iv, SM4_BLOCK_SIZE);
- nbytes -= nblks * SM4_BLOCK_SIZE;
+ kernel_neon_end();
}
- kernel_neon_end();
-
- err = skcipher_walk_done(&walk, nbytes);
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
}
return err;
@@ -238,41 +195,21 @@ static int sm4_cfb_decrypt(struct skcipher_request *req)
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLK8(nbytes);
- if (nblks) {
- sm4_neon_cfb_dec_blk8(ctx->rkey_enc, dst, src,
- walk.iv, nblks);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ sm4_neon_cfb_dec(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- u8 keystream[SM4_BLOCK_SIZE * 8];
-
- memcpy(keystream, walk.iv, SM4_BLOCK_SIZE);
- if (nblks > 1)
- memcpy(&keystream[SM4_BLOCK_SIZE], src,
- (nblks - 1) * SM4_BLOCK_SIZE);
- memcpy(walk.iv, src + (nblks - 1) * SM4_BLOCK_SIZE,
- SM4_BLOCK_SIZE);
-
- sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
- keystream, nblks);
-
- crypto_xor_cpy(dst, src, keystream,
- nblks * SM4_BLOCK_SIZE);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ kernel_neon_end();
- kernel_neon_end();
+ dst += nblocks * SM4_BLOCK_SIZE;
+ src += nblocks * SM4_BLOCK_SIZE;
+ nbytes -= nblocks * SM4_BLOCK_SIZE;
+ }
/* tail */
if (walk.nbytes == walk.total && nbytes > 0) {
@@ -302,40 +239,21 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLK8(nbytes);
- if (nblks) {
- sm4_neon_ctr_enc_blk8(ctx->rkey_enc, dst, src,
- walk.iv, nblks);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ sm4_neon_ctr_crypt(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- u8 keystream[SM4_BLOCK_SIZE * 8];
- int i;
-
- for (i = 0; i < nblks; i++) {
- memcpy(&keystream[i * SM4_BLOCK_SIZE],
- walk.iv, SM4_BLOCK_SIZE);
- crypto_inc(walk.iv, SM4_BLOCK_SIZE);
- }
- sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
- keystream, nblks);
-
- crypto_xor_cpy(dst, src, keystream,
- nblks * SM4_BLOCK_SIZE);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ kernel_neon_end();
- kernel_neon_end();
+ dst += nblocks * SM4_BLOCK_SIZE;
+ src += nblocks * SM4_BLOCK_SIZE;
+ nbytes -= nblocks * SM4_BLOCK_SIZE;
+ }
/* tail */
if (walk.nbytes == walk.total && nbytes > 0) {
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 03/16] crypto: arm64/sm4 - refactor and simplify NEON implementation
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch does not add new features. The main work is to refactor and
simplify the implementation of SM4 NEON, which is reflected in the
following aspects:
The accelerated implementation supports the arbitrary number of blocks,
not just multiples of 8, which simplifies the implementation and brings
some optimization acceleration for data that is not aligned by 8 blocks.
When loading the input data, use the ld4 instruction to replace the
original ld1 instruction as much as possible, which will save the cost
of matrix transposition of the input data.
Use 8-block parallelism whenever possible to speed up matrix transpose
and rotation operations, instead of up to 4-block parallelism.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-neon-core.S | 630 +++++++++++++++++++-----------
arch/arm64/crypto/sm4-neon-glue.c | 172 +++-----
2 files changed, 456 insertions(+), 346 deletions(-)
diff --git a/arch/arm64/crypto/sm4-neon-core.S b/arch/arm64/crypto/sm4-neon-core.S
index 3d5256b354d2..f295b4b7d70a 100644
--- a/arch/arm64/crypto/sm4-neon-core.S
+++ b/arch/arm64/crypto/sm4-neon-core.S
@@ -18,6 +18,11 @@
#define RTMP2 v10
#define RTMP3 v11
+#define RTMP4 v12
+#define RTMP5 v13
+#define RTMP6 v14
+#define RTMP7 v15
+
#define RX0 v12
#define RX1 v13
#define RKEY v14
@@ -25,7 +30,7 @@
/* Helper macros. */
-#define PREPARE \
+#define SM4_PREPARE() \
adr_l x5, crypto_sm4_sbox; \
ld1 {v16.16b-v19.16b}, [x5], #64; \
ld1 {v20.16b-v23.16b}, [x5], #64; \
@@ -42,7 +47,25 @@
zip1 s2.2d, RTMP2.2d, RTMP3.2d; \
zip2 s3.2d, RTMP2.2d, RTMP3.2d;
-#define rotate_clockwise_90(s0, s1, s2, s3) \
+#define transpose_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7) \
+ zip1 RTMP0.4s, s0.4s, s1.4s; \
+ zip1 RTMP1.4s, s2.4s, s3.4s; \
+ zip2 RTMP2.4s, s0.4s, s1.4s; \
+ zip2 RTMP3.4s, s2.4s, s3.4s; \
+ zip1 RTMP4.4s, s4.4s, s5.4s; \
+ zip1 RTMP5.4s, s6.4s, s7.4s; \
+ zip2 RTMP6.4s, s4.4s, s5.4s; \
+ zip2 RTMP7.4s, s6.4s, s7.4s; \
+ zip1 s0.2d, RTMP0.2d, RTMP1.2d; \
+ zip2 s1.2d, RTMP0.2d, RTMP1.2d; \
+ zip1 s2.2d, RTMP2.2d, RTMP3.2d; \
+ zip2 s3.2d, RTMP2.2d, RTMP3.2d; \
+ zip1 s4.2d, RTMP4.2d, RTMP5.2d; \
+ zip2 s5.2d, RTMP4.2d, RTMP5.2d; \
+ zip1 s6.2d, RTMP6.2d, RTMP7.2d; \
+ zip2 s7.2d, RTMP6.2d, RTMP7.2d;
+
+#define rotate_clockwise_4x4(s0, s1, s2, s3) \
zip1 RTMP0.4s, s1.4s, s0.4s; \
zip2 RTMP1.4s, s1.4s, s0.4s; \
zip1 RTMP2.4s, s3.4s, s2.4s; \
@@ -52,6 +75,24 @@
zip1 s2.2d, RTMP3.2d, RTMP1.2d; \
zip2 s3.2d, RTMP3.2d, RTMP1.2d;
+#define rotate_clockwise_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7) \
+ zip1 RTMP0.4s, s1.4s, s0.4s; \
+ zip1 RTMP2.4s, s3.4s, s2.4s; \
+ zip2 RTMP1.4s, s1.4s, s0.4s; \
+ zip2 RTMP3.4s, s3.4s, s2.4s; \
+ zip1 RTMP4.4s, s5.4s, s4.4s; \
+ zip1 RTMP6.4s, s7.4s, s6.4s; \
+ zip2 RTMP5.4s, s5.4s, s4.4s; \
+ zip2 RTMP7.4s, s7.4s, s6.4s; \
+ zip1 s0.2d, RTMP2.2d, RTMP0.2d; \
+ zip2 s1.2d, RTMP2.2d, RTMP0.2d; \
+ zip1 s2.2d, RTMP3.2d, RTMP1.2d; \
+ zip2 s3.2d, RTMP3.2d, RTMP1.2d; \
+ zip1 s4.2d, RTMP6.2d, RTMP4.2d; \
+ zip2 s5.2d, RTMP6.2d, RTMP4.2d; \
+ zip1 s6.2d, RTMP7.2d, RTMP5.2d; \
+ zip2 s7.2d, RTMP7.2d, RTMP5.2d;
+
#define ROUND4(round, s0, s1, s2, s3) \
dup RX0.4s, RKEY.s[round]; \
/* rk ^ s1 ^ s2 ^ s3 */ \
@@ -87,14 +128,7 @@
/* s0 ^= RTMP3 */ \
eor s0.16b, s0.16b, RTMP3.16b;
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3) \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b; \
- \
- transpose_4x4(b0, b1, b2, b3); \
- \
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3) \
mov x6, 8; \
4: \
ld1 {RKEY.4s}, [x0], #16; \
@@ -107,15 +141,23 @@
\
bne 4b; \
\
- rotate_clockwise_90(b0, b1, b2, b3); \
rev32 b0.16b, b0.16b; \
rev32 b1.16b, b1.16b; \
rev32 b2.16b, b2.16b; \
rev32 b3.16b, b3.16b; \
\
+ rotate_clockwise_4x4(b0, b1, b2, b3); \
+ \
/* repoint to rkey */ \
sub x0, x0, #128;
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b; \
+ SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
#define ROUND8(round, s0, s1, s2, s3, t0, t1, t2, t3) \
/* rk ^ s1 ^ s2 ^ s3 */ \
dup RX0.4s, RKEY.s[round]; \
@@ -175,7 +217,7 @@
eor s0.16b, s0.16b, RTMP0.16b; \
eor t0.16b, t0.16b, RTMP1.16b;
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
+#define SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7) \
rev32 b0.16b, b0.16b; \
rev32 b1.16b, b1.16b; \
rev32 b2.16b, b2.16b; \
@@ -185,9 +227,6 @@
rev32 b6.16b, b6.16b; \
rev32 b7.16b, b7.16b; \
\
- transpose_4x4(b0, b1, b2, b3); \
- transpose_4x4(b4, b5, b6, b7); \
- \
mov x6, 8; \
8: \
ld1 {RKEY.4s}, [x0], #16; \
@@ -200,8 +239,6 @@
\
bne 8b; \
\
- rotate_clockwise_90(b0, b1, b2, b3); \
- rotate_clockwise_90(b4, b5, b6, b7); \
rev32 b0.16b, b0.16b; \
rev32 b1.16b, b1.16b; \
rev32 b2.16b, b2.16b; \
@@ -214,274 +251,429 @@
/* repoint to rkey */ \
sub x0, x0, #128;
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
+ SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7); \
+ rotate_clockwise_4x4_2x(b0, b1, b2, b3, b4, b5, b6, b7); \
-.align 3
-SYM_FUNC_START_LOCAL(__sm4_neon_crypt_blk1_4)
- /* input:
- * x0: round key array, CTX
- * x1: dst
- * x2: src
- * w3: num blocks (1..4)
- */
- PREPARE;
-
- ld1 {v0.16b}, [x2], #16;
- mov v1.16b, v0.16b;
- mov v2.16b, v0.16b;
- mov v3.16b, v0.16b;
- cmp w3, #2;
- blt .Lblk4_load_input_done;
- ld1 {v1.16b}, [x2], #16;
- beq .Lblk4_load_input_done;
- ld1 {v2.16b}, [x2], #16;
- cmp w3, #3;
- beq .Lblk4_load_input_done;
- ld1 {v3.16b}, [x2];
-
-.Lblk4_load_input_done:
- SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
- st1 {v0.16b}, [x1], #16;
- cmp w3, #2;
- blt .Lblk4_store_output_done;
- st1 {v1.16b}, [x1], #16;
- beq .Lblk4_store_output_done;
- st1 {v2.16b}, [x1], #16;
- cmp w3, #3;
- beq .Lblk4_store_output_done;
- st1 {v3.16b}, [x1];
-
-.Lblk4_store_output_done:
- ret;
-SYM_FUNC_END(__sm4_neon_crypt_blk1_4)
.align 3
-SYM_FUNC_START(sm4_neon_crypt_blk1_8)
+SYM_FUNC_START(sm4_neon_crypt)
/* input:
* x0: round key array, CTX
* x1: dst
* x2: src
- * w3: num blocks (1..8)
+ * w3: nblocks
*/
- cmp w3, #5;
- blt __sm4_neon_crypt_blk1_4;
-
- PREPARE;
-
- ld1 {v0.16b-v3.16b}, [x2], #64;
- ld1 {v4.16b}, [x2], #16;
- mov v5.16b, v4.16b;
- mov v6.16b, v4.16b;
- mov v7.16b, v4.16b;
- beq .Lblk8_load_input_done;
- ld1 {v5.16b}, [x2], #16;
- cmp w3, #7;
- blt .Lblk8_load_input_done;
- ld1 {v6.16b}, [x2], #16;
- beq .Lblk8_load_input_done;
- ld1 {v7.16b}, [x2];
-
-.Lblk8_load_input_done:
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
- cmp w3, #6;
- st1 {v0.16b-v3.16b}, [x1], #64;
- st1 {v4.16b}, [x1], #16;
- blt .Lblk8_store_output_done;
- st1 {v5.16b}, [x1], #16;
- beq .Lblk8_store_output_done;
- st1 {v6.16b}, [x1], #16;
- cmp w3, #7;
- beq .Lblk8_store_output_done;
- st1 {v7.16b}, [x1];
-
-.Lblk8_store_output_done:
- ret;
-SYM_FUNC_END(sm4_neon_crypt_blk1_8)
+ SM4_PREPARE()
-.align 3
-SYM_FUNC_START(sm4_neon_crypt_blk8)
- /* input:
- * x0: round key array, CTX
- * x1: dst
- * x2: src
- * w3: nblocks (multiples of 8)
- */
- PREPARE;
+.Lcrypt_loop_8x:
+ sub w3, w3, #8
+ tbnz w3, #31, .Lcrypt_4x
+
+ ld4 {v0.4s-v3.4s}, [x2], #64
+ ld4 {v4.4s-v7.4s}, [x2], #64
-.Lcrypt_loop_blk:
- subs w3, w3, #8;
- bmi .Lcrypt_end;
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
- ld1 {v0.16b-v3.16b}, [x2], #64;
- ld1 {v4.16b-v7.16b}, [x2], #64;
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ cbz w3, .Lcrypt_end
+ b .Lcrypt_loop_8x
- st1 {v0.16b-v3.16b}, [x1], #64;
- st1 {v4.16b-v7.16b}, [x1], #64;
+.Lcrypt_4x:
+ add w3, w3, #8
+ cmp w3, #4
+ blt .Lcrypt_tail
- b .Lcrypt_loop_blk;
+ sub w3, w3, #4
+
+ ld4 {v0.4s-v3.4s}, [x2], #64
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ cbz w3, .Lcrypt_end
+
+.Lcrypt_tail:
+ cmp w3, #2
+ ld1 {v0.16b}, [x2], #16
+ blt .Lcrypt_tail_load_done
+ ld1 {v1.16b}, [x2], #16
+ beq .Lcrypt_tail_load_done
+ ld1 {v2.16b}, [x2], #16
+
+.Lcrypt_tail_load_done:
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ cmp w3, #2
+ st1 {v0.16b}, [x1], #16
+ blt .Lcrypt_end
+ st1 {v1.16b}, [x1], #16
+ beq .Lcrypt_end
+ st1 {v2.16b}, [x1], #16
.Lcrypt_end:
- ret;
-SYM_FUNC_END(sm4_neon_crypt_blk8)
+ ret
+SYM_FUNC_END(sm4_neon_crypt)
.align 3
-SYM_FUNC_START(sm4_neon_cbc_dec_blk8)
+SYM_FUNC_START(sm4_neon_cbc_dec)
/* input:
* x0: round key array, CTX
* x1: dst
* x2: src
* x3: iv (big endian, 128 bit)
- * w4: nblocks (multiples of 8)
+ * w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE()
+
+ ld1 {RIV.16b}, [x3]
+
+.Lcbc_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lcbc_dec_4x
+
+ ld4 {v0.4s-v3.4s}, [x2], #64
+ ld4 {v4.4s-v7.4s}, [x2]
+
+ SM4_CRYPT_BLK8_norotate(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ /* Avoid overwriting the RIV register */
+ rotate_clockwise_4x4(v0, v1, v2, v3)
+ rotate_clockwise_4x4(v4, v5, v6, v7)
+
+ sub x2, x2, #64
+
+ eor v0.16b, v0.16b, RIV.16b
- ld1 {RIV.16b}, [x3];
+ ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64
+ ld1 {RTMP4.16b-RTMP7.16b}, [x2], #64
-.Lcbc_loop_blk:
- subs w4, w4, #8;
- bmi .Lcbc_end;
+ eor v1.16b, v1.16b, RTMP0.16b
+ eor v2.16b, v2.16b, RTMP1.16b
+ eor v3.16b, v3.16b, RTMP2.16b
+ eor v4.16b, v4.16b, RTMP3.16b
+ eor v5.16b, v5.16b, RTMP4.16b
+ eor v6.16b, v6.16b, RTMP5.16b
+ eor v7.16b, v7.16b, RTMP6.16b
- ld1 {v0.16b-v3.16b}, [x2], #64;
- ld1 {v4.16b-v7.16b}, [x2];
+ mov RIV.16b, RTMP7.16b
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
- sub x2, x2, #64;
- eor v0.16b, v0.16b, RIV.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v1.16b, v1.16b, RTMP0.16b;
- eor v2.16b, v2.16b, RTMP1.16b;
- eor v3.16b, v3.16b, RTMP2.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ cbz w4, .Lcbc_dec_end
+ b .Lcbc_dec_loop_8x
- eor v4.16b, v4.16b, RTMP3.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v5.16b, v5.16b, RTMP0.16b;
- eor v6.16b, v6.16b, RTMP1.16b;
- eor v7.16b, v7.16b, RTMP2.16b;
+.Lcbc_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lcbc_dec_tail
- mov RIV.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+ sub w4, w4, #4
- b .Lcbc_loop_blk;
+ ld1 {v0.16b-v3.16b}, [x2], #64
-.Lcbc_end:
+ rev32 v4.16b, v0.16b
+ rev32 v5.16b, v1.16b
+ rev32 v6.16b, v2.16b
+ rev32 v7.16b, v3.16b
+
+ transpose_4x4(v4, v5, v6, v7)
+
+ SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+ eor v4.16b, v4.16b, RIV.16b
+ eor v5.16b, v5.16b, v0.16b
+ eor v6.16b, v6.16b, v1.16b
+ eor v7.16b, v7.16b, v2.16b
+
+ mov RIV.16b, v3.16b
+
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ cbz w4, .Lcbc_dec_end
+
+.Lcbc_dec_tail:
+ cmp w4, #2
+ ld1 {v0.16b}, [x2], #16
+ blt .Lcbc_dec_tail_load_done
+ ld1 {v1.16b}, [x2], #16
+ beq .Lcbc_dec_tail_load_done
+ ld1 {v2.16b}, [x2], #16
+
+.Lcbc_dec_tail_load_done:
+ rev32 v4.16b, v0.16b
+ rev32 v5.16b, v1.16b
+ rev32 v6.16b, v2.16b
+
+ transpose_4x4(v4, v5, v6, v7)
+
+ SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+ cmp w4, #2
+ eor v4.16b, v4.16b, RIV.16b
+ mov RIV.16b, v0.16b
+ st1 {v4.16b}, [x1], #16
+ blt .Lcbc_dec_end
+
+ eor v5.16b, v5.16b, v0.16b
+ mov RIV.16b, v1.16b
+ st1 {v5.16b}, [x1], #16
+ beq .Lcbc_dec_end
+
+ eor v6.16b, v6.16b, v1.16b
+ mov RIV.16b, v2.16b
+ st1 {v6.16b}, [x1], #16
+
+.Lcbc_dec_end:
/* store new IV */
- st1 {RIV.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
-SYM_FUNC_END(sm4_neon_cbc_dec_blk8)
+ ret
+SYM_FUNC_END(sm4_neon_cbc_dec)
.align 3
-SYM_FUNC_START(sm4_neon_cfb_dec_blk8)
+SYM_FUNC_START(sm4_neon_cfb_dec)
/* input:
* x0: round key array, CTX
* x1: dst
* x2: src
* x3: iv (big endian, 128 bit)
- * w4: nblocks (multiples of 8)
+ * w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE()
+
+ ld1 {v0.16b}, [x3]
+
+.Lcfb_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lcfb_dec_4x
+
+ ld1 {v1.16b-v3.16b}, [x2], #48
+ ld4 {v4.4s-v7.4s}, [x2]
+
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ sub x2, x2, #48
+ ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64
+ ld1 {RTMP4.16b-RTMP7.16b}, [x2], #64
+
+ eor v0.16b, v0.16b, RTMP0.16b
+ eor v1.16b, v1.16b, RTMP1.16b
+ eor v2.16b, v2.16b, RTMP2.16b
+ eor v3.16b, v3.16b, RTMP3.16b
+ eor v4.16b, v4.16b, RTMP4.16b
+ eor v5.16b, v5.16b, RTMP5.16b
+ eor v6.16b, v6.16b, RTMP6.16b
+ eor v7.16b, v7.16b, RTMP7.16b
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ mov v0.16b, RTMP7.16b
+
+ cbz w4, .Lcfb_dec_end
+ b .Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lcfb_dec_tail
+
+ sub w4, w4, #4
+
+ ld1 {v4.16b-v7.16b}, [x2], #64
+
+ rev32 v0.16b, v0.16b /* v0 is IV register */
+ rev32 v1.16b, v4.16b
+ rev32 v2.16b, v5.16b
+ rev32 v3.16b, v6.16b
+
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
- ld1 {v0.16b}, [x3];
+ eor v0.16b, v0.16b, v4.16b
+ eor v1.16b, v1.16b, v5.16b
+ eor v2.16b, v2.16b, v6.16b
+ eor v3.16b, v3.16b, v7.16b
-.Lcfb_loop_blk:
- subs w4, w4, #8;
- bmi .Lcfb_end;
+ st1 {v0.16b-v3.16b}, [x1], #64
- ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48;
- ld1 {v4.16b-v7.16b}, [x2];
+ mov v0.16b, v7.16b
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ cbz w4, .Lcfb_dec_end
- sub x2, x2, #48;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+.Lcfb_dec_tail:
+ cmp w4, #2
+ ld1 {v4.16b}, [x2], #16
+ blt .Lcfb_dec_tail_load_done
+ ld1 {v5.16b}, [x2], #16
+ beq .Lcfb_dec_tail_load_done
+ ld1 {v6.16b}, [x2], #16
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v4.16b, v4.16b, RTMP0.16b;
- eor v5.16b, v5.16b, RTMP1.16b;
- eor v6.16b, v6.16b, RTMP2.16b;
- eor v7.16b, v7.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+.Lcfb_dec_tail_load_done:
+ rev32 v0.16b, v0.16b /* v0 is IV register */
+ rev32 v1.16b, v4.16b
+ rev32 v2.16b, v5.16b
- mov v0.16b, RTMP3.16b;
+ transpose_4x4(v0, v1, v2, v3)
- b .Lcfb_loop_blk;
+ SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
-.Lcfb_end:
+ cmp w4, #2
+ eor v0.16b, v0.16b, v4.16b
+ st1 {v0.16b}, [x1], #16
+ mov v0.16b, v4.16b
+ blt .Lcfb_dec_end
+
+ eor v1.16b, v1.16b, v5.16b
+ st1 {v1.16b}, [x1], #16
+ mov v0.16b, v5.16b
+ beq .Lcfb_dec_end
+
+ eor v2.16b, v2.16b, v6.16b
+ st1 {v2.16b}, [x1], #16
+ mov v0.16b, v6.16b
+
+.Lcfb_dec_end:
/* store new IV */
- st1 {v0.16b}, [x3];
+ st1 {v0.16b}, [x3]
- ret;
-SYM_FUNC_END(sm4_neon_cfb_dec_blk8)
+ ret
+SYM_FUNC_END(sm4_neon_cfb_dec)
.align 3
-SYM_FUNC_START(sm4_neon_ctr_enc_blk8)
+SYM_FUNC_START(sm4_neon_ctr_crypt)
/* input:
* x0: round key array, CTX
* x1: dst
* x2: src
* x3: ctr (big endian, 128 bit)
- * w4: nblocks (multiples of 8)
+ * w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE()
- ldp x7, x8, [x3];
- rev x7, x7;
- rev x8, x8;
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
-.Lctr_loop_blk:
- subs w4, w4, #8;
- bmi .Lctr_end;
+.Lctr_crypt_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lctr_crypt_4x
-#define inc_le128(vctr) \
- mov vctr.d[1], x8; \
- mov vctr.d[0], x7; \
- adds x8, x8, #1; \
- adc x7, x7, xzr; \
- rev64 vctr.16b, vctr.16b;
+#define inc_le128(vctr) \
+ mov vctr.d[1], x8; \
+ mov vctr.d[0], x7; \
+ adds x8, x8, #1; \
+ rev64 vctr.16b, vctr.16b; \
+ adc x7, x7, xzr;
/* construct CTRs */
- inc_le128(v0); /* +0 */
- inc_le128(v1); /* +1 */
- inc_le128(v2); /* +2 */
- inc_le128(v3); /* +3 */
- inc_le128(v4); /* +4 */
- inc_le128(v5); /* +5 */
- inc_le128(v6); /* +6 */
- inc_le128(v7); /* +7 */
-
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
-
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v4.16b, v4.16b, RTMP0.16b;
- eor v5.16b, v5.16b, RTMP1.16b;
- eor v6.16b, v6.16b, RTMP2.16b;
- eor v7.16b, v7.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
-
- b .Lctr_loop_blk;
-
-.Lctr_end:
+ inc_le128(v0) /* +0 */
+ inc_le128(v1) /* +1 */
+ inc_le128(v2) /* +2 */
+ inc_le128(v3) /* +3 */
+ inc_le128(v4) /* +4 */
+ inc_le128(v5) /* +5 */
+ inc_le128(v6) /* +6 */
+ inc_le128(v7) /* +7 */
+
+ transpose_4x4_2x(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64
+ ld1 {RTMP4.16b-RTMP7.16b}, [x2], #64
+
+ eor v0.16b, v0.16b, RTMP0.16b
+ eor v1.16b, v1.16b, RTMP1.16b
+ eor v2.16b, v2.16b, RTMP2.16b
+ eor v3.16b, v3.16b, RTMP3.16b
+ eor v4.16b, v4.16b, RTMP4.16b
+ eor v5.16b, v5.16b, RTMP5.16b
+ eor v6.16b, v6.16b, RTMP6.16b
+ eor v7.16b, v7.16b, RTMP7.16b
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ cbz w4, .Lctr_crypt_end
+ b .Lctr_crypt_loop_8x
+
+.Lctr_crypt_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lctr_crypt_tail
+
+ sub w4, w4, #4
+
+ /* construct CTRs */
+ inc_le128(v0) /* +0 */
+ inc_le128(v1) /* +1 */
+ inc_le128(v2) /* +2 */
+ inc_le128(v3) /* +3 */
+
+ ld1 {v4.16b-v7.16b}, [x2], #64
+
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ eor v0.16b, v0.16b, v4.16b
+ eor v1.16b, v1.16b, v5.16b
+ eor v2.16b, v2.16b, v6.16b
+ eor v3.16b, v3.16b, v7.16b
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ cbz w4, .Lctr_crypt_end
+
+.Lctr_crypt_tail:
+ /* inc_le128 will change the sign bit */
+ ld1 {v4.16b}, [x2], #16
+ inc_le128(v0)
+ cmp w4, #2
+ blt .Lctr_crypt_tail_load_done
+
+ ld1 {v5.16b}, [x2], #16
+ inc_le128(v1)
+ cmp w4, #2
+ beq .Lctr_crypt_tail_load_done
+
+ ld1 {v6.16b}, [x2], #16
+ inc_le128(v2)
+
+.Lctr_crypt_tail_load_done:
+ transpose_4x4(v0, v1, v2, v3)
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ cmp w4, #2
+
+ eor v0.16b, v0.16b, v4.16b
+ st1 {v0.16b}, [x1], #16
+ blt .Lctr_crypt_end
+
+ eor v1.16b, v1.16b, v5.16b
+ st1 {v1.16b}, [x1], #16
+ beq .Lctr_crypt_end
+
+ eor v2.16b, v2.16b, v6.16b
+ st1 {v2.16b}, [x1], #16
+
+.Lctr_crypt_end:
/* store new CTR */
- rev x7, x7;
- rev x8, x8;
- stp x7, x8, [x3];
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
- ret;
-SYM_FUNC_END(sm4_neon_ctr_enc_blk8)
+ ret
+SYM_FUNC_END(sm4_neon_ctr_crypt)
diff --git a/arch/arm64/crypto/sm4-neon-glue.c b/arch/arm64/crypto/sm4-neon-glue.c
index 03a6a6866a31..7b19accf5c03 100644
--- a/arch/arm64/crypto/sm4-neon-glue.c
+++ b/arch/arm64/crypto/sm4-neon-glue.c
@@ -18,19 +18,14 @@
#include <crypto/internal/skcipher.h>
#include <crypto/sm4.h>
-#define BYTES2BLKS(nbytes) ((nbytes) >> 4)
-#define BYTES2BLK8(nbytes) (((nbytes) >> 4) & ~(8 - 1))
-
-asmlinkage void sm4_neon_crypt_blk1_8(const u32 *rkey, u8 *dst, const u8 *src,
- unsigned int nblks);
-asmlinkage void sm4_neon_crypt_blk8(const u32 *rkey, u8 *dst, const u8 *src,
- unsigned int nblks);
-asmlinkage void sm4_neon_cbc_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_cfb_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_ctr_enc_blk8(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
+asmlinkage void sm4_neon_crypt(const u32 *rkey, u8 *dst, const u8 *src,
+ unsigned int nblocks);
+asmlinkage void sm4_neon_cbc_dec(const u32 *rkey_dec, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_cfb_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_ctr_crypt(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
@@ -51,27 +46,18 @@ static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLK8(nbytes);
- if (nblks) {
- sm4_neon_crypt_blk8(rkey, dst, src, nblks);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ sm4_neon_crypt(rkey, dst, src, nblocks);
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- sm4_neon_crypt_blk1_8(rkey, dst, src, nblks);
- nbytes -= nblks * SM4_BLOCK_SIZE;
+ kernel_neon_end();
}
- kernel_neon_end();
-
- err = skcipher_walk_done(&walk, nbytes);
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
}
return err;
@@ -138,48 +124,19 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLK8(nbytes);
- if (nblks) {
- sm4_neon_cbc_dec_blk8(ctx->rkey_dec, dst, src,
- walk.iv, nblks);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ sm4_neon_cbc_dec(ctx->rkey_dec, dst, src,
+ walk.iv, nblocks);
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- u8 keystream[SM4_BLOCK_SIZE * 8];
- u8 iv[SM4_BLOCK_SIZE];
- int i;
-
- sm4_neon_crypt_blk1_8(ctx->rkey_dec, keystream,
- src, nblks);
-
- src += ((int)nblks - 2) * SM4_BLOCK_SIZE;
- dst += (nblks - 1) * SM4_BLOCK_SIZE;
- memcpy(iv, src + SM4_BLOCK_SIZE, SM4_BLOCK_SIZE);
-
- for (i = nblks - 1; i > 0; i--) {
- crypto_xor_cpy(dst, src,
- &keystream[i * SM4_BLOCK_SIZE],
- SM4_BLOCK_SIZE);
- src -= SM4_BLOCK_SIZE;
- dst -= SM4_BLOCK_SIZE;
- }
- crypto_xor_cpy(dst, walk.iv,
- keystream, SM4_BLOCK_SIZE);
- memcpy(walk.iv, iv, SM4_BLOCK_SIZE);
- nbytes -= nblks * SM4_BLOCK_SIZE;
+ kernel_neon_end();
}
- kernel_neon_end();
-
- err = skcipher_walk_done(&walk, nbytes);
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
}
return err;
@@ -238,41 +195,21 @@ static int sm4_cfb_decrypt(struct skcipher_request *req)
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLK8(nbytes);
- if (nblks) {
- sm4_neon_cfb_dec_blk8(ctx->rkey_enc, dst, src,
- walk.iv, nblks);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ sm4_neon_cfb_dec(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- u8 keystream[SM4_BLOCK_SIZE * 8];
-
- memcpy(keystream, walk.iv, SM4_BLOCK_SIZE);
- if (nblks > 1)
- memcpy(&keystream[SM4_BLOCK_SIZE], src,
- (nblks - 1) * SM4_BLOCK_SIZE);
- memcpy(walk.iv, src + (nblks - 1) * SM4_BLOCK_SIZE,
- SM4_BLOCK_SIZE);
-
- sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
- keystream, nblks);
-
- crypto_xor_cpy(dst, src, keystream,
- nblks * SM4_BLOCK_SIZE);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ kernel_neon_end();
- kernel_neon_end();
+ dst += nblocks * SM4_BLOCK_SIZE;
+ src += nblocks * SM4_BLOCK_SIZE;
+ nbytes -= nblocks * SM4_BLOCK_SIZE;
+ }
/* tail */
if (walk.nbytes == walk.total && nbytes > 0) {
@@ -302,40 +239,21 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLK8(nbytes);
- if (nblks) {
- sm4_neon_ctr_enc_blk8(ctx->rkey_enc, dst, src,
- walk.iv, nblks);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ sm4_neon_ctr_crypt(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- u8 keystream[SM4_BLOCK_SIZE * 8];
- int i;
-
- for (i = 0; i < nblks; i++) {
- memcpy(&keystream[i * SM4_BLOCK_SIZE],
- walk.iv, SM4_BLOCK_SIZE);
- crypto_inc(walk.iv, SM4_BLOCK_SIZE);
- }
- sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
- keystream, nblks);
-
- crypto_xor_cpy(dst, src, keystream,
- nblks * SM4_BLOCK_SIZE);
- dst += nblks * SM4_BLOCK_SIZE;
- src += nblks * SM4_BLOCK_SIZE;
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ kernel_neon_end();
- kernel_neon_end();
+ dst += nblocks * SM4_BLOCK_SIZE;
+ src += nblocks * SM4_BLOCK_SIZE;
+ nbytes -= nblocks * SM4_BLOCK_SIZE;
+ }
/* tail */
if (walk.nbytes == walk.total && nbytes > 0) {
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 04/16] crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch newly adds the test vectors of CTS-CBC/ESSIV/XTS/XCBC modes
of the SM4 algorithm, and also added some test vectors for SM4 GCM/CCM.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
crypto/testmgr.c | 25 +
crypto/testmgr.h | 1161 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 1186 insertions(+)
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index e4bb03b8b924..cce101c7e8f9 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4712,6 +4712,12 @@ static const struct alg_test_desc alg_test_descs[] = {
.alg = "cts(cbc(paes))",
.test = alg_test_null,
.fips_allowed = 1,
+ }, {
+ .alg = "cts(cbc(sm4))",
+ .test = alg_test_skcipher,
+ .suite = {
+ .cipher = __VECS(sm4_cts_tv_template)
+ }
}, {
.alg = "curve25519",
.test = alg_test_kpp,
@@ -5059,6 +5065,12 @@ static const struct alg_test_desc alg_test_descs[] = {
.cipher = __VECS(essiv_aes_cbc_tv_template)
}
}, {
+ .alg = "essiv(cbc(sm4),sm3)",
+ .test = alg_test_skcipher,
+ .suite = {
+ .cipher = __VECS(essiv_sm4_cbc_tv_template)
+ }
+ }, {
#if IS_ENABLED(CONFIG_CRYPTO_DH_RFC7919_GROUPS)
.alg = "ffdhe2048(dh)",
.test = alg_test_kpp,
@@ -5586,6 +5598,12 @@ static const struct alg_test_desc alg_test_descs[] = {
.suite = {
.hash = __VECS(aes_xcbc128_tv_template)
}
+ }, {
+ .alg = "xcbc(sm4)",
+ .test = alg_test_hash,
+ .suite = {
+ .hash = __VECS(sm4_xcbc128_tv_template)
+ }
}, {
.alg = "xchacha12",
.test = alg_test_skcipher,
@@ -5640,6 +5658,13 @@ static const struct alg_test_desc alg_test_descs[] = {
.suite = {
.cipher = __VECS(serpent_xts_tv_template)
}
+ }, {
+ .alg = "xts(sm4)",
+ .generic_driver = "xts(ecb(sm4-generic))",
+ .test = alg_test_skcipher,
+ .suite = {
+ .cipher = __VECS(sm4_xts_tv_template)
+ }
}, {
.alg = "xts(twofish)",
.generic_driver = "xts(ecb(twofish-generic))",
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d6088e26f326..ced48e4dad0c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -14882,6 +14882,537 @@ static const struct cipher_testvec sm4_cfb_tv_template[] = {
}
};
+static const struct cipher_testvec sm4_cts_tv_template[] = {
+ /* Generated from AES-CTS test vectors */
+ {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20",
+ .len = 17,
+ .ctext = "\x05\xfe\x23\xee\x17\xa2\x89\x98"
+ "\xbc\x97\x0a\x0b\x54\x67\xca\xd7"
+ "\xd6",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20",
+ .len = 31,
+ .ctext = "\x15\x46\xe4\x95\xa4\xec\xf0\xb8"
+ "\x49\xd6\x6a\x9d\x89\xc7\xfd\x70"
+ "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20\x43",
+ .len = 32,
+ .ctext = "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+ "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+ "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20\x43"
+ "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+ "\x70\x6c\x65\x61\x73\x65\x2c",
+ .len = 47,
+ .ctext = "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+ "\xd3\xe1\xdc\xeb\xfa\x04\x11\x99"
+ "\xde\xcf\x6f\x4d\x7b\x09\x92\x7f"
+ "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+ "\x01\x6a\xbf\xd4\x3f\x79\x02",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20\x43"
+ "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+ "\x70\x6c\x65\x61\x73\x65\x2c\x20",
+ .len = 48,
+ .ctext = "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+ "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+ "\xbd\x99\x21\x0c\x5e\x4d\xed\x20"
+ "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+ "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20\x43"
+ "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+ "\x70\x6c\x65\x61\x73\x65\x2c\x20"
+ "\x61\x6e\x64\x20\x77\x6f\x6e\x74"
+ "\x6f\x6e\x20\x73\x6f\x75\x70\x2e",
+ .len = 64,
+ .ctext = "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+ "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+ "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+ "\x58\x19\xa4\x8f\xa9\x68\x5e\x6b"
+ "\x2c\x0f\x81\x60\x15\x98\x27\x4f"
+ "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+ "\xbd\x99\x21\x0c\x5e\x4d\xed\x20",
+ }
+};
+
+static const struct cipher_testvec essiv_sm4_cbc_tv_template[] = {
+ /* Generated from AES-ESSIV-CBC test vectors */
+ {
+ .key = "\x06\xa9\x21\x40\x36\xb8\xa1\x5b"
+ "\x51\x2e\x03\xd5\x34\x12\x00\x06",
+ .klen = 16,
+ .iv = "\x3d\xaf\xba\x42\x9d\x9e\xb4\x30"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "Single block msg",
+ .ctext = "\x83\xa0\x79\x71\x18\xed\xb2\x0f"
+ "\xa8\x71\x94\x22\x8e\x1f\xc1\xbb",
+ .len = 16,
+ }, {
+ .key = "\xc2\x86\x69\x6d\x88\x7c\x9a\xa0"
+ "\x61\x1b\xbb\x3e\x20\x25\xa4\x5a",
+ .klen = 16,
+ .iv = "\x56\x2e\x17\x99\x6d\x09\x3d\x28"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+ .ctext = "\x48\x38\xba\xa0\x09\xa2\xe1\x61"
+ "\x94\xe5\xd2\x63\xe5\x04\x6c\x62"
+ "\x93\x21\x95\xfb\x8c\xf4\x25\x19"
+ "\xe0\x0f\x9c\xfa\x51\xfe\xe7\x32",
+ .len = 32,
+ }, {
+ .key = "\x1f\x35\x2c\x07\x3b\x61\x08\xd7"
+ "\x2d\x98\x10\xa3\x09\x14\xdf\xf4",
+ .klen = 16,
+ .iv = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
+ "\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
+ "\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
+ "\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
+ "\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
+ "\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
+ "\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
+ "\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
+ .ctext = "\xa5\x1d\x64\x91\x28\x1f\xbe\x9e"
+ "\x15\x39\x5f\xe4\xe1\x5a\x8c\x38"
+ "\x80\x7f\xc7\x7d\x00\x4c\x4b\xff"
+ "\x75\x3a\x03\xfe\x41\x75\x26\x9e"
+ "\x3f\xf1\x36\xaf\x7b\x37\x73\x1a"
+ "\xaf\x9b\x91\xec\x1e\xf0\x05\x9d"
+ "\x87\xda\x7b\xa3\xaa\xe6\x5b\x98"
+ "\x41\x73\xd5\x3c\x8c\x8b\xb5\x88",
+ .len = 64,
+ }, {
+ .key = "\xBE\xE1\x04\x27\xE1\x04\x27\x4A"
+ "\x6D\x90\x4A\x6D\x90\xB3\xD6\xF9",
+ .klen = 16,
+ .iv = "\xE7\x82\x1D\xB8\x53\x11\xAC\x47"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x50\xB9\x22\xAE\x17\x80\x0C\x75"
+ "\xDE\x47\xD3\x3C\xA5\x0E\x9A\x03"
+ "\x6C\xF8\x61\xCA\x33\xBF\x28\x91"
+ "\x1D\x86\xEF\x58\xE4\x4D\xB6\x1F"
+ "\xAB\x14\x7D\x09\x72\xDB\x44\xD0"
+ "\x39\xA2\x0B\x97\x00\x69\xF5\x5E"
+ "\xC7\x30\xBC\x25\x8E\x1A\x83\xEC"
+ "\x55\xE1\x4A\xB3\x1C\xA8\x11\x7A"
+ "\x06\x6F\xD8\x41\xCD\x36\x9F\x08"
+ "\x94\xFD\x66\xF2\x5B\xC4\x2D\xB9"
+ "\x22\x8B\x17\x80\xE9\x52\xDE\x47"
+ "\xB0\x19\xA5\x0E\x77\x03\x6C\xD5"
+ "\x3E\xCA\x33\x9C\x05\x91\xFA\x63"
+ "\xEF\x58\xC1\x2A\xB6\x1F\x88\x14"
+ "\x7D\xE6\x4F\xDB\x44\xAD\x16\xA2"
+ "\x0B\x74\x00\x69\xD2\x3B\xC7\x30"
+ "\x99\x02\x8E\xF7\x60\xEC\x55\xBE"
+ "\x27\xB3\x1C\x85\x11\x7A\xE3\x4C"
+ "\xD8\x41\xAA\x13\x9F\x08\x71\xFD"
+ "\x66\xCF\x38\xC4\x2D\x96\x22\x8B"
+ "\xF4\x5D\xE9\x52\xBB\x24\xB0\x19"
+ "\x82\x0E\x77\xE0\x49\xD5\x3E\xA7"
+ "\x10\x9C\x05\x6E\xFA\x63\xCC\x35"
+ "\xC1\x2A\x93\x1F\x88\xF1\x5A\xE6"
+ "\x4F\xB8\x21\xAD\x16\x7F\x0B\x74"
+ "\xDD\x46\xD2\x3B\xA4\x0D\x99\x02"
+ "\x6B\xF7\x60\xC9\x32\xBE\x27\x90"
+ "\x1C\x85\xEE\x57\xE3\x4C\xB5\x1E"
+ "\xAA\x13\x7C\x08\x71\xDA\x43\xCF"
+ "\x38\xA1\x0A\x96\xFF\x68\xF4\x5D"
+ "\xC6\x2F\xBB\x24\x8D\x19\x82\xEB"
+ "\x54\xE0\x49\xB2\x1B\xA7\x10\x79"
+ "\x05\x6E\xD7\x40\xCC\x35\x9E\x07"
+ "\x93\xFC\x65\xF1\x5A\xC3\x2C\xB8"
+ "\x21\x8A\x16\x7F\xE8\x51\xDD\x46"
+ "\xAF\x18\xA4\x0D\x76\x02\x6B\xD4"
+ "\x3D\xC9\x32\x9B\x04\x90\xF9\x62"
+ "\xEE\x57\xC0\x29\xB5\x1E\x87\x13"
+ "\x7C\xE5\x4E\xDA\x43\xAC\x15\xA1"
+ "\x0A\x73\xFF\x68\xD1\x3A\xC6\x2F"
+ "\x98\x01\x8D\xF6\x5F\xEB\x54\xBD"
+ "\x26\xB2\x1B\x84\x10\x79\xE2\x4B"
+ "\xD7\x40\xA9\x12\x9E\x07\x70\xFC"
+ "\x65\xCE\x37\xC3\x2C\x95\x21\x8A"
+ "\xF3\x5C\xE8\x51\xBA\x23\xAF\x18"
+ "\x81\x0D\x76\xDF\x48\xD4\x3D\xA6"
+ "\x0F\x9B\x04\x6D\xF9\x62\xCB\x34"
+ "\xC0\x29\x92\x1E\x87\xF0\x59\xE5"
+ "\x4E\xB7\x20\xAC\x15\x7E\x0A\x73"
+ "\xDC\x45\xD1\x3A\xA3\x0C\x98\x01"
+ "\x6A\xF6\x5F\xC8\x31\xBD\x26\x8F"
+ "\x1B\x84\xED\x56\xE2\x4B\xB4\x1D"
+ "\xA9\x12\x7B\x07\x70\xD9\x42\xCE"
+ "\x37\xA0\x09\x95\xFE\x67\xF3\x5C"
+ "\xC5\x2E\xBA\x23\x8C\x18\x81\xEA"
+ "\x53\xDF\x48\xB1\x1A\xA6\x0F\x78"
+ "\x04\x6D\xD6\x3F\xCB\x34\x9D\x06"
+ "\x92\xFB\x64\xF0\x59\xC2\x2B\xB7"
+ "\x20\x89\x15\x7E\xE7\x50\xDC\x45"
+ "\xAE\x17\xA3\x0C\x75\x01\x6A\xD3"
+ "\x3C\xC8\x31\x9A\x03\x8F\xF8\x61"
+ "\xED\x56\xBF\x28\xB4\x1D\x86\x12",
+ .ctext = "\xad\x68\x40\x68\xb2\xf9\x77\x55"
+ "\xd5\x1c\x17\x46\xc1\xfa\x05\xdd"
+ "\x94\x5c\xb7\x99\x82\xba\x05\x48"
+ "\xac\x5d\x14\x30\x2e\xc8\x0e\x2f"
+ "\x5a\xd7\x39\x43\x95\x4d\x93\xff"
+ "\x6b\xe3\xb7\x71\xc1\x39\x43\x8d"
+ "\x10\xd7\xd9\xa8\xe7\x65\xb7\x0a"
+ "\x27\x98\x5b\x90\xc3\x80\x1f\xd9"
+ "\x65\x82\x88\x0a\xc3\x16\x3f\xae"
+ "\x1f\xad\x88\xe9\xfb\x9e\xd4\xc8"
+ "\x81\x36\x50\x37\x1f\x11\x83\xe2"
+ "\xc5\x1a\x48\xdb\xc3\x18\x07\x5d"
+ "\xee\x4b\xea\x40\xd3\xd9\x8c\x59"
+ "\x29\xe1\x0b\x79\x3b\x28\xac\x75"
+ "\xda\x82\x99\x86\xd4\xbe\xd8\x81"
+ "\xe0\xc4\x58\x78\xe4\x33\xc1\xf1"
+ "\xbe\x96\xd3\x4c\x42\x6b\xaf\x24"
+ "\x69\xb4\x25\x88\x37\x9e\xb2\xfb"
+ "\x5c\x93\x22\x89\x2f\x81\x85\x06"
+ "\x12\x74\x3b\x6c\x99\x81\xfb\xbe"
+ "\x0f\xc4\xa5\xb6\xf8\x79\x5f\x72"
+ "\xf8\x46\x94\x3f\x1f\x9f\x15\xa2"
+ "\xc8\xc0\xbf\xeb\xa3\x9e\x59\xe1"
+ "\xbd\x1a\xe1\xe3\x6b\x33\x96\x54"
+ "\x1b\xc4\x25\x74\x06\xcf\x8a\x75"
+ "\x6c\xfc\x76\x7f\x9e\x7b\x00\xce"
+ "\xa8\x1e\x6a\x0f\x5a\xa6\xcb\x77"
+ "\x5f\x90\x39\xcb\xfe\x0e\x16\x53"
+ "\x8e\x21\x0f\x7e\x51\xcc\x92\xb8"
+ "\x4f\x65\x76\x20\x3d\x56\xb4\xcc"
+ "\x8b\x8e\x8e\x68\xc3\x82\x53\x5c"
+ "\x1c\x82\x13\x32\x3b\x97\xff\x48"
+ "\x98\xda\x4a\x7c\xc8\x21\x83\xfd"
+ "\xe2\xf1\x30\xe1\x11\xe9\xe8\x97"
+ "\x97\x24\x06\x73\xf2\x52\xbb\xab"
+ "\x9d\x5f\x0b\xa8\x2f\xab\x0b\x7d"
+ "\xe8\x20\x7b\x67\x2e\x93\xb5\x11"
+ "\x6c\x16\xea\xdd\x1a\x9d\xf2\xdc"
+ "\x79\x57\xc4\x04\xcb\x7f\x36\xa0"
+ "\x2e\xa7\x89\xab\xaa\x56\x59\x9e"
+ "\xec\x38\xea\x1a\xe9\xa7\x58\x58"
+ "\xb5\xb7\x8f\x8c\x5c\xd6\x86\x67"
+ "\x65\x0f\x93\x47\xf7\x3e\x19\x19"
+ "\x9b\x22\xd1\xc6\xc2\xba\x32\x5c"
+ "\x2c\x7a\xa2\xbb\xa5\x22\xde\xe5"
+ "\x1e\x78\x2c\xd3\x40\x6d\xfa\x79"
+ "\x4c\x9e\x1c\x36\x34\xaf\x95\x2e"
+ "\x68\x2e\x69\x7d\xe4\x7d\x0c\x74"
+ "\xaf\x73\x5b\x48\x62\x90\x5e\x19"
+ "\x0f\x12\xb3\xdb\x77\xbb\xe2\xac"
+ "\xaf\xfe\xd9\xa1\x80\x09\xc6\xd4"
+ "\xf4\x21\x3f\xa4\x0f\x16\x7b\x36"
+ "\x29\x6d\x10\xa2\xba\xaf\xf5\xa3"
+ "\x51\xca\x0a\x25\x74\x9a\xb7\x02"
+ "\xb8\xf8\x6b\xda\xb8\x1c\x9f\x62"
+ "\xf5\x61\x62\x9f\x4b\x71\x24\x45"
+ "\xfb\x0f\xdf\xa8\x47\x6f\x2f\x05"
+ "\x2f\xf4\xfd\xb8\xd1\x8c\x29\x9d"
+ "\x9d\xe8\x6f\x10\x89\xef\x08\x59"
+ "\xa0\x24\x1f\xdb\xea\xbc\x97\x44"
+ "\x23\x74\xbf\xaa\x87\x10\x5c\x58"
+ "\x2a\xe6\xe2\x19\xc5\x7e\x21\xe2",
+ .len = 496,
+ },
+};
+
+static const struct cipher_testvec sm4_xts_tv_template[] = {
+ /* Generated from AES-XTS test vectors */
+ {
+ .key = "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .klen = 32,
+ .iv = "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ctext = "\xd9\xb4\x21\xf7\x31\xc8\x94\xfd"
+ "\xc3\x5b\x77\x29\x1f\xe4\xe3\xb0"
+ "\x2a\x1f\xb7\x66\x98\xd5\x9f\x0e"
+ "\x51\x37\x6c\x4a\xda\x5b\xc7\x5d",
+ .len = 32,
+ }, {
+ .key = "\x11\x11\x11\x11\x11\x11\x11\x11"
+ "\x11\x11\x11\x11\x11\x11\x11\x11"
+ "\x22\x22\x22\x22\x22\x22\x22\x22"
+ "\x22\x22\x22\x22\x22\x22\x22\x22",
+ .klen = 32,
+ .iv = "\x33\x33\x33\x33\x33\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44",
+ .ctext = "\xa7\x4d\x72\x6c\x11\x19\x6a\x32"
+ "\xbe\x04\xe0\x01\xff\x29\xd0\xc7"
+ "\x93\x2f\x9f\x3e\xc2\x9b\xfc\xb6"
+ "\x4d\xd1\x7f\x63\xcb\xd3\xea\x31",
+ .len = 32,
+ }, {
+ .key = "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+ "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+ "\x22\x22\x22\x22\x22\x22\x22\x22"
+ "\x22\x22\x22\x22\x22\x22\x22\x22",
+ .klen = 32,
+ .iv = "\x33\x33\x33\x33\x33\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44",
+ .ctext = "\x7f\x76\x08\x8e\xff\xad\xf7\x0c"
+ "\x02\xea\x9f\x95\xda\x06\x28\xd3"
+ "\x51\xbf\xcb\x9e\xac\x05\x63\xbc"
+ "\xf1\x7b\x71\x0d\xab\x0a\x98\x26",
+ .len = 32,
+ }, {
+ .key = "\x27\x18\x28\x18\x28\x45\x90\x45"
+ "\x23\x53\x60\x28\x74\x71\x35\x26"
+ "\x31\x41\x59\x26\x53\x58\x97\x93"
+ "\x23\x84\x62\x64\x33\x83\x27\x95",
+ .klen = 32,
+ .iv = "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+ "\x20\x21\x22\x23\x24\x25\x26\x27"
+ "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+ "\x30\x31\x32\x33\x34\x35\x36\x37"
+ "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+ "\x40\x41\x42\x43\x44\x45\x46\x47"
+ "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+ "\x50\x51\x52\x53\x54\x55\x56\x57"
+ "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+ "\x60\x61\x62\x63\x64\x65\x66\x67"
+ "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+ "\x70\x71\x72\x73\x74\x75\x76\x77"
+ "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+ "\x80\x81\x82\x83\x84\x85\x86\x87"
+ "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+ "\x90\x91\x92\x93\x94\x95\x96\x97"
+ "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+ "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+ "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+ "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+ "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+ "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+ "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+ "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+ "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+ "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+ "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+ "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+ "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+ "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+ "\x20\x21\x22\x23\x24\x25\x26\x27"
+ "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+ "\x30\x31\x32\x33\x34\x35\x36\x37"
+ "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+ "\x40\x41\x42\x43\x44\x45\x46\x47"
+ "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+ "\x50\x51\x52\x53\x54\x55\x56\x57"
+ "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+ "\x60\x61\x62\x63\x64\x65\x66\x67"
+ "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+ "\x70\x71\x72\x73\x74\x75\x76\x77"
+ "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+ "\x80\x81\x82\x83\x84\x85\x86\x87"
+ "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+ "\x90\x91\x92\x93\x94\x95\x96\x97"
+ "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+ "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+ "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+ "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+ "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+ "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+ "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+ "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+ "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+ "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+ "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+ "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+ "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+ .ctext = "\x54\xdd\x65\xb6\x32\x6f\xae\xa8"
+ "\xfa\xd1\xa8\x3c\x63\x61\x4a\xf3"
+ "\x9f\x72\x1d\x8d\xfe\x17\x7a\x30"
+ "\xb6\x6a\xbf\x6a\x44\x99\x80\xe1"
+ "\xcd\xbe\x06\xaf\xb7\x33\x36\xf3"
+ "\x7a\x4d\x39\xde\x96\x4a\x30\xd7"
+ "\xd0\x4a\x37\x99\x16\x9c\x60\x25"
+ "\x8f\x6b\x74\x8a\x61\x86\x1a\xa5"
+ "\xec\x92\xa2\xc1\x5b\x2b\x7c\x61"
+ "\x5a\x42\xab\xa4\x99\xbb\xd6\xb7"
+ "\x1d\xb9\xc7\x89\xb2\x18\x20\x89"
+ "\xa2\x5d\xd3\xdf\x80\x0e\xd1\x86"
+ "\x4d\x19\xf7\xed\x45\xfd\x17\xa9"
+ "\x48\x0b\x0f\xb8\x2d\x9b\x7f\xc3"
+ "\xed\x57\xe9\xa1\x14\x0e\xaa\x77"
+ "\x8d\xd2\xdd\x67\x9e\x3e\xdc\x3d"
+ "\xc4\xd5\x5c\x95\x0e\xbc\x53\x1d"
+ "\x95\x92\xf7\xc4\x63\x82\x56\xd5"
+ "\x65\x18\x29\x2a\x20\xaf\x98\xfd"
+ "\xd3\xa6\x36\x00\x35\x0a\x70\xab"
+ "\x5a\x40\xf4\xc2\x85\x03\x7c\xa0"
+ "\x1f\x25\x1f\x19\xec\xae\x03\x29"
+ "\xff\x77\xad\x88\xcd\x5a\x4c\xde"
+ "\xa2\xae\xab\xc2\x21\x48\xff\xbd"
+ "\x23\x9b\xd1\x05\x15\xbd\xe1\x13"
+ "\x1d\xec\x84\x04\xe4\x43\xdc\x76"
+ "\x31\x40\xd5\xf2\x2b\xf3\x3e\x0c"
+ "\x68\x72\xd6\xb8\x1d\x63\x0f\x6f"
+ "\x00\xcd\xd0\x58\xfe\x80\xf9\xcb"
+ "\xfb\x77\x70\x7f\x93\xce\xe2\xca"
+ "\x92\xb9\x15\xb8\x30\x40\x27\xc1"
+ "\x90\xa8\x4e\x2d\x65\xe0\x18\xcc"
+ "\x6a\x38\x7d\x37\x66\xac\xdb\x28"
+ "\x25\x32\x84\xe8\xdb\x9a\xcf\x8f"
+ "\x52\x28\x0d\xdc\x6d\x00\x33\xd2"
+ "\xcc\xaa\xa4\xf9\xae\xff\x12\x36"
+ "\x69\xbc\x02\x4f\xd6\x76\x8e\xdf"
+ "\x8b\xc1\xf8\xd6\x22\xc1\x9c\x60"
+ "\x9e\xf9\x7f\x60\x91\x90\xcd\x11"
+ "\x02\x41\xe7\xfb\x08\x4e\xd8\x94"
+ "\x2d\xa1\xf9\xb9\xcf\x1b\x51\x4b"
+ "\x61\xa3\x88\xb3\x0e\xa6\x1a\x4a"
+ "\x74\x5b\x38\x1e\xe7\xad\x6c\x4d"
+ "\xb1\x27\x54\x53\xb8\x41\x3f\x98"
+ "\xdf\x6e\x4a\x40\x98\x6e\xe4\xb5"
+ "\x9a\xf5\xdf\xae\xcd\x30\x12\x65"
+ "\x17\x90\x67\xa0\x0d\x7c\xa3\x5a"
+ "\xb9\x5a\xbd\x61\x7a\xde\xa2\x8e"
+ "\xc1\xc2\x6a\x97\xde\x28\xb8\xbf"
+ "\xe3\x01\x20\xd6\xae\xfb\xd2\x58"
+ "\xc5\x9e\x42\xd1\x61\xe8\x06\x5a"
+ "\x78\x10\x6b\xdc\xa5\xcd\x90\xfb"
+ "\x3a\xac\x4e\x93\x86\x6c\x8a\x7f"
+ "\x96\x76\x86\x0a\x79\x14\x5b\xd9"
+ "\x2e\x02\xe8\x19\xa9\x0b\xe0\xb9"
+ "\x7c\xc5\x22\xb3\x21\x06\x85\x6f"
+ "\xdf\x0e\x54\xd8\x8e\x46\x24\x15"
+ "\x5a\x2f\x1c\x14\xea\xea\xa1\x63"
+ "\xf8\x58\xe9\x9a\x80\x6e\x79\x1a"
+ "\xcd\x82\xf1\xb0\xe2\x9f\x00\x28"
+ "\xa4\xc3\x8e\x97\x6f\x57\x1a\x93"
+ "\xf4\xfd\x57\xd7\x87\xc2\x4d\xb0"
+ "\xe0\x1c\xa3\x04\xe5\xa5\xc4\xdd"
+ "\x50\xcf\x8b\xdb\xf4\x91\xe5\x7c",
+ .len = 512,
+ }, {
+ .key = "\x62\x49\x77\x57\x24\x70\x93\x69"
+ "\x99\x59\x57\x49\x66\x96\x76\x27"
+ "\x02\x88\x41\x97\x16\x93\x99\x37"
+ "\x51\x05\x82\x09\x74\x94\x45\x92",
+ .klen = 32,
+ .iv = "\xff\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+ "\x20\x21\x22\x23\x24\x25\x26\x27"
+ "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+ "\x30\x31\x32\x33\x34\x35\x36\x37"
+ "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+ "\x40\x41\x42\x43\x44\x45\x46\x47"
+ "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+ "\x50\x51\x52\x53\x54\x55\x56\x57"
+ "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+ "\x60\x61\x62\x63\x64\x65\x66\x67"
+ "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+ "\x70\x71\x72\x73\x74\x75\x76\x77"
+ "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+ "\x80\x81\x82\x83\x84\x85\x86\x87"
+ "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+ "\x90\x91\x92\x93\x94\x95\x96\x97"
+ "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+ "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+ "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+ "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+ "\xf8\xf9\xfa\xfb\xfc",
+ .ctext = "\xa2\x9f\x9e\x4e\x71\xdb\x28\x3c"
+ "\x80\x0e\xf6\xb7\x8e\x57\x1c\xba"
+ "\x90\xda\x3b\x6c\x22\x00\x68\x30"
+ "\x1d\x63\x0d\x9e\x6a\xad\x37\x55"
+ "\xbc\x77\x1e\xc9\xad\x83\x30\xd5"
+ "\x27\xb2\x66\x77\x18\x3c\xa6\x39"
+ "\x9c\x0a\xaa\x1f\x02\xe1\xd5\x65"
+ "\x9b\x8d\xc5\x97\x3d\xc5\x04\x53"
+ "\x78\x00\xe3\xb0\x1a\x43\x4e\xb7"
+ "\xc4\x9f\x38\xc5\x7b\xa4\x70\x64"
+ "\x78\xe6\x32\xd9\x65\x44\xc5\x64"
+ "\xb8\x42\x35\x99\xff\x66\x75\xb0"
+ "\x22\xd3\x9b\x6e\x8d\xcf\x6a\x24"
+ "\xfd\x92\xb7\x1b\x04\x28\x2a\x61"
+ "\xdc\x96\x2a\x20\x7a\x2c\xf1\xf9"
+ "\x12\x15\xf0\x4d\xcf\x2b\xde\x33"
+ "\x41\xbc\xe7\x85\x87\x22\xb7\x16"
+ "\x02\x1c\xd8\xa2\x0f\x1f\xa3\xe9"
+ "\xd8\x45\x48\xe7\xbe\x08\x4e\x4e"
+ "\x23\x79\x84\xdb\x40\x76\xf5\x13"
+ "\x78\x92\x4a\x2f\xf9\x1b\xf2\x80"
+ "\x25\x74\x51\x45\x9a\x77\x78\x97"
+ "\xd3\xe0\xc7\xc4\x35\x67\x2a\xe6"
+ "\xb3\x0d\x62\x9f\x8b",
+ .len = 189,
+ },
+};
+
static const struct aead_testvec sm4_gcm_tv_template[] = {
{ /* From https://datatracker.ietf.org/doc/html/rfc8998#appendix-A.1 */
.key = "\x01\x23\x45\x67\x89\xAB\xCD\xEF"
@@ -14913,6 +15444,298 @@ static const struct aead_testvec sm4_gcm_tv_template[] = {
"\x83\xDE\x35\x41\xE4\xC2\xB5\x81"
"\x77\xE0\x65\xA9\xBF\x7B\x62\xEC",
.clen = 80,
+ }, { /* Generated from AES-GCM test vectors */
+ .key = zeroed_string,
+ .klen = 16,
+ .ctext = "\x23\x2f\x0c\xfe\x30\x8b\x49\xea"
+ "\x6f\xc8\x82\x29\xb5\xdc\x85\x8d",
+ .clen = 16,
+ }, {
+ .key = zeroed_string,
+ .klen = 16,
+ .ptext = zeroed_string,
+ .plen = 16,
+ .ctext = "\x7d\xe2\xaa\x7f\x11\x10\x18\x82"
+ "\x18\x06\x3b\xe1\xbf\xeb\x6d\x89"
+ "\xb8\x51\xb5\xf3\x94\x93\x75\x2b"
+ "\xe5\x08\xf1\xbb\x44\x82\xc5\x57",
+ .clen = 32,
+ }, {
+ .key = "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+ "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+ .klen = 16,
+ .iv = "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+ "\xde\xca\xf8\x88",
+ .ptext = "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+ "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+ "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+ "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+ "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+ "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+ "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+ "\xba\x63\x7b\x39\x1a\xaf\xd2\x55",
+ .plen = 64,
+ .ctext = "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+ "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+ "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+ "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+ "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+ "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+ "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+ "\xe3\x63\x36\x83\x23\xf7\x5b\x80"
+ "\x7d\xfe\x77\xef\x71\xb1\x5e\xc9"
+ "\x52\x6b\x09\xab\x84\x28\x4b\x8a",
+ .clen = 80,
+ }, {
+ .key = "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+ "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+ .klen = 16,
+ .iv = "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+ "\xde\xca\xf8\x88",
+ .ptext = "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+ "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+ "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+ "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+ "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+ "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+ "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+ "\xba\x63\x7b\x39",
+ .plen = 60,
+ .assoc = "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+ "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+ "\xab\xad\xda\xd2",
+ .alen = 20,
+ .ctext = "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+ "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+ "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+ "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+ "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+ "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+ "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+ "\xe3\x63\x36\x83"
+ "\x89\xf6\xba\x35\xb8\x18\xd3\xcc"
+ "\x38\x6c\x05\xb3\x8a\xcb\xc9\xde",
+ .clen = 76,
+ }, {
+ .key = "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+ "\xfe\xff\xe9\x92\x86\x65\x73\x1c",
+ .klen = 16,
+ .iv = "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+ "\xde\xca\xf8\x88",
+ .ptext = "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+ "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+ "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+ "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+ "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+ "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+ "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+ "\xba\x63\x7b\x39",
+ .plen = 60,
+ .assoc = "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+ "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+ "\xab\xad\xda\xd2",
+ .alen = 20,
+ .ctext = "\xc1\x11\x44\x51\xd9\x25\x87\x5b"
+ "\x0f\xd9\x06\xf3\x33\x44\xbb\x87"
+ "\x8b\xa3\x77\xd2\x0c\x60\xfa\xcc"
+ "\x85\x50\x6f\x96\x0c\x54\x54\xc1"
+ "\x58\x04\x88\x6e\xf4\x26\x35\x7e"
+ "\x94\x80\x48\x6c\xf2\xf4\x88\x1f"
+ "\x19\x63\xea\xae\xba\x81\x1a\x5d"
+ "\x0e\x6f\x59\x08"
+ "\x33\xac\x5b\xa8\x19\x60\xdb\x1d"
+ "\xdd\x2e\x22\x2e\xe0\x87\x51\x5d",
+ .clen = 76,
+ }, {
+ .key = "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+ "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+ .klen = 16,
+ .iv = "\x00\xff\xff\xff\xff\x00\x00\xff"
+ "\xff\xff\x00\xff",
+ .ptext = "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+ "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+ "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+ "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+ "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+ "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+ "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+ "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+ "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+ "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+ "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+ "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+ "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+ "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+ "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+ "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+ "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+ "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+ "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+ "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+ "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+ "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+ "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+ "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+ "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+ "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+ "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+ "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+ "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+ "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+ "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+ "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+ "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+ "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+ "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+ "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+ "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+ "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+ "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+ "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+ "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+ "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+ "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+ "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+ "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+ "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+ "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+ "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+ "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+ "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+ "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+ "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+ "\x87\x79\x60\x38\x46\xb4\x25\x57"
+ "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+ "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+ "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+ "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+ "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+ "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+ "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+ "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+ "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+ "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+ "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+ "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+ "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+ "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+ "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+ "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+ "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+ "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+ "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+ "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+ "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+ "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+ "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+ "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+ "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+ "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+ "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+ "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+ "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+ "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+ "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+ "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+ "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+ "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+ "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+ "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+ "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+ .plen = 719,
+ .ctext = "\xdc\xb1\x0f\x2a\xe8\x2d\x1c\x57"
+ "\xc4\x82\xfa\xd6\x87\xe6\x2f\x50"
+ "\xbd\x9e\x0a\x42\x31\xf2\xc7\xbb"
+ "\x21\x63\xa7\x05\x43\x33\xef\x33"
+ "\x5c\xd3\x47\x55\xce\x5c\xe4\xd4"
+ "\xe5\x07\x62\x22\xac\x01\xa8\x35"
+ "\x9c\x59\x34\x30\x8e\xff\x9f\xb4"
+ "\xd2\x4e\x74\x90\x64\xf2\x78\x5e"
+ "\x63\xb7\xc5\x08\x1b\x37\xa5\x9e"
+ "\xc0\xde\xff\xa9\x7f\x0b\xd3\x02"
+ "\x83\x6e\x33\xfa\x43\x11\xd3\xda"
+ "\x02\xcf\xcd\x4a\xc0\x78\x1f\x39"
+ "\x62\xcb\xa3\x95\x7e\x13\x92\x28"
+ "\xb2\xc4\x7a\xba\xd1\xc6\xf6\x1f"
+ "\xda\x0b\xf1\xd1\x99\x54\xd8\x3b"
+ "\x16\xf8\xe6\x97\x1e\xa7\xcf\x49"
+ "\x69\x84\x01\x4c\xdc\x7a\x34\xff"
+ "\x01\x08\xa3\x0b\x39\xac\x21\x37"
+ "\xd8\xb4\x04\x19\x8b\x7a\x7d\x17"
+ "\x44\xd1\x18\xaf\x1f\xa9\x29\xfe"
+ "\xfa\x77\xe0\x40\x42\x0c\x79\xb7"
+ "\xc3\x15\x1b\xd9\x0c\x82\xfc\x16"
+ "\x70\xd6\x2a\xe9\x94\x72\xc5\xa5"
+ "\x8a\x58\xbc\xfa\xe0\x88\x39\x4a"
+ "\x80\xe8\xec\xaf\x60\xac\xe7\xf8"
+ "\x9c\xf0\xfc\x61\x39\x07\x98\x6b"
+ "\x88\xe3\x98\x22\x28\x18\x4a\x2d"
+ "\x25\xef\x10\xe3\x83\x66\x3f\xfd"
+ "\xc7\x0b\xa3\xfd\x97\xa9\xf4\xbd"
+ "\xd8\x2a\xee\x4a\x50\xad\xcc\xb5"
+ "\xc7\xab\xb8\x79\x9c\xd1\xf1\x27"
+ "\x08\xf5\xf5\xe8\x1b\x66\xce\x41"
+ "\x56\x60\x94\x86\xf0\x78\xc2\xfa"
+ "\x5b\x63\x40\xb1\xd1\x1a\x38\x69"
+ "\x0b\x8c\xb2\xf5\xa2\xbe\x90\x9d"
+ "\x46\x23\x79\x8b\x3b\x4a\xf4\xbb"
+ "\x55\xf7\x58\x9d\xaf\x59\xff\x74"
+ "\xf3\xb9\xc4\x26\xb1\xf8\xe1\x28"
+ "\x8b\x5e\x8f\x6d\x64\xe7\xe8\x63"
+ "\xd2\x9e\xcb\xee\xae\x19\x04\x1d"
+ "\x05\xf0\x9d\x99\x7b\x33\x33\xae"
+ "\x6e\xe5\x09\xdd\x67\x51\xc4\xc8"
+ "\x6a\xc7\x36\x35\xc9\x93\x76\xa1"
+ "\xa8\x1c\xfa\x75\x92\x34\x0e\x7d"
+ "\x3d\x1d\xef\x00\xfd\xa5\x25\x12"
+ "\x7c\x91\x21\x41\xcc\x50\x47\xa9"
+ "\x22\x50\x24\x96\x34\x79\x3d\xe8"
+ "\x3f\xa0\x56\xaf\x98\x53\x55\xc3"
+ "\x46\x1b\x17\x54\xb8\xb0\xb7\xe0"
+ "\xe0\xab\x47\x6f\x06\xda\xcc\x75"
+ "\xa7\x96\xb7\x92\xf3\xa0\x5f\xe6"
+ "\xba\x97\xe3\x2f\x97\x05\xb2\x99"
+ "\xa0\x09\x10\x98\x9c\xd3\x2e\xd1"
+ "\x7e\x2a\x30\x54\x3c\xb9\x33\xe3"
+ "\xf2\xaf\xd3\xa5\xee\xd0\x0b\x8a"
+ "\x19\x54\x0f\x02\x51\x1f\x91\xdf"
+ "\x71\x9c\xad\x77\x35\x28\x55\x6d"
+ "\xcd\x7a\xd9\xa3\x41\x98\x6b\x37"
+ "\x19\x0f\xbe\xae\x69\xb2\x25\x01"
+ "\xee\x0e\x51\x4b\x53\xea\x0f\x5f"
+ "\x85\x74\x79\x36\x32\x0a\x2a\x40"
+ "\xad\x6b\x78\x41\x54\x99\xe9\xc1"
+ "\x2b\x6c\x9b\x42\x21\xef\xe2\x50"
+ "\x56\x8d\x78\xdf\x58\xbe\x0a\x0f"
+ "\xfc\xfc\x0d\x2e\xd0\xcb\xa6\x0a"
+ "\xa8\xd9\x1e\xa9\xd4\x7c\x99\x88"
+ "\xcf\x11\xad\x1c\xd3\x04\x63\x55"
+ "\xef\x85\x0b\x69\xa1\x40\xf1\x75"
+ "\x24\xf4\xe5\x2c\xd4\x7a\x24\x50"
+ "\x8f\xa2\x71\xc9\x92\x20\xcd\xcf"
+ "\xda\x40\xbe\xf6\xfe\x1a\xca\xc7"
+ "\x4a\x80\x45\x55\xcb\xdd\xb7\x01"
+ "\xb0\x8d\xcb\xd2\xae\xbd\xa4\xd0"
+ "\x5c\x10\x05\x66\x7b\xd4\xff\xd9"
+ "\xc4\x23\x9d\x8d\x6b\x24\xf8\x3f"
+ "\x73\x4d\x5c\x2b\x33\x4c\x5e\x63"
+ "\x74\x6d\x03\xa1\x7a\x35\x65\x17"
+ "\x38\x7f\x3b\xc1\x69\xcf\x61\x34"
+ "\x30\x21\xaf\x97\x47\x12\x3f\xa1"
+ "\xa7\x50\xc5\x87\xfb\x3f\x70\x32"
+ "\x86\x17\x5f\x25\xe4\x74\xc6\xd0"
+ "\x9b\x39\xe6\xe1\x5a\xec\x8f\x40"
+ "\xce\xcc\x37\x3b\xd8\x72\x1c\x31"
+ "\x75\xa4\xa6\x89\x8c\xdd\xd6\xd2"
+ "\x32\x3d\xe8\xc3\x54\xab\x1f\x35"
+ "\x52\xb4\x94\x81\xb0\x37\x3a\x03"
+ "\xbb\xb1\x99\x30\xa5\xf8\x21\xcd"
+ "\x93\x5d\xa7\x13\xed\xc7\x49\x09"
+ "\x70\xda\x08\x39\xaa\x15\x9e\x45"
+ "\x35\x2b\x0f\x5c\x8c\x8b\xc9"
+ "\xa8\xb8\x9f\xfd\x37\x36\x31\x7e"
+ "\x34\x4f\xc1\xc0\xca\x8a\x22\xfd",
+ .clen = 735,
}
};
@@ -14947,6 +15770,282 @@ static const struct aead_testvec sm4_ccm_tv_template[] = {
"\x16\x84\x2D\x4F\xA1\x86\xF5\x6A"
"\xB3\x32\x56\x97\x1F\xA1\x10\xF4",
.clen = 80,
+ }, { /* Generated from AES-CCM test vectors */
+ .key = "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+ "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf",
+ .klen = 16,
+ .iv = "\x01\x00\x00\x00\x03\x02\x01\x00"
+ "\xa0\xa1\xa2\xa3\xa4\xa5\x00\x00",
+ .assoc = "\x00\x01\x02\x03\x04\x05\x06\x07",
+ .alen = 8,
+ .ptext = "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e",
+ .plen = 23,
+ .ctext = "\x7b\xff\x4a\x15\xf5\x73\xce\x82"
+ "\x6e\xc2\x31\x1d\xe2\x53\x02\xac"
+ "\xa4\x48\xf9\xe4\xf5\x1f\x81\x70"
+ "\x18\xbc\xb6\x84\x01\xb8\xae",
+ .clen = 31,
+ }, {
+ .key = "\xf4\x6b\xc2\x75\x62\xfe\xb4\xe1"
+ "\x53\x14\x73\x66\x8d\x88\xf6\x80",
+ .klen = 16,
+ .iv = "\x03\xa0\x20\x35\x26\xf2\x21\x8d"
+ "\x50\x20\xda\xe2\x00\x00\x00\x00",
+ .assoc = "\x5b\x9e\x13\x67\x02\x5e\xef\xc1"
+ "\x6c\xf9\xd7\x1e\x52\x8f\x7a\x47"
+ "\xe9\xd4\xcf\x20\x14\x6e\xf0\x2d"
+ "\xd8\x9e\x2b\x56\x10\x23\x56\xe7",
+ .alen = 32,
+ .ctext = "\x23\x58\xce\xdc\x40\xb1\xcd\x92"
+ "\x47\x96\x59\xfc\x8a\x26\x4f\xcf",
+ .clen = 16,
+ }, {
+ .key = "\xab\x2f\x8a\x74\xb7\x1c\xd2\xb1"
+ "\xff\x80\x2e\x48\x7d\x82\xf8\xb9",
+ .klen = 16,
+ .iv = "\x03\xaf\x94\x87\x78\x35\x82\x81"
+ "\x7f\x88\x94\x68\x00\x00\x00\x00",
+ .alen = 0,
+ .ptext = "\x00",
+ .plen = 0,
+ .ctext = "\x72\x7e\xf5\xd6\x39\x7a\x2b\x43",
+ .clen = 8,
+ }, {
+ .key = "\x39\xbb\xa7\xbe\x59\x97\x9e\x73"
+ "\xa4\x48\x93\x39\x26\x71\x4a\xc6",
+ .klen = 16,
+ .iv = "\x03\xee\x49\x83\xe9\xa9\xff\xe9"
+ "\x57\xba\xfd\x9e\x00\x00\x00\x00",
+ .assoc = "\x44\xa6\x2c\x05\xe9\xe1\x43\xb1"
+ "\x58\x7c\xf2\x5c\x6d\x39\x0a\x64"
+ "\xa4\xf0\x13\x05\xd1\x77\x99\x67"
+ "\x11\xc4\xc6\xdb\x00\x56\x36\x61",
+ .alen = 32,
+ .ptext = "\x00",
+ .plen = 0,
+ .ctext = "\xb0\x9d\xc6\xfb\x7d\xb5\xa1\x0e",
+ .clen = 8,
+ }, {
+ .key = "\x58\x5d\xa0\x96\x65\x1a\x04\xd7"
+ "\x0d\x1a\x53\x3b\xb5\xe3\xf8\x8b",
+ .klen = 16,
+ .iv = "\x03\xcf\x76\x3f\xd9\x95\x75\x8f"
+ "\x44\x89\x40\x7b\x00\x00\x00\x00",
+ .assoc = "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+ "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+ "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+ "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe",
+ .alen = 32,
+ .ptext = "\xc2\x54\xc8\xde\x78\x87\x77\x40"
+ "\x49\x71\xe4\xb7\xe7\xcb\x76\x61"
+ "\x0a\x41\xb9\xe9\xc0\x76\x54\xab"
+ "\x04\x49\x3b\x19\x93\x57\x25\x5d",
+ .plen = 32,
+ .ctext = "\xc9\xae\xef\x1d\xf3\x2c\xd3\x38"
+ "\xc9\x7f\x7e\x28\xe8\xaa\xb3\x60"
+ "\x49\xdc\x66\xca\x7b\x3d\xe0\x3c"
+ "\xcb\x45\x9c\x1b\xb2\xbe\x07\x90"
+ "\x87\xa6\x6b\x89\x0d\x0f\x90\xaa"
+ "\x7d\xf6\x5a\x9a\x68\x2b\x81\x92",
+ .clen = 48,
+ }, {
+ .key = "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+ "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+ .klen = 16,
+ .iv = "\x02\xff\xff\xff\xff\x00\x00\xff"
+ "\xff\xff\x00\xff\xff\x00\x00\x00",
+ .assoc = "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+ "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+ "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+ "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe"
+ "\xc8\xf3\x5c\x52\x10\x63",
+ .alen = 38,
+ .ptext = "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+ "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+ "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+ "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+ "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+ "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+ "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+ "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+ "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+ "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+ "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+ "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+ "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+ "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+ "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+ "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+ "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+ "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+ "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+ "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+ "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+ "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+ "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+ "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+ "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+ "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+ "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+ "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+ "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+ "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+ "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+ "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+ "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+ "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+ "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+ "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+ "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+ "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+ "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+ "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+ "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+ "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+ "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+ "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+ "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+ "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+ "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+ "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+ "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+ "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+ "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+ "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+ "\x87\x79\x60\x38\x46\xb4\x25\x57"
+ "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+ "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+ "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+ "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+ "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+ "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+ "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+ "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+ "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+ "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+ "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+ "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+ "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+ "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+ "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+ "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+ "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+ "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+ "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+ "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+ "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+ "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+ "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+ "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+ "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+ "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+ "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+ "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+ "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+ "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+ "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+ "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+ "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+ "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+ "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+ "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+ "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+ .plen = 719,
+ .ctext = "\xc5\x50\x85\x02\x72\xa8\xb3\x62"
+ "\xf9\xcd\x77\x7b\x43\xa5\x04\x70"
+ "\x68\x40\x57\x21\x1c\xfe\xef\x05"
+ "\x4d\xb8\x44\xba\x59\xea\x62\x32"
+ "\xcb\x6b\x6a\x39\x9b\xf3\xe5\xa4"
+ "\x36\x38\xde\x7d\xcf\xb6\xcd\xe3"
+ "\x89\xbf\x37\xc9\x96\x3c\x70\x10"
+ "\x92\x47\xcc\xac\x6f\xf8\x55\x9a"
+ "\x26\x43\x34\xb4\x92\x7d\x68\xfc"
+ "\x60\x37\x74\x2a\x55\xba\xc7\xd7"
+ "\x98\x69\xb7\xcf\x42\xfd\xb2\x10"
+ "\xa0\x59\xe1\x2c\x73\x66\x12\x97"
+ "\x85\x8b\x28\xcc\x29\x02\x15\x89"
+ "\x23\xd3\x32\x92\x87\x57\x09\x13"
+ "\x04\x7e\x8b\x6c\x3a\xc1\x4e\x6c"
+ "\xe1\x9f\xc8\xcc\x47\x9c\xd8\x10"
+ "\xf4\xb7\x5c\x30\x7a\x8b\x0f\x01"
+ "\x52\x38\x02\x92\x99\xac\x03\x90"
+ "\x18\x32\x2d\x21\x6a\x0a\x2a\xe7"
+ "\xc2\xcc\x15\x84\x4e\x2b\x0b\x3a"
+ "\x4c\xdc\xb0\x6b\x10\xd1\x27\x10"
+ "\xf0\x4a\x5c\x43\xa0\x34\x34\x59"
+ "\x47\x43\x48\xcb\x69\xa7\xff\x52"
+ "\xb8\xca\x23\x09\x07\xd7\xc5\xe4"
+ "\x2a\x4f\x99\xd5\x83\x36\x2a\x2d"
+ "\x59\xd0\xca\xb0\xfa\x40\x8c\xab"
+ "\xdf\x69\x08\xd9\x79\x1d\xde\xa8"
+ "\x0b\x34\x74\x4d\xf5\xa0\x4c\x81"
+ "\x7f\x93\x06\x40\x24\xfe\x7d\xcd"
+ "\xe4\xfe\xf8\xf8\x30\xce\xd0\x5d"
+ "\x70\xfd\x0d\x5a\x78\x85\x74\x2d"
+ "\xe4\xb5\x40\x18\x99\x11\xe4\x6a"
+ "\xdf\xfa\x4f\x25\x2c\xde\x15\xb7"
+ "\x12\xd8\xc6\x90\x0d\x0f\xc9\xfb"
+ "\x21\xf1\xed\xfe\x98\xe1\x03\xe2"
+ "\x5c\xef\xb6\xc7\x87\x77\x0e\xcd"
+ "\xff\x78\x94\xc9\xbe\xd3\x47\xf7"
+ "\x8d\x37\x48\x01\x42\xe2\x17\x96"
+ "\xfc\xc0\xcb\x7b\x7b\x57\xaf\x3b"
+ "\xc9\xd0\x94\xce\x5e\x1b\xa9\x47"
+ "\x02\x4d\x74\xcc\x45\x1d\xd3\x2d"
+ "\x5f\x4f\x7f\xf2\x4b\xf9\x59\xee"
+ "\x9e\x9e\xb9\x95\x29\x19\xd1\x5f"
+ "\x72\xab\x8d\xf1\x28\xd1\x1c\xae"
+ "\xc2\xba\xf7\x22\x84\x2c\x83\x51"
+ "\x03\xad\xa3\xef\x81\xa7\xdc\xf1"
+ "\x44\x51\x50\x96\x70\xd1\xe5\x47"
+ "\x57\xf9\x30\x90\xe4\xbf\xfc\x75"
+ "\x14\xaa\x4d\xb7\xb1\xe7\x79\x33"
+ "\x43\xc2\x5c\xc1\xbc\x09\x92\x0f"
+ "\xa7\xaf\x68\x51\x51\xec\x0b\xc3"
+ "\x3d\x2b\x94\x30\x45\x29\x1b\x9e"
+ "\x70\x56\xf8\xd6\x67\x2d\x39\x3b"
+ "\x3c\xd2\xd0\xd3\xdc\x7d\x84\xe9"
+ "\x06\x31\x98\xa6\x5c\xbf\x10\x58"
+ "\xce\xbb\xa7\xe1\x65\x7e\x51\x87"
+ "\x70\x46\xb4\x7f\xf9\xec\x92\x1c"
+ "\x9b\x24\x49\xc1\x04\xbe\x1c\x5f"
+ "\xcc\xb3\x33\x8c\xad\xe7\xdc\x32"
+ "\x54\xa2\x0d\x83\x0f\x3c\x12\x5d"
+ "\x71\xe3\x9c\xae\x71\xa3\x2a\x10"
+ "\xc5\x91\xb4\x73\x96\x60\xdb\x5d"
+ "\x1f\xd5\x9a\xd2\x69\xc3\xd7\x4b"
+ "\xa2\x66\x81\x96\x4a\xaa\x02\xd6"
+ "\xd5\x44\x9b\x42\x3a\x15\x5f\xe7"
+ "\x4d\x7c\xf6\x71\x4a\xea\xe8\x43"
+ "\xd7\x68\xe4\xbc\x05\x87\x49\x05"
+ "\x3b\x47\xb2\x6d\x5f\xd1\x11\xa6"
+ "\x58\xd4\xa2\x45\xec\xb5\x54\x55"
+ "\xd3\xd6\xd2\x6a\x8b\x21\x9e\x2c"
+ "\xf1\x27\x4b\x5b\xe3\xff\xe0\xfd"
+ "\x4b\xf1\xe7\xe2\x84\xf2\x17\x37"
+ "\x11\x68\xc4\x92\x4b\x6b\xef\x8e"
+ "\x75\xf5\xc2\x7d\x5c\xe9\x7c\xfc"
+ "\x2b\x00\x33\x0e\x7d\x69\xd8\xd4"
+ "\x9b\xa8\x38\x54\x7e\x6d\x23\x51"
+ "\x2c\xd6\xc4\x58\x23\x1c\x22\x2a"
+ "\x59\xc5\x9b\xec\x9d\xbf\x03\x0f"
+ "\xb3\xdd\xba\x02\x22\xa0\x34\x37"
+ "\x19\x56\xc2\x5b\x32\x1d\x1e\x66"
+ "\x68\xf4\x47\x05\x04\x18\xa7\x28"
+ "\x80\xf2\xc7\x99\xed\x1e\x72\x48"
+ "\x8f\x97\x5d\xb3\x74\x42\xfd\x0c"
+ "\x0f\x5f\x29\x0c\xf1\x35\x22\x90"
+ "\xd6\x7c\xb8\xa3\x2a\x89\x38\x71"
+ "\xe9\x7a\x55\x3c\x3b\xf2\x6e\x1a"
+ "\x22\x8f\x07\x81\xc1\xe1\xf1\x76"
+ "\x2a\x75\xab\x86\xc4\xcc\x52\x59"
+ "\x83\x19\x5e\xb3\x53\xe2\x81\xdf"
+ "\xe6\x15\xb3\xba\x0c\x0e\xba"
+ "\xa9\x2c\xed\x51\xd5\x06\xc8\xc6"
+ "\x4b\x9f\x5d\x1b\x61\x31\xad\xf4",
+ .clen = 735,
}
};
@@ -15030,6 +16129,68 @@ static const struct hash_testvec sm4_cmac128_tv_template[] = {
}
};
+static const struct hash_testvec sm4_xcbc128_tv_template[] = {
+ { /* Generated from AES-XCBC128 test vectors */
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = zeroed_string,
+ .digest = "\xa9\x9a\x5c\x44\xe2\x34\xee\x2c"
+ "\x9b\xe4\x9d\xca\x64\xb0\xa5\xc4",
+ .psize = 0,
+ .ksize = 16,
+ }, {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02",
+ .digest = "\x17\x27\x62\xf3\x8b\x88\x1d\xc0"
+ "\x97\x35\x9c\x3e\x9f\x27\xb7\x83",
+ .psize = 3,
+ .ksize = 16,
+ } , {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .digest = "\xda\x45\xd1\xac\xec\x4d\xab\x46"
+ "\xdd\x59\xe0\x44\xff\x59\xd5\xfc",
+ .psize = 16,
+ .ksize = 16,
+ }, {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13",
+ .digest = "\xbe\x24\x5d\x81\x8c\x8a\x10\xa4"
+ "\x8e\xc2\x16\xfa\xa4\x83\xc9\x2a",
+ .psize = 20,
+ .ksize = 16,
+ }, {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+ .digest = "\x91\x82\x31\x56\xd5\x77\xa4\xc5"
+ "\x88\x2d\xce\x3a\x87\x5e\xbd\xba",
+ .psize = 32,
+ .ksize = 16,
+ }, {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+ "\x20\x21",
+ .digest = "\x2a\xae\xa5\x24\x0c\x12\x9f\x5f"
+ "\x55\xfb\xae\x35\x13\x0d\x22\x2d",
+ .psize = 34,
+ .ksize = 16,
+ }
+};
+
/* Cast6 test vectors from RFC 2612 */
static const struct cipher_testvec cast6_tv_template[] = {
{
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 04/16] crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch newly adds the test vectors of CTS-CBC/ESSIV/XTS/XCBC modes
of the SM4 algorithm, and also added some test vectors for SM4 GCM/CCM.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
crypto/testmgr.c | 25 +
crypto/testmgr.h | 1161 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 1186 insertions(+)
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index e4bb03b8b924..cce101c7e8f9 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4712,6 +4712,12 @@ static const struct alg_test_desc alg_test_descs[] = {
.alg = "cts(cbc(paes))",
.test = alg_test_null,
.fips_allowed = 1,
+ }, {
+ .alg = "cts(cbc(sm4))",
+ .test = alg_test_skcipher,
+ .suite = {
+ .cipher = __VECS(sm4_cts_tv_template)
+ }
}, {
.alg = "curve25519",
.test = alg_test_kpp,
@@ -5059,6 +5065,12 @@ static const struct alg_test_desc alg_test_descs[] = {
.cipher = __VECS(essiv_aes_cbc_tv_template)
}
}, {
+ .alg = "essiv(cbc(sm4),sm3)",
+ .test = alg_test_skcipher,
+ .suite = {
+ .cipher = __VECS(essiv_sm4_cbc_tv_template)
+ }
+ }, {
#if IS_ENABLED(CONFIG_CRYPTO_DH_RFC7919_GROUPS)
.alg = "ffdhe2048(dh)",
.test = alg_test_kpp,
@@ -5586,6 +5598,12 @@ static const struct alg_test_desc alg_test_descs[] = {
.suite = {
.hash = __VECS(aes_xcbc128_tv_template)
}
+ }, {
+ .alg = "xcbc(sm4)",
+ .test = alg_test_hash,
+ .suite = {
+ .hash = __VECS(sm4_xcbc128_tv_template)
+ }
}, {
.alg = "xchacha12",
.test = alg_test_skcipher,
@@ -5640,6 +5658,13 @@ static const struct alg_test_desc alg_test_descs[] = {
.suite = {
.cipher = __VECS(serpent_xts_tv_template)
}
+ }, {
+ .alg = "xts(sm4)",
+ .generic_driver = "xts(ecb(sm4-generic))",
+ .test = alg_test_skcipher,
+ .suite = {
+ .cipher = __VECS(sm4_xts_tv_template)
+ }
}, {
.alg = "xts(twofish)",
.generic_driver = "xts(ecb(twofish-generic))",
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d6088e26f326..ced48e4dad0c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -14882,6 +14882,537 @@ static const struct cipher_testvec sm4_cfb_tv_template[] = {
}
};
+static const struct cipher_testvec sm4_cts_tv_template[] = {
+ /* Generated from AES-CTS test vectors */
+ {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20",
+ .len = 17,
+ .ctext = "\x05\xfe\x23\xee\x17\xa2\x89\x98"
+ "\xbc\x97\x0a\x0b\x54\x67\xca\xd7"
+ "\xd6",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20",
+ .len = 31,
+ .ctext = "\x15\x46\xe4\x95\xa4\xec\xf0\xb8"
+ "\x49\xd6\x6a\x9d\x89\xc7\xfd\x70"
+ "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20\x43",
+ .len = 32,
+ .ctext = "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+ "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+ "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20\x43"
+ "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+ "\x70\x6c\x65\x61\x73\x65\x2c",
+ .len = 47,
+ .ctext = "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+ "\xd3\xe1\xdc\xeb\xfa\x04\x11\x99"
+ "\xde\xcf\x6f\x4d\x7b\x09\x92\x7f"
+ "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+ "\x01\x6a\xbf\xd4\x3f\x79\x02",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20\x43"
+ "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+ "\x70\x6c\x65\x61\x73\x65\x2c\x20",
+ .len = 48,
+ .ctext = "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+ "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+ "\xbd\x99\x21\x0c\x5e\x4d\xed\x20"
+ "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+ "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3",
+ }, {
+ .klen = 16,
+ .key = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+ "\x74\x65\x72\x69\x79\x61\x6b\x69",
+ .ptext = "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+ "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+ "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+ "\x20\x47\x61\x75\x27\x73\x20\x43"
+ "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+ "\x70\x6c\x65\x61\x73\x65\x2c\x20"
+ "\x61\x6e\x64\x20\x77\x6f\x6e\x74"
+ "\x6f\x6e\x20\x73\x6f\x75\x70\x2e",
+ .len = 64,
+ .ctext = "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+ "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+ "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+ "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+ "\x58\x19\xa4\x8f\xa9\x68\x5e\x6b"
+ "\x2c\x0f\x81\x60\x15\x98\x27\x4f"
+ "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+ "\xbd\x99\x21\x0c\x5e\x4d\xed\x20",
+ }
+};
+
+static const struct cipher_testvec essiv_sm4_cbc_tv_template[] = {
+ /* Generated from AES-ESSIV-CBC test vectors */
+ {
+ .key = "\x06\xa9\x21\x40\x36\xb8\xa1\x5b"
+ "\x51\x2e\x03\xd5\x34\x12\x00\x06",
+ .klen = 16,
+ .iv = "\x3d\xaf\xba\x42\x9d\x9e\xb4\x30"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "Single block msg",
+ .ctext = "\x83\xa0\x79\x71\x18\xed\xb2\x0f"
+ "\xa8\x71\x94\x22\x8e\x1f\xc1\xbb",
+ .len = 16,
+ }, {
+ .key = "\xc2\x86\x69\x6d\x88\x7c\x9a\xa0"
+ "\x61\x1b\xbb\x3e\x20\x25\xa4\x5a",
+ .klen = 16,
+ .iv = "\x56\x2e\x17\x99\x6d\x09\x3d\x28"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+ .ctext = "\x48\x38\xba\xa0\x09\xa2\xe1\x61"
+ "\x94\xe5\xd2\x63\xe5\x04\x6c\x62"
+ "\x93\x21\x95\xfb\x8c\xf4\x25\x19"
+ "\xe0\x0f\x9c\xfa\x51\xfe\xe7\x32",
+ .len = 32,
+ }, {
+ .key = "\x1f\x35\x2c\x07\x3b\x61\x08\xd7"
+ "\x2d\x98\x10\xa3\x09\x14\xdf\xf4",
+ .klen = 16,
+ .iv = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
+ "\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
+ "\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
+ "\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
+ "\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
+ "\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
+ "\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
+ "\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
+ .ctext = "\xa5\x1d\x64\x91\x28\x1f\xbe\x9e"
+ "\x15\x39\x5f\xe4\xe1\x5a\x8c\x38"
+ "\x80\x7f\xc7\x7d\x00\x4c\x4b\xff"
+ "\x75\x3a\x03\xfe\x41\x75\x26\x9e"
+ "\x3f\xf1\x36\xaf\x7b\x37\x73\x1a"
+ "\xaf\x9b\x91\xec\x1e\xf0\x05\x9d"
+ "\x87\xda\x7b\xa3\xaa\xe6\x5b\x98"
+ "\x41\x73\xd5\x3c\x8c\x8b\xb5\x88",
+ .len = 64,
+ }, {
+ .key = "\xBE\xE1\x04\x27\xE1\x04\x27\x4A"
+ "\x6D\x90\x4A\x6D\x90\xB3\xD6\xF9",
+ .klen = 16,
+ .iv = "\xE7\x82\x1D\xB8\x53\x11\xAC\x47"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x50\xB9\x22\xAE\x17\x80\x0C\x75"
+ "\xDE\x47\xD3\x3C\xA5\x0E\x9A\x03"
+ "\x6C\xF8\x61\xCA\x33\xBF\x28\x91"
+ "\x1D\x86\xEF\x58\xE4\x4D\xB6\x1F"
+ "\xAB\x14\x7D\x09\x72\xDB\x44\xD0"
+ "\x39\xA2\x0B\x97\x00\x69\xF5\x5E"
+ "\xC7\x30\xBC\x25\x8E\x1A\x83\xEC"
+ "\x55\xE1\x4A\xB3\x1C\xA8\x11\x7A"
+ "\x06\x6F\xD8\x41\xCD\x36\x9F\x08"
+ "\x94\xFD\x66\xF2\x5B\xC4\x2D\xB9"
+ "\x22\x8B\x17\x80\xE9\x52\xDE\x47"
+ "\xB0\x19\xA5\x0E\x77\x03\x6C\xD5"
+ "\x3E\xCA\x33\x9C\x05\x91\xFA\x63"
+ "\xEF\x58\xC1\x2A\xB6\x1F\x88\x14"
+ "\x7D\xE6\x4F\xDB\x44\xAD\x16\xA2"
+ "\x0B\x74\x00\x69\xD2\x3B\xC7\x30"
+ "\x99\x02\x8E\xF7\x60\xEC\x55\xBE"
+ "\x27\xB3\x1C\x85\x11\x7A\xE3\x4C"
+ "\xD8\x41\xAA\x13\x9F\x08\x71\xFD"
+ "\x66\xCF\x38\xC4\x2D\x96\x22\x8B"
+ "\xF4\x5D\xE9\x52\xBB\x24\xB0\x19"
+ "\x82\x0E\x77\xE0\x49\xD5\x3E\xA7"
+ "\x10\x9C\x05\x6E\xFA\x63\xCC\x35"
+ "\xC1\x2A\x93\x1F\x88\xF1\x5A\xE6"
+ "\x4F\xB8\x21\xAD\x16\x7F\x0B\x74"
+ "\xDD\x46\xD2\x3B\xA4\x0D\x99\x02"
+ "\x6B\xF7\x60\xC9\x32\xBE\x27\x90"
+ "\x1C\x85\xEE\x57\xE3\x4C\xB5\x1E"
+ "\xAA\x13\x7C\x08\x71\xDA\x43\xCF"
+ "\x38\xA1\x0A\x96\xFF\x68\xF4\x5D"
+ "\xC6\x2F\xBB\x24\x8D\x19\x82\xEB"
+ "\x54\xE0\x49\xB2\x1B\xA7\x10\x79"
+ "\x05\x6E\xD7\x40\xCC\x35\x9E\x07"
+ "\x93\xFC\x65\xF1\x5A\xC3\x2C\xB8"
+ "\x21\x8A\x16\x7F\xE8\x51\xDD\x46"
+ "\xAF\x18\xA4\x0D\x76\x02\x6B\xD4"
+ "\x3D\xC9\x32\x9B\x04\x90\xF9\x62"
+ "\xEE\x57\xC0\x29\xB5\x1E\x87\x13"
+ "\x7C\xE5\x4E\xDA\x43\xAC\x15\xA1"
+ "\x0A\x73\xFF\x68\xD1\x3A\xC6\x2F"
+ "\x98\x01\x8D\xF6\x5F\xEB\x54\xBD"
+ "\x26\xB2\x1B\x84\x10\x79\xE2\x4B"
+ "\xD7\x40\xA9\x12\x9E\x07\x70\xFC"
+ "\x65\xCE\x37\xC3\x2C\x95\x21\x8A"
+ "\xF3\x5C\xE8\x51\xBA\x23\xAF\x18"
+ "\x81\x0D\x76\xDF\x48\xD4\x3D\xA6"
+ "\x0F\x9B\x04\x6D\xF9\x62\xCB\x34"
+ "\xC0\x29\x92\x1E\x87\xF0\x59\xE5"
+ "\x4E\xB7\x20\xAC\x15\x7E\x0A\x73"
+ "\xDC\x45\xD1\x3A\xA3\x0C\x98\x01"
+ "\x6A\xF6\x5F\xC8\x31\xBD\x26\x8F"
+ "\x1B\x84\xED\x56\xE2\x4B\xB4\x1D"
+ "\xA9\x12\x7B\x07\x70\xD9\x42\xCE"
+ "\x37\xA0\x09\x95\xFE\x67\xF3\x5C"
+ "\xC5\x2E\xBA\x23\x8C\x18\x81\xEA"
+ "\x53\xDF\x48\xB1\x1A\xA6\x0F\x78"
+ "\x04\x6D\xD6\x3F\xCB\x34\x9D\x06"
+ "\x92\xFB\x64\xF0\x59\xC2\x2B\xB7"
+ "\x20\x89\x15\x7E\xE7\x50\xDC\x45"
+ "\xAE\x17\xA3\x0C\x75\x01\x6A\xD3"
+ "\x3C\xC8\x31\x9A\x03\x8F\xF8\x61"
+ "\xED\x56\xBF\x28\xB4\x1D\x86\x12",
+ .ctext = "\xad\x68\x40\x68\xb2\xf9\x77\x55"
+ "\xd5\x1c\x17\x46\xc1\xfa\x05\xdd"
+ "\x94\x5c\xb7\x99\x82\xba\x05\x48"
+ "\xac\x5d\x14\x30\x2e\xc8\x0e\x2f"
+ "\x5a\xd7\x39\x43\x95\x4d\x93\xff"
+ "\x6b\xe3\xb7\x71\xc1\x39\x43\x8d"
+ "\x10\xd7\xd9\xa8\xe7\x65\xb7\x0a"
+ "\x27\x98\x5b\x90\xc3\x80\x1f\xd9"
+ "\x65\x82\x88\x0a\xc3\x16\x3f\xae"
+ "\x1f\xad\x88\xe9\xfb\x9e\xd4\xc8"
+ "\x81\x36\x50\x37\x1f\x11\x83\xe2"
+ "\xc5\x1a\x48\xdb\xc3\x18\x07\x5d"
+ "\xee\x4b\xea\x40\xd3\xd9\x8c\x59"
+ "\x29\xe1\x0b\x79\x3b\x28\xac\x75"
+ "\xda\x82\x99\x86\xd4\xbe\xd8\x81"
+ "\xe0\xc4\x58\x78\xe4\x33\xc1\xf1"
+ "\xbe\x96\xd3\x4c\x42\x6b\xaf\x24"
+ "\x69\xb4\x25\x88\x37\x9e\xb2\xfb"
+ "\x5c\x93\x22\x89\x2f\x81\x85\x06"
+ "\x12\x74\x3b\x6c\x99\x81\xfb\xbe"
+ "\x0f\xc4\xa5\xb6\xf8\x79\x5f\x72"
+ "\xf8\x46\x94\x3f\x1f\x9f\x15\xa2"
+ "\xc8\xc0\xbf\xeb\xa3\x9e\x59\xe1"
+ "\xbd\x1a\xe1\xe3\x6b\x33\x96\x54"
+ "\x1b\xc4\x25\x74\x06\xcf\x8a\x75"
+ "\x6c\xfc\x76\x7f\x9e\x7b\x00\xce"
+ "\xa8\x1e\x6a\x0f\x5a\xa6\xcb\x77"
+ "\x5f\x90\x39\xcb\xfe\x0e\x16\x53"
+ "\x8e\x21\x0f\x7e\x51\xcc\x92\xb8"
+ "\x4f\x65\x76\x20\x3d\x56\xb4\xcc"
+ "\x8b\x8e\x8e\x68\xc3\x82\x53\x5c"
+ "\x1c\x82\x13\x32\x3b\x97\xff\x48"
+ "\x98\xda\x4a\x7c\xc8\x21\x83\xfd"
+ "\xe2\xf1\x30\xe1\x11\xe9\xe8\x97"
+ "\x97\x24\x06\x73\xf2\x52\xbb\xab"
+ "\x9d\x5f\x0b\xa8\x2f\xab\x0b\x7d"
+ "\xe8\x20\x7b\x67\x2e\x93\xb5\x11"
+ "\x6c\x16\xea\xdd\x1a\x9d\xf2\xdc"
+ "\x79\x57\xc4\x04\xcb\x7f\x36\xa0"
+ "\x2e\xa7\x89\xab\xaa\x56\x59\x9e"
+ "\xec\x38\xea\x1a\xe9\xa7\x58\x58"
+ "\xb5\xb7\x8f\x8c\x5c\xd6\x86\x67"
+ "\x65\x0f\x93\x47\xf7\x3e\x19\x19"
+ "\x9b\x22\xd1\xc6\xc2\xba\x32\x5c"
+ "\x2c\x7a\xa2\xbb\xa5\x22\xde\xe5"
+ "\x1e\x78\x2c\xd3\x40\x6d\xfa\x79"
+ "\x4c\x9e\x1c\x36\x34\xaf\x95\x2e"
+ "\x68\x2e\x69\x7d\xe4\x7d\x0c\x74"
+ "\xaf\x73\x5b\x48\x62\x90\x5e\x19"
+ "\x0f\x12\xb3\xdb\x77\xbb\xe2\xac"
+ "\xaf\xfe\xd9\xa1\x80\x09\xc6\xd4"
+ "\xf4\x21\x3f\xa4\x0f\x16\x7b\x36"
+ "\x29\x6d\x10\xa2\xba\xaf\xf5\xa3"
+ "\x51\xca\x0a\x25\x74\x9a\xb7\x02"
+ "\xb8\xf8\x6b\xda\xb8\x1c\x9f\x62"
+ "\xf5\x61\x62\x9f\x4b\x71\x24\x45"
+ "\xfb\x0f\xdf\xa8\x47\x6f\x2f\x05"
+ "\x2f\xf4\xfd\xb8\xd1\x8c\x29\x9d"
+ "\x9d\xe8\x6f\x10\x89\xef\x08\x59"
+ "\xa0\x24\x1f\xdb\xea\xbc\x97\x44"
+ "\x23\x74\xbf\xaa\x87\x10\x5c\x58"
+ "\x2a\xe6\xe2\x19\xc5\x7e\x21\xe2",
+ .len = 496,
+ },
+};
+
+static const struct cipher_testvec sm4_xts_tv_template[] = {
+ /* Generated from AES-XTS test vectors */
+ {
+ .key = "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .klen = 32,
+ .iv = "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ctext = "\xd9\xb4\x21\xf7\x31\xc8\x94\xfd"
+ "\xc3\x5b\x77\x29\x1f\xe4\xe3\xb0"
+ "\x2a\x1f\xb7\x66\x98\xd5\x9f\x0e"
+ "\x51\x37\x6c\x4a\xda\x5b\xc7\x5d",
+ .len = 32,
+ }, {
+ .key = "\x11\x11\x11\x11\x11\x11\x11\x11"
+ "\x11\x11\x11\x11\x11\x11\x11\x11"
+ "\x22\x22\x22\x22\x22\x22\x22\x22"
+ "\x22\x22\x22\x22\x22\x22\x22\x22",
+ .klen = 32,
+ .iv = "\x33\x33\x33\x33\x33\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44",
+ .ctext = "\xa7\x4d\x72\x6c\x11\x19\x6a\x32"
+ "\xbe\x04\xe0\x01\xff\x29\xd0\xc7"
+ "\x93\x2f\x9f\x3e\xc2\x9b\xfc\xb6"
+ "\x4d\xd1\x7f\x63\xcb\xd3\xea\x31",
+ .len = 32,
+ }, {
+ .key = "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+ "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+ "\x22\x22\x22\x22\x22\x22\x22\x22"
+ "\x22\x22\x22\x22\x22\x22\x22\x22",
+ .klen = 32,
+ .iv = "\x33\x33\x33\x33\x33\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44"
+ "\x44\x44\x44\x44\x44\x44\x44\x44",
+ .ctext = "\x7f\x76\x08\x8e\xff\xad\xf7\x0c"
+ "\x02\xea\x9f\x95\xda\x06\x28\xd3"
+ "\x51\xbf\xcb\x9e\xac\x05\x63\xbc"
+ "\xf1\x7b\x71\x0d\xab\x0a\x98\x26",
+ .len = 32,
+ }, {
+ .key = "\x27\x18\x28\x18\x28\x45\x90\x45"
+ "\x23\x53\x60\x28\x74\x71\x35\x26"
+ "\x31\x41\x59\x26\x53\x58\x97\x93"
+ "\x23\x84\x62\x64\x33\x83\x27\x95",
+ .klen = 32,
+ .iv = "\x00\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+ "\x20\x21\x22\x23\x24\x25\x26\x27"
+ "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+ "\x30\x31\x32\x33\x34\x35\x36\x37"
+ "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+ "\x40\x41\x42\x43\x44\x45\x46\x47"
+ "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+ "\x50\x51\x52\x53\x54\x55\x56\x57"
+ "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+ "\x60\x61\x62\x63\x64\x65\x66\x67"
+ "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+ "\x70\x71\x72\x73\x74\x75\x76\x77"
+ "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+ "\x80\x81\x82\x83\x84\x85\x86\x87"
+ "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+ "\x90\x91\x92\x93\x94\x95\x96\x97"
+ "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+ "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+ "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+ "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+ "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+ "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+ "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+ "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+ "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+ "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+ "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+ "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+ "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+ "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+ "\x20\x21\x22\x23\x24\x25\x26\x27"
+ "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+ "\x30\x31\x32\x33\x34\x35\x36\x37"
+ "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+ "\x40\x41\x42\x43\x44\x45\x46\x47"
+ "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+ "\x50\x51\x52\x53\x54\x55\x56\x57"
+ "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+ "\x60\x61\x62\x63\x64\x65\x66\x67"
+ "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+ "\x70\x71\x72\x73\x74\x75\x76\x77"
+ "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+ "\x80\x81\x82\x83\x84\x85\x86\x87"
+ "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+ "\x90\x91\x92\x93\x94\x95\x96\x97"
+ "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+ "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+ "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+ "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+ "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+ "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+ "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+ "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+ "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+ "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+ "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+ "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+ "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+ .ctext = "\x54\xdd\x65\xb6\x32\x6f\xae\xa8"
+ "\xfa\xd1\xa8\x3c\x63\x61\x4a\xf3"
+ "\x9f\x72\x1d\x8d\xfe\x17\x7a\x30"
+ "\xb6\x6a\xbf\x6a\x44\x99\x80\xe1"
+ "\xcd\xbe\x06\xaf\xb7\x33\x36\xf3"
+ "\x7a\x4d\x39\xde\x96\x4a\x30\xd7"
+ "\xd0\x4a\x37\x99\x16\x9c\x60\x25"
+ "\x8f\x6b\x74\x8a\x61\x86\x1a\xa5"
+ "\xec\x92\xa2\xc1\x5b\x2b\x7c\x61"
+ "\x5a\x42\xab\xa4\x99\xbb\xd6\xb7"
+ "\x1d\xb9\xc7\x89\xb2\x18\x20\x89"
+ "\xa2\x5d\xd3\xdf\x80\x0e\xd1\x86"
+ "\x4d\x19\xf7\xed\x45\xfd\x17\xa9"
+ "\x48\x0b\x0f\xb8\x2d\x9b\x7f\xc3"
+ "\xed\x57\xe9\xa1\x14\x0e\xaa\x77"
+ "\x8d\xd2\xdd\x67\x9e\x3e\xdc\x3d"
+ "\xc4\xd5\x5c\x95\x0e\xbc\x53\x1d"
+ "\x95\x92\xf7\xc4\x63\x82\x56\xd5"
+ "\x65\x18\x29\x2a\x20\xaf\x98\xfd"
+ "\xd3\xa6\x36\x00\x35\x0a\x70\xab"
+ "\x5a\x40\xf4\xc2\x85\x03\x7c\xa0"
+ "\x1f\x25\x1f\x19\xec\xae\x03\x29"
+ "\xff\x77\xad\x88\xcd\x5a\x4c\xde"
+ "\xa2\xae\xab\xc2\x21\x48\xff\xbd"
+ "\x23\x9b\xd1\x05\x15\xbd\xe1\x13"
+ "\x1d\xec\x84\x04\xe4\x43\xdc\x76"
+ "\x31\x40\xd5\xf2\x2b\xf3\x3e\x0c"
+ "\x68\x72\xd6\xb8\x1d\x63\x0f\x6f"
+ "\x00\xcd\xd0\x58\xfe\x80\xf9\xcb"
+ "\xfb\x77\x70\x7f\x93\xce\xe2\xca"
+ "\x92\xb9\x15\xb8\x30\x40\x27\xc1"
+ "\x90\xa8\x4e\x2d\x65\xe0\x18\xcc"
+ "\x6a\x38\x7d\x37\x66\xac\xdb\x28"
+ "\x25\x32\x84\xe8\xdb\x9a\xcf\x8f"
+ "\x52\x28\x0d\xdc\x6d\x00\x33\xd2"
+ "\xcc\xaa\xa4\xf9\xae\xff\x12\x36"
+ "\x69\xbc\x02\x4f\xd6\x76\x8e\xdf"
+ "\x8b\xc1\xf8\xd6\x22\xc1\x9c\x60"
+ "\x9e\xf9\x7f\x60\x91\x90\xcd\x11"
+ "\x02\x41\xe7\xfb\x08\x4e\xd8\x94"
+ "\x2d\xa1\xf9\xb9\xcf\x1b\x51\x4b"
+ "\x61\xa3\x88\xb3\x0e\xa6\x1a\x4a"
+ "\x74\x5b\x38\x1e\xe7\xad\x6c\x4d"
+ "\xb1\x27\x54\x53\xb8\x41\x3f\x98"
+ "\xdf\x6e\x4a\x40\x98\x6e\xe4\xb5"
+ "\x9a\xf5\xdf\xae\xcd\x30\x12\x65"
+ "\x17\x90\x67\xa0\x0d\x7c\xa3\x5a"
+ "\xb9\x5a\xbd\x61\x7a\xde\xa2\x8e"
+ "\xc1\xc2\x6a\x97\xde\x28\xb8\xbf"
+ "\xe3\x01\x20\xd6\xae\xfb\xd2\x58"
+ "\xc5\x9e\x42\xd1\x61\xe8\x06\x5a"
+ "\x78\x10\x6b\xdc\xa5\xcd\x90\xfb"
+ "\x3a\xac\x4e\x93\x86\x6c\x8a\x7f"
+ "\x96\x76\x86\x0a\x79\x14\x5b\xd9"
+ "\x2e\x02\xe8\x19\xa9\x0b\xe0\xb9"
+ "\x7c\xc5\x22\xb3\x21\x06\x85\x6f"
+ "\xdf\x0e\x54\xd8\x8e\x46\x24\x15"
+ "\x5a\x2f\x1c\x14\xea\xea\xa1\x63"
+ "\xf8\x58\xe9\x9a\x80\x6e\x79\x1a"
+ "\xcd\x82\xf1\xb0\xe2\x9f\x00\x28"
+ "\xa4\xc3\x8e\x97\x6f\x57\x1a\x93"
+ "\xf4\xfd\x57\xd7\x87\xc2\x4d\xb0"
+ "\xe0\x1c\xa3\x04\xe5\xa5\xc4\xdd"
+ "\x50\xcf\x8b\xdb\xf4\x91\xe5\x7c",
+ .len = 512,
+ }, {
+ .key = "\x62\x49\x77\x57\x24\x70\x93\x69"
+ "\x99\x59\x57\x49\x66\x96\x76\x27"
+ "\x02\x88\x41\x97\x16\x93\x99\x37"
+ "\x51\x05\x82\x09\x74\x94\x45\x92",
+ .klen = 32,
+ .iv = "\xff\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+ .ptext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+ "\x20\x21\x22\x23\x24\x25\x26\x27"
+ "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+ "\x30\x31\x32\x33\x34\x35\x36\x37"
+ "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+ "\x40\x41\x42\x43\x44\x45\x46\x47"
+ "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+ "\x50\x51\x52\x53\x54\x55\x56\x57"
+ "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+ "\x60\x61\x62\x63\x64\x65\x66\x67"
+ "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+ "\x70\x71\x72\x73\x74\x75\x76\x77"
+ "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+ "\x80\x81\x82\x83\x84\x85\x86\x87"
+ "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+ "\x90\x91\x92\x93\x94\x95\x96\x97"
+ "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+ "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+ "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+ "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+ "\xf8\xf9\xfa\xfb\xfc",
+ .ctext = "\xa2\x9f\x9e\x4e\x71\xdb\x28\x3c"
+ "\x80\x0e\xf6\xb7\x8e\x57\x1c\xba"
+ "\x90\xda\x3b\x6c\x22\x00\x68\x30"
+ "\x1d\x63\x0d\x9e\x6a\xad\x37\x55"
+ "\xbc\x77\x1e\xc9\xad\x83\x30\xd5"
+ "\x27\xb2\x66\x77\x18\x3c\xa6\x39"
+ "\x9c\x0a\xaa\x1f\x02\xe1\xd5\x65"
+ "\x9b\x8d\xc5\x97\x3d\xc5\x04\x53"
+ "\x78\x00\xe3\xb0\x1a\x43\x4e\xb7"
+ "\xc4\x9f\x38\xc5\x7b\xa4\x70\x64"
+ "\x78\xe6\x32\xd9\x65\x44\xc5\x64"
+ "\xb8\x42\x35\x99\xff\x66\x75\xb0"
+ "\x22\xd3\x9b\x6e\x8d\xcf\x6a\x24"
+ "\xfd\x92\xb7\x1b\x04\x28\x2a\x61"
+ "\xdc\x96\x2a\x20\x7a\x2c\xf1\xf9"
+ "\x12\x15\xf0\x4d\xcf\x2b\xde\x33"
+ "\x41\xbc\xe7\x85\x87\x22\xb7\x16"
+ "\x02\x1c\xd8\xa2\x0f\x1f\xa3\xe9"
+ "\xd8\x45\x48\xe7\xbe\x08\x4e\x4e"
+ "\x23\x79\x84\xdb\x40\x76\xf5\x13"
+ "\x78\x92\x4a\x2f\xf9\x1b\xf2\x80"
+ "\x25\x74\x51\x45\x9a\x77\x78\x97"
+ "\xd3\xe0\xc7\xc4\x35\x67\x2a\xe6"
+ "\xb3\x0d\x62\x9f\x8b",
+ .len = 189,
+ },
+};
+
static const struct aead_testvec sm4_gcm_tv_template[] = {
{ /* From https://datatracker.ietf.org/doc/html/rfc8998#appendix-A.1 */
.key = "\x01\x23\x45\x67\x89\xAB\xCD\xEF"
@@ -14913,6 +15444,298 @@ static const struct aead_testvec sm4_gcm_tv_template[] = {
"\x83\xDE\x35\x41\xE4\xC2\xB5\x81"
"\x77\xE0\x65\xA9\xBF\x7B\x62\xEC",
.clen = 80,
+ }, { /* Generated from AES-GCM test vectors */
+ .key = zeroed_string,
+ .klen = 16,
+ .ctext = "\x23\x2f\x0c\xfe\x30\x8b\x49\xea"
+ "\x6f\xc8\x82\x29\xb5\xdc\x85\x8d",
+ .clen = 16,
+ }, {
+ .key = zeroed_string,
+ .klen = 16,
+ .ptext = zeroed_string,
+ .plen = 16,
+ .ctext = "\x7d\xe2\xaa\x7f\x11\x10\x18\x82"
+ "\x18\x06\x3b\xe1\xbf\xeb\x6d\x89"
+ "\xb8\x51\xb5\xf3\x94\x93\x75\x2b"
+ "\xe5\x08\xf1\xbb\x44\x82\xc5\x57",
+ .clen = 32,
+ }, {
+ .key = "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+ "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+ .klen = 16,
+ .iv = "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+ "\xde\xca\xf8\x88",
+ .ptext = "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+ "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+ "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+ "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+ "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+ "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+ "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+ "\xba\x63\x7b\x39\x1a\xaf\xd2\x55",
+ .plen = 64,
+ .ctext = "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+ "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+ "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+ "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+ "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+ "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+ "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+ "\xe3\x63\x36\x83\x23\xf7\x5b\x80"
+ "\x7d\xfe\x77\xef\x71\xb1\x5e\xc9"
+ "\x52\x6b\x09\xab\x84\x28\x4b\x8a",
+ .clen = 80,
+ }, {
+ .key = "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+ "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+ .klen = 16,
+ .iv = "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+ "\xde\xca\xf8\x88",
+ .ptext = "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+ "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+ "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+ "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+ "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+ "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+ "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+ "\xba\x63\x7b\x39",
+ .plen = 60,
+ .assoc = "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+ "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+ "\xab\xad\xda\xd2",
+ .alen = 20,
+ .ctext = "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+ "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+ "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+ "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+ "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+ "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+ "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+ "\xe3\x63\x36\x83"
+ "\x89\xf6\xba\x35\xb8\x18\xd3\xcc"
+ "\x38\x6c\x05\xb3\x8a\xcb\xc9\xde",
+ .clen = 76,
+ }, {
+ .key = "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+ "\xfe\xff\xe9\x92\x86\x65\x73\x1c",
+ .klen = 16,
+ .iv = "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+ "\xde\xca\xf8\x88",
+ .ptext = "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+ "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+ "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+ "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+ "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+ "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+ "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+ "\xba\x63\x7b\x39",
+ .plen = 60,
+ .assoc = "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+ "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+ "\xab\xad\xda\xd2",
+ .alen = 20,
+ .ctext = "\xc1\x11\x44\x51\xd9\x25\x87\x5b"
+ "\x0f\xd9\x06\xf3\x33\x44\xbb\x87"
+ "\x8b\xa3\x77\xd2\x0c\x60\xfa\xcc"
+ "\x85\x50\x6f\x96\x0c\x54\x54\xc1"
+ "\x58\x04\x88\x6e\xf4\x26\x35\x7e"
+ "\x94\x80\x48\x6c\xf2\xf4\x88\x1f"
+ "\x19\x63\xea\xae\xba\x81\x1a\x5d"
+ "\x0e\x6f\x59\x08"
+ "\x33\xac\x5b\xa8\x19\x60\xdb\x1d"
+ "\xdd\x2e\x22\x2e\xe0\x87\x51\x5d",
+ .clen = 76,
+ }, {
+ .key = "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+ "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+ .klen = 16,
+ .iv = "\x00\xff\xff\xff\xff\x00\x00\xff"
+ "\xff\xff\x00\xff",
+ .ptext = "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+ "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+ "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+ "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+ "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+ "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+ "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+ "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+ "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+ "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+ "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+ "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+ "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+ "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+ "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+ "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+ "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+ "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+ "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+ "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+ "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+ "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+ "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+ "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+ "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+ "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+ "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+ "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+ "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+ "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+ "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+ "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+ "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+ "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+ "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+ "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+ "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+ "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+ "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+ "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+ "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+ "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+ "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+ "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+ "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+ "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+ "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+ "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+ "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+ "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+ "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+ "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+ "\x87\x79\x60\x38\x46\xb4\x25\x57"
+ "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+ "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+ "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+ "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+ "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+ "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+ "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+ "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+ "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+ "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+ "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+ "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+ "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+ "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+ "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+ "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+ "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+ "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+ "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+ "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+ "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+ "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+ "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+ "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+ "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+ "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+ "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+ "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+ "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+ "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+ "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+ "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+ "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+ "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+ "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+ "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+ "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+ .plen = 719,
+ .ctext = "\xdc\xb1\x0f\x2a\xe8\x2d\x1c\x57"
+ "\xc4\x82\xfa\xd6\x87\xe6\x2f\x50"
+ "\xbd\x9e\x0a\x42\x31\xf2\xc7\xbb"
+ "\x21\x63\xa7\x05\x43\x33\xef\x33"
+ "\x5c\xd3\x47\x55\xce\x5c\xe4\xd4"
+ "\xe5\x07\x62\x22\xac\x01\xa8\x35"
+ "\x9c\x59\x34\x30\x8e\xff\x9f\xb4"
+ "\xd2\x4e\x74\x90\x64\xf2\x78\x5e"
+ "\x63\xb7\xc5\x08\x1b\x37\xa5\x9e"
+ "\xc0\xde\xff\xa9\x7f\x0b\xd3\x02"
+ "\x83\x6e\x33\xfa\x43\x11\xd3\xda"
+ "\x02\xcf\xcd\x4a\xc0\x78\x1f\x39"
+ "\x62\xcb\xa3\x95\x7e\x13\x92\x28"
+ "\xb2\xc4\x7a\xba\xd1\xc6\xf6\x1f"
+ "\xda\x0b\xf1\xd1\x99\x54\xd8\x3b"
+ "\x16\xf8\xe6\x97\x1e\xa7\xcf\x49"
+ "\x69\x84\x01\x4c\xdc\x7a\x34\xff"
+ "\x01\x08\xa3\x0b\x39\xac\x21\x37"
+ "\xd8\xb4\x04\x19\x8b\x7a\x7d\x17"
+ "\x44\xd1\x18\xaf\x1f\xa9\x29\xfe"
+ "\xfa\x77\xe0\x40\x42\x0c\x79\xb7"
+ "\xc3\x15\x1b\xd9\x0c\x82\xfc\x16"
+ "\x70\xd6\x2a\xe9\x94\x72\xc5\xa5"
+ "\x8a\x58\xbc\xfa\xe0\x88\x39\x4a"
+ "\x80\xe8\xec\xaf\x60\xac\xe7\xf8"
+ "\x9c\xf0\xfc\x61\x39\x07\x98\x6b"
+ "\x88\xe3\x98\x22\x28\x18\x4a\x2d"
+ "\x25\xef\x10\xe3\x83\x66\x3f\xfd"
+ "\xc7\x0b\xa3\xfd\x97\xa9\xf4\xbd"
+ "\xd8\x2a\xee\x4a\x50\xad\xcc\xb5"
+ "\xc7\xab\xb8\x79\x9c\xd1\xf1\x27"
+ "\x08\xf5\xf5\xe8\x1b\x66\xce\x41"
+ "\x56\x60\x94\x86\xf0\x78\xc2\xfa"
+ "\x5b\x63\x40\xb1\xd1\x1a\x38\x69"
+ "\x0b\x8c\xb2\xf5\xa2\xbe\x90\x9d"
+ "\x46\x23\x79\x8b\x3b\x4a\xf4\xbb"
+ "\x55\xf7\x58\x9d\xaf\x59\xff\x74"
+ "\xf3\xb9\xc4\x26\xb1\xf8\xe1\x28"
+ "\x8b\x5e\x8f\x6d\x64\xe7\xe8\x63"
+ "\xd2\x9e\xcb\xee\xae\x19\x04\x1d"
+ "\x05\xf0\x9d\x99\x7b\x33\x33\xae"
+ "\x6e\xe5\x09\xdd\x67\x51\xc4\xc8"
+ "\x6a\xc7\x36\x35\xc9\x93\x76\xa1"
+ "\xa8\x1c\xfa\x75\x92\x34\x0e\x7d"
+ "\x3d\x1d\xef\x00\xfd\xa5\x25\x12"
+ "\x7c\x91\x21\x41\xcc\x50\x47\xa9"
+ "\x22\x50\x24\x96\x34\x79\x3d\xe8"
+ "\x3f\xa0\x56\xaf\x98\x53\x55\xc3"
+ "\x46\x1b\x17\x54\xb8\xb0\xb7\xe0"
+ "\xe0\xab\x47\x6f\x06\xda\xcc\x75"
+ "\xa7\x96\xb7\x92\xf3\xa0\x5f\xe6"
+ "\xba\x97\xe3\x2f\x97\x05\xb2\x99"
+ "\xa0\x09\x10\x98\x9c\xd3\x2e\xd1"
+ "\x7e\x2a\x30\x54\x3c\xb9\x33\xe3"
+ "\xf2\xaf\xd3\xa5\xee\xd0\x0b\x8a"
+ "\x19\x54\x0f\x02\x51\x1f\x91\xdf"
+ "\x71\x9c\xad\x77\x35\x28\x55\x6d"
+ "\xcd\x7a\xd9\xa3\x41\x98\x6b\x37"
+ "\x19\x0f\xbe\xae\x69\xb2\x25\x01"
+ "\xee\x0e\x51\x4b\x53\xea\x0f\x5f"
+ "\x85\x74\x79\x36\x32\x0a\x2a\x40"
+ "\xad\x6b\x78\x41\x54\x99\xe9\xc1"
+ "\x2b\x6c\x9b\x42\x21\xef\xe2\x50"
+ "\x56\x8d\x78\xdf\x58\xbe\x0a\x0f"
+ "\xfc\xfc\x0d\x2e\xd0\xcb\xa6\x0a"
+ "\xa8\xd9\x1e\xa9\xd4\x7c\x99\x88"
+ "\xcf\x11\xad\x1c\xd3\x04\x63\x55"
+ "\xef\x85\x0b\x69\xa1\x40\xf1\x75"
+ "\x24\xf4\xe5\x2c\xd4\x7a\x24\x50"
+ "\x8f\xa2\x71\xc9\x92\x20\xcd\xcf"
+ "\xda\x40\xbe\xf6\xfe\x1a\xca\xc7"
+ "\x4a\x80\x45\x55\xcb\xdd\xb7\x01"
+ "\xb0\x8d\xcb\xd2\xae\xbd\xa4\xd0"
+ "\x5c\x10\x05\x66\x7b\xd4\xff\xd9"
+ "\xc4\x23\x9d\x8d\x6b\x24\xf8\x3f"
+ "\x73\x4d\x5c\x2b\x33\x4c\x5e\x63"
+ "\x74\x6d\x03\xa1\x7a\x35\x65\x17"
+ "\x38\x7f\x3b\xc1\x69\xcf\x61\x34"
+ "\x30\x21\xaf\x97\x47\x12\x3f\xa1"
+ "\xa7\x50\xc5\x87\xfb\x3f\x70\x32"
+ "\x86\x17\x5f\x25\xe4\x74\xc6\xd0"
+ "\x9b\x39\xe6\xe1\x5a\xec\x8f\x40"
+ "\xce\xcc\x37\x3b\xd8\x72\x1c\x31"
+ "\x75\xa4\xa6\x89\x8c\xdd\xd6\xd2"
+ "\x32\x3d\xe8\xc3\x54\xab\x1f\x35"
+ "\x52\xb4\x94\x81\xb0\x37\x3a\x03"
+ "\xbb\xb1\x99\x30\xa5\xf8\x21\xcd"
+ "\x93\x5d\xa7\x13\xed\xc7\x49\x09"
+ "\x70\xda\x08\x39\xaa\x15\x9e\x45"
+ "\x35\x2b\x0f\x5c\x8c\x8b\xc9"
+ "\xa8\xb8\x9f\xfd\x37\x36\x31\x7e"
+ "\x34\x4f\xc1\xc0\xca\x8a\x22\xfd",
+ .clen = 735,
}
};
@@ -14947,6 +15770,282 @@ static const struct aead_testvec sm4_ccm_tv_template[] = {
"\x16\x84\x2D\x4F\xA1\x86\xF5\x6A"
"\xB3\x32\x56\x97\x1F\xA1\x10\xF4",
.clen = 80,
+ }, { /* Generated from AES-CCM test vectors */
+ .key = "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+ "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf",
+ .klen = 16,
+ .iv = "\x01\x00\x00\x00\x03\x02\x01\x00"
+ "\xa0\xa1\xa2\xa3\xa4\xa5\x00\x00",
+ .assoc = "\x00\x01\x02\x03\x04\x05\x06\x07",
+ .alen = 8,
+ .ptext = "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e",
+ .plen = 23,
+ .ctext = "\x7b\xff\x4a\x15\xf5\x73\xce\x82"
+ "\x6e\xc2\x31\x1d\xe2\x53\x02\xac"
+ "\xa4\x48\xf9\xe4\xf5\x1f\x81\x70"
+ "\x18\xbc\xb6\x84\x01\xb8\xae",
+ .clen = 31,
+ }, {
+ .key = "\xf4\x6b\xc2\x75\x62\xfe\xb4\xe1"
+ "\x53\x14\x73\x66\x8d\x88\xf6\x80",
+ .klen = 16,
+ .iv = "\x03\xa0\x20\x35\x26\xf2\x21\x8d"
+ "\x50\x20\xda\xe2\x00\x00\x00\x00",
+ .assoc = "\x5b\x9e\x13\x67\x02\x5e\xef\xc1"
+ "\x6c\xf9\xd7\x1e\x52\x8f\x7a\x47"
+ "\xe9\xd4\xcf\x20\x14\x6e\xf0\x2d"
+ "\xd8\x9e\x2b\x56\x10\x23\x56\xe7",
+ .alen = 32,
+ .ctext = "\x23\x58\xce\xdc\x40\xb1\xcd\x92"
+ "\x47\x96\x59\xfc\x8a\x26\x4f\xcf",
+ .clen = 16,
+ }, {
+ .key = "\xab\x2f\x8a\x74\xb7\x1c\xd2\xb1"
+ "\xff\x80\x2e\x48\x7d\x82\xf8\xb9",
+ .klen = 16,
+ .iv = "\x03\xaf\x94\x87\x78\x35\x82\x81"
+ "\x7f\x88\x94\x68\x00\x00\x00\x00",
+ .alen = 0,
+ .ptext = "\x00",
+ .plen = 0,
+ .ctext = "\x72\x7e\xf5\xd6\x39\x7a\x2b\x43",
+ .clen = 8,
+ }, {
+ .key = "\x39\xbb\xa7\xbe\x59\x97\x9e\x73"
+ "\xa4\x48\x93\x39\x26\x71\x4a\xc6",
+ .klen = 16,
+ .iv = "\x03\xee\x49\x83\xe9\xa9\xff\xe9"
+ "\x57\xba\xfd\x9e\x00\x00\x00\x00",
+ .assoc = "\x44\xa6\x2c\x05\xe9\xe1\x43\xb1"
+ "\x58\x7c\xf2\x5c\x6d\x39\x0a\x64"
+ "\xa4\xf0\x13\x05\xd1\x77\x99\x67"
+ "\x11\xc4\xc6\xdb\x00\x56\x36\x61",
+ .alen = 32,
+ .ptext = "\x00",
+ .plen = 0,
+ .ctext = "\xb0\x9d\xc6\xfb\x7d\xb5\xa1\x0e",
+ .clen = 8,
+ }, {
+ .key = "\x58\x5d\xa0\x96\x65\x1a\x04\xd7"
+ "\x0d\x1a\x53\x3b\xb5\xe3\xf8\x8b",
+ .klen = 16,
+ .iv = "\x03\xcf\x76\x3f\xd9\x95\x75\x8f"
+ "\x44\x89\x40\x7b\x00\x00\x00\x00",
+ .assoc = "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+ "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+ "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+ "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe",
+ .alen = 32,
+ .ptext = "\xc2\x54\xc8\xde\x78\x87\x77\x40"
+ "\x49\x71\xe4\xb7\xe7\xcb\x76\x61"
+ "\x0a\x41\xb9\xe9\xc0\x76\x54\xab"
+ "\x04\x49\x3b\x19\x93\x57\x25\x5d",
+ .plen = 32,
+ .ctext = "\xc9\xae\xef\x1d\xf3\x2c\xd3\x38"
+ "\xc9\x7f\x7e\x28\xe8\xaa\xb3\x60"
+ "\x49\xdc\x66\xca\x7b\x3d\xe0\x3c"
+ "\xcb\x45\x9c\x1b\xb2\xbe\x07\x90"
+ "\x87\xa6\x6b\x89\x0d\x0f\x90\xaa"
+ "\x7d\xf6\x5a\x9a\x68\x2b\x81\x92",
+ .clen = 48,
+ }, {
+ .key = "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+ "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+ .klen = 16,
+ .iv = "\x02\xff\xff\xff\xff\x00\x00\xff"
+ "\xff\xff\x00\xff\xff\x00\x00\x00",
+ .assoc = "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+ "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+ "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+ "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe"
+ "\xc8\xf3\x5c\x52\x10\x63",
+ .alen = 38,
+ .ptext = "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+ "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+ "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+ "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+ "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+ "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+ "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+ "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+ "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+ "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+ "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+ "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+ "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+ "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+ "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+ "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+ "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+ "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+ "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+ "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+ "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+ "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+ "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+ "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+ "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+ "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+ "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+ "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+ "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+ "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+ "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+ "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+ "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+ "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+ "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+ "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+ "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+ "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+ "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+ "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+ "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+ "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+ "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+ "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+ "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+ "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+ "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+ "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+ "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+ "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+ "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+ "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+ "\x87\x79\x60\x38\x46\xb4\x25\x57"
+ "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+ "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+ "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+ "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+ "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+ "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+ "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+ "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+ "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+ "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+ "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+ "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+ "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+ "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+ "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+ "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+ "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+ "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+ "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+ "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+ "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+ "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+ "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+ "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+ "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+ "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+ "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+ "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+ "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+ "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+ "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+ "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+ "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+ "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+ "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+ "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+ "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+ .plen = 719,
+ .ctext = "\xc5\x50\x85\x02\x72\xa8\xb3\x62"
+ "\xf9\xcd\x77\x7b\x43\xa5\x04\x70"
+ "\x68\x40\x57\x21\x1c\xfe\xef\x05"
+ "\x4d\xb8\x44\xba\x59\xea\x62\x32"
+ "\xcb\x6b\x6a\x39\x9b\xf3\xe5\xa4"
+ "\x36\x38\xde\x7d\xcf\xb6\xcd\xe3"
+ "\x89\xbf\x37\xc9\x96\x3c\x70\x10"
+ "\x92\x47\xcc\xac\x6f\xf8\x55\x9a"
+ "\x26\x43\x34\xb4\x92\x7d\x68\xfc"
+ "\x60\x37\x74\x2a\x55\xba\xc7\xd7"
+ "\x98\x69\xb7\xcf\x42\xfd\xb2\x10"
+ "\xa0\x59\xe1\x2c\x73\x66\x12\x97"
+ "\x85\x8b\x28\xcc\x29\x02\x15\x89"
+ "\x23\xd3\x32\x92\x87\x57\x09\x13"
+ "\x04\x7e\x8b\x6c\x3a\xc1\x4e\x6c"
+ "\xe1\x9f\xc8\xcc\x47\x9c\xd8\x10"
+ "\xf4\xb7\x5c\x30\x7a\x8b\x0f\x01"
+ "\x52\x38\x02\x92\x99\xac\x03\x90"
+ "\x18\x32\x2d\x21\x6a\x0a\x2a\xe7"
+ "\xc2\xcc\x15\x84\x4e\x2b\x0b\x3a"
+ "\x4c\xdc\xb0\x6b\x10\xd1\x27\x10"
+ "\xf0\x4a\x5c\x43\xa0\x34\x34\x59"
+ "\x47\x43\x48\xcb\x69\xa7\xff\x52"
+ "\xb8\xca\x23\x09\x07\xd7\xc5\xe4"
+ "\x2a\x4f\x99\xd5\x83\x36\x2a\x2d"
+ "\x59\xd0\xca\xb0\xfa\x40\x8c\xab"
+ "\xdf\x69\x08\xd9\x79\x1d\xde\xa8"
+ "\x0b\x34\x74\x4d\xf5\xa0\x4c\x81"
+ "\x7f\x93\x06\x40\x24\xfe\x7d\xcd"
+ "\xe4\xfe\xf8\xf8\x30\xce\xd0\x5d"
+ "\x70\xfd\x0d\x5a\x78\x85\x74\x2d"
+ "\xe4\xb5\x40\x18\x99\x11\xe4\x6a"
+ "\xdf\xfa\x4f\x25\x2c\xde\x15\xb7"
+ "\x12\xd8\xc6\x90\x0d\x0f\xc9\xfb"
+ "\x21\xf1\xed\xfe\x98\xe1\x03\xe2"
+ "\x5c\xef\xb6\xc7\x87\x77\x0e\xcd"
+ "\xff\x78\x94\xc9\xbe\xd3\x47\xf7"
+ "\x8d\x37\x48\x01\x42\xe2\x17\x96"
+ "\xfc\xc0\xcb\x7b\x7b\x57\xaf\x3b"
+ "\xc9\xd0\x94\xce\x5e\x1b\xa9\x47"
+ "\x02\x4d\x74\xcc\x45\x1d\xd3\x2d"
+ "\x5f\x4f\x7f\xf2\x4b\xf9\x59\xee"
+ "\x9e\x9e\xb9\x95\x29\x19\xd1\x5f"
+ "\x72\xab\x8d\xf1\x28\xd1\x1c\xae"
+ "\xc2\xba\xf7\x22\x84\x2c\x83\x51"
+ "\x03\xad\xa3\xef\x81\xa7\xdc\xf1"
+ "\x44\x51\x50\x96\x70\xd1\xe5\x47"
+ "\x57\xf9\x30\x90\xe4\xbf\xfc\x75"
+ "\x14\xaa\x4d\xb7\xb1\xe7\x79\x33"
+ "\x43\xc2\x5c\xc1\xbc\x09\x92\x0f"
+ "\xa7\xaf\x68\x51\x51\xec\x0b\xc3"
+ "\x3d\x2b\x94\x30\x45\x29\x1b\x9e"
+ "\x70\x56\xf8\xd6\x67\x2d\x39\x3b"
+ "\x3c\xd2\xd0\xd3\xdc\x7d\x84\xe9"
+ "\x06\x31\x98\xa6\x5c\xbf\x10\x58"
+ "\xce\xbb\xa7\xe1\x65\x7e\x51\x87"
+ "\x70\x46\xb4\x7f\xf9\xec\x92\x1c"
+ "\x9b\x24\x49\xc1\x04\xbe\x1c\x5f"
+ "\xcc\xb3\x33\x8c\xad\xe7\xdc\x32"
+ "\x54\xa2\x0d\x83\x0f\x3c\x12\x5d"
+ "\x71\xe3\x9c\xae\x71\xa3\x2a\x10"
+ "\xc5\x91\xb4\x73\x96\x60\xdb\x5d"
+ "\x1f\xd5\x9a\xd2\x69\xc3\xd7\x4b"
+ "\xa2\x66\x81\x96\x4a\xaa\x02\xd6"
+ "\xd5\x44\x9b\x42\x3a\x15\x5f\xe7"
+ "\x4d\x7c\xf6\x71\x4a\xea\xe8\x43"
+ "\xd7\x68\xe4\xbc\x05\x87\x49\x05"
+ "\x3b\x47\xb2\x6d\x5f\xd1\x11\xa6"
+ "\x58\xd4\xa2\x45\xec\xb5\x54\x55"
+ "\xd3\xd6\xd2\x6a\x8b\x21\x9e\x2c"
+ "\xf1\x27\x4b\x5b\xe3\xff\xe0\xfd"
+ "\x4b\xf1\xe7\xe2\x84\xf2\x17\x37"
+ "\x11\x68\xc4\x92\x4b\x6b\xef\x8e"
+ "\x75\xf5\xc2\x7d\x5c\xe9\x7c\xfc"
+ "\x2b\x00\x33\x0e\x7d\x69\xd8\xd4"
+ "\x9b\xa8\x38\x54\x7e\x6d\x23\x51"
+ "\x2c\xd6\xc4\x58\x23\x1c\x22\x2a"
+ "\x59\xc5\x9b\xec\x9d\xbf\x03\x0f"
+ "\xb3\xdd\xba\x02\x22\xa0\x34\x37"
+ "\x19\x56\xc2\x5b\x32\x1d\x1e\x66"
+ "\x68\xf4\x47\x05\x04\x18\xa7\x28"
+ "\x80\xf2\xc7\x99\xed\x1e\x72\x48"
+ "\x8f\x97\x5d\xb3\x74\x42\xfd\x0c"
+ "\x0f\x5f\x29\x0c\xf1\x35\x22\x90"
+ "\xd6\x7c\xb8\xa3\x2a\x89\x38\x71"
+ "\xe9\x7a\x55\x3c\x3b\xf2\x6e\x1a"
+ "\x22\x8f\x07\x81\xc1\xe1\xf1\x76"
+ "\x2a\x75\xab\x86\xc4\xcc\x52\x59"
+ "\x83\x19\x5e\xb3\x53\xe2\x81\xdf"
+ "\xe6\x15\xb3\xba\x0c\x0e\xba"
+ "\xa9\x2c\xed\x51\xd5\x06\xc8\xc6"
+ "\x4b\x9f\x5d\x1b\x61\x31\xad\xf4",
+ .clen = 735,
}
};
@@ -15030,6 +16129,68 @@ static const struct hash_testvec sm4_cmac128_tv_template[] = {
}
};
+static const struct hash_testvec sm4_xcbc128_tv_template[] = {
+ { /* Generated from AES-XCBC128 test vectors */
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = zeroed_string,
+ .digest = "\xa9\x9a\x5c\x44\xe2\x34\xee\x2c"
+ "\x9b\xe4\x9d\xca\x64\xb0\xa5\xc4",
+ .psize = 0,
+ .ksize = 16,
+ }, {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02",
+ .digest = "\x17\x27\x62\xf3\x8b\x88\x1d\xc0"
+ "\x97\x35\x9c\x3e\x9f\x27\xb7\x83",
+ .psize = 3,
+ .ksize = 16,
+ } , {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .digest = "\xda\x45\xd1\xac\xec\x4d\xab\x46"
+ "\xdd\x59\xe0\x44\xff\x59\xd5\xfc",
+ .psize = 16,
+ .ksize = 16,
+ }, {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13",
+ .digest = "\xbe\x24\x5d\x81\x8c\x8a\x10\xa4"
+ "\x8e\xc2\x16\xfa\xa4\x83\xc9\x2a",
+ .psize = 20,
+ .ksize = 16,
+ }, {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+ .digest = "\x91\x82\x31\x56\xd5\x77\xa4\xc5"
+ "\x88\x2d\xce\x3a\x87\x5e\xbd\xba",
+ .psize = 32,
+ .ksize = 16,
+ }, {
+ .key = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .plaintext = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+ "\x20\x21",
+ .digest = "\x2a\xae\xa5\x24\x0c\x12\x9f\x5f"
+ "\x55\xfb\xae\x35\x13\x0d\x22\x2d",
+ .psize = 34,
+ .ksize = 16,
+ }
+};
+
/* Cast6 test vectors from RFC 2612 */
static const struct cipher_testvec cast6_tv_template[] = {
{
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 05/16] crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
Added CTS-CBC/ESSIV/XTS/XCBC tests for SM4 algorithms, as well as
corresponding speed tests, this is to test performance-optimized
implementations of these modes.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
crypto/tcrypt.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index a82679b576bb..b870b2fe716d 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1711,6 +1711,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
ret += tcrypt_test("gcm(aria)");
break;
+ case 59:
+ ret += tcrypt_test("cts(cbc(sm4))");
+ break;
+
case 100:
ret += tcrypt_test("hmac(md5)");
break;
@@ -1811,6 +1815,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
ret += tcrypt_test("cmac(sm4)");
break;
+ case 160:
+ ret += tcrypt_test("xcbc(sm4)");
+ break;
+
case 181:
ret += tcrypt_test("authenc(hmac(sha1),cbc(des))");
break;
@@ -1846,6 +1854,7 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
ret += tcrypt_test("cbc(sm4)");
ret += tcrypt_test("cfb(sm4)");
ret += tcrypt_test("ctr(sm4)");
+ ret += tcrypt_test("xts(sm4)");
break;
case 192:
ret += tcrypt_test("ecb(aria)");
@@ -2109,6 +2118,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
speed_template_16);
test_cipher_speed("cbc(sm4)", DECRYPT, sec, NULL, 0,
speed_template_16);
+ test_cipher_speed("cts(cbc(sm4))", ENCRYPT, sec, NULL, 0,
+ speed_template_16);
+ test_cipher_speed("cts(cbc(sm4))", DECRYPT, sec, NULL, 0,
+ speed_template_16);
test_cipher_speed("cfb(sm4)", ENCRYPT, sec, NULL, 0,
speed_template_16);
test_cipher_speed("cfb(sm4)", DECRYPT, sec, NULL, 0,
@@ -2117,6 +2130,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
speed_template_16);
test_cipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
speed_template_16);
+ test_cipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+ speed_template_32);
+ test_cipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+ speed_template_32);
break;
case 219:
@@ -2212,6 +2229,13 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
speed_template_16, num_mb);
break;
+ case 230:
+ test_acipher_speed("essiv(cbc(sm4),sm3)", ENCRYPT, sec,
+ NULL, 0, speed_template_16);
+ test_acipher_speed("essiv(cbc(sm4),sm3)", DECRYPT, sec,
+ NULL, 0, speed_template_16);
+ break;
+
case 300:
if (alg) {
test_hash_speed(alg, sec, generic_hash_speed_template);
@@ -2630,6 +2654,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
speed_template_16);
test_acipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
speed_template_16);
+ test_acipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+ speed_template_32);
+ test_acipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+ speed_template_32);
break;
case 519:
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 05/16] crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
Added CTS-CBC/ESSIV/XTS/XCBC tests for SM4 algorithms, as well as
corresponding speed tests, this is to test performance-optimized
implementations of these modes.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
crypto/tcrypt.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index a82679b576bb..b870b2fe716d 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1711,6 +1711,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
ret += tcrypt_test("gcm(aria)");
break;
+ case 59:
+ ret += tcrypt_test("cts(cbc(sm4))");
+ break;
+
case 100:
ret += tcrypt_test("hmac(md5)");
break;
@@ -1811,6 +1815,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
ret += tcrypt_test("cmac(sm4)");
break;
+ case 160:
+ ret += tcrypt_test("xcbc(sm4)");
+ break;
+
case 181:
ret += tcrypt_test("authenc(hmac(sha1),cbc(des))");
break;
@@ -1846,6 +1854,7 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
ret += tcrypt_test("cbc(sm4)");
ret += tcrypt_test("cfb(sm4)");
ret += tcrypt_test("ctr(sm4)");
+ ret += tcrypt_test("xts(sm4)");
break;
case 192:
ret += tcrypt_test("ecb(aria)");
@@ -2109,6 +2118,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
speed_template_16);
test_cipher_speed("cbc(sm4)", DECRYPT, sec, NULL, 0,
speed_template_16);
+ test_cipher_speed("cts(cbc(sm4))", ENCRYPT, sec, NULL, 0,
+ speed_template_16);
+ test_cipher_speed("cts(cbc(sm4))", DECRYPT, sec, NULL, 0,
+ speed_template_16);
test_cipher_speed("cfb(sm4)", ENCRYPT, sec, NULL, 0,
speed_template_16);
test_cipher_speed("cfb(sm4)", DECRYPT, sec, NULL, 0,
@@ -2117,6 +2130,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
speed_template_16);
test_cipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
speed_template_16);
+ test_cipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+ speed_template_32);
+ test_cipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+ speed_template_32);
break;
case 219:
@@ -2212,6 +2229,13 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
speed_template_16, num_mb);
break;
+ case 230:
+ test_acipher_speed("essiv(cbc(sm4),sm3)", ENCRYPT, sec,
+ NULL, 0, speed_template_16);
+ test_acipher_speed("essiv(cbc(sm4),sm3)", DECRYPT, sec,
+ NULL, 0, speed_template_16);
+ break;
+
case 300:
if (alg) {
test_hash_speed(alg, sec, generic_hash_speed_template);
@@ -2630,6 +2654,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
speed_template_16);
test_acipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
speed_template_16);
+ test_acipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+ speed_template_32);
+ test_acipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+ speed_template_32);
break;
case 519:
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 06/16] crypto: arm64/sm4 - refactor and simplify CE implementation
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch does not add new features, but only refactors and simplifies the
implementation of the Crypto Extension acceleration of the SM4 algorithm:
Extract the macro optimized by SM4 Crypto Extension for reuse in the
subsequent optimization of CCM/GCM modes.
Encryption in CBC and CFB modes processes four blocks at a time instead of
one, allowing the ld1 instruction to load 64 bytes of data at a time, which
will reduces unnecessary memory accesses.
CBC/CFB/CTR makes full use of free registers to reduce redundant memory
accesses, and rearranges some instructions to improve out-of-order execution
capabilities.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-asm.h | 209 +++++++++++
arch/arm64/crypto/sm4-ce-core.S | 646 ++++++++++++++------------------
arch/arm64/crypto/sm4-ce-glue.c | 64 ++--
3 files changed, 519 insertions(+), 400 deletions(-)
create mode 100644 arch/arm64/crypto/sm4-ce-asm.h
diff --git a/arch/arm64/crypto/sm4-ce-asm.h b/arch/arm64/crypto/sm4-ce-asm.h
new file mode 100644
index 000000000000..7ea98e42e779
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-asm.h
@@ -0,0 +1,209 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 helper macros for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#define SM4_PREPARE(ptr) \
+ ld1 {v24.16b-v27.16b}, [ptr], #64; \
+ ld1 {v28.16b-v31.16b}, [ptr];
+
+#define SM4_CRYPT_BLK_BE(b0) \
+ sm4e b0.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ rev32 b0.16b, b0.16b;
+
+#define SM4_CRYPT_BLK(b0) \
+ rev32 b0.16b, b0.16b; \
+ SM4_CRYPT_BLK_BE(b0);
+
+#define SM4_CRYPT_BLK2_BE(b0, b1) \
+ sm4e b0.4s, v24.4s; \
+ sm4e b1.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b1.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b1.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b1.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b1.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b1.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b1.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ sm4e b1.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ rev64 b1.4s, b1.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ ext b1.16b, b1.16b, b1.16b, #8; \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+
+#define SM4_CRYPT_BLK2(b0, b1) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ SM4_CRYPT_BLK2_BE(b0, b1);
+
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3) \
+ sm4e b0.4s, v24.4s; \
+ sm4e b1.4s, v24.4s; \
+ sm4e b2.4s, v24.4s; \
+ sm4e b3.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b1.4s, v25.4s; \
+ sm4e b2.4s, v25.4s; \
+ sm4e b3.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b1.4s, v26.4s; \
+ sm4e b2.4s, v26.4s; \
+ sm4e b3.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b1.4s, v27.4s; \
+ sm4e b2.4s, v27.4s; \
+ sm4e b3.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b1.4s, v28.4s; \
+ sm4e b2.4s, v28.4s; \
+ sm4e b3.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b1.4s, v29.4s; \
+ sm4e b2.4s, v29.4s; \
+ sm4e b3.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b1.4s, v30.4s; \
+ sm4e b2.4s, v30.4s; \
+ sm4e b3.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ sm4e b1.4s, v31.4s; \
+ sm4e b2.4s, v31.4s; \
+ sm4e b3.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ rev64 b1.4s, b1.4s; \
+ rev64 b2.4s, b2.4s; \
+ rev64 b3.4s, b3.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ ext b1.16b, b1.16b, b1.16b, #8; \
+ ext b2.16b, b2.16b, b2.16b, #8; \
+ ext b3.16b, b3.16b, b3.16b, #8; \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b;
+
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b; \
+ SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
+#define SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7) \
+ sm4e b0.4s, v24.4s; \
+ sm4e b1.4s, v24.4s; \
+ sm4e b2.4s, v24.4s; \
+ sm4e b3.4s, v24.4s; \
+ sm4e b4.4s, v24.4s; \
+ sm4e b5.4s, v24.4s; \
+ sm4e b6.4s, v24.4s; \
+ sm4e b7.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b1.4s, v25.4s; \
+ sm4e b2.4s, v25.4s; \
+ sm4e b3.4s, v25.4s; \
+ sm4e b4.4s, v25.4s; \
+ sm4e b5.4s, v25.4s; \
+ sm4e b6.4s, v25.4s; \
+ sm4e b7.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b1.4s, v26.4s; \
+ sm4e b2.4s, v26.4s; \
+ sm4e b3.4s, v26.4s; \
+ sm4e b4.4s, v26.4s; \
+ sm4e b5.4s, v26.4s; \
+ sm4e b6.4s, v26.4s; \
+ sm4e b7.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b1.4s, v27.4s; \
+ sm4e b2.4s, v27.4s; \
+ sm4e b3.4s, v27.4s; \
+ sm4e b4.4s, v27.4s; \
+ sm4e b5.4s, v27.4s; \
+ sm4e b6.4s, v27.4s; \
+ sm4e b7.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b1.4s, v28.4s; \
+ sm4e b2.4s, v28.4s; \
+ sm4e b3.4s, v28.4s; \
+ sm4e b4.4s, v28.4s; \
+ sm4e b5.4s, v28.4s; \
+ sm4e b6.4s, v28.4s; \
+ sm4e b7.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b1.4s, v29.4s; \
+ sm4e b2.4s, v29.4s; \
+ sm4e b3.4s, v29.4s; \
+ sm4e b4.4s, v29.4s; \
+ sm4e b5.4s, v29.4s; \
+ sm4e b6.4s, v29.4s; \
+ sm4e b7.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b1.4s, v30.4s; \
+ sm4e b2.4s, v30.4s; \
+ sm4e b3.4s, v30.4s; \
+ sm4e b4.4s, v30.4s; \
+ sm4e b5.4s, v30.4s; \
+ sm4e b6.4s, v30.4s; \
+ sm4e b7.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ sm4e b1.4s, v31.4s; \
+ sm4e b2.4s, v31.4s; \
+ sm4e b3.4s, v31.4s; \
+ sm4e b4.4s, v31.4s; \
+ sm4e b5.4s, v31.4s; \
+ sm4e b6.4s, v31.4s; \
+ sm4e b7.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ rev64 b1.4s, b1.4s; \
+ rev64 b2.4s, b2.4s; \
+ rev64 b3.4s, b3.4s; \
+ rev64 b4.4s, b4.4s; \
+ rev64 b5.4s, b5.4s; \
+ rev64 b6.4s, b6.4s; \
+ rev64 b7.4s, b7.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ ext b1.16b, b1.16b, b1.16b, #8; \
+ ext b2.16b, b2.16b, b2.16b, #8; \
+ ext b3.16b, b3.16b, b3.16b, #8; \
+ ext b4.16b, b4.16b, b4.16b, #8; \
+ ext b5.16b, b5.16b, b5.16b, #8; \
+ ext b6.16b, b6.16b, b6.16b, #8; \
+ ext b7.16b, b7.16b, b7.16b, #8; \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b; \
+ rev32 b4.16b, b4.16b; \
+ rev32 b5.16b, b5.16b; \
+ rev32 b6.16b, b6.16b; \
+ rev32 b7.16b, b7.16b;
+
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b; \
+ rev32 b4.16b, b4.16b; \
+ rev32 b5.16b, b5.16b; \
+ rev32 b6.16b, b6.16b; \
+ rev32 b7.16b, b7.16b; \
+ SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7);
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 934e0f093279..41fc745a8528 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -10,10 +10,12 @@
#include <linux/linkage.h>
#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
.arch armv8-a+crypto
-.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 20, 24, 25, 26, 27, 28, 29, 30, 31
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+ 20, 24, 25, 26, 27, 28, 29, 30, 31
.set .Lv\b\().4s, \b
.endr
@@ -34,174 +36,6 @@
#define RIV v20
-/* Helper macros. */
-
-#define PREPARE \
- ld1 {v24.16b-v27.16b}, [x0], #64; \
- ld1 {v28.16b-v31.16b}, [x0];
-
-#define SM4_CRYPT_BLK(b0) \
- rev32 b0.16b, b0.16b; \
- sm4e b0.4s, v24.4s; \
- sm4e b0.4s, v25.4s; \
- sm4e b0.4s, v26.4s; \
- sm4e b0.4s, v27.4s; \
- sm4e b0.4s, v28.4s; \
- sm4e b0.4s, v29.4s; \
- sm4e b0.4s, v30.4s; \
- sm4e b0.4s, v31.4s; \
- rev64 b0.4s, b0.4s; \
- ext b0.16b, b0.16b, b0.16b, #8; \
- rev32 b0.16b, b0.16b;
-
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3) \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b; \
- sm4e b0.4s, v24.4s; \
- sm4e b1.4s, v24.4s; \
- sm4e b2.4s, v24.4s; \
- sm4e b3.4s, v24.4s; \
- sm4e b0.4s, v25.4s; \
- sm4e b1.4s, v25.4s; \
- sm4e b2.4s, v25.4s; \
- sm4e b3.4s, v25.4s; \
- sm4e b0.4s, v26.4s; \
- sm4e b1.4s, v26.4s; \
- sm4e b2.4s, v26.4s; \
- sm4e b3.4s, v26.4s; \
- sm4e b0.4s, v27.4s; \
- sm4e b1.4s, v27.4s; \
- sm4e b2.4s, v27.4s; \
- sm4e b3.4s, v27.4s; \
- sm4e b0.4s, v28.4s; \
- sm4e b1.4s, v28.4s; \
- sm4e b2.4s, v28.4s; \
- sm4e b3.4s, v28.4s; \
- sm4e b0.4s, v29.4s; \
- sm4e b1.4s, v29.4s; \
- sm4e b2.4s, v29.4s; \
- sm4e b3.4s, v29.4s; \
- sm4e b0.4s, v30.4s; \
- sm4e b1.4s, v30.4s; \
- sm4e b2.4s, v30.4s; \
- sm4e b3.4s, v30.4s; \
- sm4e b0.4s, v31.4s; \
- sm4e b1.4s, v31.4s; \
- sm4e b2.4s, v31.4s; \
- sm4e b3.4s, v31.4s; \
- rev64 b0.4s, b0.4s; \
- rev64 b1.4s, b1.4s; \
- rev64 b2.4s, b2.4s; \
- rev64 b3.4s, b3.4s; \
- ext b0.16b, b0.16b, b0.16b, #8; \
- ext b1.16b, b1.16b, b1.16b, #8; \
- ext b2.16b, b2.16b, b2.16b, #8; \
- ext b3.16b, b3.16b, b3.16b, #8; \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b;
-
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b; \
- rev32 b4.16b, b4.16b; \
- rev32 b5.16b, b5.16b; \
- rev32 b6.16b, b6.16b; \
- rev32 b7.16b, b7.16b; \
- sm4e b0.4s, v24.4s; \
- sm4e b1.4s, v24.4s; \
- sm4e b2.4s, v24.4s; \
- sm4e b3.4s, v24.4s; \
- sm4e b4.4s, v24.4s; \
- sm4e b5.4s, v24.4s; \
- sm4e b6.4s, v24.4s; \
- sm4e b7.4s, v24.4s; \
- sm4e b0.4s, v25.4s; \
- sm4e b1.4s, v25.4s; \
- sm4e b2.4s, v25.4s; \
- sm4e b3.4s, v25.4s; \
- sm4e b4.4s, v25.4s; \
- sm4e b5.4s, v25.4s; \
- sm4e b6.4s, v25.4s; \
- sm4e b7.4s, v25.4s; \
- sm4e b0.4s, v26.4s; \
- sm4e b1.4s, v26.4s; \
- sm4e b2.4s, v26.4s; \
- sm4e b3.4s, v26.4s; \
- sm4e b4.4s, v26.4s; \
- sm4e b5.4s, v26.4s; \
- sm4e b6.4s, v26.4s; \
- sm4e b7.4s, v26.4s; \
- sm4e b0.4s, v27.4s; \
- sm4e b1.4s, v27.4s; \
- sm4e b2.4s, v27.4s; \
- sm4e b3.4s, v27.4s; \
- sm4e b4.4s, v27.4s; \
- sm4e b5.4s, v27.4s; \
- sm4e b6.4s, v27.4s; \
- sm4e b7.4s, v27.4s; \
- sm4e b0.4s, v28.4s; \
- sm4e b1.4s, v28.4s; \
- sm4e b2.4s, v28.4s; \
- sm4e b3.4s, v28.4s; \
- sm4e b4.4s, v28.4s; \
- sm4e b5.4s, v28.4s; \
- sm4e b6.4s, v28.4s; \
- sm4e b7.4s, v28.4s; \
- sm4e b0.4s, v29.4s; \
- sm4e b1.4s, v29.4s; \
- sm4e b2.4s, v29.4s; \
- sm4e b3.4s, v29.4s; \
- sm4e b4.4s, v29.4s; \
- sm4e b5.4s, v29.4s; \
- sm4e b6.4s, v29.4s; \
- sm4e b7.4s, v29.4s; \
- sm4e b0.4s, v30.4s; \
- sm4e b1.4s, v30.4s; \
- sm4e b2.4s, v30.4s; \
- sm4e b3.4s, v30.4s; \
- sm4e b4.4s, v30.4s; \
- sm4e b5.4s, v30.4s; \
- sm4e b6.4s, v30.4s; \
- sm4e b7.4s, v30.4s; \
- sm4e b0.4s, v31.4s; \
- sm4e b1.4s, v31.4s; \
- sm4e b2.4s, v31.4s; \
- sm4e b3.4s, v31.4s; \
- sm4e b4.4s, v31.4s; \
- sm4e b5.4s, v31.4s; \
- sm4e b6.4s, v31.4s; \
- sm4e b7.4s, v31.4s; \
- rev64 b0.4s, b0.4s; \
- rev64 b1.4s, b1.4s; \
- rev64 b2.4s, b2.4s; \
- rev64 b3.4s, b3.4s; \
- rev64 b4.4s, b4.4s; \
- rev64 b5.4s, b5.4s; \
- rev64 b6.4s, b6.4s; \
- rev64 b7.4s, b7.4s; \
- ext b0.16b, b0.16b, b0.16b, #8; \
- ext b1.16b, b1.16b, b1.16b, #8; \
- ext b2.16b, b2.16b, b2.16b, #8; \
- ext b3.16b, b3.16b, b3.16b, #8; \
- ext b4.16b, b4.16b, b4.16b, #8; \
- ext b5.16b, b5.16b, b5.16b, #8; \
- ext b6.16b, b6.16b, b6.16b, #8; \
- ext b7.16b, b7.16b, b7.16b, #8; \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b; \
- rev32 b4.16b, b4.16b; \
- rev32 b5.16b, b5.16b; \
- rev32 b6.16b, b6.16b; \
- rev32 b7.16b, b7.16b;
-
.align 3
SYM_FUNC_START(sm4_ce_expand_key)
@@ -268,7 +102,7 @@ SYM_FUNC_START(sm4_ce_crypt_block)
* x1: dst
* x2: src
*/
- PREPARE;
+ SM4_PREPARE(x0)
ld1 {v0.16b}, [x2];
SM4_CRYPT_BLK(v0);
@@ -285,7 +119,7 @@ SYM_FUNC_START(sm4_ce_crypt)
* x2: src
* w3: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
.Lcrypt_loop_blk:
sub w3, w3, #8;
@@ -337,26 +171,50 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
* x3: iv (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
+
+ ld1 {RIV.16b}, [x3]
+
+.Lcbc_enc_loop_4x:
+ cmp w4, #4
+ blt .Lcbc_enc_loop_1x
+
+ sub w4, w4, #4
- ld1 {RIV.16b}, [x3];
+ ld1 {v0.16b-v3.16b}, [x2], #64
-.Lcbc_enc_loop:
- sub w4, w4, #1;
+ eor v0.16b, v0.16b, RIV.16b
+ SM4_CRYPT_BLK(v0)
+ eor v1.16b, v1.16b, v0.16b
+ SM4_CRYPT_BLK(v1)
+ eor v2.16b, v2.16b, v1.16b
+ SM4_CRYPT_BLK(v2)
+ eor v3.16b, v3.16b, v2.16b
+ SM4_CRYPT_BLK(v3)
- ld1 {RTMP0.16b}, [x2], #16;
- eor RIV.16b, RIV.16b, RTMP0.16b;
+ st1 {v0.16b-v3.16b}, [x1], #64
+ mov RIV.16b, v3.16b
- SM4_CRYPT_BLK(RIV);
+ cbz w4, .Lcbc_enc_end
+ b .Lcbc_enc_loop_4x
- st1 {RIV.16b}, [x1], #16;
+.Lcbc_enc_loop_1x:
+ sub w4, w4, #1
- cbnz w4, .Lcbc_enc_loop;
+ ld1 {v0.16b}, [x2], #16
+ eor RIV.16b, RIV.16b, v0.16b
+ SM4_CRYPT_BLK(RIV)
+
+ st1 {RIV.16b}, [x1], #16
+
+ cbnz w4, .Lcbc_enc_loop_1x
+
+.Lcbc_enc_end:
/* store new IV */
- st1 {RIV.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_cbc_enc)
.align 3
@@ -368,79 +226,93 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
* x3: iv (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
- ld1 {RIV.16b}, [x3];
+ ld1 {RIV.16b}, [x3]
-.Lcbc_loop_blk:
- sub w4, w4, #8;
- tbnz w4, #31, .Lcbc_tail8;
+.Lcbc_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lcbc_dec_4x
- ld1 {v0.16b-v3.16b}, [x2], #64;
- ld1 {v4.16b-v7.16b}, [x2];
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ ld1 {v4.16b-v7.16b}, [x2], #64
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ rev32 v8.16b, v0.16b
+ rev32 v9.16b, v1.16b
+ rev32 v10.16b, v2.16b
+ rev32 v11.16b, v3.16b
+ rev32 v12.16b, v4.16b
+ rev32 v13.16b, v5.16b
+ rev32 v14.16b, v6.16b
+ rev32 v15.16b, v7.16b
- sub x2, x2, #64;
- eor v0.16b, v0.16b, RIV.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v1.16b, v1.16b, RTMP0.16b;
- eor v2.16b, v2.16b, RTMP1.16b;
- eor v3.16b, v3.16b, RTMP2.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
- eor v4.16b, v4.16b, RTMP3.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v5.16b, v5.16b, RTMP0.16b;
- eor v6.16b, v6.16b, RTMP1.16b;
- eor v7.16b, v7.16b, RTMP2.16b;
+ eor v8.16b, v8.16b, RIV.16b
+ eor v9.16b, v9.16b, v0.16b
+ eor v10.16b, v10.16b, v1.16b
+ eor v11.16b, v11.16b, v2.16b
+ eor v12.16b, v12.16b, v3.16b
+ eor v13.16b, v13.16b, v4.16b
+ eor v14.16b, v14.16b, v5.16b
+ eor v15.16b, v15.16b, v6.16b
- mov RIV.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+ st1 {v8.16b-v11.16b}, [x1], #64
+ st1 {v12.16b-v15.16b}, [x1], #64
- cbz w4, .Lcbc_end;
- b .Lcbc_loop_blk;
+ mov RIV.16b, v7.16b
-.Lcbc_tail8:
- add w4, w4, #8;
- cmp w4, #4;
- blt .Lcbc_tail4;
+ cbz w4, .Lcbc_dec_end
+ b .Lcbc_dec_loop_8x
- sub w4, w4, #4;
+.Lcbc_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lcbc_dec_loop_1x
- ld1 {v0.16b-v3.16b}, [x2];
+ sub w4, w4, #4
- SM4_CRYPT_BLK4(v0, v1, v2, v3);
+ ld1 {v0.16b-v3.16b}, [x2], #64
- eor v0.16b, v0.16b, RIV.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v1.16b, v1.16b, RTMP0.16b;
- eor v2.16b, v2.16b, RTMP1.16b;
- eor v3.16b, v3.16b, RTMP2.16b;
+ rev32 v8.16b, v0.16b
+ rev32 v9.16b, v1.16b
+ rev32 v10.16b, v2.16b
+ rev32 v11.16b, v3.16b
- mov RIV.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
- cbz w4, .Lcbc_end;
+ eor v8.16b, v8.16b, RIV.16b
+ eor v9.16b, v9.16b, v0.16b
+ eor v10.16b, v10.16b, v1.16b
+ eor v11.16b, v11.16b, v2.16b
-.Lcbc_tail4:
- sub w4, w4, #1;
+ st1 {v8.16b-v11.16b}, [x1], #64
- ld1 {v0.16b}, [x2];
+ mov RIV.16b, v3.16b
- SM4_CRYPT_BLK(v0);
+ cbz w4, .Lcbc_dec_end
- eor v0.16b, v0.16b, RIV.16b;
- ld1 {RIV.16b}, [x2], #16;
- st1 {v0.16b}, [x1], #16;
+.Lcbc_dec_loop_1x:
+ sub w4, w4, #1
+
+ ld1 {v0.16b}, [x2], #16
+
+ rev32 v8.16b, v0.16b
+
+ SM4_CRYPT_BLK_BE(v8)
- cbnz w4, .Lcbc_tail4;
+ eor v8.16b, v8.16b, RIV.16b
+ st1 {v8.16b}, [x1], #16
-.Lcbc_end:
+ mov RIV.16b, v0.16b
+
+ cbnz w4, .Lcbc_dec_loop_1x
+
+.Lcbc_dec_end:
/* store new IV */
- st1 {RIV.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_cbc_dec)
.align 3
@@ -452,25 +324,57 @@ SYM_FUNC_START(sm4_ce_cfb_enc)
* x3: iv (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
+
+ ld1 {RIV.16b}, [x3]
+
+.Lcfb_enc_loop_4x:
+ cmp w4, #4
+ blt .Lcfb_enc_loop_1x
+
+ sub w4, w4, #4
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ rev32 v8.16b, RIV.16b
+ SM4_CRYPT_BLK_BE(v8)
+ eor v0.16b, v0.16b, v8.16b
+
+ rev32 v8.16b, v0.16b
+ SM4_CRYPT_BLK_BE(v8)
+ eor v1.16b, v1.16b, v8.16b
+
+ rev32 v8.16b, v1.16b
+ SM4_CRYPT_BLK_BE(v8)
+ eor v2.16b, v2.16b, v8.16b
+
+ rev32 v8.16b, v2.16b
+ SM4_CRYPT_BLK_BE(v8)
+ eor v3.16b, v3.16b, v8.16b
- ld1 {RIV.16b}, [x3];
+ st1 {v0.16b-v3.16b}, [x1], #64
+ mov RIV.16b, v3.16b
-.Lcfb_enc_loop:
- sub w4, w4, #1;
+ cbz w4, .Lcfb_enc_end
+ b .Lcfb_enc_loop_4x
- SM4_CRYPT_BLK(RIV);
+.Lcfb_enc_loop_1x:
+ sub w4, w4, #1
- ld1 {RTMP0.16b}, [x2], #16;
- eor RIV.16b, RIV.16b, RTMP0.16b;
- st1 {RIV.16b}, [x1], #16;
+ ld1 {v0.16b}, [x2], #16
- cbnz w4, .Lcfb_enc_loop;
+ SM4_CRYPT_BLK(RIV)
+ eor RIV.16b, RIV.16b, v0.16b
+ st1 {RIV.16b}, [x1], #16
+
+ cbnz w4, .Lcfb_enc_loop_1x
+
+.Lcfb_enc_end:
/* store new IV */
- st1 {RIV.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_cfb_enc)
.align 3
@@ -482,79 +386,91 @@ SYM_FUNC_START(sm4_ce_cfb_dec)
* x3: iv (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
- ld1 {v0.16b}, [x3];
+ ld1 {RIV.16b}, [x3]
-.Lcfb_loop_blk:
- sub w4, w4, #8;
- tbnz w4, #31, .Lcfb_tail8;
+.Lcfb_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lcfb_dec_4x
- ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48;
- ld1 {v4.16b-v7.16b}, [x2];
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ ld1 {v4.16b-v7.16b}, [x2], #64
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ rev32 v8.16b, RIV.16b
+ rev32 v9.16b, v0.16b
+ rev32 v10.16b, v1.16b
+ rev32 v11.16b, v2.16b
+ rev32 v12.16b, v3.16b
+ rev32 v13.16b, v4.16b
+ rev32 v14.16b, v5.16b
+ rev32 v15.16b, v6.16b
- sub x2, x2, #48;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v4.16b, v4.16b, RTMP0.16b;
- eor v5.16b, v5.16b, RTMP1.16b;
- eor v6.16b, v6.16b, RTMP2.16b;
- eor v7.16b, v7.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+ mov RIV.16b, v7.16b
- mov v0.16b, RTMP3.16b;
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
- cbz w4, .Lcfb_end;
- b .Lcfb_loop_blk;
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
-.Lcfb_tail8:
- add w4, w4, #8;
- cmp w4, #4;
- blt .Lcfb_tail4;
+ cbz w4, .Lcfb_dec_end
+ b .Lcfb_dec_loop_8x
- sub w4, w4, #4;
+.Lcfb_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lcfb_dec_loop_1x
- ld1 {v1.16b, v2.16b, v3.16b}, [x2];
+ sub w4, w4, #4
- SM4_CRYPT_BLK4(v0, v1, v2, v3);
+ ld1 {v0.16b-v3.16b}, [x2], #64
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ rev32 v8.16b, RIV.16b
+ rev32 v9.16b, v0.16b
+ rev32 v10.16b, v1.16b
+ rev32 v11.16b, v2.16b
- mov v0.16b, RTMP3.16b;
+ SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
- cbz w4, .Lcfb_end;
+ mov RIV.16b, v3.16b
-.Lcfb_tail4:
- sub w4, w4, #1;
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
- SM4_CRYPT_BLK(v0);
+ st1 {v0.16b-v3.16b}, [x1], #64
- ld1 {RTMP0.16b}, [x2], #16;
- eor v0.16b, v0.16b, RTMP0.16b;
- st1 {v0.16b}, [x1], #16;
+ cbz w4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+ sub w4, w4, #1
+
+ ld1 {v0.16b}, [x2], #16
- mov v0.16b, RTMP0.16b;
+ SM4_CRYPT_BLK(RIV)
- cbnz w4, .Lcfb_tail4;
+ eor RIV.16b, RIV.16b, v0.16b
+ st1 {RIV.16b}, [x1], #16
-.Lcfb_end:
+ mov RIV.16b, v0.16b
+
+ cbnz w4, .Lcfb_dec_loop_1x
+
+.Lcfb_dec_end:
/* store new IV */
- st1 {v0.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_cfb_dec)
.align 3
@@ -566,95 +482,99 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
* x3: ctr (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
- ldp x7, x8, [x3];
- rev x7, x7;
- rev x8, x8;
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
-.Lctr_loop_blk:
- sub w4, w4, #8;
- tbnz w4, #31, .Lctr_tail8;
+.Lctr_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lctr_4x
-#define inc_le128(vctr) \
- mov vctr.d[1], x8; \
- mov vctr.d[0], x7; \
- adds x8, x8, #1; \
- adc x7, x7, xzr; \
- rev64 vctr.16b, vctr.16b;
+#define inc_le128(vctr) \
+ mov vctr.d[1], x8; \
+ mov vctr.d[0], x7; \
+ adds x8, x8, #1; \
+ rev64 vctr.16b, vctr.16b; \
+ adc x7, x7, xzr;
/* construct CTRs */
- inc_le128(v0); /* +0 */
- inc_le128(v1); /* +1 */
- inc_le128(v2); /* +2 */
- inc_le128(v3); /* +3 */
- inc_le128(v4); /* +4 */
- inc_le128(v5); /* +5 */
- inc_le128(v6); /* +6 */
- inc_le128(v7); /* +7 */
+ inc_le128(v0) /* +0 */
+ inc_le128(v1) /* +1 */
+ inc_le128(v2) /* +2 */
+ inc_le128(v3) /* +3 */
+ inc_le128(v4) /* +4 */
+ inc_le128(v5) /* +5 */
+ inc_le128(v6) /* +6 */
+ inc_le128(v7) /* +7 */
+
+ ld1 {v8.16b-v11.16b}, [x2], #64
+ ld1 {v12.16b-v15.16b}, [x2], #64
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ cbz w4, .Lctr_end
+ b .Lctr_loop_8x
+
+.Lctr_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lctr_loop_1x
+
+ sub w4, w4, #4
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ /* construct CTRs */
+ inc_le128(v0) /* +0 */
+ inc_le128(v1) /* +1 */
+ inc_le128(v2) /* +2 */
+ inc_le128(v3) /* +3 */
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v4.16b, v4.16b, RTMP0.16b;
- eor v5.16b, v5.16b, RTMP1.16b;
- eor v6.16b, v6.16b, RTMP2.16b;
- eor v7.16b, v7.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+ ld1 {v8.16b-v11.16b}, [x2], #64
- cbz w4, .Lctr_end;
- b .Lctr_loop_blk;
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
-.Lctr_tail8:
- add w4, w4, #8;
- cmp w4, #4;
- blt .Lctr_tail4;
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
- sub w4, w4, #4;
+ st1 {v0.16b-v3.16b}, [x1], #64
- /* construct CTRs */
- inc_le128(v0); /* +0 */
- inc_le128(v1); /* +1 */
- inc_le128(v2); /* +2 */
- inc_le128(v3); /* +3 */
+ cbz w4, .Lctr_end
- SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
-
- cbz w4, .Lctr_end;
-
-.Lctr_tail4:
- sub w4, w4, #1;
+.Lctr_loop_1x:
+ sub w4, w4, #1
/* construct CTRs */
- inc_le128(v0);
+ inc_le128(v0)
- SM4_CRYPT_BLK(v0);
+ ld1 {v8.16b}, [x2], #16
- ld1 {RTMP0.16b}, [x2], #16;
- eor v0.16b, v0.16b, RTMP0.16b;
- st1 {v0.16b}, [x1], #16;
+ SM4_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, v8.16b
+ st1 {v0.16b}, [x1], #16
- cbnz w4, .Lctr_tail4;
+ cbnz w4, .Lctr_loop_1x
.Lctr_end:
/* store new CTR */
- rev x7, x7;
- rev x8, x8;
- stp x7, x8, [x3];
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_ctr_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 496d55c0d01a..e56e81b1f35f 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -26,9 +26,9 @@ asmlinkage void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
asmlinkage void sm4_ce_crypt(const u32 *rkey, u8 *dst, const u8 *src,
unsigned int nblks);
asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
+ u8 *iv, unsigned int nblocks);
asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
+ u8 *iv, unsigned int nblocks);
asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -94,66 +94,56 @@ static int sm4_ecb_decrypt(struct skcipher_request *req)
return sm4_ecb_do_crypt(req, ctx->rkey_dec);
}
-static int sm4_cbc_encrypt(struct skcipher_request *req)
+static int sm4_cbc_crypt(struct skcipher_request *req,
+ struct sm4_ctx *ctx, bool encrypt)
{
- struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
- struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
unsigned int nbytes;
int err;
err = skcipher_walk_virt(&walk, req, false);
+ if (err)
+ return err;
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- sm4_ce_cbc_enc(ctx->rkey_enc, dst, src, walk.iv, nblks);
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ if (encrypt)
+ sm4_ce_cbc_enc(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
+ else
+ sm4_ce_cbc_dec(ctx->rkey_dec, dst, src,
+ walk.iv, nblocks);
- kernel_neon_end();
+ kernel_neon_end();
+ }
- err = skcipher_walk_done(&walk, nbytes);
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
}
return err;
}
-static int sm4_cbc_decrypt(struct skcipher_request *req)
+static int sm4_cbc_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
- struct skcipher_walk walk;
- unsigned int nbytes;
- int err;
-
- err = skcipher_walk_virt(&walk, req, false);
- while ((nbytes = walk.nbytes) > 0) {
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
-
- kernel_neon_begin();
-
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- sm4_ce_cbc_dec(ctx->rkey_dec, dst, src, walk.iv, nblks);
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
-
- kernel_neon_end();
+ return sm4_cbc_crypt(req, ctx, true);
+}
- err = skcipher_walk_done(&walk, nbytes);
- }
+static int sm4_cbc_decrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
- return err;
+ return sm4_cbc_crypt(req, ctx, false);
}
static int sm4_cfb_encrypt(struct skcipher_request *req)
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 06/16] crypto: arm64/sm4 - refactor and simplify CE implementation
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch does not add new features, but only refactors and simplifies the
implementation of the Crypto Extension acceleration of the SM4 algorithm:
Extract the macro optimized by SM4 Crypto Extension for reuse in the
subsequent optimization of CCM/GCM modes.
Encryption in CBC and CFB modes processes four blocks at a time instead of
one, allowing the ld1 instruction to load 64 bytes of data at a time, which
will reduces unnecessary memory accesses.
CBC/CFB/CTR makes full use of free registers to reduce redundant memory
accesses, and rearranges some instructions to improve out-of-order execution
capabilities.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-asm.h | 209 +++++++++++
arch/arm64/crypto/sm4-ce-core.S | 646 ++++++++++++++------------------
arch/arm64/crypto/sm4-ce-glue.c | 64 ++--
3 files changed, 519 insertions(+), 400 deletions(-)
create mode 100644 arch/arm64/crypto/sm4-ce-asm.h
diff --git a/arch/arm64/crypto/sm4-ce-asm.h b/arch/arm64/crypto/sm4-ce-asm.h
new file mode 100644
index 000000000000..7ea98e42e779
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-asm.h
@@ -0,0 +1,209 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 helper macros for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#define SM4_PREPARE(ptr) \
+ ld1 {v24.16b-v27.16b}, [ptr], #64; \
+ ld1 {v28.16b-v31.16b}, [ptr];
+
+#define SM4_CRYPT_BLK_BE(b0) \
+ sm4e b0.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ rev32 b0.16b, b0.16b;
+
+#define SM4_CRYPT_BLK(b0) \
+ rev32 b0.16b, b0.16b; \
+ SM4_CRYPT_BLK_BE(b0);
+
+#define SM4_CRYPT_BLK2_BE(b0, b1) \
+ sm4e b0.4s, v24.4s; \
+ sm4e b1.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b1.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b1.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b1.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b1.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b1.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b1.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ sm4e b1.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ rev64 b1.4s, b1.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ ext b1.16b, b1.16b, b1.16b, #8; \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+
+#define SM4_CRYPT_BLK2(b0, b1) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ SM4_CRYPT_BLK2_BE(b0, b1);
+
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3) \
+ sm4e b0.4s, v24.4s; \
+ sm4e b1.4s, v24.4s; \
+ sm4e b2.4s, v24.4s; \
+ sm4e b3.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b1.4s, v25.4s; \
+ sm4e b2.4s, v25.4s; \
+ sm4e b3.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b1.4s, v26.4s; \
+ sm4e b2.4s, v26.4s; \
+ sm4e b3.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b1.4s, v27.4s; \
+ sm4e b2.4s, v27.4s; \
+ sm4e b3.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b1.4s, v28.4s; \
+ sm4e b2.4s, v28.4s; \
+ sm4e b3.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b1.4s, v29.4s; \
+ sm4e b2.4s, v29.4s; \
+ sm4e b3.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b1.4s, v30.4s; \
+ sm4e b2.4s, v30.4s; \
+ sm4e b3.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ sm4e b1.4s, v31.4s; \
+ sm4e b2.4s, v31.4s; \
+ sm4e b3.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ rev64 b1.4s, b1.4s; \
+ rev64 b2.4s, b2.4s; \
+ rev64 b3.4s, b3.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ ext b1.16b, b1.16b, b1.16b, #8; \
+ ext b2.16b, b2.16b, b2.16b, #8; \
+ ext b3.16b, b3.16b, b3.16b, #8; \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b;
+
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b; \
+ SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
+#define SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7) \
+ sm4e b0.4s, v24.4s; \
+ sm4e b1.4s, v24.4s; \
+ sm4e b2.4s, v24.4s; \
+ sm4e b3.4s, v24.4s; \
+ sm4e b4.4s, v24.4s; \
+ sm4e b5.4s, v24.4s; \
+ sm4e b6.4s, v24.4s; \
+ sm4e b7.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b1.4s, v25.4s; \
+ sm4e b2.4s, v25.4s; \
+ sm4e b3.4s, v25.4s; \
+ sm4e b4.4s, v25.4s; \
+ sm4e b5.4s, v25.4s; \
+ sm4e b6.4s, v25.4s; \
+ sm4e b7.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b1.4s, v26.4s; \
+ sm4e b2.4s, v26.4s; \
+ sm4e b3.4s, v26.4s; \
+ sm4e b4.4s, v26.4s; \
+ sm4e b5.4s, v26.4s; \
+ sm4e b6.4s, v26.4s; \
+ sm4e b7.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b1.4s, v27.4s; \
+ sm4e b2.4s, v27.4s; \
+ sm4e b3.4s, v27.4s; \
+ sm4e b4.4s, v27.4s; \
+ sm4e b5.4s, v27.4s; \
+ sm4e b6.4s, v27.4s; \
+ sm4e b7.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b1.4s, v28.4s; \
+ sm4e b2.4s, v28.4s; \
+ sm4e b3.4s, v28.4s; \
+ sm4e b4.4s, v28.4s; \
+ sm4e b5.4s, v28.4s; \
+ sm4e b6.4s, v28.4s; \
+ sm4e b7.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b1.4s, v29.4s; \
+ sm4e b2.4s, v29.4s; \
+ sm4e b3.4s, v29.4s; \
+ sm4e b4.4s, v29.4s; \
+ sm4e b5.4s, v29.4s; \
+ sm4e b6.4s, v29.4s; \
+ sm4e b7.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b1.4s, v30.4s; \
+ sm4e b2.4s, v30.4s; \
+ sm4e b3.4s, v30.4s; \
+ sm4e b4.4s, v30.4s; \
+ sm4e b5.4s, v30.4s; \
+ sm4e b6.4s, v30.4s; \
+ sm4e b7.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ sm4e b1.4s, v31.4s; \
+ sm4e b2.4s, v31.4s; \
+ sm4e b3.4s, v31.4s; \
+ sm4e b4.4s, v31.4s; \
+ sm4e b5.4s, v31.4s; \
+ sm4e b6.4s, v31.4s; \
+ sm4e b7.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ rev64 b1.4s, b1.4s; \
+ rev64 b2.4s, b2.4s; \
+ rev64 b3.4s, b3.4s; \
+ rev64 b4.4s, b4.4s; \
+ rev64 b5.4s, b5.4s; \
+ rev64 b6.4s, b6.4s; \
+ rev64 b7.4s, b7.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ ext b1.16b, b1.16b, b1.16b, #8; \
+ ext b2.16b, b2.16b, b2.16b, #8; \
+ ext b3.16b, b3.16b, b3.16b, #8; \
+ ext b4.16b, b4.16b, b4.16b, #8; \
+ ext b5.16b, b5.16b, b5.16b, #8; \
+ ext b6.16b, b6.16b, b6.16b, #8; \
+ ext b7.16b, b7.16b, b7.16b, #8; \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b; \
+ rev32 b4.16b, b4.16b; \
+ rev32 b5.16b, b5.16b; \
+ rev32 b6.16b, b6.16b; \
+ rev32 b7.16b, b7.16b;
+
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ rev32 b3.16b, b3.16b; \
+ rev32 b4.16b, b4.16b; \
+ rev32 b5.16b, b5.16b; \
+ rev32 b6.16b, b6.16b; \
+ rev32 b7.16b, b7.16b; \
+ SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7);
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 934e0f093279..41fc745a8528 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -10,10 +10,12 @@
#include <linux/linkage.h>
#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
.arch armv8-a+crypto
-.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 20, 24, 25, 26, 27, 28, 29, 30, 31
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+ 20, 24, 25, 26, 27, 28, 29, 30, 31
.set .Lv\b\().4s, \b
.endr
@@ -34,174 +36,6 @@
#define RIV v20
-/* Helper macros. */
-
-#define PREPARE \
- ld1 {v24.16b-v27.16b}, [x0], #64; \
- ld1 {v28.16b-v31.16b}, [x0];
-
-#define SM4_CRYPT_BLK(b0) \
- rev32 b0.16b, b0.16b; \
- sm4e b0.4s, v24.4s; \
- sm4e b0.4s, v25.4s; \
- sm4e b0.4s, v26.4s; \
- sm4e b0.4s, v27.4s; \
- sm4e b0.4s, v28.4s; \
- sm4e b0.4s, v29.4s; \
- sm4e b0.4s, v30.4s; \
- sm4e b0.4s, v31.4s; \
- rev64 b0.4s, b0.4s; \
- ext b0.16b, b0.16b, b0.16b, #8; \
- rev32 b0.16b, b0.16b;
-
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3) \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b; \
- sm4e b0.4s, v24.4s; \
- sm4e b1.4s, v24.4s; \
- sm4e b2.4s, v24.4s; \
- sm4e b3.4s, v24.4s; \
- sm4e b0.4s, v25.4s; \
- sm4e b1.4s, v25.4s; \
- sm4e b2.4s, v25.4s; \
- sm4e b3.4s, v25.4s; \
- sm4e b0.4s, v26.4s; \
- sm4e b1.4s, v26.4s; \
- sm4e b2.4s, v26.4s; \
- sm4e b3.4s, v26.4s; \
- sm4e b0.4s, v27.4s; \
- sm4e b1.4s, v27.4s; \
- sm4e b2.4s, v27.4s; \
- sm4e b3.4s, v27.4s; \
- sm4e b0.4s, v28.4s; \
- sm4e b1.4s, v28.4s; \
- sm4e b2.4s, v28.4s; \
- sm4e b3.4s, v28.4s; \
- sm4e b0.4s, v29.4s; \
- sm4e b1.4s, v29.4s; \
- sm4e b2.4s, v29.4s; \
- sm4e b3.4s, v29.4s; \
- sm4e b0.4s, v30.4s; \
- sm4e b1.4s, v30.4s; \
- sm4e b2.4s, v30.4s; \
- sm4e b3.4s, v30.4s; \
- sm4e b0.4s, v31.4s; \
- sm4e b1.4s, v31.4s; \
- sm4e b2.4s, v31.4s; \
- sm4e b3.4s, v31.4s; \
- rev64 b0.4s, b0.4s; \
- rev64 b1.4s, b1.4s; \
- rev64 b2.4s, b2.4s; \
- rev64 b3.4s, b3.4s; \
- ext b0.16b, b0.16b, b0.16b, #8; \
- ext b1.16b, b1.16b, b1.16b, #8; \
- ext b2.16b, b2.16b, b2.16b, #8; \
- ext b3.16b, b3.16b, b3.16b, #8; \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b;
-
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b; \
- rev32 b4.16b, b4.16b; \
- rev32 b5.16b, b5.16b; \
- rev32 b6.16b, b6.16b; \
- rev32 b7.16b, b7.16b; \
- sm4e b0.4s, v24.4s; \
- sm4e b1.4s, v24.4s; \
- sm4e b2.4s, v24.4s; \
- sm4e b3.4s, v24.4s; \
- sm4e b4.4s, v24.4s; \
- sm4e b5.4s, v24.4s; \
- sm4e b6.4s, v24.4s; \
- sm4e b7.4s, v24.4s; \
- sm4e b0.4s, v25.4s; \
- sm4e b1.4s, v25.4s; \
- sm4e b2.4s, v25.4s; \
- sm4e b3.4s, v25.4s; \
- sm4e b4.4s, v25.4s; \
- sm4e b5.4s, v25.4s; \
- sm4e b6.4s, v25.4s; \
- sm4e b7.4s, v25.4s; \
- sm4e b0.4s, v26.4s; \
- sm4e b1.4s, v26.4s; \
- sm4e b2.4s, v26.4s; \
- sm4e b3.4s, v26.4s; \
- sm4e b4.4s, v26.4s; \
- sm4e b5.4s, v26.4s; \
- sm4e b6.4s, v26.4s; \
- sm4e b7.4s, v26.4s; \
- sm4e b0.4s, v27.4s; \
- sm4e b1.4s, v27.4s; \
- sm4e b2.4s, v27.4s; \
- sm4e b3.4s, v27.4s; \
- sm4e b4.4s, v27.4s; \
- sm4e b5.4s, v27.4s; \
- sm4e b6.4s, v27.4s; \
- sm4e b7.4s, v27.4s; \
- sm4e b0.4s, v28.4s; \
- sm4e b1.4s, v28.4s; \
- sm4e b2.4s, v28.4s; \
- sm4e b3.4s, v28.4s; \
- sm4e b4.4s, v28.4s; \
- sm4e b5.4s, v28.4s; \
- sm4e b6.4s, v28.4s; \
- sm4e b7.4s, v28.4s; \
- sm4e b0.4s, v29.4s; \
- sm4e b1.4s, v29.4s; \
- sm4e b2.4s, v29.4s; \
- sm4e b3.4s, v29.4s; \
- sm4e b4.4s, v29.4s; \
- sm4e b5.4s, v29.4s; \
- sm4e b6.4s, v29.4s; \
- sm4e b7.4s, v29.4s; \
- sm4e b0.4s, v30.4s; \
- sm4e b1.4s, v30.4s; \
- sm4e b2.4s, v30.4s; \
- sm4e b3.4s, v30.4s; \
- sm4e b4.4s, v30.4s; \
- sm4e b5.4s, v30.4s; \
- sm4e b6.4s, v30.4s; \
- sm4e b7.4s, v30.4s; \
- sm4e b0.4s, v31.4s; \
- sm4e b1.4s, v31.4s; \
- sm4e b2.4s, v31.4s; \
- sm4e b3.4s, v31.4s; \
- sm4e b4.4s, v31.4s; \
- sm4e b5.4s, v31.4s; \
- sm4e b6.4s, v31.4s; \
- sm4e b7.4s, v31.4s; \
- rev64 b0.4s, b0.4s; \
- rev64 b1.4s, b1.4s; \
- rev64 b2.4s, b2.4s; \
- rev64 b3.4s, b3.4s; \
- rev64 b4.4s, b4.4s; \
- rev64 b5.4s, b5.4s; \
- rev64 b6.4s, b6.4s; \
- rev64 b7.4s, b7.4s; \
- ext b0.16b, b0.16b, b0.16b, #8; \
- ext b1.16b, b1.16b, b1.16b, #8; \
- ext b2.16b, b2.16b, b2.16b, #8; \
- ext b3.16b, b3.16b, b3.16b, #8; \
- ext b4.16b, b4.16b, b4.16b, #8; \
- ext b5.16b, b5.16b, b5.16b, #8; \
- ext b6.16b, b6.16b, b6.16b, #8; \
- ext b7.16b, b7.16b, b7.16b, #8; \
- rev32 b0.16b, b0.16b; \
- rev32 b1.16b, b1.16b; \
- rev32 b2.16b, b2.16b; \
- rev32 b3.16b, b3.16b; \
- rev32 b4.16b, b4.16b; \
- rev32 b5.16b, b5.16b; \
- rev32 b6.16b, b6.16b; \
- rev32 b7.16b, b7.16b;
-
.align 3
SYM_FUNC_START(sm4_ce_expand_key)
@@ -268,7 +102,7 @@ SYM_FUNC_START(sm4_ce_crypt_block)
* x1: dst
* x2: src
*/
- PREPARE;
+ SM4_PREPARE(x0)
ld1 {v0.16b}, [x2];
SM4_CRYPT_BLK(v0);
@@ -285,7 +119,7 @@ SYM_FUNC_START(sm4_ce_crypt)
* x2: src
* w3: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
.Lcrypt_loop_blk:
sub w3, w3, #8;
@@ -337,26 +171,50 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
* x3: iv (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
+
+ ld1 {RIV.16b}, [x3]
+
+.Lcbc_enc_loop_4x:
+ cmp w4, #4
+ blt .Lcbc_enc_loop_1x
+
+ sub w4, w4, #4
- ld1 {RIV.16b}, [x3];
+ ld1 {v0.16b-v3.16b}, [x2], #64
-.Lcbc_enc_loop:
- sub w4, w4, #1;
+ eor v0.16b, v0.16b, RIV.16b
+ SM4_CRYPT_BLK(v0)
+ eor v1.16b, v1.16b, v0.16b
+ SM4_CRYPT_BLK(v1)
+ eor v2.16b, v2.16b, v1.16b
+ SM4_CRYPT_BLK(v2)
+ eor v3.16b, v3.16b, v2.16b
+ SM4_CRYPT_BLK(v3)
- ld1 {RTMP0.16b}, [x2], #16;
- eor RIV.16b, RIV.16b, RTMP0.16b;
+ st1 {v0.16b-v3.16b}, [x1], #64
+ mov RIV.16b, v3.16b
- SM4_CRYPT_BLK(RIV);
+ cbz w4, .Lcbc_enc_end
+ b .Lcbc_enc_loop_4x
- st1 {RIV.16b}, [x1], #16;
+.Lcbc_enc_loop_1x:
+ sub w4, w4, #1
- cbnz w4, .Lcbc_enc_loop;
+ ld1 {v0.16b}, [x2], #16
+ eor RIV.16b, RIV.16b, v0.16b
+ SM4_CRYPT_BLK(RIV)
+
+ st1 {RIV.16b}, [x1], #16
+
+ cbnz w4, .Lcbc_enc_loop_1x
+
+.Lcbc_enc_end:
/* store new IV */
- st1 {RIV.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_cbc_enc)
.align 3
@@ -368,79 +226,93 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
* x3: iv (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
- ld1 {RIV.16b}, [x3];
+ ld1 {RIV.16b}, [x3]
-.Lcbc_loop_blk:
- sub w4, w4, #8;
- tbnz w4, #31, .Lcbc_tail8;
+.Lcbc_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lcbc_dec_4x
- ld1 {v0.16b-v3.16b}, [x2], #64;
- ld1 {v4.16b-v7.16b}, [x2];
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ ld1 {v4.16b-v7.16b}, [x2], #64
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ rev32 v8.16b, v0.16b
+ rev32 v9.16b, v1.16b
+ rev32 v10.16b, v2.16b
+ rev32 v11.16b, v3.16b
+ rev32 v12.16b, v4.16b
+ rev32 v13.16b, v5.16b
+ rev32 v14.16b, v6.16b
+ rev32 v15.16b, v7.16b
- sub x2, x2, #64;
- eor v0.16b, v0.16b, RIV.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v1.16b, v1.16b, RTMP0.16b;
- eor v2.16b, v2.16b, RTMP1.16b;
- eor v3.16b, v3.16b, RTMP2.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
- eor v4.16b, v4.16b, RTMP3.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v5.16b, v5.16b, RTMP0.16b;
- eor v6.16b, v6.16b, RTMP1.16b;
- eor v7.16b, v7.16b, RTMP2.16b;
+ eor v8.16b, v8.16b, RIV.16b
+ eor v9.16b, v9.16b, v0.16b
+ eor v10.16b, v10.16b, v1.16b
+ eor v11.16b, v11.16b, v2.16b
+ eor v12.16b, v12.16b, v3.16b
+ eor v13.16b, v13.16b, v4.16b
+ eor v14.16b, v14.16b, v5.16b
+ eor v15.16b, v15.16b, v6.16b
- mov RIV.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+ st1 {v8.16b-v11.16b}, [x1], #64
+ st1 {v12.16b-v15.16b}, [x1], #64
- cbz w4, .Lcbc_end;
- b .Lcbc_loop_blk;
+ mov RIV.16b, v7.16b
-.Lcbc_tail8:
- add w4, w4, #8;
- cmp w4, #4;
- blt .Lcbc_tail4;
+ cbz w4, .Lcbc_dec_end
+ b .Lcbc_dec_loop_8x
- sub w4, w4, #4;
+.Lcbc_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lcbc_dec_loop_1x
- ld1 {v0.16b-v3.16b}, [x2];
+ sub w4, w4, #4
- SM4_CRYPT_BLK4(v0, v1, v2, v3);
+ ld1 {v0.16b-v3.16b}, [x2], #64
- eor v0.16b, v0.16b, RIV.16b;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v1.16b, v1.16b, RTMP0.16b;
- eor v2.16b, v2.16b, RTMP1.16b;
- eor v3.16b, v3.16b, RTMP2.16b;
+ rev32 v8.16b, v0.16b
+ rev32 v9.16b, v1.16b
+ rev32 v10.16b, v2.16b
+ rev32 v11.16b, v3.16b
- mov RIV.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
- cbz w4, .Lcbc_end;
+ eor v8.16b, v8.16b, RIV.16b
+ eor v9.16b, v9.16b, v0.16b
+ eor v10.16b, v10.16b, v1.16b
+ eor v11.16b, v11.16b, v2.16b
-.Lcbc_tail4:
- sub w4, w4, #1;
+ st1 {v8.16b-v11.16b}, [x1], #64
- ld1 {v0.16b}, [x2];
+ mov RIV.16b, v3.16b
- SM4_CRYPT_BLK(v0);
+ cbz w4, .Lcbc_dec_end
- eor v0.16b, v0.16b, RIV.16b;
- ld1 {RIV.16b}, [x2], #16;
- st1 {v0.16b}, [x1], #16;
+.Lcbc_dec_loop_1x:
+ sub w4, w4, #1
+
+ ld1 {v0.16b}, [x2], #16
+
+ rev32 v8.16b, v0.16b
+
+ SM4_CRYPT_BLK_BE(v8)
- cbnz w4, .Lcbc_tail4;
+ eor v8.16b, v8.16b, RIV.16b
+ st1 {v8.16b}, [x1], #16
-.Lcbc_end:
+ mov RIV.16b, v0.16b
+
+ cbnz w4, .Lcbc_dec_loop_1x
+
+.Lcbc_dec_end:
/* store new IV */
- st1 {RIV.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_cbc_dec)
.align 3
@@ -452,25 +324,57 @@ SYM_FUNC_START(sm4_ce_cfb_enc)
* x3: iv (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
+
+ ld1 {RIV.16b}, [x3]
+
+.Lcfb_enc_loop_4x:
+ cmp w4, #4
+ blt .Lcfb_enc_loop_1x
+
+ sub w4, w4, #4
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ rev32 v8.16b, RIV.16b
+ SM4_CRYPT_BLK_BE(v8)
+ eor v0.16b, v0.16b, v8.16b
+
+ rev32 v8.16b, v0.16b
+ SM4_CRYPT_BLK_BE(v8)
+ eor v1.16b, v1.16b, v8.16b
+
+ rev32 v8.16b, v1.16b
+ SM4_CRYPT_BLK_BE(v8)
+ eor v2.16b, v2.16b, v8.16b
+
+ rev32 v8.16b, v2.16b
+ SM4_CRYPT_BLK_BE(v8)
+ eor v3.16b, v3.16b, v8.16b
- ld1 {RIV.16b}, [x3];
+ st1 {v0.16b-v3.16b}, [x1], #64
+ mov RIV.16b, v3.16b
-.Lcfb_enc_loop:
- sub w4, w4, #1;
+ cbz w4, .Lcfb_enc_end
+ b .Lcfb_enc_loop_4x
- SM4_CRYPT_BLK(RIV);
+.Lcfb_enc_loop_1x:
+ sub w4, w4, #1
- ld1 {RTMP0.16b}, [x2], #16;
- eor RIV.16b, RIV.16b, RTMP0.16b;
- st1 {RIV.16b}, [x1], #16;
+ ld1 {v0.16b}, [x2], #16
- cbnz w4, .Lcfb_enc_loop;
+ SM4_CRYPT_BLK(RIV)
+ eor RIV.16b, RIV.16b, v0.16b
+ st1 {RIV.16b}, [x1], #16
+
+ cbnz w4, .Lcfb_enc_loop_1x
+
+.Lcfb_enc_end:
/* store new IV */
- st1 {RIV.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_cfb_enc)
.align 3
@@ -482,79 +386,91 @@ SYM_FUNC_START(sm4_ce_cfb_dec)
* x3: iv (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
- ld1 {v0.16b}, [x3];
+ ld1 {RIV.16b}, [x3]
-.Lcfb_loop_blk:
- sub w4, w4, #8;
- tbnz w4, #31, .Lcfb_tail8;
+.Lcfb_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lcfb_dec_4x
- ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48;
- ld1 {v4.16b-v7.16b}, [x2];
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ ld1 {v4.16b-v7.16b}, [x2], #64
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+ rev32 v8.16b, RIV.16b
+ rev32 v9.16b, v0.16b
+ rev32 v10.16b, v1.16b
+ rev32 v11.16b, v2.16b
+ rev32 v12.16b, v3.16b
+ rev32 v13.16b, v4.16b
+ rev32 v14.16b, v5.16b
+ rev32 v15.16b, v6.16b
- sub x2, x2, #48;
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v4.16b, v4.16b, RTMP0.16b;
- eor v5.16b, v5.16b, RTMP1.16b;
- eor v6.16b, v6.16b, RTMP2.16b;
- eor v7.16b, v7.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+ mov RIV.16b, v7.16b
- mov v0.16b, RTMP3.16b;
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
- cbz w4, .Lcfb_end;
- b .Lcfb_loop_blk;
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
-.Lcfb_tail8:
- add w4, w4, #8;
- cmp w4, #4;
- blt .Lcfb_tail4;
+ cbz w4, .Lcfb_dec_end
+ b .Lcfb_dec_loop_8x
- sub w4, w4, #4;
+.Lcfb_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lcfb_dec_loop_1x
- ld1 {v1.16b, v2.16b, v3.16b}, [x2];
+ sub w4, w4, #4
- SM4_CRYPT_BLK4(v0, v1, v2, v3);
+ ld1 {v0.16b-v3.16b}, [x2], #64
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ rev32 v8.16b, RIV.16b
+ rev32 v9.16b, v0.16b
+ rev32 v10.16b, v1.16b
+ rev32 v11.16b, v2.16b
- mov v0.16b, RTMP3.16b;
+ SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
- cbz w4, .Lcfb_end;
+ mov RIV.16b, v3.16b
-.Lcfb_tail4:
- sub w4, w4, #1;
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
- SM4_CRYPT_BLK(v0);
+ st1 {v0.16b-v3.16b}, [x1], #64
- ld1 {RTMP0.16b}, [x2], #16;
- eor v0.16b, v0.16b, RTMP0.16b;
- st1 {v0.16b}, [x1], #16;
+ cbz w4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+ sub w4, w4, #1
+
+ ld1 {v0.16b}, [x2], #16
- mov v0.16b, RTMP0.16b;
+ SM4_CRYPT_BLK(RIV)
- cbnz w4, .Lcfb_tail4;
+ eor RIV.16b, RIV.16b, v0.16b
+ st1 {RIV.16b}, [x1], #16
-.Lcfb_end:
+ mov RIV.16b, v0.16b
+
+ cbnz w4, .Lcfb_dec_loop_1x
+
+.Lcfb_dec_end:
/* store new IV */
- st1 {v0.16b}, [x3];
+ st1 {RIV.16b}, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_cfb_dec)
.align 3
@@ -566,95 +482,99 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
* x3: ctr (big endian, 128 bit)
* w4: nblocks
*/
- PREPARE;
+ SM4_PREPARE(x0)
- ldp x7, x8, [x3];
- rev x7, x7;
- rev x8, x8;
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
-.Lctr_loop_blk:
- sub w4, w4, #8;
- tbnz w4, #31, .Lctr_tail8;
+.Lctr_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lctr_4x
-#define inc_le128(vctr) \
- mov vctr.d[1], x8; \
- mov vctr.d[0], x7; \
- adds x8, x8, #1; \
- adc x7, x7, xzr; \
- rev64 vctr.16b, vctr.16b;
+#define inc_le128(vctr) \
+ mov vctr.d[1], x8; \
+ mov vctr.d[0], x7; \
+ adds x8, x8, #1; \
+ rev64 vctr.16b, vctr.16b; \
+ adc x7, x7, xzr;
/* construct CTRs */
- inc_le128(v0); /* +0 */
- inc_le128(v1); /* +1 */
- inc_le128(v2); /* +2 */
- inc_le128(v3); /* +3 */
- inc_le128(v4); /* +4 */
- inc_le128(v5); /* +5 */
- inc_le128(v6); /* +6 */
- inc_le128(v7); /* +7 */
+ inc_le128(v0) /* +0 */
+ inc_le128(v1) /* +1 */
+ inc_le128(v2) /* +2 */
+ inc_le128(v3) /* +3 */
+ inc_le128(v4) /* +4 */
+ inc_le128(v5) /* +5 */
+ inc_le128(v6) /* +6 */
+ inc_le128(v7) /* +7 */
+
+ ld1 {v8.16b-v11.16b}, [x2], #64
+ ld1 {v12.16b-v15.16b}, [x2], #64
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ cbz w4, .Lctr_end
+ b .Lctr_loop_8x
+
+.Lctr_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lctr_loop_1x
+
+ sub w4, w4, #4
- SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
+ /* construct CTRs */
+ inc_le128(v0) /* +0 */
+ inc_le128(v1) /* +1 */
+ inc_le128(v2) /* +2 */
+ inc_le128(v3) /* +3 */
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v4.16b, v4.16b, RTMP0.16b;
- eor v5.16b, v5.16b, RTMP1.16b;
- eor v6.16b, v6.16b, RTMP2.16b;
- eor v7.16b, v7.16b, RTMP3.16b;
- st1 {v4.16b-v7.16b}, [x1], #64;
+ ld1 {v8.16b-v11.16b}, [x2], #64
- cbz w4, .Lctr_end;
- b .Lctr_loop_blk;
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
-.Lctr_tail8:
- add w4, w4, #8;
- cmp w4, #4;
- blt .Lctr_tail4;
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
- sub w4, w4, #4;
+ st1 {v0.16b-v3.16b}, [x1], #64
- /* construct CTRs */
- inc_le128(v0); /* +0 */
- inc_le128(v1); /* +1 */
- inc_le128(v2); /* +2 */
- inc_le128(v3); /* +3 */
+ cbz w4, .Lctr_end
- SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
- ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
- eor v0.16b, v0.16b, RTMP0.16b;
- eor v1.16b, v1.16b, RTMP1.16b;
- eor v2.16b, v2.16b, RTMP2.16b;
- eor v3.16b, v3.16b, RTMP3.16b;
- st1 {v0.16b-v3.16b}, [x1], #64;
-
- cbz w4, .Lctr_end;
-
-.Lctr_tail4:
- sub w4, w4, #1;
+.Lctr_loop_1x:
+ sub w4, w4, #1
/* construct CTRs */
- inc_le128(v0);
+ inc_le128(v0)
- SM4_CRYPT_BLK(v0);
+ ld1 {v8.16b}, [x2], #16
- ld1 {RTMP0.16b}, [x2], #16;
- eor v0.16b, v0.16b, RTMP0.16b;
- st1 {v0.16b}, [x1], #16;
+ SM4_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, v8.16b
+ st1 {v0.16b}, [x1], #16
- cbnz w4, .Lctr_tail4;
+ cbnz w4, .Lctr_loop_1x
.Lctr_end:
/* store new CTR */
- rev x7, x7;
- rev x8, x8;
- stp x7, x8, [x3];
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
- ret;
+ ret
SYM_FUNC_END(sm4_ce_ctr_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 496d55c0d01a..e56e81b1f35f 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -26,9 +26,9 @@ asmlinkage void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
asmlinkage void sm4_ce_crypt(const u32 *rkey, u8 *dst, const u8 *src,
unsigned int nblks);
asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
+ u8 *iv, unsigned int nblocks);
asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
- u8 *iv, unsigned int nblks);
+ u8 *iv, unsigned int nblocks);
asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -94,66 +94,56 @@ static int sm4_ecb_decrypt(struct skcipher_request *req)
return sm4_ecb_do_crypt(req, ctx->rkey_dec);
}
-static int sm4_cbc_encrypt(struct skcipher_request *req)
+static int sm4_cbc_crypt(struct skcipher_request *req,
+ struct sm4_ctx *ctx, bool encrypt)
{
- struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
- struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
unsigned int nbytes;
int err;
err = skcipher_walk_virt(&walk, req, false);
+ if (err)
+ return err;
while ((nbytes = walk.nbytes) > 0) {
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
+ unsigned int nblocks;
- kernel_neon_begin();
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- sm4_ce_cbc_enc(ctx->rkey_enc, dst, src, walk.iv, nblks);
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
+ if (encrypt)
+ sm4_ce_cbc_enc(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
+ else
+ sm4_ce_cbc_dec(ctx->rkey_dec, dst, src,
+ walk.iv, nblocks);
- kernel_neon_end();
+ kernel_neon_end();
+ }
- err = skcipher_walk_done(&walk, nbytes);
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
}
return err;
}
-static int sm4_cbc_decrypt(struct skcipher_request *req)
+static int sm4_cbc_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
- struct skcipher_walk walk;
- unsigned int nbytes;
- int err;
-
- err = skcipher_walk_virt(&walk, req, false);
- while ((nbytes = walk.nbytes) > 0) {
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
- unsigned int nblks;
-
- kernel_neon_begin();
-
- nblks = BYTES2BLKS(nbytes);
- if (nblks) {
- sm4_ce_cbc_dec(ctx->rkey_dec, dst, src, walk.iv, nblks);
- nbytes -= nblks * SM4_BLOCK_SIZE;
- }
-
- kernel_neon_end();
+ return sm4_cbc_crypt(req, ctx, true);
+}
- err = skcipher_walk_done(&walk, nbytes);
- }
+static int sm4_cbc_decrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
- return err;
+ return sm4_cbc_crypt(req, ctx, false);
}
static int sm4_cfb_encrypt(struct skcipher_request *req)
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 07/16] crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
Use a 128-bit swap mask and tbl instruction to simplify the implementation
for generating SM4 rkey_dec.
Also fixed the issue of not being wrapped by kernel_neon_begin/end() when
using the sm4_ce_expand_key() function.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-core.S | 46 ++++++++++++++++-----------------
arch/arm64/crypto/sm4-ce-glue.c | 2 ++
2 files changed, 24 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 41fc745a8528..9e4b4f01cdf3 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -65,32 +65,23 @@ SYM_FUNC_START(sm4_ce_expand_key)
sm4ekey v6.4s, v5.4s, v30.4s;
sm4ekey v7.4s, v6.4s, v31.4s;
+ adr_l x5, .Lbswap128_mask
+ ld1 {v24.16b}, [x5]
+
st1 {v0.16b-v3.16b}, [x1], #64;
st1 {v4.16b-v7.16b}, [x1];
- rev64 v7.4s, v7.4s;
- rev64 v6.4s, v6.4s;
- rev64 v5.4s, v5.4s;
- rev64 v4.4s, v4.4s;
- rev64 v3.4s, v3.4s;
- rev64 v2.4s, v2.4s;
- rev64 v1.4s, v1.4s;
- rev64 v0.4s, v0.4s;
- ext v7.16b, v7.16b, v7.16b, #8;
- ext v6.16b, v6.16b, v6.16b, #8;
- ext v5.16b, v5.16b, v5.16b, #8;
- ext v4.16b, v4.16b, v4.16b, #8;
- ext v3.16b, v3.16b, v3.16b, #8;
- ext v2.16b, v2.16b, v2.16b, #8;
- ext v1.16b, v1.16b, v1.16b, #8;
- ext v0.16b, v0.16b, v0.16b, #8;
- st1 {v7.16b}, [x2], #16;
- st1 {v6.16b}, [x2], #16;
- st1 {v5.16b}, [x2], #16;
- st1 {v4.16b}, [x2], #16;
- st1 {v3.16b}, [x2], #16;
- st1 {v2.16b}, [x2], #16;
- st1 {v1.16b}, [x2], #16;
- st1 {v0.16b}, [x2];
+
+ tbl v16.16b, {v7.16b}, v24.16b
+ tbl v17.16b, {v6.16b}, v24.16b
+ tbl v18.16b, {v5.16b}, v24.16b
+ tbl v19.16b, {v4.16b}, v24.16b
+ tbl v20.16b, {v3.16b}, v24.16b
+ tbl v21.16b, {v2.16b}, v24.16b
+ tbl v22.16b, {v1.16b}, v24.16b
+ tbl v23.16b, {v0.16b}, v24.16b
+
+ st1 {v16.16b-v19.16b}, [x2], #64
+ st1 {v20.16b-v23.16b}, [x2]
ret;
SYM_FUNC_END(sm4_ce_expand_key)
@@ -578,3 +569,10 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
ret
SYM_FUNC_END(sm4_ce_ctr_enc)
+
+
+ .section ".rodata", "a"
+ .align 4
+.Lbswap128_mask:
+ .byte 0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+ .byte 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index e56e81b1f35f..ff2d8442d473 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -44,8 +44,10 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
if (key_len != SM4_KEY_SIZE)
return -EINVAL;
+ kernel_neon_begin();
sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
return 0;
}
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 07/16] crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
Use a 128-bit swap mask and tbl instruction to simplify the implementation
for generating SM4 rkey_dec.
Also fixed the issue of not being wrapped by kernel_neon_begin/end() when
using the sm4_ce_expand_key() function.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-core.S | 46 ++++++++++++++++-----------------
arch/arm64/crypto/sm4-ce-glue.c | 2 ++
2 files changed, 24 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 41fc745a8528..9e4b4f01cdf3 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -65,32 +65,23 @@ SYM_FUNC_START(sm4_ce_expand_key)
sm4ekey v6.4s, v5.4s, v30.4s;
sm4ekey v7.4s, v6.4s, v31.4s;
+ adr_l x5, .Lbswap128_mask
+ ld1 {v24.16b}, [x5]
+
st1 {v0.16b-v3.16b}, [x1], #64;
st1 {v4.16b-v7.16b}, [x1];
- rev64 v7.4s, v7.4s;
- rev64 v6.4s, v6.4s;
- rev64 v5.4s, v5.4s;
- rev64 v4.4s, v4.4s;
- rev64 v3.4s, v3.4s;
- rev64 v2.4s, v2.4s;
- rev64 v1.4s, v1.4s;
- rev64 v0.4s, v0.4s;
- ext v7.16b, v7.16b, v7.16b, #8;
- ext v6.16b, v6.16b, v6.16b, #8;
- ext v5.16b, v5.16b, v5.16b, #8;
- ext v4.16b, v4.16b, v4.16b, #8;
- ext v3.16b, v3.16b, v3.16b, #8;
- ext v2.16b, v2.16b, v2.16b, #8;
- ext v1.16b, v1.16b, v1.16b, #8;
- ext v0.16b, v0.16b, v0.16b, #8;
- st1 {v7.16b}, [x2], #16;
- st1 {v6.16b}, [x2], #16;
- st1 {v5.16b}, [x2], #16;
- st1 {v4.16b}, [x2], #16;
- st1 {v3.16b}, [x2], #16;
- st1 {v2.16b}, [x2], #16;
- st1 {v1.16b}, [x2], #16;
- st1 {v0.16b}, [x2];
+
+ tbl v16.16b, {v7.16b}, v24.16b
+ tbl v17.16b, {v6.16b}, v24.16b
+ tbl v18.16b, {v5.16b}, v24.16b
+ tbl v19.16b, {v4.16b}, v24.16b
+ tbl v20.16b, {v3.16b}, v24.16b
+ tbl v21.16b, {v2.16b}, v24.16b
+ tbl v22.16b, {v1.16b}, v24.16b
+ tbl v23.16b, {v0.16b}, v24.16b
+
+ st1 {v16.16b-v19.16b}, [x2], #64
+ st1 {v20.16b-v23.16b}, [x2]
ret;
SYM_FUNC_END(sm4_ce_expand_key)
@@ -578,3 +569,10 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
ret
SYM_FUNC_END(sm4_ce_ctr_enc)
+
+
+ .section ".rodata", "a"
+ .align 4
+.Lbswap128_mask:
+ .byte 0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+ .byte 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index e56e81b1f35f..ff2d8442d473 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -44,8 +44,10 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
if (key_len != SM4_KEY_SIZE)
return -EINVAL;
+ kernel_neon_begin();
sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
return 0;
}
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 08/16] crypto: arm64/sm4 - export reusable CE acceleration functions
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
In the accelerated implementation of the SM4 algorithm using the Crypto
Extension instructions, there are some functions that can be reused in
the upcoming accelerated implementation of the GCM/CCM mode, and the
CBC/CFB encryption is reused in the optimized implementation of SVESM4.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-glue.c | 5 +++++
arch/arm64/crypto/sm4-ce.h | 16 ++++++++++++++++
2 files changed, 21 insertions(+)
create mode 100644 arch/arm64/crypto/sm4-ce.h
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index ff2d8442d473..63abcadc684b 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -36,6 +36,11 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
+EXPORT_SYMBOL(sm4_ce_expand_key);
+EXPORT_SYMBOL(sm4_ce_crypt_block);
+EXPORT_SYMBOL(sm4_ce_cbc_enc);
+EXPORT_SYMBOL(sm4_ce_cfb_enc);
+
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
{
diff --git a/arch/arm64/crypto/sm4-ce.h b/arch/arm64/crypto/sm4-ce.h
new file mode 100644
index 000000000000..109c21b37590
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 common functions for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+void sm4_ce_expand_key(const u8 *key, u32 *rkey_enc, u32 *rkey_dec,
+ const u32 *fk, const u32 *ck);
+
+void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
+
+void sm4_ce_cbc_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
+
+void sm4_ce_cfb_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 08/16] crypto: arm64/sm4 - export reusable CE acceleration functions
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
In the accelerated implementation of the SM4 algorithm using the Crypto
Extension instructions, there are some functions that can be reused in
the upcoming accelerated implementation of the GCM/CCM mode, and the
CBC/CFB encryption is reused in the optimized implementation of SVESM4.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-glue.c | 5 +++++
arch/arm64/crypto/sm4-ce.h | 16 ++++++++++++++++
2 files changed, 21 insertions(+)
create mode 100644 arch/arm64/crypto/sm4-ce.h
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index ff2d8442d473..63abcadc684b 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -36,6 +36,11 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
+EXPORT_SYMBOL(sm4_ce_expand_key);
+EXPORT_SYMBOL(sm4_ce_crypt_block);
+EXPORT_SYMBOL(sm4_ce_cbc_enc);
+EXPORT_SYMBOL(sm4_ce_cfb_enc);
+
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
{
diff --git a/arch/arm64/crypto/sm4-ce.h b/arch/arm64/crypto/sm4-ce.h
new file mode 100644
index 000000000000..109c21b37590
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 common functions for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+void sm4_ce_expand_key(const u8 *key, u32 *rkey_enc, u32 *rkey_dec,
+ const u32 *fk, const u32 *ck);
+
+void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
+
+void sm4_ce_cbc_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
+
+void sm4_ce_cfb_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks);
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 09/16] crypto: arm64/sm4 - add CE implementation for CTS-CBC mode
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for CTS-CBC mode.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is cts(cbc-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:
Before:
cts(cbc-sm4-ce) | 16 64 128 256 1024 1420 4096
----------------+--------------------------------------------------------------
CTS-CBC enc | 286.09 297.17 457.97 627.75 868.58 900.80 957.69
CTS-CBC dec | 286.67 285.63 538.35 947.08 2241.03 2577.32 3391.14
After:
cts-cbc-sm4-ce | 16 64 128 256 1024 1420 4096
----------------+--------------------------------------------------------------
CTS-CBC enc | 288.19 428.80 593.57 741.04 911.73 931.80 950.00
CTS-CBC dec | 292.22 468.99 838.23 1380.76 2741.17 3036.42 3409.62
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-core.S | 102 ++++++++++++++++++++++++++++++++
arch/arm64/crypto/sm4-ce-glue.c | 94 +++++++++++++++++++++++++++++
2 files changed, 196 insertions(+)
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 9e4b4f01cdf3..414d29f8110b 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -306,6 +306,100 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
ret
SYM_FUNC_END(sm4_ce_cbc_dec)
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nbytes
+ */
+ SM4_PREPARE(x0)
+
+ sub w5, w4, #16
+ uxtw x5, w5
+
+ ld1 {RIV.16b}, [x3]
+
+ ld1 {v0.16b}, [x2]
+ eor RIV.16b, RIV.16b, v0.16b
+ SM4_CRYPT_BLK(RIV)
+
+ /* load permute table */
+ adr_l x6, .Lcts_permute_table
+ add x7, x6, #32
+ add x6, x6, x5
+ sub x7, x7, x5
+ ld1 {v3.16b}, [x6]
+ ld1 {v4.16b}, [x7]
+
+ /* overlapping loads */
+ add x2, x2, x5
+ ld1 {v1.16b}, [x2]
+
+ /* create Cn from En-1 */
+ tbl v0.16b, {RIV.16b}, v3.16b
+ /* padding Pn with zeros */
+ tbl v1.16b, {v1.16b}, v4.16b
+
+ eor v1.16b, v1.16b, RIV.16b
+ SM4_CRYPT_BLK(v1)
+
+ /* overlapping stores */
+ add x5, x1, x5
+ st1 {v0.16b}, [x5]
+ st1 {v1.16b}, [x1]
+
+ ret
+SYM_FUNC_END(sm4_ce_cbc_cts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nbytes
+ */
+ SM4_PREPARE(x0)
+
+ sub w5, w4, #16
+ uxtw x5, w5
+
+ ld1 {RIV.16b}, [x3]
+
+ /* load permute table */
+ adr_l x6, .Lcts_permute_table
+ add x7, x6, #32
+ add x6, x6, x5
+ sub x7, x7, x5
+ ld1 {v3.16b}, [x6]
+ ld1 {v4.16b}, [x7]
+
+ /* overlapping loads */
+ ld1 {v0.16b}, [x2], x5
+ ld1 {v1.16b}, [x2]
+
+ SM4_CRYPT_BLK(v0)
+ /* select the first Ln bytes of Xn to create Pn */
+ tbl v2.16b, {v0.16b}, v3.16b
+ eor v2.16b, v2.16b, v1.16b
+
+ /* overwrite the first Ln bytes with Cn to create En-1 */
+ tbx v0.16b, {v1.16b}, v4.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, RIV.16b
+
+ /* overlapping stores */
+ add x5, x1, x5
+ st1 {v2.16b}, [x5]
+ st1 {v0.16b}, [x1]
+
+ ret
+SYM_FUNC_END(sm4_ce_cbc_cts_dec)
+
.align 3
SYM_FUNC_START(sm4_ce_cfb_enc)
/* input:
@@ -576,3 +670,11 @@ SYM_FUNC_END(sm4_ce_ctr_enc)
.Lbswap128_mask:
.byte 0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
.byte 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+
+.Lcts_permute_table:
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7
+ .byte 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 63abcadc684b..4d4072c7bfa2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -16,6 +16,7 @@
#include <asm/simd.h>
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
+#include <crypto/scatterwalk.h>
#include <crypto/sm4.h>
#define BYTES2BLKS(nbytes) ((nbytes) >> 4)
@@ -29,6 +30,10 @@ asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblocks);
asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nbytes);
asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -153,6 +158,78 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
return sm4_cbc_crypt(req, ctx, false);
}
+static int sm4_cbc_cts_crypt(struct skcipher_request *req, bool encrypt)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct scatterlist *src = req->src;
+ struct scatterlist *dst = req->dst;
+ struct scatterlist sg_src[2], sg_dst[2];
+ struct skcipher_request subreq;
+ struct skcipher_walk walk;
+ int cbc_blocks;
+ int err;
+
+ if (req->cryptlen < SM4_BLOCK_SIZE)
+ return -EINVAL;
+
+ if (req->cryptlen == SM4_BLOCK_SIZE)
+ return sm4_cbc_crypt(req, ctx, encrypt);
+
+ skcipher_request_set_tfm(&subreq, tfm);
+ skcipher_request_set_callback(&subreq, skcipher_request_flags(req),
+ NULL, NULL);
+
+ /* handle the CBC cryption part */
+ cbc_blocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+ if (cbc_blocks) {
+ skcipher_request_set_crypt(&subreq, src, dst,
+ cbc_blocks * SM4_BLOCK_SIZE,
+ req->iv);
+
+ err = sm4_cbc_crypt(&subreq, ctx, encrypt);
+ if (err)
+ return err;
+
+ dst = src = scatterwalk_ffwd(sg_src, src, subreq.cryptlen);
+ if (req->dst != req->src)
+ dst = scatterwalk_ffwd(sg_dst, req->dst,
+ subreq.cryptlen);
+ }
+
+ /* handle ciphertext stealing */
+ skcipher_request_set_crypt(&subreq, src, dst,
+ req->cryptlen - cbc_blocks * SM4_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &subreq, false);
+ if (err)
+ return err;
+
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_cbc_cts_enc(ctx->rkey_enc, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, walk.nbytes);
+ else
+ sm4_ce_cbc_cts_dec(ctx->rkey_dec, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, walk.nbytes);
+
+ kernel_neon_end();
+
+ return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_cbc_cts_encrypt(struct skcipher_request *req)
+{
+ return sm4_cbc_cts_crypt(req, true);
+}
+
+static int sm4_cbc_cts_decrypt(struct skcipher_request *req)
+{
+ return sm4_cbc_cts_crypt(req, false);
+}
+
static int sm4_cfb_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
@@ -342,6 +419,22 @@ static struct skcipher_alg sm4_algs[] = {
.setkey = sm4_setkey,
.encrypt = sm4_ctr_crypt,
.decrypt = sm4_ctr_crypt,
+ }, {
+ .base = {
+ .cra_name = "cts(cbc(sm4))",
+ .cra_driver_name = "cts-cbc-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .walksize = SM4_BLOCK_SIZE * 2,
+ .setkey = sm4_setkey,
+ .encrypt = sm4_cbc_cts_encrypt,
+ .decrypt = sm4_cbc_cts_decrypt,
}
};
@@ -365,5 +458,6 @@ MODULE_ALIAS_CRYPTO("ecb(sm4)");
MODULE_ALIAS_CRYPTO("cbc(sm4)");
MODULE_ALIAS_CRYPTO("cfb(sm4)");
MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 09/16] crypto: arm64/sm4 - add CE implementation for CTS-CBC mode
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for CTS-CBC mode.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is cts(cbc-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:
Before:
cts(cbc-sm4-ce) | 16 64 128 256 1024 1420 4096
----------------+--------------------------------------------------------------
CTS-CBC enc | 286.09 297.17 457.97 627.75 868.58 900.80 957.69
CTS-CBC dec | 286.67 285.63 538.35 947.08 2241.03 2577.32 3391.14
After:
cts-cbc-sm4-ce | 16 64 128 256 1024 1420 4096
----------------+--------------------------------------------------------------
CTS-CBC enc | 288.19 428.80 593.57 741.04 911.73 931.80 950.00
CTS-CBC dec | 292.22 468.99 838.23 1380.76 2741.17 3036.42 3409.62
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-core.S | 102 ++++++++++++++++++++++++++++++++
arch/arm64/crypto/sm4-ce-glue.c | 94 +++++++++++++++++++++++++++++
2 files changed, 196 insertions(+)
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 9e4b4f01cdf3..414d29f8110b 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -306,6 +306,100 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
ret
SYM_FUNC_END(sm4_ce_cbc_dec)
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nbytes
+ */
+ SM4_PREPARE(x0)
+
+ sub w5, w4, #16
+ uxtw x5, w5
+
+ ld1 {RIV.16b}, [x3]
+
+ ld1 {v0.16b}, [x2]
+ eor RIV.16b, RIV.16b, v0.16b
+ SM4_CRYPT_BLK(RIV)
+
+ /* load permute table */
+ adr_l x6, .Lcts_permute_table
+ add x7, x6, #32
+ add x6, x6, x5
+ sub x7, x7, x5
+ ld1 {v3.16b}, [x6]
+ ld1 {v4.16b}, [x7]
+
+ /* overlapping loads */
+ add x2, x2, x5
+ ld1 {v1.16b}, [x2]
+
+ /* create Cn from En-1 */
+ tbl v0.16b, {RIV.16b}, v3.16b
+ /* padding Pn with zeros */
+ tbl v1.16b, {v1.16b}, v4.16b
+
+ eor v1.16b, v1.16b, RIV.16b
+ SM4_CRYPT_BLK(v1)
+
+ /* overlapping stores */
+ add x5, x1, x5
+ st1 {v0.16b}, [x5]
+ st1 {v1.16b}, [x1]
+
+ ret
+SYM_FUNC_END(sm4_ce_cbc_cts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nbytes
+ */
+ SM4_PREPARE(x0)
+
+ sub w5, w4, #16
+ uxtw x5, w5
+
+ ld1 {RIV.16b}, [x3]
+
+ /* load permute table */
+ adr_l x6, .Lcts_permute_table
+ add x7, x6, #32
+ add x6, x6, x5
+ sub x7, x7, x5
+ ld1 {v3.16b}, [x6]
+ ld1 {v4.16b}, [x7]
+
+ /* overlapping loads */
+ ld1 {v0.16b}, [x2], x5
+ ld1 {v1.16b}, [x2]
+
+ SM4_CRYPT_BLK(v0)
+ /* select the first Ln bytes of Xn to create Pn */
+ tbl v2.16b, {v0.16b}, v3.16b
+ eor v2.16b, v2.16b, v1.16b
+
+ /* overwrite the first Ln bytes with Cn to create En-1 */
+ tbx v0.16b, {v1.16b}, v4.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, RIV.16b
+
+ /* overlapping stores */
+ add x5, x1, x5
+ st1 {v2.16b}, [x5]
+ st1 {v0.16b}, [x1]
+
+ ret
+SYM_FUNC_END(sm4_ce_cbc_cts_dec)
+
.align 3
SYM_FUNC_START(sm4_ce_cfb_enc)
/* input:
@@ -576,3 +670,11 @@ SYM_FUNC_END(sm4_ce_ctr_enc)
.Lbswap128_mask:
.byte 0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
.byte 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+
+.Lcts_permute_table:
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7
+ .byte 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 63abcadc684b..4d4072c7bfa2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -16,6 +16,7 @@
#include <asm/simd.h>
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
+#include <crypto/scatterwalk.h>
#include <crypto/sm4.h>
#define BYTES2BLKS(nbytes) ((nbytes) >> 4)
@@ -29,6 +30,10 @@ asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblocks);
asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nbytes);
asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -153,6 +158,78 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
return sm4_cbc_crypt(req, ctx, false);
}
+static int sm4_cbc_cts_crypt(struct skcipher_request *req, bool encrypt)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct scatterlist *src = req->src;
+ struct scatterlist *dst = req->dst;
+ struct scatterlist sg_src[2], sg_dst[2];
+ struct skcipher_request subreq;
+ struct skcipher_walk walk;
+ int cbc_blocks;
+ int err;
+
+ if (req->cryptlen < SM4_BLOCK_SIZE)
+ return -EINVAL;
+
+ if (req->cryptlen == SM4_BLOCK_SIZE)
+ return sm4_cbc_crypt(req, ctx, encrypt);
+
+ skcipher_request_set_tfm(&subreq, tfm);
+ skcipher_request_set_callback(&subreq, skcipher_request_flags(req),
+ NULL, NULL);
+
+ /* handle the CBC cryption part */
+ cbc_blocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+ if (cbc_blocks) {
+ skcipher_request_set_crypt(&subreq, src, dst,
+ cbc_blocks * SM4_BLOCK_SIZE,
+ req->iv);
+
+ err = sm4_cbc_crypt(&subreq, ctx, encrypt);
+ if (err)
+ return err;
+
+ dst = src = scatterwalk_ffwd(sg_src, src, subreq.cryptlen);
+ if (req->dst != req->src)
+ dst = scatterwalk_ffwd(sg_dst, req->dst,
+ subreq.cryptlen);
+ }
+
+ /* handle ciphertext stealing */
+ skcipher_request_set_crypt(&subreq, src, dst,
+ req->cryptlen - cbc_blocks * SM4_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &subreq, false);
+ if (err)
+ return err;
+
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_cbc_cts_enc(ctx->rkey_enc, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, walk.nbytes);
+ else
+ sm4_ce_cbc_cts_dec(ctx->rkey_dec, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, walk.nbytes);
+
+ kernel_neon_end();
+
+ return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_cbc_cts_encrypt(struct skcipher_request *req)
+{
+ return sm4_cbc_cts_crypt(req, true);
+}
+
+static int sm4_cbc_cts_decrypt(struct skcipher_request *req)
+{
+ return sm4_cbc_cts_crypt(req, false);
+}
+
static int sm4_cfb_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
@@ -342,6 +419,22 @@ static struct skcipher_alg sm4_algs[] = {
.setkey = sm4_setkey,
.encrypt = sm4_ctr_crypt,
.decrypt = sm4_ctr_crypt,
+ }, {
+ .base = {
+ .cra_name = "cts(cbc(sm4))",
+ .cra_driver_name = "cts-cbc-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .walksize = SM4_BLOCK_SIZE * 2,
+ .setkey = sm4_setkey,
+ .encrypt = sm4_cbc_cts_encrypt,
+ .decrypt = sm4_cbc_cts_decrypt,
}
};
@@ -365,5 +458,6 @@ MODULE_ALIAS_CRYPTO("ecb(sm4)");
MODULE_ALIAS_CRYPTO("cbc(sm4)");
MODULE_ALIAS_CRYPTO("cfb(sm4)");
MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 10/16] crypto: arm64/sm4 - add CE implementation for XTS mode
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for XTS mode.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is xts(ecb-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:
Before:
xts(ecb-sm4-ce) | 16 64 128 256 1024 1420 4096
----------------+--------------------------------------------------------------
XTS enc | 117.17 430.56 732.92 1134.98 2007.03 2136.23 2347.20
XTS dec | 116.89 429.02 733.40 1132.96 2006.13 2130.50 2347.92
After:
xts-sm4-ce | 16 64 128 256 1024 1420 4096
----------------+--------------------------------------------------------------
XTS enc | 224.68 798.91 1248.08 1714.60 2413.73 2467.84 2612.62
XTS dec | 229.85 791.34 1237.79 1720.00 2413.30 2473.84 2611.95
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 4 +-
arch/arm64/crypto/sm4-ce-core.S | 343 ++++++++++++++++++++++++++++++++
arch/arm64/crypto/sm4-ce-glue.c | 159 ++++++++++++++-
3 files changed, 504 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 4b121dc0cfba..8939f5ae9214 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -231,7 +231,7 @@ config CRYPTO_SM4_ARM64_CE
- NEON (Advanced SIMD) extensions
config CRYPTO_SM4_ARM64_CE_BLK
- tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv8 Crypto Extensions)"
+ tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR/XTS (ARMv8 Crypto Extensions)"
depends on KERNEL_MODE_NEON
select CRYPTO_SKCIPHER
select CRYPTO_SM4
@@ -242,6 +242,8 @@ config CRYPTO_SM4_ARM64_CE_BLK
- CBC (Cipher Block Chaining) mode (NIST SP800-38A)
- CFB (Cipher Feedback) mode (NIST SP800-38A)
- CTR (Counter) mode (NIST SP800-38A)
+ - XTS (XOR Encrypt XOR with ciphertext stealing) mode (NIST SP800-38E
+ and IEEE 1619)
Architecture: arm64 using:
- ARMv8 Crypto Extensions
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 414d29f8110b..ddd15ec09d38 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
#define RTMP3 v19
#define RIV v20
+#define RMASK v21
.align 3
@@ -665,6 +666,348 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
SYM_FUNC_END(sm4_ce_ctr_enc)
+#define tweak_next(vt, vin, RTMP) \
+ sshr RTMP.2d, vin.2d, #63; \
+ and RTMP.16b, RTMP.16b, RMASK.16b; \
+ add vt.2d, vin.2d, vin.2d; \
+ ext RTMP.16b, RTMP.16b, RTMP.16b, #8; \
+ eor vt.16b, vt.16b, RTMP.16b;
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: tweak (big endian, 128 bit)
+ * w4: nbytes
+ * x5: round key array for IV
+ */
+ ld1 {v8.16b}, [x3]
+
+ cbz x5, .Lxts_enc_nofirst
+
+ SM4_PREPARE(x5)
+
+ /* Generate first tweak */
+ SM4_CRYPT_BLK(v8)
+
+.Lxts_enc_nofirst:
+ SM4_PREPARE(x0)
+
+ ands w5, w4, #15
+ lsr w4, w4, #4
+ sub w6, w4, #1
+ csel w4, w4, w6, eq
+ uxtw x5, w5
+
+ movi RMASK.2s, #0x1
+ movi RTMP0.2s, #0x87
+ uzp1 RMASK.4s, RMASK.4s, RTMP0.4s
+
+ cbz w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lxts_enc_4x
+
+ tweak_next( v9, v8, RTMP0)
+ tweak_next(v10, v9, RTMP1)
+ tweak_next(v11, v10, RTMP2)
+ tweak_next(v12, v11, RTMP3)
+ tweak_next(v13, v12, RTMP0)
+ tweak_next(v14, v13, RTMP1)
+ tweak_next(v15, v14, RTMP2)
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ ld1 {v4.16b-v7.16b}, [x2], #64
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ tweak_next(v8, v15, RTMP3)
+
+ cbz w4, .Lxts_enc_cts
+ b .Lxts_enc_loop_8x
+
+.Lxts_enc_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lxts_enc_loop_1x
+
+ sub w4, w4, #4
+
+ tweak_next( v9, v8, RTMP0)
+ tweak_next(v10, v9, RTMP1)
+ tweak_next(v11, v10, RTMP2)
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ tweak_next(v8, v11, RTMP3)
+
+ cbz w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_1x:
+ sub w4, w4, #1
+
+ ld1 {v0.16b}, [x2], #16
+ eor v0.16b, v0.16b, v8.16b
+
+ SM4_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, v8.16b
+ st1 {v0.16b}, [x1], #16
+
+ tweak_next(v8, v8, RTMP0)
+
+ cbnz w4, .Lxts_enc_loop_1x
+
+.Lxts_enc_cts:
+ cbz x5, .Lxts_enc_end
+
+ /* cipher text stealing */
+
+ tweak_next(v9, v8, RTMP0)
+ ld1 {v0.16b}, [x2]
+ eor v0.16b, v0.16b, v8.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v8.16b
+
+ /* load permute table */
+ adr_l x6, .Lcts_permute_table
+ add x7, x6, #32
+ add x6, x6, x5
+ sub x7, x7, x5
+ ld1 {v3.16b}, [x6]
+ ld1 {v4.16b}, [x7]
+
+ /* overlapping loads */
+ add x2, x2, x5
+ ld1 {v1.16b}, [x2]
+
+ /* create Cn from En-1 */
+ tbl v2.16b, {v0.16b}, v3.16b
+ /* padding Pn with En-1 at the end */
+ tbx v0.16b, {v1.16b}, v4.16b
+
+ eor v0.16b, v0.16b, v9.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v9.16b
+
+
+ /* overlapping stores */
+ add x5, x1, x5
+ st1 {v2.16b}, [x5]
+ st1 {v0.16b}, [x1]
+
+ b .Lxts_enc_ret
+
+.Lxts_enc_end:
+ /* store new tweak */
+ st1 {v8.16b}, [x3]
+
+.Lxts_enc_ret:
+ ret
+SYM_FUNC_END(sm4_ce_xts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: tweak (big endian, 128 bit)
+ * w4: nbytes
+ * x5: round key array for IV
+ */
+ ld1 {v8.16b}, [x3]
+
+ cbz x5, .Lxts_dec_nofirst
+
+ SM4_PREPARE(x5)
+
+ /* Generate first tweak */
+ SM4_CRYPT_BLK(v8)
+
+.Lxts_dec_nofirst:
+ SM4_PREPARE(x0)
+
+ ands w5, w4, #15
+ lsr w4, w4, #4
+ sub w6, w4, #1
+ csel w4, w4, w6, eq
+ uxtw x5, w5
+
+ movi RMASK.2s, #0x1
+ movi RTMP0.2s, #0x87
+ uzp1 RMASK.4s, RMASK.4s, RTMP0.4s
+
+ cbz w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lxts_dec_4x
+
+ tweak_next( v9, v8, RTMP0)
+ tweak_next(v10, v9, RTMP1)
+ tweak_next(v11, v10, RTMP2)
+ tweak_next(v12, v11, RTMP3)
+ tweak_next(v13, v12, RTMP0)
+ tweak_next(v14, v13, RTMP1)
+ tweak_next(v15, v14, RTMP2)
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ ld1 {v4.16b-v7.16b}, [x2], #64
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ tweak_next(v8, v15, RTMP3)
+
+ cbz w4, .Lxts_dec_cts
+ b .Lxts_dec_loop_8x
+
+.Lxts_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lxts_dec_loop_1x
+
+ sub w4, w4, #4
+
+ tweak_next( v9, v8, RTMP0)
+ tweak_next(v10, v9, RTMP1)
+ tweak_next(v11, v10, RTMP2)
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ tweak_next(v8, v11, RTMP3)
+
+ cbz w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_1x:
+ sub w4, w4, #1
+
+ ld1 {v0.16b}, [x2], #16
+ eor v0.16b, v0.16b, v8.16b
+
+ SM4_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, v8.16b
+ st1 {v0.16b}, [x1], #16
+
+ tweak_next(v8, v8, RTMP0)
+
+ cbnz w4, .Lxts_dec_loop_1x
+
+.Lxts_dec_cts:
+ cbz x5, .Lxts_dec_end
+
+ /* cipher text stealing */
+
+ tweak_next(v9, v8, RTMP0)
+ ld1 {v0.16b}, [x2]
+ eor v0.16b, v0.16b, v9.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v9.16b
+
+ /* load permute table */
+ adr_l x6, .Lcts_permute_table
+ add x7, x6, #32
+ add x6, x6, x5
+ sub x7, x7, x5
+ ld1 {v3.16b}, [x6]
+ ld1 {v4.16b}, [x7]
+
+ /* overlapping loads */
+ add x2, x2, x5
+ ld1 {v1.16b}, [x2]
+
+ /* create Cn from En-1 */
+ tbl v2.16b, {v0.16b}, v3.16b
+ /* padding Pn with En-1 at the end */
+ tbx v0.16b, {v1.16b}, v4.16b
+
+ eor v0.16b, v0.16b, v8.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v8.16b
+
+
+ /* overlapping stores */
+ add x5, x1, x5
+ st1 {v2.16b}, [x5]
+ st1 {v0.16b}, [x1]
+
+ b .Lxts_dec_ret
+
+.Lxts_dec_end:
+ /* store new tweak */
+ st1 {v8.16b}, [x3]
+
+.Lxts_dec_ret:
+ ret
+SYM_FUNC_END(sm4_ce_xts_dec)
+
+
.section ".rodata", "a"
.align 4
.Lbswap128_mask:
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 4d4072c7bfa2..8222766f712a 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -17,6 +17,7 @@
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
#include <crypto/scatterwalk.h>
+#include <crypto/xts.h>
#include <crypto/sm4.h>
#define BYTES2BLKS(nbytes) ((nbytes) >> 4)
@@ -40,12 +41,23 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
+asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+ u8 *tweak, unsigned int nbytes,
+ const u32 *rkey2_enc);
+asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+ u8 *tweak, unsigned int nbytes,
+ const u32 *rkey2_enc);
EXPORT_SYMBOL(sm4_ce_expand_key);
EXPORT_SYMBOL(sm4_ce_crypt_block);
EXPORT_SYMBOL(sm4_ce_cbc_enc);
EXPORT_SYMBOL(sm4_ce_cfb_enc);
+struct sm4_xts_ctx {
+ struct sm4_ctx key1;
+ struct sm4_ctx key2;
+};
+
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
{
@@ -61,6 +73,29 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
return 0;
}
+static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+ int ret;
+
+ if (key_len != SM4_KEY_SIZE * 2)
+ return -EINVAL;
+
+ ret = xts_verify_key(tfm, key, key_len);
+ if (ret)
+ return ret;
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+ ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+ sm4_ce_expand_key(&key[SM4_KEY_SIZE], ctx->key2.rkey_enc,
+ ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
{
struct skcipher_walk walk;
@@ -357,6 +392,111 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
return err;
}
+static int sm4_xts_crypt(struct skcipher_request *req, bool encrypt)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+ int tail = req->cryptlen % SM4_BLOCK_SIZE;
+ const u32 *rkey2_enc = ctx->key2.rkey_enc;
+ struct scatterlist sg_src[2], sg_dst[2];
+ struct skcipher_request subreq;
+ struct scatterlist *src, *dst;
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ if (req->cryptlen < SM4_BLOCK_SIZE)
+ return -EINVAL;
+
+ err = skcipher_walk_virt(&walk, req, false);
+ if (err)
+ return err;
+
+ if (unlikely(tail > 0 && walk.nbytes < walk.total)) {
+ int nblocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+
+ skcipher_walk_abort(&walk);
+
+ skcipher_request_set_tfm(&subreq, tfm);
+ skcipher_request_set_callback(&subreq,
+ skcipher_request_flags(req),
+ NULL, NULL);
+ skcipher_request_set_crypt(&subreq, req->src, req->dst,
+ nblocks * SM4_BLOCK_SIZE, req->iv);
+
+ err = skcipher_walk_virt(&walk, &subreq, false);
+ if (err)
+ return err;
+ } else {
+ tail = 0;
+ }
+
+ while ((nbytes = walk.nbytes) >= SM4_BLOCK_SIZE) {
+ if (nbytes < walk.total)
+ nbytes &= ~(SM4_BLOCK_SIZE - 1);
+
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, nbytes,
+ rkey2_enc);
+ else
+ sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, nbytes,
+ rkey2_enc);
+
+ kernel_neon_end();
+
+ rkey2_enc = NULL;
+
+ err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+ if (err)
+ return err;
+ }
+
+ if (likely(tail == 0))
+ return 0;
+
+ /* handle ciphertext stealing */
+
+ dst = src = scatterwalk_ffwd(sg_src, req->src, subreq.cryptlen);
+ if (req->dst != req->src)
+ dst = scatterwalk_ffwd(sg_dst, req->dst, subreq.cryptlen);
+
+ skcipher_request_set_crypt(&subreq, src, dst, SM4_BLOCK_SIZE + tail,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &subreq, false);
+ if (err)
+ return err;
+
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, walk.nbytes,
+ rkey2_enc);
+ else
+ sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, walk.nbytes,
+ rkey2_enc);
+
+ kernel_neon_end();
+
+ return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_xts_encrypt(struct skcipher_request *req)
+{
+ return sm4_xts_crypt(req, true);
+}
+
+static int sm4_xts_decrypt(struct skcipher_request *req)
+{
+ return sm4_xts_crypt(req, false);
+}
+
static struct skcipher_alg sm4_algs[] = {
{
.base = {
@@ -435,6 +575,22 @@ static struct skcipher_alg sm4_algs[] = {
.setkey = sm4_setkey,
.encrypt = sm4_cbc_cts_encrypt,
.decrypt = sm4_cbc_cts_decrypt,
+ }, {
+ .base = {
+ .cra_name = "xts(sm4)",
+ .cra_driver_name = "xts-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_xts_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE * 2,
+ .max_keysize = SM4_KEY_SIZE * 2,
+ .ivsize = SM4_BLOCK_SIZE,
+ .walksize = SM4_BLOCK_SIZE * 2,
+ .setkey = sm4_xts_setkey,
+ .encrypt = sm4_xts_encrypt,
+ .decrypt = sm4_xts_decrypt,
}
};
@@ -451,7 +607,7 @@ static void __exit sm4_exit(void)
module_cpu_feature_match(SM4, sm4_init);
module_exit(sm4_exit);
-MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv8 Crypto Extensions");
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR/XTS using ARMv8 Crypto Extensions");
MODULE_ALIAS_CRYPTO("sm4-ce");
MODULE_ALIAS_CRYPTO("sm4");
MODULE_ALIAS_CRYPTO("ecb(sm4)");
@@ -459,5 +615,6 @@ MODULE_ALIAS_CRYPTO("cbc(sm4)");
MODULE_ALIAS_CRYPTO("cfb(sm4)");
MODULE_ALIAS_CRYPTO("ctr(sm4)");
MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
+MODULE_ALIAS_CRYPTO("xts(sm4)");
MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 10/16] crypto: arm64/sm4 - add CE implementation for XTS mode
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for XTS mode.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is xts(ecb-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:
Before:
xts(ecb-sm4-ce) | 16 64 128 256 1024 1420 4096
----------------+--------------------------------------------------------------
XTS enc | 117.17 430.56 732.92 1134.98 2007.03 2136.23 2347.20
XTS dec | 116.89 429.02 733.40 1132.96 2006.13 2130.50 2347.92
After:
xts-sm4-ce | 16 64 128 256 1024 1420 4096
----------------+--------------------------------------------------------------
XTS enc | 224.68 798.91 1248.08 1714.60 2413.73 2467.84 2612.62
XTS dec | 229.85 791.34 1237.79 1720.00 2413.30 2473.84 2611.95
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 4 +-
arch/arm64/crypto/sm4-ce-core.S | 343 ++++++++++++++++++++++++++++++++
arch/arm64/crypto/sm4-ce-glue.c | 159 ++++++++++++++-
3 files changed, 504 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 4b121dc0cfba..8939f5ae9214 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -231,7 +231,7 @@ config CRYPTO_SM4_ARM64_CE
- NEON (Advanced SIMD) extensions
config CRYPTO_SM4_ARM64_CE_BLK
- tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv8 Crypto Extensions)"
+ tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR/XTS (ARMv8 Crypto Extensions)"
depends on KERNEL_MODE_NEON
select CRYPTO_SKCIPHER
select CRYPTO_SM4
@@ -242,6 +242,8 @@ config CRYPTO_SM4_ARM64_CE_BLK
- CBC (Cipher Block Chaining) mode (NIST SP800-38A)
- CFB (Cipher Feedback) mode (NIST SP800-38A)
- CTR (Counter) mode (NIST SP800-38A)
+ - XTS (XOR Encrypt XOR with ciphertext stealing) mode (NIST SP800-38E
+ and IEEE 1619)
Architecture: arm64 using:
- ARMv8 Crypto Extensions
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 414d29f8110b..ddd15ec09d38 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
#define RTMP3 v19
#define RIV v20
+#define RMASK v21
.align 3
@@ -665,6 +666,348 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
SYM_FUNC_END(sm4_ce_ctr_enc)
+#define tweak_next(vt, vin, RTMP) \
+ sshr RTMP.2d, vin.2d, #63; \
+ and RTMP.16b, RTMP.16b, RMASK.16b; \
+ add vt.2d, vin.2d, vin.2d; \
+ ext RTMP.16b, RTMP.16b, RTMP.16b, #8; \
+ eor vt.16b, vt.16b, RTMP.16b;
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: tweak (big endian, 128 bit)
+ * w4: nbytes
+ * x5: round key array for IV
+ */
+ ld1 {v8.16b}, [x3]
+
+ cbz x5, .Lxts_enc_nofirst
+
+ SM4_PREPARE(x5)
+
+ /* Generate first tweak */
+ SM4_CRYPT_BLK(v8)
+
+.Lxts_enc_nofirst:
+ SM4_PREPARE(x0)
+
+ ands w5, w4, #15
+ lsr w4, w4, #4
+ sub w6, w4, #1
+ csel w4, w4, w6, eq
+ uxtw x5, w5
+
+ movi RMASK.2s, #0x1
+ movi RTMP0.2s, #0x87
+ uzp1 RMASK.4s, RMASK.4s, RTMP0.4s
+
+ cbz w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lxts_enc_4x
+
+ tweak_next( v9, v8, RTMP0)
+ tweak_next(v10, v9, RTMP1)
+ tweak_next(v11, v10, RTMP2)
+ tweak_next(v12, v11, RTMP3)
+ tweak_next(v13, v12, RTMP0)
+ tweak_next(v14, v13, RTMP1)
+ tweak_next(v15, v14, RTMP2)
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ ld1 {v4.16b-v7.16b}, [x2], #64
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ tweak_next(v8, v15, RTMP3)
+
+ cbz w4, .Lxts_enc_cts
+ b .Lxts_enc_loop_8x
+
+.Lxts_enc_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lxts_enc_loop_1x
+
+ sub w4, w4, #4
+
+ tweak_next( v9, v8, RTMP0)
+ tweak_next(v10, v9, RTMP1)
+ tweak_next(v11, v10, RTMP2)
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ tweak_next(v8, v11, RTMP3)
+
+ cbz w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_1x:
+ sub w4, w4, #1
+
+ ld1 {v0.16b}, [x2], #16
+ eor v0.16b, v0.16b, v8.16b
+
+ SM4_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, v8.16b
+ st1 {v0.16b}, [x1], #16
+
+ tweak_next(v8, v8, RTMP0)
+
+ cbnz w4, .Lxts_enc_loop_1x
+
+.Lxts_enc_cts:
+ cbz x5, .Lxts_enc_end
+
+ /* cipher text stealing */
+
+ tweak_next(v9, v8, RTMP0)
+ ld1 {v0.16b}, [x2]
+ eor v0.16b, v0.16b, v8.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v8.16b
+
+ /* load permute table */
+ adr_l x6, .Lcts_permute_table
+ add x7, x6, #32
+ add x6, x6, x5
+ sub x7, x7, x5
+ ld1 {v3.16b}, [x6]
+ ld1 {v4.16b}, [x7]
+
+ /* overlapping loads */
+ add x2, x2, x5
+ ld1 {v1.16b}, [x2]
+
+ /* create Cn from En-1 */
+ tbl v2.16b, {v0.16b}, v3.16b
+ /* padding Pn with En-1 at the end */
+ tbx v0.16b, {v1.16b}, v4.16b
+
+ eor v0.16b, v0.16b, v9.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v9.16b
+
+
+ /* overlapping stores */
+ add x5, x1, x5
+ st1 {v2.16b}, [x5]
+ st1 {v0.16b}, [x1]
+
+ b .Lxts_enc_ret
+
+.Lxts_enc_end:
+ /* store new tweak */
+ st1 {v8.16b}, [x3]
+
+.Lxts_enc_ret:
+ ret
+SYM_FUNC_END(sm4_ce_xts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: tweak (big endian, 128 bit)
+ * w4: nbytes
+ * x5: round key array for IV
+ */
+ ld1 {v8.16b}, [x3]
+
+ cbz x5, .Lxts_dec_nofirst
+
+ SM4_PREPARE(x5)
+
+ /* Generate first tweak */
+ SM4_CRYPT_BLK(v8)
+
+.Lxts_dec_nofirst:
+ SM4_PREPARE(x0)
+
+ ands w5, w4, #15
+ lsr w4, w4, #4
+ sub w6, w4, #1
+ csel w4, w4, w6, eq
+ uxtw x5, w5
+
+ movi RMASK.2s, #0x1
+ movi RTMP0.2s, #0x87
+ uzp1 RMASK.4s, RMASK.4s, RTMP0.4s
+
+ cbz w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_8x:
+ sub w4, w4, #8
+ tbnz w4, #31, .Lxts_dec_4x
+
+ tweak_next( v9, v8, RTMP0)
+ tweak_next(v10, v9, RTMP1)
+ tweak_next(v11, v10, RTMP2)
+ tweak_next(v12, v11, RTMP3)
+ tweak_next(v13, v12, RTMP0)
+ tweak_next(v14, v13, RTMP1)
+ tweak_next(v15, v14, RTMP2)
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ ld1 {v4.16b-v7.16b}, [x2], #64
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+
+ SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ eor v4.16b, v4.16b, v12.16b
+ eor v5.16b, v5.16b, v13.16b
+ eor v6.16b, v6.16b, v14.16b
+ eor v7.16b, v7.16b, v15.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+ st1 {v4.16b-v7.16b}, [x1], #64
+
+ tweak_next(v8, v15, RTMP3)
+
+ cbz w4, .Lxts_dec_cts
+ b .Lxts_dec_loop_8x
+
+.Lxts_dec_4x:
+ add w4, w4, #8
+ cmp w4, #4
+ blt .Lxts_dec_loop_1x
+
+ sub w4, w4, #4
+
+ tweak_next( v9, v8, RTMP0)
+ tweak_next(v10, v9, RTMP1)
+ tweak_next(v11, v10, RTMP2)
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ eor v0.16b, v0.16b, v8.16b
+ eor v1.16b, v1.16b, v9.16b
+ eor v2.16b, v2.16b, v10.16b
+ eor v3.16b, v3.16b, v11.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ tweak_next(v8, v11, RTMP3)
+
+ cbz w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_1x:
+ sub w4, w4, #1
+
+ ld1 {v0.16b}, [x2], #16
+ eor v0.16b, v0.16b, v8.16b
+
+ SM4_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, v8.16b
+ st1 {v0.16b}, [x1], #16
+
+ tweak_next(v8, v8, RTMP0)
+
+ cbnz w4, .Lxts_dec_loop_1x
+
+.Lxts_dec_cts:
+ cbz x5, .Lxts_dec_end
+
+ /* cipher text stealing */
+
+ tweak_next(v9, v8, RTMP0)
+ ld1 {v0.16b}, [x2]
+ eor v0.16b, v0.16b, v9.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v9.16b
+
+ /* load permute table */
+ adr_l x6, .Lcts_permute_table
+ add x7, x6, #32
+ add x6, x6, x5
+ sub x7, x7, x5
+ ld1 {v3.16b}, [x6]
+ ld1 {v4.16b}, [x7]
+
+ /* overlapping loads */
+ add x2, x2, x5
+ ld1 {v1.16b}, [x2]
+
+ /* create Cn from En-1 */
+ tbl v2.16b, {v0.16b}, v3.16b
+ /* padding Pn with En-1 at the end */
+ tbx v0.16b, {v1.16b}, v4.16b
+
+ eor v0.16b, v0.16b, v8.16b
+ SM4_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v8.16b
+
+
+ /* overlapping stores */
+ add x5, x1, x5
+ st1 {v2.16b}, [x5]
+ st1 {v0.16b}, [x1]
+
+ b .Lxts_dec_ret
+
+.Lxts_dec_end:
+ /* store new tweak */
+ st1 {v8.16b}, [x3]
+
+.Lxts_dec_ret:
+ ret
+SYM_FUNC_END(sm4_ce_xts_dec)
+
+
.section ".rodata", "a"
.align 4
.Lbswap128_mask:
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 4d4072c7bfa2..8222766f712a 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -17,6 +17,7 @@
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
#include <crypto/scatterwalk.h>
+#include <crypto/xts.h>
#include <crypto/sm4.h>
#define BYTES2BLKS(nbytes) ((nbytes) >> 4)
@@ -40,12 +41,23 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
+asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+ u8 *tweak, unsigned int nbytes,
+ const u32 *rkey2_enc);
+asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+ u8 *tweak, unsigned int nbytes,
+ const u32 *rkey2_enc);
EXPORT_SYMBOL(sm4_ce_expand_key);
EXPORT_SYMBOL(sm4_ce_crypt_block);
EXPORT_SYMBOL(sm4_ce_cbc_enc);
EXPORT_SYMBOL(sm4_ce_cfb_enc);
+struct sm4_xts_ctx {
+ struct sm4_ctx key1;
+ struct sm4_ctx key2;
+};
+
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
{
@@ -61,6 +73,29 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
return 0;
}
+static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+ int ret;
+
+ if (key_len != SM4_KEY_SIZE * 2)
+ return -EINVAL;
+
+ ret = xts_verify_key(tfm, key, key_len);
+ if (ret)
+ return ret;
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+ ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+ sm4_ce_expand_key(&key[SM4_KEY_SIZE], ctx->key2.rkey_enc,
+ ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
{
struct skcipher_walk walk;
@@ -357,6 +392,111 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
return err;
}
+static int sm4_xts_crypt(struct skcipher_request *req, bool encrypt)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+ int tail = req->cryptlen % SM4_BLOCK_SIZE;
+ const u32 *rkey2_enc = ctx->key2.rkey_enc;
+ struct scatterlist sg_src[2], sg_dst[2];
+ struct skcipher_request subreq;
+ struct scatterlist *src, *dst;
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ if (req->cryptlen < SM4_BLOCK_SIZE)
+ return -EINVAL;
+
+ err = skcipher_walk_virt(&walk, req, false);
+ if (err)
+ return err;
+
+ if (unlikely(tail > 0 && walk.nbytes < walk.total)) {
+ int nblocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+
+ skcipher_walk_abort(&walk);
+
+ skcipher_request_set_tfm(&subreq, tfm);
+ skcipher_request_set_callback(&subreq,
+ skcipher_request_flags(req),
+ NULL, NULL);
+ skcipher_request_set_crypt(&subreq, req->src, req->dst,
+ nblocks * SM4_BLOCK_SIZE, req->iv);
+
+ err = skcipher_walk_virt(&walk, &subreq, false);
+ if (err)
+ return err;
+ } else {
+ tail = 0;
+ }
+
+ while ((nbytes = walk.nbytes) >= SM4_BLOCK_SIZE) {
+ if (nbytes < walk.total)
+ nbytes &= ~(SM4_BLOCK_SIZE - 1);
+
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, nbytes,
+ rkey2_enc);
+ else
+ sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, nbytes,
+ rkey2_enc);
+
+ kernel_neon_end();
+
+ rkey2_enc = NULL;
+
+ err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+ if (err)
+ return err;
+ }
+
+ if (likely(tail == 0))
+ return 0;
+
+ /* handle ciphertext stealing */
+
+ dst = src = scatterwalk_ffwd(sg_src, req->src, subreq.cryptlen);
+ if (req->dst != req->src)
+ dst = scatterwalk_ffwd(sg_dst, req->dst, subreq.cryptlen);
+
+ skcipher_request_set_crypt(&subreq, src, dst, SM4_BLOCK_SIZE + tail,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &subreq, false);
+ if (err)
+ return err;
+
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, walk.nbytes,
+ rkey2_enc);
+ else
+ sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, walk.nbytes,
+ rkey2_enc);
+
+ kernel_neon_end();
+
+ return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_xts_encrypt(struct skcipher_request *req)
+{
+ return sm4_xts_crypt(req, true);
+}
+
+static int sm4_xts_decrypt(struct skcipher_request *req)
+{
+ return sm4_xts_crypt(req, false);
+}
+
static struct skcipher_alg sm4_algs[] = {
{
.base = {
@@ -435,6 +575,22 @@ static struct skcipher_alg sm4_algs[] = {
.setkey = sm4_setkey,
.encrypt = sm4_cbc_cts_encrypt,
.decrypt = sm4_cbc_cts_decrypt,
+ }, {
+ .base = {
+ .cra_name = "xts(sm4)",
+ .cra_driver_name = "xts-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_xts_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE * 2,
+ .max_keysize = SM4_KEY_SIZE * 2,
+ .ivsize = SM4_BLOCK_SIZE,
+ .walksize = SM4_BLOCK_SIZE * 2,
+ .setkey = sm4_xts_setkey,
+ .encrypt = sm4_xts_encrypt,
+ .decrypt = sm4_xts_decrypt,
}
};
@@ -451,7 +607,7 @@ static void __exit sm4_exit(void)
module_cpu_feature_match(SM4, sm4_init);
module_exit(sm4_exit);
-MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv8 Crypto Extensions");
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR/XTS using ARMv8 Crypto Extensions");
MODULE_ALIAS_CRYPTO("sm4-ce");
MODULE_ALIAS_CRYPTO("sm4");
MODULE_ALIAS_CRYPTO("ecb(sm4)");
@@ -459,5 +615,6 @@ MODULE_ALIAS_CRYPTO("cbc(sm4)");
MODULE_ALIAS_CRYPTO("cfb(sm4)");
MODULE_ALIAS_CRYPTO("ctr(sm4)");
MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
+MODULE_ALIAS_CRYPTO("xts(sm4)");
MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 11/16] crypto: essiv - allow digestsize to be greater than keysize
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
In essiv mode, the digest of the hash algorithm is used as the key to
encrypt the IV. The current implementation requires that the digest size
of the hash algorithm is equal to the key size, which will exclude
algorithms that do not meet this situation, such as essiv(cbc(sm4),sm3),
the hash result of sm3 is fixed 256 bits, and the key size of sm4
symmetric algorithm is fixed 128 bits, which makes it impossible to use
essiv mode.
This patch allows algorithms whose digest size is greater than key size
to use esssiv mode by truncating the digest.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
crypto/essiv.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/crypto/essiv.c b/crypto/essiv.c
index e33369df9034..6ee5a61bcae4 100644
--- a/crypto/essiv.c
+++ b/crypto/essiv.c
@@ -68,6 +68,7 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
{
struct essiv_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
u8 salt[HASH_MAX_DIGESTSIZE];
+ unsigned int saltlen;
int err;
crypto_skcipher_clear_flags(tctx->u.skcipher, CRYPTO_TFM_REQ_MASK);
@@ -86,8 +87,11 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
crypto_cipher_set_flags(tctx->essiv_cipher,
crypto_skcipher_get_flags(tfm) &
CRYPTO_TFM_REQ_MASK);
- return crypto_cipher_setkey(tctx->essiv_cipher, salt,
- crypto_shash_digestsize(tctx->hash));
+
+ saltlen = min(crypto_shash_digestsize(tctx->hash),
+ crypto_skcipher_max_keysize(tctx->u.skcipher));
+
+ return crypto_cipher_setkey(tctx->essiv_cipher, salt, saltlen);
}
static int essiv_aead_setkey(struct crypto_aead *tfm, const u8 *key,
@@ -418,8 +422,7 @@ static bool essiv_supported_algorithms(const char *essiv_cipher_name,
if (IS_ERR(alg))
return false;
- if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize ||
- hash_alg->digestsize > alg->cra_cipher.cia_max_keysize)
+ if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize)
goto out;
if (ivsize != alg->cra_blocksize)
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 11/16] crypto: essiv - allow digestsize to be greater than keysize
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
In essiv mode, the digest of the hash algorithm is used as the key to
encrypt the IV. The current implementation requires that the digest size
of the hash algorithm is equal to the key size, which will exclude
algorithms that do not meet this situation, such as essiv(cbc(sm4),sm3),
the hash result of sm3 is fixed 256 bits, and the key size of sm4
symmetric algorithm is fixed 128 bits, which makes it impossible to use
essiv mode.
This patch allows algorithms whose digest size is greater than key size
to use esssiv mode by truncating the digest.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
crypto/essiv.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/crypto/essiv.c b/crypto/essiv.c
index e33369df9034..6ee5a61bcae4 100644
--- a/crypto/essiv.c
+++ b/crypto/essiv.c
@@ -68,6 +68,7 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
{
struct essiv_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
u8 salt[HASH_MAX_DIGESTSIZE];
+ unsigned int saltlen;
int err;
crypto_skcipher_clear_flags(tctx->u.skcipher, CRYPTO_TFM_REQ_MASK);
@@ -86,8 +87,11 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
crypto_cipher_set_flags(tctx->essiv_cipher,
crypto_skcipher_get_flags(tfm) &
CRYPTO_TFM_REQ_MASK);
- return crypto_cipher_setkey(tctx->essiv_cipher, salt,
- crypto_shash_digestsize(tctx->hash));
+
+ saltlen = min(crypto_shash_digestsize(tctx->hash),
+ crypto_skcipher_max_keysize(tctx->u.skcipher));
+
+ return crypto_cipher_setkey(tctx->essiv_cipher, salt, saltlen);
}
static int essiv_aead_setkey(struct crypto_aead *tfm, const u8 *key,
@@ -418,8 +422,7 @@ static bool essiv_supported_algorithms(const char *essiv_cipher_name,
if (IS_ERR(alg))
return false;
- if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize ||
- hash_alg->digestsize > alg->cra_cipher.cia_max_keysize)
+ if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize)
goto out;
if (ivsize != alg->cra_blocksize)
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 12/16] crypto: arm64/sm4 - add CE implementation for ESSIV mode
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for ESSIV mode.
The assembly part is realized by reusing the CBC mode.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-core.S | 42 +++++++++++
arch/arm64/crypto/sm4-ce-glue.c | 128 ++++++++++++++++++++++++++++++++
2 files changed, 170 insertions(+)
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index ddd15ec09d38..6b923c3209a0 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -154,6 +154,26 @@ SYM_FUNC_START(sm4_ce_crypt)
ret;
SYM_FUNC_END(sm4_ce_crypt)
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nblocks
+ * x5: round key array for IV
+ */
+ ld1 {RIV.16b}, [x3]
+
+ SM4_PREPARE(x5)
+
+ SM4_CRYPT_BLK(RIV)
+
+ SM4_PREPARE(x0)
+
+ b .Lcbc_enc_loop_4x
+
.align 3
SYM_FUNC_START(sm4_ce_cbc_enc)
/* input:
@@ -208,6 +228,27 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
ret
SYM_FUNC_END(sm4_ce_cbc_enc)
+SYM_FUNC_END(sm4_ce_essiv_cbc_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nblocks
+ * x5: round key array for IV
+ */
+ ld1 {RIV.16b}, [x3]
+
+ SM4_PREPARE(x5)
+
+ SM4_CRYPT_BLK(RIV)
+
+ SM4_PREPARE(x0)
+
+ b .Lcbc_dec_loop_8x
.align 3
SYM_FUNC_START(sm4_ce_cbc_dec)
@@ -306,6 +347,7 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
ret
SYM_FUNC_END(sm4_ce_cbc_dec)
+SYM_FUNC_END(sm4_ce_essiv_cbc_dec)
.align 3
SYM_FUNC_START(sm4_ce_cbc_cts_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 8222766f712a..6267ec1cfac0 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -19,6 +19,8 @@
#include <crypto/scatterwalk.h>
#include <crypto/xts.h>
#include <crypto/sm4.h>
+#include <crypto/sm3.h>
+#include <crypto/hash.h>
#define BYTES2BLKS(nbytes) ((nbytes) >> 4)
@@ -35,6 +37,12 @@ asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nbytes);
asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_essiv_cbc_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks,
+ const u32 *rkey2_enc);
+asmlinkage void sm4_ce_essiv_cbc_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks,
+ const u32 *rkey2_enc);
asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -58,6 +66,12 @@ struct sm4_xts_ctx {
struct sm4_ctx key2;
};
+struct sm4_essiv_cbc_ctx {
+ struct sm4_ctx key1;
+ struct sm4_ctx key2;
+ struct crypto_shash *hash;
+};
+
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
{
@@ -96,6 +110,27 @@ static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
return 0;
}
+static int sm4_essiv_cbc_setkey(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+ u8 __aligned(8) digest[SM3_DIGEST_SIZE];
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ crypto_shash_tfm_digest(ctx->hash, key, key_len, digest);
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+ ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+ sm4_ce_expand_key(digest, ctx->key2.rkey_enc,
+ ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
{
struct skcipher_walk walk;
@@ -497,6 +532,81 @@ static int sm4_xts_decrypt(struct skcipher_request *req)
return sm4_xts_crypt(req, false);
}
+static int sm4_essiv_cbc_init_tfm(struct crypto_skcipher *tfm)
+{
+ struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ ctx->hash = crypto_alloc_shash("sm3", 0, 0);
+
+ return PTR_ERR_OR_ZERO(ctx->hash);
+}
+
+static void sm4_essiv_cbc_exit_tfm(struct crypto_skcipher *tfm)
+{
+ struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ crypto_free_shash(ctx->hash);
+}
+
+static int sm4_essiv_cbc_crypt(struct skcipher_request *req, bool encrypt)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct skcipher_walk walk;
+ unsigned int nblocks;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ if ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_essiv_cbc_enc(ctx->key1.rkey_enc,
+ walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv,
+ nblocks, ctx->key2.rkey_enc);
+ else
+ sm4_ce_essiv_cbc_dec(ctx->key1.rkey_dec,
+ walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv,
+ nblocks, ctx->key2.rkey_enc);
+
+ kernel_neon_end();
+
+ err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+ if (err)
+ return err;
+ }
+
+ while ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_cbc_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, nblocks);
+ else
+ sm4_ce_cbc_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, nblocks);
+
+ kernel_neon_end();
+
+ err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+ }
+
+ return err;
+}
+
+static int sm4_essiv_cbc_encrypt(struct skcipher_request *req)
+{
+ return sm4_essiv_cbc_crypt(req, true);
+}
+
+static int sm4_essiv_cbc_decrypt(struct skcipher_request *req)
+{
+ return sm4_essiv_cbc_crypt(req, false);
+}
+
static struct skcipher_alg sm4_algs[] = {
{
.base = {
@@ -591,6 +701,23 @@ static struct skcipher_alg sm4_algs[] = {
.setkey = sm4_xts_setkey,
.encrypt = sm4_xts_encrypt,
.decrypt = sm4_xts_decrypt,
+ }, {
+ .base = {
+ .cra_name = "essiv(cbc(sm4),sm3)",
+ .cra_driver_name = "essiv-cbc-sm4-sm3-ce",
+ .cra_priority = 400 + 1,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_essiv_cbc_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .setkey = sm4_essiv_cbc_setkey,
+ .encrypt = sm4_essiv_cbc_encrypt,
+ .decrypt = sm4_essiv_cbc_decrypt,
+ .init = sm4_essiv_cbc_init_tfm,
+ .exit = sm4_essiv_cbc_exit_tfm,
}
};
@@ -616,5 +743,6 @@ MODULE_ALIAS_CRYPTO("cfb(sm4)");
MODULE_ALIAS_CRYPTO("ctr(sm4)");
MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
MODULE_ALIAS_CRYPTO("xts(sm4)");
+MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 12/16] crypto: arm64/sm4 - add CE implementation for ESSIV mode
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for ESSIV mode.
The assembly part is realized by reusing the CBC mode.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-core.S | 42 +++++++++++
arch/arm64/crypto/sm4-ce-glue.c | 128 ++++++++++++++++++++++++++++++++
2 files changed, 170 insertions(+)
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index ddd15ec09d38..6b923c3209a0 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -154,6 +154,26 @@ SYM_FUNC_START(sm4_ce_crypt)
ret;
SYM_FUNC_END(sm4_ce_crypt)
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nblocks
+ * x5: round key array for IV
+ */
+ ld1 {RIV.16b}, [x3]
+
+ SM4_PREPARE(x5)
+
+ SM4_CRYPT_BLK(RIV)
+
+ SM4_PREPARE(x0)
+
+ b .Lcbc_enc_loop_4x
+
.align 3
SYM_FUNC_START(sm4_ce_cbc_enc)
/* input:
@@ -208,6 +228,27 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
ret
SYM_FUNC_END(sm4_ce_cbc_enc)
+SYM_FUNC_END(sm4_ce_essiv_cbc_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nblocks
+ * x5: round key array for IV
+ */
+ ld1 {RIV.16b}, [x3]
+
+ SM4_PREPARE(x5)
+
+ SM4_CRYPT_BLK(RIV)
+
+ SM4_PREPARE(x0)
+
+ b .Lcbc_dec_loop_8x
.align 3
SYM_FUNC_START(sm4_ce_cbc_dec)
@@ -306,6 +347,7 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
ret
SYM_FUNC_END(sm4_ce_cbc_dec)
+SYM_FUNC_END(sm4_ce_essiv_cbc_dec)
.align 3
SYM_FUNC_START(sm4_ce_cbc_cts_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 8222766f712a..6267ec1cfac0 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -19,6 +19,8 @@
#include <crypto/scatterwalk.h>
#include <crypto/xts.h>
#include <crypto/sm4.h>
+#include <crypto/sm3.h>
+#include <crypto/hash.h>
#define BYTES2BLKS(nbytes) ((nbytes) >> 4)
@@ -35,6 +37,12 @@ asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nbytes);
asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_essiv_cbc_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks,
+ const u32 *rkey2_enc);
+asmlinkage void sm4_ce_essiv_cbc_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nblocks,
+ const u32 *rkey2_enc);
asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
u8 *iv, unsigned int nblks);
asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -58,6 +66,12 @@ struct sm4_xts_ctx {
struct sm4_ctx key2;
};
+struct sm4_essiv_cbc_ctx {
+ struct sm4_ctx key1;
+ struct sm4_ctx key2;
+ struct crypto_shash *hash;
+};
+
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
{
@@ -96,6 +110,27 @@ static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
return 0;
}
+static int sm4_essiv_cbc_setkey(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+ u8 __aligned(8) digest[SM3_DIGEST_SIZE];
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ crypto_shash_tfm_digest(ctx->hash, key, key_len, digest);
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+ ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+ sm4_ce_expand_key(digest, ctx->key2.rkey_enc,
+ ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
{
struct skcipher_walk walk;
@@ -497,6 +532,81 @@ static int sm4_xts_decrypt(struct skcipher_request *req)
return sm4_xts_crypt(req, false);
}
+static int sm4_essiv_cbc_init_tfm(struct crypto_skcipher *tfm)
+{
+ struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ ctx->hash = crypto_alloc_shash("sm3", 0, 0);
+
+ return PTR_ERR_OR_ZERO(ctx->hash);
+}
+
+static void sm4_essiv_cbc_exit_tfm(struct crypto_skcipher *tfm)
+{
+ struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ crypto_free_shash(ctx->hash);
+}
+
+static int sm4_essiv_cbc_crypt(struct skcipher_request *req, bool encrypt)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct skcipher_walk walk;
+ unsigned int nblocks;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ if ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_essiv_cbc_enc(ctx->key1.rkey_enc,
+ walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv,
+ nblocks, ctx->key2.rkey_enc);
+ else
+ sm4_ce_essiv_cbc_dec(ctx->key1.rkey_dec,
+ walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv,
+ nblocks, ctx->key2.rkey_enc);
+
+ kernel_neon_end();
+
+ err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+ if (err)
+ return err;
+ }
+
+ while ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+ kernel_neon_begin();
+
+ if (encrypt)
+ sm4_ce_cbc_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, nblocks);
+ else
+ sm4_ce_cbc_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+ walk.src.virt.addr, walk.iv, nblocks);
+
+ kernel_neon_end();
+
+ err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+ }
+
+ return err;
+}
+
+static int sm4_essiv_cbc_encrypt(struct skcipher_request *req)
+{
+ return sm4_essiv_cbc_crypt(req, true);
+}
+
+static int sm4_essiv_cbc_decrypt(struct skcipher_request *req)
+{
+ return sm4_essiv_cbc_crypt(req, false);
+}
+
static struct skcipher_alg sm4_algs[] = {
{
.base = {
@@ -591,6 +701,23 @@ static struct skcipher_alg sm4_algs[] = {
.setkey = sm4_xts_setkey,
.encrypt = sm4_xts_encrypt,
.decrypt = sm4_xts_decrypt,
+ }, {
+ .base = {
+ .cra_name = "essiv(cbc(sm4),sm3)",
+ .cra_driver_name = "essiv-cbc-sm4-sm3-ce",
+ .cra_priority = 400 + 1,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_essiv_cbc_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .setkey = sm4_essiv_cbc_setkey,
+ .encrypt = sm4_essiv_cbc_encrypt,
+ .decrypt = sm4_essiv_cbc_decrypt,
+ .init = sm4_essiv_cbc_init_tfm,
+ .exit = sm4_essiv_cbc_exit_tfm,
}
};
@@ -616,5 +743,6 @@ MODULE_ALIAS_CRYPTO("cfb(sm4)");
MODULE_ALIAS_CRYPTO("ctr(sm4)");
MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
MODULE_ALIAS_CRYPTO("xts(sm4)");
+MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 13/16] crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for cmac/xcbc/cbcmac.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 300 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is XXXmac(sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:
Before:
update-size | 16 64 256 1024 2048 4096 8192
---------------+--------------------------------------------------------
cmac(sm4-ce) | 293.33 403.69 503.76 527.78 531.10 535.46 535.81
xcbc(sm4-ce) | 292.83 402.50 504.02 529.08 529.87 536.55 538.24
cbcmac(sm4-ce) | 318.42 415.79 497.12 515.05 523.15 521.19 523.01
After:
update-size | 16 64 256 1024 2048 4096 8192
---------------+--------------------------------------------------------
cmac-sm4-ce | 371.99 675.28 903.56 971.65 980.57 990.40 991.04
xcbc-sm4-ce | 372.11 674.55 903.47 971.61 980.96 990.42 991.10
cbcmac-sm4-ce | 371.63 675.33 903.23 972.07 981.42 990.93 991.45
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-core.S | 70 +++++++++
arch/arm64/crypto/sm4-ce-glue.c | 267 +++++++++++++++++++++++++++++++-
2 files changed, 336 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 6b923c3209a0..69fe3b90b7ad 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
#define RTMP3 v19
#define RIV v20
+#define RMAC v20
#define RMASK v21
@@ -1049,6 +1050,75 @@ SYM_FUNC_START(sm4_ce_xts_dec)
ret
SYM_FUNC_END(sm4_ce_xts_dec)
+.align 3
+SYM_FUNC_START(sm4_ce_mac_update)
+ /* input:
+ * x0: round key array, CTX
+ * x1: digest
+ * x2: src
+ * w3: nblocks
+ * w4: enc_before
+ * w5: enc_after
+ */
+ SM4_PREPARE(x0)
+
+ ld1 {RMAC.16b}, [x1]
+
+ cbz w4, .Lmac_update
+
+ SM4_CRYPT_BLK(RMAC)
+
+.Lmac_update:
+ cbz w3, .Lmac_ret
+
+ sub w6, w3, #1
+ cmp w5, wzr
+ csel w3, w3, w6, ne
+
+ cbz w3, .Lmac_end
+
+.Lmac_loop_4x:
+ cmp w3, #4
+ blt .Lmac_loop_1x
+
+ sub w3, w3, #4
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ eor RMAC.16b, RMAC.16b, v0.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v1.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v2.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v3.16b
+ SM4_CRYPT_BLK(RMAC)
+
+ cbz w3, .Lmac_end
+ b .Lmac_loop_4x
+
+.Lmac_loop_1x:
+ sub w3, w3, #1
+
+ ld1 {v0.16b}, [x2], #16
+
+ eor RMAC.16b, RMAC.16b, v0.16b
+ SM4_CRYPT_BLK(RMAC)
+
+ cbnz w3, .Lmac_loop_1x
+
+
+.Lmac_end:
+ cbnz w5, .Lmac_ret
+
+ ld1 {v0.16b}, [x2], #16
+ eor RMAC.16b, RMAC.16b, v0.16b
+
+.Lmac_ret:
+ st1 {RMAC.16b}, [x1]
+ ret
+SYM_FUNC_END(sm4_ce_mac_update)
+
.section ".rodata", "a"
.align 4
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 6267ec1cfac0..c2d10b8e92b2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -14,8 +14,10 @@
#include <linux/cpufeature.h>
#include <asm/neon.h>
#include <asm/simd.h>
+#include <crypto/b128ops.h>
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
+#include <crypto/internal/hash.h>
#include <crypto/scatterwalk.h>
#include <crypto/xts.h>
#include <crypto/sm4.h>
@@ -55,6 +57,9 @@ asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
u8 *tweak, unsigned int nbytes,
const u32 *rkey2_enc);
+asmlinkage void sm4_ce_mac_update(const u32 *rkey_enc, u8 *digest,
+ const u8 *src, unsigned int nblocks,
+ bool enc_before, bool enc_after);
EXPORT_SYMBOL(sm4_ce_expand_key);
EXPORT_SYMBOL(sm4_ce_crypt_block);
@@ -72,6 +77,16 @@ struct sm4_essiv_cbc_ctx {
struct crypto_shash *hash;
};
+struct sm4_mac_tfm_ctx {
+ struct sm4_ctx key;
+ u8 __aligned(8) consts[];
+};
+
+struct sm4_mac_desc_ctx {
+ unsigned int len;
+ u8 digest[SM4_BLOCK_SIZE];
+};
+
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
{
@@ -721,13 +736,260 @@ static struct skcipher_alg sm4_algs[] = {
}
};
+static int sm4_cbcmac_setkey(struct crypto_shash *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int sm4_cmac_setkey(struct crypto_shash *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+ be128 *consts = (be128 *)ctx->consts;
+ u64 a, b;
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ memset(consts, 0, SM4_BLOCK_SIZE);
+
+ kernel_neon_begin();
+
+ sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+
+ /* encrypt the zero block */
+ sm4_ce_crypt_block(ctx->key.rkey_enc, (u8 *)consts, (const u8 *)consts);
+
+ kernel_neon_end();
+
+ /* gf(2^128) multiply zero-ciphertext with u and u^2 */
+ a = be64_to_cpu(consts[0].a);
+ b = be64_to_cpu(consts[0].b);
+ consts[0].a = cpu_to_be64((a << 1) | (b >> 63));
+ consts[0].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+ a = be64_to_cpu(consts[0].a);
+ b = be64_to_cpu(consts[0].b);
+ consts[1].a = cpu_to_be64((a << 1) | (b >> 63));
+ consts[1].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+ return 0;
+}
+
+static int sm4_xcbc_setkey(struct crypto_shash *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+ u8 __aligned(8) key2[SM4_BLOCK_SIZE];
+ static u8 const ks[3][SM4_BLOCK_SIZE] = {
+ { [0 ... SM4_BLOCK_SIZE - 1] = 0x1},
+ { [0 ... SM4_BLOCK_SIZE - 1] = 0x2},
+ { [0 ... SM4_BLOCK_SIZE - 1] = 0x3},
+ };
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+
+ sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+
+ sm4_ce_crypt_block(ctx->key.rkey_enc, key2, ks[0]);
+ sm4_ce_crypt(ctx->key.rkey_enc, ctx->consts, ks[1], 2);
+
+ sm4_ce_expand_key(key2, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int sm4_mac_init(struct shash_desc *desc)
+{
+ struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+ memset(ctx->digest, 0, SM4_BLOCK_SIZE);
+ ctx->len = 0;
+
+ return 0;
+}
+
+static int sm4_mac_update(struct shash_desc *desc, const u8 *p,
+ unsigned int len)
+{
+ struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+ struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+ unsigned int l, nblocks;
+
+ if (len == 0)
+ return 0;
+
+ if (ctx->len || ctx->len + len < SM4_BLOCK_SIZE) {
+ l = min(len, SM4_BLOCK_SIZE - ctx->len);
+
+ crypto_xor(ctx->digest + ctx->len, p, l);
+ ctx->len += l;
+ len -= l;
+ p += l;
+ }
+
+ if (len && (ctx->len % SM4_BLOCK_SIZE) == 0) {
+ kernel_neon_begin();
+
+ if (len < SM4_BLOCK_SIZE && ctx->len == SM4_BLOCK_SIZE) {
+ sm4_ce_crypt_block(tctx->key.rkey_enc,
+ ctx->digest, ctx->digest);
+ ctx->len = 0;
+ } else {
+ nblocks = len / SM4_BLOCK_SIZE;
+ len %= SM4_BLOCK_SIZE;
+
+ sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, p,
+ nblocks, (ctx->len == SM4_BLOCK_SIZE),
+ (len != 0));
+
+ p += nblocks * SM4_BLOCK_SIZE;
+
+ if (len == 0)
+ ctx->len = SM4_BLOCK_SIZE;
+ }
+
+ kernel_neon_end();
+
+ if (len) {
+ crypto_xor(ctx->digest, p, len);
+ ctx->len = len;
+ }
+ }
+
+ return 0;
+}
+
+static int sm4_cmac_final(struct shash_desc *desc, u8 *out)
+{
+ struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+ struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+ const u8 *consts = tctx->consts;
+
+ if (ctx->len != SM4_BLOCK_SIZE) {
+ ctx->digest[ctx->len] ^= 0x80;
+ consts += SM4_BLOCK_SIZE;
+ }
+
+ kernel_neon_begin();
+ sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, consts, 1,
+ false, true);
+ kernel_neon_end();
+
+ memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+ return 0;
+}
+
+static int sm4_cbcmac_final(struct shash_desc *desc, u8 *out)
+{
+ struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+ struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+ if (ctx->len) {
+ kernel_neon_begin();
+ sm4_ce_crypt_block(tctx->key.rkey_enc, ctx->digest,
+ ctx->digest);
+ kernel_neon_end();
+ }
+
+ memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+ return 0;
+}
+
+static struct shash_alg sm4_mac_algs[] = {
+ {
+ .base = {
+ .cra_name = "cmac(sm4)",
+ .cra_driver_name = "cmac-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_mac_tfm_ctx)
+ + SM4_BLOCK_SIZE * 2,
+ .cra_module = THIS_MODULE,
+ },
+ .digestsize = SM4_BLOCK_SIZE,
+ .init = sm4_mac_init,
+ .update = sm4_mac_update,
+ .final = sm4_cmac_final,
+ .setkey = sm4_cmac_setkey,
+ .descsize = sizeof(struct sm4_mac_desc_ctx),
+ }, {
+ .base = {
+ .cra_name = "xcbc(sm4)",
+ .cra_driver_name = "xcbc-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_mac_tfm_ctx)
+ + SM4_BLOCK_SIZE * 2,
+ .cra_module = THIS_MODULE,
+ },
+ .digestsize = SM4_BLOCK_SIZE,
+ .init = sm4_mac_init,
+ .update = sm4_mac_update,
+ .final = sm4_cmac_final,
+ .setkey = sm4_xcbc_setkey,
+ .descsize = sizeof(struct sm4_mac_desc_ctx),
+ }, {
+ .base = {
+ .cra_name = "cbcmac(sm4)",
+ .cra_driver_name = "cbcmac-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_mac_tfm_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .digestsize = SM4_BLOCK_SIZE,
+ .init = sm4_mac_init,
+ .update = sm4_mac_update,
+ .final = sm4_cbcmac_final,
+ .setkey = sm4_cbcmac_setkey,
+ .descsize = sizeof(struct sm4_mac_desc_ctx),
+ }
+};
+
static int __init sm4_init(void)
{
- return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+ int err;
+
+ err = crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+ if (err)
+ return err;
+
+ err = crypto_register_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
+ if (err)
+ goto out_err;
+
+ return 0;
+
+out_err:
+ crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+ return err;
}
static void __exit sm4_exit(void)
{
+ crypto_unregister_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
}
@@ -744,5 +1006,8 @@ MODULE_ALIAS_CRYPTO("ctr(sm4)");
MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
MODULE_ALIAS_CRYPTO("xts(sm4)");
MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
+MODULE_ALIAS_CRYPTO("cmac(sm4)");
+MODULE_ALIAS_CRYPTO("xcbc(sm4)");
+MODULE_ALIAS_CRYPTO("cbcmac(sm4)");
MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 13/16] crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for cmac/xcbc/cbcmac.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 300 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is XXXmac(sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:
Before:
update-size | 16 64 256 1024 2048 4096 8192
---------------+--------------------------------------------------------
cmac(sm4-ce) | 293.33 403.69 503.76 527.78 531.10 535.46 535.81
xcbc(sm4-ce) | 292.83 402.50 504.02 529.08 529.87 536.55 538.24
cbcmac(sm4-ce) | 318.42 415.79 497.12 515.05 523.15 521.19 523.01
After:
update-size | 16 64 256 1024 2048 4096 8192
---------------+--------------------------------------------------------
cmac-sm4-ce | 371.99 675.28 903.56 971.65 980.57 990.40 991.04
xcbc-sm4-ce | 372.11 674.55 903.47 971.61 980.96 990.42 991.10
cbcmac-sm4-ce | 371.63 675.33 903.23 972.07 981.42 990.93 991.45
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/sm4-ce-core.S | 70 +++++++++
arch/arm64/crypto/sm4-ce-glue.c | 267 +++++++++++++++++++++++++++++++-
2 files changed, 336 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 6b923c3209a0..69fe3b90b7ad 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
#define RTMP3 v19
#define RIV v20
+#define RMAC v20
#define RMASK v21
@@ -1049,6 +1050,75 @@ SYM_FUNC_START(sm4_ce_xts_dec)
ret
SYM_FUNC_END(sm4_ce_xts_dec)
+.align 3
+SYM_FUNC_START(sm4_ce_mac_update)
+ /* input:
+ * x0: round key array, CTX
+ * x1: digest
+ * x2: src
+ * w3: nblocks
+ * w4: enc_before
+ * w5: enc_after
+ */
+ SM4_PREPARE(x0)
+
+ ld1 {RMAC.16b}, [x1]
+
+ cbz w4, .Lmac_update
+
+ SM4_CRYPT_BLK(RMAC)
+
+.Lmac_update:
+ cbz w3, .Lmac_ret
+
+ sub w6, w3, #1
+ cmp w5, wzr
+ csel w3, w3, w6, ne
+
+ cbz w3, .Lmac_end
+
+.Lmac_loop_4x:
+ cmp w3, #4
+ blt .Lmac_loop_1x
+
+ sub w3, w3, #4
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ eor RMAC.16b, RMAC.16b, v0.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v1.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v2.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v3.16b
+ SM4_CRYPT_BLK(RMAC)
+
+ cbz w3, .Lmac_end
+ b .Lmac_loop_4x
+
+.Lmac_loop_1x:
+ sub w3, w3, #1
+
+ ld1 {v0.16b}, [x2], #16
+
+ eor RMAC.16b, RMAC.16b, v0.16b
+ SM4_CRYPT_BLK(RMAC)
+
+ cbnz w3, .Lmac_loop_1x
+
+
+.Lmac_end:
+ cbnz w5, .Lmac_ret
+
+ ld1 {v0.16b}, [x2], #16
+ eor RMAC.16b, RMAC.16b, v0.16b
+
+.Lmac_ret:
+ st1 {RMAC.16b}, [x1]
+ ret
+SYM_FUNC_END(sm4_ce_mac_update)
+
.section ".rodata", "a"
.align 4
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 6267ec1cfac0..c2d10b8e92b2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -14,8 +14,10 @@
#include <linux/cpufeature.h>
#include <asm/neon.h>
#include <asm/simd.h>
+#include <crypto/b128ops.h>
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
+#include <crypto/internal/hash.h>
#include <crypto/scatterwalk.h>
#include <crypto/xts.h>
#include <crypto/sm4.h>
@@ -55,6 +57,9 @@ asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
u8 *tweak, unsigned int nbytes,
const u32 *rkey2_enc);
+asmlinkage void sm4_ce_mac_update(const u32 *rkey_enc, u8 *digest,
+ const u8 *src, unsigned int nblocks,
+ bool enc_before, bool enc_after);
EXPORT_SYMBOL(sm4_ce_expand_key);
EXPORT_SYMBOL(sm4_ce_crypt_block);
@@ -72,6 +77,16 @@ struct sm4_essiv_cbc_ctx {
struct crypto_shash *hash;
};
+struct sm4_mac_tfm_ctx {
+ struct sm4_ctx key;
+ u8 __aligned(8) consts[];
+};
+
+struct sm4_mac_desc_ctx {
+ unsigned int len;
+ u8 digest[SM4_BLOCK_SIZE];
+};
+
static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
unsigned int key_len)
{
@@ -721,13 +736,260 @@ static struct skcipher_alg sm4_algs[] = {
}
};
+static int sm4_cbcmac_setkey(struct crypto_shash *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int sm4_cmac_setkey(struct crypto_shash *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+ be128 *consts = (be128 *)ctx->consts;
+ u64 a, b;
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ memset(consts, 0, SM4_BLOCK_SIZE);
+
+ kernel_neon_begin();
+
+ sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+
+ /* encrypt the zero block */
+ sm4_ce_crypt_block(ctx->key.rkey_enc, (u8 *)consts, (const u8 *)consts);
+
+ kernel_neon_end();
+
+ /* gf(2^128) multiply zero-ciphertext with u and u^2 */
+ a = be64_to_cpu(consts[0].a);
+ b = be64_to_cpu(consts[0].b);
+ consts[0].a = cpu_to_be64((a << 1) | (b >> 63));
+ consts[0].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+ a = be64_to_cpu(consts[0].a);
+ b = be64_to_cpu(consts[0].b);
+ consts[1].a = cpu_to_be64((a << 1) | (b >> 63));
+ consts[1].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+ return 0;
+}
+
+static int sm4_xcbc_setkey(struct crypto_shash *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+ u8 __aligned(8) key2[SM4_BLOCK_SIZE];
+ static u8 const ks[3][SM4_BLOCK_SIZE] = {
+ { [0 ... SM4_BLOCK_SIZE - 1] = 0x1},
+ { [0 ... SM4_BLOCK_SIZE - 1] = 0x2},
+ { [0 ... SM4_BLOCK_SIZE - 1] = 0x3},
+ };
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+
+ sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+
+ sm4_ce_crypt_block(ctx->key.rkey_enc, key2, ks[0]);
+ sm4_ce_crypt(ctx->key.rkey_enc, ctx->consts, ks[1], 2);
+
+ sm4_ce_expand_key(key2, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int sm4_mac_init(struct shash_desc *desc)
+{
+ struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+ memset(ctx->digest, 0, SM4_BLOCK_SIZE);
+ ctx->len = 0;
+
+ return 0;
+}
+
+static int sm4_mac_update(struct shash_desc *desc, const u8 *p,
+ unsigned int len)
+{
+ struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+ struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+ unsigned int l, nblocks;
+
+ if (len == 0)
+ return 0;
+
+ if (ctx->len || ctx->len + len < SM4_BLOCK_SIZE) {
+ l = min(len, SM4_BLOCK_SIZE - ctx->len);
+
+ crypto_xor(ctx->digest + ctx->len, p, l);
+ ctx->len += l;
+ len -= l;
+ p += l;
+ }
+
+ if (len && (ctx->len % SM4_BLOCK_SIZE) == 0) {
+ kernel_neon_begin();
+
+ if (len < SM4_BLOCK_SIZE && ctx->len == SM4_BLOCK_SIZE) {
+ sm4_ce_crypt_block(tctx->key.rkey_enc,
+ ctx->digest, ctx->digest);
+ ctx->len = 0;
+ } else {
+ nblocks = len / SM4_BLOCK_SIZE;
+ len %= SM4_BLOCK_SIZE;
+
+ sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, p,
+ nblocks, (ctx->len == SM4_BLOCK_SIZE),
+ (len != 0));
+
+ p += nblocks * SM4_BLOCK_SIZE;
+
+ if (len == 0)
+ ctx->len = SM4_BLOCK_SIZE;
+ }
+
+ kernel_neon_end();
+
+ if (len) {
+ crypto_xor(ctx->digest, p, len);
+ ctx->len = len;
+ }
+ }
+
+ return 0;
+}
+
+static int sm4_cmac_final(struct shash_desc *desc, u8 *out)
+{
+ struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+ struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+ const u8 *consts = tctx->consts;
+
+ if (ctx->len != SM4_BLOCK_SIZE) {
+ ctx->digest[ctx->len] ^= 0x80;
+ consts += SM4_BLOCK_SIZE;
+ }
+
+ kernel_neon_begin();
+ sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, consts, 1,
+ false, true);
+ kernel_neon_end();
+
+ memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+ return 0;
+}
+
+static int sm4_cbcmac_final(struct shash_desc *desc, u8 *out)
+{
+ struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+ struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+ if (ctx->len) {
+ kernel_neon_begin();
+ sm4_ce_crypt_block(tctx->key.rkey_enc, ctx->digest,
+ ctx->digest);
+ kernel_neon_end();
+ }
+
+ memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+ return 0;
+}
+
+static struct shash_alg sm4_mac_algs[] = {
+ {
+ .base = {
+ .cra_name = "cmac(sm4)",
+ .cra_driver_name = "cmac-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_mac_tfm_ctx)
+ + SM4_BLOCK_SIZE * 2,
+ .cra_module = THIS_MODULE,
+ },
+ .digestsize = SM4_BLOCK_SIZE,
+ .init = sm4_mac_init,
+ .update = sm4_mac_update,
+ .final = sm4_cmac_final,
+ .setkey = sm4_cmac_setkey,
+ .descsize = sizeof(struct sm4_mac_desc_ctx),
+ }, {
+ .base = {
+ .cra_name = "xcbc(sm4)",
+ .cra_driver_name = "xcbc-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_mac_tfm_ctx)
+ + SM4_BLOCK_SIZE * 2,
+ .cra_module = THIS_MODULE,
+ },
+ .digestsize = SM4_BLOCK_SIZE,
+ .init = sm4_mac_init,
+ .update = sm4_mac_update,
+ .final = sm4_cmac_final,
+ .setkey = sm4_xcbc_setkey,
+ .descsize = sizeof(struct sm4_mac_desc_ctx),
+ }, {
+ .base = {
+ .cra_name = "cbcmac(sm4)",
+ .cra_driver_name = "cbcmac-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_mac_tfm_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .digestsize = SM4_BLOCK_SIZE,
+ .init = sm4_mac_init,
+ .update = sm4_mac_update,
+ .final = sm4_cbcmac_final,
+ .setkey = sm4_cbcmac_setkey,
+ .descsize = sizeof(struct sm4_mac_desc_ctx),
+ }
+};
+
static int __init sm4_init(void)
{
- return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+ int err;
+
+ err = crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+ if (err)
+ return err;
+
+ err = crypto_register_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
+ if (err)
+ goto out_err;
+
+ return 0;
+
+out_err:
+ crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+ return err;
}
static void __exit sm4_exit(void)
{
+ crypto_unregister_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
}
@@ -744,5 +1006,8 @@ MODULE_ALIAS_CRYPTO("ctr(sm4)");
MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
MODULE_ALIAS_CRYPTO("xts(sm4)");
MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
+MODULE_ALIAS_CRYPTO("cmac(sm4)");
+MODULE_ALIAS_CRYPTO("xcbc(sm4)");
+MODULE_ALIAS_CRYPTO("cbcmac(sm4)");
MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 14/16] crypto: arm64/sm4 - add CE implementation for CCM mode
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for CCM mode.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 223 and 225
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is ccm_base(ctr-sm4-ce,cbcmac-sm4-ce)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:
Before (rfc4309(ccm_base(ctr-sm4-ce,cbcmac-sm4-ce))):
ccm(sm4) | 16 64 256 512 1024 1420 4096 8192
-------------+---------------------------------------------------------------
CCM enc | 35.07 125.40 336.47 468.17 581.97 619.18 712.56 736.01
CCM dec | 34.87 124.40 335.08 466.75 581.04 618.81 712.25 735.89
CCM mb enc | 34.71 123.96 333.92 465.39 579.91 617.49 711.45 734.92
CCM mb dec | 34.42 122.80 331.02 462.81 578.28 616.42 709.88 734.19
After (rfc4309(ccm-sm4-ce)):
ccm-sm4-ce | 16 64 256 512 1024 1420 4096 8192
-------------+---------------------------------------------------------------
CCM enc | 77.12 249.82 569.94 725.17 839.27 867.71 952.87 969.89
CCM dec | 75.90 247.26 566.29 722.12 836.90 865.95 951.74 968.57
CCM mb enc | 75.98 245.25 562.91 718.99 834.76 864.70 950.17 967.90
CCM mb dec | 75.06 243.78 560.58 717.13 833.68 862.70 949.35 967.11
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 16 ++
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/sm4-ce-ccm-core.S | 328 ++++++++++++++++++++++++++++
arch/arm64/crypto/sm4-ce-ccm-glue.c | 303 +++++++++++++++++++++++++
4 files changed, 650 insertions(+)
create mode 100644 arch/arm64/crypto/sm4-ce-ccm-core.S
create mode 100644 arch/arm64/crypto/sm4-ce-ccm-glue.c
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8939f5ae9214..2611036a3e3f 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -281,6 +281,22 @@ config CRYPTO_AES_ARM64_CE_CCM
- ARMv8 Crypto Extensions
- NEON (Advanced SIMD) extensions
+config CRYPTO_SM4_ARM64_CE_CCM
+ tristate "AEAD cipher: SM4 in CCM mode (ARMv8 Crypto Extensions)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_ALGAPI
+ select CRYPTO_AEAD
+ select CRYPTO_SM4
+ select CRYPTO_SM4_ARM64_CE_BLK
+ help
+ AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+ CCM (Counter with Cipher Block Chaining-Message Authentication Code)
+ authenticated encryption mode (NIST SP800-38C)
+
+ Architecture: arm64 using:
+ - ARMv8 Crypto Extensions
+ - NEON (Advanced SIMD) extensions
+
config CRYPTO_CRCT10DIF_ARM64_CE
tristate "CRCT10DIF (PMULL)"
depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 087f1625e775..843ea5266965 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -29,6 +29,9 @@ sm4-ce-cipher-y := sm4-ce-cipher-glue.o sm4-ce-cipher-core.o
obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_BLK) += sm4-ce.o
sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
+sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
+
obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
diff --git a/arch/arm64/crypto/sm4-ce-ccm-core.S b/arch/arm64/crypto/sm4-ce-ccm-core.S
new file mode 100644
index 000000000000..028207c4afd0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-core.S
@@ -0,0 +1,328 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch armv8-a+crypto
+
+.irp b, 0, 1, 8, 9, 10, 11, 12, 13, 14, 15, 16, 24, 25, 26, 27, 28, 29, 30, 31
+ .set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+ .inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+#define RMAC v16
+
+/* Helper macros. */
+
+#define inc_le128(vctr) \
+ mov vctr.d[1], x8; \
+ mov vctr.d[0], x7; \
+ adds x8, x8, #1; \
+ rev64 vctr.16b, vctr.16b; \
+ adc x7, x7, xzr;
+
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbcmac_update)
+ /* input:
+ * x0: round key array, CTX
+ * x1: mac
+ * x2: src
+ * w3: nblocks
+ */
+ SM4_PREPARE(x0)
+
+ ld1 {RMAC.16b}, [x1]
+
+.Lcbcmac_loop_4x:
+ cmp w3, #4
+ blt .Lcbcmac_loop_1x
+
+ sub w3, w3, #4
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v0.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v1.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v2.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v3.16b
+
+ cbz w3, .Lcbcmac_end
+ b .Lcbcmac_loop_4x
+
+.Lcbcmac_loop_1x:
+ sub w3, w3, #1
+
+ ld1 {v0.16b}, [x2], #16
+
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v0.16b
+
+ cbnz w3, .Lcbcmac_loop_1x
+
+.Lcbcmac_end:
+ st1 {RMAC.16b}, [x1]
+ ret
+SYM_FUNC_END(sm4_ce_cbcmac_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_final)
+ /* input:
+ * x0: round key array, CTX
+ * x1: ctr0 (big endian, 128 bit)
+ * x2: mac
+ */
+ SM4_PREPARE(x0)
+
+ ld1 {RMAC.16b}, [x2]
+ ld1 {v0.16b}, [x1]
+
+ SM4_CRYPT_BLK2(RMAC, v0)
+
+ /* en-/decrypt the mac with ctr0 */
+ eor RMAC.16b, RMAC.16b, v0.16b
+ st1 {RMAC.16b}, [x2]
+
+ ret
+SYM_FUNC_END(sm4_ce_ccm_final)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nbytes
+ * x5: mac
+ */
+ SM4_PREPARE(x0)
+
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
+
+ ld1 {RMAC.16b}, [x5]
+
+.Lccm_enc_loop_4x:
+ cmp w4, #(4 * 16)
+ blt .Lccm_enc_loop_1x
+
+ sub w4, w4, #(4 * 16)
+
+ /* construct CTRs */
+ inc_le128(v8) /* +0 */
+ inc_le128(v9) /* +1 */
+ inc_le128(v10) /* +2 */
+ inc_le128(v11) /* +3 */
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ SM4_CRYPT_BLK2(v8, RMAC)
+ eor v8.16b, v8.16b, v0.16b
+ eor RMAC.16b, RMAC.16b, v0.16b
+ SM4_CRYPT_BLK2(v9, RMAC)
+ eor v9.16b, v9.16b, v1.16b
+ eor RMAC.16b, RMAC.16b, v1.16b
+ SM4_CRYPT_BLK2(v10, RMAC)
+ eor v10.16b, v10.16b, v2.16b
+ eor RMAC.16b, RMAC.16b, v2.16b
+ SM4_CRYPT_BLK2(v11, RMAC)
+ eor v11.16b, v11.16b, v3.16b
+ eor RMAC.16b, RMAC.16b, v3.16b
+
+ st1 {v8.16b-v11.16b}, [x1], #64
+
+ cbz w4, .Lccm_enc_end
+ b .Lccm_enc_loop_4x
+
+.Lccm_enc_loop_1x:
+ cmp w4, #16
+ blt .Lccm_enc_tail
+
+ sub w4, w4, #16
+
+ /* construct CTRs */
+ inc_le128(v8)
+
+ ld1 {v0.16b}, [x2], #16
+
+ SM4_CRYPT_BLK2(v8, RMAC)
+ eor v8.16b, v8.16b, v0.16b
+ eor RMAC.16b, RMAC.16b, v0.16b
+
+ st1 {v8.16b}, [x1], #16
+
+ cbz w4, .Lccm_enc_end
+ b .Lccm_enc_loop_1x
+
+.Lccm_enc_tail:
+ /* construct CTRs */
+ inc_le128(v8)
+
+ SM4_CRYPT_BLK2(RMAC, v8)
+
+ /* store new MAC */
+ st1 {RMAC.16b}, [x5]
+
+.Lccm_enc_tail_loop:
+ ldrb w0, [x2], #1 /* get 1 byte from input */
+ umov w9, v8.b[0] /* get top crypted CTR byte */
+ umov w6, RMAC.b[0] /* get top MAC byte */
+
+ eor w9, w9, w0 /* w9 = CTR ^ input */
+ eor w6, w6, w0 /* w6 = MAC ^ input */
+
+ strb w9, [x1], #1 /* store out byte */
+ strb w6, [x5], #1 /* store MAC byte */
+
+ subs w4, w4, #1
+ beq .Lccm_enc_ret
+
+ /* shift out one byte */
+ ext RMAC.16b, RMAC.16b, RMAC.16b, #1
+ ext v8.16b, v8.16b, v8.16b, #1
+
+ b .Lccm_enc_tail_loop
+
+.Lccm_enc_end:
+ /* store new MAC */
+ st1 {RMAC.16b}, [x5]
+
+ /* store new CTR */
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
+
+.Lccm_enc_ret:
+ ret
+SYM_FUNC_END(sm4_ce_ccm_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nbytes
+ * x5: mac
+ */
+ SM4_PREPARE(x0)
+
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
+
+ ld1 {RMAC.16b}, [x5]
+
+.Lccm_dec_loop_4x:
+ cmp w4, #(4 * 16)
+ blt .Lccm_dec_loop_1x
+
+ sub w4, w4, #(4 * 16)
+
+ /* construct CTRs */
+ inc_le128(v8) /* +0 */
+ inc_le128(v9) /* +1 */
+ inc_le128(v10) /* +2 */
+ inc_le128(v11) /* +3 */
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ SM4_CRYPT_BLK2(v8, RMAC)
+ eor v8.16b, v8.16b, v0.16b
+ eor RMAC.16b, RMAC.16b, v8.16b
+ SM4_CRYPT_BLK2(v9, RMAC)
+ eor v9.16b, v9.16b, v1.16b
+ eor RMAC.16b, RMAC.16b, v9.16b
+ SM4_CRYPT_BLK2(v10, RMAC)
+ eor v10.16b, v10.16b, v2.16b
+ eor RMAC.16b, RMAC.16b, v10.16b
+ SM4_CRYPT_BLK2(v11, RMAC)
+ eor v11.16b, v11.16b, v3.16b
+ eor RMAC.16b, RMAC.16b, v11.16b
+
+ st1 {v8.16b-v11.16b}, [x1], #64
+
+ cbz w4, .Lccm_dec_end
+ b .Lccm_dec_loop_4x
+
+.Lccm_dec_loop_1x:
+ cmp w4, #16
+ blt .Lccm_dec_tail
+
+ sub w4, w4, #16
+
+ /* construct CTRs */
+ inc_le128(v8)
+
+ ld1 {v0.16b}, [x2], #16
+
+ SM4_CRYPT_BLK2(v8, RMAC)
+ eor v8.16b, v8.16b, v0.16b
+ eor RMAC.16b, RMAC.16b, v8.16b
+
+ st1 {v8.16b}, [x1], #16
+
+ cbz w4, .Lccm_dec_end
+ b .Lccm_dec_loop_1x
+
+.Lccm_dec_tail:
+ /* construct CTRs */
+ inc_le128(v8)
+
+ SM4_CRYPT_BLK2(RMAC, v8)
+
+ /* store new MAC */
+ st1 {RMAC.16b}, [x5]
+
+.Lccm_dec_tail_loop:
+ ldrb w0, [x2], #1 /* get 1 byte from input */
+ umov w9, v8.b[0] /* get top crypted CTR byte */
+ umov w6, RMAC.b[0] /* get top MAC byte */
+
+ eor w9, w9, w0 /* w9 = CTR ^ input */
+ eor w6, w6, w9 /* w6 = MAC ^ output */
+
+ strb w9, [x1], #1 /* store out byte */
+ strb w6, [x5], #1 /* store MAC byte */
+
+ subs w4, w4, #1
+ beq .Lccm_dec_ret
+
+ /* shift out one byte */
+ ext RMAC.16b, RMAC.16b, RMAC.16b, #1
+ ext v8.16b, v8.16b, v8.16b, #1
+
+ b .Lccm_dec_tail_loop
+
+.Lccm_dec_end:
+ /* store new MAC */
+ st1 {RMAC.16b}, [x5]
+
+ /* store new CTR */
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
+
+.Lccm_dec_ret:
+ ret
+SYM_FUNC_END(sm4_ce_ccm_dec)
diff --git a/arch/arm64/crypto/sm4-ce-ccm-glue.c b/arch/arm64/crypto/sm4-ce-ccm-glue.c
new file mode 100644
index 000000000000..f2cec7b52efc
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-glue.c
@@ -0,0 +1,303 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_cbcmac_update(const u32 *rkey_enc, u8 *mac,
+ const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_ccm_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_final(const u32 *rkey_enc, u8 *iv, u8 *mac);
+
+
+static int ccm_setkey(struct crypto_aead *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_ctx *ctx = crypto_aead_ctx(tfm);
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int ccm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+ if ((authsize & 1) || authsize < 4)
+ return -EINVAL;
+ return 0;
+}
+
+static int ccm_format_input(u8 info[], struct aead_request *req,
+ unsigned int msglen)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ unsigned int l = req->iv[0] + 1;
+ unsigned int m;
+ __be32 len;
+
+ /* verify that CCM dimension 'L': 2 <= L <= 8 */
+ if (l < 2 || l > 8)
+ return -EINVAL;
+ if (l < 4 && msglen >> (8 * l))
+ return -EOVERFLOW;
+
+ memset(&req->iv[SM4_BLOCK_SIZE - l], 0, l);
+
+ memcpy(info, req->iv, SM4_BLOCK_SIZE);
+
+ m = crypto_aead_authsize(aead);
+
+ /* format flags field per RFC 3610/NIST 800-38C */
+ *info |= ((m - 2) / 2) << 3;
+ if (req->assoclen)
+ *info |= (1 << 6);
+
+ /*
+ * format message length field,
+ * Linux uses a u32 type to represent msglen
+ */
+ if (l >= 4)
+ l = 4;
+
+ len = cpu_to_be32(msglen);
+ memcpy(&info[SM4_BLOCK_SIZE - l], (u8 *)&len + 4 - l, l);
+
+ return 0;
+}
+
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+ struct __packed { __be16 l; __be32 h; } aadlen;
+ u32 assoclen = req->assoclen;
+ struct scatter_walk walk;
+ unsigned int len;
+
+ if (assoclen < 0xff00) {
+ aadlen.l = cpu_to_be16(assoclen);
+ len = 2;
+ } else {
+ aadlen.l = cpu_to_be16(0xfffe);
+ put_unaligned_be32(assoclen, &aadlen.h);
+ len = 6;
+ }
+
+ sm4_ce_crypt_block(ctx->rkey_enc, mac, mac);
+ crypto_xor(mac, (const u8 *)&aadlen, len);
+
+ scatterwalk_start(&walk, req->src);
+
+ do {
+ u32 n = scatterwalk_clamp(&walk, assoclen);
+ u8 *p, *ptr;
+
+ if (!n) {
+ scatterwalk_start(&walk, sg_next(walk.sg));
+ n = scatterwalk_clamp(&walk, assoclen);
+ }
+
+ p = ptr = scatterwalk_map(&walk);
+ assoclen -= n;
+ scatterwalk_advance(&walk, n);
+
+ while (n > 0) {
+ unsigned int l, nblocks;
+
+ if (len == SM4_BLOCK_SIZE) {
+ if (n < SM4_BLOCK_SIZE) {
+ sm4_ce_crypt_block(ctx->rkey_enc,
+ mac, mac);
+
+ len = 0;
+ } else {
+ nblocks = n / SM4_BLOCK_SIZE;
+ sm4_ce_cbcmac_update(ctx->rkey_enc,
+ mac, ptr, nblocks);
+
+ ptr += nblocks * SM4_BLOCK_SIZE;
+ n %= SM4_BLOCK_SIZE;
+
+ continue;
+ }
+ }
+
+ l = min(n, SM4_BLOCK_SIZE - len);
+ if (l) {
+ crypto_xor(mac + len, ptr, l);
+ len += l;
+ ptr += l;
+ n -= l;
+ }
+ }
+
+ scatterwalk_unmap(p);
+ scatterwalk_done(&walk, 0, assoclen);
+ } while (assoclen);
+}
+
+static int ccm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+ u32 *rkey_enc, u8 mac[],
+ void (*sm4_ce_ccm_crypt)(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nbytes, u8 *mac))
+{
+ u8 __aligned(8) ctr0[SM4_BLOCK_SIZE];
+ int err;
+
+ /* preserve the initial ctr0 for the TAG */
+ memcpy(ctr0, walk->iv, SM4_BLOCK_SIZE);
+ crypto_inc(walk->iv, SM4_BLOCK_SIZE);
+
+ kernel_neon_begin();
+
+ if (req->assoclen)
+ ccm_calculate_auth_mac(req, mac);
+
+ do {
+ unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+ const u8 *src = walk->src.virt.addr;
+ u8 *dst = walk->dst.virt.addr;
+
+ if (walk->nbytes == walk->total)
+ tail = 0;
+
+ if (walk->nbytes - tail)
+ sm4_ce_ccm_crypt(rkey_enc, dst, src, walk->iv,
+ walk->nbytes - tail, mac);
+
+ if (walk->nbytes == walk->total)
+ sm4_ce_ccm_final(rkey_enc, ctr0, mac);
+
+ kernel_neon_end();
+
+ if (walk->nbytes) {
+ err = skcipher_walk_done(walk, tail);
+ if (err)
+ return err;
+ if (walk->nbytes)
+ kernel_neon_begin();
+ }
+ } while (walk->nbytes > 0);
+
+ return 0;
+}
+
+static int ccm_encrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+ struct skcipher_walk walk;
+ int err;
+
+ err = ccm_format_input(mac, req, req->cryptlen);
+ if (err)
+ return err;
+
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
+ if (err)
+ return err;
+
+ err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_enc);
+ if (err)
+ return err;
+
+ /* copy authtag to end of dst */
+ scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
+ crypto_aead_authsize(aead), 1);
+
+ return 0;
+}
+
+static int ccm_decrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ unsigned int authsize = crypto_aead_authsize(aead);
+ struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+ u8 authtag[SM4_BLOCK_SIZE];
+ struct skcipher_walk walk;
+ int err;
+
+ err = ccm_format_input(mac, req, req->cryptlen - authsize);
+ if (err)
+ return err;
+
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
+ if (err)
+ return err;
+
+ err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_dec);
+ if (err)
+ return err;
+
+ /* compare calculated auth tag with the stored one */
+ scatterwalk_map_and_copy(authtag, req->src,
+ req->assoclen + req->cryptlen - authsize,
+ authsize, 0);
+
+ if (crypto_memneq(authtag, mac, authsize))
+ return -EBADMSG;
+
+ return 0;
+}
+
+static struct aead_alg sm4_ccm_alg = {
+ .base = {
+ .cra_name = "ccm(sm4)",
+ .cra_driver_name = "ccm-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .ivsize = SM4_BLOCK_SIZE,
+ .chunksize = SM4_BLOCK_SIZE,
+ .maxauthsize = SM4_BLOCK_SIZE,
+ .setkey = ccm_setkey,
+ .setauthsize = ccm_setauthsize,
+ .encrypt = ccm_encrypt,
+ .decrypt = ccm_decrypt,
+};
+
+static int __init sm4_ce_ccm_init(void)
+{
+ return crypto_register_aead(&sm4_ccm_alg);
+}
+
+static void __exit sm4_ce_ccm_exit(void)
+{
+ crypto_unregister_aead(&sm4_ccm_alg);
+}
+
+module_cpu_feature_match(SM4, sm4_ce_ccm_init);
+module_exit(sm4_ce_ccm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in CCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("ccm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 14/16] crypto: arm64/sm4 - add CE implementation for CCM mode
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for CCM mode.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 223 and 225
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is ccm_base(ctr-sm4-ce,cbcmac-sm4-ce)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:
Before (rfc4309(ccm_base(ctr-sm4-ce,cbcmac-sm4-ce))):
ccm(sm4) | 16 64 256 512 1024 1420 4096 8192
-------------+---------------------------------------------------------------
CCM enc | 35.07 125.40 336.47 468.17 581.97 619.18 712.56 736.01
CCM dec | 34.87 124.40 335.08 466.75 581.04 618.81 712.25 735.89
CCM mb enc | 34.71 123.96 333.92 465.39 579.91 617.49 711.45 734.92
CCM mb dec | 34.42 122.80 331.02 462.81 578.28 616.42 709.88 734.19
After (rfc4309(ccm-sm4-ce)):
ccm-sm4-ce | 16 64 256 512 1024 1420 4096 8192
-------------+---------------------------------------------------------------
CCM enc | 77.12 249.82 569.94 725.17 839.27 867.71 952.87 969.89
CCM dec | 75.90 247.26 566.29 722.12 836.90 865.95 951.74 968.57
CCM mb enc | 75.98 245.25 562.91 718.99 834.76 864.70 950.17 967.90
CCM mb dec | 75.06 243.78 560.58 717.13 833.68 862.70 949.35 967.11
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 16 ++
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/sm4-ce-ccm-core.S | 328 ++++++++++++++++++++++++++++
arch/arm64/crypto/sm4-ce-ccm-glue.c | 303 +++++++++++++++++++++++++
4 files changed, 650 insertions(+)
create mode 100644 arch/arm64/crypto/sm4-ce-ccm-core.S
create mode 100644 arch/arm64/crypto/sm4-ce-ccm-glue.c
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8939f5ae9214..2611036a3e3f 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -281,6 +281,22 @@ config CRYPTO_AES_ARM64_CE_CCM
- ARMv8 Crypto Extensions
- NEON (Advanced SIMD) extensions
+config CRYPTO_SM4_ARM64_CE_CCM
+ tristate "AEAD cipher: SM4 in CCM mode (ARMv8 Crypto Extensions)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_ALGAPI
+ select CRYPTO_AEAD
+ select CRYPTO_SM4
+ select CRYPTO_SM4_ARM64_CE_BLK
+ help
+ AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+ CCM (Counter with Cipher Block Chaining-Message Authentication Code)
+ authenticated encryption mode (NIST SP800-38C)
+
+ Architecture: arm64 using:
+ - ARMv8 Crypto Extensions
+ - NEON (Advanced SIMD) extensions
+
config CRYPTO_CRCT10DIF_ARM64_CE
tristate "CRCT10DIF (PMULL)"
depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 087f1625e775..843ea5266965 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -29,6 +29,9 @@ sm4-ce-cipher-y := sm4-ce-cipher-glue.o sm4-ce-cipher-core.o
obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_BLK) += sm4-ce.o
sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
+sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
+
obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
diff --git a/arch/arm64/crypto/sm4-ce-ccm-core.S b/arch/arm64/crypto/sm4-ce-ccm-core.S
new file mode 100644
index 000000000000..028207c4afd0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-core.S
@@ -0,0 +1,328 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch armv8-a+crypto
+
+.irp b, 0, 1, 8, 9, 10, 11, 12, 13, 14, 15, 16, 24, 25, 26, 27, 28, 29, 30, 31
+ .set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+ .inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+#define RMAC v16
+
+/* Helper macros. */
+
+#define inc_le128(vctr) \
+ mov vctr.d[1], x8; \
+ mov vctr.d[0], x7; \
+ adds x8, x8, #1; \
+ rev64 vctr.16b, vctr.16b; \
+ adc x7, x7, xzr;
+
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbcmac_update)
+ /* input:
+ * x0: round key array, CTX
+ * x1: mac
+ * x2: src
+ * w3: nblocks
+ */
+ SM4_PREPARE(x0)
+
+ ld1 {RMAC.16b}, [x1]
+
+.Lcbcmac_loop_4x:
+ cmp w3, #4
+ blt .Lcbcmac_loop_1x
+
+ sub w3, w3, #4
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v0.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v1.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v2.16b
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v3.16b
+
+ cbz w3, .Lcbcmac_end
+ b .Lcbcmac_loop_4x
+
+.Lcbcmac_loop_1x:
+ sub w3, w3, #1
+
+ ld1 {v0.16b}, [x2], #16
+
+ SM4_CRYPT_BLK(RMAC)
+ eor RMAC.16b, RMAC.16b, v0.16b
+
+ cbnz w3, .Lcbcmac_loop_1x
+
+.Lcbcmac_end:
+ st1 {RMAC.16b}, [x1]
+ ret
+SYM_FUNC_END(sm4_ce_cbcmac_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_final)
+ /* input:
+ * x0: round key array, CTX
+ * x1: ctr0 (big endian, 128 bit)
+ * x2: mac
+ */
+ SM4_PREPARE(x0)
+
+ ld1 {RMAC.16b}, [x2]
+ ld1 {v0.16b}, [x1]
+
+ SM4_CRYPT_BLK2(RMAC, v0)
+
+ /* en-/decrypt the mac with ctr0 */
+ eor RMAC.16b, RMAC.16b, v0.16b
+ st1 {RMAC.16b}, [x2]
+
+ ret
+SYM_FUNC_END(sm4_ce_ccm_final)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nbytes
+ * x5: mac
+ */
+ SM4_PREPARE(x0)
+
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
+
+ ld1 {RMAC.16b}, [x5]
+
+.Lccm_enc_loop_4x:
+ cmp w4, #(4 * 16)
+ blt .Lccm_enc_loop_1x
+
+ sub w4, w4, #(4 * 16)
+
+ /* construct CTRs */
+ inc_le128(v8) /* +0 */
+ inc_le128(v9) /* +1 */
+ inc_le128(v10) /* +2 */
+ inc_le128(v11) /* +3 */
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ SM4_CRYPT_BLK2(v8, RMAC)
+ eor v8.16b, v8.16b, v0.16b
+ eor RMAC.16b, RMAC.16b, v0.16b
+ SM4_CRYPT_BLK2(v9, RMAC)
+ eor v9.16b, v9.16b, v1.16b
+ eor RMAC.16b, RMAC.16b, v1.16b
+ SM4_CRYPT_BLK2(v10, RMAC)
+ eor v10.16b, v10.16b, v2.16b
+ eor RMAC.16b, RMAC.16b, v2.16b
+ SM4_CRYPT_BLK2(v11, RMAC)
+ eor v11.16b, v11.16b, v3.16b
+ eor RMAC.16b, RMAC.16b, v3.16b
+
+ st1 {v8.16b-v11.16b}, [x1], #64
+
+ cbz w4, .Lccm_enc_end
+ b .Lccm_enc_loop_4x
+
+.Lccm_enc_loop_1x:
+ cmp w4, #16
+ blt .Lccm_enc_tail
+
+ sub w4, w4, #16
+
+ /* construct CTRs */
+ inc_le128(v8)
+
+ ld1 {v0.16b}, [x2], #16
+
+ SM4_CRYPT_BLK2(v8, RMAC)
+ eor v8.16b, v8.16b, v0.16b
+ eor RMAC.16b, RMAC.16b, v0.16b
+
+ st1 {v8.16b}, [x1], #16
+
+ cbz w4, .Lccm_enc_end
+ b .Lccm_enc_loop_1x
+
+.Lccm_enc_tail:
+ /* construct CTRs */
+ inc_le128(v8)
+
+ SM4_CRYPT_BLK2(RMAC, v8)
+
+ /* store new MAC */
+ st1 {RMAC.16b}, [x5]
+
+.Lccm_enc_tail_loop:
+ ldrb w0, [x2], #1 /* get 1 byte from input */
+ umov w9, v8.b[0] /* get top crypted CTR byte */
+ umov w6, RMAC.b[0] /* get top MAC byte */
+
+ eor w9, w9, w0 /* w9 = CTR ^ input */
+ eor w6, w6, w0 /* w6 = MAC ^ input */
+
+ strb w9, [x1], #1 /* store out byte */
+ strb w6, [x5], #1 /* store MAC byte */
+
+ subs w4, w4, #1
+ beq .Lccm_enc_ret
+
+ /* shift out one byte */
+ ext RMAC.16b, RMAC.16b, RMAC.16b, #1
+ ext v8.16b, v8.16b, v8.16b, #1
+
+ b .Lccm_enc_tail_loop
+
+.Lccm_enc_end:
+ /* store new MAC */
+ st1 {RMAC.16b}, [x5]
+
+ /* store new CTR */
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
+
+.Lccm_enc_ret:
+ ret
+SYM_FUNC_END(sm4_ce_ccm_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nbytes
+ * x5: mac
+ */
+ SM4_PREPARE(x0)
+
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
+
+ ld1 {RMAC.16b}, [x5]
+
+.Lccm_dec_loop_4x:
+ cmp w4, #(4 * 16)
+ blt .Lccm_dec_loop_1x
+
+ sub w4, w4, #(4 * 16)
+
+ /* construct CTRs */
+ inc_le128(v8) /* +0 */
+ inc_le128(v9) /* +1 */
+ inc_le128(v10) /* +2 */
+ inc_le128(v11) /* +3 */
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ SM4_CRYPT_BLK2(v8, RMAC)
+ eor v8.16b, v8.16b, v0.16b
+ eor RMAC.16b, RMAC.16b, v8.16b
+ SM4_CRYPT_BLK2(v9, RMAC)
+ eor v9.16b, v9.16b, v1.16b
+ eor RMAC.16b, RMAC.16b, v9.16b
+ SM4_CRYPT_BLK2(v10, RMAC)
+ eor v10.16b, v10.16b, v2.16b
+ eor RMAC.16b, RMAC.16b, v10.16b
+ SM4_CRYPT_BLK2(v11, RMAC)
+ eor v11.16b, v11.16b, v3.16b
+ eor RMAC.16b, RMAC.16b, v11.16b
+
+ st1 {v8.16b-v11.16b}, [x1], #64
+
+ cbz w4, .Lccm_dec_end
+ b .Lccm_dec_loop_4x
+
+.Lccm_dec_loop_1x:
+ cmp w4, #16
+ blt .Lccm_dec_tail
+
+ sub w4, w4, #16
+
+ /* construct CTRs */
+ inc_le128(v8)
+
+ ld1 {v0.16b}, [x2], #16
+
+ SM4_CRYPT_BLK2(v8, RMAC)
+ eor v8.16b, v8.16b, v0.16b
+ eor RMAC.16b, RMAC.16b, v8.16b
+
+ st1 {v8.16b}, [x1], #16
+
+ cbz w4, .Lccm_dec_end
+ b .Lccm_dec_loop_1x
+
+.Lccm_dec_tail:
+ /* construct CTRs */
+ inc_le128(v8)
+
+ SM4_CRYPT_BLK2(RMAC, v8)
+
+ /* store new MAC */
+ st1 {RMAC.16b}, [x5]
+
+.Lccm_dec_tail_loop:
+ ldrb w0, [x2], #1 /* get 1 byte from input */
+ umov w9, v8.b[0] /* get top crypted CTR byte */
+ umov w6, RMAC.b[0] /* get top MAC byte */
+
+ eor w9, w9, w0 /* w9 = CTR ^ input */
+ eor w6, w6, w9 /* w6 = MAC ^ output */
+
+ strb w9, [x1], #1 /* store out byte */
+ strb w6, [x5], #1 /* store MAC byte */
+
+ subs w4, w4, #1
+ beq .Lccm_dec_ret
+
+ /* shift out one byte */
+ ext RMAC.16b, RMAC.16b, RMAC.16b, #1
+ ext v8.16b, v8.16b, v8.16b, #1
+
+ b .Lccm_dec_tail_loop
+
+.Lccm_dec_end:
+ /* store new MAC */
+ st1 {RMAC.16b}, [x5]
+
+ /* store new CTR */
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
+
+.Lccm_dec_ret:
+ ret
+SYM_FUNC_END(sm4_ce_ccm_dec)
diff --git a/arch/arm64/crypto/sm4-ce-ccm-glue.c b/arch/arm64/crypto/sm4-ce-ccm-glue.c
new file mode 100644
index 000000000000..f2cec7b52efc
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-glue.c
@@ -0,0 +1,303 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_cbcmac_update(const u32 *rkey_enc, u8 *mac,
+ const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_ccm_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+ u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_final(const u32 *rkey_enc, u8 *iv, u8 *mac);
+
+
+static int ccm_setkey(struct crypto_aead *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_ctx *ctx = crypto_aead_ctx(tfm);
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int ccm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+ if ((authsize & 1) || authsize < 4)
+ return -EINVAL;
+ return 0;
+}
+
+static int ccm_format_input(u8 info[], struct aead_request *req,
+ unsigned int msglen)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ unsigned int l = req->iv[0] + 1;
+ unsigned int m;
+ __be32 len;
+
+ /* verify that CCM dimension 'L': 2 <= L <= 8 */
+ if (l < 2 || l > 8)
+ return -EINVAL;
+ if (l < 4 && msglen >> (8 * l))
+ return -EOVERFLOW;
+
+ memset(&req->iv[SM4_BLOCK_SIZE - l], 0, l);
+
+ memcpy(info, req->iv, SM4_BLOCK_SIZE);
+
+ m = crypto_aead_authsize(aead);
+
+ /* format flags field per RFC 3610/NIST 800-38C */
+ *info |= ((m - 2) / 2) << 3;
+ if (req->assoclen)
+ *info |= (1 << 6);
+
+ /*
+ * format message length field,
+ * Linux uses a u32 type to represent msglen
+ */
+ if (l >= 4)
+ l = 4;
+
+ len = cpu_to_be32(msglen);
+ memcpy(&info[SM4_BLOCK_SIZE - l], (u8 *)&len + 4 - l, l);
+
+ return 0;
+}
+
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+ struct __packed { __be16 l; __be32 h; } aadlen;
+ u32 assoclen = req->assoclen;
+ struct scatter_walk walk;
+ unsigned int len;
+
+ if (assoclen < 0xff00) {
+ aadlen.l = cpu_to_be16(assoclen);
+ len = 2;
+ } else {
+ aadlen.l = cpu_to_be16(0xfffe);
+ put_unaligned_be32(assoclen, &aadlen.h);
+ len = 6;
+ }
+
+ sm4_ce_crypt_block(ctx->rkey_enc, mac, mac);
+ crypto_xor(mac, (const u8 *)&aadlen, len);
+
+ scatterwalk_start(&walk, req->src);
+
+ do {
+ u32 n = scatterwalk_clamp(&walk, assoclen);
+ u8 *p, *ptr;
+
+ if (!n) {
+ scatterwalk_start(&walk, sg_next(walk.sg));
+ n = scatterwalk_clamp(&walk, assoclen);
+ }
+
+ p = ptr = scatterwalk_map(&walk);
+ assoclen -= n;
+ scatterwalk_advance(&walk, n);
+
+ while (n > 0) {
+ unsigned int l, nblocks;
+
+ if (len == SM4_BLOCK_SIZE) {
+ if (n < SM4_BLOCK_SIZE) {
+ sm4_ce_crypt_block(ctx->rkey_enc,
+ mac, mac);
+
+ len = 0;
+ } else {
+ nblocks = n / SM4_BLOCK_SIZE;
+ sm4_ce_cbcmac_update(ctx->rkey_enc,
+ mac, ptr, nblocks);
+
+ ptr += nblocks * SM4_BLOCK_SIZE;
+ n %= SM4_BLOCK_SIZE;
+
+ continue;
+ }
+ }
+
+ l = min(n, SM4_BLOCK_SIZE - len);
+ if (l) {
+ crypto_xor(mac + len, ptr, l);
+ len += l;
+ ptr += l;
+ n -= l;
+ }
+ }
+
+ scatterwalk_unmap(p);
+ scatterwalk_done(&walk, 0, assoclen);
+ } while (assoclen);
+}
+
+static int ccm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+ u32 *rkey_enc, u8 mac[],
+ void (*sm4_ce_ccm_crypt)(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nbytes, u8 *mac))
+{
+ u8 __aligned(8) ctr0[SM4_BLOCK_SIZE];
+ int err;
+
+ /* preserve the initial ctr0 for the TAG */
+ memcpy(ctr0, walk->iv, SM4_BLOCK_SIZE);
+ crypto_inc(walk->iv, SM4_BLOCK_SIZE);
+
+ kernel_neon_begin();
+
+ if (req->assoclen)
+ ccm_calculate_auth_mac(req, mac);
+
+ do {
+ unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+ const u8 *src = walk->src.virt.addr;
+ u8 *dst = walk->dst.virt.addr;
+
+ if (walk->nbytes == walk->total)
+ tail = 0;
+
+ if (walk->nbytes - tail)
+ sm4_ce_ccm_crypt(rkey_enc, dst, src, walk->iv,
+ walk->nbytes - tail, mac);
+
+ if (walk->nbytes == walk->total)
+ sm4_ce_ccm_final(rkey_enc, ctr0, mac);
+
+ kernel_neon_end();
+
+ if (walk->nbytes) {
+ err = skcipher_walk_done(walk, tail);
+ if (err)
+ return err;
+ if (walk->nbytes)
+ kernel_neon_begin();
+ }
+ } while (walk->nbytes > 0);
+
+ return 0;
+}
+
+static int ccm_encrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+ struct skcipher_walk walk;
+ int err;
+
+ err = ccm_format_input(mac, req, req->cryptlen);
+ if (err)
+ return err;
+
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
+ if (err)
+ return err;
+
+ err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_enc);
+ if (err)
+ return err;
+
+ /* copy authtag to end of dst */
+ scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
+ crypto_aead_authsize(aead), 1);
+
+ return 0;
+}
+
+static int ccm_decrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ unsigned int authsize = crypto_aead_authsize(aead);
+ struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+ u8 authtag[SM4_BLOCK_SIZE];
+ struct skcipher_walk walk;
+ int err;
+
+ err = ccm_format_input(mac, req, req->cryptlen - authsize);
+ if (err)
+ return err;
+
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
+ if (err)
+ return err;
+
+ err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_dec);
+ if (err)
+ return err;
+
+ /* compare calculated auth tag with the stored one */
+ scatterwalk_map_and_copy(authtag, req->src,
+ req->assoclen + req->cryptlen - authsize,
+ authsize, 0);
+
+ if (crypto_memneq(authtag, mac, authsize))
+ return -EBADMSG;
+
+ return 0;
+}
+
+static struct aead_alg sm4_ccm_alg = {
+ .base = {
+ .cra_name = "ccm(sm4)",
+ .cra_driver_name = "ccm-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .ivsize = SM4_BLOCK_SIZE,
+ .chunksize = SM4_BLOCK_SIZE,
+ .maxauthsize = SM4_BLOCK_SIZE,
+ .setkey = ccm_setkey,
+ .setauthsize = ccm_setauthsize,
+ .encrypt = ccm_encrypt,
+ .decrypt = ccm_decrypt,
+};
+
+static int __init sm4_ce_ccm_init(void)
+{
+ return crypto_register_aead(&sm4_ccm_alg);
+}
+
+static void __exit sm4_ce_ccm_exit(void)
+{
+ crypto_unregister_aead(&sm4_ccm_alg);
+}
+
+module_cpu_feature_match(SM4, sm4_ce_ccm_init);
+module_exit(sm4_ce_ccm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in CCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("ccm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 15/16] crypto: arm64/sm4 - add CE implementation for GCM mode
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for GCM mode.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 224 and 224
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is gcm_base(ctr-sm4-ce,ghash-generic)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:
Before (gcm_base(ctr-sm4-ce,ghash-generic)):
gcm(sm4) | 16 64 256 512 1024 1420 4096 8192
-------------+---------------------------------------------------------------------
GCM enc | 25.24 64.65 104.66 116.69 123.81 125.12 129.67 130.62
GCM dec | 25.40 64.80 104.74 116.70 123.81 125.21 129.68 130.59
GCM mb enc | 24.95 64.06 104.20 116.38 123.55 124.97 129.63 130.61
GCM mb dec | 24.92 64.00 104.13 116.34 123.55 124.98 129.56 130.48
After:
gcm-sm4-ce | 16 64 256 512 1024 1420 4096 8192
-------------+---------------------------------------------------------------------
GCM enc | 108.62 397.18 971.60 1283.92 1522.77 1513.39 1777.00 1806.96
GCM dec | 116.36 398.14 1004.27 1319.11 1624.21 1635.43 1932.54 1974.20
GCM mb enc | 107.13 391.79 962.05 1274.94 1514.76 1508.57 1769.07 1801.58
GCM mb dec | 113.40 389.36 988.51 1307.68 1619.10 1631.55 1931.70 1970.86
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 16 +
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/sm4-ce-gcm-core.S | 741 ++++++++++++++++++++++++++++
arch/arm64/crypto/sm4-ce-gcm-glue.c | 286 +++++++++++
4 files changed, 1046 insertions(+)
create mode 100644 arch/arm64/crypto/sm4-ce-gcm-core.S
create mode 100644 arch/arm64/crypto/sm4-ce-gcm-glue.c
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 2611036a3e3f..6793d5bc3ee5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -297,6 +297,22 @@ config CRYPTO_SM4_ARM64_CE_CCM
- ARMv8 Crypto Extensions
- NEON (Advanced SIMD) extensions
+config CRYPTO_SM4_ARM64_CE_GCM
+ tristate "AEAD cipher: SM4 in GCM mode (ARMv8 Crypto Extensions)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_ALGAPI
+ select CRYPTO_AEAD
+ select CRYPTO_SM4
+ select CRYPTO_SM4_ARM64_CE_BLK
+ help
+ AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+ GCM (Galois/Counter Mode) authenticated encryption mode (NIST SP800-38D)
+
+ Architecture: arm64 using:
+ - ARMv8 Crypto Extensions
+ - PMULL (Polynomial Multiply Long) instructions
+ - NEON (Advanced SIMD) extensions
+
config CRYPTO_CRCT10DIF_ARM64_CE
tristate "CRCT10DIF (PMULL)"
depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 843ea5266965..4818e204c2ac 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -32,6 +32,9 @@ sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_GCM) += sm4-ce-gcm.o
+sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
+
obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
diff --git a/arch/arm64/crypto/sm4-ce-gcm-core.S b/arch/arm64/crypto/sm4-ce-gcm-core.S
new file mode 100644
index 000000000000..7aa3ec18a289
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-core.S
@@ -0,0 +1,741 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2016 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch armv8-a+crypto
+
+.irp b, 0, 1, 2, 3, 24, 25, 26, 27, 28, 29, 30, 31
+ .set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+ .inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+/* Used for both encryption and decryption */
+#define RHASH v21
+#define RRCONST v22
+#define RZERO v23
+
+/* Helper macros. */
+
+/*
+ * input: m0, m1
+ * output: r0:r1 (low 128-bits in r0, high in r1)
+ */
+#define PMUL_128x128(r0, r1, m0, m1, T0, T1) \
+ ext T0.16b, m1.16b, m1.16b, #8; \
+ pmull r0.1q, m0.1d, m1.1d; \
+ pmull T1.1q, m0.1d, T0.1d; \
+ pmull2 T0.1q, m0.2d, T0.2d; \
+ pmull2 r1.1q, m0.2d, m1.2d; \
+ eor T0.16b, T0.16b, T1.16b; \
+ ext T1.16b, RZERO.16b, T0.16b, #8; \
+ ext T0.16b, T0.16b, RZERO.16b, #8; \
+ eor r0.16b, r0.16b, T1.16b; \
+ eor r1.16b, r1.16b, T0.16b;
+
+#define PMUL_128x128_4x(r0, r1, m0, m1, T0, T1, \
+ r2, r3, m2, m3, T2, T3, \
+ r4, r5, m4, m5, T4, T5, \
+ r6, r7, m6, m7, T6, T7) \
+ ext T0.16b, m1.16b, m1.16b, #8; \
+ ext T2.16b, m3.16b, m3.16b, #8; \
+ ext T4.16b, m5.16b, m5.16b, #8; \
+ ext T6.16b, m7.16b, m7.16b, #8; \
+ pmull r0.1q, m0.1d, m1.1d; \
+ pmull r2.1q, m2.1d, m3.1d; \
+ pmull r4.1q, m4.1d, m5.1d; \
+ pmull r6.1q, m6.1d, m7.1d; \
+ pmull T1.1q, m0.1d, T0.1d; \
+ pmull T3.1q, m2.1d, T2.1d; \
+ pmull T5.1q, m4.1d, T4.1d; \
+ pmull T7.1q, m6.1d, T6.1d; \
+ pmull2 T0.1q, m0.2d, T0.2d; \
+ pmull2 T2.1q, m2.2d, T2.2d; \
+ pmull2 T4.1q, m4.2d, T4.2d; \
+ pmull2 T6.1q, m6.2d, T6.2d; \
+ pmull2 r1.1q, m0.2d, m1.2d; \
+ pmull2 r3.1q, m2.2d, m3.2d; \
+ pmull2 r5.1q, m4.2d, m5.2d; \
+ pmull2 r7.1q, m6.2d, m7.2d; \
+ eor T0.16b, T0.16b, T1.16b; \
+ eor T2.16b, T2.16b, T3.16b; \
+ eor T4.16b, T4.16b, T5.16b; \
+ eor T6.16b, T6.16b, T7.16b; \
+ ext T1.16b, RZERO.16b, T0.16b, #8; \
+ ext T3.16b, RZERO.16b, T2.16b, #8; \
+ ext T5.16b, RZERO.16b, T4.16b, #8; \
+ ext T7.16b, RZERO.16b, T6.16b, #8; \
+ ext T0.16b, T0.16b, RZERO.16b, #8; \
+ ext T2.16b, T2.16b, RZERO.16b, #8; \
+ ext T4.16b, T4.16b, RZERO.16b, #8; \
+ ext T6.16b, T6.16b, RZERO.16b, #8; \
+ eor r0.16b, r0.16b, T1.16b; \
+ eor r2.16b, r2.16b, T3.16b; \
+ eor r4.16b, r4.16b, T5.16b; \
+ eor r6.16b, r6.16b, T7.16b; \
+ eor r1.16b, r1.16b, T0.16b; \
+ eor r3.16b, r3.16b, T2.16b; \
+ eor r5.16b, r5.16b, T4.16b; \
+ eor r7.16b, r7.16b, T6.16b;
+
+/*
+ * input: r0:r1 (low 128-bits in r0, high in r1)
+ * output: a
+ */
+#define REDUCTION(a, r0, r1, rconst, T0, T1) \
+ pmull2 T0.1q, r1.2d, rconst.2d; \
+ ext T1.16b, T0.16b, RZERO.16b, #8; \
+ ext T0.16b, RZERO.16b, T0.16b, #8; \
+ eor r1.16b, r1.16b, T1.16b; \
+ eor r0.16b, r0.16b, T0.16b; \
+ pmull T0.1q, r1.1d, rconst.1d; \
+ eor a.16b, r0.16b, T0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK(b0, r0, r1, m0, m1, T0, T1) \
+ rev32 b0.16b, b0.16b; \
+ ext T0.16b, m1.16b, m1.16b, #8; \
+ sm4e b0.4s, v24.4s; \
+ pmull r0.1q, m0.1d, m1.1d; \
+ sm4e b0.4s, v25.4s; \
+ pmull T1.1q, m0.1d, T0.1d; \
+ sm4e b0.4s, v26.4s; \
+ pmull2 T0.1q, m0.2d, T0.2d; \
+ sm4e b0.4s, v27.4s; \
+ pmull2 r1.1q, m0.2d, m1.2d; \
+ sm4e b0.4s, v28.4s; \
+ eor T0.16b, T0.16b, T1.16b; \
+ sm4e b0.4s, v29.4s; \
+ ext T1.16b, RZERO.16b, T0.16b, #8; \
+ sm4e b0.4s, v30.4s; \
+ ext T0.16b, T0.16b, RZERO.16b, #8; \
+ sm4e b0.4s, v31.4s; \
+ eor r0.16b, r0.16b, T1.16b; \
+ rev64 b0.4s, b0.4s; \
+ eor r1.16b, r1.16b, T0.16b; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ rev32 b0.16b, b0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK3(b0, b1, b2, \
+ r0, r1, m0, m1, T0, T1, \
+ r2, r3, m2, m3, T2, T3, \
+ r4, r5, m4, m5, T4, T5) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ ext T0.16b, m1.16b, m1.16b, #8; \
+ ext T2.16b, m3.16b, m3.16b, #8; \
+ ext T4.16b, m5.16b, m5.16b, #8; \
+ sm4e b0.4s, v24.4s; \
+ sm4e b1.4s, v24.4s; \
+ sm4e b2.4s, v24.4s; \
+ pmull r0.1q, m0.1d, m1.1d; \
+ pmull r2.1q, m2.1d, m3.1d; \
+ pmull r4.1q, m4.1d, m5.1d; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b1.4s, v25.4s; \
+ sm4e b2.4s, v25.4s; \
+ pmull T1.1q, m0.1d, T0.1d; \
+ pmull T3.1q, m2.1d, T2.1d; \
+ pmull T5.1q, m4.1d, T4.1d; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b1.4s, v26.4s; \
+ sm4e b2.4s, v26.4s; \
+ pmull2 T0.1q, m0.2d, T0.2d; \
+ pmull2 T2.1q, m2.2d, T2.2d; \
+ pmull2 T4.1q, m4.2d, T4.2d; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b1.4s, v27.4s; \
+ sm4e b2.4s, v27.4s; \
+ pmull2 r1.1q, m0.2d, m1.2d; \
+ pmull2 r3.1q, m2.2d, m3.2d; \
+ pmull2 r5.1q, m4.2d, m5.2d; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b1.4s, v28.4s; \
+ sm4e b2.4s, v28.4s; \
+ eor T0.16b, T0.16b, T1.16b; \
+ eor T2.16b, T2.16b, T3.16b; \
+ eor T4.16b, T4.16b, T5.16b; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b1.4s, v29.4s; \
+ sm4e b2.4s, v29.4s; \
+ ext T1.16b, RZERO.16b, T0.16b, #8; \
+ ext T3.16b, RZERO.16b, T2.16b, #8; \
+ ext T5.16b, RZERO.16b, T4.16b, #8; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b1.4s, v30.4s; \
+ sm4e b2.4s, v30.4s; \
+ ext T0.16b, T0.16b, RZERO.16b, #8; \
+ ext T2.16b, T2.16b, RZERO.16b, #8; \
+ ext T4.16b, T4.16b, RZERO.16b, #8; \
+ sm4e b0.4s, v31.4s; \
+ sm4e b1.4s, v31.4s; \
+ sm4e b2.4s, v31.4s; \
+ eor r0.16b, r0.16b, T1.16b; \
+ eor r2.16b, r2.16b, T3.16b; \
+ eor r4.16b, r4.16b, T5.16b; \
+ rev64 b0.4s, b0.4s; \
+ rev64 b1.4s, b1.4s; \
+ rev64 b2.4s, b2.4s; \
+ eor r1.16b, r1.16b, T0.16b; \
+ eor r3.16b, r3.16b, T2.16b; \
+ eor r5.16b, r5.16b, T4.16b; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ ext b1.16b, b1.16b, b1.16b, #8; \
+ ext b2.16b, b2.16b, b2.16b, #8; \
+ eor r0.16b, r0.16b, r2.16b; \
+ eor r1.16b, r1.16b, r3.16b; \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ eor r0.16b, r0.16b, r4.16b; \
+ eor r1.16b, r1.16b, r5.16b;
+
+#define inc32_le128(vctr) \
+ mov vctr.d[1], x9; \
+ add w6, w9, #1; \
+ mov vctr.d[0], x8; \
+ bfi x9, x6, #0, #32; \
+ rev64 vctr.16b, vctr.16b;
+
+#define GTAG_HASH_LENGTHS(vctr0, vlen) \
+ ld1 {vlen.16b}, [x7]; \
+ /* construct CTR0 */ \
+ /* the lower 32-bits of initial IV is always be32(1) */ \
+ mov x6, #0x1; \
+ bfi x9, x6, #0, #32; \
+ mov vctr0.d[0], x8; \
+ mov vctr0.d[1], x9; \
+ rbit vlen.16b, vlen.16b; \
+ rev64 vctr0.16b, vctr0.16b; \
+ /* authtag = GCTR(CTR0, GHASH) */ \
+ eor RHASH.16b, RHASH.16b, vlen.16b; \
+ SM4_CRYPT_PMUL_128x128_BLK(vctr0, RR0, RR1, RHASH, RH1, \
+ RTMP0, RTMP1); \
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3); \
+ rbit RHASH.16b, RHASH.16b; \
+ eor RHASH.16b, RHASH.16b, vctr0.16b;
+
+
+/* Register macros for encrypt and ghash */
+
+/* can be the same as input v0-v3 */
+#define RR1 v0
+#define RR3 v1
+#define RR5 v2
+#define RR7 v3
+
+#define RR0 v4
+#define RR2 v5
+#define RR4 v6
+#define RR6 v7
+
+#define RTMP0 v8
+#define RTMP1 v9
+#define RTMP2 v10
+#define RTMP3 v11
+#define RTMP4 v12
+#define RTMP5 v13
+#define RTMP6 v14
+#define RTMP7 v15
+
+#define RH1 v16
+#define RH2 v17
+#define RH3 v18
+#define RH4 v19
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_ghash_setup)
+ /* input:
+ * x0: round key array, CTX
+ * x1: ghash table
+ */
+ SM4_PREPARE(x0)
+
+ adr_l x2, .Lghash_rconst
+ ld1r {RRCONST.2d}, [x2]
+
+ eor RZERO.16b, RZERO.16b, RZERO.16b
+
+ /* H = E(K, 0^128) */
+ rev32 v0.16b, RZERO.16b
+ SM4_CRYPT_BLK_BE(v0)
+
+ /* H ^ 1 */
+ rbit RH1.16b, v0.16b
+
+ /* H ^ 2 */
+ PMUL_128x128(RR0, RR1, RH1, RH1, RTMP0, RTMP1)
+ REDUCTION(RH2, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ /* H ^ 3 */
+ PMUL_128x128(RR0, RR1, RH2, RH1, RTMP0, RTMP1)
+ REDUCTION(RH3, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ /* H ^ 4 */
+ PMUL_128x128(RR0, RR1, RH2, RH2, RTMP0, RTMP1)
+ REDUCTION(RH4, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ st1 {RH1.16b-RH4.16b}, [x1]
+
+ ret
+SYM_FUNC_END(sm4_ce_pmull_ghash_setup)
+
+.align 3
+SYM_FUNC_START(pmull_ghash_update)
+ /* input:
+ * x0: ghash table
+ * x1: ghash result
+ * x2: src
+ * w3: nblocks
+ */
+ ld1 {RH1.16b-RH4.16b}, [x0]
+
+ ld1 {RHASH.16b}, [x1]
+ rbit RHASH.16b, RHASH.16b
+
+ adr_l x4, .Lghash_rconst
+ ld1r {RRCONST.2d}, [x4]
+
+ eor RZERO.16b, RZERO.16b, RZERO.16b
+
+.Lghash_loop_4x:
+ cmp w3, #4
+ blt .Lghash_loop_1x
+
+ sub w3, w3, #4
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ rbit v0.16b, v0.16b
+ rbit v1.16b, v1.16b
+ rbit v2.16b, v2.16b
+ rbit v3.16b, v3.16b
+
+ /*
+ * (in0 ^ HASH) * H^4 => rr0:rr1
+ * (in1) * H^3 => rr2:rr3
+ * (in2) * H^2 => rr4:rr5
+ * (in3) * H^1 => rr6:rr7
+ */
+ eor RHASH.16b, RHASH.16b, v0.16b
+
+ PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+ RR2, RR3, v1, RH3, RTMP2, RTMP3,
+ RR4, RR5, v2, RH2, RTMP4, RTMP5,
+ RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+ eor RR0.16b, RR0.16b, RR2.16b
+ eor RR1.16b, RR1.16b, RR3.16b
+ eor RR0.16b, RR0.16b, RR4.16b
+ eor RR1.16b, RR1.16b, RR5.16b
+ eor RR0.16b, RR0.16b, RR6.16b
+ eor RR1.16b, RR1.16b, RR7.16b
+
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+ cbz w3, .Lghash_end
+ b .Lghash_loop_4x
+
+.Lghash_loop_1x:
+ sub w3, w3, #1
+
+ ld1 {v0.16b}, [x2], #16
+ rbit v0.16b, v0.16b
+ eor RHASH.16b, RHASH.16b, v0.16b
+
+ PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ cbnz w3, .Lghash_loop_1x
+
+.Lghash_end:
+ rbit RHASH.16b, RHASH.16b
+ st1 {RHASH.2d}, [x1]
+
+ ret
+SYM_FUNC_END(pmull_ghash_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nbytes
+ * x5: ghash result
+ * x6: ghash table
+ * x7: lengths (only for last block)
+ */
+ SM4_PREPARE(x0)
+
+ ldp x8, x9, [x3]
+ rev x8, x8
+ rev x9, x9
+
+ ld1 {RH1.16b-RH4.16b}, [x6]
+
+ ld1 {RHASH.16b}, [x5]
+ rbit RHASH.16b, RHASH.16b
+
+ adr_l x6, .Lghash_rconst
+ ld1r {RRCONST.2d}, [x6]
+
+ eor RZERO.16b, RZERO.16b, RZERO.16b
+
+ cbz w4, .Lgcm_enc_hash_len
+
+.Lgcm_enc_loop_4x:
+ cmp w4, #(4 * 16)
+ blt .Lgcm_enc_loop_1x
+
+ sub w4, w4, #(4 * 16)
+
+ /* construct CTRs */
+ inc32_le128(v0) /* +0 */
+ inc32_le128(v1) /* +1 */
+ inc32_le128(v2) /* +2 */
+ inc32_le128(v3) /* +3 */
+
+ ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ eor v0.16b, v0.16b, RTMP0.16b
+ eor v1.16b, v1.16b, RTMP1.16b
+ eor v2.16b, v2.16b, RTMP2.16b
+ eor v3.16b, v3.16b, RTMP3.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ /* ghash update */
+
+ rbit v0.16b, v0.16b
+ rbit v1.16b, v1.16b
+ rbit v2.16b, v2.16b
+ rbit v3.16b, v3.16b
+
+ /*
+ * (in0 ^ HASH) * H^4 => rr0:rr1
+ * (in1) * H^3 => rr2:rr3
+ * (in2) * H^2 => rr4:rr5
+ * (in3) * H^1 => rr6:rr7
+ */
+ eor RHASH.16b, RHASH.16b, v0.16b
+
+ PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+ RR2, RR3, v1, RH3, RTMP2, RTMP3,
+ RR4, RR5, v2, RH2, RTMP4, RTMP5,
+ RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+ eor RR0.16b, RR0.16b, RR2.16b
+ eor RR1.16b, RR1.16b, RR3.16b
+ eor RR0.16b, RR0.16b, RR4.16b
+ eor RR1.16b, RR1.16b, RR5.16b
+ eor RR0.16b, RR0.16b, RR6.16b
+ eor RR1.16b, RR1.16b, RR7.16b
+
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+ cbz w4, .Lgcm_enc_hash_len
+ b .Lgcm_enc_loop_4x
+
+.Lgcm_enc_loop_1x:
+ cmp w4, #16
+ blt .Lgcm_enc_tail
+
+ sub w4, w4, #16
+
+ /* construct CTRs */
+ inc32_le128(v0)
+
+ ld1 {RTMP0.16b}, [x2], #16
+
+ SM4_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, RTMP0.16b
+ st1 {v0.16b}, [x1], #16
+
+ /* ghash update */
+ rbit v0.16b, v0.16b
+ eor RHASH.16b, RHASH.16b, v0.16b
+ PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ cbz w4, .Lgcm_enc_hash_len
+ b .Lgcm_enc_loop_1x
+
+.Lgcm_enc_tail:
+ /* construct CTRs */
+ inc32_le128(v0)
+ SM4_CRYPT_BLK(v0)
+
+ /* load permute table */
+ adr_l x0, .Lcts_permute_table
+ add x0, x0, #32
+ sub x0, x0, w4, uxtw
+ ld1 {v3.16b}, [x0]
+
+.Lgcm_enc_tail_loop:
+ /* do encrypt */
+ ldrb w0, [x2], #1 /* get 1 byte from input */
+ umov w6, v0.b[0] /* get top crypted byte */
+ eor w6, w6, w0 /* w6 = CTR ^ input */
+ strb w6, [x1], #1 /* store out byte */
+
+ /* shift right out one byte */
+ ext v0.16b, v0.16b, v0.16b, #1
+ /* the last ciphertext is placed in high bytes */
+ ins v0.b[15], w6
+
+ subs w4, w4, #1
+ bne .Lgcm_enc_tail_loop
+
+ /* padding last block with zeros */
+ tbl v0.16b, {v0.16b}, v3.16b
+
+ /* ghash update */
+ rbit v0.16b, v0.16b
+ eor RHASH.16b, RHASH.16b, v0.16b
+ PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_enc_hash_len:
+ cbz x7, .Lgcm_enc_end
+
+ GTAG_HASH_LENGTHS(v1, v3)
+
+ b .Lgcm_enc_ret
+
+.Lgcm_enc_end:
+ /* store new CTR */
+ rev x8, x8
+ rev x9, x9
+ stp x8, x9, [x3]
+
+ rbit RHASH.16b, RHASH.16b
+
+.Lgcm_enc_ret:
+ /* store new MAC */
+ st1 {RHASH.2d}, [x5]
+
+ ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_enc)
+
+#undef RR1
+#undef RR3
+#undef RR5
+#undef RR7
+#undef RR0
+#undef RR2
+#undef RR4
+#undef RR6
+#undef RTMP0
+#undef RTMP1
+#undef RTMP2
+#undef RTMP3
+#undef RTMP4
+#undef RTMP5
+#undef RTMP6
+#undef RTMP7
+#undef RH1
+#undef RH2
+#undef RH3
+#undef RH4
+
+
+/* Register macros for decrypt */
+
+/* v0-v2 for building CTRs, v3-v5 for saving inputs */
+
+#define RR1 v6
+#define RR3 v7
+#define RR5 v8
+
+#define RR0 v9
+#define RR2 v10
+#define RR4 v11
+
+#define RTMP0 v12
+#define RTMP1 v13
+#define RTMP2 v14
+#define RTMP3 v15
+#define RTMP4 v16
+#define RTMP5 v17
+
+#define RH1 v18
+#define RH2 v19
+#define RH3 v20
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nbytes
+ * x5: ghash result
+ * x6: ghash table
+ * x7: lengths (only for last block)
+ */
+ SM4_PREPARE(x0)
+
+ ldp x8, x9, [x3]
+ rev x8, x8
+ rev x9, x9
+
+ ld1 {RH1.16b-RH3.16b}, [x6]
+
+ ld1 {RHASH.16b}, [x5]
+ rbit RHASH.16b, RHASH.16b
+
+ adr_l x6, .Lghash_rconst
+ ld1r {RRCONST.2d}, [x6]
+
+ eor RZERO.16b, RZERO.16b, RZERO.16b
+
+ cbz w4, .Lgcm_dec_hash_len
+
+.Lgcm_dec_loop_3x:
+ cmp w4, #(3 * 16)
+ blt .Lgcm_dec_loop_1x
+
+ sub w4, w4, #(3 * 16)
+
+ ld1 {v3.16b-v5.16b}, [x2], #(3 * 16)
+
+ /* construct CTRs */
+ inc32_le128(v0) /* +0 */
+ rbit v6.16b, v3.16b
+ inc32_le128(v1) /* +1 */
+ rbit v7.16b, v4.16b
+ inc32_le128(v2) /* +2 */
+ rbit v8.16b, v5.16b
+
+ eor RHASH.16b, RHASH.16b, v6.16b
+
+ /* decrypt & ghash update */
+ SM4_CRYPT_PMUL_128x128_BLK3(v0, v1, v2,
+ RR0, RR1, RHASH, RH3, RTMP0, RTMP1,
+ RR2, RR3, v7, RH2, RTMP2, RTMP3,
+ RR4, RR5, v8, RH1, RTMP4, RTMP5)
+
+ eor v0.16b, v0.16b, v3.16b
+ eor v1.16b, v1.16b, v4.16b
+ eor v2.16b, v2.16b, v5.16b
+
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+ st1 {v0.16b-v2.16b}, [x1], #(3 * 16)
+
+ cbz w4, .Lgcm_dec_hash_len
+ b .Lgcm_dec_loop_3x
+
+.Lgcm_dec_loop_1x:
+ cmp w4, #16
+ blt .Lgcm_dec_tail
+
+ sub w4, w4, #16
+
+ ld1 {v3.16b}, [x2], #16
+
+ /* construct CTRs */
+ inc32_le128(v0)
+ rbit v6.16b, v3.16b
+
+ eor RHASH.16b, RHASH.16b, v6.16b
+
+ SM4_CRYPT_PMUL_128x128_BLK(v0, RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+
+ eor v0.16b, v0.16b, v3.16b
+
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ st1 {v0.16b}, [x1], #16
+
+ cbz w4, .Lgcm_dec_hash_len
+ b .Lgcm_dec_loop_1x
+
+.Lgcm_dec_tail:
+ /* construct CTRs */
+ inc32_le128(v0)
+ SM4_CRYPT_BLK(v0)
+
+ /* load permute table */
+ adr_l x0, .Lcts_permute_table
+ add x0, x0, #32
+ sub x0, x0, w4, uxtw
+ ld1 {v3.16b}, [x0]
+
+.Lgcm_dec_tail_loop:
+ /* do decrypt */
+ ldrb w0, [x2], #1 /* get 1 byte from input */
+ umov w6, v0.b[0] /* get top crypted byte */
+ eor w6, w6, w0 /* w6 = CTR ^ input */
+ strb w6, [x1], #1 /* store out byte */
+
+ /* shift right out one byte */
+ ext v0.16b, v0.16b, v0.16b, #1
+ /* the last ciphertext is placed in high bytes */
+ ins v0.b[15], w0
+
+ subs w4, w4, #1
+ bne .Lgcm_dec_tail_loop
+
+ /* padding last block with zeros */
+ tbl v0.16b, {v0.16b}, v3.16b
+
+ /* ghash update */
+ rbit v0.16b, v0.16b
+ eor RHASH.16b, RHASH.16b, v0.16b
+ PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_dec_hash_len:
+ cbz x7, .Lgcm_dec_end
+
+ GTAG_HASH_LENGTHS(v1, v3)
+
+ b .Lgcm_dec_ret
+
+.Lgcm_dec_end:
+ /* store new CTR */
+ rev x8, x8
+ rev x9, x9
+ stp x8, x9, [x3]
+
+ rbit RHASH.16b, RHASH.16b
+
+.Lgcm_dec_ret:
+ /* store new MAC */
+ st1 {RHASH.2d}, [x5]
+
+ ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_dec)
+
+ .section ".rodata", "a"
+ .align 4
+.Lcts_permute_table:
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7
+ .byte 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+
+.Lghash_rconst:
+ .quad 0x87
diff --git a/arch/arm64/crypto/sm4-ce-gcm-glue.c b/arch/arm64/crypto/sm4-ce-gcm-glue.c
new file mode 100644
index 000000000000..e90ea0f17beb
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-glue.c
@@ -0,0 +1,286 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/b128ops.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_pmull_ghash_setup(const u32 *rkey_enc, u8 *ghash_table);
+asmlinkage void pmull_ghash_update(const u8 *ghash_table, u8 *ghash,
+ const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_pmull_gcm_enc(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nbytes, u8 *ghash,
+ const u8 *ghash_table, const u8 *lengths);
+asmlinkage void sm4_ce_pmull_gcm_dec(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nbytes, u8 *ghash,
+ const u8 *ghash_table, const u8 *lengths);
+
+#define GHASH_BLOCK_SIZE 16
+#define GCM_IV_SIZE 12
+
+struct sm4_gcm_ctx {
+ struct sm4_ctx key;
+ u8 ghash_table[16 * 4];
+};
+
+
+static int gcm_setkey(struct crypto_aead *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_gcm_ctx *ctx = crypto_aead_ctx(tfm);
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+
+ sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+ sm4_ce_pmull_ghash_setup(ctx->key.rkey_enc, ctx->ghash_table);
+
+ kernel_neon_end();
+ return 0;
+}
+
+static int gcm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+ switch (authsize) {
+ case 4:
+ case 8:
+ case 12 ... 16:
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
+static void gcm_calculate_auth_mac(struct aead_request *req, u8 ghash[])
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) buffer[GHASH_BLOCK_SIZE];
+ u32 assoclen = req->assoclen;
+ struct scatter_walk walk;
+ unsigned int buflen = 0;
+
+ scatterwalk_start(&walk, req->src);
+
+ do {
+ u32 n = scatterwalk_clamp(&walk, assoclen);
+ u8 *p, *ptr;
+
+ if (!n) {
+ scatterwalk_start(&walk, sg_next(walk.sg));
+ n = scatterwalk_clamp(&walk, assoclen);
+ }
+
+ p = ptr = scatterwalk_map(&walk);
+ assoclen -= n;
+ scatterwalk_advance(&walk, n);
+
+ if (n + buflen < GHASH_BLOCK_SIZE) {
+ memcpy(&buffer[buflen], ptr, n);
+ buflen += n;
+ } else {
+ unsigned int nblocks;
+
+ if (buflen) {
+ unsigned int l = GHASH_BLOCK_SIZE - buflen;
+
+ memcpy(&buffer[buflen], ptr, l);
+ ptr += l;
+ n -= l;
+
+ pmull_ghash_update(ctx->ghash_table, ghash,
+ buffer, 1);
+ }
+
+ nblocks = n / GHASH_BLOCK_SIZE;
+ if (nblocks) {
+ pmull_ghash_update(ctx->ghash_table, ghash,
+ ptr, nblocks);
+ ptr += nblocks * GHASH_BLOCK_SIZE;
+ }
+
+ buflen = n % GHASH_BLOCK_SIZE;
+ if (buflen)
+ memcpy(&buffer[0], ptr, buflen);
+ }
+
+ scatterwalk_unmap(p);
+ scatterwalk_done(&walk, 0, assoclen);
+ } while (assoclen);
+
+ /* padding with '0' */
+ if (buflen) {
+ memset(&buffer[buflen], 0, GHASH_BLOCK_SIZE - buflen);
+ pmull_ghash_update(ctx->ghash_table, ghash, buffer, 1);
+ }
+}
+
+static int gcm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+ struct sm4_gcm_ctx *ctx, u8 ghash[],
+ void (*sm4_ce_pmull_gcm_crypt)(const u32 *rkey_enc,
+ u8 *dst, const u8 *src, u8 *iv,
+ unsigned int nbytes, u8 *ghash,
+ const u8 *ghash_table, const u8 *lengths))
+{
+ u8 __aligned(8) iv[SM4_BLOCK_SIZE];
+ be128 __aligned(8) lengths;
+ int err;
+
+ memset(ghash, 0, SM4_BLOCK_SIZE);
+
+ lengths.a = cpu_to_be64(req->assoclen * 8);
+ lengths.b = cpu_to_be64(walk->total * 8);
+
+ memcpy(iv, walk->iv, GCM_IV_SIZE);
+ put_unaligned_be32(2, iv + GCM_IV_SIZE);
+
+ kernel_neon_begin();
+
+ if (req->assoclen)
+ gcm_calculate_auth_mac(req, ghash);
+
+ do {
+ unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+ const u8 *src = walk->src.virt.addr;
+ u8 *dst = walk->dst.virt.addr;
+
+ if (walk->nbytes == walk->total) {
+ tail = 0;
+
+ sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+ walk->nbytes, ghash,
+ ctx->ghash_table,
+ (const u8 *)&lengths);
+ } else if (walk->nbytes - tail) {
+ sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+ walk->nbytes - tail, ghash,
+ ctx->ghash_table, NULL);
+ }
+
+ kernel_neon_end();
+
+ err = skcipher_walk_done(walk, tail);
+ if (err)
+ return err;
+ if (walk->nbytes)
+ kernel_neon_begin();
+ } while (walk->nbytes > 0);
+
+ return 0;
+}
+
+static int gcm_encrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+ struct skcipher_walk walk;
+ int err;
+
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
+ if (err)
+ return err;
+
+ err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_enc);
+ if (err)
+ return err;
+
+ /* copy authtag to end of dst */
+ scatterwalk_map_and_copy(ghash, req->dst, req->assoclen + req->cryptlen,
+ crypto_aead_authsize(aead), 1);
+
+ return 0;
+}
+
+static int gcm_decrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ unsigned int authsize = crypto_aead_authsize(aead);
+ struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+ u8 authtag[SM4_BLOCK_SIZE];
+ struct skcipher_walk walk;
+ int err;
+
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
+ if (err)
+ return err;
+
+ err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_dec);
+ if (err)
+ return err;
+
+ /* compare calculated auth tag with the stored one */
+ scatterwalk_map_and_copy(authtag, req->src,
+ req->assoclen + req->cryptlen - authsize,
+ authsize, 0);
+
+ if (crypto_memneq(authtag, ghash, authsize))
+ return -EBADMSG;
+
+ return 0;
+}
+
+static struct aead_alg sm4_gcm_alg = {
+ .base = {
+ .cra_name = "gcm(sm4)",
+ .cra_driver_name = "gcm-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_gcm_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .ivsize = GCM_IV_SIZE,
+ .chunksize = SM4_BLOCK_SIZE,
+ .maxauthsize = SM4_BLOCK_SIZE,
+ .setkey = gcm_setkey,
+ .setauthsize = gcm_setauthsize,
+ .encrypt = gcm_encrypt,
+ .decrypt = gcm_decrypt,
+};
+
+static int __init sm4_ce_gcm_init(void)
+{
+ if (!cpu_have_named_feature(PMULL))
+ return -ENODEV;
+
+ return crypto_register_aead(&sm4_gcm_alg);
+}
+
+static void __exit sm4_ce_gcm_exit(void)
+{
+ crypto_unregister_aead(&sm4_gcm_alg);
+}
+
+static const struct cpu_feature sm4_ce_gcm_cpu_feature[] = {
+ { cpu_feature(PMULL) },
+ {}
+};
+MODULE_DEVICE_TABLE(cpu, sm4_ce_gcm_cpu_feature);
+
+module_cpu_feature_match(SM4, sm4_ce_gcm_init);
+module_exit(sm4_ce_gcm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in GCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("gcm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 15/16] crypto: arm64/sm4 - add CE implementation for GCM mode
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
This patch is a CE-optimized assembly implementation for GCM mode.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 224 and 224
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is gcm_base(ctr-sm4-ce,ghash-generic)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:
Before (gcm_base(ctr-sm4-ce,ghash-generic)):
gcm(sm4) | 16 64 256 512 1024 1420 4096 8192
-------------+---------------------------------------------------------------------
GCM enc | 25.24 64.65 104.66 116.69 123.81 125.12 129.67 130.62
GCM dec | 25.40 64.80 104.74 116.70 123.81 125.21 129.68 130.59
GCM mb enc | 24.95 64.06 104.20 116.38 123.55 124.97 129.63 130.61
GCM mb dec | 24.92 64.00 104.13 116.34 123.55 124.98 129.56 130.48
After:
gcm-sm4-ce | 16 64 256 512 1024 1420 4096 8192
-------------+---------------------------------------------------------------------
GCM enc | 108.62 397.18 971.60 1283.92 1522.77 1513.39 1777.00 1806.96
GCM dec | 116.36 398.14 1004.27 1319.11 1624.21 1635.43 1932.54 1974.20
GCM mb enc | 107.13 391.79 962.05 1274.94 1514.76 1508.57 1769.07 1801.58
GCM mb dec | 113.40 389.36 988.51 1307.68 1619.10 1631.55 1931.70 1970.86
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 16 +
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/sm4-ce-gcm-core.S | 741 ++++++++++++++++++++++++++++
arch/arm64/crypto/sm4-ce-gcm-glue.c | 286 +++++++++++
4 files changed, 1046 insertions(+)
create mode 100644 arch/arm64/crypto/sm4-ce-gcm-core.S
create mode 100644 arch/arm64/crypto/sm4-ce-gcm-glue.c
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 2611036a3e3f..6793d5bc3ee5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -297,6 +297,22 @@ config CRYPTO_SM4_ARM64_CE_CCM
- ARMv8 Crypto Extensions
- NEON (Advanced SIMD) extensions
+config CRYPTO_SM4_ARM64_CE_GCM
+ tristate "AEAD cipher: SM4 in GCM mode (ARMv8 Crypto Extensions)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_ALGAPI
+ select CRYPTO_AEAD
+ select CRYPTO_SM4
+ select CRYPTO_SM4_ARM64_CE_BLK
+ help
+ AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+ GCM (Galois/Counter Mode) authenticated encryption mode (NIST SP800-38D)
+
+ Architecture: arm64 using:
+ - ARMv8 Crypto Extensions
+ - PMULL (Polynomial Multiply Long) instructions
+ - NEON (Advanced SIMD) extensions
+
config CRYPTO_CRCT10DIF_ARM64_CE
tristate "CRCT10DIF (PMULL)"
depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 843ea5266965..4818e204c2ac 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -32,6 +32,9 @@ sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_GCM) += sm4-ce-gcm.o
+sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
+
obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
diff --git a/arch/arm64/crypto/sm4-ce-gcm-core.S b/arch/arm64/crypto/sm4-ce-gcm-core.S
new file mode 100644
index 000000000000..7aa3ec18a289
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-core.S
@@ -0,0 +1,741 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2016 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch armv8-a+crypto
+
+.irp b, 0, 1, 2, 3, 24, 25, 26, 27, 28, 29, 30, 31
+ .set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+ .inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+/* Used for both encryption and decryption */
+#define RHASH v21
+#define RRCONST v22
+#define RZERO v23
+
+/* Helper macros. */
+
+/*
+ * input: m0, m1
+ * output: r0:r1 (low 128-bits in r0, high in r1)
+ */
+#define PMUL_128x128(r0, r1, m0, m1, T0, T1) \
+ ext T0.16b, m1.16b, m1.16b, #8; \
+ pmull r0.1q, m0.1d, m1.1d; \
+ pmull T1.1q, m0.1d, T0.1d; \
+ pmull2 T0.1q, m0.2d, T0.2d; \
+ pmull2 r1.1q, m0.2d, m1.2d; \
+ eor T0.16b, T0.16b, T1.16b; \
+ ext T1.16b, RZERO.16b, T0.16b, #8; \
+ ext T0.16b, T0.16b, RZERO.16b, #8; \
+ eor r0.16b, r0.16b, T1.16b; \
+ eor r1.16b, r1.16b, T0.16b;
+
+#define PMUL_128x128_4x(r0, r1, m0, m1, T0, T1, \
+ r2, r3, m2, m3, T2, T3, \
+ r4, r5, m4, m5, T4, T5, \
+ r6, r7, m6, m7, T6, T7) \
+ ext T0.16b, m1.16b, m1.16b, #8; \
+ ext T2.16b, m3.16b, m3.16b, #8; \
+ ext T4.16b, m5.16b, m5.16b, #8; \
+ ext T6.16b, m7.16b, m7.16b, #8; \
+ pmull r0.1q, m0.1d, m1.1d; \
+ pmull r2.1q, m2.1d, m3.1d; \
+ pmull r4.1q, m4.1d, m5.1d; \
+ pmull r6.1q, m6.1d, m7.1d; \
+ pmull T1.1q, m0.1d, T0.1d; \
+ pmull T3.1q, m2.1d, T2.1d; \
+ pmull T5.1q, m4.1d, T4.1d; \
+ pmull T7.1q, m6.1d, T6.1d; \
+ pmull2 T0.1q, m0.2d, T0.2d; \
+ pmull2 T2.1q, m2.2d, T2.2d; \
+ pmull2 T4.1q, m4.2d, T4.2d; \
+ pmull2 T6.1q, m6.2d, T6.2d; \
+ pmull2 r1.1q, m0.2d, m1.2d; \
+ pmull2 r3.1q, m2.2d, m3.2d; \
+ pmull2 r5.1q, m4.2d, m5.2d; \
+ pmull2 r7.1q, m6.2d, m7.2d; \
+ eor T0.16b, T0.16b, T1.16b; \
+ eor T2.16b, T2.16b, T3.16b; \
+ eor T4.16b, T4.16b, T5.16b; \
+ eor T6.16b, T6.16b, T7.16b; \
+ ext T1.16b, RZERO.16b, T0.16b, #8; \
+ ext T3.16b, RZERO.16b, T2.16b, #8; \
+ ext T5.16b, RZERO.16b, T4.16b, #8; \
+ ext T7.16b, RZERO.16b, T6.16b, #8; \
+ ext T0.16b, T0.16b, RZERO.16b, #8; \
+ ext T2.16b, T2.16b, RZERO.16b, #8; \
+ ext T4.16b, T4.16b, RZERO.16b, #8; \
+ ext T6.16b, T6.16b, RZERO.16b, #8; \
+ eor r0.16b, r0.16b, T1.16b; \
+ eor r2.16b, r2.16b, T3.16b; \
+ eor r4.16b, r4.16b, T5.16b; \
+ eor r6.16b, r6.16b, T7.16b; \
+ eor r1.16b, r1.16b, T0.16b; \
+ eor r3.16b, r3.16b, T2.16b; \
+ eor r5.16b, r5.16b, T4.16b; \
+ eor r7.16b, r7.16b, T6.16b;
+
+/*
+ * input: r0:r1 (low 128-bits in r0, high in r1)
+ * output: a
+ */
+#define REDUCTION(a, r0, r1, rconst, T0, T1) \
+ pmull2 T0.1q, r1.2d, rconst.2d; \
+ ext T1.16b, T0.16b, RZERO.16b, #8; \
+ ext T0.16b, RZERO.16b, T0.16b, #8; \
+ eor r1.16b, r1.16b, T1.16b; \
+ eor r0.16b, r0.16b, T0.16b; \
+ pmull T0.1q, r1.1d, rconst.1d; \
+ eor a.16b, r0.16b, T0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK(b0, r0, r1, m0, m1, T0, T1) \
+ rev32 b0.16b, b0.16b; \
+ ext T0.16b, m1.16b, m1.16b, #8; \
+ sm4e b0.4s, v24.4s; \
+ pmull r0.1q, m0.1d, m1.1d; \
+ sm4e b0.4s, v25.4s; \
+ pmull T1.1q, m0.1d, T0.1d; \
+ sm4e b0.4s, v26.4s; \
+ pmull2 T0.1q, m0.2d, T0.2d; \
+ sm4e b0.4s, v27.4s; \
+ pmull2 r1.1q, m0.2d, m1.2d; \
+ sm4e b0.4s, v28.4s; \
+ eor T0.16b, T0.16b, T1.16b; \
+ sm4e b0.4s, v29.4s; \
+ ext T1.16b, RZERO.16b, T0.16b, #8; \
+ sm4e b0.4s, v30.4s; \
+ ext T0.16b, T0.16b, RZERO.16b, #8; \
+ sm4e b0.4s, v31.4s; \
+ eor r0.16b, r0.16b, T1.16b; \
+ rev64 b0.4s, b0.4s; \
+ eor r1.16b, r1.16b, T0.16b; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ rev32 b0.16b, b0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK3(b0, b1, b2, \
+ r0, r1, m0, m1, T0, T1, \
+ r2, r3, m2, m3, T2, T3, \
+ r4, r5, m4, m5, T4, T5) \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ ext T0.16b, m1.16b, m1.16b, #8; \
+ ext T2.16b, m3.16b, m3.16b, #8; \
+ ext T4.16b, m5.16b, m5.16b, #8; \
+ sm4e b0.4s, v24.4s; \
+ sm4e b1.4s, v24.4s; \
+ sm4e b2.4s, v24.4s; \
+ pmull r0.1q, m0.1d, m1.1d; \
+ pmull r2.1q, m2.1d, m3.1d; \
+ pmull r4.1q, m4.1d, m5.1d; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b1.4s, v25.4s; \
+ sm4e b2.4s, v25.4s; \
+ pmull T1.1q, m0.1d, T0.1d; \
+ pmull T3.1q, m2.1d, T2.1d; \
+ pmull T5.1q, m4.1d, T4.1d; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b1.4s, v26.4s; \
+ sm4e b2.4s, v26.4s; \
+ pmull2 T0.1q, m0.2d, T0.2d; \
+ pmull2 T2.1q, m2.2d, T2.2d; \
+ pmull2 T4.1q, m4.2d, T4.2d; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b1.4s, v27.4s; \
+ sm4e b2.4s, v27.4s; \
+ pmull2 r1.1q, m0.2d, m1.2d; \
+ pmull2 r3.1q, m2.2d, m3.2d; \
+ pmull2 r5.1q, m4.2d, m5.2d; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b1.4s, v28.4s; \
+ sm4e b2.4s, v28.4s; \
+ eor T0.16b, T0.16b, T1.16b; \
+ eor T2.16b, T2.16b, T3.16b; \
+ eor T4.16b, T4.16b, T5.16b; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b1.4s, v29.4s; \
+ sm4e b2.4s, v29.4s; \
+ ext T1.16b, RZERO.16b, T0.16b, #8; \
+ ext T3.16b, RZERO.16b, T2.16b, #8; \
+ ext T5.16b, RZERO.16b, T4.16b, #8; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b1.4s, v30.4s; \
+ sm4e b2.4s, v30.4s; \
+ ext T0.16b, T0.16b, RZERO.16b, #8; \
+ ext T2.16b, T2.16b, RZERO.16b, #8; \
+ ext T4.16b, T4.16b, RZERO.16b, #8; \
+ sm4e b0.4s, v31.4s; \
+ sm4e b1.4s, v31.4s; \
+ sm4e b2.4s, v31.4s; \
+ eor r0.16b, r0.16b, T1.16b; \
+ eor r2.16b, r2.16b, T3.16b; \
+ eor r4.16b, r4.16b, T5.16b; \
+ rev64 b0.4s, b0.4s; \
+ rev64 b1.4s, b1.4s; \
+ rev64 b2.4s, b2.4s; \
+ eor r1.16b, r1.16b, T0.16b; \
+ eor r3.16b, r3.16b, T2.16b; \
+ eor r5.16b, r5.16b, T4.16b; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ ext b1.16b, b1.16b, b1.16b, #8; \
+ ext b2.16b, b2.16b, b2.16b, #8; \
+ eor r0.16b, r0.16b, r2.16b; \
+ eor r1.16b, r1.16b, r3.16b; \
+ rev32 b0.16b, b0.16b; \
+ rev32 b1.16b, b1.16b; \
+ rev32 b2.16b, b2.16b; \
+ eor r0.16b, r0.16b, r4.16b; \
+ eor r1.16b, r1.16b, r5.16b;
+
+#define inc32_le128(vctr) \
+ mov vctr.d[1], x9; \
+ add w6, w9, #1; \
+ mov vctr.d[0], x8; \
+ bfi x9, x6, #0, #32; \
+ rev64 vctr.16b, vctr.16b;
+
+#define GTAG_HASH_LENGTHS(vctr0, vlen) \
+ ld1 {vlen.16b}, [x7]; \
+ /* construct CTR0 */ \
+ /* the lower 32-bits of initial IV is always be32(1) */ \
+ mov x6, #0x1; \
+ bfi x9, x6, #0, #32; \
+ mov vctr0.d[0], x8; \
+ mov vctr0.d[1], x9; \
+ rbit vlen.16b, vlen.16b; \
+ rev64 vctr0.16b, vctr0.16b; \
+ /* authtag = GCTR(CTR0, GHASH) */ \
+ eor RHASH.16b, RHASH.16b, vlen.16b; \
+ SM4_CRYPT_PMUL_128x128_BLK(vctr0, RR0, RR1, RHASH, RH1, \
+ RTMP0, RTMP1); \
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3); \
+ rbit RHASH.16b, RHASH.16b; \
+ eor RHASH.16b, RHASH.16b, vctr0.16b;
+
+
+/* Register macros for encrypt and ghash */
+
+/* can be the same as input v0-v3 */
+#define RR1 v0
+#define RR3 v1
+#define RR5 v2
+#define RR7 v3
+
+#define RR0 v4
+#define RR2 v5
+#define RR4 v6
+#define RR6 v7
+
+#define RTMP0 v8
+#define RTMP1 v9
+#define RTMP2 v10
+#define RTMP3 v11
+#define RTMP4 v12
+#define RTMP5 v13
+#define RTMP6 v14
+#define RTMP7 v15
+
+#define RH1 v16
+#define RH2 v17
+#define RH3 v18
+#define RH4 v19
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_ghash_setup)
+ /* input:
+ * x0: round key array, CTX
+ * x1: ghash table
+ */
+ SM4_PREPARE(x0)
+
+ adr_l x2, .Lghash_rconst
+ ld1r {RRCONST.2d}, [x2]
+
+ eor RZERO.16b, RZERO.16b, RZERO.16b
+
+ /* H = E(K, 0^128) */
+ rev32 v0.16b, RZERO.16b
+ SM4_CRYPT_BLK_BE(v0)
+
+ /* H ^ 1 */
+ rbit RH1.16b, v0.16b
+
+ /* H ^ 2 */
+ PMUL_128x128(RR0, RR1, RH1, RH1, RTMP0, RTMP1)
+ REDUCTION(RH2, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ /* H ^ 3 */
+ PMUL_128x128(RR0, RR1, RH2, RH1, RTMP0, RTMP1)
+ REDUCTION(RH3, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ /* H ^ 4 */
+ PMUL_128x128(RR0, RR1, RH2, RH2, RTMP0, RTMP1)
+ REDUCTION(RH4, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ st1 {RH1.16b-RH4.16b}, [x1]
+
+ ret
+SYM_FUNC_END(sm4_ce_pmull_ghash_setup)
+
+.align 3
+SYM_FUNC_START(pmull_ghash_update)
+ /* input:
+ * x0: ghash table
+ * x1: ghash result
+ * x2: src
+ * w3: nblocks
+ */
+ ld1 {RH1.16b-RH4.16b}, [x0]
+
+ ld1 {RHASH.16b}, [x1]
+ rbit RHASH.16b, RHASH.16b
+
+ adr_l x4, .Lghash_rconst
+ ld1r {RRCONST.2d}, [x4]
+
+ eor RZERO.16b, RZERO.16b, RZERO.16b
+
+.Lghash_loop_4x:
+ cmp w3, #4
+ blt .Lghash_loop_1x
+
+ sub w3, w3, #4
+
+ ld1 {v0.16b-v3.16b}, [x2], #64
+
+ rbit v0.16b, v0.16b
+ rbit v1.16b, v1.16b
+ rbit v2.16b, v2.16b
+ rbit v3.16b, v3.16b
+
+ /*
+ * (in0 ^ HASH) * H^4 => rr0:rr1
+ * (in1) * H^3 => rr2:rr3
+ * (in2) * H^2 => rr4:rr5
+ * (in3) * H^1 => rr6:rr7
+ */
+ eor RHASH.16b, RHASH.16b, v0.16b
+
+ PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+ RR2, RR3, v1, RH3, RTMP2, RTMP3,
+ RR4, RR5, v2, RH2, RTMP4, RTMP5,
+ RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+ eor RR0.16b, RR0.16b, RR2.16b
+ eor RR1.16b, RR1.16b, RR3.16b
+ eor RR0.16b, RR0.16b, RR4.16b
+ eor RR1.16b, RR1.16b, RR5.16b
+ eor RR0.16b, RR0.16b, RR6.16b
+ eor RR1.16b, RR1.16b, RR7.16b
+
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+ cbz w3, .Lghash_end
+ b .Lghash_loop_4x
+
+.Lghash_loop_1x:
+ sub w3, w3, #1
+
+ ld1 {v0.16b}, [x2], #16
+ rbit v0.16b, v0.16b
+ eor RHASH.16b, RHASH.16b, v0.16b
+
+ PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ cbnz w3, .Lghash_loop_1x
+
+.Lghash_end:
+ rbit RHASH.16b, RHASH.16b
+ st1 {RHASH.2d}, [x1]
+
+ ret
+SYM_FUNC_END(pmull_ghash_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_enc)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nbytes
+ * x5: ghash result
+ * x6: ghash table
+ * x7: lengths (only for last block)
+ */
+ SM4_PREPARE(x0)
+
+ ldp x8, x9, [x3]
+ rev x8, x8
+ rev x9, x9
+
+ ld1 {RH1.16b-RH4.16b}, [x6]
+
+ ld1 {RHASH.16b}, [x5]
+ rbit RHASH.16b, RHASH.16b
+
+ adr_l x6, .Lghash_rconst
+ ld1r {RRCONST.2d}, [x6]
+
+ eor RZERO.16b, RZERO.16b, RZERO.16b
+
+ cbz w4, .Lgcm_enc_hash_len
+
+.Lgcm_enc_loop_4x:
+ cmp w4, #(4 * 16)
+ blt .Lgcm_enc_loop_1x
+
+ sub w4, w4, #(4 * 16)
+
+ /* construct CTRs */
+ inc32_le128(v0) /* +0 */
+ inc32_le128(v1) /* +1 */
+ inc32_le128(v2) /* +2 */
+ inc32_le128(v3) /* +3 */
+
+ ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64
+
+ SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+ eor v0.16b, v0.16b, RTMP0.16b
+ eor v1.16b, v1.16b, RTMP1.16b
+ eor v2.16b, v2.16b, RTMP2.16b
+ eor v3.16b, v3.16b, RTMP3.16b
+ st1 {v0.16b-v3.16b}, [x1], #64
+
+ /* ghash update */
+
+ rbit v0.16b, v0.16b
+ rbit v1.16b, v1.16b
+ rbit v2.16b, v2.16b
+ rbit v3.16b, v3.16b
+
+ /*
+ * (in0 ^ HASH) * H^4 => rr0:rr1
+ * (in1) * H^3 => rr2:rr3
+ * (in2) * H^2 => rr4:rr5
+ * (in3) * H^1 => rr6:rr7
+ */
+ eor RHASH.16b, RHASH.16b, v0.16b
+
+ PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+ RR2, RR3, v1, RH3, RTMP2, RTMP3,
+ RR4, RR5, v2, RH2, RTMP4, RTMP5,
+ RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+ eor RR0.16b, RR0.16b, RR2.16b
+ eor RR1.16b, RR1.16b, RR3.16b
+ eor RR0.16b, RR0.16b, RR4.16b
+ eor RR1.16b, RR1.16b, RR5.16b
+ eor RR0.16b, RR0.16b, RR6.16b
+ eor RR1.16b, RR1.16b, RR7.16b
+
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+ cbz w4, .Lgcm_enc_hash_len
+ b .Lgcm_enc_loop_4x
+
+.Lgcm_enc_loop_1x:
+ cmp w4, #16
+ blt .Lgcm_enc_tail
+
+ sub w4, w4, #16
+
+ /* construct CTRs */
+ inc32_le128(v0)
+
+ ld1 {RTMP0.16b}, [x2], #16
+
+ SM4_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, RTMP0.16b
+ st1 {v0.16b}, [x1], #16
+
+ /* ghash update */
+ rbit v0.16b, v0.16b
+ eor RHASH.16b, RHASH.16b, v0.16b
+ PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ cbz w4, .Lgcm_enc_hash_len
+ b .Lgcm_enc_loop_1x
+
+.Lgcm_enc_tail:
+ /* construct CTRs */
+ inc32_le128(v0)
+ SM4_CRYPT_BLK(v0)
+
+ /* load permute table */
+ adr_l x0, .Lcts_permute_table
+ add x0, x0, #32
+ sub x0, x0, w4, uxtw
+ ld1 {v3.16b}, [x0]
+
+.Lgcm_enc_tail_loop:
+ /* do encrypt */
+ ldrb w0, [x2], #1 /* get 1 byte from input */
+ umov w6, v0.b[0] /* get top crypted byte */
+ eor w6, w6, w0 /* w6 = CTR ^ input */
+ strb w6, [x1], #1 /* store out byte */
+
+ /* shift right out one byte */
+ ext v0.16b, v0.16b, v0.16b, #1
+ /* the last ciphertext is placed in high bytes */
+ ins v0.b[15], w6
+
+ subs w4, w4, #1
+ bne .Lgcm_enc_tail_loop
+
+ /* padding last block with zeros */
+ tbl v0.16b, {v0.16b}, v3.16b
+
+ /* ghash update */
+ rbit v0.16b, v0.16b
+ eor RHASH.16b, RHASH.16b, v0.16b
+ PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_enc_hash_len:
+ cbz x7, .Lgcm_enc_end
+
+ GTAG_HASH_LENGTHS(v1, v3)
+
+ b .Lgcm_enc_ret
+
+.Lgcm_enc_end:
+ /* store new CTR */
+ rev x8, x8
+ rev x9, x9
+ stp x8, x9, [x3]
+
+ rbit RHASH.16b, RHASH.16b
+
+.Lgcm_enc_ret:
+ /* store new MAC */
+ st1 {RHASH.2d}, [x5]
+
+ ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_enc)
+
+#undef RR1
+#undef RR3
+#undef RR5
+#undef RR7
+#undef RR0
+#undef RR2
+#undef RR4
+#undef RR6
+#undef RTMP0
+#undef RTMP1
+#undef RTMP2
+#undef RTMP3
+#undef RTMP4
+#undef RTMP5
+#undef RTMP6
+#undef RTMP7
+#undef RH1
+#undef RH2
+#undef RH3
+#undef RH4
+
+
+/* Register macros for decrypt */
+
+/* v0-v2 for building CTRs, v3-v5 for saving inputs */
+
+#define RR1 v6
+#define RR3 v7
+#define RR5 v8
+
+#define RR0 v9
+#define RR2 v10
+#define RR4 v11
+
+#define RTMP0 v12
+#define RTMP1 v13
+#define RTMP2 v14
+#define RTMP3 v15
+#define RTMP4 v16
+#define RTMP5 v17
+
+#define RH1 v18
+#define RH2 v19
+#define RH3 v20
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nbytes
+ * x5: ghash result
+ * x6: ghash table
+ * x7: lengths (only for last block)
+ */
+ SM4_PREPARE(x0)
+
+ ldp x8, x9, [x3]
+ rev x8, x8
+ rev x9, x9
+
+ ld1 {RH1.16b-RH3.16b}, [x6]
+
+ ld1 {RHASH.16b}, [x5]
+ rbit RHASH.16b, RHASH.16b
+
+ adr_l x6, .Lghash_rconst
+ ld1r {RRCONST.2d}, [x6]
+
+ eor RZERO.16b, RZERO.16b, RZERO.16b
+
+ cbz w4, .Lgcm_dec_hash_len
+
+.Lgcm_dec_loop_3x:
+ cmp w4, #(3 * 16)
+ blt .Lgcm_dec_loop_1x
+
+ sub w4, w4, #(3 * 16)
+
+ ld1 {v3.16b-v5.16b}, [x2], #(3 * 16)
+
+ /* construct CTRs */
+ inc32_le128(v0) /* +0 */
+ rbit v6.16b, v3.16b
+ inc32_le128(v1) /* +1 */
+ rbit v7.16b, v4.16b
+ inc32_le128(v2) /* +2 */
+ rbit v8.16b, v5.16b
+
+ eor RHASH.16b, RHASH.16b, v6.16b
+
+ /* decrypt & ghash update */
+ SM4_CRYPT_PMUL_128x128_BLK3(v0, v1, v2,
+ RR0, RR1, RHASH, RH3, RTMP0, RTMP1,
+ RR2, RR3, v7, RH2, RTMP2, RTMP3,
+ RR4, RR5, v8, RH1, RTMP4, RTMP5)
+
+ eor v0.16b, v0.16b, v3.16b
+ eor v1.16b, v1.16b, v4.16b
+ eor v2.16b, v2.16b, v5.16b
+
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+ st1 {v0.16b-v2.16b}, [x1], #(3 * 16)
+
+ cbz w4, .Lgcm_dec_hash_len
+ b .Lgcm_dec_loop_3x
+
+.Lgcm_dec_loop_1x:
+ cmp w4, #16
+ blt .Lgcm_dec_tail
+
+ sub w4, w4, #16
+
+ ld1 {v3.16b}, [x2], #16
+
+ /* construct CTRs */
+ inc32_le128(v0)
+ rbit v6.16b, v3.16b
+
+ eor RHASH.16b, RHASH.16b, v6.16b
+
+ SM4_CRYPT_PMUL_128x128_BLK(v0, RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+
+ eor v0.16b, v0.16b, v3.16b
+
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+ st1 {v0.16b}, [x1], #16
+
+ cbz w4, .Lgcm_dec_hash_len
+ b .Lgcm_dec_loop_1x
+
+.Lgcm_dec_tail:
+ /* construct CTRs */
+ inc32_le128(v0)
+ SM4_CRYPT_BLK(v0)
+
+ /* load permute table */
+ adr_l x0, .Lcts_permute_table
+ add x0, x0, #32
+ sub x0, x0, w4, uxtw
+ ld1 {v3.16b}, [x0]
+
+.Lgcm_dec_tail_loop:
+ /* do decrypt */
+ ldrb w0, [x2], #1 /* get 1 byte from input */
+ umov w6, v0.b[0] /* get top crypted byte */
+ eor w6, w6, w0 /* w6 = CTR ^ input */
+ strb w6, [x1], #1 /* store out byte */
+
+ /* shift right out one byte */
+ ext v0.16b, v0.16b, v0.16b, #1
+ /* the last ciphertext is placed in high bytes */
+ ins v0.b[15], w0
+
+ subs w4, w4, #1
+ bne .Lgcm_dec_tail_loop
+
+ /* padding last block with zeros */
+ tbl v0.16b, {v0.16b}, v3.16b
+
+ /* ghash update */
+ rbit v0.16b, v0.16b
+ eor RHASH.16b, RHASH.16b, v0.16b
+ PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+ REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_dec_hash_len:
+ cbz x7, .Lgcm_dec_end
+
+ GTAG_HASH_LENGTHS(v1, v3)
+
+ b .Lgcm_dec_ret
+
+.Lgcm_dec_end:
+ /* store new CTR */
+ rev x8, x8
+ rev x9, x9
+ stp x8, x9, [x3]
+
+ rbit RHASH.16b, RHASH.16b
+
+.Lgcm_dec_ret:
+ /* store new MAC */
+ st1 {RHASH.2d}, [x5]
+
+ ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_dec)
+
+ .section ".rodata", "a"
+ .align 4
+.Lcts_permute_table:
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7
+ .byte 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+
+.Lghash_rconst:
+ .quad 0x87
diff --git a/arch/arm64/crypto/sm4-ce-gcm-glue.c b/arch/arm64/crypto/sm4-ce-gcm-glue.c
new file mode 100644
index 000000000000..e90ea0f17beb
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-glue.c
@@ -0,0 +1,286 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/b128ops.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_pmull_ghash_setup(const u32 *rkey_enc, u8 *ghash_table);
+asmlinkage void pmull_ghash_update(const u8 *ghash_table, u8 *ghash,
+ const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_pmull_gcm_enc(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nbytes, u8 *ghash,
+ const u8 *ghash_table, const u8 *lengths);
+asmlinkage void sm4_ce_pmull_gcm_dec(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nbytes, u8 *ghash,
+ const u8 *ghash_table, const u8 *lengths);
+
+#define GHASH_BLOCK_SIZE 16
+#define GCM_IV_SIZE 12
+
+struct sm4_gcm_ctx {
+ struct sm4_ctx key;
+ u8 ghash_table[16 * 4];
+};
+
+
+static int gcm_setkey(struct crypto_aead *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_gcm_ctx *ctx = crypto_aead_ctx(tfm);
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+
+ sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+ sm4_ce_pmull_ghash_setup(ctx->key.rkey_enc, ctx->ghash_table);
+
+ kernel_neon_end();
+ return 0;
+}
+
+static int gcm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+ switch (authsize) {
+ case 4:
+ case 8:
+ case 12 ... 16:
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
+static void gcm_calculate_auth_mac(struct aead_request *req, u8 ghash[])
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) buffer[GHASH_BLOCK_SIZE];
+ u32 assoclen = req->assoclen;
+ struct scatter_walk walk;
+ unsigned int buflen = 0;
+
+ scatterwalk_start(&walk, req->src);
+
+ do {
+ u32 n = scatterwalk_clamp(&walk, assoclen);
+ u8 *p, *ptr;
+
+ if (!n) {
+ scatterwalk_start(&walk, sg_next(walk.sg));
+ n = scatterwalk_clamp(&walk, assoclen);
+ }
+
+ p = ptr = scatterwalk_map(&walk);
+ assoclen -= n;
+ scatterwalk_advance(&walk, n);
+
+ if (n + buflen < GHASH_BLOCK_SIZE) {
+ memcpy(&buffer[buflen], ptr, n);
+ buflen += n;
+ } else {
+ unsigned int nblocks;
+
+ if (buflen) {
+ unsigned int l = GHASH_BLOCK_SIZE - buflen;
+
+ memcpy(&buffer[buflen], ptr, l);
+ ptr += l;
+ n -= l;
+
+ pmull_ghash_update(ctx->ghash_table, ghash,
+ buffer, 1);
+ }
+
+ nblocks = n / GHASH_BLOCK_SIZE;
+ if (nblocks) {
+ pmull_ghash_update(ctx->ghash_table, ghash,
+ ptr, nblocks);
+ ptr += nblocks * GHASH_BLOCK_SIZE;
+ }
+
+ buflen = n % GHASH_BLOCK_SIZE;
+ if (buflen)
+ memcpy(&buffer[0], ptr, buflen);
+ }
+
+ scatterwalk_unmap(p);
+ scatterwalk_done(&walk, 0, assoclen);
+ } while (assoclen);
+
+ /* padding with '0' */
+ if (buflen) {
+ memset(&buffer[buflen], 0, GHASH_BLOCK_SIZE - buflen);
+ pmull_ghash_update(ctx->ghash_table, ghash, buffer, 1);
+ }
+}
+
+static int gcm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+ struct sm4_gcm_ctx *ctx, u8 ghash[],
+ void (*sm4_ce_pmull_gcm_crypt)(const u32 *rkey_enc,
+ u8 *dst, const u8 *src, u8 *iv,
+ unsigned int nbytes, u8 *ghash,
+ const u8 *ghash_table, const u8 *lengths))
+{
+ u8 __aligned(8) iv[SM4_BLOCK_SIZE];
+ be128 __aligned(8) lengths;
+ int err;
+
+ memset(ghash, 0, SM4_BLOCK_SIZE);
+
+ lengths.a = cpu_to_be64(req->assoclen * 8);
+ lengths.b = cpu_to_be64(walk->total * 8);
+
+ memcpy(iv, walk->iv, GCM_IV_SIZE);
+ put_unaligned_be32(2, iv + GCM_IV_SIZE);
+
+ kernel_neon_begin();
+
+ if (req->assoclen)
+ gcm_calculate_auth_mac(req, ghash);
+
+ do {
+ unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+ const u8 *src = walk->src.virt.addr;
+ u8 *dst = walk->dst.virt.addr;
+
+ if (walk->nbytes == walk->total) {
+ tail = 0;
+
+ sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+ walk->nbytes, ghash,
+ ctx->ghash_table,
+ (const u8 *)&lengths);
+ } else if (walk->nbytes - tail) {
+ sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+ walk->nbytes - tail, ghash,
+ ctx->ghash_table, NULL);
+ }
+
+ kernel_neon_end();
+
+ err = skcipher_walk_done(walk, tail);
+ if (err)
+ return err;
+ if (walk->nbytes)
+ kernel_neon_begin();
+ } while (walk->nbytes > 0);
+
+ return 0;
+}
+
+static int gcm_encrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+ struct skcipher_walk walk;
+ int err;
+
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
+ if (err)
+ return err;
+
+ err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_enc);
+ if (err)
+ return err;
+
+ /* copy authtag to end of dst */
+ scatterwalk_map_and_copy(ghash, req->dst, req->assoclen + req->cryptlen,
+ crypto_aead_authsize(aead), 1);
+
+ return 0;
+}
+
+static int gcm_decrypt(struct aead_request *req)
+{
+ struct crypto_aead *aead = crypto_aead_reqtfm(req);
+ unsigned int authsize = crypto_aead_authsize(aead);
+ struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+ u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+ u8 authtag[SM4_BLOCK_SIZE];
+ struct skcipher_walk walk;
+ int err;
+
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
+ if (err)
+ return err;
+
+ err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_dec);
+ if (err)
+ return err;
+
+ /* compare calculated auth tag with the stored one */
+ scatterwalk_map_and_copy(authtag, req->src,
+ req->assoclen + req->cryptlen - authsize,
+ authsize, 0);
+
+ if (crypto_memneq(authtag, ghash, authsize))
+ return -EBADMSG;
+
+ return 0;
+}
+
+static struct aead_alg sm4_gcm_alg = {
+ .base = {
+ .cra_name = "gcm(sm4)",
+ .cra_driver_name = "gcm-sm4-ce",
+ .cra_priority = 400,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_gcm_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .ivsize = GCM_IV_SIZE,
+ .chunksize = SM4_BLOCK_SIZE,
+ .maxauthsize = SM4_BLOCK_SIZE,
+ .setkey = gcm_setkey,
+ .setauthsize = gcm_setauthsize,
+ .encrypt = gcm_encrypt,
+ .decrypt = gcm_decrypt,
+};
+
+static int __init sm4_ce_gcm_init(void)
+{
+ if (!cpu_have_named_feature(PMULL))
+ return -ENODEV;
+
+ return crypto_register_aead(&sm4_gcm_alg);
+}
+
+static void __exit sm4_ce_gcm_exit(void)
+{
+ crypto_unregister_aead(&sm4_gcm_alg);
+}
+
+static const struct cpu_feature sm4_ce_gcm_cpu_feature[] = {
+ { cpu_feature(PMULL) },
+ {}
+};
+MODULE_DEVICE_TABLE(cpu, sm4_ce_gcm_cpu_feature);
+
+module_cpu_feature_match(SM4, sm4_ce_gcm_init);
+module_exit(sm4_ce_gcm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in GCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("gcm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 9:36 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
Scalable Vector Extension (SVE) is the next-generation SIMD extension for
arm64. SVE allows flexible vector length implementations with a range of
possible values in CPU implementations. The vector length can vary from a
minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
The SVE design guarantees that the same application can run on different
implementations that support SVE, without the need to recompile the code.
SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
expand and improve it. Similar to the Crypto Extension supported by the
NEON instruction set for the algorithm, SVE also supports the similar
instructions, called cryptography acceleration instructions, but this is
also optional instruction set.
This patch uses SM4 cryptography acceleration instructions and SVE2
instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
Since the encryption of CBC/CFB cannot be parallelized, the Crypto
Extension instruction is used.
Since no test environment with a Vector Length (VL) greater than 128 bits
was found, the performance data was obtained on a machine with a VL is
128 bits, because this driver is enabled when the VL is greater than 128
bits, so this performance is only for reference. It can be seen from the
data that there is little difference between the data optimized by Crypto
Extension and SVE (VL=128 bits), and the optimization effect will be more
obvious when VL=256 bits or longer.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
of tcrypt, and compared with that optimized by Crypto Extension. The
abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:
sm4-ce | 16 64 128 256 1024 1420 4096
------------+--------------------------------------------------------------
ECB enc | 315.18 1162.65 1815.66 2553.50 3692.91 3727.20 4001.93
ECB dec | 316.06 1172.97 1817.81 2554.66 3692.18 3786.54 4001.93
CBC enc | 304.82 629.54 768.65 864.72 953.90 963.32 974.06
CBC dec | 306.05 1142.53 1805.11 2481.67 3522.06 3587.87 3790.99
CFB enc | 309.48 635.70 774.44 865.85 950.62 952.68 968.24
CFB dec | 315.98 1170.38 1828.75 2509.72 3543.63 3539.40 3793.25
CTR enc | 285.83 1036.59 1583.50 2147.26 2933.54 2954.66 3041.14
CTR dec | 285.29 1037.47 1584.67 2145.51 2934.10 2950.89 3041.62
sm4-sve-ce (VL = 128 bits)
ECB enc | 310.00 1154.70 1813.26 2579.74 3766.90 3869.45 4100.26
ECB dec | 315.60 1176.22 1838.06 2593.69 3774.95 3878.42 4098.83
CBC enc | 303.44 622.65 764.67 861.40 953.18 963.05 973.77
CBC dec | 302.13 1091.15 1689.10 2267.79 3182.84 3242.68 3408.92
CFB enc | 296.62 620.41 762.94 858.96 948.18 956.04 967.67
CFB dec | 291.23 1065.50 1637.33 2228.12 3158.52 3213.35 3403.83
CTR enc | 272.27 959.35 1466.34 1934.24 2562.80 2595.87 2695.15
CTR dec | 273.40 963.65 1471.83 1938.97 2563.12 2597.25 2694.54
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 19 +
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
arch/arm64/crypto/sm4-sve-ce-glue.c | 332 +++++++++
4 files changed, 1382 insertions(+)
create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 6793d5bc3ee5..bbb5a7a08af5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
- ARMv8 Crypto Extensions
- NEON (Advanced SIMD) extensions
+config CRYPTO_SM4_ARM64_SVE_CE_BLK
+ tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_SKCIPHER
+ select CRYPTO_SM4
+ select CRYPTO_SM4_ARM64_CE_BLK
+ help
+ Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
+ with block cipher modes:
+ - ECB (Electronic Codebook) mode (NIST SP800-38A)
+ - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
+ - CFB (Cipher Feedback) mode (NIST SP800-38A)
+ - CTR (Counter) mode (NIST SP800-38A)
+
+ Architecture: arm64 using:
+ - ARMv8 Crypto Extensions
+ - ARMv9 cryptography acceleration with SVE2
+ - NEON (Advanced SIMD) extensions
+
config CRYPTO_SM4_ARM64_NEON_BLK
tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 4818e204c2ac..355dd9053434 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
+obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
+sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
+
obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
new file mode 100644
index 000000000000..caecbdf2536c
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-core.S
@@ -0,0 +1,1028 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+.arch armv8-a+crypto+sve+sve2
+
+.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
+ .set .Lv\b\().4s, \b
+.endr
+
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+ 16, 24, 25, 26, 27, 28, 29, 30, 31
+ .set .Lz\b\().s, \b
+.endr
+
+.macro sm4e, vd, vn
+ .inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+.macro sm4e_sve, zd, zn
+ .inst 0x4523e000 | (.L\zn << 5) | .L\zd
+.endm
+
+
+/* Register macros */
+
+#define RCTR z16
+#define RCTRv v16
+#define RIV z16
+#define RIVv v16
+#define RSWAP128 z17
+#define RZERO z18
+#define RLE128_INC z19
+
+#define RTMP0 z20
+#define RTMP0v v20
+#define RTMP1 z21
+#define RTMP2 z22
+#define RTMP3 z23
+
+
+/* Helper macros. */
+
+#define SM4_PREPARE(ptr) \
+ adr_l x7, .Lbswap128_mask; \
+ ptrue p0.b, ALL; \
+ rdvl x5, #1; \
+ ld1b {RSWAP128.b}, p0/z, [x7]; \
+ \
+ ld1 {v24.16b-v27.16b}, [ptr], #64; \
+ ld1 {v28.16b-v31.16b}, [ptr]; \
+ dup z24.q, z24.q[0]; \
+ dup z25.q, z25.q[0]; \
+ dup z26.q, z26.q[0]; \
+ dup z27.q, z27.q[0]; \
+ dup z28.q, z28.q[0]; \
+ dup z29.q, z29.q[0]; \
+ dup z30.q, z30.q[0]; \
+ dup z31.q, z31.q[0];
+
+#define SM4_SVE_CE_CRYPT_BLK(b0) \
+ revb b0.s, p0/m, b0.s; \
+ sm4e_sve b0.s, z24.s; \
+ sm4e_sve b0.s, z25.s; \
+ sm4e_sve b0.s, z26.s; \
+ sm4e_sve b0.s, z27.s; \
+ sm4e_sve b0.s, z28.s; \
+ sm4e_sve b0.s, z29.s; \
+ sm4e_sve b0.s, z30.s; \
+ sm4e_sve b0.s, z31.s; \
+ tbl b0.b, {b0.b}, RSWAP128.b; \
+ revb b0.s, p0/m, b0.s;
+
+#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3) \
+ revb b0.s, p0/m, b0.s; \
+ revb b1.s, p0/m, b1.s; \
+ revb b2.s, p0/m, b2.s; \
+ revb b3.s, p0/m, b3.s; \
+ sm4e_sve b0.s, z24.s; \
+ sm4e_sve b1.s, z24.s; \
+ sm4e_sve b2.s, z24.s; \
+ sm4e_sve b3.s, z24.s; \
+ sm4e_sve b0.s, z25.s; \
+ sm4e_sve b1.s, z25.s; \
+ sm4e_sve b2.s, z25.s; \
+ sm4e_sve b3.s, z25.s; \
+ sm4e_sve b0.s, z26.s; \
+ sm4e_sve b1.s, z26.s; \
+ sm4e_sve b2.s, z26.s; \
+ sm4e_sve b3.s, z26.s; \
+ sm4e_sve b0.s, z27.s; \
+ sm4e_sve b1.s, z27.s; \
+ sm4e_sve b2.s, z27.s; \
+ sm4e_sve b3.s, z27.s; \
+ sm4e_sve b0.s, z28.s; \
+ sm4e_sve b1.s, z28.s; \
+ sm4e_sve b2.s, z28.s; \
+ sm4e_sve b3.s, z28.s; \
+ sm4e_sve b0.s, z29.s; \
+ sm4e_sve b1.s, z29.s; \
+ sm4e_sve b2.s, z29.s; \
+ sm4e_sve b3.s, z29.s; \
+ sm4e_sve b0.s, z30.s; \
+ sm4e_sve b1.s, z30.s; \
+ sm4e_sve b2.s, z30.s; \
+ sm4e_sve b3.s, z30.s; \
+ sm4e_sve b0.s, z31.s; \
+ sm4e_sve b1.s, z31.s; \
+ sm4e_sve b2.s, z31.s; \
+ sm4e_sve b3.s, z31.s; \
+ tbl b0.b, {b0.b}, RSWAP128.b; \
+ tbl b1.b, {b1.b}, RSWAP128.b; \
+ tbl b2.b, {b2.b}, RSWAP128.b; \
+ tbl b3.b, {b3.b}, RSWAP128.b; \
+ revb b0.s, p0/m, b0.s; \
+ revb b1.s, p0/m, b1.s; \
+ revb b2.s, p0/m, b2.s; \
+ revb b3.s, p0/m, b3.s;
+
+#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
+ revb b0.s, p0/m, b0.s; \
+ revb b1.s, p0/m, b1.s; \
+ revb b2.s, p0/m, b2.s; \
+ revb b3.s, p0/m, b3.s; \
+ revb b4.s, p0/m, b4.s; \
+ revb b5.s, p0/m, b5.s; \
+ revb b6.s, p0/m, b6.s; \
+ revb b7.s, p0/m, b7.s; \
+ sm4e_sve b0.s, z24.s; \
+ sm4e_sve b1.s, z24.s; \
+ sm4e_sve b2.s, z24.s; \
+ sm4e_sve b3.s, z24.s; \
+ sm4e_sve b4.s, z24.s; \
+ sm4e_sve b5.s, z24.s; \
+ sm4e_sve b6.s, z24.s; \
+ sm4e_sve b7.s, z24.s; \
+ sm4e_sve b0.s, z25.s; \
+ sm4e_sve b1.s, z25.s; \
+ sm4e_sve b2.s, z25.s; \
+ sm4e_sve b3.s, z25.s; \
+ sm4e_sve b4.s, z25.s; \
+ sm4e_sve b5.s, z25.s; \
+ sm4e_sve b6.s, z25.s; \
+ sm4e_sve b7.s, z25.s; \
+ sm4e_sve b0.s, z26.s; \
+ sm4e_sve b1.s, z26.s; \
+ sm4e_sve b2.s, z26.s; \
+ sm4e_sve b3.s, z26.s; \
+ sm4e_sve b4.s, z26.s; \
+ sm4e_sve b5.s, z26.s; \
+ sm4e_sve b6.s, z26.s; \
+ sm4e_sve b7.s, z26.s; \
+ sm4e_sve b0.s, z27.s; \
+ sm4e_sve b1.s, z27.s; \
+ sm4e_sve b2.s, z27.s; \
+ sm4e_sve b3.s, z27.s; \
+ sm4e_sve b4.s, z27.s; \
+ sm4e_sve b5.s, z27.s; \
+ sm4e_sve b6.s, z27.s; \
+ sm4e_sve b7.s, z27.s; \
+ sm4e_sve b0.s, z28.s; \
+ sm4e_sve b1.s, z28.s; \
+ sm4e_sve b2.s, z28.s; \
+ sm4e_sve b3.s, z28.s; \
+ sm4e_sve b4.s, z28.s; \
+ sm4e_sve b5.s, z28.s; \
+ sm4e_sve b6.s, z28.s; \
+ sm4e_sve b7.s, z28.s; \
+ sm4e_sve b0.s, z29.s; \
+ sm4e_sve b1.s, z29.s; \
+ sm4e_sve b2.s, z29.s; \
+ sm4e_sve b3.s, z29.s; \
+ sm4e_sve b4.s, z29.s; \
+ sm4e_sve b5.s, z29.s; \
+ sm4e_sve b6.s, z29.s; \
+ sm4e_sve b7.s, z29.s; \
+ sm4e_sve b0.s, z30.s; \
+ sm4e_sve b1.s, z30.s; \
+ sm4e_sve b2.s, z30.s; \
+ sm4e_sve b3.s, z30.s; \
+ sm4e_sve b4.s, z30.s; \
+ sm4e_sve b5.s, z30.s; \
+ sm4e_sve b6.s, z30.s; \
+ sm4e_sve b7.s, z30.s; \
+ sm4e_sve b0.s, z31.s; \
+ sm4e_sve b1.s, z31.s; \
+ sm4e_sve b2.s, z31.s; \
+ sm4e_sve b3.s, z31.s; \
+ sm4e_sve b4.s, z31.s; \
+ sm4e_sve b5.s, z31.s; \
+ sm4e_sve b6.s, z31.s; \
+ sm4e_sve b7.s, z31.s; \
+ tbl b0.b, {b0.b}, RSWAP128.b; \
+ tbl b1.b, {b1.b}, RSWAP128.b; \
+ tbl b2.b, {b2.b}, RSWAP128.b; \
+ tbl b3.b, {b3.b}, RSWAP128.b; \
+ tbl b4.b, {b4.b}, RSWAP128.b; \
+ tbl b5.b, {b5.b}, RSWAP128.b; \
+ tbl b6.b, {b6.b}, RSWAP128.b; \
+ tbl b7.b, {b7.b}, RSWAP128.b; \
+ revb b0.s, p0/m, b0.s; \
+ revb b1.s, p0/m, b1.s; \
+ revb b2.s, p0/m, b2.s; \
+ revb b3.s, p0/m, b3.s; \
+ revb b4.s, p0/m, b4.s; \
+ revb b5.s, p0/m, b5.s; \
+ revb b6.s, p0/m, b6.s; \
+ revb b7.s, p0/m, b7.s;
+
+#define SM4_CE_CRYPT_BLK(b0) \
+ rev32 b0.16b, b0.16b; \
+ sm4e b0.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ rev32 b0.16b, b0.16b;
+
+#define inc_le128(zctr) \
+ mov RCTRv.d[1], x8; \
+ mov RCTRv.d[0], x7; \
+ mov zctr.d, RLE128_INC.d; \
+ dup RCTR.q, RCTR.q[0]; \
+ adds x8, x8, x5, LSR #4; \
+ adclt zctr.d, RCTR.d, RZERO.d; \
+ adclt RCTR.d, zctr.d, RZERO.d; \
+ adc x7, x7, xzr; \
+ trn1 zctr.d, RCTR.d, zctr.d; \
+ revb zctr.d, p0/m, zctr.d;
+
+#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3) \
+ mov v8.d[1], x8; \
+ mov v8.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr0.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v9.d[1], x8; \
+ mov v9.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr1.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v10.d[1], x8; \
+ mov v10.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr2.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v11.d[1], x8; \
+ mov v11.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr3.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ dup z8.q, z8.q[0]; \
+ dup z9.q, z9.q[0]; \
+ dup z10.q, z10.q[0]; \
+ dup z11.q, z11.q[0]; \
+ adclt zctr0.d, z8.d, RZERO.d; \
+ adclt zctr1.d, z9.d, RZERO.d; \
+ adclt zctr2.d, z10.d, RZERO.d; \
+ adclt zctr3.d, z11.d, RZERO.d; \
+ adclt z8.d, zctr0.d, RZERO.d; \
+ adclt z9.d, zctr1.d, RZERO.d; \
+ adclt z10.d, zctr2.d, RZERO.d; \
+ adclt z11.d, zctr3.d, RZERO.d; \
+ trn1 zctr0.d, z8.d, zctr0.d; \
+ trn1 zctr1.d, z9.d, zctr1.d; \
+ trn1 zctr2.d, z10.d, zctr2.d; \
+ trn1 zctr3.d, z11.d, zctr3.d; \
+ revb zctr0.d, p0/m, zctr0.d; \
+ revb zctr1.d, p0/m, zctr1.d; \
+ revb zctr2.d, p0/m, zctr2.d; \
+ revb zctr3.d, p0/m, zctr3.d;
+
+#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3, \
+ zctr4, zctr5, zctr6, zctr7) \
+ mov v8.d[1], x8; \
+ mov v8.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr0.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v9.d[1], x8; \
+ mov v9.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr1.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v10.d[1], x8; \
+ mov v10.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr2.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v11.d[1], x8; \
+ mov v11.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr3.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v12.d[1], x8; \
+ mov v12.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr4.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v13.d[1], x8; \
+ mov v13.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr5.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v14.d[1], x8; \
+ mov v14.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr6.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v15.d[1], x8; \
+ mov v15.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr7.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ dup z8.q, z8.q[0]; \
+ dup z9.q, z9.q[0]; \
+ dup z10.q, z10.q[0]; \
+ dup z11.q, z11.q[0]; \
+ dup z12.q, z12.q[0]; \
+ dup z13.q, z13.q[0]; \
+ dup z14.q, z14.q[0]; \
+ dup z15.q, z15.q[0]; \
+ adclt zctr0.d, z8.d, RZERO.d; \
+ adclt zctr1.d, z9.d, RZERO.d; \
+ adclt zctr2.d, z10.d, RZERO.d; \
+ adclt zctr3.d, z11.d, RZERO.d; \
+ adclt zctr4.d, z12.d, RZERO.d; \
+ adclt zctr5.d, z13.d, RZERO.d; \
+ adclt zctr6.d, z14.d, RZERO.d; \
+ adclt zctr7.d, z15.d, RZERO.d; \
+ adclt z8.d, zctr0.d, RZERO.d; \
+ adclt z9.d, zctr1.d, RZERO.d; \
+ adclt z10.d, zctr2.d, RZERO.d; \
+ adclt z11.d, zctr3.d, RZERO.d; \
+ adclt z12.d, zctr4.d, RZERO.d; \
+ adclt z13.d, zctr5.d, RZERO.d; \
+ adclt z14.d, zctr6.d, RZERO.d; \
+ adclt z15.d, zctr7.d, RZERO.d; \
+ trn1 zctr0.d, z8.d, zctr0.d; \
+ trn1 zctr1.d, z9.d, zctr1.d; \
+ trn1 zctr2.d, z10.d, zctr2.d; \
+ trn1 zctr3.d, z11.d, zctr3.d; \
+ trn1 zctr4.d, z12.d, zctr4.d; \
+ trn1 zctr5.d, z13.d, zctr5.d; \
+ trn1 zctr6.d, z14.d, zctr6.d; \
+ trn1 zctr7.d, z15.d, zctr7.d; \
+ revb zctr0.d, p0/m, zctr0.d; \
+ revb zctr1.d, p0/m, zctr1.d; \
+ revb zctr2.d, p0/m, zctr2.d; \
+ revb zctr3.d, p0/m, zctr3.d; \
+ revb zctr4.d, p0/m, zctr4.d; \
+ revb zctr5.d, p0/m, zctr5.d; \
+ revb zctr6.d, p0/m, zctr6.d; \
+ revb zctr7.d, p0/m, zctr7.d;
+
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_crypt)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * w3: nblocks
+ */
+ uxtw x3, w3
+ SM4_PREPARE(x0)
+
+.Lcrypt_loop_8x:
+ sub x3, x3, x5, LSR #1 /* x3 - (8 * VL) */
+ tbnz x3, #63, .Lcrypt_4x
+
+ ld1b {z0.b}, p0/z, [x2]
+ ld1b {z1.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z2.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z3.b}, p0/z, [x2, #3, MUL VL]
+ ld1b {z4.b}, p0/z, [x2, #4, MUL VL]
+ ld1b {z5.b}, p0/z, [x2, #5, MUL VL]
+ ld1b {z6.b}, p0/z, [x2, #6, MUL VL]
+ ld1b {z7.b}, p0/z, [x2, #7, MUL VL]
+
+ SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+ st1b {z4.b}, p0, [x1, #4, MUL VL]
+ st1b {z5.b}, p0, [x1, #5, MUL VL]
+ st1b {z6.b}, p0, [x1, #6, MUL VL]
+ st1b {z7.b}, p0, [x1, #7, MUL VL]
+
+ addvl x2, x2, #8
+ addvl x1, x1, #8
+
+ cbz x3, .Lcrypt_end
+ b .Lcrypt_loop_8x
+
+.Lcrypt_4x:
+ add x3, x3, x5, LSR #1
+ cmp x3, x5, LSR #2
+ blt .Lcrypt_loop_1x
+
+ sub x3, x3, x5, LSR #2 /* x3 - (4 * VL) */
+
+ ld1b {z0.b}, p0/z, [x2]
+ ld1b {z1.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z2.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z3.b}, p0/z, [x2, #3, MUL VL]
+
+ SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+
+ addvl x2, x2, #4
+ addvl x1, x1, #4
+
+ cbz x3, .Lcrypt_end
+
+.Lcrypt_loop_1x:
+ cmp x3, x5, LSR #4
+ blt .Lcrypt_ce_loop_1x
+
+ sub x3, x3, x5, LSR #4 /* x3 - VL */
+
+ ld1b {z0.b}, p0/z, [x2]
+
+ SM4_SVE_CE_CRYPT_BLK(z0)
+
+ st1b {z0.b}, p0, [x1]
+
+ addvl x2, x2, #1
+ addvl x1, x1, #1
+
+ cbz x3, .Lcrypt_end
+ b .Lcrypt_loop_1x
+
+.Lcrypt_ce_loop_1x:
+ sub x3, x3, #1
+
+ ld1 {v0.16b}, [x2], #16
+ SM4_CE_CRYPT_BLK(v0)
+ st1 {v0.16b}, [x1], #16
+
+ cbnz x3, .Lcrypt_ce_loop_1x
+
+.Lcrypt_end:
+ ret
+SYM_FUNC_END(sm4_sve_ce_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cbc_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nblocks
+ */
+ uxtw x4, w4
+ SM4_PREPARE(x0)
+
+ ld1 {RIVv.16b}, [x3]
+ ext RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_loop_8x:
+ sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
+ tbnz x4, #63, .Lcbc_dec_4x
+
+ ld1b {z15.b}, p0/z, [x2]
+ ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
+ ld1b {z11.b}, p0/z, [x2, #4, MUL VL]
+ ld1b {z10.b}, p0/z, [x2, #5, MUL VL]
+ ld1b {z9.b}, p0/z, [x2, #6, MUL VL]
+ ld1b {z8.b}, p0/z, [x2, #7, MUL VL]
+ rev z0.b, z15.b
+ rev z1.b, z14.b
+ rev z2.b, z13.b
+ rev z3.b, z12.b
+ rev z4.b, z11.b
+ rev z5.b, z10.b
+ rev z6.b, z9.b
+ rev z7.b, z8.b
+ rev RTMP0.b, RIV.b
+ ext z7.b, z7.b, z6.b, #16
+ ext z6.b, z6.b, z5.b, #16
+ ext z5.b, z5.b, z4.b, #16
+ ext z4.b, z4.b, z3.b, #16
+ ext z3.b, z3.b, z2.b, #16
+ ext z2.b, z2.b, z1.b, #16
+ ext z1.b, z1.b, z0.b, #16
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z7.b, z7.b
+ rev z6.b, z6.b
+ rev z5.b, z5.b
+ rev z4.b, z4.b
+ rev z3.b, z3.b
+ rev z2.b, z2.b
+ rev z1.b, z1.b
+ rev z0.b, z0.b
+ mov RIV.d, z8.d
+
+ SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
+
+ eor z0.d, z0.d, z15.d
+ eor z1.d, z1.d, z14.d
+ eor z2.d, z2.d, z13.d
+ eor z3.d, z3.d, z12.d
+ eor z4.d, z4.d, z11.d
+ eor z5.d, z5.d, z10.d
+ eor z6.d, z6.d, z9.d
+ eor z7.d, z7.d, z8.d
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+ st1b {z4.b}, p0, [x1, #4, MUL VL]
+ st1b {z5.b}, p0, [x1, #5, MUL VL]
+ st1b {z6.b}, p0, [x1, #6, MUL VL]
+ st1b {z7.b}, p0, [x1, #7, MUL VL]
+
+ addvl x2, x2, #8
+ addvl x1, x1, #8
+
+ cbz x4, .Lcbc_dec_end
+ b .Lcbc_dec_loop_8x
+
+.Lcbc_dec_4x:
+ add x4, x4, x5, LSR #1
+ cmp x4, x5, LSR #2
+ blt .Lcbc_dec_loop_1x
+
+ sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
+
+ ld1b {z15.b}, p0/z, [x2]
+ ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
+ rev z0.b, z15.b
+ rev z1.b, z14.b
+ rev z2.b, z13.b
+ rev z3.b, z12.b
+ rev RTMP0.b, RIV.b
+ ext z3.b, z3.b, z2.b, #16
+ ext z2.b, z2.b, z1.b, #16
+ ext z1.b, z1.b, z0.b, #16
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z3.b, z3.b
+ rev z2.b, z2.b
+ rev z1.b, z1.b
+ rev z0.b, z0.b
+ mov RIV.d, z12.d
+
+ SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
+
+ eor z0.d, z0.d, z15.d
+ eor z1.d, z1.d, z14.d
+ eor z2.d, z2.d, z13.d
+ eor z3.d, z3.d, z12.d
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+
+ addvl x2, x2, #4
+ addvl x1, x1, #4
+
+ cbz x4, .Lcbc_dec_end
+
+.Lcbc_dec_loop_1x:
+ cmp x4, x5, LSR #4
+ blt .Lcbc_dec_ce
+
+ sub x4, x4, x5, LSR #4 /* x4 - VL */
+
+ ld1b {z15.b}, p0/z, [x2]
+ rev RTMP0.b, RIV.b
+ rev z0.b, z15.b
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z0.b, z0.b
+ mov RIV.d, z15.d
+
+ SM4_SVE_CE_CRYPT_BLK(z15)
+
+ eor z0.d, z0.d, z15.d
+ st1b {z0.b}, p0, [x1]
+
+ addvl x2, x2, #1
+ addvl x1, x1, #1
+
+ cbz x4, .Lcbc_dec_end
+ b .Lcbc_dec_loop_1x
+
+.Lcbc_dec_ce:
+ rev RIV.s, RIV.s
+ tbl RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcbc_dec_ce_loop_1x:
+ sub x4, x4, #1
+
+ ld1 {v15.16b}, [x2], #16
+ mov v0.16b, RIVv.16b
+ mov RIVv.16b, v15.16b
+ SM4_CE_CRYPT_BLK(v15)
+ eor v0.16b, v0.16b, v15.16b
+ st1 {v0.16b}, [x1], #16
+
+ cbnz x4, .Lcbc_dec_ce_loop_1x
+
+ ext RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_end:
+ /* store new IV */
+ rev RIV.s, RIV.s
+ tbl RIV.b, {RIV.b}, RSWAP128.b
+ st1 {RIVv.16b}, [x3]
+
+ ret
+SYM_FUNC_END(sm4_sve_ce_cbc_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cfb_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nblocks
+ */
+ uxtw x4, w4
+ SM4_PREPARE(x0)
+
+ ld1 {RIVv.16b}, [x3]
+ ext RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_loop_8x:
+ sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
+ tbnz x4, #63, .Lcfb_dec_4x
+
+ ld1b {z15.b}, p0/z, [x2]
+ ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
+ ld1b {z11.b}, p0/z, [x2, #4, MUL VL]
+ ld1b {z10.b}, p0/z, [x2, #5, MUL VL]
+ ld1b {z9.b}, p0/z, [x2, #6, MUL VL]
+ ld1b {z8.b}, p0/z, [x2, #7, MUL VL]
+ rev z0.b, z15.b
+ rev z1.b, z14.b
+ rev z2.b, z13.b
+ rev z3.b, z12.b
+ rev z4.b, z11.b
+ rev z5.b, z10.b
+ rev z6.b, z9.b
+ rev z7.b, z8.b
+ rev RTMP0.b, RIV.b
+ ext z7.b, z7.b, z6.b, #16
+ ext z6.b, z6.b, z5.b, #16
+ ext z5.b, z5.b, z4.b, #16
+ ext z4.b, z4.b, z3.b, #16
+ ext z3.b, z3.b, z2.b, #16
+ ext z2.b, z2.b, z1.b, #16
+ ext z1.b, z1.b, z0.b, #16
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z7.b, z7.b
+ rev z6.b, z6.b
+ rev z5.b, z5.b
+ rev z4.b, z4.b
+ rev z3.b, z3.b
+ rev z2.b, z2.b
+ rev z1.b, z1.b
+ rev z0.b, z0.b
+ mov RIV.d, z8.d
+
+ SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+ eor z0.d, z0.d, z15.d
+ eor z1.d, z1.d, z14.d
+ eor z2.d, z2.d, z13.d
+ eor z3.d, z3.d, z12.d
+ eor z4.d, z4.d, z11.d
+ eor z5.d, z5.d, z10.d
+ eor z6.d, z6.d, z9.d
+ eor z7.d, z7.d, z8.d
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+ st1b {z4.b}, p0, [x1, #4, MUL VL]
+ st1b {z5.b}, p0, [x1, #5, MUL VL]
+ st1b {z6.b}, p0, [x1, #6, MUL VL]
+ st1b {z7.b}, p0, [x1, #7, MUL VL]
+
+ addvl x2, x2, #8
+ addvl x1, x1, #8
+
+ cbz x4, .Lcfb_dec_end
+ b .Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+ add x4, x4, x5, LSR #1
+ cmp x4, x5, LSR #2
+ blt .Lcfb_dec_loop_1x
+
+ sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
+
+ ld1b {z15.b}, p0/z, [x2]
+ ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
+ rev z0.b, z15.b
+ rev z1.b, z14.b
+ rev z2.b, z13.b
+ rev z3.b, z12.b
+ rev RTMP0.b, RIV.b
+ ext z3.b, z3.b, z2.b, #16
+ ext z2.b, z2.b, z1.b, #16
+ ext z1.b, z1.b, z0.b, #16
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z3.b, z3.b
+ rev z2.b, z2.b
+ rev z1.b, z1.b
+ rev z0.b, z0.b
+ mov RIV.d, z12.d
+
+ SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+ eor z0.d, z0.d, z15.d
+ eor z1.d, z1.d, z14.d
+ eor z2.d, z2.d, z13.d
+ eor z3.d, z3.d, z12.d
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+
+ addvl x2, x2, #4
+ addvl x1, x1, #4
+
+ cbz x4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+ cmp x4, x5, LSR #4
+ blt .Lcfb_dec_ce
+
+ sub x4, x4, x5, LSR #4 /* x4 - VL */
+
+ ld1b {z15.b}, p0/z, [x2]
+ rev RTMP0.b, RIV.b
+ rev z0.b, z15.b
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z0.b, z0.b
+ mov RIV.d, z15.d
+
+ SM4_SVE_CE_CRYPT_BLK(z0)
+
+ eor z0.d, z0.d, z15.d
+ st1b {z0.b}, p0, [x1]
+
+ addvl x2, x2, #1
+ addvl x1, x1, #1
+
+ cbz x4, .Lcfb_dec_end
+ b .Lcfb_dec_loop_1x
+
+.Lcfb_dec_ce:
+ rev RIV.s, RIV.s
+ tbl RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcfb_dec_ce_loop_1x:
+ sub x4, x4, #1
+
+ ld1 {v15.16b}, [x2], #16
+ mov v0.16b, RIVv.16b
+ mov RIVv.16b, v15.16b
+ SM4_CE_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v15.16b
+ st1 {v0.16b}, [x1], #16
+
+ cbnz x4, .Lcfb_dec_ce_loop_1x
+
+ ext RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_end:
+ /* store new IV */
+ rev RIV.s, RIV.s
+ tbl RIV.b, {RIV.b}, RSWAP128.b
+ st1 {RIVv.16b}, [x3]
+
+ ret
+SYM_FUNC_END(sm4_sve_ce_cfb_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nblocks
+ */
+ uxtw x4, w4
+ SM4_PREPARE(x0)
+
+ dup RZERO.d, #0
+ adr_l x6, .Lle128_inc
+ ld1b {RLE128_INC.b}, p0/z, [x6]
+
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
+
+.Lctr_loop_8x:
+ sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
+ tbnz x4, #63, .Lctr_4x
+
+ inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
+
+ ld1b {z8.b}, p0/z, [x2]
+ ld1b {z9.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z10.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z11.b}, p0/z, [x2, #3, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #4, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #5, MUL VL]
+ ld1b {z14.b}, p0/z, [x2, #6, MUL VL]
+ ld1b {z15.b}, p0/z, [x2, #7, MUL VL]
+
+ SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+ eor z0.d, z0.d, z8.d
+ eor z1.d, z1.d, z9.d
+ eor z2.d, z2.d, z10.d
+ eor z3.d, z3.d, z11.d
+ eor z4.d, z4.d, z12.d
+ eor z5.d, z5.d, z13.d
+ eor z6.d, z6.d, z14.d
+ eor z7.d, z7.d, z15.d
+
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+ st1b {z4.b}, p0, [x1, #4, MUL VL]
+ st1b {z5.b}, p0, [x1, #5, MUL VL]
+ st1b {z6.b}, p0, [x1, #6, MUL VL]
+ st1b {z7.b}, p0, [x1, #7, MUL VL]
+
+ addvl x2, x2, #8
+ addvl x1, x1, #8
+
+ cbz x4, .Lctr_end
+ b .Lctr_loop_8x
+
+.Lctr_4x:
+ add x4, x4, x5, LSR #1
+ cmp x4, x5, LSR #2
+ blt .Lctr_loop_1x
+
+ sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
+
+ inc_le128_4x(z0, z1, z2, z3)
+
+ ld1b {z8.b}, p0/z, [x2]
+ ld1b {z9.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z10.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z11.b}, p0/z, [x2, #3, MUL VL]
+
+ SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+ eor z0.d, z0.d, z8.d
+ eor z1.d, z1.d, z9.d
+ eor z2.d, z2.d, z10.d
+ eor z3.d, z3.d, z11.d
+
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+
+ addvl x2, x2, #4
+ addvl x1, x1, #4
+
+ cbz x4, .Lctr_end
+
+.Lctr_loop_1x:
+ cmp x4, x5, LSR #4
+ blt .Lctr_ce_loop_1x
+
+ sub x4, x4, x5, LSR #4 /* x4 - VL */
+
+ inc_le128(z0)
+ ld1b {z8.b}, p0/z, [x2]
+
+ SM4_SVE_CE_CRYPT_BLK(z0)
+
+ eor z0.d, z0.d, z8.d
+ st1b {z0.b}, p0, [x1]
+
+ addvl x2, x2, #1
+ addvl x1, x1, #1
+
+ cbz x4, .Lctr_end
+ b .Lctr_loop_1x
+
+.Lctr_ce_loop_1x:
+ sub x4, x4, #1
+
+ /* inc_le128 for CE */
+ mov v0.d[1], x8
+ mov v0.d[0], x7
+ adds x8, x8, #1
+ rev64 v0.16b, v0.16b
+ adc x7, x7, xzr
+
+ ld1 {v8.16b}, [x2], #16
+
+ SM4_CE_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, v8.16b
+ st1 {v0.16b}, [x1], #16
+
+ cbnz x4, .Lctr_ce_loop_1x
+
+.Lctr_end:
+ /* store new CTR */
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
+
+ ret
+SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_get_vl)
+ /* VL in bytes */
+ rdvl x0, #1
+
+ ret
+SYM_FUNC_END(sm4_sve_get_vl)
+
+
+ .section ".rodata", "a"
+ .align 4
+.Lbswap128_mask:
+ .byte 0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+ .byte 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+ .byte 0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
+ .byte 0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
+ .byte 0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
+ .byte 0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
+ .byte 0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
+ .byte 0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
+ .byte 0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
+ .byte 0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
+ .byte 0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
+ .byte 0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
+ .byte 0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
+ .byte 0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
+ .byte 0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
+ .byte 0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
+ .byte 0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
+ .byte 0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
+ .byte 0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
+ .byte 0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
+ .byte 0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
+ .byte 0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
+ .byte 0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
+ .byte 0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
+ .byte 0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
+ .byte 0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
+ .byte 0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
+ .byte 0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
+ .byte 0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
+ .byte 0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
+ .byte 0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
+ .byte 0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
+
+.Lle128_inc:
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
new file mode 100644
index 000000000000..fc797b72b5f0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
@@ -0,0 +1,332 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
+ const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nblocks);
+asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nblocks);
+asmlinkage unsigned int sm4_sve_get_vl(void);
+
+
+static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
+{
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ unsigned int nblocks;
+
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
+
+ sm4_sve_ce_crypt(rkey, dst, src, nblocks);
+
+ kernel_neon_end();
+ }
+
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+ }
+
+ return err;
+}
+
+static int ecb_encrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return ecb_crypt(req, ctx->rkey_enc);
+}
+
+static int ecb_decrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return ecb_crypt(req, ctx->rkey_dec);
+}
+
+static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
+ void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
+ const u8 *src, u8 *iv, unsigned int nblocks))
+{
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ unsigned int nblocks;
+
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
+
+ sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
+
+ kernel_neon_end();
+ }
+
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+ }
+
+ return err;
+}
+
+static int cbc_encrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
+}
+
+static int cbc_decrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
+}
+
+static int cfb_crypt(struct skcipher_request *req,
+ void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
+ const u8 *src, u8 *iv, unsigned int nblocks))
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ unsigned int nblocks;
+
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
+
+ sm4_cfb_crypt(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
+
+ kernel_neon_end();
+
+ dst += nblocks * SM4_BLOCK_SIZE;
+ src += nblocks * SM4_BLOCK_SIZE;
+ nbytes -= nblocks * SM4_BLOCK_SIZE;
+ }
+
+ /* tail */
+ if (walk.nbytes == walk.total && nbytes > 0) {
+ u8 keystream[SM4_BLOCK_SIZE];
+
+ sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+ crypto_xor_cpy(dst, src, keystream, nbytes);
+ nbytes = 0;
+ }
+
+ err = skcipher_walk_done(&walk, nbytes);
+ }
+
+ return err;
+}
+
+static int cfb_encrypt(struct skcipher_request *req)
+{
+ return cfb_crypt(req, sm4_ce_cfb_enc);
+}
+
+static int cfb_decrypt(struct skcipher_request *req)
+{
+ return cfb_crypt(req, sm4_sve_ce_cfb_dec);
+}
+
+static int ctr_crypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ unsigned int nblocks;
+
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
+
+ sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
+
+ kernel_neon_end();
+
+ dst += nblocks * SM4_BLOCK_SIZE;
+ src += nblocks * SM4_BLOCK_SIZE;
+ nbytes -= nblocks * SM4_BLOCK_SIZE;
+ }
+
+ /* tail */
+ if (walk.nbytes == walk.total && nbytes > 0) {
+ u8 keystream[SM4_BLOCK_SIZE];
+
+ sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+ crypto_inc(walk.iv, SM4_BLOCK_SIZE);
+ crypto_xor_cpy(dst, src, keystream, nbytes);
+ nbytes = 0;
+ }
+
+ err = skcipher_walk_done(&walk, nbytes);
+ }
+
+ return err;
+}
+
+static struct skcipher_alg sm4_algs[] = {
+ {
+ .base = {
+ .cra_name = "ecb(sm4)",
+ .cra_driver_name = "ecb-sm4-sve-ce",
+ .cra_priority = 500,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .setkey = sm4_setkey,
+ .encrypt = ecb_encrypt,
+ .decrypt = ecb_decrypt,
+ }, {
+ .base = {
+ .cra_name = "cbc(sm4)",
+ .cra_driver_name = "cbc-sm4-sve-ce",
+ .cra_priority = 500,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .setkey = sm4_setkey,
+ .encrypt = cbc_encrypt,
+ .decrypt = cbc_decrypt,
+ }, {
+ .base = {
+ .cra_name = "cfb(sm4)",
+ .cra_driver_name = "cfb-sm4-sve-ce",
+ .cra_priority = 500,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .chunksize = SM4_BLOCK_SIZE,
+ .setkey = sm4_setkey,
+ .encrypt = cfb_encrypt,
+ .decrypt = cfb_decrypt,
+ }, {
+ .base = {
+ .cra_name = "ctr(sm4)",
+ .cra_driver_name = "ctr-sm4-sve-ce",
+ .cra_priority = 500,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .chunksize = SM4_BLOCK_SIZE,
+ .setkey = sm4_setkey,
+ .encrypt = ctr_crypt,
+ .decrypt = ctr_crypt,
+ }
+};
+
+static int __init sm4_sve_ce_init(void)
+{
+ if (sm4_sve_get_vl() <= 16)
+ return -ENODEV;
+
+ return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+static void __exit sm4_sve_ce_exit(void)
+{
+ crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
+module_exit(sm4_sve_ce_exit);
+
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
+MODULE_ALIAS_CRYPTO("sm4-sve-ce");
+MODULE_ALIAS_CRYPTO("sm4");
+MODULE_ALIAS_CRYPTO("ecb(sm4)");
+MODULE_ALIAS_CRYPTO("cbc(sm4)");
+MODULE_ALIAS_CRYPTO("cfb(sm4)");
+MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-26 9:36 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26 9:36 UTC (permalink / raw)
To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
Scalable Vector Extension (SVE) is the next-generation SIMD extension for
arm64. SVE allows flexible vector length implementations with a range of
possible values in CPU implementations. The vector length can vary from a
minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
The SVE design guarantees that the same application can run on different
implementations that support SVE, without the need to recompile the code.
SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
expand and improve it. Similar to the Crypto Extension supported by the
NEON instruction set for the algorithm, SVE also supports the similar
instructions, called cryptography acceleration instructions, but this is
also optional instruction set.
This patch uses SM4 cryptography acceleration instructions and SVE2
instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
Since the encryption of CBC/CFB cannot be parallelized, the Crypto
Extension instruction is used.
Since no test environment with a Vector Length (VL) greater than 128 bits
was found, the performance data was obtained on a machine with a VL is
128 bits, because this driver is enabled when the VL is greater than 128
bits, so this performance is only for reference. It can be seen from the
data that there is little difference between the data optimized by Crypto
Extension and SVE (VL=128 bits), and the optimization effect will be more
obvious when VL=256 bits or longer.
Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
of tcrypt, and compared with that optimized by Crypto Extension. The
abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:
sm4-ce | 16 64 128 256 1024 1420 4096
------------+--------------------------------------------------------------
ECB enc | 315.18 1162.65 1815.66 2553.50 3692.91 3727.20 4001.93
ECB dec | 316.06 1172.97 1817.81 2554.66 3692.18 3786.54 4001.93
CBC enc | 304.82 629.54 768.65 864.72 953.90 963.32 974.06
CBC dec | 306.05 1142.53 1805.11 2481.67 3522.06 3587.87 3790.99
CFB enc | 309.48 635.70 774.44 865.85 950.62 952.68 968.24
CFB dec | 315.98 1170.38 1828.75 2509.72 3543.63 3539.40 3793.25
CTR enc | 285.83 1036.59 1583.50 2147.26 2933.54 2954.66 3041.14
CTR dec | 285.29 1037.47 1584.67 2145.51 2934.10 2950.89 3041.62
sm4-sve-ce (VL = 128 bits)
ECB enc | 310.00 1154.70 1813.26 2579.74 3766.90 3869.45 4100.26
ECB dec | 315.60 1176.22 1838.06 2593.69 3774.95 3878.42 4098.83
CBC enc | 303.44 622.65 764.67 861.40 953.18 963.05 973.77
CBC dec | 302.13 1091.15 1689.10 2267.79 3182.84 3242.68 3408.92
CFB enc | 296.62 620.41 762.94 858.96 948.18 956.04 967.67
CFB dec | 291.23 1065.50 1637.33 2228.12 3158.52 3213.35 3403.83
CTR enc | 272.27 959.35 1466.34 1934.24 2562.80 2595.87 2695.15
CTR dec | 273.40 963.65 1471.83 1938.97 2563.12 2597.25 2694.54
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/arm64/crypto/Kconfig | 19 +
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
arch/arm64/crypto/sm4-sve-ce-glue.c | 332 +++++++++
4 files changed, 1382 insertions(+)
create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 6793d5bc3ee5..bbb5a7a08af5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
- ARMv8 Crypto Extensions
- NEON (Advanced SIMD) extensions
+config CRYPTO_SM4_ARM64_SVE_CE_BLK
+ tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_SKCIPHER
+ select CRYPTO_SM4
+ select CRYPTO_SM4_ARM64_CE_BLK
+ help
+ Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
+ with block cipher modes:
+ - ECB (Electronic Codebook) mode (NIST SP800-38A)
+ - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
+ - CFB (Cipher Feedback) mode (NIST SP800-38A)
+ - CTR (Counter) mode (NIST SP800-38A)
+
+ Architecture: arm64 using:
+ - ARMv8 Crypto Extensions
+ - ARMv9 cryptography acceleration with SVE2
+ - NEON (Advanced SIMD) extensions
+
config CRYPTO_SM4_ARM64_NEON_BLK
tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 4818e204c2ac..355dd9053434 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
+obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
+sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
+
obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
new file mode 100644
index 000000000000..caecbdf2536c
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-core.S
@@ -0,0 +1,1028 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+.arch armv8-a+crypto+sve+sve2
+
+.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
+ .set .Lv\b\().4s, \b
+.endr
+
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+ 16, 24, 25, 26, 27, 28, 29, 30, 31
+ .set .Lz\b\().s, \b
+.endr
+
+.macro sm4e, vd, vn
+ .inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+.macro sm4e_sve, zd, zn
+ .inst 0x4523e000 | (.L\zn << 5) | .L\zd
+.endm
+
+
+/* Register macros */
+
+#define RCTR z16
+#define RCTRv v16
+#define RIV z16
+#define RIVv v16
+#define RSWAP128 z17
+#define RZERO z18
+#define RLE128_INC z19
+
+#define RTMP0 z20
+#define RTMP0v v20
+#define RTMP1 z21
+#define RTMP2 z22
+#define RTMP3 z23
+
+
+/* Helper macros. */
+
+#define SM4_PREPARE(ptr) \
+ adr_l x7, .Lbswap128_mask; \
+ ptrue p0.b, ALL; \
+ rdvl x5, #1; \
+ ld1b {RSWAP128.b}, p0/z, [x7]; \
+ \
+ ld1 {v24.16b-v27.16b}, [ptr], #64; \
+ ld1 {v28.16b-v31.16b}, [ptr]; \
+ dup z24.q, z24.q[0]; \
+ dup z25.q, z25.q[0]; \
+ dup z26.q, z26.q[0]; \
+ dup z27.q, z27.q[0]; \
+ dup z28.q, z28.q[0]; \
+ dup z29.q, z29.q[0]; \
+ dup z30.q, z30.q[0]; \
+ dup z31.q, z31.q[0];
+
+#define SM4_SVE_CE_CRYPT_BLK(b0) \
+ revb b0.s, p0/m, b0.s; \
+ sm4e_sve b0.s, z24.s; \
+ sm4e_sve b0.s, z25.s; \
+ sm4e_sve b0.s, z26.s; \
+ sm4e_sve b0.s, z27.s; \
+ sm4e_sve b0.s, z28.s; \
+ sm4e_sve b0.s, z29.s; \
+ sm4e_sve b0.s, z30.s; \
+ sm4e_sve b0.s, z31.s; \
+ tbl b0.b, {b0.b}, RSWAP128.b; \
+ revb b0.s, p0/m, b0.s;
+
+#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3) \
+ revb b0.s, p0/m, b0.s; \
+ revb b1.s, p0/m, b1.s; \
+ revb b2.s, p0/m, b2.s; \
+ revb b3.s, p0/m, b3.s; \
+ sm4e_sve b0.s, z24.s; \
+ sm4e_sve b1.s, z24.s; \
+ sm4e_sve b2.s, z24.s; \
+ sm4e_sve b3.s, z24.s; \
+ sm4e_sve b0.s, z25.s; \
+ sm4e_sve b1.s, z25.s; \
+ sm4e_sve b2.s, z25.s; \
+ sm4e_sve b3.s, z25.s; \
+ sm4e_sve b0.s, z26.s; \
+ sm4e_sve b1.s, z26.s; \
+ sm4e_sve b2.s, z26.s; \
+ sm4e_sve b3.s, z26.s; \
+ sm4e_sve b0.s, z27.s; \
+ sm4e_sve b1.s, z27.s; \
+ sm4e_sve b2.s, z27.s; \
+ sm4e_sve b3.s, z27.s; \
+ sm4e_sve b0.s, z28.s; \
+ sm4e_sve b1.s, z28.s; \
+ sm4e_sve b2.s, z28.s; \
+ sm4e_sve b3.s, z28.s; \
+ sm4e_sve b0.s, z29.s; \
+ sm4e_sve b1.s, z29.s; \
+ sm4e_sve b2.s, z29.s; \
+ sm4e_sve b3.s, z29.s; \
+ sm4e_sve b0.s, z30.s; \
+ sm4e_sve b1.s, z30.s; \
+ sm4e_sve b2.s, z30.s; \
+ sm4e_sve b3.s, z30.s; \
+ sm4e_sve b0.s, z31.s; \
+ sm4e_sve b1.s, z31.s; \
+ sm4e_sve b2.s, z31.s; \
+ sm4e_sve b3.s, z31.s; \
+ tbl b0.b, {b0.b}, RSWAP128.b; \
+ tbl b1.b, {b1.b}, RSWAP128.b; \
+ tbl b2.b, {b2.b}, RSWAP128.b; \
+ tbl b3.b, {b3.b}, RSWAP128.b; \
+ revb b0.s, p0/m, b0.s; \
+ revb b1.s, p0/m, b1.s; \
+ revb b2.s, p0/m, b2.s; \
+ revb b3.s, p0/m, b3.s;
+
+#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
+ revb b0.s, p0/m, b0.s; \
+ revb b1.s, p0/m, b1.s; \
+ revb b2.s, p0/m, b2.s; \
+ revb b3.s, p0/m, b3.s; \
+ revb b4.s, p0/m, b4.s; \
+ revb b5.s, p0/m, b5.s; \
+ revb b6.s, p0/m, b6.s; \
+ revb b7.s, p0/m, b7.s; \
+ sm4e_sve b0.s, z24.s; \
+ sm4e_sve b1.s, z24.s; \
+ sm4e_sve b2.s, z24.s; \
+ sm4e_sve b3.s, z24.s; \
+ sm4e_sve b4.s, z24.s; \
+ sm4e_sve b5.s, z24.s; \
+ sm4e_sve b6.s, z24.s; \
+ sm4e_sve b7.s, z24.s; \
+ sm4e_sve b0.s, z25.s; \
+ sm4e_sve b1.s, z25.s; \
+ sm4e_sve b2.s, z25.s; \
+ sm4e_sve b3.s, z25.s; \
+ sm4e_sve b4.s, z25.s; \
+ sm4e_sve b5.s, z25.s; \
+ sm4e_sve b6.s, z25.s; \
+ sm4e_sve b7.s, z25.s; \
+ sm4e_sve b0.s, z26.s; \
+ sm4e_sve b1.s, z26.s; \
+ sm4e_sve b2.s, z26.s; \
+ sm4e_sve b3.s, z26.s; \
+ sm4e_sve b4.s, z26.s; \
+ sm4e_sve b5.s, z26.s; \
+ sm4e_sve b6.s, z26.s; \
+ sm4e_sve b7.s, z26.s; \
+ sm4e_sve b0.s, z27.s; \
+ sm4e_sve b1.s, z27.s; \
+ sm4e_sve b2.s, z27.s; \
+ sm4e_sve b3.s, z27.s; \
+ sm4e_sve b4.s, z27.s; \
+ sm4e_sve b5.s, z27.s; \
+ sm4e_sve b6.s, z27.s; \
+ sm4e_sve b7.s, z27.s; \
+ sm4e_sve b0.s, z28.s; \
+ sm4e_sve b1.s, z28.s; \
+ sm4e_sve b2.s, z28.s; \
+ sm4e_sve b3.s, z28.s; \
+ sm4e_sve b4.s, z28.s; \
+ sm4e_sve b5.s, z28.s; \
+ sm4e_sve b6.s, z28.s; \
+ sm4e_sve b7.s, z28.s; \
+ sm4e_sve b0.s, z29.s; \
+ sm4e_sve b1.s, z29.s; \
+ sm4e_sve b2.s, z29.s; \
+ sm4e_sve b3.s, z29.s; \
+ sm4e_sve b4.s, z29.s; \
+ sm4e_sve b5.s, z29.s; \
+ sm4e_sve b6.s, z29.s; \
+ sm4e_sve b7.s, z29.s; \
+ sm4e_sve b0.s, z30.s; \
+ sm4e_sve b1.s, z30.s; \
+ sm4e_sve b2.s, z30.s; \
+ sm4e_sve b3.s, z30.s; \
+ sm4e_sve b4.s, z30.s; \
+ sm4e_sve b5.s, z30.s; \
+ sm4e_sve b6.s, z30.s; \
+ sm4e_sve b7.s, z30.s; \
+ sm4e_sve b0.s, z31.s; \
+ sm4e_sve b1.s, z31.s; \
+ sm4e_sve b2.s, z31.s; \
+ sm4e_sve b3.s, z31.s; \
+ sm4e_sve b4.s, z31.s; \
+ sm4e_sve b5.s, z31.s; \
+ sm4e_sve b6.s, z31.s; \
+ sm4e_sve b7.s, z31.s; \
+ tbl b0.b, {b0.b}, RSWAP128.b; \
+ tbl b1.b, {b1.b}, RSWAP128.b; \
+ tbl b2.b, {b2.b}, RSWAP128.b; \
+ tbl b3.b, {b3.b}, RSWAP128.b; \
+ tbl b4.b, {b4.b}, RSWAP128.b; \
+ tbl b5.b, {b5.b}, RSWAP128.b; \
+ tbl b6.b, {b6.b}, RSWAP128.b; \
+ tbl b7.b, {b7.b}, RSWAP128.b; \
+ revb b0.s, p0/m, b0.s; \
+ revb b1.s, p0/m, b1.s; \
+ revb b2.s, p0/m, b2.s; \
+ revb b3.s, p0/m, b3.s; \
+ revb b4.s, p0/m, b4.s; \
+ revb b5.s, p0/m, b5.s; \
+ revb b6.s, p0/m, b6.s; \
+ revb b7.s, p0/m, b7.s;
+
+#define SM4_CE_CRYPT_BLK(b0) \
+ rev32 b0.16b, b0.16b; \
+ sm4e b0.4s, v24.4s; \
+ sm4e b0.4s, v25.4s; \
+ sm4e b0.4s, v26.4s; \
+ sm4e b0.4s, v27.4s; \
+ sm4e b0.4s, v28.4s; \
+ sm4e b0.4s, v29.4s; \
+ sm4e b0.4s, v30.4s; \
+ sm4e b0.4s, v31.4s; \
+ rev64 b0.4s, b0.4s; \
+ ext b0.16b, b0.16b, b0.16b, #8; \
+ rev32 b0.16b, b0.16b;
+
+#define inc_le128(zctr) \
+ mov RCTRv.d[1], x8; \
+ mov RCTRv.d[0], x7; \
+ mov zctr.d, RLE128_INC.d; \
+ dup RCTR.q, RCTR.q[0]; \
+ adds x8, x8, x5, LSR #4; \
+ adclt zctr.d, RCTR.d, RZERO.d; \
+ adclt RCTR.d, zctr.d, RZERO.d; \
+ adc x7, x7, xzr; \
+ trn1 zctr.d, RCTR.d, zctr.d; \
+ revb zctr.d, p0/m, zctr.d;
+
+#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3) \
+ mov v8.d[1], x8; \
+ mov v8.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr0.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v9.d[1], x8; \
+ mov v9.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr1.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v10.d[1], x8; \
+ mov v10.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr2.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v11.d[1], x8; \
+ mov v11.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr3.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ dup z8.q, z8.q[0]; \
+ dup z9.q, z9.q[0]; \
+ dup z10.q, z10.q[0]; \
+ dup z11.q, z11.q[0]; \
+ adclt zctr0.d, z8.d, RZERO.d; \
+ adclt zctr1.d, z9.d, RZERO.d; \
+ adclt zctr2.d, z10.d, RZERO.d; \
+ adclt zctr3.d, z11.d, RZERO.d; \
+ adclt z8.d, zctr0.d, RZERO.d; \
+ adclt z9.d, zctr1.d, RZERO.d; \
+ adclt z10.d, zctr2.d, RZERO.d; \
+ adclt z11.d, zctr3.d, RZERO.d; \
+ trn1 zctr0.d, z8.d, zctr0.d; \
+ trn1 zctr1.d, z9.d, zctr1.d; \
+ trn1 zctr2.d, z10.d, zctr2.d; \
+ trn1 zctr3.d, z11.d, zctr3.d; \
+ revb zctr0.d, p0/m, zctr0.d; \
+ revb zctr1.d, p0/m, zctr1.d; \
+ revb zctr2.d, p0/m, zctr2.d; \
+ revb zctr3.d, p0/m, zctr3.d;
+
+#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3, \
+ zctr4, zctr5, zctr6, zctr7) \
+ mov v8.d[1], x8; \
+ mov v8.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr0.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v9.d[1], x8; \
+ mov v9.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr1.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v10.d[1], x8; \
+ mov v10.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr2.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v11.d[1], x8; \
+ mov v11.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr3.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v12.d[1], x8; \
+ mov v12.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr4.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v13.d[1], x8; \
+ mov v13.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr5.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v14.d[1], x8; \
+ mov v14.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr6.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ mov v15.d[1], x8; \
+ mov v15.d[0], x7; \
+ adds x8, x8, x5, LSR #4; \
+ mov zctr7.d, RLE128_INC.d; \
+ adc x7, x7, xzr; \
+ dup z8.q, z8.q[0]; \
+ dup z9.q, z9.q[0]; \
+ dup z10.q, z10.q[0]; \
+ dup z11.q, z11.q[0]; \
+ dup z12.q, z12.q[0]; \
+ dup z13.q, z13.q[0]; \
+ dup z14.q, z14.q[0]; \
+ dup z15.q, z15.q[0]; \
+ adclt zctr0.d, z8.d, RZERO.d; \
+ adclt zctr1.d, z9.d, RZERO.d; \
+ adclt zctr2.d, z10.d, RZERO.d; \
+ adclt zctr3.d, z11.d, RZERO.d; \
+ adclt zctr4.d, z12.d, RZERO.d; \
+ adclt zctr5.d, z13.d, RZERO.d; \
+ adclt zctr6.d, z14.d, RZERO.d; \
+ adclt zctr7.d, z15.d, RZERO.d; \
+ adclt z8.d, zctr0.d, RZERO.d; \
+ adclt z9.d, zctr1.d, RZERO.d; \
+ adclt z10.d, zctr2.d, RZERO.d; \
+ adclt z11.d, zctr3.d, RZERO.d; \
+ adclt z12.d, zctr4.d, RZERO.d; \
+ adclt z13.d, zctr5.d, RZERO.d; \
+ adclt z14.d, zctr6.d, RZERO.d; \
+ adclt z15.d, zctr7.d, RZERO.d; \
+ trn1 zctr0.d, z8.d, zctr0.d; \
+ trn1 zctr1.d, z9.d, zctr1.d; \
+ trn1 zctr2.d, z10.d, zctr2.d; \
+ trn1 zctr3.d, z11.d, zctr3.d; \
+ trn1 zctr4.d, z12.d, zctr4.d; \
+ trn1 zctr5.d, z13.d, zctr5.d; \
+ trn1 zctr6.d, z14.d, zctr6.d; \
+ trn1 zctr7.d, z15.d, zctr7.d; \
+ revb zctr0.d, p0/m, zctr0.d; \
+ revb zctr1.d, p0/m, zctr1.d; \
+ revb zctr2.d, p0/m, zctr2.d; \
+ revb zctr3.d, p0/m, zctr3.d; \
+ revb zctr4.d, p0/m, zctr4.d; \
+ revb zctr5.d, p0/m, zctr5.d; \
+ revb zctr6.d, p0/m, zctr6.d; \
+ revb zctr7.d, p0/m, zctr7.d;
+
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_crypt)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * w3: nblocks
+ */
+ uxtw x3, w3
+ SM4_PREPARE(x0)
+
+.Lcrypt_loop_8x:
+ sub x3, x3, x5, LSR #1 /* x3 - (8 * VL) */
+ tbnz x3, #63, .Lcrypt_4x
+
+ ld1b {z0.b}, p0/z, [x2]
+ ld1b {z1.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z2.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z3.b}, p0/z, [x2, #3, MUL VL]
+ ld1b {z4.b}, p0/z, [x2, #4, MUL VL]
+ ld1b {z5.b}, p0/z, [x2, #5, MUL VL]
+ ld1b {z6.b}, p0/z, [x2, #6, MUL VL]
+ ld1b {z7.b}, p0/z, [x2, #7, MUL VL]
+
+ SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+ st1b {z4.b}, p0, [x1, #4, MUL VL]
+ st1b {z5.b}, p0, [x1, #5, MUL VL]
+ st1b {z6.b}, p0, [x1, #6, MUL VL]
+ st1b {z7.b}, p0, [x1, #7, MUL VL]
+
+ addvl x2, x2, #8
+ addvl x1, x1, #8
+
+ cbz x3, .Lcrypt_end
+ b .Lcrypt_loop_8x
+
+.Lcrypt_4x:
+ add x3, x3, x5, LSR #1
+ cmp x3, x5, LSR #2
+ blt .Lcrypt_loop_1x
+
+ sub x3, x3, x5, LSR #2 /* x3 - (4 * VL) */
+
+ ld1b {z0.b}, p0/z, [x2]
+ ld1b {z1.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z2.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z3.b}, p0/z, [x2, #3, MUL VL]
+
+ SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+
+ addvl x2, x2, #4
+ addvl x1, x1, #4
+
+ cbz x3, .Lcrypt_end
+
+.Lcrypt_loop_1x:
+ cmp x3, x5, LSR #4
+ blt .Lcrypt_ce_loop_1x
+
+ sub x3, x3, x5, LSR #4 /* x3 - VL */
+
+ ld1b {z0.b}, p0/z, [x2]
+
+ SM4_SVE_CE_CRYPT_BLK(z0)
+
+ st1b {z0.b}, p0, [x1]
+
+ addvl x2, x2, #1
+ addvl x1, x1, #1
+
+ cbz x3, .Lcrypt_end
+ b .Lcrypt_loop_1x
+
+.Lcrypt_ce_loop_1x:
+ sub x3, x3, #1
+
+ ld1 {v0.16b}, [x2], #16
+ SM4_CE_CRYPT_BLK(v0)
+ st1 {v0.16b}, [x1], #16
+
+ cbnz x3, .Lcrypt_ce_loop_1x
+
+.Lcrypt_end:
+ ret
+SYM_FUNC_END(sm4_sve_ce_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cbc_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nblocks
+ */
+ uxtw x4, w4
+ SM4_PREPARE(x0)
+
+ ld1 {RIVv.16b}, [x3]
+ ext RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_loop_8x:
+ sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
+ tbnz x4, #63, .Lcbc_dec_4x
+
+ ld1b {z15.b}, p0/z, [x2]
+ ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
+ ld1b {z11.b}, p0/z, [x2, #4, MUL VL]
+ ld1b {z10.b}, p0/z, [x2, #5, MUL VL]
+ ld1b {z9.b}, p0/z, [x2, #6, MUL VL]
+ ld1b {z8.b}, p0/z, [x2, #7, MUL VL]
+ rev z0.b, z15.b
+ rev z1.b, z14.b
+ rev z2.b, z13.b
+ rev z3.b, z12.b
+ rev z4.b, z11.b
+ rev z5.b, z10.b
+ rev z6.b, z9.b
+ rev z7.b, z8.b
+ rev RTMP0.b, RIV.b
+ ext z7.b, z7.b, z6.b, #16
+ ext z6.b, z6.b, z5.b, #16
+ ext z5.b, z5.b, z4.b, #16
+ ext z4.b, z4.b, z3.b, #16
+ ext z3.b, z3.b, z2.b, #16
+ ext z2.b, z2.b, z1.b, #16
+ ext z1.b, z1.b, z0.b, #16
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z7.b, z7.b
+ rev z6.b, z6.b
+ rev z5.b, z5.b
+ rev z4.b, z4.b
+ rev z3.b, z3.b
+ rev z2.b, z2.b
+ rev z1.b, z1.b
+ rev z0.b, z0.b
+ mov RIV.d, z8.d
+
+ SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
+
+ eor z0.d, z0.d, z15.d
+ eor z1.d, z1.d, z14.d
+ eor z2.d, z2.d, z13.d
+ eor z3.d, z3.d, z12.d
+ eor z4.d, z4.d, z11.d
+ eor z5.d, z5.d, z10.d
+ eor z6.d, z6.d, z9.d
+ eor z7.d, z7.d, z8.d
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+ st1b {z4.b}, p0, [x1, #4, MUL VL]
+ st1b {z5.b}, p0, [x1, #5, MUL VL]
+ st1b {z6.b}, p0, [x1, #6, MUL VL]
+ st1b {z7.b}, p0, [x1, #7, MUL VL]
+
+ addvl x2, x2, #8
+ addvl x1, x1, #8
+
+ cbz x4, .Lcbc_dec_end
+ b .Lcbc_dec_loop_8x
+
+.Lcbc_dec_4x:
+ add x4, x4, x5, LSR #1
+ cmp x4, x5, LSR #2
+ blt .Lcbc_dec_loop_1x
+
+ sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
+
+ ld1b {z15.b}, p0/z, [x2]
+ ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
+ rev z0.b, z15.b
+ rev z1.b, z14.b
+ rev z2.b, z13.b
+ rev z3.b, z12.b
+ rev RTMP0.b, RIV.b
+ ext z3.b, z3.b, z2.b, #16
+ ext z2.b, z2.b, z1.b, #16
+ ext z1.b, z1.b, z0.b, #16
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z3.b, z3.b
+ rev z2.b, z2.b
+ rev z1.b, z1.b
+ rev z0.b, z0.b
+ mov RIV.d, z12.d
+
+ SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
+
+ eor z0.d, z0.d, z15.d
+ eor z1.d, z1.d, z14.d
+ eor z2.d, z2.d, z13.d
+ eor z3.d, z3.d, z12.d
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+
+ addvl x2, x2, #4
+ addvl x1, x1, #4
+
+ cbz x4, .Lcbc_dec_end
+
+.Lcbc_dec_loop_1x:
+ cmp x4, x5, LSR #4
+ blt .Lcbc_dec_ce
+
+ sub x4, x4, x5, LSR #4 /* x4 - VL */
+
+ ld1b {z15.b}, p0/z, [x2]
+ rev RTMP0.b, RIV.b
+ rev z0.b, z15.b
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z0.b, z0.b
+ mov RIV.d, z15.d
+
+ SM4_SVE_CE_CRYPT_BLK(z15)
+
+ eor z0.d, z0.d, z15.d
+ st1b {z0.b}, p0, [x1]
+
+ addvl x2, x2, #1
+ addvl x1, x1, #1
+
+ cbz x4, .Lcbc_dec_end
+ b .Lcbc_dec_loop_1x
+
+.Lcbc_dec_ce:
+ rev RIV.s, RIV.s
+ tbl RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcbc_dec_ce_loop_1x:
+ sub x4, x4, #1
+
+ ld1 {v15.16b}, [x2], #16
+ mov v0.16b, RIVv.16b
+ mov RIVv.16b, v15.16b
+ SM4_CE_CRYPT_BLK(v15)
+ eor v0.16b, v0.16b, v15.16b
+ st1 {v0.16b}, [x1], #16
+
+ cbnz x4, .Lcbc_dec_ce_loop_1x
+
+ ext RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_end:
+ /* store new IV */
+ rev RIV.s, RIV.s
+ tbl RIV.b, {RIV.b}, RSWAP128.b
+ st1 {RIVv.16b}, [x3]
+
+ ret
+SYM_FUNC_END(sm4_sve_ce_cbc_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cfb_dec)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: iv (big endian, 128 bit)
+ * w4: nblocks
+ */
+ uxtw x4, w4
+ SM4_PREPARE(x0)
+
+ ld1 {RIVv.16b}, [x3]
+ ext RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_loop_8x:
+ sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
+ tbnz x4, #63, .Lcfb_dec_4x
+
+ ld1b {z15.b}, p0/z, [x2]
+ ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
+ ld1b {z11.b}, p0/z, [x2, #4, MUL VL]
+ ld1b {z10.b}, p0/z, [x2, #5, MUL VL]
+ ld1b {z9.b}, p0/z, [x2, #6, MUL VL]
+ ld1b {z8.b}, p0/z, [x2, #7, MUL VL]
+ rev z0.b, z15.b
+ rev z1.b, z14.b
+ rev z2.b, z13.b
+ rev z3.b, z12.b
+ rev z4.b, z11.b
+ rev z5.b, z10.b
+ rev z6.b, z9.b
+ rev z7.b, z8.b
+ rev RTMP0.b, RIV.b
+ ext z7.b, z7.b, z6.b, #16
+ ext z6.b, z6.b, z5.b, #16
+ ext z5.b, z5.b, z4.b, #16
+ ext z4.b, z4.b, z3.b, #16
+ ext z3.b, z3.b, z2.b, #16
+ ext z2.b, z2.b, z1.b, #16
+ ext z1.b, z1.b, z0.b, #16
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z7.b, z7.b
+ rev z6.b, z6.b
+ rev z5.b, z5.b
+ rev z4.b, z4.b
+ rev z3.b, z3.b
+ rev z2.b, z2.b
+ rev z1.b, z1.b
+ rev z0.b, z0.b
+ mov RIV.d, z8.d
+
+ SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+ eor z0.d, z0.d, z15.d
+ eor z1.d, z1.d, z14.d
+ eor z2.d, z2.d, z13.d
+ eor z3.d, z3.d, z12.d
+ eor z4.d, z4.d, z11.d
+ eor z5.d, z5.d, z10.d
+ eor z6.d, z6.d, z9.d
+ eor z7.d, z7.d, z8.d
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+ st1b {z4.b}, p0, [x1, #4, MUL VL]
+ st1b {z5.b}, p0, [x1, #5, MUL VL]
+ st1b {z6.b}, p0, [x1, #6, MUL VL]
+ st1b {z7.b}, p0, [x1, #7, MUL VL]
+
+ addvl x2, x2, #8
+ addvl x1, x1, #8
+
+ cbz x4, .Lcfb_dec_end
+ b .Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+ add x4, x4, x5, LSR #1
+ cmp x4, x5, LSR #2
+ blt .Lcfb_dec_loop_1x
+
+ sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
+
+ ld1b {z15.b}, p0/z, [x2]
+ ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
+ rev z0.b, z15.b
+ rev z1.b, z14.b
+ rev z2.b, z13.b
+ rev z3.b, z12.b
+ rev RTMP0.b, RIV.b
+ ext z3.b, z3.b, z2.b, #16
+ ext z2.b, z2.b, z1.b, #16
+ ext z1.b, z1.b, z0.b, #16
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z3.b, z3.b
+ rev z2.b, z2.b
+ rev z1.b, z1.b
+ rev z0.b, z0.b
+ mov RIV.d, z12.d
+
+ SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+ eor z0.d, z0.d, z15.d
+ eor z1.d, z1.d, z14.d
+ eor z2.d, z2.d, z13.d
+ eor z3.d, z3.d, z12.d
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+
+ addvl x2, x2, #4
+ addvl x1, x1, #4
+
+ cbz x4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+ cmp x4, x5, LSR #4
+ blt .Lcfb_dec_ce
+
+ sub x4, x4, x5, LSR #4 /* x4 - VL */
+
+ ld1b {z15.b}, p0/z, [x2]
+ rev RTMP0.b, RIV.b
+ rev z0.b, z15.b
+ ext z0.b, z0.b, RTMP0.b, #16
+ rev z0.b, z0.b
+ mov RIV.d, z15.d
+
+ SM4_SVE_CE_CRYPT_BLK(z0)
+
+ eor z0.d, z0.d, z15.d
+ st1b {z0.b}, p0, [x1]
+
+ addvl x2, x2, #1
+ addvl x1, x1, #1
+
+ cbz x4, .Lcfb_dec_end
+ b .Lcfb_dec_loop_1x
+
+.Lcfb_dec_ce:
+ rev RIV.s, RIV.s
+ tbl RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcfb_dec_ce_loop_1x:
+ sub x4, x4, #1
+
+ ld1 {v15.16b}, [x2], #16
+ mov v0.16b, RIVv.16b
+ mov RIVv.16b, v15.16b
+ SM4_CE_CRYPT_BLK(v0)
+ eor v0.16b, v0.16b, v15.16b
+ st1 {v0.16b}, [x1], #16
+
+ cbnz x4, .Lcfb_dec_ce_loop_1x
+
+ ext RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_end:
+ /* store new IV */
+ rev RIV.s, RIV.s
+ tbl RIV.b, {RIV.b}, RSWAP128.b
+ st1 {RIVv.16b}, [x3]
+
+ ret
+SYM_FUNC_END(sm4_sve_ce_cfb_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
+ /* input:
+ * x0: round key array, CTX
+ * x1: dst
+ * x2: src
+ * x3: ctr (big endian, 128 bit)
+ * w4: nblocks
+ */
+ uxtw x4, w4
+ SM4_PREPARE(x0)
+
+ dup RZERO.d, #0
+ adr_l x6, .Lle128_inc
+ ld1b {RLE128_INC.b}, p0/z, [x6]
+
+ ldp x7, x8, [x3]
+ rev x7, x7
+ rev x8, x8
+
+.Lctr_loop_8x:
+ sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
+ tbnz x4, #63, .Lctr_4x
+
+ inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
+
+ ld1b {z8.b}, p0/z, [x2]
+ ld1b {z9.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z10.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z11.b}, p0/z, [x2, #3, MUL VL]
+ ld1b {z12.b}, p0/z, [x2, #4, MUL VL]
+ ld1b {z13.b}, p0/z, [x2, #5, MUL VL]
+ ld1b {z14.b}, p0/z, [x2, #6, MUL VL]
+ ld1b {z15.b}, p0/z, [x2, #7, MUL VL]
+
+ SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+ eor z0.d, z0.d, z8.d
+ eor z1.d, z1.d, z9.d
+ eor z2.d, z2.d, z10.d
+ eor z3.d, z3.d, z11.d
+ eor z4.d, z4.d, z12.d
+ eor z5.d, z5.d, z13.d
+ eor z6.d, z6.d, z14.d
+ eor z7.d, z7.d, z15.d
+
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+ st1b {z4.b}, p0, [x1, #4, MUL VL]
+ st1b {z5.b}, p0, [x1, #5, MUL VL]
+ st1b {z6.b}, p0, [x1, #6, MUL VL]
+ st1b {z7.b}, p0, [x1, #7, MUL VL]
+
+ addvl x2, x2, #8
+ addvl x1, x1, #8
+
+ cbz x4, .Lctr_end
+ b .Lctr_loop_8x
+
+.Lctr_4x:
+ add x4, x4, x5, LSR #1
+ cmp x4, x5, LSR #2
+ blt .Lctr_loop_1x
+
+ sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
+
+ inc_le128_4x(z0, z1, z2, z3)
+
+ ld1b {z8.b}, p0/z, [x2]
+ ld1b {z9.b}, p0/z, [x2, #1, MUL VL]
+ ld1b {z10.b}, p0/z, [x2, #2, MUL VL]
+ ld1b {z11.b}, p0/z, [x2, #3, MUL VL]
+
+ SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+ eor z0.d, z0.d, z8.d
+ eor z1.d, z1.d, z9.d
+ eor z2.d, z2.d, z10.d
+ eor z3.d, z3.d, z11.d
+
+ st1b {z0.b}, p0, [x1]
+ st1b {z1.b}, p0, [x1, #1, MUL VL]
+ st1b {z2.b}, p0, [x1, #2, MUL VL]
+ st1b {z3.b}, p0, [x1, #3, MUL VL]
+
+ addvl x2, x2, #4
+ addvl x1, x1, #4
+
+ cbz x4, .Lctr_end
+
+.Lctr_loop_1x:
+ cmp x4, x5, LSR #4
+ blt .Lctr_ce_loop_1x
+
+ sub x4, x4, x5, LSR #4 /* x4 - VL */
+
+ inc_le128(z0)
+ ld1b {z8.b}, p0/z, [x2]
+
+ SM4_SVE_CE_CRYPT_BLK(z0)
+
+ eor z0.d, z0.d, z8.d
+ st1b {z0.b}, p0, [x1]
+
+ addvl x2, x2, #1
+ addvl x1, x1, #1
+
+ cbz x4, .Lctr_end
+ b .Lctr_loop_1x
+
+.Lctr_ce_loop_1x:
+ sub x4, x4, #1
+
+ /* inc_le128 for CE */
+ mov v0.d[1], x8
+ mov v0.d[0], x7
+ adds x8, x8, #1
+ rev64 v0.16b, v0.16b
+ adc x7, x7, xzr
+
+ ld1 {v8.16b}, [x2], #16
+
+ SM4_CE_CRYPT_BLK(v0)
+
+ eor v0.16b, v0.16b, v8.16b
+ st1 {v0.16b}, [x1], #16
+
+ cbnz x4, .Lctr_ce_loop_1x
+
+.Lctr_end:
+ /* store new CTR */
+ rev x7, x7
+ rev x8, x8
+ stp x7, x8, [x3]
+
+ ret
+SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_get_vl)
+ /* VL in bytes */
+ rdvl x0, #1
+
+ ret
+SYM_FUNC_END(sm4_sve_get_vl)
+
+
+ .section ".rodata", "a"
+ .align 4
+.Lbswap128_mask:
+ .byte 0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+ .byte 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+ .byte 0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
+ .byte 0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
+ .byte 0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
+ .byte 0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
+ .byte 0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
+ .byte 0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
+ .byte 0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
+ .byte 0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
+ .byte 0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
+ .byte 0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
+ .byte 0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
+ .byte 0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
+ .byte 0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
+ .byte 0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
+ .byte 0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
+ .byte 0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
+ .byte 0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
+ .byte 0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
+ .byte 0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
+ .byte 0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
+ .byte 0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
+ .byte 0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
+ .byte 0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
+ .byte 0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
+ .byte 0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
+ .byte 0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
+ .byte 0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
+ .byte 0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
+ .byte 0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
+ .byte 0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
+
+.Lle128_inc:
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
new file mode 100644
index 000000000000..fc797b72b5f0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
@@ -0,0 +1,332 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
+ const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nblocks);
+asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
+ const u8 *src, u8 *iv,
+ unsigned int nblocks);
+asmlinkage unsigned int sm4_sve_get_vl(void);
+
+
+static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ if (key_len != SM4_KEY_SIZE)
+ return -EINVAL;
+
+ kernel_neon_begin();
+ sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+ crypto_sm4_fk, crypto_sm4_ck);
+ kernel_neon_end();
+
+ return 0;
+}
+
+static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
+{
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ unsigned int nblocks;
+
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
+
+ sm4_sve_ce_crypt(rkey, dst, src, nblocks);
+
+ kernel_neon_end();
+ }
+
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+ }
+
+ return err;
+}
+
+static int ecb_encrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return ecb_crypt(req, ctx->rkey_enc);
+}
+
+static int ecb_decrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return ecb_crypt(req, ctx->rkey_dec);
+}
+
+static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
+ void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
+ const u8 *src, u8 *iv, unsigned int nblocks))
+{
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ unsigned int nblocks;
+
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
+
+ sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
+
+ kernel_neon_end();
+ }
+
+ err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+ }
+
+ return err;
+}
+
+static int cbc_encrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
+}
+
+static int cbc_decrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
+}
+
+static int cfb_crypt(struct skcipher_request *req,
+ void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
+ const u8 *src, u8 *iv, unsigned int nblocks))
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ unsigned int nblocks;
+
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
+
+ sm4_cfb_crypt(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
+
+ kernel_neon_end();
+
+ dst += nblocks * SM4_BLOCK_SIZE;
+ src += nblocks * SM4_BLOCK_SIZE;
+ nbytes -= nblocks * SM4_BLOCK_SIZE;
+ }
+
+ /* tail */
+ if (walk.nbytes == walk.total && nbytes > 0) {
+ u8 keystream[SM4_BLOCK_SIZE];
+
+ sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+ crypto_xor_cpy(dst, src, keystream, nbytes);
+ nbytes = 0;
+ }
+
+ err = skcipher_walk_done(&walk, nbytes);
+ }
+
+ return err;
+}
+
+static int cfb_encrypt(struct skcipher_request *req)
+{
+ return cfb_crypt(req, sm4_ce_cfb_enc);
+}
+
+static int cfb_decrypt(struct skcipher_request *req)
+{
+ return cfb_crypt(req, sm4_sve_ce_cfb_dec);
+}
+
+static int ctr_crypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ unsigned int nblocks;
+
+ nblocks = nbytes / SM4_BLOCK_SIZE;
+ if (nblocks) {
+ kernel_neon_begin();
+
+ sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
+ walk.iv, nblocks);
+
+ kernel_neon_end();
+
+ dst += nblocks * SM4_BLOCK_SIZE;
+ src += nblocks * SM4_BLOCK_SIZE;
+ nbytes -= nblocks * SM4_BLOCK_SIZE;
+ }
+
+ /* tail */
+ if (walk.nbytes == walk.total && nbytes > 0) {
+ u8 keystream[SM4_BLOCK_SIZE];
+
+ sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+ crypto_inc(walk.iv, SM4_BLOCK_SIZE);
+ crypto_xor_cpy(dst, src, keystream, nbytes);
+ nbytes = 0;
+ }
+
+ err = skcipher_walk_done(&walk, nbytes);
+ }
+
+ return err;
+}
+
+static struct skcipher_alg sm4_algs[] = {
+ {
+ .base = {
+ .cra_name = "ecb(sm4)",
+ .cra_driver_name = "ecb-sm4-sve-ce",
+ .cra_priority = 500,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .setkey = sm4_setkey,
+ .encrypt = ecb_encrypt,
+ .decrypt = ecb_decrypt,
+ }, {
+ .base = {
+ .cra_name = "cbc(sm4)",
+ .cra_driver_name = "cbc-sm4-sve-ce",
+ .cra_priority = 500,
+ .cra_blocksize = SM4_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .setkey = sm4_setkey,
+ .encrypt = cbc_encrypt,
+ .decrypt = cbc_decrypt,
+ }, {
+ .base = {
+ .cra_name = "cfb(sm4)",
+ .cra_driver_name = "cfb-sm4-sve-ce",
+ .cra_priority = 500,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .chunksize = SM4_BLOCK_SIZE,
+ .setkey = sm4_setkey,
+ .encrypt = cfb_encrypt,
+ .decrypt = cfb_decrypt,
+ }, {
+ .base = {
+ .cra_name = "ctr(sm4)",
+ .cra_driver_name = "ctr-sm4-sve-ce",
+ .cra_priority = 500,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct sm4_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = SM4_KEY_SIZE,
+ .max_keysize = SM4_KEY_SIZE,
+ .ivsize = SM4_BLOCK_SIZE,
+ .chunksize = SM4_BLOCK_SIZE,
+ .setkey = sm4_setkey,
+ .encrypt = ctr_crypt,
+ .decrypt = ctr_crypt,
+ }
+};
+
+static int __init sm4_sve_ce_init(void)
+{
+ if (sm4_sve_get_vl() <= 16)
+ return -ENODEV;
+
+ return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+static void __exit sm4_sve_ce_exit(void)
+{
+ crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
+module_exit(sm4_sve_ce_exit);
+
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
+MODULE_ALIAS_CRYPTO("sm4-sve-ce");
+MODULE_ALIAS_CRYPTO("sm4");
+MODULE_ALIAS_CRYPTO("ecb(sm4)");
+MODULE_ALIAS_CRYPTO("cbc(sm4)");
+MODULE_ALIAS_CRYPTO("cfb(sm4)");
+MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
--
2.24.3 (Apple Git-128)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
2022-09-26 9:36 ` Tianjia Zhang
@ 2022-09-26 10:02 ` Ard Biesheuvel
-1 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2022-09-26 10:02 UTC (permalink / raw)
To: Tianjia Zhang, Mark Brown
Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32
(cc Mark Brown)
Hello Tianjia,
On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
<tianjia.zhang@linux.alibaba.com> wrote:
>
> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
> arm64. SVE allows flexible vector length implementations with a range of
> possible values in CPU implementations. The vector length can vary from a
> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
> The SVE design guarantees that the same application can run on different
> implementations that support SVE, without the need to recompile the code.
>
> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
> expand and improve it. Similar to the Crypto Extension supported by the
> NEON instruction set for the algorithm, SVE also supports the similar
> instructions, called cryptography acceleration instructions, but this is
> also optional instruction set.
>
> This patch uses SM4 cryptography acceleration instructions and SVE2
> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
> Extension instruction is used.
>
Given that we currently do not support the use of SVE in kernel mode,
this patch cannot be accepted at this time (but the rest of the series
looks reasonable to me, although I have only skimmed over the patches)
In view of the disappointing benchmark results below, I don't think
this is worth the hassle at the moment. If we can find a case where
using SVE in kernel mode truly makes a [favorable] difference, we can
revisit this, but not without a thorough analysis of the impact it
will have to support SVE in the kernel. Also, the fact that SVE may
also cover cryptographic extensions does not necessarily imply that a
micro-architecture will perform those crypto transformations in
parallel and so the performance may be the same even if VL > 128.
In summary, please drop this patch for now, and once there are more
encouraging performance numbers, please resubmit it as part of a
series that explicitly enables SVE in kernel mode on arm64, and
documents the requirements and constraints.
I have cc'ed Mark who has been working on the SVE support., who might
have something to add here as well.
Thanks,
Ard.
> Since no test environment with a Vector Length (VL) greater than 128 bits
> was found, the performance data was obtained on a machine with a VL is
> 128 bits, because this driver is enabled when the VL is greater than 128
> bits, so this performance is only for reference. It can be seen from the
> data that there is little difference between the data optimized by Crypto
> Extension and SVE (VL=128 bits), and the optimization effect will be more
> obvious when VL=256 bits or longer.
>
> Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
> of tcrypt, and compared with that optimized by Crypto Extension. The
> abscissas are blocks of different lengths. The data is tabulated and the
> unit is Mb/s:
>
> sm4-ce | 16 64 128 256 1024 1420 4096
> ------------+--------------------------------------------------------------
> ECB enc | 315.18 1162.65 1815.66 2553.50 3692.91 3727.20 4001.93
> ECB dec | 316.06 1172.97 1817.81 2554.66 3692.18 3786.54 4001.93
> CBC enc | 304.82 629.54 768.65 864.72 953.90 963.32 974.06
> CBC dec | 306.05 1142.53 1805.11 2481.67 3522.06 3587.87 3790.99
> CFB enc | 309.48 635.70 774.44 865.85 950.62 952.68 968.24
> CFB dec | 315.98 1170.38 1828.75 2509.72 3543.63 3539.40 3793.25
> CTR enc | 285.83 1036.59 1583.50 2147.26 2933.54 2954.66 3041.14
> CTR dec | 285.29 1037.47 1584.67 2145.51 2934.10 2950.89 3041.62
>
> sm4-sve-ce (VL = 128 bits)
> ECB enc | 310.00 1154.70 1813.26 2579.74 3766.90 3869.45 4100.26
> ECB dec | 315.60 1176.22 1838.06 2593.69 3774.95 3878.42 4098.83
> CBC enc | 303.44 622.65 764.67 861.40 953.18 963.05 973.77
> CBC dec | 302.13 1091.15 1689.10 2267.79 3182.84 3242.68 3408.92
> CFB enc | 296.62 620.41 762.94 858.96 948.18 956.04 967.67
> CFB dec | 291.23 1065.50 1637.33 2228.12 3158.52 3213.35 3403.83
> CTR enc | 272.27 959.35 1466.34 1934.24 2562.80 2595.87 2695.15
> CTR dec | 273.40 963.65 1471.83 1938.97 2563.12 2597.25 2694.54
>
> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> ---
> arch/arm64/crypto/Kconfig | 19 +
> arch/arm64/crypto/Makefile | 3 +
> arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
> arch/arm64/crypto/sm4-sve-ce-glue.c | 332 +++++++++
> 4 files changed, 1382 insertions(+)
> create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
> create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index 6793d5bc3ee5..bbb5a7a08af5 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
> - ARMv8 Crypto Extensions
> - NEON (Advanced SIMD) extensions
>
> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> + tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
> + depends on KERNEL_MODE_NEON
> + select CRYPTO_SKCIPHER
> + select CRYPTO_SM4
> + select CRYPTO_SM4_ARM64_CE_BLK
> + help
> + Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
> + with block cipher modes:
> + - ECB (Electronic Codebook) mode (NIST SP800-38A)
> + - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
> + - CFB (Cipher Feedback) mode (NIST SP800-38A)
> + - CTR (Counter) mode (NIST SP800-38A)
> +
> + Architecture: arm64 using:
> + - ARMv8 Crypto Extensions
> + - ARMv9 cryptography acceleration with SVE2
> + - NEON (Advanced SIMD) extensions
> +
> config CRYPTO_SM4_ARM64_NEON_BLK
> tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
> depends on KERNEL_MODE_NEON
> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> index 4818e204c2ac..355dd9053434 100644
> --- a/arch/arm64/crypto/Makefile
> +++ b/arch/arm64/crypto/Makefile
> @@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
> obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
> sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
>
> +obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
> +sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
> +
> obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
> ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
>
> diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
> new file mode 100644
> index 000000000000..caecbdf2536c
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-core.S
> @@ -0,0 +1,1028 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/assembler.h>
> +
> +.arch armv8-a+crypto+sve+sve2
> +
> +.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
> + .set .Lv\b\().4s, \b
> +.endr
> +
> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
> + 16, 24, 25, 26, 27, 28, 29, 30, 31
> + .set .Lz\b\().s, \b
> +.endr
> +
> +.macro sm4e, vd, vn
> + .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> +.endm
> +
> +.macro sm4e_sve, zd, zn
> + .inst 0x4523e000 | (.L\zn << 5) | .L\zd
> +.endm
> +
> +
> +/* Register macros */
> +
> +#define RCTR z16
> +#define RCTRv v16
> +#define RIV z16
> +#define RIVv v16
> +#define RSWAP128 z17
> +#define RZERO z18
> +#define RLE128_INC z19
> +
> +#define RTMP0 z20
> +#define RTMP0v v20
> +#define RTMP1 z21
> +#define RTMP2 z22
> +#define RTMP3 z23
> +
> +
> +/* Helper macros. */
> +
> +#define SM4_PREPARE(ptr) \
> + adr_l x7, .Lbswap128_mask; \
> + ptrue p0.b, ALL; \
> + rdvl x5, #1; \
> + ld1b {RSWAP128.b}, p0/z, [x7]; \
> + \
> + ld1 {v24.16b-v27.16b}, [ptr], #64; \
> + ld1 {v28.16b-v31.16b}, [ptr]; \
> + dup z24.q, z24.q[0]; \
> + dup z25.q, z25.q[0]; \
> + dup z26.q, z26.q[0]; \
> + dup z27.q, z27.q[0]; \
> + dup z28.q, z28.q[0]; \
> + dup z29.q, z29.q[0]; \
> + dup z30.q, z30.q[0]; \
> + dup z31.q, z31.q[0];
> +
> +#define SM4_SVE_CE_CRYPT_BLK(b0) \
> + revb b0.s, p0/m, b0.s; \
> + sm4e_sve b0.s, z24.s; \
> + sm4e_sve b0.s, z25.s; \
> + sm4e_sve b0.s, z26.s; \
> + sm4e_sve b0.s, z27.s; \
> + sm4e_sve b0.s, z28.s; \
> + sm4e_sve b0.s, z29.s; \
> + sm4e_sve b0.s, z30.s; \
> + sm4e_sve b0.s, z31.s; \
> + tbl b0.b, {b0.b}, RSWAP128.b; \
> + revb b0.s, p0/m, b0.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3) \
> + revb b0.s, p0/m, b0.s; \
> + revb b1.s, p0/m, b1.s; \
> + revb b2.s, p0/m, b2.s; \
> + revb b3.s, p0/m, b3.s; \
> + sm4e_sve b0.s, z24.s; \
> + sm4e_sve b1.s, z24.s; \
> + sm4e_sve b2.s, z24.s; \
> + sm4e_sve b3.s, z24.s; \
> + sm4e_sve b0.s, z25.s; \
> + sm4e_sve b1.s, z25.s; \
> + sm4e_sve b2.s, z25.s; \
> + sm4e_sve b3.s, z25.s; \
> + sm4e_sve b0.s, z26.s; \
> + sm4e_sve b1.s, z26.s; \
> + sm4e_sve b2.s, z26.s; \
> + sm4e_sve b3.s, z26.s; \
> + sm4e_sve b0.s, z27.s; \
> + sm4e_sve b1.s, z27.s; \
> + sm4e_sve b2.s, z27.s; \
> + sm4e_sve b3.s, z27.s; \
> + sm4e_sve b0.s, z28.s; \
> + sm4e_sve b1.s, z28.s; \
> + sm4e_sve b2.s, z28.s; \
> + sm4e_sve b3.s, z28.s; \
> + sm4e_sve b0.s, z29.s; \
> + sm4e_sve b1.s, z29.s; \
> + sm4e_sve b2.s, z29.s; \
> + sm4e_sve b3.s, z29.s; \
> + sm4e_sve b0.s, z30.s; \
> + sm4e_sve b1.s, z30.s; \
> + sm4e_sve b2.s, z30.s; \
> + sm4e_sve b3.s, z30.s; \
> + sm4e_sve b0.s, z31.s; \
> + sm4e_sve b1.s, z31.s; \
> + sm4e_sve b2.s, z31.s; \
> + sm4e_sve b3.s, z31.s; \
> + tbl b0.b, {b0.b}, RSWAP128.b; \
> + tbl b1.b, {b1.b}, RSWAP128.b; \
> + tbl b2.b, {b2.b}, RSWAP128.b; \
> + tbl b3.b, {b3.b}, RSWAP128.b; \
> + revb b0.s, p0/m, b0.s; \
> + revb b1.s, p0/m, b1.s; \
> + revb b2.s, p0/m, b2.s; \
> + revb b3.s, p0/m, b3.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
> + revb b0.s, p0/m, b0.s; \
> + revb b1.s, p0/m, b1.s; \
> + revb b2.s, p0/m, b2.s; \
> + revb b3.s, p0/m, b3.s; \
> + revb b4.s, p0/m, b4.s; \
> + revb b5.s, p0/m, b5.s; \
> + revb b6.s, p0/m, b6.s; \
> + revb b7.s, p0/m, b7.s; \
> + sm4e_sve b0.s, z24.s; \
> + sm4e_sve b1.s, z24.s; \
> + sm4e_sve b2.s, z24.s; \
> + sm4e_sve b3.s, z24.s; \
> + sm4e_sve b4.s, z24.s; \
> + sm4e_sve b5.s, z24.s; \
> + sm4e_sve b6.s, z24.s; \
> + sm4e_sve b7.s, z24.s; \
> + sm4e_sve b0.s, z25.s; \
> + sm4e_sve b1.s, z25.s; \
> + sm4e_sve b2.s, z25.s; \
> + sm4e_sve b3.s, z25.s; \
> + sm4e_sve b4.s, z25.s; \
> + sm4e_sve b5.s, z25.s; \
> + sm4e_sve b6.s, z25.s; \
> + sm4e_sve b7.s, z25.s; \
> + sm4e_sve b0.s, z26.s; \
> + sm4e_sve b1.s, z26.s; \
> + sm4e_sve b2.s, z26.s; \
> + sm4e_sve b3.s, z26.s; \
> + sm4e_sve b4.s, z26.s; \
> + sm4e_sve b5.s, z26.s; \
> + sm4e_sve b6.s, z26.s; \
> + sm4e_sve b7.s, z26.s; \
> + sm4e_sve b0.s, z27.s; \
> + sm4e_sve b1.s, z27.s; \
> + sm4e_sve b2.s, z27.s; \
> + sm4e_sve b3.s, z27.s; \
> + sm4e_sve b4.s, z27.s; \
> + sm4e_sve b5.s, z27.s; \
> + sm4e_sve b6.s, z27.s; \
> + sm4e_sve b7.s, z27.s; \
> + sm4e_sve b0.s, z28.s; \
> + sm4e_sve b1.s, z28.s; \
> + sm4e_sve b2.s, z28.s; \
> + sm4e_sve b3.s, z28.s; \
> + sm4e_sve b4.s, z28.s; \
> + sm4e_sve b5.s, z28.s; \
> + sm4e_sve b6.s, z28.s; \
> + sm4e_sve b7.s, z28.s; \
> + sm4e_sve b0.s, z29.s; \
> + sm4e_sve b1.s, z29.s; \
> + sm4e_sve b2.s, z29.s; \
> + sm4e_sve b3.s, z29.s; \
> + sm4e_sve b4.s, z29.s; \
> + sm4e_sve b5.s, z29.s; \
> + sm4e_sve b6.s, z29.s; \
> + sm4e_sve b7.s, z29.s; \
> + sm4e_sve b0.s, z30.s; \
> + sm4e_sve b1.s, z30.s; \
> + sm4e_sve b2.s, z30.s; \
> + sm4e_sve b3.s, z30.s; \
> + sm4e_sve b4.s, z30.s; \
> + sm4e_sve b5.s, z30.s; \
> + sm4e_sve b6.s, z30.s; \
> + sm4e_sve b7.s, z30.s; \
> + sm4e_sve b0.s, z31.s; \
> + sm4e_sve b1.s, z31.s; \
> + sm4e_sve b2.s, z31.s; \
> + sm4e_sve b3.s, z31.s; \
> + sm4e_sve b4.s, z31.s; \
> + sm4e_sve b5.s, z31.s; \
> + sm4e_sve b6.s, z31.s; \
> + sm4e_sve b7.s, z31.s; \
> + tbl b0.b, {b0.b}, RSWAP128.b; \
> + tbl b1.b, {b1.b}, RSWAP128.b; \
> + tbl b2.b, {b2.b}, RSWAP128.b; \
> + tbl b3.b, {b3.b}, RSWAP128.b; \
> + tbl b4.b, {b4.b}, RSWAP128.b; \
> + tbl b5.b, {b5.b}, RSWAP128.b; \
> + tbl b6.b, {b6.b}, RSWAP128.b; \
> + tbl b7.b, {b7.b}, RSWAP128.b; \
> + revb b0.s, p0/m, b0.s; \
> + revb b1.s, p0/m, b1.s; \
> + revb b2.s, p0/m, b2.s; \
> + revb b3.s, p0/m, b3.s; \
> + revb b4.s, p0/m, b4.s; \
> + revb b5.s, p0/m, b5.s; \
> + revb b6.s, p0/m, b6.s; \
> + revb b7.s, p0/m, b7.s;
> +
> +#define SM4_CE_CRYPT_BLK(b0) \
> + rev32 b0.16b, b0.16b; \
> + sm4e b0.4s, v24.4s; \
> + sm4e b0.4s, v25.4s; \
> + sm4e b0.4s, v26.4s; \
> + sm4e b0.4s, v27.4s; \
> + sm4e b0.4s, v28.4s; \
> + sm4e b0.4s, v29.4s; \
> + sm4e b0.4s, v30.4s; \
> + sm4e b0.4s, v31.4s; \
> + rev64 b0.4s, b0.4s; \
> + ext b0.16b, b0.16b, b0.16b, #8; \
> + rev32 b0.16b, b0.16b;
> +
> +#define inc_le128(zctr) \
> + mov RCTRv.d[1], x8; \
> + mov RCTRv.d[0], x7; \
> + mov zctr.d, RLE128_INC.d; \
> + dup RCTR.q, RCTR.q[0]; \
> + adds x8, x8, x5, LSR #4; \
> + adclt zctr.d, RCTR.d, RZERO.d; \
> + adclt RCTR.d, zctr.d, RZERO.d; \
> + adc x7, x7, xzr; \
> + trn1 zctr.d, RCTR.d, zctr.d; \
> + revb zctr.d, p0/m, zctr.d;
> +
> +#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3) \
> + mov v8.d[1], x8; \
> + mov v8.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr0.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v9.d[1], x8; \
> + mov v9.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr1.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v10.d[1], x8; \
> + mov v10.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr2.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v11.d[1], x8; \
> + mov v11.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr3.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + dup z8.q, z8.q[0]; \
> + dup z9.q, z9.q[0]; \
> + dup z10.q, z10.q[0]; \
> + dup z11.q, z11.q[0]; \
> + adclt zctr0.d, z8.d, RZERO.d; \
> + adclt zctr1.d, z9.d, RZERO.d; \
> + adclt zctr2.d, z10.d, RZERO.d; \
> + adclt zctr3.d, z11.d, RZERO.d; \
> + adclt z8.d, zctr0.d, RZERO.d; \
> + adclt z9.d, zctr1.d, RZERO.d; \
> + adclt z10.d, zctr2.d, RZERO.d; \
> + adclt z11.d, zctr3.d, RZERO.d; \
> + trn1 zctr0.d, z8.d, zctr0.d; \
> + trn1 zctr1.d, z9.d, zctr1.d; \
> + trn1 zctr2.d, z10.d, zctr2.d; \
> + trn1 zctr3.d, z11.d, zctr3.d; \
> + revb zctr0.d, p0/m, zctr0.d; \
> + revb zctr1.d, p0/m, zctr1.d; \
> + revb zctr2.d, p0/m, zctr2.d; \
> + revb zctr3.d, p0/m, zctr3.d;
> +
> +#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3, \
> + zctr4, zctr5, zctr6, zctr7) \
> + mov v8.d[1], x8; \
> + mov v8.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr0.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v9.d[1], x8; \
> + mov v9.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr1.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v10.d[1], x8; \
> + mov v10.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr2.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v11.d[1], x8; \
> + mov v11.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr3.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v12.d[1], x8; \
> + mov v12.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr4.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v13.d[1], x8; \
> + mov v13.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr5.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v14.d[1], x8; \
> + mov v14.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr6.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v15.d[1], x8; \
> + mov v15.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr7.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + dup z8.q, z8.q[0]; \
> + dup z9.q, z9.q[0]; \
> + dup z10.q, z10.q[0]; \
> + dup z11.q, z11.q[0]; \
> + dup z12.q, z12.q[0]; \
> + dup z13.q, z13.q[0]; \
> + dup z14.q, z14.q[0]; \
> + dup z15.q, z15.q[0]; \
> + adclt zctr0.d, z8.d, RZERO.d; \
> + adclt zctr1.d, z9.d, RZERO.d; \
> + adclt zctr2.d, z10.d, RZERO.d; \
> + adclt zctr3.d, z11.d, RZERO.d; \
> + adclt zctr4.d, z12.d, RZERO.d; \
> + adclt zctr5.d, z13.d, RZERO.d; \
> + adclt zctr6.d, z14.d, RZERO.d; \
> + adclt zctr7.d, z15.d, RZERO.d; \
> + adclt z8.d, zctr0.d, RZERO.d; \
> + adclt z9.d, zctr1.d, RZERO.d; \
> + adclt z10.d, zctr2.d, RZERO.d; \
> + adclt z11.d, zctr3.d, RZERO.d; \
> + adclt z12.d, zctr4.d, RZERO.d; \
> + adclt z13.d, zctr5.d, RZERO.d; \
> + adclt z14.d, zctr6.d, RZERO.d; \
> + adclt z15.d, zctr7.d, RZERO.d; \
> + trn1 zctr0.d, z8.d, zctr0.d; \
> + trn1 zctr1.d, z9.d, zctr1.d; \
> + trn1 zctr2.d, z10.d, zctr2.d; \
> + trn1 zctr3.d, z11.d, zctr3.d; \
> + trn1 zctr4.d, z12.d, zctr4.d; \
> + trn1 zctr5.d, z13.d, zctr5.d; \
> + trn1 zctr6.d, z14.d, zctr6.d; \
> + trn1 zctr7.d, z15.d, zctr7.d; \
> + revb zctr0.d, p0/m, zctr0.d; \
> + revb zctr1.d, p0/m, zctr1.d; \
> + revb zctr2.d, p0/m, zctr2.d; \
> + revb zctr3.d, p0/m, zctr3.d; \
> + revb zctr4.d, p0/m, zctr4.d; \
> + revb zctr5.d, p0/m, zctr5.d; \
> + revb zctr6.d, p0/m, zctr6.d; \
> + revb zctr7.d, p0/m, zctr7.d;
> +
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_crypt)
> + /* input:
> + * x0: round key array, CTX
> + * x1: dst
> + * x2: src
> + * w3: nblocks
> + */
> + uxtw x3, w3
> + SM4_PREPARE(x0)
> +
> +.Lcrypt_loop_8x:
> + sub x3, x3, x5, LSR #1 /* x3 - (8 * VL) */
> + tbnz x3, #63, .Lcrypt_4x
> +
> + ld1b {z0.b}, p0/z, [x2]
> + ld1b {z1.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z2.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z3.b}, p0/z, [x2, #3, MUL VL]
> + ld1b {z4.b}, p0/z, [x2, #4, MUL VL]
> + ld1b {z5.b}, p0/z, [x2, #5, MUL VL]
> + ld1b {z6.b}, p0/z, [x2, #6, MUL VL]
> + ld1b {z7.b}, p0/z, [x2, #7, MUL VL]
> +
> + SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> + st1b {z4.b}, p0, [x1, #4, MUL VL]
> + st1b {z5.b}, p0, [x1, #5, MUL VL]
> + st1b {z6.b}, p0, [x1, #6, MUL VL]
> + st1b {z7.b}, p0, [x1, #7, MUL VL]
> +
> + addvl x2, x2, #8
> + addvl x1, x1, #8
> +
> + cbz x3, .Lcrypt_end
> + b .Lcrypt_loop_8x
> +
> +.Lcrypt_4x:
> + add x3, x3, x5, LSR #1
> + cmp x3, x5, LSR #2
> + blt .Lcrypt_loop_1x
> +
> + sub x3, x3, x5, LSR #2 /* x3 - (4 * VL) */
> +
> + ld1b {z0.b}, p0/z, [x2]
> + ld1b {z1.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z2.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z3.b}, p0/z, [x2, #3, MUL VL]
> +
> + SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> +
> + addvl x2, x2, #4
> + addvl x1, x1, #4
> +
> + cbz x3, .Lcrypt_end
> +
> +.Lcrypt_loop_1x:
> + cmp x3, x5, LSR #4
> + blt .Lcrypt_ce_loop_1x
> +
> + sub x3, x3, x5, LSR #4 /* x3 - VL */
> +
> + ld1b {z0.b}, p0/z, [x2]
> +
> + SM4_SVE_CE_CRYPT_BLK(z0)
> +
> + st1b {z0.b}, p0, [x1]
> +
> + addvl x2, x2, #1
> + addvl x1, x1, #1
> +
> + cbz x3, .Lcrypt_end
> + b .Lcrypt_loop_1x
> +
> +.Lcrypt_ce_loop_1x:
> + sub x3, x3, #1
> +
> + ld1 {v0.16b}, [x2], #16
> + SM4_CE_CRYPT_BLK(v0)
> + st1 {v0.16b}, [x1], #16
> +
> + cbnz x3, .Lcrypt_ce_loop_1x
> +
> +.Lcrypt_end:
> + ret
> +SYM_FUNC_END(sm4_sve_ce_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cbc_dec)
> + /* input:
> + * x0: round key array, CTX
> + * x1: dst
> + * x2: src
> + * x3: iv (big endian, 128 bit)
> + * w4: nblocks
> + */
> + uxtw x4, w4
> + SM4_PREPARE(x0)
> +
> + ld1 {RIVv.16b}, [x3]
> + ext RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_loop_8x:
> + sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
> + tbnz x4, #63, .Lcbc_dec_4x
> +
> + ld1b {z15.b}, p0/z, [x2]
> + ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
> + ld1b {z11.b}, p0/z, [x2, #4, MUL VL]
> + ld1b {z10.b}, p0/z, [x2, #5, MUL VL]
> + ld1b {z9.b}, p0/z, [x2, #6, MUL VL]
> + ld1b {z8.b}, p0/z, [x2, #7, MUL VL]
> + rev z0.b, z15.b
> + rev z1.b, z14.b
> + rev z2.b, z13.b
> + rev z3.b, z12.b
> + rev z4.b, z11.b
> + rev z5.b, z10.b
> + rev z6.b, z9.b
> + rev z7.b, z8.b
> + rev RTMP0.b, RIV.b
> + ext z7.b, z7.b, z6.b, #16
> + ext z6.b, z6.b, z5.b, #16
> + ext z5.b, z5.b, z4.b, #16
> + ext z4.b, z4.b, z3.b, #16
> + ext z3.b, z3.b, z2.b, #16
> + ext z2.b, z2.b, z1.b, #16
> + ext z1.b, z1.b, z0.b, #16
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z7.b, z7.b
> + rev z6.b, z6.b
> + rev z5.b, z5.b
> + rev z4.b, z4.b
> + rev z3.b, z3.b
> + rev z2.b, z2.b
> + rev z1.b, z1.b
> + rev z0.b, z0.b
> + mov RIV.d, z8.d
> +
> + SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
> +
> + eor z0.d, z0.d, z15.d
> + eor z1.d, z1.d, z14.d
> + eor z2.d, z2.d, z13.d
> + eor z3.d, z3.d, z12.d
> + eor z4.d, z4.d, z11.d
> + eor z5.d, z5.d, z10.d
> + eor z6.d, z6.d, z9.d
> + eor z7.d, z7.d, z8.d
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> + st1b {z4.b}, p0, [x1, #4, MUL VL]
> + st1b {z5.b}, p0, [x1, #5, MUL VL]
> + st1b {z6.b}, p0, [x1, #6, MUL VL]
> + st1b {z7.b}, p0, [x1, #7, MUL VL]
> +
> + addvl x2, x2, #8
> + addvl x1, x1, #8
> +
> + cbz x4, .Lcbc_dec_end
> + b .Lcbc_dec_loop_8x
> +
> +.Lcbc_dec_4x:
> + add x4, x4, x5, LSR #1
> + cmp x4, x5, LSR #2
> + blt .Lcbc_dec_loop_1x
> +
> + sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
> +
> + ld1b {z15.b}, p0/z, [x2]
> + ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
> + rev z0.b, z15.b
> + rev z1.b, z14.b
> + rev z2.b, z13.b
> + rev z3.b, z12.b
> + rev RTMP0.b, RIV.b
> + ext z3.b, z3.b, z2.b, #16
> + ext z2.b, z2.b, z1.b, #16
> + ext z1.b, z1.b, z0.b, #16
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z3.b, z3.b
> + rev z2.b, z2.b
> + rev z1.b, z1.b
> + rev z0.b, z0.b
> + mov RIV.d, z12.d
> +
> + SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
> +
> + eor z0.d, z0.d, z15.d
> + eor z1.d, z1.d, z14.d
> + eor z2.d, z2.d, z13.d
> + eor z3.d, z3.d, z12.d
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> +
> + addvl x2, x2, #4
> + addvl x1, x1, #4
> +
> + cbz x4, .Lcbc_dec_end
> +
> +.Lcbc_dec_loop_1x:
> + cmp x4, x5, LSR #4
> + blt .Lcbc_dec_ce
> +
> + sub x4, x4, x5, LSR #4 /* x4 - VL */
> +
> + ld1b {z15.b}, p0/z, [x2]
> + rev RTMP0.b, RIV.b
> + rev z0.b, z15.b
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z0.b, z0.b
> + mov RIV.d, z15.d
> +
> + SM4_SVE_CE_CRYPT_BLK(z15)
> +
> + eor z0.d, z0.d, z15.d
> + st1b {z0.b}, p0, [x1]
> +
> + addvl x2, x2, #1
> + addvl x1, x1, #1
> +
> + cbz x4, .Lcbc_dec_end
> + b .Lcbc_dec_loop_1x
> +
> +.Lcbc_dec_ce:
> + rev RIV.s, RIV.s
> + tbl RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcbc_dec_ce_loop_1x:
> + sub x4, x4, #1
> +
> + ld1 {v15.16b}, [x2], #16
> + mov v0.16b, RIVv.16b
> + mov RIVv.16b, v15.16b
> + SM4_CE_CRYPT_BLK(v15)
> + eor v0.16b, v0.16b, v15.16b
> + st1 {v0.16b}, [x1], #16
> +
> + cbnz x4, .Lcbc_dec_ce_loop_1x
> +
> + ext RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_end:
> + /* store new IV */
> + rev RIV.s, RIV.s
> + tbl RIV.b, {RIV.b}, RSWAP128.b
> + st1 {RIVv.16b}, [x3]
> +
> + ret
> +SYM_FUNC_END(sm4_sve_ce_cbc_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cfb_dec)
> + /* input:
> + * x0: round key array, CTX
> + * x1: dst
> + * x2: src
> + * x3: iv (big endian, 128 bit)
> + * w4: nblocks
> + */
> + uxtw x4, w4
> + SM4_PREPARE(x0)
> +
> + ld1 {RIVv.16b}, [x3]
> + ext RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_loop_8x:
> + sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
> + tbnz x4, #63, .Lcfb_dec_4x
> +
> + ld1b {z15.b}, p0/z, [x2]
> + ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
> + ld1b {z11.b}, p0/z, [x2, #4, MUL VL]
> + ld1b {z10.b}, p0/z, [x2, #5, MUL VL]
> + ld1b {z9.b}, p0/z, [x2, #6, MUL VL]
> + ld1b {z8.b}, p0/z, [x2, #7, MUL VL]
> + rev z0.b, z15.b
> + rev z1.b, z14.b
> + rev z2.b, z13.b
> + rev z3.b, z12.b
> + rev z4.b, z11.b
> + rev z5.b, z10.b
> + rev z6.b, z9.b
> + rev z7.b, z8.b
> + rev RTMP0.b, RIV.b
> + ext z7.b, z7.b, z6.b, #16
> + ext z6.b, z6.b, z5.b, #16
> + ext z5.b, z5.b, z4.b, #16
> + ext z4.b, z4.b, z3.b, #16
> + ext z3.b, z3.b, z2.b, #16
> + ext z2.b, z2.b, z1.b, #16
> + ext z1.b, z1.b, z0.b, #16
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z7.b, z7.b
> + rev z6.b, z6.b
> + rev z5.b, z5.b
> + rev z4.b, z4.b
> + rev z3.b, z3.b
> + rev z2.b, z2.b
> + rev z1.b, z1.b
> + rev z0.b, z0.b
> + mov RIV.d, z8.d
> +
> + SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> + eor z0.d, z0.d, z15.d
> + eor z1.d, z1.d, z14.d
> + eor z2.d, z2.d, z13.d
> + eor z3.d, z3.d, z12.d
> + eor z4.d, z4.d, z11.d
> + eor z5.d, z5.d, z10.d
> + eor z6.d, z6.d, z9.d
> + eor z7.d, z7.d, z8.d
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> + st1b {z4.b}, p0, [x1, #4, MUL VL]
> + st1b {z5.b}, p0, [x1, #5, MUL VL]
> + st1b {z6.b}, p0, [x1, #6, MUL VL]
> + st1b {z7.b}, p0, [x1, #7, MUL VL]
> +
> + addvl x2, x2, #8
> + addvl x1, x1, #8
> +
> + cbz x4, .Lcfb_dec_end
> + b .Lcfb_dec_loop_8x
> +
> +.Lcfb_dec_4x:
> + add x4, x4, x5, LSR #1
> + cmp x4, x5, LSR #2
> + blt .Lcfb_dec_loop_1x
> +
> + sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
> +
> + ld1b {z15.b}, p0/z, [x2]
> + ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
> + rev z0.b, z15.b
> + rev z1.b, z14.b
> + rev z2.b, z13.b
> + rev z3.b, z12.b
> + rev RTMP0.b, RIV.b
> + ext z3.b, z3.b, z2.b, #16
> + ext z2.b, z2.b, z1.b, #16
> + ext z1.b, z1.b, z0.b, #16
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z3.b, z3.b
> + rev z2.b, z2.b
> + rev z1.b, z1.b
> + rev z0.b, z0.b
> + mov RIV.d, z12.d
> +
> + SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> + eor z0.d, z0.d, z15.d
> + eor z1.d, z1.d, z14.d
> + eor z2.d, z2.d, z13.d
> + eor z3.d, z3.d, z12.d
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> +
> + addvl x2, x2, #4
> + addvl x1, x1, #4
> +
> + cbz x4, .Lcfb_dec_end
> +
> +.Lcfb_dec_loop_1x:
> + cmp x4, x5, LSR #4
> + blt .Lcfb_dec_ce
> +
> + sub x4, x4, x5, LSR #4 /* x4 - VL */
> +
> + ld1b {z15.b}, p0/z, [x2]
> + rev RTMP0.b, RIV.b
> + rev z0.b, z15.b
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z0.b, z0.b
> + mov RIV.d, z15.d
> +
> + SM4_SVE_CE_CRYPT_BLK(z0)
> +
> + eor z0.d, z0.d, z15.d
> + st1b {z0.b}, p0, [x1]
> +
> + addvl x2, x2, #1
> + addvl x1, x1, #1
> +
> + cbz x4, .Lcfb_dec_end
> + b .Lcfb_dec_loop_1x
> +
> +.Lcfb_dec_ce:
> + rev RIV.s, RIV.s
> + tbl RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcfb_dec_ce_loop_1x:
> + sub x4, x4, #1
> +
> + ld1 {v15.16b}, [x2], #16
> + mov v0.16b, RIVv.16b
> + mov RIVv.16b, v15.16b
> + SM4_CE_CRYPT_BLK(v0)
> + eor v0.16b, v0.16b, v15.16b
> + st1 {v0.16b}, [x1], #16
> +
> + cbnz x4, .Lcfb_dec_ce_loop_1x
> +
> + ext RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_end:
> + /* store new IV */
> + rev RIV.s, RIV.s
> + tbl RIV.b, {RIV.b}, RSWAP128.b
> + st1 {RIVv.16b}, [x3]
> +
> + ret
> +SYM_FUNC_END(sm4_sve_ce_cfb_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
> + /* input:
> + * x0: round key array, CTX
> + * x1: dst
> + * x2: src
> + * x3: ctr (big endian, 128 bit)
> + * w4: nblocks
> + */
> + uxtw x4, w4
> + SM4_PREPARE(x0)
> +
> + dup RZERO.d, #0
> + adr_l x6, .Lle128_inc
> + ld1b {RLE128_INC.b}, p0/z, [x6]
> +
> + ldp x7, x8, [x3]
> + rev x7, x7
> + rev x8, x8
> +
> +.Lctr_loop_8x:
> + sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
> + tbnz x4, #63, .Lctr_4x
> +
> + inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> + ld1b {z8.b}, p0/z, [x2]
> + ld1b {z9.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z10.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z11.b}, p0/z, [x2, #3, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #4, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #5, MUL VL]
> + ld1b {z14.b}, p0/z, [x2, #6, MUL VL]
> + ld1b {z15.b}, p0/z, [x2, #7, MUL VL]
> +
> + SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> + eor z0.d, z0.d, z8.d
> + eor z1.d, z1.d, z9.d
> + eor z2.d, z2.d, z10.d
> + eor z3.d, z3.d, z11.d
> + eor z4.d, z4.d, z12.d
> + eor z5.d, z5.d, z13.d
> + eor z6.d, z6.d, z14.d
> + eor z7.d, z7.d, z15.d
> +
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> + st1b {z4.b}, p0, [x1, #4, MUL VL]
> + st1b {z5.b}, p0, [x1, #5, MUL VL]
> + st1b {z6.b}, p0, [x1, #6, MUL VL]
> + st1b {z7.b}, p0, [x1, #7, MUL VL]
> +
> + addvl x2, x2, #8
> + addvl x1, x1, #8
> +
> + cbz x4, .Lctr_end
> + b .Lctr_loop_8x
> +
> +.Lctr_4x:
> + add x4, x4, x5, LSR #1
> + cmp x4, x5, LSR #2
> + blt .Lctr_loop_1x
> +
> + sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
> +
> + inc_le128_4x(z0, z1, z2, z3)
> +
> + ld1b {z8.b}, p0/z, [x2]
> + ld1b {z9.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z10.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z11.b}, p0/z, [x2, #3, MUL VL]
> +
> + SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> + eor z0.d, z0.d, z8.d
> + eor z1.d, z1.d, z9.d
> + eor z2.d, z2.d, z10.d
> + eor z3.d, z3.d, z11.d
> +
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> +
> + addvl x2, x2, #4
> + addvl x1, x1, #4
> +
> + cbz x4, .Lctr_end
> +
> +.Lctr_loop_1x:
> + cmp x4, x5, LSR #4
> + blt .Lctr_ce_loop_1x
> +
> + sub x4, x4, x5, LSR #4 /* x4 - VL */
> +
> + inc_le128(z0)
> + ld1b {z8.b}, p0/z, [x2]
> +
> + SM4_SVE_CE_CRYPT_BLK(z0)
> +
> + eor z0.d, z0.d, z8.d
> + st1b {z0.b}, p0, [x1]
> +
> + addvl x2, x2, #1
> + addvl x1, x1, #1
> +
> + cbz x4, .Lctr_end
> + b .Lctr_loop_1x
> +
> +.Lctr_ce_loop_1x:
> + sub x4, x4, #1
> +
> + /* inc_le128 for CE */
> + mov v0.d[1], x8
> + mov v0.d[0], x7
> + adds x8, x8, #1
> + rev64 v0.16b, v0.16b
> + adc x7, x7, xzr
> +
> + ld1 {v8.16b}, [x2], #16
> +
> + SM4_CE_CRYPT_BLK(v0)
> +
> + eor v0.16b, v0.16b, v8.16b
> + st1 {v0.16b}, [x1], #16
> +
> + cbnz x4, .Lctr_ce_loop_1x
> +
> +.Lctr_end:
> + /* store new CTR */
> + rev x7, x7
> + rev x8, x8
> + stp x7, x8, [x3]
> +
> + ret
> +SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_get_vl)
> + /* VL in bytes */
> + rdvl x0, #1
> +
> + ret
> +SYM_FUNC_END(sm4_sve_get_vl)
> +
> +
> + .section ".rodata", "a"
> + .align 4
> +.Lbswap128_mask:
> + .byte 0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
> + .byte 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
> + .byte 0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
> + .byte 0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
> + .byte 0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
> + .byte 0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
> + .byte 0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
> + .byte 0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
> + .byte 0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
> + .byte 0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
> + .byte 0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
> + .byte 0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
> + .byte 0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
> + .byte 0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
> + .byte 0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
> + .byte 0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
> + .byte 0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
> + .byte 0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
> + .byte 0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
> + .byte 0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
> + .byte 0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
> + .byte 0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
> + .byte 0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
> + .byte 0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
> + .byte 0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
> + .byte 0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
> + .byte 0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
> + .byte 0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
> + .byte 0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
> + .byte 0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
> + .byte 0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
> + .byte 0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
> +
> +.Lle128_inc:
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
> new file mode 100644
> index 000000000000..fc797b72b5f0
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
> @@ -0,0 +1,332 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/crypto.h>
> +#include <linux/kernel.h>
> +#include <linux/cpufeature.h>
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <crypto/internal/simd.h>
> +#include <crypto/internal/skcipher.h>
> +#include <crypto/sm4.h>
> +#include "sm4-ce.h"
> +
> +asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
> + const u8 *src, unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
> + const u8 *src, u8 *iv,
> + unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
> + const u8 *src, u8 *iv,
> + unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
> + const u8 *src, u8 *iv,
> + unsigned int nblocks);
> +asmlinkage unsigned int sm4_sve_get_vl(void);
> +
> +
> +static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
> + unsigned int key_len)
> +{
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + if (key_len != SM4_KEY_SIZE)
> + return -EINVAL;
> +
> + kernel_neon_begin();
> + sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
> + crypto_sm4_fk, crypto_sm4_ck);
> + kernel_neon_end();
> +
> + return 0;
> +}
> +
> +static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
> +{
> + struct skcipher_walk walk;
> + unsigned int nbytes;
> + int err;
> +
> + err = skcipher_walk_virt(&walk, req, false);
> +
> + while ((nbytes = walk.nbytes) > 0) {
> + const u8 *src = walk.src.virt.addr;
> + u8 *dst = walk.dst.virt.addr;
> + unsigned int nblocks;
> +
> + nblocks = nbytes / SM4_BLOCK_SIZE;
> + if (nblocks) {
> + kernel_neon_begin();
> +
> + sm4_sve_ce_crypt(rkey, dst, src, nblocks);
> +
> + kernel_neon_end();
> + }
> +
> + err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> + }
> +
> + return err;
> +}
> +
> +static int ecb_encrypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + return ecb_crypt(req, ctx->rkey_enc);
> +}
> +
> +static int ecb_decrypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + return ecb_crypt(req, ctx->rkey_dec);
> +}
> +
> +static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
> + void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
> + const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> + struct skcipher_walk walk;
> + unsigned int nbytes;
> + int err;
> +
> + err = skcipher_walk_virt(&walk, req, false);
> +
> + while ((nbytes = walk.nbytes) > 0) {
> + const u8 *src = walk.src.virt.addr;
> + u8 *dst = walk.dst.virt.addr;
> + unsigned int nblocks;
> +
> + nblocks = nbytes / SM4_BLOCK_SIZE;
> + if (nblocks) {
> + kernel_neon_begin();
> +
> + sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
> +
> + kernel_neon_end();
> + }
> +
> + err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> + }
> +
> + return err;
> +}
> +
> +static int cbc_encrypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
> +}
> +
> +static int cbc_decrypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
> +}
> +
> +static int cfb_crypt(struct skcipher_request *req,
> + void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
> + const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> + struct skcipher_walk walk;
> + unsigned int nbytes;
> + int err;
> +
> + err = skcipher_walk_virt(&walk, req, false);
> +
> + while ((nbytes = walk.nbytes) > 0) {
> + const u8 *src = walk.src.virt.addr;
> + u8 *dst = walk.dst.virt.addr;
> + unsigned int nblocks;
> +
> + nblocks = nbytes / SM4_BLOCK_SIZE;
> + if (nblocks) {
> + kernel_neon_begin();
> +
> + sm4_cfb_crypt(ctx->rkey_enc, dst, src,
> + walk.iv, nblocks);
> +
> + kernel_neon_end();
> +
> + dst += nblocks * SM4_BLOCK_SIZE;
> + src += nblocks * SM4_BLOCK_SIZE;
> + nbytes -= nblocks * SM4_BLOCK_SIZE;
> + }
> +
> + /* tail */
> + if (walk.nbytes == walk.total && nbytes > 0) {
> + u8 keystream[SM4_BLOCK_SIZE];
> +
> + sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> + crypto_xor_cpy(dst, src, keystream, nbytes);
> + nbytes = 0;
> + }
> +
> + err = skcipher_walk_done(&walk, nbytes);
> + }
> +
> + return err;
> +}
> +
> +static int cfb_encrypt(struct skcipher_request *req)
> +{
> + return cfb_crypt(req, sm4_ce_cfb_enc);
> +}
> +
> +static int cfb_decrypt(struct skcipher_request *req)
> +{
> + return cfb_crypt(req, sm4_sve_ce_cfb_dec);
> +}
> +
> +static int ctr_crypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> + struct skcipher_walk walk;
> + unsigned int nbytes;
> + int err;
> +
> + err = skcipher_walk_virt(&walk, req, false);
> +
> + while ((nbytes = walk.nbytes) > 0) {
> + const u8 *src = walk.src.virt.addr;
> + u8 *dst = walk.dst.virt.addr;
> + unsigned int nblocks;
> +
> + nblocks = nbytes / SM4_BLOCK_SIZE;
> + if (nblocks) {
> + kernel_neon_begin();
> +
> + sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
> + walk.iv, nblocks);
> +
> + kernel_neon_end();
> +
> + dst += nblocks * SM4_BLOCK_SIZE;
> + src += nblocks * SM4_BLOCK_SIZE;
> + nbytes -= nblocks * SM4_BLOCK_SIZE;
> + }
> +
> + /* tail */
> + if (walk.nbytes == walk.total && nbytes > 0) {
> + u8 keystream[SM4_BLOCK_SIZE];
> +
> + sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> + crypto_inc(walk.iv, SM4_BLOCK_SIZE);
> + crypto_xor_cpy(dst, src, keystream, nbytes);
> + nbytes = 0;
> + }
> +
> + err = skcipher_walk_done(&walk, nbytes);
> + }
> +
> + return err;
> +}
> +
> +static struct skcipher_alg sm4_algs[] = {
> + {
> + .base = {
> + .cra_name = "ecb(sm4)",
> + .cra_driver_name = "ecb-sm4-sve-ce",
> + .cra_priority = 500,
> + .cra_blocksize = SM4_BLOCK_SIZE,
> + .cra_ctxsize = sizeof(struct sm4_ctx),
> + .cra_module = THIS_MODULE,
> + },
> + .min_keysize = SM4_KEY_SIZE,
> + .max_keysize = SM4_KEY_SIZE,
> + .setkey = sm4_setkey,
> + .encrypt = ecb_encrypt,
> + .decrypt = ecb_decrypt,
> + }, {
> + .base = {
> + .cra_name = "cbc(sm4)",
> + .cra_driver_name = "cbc-sm4-sve-ce",
> + .cra_priority = 500,
> + .cra_blocksize = SM4_BLOCK_SIZE,
> + .cra_ctxsize = sizeof(struct sm4_ctx),
> + .cra_module = THIS_MODULE,
> + },
> + .min_keysize = SM4_KEY_SIZE,
> + .max_keysize = SM4_KEY_SIZE,
> + .ivsize = SM4_BLOCK_SIZE,
> + .setkey = sm4_setkey,
> + .encrypt = cbc_encrypt,
> + .decrypt = cbc_decrypt,
> + }, {
> + .base = {
> + .cra_name = "cfb(sm4)",
> + .cra_driver_name = "cfb-sm4-sve-ce",
> + .cra_priority = 500,
> + .cra_blocksize = 1,
> + .cra_ctxsize = sizeof(struct sm4_ctx),
> + .cra_module = THIS_MODULE,
> + },
> + .min_keysize = SM4_KEY_SIZE,
> + .max_keysize = SM4_KEY_SIZE,
> + .ivsize = SM4_BLOCK_SIZE,
> + .chunksize = SM4_BLOCK_SIZE,
> + .setkey = sm4_setkey,
> + .encrypt = cfb_encrypt,
> + .decrypt = cfb_decrypt,
> + }, {
> + .base = {
> + .cra_name = "ctr(sm4)",
> + .cra_driver_name = "ctr-sm4-sve-ce",
> + .cra_priority = 500,
> + .cra_blocksize = 1,
> + .cra_ctxsize = sizeof(struct sm4_ctx),
> + .cra_module = THIS_MODULE,
> + },
> + .min_keysize = SM4_KEY_SIZE,
> + .max_keysize = SM4_KEY_SIZE,
> + .ivsize = SM4_BLOCK_SIZE,
> + .chunksize = SM4_BLOCK_SIZE,
> + .setkey = sm4_setkey,
> + .encrypt = ctr_crypt,
> + .decrypt = ctr_crypt,
> + }
> +};
> +
> +static int __init sm4_sve_ce_init(void)
> +{
> + if (sm4_sve_get_vl() <= 16)
> + return -ENODEV;
> +
> + return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +static void __exit sm4_sve_ce_exit(void)
> +{
> + crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
> +module_exit(sm4_sve_ce_exit);
> +
> +MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
> +MODULE_ALIAS_CRYPTO("sm4-sve-ce");
> +MODULE_ALIAS_CRYPTO("sm4");
> +MODULE_ALIAS_CRYPTO("ecb(sm4)");
> +MODULE_ALIAS_CRYPTO("cbc(sm4)");
> +MODULE_ALIAS_CRYPTO("cfb(sm4)");
> +MODULE_ALIAS_CRYPTO("ctr(sm4)");
> +MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
> +MODULE_LICENSE("GPL v2");
> --
> 2.24.3 (Apple Git-128)
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-26 10:02 ` Ard Biesheuvel
0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2022-09-26 10:02 UTC (permalink / raw)
To: Tianjia Zhang, Mark Brown
Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32
(cc Mark Brown)
Hello Tianjia,
On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
<tianjia.zhang@linux.alibaba.com> wrote:
>
> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
> arm64. SVE allows flexible vector length implementations with a range of
> possible values in CPU implementations. The vector length can vary from a
> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
> The SVE design guarantees that the same application can run on different
> implementations that support SVE, without the need to recompile the code.
>
> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
> expand and improve it. Similar to the Crypto Extension supported by the
> NEON instruction set for the algorithm, SVE also supports the similar
> instructions, called cryptography acceleration instructions, but this is
> also optional instruction set.
>
> This patch uses SM4 cryptography acceleration instructions and SVE2
> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
> Extension instruction is used.
>
Given that we currently do not support the use of SVE in kernel mode,
this patch cannot be accepted at this time (but the rest of the series
looks reasonable to me, although I have only skimmed over the patches)
In view of the disappointing benchmark results below, I don't think
this is worth the hassle at the moment. If we can find a case where
using SVE in kernel mode truly makes a [favorable] difference, we can
revisit this, but not without a thorough analysis of the impact it
will have to support SVE in the kernel. Also, the fact that SVE may
also cover cryptographic extensions does not necessarily imply that a
micro-architecture will perform those crypto transformations in
parallel and so the performance may be the same even if VL > 128.
In summary, please drop this patch for now, and once there are more
encouraging performance numbers, please resubmit it as part of a
series that explicitly enables SVE in kernel mode on arm64, and
documents the requirements and constraints.
I have cc'ed Mark who has been working on the SVE support., who might
have something to add here as well.
Thanks,
Ard.
> Since no test environment with a Vector Length (VL) greater than 128 bits
> was found, the performance data was obtained on a machine with a VL is
> 128 bits, because this driver is enabled when the VL is greater than 128
> bits, so this performance is only for reference. It can be seen from the
> data that there is little difference between the data optimized by Crypto
> Extension and SVE (VL=128 bits), and the optimization effect will be more
> obvious when VL=256 bits or longer.
>
> Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
> of tcrypt, and compared with that optimized by Crypto Extension. The
> abscissas are blocks of different lengths. The data is tabulated and the
> unit is Mb/s:
>
> sm4-ce | 16 64 128 256 1024 1420 4096
> ------------+--------------------------------------------------------------
> ECB enc | 315.18 1162.65 1815.66 2553.50 3692.91 3727.20 4001.93
> ECB dec | 316.06 1172.97 1817.81 2554.66 3692.18 3786.54 4001.93
> CBC enc | 304.82 629.54 768.65 864.72 953.90 963.32 974.06
> CBC dec | 306.05 1142.53 1805.11 2481.67 3522.06 3587.87 3790.99
> CFB enc | 309.48 635.70 774.44 865.85 950.62 952.68 968.24
> CFB dec | 315.98 1170.38 1828.75 2509.72 3543.63 3539.40 3793.25
> CTR enc | 285.83 1036.59 1583.50 2147.26 2933.54 2954.66 3041.14
> CTR dec | 285.29 1037.47 1584.67 2145.51 2934.10 2950.89 3041.62
>
> sm4-sve-ce (VL = 128 bits)
> ECB enc | 310.00 1154.70 1813.26 2579.74 3766.90 3869.45 4100.26
> ECB dec | 315.60 1176.22 1838.06 2593.69 3774.95 3878.42 4098.83
> CBC enc | 303.44 622.65 764.67 861.40 953.18 963.05 973.77
> CBC dec | 302.13 1091.15 1689.10 2267.79 3182.84 3242.68 3408.92
> CFB enc | 296.62 620.41 762.94 858.96 948.18 956.04 967.67
> CFB dec | 291.23 1065.50 1637.33 2228.12 3158.52 3213.35 3403.83
> CTR enc | 272.27 959.35 1466.34 1934.24 2562.80 2595.87 2695.15
> CTR dec | 273.40 963.65 1471.83 1938.97 2563.12 2597.25 2694.54
>
> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> ---
> arch/arm64/crypto/Kconfig | 19 +
> arch/arm64/crypto/Makefile | 3 +
> arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
> arch/arm64/crypto/sm4-sve-ce-glue.c | 332 +++++++++
> 4 files changed, 1382 insertions(+)
> create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
> create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index 6793d5bc3ee5..bbb5a7a08af5 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
> - ARMv8 Crypto Extensions
> - NEON (Advanced SIMD) extensions
>
> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> + tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
> + depends on KERNEL_MODE_NEON
> + select CRYPTO_SKCIPHER
> + select CRYPTO_SM4
> + select CRYPTO_SM4_ARM64_CE_BLK
> + help
> + Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
> + with block cipher modes:
> + - ECB (Electronic Codebook) mode (NIST SP800-38A)
> + - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
> + - CFB (Cipher Feedback) mode (NIST SP800-38A)
> + - CTR (Counter) mode (NIST SP800-38A)
> +
> + Architecture: arm64 using:
> + - ARMv8 Crypto Extensions
> + - ARMv9 cryptography acceleration with SVE2
> + - NEON (Advanced SIMD) extensions
> +
> config CRYPTO_SM4_ARM64_NEON_BLK
> tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
> depends on KERNEL_MODE_NEON
> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> index 4818e204c2ac..355dd9053434 100644
> --- a/arch/arm64/crypto/Makefile
> +++ b/arch/arm64/crypto/Makefile
> @@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
> obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
> sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
>
> +obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
> +sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
> +
> obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
> ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
>
> diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
> new file mode 100644
> index 000000000000..caecbdf2536c
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-core.S
> @@ -0,0 +1,1028 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/assembler.h>
> +
> +.arch armv8-a+crypto+sve+sve2
> +
> +.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
> + .set .Lv\b\().4s, \b
> +.endr
> +
> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
> + 16, 24, 25, 26, 27, 28, 29, 30, 31
> + .set .Lz\b\().s, \b
> +.endr
> +
> +.macro sm4e, vd, vn
> + .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> +.endm
> +
> +.macro sm4e_sve, zd, zn
> + .inst 0x4523e000 | (.L\zn << 5) | .L\zd
> +.endm
> +
> +
> +/* Register macros */
> +
> +#define RCTR z16
> +#define RCTRv v16
> +#define RIV z16
> +#define RIVv v16
> +#define RSWAP128 z17
> +#define RZERO z18
> +#define RLE128_INC z19
> +
> +#define RTMP0 z20
> +#define RTMP0v v20
> +#define RTMP1 z21
> +#define RTMP2 z22
> +#define RTMP3 z23
> +
> +
> +/* Helper macros. */
> +
> +#define SM4_PREPARE(ptr) \
> + adr_l x7, .Lbswap128_mask; \
> + ptrue p0.b, ALL; \
> + rdvl x5, #1; \
> + ld1b {RSWAP128.b}, p0/z, [x7]; \
> + \
> + ld1 {v24.16b-v27.16b}, [ptr], #64; \
> + ld1 {v28.16b-v31.16b}, [ptr]; \
> + dup z24.q, z24.q[0]; \
> + dup z25.q, z25.q[0]; \
> + dup z26.q, z26.q[0]; \
> + dup z27.q, z27.q[0]; \
> + dup z28.q, z28.q[0]; \
> + dup z29.q, z29.q[0]; \
> + dup z30.q, z30.q[0]; \
> + dup z31.q, z31.q[0];
> +
> +#define SM4_SVE_CE_CRYPT_BLK(b0) \
> + revb b0.s, p0/m, b0.s; \
> + sm4e_sve b0.s, z24.s; \
> + sm4e_sve b0.s, z25.s; \
> + sm4e_sve b0.s, z26.s; \
> + sm4e_sve b0.s, z27.s; \
> + sm4e_sve b0.s, z28.s; \
> + sm4e_sve b0.s, z29.s; \
> + sm4e_sve b0.s, z30.s; \
> + sm4e_sve b0.s, z31.s; \
> + tbl b0.b, {b0.b}, RSWAP128.b; \
> + revb b0.s, p0/m, b0.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3) \
> + revb b0.s, p0/m, b0.s; \
> + revb b1.s, p0/m, b1.s; \
> + revb b2.s, p0/m, b2.s; \
> + revb b3.s, p0/m, b3.s; \
> + sm4e_sve b0.s, z24.s; \
> + sm4e_sve b1.s, z24.s; \
> + sm4e_sve b2.s, z24.s; \
> + sm4e_sve b3.s, z24.s; \
> + sm4e_sve b0.s, z25.s; \
> + sm4e_sve b1.s, z25.s; \
> + sm4e_sve b2.s, z25.s; \
> + sm4e_sve b3.s, z25.s; \
> + sm4e_sve b0.s, z26.s; \
> + sm4e_sve b1.s, z26.s; \
> + sm4e_sve b2.s, z26.s; \
> + sm4e_sve b3.s, z26.s; \
> + sm4e_sve b0.s, z27.s; \
> + sm4e_sve b1.s, z27.s; \
> + sm4e_sve b2.s, z27.s; \
> + sm4e_sve b3.s, z27.s; \
> + sm4e_sve b0.s, z28.s; \
> + sm4e_sve b1.s, z28.s; \
> + sm4e_sve b2.s, z28.s; \
> + sm4e_sve b3.s, z28.s; \
> + sm4e_sve b0.s, z29.s; \
> + sm4e_sve b1.s, z29.s; \
> + sm4e_sve b2.s, z29.s; \
> + sm4e_sve b3.s, z29.s; \
> + sm4e_sve b0.s, z30.s; \
> + sm4e_sve b1.s, z30.s; \
> + sm4e_sve b2.s, z30.s; \
> + sm4e_sve b3.s, z30.s; \
> + sm4e_sve b0.s, z31.s; \
> + sm4e_sve b1.s, z31.s; \
> + sm4e_sve b2.s, z31.s; \
> + sm4e_sve b3.s, z31.s; \
> + tbl b0.b, {b0.b}, RSWAP128.b; \
> + tbl b1.b, {b1.b}, RSWAP128.b; \
> + tbl b2.b, {b2.b}, RSWAP128.b; \
> + tbl b3.b, {b3.b}, RSWAP128.b; \
> + revb b0.s, p0/m, b0.s; \
> + revb b1.s, p0/m, b1.s; \
> + revb b2.s, p0/m, b2.s; \
> + revb b3.s, p0/m, b3.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
> + revb b0.s, p0/m, b0.s; \
> + revb b1.s, p0/m, b1.s; \
> + revb b2.s, p0/m, b2.s; \
> + revb b3.s, p0/m, b3.s; \
> + revb b4.s, p0/m, b4.s; \
> + revb b5.s, p0/m, b5.s; \
> + revb b6.s, p0/m, b6.s; \
> + revb b7.s, p0/m, b7.s; \
> + sm4e_sve b0.s, z24.s; \
> + sm4e_sve b1.s, z24.s; \
> + sm4e_sve b2.s, z24.s; \
> + sm4e_sve b3.s, z24.s; \
> + sm4e_sve b4.s, z24.s; \
> + sm4e_sve b5.s, z24.s; \
> + sm4e_sve b6.s, z24.s; \
> + sm4e_sve b7.s, z24.s; \
> + sm4e_sve b0.s, z25.s; \
> + sm4e_sve b1.s, z25.s; \
> + sm4e_sve b2.s, z25.s; \
> + sm4e_sve b3.s, z25.s; \
> + sm4e_sve b4.s, z25.s; \
> + sm4e_sve b5.s, z25.s; \
> + sm4e_sve b6.s, z25.s; \
> + sm4e_sve b7.s, z25.s; \
> + sm4e_sve b0.s, z26.s; \
> + sm4e_sve b1.s, z26.s; \
> + sm4e_sve b2.s, z26.s; \
> + sm4e_sve b3.s, z26.s; \
> + sm4e_sve b4.s, z26.s; \
> + sm4e_sve b5.s, z26.s; \
> + sm4e_sve b6.s, z26.s; \
> + sm4e_sve b7.s, z26.s; \
> + sm4e_sve b0.s, z27.s; \
> + sm4e_sve b1.s, z27.s; \
> + sm4e_sve b2.s, z27.s; \
> + sm4e_sve b3.s, z27.s; \
> + sm4e_sve b4.s, z27.s; \
> + sm4e_sve b5.s, z27.s; \
> + sm4e_sve b6.s, z27.s; \
> + sm4e_sve b7.s, z27.s; \
> + sm4e_sve b0.s, z28.s; \
> + sm4e_sve b1.s, z28.s; \
> + sm4e_sve b2.s, z28.s; \
> + sm4e_sve b3.s, z28.s; \
> + sm4e_sve b4.s, z28.s; \
> + sm4e_sve b5.s, z28.s; \
> + sm4e_sve b6.s, z28.s; \
> + sm4e_sve b7.s, z28.s; \
> + sm4e_sve b0.s, z29.s; \
> + sm4e_sve b1.s, z29.s; \
> + sm4e_sve b2.s, z29.s; \
> + sm4e_sve b3.s, z29.s; \
> + sm4e_sve b4.s, z29.s; \
> + sm4e_sve b5.s, z29.s; \
> + sm4e_sve b6.s, z29.s; \
> + sm4e_sve b7.s, z29.s; \
> + sm4e_sve b0.s, z30.s; \
> + sm4e_sve b1.s, z30.s; \
> + sm4e_sve b2.s, z30.s; \
> + sm4e_sve b3.s, z30.s; \
> + sm4e_sve b4.s, z30.s; \
> + sm4e_sve b5.s, z30.s; \
> + sm4e_sve b6.s, z30.s; \
> + sm4e_sve b7.s, z30.s; \
> + sm4e_sve b0.s, z31.s; \
> + sm4e_sve b1.s, z31.s; \
> + sm4e_sve b2.s, z31.s; \
> + sm4e_sve b3.s, z31.s; \
> + sm4e_sve b4.s, z31.s; \
> + sm4e_sve b5.s, z31.s; \
> + sm4e_sve b6.s, z31.s; \
> + sm4e_sve b7.s, z31.s; \
> + tbl b0.b, {b0.b}, RSWAP128.b; \
> + tbl b1.b, {b1.b}, RSWAP128.b; \
> + tbl b2.b, {b2.b}, RSWAP128.b; \
> + tbl b3.b, {b3.b}, RSWAP128.b; \
> + tbl b4.b, {b4.b}, RSWAP128.b; \
> + tbl b5.b, {b5.b}, RSWAP128.b; \
> + tbl b6.b, {b6.b}, RSWAP128.b; \
> + tbl b7.b, {b7.b}, RSWAP128.b; \
> + revb b0.s, p0/m, b0.s; \
> + revb b1.s, p0/m, b1.s; \
> + revb b2.s, p0/m, b2.s; \
> + revb b3.s, p0/m, b3.s; \
> + revb b4.s, p0/m, b4.s; \
> + revb b5.s, p0/m, b5.s; \
> + revb b6.s, p0/m, b6.s; \
> + revb b7.s, p0/m, b7.s;
> +
> +#define SM4_CE_CRYPT_BLK(b0) \
> + rev32 b0.16b, b0.16b; \
> + sm4e b0.4s, v24.4s; \
> + sm4e b0.4s, v25.4s; \
> + sm4e b0.4s, v26.4s; \
> + sm4e b0.4s, v27.4s; \
> + sm4e b0.4s, v28.4s; \
> + sm4e b0.4s, v29.4s; \
> + sm4e b0.4s, v30.4s; \
> + sm4e b0.4s, v31.4s; \
> + rev64 b0.4s, b0.4s; \
> + ext b0.16b, b0.16b, b0.16b, #8; \
> + rev32 b0.16b, b0.16b;
> +
> +#define inc_le128(zctr) \
> + mov RCTRv.d[1], x8; \
> + mov RCTRv.d[0], x7; \
> + mov zctr.d, RLE128_INC.d; \
> + dup RCTR.q, RCTR.q[0]; \
> + adds x8, x8, x5, LSR #4; \
> + adclt zctr.d, RCTR.d, RZERO.d; \
> + adclt RCTR.d, zctr.d, RZERO.d; \
> + adc x7, x7, xzr; \
> + trn1 zctr.d, RCTR.d, zctr.d; \
> + revb zctr.d, p0/m, zctr.d;
> +
> +#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3) \
> + mov v8.d[1], x8; \
> + mov v8.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr0.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v9.d[1], x8; \
> + mov v9.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr1.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v10.d[1], x8; \
> + mov v10.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr2.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v11.d[1], x8; \
> + mov v11.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr3.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + dup z8.q, z8.q[0]; \
> + dup z9.q, z9.q[0]; \
> + dup z10.q, z10.q[0]; \
> + dup z11.q, z11.q[0]; \
> + adclt zctr0.d, z8.d, RZERO.d; \
> + adclt zctr1.d, z9.d, RZERO.d; \
> + adclt zctr2.d, z10.d, RZERO.d; \
> + adclt zctr3.d, z11.d, RZERO.d; \
> + adclt z8.d, zctr0.d, RZERO.d; \
> + adclt z9.d, zctr1.d, RZERO.d; \
> + adclt z10.d, zctr2.d, RZERO.d; \
> + adclt z11.d, zctr3.d, RZERO.d; \
> + trn1 zctr0.d, z8.d, zctr0.d; \
> + trn1 zctr1.d, z9.d, zctr1.d; \
> + trn1 zctr2.d, z10.d, zctr2.d; \
> + trn1 zctr3.d, z11.d, zctr3.d; \
> + revb zctr0.d, p0/m, zctr0.d; \
> + revb zctr1.d, p0/m, zctr1.d; \
> + revb zctr2.d, p0/m, zctr2.d; \
> + revb zctr3.d, p0/m, zctr3.d;
> +
> +#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3, \
> + zctr4, zctr5, zctr6, zctr7) \
> + mov v8.d[1], x8; \
> + mov v8.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr0.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v9.d[1], x8; \
> + mov v9.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr1.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v10.d[1], x8; \
> + mov v10.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr2.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v11.d[1], x8; \
> + mov v11.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr3.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v12.d[1], x8; \
> + mov v12.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr4.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v13.d[1], x8; \
> + mov v13.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr5.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v14.d[1], x8; \
> + mov v14.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr6.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + mov v15.d[1], x8; \
> + mov v15.d[0], x7; \
> + adds x8, x8, x5, LSR #4; \
> + mov zctr7.d, RLE128_INC.d; \
> + adc x7, x7, xzr; \
> + dup z8.q, z8.q[0]; \
> + dup z9.q, z9.q[0]; \
> + dup z10.q, z10.q[0]; \
> + dup z11.q, z11.q[0]; \
> + dup z12.q, z12.q[0]; \
> + dup z13.q, z13.q[0]; \
> + dup z14.q, z14.q[0]; \
> + dup z15.q, z15.q[0]; \
> + adclt zctr0.d, z8.d, RZERO.d; \
> + adclt zctr1.d, z9.d, RZERO.d; \
> + adclt zctr2.d, z10.d, RZERO.d; \
> + adclt zctr3.d, z11.d, RZERO.d; \
> + adclt zctr4.d, z12.d, RZERO.d; \
> + adclt zctr5.d, z13.d, RZERO.d; \
> + adclt zctr6.d, z14.d, RZERO.d; \
> + adclt zctr7.d, z15.d, RZERO.d; \
> + adclt z8.d, zctr0.d, RZERO.d; \
> + adclt z9.d, zctr1.d, RZERO.d; \
> + adclt z10.d, zctr2.d, RZERO.d; \
> + adclt z11.d, zctr3.d, RZERO.d; \
> + adclt z12.d, zctr4.d, RZERO.d; \
> + adclt z13.d, zctr5.d, RZERO.d; \
> + adclt z14.d, zctr6.d, RZERO.d; \
> + adclt z15.d, zctr7.d, RZERO.d; \
> + trn1 zctr0.d, z8.d, zctr0.d; \
> + trn1 zctr1.d, z9.d, zctr1.d; \
> + trn1 zctr2.d, z10.d, zctr2.d; \
> + trn1 zctr3.d, z11.d, zctr3.d; \
> + trn1 zctr4.d, z12.d, zctr4.d; \
> + trn1 zctr5.d, z13.d, zctr5.d; \
> + trn1 zctr6.d, z14.d, zctr6.d; \
> + trn1 zctr7.d, z15.d, zctr7.d; \
> + revb zctr0.d, p0/m, zctr0.d; \
> + revb zctr1.d, p0/m, zctr1.d; \
> + revb zctr2.d, p0/m, zctr2.d; \
> + revb zctr3.d, p0/m, zctr3.d; \
> + revb zctr4.d, p0/m, zctr4.d; \
> + revb zctr5.d, p0/m, zctr5.d; \
> + revb zctr6.d, p0/m, zctr6.d; \
> + revb zctr7.d, p0/m, zctr7.d;
> +
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_crypt)
> + /* input:
> + * x0: round key array, CTX
> + * x1: dst
> + * x2: src
> + * w3: nblocks
> + */
> + uxtw x3, w3
> + SM4_PREPARE(x0)
> +
> +.Lcrypt_loop_8x:
> + sub x3, x3, x5, LSR #1 /* x3 - (8 * VL) */
> + tbnz x3, #63, .Lcrypt_4x
> +
> + ld1b {z0.b}, p0/z, [x2]
> + ld1b {z1.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z2.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z3.b}, p0/z, [x2, #3, MUL VL]
> + ld1b {z4.b}, p0/z, [x2, #4, MUL VL]
> + ld1b {z5.b}, p0/z, [x2, #5, MUL VL]
> + ld1b {z6.b}, p0/z, [x2, #6, MUL VL]
> + ld1b {z7.b}, p0/z, [x2, #7, MUL VL]
> +
> + SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> + st1b {z4.b}, p0, [x1, #4, MUL VL]
> + st1b {z5.b}, p0, [x1, #5, MUL VL]
> + st1b {z6.b}, p0, [x1, #6, MUL VL]
> + st1b {z7.b}, p0, [x1, #7, MUL VL]
> +
> + addvl x2, x2, #8
> + addvl x1, x1, #8
> +
> + cbz x3, .Lcrypt_end
> + b .Lcrypt_loop_8x
> +
> +.Lcrypt_4x:
> + add x3, x3, x5, LSR #1
> + cmp x3, x5, LSR #2
> + blt .Lcrypt_loop_1x
> +
> + sub x3, x3, x5, LSR #2 /* x3 - (4 * VL) */
> +
> + ld1b {z0.b}, p0/z, [x2]
> + ld1b {z1.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z2.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z3.b}, p0/z, [x2, #3, MUL VL]
> +
> + SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> +
> + addvl x2, x2, #4
> + addvl x1, x1, #4
> +
> + cbz x3, .Lcrypt_end
> +
> +.Lcrypt_loop_1x:
> + cmp x3, x5, LSR #4
> + blt .Lcrypt_ce_loop_1x
> +
> + sub x3, x3, x5, LSR #4 /* x3 - VL */
> +
> + ld1b {z0.b}, p0/z, [x2]
> +
> + SM4_SVE_CE_CRYPT_BLK(z0)
> +
> + st1b {z0.b}, p0, [x1]
> +
> + addvl x2, x2, #1
> + addvl x1, x1, #1
> +
> + cbz x3, .Lcrypt_end
> + b .Lcrypt_loop_1x
> +
> +.Lcrypt_ce_loop_1x:
> + sub x3, x3, #1
> +
> + ld1 {v0.16b}, [x2], #16
> + SM4_CE_CRYPT_BLK(v0)
> + st1 {v0.16b}, [x1], #16
> +
> + cbnz x3, .Lcrypt_ce_loop_1x
> +
> +.Lcrypt_end:
> + ret
> +SYM_FUNC_END(sm4_sve_ce_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cbc_dec)
> + /* input:
> + * x0: round key array, CTX
> + * x1: dst
> + * x2: src
> + * x3: iv (big endian, 128 bit)
> + * w4: nblocks
> + */
> + uxtw x4, w4
> + SM4_PREPARE(x0)
> +
> + ld1 {RIVv.16b}, [x3]
> + ext RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_loop_8x:
> + sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
> + tbnz x4, #63, .Lcbc_dec_4x
> +
> + ld1b {z15.b}, p0/z, [x2]
> + ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
> + ld1b {z11.b}, p0/z, [x2, #4, MUL VL]
> + ld1b {z10.b}, p0/z, [x2, #5, MUL VL]
> + ld1b {z9.b}, p0/z, [x2, #6, MUL VL]
> + ld1b {z8.b}, p0/z, [x2, #7, MUL VL]
> + rev z0.b, z15.b
> + rev z1.b, z14.b
> + rev z2.b, z13.b
> + rev z3.b, z12.b
> + rev z4.b, z11.b
> + rev z5.b, z10.b
> + rev z6.b, z9.b
> + rev z7.b, z8.b
> + rev RTMP0.b, RIV.b
> + ext z7.b, z7.b, z6.b, #16
> + ext z6.b, z6.b, z5.b, #16
> + ext z5.b, z5.b, z4.b, #16
> + ext z4.b, z4.b, z3.b, #16
> + ext z3.b, z3.b, z2.b, #16
> + ext z2.b, z2.b, z1.b, #16
> + ext z1.b, z1.b, z0.b, #16
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z7.b, z7.b
> + rev z6.b, z6.b
> + rev z5.b, z5.b
> + rev z4.b, z4.b
> + rev z3.b, z3.b
> + rev z2.b, z2.b
> + rev z1.b, z1.b
> + rev z0.b, z0.b
> + mov RIV.d, z8.d
> +
> + SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
> +
> + eor z0.d, z0.d, z15.d
> + eor z1.d, z1.d, z14.d
> + eor z2.d, z2.d, z13.d
> + eor z3.d, z3.d, z12.d
> + eor z4.d, z4.d, z11.d
> + eor z5.d, z5.d, z10.d
> + eor z6.d, z6.d, z9.d
> + eor z7.d, z7.d, z8.d
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> + st1b {z4.b}, p0, [x1, #4, MUL VL]
> + st1b {z5.b}, p0, [x1, #5, MUL VL]
> + st1b {z6.b}, p0, [x1, #6, MUL VL]
> + st1b {z7.b}, p0, [x1, #7, MUL VL]
> +
> + addvl x2, x2, #8
> + addvl x1, x1, #8
> +
> + cbz x4, .Lcbc_dec_end
> + b .Lcbc_dec_loop_8x
> +
> +.Lcbc_dec_4x:
> + add x4, x4, x5, LSR #1
> + cmp x4, x5, LSR #2
> + blt .Lcbc_dec_loop_1x
> +
> + sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
> +
> + ld1b {z15.b}, p0/z, [x2]
> + ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
> + rev z0.b, z15.b
> + rev z1.b, z14.b
> + rev z2.b, z13.b
> + rev z3.b, z12.b
> + rev RTMP0.b, RIV.b
> + ext z3.b, z3.b, z2.b, #16
> + ext z2.b, z2.b, z1.b, #16
> + ext z1.b, z1.b, z0.b, #16
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z3.b, z3.b
> + rev z2.b, z2.b
> + rev z1.b, z1.b
> + rev z0.b, z0.b
> + mov RIV.d, z12.d
> +
> + SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
> +
> + eor z0.d, z0.d, z15.d
> + eor z1.d, z1.d, z14.d
> + eor z2.d, z2.d, z13.d
> + eor z3.d, z3.d, z12.d
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> +
> + addvl x2, x2, #4
> + addvl x1, x1, #4
> +
> + cbz x4, .Lcbc_dec_end
> +
> +.Lcbc_dec_loop_1x:
> + cmp x4, x5, LSR #4
> + blt .Lcbc_dec_ce
> +
> + sub x4, x4, x5, LSR #4 /* x4 - VL */
> +
> + ld1b {z15.b}, p0/z, [x2]
> + rev RTMP0.b, RIV.b
> + rev z0.b, z15.b
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z0.b, z0.b
> + mov RIV.d, z15.d
> +
> + SM4_SVE_CE_CRYPT_BLK(z15)
> +
> + eor z0.d, z0.d, z15.d
> + st1b {z0.b}, p0, [x1]
> +
> + addvl x2, x2, #1
> + addvl x1, x1, #1
> +
> + cbz x4, .Lcbc_dec_end
> + b .Lcbc_dec_loop_1x
> +
> +.Lcbc_dec_ce:
> + rev RIV.s, RIV.s
> + tbl RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcbc_dec_ce_loop_1x:
> + sub x4, x4, #1
> +
> + ld1 {v15.16b}, [x2], #16
> + mov v0.16b, RIVv.16b
> + mov RIVv.16b, v15.16b
> + SM4_CE_CRYPT_BLK(v15)
> + eor v0.16b, v0.16b, v15.16b
> + st1 {v0.16b}, [x1], #16
> +
> + cbnz x4, .Lcbc_dec_ce_loop_1x
> +
> + ext RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_end:
> + /* store new IV */
> + rev RIV.s, RIV.s
> + tbl RIV.b, {RIV.b}, RSWAP128.b
> + st1 {RIVv.16b}, [x3]
> +
> + ret
> +SYM_FUNC_END(sm4_sve_ce_cbc_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cfb_dec)
> + /* input:
> + * x0: round key array, CTX
> + * x1: dst
> + * x2: src
> + * x3: iv (big endian, 128 bit)
> + * w4: nblocks
> + */
> + uxtw x4, w4
> + SM4_PREPARE(x0)
> +
> + ld1 {RIVv.16b}, [x3]
> + ext RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_loop_8x:
> + sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
> + tbnz x4, #63, .Lcfb_dec_4x
> +
> + ld1b {z15.b}, p0/z, [x2]
> + ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
> + ld1b {z11.b}, p0/z, [x2, #4, MUL VL]
> + ld1b {z10.b}, p0/z, [x2, #5, MUL VL]
> + ld1b {z9.b}, p0/z, [x2, #6, MUL VL]
> + ld1b {z8.b}, p0/z, [x2, #7, MUL VL]
> + rev z0.b, z15.b
> + rev z1.b, z14.b
> + rev z2.b, z13.b
> + rev z3.b, z12.b
> + rev z4.b, z11.b
> + rev z5.b, z10.b
> + rev z6.b, z9.b
> + rev z7.b, z8.b
> + rev RTMP0.b, RIV.b
> + ext z7.b, z7.b, z6.b, #16
> + ext z6.b, z6.b, z5.b, #16
> + ext z5.b, z5.b, z4.b, #16
> + ext z4.b, z4.b, z3.b, #16
> + ext z3.b, z3.b, z2.b, #16
> + ext z2.b, z2.b, z1.b, #16
> + ext z1.b, z1.b, z0.b, #16
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z7.b, z7.b
> + rev z6.b, z6.b
> + rev z5.b, z5.b
> + rev z4.b, z4.b
> + rev z3.b, z3.b
> + rev z2.b, z2.b
> + rev z1.b, z1.b
> + rev z0.b, z0.b
> + mov RIV.d, z8.d
> +
> + SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> + eor z0.d, z0.d, z15.d
> + eor z1.d, z1.d, z14.d
> + eor z2.d, z2.d, z13.d
> + eor z3.d, z3.d, z12.d
> + eor z4.d, z4.d, z11.d
> + eor z5.d, z5.d, z10.d
> + eor z6.d, z6.d, z9.d
> + eor z7.d, z7.d, z8.d
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> + st1b {z4.b}, p0, [x1, #4, MUL VL]
> + st1b {z5.b}, p0, [x1, #5, MUL VL]
> + st1b {z6.b}, p0, [x1, #6, MUL VL]
> + st1b {z7.b}, p0, [x1, #7, MUL VL]
> +
> + addvl x2, x2, #8
> + addvl x1, x1, #8
> +
> + cbz x4, .Lcfb_dec_end
> + b .Lcfb_dec_loop_8x
> +
> +.Lcfb_dec_4x:
> + add x4, x4, x5, LSR #1
> + cmp x4, x5, LSR #2
> + blt .Lcfb_dec_loop_1x
> +
> + sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
> +
> + ld1b {z15.b}, p0/z, [x2]
> + ld1b {z14.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #3, MUL VL]
> + rev z0.b, z15.b
> + rev z1.b, z14.b
> + rev z2.b, z13.b
> + rev z3.b, z12.b
> + rev RTMP0.b, RIV.b
> + ext z3.b, z3.b, z2.b, #16
> + ext z2.b, z2.b, z1.b, #16
> + ext z1.b, z1.b, z0.b, #16
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z3.b, z3.b
> + rev z2.b, z2.b
> + rev z1.b, z1.b
> + rev z0.b, z0.b
> + mov RIV.d, z12.d
> +
> + SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> + eor z0.d, z0.d, z15.d
> + eor z1.d, z1.d, z14.d
> + eor z2.d, z2.d, z13.d
> + eor z3.d, z3.d, z12.d
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> +
> + addvl x2, x2, #4
> + addvl x1, x1, #4
> +
> + cbz x4, .Lcfb_dec_end
> +
> +.Lcfb_dec_loop_1x:
> + cmp x4, x5, LSR #4
> + blt .Lcfb_dec_ce
> +
> + sub x4, x4, x5, LSR #4 /* x4 - VL */
> +
> + ld1b {z15.b}, p0/z, [x2]
> + rev RTMP0.b, RIV.b
> + rev z0.b, z15.b
> + ext z0.b, z0.b, RTMP0.b, #16
> + rev z0.b, z0.b
> + mov RIV.d, z15.d
> +
> + SM4_SVE_CE_CRYPT_BLK(z0)
> +
> + eor z0.d, z0.d, z15.d
> + st1b {z0.b}, p0, [x1]
> +
> + addvl x2, x2, #1
> + addvl x1, x1, #1
> +
> + cbz x4, .Lcfb_dec_end
> + b .Lcfb_dec_loop_1x
> +
> +.Lcfb_dec_ce:
> + rev RIV.s, RIV.s
> + tbl RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcfb_dec_ce_loop_1x:
> + sub x4, x4, #1
> +
> + ld1 {v15.16b}, [x2], #16
> + mov v0.16b, RIVv.16b
> + mov RIVv.16b, v15.16b
> + SM4_CE_CRYPT_BLK(v0)
> + eor v0.16b, v0.16b, v15.16b
> + st1 {v0.16b}, [x1], #16
> +
> + cbnz x4, .Lcfb_dec_ce_loop_1x
> +
> + ext RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_end:
> + /* store new IV */
> + rev RIV.s, RIV.s
> + tbl RIV.b, {RIV.b}, RSWAP128.b
> + st1 {RIVv.16b}, [x3]
> +
> + ret
> +SYM_FUNC_END(sm4_sve_ce_cfb_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
> + /* input:
> + * x0: round key array, CTX
> + * x1: dst
> + * x2: src
> + * x3: ctr (big endian, 128 bit)
> + * w4: nblocks
> + */
> + uxtw x4, w4
> + SM4_PREPARE(x0)
> +
> + dup RZERO.d, #0
> + adr_l x6, .Lle128_inc
> + ld1b {RLE128_INC.b}, p0/z, [x6]
> +
> + ldp x7, x8, [x3]
> + rev x7, x7
> + rev x8, x8
> +
> +.Lctr_loop_8x:
> + sub x4, x4, x5, LSR #1 /* x4 - (8 * VL) */
> + tbnz x4, #63, .Lctr_4x
> +
> + inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> + ld1b {z8.b}, p0/z, [x2]
> + ld1b {z9.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z10.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z11.b}, p0/z, [x2, #3, MUL VL]
> + ld1b {z12.b}, p0/z, [x2, #4, MUL VL]
> + ld1b {z13.b}, p0/z, [x2, #5, MUL VL]
> + ld1b {z14.b}, p0/z, [x2, #6, MUL VL]
> + ld1b {z15.b}, p0/z, [x2, #7, MUL VL]
> +
> + SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> + eor z0.d, z0.d, z8.d
> + eor z1.d, z1.d, z9.d
> + eor z2.d, z2.d, z10.d
> + eor z3.d, z3.d, z11.d
> + eor z4.d, z4.d, z12.d
> + eor z5.d, z5.d, z13.d
> + eor z6.d, z6.d, z14.d
> + eor z7.d, z7.d, z15.d
> +
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> + st1b {z4.b}, p0, [x1, #4, MUL VL]
> + st1b {z5.b}, p0, [x1, #5, MUL VL]
> + st1b {z6.b}, p0, [x1, #6, MUL VL]
> + st1b {z7.b}, p0, [x1, #7, MUL VL]
> +
> + addvl x2, x2, #8
> + addvl x1, x1, #8
> +
> + cbz x4, .Lctr_end
> + b .Lctr_loop_8x
> +
> +.Lctr_4x:
> + add x4, x4, x5, LSR #1
> + cmp x4, x5, LSR #2
> + blt .Lctr_loop_1x
> +
> + sub x4, x4, x5, LSR #2 /* x4 - (4 * VL) */
> +
> + inc_le128_4x(z0, z1, z2, z3)
> +
> + ld1b {z8.b}, p0/z, [x2]
> + ld1b {z9.b}, p0/z, [x2, #1, MUL VL]
> + ld1b {z10.b}, p0/z, [x2, #2, MUL VL]
> + ld1b {z11.b}, p0/z, [x2, #3, MUL VL]
> +
> + SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> + eor z0.d, z0.d, z8.d
> + eor z1.d, z1.d, z9.d
> + eor z2.d, z2.d, z10.d
> + eor z3.d, z3.d, z11.d
> +
> + st1b {z0.b}, p0, [x1]
> + st1b {z1.b}, p0, [x1, #1, MUL VL]
> + st1b {z2.b}, p0, [x1, #2, MUL VL]
> + st1b {z3.b}, p0, [x1, #3, MUL VL]
> +
> + addvl x2, x2, #4
> + addvl x1, x1, #4
> +
> + cbz x4, .Lctr_end
> +
> +.Lctr_loop_1x:
> + cmp x4, x5, LSR #4
> + blt .Lctr_ce_loop_1x
> +
> + sub x4, x4, x5, LSR #4 /* x4 - VL */
> +
> + inc_le128(z0)
> + ld1b {z8.b}, p0/z, [x2]
> +
> + SM4_SVE_CE_CRYPT_BLK(z0)
> +
> + eor z0.d, z0.d, z8.d
> + st1b {z0.b}, p0, [x1]
> +
> + addvl x2, x2, #1
> + addvl x1, x1, #1
> +
> + cbz x4, .Lctr_end
> + b .Lctr_loop_1x
> +
> +.Lctr_ce_loop_1x:
> + sub x4, x4, #1
> +
> + /* inc_le128 for CE */
> + mov v0.d[1], x8
> + mov v0.d[0], x7
> + adds x8, x8, #1
> + rev64 v0.16b, v0.16b
> + adc x7, x7, xzr
> +
> + ld1 {v8.16b}, [x2], #16
> +
> + SM4_CE_CRYPT_BLK(v0)
> +
> + eor v0.16b, v0.16b, v8.16b
> + st1 {v0.16b}, [x1], #16
> +
> + cbnz x4, .Lctr_ce_loop_1x
> +
> +.Lctr_end:
> + /* store new CTR */
> + rev x7, x7
> + rev x8, x8
> + stp x7, x8, [x3]
> +
> + ret
> +SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_get_vl)
> + /* VL in bytes */
> + rdvl x0, #1
> +
> + ret
> +SYM_FUNC_END(sm4_sve_get_vl)
> +
> +
> + .section ".rodata", "a"
> + .align 4
> +.Lbswap128_mask:
> + .byte 0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
> + .byte 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
> + .byte 0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
> + .byte 0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
> + .byte 0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
> + .byte 0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
> + .byte 0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
> + .byte 0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
> + .byte 0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
> + .byte 0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
> + .byte 0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
> + .byte 0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
> + .byte 0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
> + .byte 0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
> + .byte 0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
> + .byte 0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
> + .byte 0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
> + .byte 0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
> + .byte 0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
> + .byte 0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
> + .byte 0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
> + .byte 0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
> + .byte 0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
> + .byte 0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
> + .byte 0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
> + .byte 0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
> + .byte 0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
> + .byte 0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
> + .byte 0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
> + .byte 0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
> + .byte 0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
> + .byte 0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
> +
> +.Lle128_inc:
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> + .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
> new file mode 100644
> index 000000000000..fc797b72b5f0
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
> @@ -0,0 +1,332 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/crypto.h>
> +#include <linux/kernel.h>
> +#include <linux/cpufeature.h>
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <crypto/internal/simd.h>
> +#include <crypto/internal/skcipher.h>
> +#include <crypto/sm4.h>
> +#include "sm4-ce.h"
> +
> +asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
> + const u8 *src, unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
> + const u8 *src, u8 *iv,
> + unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
> + const u8 *src, u8 *iv,
> + unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
> + const u8 *src, u8 *iv,
> + unsigned int nblocks);
> +asmlinkage unsigned int sm4_sve_get_vl(void);
> +
> +
> +static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
> + unsigned int key_len)
> +{
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + if (key_len != SM4_KEY_SIZE)
> + return -EINVAL;
> +
> + kernel_neon_begin();
> + sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
> + crypto_sm4_fk, crypto_sm4_ck);
> + kernel_neon_end();
> +
> + return 0;
> +}
> +
> +static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
> +{
> + struct skcipher_walk walk;
> + unsigned int nbytes;
> + int err;
> +
> + err = skcipher_walk_virt(&walk, req, false);
> +
> + while ((nbytes = walk.nbytes) > 0) {
> + const u8 *src = walk.src.virt.addr;
> + u8 *dst = walk.dst.virt.addr;
> + unsigned int nblocks;
> +
> + nblocks = nbytes / SM4_BLOCK_SIZE;
> + if (nblocks) {
> + kernel_neon_begin();
> +
> + sm4_sve_ce_crypt(rkey, dst, src, nblocks);
> +
> + kernel_neon_end();
> + }
> +
> + err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> + }
> +
> + return err;
> +}
> +
> +static int ecb_encrypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + return ecb_crypt(req, ctx->rkey_enc);
> +}
> +
> +static int ecb_decrypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + return ecb_crypt(req, ctx->rkey_dec);
> +}
> +
> +static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
> + void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
> + const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> + struct skcipher_walk walk;
> + unsigned int nbytes;
> + int err;
> +
> + err = skcipher_walk_virt(&walk, req, false);
> +
> + while ((nbytes = walk.nbytes) > 0) {
> + const u8 *src = walk.src.virt.addr;
> + u8 *dst = walk.dst.virt.addr;
> + unsigned int nblocks;
> +
> + nblocks = nbytes / SM4_BLOCK_SIZE;
> + if (nblocks) {
> + kernel_neon_begin();
> +
> + sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
> +
> + kernel_neon_end();
> + }
> +
> + err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> + }
> +
> + return err;
> +}
> +
> +static int cbc_encrypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
> +}
> +
> +static int cbc_decrypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> + return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
> +}
> +
> +static int cfb_crypt(struct skcipher_request *req,
> + void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
> + const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> + struct skcipher_walk walk;
> + unsigned int nbytes;
> + int err;
> +
> + err = skcipher_walk_virt(&walk, req, false);
> +
> + while ((nbytes = walk.nbytes) > 0) {
> + const u8 *src = walk.src.virt.addr;
> + u8 *dst = walk.dst.virt.addr;
> + unsigned int nblocks;
> +
> + nblocks = nbytes / SM4_BLOCK_SIZE;
> + if (nblocks) {
> + kernel_neon_begin();
> +
> + sm4_cfb_crypt(ctx->rkey_enc, dst, src,
> + walk.iv, nblocks);
> +
> + kernel_neon_end();
> +
> + dst += nblocks * SM4_BLOCK_SIZE;
> + src += nblocks * SM4_BLOCK_SIZE;
> + nbytes -= nblocks * SM4_BLOCK_SIZE;
> + }
> +
> + /* tail */
> + if (walk.nbytes == walk.total && nbytes > 0) {
> + u8 keystream[SM4_BLOCK_SIZE];
> +
> + sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> + crypto_xor_cpy(dst, src, keystream, nbytes);
> + nbytes = 0;
> + }
> +
> + err = skcipher_walk_done(&walk, nbytes);
> + }
> +
> + return err;
> +}
> +
> +static int cfb_encrypt(struct skcipher_request *req)
> +{
> + return cfb_crypt(req, sm4_ce_cfb_enc);
> +}
> +
> +static int cfb_decrypt(struct skcipher_request *req)
> +{
> + return cfb_crypt(req, sm4_sve_ce_cfb_dec);
> +}
> +
> +static int ctr_crypt(struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> + struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> + struct skcipher_walk walk;
> + unsigned int nbytes;
> + int err;
> +
> + err = skcipher_walk_virt(&walk, req, false);
> +
> + while ((nbytes = walk.nbytes) > 0) {
> + const u8 *src = walk.src.virt.addr;
> + u8 *dst = walk.dst.virt.addr;
> + unsigned int nblocks;
> +
> + nblocks = nbytes / SM4_BLOCK_SIZE;
> + if (nblocks) {
> + kernel_neon_begin();
> +
> + sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
> + walk.iv, nblocks);
> +
> + kernel_neon_end();
> +
> + dst += nblocks * SM4_BLOCK_SIZE;
> + src += nblocks * SM4_BLOCK_SIZE;
> + nbytes -= nblocks * SM4_BLOCK_SIZE;
> + }
> +
> + /* tail */
> + if (walk.nbytes == walk.total && nbytes > 0) {
> + u8 keystream[SM4_BLOCK_SIZE];
> +
> + sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> + crypto_inc(walk.iv, SM4_BLOCK_SIZE);
> + crypto_xor_cpy(dst, src, keystream, nbytes);
> + nbytes = 0;
> + }
> +
> + err = skcipher_walk_done(&walk, nbytes);
> + }
> +
> + return err;
> +}
> +
> +static struct skcipher_alg sm4_algs[] = {
> + {
> + .base = {
> + .cra_name = "ecb(sm4)",
> + .cra_driver_name = "ecb-sm4-sve-ce",
> + .cra_priority = 500,
> + .cra_blocksize = SM4_BLOCK_SIZE,
> + .cra_ctxsize = sizeof(struct sm4_ctx),
> + .cra_module = THIS_MODULE,
> + },
> + .min_keysize = SM4_KEY_SIZE,
> + .max_keysize = SM4_KEY_SIZE,
> + .setkey = sm4_setkey,
> + .encrypt = ecb_encrypt,
> + .decrypt = ecb_decrypt,
> + }, {
> + .base = {
> + .cra_name = "cbc(sm4)",
> + .cra_driver_name = "cbc-sm4-sve-ce",
> + .cra_priority = 500,
> + .cra_blocksize = SM4_BLOCK_SIZE,
> + .cra_ctxsize = sizeof(struct sm4_ctx),
> + .cra_module = THIS_MODULE,
> + },
> + .min_keysize = SM4_KEY_SIZE,
> + .max_keysize = SM4_KEY_SIZE,
> + .ivsize = SM4_BLOCK_SIZE,
> + .setkey = sm4_setkey,
> + .encrypt = cbc_encrypt,
> + .decrypt = cbc_decrypt,
> + }, {
> + .base = {
> + .cra_name = "cfb(sm4)",
> + .cra_driver_name = "cfb-sm4-sve-ce",
> + .cra_priority = 500,
> + .cra_blocksize = 1,
> + .cra_ctxsize = sizeof(struct sm4_ctx),
> + .cra_module = THIS_MODULE,
> + },
> + .min_keysize = SM4_KEY_SIZE,
> + .max_keysize = SM4_KEY_SIZE,
> + .ivsize = SM4_BLOCK_SIZE,
> + .chunksize = SM4_BLOCK_SIZE,
> + .setkey = sm4_setkey,
> + .encrypt = cfb_encrypt,
> + .decrypt = cfb_decrypt,
> + }, {
> + .base = {
> + .cra_name = "ctr(sm4)",
> + .cra_driver_name = "ctr-sm4-sve-ce",
> + .cra_priority = 500,
> + .cra_blocksize = 1,
> + .cra_ctxsize = sizeof(struct sm4_ctx),
> + .cra_module = THIS_MODULE,
> + },
> + .min_keysize = SM4_KEY_SIZE,
> + .max_keysize = SM4_KEY_SIZE,
> + .ivsize = SM4_BLOCK_SIZE,
> + .chunksize = SM4_BLOCK_SIZE,
> + .setkey = sm4_setkey,
> + .encrypt = ctr_crypt,
> + .decrypt = ctr_crypt,
> + }
> +};
> +
> +static int __init sm4_sve_ce_init(void)
> +{
> + if (sm4_sve_get_vl() <= 16)
> + return -ENODEV;
> +
> + return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +static void __exit sm4_sve_ce_exit(void)
> +{
> + crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
> +module_exit(sm4_sve_ce_exit);
> +
> +MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
> +MODULE_ALIAS_CRYPTO("sm4-sve-ce");
> +MODULE_ALIAS_CRYPTO("sm4");
> +MODULE_ALIAS_CRYPTO("ecb(sm4)");
> +MODULE_ALIAS_CRYPTO("cbc(sm4)");
> +MODULE_ALIAS_CRYPTO("cfb(sm4)");
> +MODULE_ALIAS_CRYPTO("ctr(sm4)");
> +MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
> +MODULE_LICENSE("GPL v2");
> --
> 2.24.3 (Apple Git-128)
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
2022-09-26 10:02 ` Ard Biesheuvel
@ 2022-09-26 17:14 ` Mark Brown
-1 siblings, 0 replies; 42+ messages in thread
From: Mark Brown @ 2022-09-26 17:14 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Tianjia Zhang, Herbert Xu, David S. Miller, Jussi Kivilinna,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
[-- Attachment #1.1: Type: text/plain, Size: 4253 bytes --]
On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:
> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)
> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may
The kernel code doesn't really distinguish between FPSIMD and SVE in
terms of state management, and with the sharing of the V and Z registers
the architecture is very similar too so it shouldn't be too much hassle,
the only thing we should need is some management for the VL when
starting kernel mode SVE (probably just setting the maximum VL as a
first pass).
The current code should *work* and on a system with only a single VL
supported it'd be equivalent since setting the VL is a noop, it'd just
mean that any kernel mode SVE would end up using whatever the last VL
set on the PE happened to be in which could result in inconsistent
performance.
> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.
Indeed, though so long as the performance is comparable I guess it
doesn't really hurt - if we run into situations where for some
implementations SVE performs worse then we'd need to do something more
complicated than just using SVE if it's available but...
> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.
...in any case as you say until there are cases where SVE does better
for some in kernel use case we probably just shouldn't merge things.
Having said that I have been tempted to put together a branch which has
a kernel_sve_begin() implementation and collects proposed algorithm
implementations so they're there for people to experiment with as new
hardware becomes available. There's clearly interest in trying to use
SVE in kernel and it makes sense to try to avoid common pitfalls and
reduce duplication of effort.
A couple of very minor comments on the patch:
> > +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> > + tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
> +acceleration with SVE2)"
> > + depends on KERNEL_MODE_NEON
> > + select CRYPTO_SKCIPHER
> > + select CRYPTO_SM4
> > + select CRYPTO_SM4_ARM64_CE_BLK
> > + help
Our current baseline binutils version requirement predates SVE support
so we'd either need to manually encode all SVE instructions used or add
suitable dependency. The dependency seems a lot more reasonable here,
and we could require a new enough version to avoid the manual encoding
that is done in the patch (though I've not checked how new a version
that'd end up requiring, it might be unreasonable so perhaps just
depending on binutils having basic SVE support and continuing with the
manual encoding might be more helpful).
> > +.macro sm4e, vd, vn
> > + .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> > +.endm
For any manual encodings that do get left it'd be good to note the
binutils and LLVM versions which support the instruction so we can
hopefully at some point switch to assembling them normally.
> > +static int __init sm4_sve_ce_init(void)
> > +{
> > + if (sm4_sve_get_vl() <= 16)
> > + return -ENODEV;
I'm not clear what this check is attempting to guard against - what's
the issue with larger VLs?
If it is needed then we already have a sve_get_vl() in the core kernel
which we should probably be making available to modules rather than
having them open code something (eg, making it a static inline rather
than putting it in asm).
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-26 17:14 ` Mark Brown
0 siblings, 0 replies; 42+ messages in thread
From: Mark Brown @ 2022-09-26 17:14 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Tianjia Zhang, Herbert Xu, David S. Miller, Jussi Kivilinna,
Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
linux-stm32
[-- Attachment #1: Type: text/plain, Size: 4253 bytes --]
On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:
> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)
> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may
The kernel code doesn't really distinguish between FPSIMD and SVE in
terms of state management, and with the sharing of the V and Z registers
the architecture is very similar too so it shouldn't be too much hassle,
the only thing we should need is some management for the VL when
starting kernel mode SVE (probably just setting the maximum VL as a
first pass).
The current code should *work* and on a system with only a single VL
supported it'd be equivalent since setting the VL is a noop, it'd just
mean that any kernel mode SVE would end up using whatever the last VL
set on the PE happened to be in which could result in inconsistent
performance.
> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.
Indeed, though so long as the performance is comparable I guess it
doesn't really hurt - if we run into situations where for some
implementations SVE performs worse then we'd need to do something more
complicated than just using SVE if it's available but...
> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.
...in any case as you say until there are cases where SVE does better
for some in kernel use case we probably just shouldn't merge things.
Having said that I have been tempted to put together a branch which has
a kernel_sve_begin() implementation and collects proposed algorithm
implementations so they're there for people to experiment with as new
hardware becomes available. There's clearly interest in trying to use
SVE in kernel and it makes sense to try to avoid common pitfalls and
reduce duplication of effort.
A couple of very minor comments on the patch:
> > +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> > + tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
> +acceleration with SVE2)"
> > + depends on KERNEL_MODE_NEON
> > + select CRYPTO_SKCIPHER
> > + select CRYPTO_SM4
> > + select CRYPTO_SM4_ARM64_CE_BLK
> > + help
Our current baseline binutils version requirement predates SVE support
so we'd either need to manually encode all SVE instructions used or add
suitable dependency. The dependency seems a lot more reasonable here,
and we could require a new enough version to avoid the manual encoding
that is done in the patch (though I've not checked how new a version
that'd end up requiring, it might be unreasonable so perhaps just
depending on binutils having basic SVE support and continuing with the
manual encoding might be more helpful).
> > +.macro sm4e, vd, vn
> > + .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> > +.endm
For any manual encodings that do get left it'd be good to note the
binutils and LLVM versions which support the instruction so we can
hopefully at some point switch to assembling them normally.
> > +static int __init sm4_sve_ce_init(void)
> > +{
> > + if (sm4_sve_get_vl() <= 16)
> > + return -ENODEV;
I'm not clear what this check is attempting to guard against - what's
the issue with larger VLs?
If it is needed then we already have a sve_get_vl() in the core kernel
which we should probably be making available to modules rather than
having them open code something (eg, making it a static inline rather
than putting it in asm).
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
2022-09-26 10:02 ` Ard Biesheuvel
@ 2022-09-27 4:26 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-27 4:26 UTC (permalink / raw)
To: Ard Biesheuvel, Mark Brown
Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32
Hi Ard,
On 9/26/22 6:02 PM, Ard Biesheuvel wrote:
> (cc Mark Brown)
>
> Hello Tianjia,
>
> On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
> <tianjia.zhang@linux.alibaba.com> wrote:
>>
>> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
>> arm64. SVE allows flexible vector length implementations with a range of
>> possible values in CPU implementations. The vector length can vary from a
>> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
>> The SVE design guarantees that the same application can run on different
>> implementations that support SVE, without the need to recompile the code.
>>
>> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
>> expand and improve it. Similar to the Crypto Extension supported by the
>> NEON instruction set for the algorithm, SVE also supports the similar
>> instructions, called cryptography acceleration instructions, but this is
>> also optional instruction set.
>>
>> This patch uses SM4 cryptography acceleration instructions and SVE2
>> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
>> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
>> Extension instruction is used.
>>
>
> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)
>
> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may
> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.
>
> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.
>
> I have cc'ed Mark who has been working on the SVE support., who might
> have something to add here as well.
>
> Thanks,
> Ard.
>
>
Thanks for your reply, the current performance of SVE is really
unsatisfactory. One reason is that the optimization of SVE needs to deal
with more and more complex data shifting operations, such as in CBC/CFB
mode, but also in CTR mode. needing more instruction to complete the
128-bit count increment, and the use of CE optimization does not have
these complications.
In addition, I naively thought that when the VL is 256-bit, the
performance will simply double compared to 128-bit. At present, this is
not the case. Maybe it is worth using SVE until there are significantly
improved performance data. I'll follow your advice and drop this
patch.
Best regards,
Tianjia
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-27 4:26 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-27 4:26 UTC (permalink / raw)
To: Ard Biesheuvel, Mark Brown
Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32
Hi Ard,
On 9/26/22 6:02 PM, Ard Biesheuvel wrote:
> (cc Mark Brown)
>
> Hello Tianjia,
>
> On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
> <tianjia.zhang@linux.alibaba.com> wrote:
>>
>> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
>> arm64. SVE allows flexible vector length implementations with a range of
>> possible values in CPU implementations. The vector length can vary from a
>> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
>> The SVE design guarantees that the same application can run on different
>> implementations that support SVE, without the need to recompile the code.
>>
>> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
>> expand and improve it. Similar to the Crypto Extension supported by the
>> NEON instruction set for the algorithm, SVE also supports the similar
>> instructions, called cryptography acceleration instructions, but this is
>> also optional instruction set.
>>
>> This patch uses SM4 cryptography acceleration instructions and SVE2
>> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
>> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
>> Extension instruction is used.
>>
>
> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)
>
> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may
> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.
>
> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.
>
> I have cc'ed Mark who has been working on the SVE support., who might
> have something to add here as well.
>
> Thanks,
> Ard.
>
>
Thanks for your reply, the current performance of SVE is really
unsatisfactory. One reason is that the optimization of SVE needs to deal
with more and more complex data shifting operations, such as in CBC/CFB
mode, but also in CTR mode. needing more instruction to complete the
128-bit count increment, and the use of CE optimization does not have
these complications.
In addition, I naively thought that when the VL is 256-bit, the
performance will simply double compared to 128-bit. At present, this is
not the case. Maybe it is worth using SVE until there are significantly
improved performance data. I'll follow your advice and drop this
patch.
Best regards,
Tianjia
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
2022-09-26 17:14 ` Mark Brown
@ 2022-09-27 4:30 ` Tianjia Zhang
-1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-27 4:30 UTC (permalink / raw)
To: Mark Brown, Ard Biesheuvel
Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32
Hi Mark,
On 9/27/22 1:14 AM, Mark Brown wrote:
> On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:
>
>> Given that we currently do not support the use of SVE in kernel mode,
>> this patch cannot be accepted at this time (but the rest of the series
>> looks reasonable to me, although I have only skimmed over the patches)
>
>> In view of the disappointing benchmark results below, I don't think
>> this is worth the hassle at the moment. If we can find a case where
>> using SVE in kernel mode truly makes a [favorable] difference, we can
>> revisit this, but not without a thorough analysis of the impact it
>> will have to support SVE in the kernel. Also, the fact that SVE may
>
> The kernel code doesn't really distinguish between FPSIMD and SVE in
> terms of state management, and with the sharing of the V and Z registers
> the architecture is very similar too so it shouldn't be too much hassle,
> the only thing we should need is some management for the VL when
> starting kernel mode SVE (probably just setting the maximum VL as a
> first pass).
>
> The current code should *work* and on a system with only a single VL
> supported it'd be equivalent since setting the VL is a noop, it'd just
> mean that any kernel mode SVE would end up using whatever the last VL
> set on the PE happened to be in which could result in inconsistent
> performance.
>
>> also cover cryptographic extensions does not necessarily imply that a
>> micro-architecture will perform those crypto transformations in
>> parallel and so the performance may be the same even if VL > 128.
>
> Indeed, though so long as the performance is comparable I guess it
> doesn't really hurt - if we run into situations where for some
> implementations SVE performs worse then we'd need to do something more
> complicated than just using SVE if it's available but...
>
>> In summary, please drop this patch for now, and once there are more
>> encouraging performance numbers, please resubmit it as part of a
>> series that explicitly enables SVE in kernel mode on arm64, and
>> documents the requirements and constraints.
>
> ...in any case as you say until there are cases where SVE does better
> for some in kernel use case we probably just shouldn't merge things.
>
> Having said that I have been tempted to put together a branch which has
> a kernel_sve_begin() implementation and collects proposed algorithm
> implementations so they're there for people to experiment with as new
> hardware becomes available. There's clearly interest in trying to use
> SVE in kernel and it makes sense to try to avoid common pitfalls and
> reduce duplication of effort.
>
Your reply helped me a lot, I did encounter problems when using qemu VL
larger than 128-bit environment, but I also tested it with the pure
user-mode library libgcrypt, it seems to be normal, maybe in 128-bit
It's just a coincidence that it works fine in the physical machine.
I am looking forward to your experimental branch, and I believe that
there will be breakthroughs in hardware in the near future.
> A couple of very minor comments on the patch:
>
>>> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
>>> + tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
>> +acceleration with SVE2)"
>>> + depends on KERNEL_MODE_NEON
>>> + select CRYPTO_SKCIPHER
>>> + select CRYPTO_SM4
>>> + select CRYPTO_SM4_ARM64_CE_BLK
>>> + help
>
> Our current baseline binutils version requirement predates SVE support
> so we'd either need to manually encode all SVE instructions used or add
> suitable dependency. The dependency seems a lot more reasonable here,
> and we could require a new enough version to avoid the manual encoding
> that is done in the patch (though I've not checked how new a version
> that'd end up requiring, it might be unreasonable so perhaps just
> depending on binutils having basic SVE support and continuing with the
> manual encoding might be more helpful).
>
>>> +.macro sm4e, vd, vn
>>> + .inst 0xcec08400 | (.L\vn << 5) | .L\vd
>>> +.endm
>
> For any manual encodings that do get left it'd be good to note the
> binutils and LLVM versions which support the instruction so we can
> hopefully at some point switch to assembling them normally.
>
>>> +static int __init sm4_sve_ce_init(void)
>>> +{
>>> + if (sm4_sve_get_vl() <= 16)
>>> + return -ENODEV;
>
> I'm not clear what this check is attempting to guard against - what's
> the issue with larger VLs?
Since there is no physical environment, this check is based on my naive
assumption that the performance when VL is 256-bit should theoretically
be twice that of 128-bit, because SVE needs to handle more complex data
shifting operations and CTR incrementing operations, so When VL is
greater than or equal to 256 bits, the use of SVE will bring performance
improvement, otherwise it is a suitable choice to degenerate to CE.
Now it seems that this assumption itself is not valid, I will drop
this patch first.
>
> If it is needed then we already have a sve_get_vl() in the core kernel
> which we should probably be making available to modules rather than
> having them open code something (eg, making it a static inline rather
> than putting it in asm).
Yes, I agree, exporting sve_get_vl() to the module is the more
appropriate approach.
Best regards,
Tianjia
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-27 4:30 ` Tianjia Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-27 4:30 UTC (permalink / raw)
To: Mark Brown, Ard Biesheuvel
Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32
Hi Mark,
On 9/27/22 1:14 AM, Mark Brown wrote:
> On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:
>
>> Given that we currently do not support the use of SVE in kernel mode,
>> this patch cannot be accepted at this time (but the rest of the series
>> looks reasonable to me, although I have only skimmed over the patches)
>
>> In view of the disappointing benchmark results below, I don't think
>> this is worth the hassle at the moment. If we can find a case where
>> using SVE in kernel mode truly makes a [favorable] difference, we can
>> revisit this, but not without a thorough analysis of the impact it
>> will have to support SVE in the kernel. Also, the fact that SVE may
>
> The kernel code doesn't really distinguish between FPSIMD and SVE in
> terms of state management, and with the sharing of the V and Z registers
> the architecture is very similar too so it shouldn't be too much hassle,
> the only thing we should need is some management for the VL when
> starting kernel mode SVE (probably just setting the maximum VL as a
> first pass).
>
> The current code should *work* and on a system with only a single VL
> supported it'd be equivalent since setting the VL is a noop, it'd just
> mean that any kernel mode SVE would end up using whatever the last VL
> set on the PE happened to be in which could result in inconsistent
> performance.
>
>> also cover cryptographic extensions does not necessarily imply that a
>> micro-architecture will perform those crypto transformations in
>> parallel and so the performance may be the same even if VL > 128.
>
> Indeed, though so long as the performance is comparable I guess it
> doesn't really hurt - if we run into situations where for some
> implementations SVE performs worse then we'd need to do something more
> complicated than just using SVE if it's available but...
>
>> In summary, please drop this patch for now, and once there are more
>> encouraging performance numbers, please resubmit it as part of a
>> series that explicitly enables SVE in kernel mode on arm64, and
>> documents the requirements and constraints.
>
> ...in any case as you say until there are cases where SVE does better
> for some in kernel use case we probably just shouldn't merge things.
>
> Having said that I have been tempted to put together a branch which has
> a kernel_sve_begin() implementation and collects proposed algorithm
> implementations so they're there for people to experiment with as new
> hardware becomes available. There's clearly interest in trying to use
> SVE in kernel and it makes sense to try to avoid common pitfalls and
> reduce duplication of effort.
>
Your reply helped me a lot, I did encounter problems when using qemu VL
larger than 128-bit environment, but I also tested it with the pure
user-mode library libgcrypt, it seems to be normal, maybe in 128-bit
It's just a coincidence that it works fine in the physical machine.
I am looking forward to your experimental branch, and I believe that
there will be breakthroughs in hardware in the near future.
> A couple of very minor comments on the patch:
>
>>> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
>>> + tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
>> +acceleration with SVE2)"
>>> + depends on KERNEL_MODE_NEON
>>> + select CRYPTO_SKCIPHER
>>> + select CRYPTO_SM4
>>> + select CRYPTO_SM4_ARM64_CE_BLK
>>> + help
>
> Our current baseline binutils version requirement predates SVE support
> so we'd either need to manually encode all SVE instructions used or add
> suitable dependency. The dependency seems a lot more reasonable here,
> and we could require a new enough version to avoid the manual encoding
> that is done in the patch (though I've not checked how new a version
> that'd end up requiring, it might be unreasonable so perhaps just
> depending on binutils having basic SVE support and continuing with the
> manual encoding might be more helpful).
>
>>> +.macro sm4e, vd, vn
>>> + .inst 0xcec08400 | (.L\vn << 5) | .L\vd
>>> +.endm
>
> For any manual encodings that do get left it'd be good to note the
> binutils and LLVM versions which support the instruction so we can
> hopefully at some point switch to assembling them normally.
>
>>> +static int __init sm4_sve_ce_init(void)
>>> +{
>>> + if (sm4_sve_get_vl() <= 16)
>>> + return -ENODEV;
>
> I'm not clear what this check is attempting to guard against - what's
> the issue with larger VLs?
Since there is no physical environment, this check is based on my naive
assumption that the performance when VL is 256-bit should theoretically
be twice that of 128-bit, because SVE needs to handle more complex data
shifting operations and CTR incrementing operations, so When VL is
greater than or equal to 256 bits, the use of SVE will bring performance
improvement, otherwise it is a suitable choice to degenerate to CE.
Now it seems that this assumption itself is not valid, I will drop
this patch first.
>
> If it is needed then we already have a sve_get_vl() in the core kernel
> which we should probably be making available to modules rather than
> having them open code something (eg, making it a static inline rather
> than putting it in asm).
Yes, I agree, exporting sve_get_vl() to the module is the more
appropriate approach.
Best regards,
Tianjia
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2022-09-27 4:32 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-26 9:36 [PATCH 00/16] Optimizing SM3 and SM4 algorithms using NEON/CE/SVE instructions Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 01/16] crypto: arm64/sm3 - raise the priority of the CE implementation Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 02/16] crypto: arm64/sm3 - add NEON assembly implementation Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 03/16] crypto: arm64/sm4 - refactor and simplify NEON implementation Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 04/16] crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 05/16] crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 06/16] crypto: arm64/sm4 - refactor and simplify CE implementation Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 07/16] crypto: arm64/sm4 - simplify sm4_ce_expand_key() of " Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 08/16] crypto: arm64/sm4 - export reusable CE acceleration functions Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 09/16] crypto: arm64/sm4 - add CE implementation for CTS-CBC mode Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 10/16] crypto: arm64/sm4 - add CE implementation for XTS mode Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 11/16] crypto: essiv - allow digestsize to be greater than keysize Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 12/16] crypto: arm64/sm4 - add CE implementation for ESSIV mode Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 13/16] crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 14/16] crypto: arm64/sm4 - add CE implementation for CCM mode Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 15/16] crypto: arm64/sm4 - add CE implementation for GCM mode Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 9:36 ` [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation Tianjia Zhang
2022-09-26 9:36 ` Tianjia Zhang
2022-09-26 10:02 ` Ard Biesheuvel
2022-09-26 10:02 ` Ard Biesheuvel
2022-09-26 17:14 ` Mark Brown
2022-09-26 17:14 ` Mark Brown
2022-09-27 4:30 ` Tianjia Zhang
2022-09-27 4:30 ` Tianjia Zhang
2022-09-27 4:26 ` Tianjia Zhang
2022-09-27 4:26 ` Tianjia Zhang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.