[PATCH 00/16] Optimizing SM3 and SM4 algorithms using NEON/CE/SVE instructions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/16] Optimizing SM3 and SM4 algorithms using NEON/CE/SVE instructions
@ 2022-09-26  9:36 ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This series of patches uses different instruction sets to optimize
the SM3 and SM4 algorithms, as well as the optimization of different
modes of SM4.

patch 1-2:  NEON instruction set optimization for SM3
patch 3:    Refactored and streamlined SM4 NEON instruction implementation
patch 4-5:  support test for new SM4 mode
patch 6-8:  Refactored and streamlined SM4 CE instruction implementation
patch 9-12: CE accelerated implementation of SM4 CTS/XTS/ESSIV
patch 13:   CE accelerated implementation of SM4 CMAC/XCBC/CBCMAC
patch 14-15: CE accelerated implementation of SM4 CCM/GCM
patch 16:   SM4 ARMv9 SVE cryptography acceleration implementation


Tianjia Zhang (16):
  crypto: arm64/sm3 - raise the priority of the CE implementation
  crypto: arm64/sm3 - add NEON assembly implementation
  crypto: arm64/sm4 - refactor and simplify NEON implementation
  crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors
  crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test
  crypto: arm64/sm4 - refactor and simplify CE implementation
  crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation
  crypto: arm64/sm4 - export reusable CE acceleration functions
  crypto: arm64/sm4 - add CE implementation for CTS-CBC mode
  crypto: arm64/sm4 - add CE implementation for XTS mode
  crypto: essiv - allow digestsize to be greater than keysize
  crypto: arm64/sm4 - add CE implementation for ESSIV mode
  crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac
  crypto: arm64/sm4 - add CE implementation for CCM mode
  crypto: arm64/sm4 - add CE implementation for GCM mode
  crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration
    implementation

 arch/arm64/crypto/Kconfig           |   66 +-
 arch/arm64/crypto/Makefile          |   12 +
 arch/arm64/crypto/sm3-ce-glue.c     |    2 +-
 arch/arm64/crypto/sm3-neon-core.S   |  600 +++++++++++++
 arch/arm64/crypto/sm3-neon-glue.c   |  103 +++
 arch/arm64/crypto/sm4-ce-asm.h      |  209 +++++
 arch/arm64/crypto/sm4-ce-ccm-core.S |  328 +++++++
 arch/arm64/crypto/sm4-ce-ccm-glue.c |  303 +++++++
 arch/arm64/crypto/sm4-ce-core.S     | 1247 ++++++++++++++++++---------
 arch/arm64/crypto/sm4-ce-gcm-core.S |  741 ++++++++++++++++
 arch/arm64/crypto/sm4-ce-gcm-glue.c |  286 ++++++
 arch/arm64/crypto/sm4-ce-glue.c     |  703 ++++++++++++++-
 arch/arm64/crypto/sm4-ce.h          |   16 +
 arch/arm64/crypto/sm4-neon-core.S   |  630 +++++++++-----
 arch/arm64/crypto/sm4-neon-glue.c   |  172 +---
 arch/arm64/crypto/sm4-sve-ce-core.S | 1028 ++++++++++++++++++++++
 arch/arm64/crypto/sm4-sve-ce-glue.c |  332 +++++++
 crypto/essiv.c                      |   11 +-
 crypto/tcrypt.c                     |   28 +
 crypto/testmgr.c                    |   25 +
 crypto/testmgr.h                    | 1161 +++++++++++++++++++++++++
 21 files changed, 7234 insertions(+), 769 deletions(-)
 create mode 100644 arch/arm64/crypto/sm3-neon-core.S
 create mode 100644 arch/arm64/crypto/sm3-neon-glue.c
 create mode 100644 arch/arm64/crypto/sm4-ce-asm.h
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-glue.c
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-glue.c
 create mode 100644 arch/arm64/crypto/sm4-ce.h
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 00/16] Optimizing SM3 and SM4 algorithms using NEON/CE/SVE instructions
@ 2022-09-26  9:36 ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This series of patches uses different instruction sets to optimize
the SM3 and SM4 algorithms, as well as the optimization of different
modes of SM4.

patch 1-2:  NEON instruction set optimization for SM3
patch 3:    Refactored and streamlined SM4 NEON instruction implementation
patch 4-5:  support test for new SM4 mode
patch 6-8:  Refactored and streamlined SM4 CE instruction implementation
patch 9-12: CE accelerated implementation of SM4 CTS/XTS/ESSIV
patch 13:   CE accelerated implementation of SM4 CMAC/XCBC/CBCMAC
patch 14-15: CE accelerated implementation of SM4 CCM/GCM
patch 16:   SM4 ARMv9 SVE cryptography acceleration implementation


Tianjia Zhang (16):
  crypto: arm64/sm3 - raise the priority of the CE implementation
  crypto: arm64/sm3 - add NEON assembly implementation
  crypto: arm64/sm4 - refactor and simplify NEON implementation
  crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors
  crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test
  crypto: arm64/sm4 - refactor and simplify CE implementation
  crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation
  crypto: arm64/sm4 - export reusable CE acceleration functions
  crypto: arm64/sm4 - add CE implementation for CTS-CBC mode
  crypto: arm64/sm4 - add CE implementation for XTS mode
  crypto: essiv - allow digestsize to be greater than keysize
  crypto: arm64/sm4 - add CE implementation for ESSIV mode
  crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac
  crypto: arm64/sm4 - add CE implementation for CCM mode
  crypto: arm64/sm4 - add CE implementation for GCM mode
  crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration
    implementation

 arch/arm64/crypto/Kconfig           |   66 +-
 arch/arm64/crypto/Makefile          |   12 +
 arch/arm64/crypto/sm3-ce-glue.c     |    2 +-
 arch/arm64/crypto/sm3-neon-core.S   |  600 +++++++++++++
 arch/arm64/crypto/sm3-neon-glue.c   |  103 +++
 arch/arm64/crypto/sm4-ce-asm.h      |  209 +++++
 arch/arm64/crypto/sm4-ce-ccm-core.S |  328 +++++++
 arch/arm64/crypto/sm4-ce-ccm-glue.c |  303 +++++++
 arch/arm64/crypto/sm4-ce-core.S     | 1247 ++++++++++++++++++---------
 arch/arm64/crypto/sm4-ce-gcm-core.S |  741 ++++++++++++++++
 arch/arm64/crypto/sm4-ce-gcm-glue.c |  286 ++++++
 arch/arm64/crypto/sm4-ce-glue.c     |  703 ++++++++++++++-
 arch/arm64/crypto/sm4-ce.h          |   16 +
 arch/arm64/crypto/sm4-neon-core.S   |  630 +++++++++-----
 arch/arm64/crypto/sm4-neon-glue.c   |  172 +---
 arch/arm64/crypto/sm4-sve-ce-core.S | 1028 ++++++++++++++++++++++
 arch/arm64/crypto/sm4-sve-ce-glue.c |  332 +++++++
 crypto/essiv.c                      |   11 +-
 crypto/tcrypt.c                     |   28 +
 crypto/testmgr.c                    |   25 +
 crypto/testmgr.h                    | 1161 +++++++++++++++++++++++++
 21 files changed, 7234 insertions(+), 769 deletions(-)
 create mode 100644 arch/arm64/crypto/sm3-neon-core.S
 create mode 100644 arch/arm64/crypto/sm3-neon-glue.c
 create mode 100644 arch/arm64/crypto/sm4-ce-asm.h
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-glue.c
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-glue.c
 create mode 100644 arch/arm64/crypto/sm4-ce.h
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c

-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/16] crypto: arm64/sm3 - raise the priority of the CE implementation
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

Raise the priority of the sm3-ce algorithm from 200 to 400, this is
to make room for the implementation of sm3-neon.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm3-ce-glue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/sm3-ce-glue.c b/arch/arm64/crypto/sm3-ce-glue.c
index ee98954ae8ca..54bf6ebcfffb 100644
--- a/arch/arm64/crypto/sm3-ce-glue.c
+++ b/arch/arm64/crypto/sm3-ce-glue.c
@@ -84,7 +84,7 @@ static struct shash_alg sm3_alg = {
 	.base.cra_driver_name	= "sm3-ce",
 	.base.cra_blocksize	= SM3_BLOCK_SIZE,
 	.base.cra_module	= THIS_MODULE,
-	.base.cra_priority	= 200,
+	.base.cra_priority	= 400,
 };
 
 static int __init sm3_ce_mod_init(void)
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 01/16] crypto: arm64/sm3 - raise the priority of the CE implementation
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

Raise the priority of the sm3-ce algorithm from 200 to 400, this is
to make room for the implementation of sm3-neon.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm3-ce-glue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/sm3-ce-glue.c b/arch/arm64/crypto/sm3-ce-glue.c
index ee98954ae8ca..54bf6ebcfffb 100644
--- a/arch/arm64/crypto/sm3-ce-glue.c
+++ b/arch/arm64/crypto/sm3-ce-glue.c
@@ -84,7 +84,7 @@ static struct shash_alg sm3_alg = {
 	.base.cra_driver_name	= "sm3-ce",
 	.base.cra_blocksize	= SM3_BLOCK_SIZE,
 	.base.cra_module	= THIS_MODULE,
-	.base.cra_priority	= 200,
+	.base.cra_priority	= 400,
 };
 
 static int __init sm3_ce_mod_init(void)
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/16] crypto: arm64/sm3 - add NEON assembly implementation
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch adds the NEON acceleration implementation of the SM3 hash
algorithm. The main algorithm is based on SM3 NEON accelerated work of
the libgcrypt project.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 326 mode
of tcrypt, and compares the performance data of sm3-generic and sm3-ce.
The abscissas are blocks of different lengths. The data is tabulated and
the unit is Mb/s:

update-size    |      16      64     256    1024    2048    4096    8192
---------------+--------------------------------------------------------
sm3-generic    |  185.24  221.28  301.26  307.43  300.83  308.82  308.91
sm3-neon       |  171.81  220.20  322.94  339.28  334.09  343.61  343.87
sm3-ce         |  227.48  333.48  502.62  527.87  520.45  534.91  535.40

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig         |  11 +
 arch/arm64/crypto/Makefile        |   3 +
 arch/arm64/crypto/sm3-neon-core.S | 600 ++++++++++++++++++++++++++++++
 arch/arm64/crypto/sm3-neon-glue.c | 103 +++++
 4 files changed, 717 insertions(+)
 create mode 100644 arch/arm64/crypto/sm3-neon-core.S
 create mode 100644 arch/arm64/crypto/sm3-neon-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8bd80508a710..4b121dc0cfba 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -96,6 +96,17 @@ config CRYPTO_SHA3_ARM64
 	  Architecture: arm64 using:
 	  - ARMv8.2 Crypto Extensions
 
+config CRYPTO_SM3_NEON
+	tristate "Hash functions: SM3 (NEON)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_HASH
+	select CRYPTO_SM3
+	help
+	  SM3 (ShangMi 3) secure hash function (OSCCA GM/T 0004-2012)
+
+	  Architecture: arm64 using:
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_SM3_ARM64_CE
 	tristate "Hash functions: SM3 (ARMv8.2 Crypto Extensions)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 24bb0c4610de..087f1625e775 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
 obj-$(CONFIG_CRYPTO_SHA3_ARM64) += sha3-ce.o
 sha3-ce-y := sha3-ce-glue.o sha3-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SM3_NEON) += sm3-neon.o
+sm3-neon-y := sm3-neon-glue.o sm3-neon-core.o
+
 obj-$(CONFIG_CRYPTO_SM3_ARM64_CE) += sm3-ce.o
 sm3-ce-y := sm3-ce-glue.o sm3-ce-core.o
 
diff --git a/arch/arm64/crypto/sm3-neon-core.S b/arch/arm64/crypto/sm3-neon-core.S
new file mode 100644
index 000000000000..3e3b4e5c736f
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-core.S
@@ -0,0 +1,600 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-core.S - SM3 secure hash using NEON instructions
+ *
+ * Linux/arm64 port of the libgcrypt SM3 implementation for AArch64
+ *
+ * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (c) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/* Context structure */
+
+#define state_h0 0
+#define state_h1 4
+#define state_h2 8
+#define state_h3 12
+#define state_h4 16
+#define state_h5 20
+#define state_h6 24
+#define state_h7 28
+
+/* Stack structure */
+
+#define STACK_W_SIZE        (32 * 2 * 3)
+
+#define STACK_W             (0)
+#define STACK_SIZE          (STACK_W + STACK_W_SIZE)
+
+/* Register macros */
+
+#define RSTATE x0
+#define RDATA  x1
+#define RNBLKS x2
+#define RKPTR  x28
+#define RFRAME x29
+
+#define ra w3
+#define rb w4
+#define rc w5
+#define rd w6
+#define re w7
+#define rf w8
+#define rg w9
+#define rh w10
+
+#define t0 w11
+#define t1 w12
+#define t2 w13
+#define t3 w14
+#define t4 w15
+#define t5 w16
+#define t6 w17
+
+#define k_even w19
+#define k_odd w20
+
+#define addr0 x21
+#define addr1 x22
+
+#define s0 w23
+#define s1 w24
+#define s2 w25
+#define s3 w26
+
+#define W0 v0
+#define W1 v1
+#define W2 v2
+#define W3 v3
+#define W4 v4
+#define W5 v5
+
+#define XTMP0 v6
+#define XTMP1 v7
+#define XTMP2 v16
+#define XTMP3 v17
+#define XTMP4 v18
+#define XTMP5 v19
+#define XTMP6 v20
+
+/* Helper macros. */
+
+#define _(...) /*_*/
+
+#define clear_vec(x) \
+	movi	x.8h, #0;
+
+#define rolw(o, a, n) \
+	ror	o, a, #(32 - n);
+
+/* Round function macros. */
+
+#define GG1_1(x, y, z, o, t) \
+	eor	o, x, y;
+#define GG1_2(x, y, z, o, t) \
+	eor	o, o, z;
+#define GG1_3(x, y, z, o, t)
+
+#define FF1_1(x, y, z, o, t) GG1_1(x, y, z, o, t)
+#define FF1_2(x, y, z, o, t)
+#define FF1_3(x, y, z, o, t) GG1_2(x, y, z, o, t)
+
+#define GG2_1(x, y, z, o, t) \
+	bic	o, z, x;
+#define GG2_2(x, y, z, o, t) \
+	and	t, y, x;
+#define GG2_3(x, y, z, o, t) \
+	eor	o, o, t;
+
+#define FF2_1(x, y, z, o, t) \
+	eor	o, x, y;
+#define FF2_2(x, y, z, o, t) \
+	and	t, x, y; \
+	and	o, o, z;
+#define FF2_3(x, y, z, o, t) \
+	eor	o, o, t;
+
+#define R(i, a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+	K_LOAD(round);                                                        \
+	ldr	t5, [sp, #(wtype##_W1_ADDR(round, widx))];                    \
+	rolw(t0, a, 12);                              /* rol(a, 12) => t0 */  \
+      IOP(1, iop_param);                                                      \
+	FF##i##_1(a, b, c, t1, t2);                                           \
+	ldr	t6, [sp, #(wtype##_W1W2_ADDR(round, widx))];                  \
+	add	k, k, e;                                                      \
+      IOP(2, iop_param);                                                      \
+	GG##i##_1(e, f, g, t3, t4);                                           \
+	FF##i##_2(a, b, c, t1, t2);                                           \
+      IOP(3, iop_param);                                                      \
+	add	k, k, t0;                                                     \
+	add	h, h, t5;                                                     \
+	add	d, d, t6;                     /* w1w2 + d => d */             \
+      IOP(4, iop_param);                                                      \
+	rolw(k, k, 7);                        /* rol (t0 + e + t), 7) => k */ \
+	GG##i##_2(e, f, g, t3, t4);                                           \
+	add	h, h, k;                      /* h + w1 + k => h */           \
+      IOP(5, iop_param);                                                      \
+	FF##i##_3(a, b, c, t1, t2);                                           \
+	eor	t0, t0, k;                    /* k ^ t0 => t0 */              \
+	GG##i##_3(e, f, g, t3, t4);                                           \
+	add	d, d, t1;                     /* FF(a,b,c) + d => d */        \
+      IOP(6, iop_param);                                                      \
+	add	t3, t3, h;                    /* GG(e,f,g) + h => t3 */       \
+	rolw(b, b, 9);                        /* rol(b, 9) => b */            \
+	eor	h, t3, t3, ror #(32-9);                                       \
+      IOP(7, iop_param);                                                      \
+	add	d, d, t0;                     /* t0 + d => d */               \
+	rolw(f, f, 19);                       /* rol(f, 19) => f */           \
+      IOP(8, iop_param);                                                      \
+	eor	h, h, t3, ror #(32-17);       /* P0(t3) => h */
+
+#define R1(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+	R(1, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define R2(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+	R(2, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define KL(round) \
+	ldp	k_even, k_odd, [RKPTR, #(4*(round))];
+
+/* Input expansion macros. */
+
+/* Byte-swapped input address. */
+#define IW_W_ADDR(round, widx, offs) \
+	(STACK_W + ((round) / 4) * 64 + (offs) + ((widx) * 4))
+
+/* Expanded input address. */
+#define XW_W_ADDR(round, widx, offs) \
+	(STACK_W + ((((round) / 3) - 4) % 2) * 64 + (offs) + ((widx) * 4))
+
+/* Rounds 1-12, byte-swapped input block addresses. */
+#define IW_W1_ADDR(round, widx)   IW_W_ADDR(round, widx, 32)
+#define IW_W1W2_ADDR(round, widx) IW_W_ADDR(round, widx, 48)
+
+/* Rounds 1-12, expanded input block addresses. */
+#define XW_W1_ADDR(round, widx)   XW_W_ADDR(round, widx, 0)
+#define XW_W1W2_ADDR(round, widx) XW_W_ADDR(round, widx, 16)
+
+/* Input block loading.
+ * Interleaving within round function needed for in-order CPUs. */
+#define LOAD_W_VEC_1_1() \
+	add	addr0, sp, #IW_W1_ADDR(0, 0);
+#define LOAD_W_VEC_1_2() \
+	add	addr1, sp, #IW_W1_ADDR(4, 0);
+#define LOAD_W_VEC_1_3() \
+	ld1	{W0.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_4() \
+	ld1	{W1.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_5() \
+	ld1	{W2.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_6() \
+	ld1	{W3.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_7() \
+	rev32	XTMP0.16b, W0.16b;
+#define LOAD_W_VEC_1_8() \
+	rev32	XTMP1.16b, W1.16b;
+#define LOAD_W_VEC_2_1() \
+	rev32	XTMP2.16b, W2.16b;
+#define LOAD_W_VEC_2_2() \
+	rev32	XTMP3.16b, W3.16b;
+#define LOAD_W_VEC_2_3() \
+	eor	XTMP4.16b, XTMP1.16b, XTMP0.16b;
+#define LOAD_W_VEC_2_4() \
+	eor	XTMP5.16b, XTMP2.16b, XTMP1.16b;
+#define LOAD_W_VEC_2_5() \
+	st1	{XTMP0.16b}, [addr0], #16;
+#define LOAD_W_VEC_2_6() \
+	st1	{XTMP4.16b}, [addr0]; \
+	add	addr0, sp, #IW_W1_ADDR(8, 0);
+#define LOAD_W_VEC_2_7() \
+	eor	XTMP6.16b, XTMP3.16b, XTMP2.16b;
+#define LOAD_W_VEC_2_8() \
+	ext	W0.16b, XTMP0.16b, XTMP0.16b, #8;  /* W0: xx, w0, xx, xx */
+#define LOAD_W_VEC_3_1() \
+	mov	W2.16b, XTMP1.16b;                 /* W2: xx, w6, w5, w4 */
+#define LOAD_W_VEC_3_2() \
+	st1	{XTMP1.16b}, [addr1], #16;
+#define LOAD_W_VEC_3_3() \
+	st1	{XTMP5.16b}, [addr1]; \
+	ext	W1.16b, XTMP0.16b, XTMP0.16b, #4;  /* W1: xx, w3, w2, w1 */
+#define LOAD_W_VEC_3_4() \
+	ext	W3.16b, XTMP1.16b, XTMP2.16b, #12; /* W3: xx, w9, w8, w7 */
+#define LOAD_W_VEC_3_5() \
+	ext	W4.16b, XTMP2.16b, XTMP3.16b, #8;  /* W4: xx, w12, w11, w10 */
+#define LOAD_W_VEC_3_6() \
+	st1	{XTMP2.16b}, [addr0], #16;
+#define LOAD_W_VEC_3_7() \
+	st1	{XTMP6.16b}, [addr0];
+#define LOAD_W_VEC_3_8() \
+	ext	W5.16b, XTMP3.16b, XTMP3.16b, #4;  /* W5: xx, w15, w14, w13 */
+
+#define LOAD_W_VEC_1(iop_num, ...) \
+	LOAD_W_VEC_1_##iop_num()
+#define LOAD_W_VEC_2(iop_num, ...) \
+	LOAD_W_VEC_2_##iop_num()
+#define LOAD_W_VEC_3(iop_num, ...) \
+	LOAD_W_VEC_3_##iop_num()
+
+/* Message scheduling. Note: 3 words per vector register.
+ * Interleaving within round function needed for in-order CPUs. */
+#define SCHED_W_1_1(round, w0, w1, w2, w3, w4, w5) \
+	/* Load (w[i - 16]) => XTMP0 */            \
+	/* Load (w[i - 13]) => XTMP5 */            \
+	ext	XTMP0.16b, w0.16b, w0.16b, #12;    /* XTMP0: w0, xx, xx, xx */
+#define SCHED_W_1_2(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP5.16b, w1.16b, w1.16b, #12;
+#define SCHED_W_1_3(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP0.16b, XTMP0.16b, w1.16b, #12; /* XTMP0: xx, w2, w1, w0 */
+#define SCHED_W_1_4(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP5.16b, XTMP5.16b, w2.16b, #12;
+#define SCHED_W_1_5(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 9] == w3 */                       \
+	/* W3 ^ XTMP0 => XTMP0 */                  \
+	eor	XTMP0.16b, XTMP0.16b, w3.16b;
+#define SCHED_W_1_6(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 3] == w5 */                       \
+	/* rol(XMM5, 15) ^ XTMP0 => XTMP0 */       \
+	/* rol(XTMP5, 7) => XTMP1 */               \
+	add	addr0, sp, #XW_W1_ADDR((round), 0); \
+	shl	XTMP2.4s, w5.4s, #15;
+#define SCHED_W_1_7(round, w0, w1, w2, w3, w4, w5) \
+	shl	XTMP1.4s, XTMP5.4s, #7;
+#define SCHED_W_1_8(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP2.4s, w5.4s, #(32-15);
+#define SCHED_W_2_1(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP1.4s, XTMP5.4s, #(32-7);
+#define SCHED_W_2_2(round, w0, w1, w2, w3, w4, w5) \
+	eor	XTMP0.16b, XTMP0.16b, XTMP2.16b;
+#define SCHED_W_2_3(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 6] == W4 */                       \
+	/* W4 ^ XTMP1 => XTMP1 */                  \
+	eor	XTMP1.16b, XTMP1.16b, w4.16b;
+#define SCHED_W_2_4(round, w0, w1, w2, w3, w4, w5) \
+	/* P1(XTMP0) ^ XTMP1 => W0 */              \
+	shl	XTMP3.4s, XTMP0.4s, #15;
+#define SCHED_W_2_5(round, w0, w1, w2, w3, w4, w5) \
+	shl	XTMP4.4s, XTMP0.4s, #23;
+#define SCHED_W_2_6(round, w0, w1, w2, w3, w4, w5) \
+	eor	w0.16b, XTMP1.16b, XTMP0.16b;
+#define SCHED_W_2_7(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP3.4s, XTMP0.4s, #(32-15);
+#define SCHED_W_2_8(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP4.4s, XTMP0.4s, #(32-23);
+#define SCHED_W_3_1(round, w0, w1, w2, w3, w4, w5) \
+	eor	w0.16b, w0.16b, XTMP3.16b;
+#define SCHED_W_3_2(round, w0, w1, w2, w3, w4, w5) \
+	/* Load (w[i - 3]) => XTMP2 */             \
+	ext	XTMP2.16b, w4.16b, w4.16b, #12;
+#define SCHED_W_3_3(round, w0, w1, w2, w3, w4, w5) \
+	eor	w0.16b, w0.16b, XTMP4.16b;
+#define SCHED_W_3_4(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP2.16b, XTMP2.16b, w5.16b, #12;
+#define SCHED_W_3_5(round, w0, w1, w2, w3, w4, w5) \
+	/* W1 ^ W2 => XTMP3 */                     \
+	eor	XTMP3.16b, XTMP2.16b, w0.16b;
+#define SCHED_W_3_6(round, w0, w1, w2, w3, w4, w5)
+#define SCHED_W_3_7(round, w0, w1, w2, w3, w4, w5) \
+	st1	{XTMP2.16b-XTMP3.16b}, [addr0];
+#define SCHED_W_3_8(round, w0, w1, w2, w3, w4, w5)
+
+#define SCHED_W_W0W1W2W3W4W5_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W0, W1, W2, W3, W4, W5)
+
+#define SCHED_W_W1W2W3W4W5W0_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W1, W2, W3, W4, W5, W0)
+
+#define SCHED_W_W2W3W4W5W0W1_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W2, W3, W4, W5, W0, W1)
+
+#define SCHED_W_W3W4W5W0W1W2_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W3, W4, W5, W0, W1, W2)
+
+#define SCHED_W_W4W5W0W1W2W3_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W4, W5, W0, W1, W2, W3)
+
+#define SCHED_W_W5W0W1W2W3W4_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W5, W0, W1, W2, W3, W4)
+
+
+	/*
+	 * Transform blocks*64 bytes (blocks*16 32-bit words) at 'src'.
+	 *
+	 * void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+	 *                         int blocks)
+	 */
+	.text
+.align 3
+SYM_FUNC_START(sm3_neon_transform)
+	ldp		ra, rb, [RSTATE, #0]
+	ldp		rc, rd, [RSTATE, #8]
+	ldp		re, rf, [RSTATE, #16]
+	ldp		rg, rh, [RSTATE, #24]
+
+	stp		x28, x29, [sp, #-16]!
+	stp		x19, x20, [sp, #-16]!
+	stp		x21, x22, [sp, #-16]!
+	stp		x23, x24, [sp, #-16]!
+	stp		x25, x26, [sp, #-16]!
+	mov		RFRAME, sp
+
+	sub		addr0, sp, #STACK_SIZE
+	adr_l		RKPTR, .LKtable
+	and		sp, addr0, #(~63)
+
+	/* Preload first block. */
+	LOAD_W_VEC_1(1, 0)
+	LOAD_W_VEC_1(2, 0)
+	LOAD_W_VEC_1(3, 0)
+	LOAD_W_VEC_1(4, 0)
+	LOAD_W_VEC_1(5, 0)
+	LOAD_W_VEC_1(6, 0)
+	LOAD_W_VEC_1(7, 0)
+	LOAD_W_VEC_1(8, 0)
+	LOAD_W_VEC_2(1, 0)
+	LOAD_W_VEC_2(2, 0)
+	LOAD_W_VEC_2(3, 0)
+	LOAD_W_VEC_2(4, 0)
+	LOAD_W_VEC_2(5, 0)
+	LOAD_W_VEC_2(6, 0)
+	LOAD_W_VEC_2(7, 0)
+	LOAD_W_VEC_2(8, 0)
+	LOAD_W_VEC_3(1, 0)
+	LOAD_W_VEC_3(2, 0)
+	LOAD_W_VEC_3(3, 0)
+	LOAD_W_VEC_3(4, 0)
+	LOAD_W_VEC_3(5, 0)
+	LOAD_W_VEC_3(6, 0)
+	LOAD_W_VEC_3(7, 0)
+	LOAD_W_VEC_3(8, 0)
+
+.balign 16
+.Loop:
+	/* Transform 0-3 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 0, 0, IW, _, 0)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  1, 1, IW, _, 0)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 2, 2, IW, _, 0)
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  3, 3, IW, _, 0)
+
+	/* Transform 4-7 + Precalc 12-14 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 4, 0, IW, _, 0)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  5, 1, IW, _, 0)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 6, 2, IW, SCHED_W_W0W1W2W3W4W5_1, 12)
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  7, 3, IW, SCHED_W_W0W1W2W3W4W5_2, 12)
+
+	/* Transform 8-11 + Precalc 12-17 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 8, 0, IW, SCHED_W_W0W1W2W3W4W5_3, 12)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  9, 1, IW, SCHED_W_W1W2W3W4W5W0_1, 15)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 10, 2, IW, SCHED_W_W1W2W3W4W5W0_2, 15)
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  11, 3, IW, SCHED_W_W1W2W3W4W5W0_3, 15)
+
+	/* Transform 12-14 + Precalc 18-20 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 12, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 18)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  13, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 18)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 14, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 18)
+
+	/* Transform 15-17 + Precalc 21-23 */
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  15, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 21)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 16, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 21)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  17, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 21)
+
+	/* Transform 18-20 + Precalc 24-26 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 18, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 24)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  19, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 24)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 20, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 24)
+
+	/* Transform 21-23 + Precalc 27-29 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  21, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 27)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 22, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 27)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  23, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 27)
+
+	/* Transform 24-26 + Precalc 30-32 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 24, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 30)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  25, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 30)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 26, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 30)
+
+	/* Transform 27-29 + Precalc 33-35 */
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  27, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 33)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 28, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 33)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  29, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 33)
+
+	/* Transform 30-32 + Precalc 36-38 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 30, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 36)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  31, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 36)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 32, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 36)
+
+	/* Transform 33-35 + Precalc 39-41 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  33, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 39)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 34, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 39)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  35, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 39)
+
+	/* Transform 36-38 + Precalc 42-44 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 36, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 42)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  37, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 42)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 38, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 42)
+
+	/* Transform 39-41 + Precalc 45-47 */
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  39, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 45)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 40, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 45)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  41, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 45)
+
+	/* Transform 42-44 + Precalc 48-50 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 42, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 48)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  43, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 48)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 44, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 48)
+
+	/* Transform 45-47 + Precalc 51-53 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  45, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 51)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 46, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 51)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  47, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 51)
+
+	/* Transform 48-50 + Precalc 54-56 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 48, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 54)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  49, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 54)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 50, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 54)
+
+	/* Transform 51-53 + Precalc 57-59 */
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  51, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 57)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 52, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 57)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  53, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 57)
+
+	/* Transform 54-56 + Precalc 60-62 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 54, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 60)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  55, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 60)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 56, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 60)
+
+	/* Transform 57-59 + Precalc 63 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  57, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 63)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 58, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 63)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  59, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 63)
+
+	/* Transform 60 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 60, 0, XW, _, _)
+	subs		RNBLKS, RNBLKS, #1
+	b.eq		.Lend
+
+	/* Transform 61-63 + Preload next block */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  61, 1, XW, LOAD_W_VEC_1, _)
+	ldp		s0, s1, [RSTATE, #0]
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, LOAD_W_VEC_2, _)
+	ldp		s2, s3, [RSTATE, #8]
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  63, 0, XW, LOAD_W_VEC_3, _)
+
+	/* Update the chaining variables. */
+	eor		ra, ra, s0
+	eor		rb, rb, s1
+	ldp		s0, s1, [RSTATE, #16]
+	eor		rc, rc, s2
+	ldp		k_even, k_odd, [RSTATE, #24]
+	eor		rd, rd, s3
+	eor		re, re, s0
+	stp		ra, rb, [RSTATE, #0]
+	eor		rf, rf, s1
+	stp		rc, rd, [RSTATE, #8]
+	eor		rg, rg, k_even
+	stp		re, rf, [RSTATE, #16]
+	eor		rh, rh, k_odd
+	stp		rg, rh, [RSTATE, #24]
+	b		.Loop
+
+.Lend:
+	/* Transform 61-63 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  61, 1, XW, _, _)
+	ldp		s0, s1, [RSTATE, #0]
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, _, _)
+	ldp		s2, s3, [RSTATE, #8]
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  63, 0, XW, _, _)
+
+	/* Update the chaining variables. */
+	eor		ra, ra, s0
+	clear_vec(W0)
+	eor		rb, rb, s1
+	clear_vec(W1)
+	ldp		s0, s1, [RSTATE, #16]
+	clear_vec(W2)
+	eor		rc, rc, s2
+	clear_vec(W3)
+	ldp		k_even, k_odd, [RSTATE, #24]
+	clear_vec(W4)
+	eor		rd, rd, s3
+	clear_vec(W5)
+	eor		re, re, s0
+	clear_vec(XTMP0)
+	stp		ra, rb, [RSTATE, #0]
+	clear_vec(XTMP1)
+	eor		rf, rf, s1
+	clear_vec(XTMP2)
+	stp		rc, rd, [RSTATE, #8]
+	clear_vec(XTMP3)
+	eor		rg, rg, k_even
+	clear_vec(XTMP4)
+	stp		re, rf, [RSTATE, #16]
+	clear_vec(XTMP5)
+	eor		rh, rh, k_odd
+	clear_vec(XTMP6)
+	stp		rg, rh, [RSTATE, #24]
+
+	/* Clear message expansion area */
+	add		addr0, sp, #STACK_W
+	st1		{W0.16b-W3.16b}, [addr0], #64
+	st1		{W0.16b-W3.16b}, [addr0], #64
+	st1		{W0.16b-W3.16b}, [addr0]
+
+	mov		sp, RFRAME
+
+	ldp		x25, x26, [sp], #16
+	ldp		x23, x24, [sp], #16
+	ldp		x21, x22, [sp], #16
+	ldp		x19, x20, [sp], #16
+	ldp		x28, x29, [sp], #16
+
+	ret
+SYM_FUNC_END(sm3_neon_transform)
+
+
+	.section	".rodata", "a"
+
+	.align 4
+.LKtable:
+	.long 0x79cc4519, 0xf3988a32, 0xe7311465, 0xce6228cb
+	.long 0x9cc45197, 0x3988a32f, 0x7311465e, 0xe6228cbc
+	.long 0xcc451979, 0x988a32f3, 0x311465e7, 0x6228cbce
+	.long 0xc451979c, 0x88a32f39, 0x11465e73, 0x228cbce6
+	.long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+	.long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+	.long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+	.long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
+	.long 0x7a879d8a, 0xf50f3b14, 0xea1e7629, 0xd43cec53
+	.long 0xa879d8a7, 0x50f3b14f, 0xa1e7629e, 0x43cec53d
+	.long 0x879d8a7a, 0x0f3b14f5, 0x1e7629ea, 0x3cec53d4
+	.long 0x79d8a7a8, 0xf3b14f50, 0xe7629ea1, 0xcec53d43
+	.long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+	.long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+	.long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+	.long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
diff --git a/arch/arm64/crypto/sm3-neon-glue.c b/arch/arm64/crypto/sm3-neon-glue.c
new file mode 100644
index 000000000000..7182ee683f14
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-glue.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-glue.c - SM3 secure hash using NEON instructions
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <asm/unaligned.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <crypto/sm3.h>
+#include <crypto/sm3_base.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+
+asmlinkage void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+				   int blocks);
+
+static int sm3_neon_update(struct shash_desc *desc, const u8 *data,
+			   unsigned int len)
+{
+	if (!crypto_simd_usable()) {
+		sm3_update(shash_desc_ctx(desc), data, len);
+		return 0;
+	}
+
+	kernel_neon_begin();
+	sm3_base_do_update(desc, data, len, sm3_neon_transform);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sm3_neon_final(struct shash_desc *desc, u8 *out)
+{
+	if (!crypto_simd_usable()) {
+		sm3_final(shash_desc_ctx(desc), out);
+		return 0;
+	}
+
+	kernel_neon_begin();
+	sm3_base_do_finalize(desc, sm3_neon_transform);
+	kernel_neon_end();
+
+	return sm3_base_finish(desc, out);
+}
+
+static int sm3_neon_finup(struct shash_desc *desc, const u8 *data,
+			  unsigned int len, u8 *out)
+{
+	if (!crypto_simd_usable()) {
+		struct sm3_state *sctx = shash_desc_ctx(desc);
+
+		if (len)
+			sm3_update(sctx, data, len);
+		sm3_final(sctx, out);
+		return 0;
+	}
+
+	kernel_neon_begin();
+	if (len)
+		sm3_base_do_update(desc, data, len, sm3_neon_transform);
+	sm3_base_do_finalize(desc, sm3_neon_transform);
+	kernel_neon_end();
+
+	return sm3_base_finish(desc, out);
+}
+
+static struct shash_alg sm3_alg = {
+	.digestsize		= SM3_DIGEST_SIZE,
+	.init			= sm3_base_init,
+	.update			= sm3_neon_update,
+	.final			= sm3_neon_final,
+	.finup			= sm3_neon_finup,
+	.descsize		= sizeof(struct sm3_state),
+	.base.cra_name		= "sm3",
+	.base.cra_driver_name	= "sm3-neon",
+	.base.cra_blocksize	= SM3_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+	.base.cra_priority	= 200,
+};
+
+static int __init sm3_neon_init(void)
+{
+	return crypto_register_shash(&sm3_alg);
+}
+
+static void __exit sm3_neon_fini(void)
+{
+	crypto_unregister_shash(&sm3_alg);
+}
+
+module_init(sm3_neon_init);
+module_exit(sm3_neon_fini);
+
+MODULE_DESCRIPTION("SM3 secure hash using NEON instructions");
+MODULE_AUTHOR("Jussi Kivilinna <jussi.kivilinna@iki.fi>");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/16] crypto: arm64/sm3 - add NEON assembly implementation
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch adds the NEON acceleration implementation of the SM3 hash
algorithm. The main algorithm is based on SM3 NEON accelerated work of
the libgcrypt project.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 326 mode
of tcrypt, and compares the performance data of sm3-generic and sm3-ce.
The abscissas are blocks of different lengths. The data is tabulated and
the unit is Mb/s:

update-size    |      16      64     256    1024    2048    4096    8192
---------------+--------------------------------------------------------
sm3-generic    |  185.24  221.28  301.26  307.43  300.83  308.82  308.91
sm3-neon       |  171.81  220.20  322.94  339.28  334.09  343.61  343.87
sm3-ce         |  227.48  333.48  502.62  527.87  520.45  534.91  535.40

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig         |  11 +
 arch/arm64/crypto/Makefile        |   3 +
 arch/arm64/crypto/sm3-neon-core.S | 600 ++++++++++++++++++++++++++++++
 arch/arm64/crypto/sm3-neon-glue.c | 103 +++++
 4 files changed, 717 insertions(+)
 create mode 100644 arch/arm64/crypto/sm3-neon-core.S
 create mode 100644 arch/arm64/crypto/sm3-neon-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8bd80508a710..4b121dc0cfba 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -96,6 +96,17 @@ config CRYPTO_SHA3_ARM64
 	  Architecture: arm64 using:
 	  - ARMv8.2 Crypto Extensions
 
+config CRYPTO_SM3_NEON
+	tristate "Hash functions: SM3 (NEON)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_HASH
+	select CRYPTO_SM3
+	help
+	  SM3 (ShangMi 3) secure hash function (OSCCA GM/T 0004-2012)
+
+	  Architecture: arm64 using:
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_SM3_ARM64_CE
 	tristate "Hash functions: SM3 (ARMv8.2 Crypto Extensions)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 24bb0c4610de..087f1625e775 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
 obj-$(CONFIG_CRYPTO_SHA3_ARM64) += sha3-ce.o
 sha3-ce-y := sha3-ce-glue.o sha3-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SM3_NEON) += sm3-neon.o
+sm3-neon-y := sm3-neon-glue.o sm3-neon-core.o
+
 obj-$(CONFIG_CRYPTO_SM3_ARM64_CE) += sm3-ce.o
 sm3-ce-y := sm3-ce-glue.o sm3-ce-core.o
 
diff --git a/arch/arm64/crypto/sm3-neon-core.S b/arch/arm64/crypto/sm3-neon-core.S
new file mode 100644
index 000000000000..3e3b4e5c736f
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-core.S
@@ -0,0 +1,600 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-core.S - SM3 secure hash using NEON instructions
+ *
+ * Linux/arm64 port of the libgcrypt SM3 implementation for AArch64
+ *
+ * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (c) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/* Context structure */
+
+#define state_h0 0
+#define state_h1 4
+#define state_h2 8
+#define state_h3 12
+#define state_h4 16
+#define state_h5 20
+#define state_h6 24
+#define state_h7 28
+
+/* Stack structure */
+
+#define STACK_W_SIZE        (32 * 2 * 3)
+
+#define STACK_W             (0)
+#define STACK_SIZE          (STACK_W + STACK_W_SIZE)
+
+/* Register macros */
+
+#define RSTATE x0
+#define RDATA  x1
+#define RNBLKS x2
+#define RKPTR  x28
+#define RFRAME x29
+
+#define ra w3
+#define rb w4
+#define rc w5
+#define rd w6
+#define re w7
+#define rf w8
+#define rg w9
+#define rh w10
+
+#define t0 w11
+#define t1 w12
+#define t2 w13
+#define t3 w14
+#define t4 w15
+#define t5 w16
+#define t6 w17
+
+#define k_even w19
+#define k_odd w20
+
+#define addr0 x21
+#define addr1 x22
+
+#define s0 w23
+#define s1 w24
+#define s2 w25
+#define s3 w26
+
+#define W0 v0
+#define W1 v1
+#define W2 v2
+#define W3 v3
+#define W4 v4
+#define W5 v5
+
+#define XTMP0 v6
+#define XTMP1 v7
+#define XTMP2 v16
+#define XTMP3 v17
+#define XTMP4 v18
+#define XTMP5 v19
+#define XTMP6 v20
+
+/* Helper macros. */
+
+#define _(...) /*_*/
+
+#define clear_vec(x) \
+	movi	x.8h, #0;
+
+#define rolw(o, a, n) \
+	ror	o, a, #(32 - n);
+
+/* Round function macros. */
+
+#define GG1_1(x, y, z, o, t) \
+	eor	o, x, y;
+#define GG1_2(x, y, z, o, t) \
+	eor	o, o, z;
+#define GG1_3(x, y, z, o, t)
+
+#define FF1_1(x, y, z, o, t) GG1_1(x, y, z, o, t)
+#define FF1_2(x, y, z, o, t)
+#define FF1_3(x, y, z, o, t) GG1_2(x, y, z, o, t)
+
+#define GG2_1(x, y, z, o, t) \
+	bic	o, z, x;
+#define GG2_2(x, y, z, o, t) \
+	and	t, y, x;
+#define GG2_3(x, y, z, o, t) \
+	eor	o, o, t;
+
+#define FF2_1(x, y, z, o, t) \
+	eor	o, x, y;
+#define FF2_2(x, y, z, o, t) \
+	and	t, x, y; \
+	and	o, o, z;
+#define FF2_3(x, y, z, o, t) \
+	eor	o, o, t;
+
+#define R(i, a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+	K_LOAD(round);                                                        \
+	ldr	t5, [sp, #(wtype##_W1_ADDR(round, widx))];                    \
+	rolw(t0, a, 12);                              /* rol(a, 12) => t0 */  \
+      IOP(1, iop_param);                                                      \
+	FF##i##_1(a, b, c, t1, t2);                                           \
+	ldr	t6, [sp, #(wtype##_W1W2_ADDR(round, widx))];                  \
+	add	k, k, e;                                                      \
+      IOP(2, iop_param);                                                      \
+	GG##i##_1(e, f, g, t3, t4);                                           \
+	FF##i##_2(a, b, c, t1, t2);                                           \
+      IOP(3, iop_param);                                                      \
+	add	k, k, t0;                                                     \
+	add	h, h, t5;                                                     \
+	add	d, d, t6;                     /* w1w2 + d => d */             \
+      IOP(4, iop_param);                                                      \
+	rolw(k, k, 7);                        /* rol (t0 + e + t), 7) => k */ \
+	GG##i##_2(e, f, g, t3, t4);                                           \
+	add	h, h, k;                      /* h + w1 + k => h */           \
+      IOP(5, iop_param);                                                      \
+	FF##i##_3(a, b, c, t1, t2);                                           \
+	eor	t0, t0, k;                    /* k ^ t0 => t0 */              \
+	GG##i##_3(e, f, g, t3, t4);                                           \
+	add	d, d, t1;                     /* FF(a,b,c) + d => d */        \
+      IOP(6, iop_param);                                                      \
+	add	t3, t3, h;                    /* GG(e,f,g) + h => t3 */       \
+	rolw(b, b, 9);                        /* rol(b, 9) => b */            \
+	eor	h, t3, t3, ror #(32-9);                                       \
+      IOP(7, iop_param);                                                      \
+	add	d, d, t0;                     /* t0 + d => d */               \
+	rolw(f, f, 19);                       /* rol(f, 19) => f */           \
+      IOP(8, iop_param);                                                      \
+	eor	h, h, t3, ror #(32-17);       /* P0(t3) => h */
+
+#define R1(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+	R(1, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define R2(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+	R(2, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define KL(round) \
+	ldp	k_even, k_odd, [RKPTR, #(4*(round))];
+
+/* Input expansion macros. */
+
+/* Byte-swapped input address. */
+#define IW_W_ADDR(round, widx, offs) \
+	(STACK_W + ((round) / 4) * 64 + (offs) + ((widx) * 4))
+
+/* Expanded input address. */
+#define XW_W_ADDR(round, widx, offs) \
+	(STACK_W + ((((round) / 3) - 4) % 2) * 64 + (offs) + ((widx) * 4))
+
+/* Rounds 1-12, byte-swapped input block addresses. */
+#define IW_W1_ADDR(round, widx)   IW_W_ADDR(round, widx, 32)
+#define IW_W1W2_ADDR(round, widx) IW_W_ADDR(round, widx, 48)
+
+/* Rounds 1-12, expanded input block addresses. */
+#define XW_W1_ADDR(round, widx)   XW_W_ADDR(round, widx, 0)
+#define XW_W1W2_ADDR(round, widx) XW_W_ADDR(round, widx, 16)
+
+/* Input block loading.
+ * Interleaving within round function needed for in-order CPUs. */
+#define LOAD_W_VEC_1_1() \
+	add	addr0, sp, #IW_W1_ADDR(0, 0);
+#define LOAD_W_VEC_1_2() \
+	add	addr1, sp, #IW_W1_ADDR(4, 0);
+#define LOAD_W_VEC_1_3() \
+	ld1	{W0.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_4() \
+	ld1	{W1.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_5() \
+	ld1	{W2.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_6() \
+	ld1	{W3.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_7() \
+	rev32	XTMP0.16b, W0.16b;
+#define LOAD_W_VEC_1_8() \
+	rev32	XTMP1.16b, W1.16b;
+#define LOAD_W_VEC_2_1() \
+	rev32	XTMP2.16b, W2.16b;
+#define LOAD_W_VEC_2_2() \
+	rev32	XTMP3.16b, W3.16b;
+#define LOAD_W_VEC_2_3() \
+	eor	XTMP4.16b, XTMP1.16b, XTMP0.16b;
+#define LOAD_W_VEC_2_4() \
+	eor	XTMP5.16b, XTMP2.16b, XTMP1.16b;
+#define LOAD_W_VEC_2_5() \
+	st1	{XTMP0.16b}, [addr0], #16;
+#define LOAD_W_VEC_2_6() \
+	st1	{XTMP4.16b}, [addr0]; \
+	add	addr0, sp, #IW_W1_ADDR(8, 0);
+#define LOAD_W_VEC_2_7() \
+	eor	XTMP6.16b, XTMP3.16b, XTMP2.16b;
+#define LOAD_W_VEC_2_8() \
+	ext	W0.16b, XTMP0.16b, XTMP0.16b, #8;  /* W0: xx, w0, xx, xx */
+#define LOAD_W_VEC_3_1() \
+	mov	W2.16b, XTMP1.16b;                 /* W2: xx, w6, w5, w4 */
+#define LOAD_W_VEC_3_2() \
+	st1	{XTMP1.16b}, [addr1], #16;
+#define LOAD_W_VEC_3_3() \
+	st1	{XTMP5.16b}, [addr1]; \
+	ext	W1.16b, XTMP0.16b, XTMP0.16b, #4;  /* W1: xx, w3, w2, w1 */
+#define LOAD_W_VEC_3_4() \
+	ext	W3.16b, XTMP1.16b, XTMP2.16b, #12; /* W3: xx, w9, w8, w7 */
+#define LOAD_W_VEC_3_5() \
+	ext	W4.16b, XTMP2.16b, XTMP3.16b, #8;  /* W4: xx, w12, w11, w10 */
+#define LOAD_W_VEC_3_6() \
+	st1	{XTMP2.16b}, [addr0], #16;
+#define LOAD_W_VEC_3_7() \
+	st1	{XTMP6.16b}, [addr0];
+#define LOAD_W_VEC_3_8() \
+	ext	W5.16b, XTMP3.16b, XTMP3.16b, #4;  /* W5: xx, w15, w14, w13 */
+
+#define LOAD_W_VEC_1(iop_num, ...) \
+	LOAD_W_VEC_1_##iop_num()
+#define LOAD_W_VEC_2(iop_num, ...) \
+	LOAD_W_VEC_2_##iop_num()
+#define LOAD_W_VEC_3(iop_num, ...) \
+	LOAD_W_VEC_3_##iop_num()
+
+/* Message scheduling. Note: 3 words per vector register.
+ * Interleaving within round function needed for in-order CPUs. */
+#define SCHED_W_1_1(round, w0, w1, w2, w3, w4, w5) \
+	/* Load (w[i - 16]) => XTMP0 */            \
+	/* Load (w[i - 13]) => XTMP5 */            \
+	ext	XTMP0.16b, w0.16b, w0.16b, #12;    /* XTMP0: w0, xx, xx, xx */
+#define SCHED_W_1_2(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP5.16b, w1.16b, w1.16b, #12;
+#define SCHED_W_1_3(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP0.16b, XTMP0.16b, w1.16b, #12; /* XTMP0: xx, w2, w1, w0 */
+#define SCHED_W_1_4(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP5.16b, XTMP5.16b, w2.16b, #12;
+#define SCHED_W_1_5(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 9] == w3 */                       \
+	/* W3 ^ XTMP0 => XTMP0 */                  \
+	eor	XTMP0.16b, XTMP0.16b, w3.16b;
+#define SCHED_W_1_6(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 3] == w5 */                       \
+	/* rol(XMM5, 15) ^ XTMP0 => XTMP0 */       \
+	/* rol(XTMP5, 7) => XTMP1 */               \
+	add	addr0, sp, #XW_W1_ADDR((round), 0); \
+	shl	XTMP2.4s, w5.4s, #15;
+#define SCHED_W_1_7(round, w0, w1, w2, w3, w4, w5) \
+	shl	XTMP1.4s, XTMP5.4s, #7;
+#define SCHED_W_1_8(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP2.4s, w5.4s, #(32-15);
+#define SCHED_W_2_1(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP1.4s, XTMP5.4s, #(32-7);
+#define SCHED_W_2_2(round, w0, w1, w2, w3, w4, w5) \
+	eor	XTMP0.16b, XTMP0.16b, XTMP2.16b;
+#define SCHED_W_2_3(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 6] == W4 */                       \
+	/* W4 ^ XTMP1 => XTMP1 */                  \
+	eor	XTMP1.16b, XTMP1.16b, w4.16b;
+#define SCHED_W_2_4(round, w0, w1, w2, w3, w4, w5) \
+	/* P1(XTMP0) ^ XTMP1 => W0 */              \
+	shl	XTMP3.4s, XTMP0.4s, #15;
+#define SCHED_W_2_5(round, w0, w1, w2, w3, w4, w5) \
+	shl	XTMP4.4s, XTMP0.4s, #23;
+#define SCHED_W_2_6(round, w0, w1, w2, w3, w4, w5) \
+	eor	w0.16b, XTMP1.16b, XTMP0.16b;
+#define SCHED_W_2_7(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP3.4s, XTMP0.4s, #(32-15);
+#define SCHED_W_2_8(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP4.4s, XTMP0.4s, #(32-23);
+#define SCHED_W_3_1(round, w0, w1, w2, w3, w4, w5) \
+	eor	w0.16b, w0.16b, XTMP3.16b;
+#define SCHED_W_3_2(round, w0, w1, w2, w3, w4, w5) \
+	/* Load (w[i - 3]) => XTMP2 */             \
+	ext	XTMP2.16b, w4.16b, w4.16b, #12;
+#define SCHED_W_3_3(round, w0, w1, w2, w3, w4, w5) \
+	eor	w0.16b, w0.16b, XTMP4.16b;
+#define SCHED_W_3_4(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP2.16b, XTMP2.16b, w5.16b, #12;
+#define SCHED_W_3_5(round, w0, w1, w2, w3, w4, w5) \
+	/* W1 ^ W2 => XTMP3 */                     \
+	eor	XTMP3.16b, XTMP2.16b, w0.16b;
+#define SCHED_W_3_6(round, w0, w1, w2, w3, w4, w5)
+#define SCHED_W_3_7(round, w0, w1, w2, w3, w4, w5) \
+	st1	{XTMP2.16b-XTMP3.16b}, [addr0];
+#define SCHED_W_3_8(round, w0, w1, w2, w3, w4, w5)
+
+#define SCHED_W_W0W1W2W3W4W5_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W0, W1, W2, W3, W4, W5)
+
+#define SCHED_W_W1W2W3W4W5W0_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W1, W2, W3, W4, W5, W0)
+
+#define SCHED_W_W2W3W4W5W0W1_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W2, W3, W4, W5, W0, W1)
+
+#define SCHED_W_W3W4W5W0W1W2_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W3, W4, W5, W0, W1, W2)
+
+#define SCHED_W_W4W5W0W1W2W3_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W4, W5, W0, W1, W2, W3)
+
+#define SCHED_W_W5W0W1W2W3W4_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W5, W0, W1, W2, W3, W4)
+
+
+	/*
+	 * Transform blocks*64 bytes (blocks*16 32-bit words) at 'src'.
+	 *
+	 * void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+	 *                         int blocks)
+	 */
+	.text
+.align 3
+SYM_FUNC_START(sm3_neon_transform)
+	ldp		ra, rb, [RSTATE, #0]
+	ldp		rc, rd, [RSTATE, #8]
+	ldp		re, rf, [RSTATE, #16]
+	ldp		rg, rh, [RSTATE, #24]
+
+	stp		x28, x29, [sp, #-16]!
+	stp		x19, x20, [sp, #-16]!
+	stp		x21, x22, [sp, #-16]!
+	stp		x23, x24, [sp, #-16]!
+	stp		x25, x26, [sp, #-16]!
+	mov		RFRAME, sp
+
+	sub		addr0, sp, #STACK_SIZE
+	adr_l		RKPTR, .LKtable
+	and		sp, addr0, #(~63)
+
+	/* Preload first block. */
+	LOAD_W_VEC_1(1, 0)
+	LOAD_W_VEC_1(2, 0)
+	LOAD_W_VEC_1(3, 0)
+	LOAD_W_VEC_1(4, 0)
+	LOAD_W_VEC_1(5, 0)
+	LOAD_W_VEC_1(6, 0)
+	LOAD_W_VEC_1(7, 0)
+	LOAD_W_VEC_1(8, 0)
+	LOAD_W_VEC_2(1, 0)
+	LOAD_W_VEC_2(2, 0)
+	LOAD_W_VEC_2(3, 0)
+	LOAD_W_VEC_2(4, 0)
+	LOAD_W_VEC_2(5, 0)
+	LOAD_W_VEC_2(6, 0)
+	LOAD_W_VEC_2(7, 0)
+	LOAD_W_VEC_2(8, 0)
+	LOAD_W_VEC_3(1, 0)
+	LOAD_W_VEC_3(2, 0)
+	LOAD_W_VEC_3(3, 0)
+	LOAD_W_VEC_3(4, 0)
+	LOAD_W_VEC_3(5, 0)
+	LOAD_W_VEC_3(6, 0)
+	LOAD_W_VEC_3(7, 0)
+	LOAD_W_VEC_3(8, 0)
+
+.balign 16
+.Loop:
+	/* Transform 0-3 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 0, 0, IW, _, 0)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  1, 1, IW, _, 0)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 2, 2, IW, _, 0)
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  3, 3, IW, _, 0)
+
+	/* Transform 4-7 + Precalc 12-14 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 4, 0, IW, _, 0)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  5, 1, IW, _, 0)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 6, 2, IW, SCHED_W_W0W1W2W3W4W5_1, 12)
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  7, 3, IW, SCHED_W_W0W1W2W3W4W5_2, 12)
+
+	/* Transform 8-11 + Precalc 12-17 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 8, 0, IW, SCHED_W_W0W1W2W3W4W5_3, 12)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  9, 1, IW, SCHED_W_W1W2W3W4W5W0_1, 15)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 10, 2, IW, SCHED_W_W1W2W3W4W5W0_2, 15)
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  11, 3, IW, SCHED_W_W1W2W3W4W5W0_3, 15)
+
+	/* Transform 12-14 + Precalc 18-20 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 12, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 18)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  13, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 18)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 14, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 18)
+
+	/* Transform 15-17 + Precalc 21-23 */
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  15, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 21)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 16, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 21)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  17, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 21)
+
+	/* Transform 18-20 + Precalc 24-26 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 18, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 24)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  19, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 24)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 20, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 24)
+
+	/* Transform 21-23 + Precalc 27-29 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  21, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 27)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 22, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 27)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  23, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 27)
+
+	/* Transform 24-26 + Precalc 30-32 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 24, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 30)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  25, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 30)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 26, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 30)
+
+	/* Transform 27-29 + Precalc 33-35 */
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  27, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 33)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 28, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 33)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  29, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 33)
+
+	/* Transform 30-32 + Precalc 36-38 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 30, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 36)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  31, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 36)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 32, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 36)
+
+	/* Transform 33-35 + Precalc 39-41 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  33, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 39)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 34, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 39)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  35, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 39)
+
+	/* Transform 36-38 + Precalc 42-44 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 36, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 42)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  37, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 42)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 38, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 42)
+
+	/* Transform 39-41 + Precalc 45-47 */
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  39, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 45)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 40, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 45)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  41, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 45)
+
+	/* Transform 42-44 + Precalc 48-50 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 42, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 48)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  43, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 48)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 44, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 48)
+
+	/* Transform 45-47 + Precalc 51-53 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  45, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 51)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 46, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 51)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  47, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 51)
+
+	/* Transform 48-50 + Precalc 54-56 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 48, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 54)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  49, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 54)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 50, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 54)
+
+	/* Transform 51-53 + Precalc 57-59 */
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  51, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 57)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 52, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 57)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  53, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 57)
+
+	/* Transform 54-56 + Precalc 60-62 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 54, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 60)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  55, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 60)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 56, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 60)
+
+	/* Transform 57-59 + Precalc 63 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  57, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 63)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 58, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 63)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  59, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 63)
+
+	/* Transform 60 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 60, 0, XW, _, _)
+	subs		RNBLKS, RNBLKS, #1
+	b.eq		.Lend
+
+	/* Transform 61-63 + Preload next block */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  61, 1, XW, LOAD_W_VEC_1, _)
+	ldp		s0, s1, [RSTATE, #0]
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, LOAD_W_VEC_2, _)
+	ldp		s2, s3, [RSTATE, #8]
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  63, 0, XW, LOAD_W_VEC_3, _)
+
+	/* Update the chaining variables. */
+	eor		ra, ra, s0
+	eor		rb, rb, s1
+	ldp		s0, s1, [RSTATE, #16]
+	eor		rc, rc, s2
+	ldp		k_even, k_odd, [RSTATE, #24]
+	eor		rd, rd, s3
+	eor		re, re, s0
+	stp		ra, rb, [RSTATE, #0]
+	eor		rf, rf, s1
+	stp		rc, rd, [RSTATE, #8]
+	eor		rg, rg, k_even
+	stp		re, rf, [RSTATE, #16]
+	eor		rh, rh, k_odd
+	stp		rg, rh, [RSTATE, #24]
+	b		.Loop
+
+.Lend:
+	/* Transform 61-63 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  61, 1, XW, _, _)
+	ldp		s0, s1, [RSTATE, #0]
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, _, _)
+	ldp		s2, s3, [RSTATE, #8]
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  63, 0, XW, _, _)
+
+	/* Update the chaining variables. */
+	eor		ra, ra, s0
+	clear_vec(W0)
+	eor		rb, rb, s1
+	clear_vec(W1)
+	ldp		s0, s1, [RSTATE, #16]
+	clear_vec(W2)
+	eor		rc, rc, s2
+	clear_vec(W3)
+	ldp		k_even, k_odd, [RSTATE, #24]
+	clear_vec(W4)
+	eor		rd, rd, s3
+	clear_vec(W5)
+	eor		re, re, s0
+	clear_vec(XTMP0)
+	stp		ra, rb, [RSTATE, #0]
+	clear_vec(XTMP1)
+	eor		rf, rf, s1
+	clear_vec(XTMP2)
+	stp		rc, rd, [RSTATE, #8]
+	clear_vec(XTMP3)
+	eor		rg, rg, k_even
+	clear_vec(XTMP4)
+	stp		re, rf, [RSTATE, #16]
+	clear_vec(XTMP5)
+	eor		rh, rh, k_odd
+	clear_vec(XTMP6)
+	stp		rg, rh, [RSTATE, #24]
+
+	/* Clear message expansion area */
+	add		addr0, sp, #STACK_W
+	st1		{W0.16b-W3.16b}, [addr0], #64
+	st1		{W0.16b-W3.16b}, [addr0], #64
+	st1		{W0.16b-W3.16b}, [addr0]
+
+	mov		sp, RFRAME
+
+	ldp		x25, x26, [sp], #16
+	ldp		x23, x24, [sp], #16
+	ldp		x21, x22, [sp], #16
+	ldp		x19, x20, [sp], #16
+	ldp		x28, x29, [sp], #16
+
+	ret
+SYM_FUNC_END(sm3_neon_transform)
+
+
+	.section	".rodata", "a"
+
+	.align 4
+.LKtable:
+	.long 0x79cc4519, 0xf3988a32, 0xe7311465, 0xce6228cb
+	.long 0x9cc45197, 0x3988a32f, 0x7311465e, 0xe6228cbc
+	.long 0xcc451979, 0x988a32f3, 0x311465e7, 0x6228cbce
+	.long 0xc451979c, 0x88a32f39, 0x11465e73, 0x228cbce6
+	.long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+	.long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+	.long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+	.long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
+	.long 0x7a879d8a, 0xf50f3b14, 0xea1e7629, 0xd43cec53
+	.long 0xa879d8a7, 0x50f3b14f, 0xa1e7629e, 0x43cec53d
+	.long 0x879d8a7a, 0x0f3b14f5, 0x1e7629ea, 0x3cec53d4
+	.long 0x79d8a7a8, 0xf3b14f50, 0xe7629ea1, 0xcec53d43
+	.long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+	.long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+	.long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+	.long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
diff --git a/arch/arm64/crypto/sm3-neon-glue.c b/arch/arm64/crypto/sm3-neon-glue.c
new file mode 100644
index 000000000000..7182ee683f14
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-glue.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-glue.c - SM3 secure hash using NEON instructions
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <asm/unaligned.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <crypto/sm3.h>
+#include <crypto/sm3_base.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+
+asmlinkage void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+				   int blocks);
+
+static int sm3_neon_update(struct shash_desc *desc, const u8 *data,
+			   unsigned int len)
+{
+	if (!crypto_simd_usable()) {
+		sm3_update(shash_desc_ctx(desc), data, len);
+		return 0;
+	}
+
+	kernel_neon_begin();
+	sm3_base_do_update(desc, data, len, sm3_neon_transform);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sm3_neon_final(struct shash_desc *desc, u8 *out)
+{
+	if (!crypto_simd_usable()) {
+		sm3_final(shash_desc_ctx(desc), out);
+		return 0;
+	}
+
+	kernel_neon_begin();
+	sm3_base_do_finalize(desc, sm3_neon_transform);
+	kernel_neon_end();
+
+	return sm3_base_finish(desc, out);
+}
+
+static int sm3_neon_finup(struct shash_desc *desc, const u8 *data,
+			  unsigned int len, u8 *out)
+{
+	if (!crypto_simd_usable()) {
+		struct sm3_state *sctx = shash_desc_ctx(desc);
+
+		if (len)
+			sm3_update(sctx, data, len);
+		sm3_final(sctx, out);
+		return 0;
+	}
+
+	kernel_neon_begin();
+	if (len)
+		sm3_base_do_update(desc, data, len, sm3_neon_transform);
+	sm3_base_do_finalize(desc, sm3_neon_transform);
+	kernel_neon_end();
+
+	return sm3_base_finish(desc, out);
+}
+
+static struct shash_alg sm3_alg = {
+	.digestsize		= SM3_DIGEST_SIZE,
+	.init			= sm3_base_init,
+	.update			= sm3_neon_update,
+	.final			= sm3_neon_final,
+	.finup			= sm3_neon_finup,
+	.descsize		= sizeof(struct sm3_state),
+	.base.cra_name		= "sm3",
+	.base.cra_driver_name	= "sm3-neon",
+	.base.cra_blocksize	= SM3_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+	.base.cra_priority	= 200,
+};
+
+static int __init sm3_neon_init(void)
+{
+	return crypto_register_shash(&sm3_alg);
+}
+
+static void __exit sm3_neon_fini(void)
+{
+	crypto_unregister_shash(&sm3_alg);
+}
+
+module_init(sm3_neon_init);
+module_exit(sm3_neon_fini);
+
+MODULE_DESCRIPTION("SM3 secure hash using NEON instructions");
+MODULE_AUTHOR("Jussi Kivilinna <jussi.kivilinna@iki.fi>");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/16] crypto: arm64/sm4 - refactor and simplify NEON implementation
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch does not add new features. The main work is to refactor and
simplify the implementation of SM4 NEON, which is reflected in the
following aspects:

The accelerated implementation supports the arbitrary number of blocks,
not just multiples of 8, which simplifies the implementation and brings
some optimization acceleration for data that is not aligned by 8 blocks.

When loading the input data, use the ld4 instruction to replace the
original ld1 instruction as much as possible, which will save the cost
of matrix transposition of the input data.

Use 8-block parallelism whenever possible to speed up matrix transpose
and rotation operations, instead of up to 4-block parallelism.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-neon-core.S | 630 +++++++++++++++++++-----------
 arch/arm64/crypto/sm4-neon-glue.c | 172 +++-----
 2 files changed, 456 insertions(+), 346 deletions(-)

diff --git a/arch/arm64/crypto/sm4-neon-core.S b/arch/arm64/crypto/sm4-neon-core.S
index 3d5256b354d2..f295b4b7d70a 100644
--- a/arch/arm64/crypto/sm4-neon-core.S
+++ b/arch/arm64/crypto/sm4-neon-core.S
@@ -18,6 +18,11 @@
 #define RTMP2	v10
 #define RTMP3	v11
 
+#define RTMP4	v12
+#define RTMP5	v13
+#define RTMP6	v14
+#define RTMP7	v15
+
 #define RX0	v12
 #define RX1	v13
 #define RKEY	v14
@@ -25,7 +30,7 @@
 
 /* Helper macros. */
 
-#define PREPARE                                                 \
+#define SM4_PREPARE()                                           \
 	adr_l		x5, crypto_sm4_sbox;                    \
 	ld1		{v16.16b-v19.16b}, [x5], #64;           \
 	ld1		{v20.16b-v23.16b}, [x5], #64;           \
@@ -42,7 +47,25 @@
 	zip1		s2.2d, RTMP2.2d, RTMP3.2d;              \
 	zip2		s3.2d, RTMP2.2d, RTMP3.2d;
 
-#define rotate_clockwise_90(s0, s1, s2, s3)                     \
+#define transpose_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7)        \
+	zip1		RTMP0.4s, s0.4s, s1.4s;                 \
+	zip1		RTMP1.4s, s2.4s, s3.4s;                 \
+	zip2		RTMP2.4s, s0.4s, s1.4s;                 \
+	zip2		RTMP3.4s, s2.4s, s3.4s;                 \
+	zip1		RTMP4.4s, s4.4s, s5.4s;                 \
+	zip1		RTMP5.4s, s6.4s, s7.4s;                 \
+	zip2		RTMP6.4s, s4.4s, s5.4s;                 \
+	zip2		RTMP7.4s, s6.4s, s7.4s;                 \
+	zip1		s0.2d, RTMP0.2d, RTMP1.2d;              \
+	zip2		s1.2d, RTMP0.2d, RTMP1.2d;              \
+	zip1		s2.2d, RTMP2.2d, RTMP3.2d;              \
+	zip2		s3.2d, RTMP2.2d, RTMP3.2d;              \
+	zip1		s4.2d, RTMP4.2d, RTMP5.2d;              \
+	zip2		s5.2d, RTMP4.2d, RTMP5.2d;              \
+	zip1		s6.2d, RTMP6.2d, RTMP7.2d;              \
+	zip2		s7.2d, RTMP6.2d, RTMP7.2d;
+
+#define rotate_clockwise_4x4(s0, s1, s2, s3)                    \
 	zip1		RTMP0.4s, s1.4s, s0.4s;                 \
 	zip2		RTMP1.4s, s1.4s, s0.4s;                 \
 	zip1		RTMP2.4s, s3.4s, s2.4s;                 \
@@ -52,6 +75,24 @@
 	zip1		s2.2d, RTMP3.2d, RTMP1.2d;              \
 	zip2		s3.2d, RTMP3.2d, RTMP1.2d;
 
+#define rotate_clockwise_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7) \
+	zip1		RTMP0.4s, s1.4s, s0.4s;                 \
+	zip1		RTMP2.4s, s3.4s, s2.4s;                 \
+	zip2		RTMP1.4s, s1.4s, s0.4s;                 \
+	zip2		RTMP3.4s, s3.4s, s2.4s;                 \
+	zip1		RTMP4.4s, s5.4s, s4.4s;                 \
+	zip1		RTMP6.4s, s7.4s, s6.4s;                 \
+	zip2		RTMP5.4s, s5.4s, s4.4s;                 \
+	zip2		RTMP7.4s, s7.4s, s6.4s;                 \
+	zip1		s0.2d, RTMP2.2d, RTMP0.2d;              \
+	zip2		s1.2d, RTMP2.2d, RTMP0.2d;              \
+	zip1		s2.2d, RTMP3.2d, RTMP1.2d;              \
+	zip2		s3.2d, RTMP3.2d, RTMP1.2d;              \
+	zip1		s4.2d, RTMP6.2d, RTMP4.2d;              \
+	zip2		s5.2d, RTMP6.2d, RTMP4.2d;              \
+	zip1		s6.2d, RTMP7.2d, RTMP5.2d;              \
+	zip2		s7.2d, RTMP7.2d, RTMP5.2d;
+
 #define ROUND4(round, s0, s1, s2, s3)                           \
 	dup		RX0.4s, RKEY.s[round];                  \
 	/* rk ^ s1 ^ s2 ^ s3 */                                 \
@@ -87,14 +128,7 @@
 	/* s0 ^= RTMP3 */                                       \
 	eor		s0.16b, s0.16b, RTMP3.16b;
 
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3)                          \
-	rev32		b0.16b, b0.16b;                         \
-	rev32		b1.16b, b1.16b;                         \
-	rev32		b2.16b, b2.16b;                         \
-	rev32		b3.16b, b3.16b;                         \
-                                                                \
-	transpose_4x4(b0, b1, b2, b3);                          \
-                                                                \
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3)                       \
 	mov		x6, 8;                                  \
 4:                                                              \
 	ld1		{RKEY.4s}, [x0], #16;                   \
@@ -107,15 +141,23 @@
                                                                 \
 	bne		4b;                                     \
                                                                 \
-	rotate_clockwise_90(b0, b1, b2, b3);                    \
 	rev32		b0.16b, b0.16b;                         \
 	rev32		b1.16b, b1.16b;                         \
 	rev32		b2.16b, b2.16b;                         \
 	rev32		b3.16b, b3.16b;                         \
                                                                 \
+	rotate_clockwise_4x4(b0, b1, b2, b3);                   \
+                                                                \
 	/* repoint to rkey */                                   \
 	sub		x0, x0, #128;
 
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3)                          \
+	rev32		b0.16b, b0.16b;                         \
+	rev32		b1.16b, b1.16b;                         \
+	rev32		b2.16b, b2.16b;                         \
+	rev32		b3.16b, b3.16b;                         \
+	SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
 #define ROUND8(round, s0, s1, s2, s3, t0, t1, t2, t3)           \
 	/* rk ^ s1 ^ s2 ^ s3 */                                 \
 	dup		RX0.4s, RKEY.s[round];                  \
@@ -175,7 +217,7 @@
 	eor		s0.16b, s0.16b, RTMP0.16b;              \
 	eor		t0.16b, t0.16b, RTMP1.16b;
 
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)          \
+#define SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7) \
 	rev32		b0.16b, b0.16b;                         \
 	rev32		b1.16b, b1.16b;                         \
 	rev32		b2.16b, b2.16b;                         \
@@ -185,9 +227,6 @@
 	rev32		b6.16b, b6.16b;                         \
 	rev32		b7.16b, b7.16b;                         \
                                                                 \
-	transpose_4x4(b0, b1, b2, b3);                          \
-	transpose_4x4(b4, b5, b6, b7);                          \
-                                                                \
 	mov		x6, 8;                                  \
 8:                                                              \
 	ld1		{RKEY.4s}, [x0], #16;                   \
@@ -200,8 +239,6 @@
                                                                 \
 	bne		8b;                                     \
                                                                 \
-	rotate_clockwise_90(b0, b1, b2, b3);                    \
-	rotate_clockwise_90(b4, b5, b6, b7);                    \
 	rev32		b0.16b, b0.16b;                         \
 	rev32		b1.16b, b1.16b;                         \
 	rev32		b2.16b, b2.16b;                         \
@@ -214,274 +251,429 @@
 	/* repoint to rkey */                                   \
 	sub		x0, x0, #128;
 
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)			\
+	SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7);	\
+	rotate_clockwise_4x4_2x(b0, b1, b2, b3, b4, b5, b6, b7);	\
 
-.align 3
-SYM_FUNC_START_LOCAL(__sm4_neon_crypt_blk1_4)
-	/* input:
-	 *   x0: round key array, CTX
-	 *   x1: dst
-	 *   x2: src
-	 *   w3: num blocks (1..4)
-	 */
-	PREPARE;
-
-	ld1		{v0.16b}, [x2], #16;
-	mov		v1.16b, v0.16b;
-	mov		v2.16b, v0.16b;
-	mov		v3.16b, v0.16b;
-	cmp		w3, #2;
-	blt		.Lblk4_load_input_done;
-	ld1		{v1.16b}, [x2], #16;
-	beq		.Lblk4_load_input_done;
-	ld1		{v2.16b}, [x2], #16;
-	cmp		w3, #3;
-	beq		.Lblk4_load_input_done;
-	ld1		{v3.16b}, [x2];
-
-.Lblk4_load_input_done:
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
-	st1		{v0.16b}, [x1], #16;
-	cmp		w3, #2;
-	blt		.Lblk4_store_output_done;
-	st1		{v1.16b}, [x1], #16;
-	beq		.Lblk4_store_output_done;
-	st1		{v2.16b}, [x1], #16;
-	cmp		w3, #3;
-	beq		.Lblk4_store_output_done;
-	st1		{v3.16b}, [x1];
-
-.Lblk4_store_output_done:
-	ret;
-SYM_FUNC_END(__sm4_neon_crypt_blk1_4)
 
 .align 3
-SYM_FUNC_START(sm4_neon_crypt_blk1_8)
+SYM_FUNC_START(sm4_neon_crypt)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
-	 *   w3: num blocks (1..8)
+	 *   w3: nblocks
 	 */
-	cmp		w3, #5;
-	blt		__sm4_neon_crypt_blk1_4;
-
-	PREPARE;
-
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b}, [x2], #16;
-	mov		v5.16b, v4.16b;
-	mov		v6.16b, v4.16b;
-	mov		v7.16b, v4.16b;
-	beq		.Lblk8_load_input_done;
-	ld1		{v5.16b}, [x2], #16;
-	cmp		w3, #7;
-	blt		.Lblk8_load_input_done;
-	ld1		{v6.16b}, [x2], #16;
-	beq		.Lblk8_load_input_done;
-	ld1		{v7.16b}, [x2];
-
-.Lblk8_load_input_done:
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
-	cmp		w3, #6;
-	st1		{v0.16b-v3.16b}, [x1], #64;
-	st1		{v4.16b}, [x1], #16;
-	blt		.Lblk8_store_output_done;
-	st1		{v5.16b}, [x1], #16;
-	beq		.Lblk8_store_output_done;
-	st1		{v6.16b}, [x1], #16;
-	cmp		w3, #7;
-	beq		.Lblk8_store_output_done;
-	st1		{v7.16b}, [x1];
-
-.Lblk8_store_output_done:
-	ret;
-SYM_FUNC_END(sm4_neon_crypt_blk1_8)
+	SM4_PREPARE()
 
-.align 3
-SYM_FUNC_START(sm4_neon_crypt_blk8)
-	/* input:
-	 *   x0: round key array, CTX
-	 *   x1: dst
-	 *   x2: src
-	 *   w3: nblocks (multiples of 8)
-	 */
-	PREPARE;
+.Lcrypt_loop_8x:
+	sub		w3, w3, #8
+	tbnz		w3, #31, .Lcrypt_4x
+
+	ld4		{v0.4s-v3.4s}, [x2], #64
+	ld4		{v4.4s-v7.4s}, [x2], #64
 
-.Lcrypt_loop_blk:
-	subs		w3, w3, #8;
-	bmi		.Lcrypt_end;
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
 
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b-v7.16b}, [x2], #64;
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	cbz		w3, .Lcrypt_end
+	b		.Lcrypt_loop_8x
 
-	st1		{v0.16b-v3.16b}, [x1], #64;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+.Lcrypt_4x:
+	add		w3, w3, #8
+	cmp		w3, #4
+	blt		.Lcrypt_tail
 
-	b		.Lcrypt_loop_blk;
+	sub		w3, w3, #4
+
+	ld4		{v0.4s-v3.4s}, [x2], #64
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	cbz		w3, .Lcrypt_end
+
+.Lcrypt_tail:
+	cmp		w3, #2
+	ld1		{v0.16b}, [x2], #16
+	blt		.Lcrypt_tail_load_done
+	ld1		{v1.16b}, [x2], #16
+	beq		.Lcrypt_tail_load_done
+	ld1		{v2.16b}, [x2], #16
+
+.Lcrypt_tail_load_done:
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	cmp		w3, #2
+	st1		{v0.16b}, [x1], #16
+	blt		.Lcrypt_end
+	st1		{v1.16b}, [x1], #16
+	beq		.Lcrypt_end
+	st1		{v2.16b}, [x1], #16
 
 .Lcrypt_end:
-	ret;
-SYM_FUNC_END(sm4_neon_crypt_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_crypt)
 
 .align 3
-SYM_FUNC_START(sm4_neon_cbc_dec_blk8)
+SYM_FUNC_START(sm4_neon_cbc_dec)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
 	 *   x3: iv (big endian, 128 bit)
-	 *   w4: nblocks (multiples of 8)
+	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE()
+
+	ld1		{RIV.16b}, [x3]
+
+.Lcbc_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcbc_dec_4x
+
+	ld4		{v0.4s-v3.4s}, [x2], #64
+	ld4		{v4.4s-v7.4s}, [x2]
+
+	SM4_CRYPT_BLK8_norotate(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	/* Avoid overwriting the RIV register */
+	rotate_clockwise_4x4(v0, v1, v2, v3)
+	rotate_clockwise_4x4(v4, v5, v6, v7)
+
+	sub		x2, x2, #64
+
+	eor		v0.16b, v0.16b, RIV.16b
 
-	ld1		{RIV.16b}, [x3];
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+	ld1		{RTMP4.16b-RTMP7.16b}, [x2], #64
 
-.Lcbc_loop_blk:
-	subs		w4, w4, #8;
-	bmi		.Lcbc_end;
+	eor		v1.16b, v1.16b, RTMP0.16b
+	eor		v2.16b, v2.16b, RTMP1.16b
+	eor		v3.16b, v3.16b, RTMP2.16b
+	eor		v4.16b, v4.16b, RTMP3.16b
+	eor		v5.16b, v5.16b, RTMP4.16b
+	eor		v6.16b, v6.16b, RTMP5.16b
+	eor		v7.16b, v7.16b, RTMP6.16b
 
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b-v7.16b}, [x2];
+	mov		RIV.16b, RTMP7.16b
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
 
-	sub		x2, x2, #64;
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v1.16b, v1.16b, RTMP0.16b;
-	eor		v2.16b, v2.16b, RTMP1.16b;
-	eor		v3.16b, v3.16b, RTMP2.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	cbz		w4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
 
-	eor		v4.16b, v4.16b, RTMP3.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v5.16b, v5.16b, RTMP0.16b;
-	eor		v6.16b, v6.16b, RTMP1.16b;
-	eor		v7.16b, v7.16b, RTMP2.16b;
+.Lcbc_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcbc_dec_tail
 
-	mov		RIV.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	sub		w4, w4, #4
 
-	b		.Lcbc_loop_blk;
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-.Lcbc_end:
+	rev32		v4.16b, v0.16b
+	rev32		v5.16b, v1.16b
+	rev32		v6.16b, v2.16b
+	rev32		v7.16b, v3.16b
+
+	transpose_4x4(v4, v5, v6, v7)
+
+	SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+	eor		v4.16b, v4.16b, RIV.16b
+	eor		v5.16b, v5.16b, v0.16b
+	eor		v6.16b, v6.16b, v1.16b
+	eor		v7.16b, v7.16b, v2.16b
+
+	mov		RIV.16b, v3.16b
+
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	cbz		w4, .Lcbc_dec_end
+
+.Lcbc_dec_tail:
+	cmp		w4, #2
+	ld1		{v0.16b}, [x2], #16
+	blt		.Lcbc_dec_tail_load_done
+	ld1		{v1.16b}, [x2], #16
+	beq		.Lcbc_dec_tail_load_done
+	ld1		{v2.16b}, [x2], #16
+
+.Lcbc_dec_tail_load_done:
+	rev32		v4.16b, v0.16b
+	rev32		v5.16b, v1.16b
+	rev32		v6.16b, v2.16b
+
+	transpose_4x4(v4, v5, v6, v7)
+
+	SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+	cmp		w4, #2
+	eor		v4.16b, v4.16b, RIV.16b
+	mov		RIV.16b, v0.16b
+	st1		{v4.16b}, [x1], #16
+	blt		.Lcbc_dec_end
+
+	eor		v5.16b, v5.16b, v0.16b
+	mov		RIV.16b, v1.16b
+	st1		{v5.16b}, [x1], #16
+	beq		.Lcbc_dec_end
+
+	eor		v6.16b, v6.16b, v1.16b
+	mov		RIV.16b, v2.16b
+	st1		{v6.16b}, [x1], #16
+
+.Lcbc_dec_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
-SYM_FUNC_END(sm4_neon_cbc_dec_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_cbc_dec)
 
 .align 3
-SYM_FUNC_START(sm4_neon_cfb_dec_blk8)
+SYM_FUNC_START(sm4_neon_cfb_dec)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
 	 *   x3: iv (big endian, 128 bit)
-	 *   w4: nblocks (multiples of 8)
+	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE()
+
+	ld1		{v0.16b}, [x3]
+
+.Lcfb_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcfb_dec_4x
+
+	ld1		{v1.16b-v3.16b}, [x2], #48
+	ld4		{v4.4s-v7.4s}, [x2]
+
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	sub		x2, x2, #48
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+	ld1		{RTMP4.16b-RTMP7.16b}, [x2], #64
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	eor		v1.16b, v1.16b, RTMP1.16b
+	eor		v2.16b, v2.16b, RTMP2.16b
+	eor		v3.16b, v3.16b, RTMP3.16b
+	eor		v4.16b, v4.16b, RTMP4.16b
+	eor		v5.16b, v5.16b, RTMP5.16b
+	eor		v6.16b, v6.16b, RTMP6.16b
+	eor		v7.16b, v7.16b, RTMP7.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	mov		v0.16b, RTMP7.16b
+
+	cbz		w4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcfb_dec_tail
+
+	sub		w4, w4, #4
+
+	ld1		{v4.16b-v7.16b}, [x2], #64
+
+	rev32		v0.16b, v0.16b		/* v0 is IV register */
+	rev32		v1.16b, v4.16b
+	rev32		v2.16b, v5.16b
+	rev32		v3.16b, v6.16b
+
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
 
-	ld1		{v0.16b}, [x3];
+	eor		v0.16b, v0.16b, v4.16b
+	eor		v1.16b, v1.16b, v5.16b
+	eor		v2.16b, v2.16b, v6.16b
+	eor		v3.16b, v3.16b, v7.16b
 
-.Lcfb_loop_blk:
-	subs		w4, w4, #8;
-	bmi		.Lcfb_end;
+	st1		{v0.16b-v3.16b}, [x1], #64
 
-	ld1		{v1.16b, v2.16b, v3.16b}, [x2], #48;
-	ld1		{v4.16b-v7.16b}, [x2];
+	mov		v0.16b, v7.16b
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	cbz		w4, .Lcfb_dec_end
 
-	sub		x2, x2, #48;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+.Lcfb_dec_tail:
+	cmp		w4, #2
+	ld1		{v4.16b}, [x2], #16
+	blt		.Lcfb_dec_tail_load_done
+	ld1		{v5.16b}, [x2], #16
+	beq		.Lcfb_dec_tail_load_done
+	ld1		{v6.16b}, [x2], #16
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+.Lcfb_dec_tail_load_done:
+	rev32		v0.16b, v0.16b		/* v0 is IV register */
+	rev32		v1.16b, v4.16b
+	rev32		v2.16b, v5.16b
 
-	mov		v0.16b, RTMP3.16b;
+	transpose_4x4(v0, v1, v2, v3)
 
-	b		.Lcfb_loop_blk;
+	SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
 
-.Lcfb_end:
+	cmp		w4, #2
+	eor		v0.16b, v0.16b, v4.16b
+	st1		{v0.16b}, [x1], #16
+	mov		v0.16b, v4.16b
+	blt		.Lcfb_dec_end
+
+	eor		v1.16b, v1.16b, v5.16b
+	st1		{v1.16b}, [x1], #16
+	mov		v0.16b, v5.16b
+	beq		.Lcfb_dec_end
+
+	eor		v2.16b, v2.16b, v6.16b
+	st1		{v2.16b}, [x1], #16
+	mov		v0.16b, v6.16b
+
+.Lcfb_dec_end:
 	/* store new IV */
-	st1		{v0.16b}, [x3];
+	st1		{v0.16b}, [x3]
 
-	ret;
-SYM_FUNC_END(sm4_neon_cfb_dec_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_cfb_dec)
 
 .align 3
-SYM_FUNC_START(sm4_neon_ctr_enc_blk8)
+SYM_FUNC_START(sm4_neon_ctr_crypt)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
 	 *   x3: ctr (big endian, 128 bit)
-	 *   w4: nblocks (multiples of 8)
+	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE()
 
-	ldp		x7, x8, [x3];
-	rev		x7, x7;
-	rev		x8, x8;
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
 
-.Lctr_loop_blk:
-	subs		w4, w4, #8;
-	bmi		.Lctr_end;
+.Lctr_crypt_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lctr_crypt_4x
 
-#define inc_le128(vctr)                     \
-	mov		vctr.d[1], x8;      \
-	mov		vctr.d[0], x7;      \
-	adds		x8, x8, #1;         \
-	adc		x7, x7, xzr;        \
-	rev64		vctr.16b, vctr.16b;
+#define inc_le128(vctr)                             \
+		mov		vctr.d[1], x8;      \
+		mov		vctr.d[0], x7;      \
+		adds		x8, x8, #1;         \
+		rev64		vctr.16b, vctr.16b; \
+		adc		x7, x7, xzr;
 
 	/* construct CTRs */
-	inc_le128(v0);			/* +0 */
-	inc_le128(v1);			/* +1 */
-	inc_le128(v2);			/* +2 */
-	inc_le128(v3);			/* +3 */
-	inc_le128(v4);			/* +4 */
-	inc_le128(v5);			/* +5 */
-	inc_le128(v6);			/* +6 */
-	inc_le128(v7);			/* +7 */
-
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
-
-	b		.Lctr_loop_blk;
-
-.Lctr_end:
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
+	inc_le128(v4)			/* +4 */
+	inc_le128(v5)			/* +5 */
+	inc_le128(v6)			/* +6 */
+	inc_le128(v7)			/* +7 */
+
+	transpose_4x4_2x(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+	ld1		{RTMP4.16b-RTMP7.16b}, [x2], #64
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	eor		v1.16b, v1.16b, RTMP1.16b
+	eor		v2.16b, v2.16b, RTMP2.16b
+	eor		v3.16b, v3.16b, RTMP3.16b
+	eor		v4.16b, v4.16b, RTMP4.16b
+	eor		v5.16b, v5.16b, RTMP5.16b
+	eor		v6.16b, v6.16b, RTMP6.16b
+	eor		v7.16b, v7.16b, RTMP7.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	cbz		w4, .Lctr_crypt_end
+	b		.Lctr_crypt_loop_8x
+
+.Lctr_crypt_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lctr_crypt_tail
+
+	sub		w4, w4, #4
+
+	/* construct CTRs */
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
+
+	ld1		{v4.16b-v7.16b}, [x2], #64
+
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b, v4.16b
+	eor		v1.16b, v1.16b, v5.16b
+	eor		v2.16b, v2.16b, v6.16b
+	eor		v3.16b, v3.16b, v7.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	cbz		w4, .Lctr_crypt_end
+
+.Lctr_crypt_tail:
+	/* inc_le128 will change the sign bit */
+	ld1		{v4.16b}, [x2], #16
+	inc_le128(v0)
+	cmp		w4, #2
+	blt		.Lctr_crypt_tail_load_done
+
+	ld1		{v5.16b}, [x2], #16
+	inc_le128(v1)
+	cmp		w4, #2
+	beq		.Lctr_crypt_tail_load_done
+
+	ld1		{v6.16b}, [x2], #16
+	inc_le128(v2)
+
+.Lctr_crypt_tail_load_done:
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	cmp		w4, #2
+
+	eor		v0.16b, v0.16b, v4.16b
+	st1		{v0.16b}, [x1], #16
+	blt		.Lctr_crypt_end
+
+	eor		v1.16b, v1.16b, v5.16b
+	st1		{v1.16b}, [x1], #16
+	beq		.Lctr_crypt_end
+
+	eor		v2.16b, v2.16b, v6.16b
+	st1		{v2.16b}, [x1], #16
+
+.Lctr_crypt_end:
 	/* store new CTR */
-	rev		x7, x7;
-	rev		x8, x8;
-	stp		x7, x8, [x3];
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
 
-	ret;
-SYM_FUNC_END(sm4_neon_ctr_enc_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_ctr_crypt)
diff --git a/arch/arm64/crypto/sm4-neon-glue.c b/arch/arm64/crypto/sm4-neon-glue.c
index 03a6a6866a31..7b19accf5c03 100644
--- a/arch/arm64/crypto/sm4-neon-glue.c
+++ b/arch/arm64/crypto/sm4-neon-glue.c
@@ -18,19 +18,14 @@
 #include <crypto/internal/skcipher.h>
 #include <crypto/sm4.h>
 
-#define BYTES2BLKS(nbytes)	((nbytes) >> 4)
-#define BYTES2BLK8(nbytes)	(((nbytes) >> 4) & ~(8 - 1))
-
-asmlinkage void sm4_neon_crypt_blk1_8(const u32 *rkey, u8 *dst, const u8 *src,
-				      unsigned int nblks);
-asmlinkage void sm4_neon_crypt_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				    unsigned int nblks);
-asmlinkage void sm4_neon_cbc_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				      u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_cfb_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				      u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_ctr_enc_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				      u8 *iv, unsigned int nblks);
+asmlinkage void sm4_neon_crypt(const u32 *rkey, u8 *dst, const u8 *src,
+			       unsigned int nblocks);
+asmlinkage void sm4_neon_cbc_dec(const u32 *rkey_dec, u8 *dst, const u8 *src,
+				 u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_cfb_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+				 u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_ctr_crypt(const u32 *rkey_enc, u8 *dst, const u8 *src,
+				   u8 *iv, unsigned int nblocks);
 
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
@@ -51,27 +46,18 @@ static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_crypt_blk8(rkey, dst, src, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_crypt(rkey, dst, src, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			sm4_neon_crypt_blk1_8(rkey, dst, src, nblks);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
+			kernel_neon_end();
 		}
 
-		kernel_neon_end();
-
-		err = skcipher_walk_done(&walk, nbytes);
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
 	}
 
 	return err;
@@ -138,48 +124,19 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_cbc_dec_blk8(ctx->rkey_dec, dst, src,
-					walk.iv, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_cbc_dec(ctx->rkey_dec, dst, src,
+					 walk.iv, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			u8 keystream[SM4_BLOCK_SIZE * 8];
-			u8 iv[SM4_BLOCK_SIZE];
-			int i;
-
-			sm4_neon_crypt_blk1_8(ctx->rkey_dec, keystream,
-					src, nblks);
-
-			src += ((int)nblks - 2) * SM4_BLOCK_SIZE;
-			dst += (nblks - 1) * SM4_BLOCK_SIZE;
-			memcpy(iv, src + SM4_BLOCK_SIZE, SM4_BLOCK_SIZE);
-
-			for (i = nblks - 1; i > 0; i--) {
-				crypto_xor_cpy(dst, src,
-					&keystream[i * SM4_BLOCK_SIZE],
-					SM4_BLOCK_SIZE);
-				src -= SM4_BLOCK_SIZE;
-				dst -= SM4_BLOCK_SIZE;
-			}
-			crypto_xor_cpy(dst, walk.iv,
-					keystream, SM4_BLOCK_SIZE);
-			memcpy(walk.iv, iv, SM4_BLOCK_SIZE);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
+			kernel_neon_end();
 		}
 
-		kernel_neon_end();
-
-		err = skcipher_walk_done(&walk, nbytes);
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
 	}
 
 	return err;
@@ -238,41 +195,21 @@ static int sm4_cfb_decrypt(struct skcipher_request *req)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_cfb_dec_blk8(ctx->rkey_enc, dst, src,
-					walk.iv, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_cfb_dec(ctx->rkey_enc, dst, src,
+					 walk.iv, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			u8 keystream[SM4_BLOCK_SIZE * 8];
-
-			memcpy(keystream, walk.iv, SM4_BLOCK_SIZE);
-			if (nblks > 1)
-				memcpy(&keystream[SM4_BLOCK_SIZE], src,
-					(nblks - 1) * SM4_BLOCK_SIZE);
-			memcpy(walk.iv, src + (nblks - 1) * SM4_BLOCK_SIZE,
-				SM4_BLOCK_SIZE);
-
-			sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
-					keystream, nblks);
-
-			crypto_xor_cpy(dst, src, keystream,
-					nblks * SM4_BLOCK_SIZE);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			kernel_neon_end();
 
-		kernel_neon_end();
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
 
 		/* tail */
 		if (walk.nbytes == walk.total && nbytes > 0) {
@@ -302,40 +239,21 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_ctr_enc_blk8(ctx->rkey_enc, dst, src,
-					walk.iv, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_ctr_crypt(ctx->rkey_enc, dst, src,
+					   walk.iv, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			u8 keystream[SM4_BLOCK_SIZE * 8];
-			int i;
-
-			for (i = 0; i < nblks; i++) {
-				memcpy(&keystream[i * SM4_BLOCK_SIZE],
-					walk.iv, SM4_BLOCK_SIZE);
-				crypto_inc(walk.iv, SM4_BLOCK_SIZE);
-			}
-			sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
-					keystream, nblks);
-
-			crypto_xor_cpy(dst, src, keystream,
-					nblks * SM4_BLOCK_SIZE);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			kernel_neon_end();
 
-		kernel_neon_end();
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
 
 		/* tail */
 		if (walk.nbytes == walk.total && nbytes > 0) {
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/16] crypto: arm64/sm4 - refactor and simplify NEON implementation
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch does not add new features. The main work is to refactor and
simplify the implementation of SM4 NEON, which is reflected in the
following aspects:

The accelerated implementation supports the arbitrary number of blocks,
not just multiples of 8, which simplifies the implementation and brings
some optimization acceleration for data that is not aligned by 8 blocks.

When loading the input data, use the ld4 instruction to replace the
original ld1 instruction as much as possible, which will save the cost
of matrix transposition of the input data.

Use 8-block parallelism whenever possible to speed up matrix transpose
and rotation operations, instead of up to 4-block parallelism.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-neon-core.S | 630 +++++++++++++++++++-----------
 arch/arm64/crypto/sm4-neon-glue.c | 172 +++-----
 2 files changed, 456 insertions(+), 346 deletions(-)

diff --git a/arch/arm64/crypto/sm4-neon-core.S b/arch/arm64/crypto/sm4-neon-core.S
index 3d5256b354d2..f295b4b7d70a 100644
--- a/arch/arm64/crypto/sm4-neon-core.S
+++ b/arch/arm64/crypto/sm4-neon-core.S
@@ -18,6 +18,11 @@
 #define RTMP2	v10
 #define RTMP3	v11
 
+#define RTMP4	v12
+#define RTMP5	v13
+#define RTMP6	v14
+#define RTMP7	v15
+
 #define RX0	v12
 #define RX1	v13
 #define RKEY	v14
@@ -25,7 +30,7 @@
 
 /* Helper macros. */
 
-#define PREPARE                                                 \
+#define SM4_PREPARE()                                           \
 	adr_l		x5, crypto_sm4_sbox;                    \
 	ld1		{v16.16b-v19.16b}, [x5], #64;           \
 	ld1		{v20.16b-v23.16b}, [x5], #64;           \
@@ -42,7 +47,25 @@
 	zip1		s2.2d, RTMP2.2d, RTMP3.2d;              \
 	zip2		s3.2d, RTMP2.2d, RTMP3.2d;
 
-#define rotate_clockwise_90(s0, s1, s2, s3)                     \
+#define transpose_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7)        \
+	zip1		RTMP0.4s, s0.4s, s1.4s;                 \
+	zip1		RTMP1.4s, s2.4s, s3.4s;                 \
+	zip2		RTMP2.4s, s0.4s, s1.4s;                 \
+	zip2		RTMP3.4s, s2.4s, s3.4s;                 \
+	zip1		RTMP4.4s, s4.4s, s5.4s;                 \
+	zip1		RTMP5.4s, s6.4s, s7.4s;                 \
+	zip2		RTMP6.4s, s4.4s, s5.4s;                 \
+	zip2		RTMP7.4s, s6.4s, s7.4s;                 \
+	zip1		s0.2d, RTMP0.2d, RTMP1.2d;              \
+	zip2		s1.2d, RTMP0.2d, RTMP1.2d;              \
+	zip1		s2.2d, RTMP2.2d, RTMP3.2d;              \
+	zip2		s3.2d, RTMP2.2d, RTMP3.2d;              \
+	zip1		s4.2d, RTMP4.2d, RTMP5.2d;              \
+	zip2		s5.2d, RTMP4.2d, RTMP5.2d;              \
+	zip1		s6.2d, RTMP6.2d, RTMP7.2d;              \
+	zip2		s7.2d, RTMP6.2d, RTMP7.2d;
+
+#define rotate_clockwise_4x4(s0, s1, s2, s3)                    \
 	zip1		RTMP0.4s, s1.4s, s0.4s;                 \
 	zip2		RTMP1.4s, s1.4s, s0.4s;                 \
 	zip1		RTMP2.4s, s3.4s, s2.4s;                 \
@@ -52,6 +75,24 @@
 	zip1		s2.2d, RTMP3.2d, RTMP1.2d;              \
 	zip2		s3.2d, RTMP3.2d, RTMP1.2d;
 
+#define rotate_clockwise_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7) \
+	zip1		RTMP0.4s, s1.4s, s0.4s;                 \
+	zip1		RTMP2.4s, s3.4s, s2.4s;                 \
+	zip2		RTMP1.4s, s1.4s, s0.4s;                 \
+	zip2		RTMP3.4s, s3.4s, s2.4s;                 \
+	zip1		RTMP4.4s, s5.4s, s4.4s;                 \
+	zip1		RTMP6.4s, s7.4s, s6.4s;                 \
+	zip2		RTMP5.4s, s5.4s, s4.4s;                 \
+	zip2		RTMP7.4s, s7.4s, s6.4s;                 \
+	zip1		s0.2d, RTMP2.2d, RTMP0.2d;              \
+	zip2		s1.2d, RTMP2.2d, RTMP0.2d;              \
+	zip1		s2.2d, RTMP3.2d, RTMP1.2d;              \
+	zip2		s3.2d, RTMP3.2d, RTMP1.2d;              \
+	zip1		s4.2d, RTMP6.2d, RTMP4.2d;              \
+	zip2		s5.2d, RTMP6.2d, RTMP4.2d;              \
+	zip1		s6.2d, RTMP7.2d, RTMP5.2d;              \
+	zip2		s7.2d, RTMP7.2d, RTMP5.2d;
+
 #define ROUND4(round, s0, s1, s2, s3)                           \
 	dup		RX0.4s, RKEY.s[round];                  \
 	/* rk ^ s1 ^ s2 ^ s3 */                                 \
@@ -87,14 +128,7 @@
 	/* s0 ^= RTMP3 */                                       \
 	eor		s0.16b, s0.16b, RTMP3.16b;
 
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3)                          \
-	rev32		b0.16b, b0.16b;                         \
-	rev32		b1.16b, b1.16b;                         \
-	rev32		b2.16b, b2.16b;                         \
-	rev32		b3.16b, b3.16b;                         \
-                                                                \
-	transpose_4x4(b0, b1, b2, b3);                          \
-                                                                \
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3)                       \
 	mov		x6, 8;                                  \
 4:                                                              \
 	ld1		{RKEY.4s}, [x0], #16;                   \
@@ -107,15 +141,23 @@
                                                                 \
 	bne		4b;                                     \
                                                                 \
-	rotate_clockwise_90(b0, b1, b2, b3);                    \
 	rev32		b0.16b, b0.16b;                         \
 	rev32		b1.16b, b1.16b;                         \
 	rev32		b2.16b, b2.16b;                         \
 	rev32		b3.16b, b3.16b;                         \
                                                                 \
+	rotate_clockwise_4x4(b0, b1, b2, b3);                   \
+                                                                \
 	/* repoint to rkey */                                   \
 	sub		x0, x0, #128;
 
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3)                          \
+	rev32		b0.16b, b0.16b;                         \
+	rev32		b1.16b, b1.16b;                         \
+	rev32		b2.16b, b2.16b;                         \
+	rev32		b3.16b, b3.16b;                         \
+	SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
 #define ROUND8(round, s0, s1, s2, s3, t0, t1, t2, t3)           \
 	/* rk ^ s1 ^ s2 ^ s3 */                                 \
 	dup		RX0.4s, RKEY.s[round];                  \
@@ -175,7 +217,7 @@
 	eor		s0.16b, s0.16b, RTMP0.16b;              \
 	eor		t0.16b, t0.16b, RTMP1.16b;
 
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)          \
+#define SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7) \
 	rev32		b0.16b, b0.16b;                         \
 	rev32		b1.16b, b1.16b;                         \
 	rev32		b2.16b, b2.16b;                         \
@@ -185,9 +227,6 @@
 	rev32		b6.16b, b6.16b;                         \
 	rev32		b7.16b, b7.16b;                         \
                                                                 \
-	transpose_4x4(b0, b1, b2, b3);                          \
-	transpose_4x4(b4, b5, b6, b7);                          \
-                                                                \
 	mov		x6, 8;                                  \
 8:                                                              \
 	ld1		{RKEY.4s}, [x0], #16;                   \
@@ -200,8 +239,6 @@
                                                                 \
 	bne		8b;                                     \
                                                                 \
-	rotate_clockwise_90(b0, b1, b2, b3);                    \
-	rotate_clockwise_90(b4, b5, b6, b7);                    \
 	rev32		b0.16b, b0.16b;                         \
 	rev32		b1.16b, b1.16b;                         \
 	rev32		b2.16b, b2.16b;                         \
@@ -214,274 +251,429 @@
 	/* repoint to rkey */                                   \
 	sub		x0, x0, #128;
 
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)			\
+	SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7);	\
+	rotate_clockwise_4x4_2x(b0, b1, b2, b3, b4, b5, b6, b7);	\
 
-.align 3
-SYM_FUNC_START_LOCAL(__sm4_neon_crypt_blk1_4)
-	/* input:
-	 *   x0: round key array, CTX
-	 *   x1: dst
-	 *   x2: src
-	 *   w3: num blocks (1..4)
-	 */
-	PREPARE;
-
-	ld1		{v0.16b}, [x2], #16;
-	mov		v1.16b, v0.16b;
-	mov		v2.16b, v0.16b;
-	mov		v3.16b, v0.16b;
-	cmp		w3, #2;
-	blt		.Lblk4_load_input_done;
-	ld1		{v1.16b}, [x2], #16;
-	beq		.Lblk4_load_input_done;
-	ld1		{v2.16b}, [x2], #16;
-	cmp		w3, #3;
-	beq		.Lblk4_load_input_done;
-	ld1		{v3.16b}, [x2];
-
-.Lblk4_load_input_done:
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
-	st1		{v0.16b}, [x1], #16;
-	cmp		w3, #2;
-	blt		.Lblk4_store_output_done;
-	st1		{v1.16b}, [x1], #16;
-	beq		.Lblk4_store_output_done;
-	st1		{v2.16b}, [x1], #16;
-	cmp		w3, #3;
-	beq		.Lblk4_store_output_done;
-	st1		{v3.16b}, [x1];
-
-.Lblk4_store_output_done:
-	ret;
-SYM_FUNC_END(__sm4_neon_crypt_blk1_4)
 
 .align 3
-SYM_FUNC_START(sm4_neon_crypt_blk1_8)
+SYM_FUNC_START(sm4_neon_crypt)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
-	 *   w3: num blocks (1..8)
+	 *   w3: nblocks
 	 */
-	cmp		w3, #5;
-	blt		__sm4_neon_crypt_blk1_4;
-
-	PREPARE;
-
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b}, [x2], #16;
-	mov		v5.16b, v4.16b;
-	mov		v6.16b, v4.16b;
-	mov		v7.16b, v4.16b;
-	beq		.Lblk8_load_input_done;
-	ld1		{v5.16b}, [x2], #16;
-	cmp		w3, #7;
-	blt		.Lblk8_load_input_done;
-	ld1		{v6.16b}, [x2], #16;
-	beq		.Lblk8_load_input_done;
-	ld1		{v7.16b}, [x2];
-
-.Lblk8_load_input_done:
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
-	cmp		w3, #6;
-	st1		{v0.16b-v3.16b}, [x1], #64;
-	st1		{v4.16b}, [x1], #16;
-	blt		.Lblk8_store_output_done;
-	st1		{v5.16b}, [x1], #16;
-	beq		.Lblk8_store_output_done;
-	st1		{v6.16b}, [x1], #16;
-	cmp		w3, #7;
-	beq		.Lblk8_store_output_done;
-	st1		{v7.16b}, [x1];
-
-.Lblk8_store_output_done:
-	ret;
-SYM_FUNC_END(sm4_neon_crypt_blk1_8)
+	SM4_PREPARE()
 
-.align 3
-SYM_FUNC_START(sm4_neon_crypt_blk8)
-	/* input:
-	 *   x0: round key array, CTX
-	 *   x1: dst
-	 *   x2: src
-	 *   w3: nblocks (multiples of 8)
-	 */
-	PREPARE;
+.Lcrypt_loop_8x:
+	sub		w3, w3, #8
+	tbnz		w3, #31, .Lcrypt_4x
+
+	ld4		{v0.4s-v3.4s}, [x2], #64
+	ld4		{v4.4s-v7.4s}, [x2], #64
 
-.Lcrypt_loop_blk:
-	subs		w3, w3, #8;
-	bmi		.Lcrypt_end;
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
 
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b-v7.16b}, [x2], #64;
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	cbz		w3, .Lcrypt_end
+	b		.Lcrypt_loop_8x
 
-	st1		{v0.16b-v3.16b}, [x1], #64;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+.Lcrypt_4x:
+	add		w3, w3, #8
+	cmp		w3, #4
+	blt		.Lcrypt_tail
 
-	b		.Lcrypt_loop_blk;
+	sub		w3, w3, #4
+
+	ld4		{v0.4s-v3.4s}, [x2], #64
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	cbz		w3, .Lcrypt_end
+
+.Lcrypt_tail:
+	cmp		w3, #2
+	ld1		{v0.16b}, [x2], #16
+	blt		.Lcrypt_tail_load_done
+	ld1		{v1.16b}, [x2], #16
+	beq		.Lcrypt_tail_load_done
+	ld1		{v2.16b}, [x2], #16
+
+.Lcrypt_tail_load_done:
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	cmp		w3, #2
+	st1		{v0.16b}, [x1], #16
+	blt		.Lcrypt_end
+	st1		{v1.16b}, [x1], #16
+	beq		.Lcrypt_end
+	st1		{v2.16b}, [x1], #16
 
 .Lcrypt_end:
-	ret;
-SYM_FUNC_END(sm4_neon_crypt_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_crypt)
 
 .align 3
-SYM_FUNC_START(sm4_neon_cbc_dec_blk8)
+SYM_FUNC_START(sm4_neon_cbc_dec)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
 	 *   x3: iv (big endian, 128 bit)
-	 *   w4: nblocks (multiples of 8)
+	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE()
+
+	ld1		{RIV.16b}, [x3]
+
+.Lcbc_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcbc_dec_4x
+
+	ld4		{v0.4s-v3.4s}, [x2], #64
+	ld4		{v4.4s-v7.4s}, [x2]
+
+	SM4_CRYPT_BLK8_norotate(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	/* Avoid overwriting the RIV register */
+	rotate_clockwise_4x4(v0, v1, v2, v3)
+	rotate_clockwise_4x4(v4, v5, v6, v7)
+
+	sub		x2, x2, #64
+
+	eor		v0.16b, v0.16b, RIV.16b
 
-	ld1		{RIV.16b}, [x3];
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+	ld1		{RTMP4.16b-RTMP7.16b}, [x2], #64
 
-.Lcbc_loop_blk:
-	subs		w4, w4, #8;
-	bmi		.Lcbc_end;
+	eor		v1.16b, v1.16b, RTMP0.16b
+	eor		v2.16b, v2.16b, RTMP1.16b
+	eor		v3.16b, v3.16b, RTMP2.16b
+	eor		v4.16b, v4.16b, RTMP3.16b
+	eor		v5.16b, v5.16b, RTMP4.16b
+	eor		v6.16b, v6.16b, RTMP5.16b
+	eor		v7.16b, v7.16b, RTMP6.16b
 
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b-v7.16b}, [x2];
+	mov		RIV.16b, RTMP7.16b
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
 
-	sub		x2, x2, #64;
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v1.16b, v1.16b, RTMP0.16b;
-	eor		v2.16b, v2.16b, RTMP1.16b;
-	eor		v3.16b, v3.16b, RTMP2.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	cbz		w4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
 
-	eor		v4.16b, v4.16b, RTMP3.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v5.16b, v5.16b, RTMP0.16b;
-	eor		v6.16b, v6.16b, RTMP1.16b;
-	eor		v7.16b, v7.16b, RTMP2.16b;
+.Lcbc_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcbc_dec_tail
 
-	mov		RIV.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	sub		w4, w4, #4
 
-	b		.Lcbc_loop_blk;
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-.Lcbc_end:
+	rev32		v4.16b, v0.16b
+	rev32		v5.16b, v1.16b
+	rev32		v6.16b, v2.16b
+	rev32		v7.16b, v3.16b
+
+	transpose_4x4(v4, v5, v6, v7)
+
+	SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+	eor		v4.16b, v4.16b, RIV.16b
+	eor		v5.16b, v5.16b, v0.16b
+	eor		v6.16b, v6.16b, v1.16b
+	eor		v7.16b, v7.16b, v2.16b
+
+	mov		RIV.16b, v3.16b
+
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	cbz		w4, .Lcbc_dec_end
+
+.Lcbc_dec_tail:
+	cmp		w4, #2
+	ld1		{v0.16b}, [x2], #16
+	blt		.Lcbc_dec_tail_load_done
+	ld1		{v1.16b}, [x2], #16
+	beq		.Lcbc_dec_tail_load_done
+	ld1		{v2.16b}, [x2], #16
+
+.Lcbc_dec_tail_load_done:
+	rev32		v4.16b, v0.16b
+	rev32		v5.16b, v1.16b
+	rev32		v6.16b, v2.16b
+
+	transpose_4x4(v4, v5, v6, v7)
+
+	SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+	cmp		w4, #2
+	eor		v4.16b, v4.16b, RIV.16b
+	mov		RIV.16b, v0.16b
+	st1		{v4.16b}, [x1], #16
+	blt		.Lcbc_dec_end
+
+	eor		v5.16b, v5.16b, v0.16b
+	mov		RIV.16b, v1.16b
+	st1		{v5.16b}, [x1], #16
+	beq		.Lcbc_dec_end
+
+	eor		v6.16b, v6.16b, v1.16b
+	mov		RIV.16b, v2.16b
+	st1		{v6.16b}, [x1], #16
+
+.Lcbc_dec_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
-SYM_FUNC_END(sm4_neon_cbc_dec_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_cbc_dec)
 
 .align 3
-SYM_FUNC_START(sm4_neon_cfb_dec_blk8)
+SYM_FUNC_START(sm4_neon_cfb_dec)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
 	 *   x3: iv (big endian, 128 bit)
-	 *   w4: nblocks (multiples of 8)
+	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE()
+
+	ld1		{v0.16b}, [x3]
+
+.Lcfb_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcfb_dec_4x
+
+	ld1		{v1.16b-v3.16b}, [x2], #48
+	ld4		{v4.4s-v7.4s}, [x2]
+
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	sub		x2, x2, #48
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+	ld1		{RTMP4.16b-RTMP7.16b}, [x2], #64
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	eor		v1.16b, v1.16b, RTMP1.16b
+	eor		v2.16b, v2.16b, RTMP2.16b
+	eor		v3.16b, v3.16b, RTMP3.16b
+	eor		v4.16b, v4.16b, RTMP4.16b
+	eor		v5.16b, v5.16b, RTMP5.16b
+	eor		v6.16b, v6.16b, RTMP6.16b
+	eor		v7.16b, v7.16b, RTMP7.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	mov		v0.16b, RTMP7.16b
+
+	cbz		w4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcfb_dec_tail
+
+	sub		w4, w4, #4
+
+	ld1		{v4.16b-v7.16b}, [x2], #64
+
+	rev32		v0.16b, v0.16b		/* v0 is IV register */
+	rev32		v1.16b, v4.16b
+	rev32		v2.16b, v5.16b
+	rev32		v3.16b, v6.16b
+
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
 
-	ld1		{v0.16b}, [x3];
+	eor		v0.16b, v0.16b, v4.16b
+	eor		v1.16b, v1.16b, v5.16b
+	eor		v2.16b, v2.16b, v6.16b
+	eor		v3.16b, v3.16b, v7.16b
 
-.Lcfb_loop_blk:
-	subs		w4, w4, #8;
-	bmi		.Lcfb_end;
+	st1		{v0.16b-v3.16b}, [x1], #64
 
-	ld1		{v1.16b, v2.16b, v3.16b}, [x2], #48;
-	ld1		{v4.16b-v7.16b}, [x2];
+	mov		v0.16b, v7.16b
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	cbz		w4, .Lcfb_dec_end
 
-	sub		x2, x2, #48;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+.Lcfb_dec_tail:
+	cmp		w4, #2
+	ld1		{v4.16b}, [x2], #16
+	blt		.Lcfb_dec_tail_load_done
+	ld1		{v5.16b}, [x2], #16
+	beq		.Lcfb_dec_tail_load_done
+	ld1		{v6.16b}, [x2], #16
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+.Lcfb_dec_tail_load_done:
+	rev32		v0.16b, v0.16b		/* v0 is IV register */
+	rev32		v1.16b, v4.16b
+	rev32		v2.16b, v5.16b
 
-	mov		v0.16b, RTMP3.16b;
+	transpose_4x4(v0, v1, v2, v3)
 
-	b		.Lcfb_loop_blk;
+	SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
 
-.Lcfb_end:
+	cmp		w4, #2
+	eor		v0.16b, v0.16b, v4.16b
+	st1		{v0.16b}, [x1], #16
+	mov		v0.16b, v4.16b
+	blt		.Lcfb_dec_end
+
+	eor		v1.16b, v1.16b, v5.16b
+	st1		{v1.16b}, [x1], #16
+	mov		v0.16b, v5.16b
+	beq		.Lcfb_dec_end
+
+	eor		v2.16b, v2.16b, v6.16b
+	st1		{v2.16b}, [x1], #16
+	mov		v0.16b, v6.16b
+
+.Lcfb_dec_end:
 	/* store new IV */
-	st1		{v0.16b}, [x3];
+	st1		{v0.16b}, [x3]
 
-	ret;
-SYM_FUNC_END(sm4_neon_cfb_dec_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_cfb_dec)
 
 .align 3
-SYM_FUNC_START(sm4_neon_ctr_enc_blk8)
+SYM_FUNC_START(sm4_neon_ctr_crypt)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
 	 *   x3: ctr (big endian, 128 bit)
-	 *   w4: nblocks (multiples of 8)
+	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE()
 
-	ldp		x7, x8, [x3];
-	rev		x7, x7;
-	rev		x8, x8;
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
 
-.Lctr_loop_blk:
-	subs		w4, w4, #8;
-	bmi		.Lctr_end;
+.Lctr_crypt_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lctr_crypt_4x
 
-#define inc_le128(vctr)                     \
-	mov		vctr.d[1], x8;      \
-	mov		vctr.d[0], x7;      \
-	adds		x8, x8, #1;         \
-	adc		x7, x7, xzr;        \
-	rev64		vctr.16b, vctr.16b;
+#define inc_le128(vctr)                             \
+		mov		vctr.d[1], x8;      \
+		mov		vctr.d[0], x7;      \
+		adds		x8, x8, #1;         \
+		rev64		vctr.16b, vctr.16b; \
+		adc		x7, x7, xzr;
 
 	/* construct CTRs */
-	inc_le128(v0);			/* +0 */
-	inc_le128(v1);			/* +1 */
-	inc_le128(v2);			/* +2 */
-	inc_le128(v3);			/* +3 */
-	inc_le128(v4);			/* +4 */
-	inc_le128(v5);			/* +5 */
-	inc_le128(v6);			/* +6 */
-	inc_le128(v7);			/* +7 */
-
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
-
-	b		.Lctr_loop_blk;
-
-.Lctr_end:
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
+	inc_le128(v4)			/* +4 */
+	inc_le128(v5)			/* +5 */
+	inc_le128(v6)			/* +6 */
+	inc_le128(v7)			/* +7 */
+
+	transpose_4x4_2x(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+	ld1		{RTMP4.16b-RTMP7.16b}, [x2], #64
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	eor		v1.16b, v1.16b, RTMP1.16b
+	eor		v2.16b, v2.16b, RTMP2.16b
+	eor		v3.16b, v3.16b, RTMP3.16b
+	eor		v4.16b, v4.16b, RTMP4.16b
+	eor		v5.16b, v5.16b, RTMP5.16b
+	eor		v6.16b, v6.16b, RTMP6.16b
+	eor		v7.16b, v7.16b, RTMP7.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	cbz		w4, .Lctr_crypt_end
+	b		.Lctr_crypt_loop_8x
+
+.Lctr_crypt_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lctr_crypt_tail
+
+	sub		w4, w4, #4
+
+	/* construct CTRs */
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
+
+	ld1		{v4.16b-v7.16b}, [x2], #64
+
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b, v4.16b
+	eor		v1.16b, v1.16b, v5.16b
+	eor		v2.16b, v2.16b, v6.16b
+	eor		v3.16b, v3.16b, v7.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	cbz		w4, .Lctr_crypt_end
+
+.Lctr_crypt_tail:
+	/* inc_le128 will change the sign bit */
+	ld1		{v4.16b}, [x2], #16
+	inc_le128(v0)
+	cmp		w4, #2
+	blt		.Lctr_crypt_tail_load_done
+
+	ld1		{v5.16b}, [x2], #16
+	inc_le128(v1)
+	cmp		w4, #2
+	beq		.Lctr_crypt_tail_load_done
+
+	ld1		{v6.16b}, [x2], #16
+	inc_le128(v2)
+
+.Lctr_crypt_tail_load_done:
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	cmp		w4, #2
+
+	eor		v0.16b, v0.16b, v4.16b
+	st1		{v0.16b}, [x1], #16
+	blt		.Lctr_crypt_end
+
+	eor		v1.16b, v1.16b, v5.16b
+	st1		{v1.16b}, [x1], #16
+	beq		.Lctr_crypt_end
+
+	eor		v2.16b, v2.16b, v6.16b
+	st1		{v2.16b}, [x1], #16
+
+.Lctr_crypt_end:
 	/* store new CTR */
-	rev		x7, x7;
-	rev		x8, x8;
-	stp		x7, x8, [x3];
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
 
-	ret;
-SYM_FUNC_END(sm4_neon_ctr_enc_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_ctr_crypt)
diff --git a/arch/arm64/crypto/sm4-neon-glue.c b/arch/arm64/crypto/sm4-neon-glue.c
index 03a6a6866a31..7b19accf5c03 100644
--- a/arch/arm64/crypto/sm4-neon-glue.c
+++ b/arch/arm64/crypto/sm4-neon-glue.c
@@ -18,19 +18,14 @@
 #include <crypto/internal/skcipher.h>
 #include <crypto/sm4.h>
 
-#define BYTES2BLKS(nbytes)	((nbytes) >> 4)
-#define BYTES2BLK8(nbytes)	(((nbytes) >> 4) & ~(8 - 1))
-
-asmlinkage void sm4_neon_crypt_blk1_8(const u32 *rkey, u8 *dst, const u8 *src,
-				      unsigned int nblks);
-asmlinkage void sm4_neon_crypt_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				    unsigned int nblks);
-asmlinkage void sm4_neon_cbc_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				      u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_cfb_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				      u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_ctr_enc_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				      u8 *iv, unsigned int nblks);
+asmlinkage void sm4_neon_crypt(const u32 *rkey, u8 *dst, const u8 *src,
+			       unsigned int nblocks);
+asmlinkage void sm4_neon_cbc_dec(const u32 *rkey_dec, u8 *dst, const u8 *src,
+				 u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_cfb_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+				 u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_ctr_crypt(const u32 *rkey_enc, u8 *dst, const u8 *src,
+				   u8 *iv, unsigned int nblocks);
 
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
@@ -51,27 +46,18 @@ static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_crypt_blk8(rkey, dst, src, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_crypt(rkey, dst, src, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			sm4_neon_crypt_blk1_8(rkey, dst, src, nblks);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
+			kernel_neon_end();
 		}
 
-		kernel_neon_end();
-
-		err = skcipher_walk_done(&walk, nbytes);
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
 	}
 
 	return err;
@@ -138,48 +124,19 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_cbc_dec_blk8(ctx->rkey_dec, dst, src,
-					walk.iv, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_cbc_dec(ctx->rkey_dec, dst, src,
+					 walk.iv, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			u8 keystream[SM4_BLOCK_SIZE * 8];
-			u8 iv[SM4_BLOCK_SIZE];
-			int i;
-
-			sm4_neon_crypt_blk1_8(ctx->rkey_dec, keystream,
-					src, nblks);
-
-			src += ((int)nblks - 2) * SM4_BLOCK_SIZE;
-			dst += (nblks - 1) * SM4_BLOCK_SIZE;
-			memcpy(iv, src + SM4_BLOCK_SIZE, SM4_BLOCK_SIZE);
-
-			for (i = nblks - 1; i > 0; i--) {
-				crypto_xor_cpy(dst, src,
-					&keystream[i * SM4_BLOCK_SIZE],
-					SM4_BLOCK_SIZE);
-				src -= SM4_BLOCK_SIZE;
-				dst -= SM4_BLOCK_SIZE;
-			}
-			crypto_xor_cpy(dst, walk.iv,
-					keystream, SM4_BLOCK_SIZE);
-			memcpy(walk.iv, iv, SM4_BLOCK_SIZE);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
+			kernel_neon_end();
 		}
 
-		kernel_neon_end();
-
-		err = skcipher_walk_done(&walk, nbytes);
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
 	}
 
 	return err;
@@ -238,41 +195,21 @@ static int sm4_cfb_decrypt(struct skcipher_request *req)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_cfb_dec_blk8(ctx->rkey_enc, dst, src,
-					walk.iv, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_cfb_dec(ctx->rkey_enc, dst, src,
+					 walk.iv, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			u8 keystream[SM4_BLOCK_SIZE * 8];
-
-			memcpy(keystream, walk.iv, SM4_BLOCK_SIZE);
-			if (nblks > 1)
-				memcpy(&keystream[SM4_BLOCK_SIZE], src,
-					(nblks - 1) * SM4_BLOCK_SIZE);
-			memcpy(walk.iv, src + (nblks - 1) * SM4_BLOCK_SIZE,
-				SM4_BLOCK_SIZE);
-
-			sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
-					keystream, nblks);
-
-			crypto_xor_cpy(dst, src, keystream,
-					nblks * SM4_BLOCK_SIZE);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			kernel_neon_end();
 
-		kernel_neon_end();
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
 
 		/* tail */
 		if (walk.nbytes == walk.total && nbytes > 0) {
@@ -302,40 +239,21 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_ctr_enc_blk8(ctx->rkey_enc, dst, src,
-					walk.iv, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_ctr_crypt(ctx->rkey_enc, dst, src,
+					   walk.iv, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			u8 keystream[SM4_BLOCK_SIZE * 8];
-			int i;
-
-			for (i = 0; i < nblks; i++) {
-				memcpy(&keystream[i * SM4_BLOCK_SIZE],
-					walk.iv, SM4_BLOCK_SIZE);
-				crypto_inc(walk.iv, SM4_BLOCK_SIZE);
-			}
-			sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
-					keystream, nblks);
-
-			crypto_xor_cpy(dst, src, keystream,
-					nblks * SM4_BLOCK_SIZE);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			kernel_neon_end();
 
-		kernel_neon_end();
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
 
 		/* tail */
 		if (walk.nbytes == walk.total && nbytes > 0) {
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/16] crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch newly adds the test vectors of CTS-CBC/ESSIV/XTS/XCBC modes
of the SM4 algorithm, and also added some test vectors for SM4 GCM/CCM.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/testmgr.c |   25 +
 crypto/testmgr.h | 1161 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 1186 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index e4bb03b8b924..cce101c7e8f9 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4712,6 +4712,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.alg = "cts(cbc(paes))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
+	}, {
+		.alg = "cts(cbc(sm4))",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = __VECS(sm4_cts_tv_template)
+		}
 	}, {
 		.alg = "curve25519",
 		.test = alg_test_kpp,
@@ -5059,6 +5065,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 			.cipher = __VECS(essiv_aes_cbc_tv_template)
 		}
 	}, {
+		.alg = "essiv(cbc(sm4),sm3)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = __VECS(essiv_sm4_cbc_tv_template)
+		}
+	}, {
 #if IS_ENABLED(CONFIG_CRYPTO_DH_RFC7919_GROUPS)
 		.alg = "ffdhe2048(dh)",
 		.test = alg_test_kpp,
@@ -5586,6 +5598,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.hash = __VECS(aes_xcbc128_tv_template)
 		}
+	}, {
+		.alg = "xcbc(sm4)",
+		.test = alg_test_hash,
+		.suite = {
+			.hash = __VECS(sm4_xcbc128_tv_template)
+		}
 	}, {
 		.alg = "xchacha12",
 		.test = alg_test_skcipher,
@@ -5640,6 +5658,13 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.cipher = __VECS(serpent_xts_tv_template)
 		}
+	}, {
+		.alg = "xts(sm4)",
+		.generic_driver = "xts(ecb(sm4-generic))",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = __VECS(sm4_xts_tv_template)
+		}
 	}, {
 		.alg = "xts(twofish)",
 		.generic_driver = "xts(ecb(twofish-generic))",
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d6088e26f326..ced48e4dad0c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -14882,6 +14882,537 @@ static const struct cipher_testvec sm4_cfb_tv_template[] = {
 	}
 };
 
+static const struct cipher_testvec sm4_cts_tv_template[] = {
+	/* Generated from AES-CTS test vectors */
+	{
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20",
+		.len	= 17,
+		.ctext	= "\x05\xfe\x23\xee\x17\xa2\x89\x98"
+			  "\xbc\x97\x0a\x0b\x54\x67\xca\xd7"
+			  "\xd6",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20",
+		.len	= 31,
+		.ctext	= "\x15\x46\xe4\x95\xa4\xec\xf0\xb8"
+			  "\x49\xd6\x6a\x9d\x89\xc7\xfd\x70"
+			  "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43",
+		.len	= 32,
+		.ctext	= "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+			  "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43"
+			  "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+			  "\x70\x6c\x65\x61\x73\x65\x2c",
+		.len	= 47,
+		.ctext	= "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+			  "\xd3\xe1\xdc\xeb\xfa\x04\x11\x99"
+			  "\xde\xcf\x6f\x4d\x7b\x09\x92\x7f"
+			  "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43"
+			  "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+			  "\x70\x6c\x65\x61\x73\x65\x2c\x20",
+		.len	= 48,
+		.ctext	= "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+			  "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+			  "\xbd\x99\x21\x0c\x5e\x4d\xed\x20"
+			  "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43"
+			  "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+			  "\x70\x6c\x65\x61\x73\x65\x2c\x20"
+			  "\x61\x6e\x64\x20\x77\x6f\x6e\x74"
+			  "\x6f\x6e\x20\x73\x6f\x75\x70\x2e",
+		.len	= 64,
+		.ctext	= "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+			  "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+			  "\x58\x19\xa4\x8f\xa9\x68\x5e\x6b"
+			  "\x2c\x0f\x81\x60\x15\x98\x27\x4f"
+			  "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+			  "\xbd\x99\x21\x0c\x5e\x4d\xed\x20",
+	}
+};
+
+static const struct cipher_testvec essiv_sm4_cbc_tv_template[] = {
+	/* Generated from AES-ESSIV-CBC test vectors */
+	{
+		.key    = "\x06\xa9\x21\x40\x36\xb8\xa1\x5b"
+			  "\x51\x2e\x03\xd5\x34\x12\x00\x06",
+		.klen   = 16,
+		.iv	= "\x3d\xaf\xba\x42\x9d\x9e\xb4\x30"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "Single block msg",
+		.ctext	= "\x83\xa0\x79\x71\x18\xed\xb2\x0f"
+			  "\xa8\x71\x94\x22\x8e\x1f\xc1\xbb",
+		.len	= 16,
+	}, {
+		.key    = "\xc2\x86\x69\x6d\x88\x7c\x9a\xa0"
+			  "\x61\x1b\xbb\x3e\x20\x25\xa4\x5a",
+		.klen   = 16,
+		.iv     = "\x56\x2e\x17\x99\x6d\x09\x3d\x28"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.ctext	= "\x48\x38\xba\xa0\x09\xa2\xe1\x61"
+			  "\x94\xe5\xd2\x63\xe5\x04\x6c\x62"
+			  "\x93\x21\x95\xfb\x8c\xf4\x25\x19"
+			  "\xe0\x0f\x9c\xfa\x51\xfe\xe7\x32",
+		.len	= 32,
+	}, {
+		.key	= "\x1f\x35\x2c\x07\x3b\x61\x08\xd7"
+			  "\x2d\x98\x10\xa3\x09\x14\xdf\xf4",
+		.klen	= 16,
+		.iv	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
+			  "\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
+			  "\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
+			  "\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
+			  "\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
+			  "\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
+			  "\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
+			  "\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
+		.ctext	= "\xa5\x1d\x64\x91\x28\x1f\xbe\x9e"
+			  "\x15\x39\x5f\xe4\xe1\x5a\x8c\x38"
+			  "\x80\x7f\xc7\x7d\x00\x4c\x4b\xff"
+			  "\x75\x3a\x03\xfe\x41\x75\x26\x9e"
+			  "\x3f\xf1\x36\xaf\x7b\x37\x73\x1a"
+			  "\xaf\x9b\x91\xec\x1e\xf0\x05\x9d"
+			  "\x87\xda\x7b\xa3\xaa\xe6\x5b\x98"
+			  "\x41\x73\xd5\x3c\x8c\x8b\xb5\x88",
+		.len	= 64,
+	}, {
+		.key	= "\xBE\xE1\x04\x27\xE1\x04\x27\x4A"
+			  "\x6D\x90\x4A\x6D\x90\xB3\xD6\xF9",
+		.klen	= 16,
+		.iv	= "\xE7\x82\x1D\xB8\x53\x11\xAC\x47"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x50\xB9\x22\xAE\x17\x80\x0C\x75"
+			  "\xDE\x47\xD3\x3C\xA5\x0E\x9A\x03"
+			  "\x6C\xF8\x61\xCA\x33\xBF\x28\x91"
+			  "\x1D\x86\xEF\x58\xE4\x4D\xB6\x1F"
+			  "\xAB\x14\x7D\x09\x72\xDB\x44\xD0"
+			  "\x39\xA2\x0B\x97\x00\x69\xF5\x5E"
+			  "\xC7\x30\xBC\x25\x8E\x1A\x83\xEC"
+			  "\x55\xE1\x4A\xB3\x1C\xA8\x11\x7A"
+			  "\x06\x6F\xD8\x41\xCD\x36\x9F\x08"
+			  "\x94\xFD\x66\xF2\x5B\xC4\x2D\xB9"
+			  "\x22\x8B\x17\x80\xE9\x52\xDE\x47"
+			  "\xB0\x19\xA5\x0E\x77\x03\x6C\xD5"
+			  "\x3E\xCA\x33\x9C\x05\x91\xFA\x63"
+			  "\xEF\x58\xC1\x2A\xB6\x1F\x88\x14"
+			  "\x7D\xE6\x4F\xDB\x44\xAD\x16\xA2"
+			  "\x0B\x74\x00\x69\xD2\x3B\xC7\x30"
+			  "\x99\x02\x8E\xF7\x60\xEC\x55\xBE"
+			  "\x27\xB3\x1C\x85\x11\x7A\xE3\x4C"
+			  "\xD8\x41\xAA\x13\x9F\x08\x71\xFD"
+			  "\x66\xCF\x38\xC4\x2D\x96\x22\x8B"
+			  "\xF4\x5D\xE9\x52\xBB\x24\xB0\x19"
+			  "\x82\x0E\x77\xE0\x49\xD5\x3E\xA7"
+			  "\x10\x9C\x05\x6E\xFA\x63\xCC\x35"
+			  "\xC1\x2A\x93\x1F\x88\xF1\x5A\xE6"
+			  "\x4F\xB8\x21\xAD\x16\x7F\x0B\x74"
+			  "\xDD\x46\xD2\x3B\xA4\x0D\x99\x02"
+			  "\x6B\xF7\x60\xC9\x32\xBE\x27\x90"
+			  "\x1C\x85\xEE\x57\xE3\x4C\xB5\x1E"
+			  "\xAA\x13\x7C\x08\x71\xDA\x43\xCF"
+			  "\x38\xA1\x0A\x96\xFF\x68\xF4\x5D"
+			  "\xC6\x2F\xBB\x24\x8D\x19\x82\xEB"
+			  "\x54\xE0\x49\xB2\x1B\xA7\x10\x79"
+			  "\x05\x6E\xD7\x40\xCC\x35\x9E\x07"
+			  "\x93\xFC\x65\xF1\x5A\xC3\x2C\xB8"
+			  "\x21\x8A\x16\x7F\xE8\x51\xDD\x46"
+			  "\xAF\x18\xA4\x0D\x76\x02\x6B\xD4"
+			  "\x3D\xC9\x32\x9B\x04\x90\xF9\x62"
+			  "\xEE\x57\xC0\x29\xB5\x1E\x87\x13"
+			  "\x7C\xE5\x4E\xDA\x43\xAC\x15\xA1"
+			  "\x0A\x73\xFF\x68\xD1\x3A\xC6\x2F"
+			  "\x98\x01\x8D\xF6\x5F\xEB\x54\xBD"
+			  "\x26\xB2\x1B\x84\x10\x79\xE2\x4B"
+			  "\xD7\x40\xA9\x12\x9E\x07\x70\xFC"
+			  "\x65\xCE\x37\xC3\x2C\x95\x21\x8A"
+			  "\xF3\x5C\xE8\x51\xBA\x23\xAF\x18"
+			  "\x81\x0D\x76\xDF\x48\xD4\x3D\xA6"
+			  "\x0F\x9B\x04\x6D\xF9\x62\xCB\x34"
+			  "\xC0\x29\x92\x1E\x87\xF0\x59\xE5"
+			  "\x4E\xB7\x20\xAC\x15\x7E\x0A\x73"
+			  "\xDC\x45\xD1\x3A\xA3\x0C\x98\x01"
+			  "\x6A\xF6\x5F\xC8\x31\xBD\x26\x8F"
+			  "\x1B\x84\xED\x56\xE2\x4B\xB4\x1D"
+			  "\xA9\x12\x7B\x07\x70\xD9\x42\xCE"
+			  "\x37\xA0\x09\x95\xFE\x67\xF3\x5C"
+			  "\xC5\x2E\xBA\x23\x8C\x18\x81\xEA"
+			  "\x53\xDF\x48\xB1\x1A\xA6\x0F\x78"
+			  "\x04\x6D\xD6\x3F\xCB\x34\x9D\x06"
+			  "\x92\xFB\x64\xF0\x59\xC2\x2B\xB7"
+			  "\x20\x89\x15\x7E\xE7\x50\xDC\x45"
+			  "\xAE\x17\xA3\x0C\x75\x01\x6A\xD3"
+			  "\x3C\xC8\x31\x9A\x03\x8F\xF8\x61"
+			  "\xED\x56\xBF\x28\xB4\x1D\x86\x12",
+		.ctext	= "\xad\x68\x40\x68\xb2\xf9\x77\x55"
+			  "\xd5\x1c\x17\x46\xc1\xfa\x05\xdd"
+			  "\x94\x5c\xb7\x99\x82\xba\x05\x48"
+			  "\xac\x5d\x14\x30\x2e\xc8\x0e\x2f"
+			  "\x5a\xd7\x39\x43\x95\x4d\x93\xff"
+			  "\x6b\xe3\xb7\x71\xc1\x39\x43\x8d"
+			  "\x10\xd7\xd9\xa8\xe7\x65\xb7\x0a"
+			  "\x27\x98\x5b\x90\xc3\x80\x1f\xd9"
+			  "\x65\x82\x88\x0a\xc3\x16\x3f\xae"
+			  "\x1f\xad\x88\xe9\xfb\x9e\xd4\xc8"
+			  "\x81\x36\x50\x37\x1f\x11\x83\xe2"
+			  "\xc5\x1a\x48\xdb\xc3\x18\x07\x5d"
+			  "\xee\x4b\xea\x40\xd3\xd9\x8c\x59"
+			  "\x29\xe1\x0b\x79\x3b\x28\xac\x75"
+			  "\xda\x82\x99\x86\xd4\xbe\xd8\x81"
+			  "\xe0\xc4\x58\x78\xe4\x33\xc1\xf1"
+			  "\xbe\x96\xd3\x4c\x42\x6b\xaf\x24"
+			  "\x69\xb4\x25\x88\x37\x9e\xb2\xfb"
+			  "\x5c\x93\x22\x89\x2f\x81\x85\x06"
+			  "\x12\x74\x3b\x6c\x99\x81\xfb\xbe"
+			  "\x0f\xc4\xa5\xb6\xf8\x79\x5f\x72"
+			  "\xf8\x46\x94\x3f\x1f\x9f\x15\xa2"
+			  "\xc8\xc0\xbf\xeb\xa3\x9e\x59\xe1"
+			  "\xbd\x1a\xe1\xe3\x6b\x33\x96\x54"
+			  "\x1b\xc4\x25\x74\x06\xcf\x8a\x75"
+			  "\x6c\xfc\x76\x7f\x9e\x7b\x00\xce"
+			  "\xa8\x1e\x6a\x0f\x5a\xa6\xcb\x77"
+			  "\x5f\x90\x39\xcb\xfe\x0e\x16\x53"
+			  "\x8e\x21\x0f\x7e\x51\xcc\x92\xb8"
+			  "\x4f\x65\x76\x20\x3d\x56\xb4\xcc"
+			  "\x8b\x8e\x8e\x68\xc3\x82\x53\x5c"
+			  "\x1c\x82\x13\x32\x3b\x97\xff\x48"
+			  "\x98\xda\x4a\x7c\xc8\x21\x83\xfd"
+			  "\xe2\xf1\x30\xe1\x11\xe9\xe8\x97"
+			  "\x97\x24\x06\x73\xf2\x52\xbb\xab"
+			  "\x9d\x5f\x0b\xa8\x2f\xab\x0b\x7d"
+			  "\xe8\x20\x7b\x67\x2e\x93\xb5\x11"
+			  "\x6c\x16\xea\xdd\x1a\x9d\xf2\xdc"
+			  "\x79\x57\xc4\x04\xcb\x7f\x36\xa0"
+			  "\x2e\xa7\x89\xab\xaa\x56\x59\x9e"
+			  "\xec\x38\xea\x1a\xe9\xa7\x58\x58"
+			  "\xb5\xb7\x8f\x8c\x5c\xd6\x86\x67"
+			  "\x65\x0f\x93\x47\xf7\x3e\x19\x19"
+			  "\x9b\x22\xd1\xc6\xc2\xba\x32\x5c"
+			  "\x2c\x7a\xa2\xbb\xa5\x22\xde\xe5"
+			  "\x1e\x78\x2c\xd3\x40\x6d\xfa\x79"
+			  "\x4c\x9e\x1c\x36\x34\xaf\x95\x2e"
+			  "\x68\x2e\x69\x7d\xe4\x7d\x0c\x74"
+			  "\xaf\x73\x5b\x48\x62\x90\x5e\x19"
+			  "\x0f\x12\xb3\xdb\x77\xbb\xe2\xac"
+			  "\xaf\xfe\xd9\xa1\x80\x09\xc6\xd4"
+			  "\xf4\x21\x3f\xa4\x0f\x16\x7b\x36"
+			  "\x29\x6d\x10\xa2\xba\xaf\xf5\xa3"
+			  "\x51\xca\x0a\x25\x74\x9a\xb7\x02"
+			  "\xb8\xf8\x6b\xda\xb8\x1c\x9f\x62"
+			  "\xf5\x61\x62\x9f\x4b\x71\x24\x45"
+			  "\xfb\x0f\xdf\xa8\x47\x6f\x2f\x05"
+			  "\x2f\xf4\xfd\xb8\xd1\x8c\x29\x9d"
+			  "\x9d\xe8\x6f\x10\x89\xef\x08\x59"
+			  "\xa0\x24\x1f\xdb\xea\xbc\x97\x44"
+			  "\x23\x74\xbf\xaa\x87\x10\x5c\x58"
+			  "\x2a\xe6\xe2\x19\xc5\x7e\x21\xe2",
+		.len	= 496,
+	},
+};
+
+static const struct cipher_testvec sm4_xts_tv_template[] = {
+	/* Generated from AES-XTS test vectors */
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ctext	= "\xd9\xb4\x21\xf7\x31\xc8\x94\xfd"
+			  "\xc3\x5b\x77\x29\x1f\xe4\xe3\xb0"
+			  "\x2a\x1f\xb7\x66\x98\xd5\x9f\x0e"
+			  "\x51\x37\x6c\x4a\xda\x5b\xc7\x5d",
+		.len	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ctext	= "\xa7\x4d\x72\x6c\x11\x19\x6a\x32"
+			  "\xbe\x04\xe0\x01\xff\x29\xd0\xc7"
+			  "\x93\x2f\x9f\x3e\xc2\x9b\xfc\xb6"
+			  "\x4d\xd1\x7f\x63\xcb\xd3\xea\x31",
+		.len	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ctext	= "\x7f\x76\x08\x8e\xff\xad\xf7\x0c"
+			  "\x02\xea\x9f\x95\xda\x06\x28\xd3"
+			  "\x51\xbf\xcb\x9e\xac\x05\x63\xbc"
+			  "\xf1\x7b\x71\x0d\xab\x0a\x98\x26",
+		.len	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ctext	= "\x54\xdd\x65\xb6\x32\x6f\xae\xa8"
+			  "\xfa\xd1\xa8\x3c\x63\x61\x4a\xf3"
+			  "\x9f\x72\x1d\x8d\xfe\x17\x7a\x30"
+			  "\xb6\x6a\xbf\x6a\x44\x99\x80\xe1"
+			  "\xcd\xbe\x06\xaf\xb7\x33\x36\xf3"
+			  "\x7a\x4d\x39\xde\x96\x4a\x30\xd7"
+			  "\xd0\x4a\x37\x99\x16\x9c\x60\x25"
+			  "\x8f\x6b\x74\x8a\x61\x86\x1a\xa5"
+			  "\xec\x92\xa2\xc1\x5b\x2b\x7c\x61"
+			  "\x5a\x42\xab\xa4\x99\xbb\xd6\xb7"
+			  "\x1d\xb9\xc7\x89\xb2\x18\x20\x89"
+			  "\xa2\x5d\xd3\xdf\x80\x0e\xd1\x86"
+			  "\x4d\x19\xf7\xed\x45\xfd\x17\xa9"
+			  "\x48\x0b\x0f\xb8\x2d\x9b\x7f\xc3"
+			  "\xed\x57\xe9\xa1\x14\x0e\xaa\x77"
+			  "\x8d\xd2\xdd\x67\x9e\x3e\xdc\x3d"
+			  "\xc4\xd5\x5c\x95\x0e\xbc\x53\x1d"
+			  "\x95\x92\xf7\xc4\x63\x82\x56\xd5"
+			  "\x65\x18\x29\x2a\x20\xaf\x98\xfd"
+			  "\xd3\xa6\x36\x00\x35\x0a\x70\xab"
+			  "\x5a\x40\xf4\xc2\x85\x03\x7c\xa0"
+			  "\x1f\x25\x1f\x19\xec\xae\x03\x29"
+			  "\xff\x77\xad\x88\xcd\x5a\x4c\xde"
+			  "\xa2\xae\xab\xc2\x21\x48\xff\xbd"
+			  "\x23\x9b\xd1\x05\x15\xbd\xe1\x13"
+			  "\x1d\xec\x84\x04\xe4\x43\xdc\x76"
+			  "\x31\x40\xd5\xf2\x2b\xf3\x3e\x0c"
+			  "\x68\x72\xd6\xb8\x1d\x63\x0f\x6f"
+			  "\x00\xcd\xd0\x58\xfe\x80\xf9\xcb"
+			  "\xfb\x77\x70\x7f\x93\xce\xe2\xca"
+			  "\x92\xb9\x15\xb8\x30\x40\x27\xc1"
+			  "\x90\xa8\x4e\x2d\x65\xe0\x18\xcc"
+			  "\x6a\x38\x7d\x37\x66\xac\xdb\x28"
+			  "\x25\x32\x84\xe8\xdb\x9a\xcf\x8f"
+			  "\x52\x28\x0d\xdc\x6d\x00\x33\xd2"
+			  "\xcc\xaa\xa4\xf9\xae\xff\x12\x36"
+			  "\x69\xbc\x02\x4f\xd6\x76\x8e\xdf"
+			  "\x8b\xc1\xf8\xd6\x22\xc1\x9c\x60"
+			  "\x9e\xf9\x7f\x60\x91\x90\xcd\x11"
+			  "\x02\x41\xe7\xfb\x08\x4e\xd8\x94"
+			  "\x2d\xa1\xf9\xb9\xcf\x1b\x51\x4b"
+			  "\x61\xa3\x88\xb3\x0e\xa6\x1a\x4a"
+			  "\x74\x5b\x38\x1e\xe7\xad\x6c\x4d"
+			  "\xb1\x27\x54\x53\xb8\x41\x3f\x98"
+			  "\xdf\x6e\x4a\x40\x98\x6e\xe4\xb5"
+			  "\x9a\xf5\xdf\xae\xcd\x30\x12\x65"
+			  "\x17\x90\x67\xa0\x0d\x7c\xa3\x5a"
+			  "\xb9\x5a\xbd\x61\x7a\xde\xa2\x8e"
+			  "\xc1\xc2\x6a\x97\xde\x28\xb8\xbf"
+			  "\xe3\x01\x20\xd6\xae\xfb\xd2\x58"
+			  "\xc5\x9e\x42\xd1\x61\xe8\x06\x5a"
+			  "\x78\x10\x6b\xdc\xa5\xcd\x90\xfb"
+			  "\x3a\xac\x4e\x93\x86\x6c\x8a\x7f"
+			  "\x96\x76\x86\x0a\x79\x14\x5b\xd9"
+			  "\x2e\x02\xe8\x19\xa9\x0b\xe0\xb9"
+			  "\x7c\xc5\x22\xb3\x21\x06\x85\x6f"
+			  "\xdf\x0e\x54\xd8\x8e\x46\x24\x15"
+			  "\x5a\x2f\x1c\x14\xea\xea\xa1\x63"
+			  "\xf8\x58\xe9\x9a\x80\x6e\x79\x1a"
+			  "\xcd\x82\xf1\xb0\xe2\x9f\x00\x28"
+			  "\xa4\xc3\x8e\x97\x6f\x57\x1a\x93"
+			  "\xf4\xfd\x57\xd7\x87\xc2\x4d\xb0"
+			  "\xe0\x1c\xa3\x04\xe5\xa5\xc4\xdd"
+			  "\x50\xcf\x8b\xdb\xf4\x91\xe5\x7c",
+		.len	= 512,
+	}, {
+		.key	= "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27"
+			  "\x02\x88\x41\x97\x16\x93\x99\x37"
+			  "\x51\x05\x82\x09\x74\x94\x45\x92",
+		.klen	= 32,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xf8\xf9\xfa\xfb\xfc",
+		.ctext	= "\xa2\x9f\x9e\x4e\x71\xdb\x28\x3c"
+			  "\x80\x0e\xf6\xb7\x8e\x57\x1c\xba"
+			  "\x90\xda\x3b\x6c\x22\x00\x68\x30"
+			  "\x1d\x63\x0d\x9e\x6a\xad\x37\x55"
+			  "\xbc\x77\x1e\xc9\xad\x83\x30\xd5"
+			  "\x27\xb2\x66\x77\x18\x3c\xa6\x39"
+			  "\x9c\x0a\xaa\x1f\x02\xe1\xd5\x65"
+			  "\x9b\x8d\xc5\x97\x3d\xc5\x04\x53"
+			  "\x78\x00\xe3\xb0\x1a\x43\x4e\xb7"
+			  "\xc4\x9f\x38\xc5\x7b\xa4\x70\x64"
+			  "\x78\xe6\x32\xd9\x65\x44\xc5\x64"
+			  "\xb8\x42\x35\x99\xff\x66\x75\xb0"
+			  "\x22\xd3\x9b\x6e\x8d\xcf\x6a\x24"
+			  "\xfd\x92\xb7\x1b\x04\x28\x2a\x61"
+			  "\xdc\x96\x2a\x20\x7a\x2c\xf1\xf9"
+			  "\x12\x15\xf0\x4d\xcf\x2b\xde\x33"
+			  "\x41\xbc\xe7\x85\x87\x22\xb7\x16"
+			  "\x02\x1c\xd8\xa2\x0f\x1f\xa3\xe9"
+			  "\xd8\x45\x48\xe7\xbe\x08\x4e\x4e"
+			  "\x23\x79\x84\xdb\x40\x76\xf5\x13"
+			  "\x78\x92\x4a\x2f\xf9\x1b\xf2\x80"
+			  "\x25\x74\x51\x45\x9a\x77\x78\x97"
+			  "\xd3\xe0\xc7\xc4\x35\x67\x2a\xe6"
+			  "\xb3\x0d\x62\x9f\x8b",
+		.len	= 189,
+	},
+};
+
 static const struct aead_testvec sm4_gcm_tv_template[] = {
 	{ /* From https://datatracker.ietf.org/doc/html/rfc8998#appendix-A.1 */
 		.key	= "\x01\x23\x45\x67\x89\xAB\xCD\xEF"
@@ -14913,6 +15444,298 @@ static const struct aead_testvec sm4_gcm_tv_template[] = {
 			  "\x83\xDE\x35\x41\xE4\xC2\xB5\x81"
 			  "\x77\xE0\x65\xA9\xBF\x7B\x62\xEC",
 		.clen	= 80,
+	}, { /* Generated from AES-GCM test vectors */
+		.key    = zeroed_string,
+		.klen	= 16,
+		.ctext	= "\x23\x2f\x0c\xfe\x30\x8b\x49\xea"
+			  "\x6f\xc8\x82\x29\xb5\xdc\x85\x8d",
+		.clen	= 16,
+	}, {
+		.key    = zeroed_string,
+		.klen	= 16,
+		.ptext	= zeroed_string,
+		.plen	= 16,
+		.ctext	= "\x7d\xe2\xaa\x7f\x11\x10\x18\x82"
+			  "\x18\x06\x3b\xe1\xbf\xeb\x6d\x89"
+			  "\xb8\x51\xb5\xf3\x94\x93\x75\x2b"
+			  "\xe5\x08\xf1\xbb\x44\x82\xc5\x57",
+		.clen	= 32,
+	}, {
+		.key	= "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+			  "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+		.klen	= 16,
+		.iv	= "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+			  "\xde\xca\xf8\x88",
+		.ptext	= "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+			  "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+			  "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+			  "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+			  "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+			  "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+			  "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+			  "\xba\x63\x7b\x39\x1a\xaf\xd2\x55",
+		.plen	= 64,
+		.ctext	= "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+			  "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+			  "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+			  "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+			  "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+			  "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+			  "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+			  "\xe3\x63\x36\x83\x23\xf7\x5b\x80"
+			  "\x7d\xfe\x77\xef\x71\xb1\x5e\xc9"
+			  "\x52\x6b\x09\xab\x84\x28\x4b\x8a",
+		.clen	= 80,
+	}, {
+		.key	= "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+			  "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+		.klen	= 16,
+		.iv	= "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+			  "\xde\xca\xf8\x88",
+		.ptext	= "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+			  "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+			  "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+			  "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+			  "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+			  "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+			  "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+			  "\xba\x63\x7b\x39",
+		.plen	= 60,
+		.assoc	= "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xab\xad\xda\xd2",
+		.alen	= 20,
+		.ctext	= "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+			  "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+			  "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+			  "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+			  "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+			  "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+			  "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+			  "\xe3\x63\x36\x83"
+			  "\x89\xf6\xba\x35\xb8\x18\xd3\xcc"
+			  "\x38\x6c\x05\xb3\x8a\xcb\xc9\xde",
+		.clen	= 76,
+	}, {
+		.key	= "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+			  "\xfe\xff\xe9\x92\x86\x65\x73\x1c",
+		.klen	= 16,
+		.iv	= "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+			  "\xde\xca\xf8\x88",
+		.ptext	= "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+			  "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+			  "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+			  "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+			  "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+			  "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+			  "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+			  "\xba\x63\x7b\x39",
+		.plen	= 60,
+		.assoc	= "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xab\xad\xda\xd2",
+		.alen	= 20,
+		.ctext	= "\xc1\x11\x44\x51\xd9\x25\x87\x5b"
+			  "\x0f\xd9\x06\xf3\x33\x44\xbb\x87"
+			  "\x8b\xa3\x77\xd2\x0c\x60\xfa\xcc"
+			  "\x85\x50\x6f\x96\x0c\x54\x54\xc1"
+			  "\x58\x04\x88\x6e\xf4\x26\x35\x7e"
+			  "\x94\x80\x48\x6c\xf2\xf4\x88\x1f"
+			  "\x19\x63\xea\xae\xba\x81\x1a\x5d"
+			  "\x0e\x6f\x59\x08"
+			  "\x33\xac\x5b\xa8\x19\x60\xdb\x1d"
+			  "\xdd\x2e\x22\x2e\xe0\x87\x51\x5d",
+		.clen	= 76,
+	}, {
+		.key	= "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+			  "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+		.klen	= 16,
+		.iv	= "\x00\xff\xff\xff\xff\x00\x00\xff"
+			  "\xff\xff\x00\xff",
+		.ptext	= "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+			  "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+			  "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+			  "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+			  "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+			  "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+			  "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+			  "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+			  "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+			  "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+			  "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+			  "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+			  "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+			  "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+			  "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+			  "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+			  "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+			  "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+			  "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+			  "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+			  "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+			  "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+			  "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+			  "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+			  "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+			  "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+			  "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+			  "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+			  "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+			  "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+			  "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+			  "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+			  "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+			  "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+			  "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+			  "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+			  "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+			  "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+			  "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+			  "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+			  "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+			  "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+			  "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+			  "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+			  "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+			  "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+			  "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+			  "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+			  "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+			  "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+			  "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+			  "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+			  "\x87\x79\x60\x38\x46\xb4\x25\x57"
+			  "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+			  "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+			  "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+			  "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+			  "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+			  "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+			  "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+			  "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+			  "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+			  "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+			  "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+			  "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+			  "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+			  "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+			  "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+			  "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+			  "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+			  "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+			  "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+			  "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+			  "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+			  "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+			  "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+			  "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+			  "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+			  "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+			  "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+			  "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+			  "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+			  "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+			  "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+			  "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+			  "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+			  "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+			  "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+			  "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+			  "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+		.plen	= 719,
+		.ctext	= "\xdc\xb1\x0f\x2a\xe8\x2d\x1c\x57"
+			  "\xc4\x82\xfa\xd6\x87\xe6\x2f\x50"
+			  "\xbd\x9e\x0a\x42\x31\xf2\xc7\xbb"
+			  "\x21\x63\xa7\x05\x43\x33\xef\x33"
+			  "\x5c\xd3\x47\x55\xce\x5c\xe4\xd4"
+			  "\xe5\x07\x62\x22\xac\x01\xa8\x35"
+			  "\x9c\x59\x34\x30\x8e\xff\x9f\xb4"
+			  "\xd2\x4e\x74\x90\x64\xf2\x78\x5e"
+			  "\x63\xb7\xc5\x08\x1b\x37\xa5\x9e"
+			  "\xc0\xde\xff\xa9\x7f\x0b\xd3\x02"
+			  "\x83\x6e\x33\xfa\x43\x11\xd3\xda"
+			  "\x02\xcf\xcd\x4a\xc0\x78\x1f\x39"
+			  "\x62\xcb\xa3\x95\x7e\x13\x92\x28"
+			  "\xb2\xc4\x7a\xba\xd1\xc6\xf6\x1f"
+			  "\xda\x0b\xf1\xd1\x99\x54\xd8\x3b"
+			  "\x16\xf8\xe6\x97\x1e\xa7\xcf\x49"
+			  "\x69\x84\x01\x4c\xdc\x7a\x34\xff"
+			  "\x01\x08\xa3\x0b\x39\xac\x21\x37"
+			  "\xd8\xb4\x04\x19\x8b\x7a\x7d\x17"
+			  "\x44\xd1\x18\xaf\x1f\xa9\x29\xfe"
+			  "\xfa\x77\xe0\x40\x42\x0c\x79\xb7"
+			  "\xc3\x15\x1b\xd9\x0c\x82\xfc\x16"
+			  "\x70\xd6\x2a\xe9\x94\x72\xc5\xa5"
+			  "\x8a\x58\xbc\xfa\xe0\x88\x39\x4a"
+			  "\x80\xe8\xec\xaf\x60\xac\xe7\xf8"
+			  "\x9c\xf0\xfc\x61\x39\x07\x98\x6b"
+			  "\x88\xe3\x98\x22\x28\x18\x4a\x2d"
+			  "\x25\xef\x10\xe3\x83\x66\x3f\xfd"
+			  "\xc7\x0b\xa3\xfd\x97\xa9\xf4\xbd"
+			  "\xd8\x2a\xee\x4a\x50\xad\xcc\xb5"
+			  "\xc7\xab\xb8\x79\x9c\xd1\xf1\x27"
+			  "\x08\xf5\xf5\xe8\x1b\x66\xce\x41"
+			  "\x56\x60\x94\x86\xf0\x78\xc2\xfa"
+			  "\x5b\x63\x40\xb1\xd1\x1a\x38\x69"
+			  "\x0b\x8c\xb2\xf5\xa2\xbe\x90\x9d"
+			  "\x46\x23\x79\x8b\x3b\x4a\xf4\xbb"
+			  "\x55\xf7\x58\x9d\xaf\x59\xff\x74"
+			  "\xf3\xb9\xc4\x26\xb1\xf8\xe1\x28"
+			  "\x8b\x5e\x8f\x6d\x64\xe7\xe8\x63"
+			  "\xd2\x9e\xcb\xee\xae\x19\x04\x1d"
+			  "\x05\xf0\x9d\x99\x7b\x33\x33\xae"
+			  "\x6e\xe5\x09\xdd\x67\x51\xc4\xc8"
+			  "\x6a\xc7\x36\x35\xc9\x93\x76\xa1"
+			  "\xa8\x1c\xfa\x75\x92\x34\x0e\x7d"
+			  "\x3d\x1d\xef\x00\xfd\xa5\x25\x12"
+			  "\x7c\x91\x21\x41\xcc\x50\x47\xa9"
+			  "\x22\x50\x24\x96\x34\x79\x3d\xe8"
+			  "\x3f\xa0\x56\xaf\x98\x53\x55\xc3"
+			  "\x46\x1b\x17\x54\xb8\xb0\xb7\xe0"
+			  "\xe0\xab\x47\x6f\x06\xda\xcc\x75"
+			  "\xa7\x96\xb7\x92\xf3\xa0\x5f\xe6"
+			  "\xba\x97\xe3\x2f\x97\x05\xb2\x99"
+			  "\xa0\x09\x10\x98\x9c\xd3\x2e\xd1"
+			  "\x7e\x2a\x30\x54\x3c\xb9\x33\xe3"
+			  "\xf2\xaf\xd3\xa5\xee\xd0\x0b\x8a"
+			  "\x19\x54\x0f\x02\x51\x1f\x91\xdf"
+			  "\x71\x9c\xad\x77\x35\x28\x55\x6d"
+			  "\xcd\x7a\xd9\xa3\x41\x98\x6b\x37"
+			  "\x19\x0f\xbe\xae\x69\xb2\x25\x01"
+			  "\xee\x0e\x51\x4b\x53\xea\x0f\x5f"
+			  "\x85\x74\x79\x36\x32\x0a\x2a\x40"
+			  "\xad\x6b\x78\x41\x54\x99\xe9\xc1"
+			  "\x2b\x6c\x9b\x42\x21\xef\xe2\x50"
+			  "\x56\x8d\x78\xdf\x58\xbe\x0a\x0f"
+			  "\xfc\xfc\x0d\x2e\xd0\xcb\xa6\x0a"
+			  "\xa8\xd9\x1e\xa9\xd4\x7c\x99\x88"
+			  "\xcf\x11\xad\x1c\xd3\x04\x63\x55"
+			  "\xef\x85\x0b\x69\xa1\x40\xf1\x75"
+			  "\x24\xf4\xe5\x2c\xd4\x7a\x24\x50"
+			  "\x8f\xa2\x71\xc9\x92\x20\xcd\xcf"
+			  "\xda\x40\xbe\xf6\xfe\x1a\xca\xc7"
+			  "\x4a\x80\x45\x55\xcb\xdd\xb7\x01"
+			  "\xb0\x8d\xcb\xd2\xae\xbd\xa4\xd0"
+			  "\x5c\x10\x05\x66\x7b\xd4\xff\xd9"
+			  "\xc4\x23\x9d\x8d\x6b\x24\xf8\x3f"
+			  "\x73\x4d\x5c\x2b\x33\x4c\x5e\x63"
+			  "\x74\x6d\x03\xa1\x7a\x35\x65\x17"
+			  "\x38\x7f\x3b\xc1\x69\xcf\x61\x34"
+			  "\x30\x21\xaf\x97\x47\x12\x3f\xa1"
+			  "\xa7\x50\xc5\x87\xfb\x3f\x70\x32"
+			  "\x86\x17\x5f\x25\xe4\x74\xc6\xd0"
+			  "\x9b\x39\xe6\xe1\x5a\xec\x8f\x40"
+			  "\xce\xcc\x37\x3b\xd8\x72\x1c\x31"
+			  "\x75\xa4\xa6\x89\x8c\xdd\xd6\xd2"
+			  "\x32\x3d\xe8\xc3\x54\xab\x1f\x35"
+			  "\x52\xb4\x94\x81\xb0\x37\x3a\x03"
+			  "\xbb\xb1\x99\x30\xa5\xf8\x21\xcd"
+			  "\x93\x5d\xa7\x13\xed\xc7\x49\x09"
+			  "\x70\xda\x08\x39\xaa\x15\x9e\x45"
+			  "\x35\x2b\x0f\x5c\x8c\x8b\xc9"
+			  "\xa8\xb8\x9f\xfd\x37\x36\x31\x7e"
+			  "\x34\x4f\xc1\xc0\xca\x8a\x22\xfd",
+		.clen	= 735,
 	}
 };
 
@@ -14947,6 +15770,282 @@ static const struct aead_testvec sm4_ccm_tv_template[] = {
 			  "\x16\x84\x2D\x4F\xA1\x86\xF5\x6A"
 			  "\xB3\x32\x56\x97\x1F\xA1\x10\xF4",
 		.clen	= 80,
+	}, { /* Generated from AES-CCM test vectors */
+		.key	= "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf",
+		.klen	= 16,
+		.iv	= "\x01\x00\x00\x00\x03\x02\x01\x00"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\x00\x00",
+		.assoc	= "\x00\x01\x02\x03\x04\x05\x06\x07",
+		.alen	= 8,
+		.ptext	= "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e",
+		.plen	= 23,
+		.ctext	= "\x7b\xff\x4a\x15\xf5\x73\xce\x82"
+			  "\x6e\xc2\x31\x1d\xe2\x53\x02\xac"
+			  "\xa4\x48\xf9\xe4\xf5\x1f\x81\x70"
+			  "\x18\xbc\xb6\x84\x01\xb8\xae",
+		.clen	= 31,
+	}, {
+		.key	= "\xf4\x6b\xc2\x75\x62\xfe\xb4\xe1"
+			  "\x53\x14\x73\x66\x8d\x88\xf6\x80",
+		.klen	= 16,
+		.iv	= "\x03\xa0\x20\x35\x26\xf2\x21\x8d"
+			  "\x50\x20\xda\xe2\x00\x00\x00\x00",
+		.assoc	= "\x5b\x9e\x13\x67\x02\x5e\xef\xc1"
+			  "\x6c\xf9\xd7\x1e\x52\x8f\x7a\x47"
+			  "\xe9\xd4\xcf\x20\x14\x6e\xf0\x2d"
+			  "\xd8\x9e\x2b\x56\x10\x23\x56\xe7",
+		.alen	= 32,
+		.ctext	= "\x23\x58\xce\xdc\x40\xb1\xcd\x92"
+			  "\x47\x96\x59\xfc\x8a\x26\x4f\xcf",
+		.clen	= 16,
+	}, {
+		.key	= "\xab\x2f\x8a\x74\xb7\x1c\xd2\xb1"
+			  "\xff\x80\x2e\x48\x7d\x82\xf8\xb9",
+		.klen	= 16,
+		.iv	= "\x03\xaf\x94\x87\x78\x35\x82\x81"
+			  "\x7f\x88\x94\x68\x00\x00\x00\x00",
+		.alen	= 0,
+		.ptext	= "\x00",
+		.plen	= 0,
+		.ctext	= "\x72\x7e\xf5\xd6\x39\x7a\x2b\x43",
+		.clen	= 8,
+	}, {
+		.key	= "\x39\xbb\xa7\xbe\x59\x97\x9e\x73"
+			  "\xa4\x48\x93\x39\x26\x71\x4a\xc6",
+		.klen	= 16,
+		.iv	= "\x03\xee\x49\x83\xe9\xa9\xff\xe9"
+			  "\x57\xba\xfd\x9e\x00\x00\x00\x00",
+		.assoc	= "\x44\xa6\x2c\x05\xe9\xe1\x43\xb1"
+			  "\x58\x7c\xf2\x5c\x6d\x39\x0a\x64"
+			  "\xa4\xf0\x13\x05\xd1\x77\x99\x67"
+			  "\x11\xc4\xc6\xdb\x00\x56\x36\x61",
+		.alen	= 32,
+		.ptext	= "\x00",
+		.plen	= 0,
+		.ctext	= "\xb0\x9d\xc6\xfb\x7d\xb5\xa1\x0e",
+		.clen	= 8,
+	}, {
+		.key	= "\x58\x5d\xa0\x96\x65\x1a\x04\xd7"
+			  "\x0d\x1a\x53\x3b\xb5\xe3\xf8\x8b",
+		.klen	= 16,
+		.iv	= "\x03\xcf\x76\x3f\xd9\x95\x75\x8f"
+			  "\x44\x89\x40\x7b\x00\x00\x00\x00",
+		.assoc	= "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+			  "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+			  "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+			  "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe",
+		.alen	= 32,
+		.ptext	= "\xc2\x54\xc8\xde\x78\x87\x77\x40"
+			  "\x49\x71\xe4\xb7\xe7\xcb\x76\x61"
+			  "\x0a\x41\xb9\xe9\xc0\x76\x54\xab"
+			  "\x04\x49\x3b\x19\x93\x57\x25\x5d",
+		.plen	= 32,
+		.ctext	= "\xc9\xae\xef\x1d\xf3\x2c\xd3\x38"
+			  "\xc9\x7f\x7e\x28\xe8\xaa\xb3\x60"
+			  "\x49\xdc\x66\xca\x7b\x3d\xe0\x3c"
+			  "\xcb\x45\x9c\x1b\xb2\xbe\x07\x90"
+			  "\x87\xa6\x6b\x89\x0d\x0f\x90\xaa"
+			  "\x7d\xf6\x5a\x9a\x68\x2b\x81\x92",
+		.clen	= 48,
+	}, {
+		.key	= "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+			  "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+		.klen	= 16,
+		.iv	= "\x02\xff\xff\xff\xff\x00\x00\xff"
+			  "\xff\xff\x00\xff\xff\x00\x00\x00",
+		.assoc	= "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+			  "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+			  "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+			  "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe"
+			  "\xc8\xf3\x5c\x52\x10\x63",
+		.alen	= 38,
+		.ptext	= "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+			  "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+			  "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+			  "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+			  "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+			  "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+			  "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+			  "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+			  "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+			  "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+			  "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+			  "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+			  "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+			  "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+			  "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+			  "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+			  "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+			  "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+			  "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+			  "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+			  "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+			  "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+			  "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+			  "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+			  "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+			  "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+			  "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+			  "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+			  "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+			  "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+			  "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+			  "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+			  "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+			  "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+			  "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+			  "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+			  "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+			  "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+			  "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+			  "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+			  "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+			  "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+			  "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+			  "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+			  "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+			  "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+			  "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+			  "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+			  "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+			  "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+			  "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+			  "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+			  "\x87\x79\x60\x38\x46\xb4\x25\x57"
+			  "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+			  "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+			  "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+			  "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+			  "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+			  "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+			  "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+			  "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+			  "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+			  "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+			  "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+			  "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+			  "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+			  "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+			  "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+			  "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+			  "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+			  "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+			  "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+			  "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+			  "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+			  "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+			  "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+			  "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+			  "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+			  "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+			  "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+			  "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+			  "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+			  "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+			  "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+			  "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+			  "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+			  "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+			  "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+			  "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+			  "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+		.plen	= 719,
+		.ctext	= "\xc5\x50\x85\x02\x72\xa8\xb3\x62"
+			  "\xf9\xcd\x77\x7b\x43\xa5\x04\x70"
+			  "\x68\x40\x57\x21\x1c\xfe\xef\x05"
+			  "\x4d\xb8\x44\xba\x59\xea\x62\x32"
+			  "\xcb\x6b\x6a\x39\x9b\xf3\xe5\xa4"
+			  "\x36\x38\xde\x7d\xcf\xb6\xcd\xe3"
+			  "\x89\xbf\x37\xc9\x96\x3c\x70\x10"
+			  "\x92\x47\xcc\xac\x6f\xf8\x55\x9a"
+			  "\x26\x43\x34\xb4\x92\x7d\x68\xfc"
+			  "\x60\x37\x74\x2a\x55\xba\xc7\xd7"
+			  "\x98\x69\xb7\xcf\x42\xfd\xb2\x10"
+			  "\xa0\x59\xe1\x2c\x73\x66\x12\x97"
+			  "\x85\x8b\x28\xcc\x29\x02\x15\x89"
+			  "\x23\xd3\x32\x92\x87\x57\x09\x13"
+			  "\x04\x7e\x8b\x6c\x3a\xc1\x4e\x6c"
+			  "\xe1\x9f\xc8\xcc\x47\x9c\xd8\x10"
+			  "\xf4\xb7\x5c\x30\x7a\x8b\x0f\x01"
+			  "\x52\x38\x02\x92\x99\xac\x03\x90"
+			  "\x18\x32\x2d\x21\x6a\x0a\x2a\xe7"
+			  "\xc2\xcc\x15\x84\x4e\x2b\x0b\x3a"
+			  "\x4c\xdc\xb0\x6b\x10\xd1\x27\x10"
+			  "\xf0\x4a\x5c\x43\xa0\x34\x34\x59"
+			  "\x47\x43\x48\xcb\x69\xa7\xff\x52"
+			  "\xb8\xca\x23\x09\x07\xd7\xc5\xe4"
+			  "\x2a\x4f\x99\xd5\x83\x36\x2a\x2d"
+			  "\x59\xd0\xca\xb0\xfa\x40\x8c\xab"
+			  "\xdf\x69\x08\xd9\x79\x1d\xde\xa8"
+			  "\x0b\x34\x74\x4d\xf5\xa0\x4c\x81"
+			  "\x7f\x93\x06\x40\x24\xfe\x7d\xcd"
+			  "\xe4\xfe\xf8\xf8\x30\xce\xd0\x5d"
+			  "\x70\xfd\x0d\x5a\x78\x85\x74\x2d"
+			  "\xe4\xb5\x40\x18\x99\x11\xe4\x6a"
+			  "\xdf\xfa\x4f\x25\x2c\xde\x15\xb7"
+			  "\x12\xd8\xc6\x90\x0d\x0f\xc9\xfb"
+			  "\x21\xf1\xed\xfe\x98\xe1\x03\xe2"
+			  "\x5c\xef\xb6\xc7\x87\x77\x0e\xcd"
+			  "\xff\x78\x94\xc9\xbe\xd3\x47\xf7"
+			  "\x8d\x37\x48\x01\x42\xe2\x17\x96"
+			  "\xfc\xc0\xcb\x7b\x7b\x57\xaf\x3b"
+			  "\xc9\xd0\x94\xce\x5e\x1b\xa9\x47"
+			  "\x02\x4d\x74\xcc\x45\x1d\xd3\x2d"
+			  "\x5f\x4f\x7f\xf2\x4b\xf9\x59\xee"
+			  "\x9e\x9e\xb9\x95\x29\x19\xd1\x5f"
+			  "\x72\xab\x8d\xf1\x28\xd1\x1c\xae"
+			  "\xc2\xba\xf7\x22\x84\x2c\x83\x51"
+			  "\x03\xad\xa3\xef\x81\xa7\xdc\xf1"
+			  "\x44\x51\x50\x96\x70\xd1\xe5\x47"
+			  "\x57\xf9\x30\x90\xe4\xbf\xfc\x75"
+			  "\x14\xaa\x4d\xb7\xb1\xe7\x79\x33"
+			  "\x43\xc2\x5c\xc1\xbc\x09\x92\x0f"
+			  "\xa7\xaf\x68\x51\x51\xec\x0b\xc3"
+			  "\x3d\x2b\x94\x30\x45\x29\x1b\x9e"
+			  "\x70\x56\xf8\xd6\x67\x2d\x39\x3b"
+			  "\x3c\xd2\xd0\xd3\xdc\x7d\x84\xe9"
+			  "\x06\x31\x98\xa6\x5c\xbf\x10\x58"
+			  "\xce\xbb\xa7\xe1\x65\x7e\x51\x87"
+			  "\x70\x46\xb4\x7f\xf9\xec\x92\x1c"
+			  "\x9b\x24\x49\xc1\x04\xbe\x1c\x5f"
+			  "\xcc\xb3\x33\x8c\xad\xe7\xdc\x32"
+			  "\x54\xa2\x0d\x83\x0f\x3c\x12\x5d"
+			  "\x71\xe3\x9c\xae\x71\xa3\x2a\x10"
+			  "\xc5\x91\xb4\x73\x96\x60\xdb\x5d"
+			  "\x1f\xd5\x9a\xd2\x69\xc3\xd7\x4b"
+			  "\xa2\x66\x81\x96\x4a\xaa\x02\xd6"
+			  "\xd5\x44\x9b\x42\x3a\x15\x5f\xe7"
+			  "\x4d\x7c\xf6\x71\x4a\xea\xe8\x43"
+			  "\xd7\x68\xe4\xbc\x05\x87\x49\x05"
+			  "\x3b\x47\xb2\x6d\x5f\xd1\x11\xa6"
+			  "\x58\xd4\xa2\x45\xec\xb5\x54\x55"
+			  "\xd3\xd6\xd2\x6a\x8b\x21\x9e\x2c"
+			  "\xf1\x27\x4b\x5b\xe3\xff\xe0\xfd"
+			  "\x4b\xf1\xe7\xe2\x84\xf2\x17\x37"
+			  "\x11\x68\xc4\x92\x4b\x6b\xef\x8e"
+			  "\x75\xf5\xc2\x7d\x5c\xe9\x7c\xfc"
+			  "\x2b\x00\x33\x0e\x7d\x69\xd8\xd4"
+			  "\x9b\xa8\x38\x54\x7e\x6d\x23\x51"
+			  "\x2c\xd6\xc4\x58\x23\x1c\x22\x2a"
+			  "\x59\xc5\x9b\xec\x9d\xbf\x03\x0f"
+			  "\xb3\xdd\xba\x02\x22\xa0\x34\x37"
+			  "\x19\x56\xc2\x5b\x32\x1d\x1e\x66"
+			  "\x68\xf4\x47\x05\x04\x18\xa7\x28"
+			  "\x80\xf2\xc7\x99\xed\x1e\x72\x48"
+			  "\x8f\x97\x5d\xb3\x74\x42\xfd\x0c"
+			  "\x0f\x5f\x29\x0c\xf1\x35\x22\x90"
+			  "\xd6\x7c\xb8\xa3\x2a\x89\x38\x71"
+			  "\xe9\x7a\x55\x3c\x3b\xf2\x6e\x1a"
+			  "\x22\x8f\x07\x81\xc1\xe1\xf1\x76"
+			  "\x2a\x75\xab\x86\xc4\xcc\x52\x59"
+			  "\x83\x19\x5e\xb3\x53\xe2\x81\xdf"
+			  "\xe6\x15\xb3\xba\x0c\x0e\xba"
+			  "\xa9\x2c\xed\x51\xd5\x06\xc8\xc6"
+			  "\x4b\x9f\x5d\x1b\x61\x31\xad\xf4",
+		.clen	= 735,
 	}
 };
 
@@ -15030,6 +16129,68 @@ static const struct hash_testvec sm4_cmac128_tv_template[] = {
 	}
 };
 
+static const struct hash_testvec sm4_xcbc128_tv_template[] = {
+	{ /* Generated from AES-XCBC128 test vectors */
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= zeroed_string,
+		.digest 	= "\xa9\x9a\x5c\x44\xe2\x34\xee\x2c"
+				  "\x9b\xe4\x9d\xca\x64\xb0\xa5\xc4",
+		.psize		= 0,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02",
+		.digest		= "\x17\x27\x62\xf3\x8b\x88\x1d\xc0"
+				  "\x97\x35\x9c\x3e\x9f\x27\xb7\x83",
+		.psize		= 3,
+		.ksize		= 16,
+	} , {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.digest 	= "\xda\x45\xd1\xac\xec\x4d\xab\x46"
+				  "\xdd\x59\xe0\x44\xff\x59\xd5\xfc",
+		.psize		= 16,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+				  "\x10\x11\x12\x13",
+		.digest 	= "\xbe\x24\x5d\x81\x8c\x8a\x10\xa4"
+				  "\x8e\xc2\x16\xfa\xa4\x83\xc9\x2a",
+		.psize		= 20,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+				  "\x10\x11\x12\x13\x14\x15\x16\x17"
+				  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.digest 	= "\x91\x82\x31\x56\xd5\x77\xa4\xc5"
+				  "\x88\x2d\xce\x3a\x87\x5e\xbd\xba",
+		.psize		= 32,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+				  "\x10\x11\x12\x13\x14\x15\x16\x17"
+				  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+				  "\x20\x21",
+		.digest 	= "\x2a\xae\xa5\x24\x0c\x12\x9f\x5f"
+				  "\x55\xfb\xae\x35\x13\x0d\x22\x2d",
+		.psize		= 34,
+		.ksize		= 16,
+	}
+};
+
 /* Cast6 test vectors from RFC 2612 */
 static const struct cipher_testvec cast6_tv_template[] = {
 	{
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/16] crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch newly adds the test vectors of CTS-CBC/ESSIV/XTS/XCBC modes
of the SM4 algorithm, and also added some test vectors for SM4 GCM/CCM.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/testmgr.c |   25 +
 crypto/testmgr.h | 1161 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 1186 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index e4bb03b8b924..cce101c7e8f9 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4712,6 +4712,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.alg = "cts(cbc(paes))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
+	}, {
+		.alg = "cts(cbc(sm4))",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = __VECS(sm4_cts_tv_template)
+		}
 	}, {
 		.alg = "curve25519",
 		.test = alg_test_kpp,
@@ -5059,6 +5065,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 			.cipher = __VECS(essiv_aes_cbc_tv_template)
 		}
 	}, {
+		.alg = "essiv(cbc(sm4),sm3)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = __VECS(essiv_sm4_cbc_tv_template)
+		}
+	}, {
 #if IS_ENABLED(CONFIG_CRYPTO_DH_RFC7919_GROUPS)
 		.alg = "ffdhe2048(dh)",
 		.test = alg_test_kpp,
@@ -5586,6 +5598,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.hash = __VECS(aes_xcbc128_tv_template)
 		}
+	}, {
+		.alg = "xcbc(sm4)",
+		.test = alg_test_hash,
+		.suite = {
+			.hash = __VECS(sm4_xcbc128_tv_template)
+		}
 	}, {
 		.alg = "xchacha12",
 		.test = alg_test_skcipher,
@@ -5640,6 +5658,13 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.cipher = __VECS(serpent_xts_tv_template)
 		}
+	}, {
+		.alg = "xts(sm4)",
+		.generic_driver = "xts(ecb(sm4-generic))",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = __VECS(sm4_xts_tv_template)
+		}
 	}, {
 		.alg = "xts(twofish)",
 		.generic_driver = "xts(ecb(twofish-generic))",
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d6088e26f326..ced48e4dad0c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -14882,6 +14882,537 @@ static const struct cipher_testvec sm4_cfb_tv_template[] = {
 	}
 };
 
+static const struct cipher_testvec sm4_cts_tv_template[] = {
+	/* Generated from AES-CTS test vectors */
+	{
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20",
+		.len	= 17,
+		.ctext	= "\x05\xfe\x23\xee\x17\xa2\x89\x98"
+			  "\xbc\x97\x0a\x0b\x54\x67\xca\xd7"
+			  "\xd6",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20",
+		.len	= 31,
+		.ctext	= "\x15\x46\xe4\x95\xa4\xec\xf0\xb8"
+			  "\x49\xd6\x6a\x9d\x89\xc7\xfd\x70"
+			  "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43",
+		.len	= 32,
+		.ctext	= "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+			  "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43"
+			  "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+			  "\x70\x6c\x65\x61\x73\x65\x2c",
+		.len	= 47,
+		.ctext	= "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+			  "\xd3\xe1\xdc\xeb\xfa\x04\x11\x99"
+			  "\xde\xcf\x6f\x4d\x7b\x09\x92\x7f"
+			  "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43"
+			  "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+			  "\x70\x6c\x65\x61\x73\x65\x2c\x20",
+		.len	= 48,
+		.ctext	= "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+			  "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+			  "\xbd\x99\x21\x0c\x5e\x4d\xed\x20"
+			  "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43"
+			  "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+			  "\x70\x6c\x65\x61\x73\x65\x2c\x20"
+			  "\x61\x6e\x64\x20\x77\x6f\x6e\x74"
+			  "\x6f\x6e\x20\x73\x6f\x75\x70\x2e",
+		.len	= 64,
+		.ctext	= "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+			  "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+			  "\x58\x19\xa4\x8f\xa9\x68\x5e\x6b"
+			  "\x2c\x0f\x81\x60\x15\x98\x27\x4f"
+			  "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+			  "\xbd\x99\x21\x0c\x5e\x4d\xed\x20",
+	}
+};
+
+static const struct cipher_testvec essiv_sm4_cbc_tv_template[] = {
+	/* Generated from AES-ESSIV-CBC test vectors */
+	{
+		.key    = "\x06\xa9\x21\x40\x36\xb8\xa1\x5b"
+			  "\x51\x2e\x03\xd5\x34\x12\x00\x06",
+		.klen   = 16,
+		.iv	= "\x3d\xaf\xba\x42\x9d\x9e\xb4\x30"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "Single block msg",
+		.ctext	= "\x83\xa0\x79\x71\x18\xed\xb2\x0f"
+			  "\xa8\x71\x94\x22\x8e\x1f\xc1\xbb",
+		.len	= 16,
+	}, {
+		.key    = "\xc2\x86\x69\x6d\x88\x7c\x9a\xa0"
+			  "\x61\x1b\xbb\x3e\x20\x25\xa4\x5a",
+		.klen   = 16,
+		.iv     = "\x56\x2e\x17\x99\x6d\x09\x3d\x28"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.ctext	= "\x48\x38\xba\xa0\x09\xa2\xe1\x61"
+			  "\x94\xe5\xd2\x63\xe5\x04\x6c\x62"
+			  "\x93\x21\x95\xfb\x8c\xf4\x25\x19"
+			  "\xe0\x0f\x9c\xfa\x51\xfe\xe7\x32",
+		.len	= 32,
+	}, {
+		.key	= "\x1f\x35\x2c\x07\x3b\x61\x08\xd7"
+			  "\x2d\x98\x10\xa3\x09\x14\xdf\xf4",
+		.klen	= 16,
+		.iv	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
+			  "\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
+			  "\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
+			  "\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
+			  "\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
+			  "\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
+			  "\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
+			  "\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
+		.ctext	= "\xa5\x1d\x64\x91\x28\x1f\xbe\x9e"
+			  "\x15\x39\x5f\xe4\xe1\x5a\x8c\x38"
+			  "\x80\x7f\xc7\x7d\x00\x4c\x4b\xff"
+			  "\x75\x3a\x03\xfe\x41\x75\x26\x9e"
+			  "\x3f\xf1\x36\xaf\x7b\x37\x73\x1a"
+			  "\xaf\x9b\x91\xec\x1e\xf0\x05\x9d"
+			  "\x87\xda\x7b\xa3\xaa\xe6\x5b\x98"
+			  "\x41\x73\xd5\x3c\x8c\x8b\xb5\x88",
+		.len	= 64,
+	}, {
+		.key	= "\xBE\xE1\x04\x27\xE1\x04\x27\x4A"
+			  "\x6D\x90\x4A\x6D\x90\xB3\xD6\xF9",
+		.klen	= 16,
+		.iv	= "\xE7\x82\x1D\xB8\x53\x11\xAC\x47"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x50\xB9\x22\xAE\x17\x80\x0C\x75"
+			  "\xDE\x47\xD3\x3C\xA5\x0E\x9A\x03"
+			  "\x6C\xF8\x61\xCA\x33\xBF\x28\x91"
+			  "\x1D\x86\xEF\x58\xE4\x4D\xB6\x1F"
+			  "\xAB\x14\x7D\x09\x72\xDB\x44\xD0"
+			  "\x39\xA2\x0B\x97\x00\x69\xF5\x5E"
+			  "\xC7\x30\xBC\x25\x8E\x1A\x83\xEC"
+			  "\x55\xE1\x4A\xB3\x1C\xA8\x11\x7A"
+			  "\x06\x6F\xD8\x41\xCD\x36\x9F\x08"
+			  "\x94\xFD\x66\xF2\x5B\xC4\x2D\xB9"
+			  "\x22\x8B\x17\x80\xE9\x52\xDE\x47"
+			  "\xB0\x19\xA5\x0E\x77\x03\x6C\xD5"
+			  "\x3E\xCA\x33\x9C\x05\x91\xFA\x63"
+			  "\xEF\x58\xC1\x2A\xB6\x1F\x88\x14"
+			  "\x7D\xE6\x4F\xDB\x44\xAD\x16\xA2"
+			  "\x0B\x74\x00\x69\xD2\x3B\xC7\x30"
+			  "\x99\x02\x8E\xF7\x60\xEC\x55\xBE"
+			  "\x27\xB3\x1C\x85\x11\x7A\xE3\x4C"
+			  "\xD8\x41\xAA\x13\x9F\x08\x71\xFD"
+			  "\x66\xCF\x38\xC4\x2D\x96\x22\x8B"
+			  "\xF4\x5D\xE9\x52\xBB\x24\xB0\x19"
+			  "\x82\x0E\x77\xE0\x49\xD5\x3E\xA7"
+			  "\x10\x9C\x05\x6E\xFA\x63\xCC\x35"
+			  "\xC1\x2A\x93\x1F\x88\xF1\x5A\xE6"
+			  "\x4F\xB8\x21\xAD\x16\x7F\x0B\x74"
+			  "\xDD\x46\xD2\x3B\xA4\x0D\x99\x02"
+			  "\x6B\xF7\x60\xC9\x32\xBE\x27\x90"
+			  "\x1C\x85\xEE\x57\xE3\x4C\xB5\x1E"
+			  "\xAA\x13\x7C\x08\x71\xDA\x43\xCF"
+			  "\x38\xA1\x0A\x96\xFF\x68\xF4\x5D"
+			  "\xC6\x2F\xBB\x24\x8D\x19\x82\xEB"
+			  "\x54\xE0\x49\xB2\x1B\xA7\x10\x79"
+			  "\x05\x6E\xD7\x40\xCC\x35\x9E\x07"
+			  "\x93\xFC\x65\xF1\x5A\xC3\x2C\xB8"
+			  "\x21\x8A\x16\x7F\xE8\x51\xDD\x46"
+			  "\xAF\x18\xA4\x0D\x76\x02\x6B\xD4"
+			  "\x3D\xC9\x32\x9B\x04\x90\xF9\x62"
+			  "\xEE\x57\xC0\x29\xB5\x1E\x87\x13"
+			  "\x7C\xE5\x4E\xDA\x43\xAC\x15\xA1"
+			  "\x0A\x73\xFF\x68\xD1\x3A\xC6\x2F"
+			  "\x98\x01\x8D\xF6\x5F\xEB\x54\xBD"
+			  "\x26\xB2\x1B\x84\x10\x79\xE2\x4B"
+			  "\xD7\x40\xA9\x12\x9E\x07\x70\xFC"
+			  "\x65\xCE\x37\xC3\x2C\x95\x21\x8A"
+			  "\xF3\x5C\xE8\x51\xBA\x23\xAF\x18"
+			  "\x81\x0D\x76\xDF\x48\xD4\x3D\xA6"
+			  "\x0F\x9B\x04\x6D\xF9\x62\xCB\x34"
+			  "\xC0\x29\x92\x1E\x87\xF0\x59\xE5"
+			  "\x4E\xB7\x20\xAC\x15\x7E\x0A\x73"
+			  "\xDC\x45\xD1\x3A\xA3\x0C\x98\x01"
+			  "\x6A\xF6\x5F\xC8\x31\xBD\x26\x8F"
+			  "\x1B\x84\xED\x56\xE2\x4B\xB4\x1D"
+			  "\xA9\x12\x7B\x07\x70\xD9\x42\xCE"
+			  "\x37\xA0\x09\x95\xFE\x67\xF3\x5C"
+			  "\xC5\x2E\xBA\x23\x8C\x18\x81\xEA"
+			  "\x53\xDF\x48\xB1\x1A\xA6\x0F\x78"
+			  "\x04\x6D\xD6\x3F\xCB\x34\x9D\x06"
+			  "\x92\xFB\x64\xF0\x59\xC2\x2B\xB7"
+			  "\x20\x89\x15\x7E\xE7\x50\xDC\x45"
+			  "\xAE\x17\xA3\x0C\x75\x01\x6A\xD3"
+			  "\x3C\xC8\x31\x9A\x03\x8F\xF8\x61"
+			  "\xED\x56\xBF\x28\xB4\x1D\x86\x12",
+		.ctext	= "\xad\x68\x40\x68\xb2\xf9\x77\x55"
+			  "\xd5\x1c\x17\x46\xc1\xfa\x05\xdd"
+			  "\x94\x5c\xb7\x99\x82\xba\x05\x48"
+			  "\xac\x5d\x14\x30\x2e\xc8\x0e\x2f"
+			  "\x5a\xd7\x39\x43\x95\x4d\x93\xff"
+			  "\x6b\xe3\xb7\x71\xc1\x39\x43\x8d"
+			  "\x10\xd7\xd9\xa8\xe7\x65\xb7\x0a"
+			  "\x27\x98\x5b\x90\xc3\x80\x1f\xd9"
+			  "\x65\x82\x88\x0a\xc3\x16\x3f\xae"
+			  "\x1f\xad\x88\xe9\xfb\x9e\xd4\xc8"
+			  "\x81\x36\x50\x37\x1f\x11\x83\xe2"
+			  "\xc5\x1a\x48\xdb\xc3\x18\x07\x5d"
+			  "\xee\x4b\xea\x40\xd3\xd9\x8c\x59"
+			  "\x29\xe1\x0b\x79\x3b\x28\xac\x75"
+			  "\xda\x82\x99\x86\xd4\xbe\xd8\x81"
+			  "\xe0\xc4\x58\x78\xe4\x33\xc1\xf1"
+			  "\xbe\x96\xd3\x4c\x42\x6b\xaf\x24"
+			  "\x69\xb4\x25\x88\x37\x9e\xb2\xfb"
+			  "\x5c\x93\x22\x89\x2f\x81\x85\x06"
+			  "\x12\x74\x3b\x6c\x99\x81\xfb\xbe"
+			  "\x0f\xc4\xa5\xb6\xf8\x79\x5f\x72"
+			  "\xf8\x46\x94\x3f\x1f\x9f\x15\xa2"
+			  "\xc8\xc0\xbf\xeb\xa3\x9e\x59\xe1"
+			  "\xbd\x1a\xe1\xe3\x6b\x33\x96\x54"
+			  "\x1b\xc4\x25\x74\x06\xcf\x8a\x75"
+			  "\x6c\xfc\x76\x7f\x9e\x7b\x00\xce"
+			  "\xa8\x1e\x6a\x0f\x5a\xa6\xcb\x77"
+			  "\x5f\x90\x39\xcb\xfe\x0e\x16\x53"
+			  "\x8e\x21\x0f\x7e\x51\xcc\x92\xb8"
+			  "\x4f\x65\x76\x20\x3d\x56\xb4\xcc"
+			  "\x8b\x8e\x8e\x68\xc3\x82\x53\x5c"
+			  "\x1c\x82\x13\x32\x3b\x97\xff\x48"
+			  "\x98\xda\x4a\x7c\xc8\x21\x83\xfd"
+			  "\xe2\xf1\x30\xe1\x11\xe9\xe8\x97"
+			  "\x97\x24\x06\x73\xf2\x52\xbb\xab"
+			  "\x9d\x5f\x0b\xa8\x2f\xab\x0b\x7d"
+			  "\xe8\x20\x7b\x67\x2e\x93\xb5\x11"
+			  "\x6c\x16\xea\xdd\x1a\x9d\xf2\xdc"
+			  "\x79\x57\xc4\x04\xcb\x7f\x36\xa0"
+			  "\x2e\xa7\x89\xab\xaa\x56\x59\x9e"
+			  "\xec\x38\xea\x1a\xe9\xa7\x58\x58"
+			  "\xb5\xb7\x8f\x8c\x5c\xd6\x86\x67"
+			  "\x65\x0f\x93\x47\xf7\x3e\x19\x19"
+			  "\x9b\x22\xd1\xc6\xc2\xba\x32\x5c"
+			  "\x2c\x7a\xa2\xbb\xa5\x22\xde\xe5"
+			  "\x1e\x78\x2c\xd3\x40\x6d\xfa\x79"
+			  "\x4c\x9e\x1c\x36\x34\xaf\x95\x2e"
+			  "\x68\x2e\x69\x7d\xe4\x7d\x0c\x74"
+			  "\xaf\x73\x5b\x48\x62\x90\x5e\x19"
+			  "\x0f\x12\xb3\xdb\x77\xbb\xe2\xac"
+			  "\xaf\xfe\xd9\xa1\x80\x09\xc6\xd4"
+			  "\xf4\x21\x3f\xa4\x0f\x16\x7b\x36"
+			  "\x29\x6d\x10\xa2\xba\xaf\xf5\xa3"
+			  "\x51\xca\x0a\x25\x74\x9a\xb7\x02"
+			  "\xb8\xf8\x6b\xda\xb8\x1c\x9f\x62"
+			  "\xf5\x61\x62\x9f\x4b\x71\x24\x45"
+			  "\xfb\x0f\xdf\xa8\x47\x6f\x2f\x05"
+			  "\x2f\xf4\xfd\xb8\xd1\x8c\x29\x9d"
+			  "\x9d\xe8\x6f\x10\x89\xef\x08\x59"
+			  "\xa0\x24\x1f\xdb\xea\xbc\x97\x44"
+			  "\x23\x74\xbf\xaa\x87\x10\x5c\x58"
+			  "\x2a\xe6\xe2\x19\xc5\x7e\x21\xe2",
+		.len	= 496,
+	},
+};
+
+static const struct cipher_testvec sm4_xts_tv_template[] = {
+	/* Generated from AES-XTS test vectors */
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ctext	= "\xd9\xb4\x21\xf7\x31\xc8\x94\xfd"
+			  "\xc3\x5b\x77\x29\x1f\xe4\xe3\xb0"
+			  "\x2a\x1f\xb7\x66\x98\xd5\x9f\x0e"
+			  "\x51\x37\x6c\x4a\xda\x5b\xc7\x5d",
+		.len	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ctext	= "\xa7\x4d\x72\x6c\x11\x19\x6a\x32"
+			  "\xbe\x04\xe0\x01\xff\x29\xd0\xc7"
+			  "\x93\x2f\x9f\x3e\xc2\x9b\xfc\xb6"
+			  "\x4d\xd1\x7f\x63\xcb\xd3\xea\x31",
+		.len	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ctext	= "\x7f\x76\x08\x8e\xff\xad\xf7\x0c"
+			  "\x02\xea\x9f\x95\xda\x06\x28\xd3"
+			  "\x51\xbf\xcb\x9e\xac\x05\x63\xbc"
+			  "\xf1\x7b\x71\x0d\xab\x0a\x98\x26",
+		.len	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ctext	= "\x54\xdd\x65\xb6\x32\x6f\xae\xa8"
+			  "\xfa\xd1\xa8\x3c\x63\x61\x4a\xf3"
+			  "\x9f\x72\x1d\x8d\xfe\x17\x7a\x30"
+			  "\xb6\x6a\xbf\x6a\x44\x99\x80\xe1"
+			  "\xcd\xbe\x06\xaf\xb7\x33\x36\xf3"
+			  "\x7a\x4d\x39\xde\x96\x4a\x30\xd7"
+			  "\xd0\x4a\x37\x99\x16\x9c\x60\x25"
+			  "\x8f\x6b\x74\x8a\x61\x86\x1a\xa5"
+			  "\xec\x92\xa2\xc1\x5b\x2b\x7c\x61"
+			  "\x5a\x42\xab\xa4\x99\xbb\xd6\xb7"
+			  "\x1d\xb9\xc7\x89\xb2\x18\x20\x89"
+			  "\xa2\x5d\xd3\xdf\x80\x0e\xd1\x86"
+			  "\x4d\x19\xf7\xed\x45\xfd\x17\xa9"
+			  "\x48\x0b\x0f\xb8\x2d\x9b\x7f\xc3"
+			  "\xed\x57\xe9\xa1\x14\x0e\xaa\x77"
+			  "\x8d\xd2\xdd\x67\x9e\x3e\xdc\x3d"
+			  "\xc4\xd5\x5c\x95\x0e\xbc\x53\x1d"
+			  "\x95\x92\xf7\xc4\x63\x82\x56\xd5"
+			  "\x65\x18\x29\x2a\x20\xaf\x98\xfd"
+			  "\xd3\xa6\x36\x00\x35\x0a\x70\xab"
+			  "\x5a\x40\xf4\xc2\x85\x03\x7c\xa0"
+			  "\x1f\x25\x1f\x19\xec\xae\x03\x29"
+			  "\xff\x77\xad\x88\xcd\x5a\x4c\xde"
+			  "\xa2\xae\xab\xc2\x21\x48\xff\xbd"
+			  "\x23\x9b\xd1\x05\x15\xbd\xe1\x13"
+			  "\x1d\xec\x84\x04\xe4\x43\xdc\x76"
+			  "\x31\x40\xd5\xf2\x2b\xf3\x3e\x0c"
+			  "\x68\x72\xd6\xb8\x1d\x63\x0f\x6f"
+			  "\x00\xcd\xd0\x58\xfe\x80\xf9\xcb"
+			  "\xfb\x77\x70\x7f\x93\xce\xe2\xca"
+			  "\x92\xb9\x15\xb8\x30\x40\x27\xc1"
+			  "\x90\xa8\x4e\x2d\x65\xe0\x18\xcc"
+			  "\x6a\x38\x7d\x37\x66\xac\xdb\x28"
+			  "\x25\x32\x84\xe8\xdb\x9a\xcf\x8f"
+			  "\x52\x28\x0d\xdc\x6d\x00\x33\xd2"
+			  "\xcc\xaa\xa4\xf9\xae\xff\x12\x36"
+			  "\x69\xbc\x02\x4f\xd6\x76\x8e\xdf"
+			  "\x8b\xc1\xf8\xd6\x22\xc1\x9c\x60"
+			  "\x9e\xf9\x7f\x60\x91\x90\xcd\x11"
+			  "\x02\x41\xe7\xfb\x08\x4e\xd8\x94"
+			  "\x2d\xa1\xf9\xb9\xcf\x1b\x51\x4b"
+			  "\x61\xa3\x88\xb3\x0e\xa6\x1a\x4a"
+			  "\x74\x5b\x38\x1e\xe7\xad\x6c\x4d"
+			  "\xb1\x27\x54\x53\xb8\x41\x3f\x98"
+			  "\xdf\x6e\x4a\x40\x98\x6e\xe4\xb5"
+			  "\x9a\xf5\xdf\xae\xcd\x30\x12\x65"
+			  "\x17\x90\x67\xa0\x0d\x7c\xa3\x5a"
+			  "\xb9\x5a\xbd\x61\x7a\xde\xa2\x8e"
+			  "\xc1\xc2\x6a\x97\xde\x28\xb8\xbf"
+			  "\xe3\x01\x20\xd6\xae\xfb\xd2\x58"
+			  "\xc5\x9e\x42\xd1\x61\xe8\x06\x5a"
+			  "\x78\x10\x6b\xdc\xa5\xcd\x90\xfb"
+			  "\x3a\xac\x4e\x93\x86\x6c\x8a\x7f"
+			  "\x96\x76\x86\x0a\x79\x14\x5b\xd9"
+			  "\x2e\x02\xe8\x19\xa9\x0b\xe0\xb9"
+			  "\x7c\xc5\x22\xb3\x21\x06\x85\x6f"
+			  "\xdf\x0e\x54\xd8\x8e\x46\x24\x15"
+			  "\x5a\x2f\x1c\x14\xea\xea\xa1\x63"
+			  "\xf8\x58\xe9\x9a\x80\x6e\x79\x1a"
+			  "\xcd\x82\xf1\xb0\xe2\x9f\x00\x28"
+			  "\xa4\xc3\x8e\x97\x6f\x57\x1a\x93"
+			  "\xf4\xfd\x57\xd7\x87\xc2\x4d\xb0"
+			  "\xe0\x1c\xa3\x04\xe5\xa5\xc4\xdd"
+			  "\x50\xcf\x8b\xdb\xf4\x91\xe5\x7c",
+		.len	= 512,
+	}, {
+		.key	= "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27"
+			  "\x02\x88\x41\x97\x16\x93\x99\x37"
+			  "\x51\x05\x82\x09\x74\x94\x45\x92",
+		.klen	= 32,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xf8\xf9\xfa\xfb\xfc",
+		.ctext	= "\xa2\x9f\x9e\x4e\x71\xdb\x28\x3c"
+			  "\x80\x0e\xf6\xb7\x8e\x57\x1c\xba"
+			  "\x90\xda\x3b\x6c\x22\x00\x68\x30"
+			  "\x1d\x63\x0d\x9e\x6a\xad\x37\x55"
+			  "\xbc\x77\x1e\xc9\xad\x83\x30\xd5"
+			  "\x27\xb2\x66\x77\x18\x3c\xa6\x39"
+			  "\x9c\x0a\xaa\x1f\x02\xe1\xd5\x65"
+			  "\x9b\x8d\xc5\x97\x3d\xc5\x04\x53"
+			  "\x78\x00\xe3\xb0\x1a\x43\x4e\xb7"
+			  "\xc4\x9f\x38\xc5\x7b\xa4\x70\x64"
+			  "\x78\xe6\x32\xd9\x65\x44\xc5\x64"
+			  "\xb8\x42\x35\x99\xff\x66\x75\xb0"
+			  "\x22\xd3\x9b\x6e\x8d\xcf\x6a\x24"
+			  "\xfd\x92\xb7\x1b\x04\x28\x2a\x61"
+			  "\xdc\x96\x2a\x20\x7a\x2c\xf1\xf9"
+			  "\x12\x15\xf0\x4d\xcf\x2b\xde\x33"
+			  "\x41\xbc\xe7\x85\x87\x22\xb7\x16"
+			  "\x02\x1c\xd8\xa2\x0f\x1f\xa3\xe9"
+			  "\xd8\x45\x48\xe7\xbe\x08\x4e\x4e"
+			  "\x23\x79\x84\xdb\x40\x76\xf5\x13"
+			  "\x78\x92\x4a\x2f\xf9\x1b\xf2\x80"
+			  "\x25\x74\x51\x45\x9a\x77\x78\x97"
+			  "\xd3\xe0\xc7\xc4\x35\x67\x2a\xe6"
+			  "\xb3\x0d\x62\x9f\x8b",
+		.len	= 189,
+	},
+};
+
 static const struct aead_testvec sm4_gcm_tv_template[] = {
 	{ /* From https://datatracker.ietf.org/doc/html/rfc8998#appendix-A.1 */
 		.key	= "\x01\x23\x45\x67\x89\xAB\xCD\xEF"
@@ -14913,6 +15444,298 @@ static const struct aead_testvec sm4_gcm_tv_template[] = {
 			  "\x83\xDE\x35\x41\xE4\xC2\xB5\x81"
 			  "\x77\xE0\x65\xA9\xBF\x7B\x62\xEC",
 		.clen	= 80,
+	}, { /* Generated from AES-GCM test vectors */
+		.key    = zeroed_string,
+		.klen	= 16,
+		.ctext	= "\x23\x2f\x0c\xfe\x30\x8b\x49\xea"
+			  "\x6f\xc8\x82\x29\xb5\xdc\x85\x8d",
+		.clen	= 16,
+	}, {
+		.key    = zeroed_string,
+		.klen	= 16,
+		.ptext	= zeroed_string,
+		.plen	= 16,
+		.ctext	= "\x7d\xe2\xaa\x7f\x11\x10\x18\x82"
+			  "\x18\x06\x3b\xe1\xbf\xeb\x6d\x89"
+			  "\xb8\x51\xb5\xf3\x94\x93\x75\x2b"
+			  "\xe5\x08\xf1\xbb\x44\x82\xc5\x57",
+		.clen	= 32,
+	}, {
+		.key	= "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+			  "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+		.klen	= 16,
+		.iv	= "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+			  "\xde\xca\xf8\x88",
+		.ptext	= "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+			  "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+			  "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+			  "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+			  "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+			  "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+			  "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+			  "\xba\x63\x7b\x39\x1a\xaf\xd2\x55",
+		.plen	= 64,
+		.ctext	= "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+			  "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+			  "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+			  "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+			  "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+			  "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+			  "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+			  "\xe3\x63\x36\x83\x23\xf7\x5b\x80"
+			  "\x7d\xfe\x77\xef\x71\xb1\x5e\xc9"
+			  "\x52\x6b\x09\xab\x84\x28\x4b\x8a",
+		.clen	= 80,
+	}, {
+		.key	= "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+			  "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+		.klen	= 16,
+		.iv	= "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+			  "\xde\xca\xf8\x88",
+		.ptext	= "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+			  "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+			  "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+			  "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+			  "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+			  "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+			  "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+			  "\xba\x63\x7b\x39",
+		.plen	= 60,
+		.assoc	= "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xab\xad\xda\xd2",
+		.alen	= 20,
+		.ctext	= "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+			  "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+			  "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+			  "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+			  "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+			  "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+			  "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+			  "\xe3\x63\x36\x83"
+			  "\x89\xf6\xba\x35\xb8\x18\xd3\xcc"
+			  "\x38\x6c\x05\xb3\x8a\xcb\xc9\xde",
+		.clen	= 76,
+	}, {
+		.key	= "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+			  "\xfe\xff\xe9\x92\x86\x65\x73\x1c",
+		.klen	= 16,
+		.iv	= "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+			  "\xde\xca\xf8\x88",
+		.ptext	= "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+			  "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+			  "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+			  "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+			  "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+			  "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+			  "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+			  "\xba\x63\x7b\x39",
+		.plen	= 60,
+		.assoc	= "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xab\xad\xda\xd2",
+		.alen	= 20,
+		.ctext	= "\xc1\x11\x44\x51\xd9\x25\x87\x5b"
+			  "\x0f\xd9\x06\xf3\x33\x44\xbb\x87"
+			  "\x8b\xa3\x77\xd2\x0c\x60\xfa\xcc"
+			  "\x85\x50\x6f\x96\x0c\x54\x54\xc1"
+			  "\x58\x04\x88\x6e\xf4\x26\x35\x7e"
+			  "\x94\x80\x48\x6c\xf2\xf4\x88\x1f"
+			  "\x19\x63\xea\xae\xba\x81\x1a\x5d"
+			  "\x0e\x6f\x59\x08"
+			  "\x33\xac\x5b\xa8\x19\x60\xdb\x1d"
+			  "\xdd\x2e\x22\x2e\xe0\x87\x51\x5d",
+		.clen	= 76,
+	}, {
+		.key	= "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+			  "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+		.klen	= 16,
+		.iv	= "\x00\xff\xff\xff\xff\x00\x00\xff"
+			  "\xff\xff\x00\xff",
+		.ptext	= "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+			  "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+			  "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+			  "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+			  "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+			  "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+			  "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+			  "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+			  "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+			  "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+			  "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+			  "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+			  "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+			  "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+			  "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+			  "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+			  "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+			  "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+			  "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+			  "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+			  "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+			  "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+			  "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+			  "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+			  "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+			  "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+			  "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+			  "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+			  "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+			  "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+			  "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+			  "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+			  "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+			  "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+			  "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+			  "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+			  "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+			  "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+			  "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+			  "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+			  "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+			  "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+			  "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+			  "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+			  "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+			  "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+			  "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+			  "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+			  "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+			  "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+			  "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+			  "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+			  "\x87\x79\x60\x38\x46\xb4\x25\x57"
+			  "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+			  "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+			  "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+			  "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+			  "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+			  "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+			  "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+			  "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+			  "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+			  "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+			  "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+			  "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+			  "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+			  "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+			  "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+			  "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+			  "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+			  "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+			  "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+			  "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+			  "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+			  "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+			  "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+			  "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+			  "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+			  "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+			  "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+			  "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+			  "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+			  "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+			  "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+			  "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+			  "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+			  "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+			  "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+			  "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+			  "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+		.plen	= 719,
+		.ctext	= "\xdc\xb1\x0f\x2a\xe8\x2d\x1c\x57"
+			  "\xc4\x82\xfa\xd6\x87\xe6\x2f\x50"
+			  "\xbd\x9e\x0a\x42\x31\xf2\xc7\xbb"
+			  "\x21\x63\xa7\x05\x43\x33\xef\x33"
+			  "\x5c\xd3\x47\x55\xce\x5c\xe4\xd4"
+			  "\xe5\x07\x62\x22\xac\x01\xa8\x35"
+			  "\x9c\x59\x34\x30\x8e\xff\x9f\xb4"
+			  "\xd2\x4e\x74\x90\x64\xf2\x78\x5e"
+			  "\x63\xb7\xc5\x08\x1b\x37\xa5\x9e"
+			  "\xc0\xde\xff\xa9\x7f\x0b\xd3\x02"
+			  "\x83\x6e\x33\xfa\x43\x11\xd3\xda"
+			  "\x02\xcf\xcd\x4a\xc0\x78\x1f\x39"
+			  "\x62\xcb\xa3\x95\x7e\x13\x92\x28"
+			  "\xb2\xc4\x7a\xba\xd1\xc6\xf6\x1f"
+			  "\xda\x0b\xf1\xd1\x99\x54\xd8\x3b"
+			  "\x16\xf8\xe6\x97\x1e\xa7\xcf\x49"
+			  "\x69\x84\x01\x4c\xdc\x7a\x34\xff"
+			  "\x01\x08\xa3\x0b\x39\xac\x21\x37"
+			  "\xd8\xb4\x04\x19\x8b\x7a\x7d\x17"
+			  "\x44\xd1\x18\xaf\x1f\xa9\x29\xfe"
+			  "\xfa\x77\xe0\x40\x42\x0c\x79\xb7"
+			  "\xc3\x15\x1b\xd9\x0c\x82\xfc\x16"
+			  "\x70\xd6\x2a\xe9\x94\x72\xc5\xa5"
+			  "\x8a\x58\xbc\xfa\xe0\x88\x39\x4a"
+			  "\x80\xe8\xec\xaf\x60\xac\xe7\xf8"
+			  "\x9c\xf0\xfc\x61\x39\x07\x98\x6b"
+			  "\x88\xe3\x98\x22\x28\x18\x4a\x2d"
+			  "\x25\xef\x10\xe3\x83\x66\x3f\xfd"
+			  "\xc7\x0b\xa3\xfd\x97\xa9\xf4\xbd"
+			  "\xd8\x2a\xee\x4a\x50\xad\xcc\xb5"
+			  "\xc7\xab\xb8\x79\x9c\xd1\xf1\x27"
+			  "\x08\xf5\xf5\xe8\x1b\x66\xce\x41"
+			  "\x56\x60\x94\x86\xf0\x78\xc2\xfa"
+			  "\x5b\x63\x40\xb1\xd1\x1a\x38\x69"
+			  "\x0b\x8c\xb2\xf5\xa2\xbe\x90\x9d"
+			  "\x46\x23\x79\x8b\x3b\x4a\xf4\xbb"
+			  "\x55\xf7\x58\x9d\xaf\x59\xff\x74"
+			  "\xf3\xb9\xc4\x26\xb1\xf8\xe1\x28"
+			  "\x8b\x5e\x8f\x6d\x64\xe7\xe8\x63"
+			  "\xd2\x9e\xcb\xee\xae\x19\x04\x1d"
+			  "\x05\xf0\x9d\x99\x7b\x33\x33\xae"
+			  "\x6e\xe5\x09\xdd\x67\x51\xc4\xc8"
+			  "\x6a\xc7\x36\x35\xc9\x93\x76\xa1"
+			  "\xa8\x1c\xfa\x75\x92\x34\x0e\x7d"
+			  "\x3d\x1d\xef\x00\xfd\xa5\x25\x12"
+			  "\x7c\x91\x21\x41\xcc\x50\x47\xa9"
+			  "\x22\x50\x24\x96\x34\x79\x3d\xe8"
+			  "\x3f\xa0\x56\xaf\x98\x53\x55\xc3"
+			  "\x46\x1b\x17\x54\xb8\xb0\xb7\xe0"
+			  "\xe0\xab\x47\x6f\x06\xda\xcc\x75"
+			  "\xa7\x96\xb7\x92\xf3\xa0\x5f\xe6"
+			  "\xba\x97\xe3\x2f\x97\x05\xb2\x99"
+			  "\xa0\x09\x10\x98\x9c\xd3\x2e\xd1"
+			  "\x7e\x2a\x30\x54\x3c\xb9\x33\xe3"
+			  "\xf2\xaf\xd3\xa5\xee\xd0\x0b\x8a"
+			  "\x19\x54\x0f\x02\x51\x1f\x91\xdf"
+			  "\x71\x9c\xad\x77\x35\x28\x55\x6d"
+			  "\xcd\x7a\xd9\xa3\x41\x98\x6b\x37"
+			  "\x19\x0f\xbe\xae\x69\xb2\x25\x01"
+			  "\xee\x0e\x51\x4b\x53\xea\x0f\x5f"
+			  "\x85\x74\x79\x36\x32\x0a\x2a\x40"
+			  "\xad\x6b\x78\x41\x54\x99\xe9\xc1"
+			  "\x2b\x6c\x9b\x42\x21\xef\xe2\x50"
+			  "\x56\x8d\x78\xdf\x58\xbe\x0a\x0f"
+			  "\xfc\xfc\x0d\x2e\xd0\xcb\xa6\x0a"
+			  "\xa8\xd9\x1e\xa9\xd4\x7c\x99\x88"
+			  "\xcf\x11\xad\x1c\xd3\x04\x63\x55"
+			  "\xef\x85\x0b\x69\xa1\x40\xf1\x75"
+			  "\x24\xf4\xe5\x2c\xd4\x7a\x24\x50"
+			  "\x8f\xa2\x71\xc9\x92\x20\xcd\xcf"
+			  "\xda\x40\xbe\xf6\xfe\x1a\xca\xc7"
+			  "\x4a\x80\x45\x55\xcb\xdd\xb7\x01"
+			  "\xb0\x8d\xcb\xd2\xae\xbd\xa4\xd0"
+			  "\x5c\x10\x05\x66\x7b\xd4\xff\xd9"
+			  "\xc4\x23\x9d\x8d\x6b\x24\xf8\x3f"
+			  "\x73\x4d\x5c\x2b\x33\x4c\x5e\x63"
+			  "\x74\x6d\x03\xa1\x7a\x35\x65\x17"
+			  "\x38\x7f\x3b\xc1\x69\xcf\x61\x34"
+			  "\x30\x21\xaf\x97\x47\x12\x3f\xa1"
+			  "\xa7\x50\xc5\x87\xfb\x3f\x70\x32"
+			  "\x86\x17\x5f\x25\xe4\x74\xc6\xd0"
+			  "\x9b\x39\xe6\xe1\x5a\xec\x8f\x40"
+			  "\xce\xcc\x37\x3b\xd8\x72\x1c\x31"
+			  "\x75\xa4\xa6\x89\x8c\xdd\xd6\xd2"
+			  "\x32\x3d\xe8\xc3\x54\xab\x1f\x35"
+			  "\x52\xb4\x94\x81\xb0\x37\x3a\x03"
+			  "\xbb\xb1\x99\x30\xa5\xf8\x21\xcd"
+			  "\x93\x5d\xa7\x13\xed\xc7\x49\x09"
+			  "\x70\xda\x08\x39\xaa\x15\x9e\x45"
+			  "\x35\x2b\x0f\x5c\x8c\x8b\xc9"
+			  "\xa8\xb8\x9f\xfd\x37\x36\x31\x7e"
+			  "\x34\x4f\xc1\xc0\xca\x8a\x22\xfd",
+		.clen	= 735,
 	}
 };
 
@@ -14947,6 +15770,282 @@ static const struct aead_testvec sm4_ccm_tv_template[] = {
 			  "\x16\x84\x2D\x4F\xA1\x86\xF5\x6A"
 			  "\xB3\x32\x56\x97\x1F\xA1\x10\xF4",
 		.clen	= 80,
+	}, { /* Generated from AES-CCM test vectors */
+		.key	= "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf",
+		.klen	= 16,
+		.iv	= "\x01\x00\x00\x00\x03\x02\x01\x00"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\x00\x00",
+		.assoc	= "\x00\x01\x02\x03\x04\x05\x06\x07",
+		.alen	= 8,
+		.ptext	= "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e",
+		.plen	= 23,
+		.ctext	= "\x7b\xff\x4a\x15\xf5\x73\xce\x82"
+			  "\x6e\xc2\x31\x1d\xe2\x53\x02\xac"
+			  "\xa4\x48\xf9\xe4\xf5\x1f\x81\x70"
+			  "\x18\xbc\xb6\x84\x01\xb8\xae",
+		.clen	= 31,
+	}, {
+		.key	= "\xf4\x6b\xc2\x75\x62\xfe\xb4\xe1"
+			  "\x53\x14\x73\x66\x8d\x88\xf6\x80",
+		.klen	= 16,
+		.iv	= "\x03\xa0\x20\x35\x26\xf2\x21\x8d"
+			  "\x50\x20\xda\xe2\x00\x00\x00\x00",
+		.assoc	= "\x5b\x9e\x13\x67\x02\x5e\xef\xc1"
+			  "\x6c\xf9\xd7\x1e\x52\x8f\x7a\x47"
+			  "\xe9\xd4\xcf\x20\x14\x6e\xf0\x2d"
+			  "\xd8\x9e\x2b\x56\x10\x23\x56\xe7",
+		.alen	= 32,
+		.ctext	= "\x23\x58\xce\xdc\x40\xb1\xcd\x92"
+			  "\x47\x96\x59\xfc\x8a\x26\x4f\xcf",
+		.clen	= 16,
+	}, {
+		.key	= "\xab\x2f\x8a\x74\xb7\x1c\xd2\xb1"
+			  "\xff\x80\x2e\x48\x7d\x82\xf8\xb9",
+		.klen	= 16,
+		.iv	= "\x03\xaf\x94\x87\x78\x35\x82\x81"
+			  "\x7f\x88\x94\x68\x00\x00\x00\x00",
+		.alen	= 0,
+		.ptext	= "\x00",
+		.plen	= 0,
+		.ctext	= "\x72\x7e\xf5\xd6\x39\x7a\x2b\x43",
+		.clen	= 8,
+	}, {
+		.key	= "\x39\xbb\xa7\xbe\x59\x97\x9e\x73"
+			  "\xa4\x48\x93\x39\x26\x71\x4a\xc6",
+		.klen	= 16,
+		.iv	= "\x03\xee\x49\x83\xe9\xa9\xff\xe9"
+			  "\x57\xba\xfd\x9e\x00\x00\x00\x00",
+		.assoc	= "\x44\xa6\x2c\x05\xe9\xe1\x43\xb1"
+			  "\x58\x7c\xf2\x5c\x6d\x39\x0a\x64"
+			  "\xa4\xf0\x13\x05\xd1\x77\x99\x67"
+			  "\x11\xc4\xc6\xdb\x00\x56\x36\x61",
+		.alen	= 32,
+		.ptext	= "\x00",
+		.plen	= 0,
+		.ctext	= "\xb0\x9d\xc6\xfb\x7d\xb5\xa1\x0e",
+		.clen	= 8,
+	}, {
+		.key	= "\x58\x5d\xa0\x96\x65\x1a\x04\xd7"
+			  "\x0d\x1a\x53\x3b\xb5\xe3\xf8\x8b",
+		.klen	= 16,
+		.iv	= "\x03\xcf\x76\x3f\xd9\x95\x75\x8f"
+			  "\x44\x89\x40\x7b\x00\x00\x00\x00",
+		.assoc	= "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+			  "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+			  "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+			  "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe",
+		.alen	= 32,
+		.ptext	= "\xc2\x54\xc8\xde\x78\x87\x77\x40"
+			  "\x49\x71\xe4\xb7\xe7\xcb\x76\x61"
+			  "\x0a\x41\xb9\xe9\xc0\x76\x54\xab"
+			  "\x04\x49\x3b\x19\x93\x57\x25\x5d",
+		.plen	= 32,
+		.ctext	= "\xc9\xae\xef\x1d\xf3\x2c\xd3\x38"
+			  "\xc9\x7f\x7e\x28\xe8\xaa\xb3\x60"
+			  "\x49\xdc\x66\xca\x7b\x3d\xe0\x3c"
+			  "\xcb\x45\x9c\x1b\xb2\xbe\x07\x90"
+			  "\x87\xa6\x6b\x89\x0d\x0f\x90\xaa"
+			  "\x7d\xf6\x5a\x9a\x68\x2b\x81\x92",
+		.clen	= 48,
+	}, {
+		.key	= "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+			  "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+		.klen	= 16,
+		.iv	= "\x02\xff\xff\xff\xff\x00\x00\xff"
+			  "\xff\xff\x00\xff\xff\x00\x00\x00",
+		.assoc	= "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+			  "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+			  "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+			  "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe"
+			  "\xc8\xf3\x5c\x52\x10\x63",
+		.alen	= 38,
+		.ptext	= "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+			  "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+			  "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+			  "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+			  "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+			  "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+			  "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+			  "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+			  "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+			  "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+			  "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+			  "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+			  "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+			  "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+			  "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+			  "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+			  "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+			  "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+			  "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+			  "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+			  "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+			  "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+			  "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+			  "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+			  "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+			  "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+			  "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+			  "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+			  "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+			  "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+			  "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+			  "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+			  "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+			  "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+			  "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+			  "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+			  "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+			  "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+			  "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+			  "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+			  "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+			  "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+			  "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+			  "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+			  "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+			  "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+			  "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+			  "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+			  "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+			  "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+			  "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+			  "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+			  "\x87\x79\x60\x38\x46\xb4\x25\x57"
+			  "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+			  "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+			  "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+			  "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+			  "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+			  "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+			  "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+			  "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+			  "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+			  "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+			  "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+			  "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+			  "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+			  "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+			  "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+			  "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+			  "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+			  "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+			  "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+			  "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+			  "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+			  "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+			  "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+			  "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+			  "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+			  "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+			  "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+			  "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+			  "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+			  "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+			  "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+			  "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+			  "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+			  "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+			  "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+			  "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+			  "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+		.plen	= 719,
+		.ctext	= "\xc5\x50\x85\x02\x72\xa8\xb3\x62"
+			  "\xf9\xcd\x77\x7b\x43\xa5\x04\x70"
+			  "\x68\x40\x57\x21\x1c\xfe\xef\x05"
+			  "\x4d\xb8\x44\xba\x59\xea\x62\x32"
+			  "\xcb\x6b\x6a\x39\x9b\xf3\xe5\xa4"
+			  "\x36\x38\xde\x7d\xcf\xb6\xcd\xe3"
+			  "\x89\xbf\x37\xc9\x96\x3c\x70\x10"
+			  "\x92\x47\xcc\xac\x6f\xf8\x55\x9a"
+			  "\x26\x43\x34\xb4\x92\x7d\x68\xfc"
+			  "\x60\x37\x74\x2a\x55\xba\xc7\xd7"
+			  "\x98\x69\xb7\xcf\x42\xfd\xb2\x10"
+			  "\xa0\x59\xe1\x2c\x73\x66\x12\x97"
+			  "\x85\x8b\x28\xcc\x29\x02\x15\x89"
+			  "\x23\xd3\x32\x92\x87\x57\x09\x13"
+			  "\x04\x7e\x8b\x6c\x3a\xc1\x4e\x6c"
+			  "\xe1\x9f\xc8\xcc\x47\x9c\xd8\x10"
+			  "\xf4\xb7\x5c\x30\x7a\x8b\x0f\x01"
+			  "\x52\x38\x02\x92\x99\xac\x03\x90"
+			  "\x18\x32\x2d\x21\x6a\x0a\x2a\xe7"
+			  "\xc2\xcc\x15\x84\x4e\x2b\x0b\x3a"
+			  "\x4c\xdc\xb0\x6b\x10\xd1\x27\x10"
+			  "\xf0\x4a\x5c\x43\xa0\x34\x34\x59"
+			  "\x47\x43\x48\xcb\x69\xa7\xff\x52"
+			  "\xb8\xca\x23\x09\x07\xd7\xc5\xe4"
+			  "\x2a\x4f\x99\xd5\x83\x36\x2a\x2d"
+			  "\x59\xd0\xca\xb0\xfa\x40\x8c\xab"
+			  "\xdf\x69\x08\xd9\x79\x1d\xde\xa8"
+			  "\x0b\x34\x74\x4d\xf5\xa0\x4c\x81"
+			  "\x7f\x93\x06\x40\x24\xfe\x7d\xcd"
+			  "\xe4\xfe\xf8\xf8\x30\xce\xd0\x5d"
+			  "\x70\xfd\x0d\x5a\x78\x85\x74\x2d"
+			  "\xe4\xb5\x40\x18\x99\x11\xe4\x6a"
+			  "\xdf\xfa\x4f\x25\x2c\xde\x15\xb7"
+			  "\x12\xd8\xc6\x90\x0d\x0f\xc9\xfb"
+			  "\x21\xf1\xed\xfe\x98\xe1\x03\xe2"
+			  "\x5c\xef\xb6\xc7\x87\x77\x0e\xcd"
+			  "\xff\x78\x94\xc9\xbe\xd3\x47\xf7"
+			  "\x8d\x37\x48\x01\x42\xe2\x17\x96"
+			  "\xfc\xc0\xcb\x7b\x7b\x57\xaf\x3b"
+			  "\xc9\xd0\x94\xce\x5e\x1b\xa9\x47"
+			  "\x02\x4d\x74\xcc\x45\x1d\xd3\x2d"
+			  "\x5f\x4f\x7f\xf2\x4b\xf9\x59\xee"
+			  "\x9e\x9e\xb9\x95\x29\x19\xd1\x5f"
+			  "\x72\xab\x8d\xf1\x28\xd1\x1c\xae"
+			  "\xc2\xba\xf7\x22\x84\x2c\x83\x51"
+			  "\x03\xad\xa3\xef\x81\xa7\xdc\xf1"
+			  "\x44\x51\x50\x96\x70\xd1\xe5\x47"
+			  "\x57\xf9\x30\x90\xe4\xbf\xfc\x75"
+			  "\x14\xaa\x4d\xb7\xb1\xe7\x79\x33"
+			  "\x43\xc2\x5c\xc1\xbc\x09\x92\x0f"
+			  "\xa7\xaf\x68\x51\x51\xec\x0b\xc3"
+			  "\x3d\x2b\x94\x30\x45\x29\x1b\x9e"
+			  "\x70\x56\xf8\xd6\x67\x2d\x39\x3b"
+			  "\x3c\xd2\xd0\xd3\xdc\x7d\x84\xe9"
+			  "\x06\x31\x98\xa6\x5c\xbf\x10\x58"
+			  "\xce\xbb\xa7\xe1\x65\x7e\x51\x87"
+			  "\x70\x46\xb4\x7f\xf9\xec\x92\x1c"
+			  "\x9b\x24\x49\xc1\x04\xbe\x1c\x5f"
+			  "\xcc\xb3\x33\x8c\xad\xe7\xdc\x32"
+			  "\x54\xa2\x0d\x83\x0f\x3c\x12\x5d"
+			  "\x71\xe3\x9c\xae\x71\xa3\x2a\x10"
+			  "\xc5\x91\xb4\x73\x96\x60\xdb\x5d"
+			  "\x1f\xd5\x9a\xd2\x69\xc3\xd7\x4b"
+			  "\xa2\x66\x81\x96\x4a\xaa\x02\xd6"
+			  "\xd5\x44\x9b\x42\x3a\x15\x5f\xe7"
+			  "\x4d\x7c\xf6\x71\x4a\xea\xe8\x43"
+			  "\xd7\x68\xe4\xbc\x05\x87\x49\x05"
+			  "\x3b\x47\xb2\x6d\x5f\xd1\x11\xa6"
+			  "\x58\xd4\xa2\x45\xec\xb5\x54\x55"
+			  "\xd3\xd6\xd2\x6a\x8b\x21\x9e\x2c"
+			  "\xf1\x27\x4b\x5b\xe3\xff\xe0\xfd"
+			  "\x4b\xf1\xe7\xe2\x84\xf2\x17\x37"
+			  "\x11\x68\xc4\x92\x4b\x6b\xef\x8e"
+			  "\x75\xf5\xc2\x7d\x5c\xe9\x7c\xfc"
+			  "\x2b\x00\x33\x0e\x7d\x69\xd8\xd4"
+			  "\x9b\xa8\x38\x54\x7e\x6d\x23\x51"
+			  "\x2c\xd6\xc4\x58\x23\x1c\x22\x2a"
+			  "\x59\xc5\x9b\xec\x9d\xbf\x03\x0f"
+			  "\xb3\xdd\xba\x02\x22\xa0\x34\x37"
+			  "\x19\x56\xc2\x5b\x32\x1d\x1e\x66"
+			  "\x68\xf4\x47\x05\x04\x18\xa7\x28"
+			  "\x80\xf2\xc7\x99\xed\x1e\x72\x48"
+			  "\x8f\x97\x5d\xb3\x74\x42\xfd\x0c"
+			  "\x0f\x5f\x29\x0c\xf1\x35\x22\x90"
+			  "\xd6\x7c\xb8\xa3\x2a\x89\x38\x71"
+			  "\xe9\x7a\x55\x3c\x3b\xf2\x6e\x1a"
+			  "\x22\x8f\x07\x81\xc1\xe1\xf1\x76"
+			  "\x2a\x75\xab\x86\xc4\xcc\x52\x59"
+			  "\x83\x19\x5e\xb3\x53\xe2\x81\xdf"
+			  "\xe6\x15\xb3\xba\x0c\x0e\xba"
+			  "\xa9\x2c\xed\x51\xd5\x06\xc8\xc6"
+			  "\x4b\x9f\x5d\x1b\x61\x31\xad\xf4",
+		.clen	= 735,
 	}
 };
 
@@ -15030,6 +16129,68 @@ static const struct hash_testvec sm4_cmac128_tv_template[] = {
 	}
 };
 
+static const struct hash_testvec sm4_xcbc128_tv_template[] = {
+	{ /* Generated from AES-XCBC128 test vectors */
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= zeroed_string,
+		.digest 	= "\xa9\x9a\x5c\x44\xe2\x34\xee\x2c"
+				  "\x9b\xe4\x9d\xca\x64\xb0\xa5\xc4",
+		.psize		= 0,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02",
+		.digest		= "\x17\x27\x62\xf3\x8b\x88\x1d\xc0"
+				  "\x97\x35\x9c\x3e\x9f\x27\xb7\x83",
+		.psize		= 3,
+		.ksize		= 16,
+	} , {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.digest 	= "\xda\x45\xd1\xac\xec\x4d\xab\x46"
+				  "\xdd\x59\xe0\x44\xff\x59\xd5\xfc",
+		.psize		= 16,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+				  "\x10\x11\x12\x13",
+		.digest 	= "\xbe\x24\x5d\x81\x8c\x8a\x10\xa4"
+				  "\x8e\xc2\x16\xfa\xa4\x83\xc9\x2a",
+		.psize		= 20,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+				  "\x10\x11\x12\x13\x14\x15\x16\x17"
+				  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.digest 	= "\x91\x82\x31\x56\xd5\x77\xa4\xc5"
+				  "\x88\x2d\xce\x3a\x87\x5e\xbd\xba",
+		.psize		= 32,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+				  "\x10\x11\x12\x13\x14\x15\x16\x17"
+				  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+				  "\x20\x21",
+		.digest 	= "\x2a\xae\xa5\x24\x0c\x12\x9f\x5f"
+				  "\x55\xfb\xae\x35\x13\x0d\x22\x2d",
+		.psize		= 34,
+		.ksize		= 16,
+	}
+};
+
 /* Cast6 test vectors from RFC 2612 */
 static const struct cipher_testvec cast6_tv_template[] = {
 	{
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/16] crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

Added CTS-CBC/ESSIV/XTS/XCBC tests for SM4 algorithms, as well as
corresponding speed tests, this is to test performance-optimized
implementations of these modes.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/tcrypt.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index a82679b576bb..b870b2fe716d 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1711,6 +1711,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		ret += tcrypt_test("gcm(aria)");
 		break;
 
+	case 59:
+		ret += tcrypt_test("cts(cbc(sm4))");
+		break;
+
 	case 100:
 		ret += tcrypt_test("hmac(md5)");
 		break;
@@ -1811,6 +1815,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		ret += tcrypt_test("cmac(sm4)");
 		break;
 
+	case 160:
+		ret += tcrypt_test("xcbc(sm4)");
+		break;
+
 	case 181:
 		ret += tcrypt_test("authenc(hmac(sha1),cbc(des))");
 		break;
@@ -1846,6 +1854,7 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		ret += tcrypt_test("cbc(sm4)");
 		ret += tcrypt_test("cfb(sm4)");
 		ret += tcrypt_test("ctr(sm4)");
+		ret += tcrypt_test("xts(sm4)");
 		break;
 	case 192:
 		ret += tcrypt_test("ecb(aria)");
@@ -2109,6 +2118,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				speed_template_16);
 		test_cipher_speed("cbc(sm4)", DECRYPT, sec, NULL, 0,
 				speed_template_16);
+		test_cipher_speed("cts(cbc(sm4))", ENCRYPT, sec, NULL, 0,
+				speed_template_16);
+		test_cipher_speed("cts(cbc(sm4))", DECRYPT, sec, NULL, 0,
+				speed_template_16);
 		test_cipher_speed("cfb(sm4)", ENCRYPT, sec, NULL, 0,
 				speed_template_16);
 		test_cipher_speed("cfb(sm4)", DECRYPT, sec, NULL, 0,
@@ -2117,6 +2130,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				speed_template_16);
 		test_cipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
 				speed_template_16);
+		test_cipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+				speed_template_32);
+		test_cipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+				speed_template_32);
 		break;
 
 	case 219:
@@ -2212,6 +2229,13 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				   speed_template_16, num_mb);
 		break;
 
+	case 230:
+		test_acipher_speed("essiv(cbc(sm4),sm3)", ENCRYPT, sec,
+				   NULL, 0, speed_template_16);
+		test_acipher_speed("essiv(cbc(sm4),sm3)", DECRYPT, sec,
+				   NULL, 0, speed_template_16);
+		break;
+
 	case 300:
 		if (alg) {
 			test_hash_speed(alg, sec, generic_hash_speed_template);
@@ -2630,6 +2654,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				speed_template_16);
 		test_acipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
 				speed_template_16);
+		test_acipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+				speed_template_32);
+		test_acipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+				speed_template_32);
 		break;
 
 	case 519:
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/16] crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

Added CTS-CBC/ESSIV/XTS/XCBC tests for SM4 algorithms, as well as
corresponding speed tests, this is to test performance-optimized
implementations of these modes.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/tcrypt.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index a82679b576bb..b870b2fe716d 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1711,6 +1711,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		ret += tcrypt_test("gcm(aria)");
 		break;
 
+	case 59:
+		ret += tcrypt_test("cts(cbc(sm4))");
+		break;
+
 	case 100:
 		ret += tcrypt_test("hmac(md5)");
 		break;
@@ -1811,6 +1815,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		ret += tcrypt_test("cmac(sm4)");
 		break;
 
+	case 160:
+		ret += tcrypt_test("xcbc(sm4)");
+		break;
+
 	case 181:
 		ret += tcrypt_test("authenc(hmac(sha1),cbc(des))");
 		break;
@@ -1846,6 +1854,7 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		ret += tcrypt_test("cbc(sm4)");
 		ret += tcrypt_test("cfb(sm4)");
 		ret += tcrypt_test("ctr(sm4)");
+		ret += tcrypt_test("xts(sm4)");
 		break;
 	case 192:
 		ret += tcrypt_test("ecb(aria)");
@@ -2109,6 +2118,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				speed_template_16);
 		test_cipher_speed("cbc(sm4)", DECRYPT, sec, NULL, 0,
 				speed_template_16);
+		test_cipher_speed("cts(cbc(sm4))", ENCRYPT, sec, NULL, 0,
+				speed_template_16);
+		test_cipher_speed("cts(cbc(sm4))", DECRYPT, sec, NULL, 0,
+				speed_template_16);
 		test_cipher_speed("cfb(sm4)", ENCRYPT, sec, NULL, 0,
 				speed_template_16);
 		test_cipher_speed("cfb(sm4)", DECRYPT, sec, NULL, 0,
@@ -2117,6 +2130,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				speed_template_16);
 		test_cipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
 				speed_template_16);
+		test_cipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+				speed_template_32);
+		test_cipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+				speed_template_32);
 		break;
 
 	case 219:
@@ -2212,6 +2229,13 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				   speed_template_16, num_mb);
 		break;
 
+	case 230:
+		test_acipher_speed("essiv(cbc(sm4),sm3)", ENCRYPT, sec,
+				   NULL, 0, speed_template_16);
+		test_acipher_speed("essiv(cbc(sm4),sm3)", DECRYPT, sec,
+				   NULL, 0, speed_template_16);
+		break;
+
 	case 300:
 		if (alg) {
 			test_hash_speed(alg, sec, generic_hash_speed_template);
@@ -2630,6 +2654,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				speed_template_16);
 		test_acipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
 				speed_template_16);
+		test_acipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+				speed_template_32);
+		test_acipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+				speed_template_32);
 		break;
 
 	case 519:
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/16] crypto: arm64/sm4 - refactor and simplify CE implementation
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch does not add new features, but only refactors and simplifies the
implementation of the Crypto Extension acceleration of the SM4 algorithm:

Extract the macro optimized by SM4 Crypto Extension for reuse in the
subsequent optimization of CCM/GCM modes.

Encryption in CBC and CFB modes processes four blocks at a time instead of
one, allowing the ld1 instruction to load 64 bytes of data at a time, which
will reduces unnecessary memory accesses.

CBC/CFB/CTR makes full use of free registers to reduce redundant memory
accesses, and rearranges some instructions to improve out-of-order execution
capabilities.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-asm.h  | 209 +++++++++++
 arch/arm64/crypto/sm4-ce-core.S | 646 ++++++++++++++------------------
 arch/arm64/crypto/sm4-ce-glue.c |  64 ++--
 3 files changed, 519 insertions(+), 400 deletions(-)
 create mode 100644 arch/arm64/crypto/sm4-ce-asm.h

diff --git a/arch/arm64/crypto/sm4-ce-asm.h b/arch/arm64/crypto/sm4-ce-asm.h
new file mode 100644
index 000000000000..7ea98e42e779
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-asm.h
@@ -0,0 +1,209 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 helper macros for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#define SM4_PREPARE(ptr)					\
+	ld1		{v24.16b-v27.16b}, [ptr], #64;		\
+	ld1		{v28.16b-v31.16b}, [ptr];
+
+#define SM4_CRYPT_BLK_BE(b0)					\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	rev32		b0.16b, b0.16b;
+
+#define SM4_CRYPT_BLK(b0)					\
+	rev32		b0.16b, b0.16b;				\
+	SM4_CRYPT_BLK_BE(b0);
+
+#define SM4_CRYPT_BLK2_BE(b0, b1)				\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b1.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b1.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b1.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b1.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b1.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b1.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b1.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	sm4e		b1.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	rev64		b1.4s, b1.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	ext		b1.16b, b1.16b, b1.16b, #8;		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+
+#define SM4_CRYPT_BLK2(b0, b1)					\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	SM4_CRYPT_BLK2_BE(b0, b1);
+
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3)			\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b1.4s, v24.4s;				\
+	sm4e		b2.4s, v24.4s;				\
+	sm4e		b3.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b1.4s, v25.4s;				\
+	sm4e		b2.4s, v25.4s;				\
+	sm4e		b3.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b1.4s, v26.4s;				\
+	sm4e		b2.4s, v26.4s;				\
+	sm4e		b3.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b1.4s, v27.4s;				\
+	sm4e		b2.4s, v27.4s;				\
+	sm4e		b3.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b1.4s, v28.4s;				\
+	sm4e		b2.4s, v28.4s;				\
+	sm4e		b3.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b1.4s, v29.4s;				\
+	sm4e		b2.4s, v29.4s;				\
+	sm4e		b3.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b1.4s, v30.4s;				\
+	sm4e		b2.4s, v30.4s;				\
+	sm4e		b3.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	sm4e		b1.4s, v31.4s;				\
+	sm4e		b2.4s, v31.4s;				\
+	sm4e		b3.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	rev64		b1.4s, b1.4s;				\
+	rev64		b2.4s, b2.4s;				\
+	rev64		b3.4s, b3.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	ext		b1.16b, b1.16b, b1.16b, #8;		\
+	ext		b2.16b, b2.16b, b2.16b, #8;		\
+	ext		b3.16b, b3.16b, b3.16b, #8;		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;
+
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3)				\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;				\
+	SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
+#define SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7)	\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b1.4s, v24.4s;				\
+	sm4e		b2.4s, v24.4s;				\
+	sm4e		b3.4s, v24.4s;				\
+	sm4e		b4.4s, v24.4s;				\
+	sm4e		b5.4s, v24.4s;				\
+	sm4e		b6.4s, v24.4s;				\
+	sm4e		b7.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b1.4s, v25.4s;				\
+	sm4e		b2.4s, v25.4s;				\
+	sm4e		b3.4s, v25.4s;				\
+	sm4e		b4.4s, v25.4s;				\
+	sm4e		b5.4s, v25.4s;				\
+	sm4e		b6.4s, v25.4s;				\
+	sm4e		b7.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b1.4s, v26.4s;				\
+	sm4e		b2.4s, v26.4s;				\
+	sm4e		b3.4s, v26.4s;				\
+	sm4e		b4.4s, v26.4s;				\
+	sm4e		b5.4s, v26.4s;				\
+	sm4e		b6.4s, v26.4s;				\
+	sm4e		b7.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b1.4s, v27.4s;				\
+	sm4e		b2.4s, v27.4s;				\
+	sm4e		b3.4s, v27.4s;				\
+	sm4e		b4.4s, v27.4s;				\
+	sm4e		b5.4s, v27.4s;				\
+	sm4e		b6.4s, v27.4s;				\
+	sm4e		b7.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b1.4s, v28.4s;				\
+	sm4e		b2.4s, v28.4s;				\
+	sm4e		b3.4s, v28.4s;				\
+	sm4e		b4.4s, v28.4s;				\
+	sm4e		b5.4s, v28.4s;				\
+	sm4e		b6.4s, v28.4s;				\
+	sm4e		b7.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b1.4s, v29.4s;				\
+	sm4e		b2.4s, v29.4s;				\
+	sm4e		b3.4s, v29.4s;				\
+	sm4e		b4.4s, v29.4s;				\
+	sm4e		b5.4s, v29.4s;				\
+	sm4e		b6.4s, v29.4s;				\
+	sm4e		b7.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b1.4s, v30.4s;				\
+	sm4e		b2.4s, v30.4s;				\
+	sm4e		b3.4s, v30.4s;				\
+	sm4e		b4.4s, v30.4s;				\
+	sm4e		b5.4s, v30.4s;				\
+	sm4e		b6.4s, v30.4s;				\
+	sm4e		b7.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	sm4e		b1.4s, v31.4s;				\
+	sm4e		b2.4s, v31.4s;				\
+	sm4e		b3.4s, v31.4s;				\
+	sm4e		b4.4s, v31.4s;				\
+	sm4e		b5.4s, v31.4s;				\
+	sm4e		b6.4s, v31.4s;				\
+	sm4e		b7.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	rev64		b1.4s, b1.4s;				\
+	rev64		b2.4s, b2.4s;				\
+	rev64		b3.4s, b3.4s;				\
+	rev64		b4.4s, b4.4s;				\
+	rev64		b5.4s, b5.4s;				\
+	rev64		b6.4s, b6.4s;				\
+	rev64		b7.4s, b7.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	ext		b1.16b, b1.16b, b1.16b, #8;		\
+	ext		b2.16b, b2.16b, b2.16b, #8;		\
+	ext		b3.16b, b3.16b, b3.16b, #8;		\
+	ext		b4.16b, b4.16b, b4.16b, #8;		\
+	ext		b5.16b, b5.16b, b5.16b, #8;		\
+	ext		b6.16b, b6.16b, b6.16b, #8;		\
+	ext		b7.16b, b7.16b, b7.16b, #8;		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;				\
+	rev32		b4.16b, b4.16b;				\
+	rev32		b5.16b, b5.16b;				\
+	rev32		b6.16b, b6.16b;				\
+	rev32		b7.16b, b7.16b;
+
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;				\
+	rev32		b4.16b, b4.16b;				\
+	rev32		b5.16b, b5.16b;				\
+	rev32		b6.16b, b6.16b;				\
+	rev32		b7.16b, b7.16b;				\
+	SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7);
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 934e0f093279..41fc745a8528 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -10,10 +10,12 @@
 
 #include <linux/linkage.h>
 #include <asm/assembler.h>
+#include "sm4-ce-asm.h"
 
 .arch	armv8-a+crypto
 
-.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 20, 24, 25, 26, 27, 28, 29, 30, 31
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+		20, 24, 25, 26, 27, 28, 29, 30, 31
 	.set .Lv\b\().4s, \b
 .endr
 
@@ -34,174 +36,6 @@
 
 #define RIV	v20
 
-/* Helper macros. */
-
-#define PREPARE                                       \
-	ld1		{v24.16b-v27.16b}, [x0], #64; \
-	ld1		{v28.16b-v31.16b}, [x0];
-
-#define SM4_CRYPT_BLK(b0)                           \
-	rev32		b0.16b, b0.16b;             \
-	sm4e		b0.4s, v24.4s;              \
-	sm4e		b0.4s, v25.4s;              \
-	sm4e		b0.4s, v26.4s;              \
-	sm4e		b0.4s, v27.4s;              \
-	sm4e		b0.4s, v28.4s;              \
-	sm4e		b0.4s, v29.4s;              \
-	sm4e		b0.4s, v30.4s;              \
-	sm4e		b0.4s, v31.4s;              \
-	rev64		b0.4s, b0.4s;               \
-	ext		b0.16b, b0.16b, b0.16b, #8; \
-	rev32		b0.16b, b0.16b;
-
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3)              \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;             \
-	sm4e		b0.4s, v24.4s;              \
-	sm4e		b1.4s, v24.4s;              \
-	sm4e		b2.4s, v24.4s;              \
-	sm4e		b3.4s, v24.4s;              \
-	sm4e		b0.4s, v25.4s;              \
-	sm4e		b1.4s, v25.4s;              \
-	sm4e		b2.4s, v25.4s;              \
-	sm4e		b3.4s, v25.4s;              \
-	sm4e		b0.4s, v26.4s;              \
-	sm4e		b1.4s, v26.4s;              \
-	sm4e		b2.4s, v26.4s;              \
-	sm4e		b3.4s, v26.4s;              \
-	sm4e		b0.4s, v27.4s;              \
-	sm4e		b1.4s, v27.4s;              \
-	sm4e		b2.4s, v27.4s;              \
-	sm4e		b3.4s, v27.4s;              \
-	sm4e		b0.4s, v28.4s;              \
-	sm4e		b1.4s, v28.4s;              \
-	sm4e		b2.4s, v28.4s;              \
-	sm4e		b3.4s, v28.4s;              \
-	sm4e		b0.4s, v29.4s;              \
-	sm4e		b1.4s, v29.4s;              \
-	sm4e		b2.4s, v29.4s;              \
-	sm4e		b3.4s, v29.4s;              \
-	sm4e		b0.4s, v30.4s;              \
-	sm4e		b1.4s, v30.4s;              \
-	sm4e		b2.4s, v30.4s;              \
-	sm4e		b3.4s, v30.4s;              \
-	sm4e		b0.4s, v31.4s;              \
-	sm4e		b1.4s, v31.4s;              \
-	sm4e		b2.4s, v31.4s;              \
-	sm4e		b3.4s, v31.4s;              \
-	rev64		b0.4s, b0.4s;               \
-	rev64		b1.4s, b1.4s;               \
-	rev64		b2.4s, b2.4s;               \
-	rev64		b3.4s, b3.4s;               \
-	ext		b0.16b, b0.16b, b0.16b, #8; \
-	ext		b1.16b, b1.16b, b1.16b, #8; \
-	ext		b2.16b, b2.16b, b2.16b, #8; \
-	ext		b3.16b, b3.16b, b3.16b, #8; \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;
-
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;             \
-	rev32		b4.16b, b4.16b;             \
-	rev32		b5.16b, b5.16b;             \
-	rev32		b6.16b, b6.16b;             \
-	rev32		b7.16b, b7.16b;             \
-	sm4e		b0.4s, v24.4s;              \
-	sm4e		b1.4s, v24.4s;              \
-	sm4e		b2.4s, v24.4s;              \
-	sm4e		b3.4s, v24.4s;              \
-	sm4e		b4.4s, v24.4s;              \
-	sm4e		b5.4s, v24.4s;              \
-	sm4e		b6.4s, v24.4s;              \
-	sm4e		b7.4s, v24.4s;              \
-	sm4e		b0.4s, v25.4s;              \
-	sm4e		b1.4s, v25.4s;              \
-	sm4e		b2.4s, v25.4s;              \
-	sm4e		b3.4s, v25.4s;              \
-	sm4e		b4.4s, v25.4s;              \
-	sm4e		b5.4s, v25.4s;              \
-	sm4e		b6.4s, v25.4s;              \
-	sm4e		b7.4s, v25.4s;              \
-	sm4e		b0.4s, v26.4s;              \
-	sm4e		b1.4s, v26.4s;              \
-	sm4e		b2.4s, v26.4s;              \
-	sm4e		b3.4s, v26.4s;              \
-	sm4e		b4.4s, v26.4s;              \
-	sm4e		b5.4s, v26.4s;              \
-	sm4e		b6.4s, v26.4s;              \
-	sm4e		b7.4s, v26.4s;              \
-	sm4e		b0.4s, v27.4s;              \
-	sm4e		b1.4s, v27.4s;              \
-	sm4e		b2.4s, v27.4s;              \
-	sm4e		b3.4s, v27.4s;              \
-	sm4e		b4.4s, v27.4s;              \
-	sm4e		b5.4s, v27.4s;              \
-	sm4e		b6.4s, v27.4s;              \
-	sm4e		b7.4s, v27.4s;              \
-	sm4e		b0.4s, v28.4s;              \
-	sm4e		b1.4s, v28.4s;              \
-	sm4e		b2.4s, v28.4s;              \
-	sm4e		b3.4s, v28.4s;              \
-	sm4e		b4.4s, v28.4s;              \
-	sm4e		b5.4s, v28.4s;              \
-	sm4e		b6.4s, v28.4s;              \
-	sm4e		b7.4s, v28.4s;              \
-	sm4e		b0.4s, v29.4s;              \
-	sm4e		b1.4s, v29.4s;              \
-	sm4e		b2.4s, v29.4s;              \
-	sm4e		b3.4s, v29.4s;              \
-	sm4e		b4.4s, v29.4s;              \
-	sm4e		b5.4s, v29.4s;              \
-	sm4e		b6.4s, v29.4s;              \
-	sm4e		b7.4s, v29.4s;              \
-	sm4e		b0.4s, v30.4s;              \
-	sm4e		b1.4s, v30.4s;              \
-	sm4e		b2.4s, v30.4s;              \
-	sm4e		b3.4s, v30.4s;              \
-	sm4e		b4.4s, v30.4s;              \
-	sm4e		b5.4s, v30.4s;              \
-	sm4e		b6.4s, v30.4s;              \
-	sm4e		b7.4s, v30.4s;              \
-	sm4e		b0.4s, v31.4s;              \
-	sm4e		b1.4s, v31.4s;              \
-	sm4e		b2.4s, v31.4s;              \
-	sm4e		b3.4s, v31.4s;              \
-	sm4e		b4.4s, v31.4s;              \
-	sm4e		b5.4s, v31.4s;              \
-	sm4e		b6.4s, v31.4s;              \
-	sm4e		b7.4s, v31.4s;              \
-	rev64		b0.4s, b0.4s;               \
-	rev64		b1.4s, b1.4s;               \
-	rev64		b2.4s, b2.4s;               \
-	rev64		b3.4s, b3.4s;               \
-	rev64		b4.4s, b4.4s;               \
-	rev64		b5.4s, b5.4s;               \
-	rev64		b6.4s, b6.4s;               \
-	rev64		b7.4s, b7.4s;               \
-	ext		b0.16b, b0.16b, b0.16b, #8; \
-	ext		b1.16b, b1.16b, b1.16b, #8; \
-	ext		b2.16b, b2.16b, b2.16b, #8; \
-	ext		b3.16b, b3.16b, b3.16b, #8; \
-	ext		b4.16b, b4.16b, b4.16b, #8; \
-	ext		b5.16b, b5.16b, b5.16b, #8; \
-	ext		b6.16b, b6.16b, b6.16b, #8; \
-	ext		b7.16b, b7.16b, b7.16b, #8; \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;             \
-	rev32		b4.16b, b4.16b;             \
-	rev32		b5.16b, b5.16b;             \
-	rev32		b6.16b, b6.16b;             \
-	rev32		b7.16b, b7.16b;
-
 
 .align 3
 SYM_FUNC_START(sm4_ce_expand_key)
@@ -268,7 +102,7 @@ SYM_FUNC_START(sm4_ce_crypt_block)
 	 *   x1: dst
 	 *   x2: src
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
 	ld1		{v0.16b}, [x2];
 	SM4_CRYPT_BLK(v0);
@@ -285,7 +119,7 @@ SYM_FUNC_START(sm4_ce_crypt)
 	 *   x2: src
 	 *   w3: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
 .Lcrypt_loop_blk:
 	sub		w3, w3, #8;
@@ -337,26 +171,50 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
+
+	ld1		{RIV.16b}, [x3]
+
+.Lcbc_enc_loop_4x:
+	cmp		w4, #4
+	blt		.Lcbc_enc_loop_1x
+
+	sub		w4, w4, #4
 
-	ld1		{RIV.16b}, [x3];
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-.Lcbc_enc_loop:
-	sub		w4, w4, #1;
+	eor		v0.16b, v0.16b, RIV.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v1.16b, v1.16b, v0.16b
+	SM4_CRYPT_BLK(v1)
+	eor		v2.16b, v2.16b, v1.16b
+	SM4_CRYPT_BLK(v2)
+	eor		v3.16b, v3.16b, v2.16b
+	SM4_CRYPT_BLK(v3)
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		RIV.16b, RIV.16b, RTMP0.16b;
+	st1		{v0.16b-v3.16b}, [x1], #64
+	mov		RIV.16b, v3.16b
 
-	SM4_CRYPT_BLK(RIV);
+	cbz		w4, .Lcbc_enc_end
+	b		.Lcbc_enc_loop_4x
 
-	st1		{RIV.16b}, [x1], #16;
+.Lcbc_enc_loop_1x:
+	sub		w4, w4, #1
 
-	cbnz		w4, .Lcbc_enc_loop;
+	ld1		{v0.16b}, [x2], #16
 
+	eor		RIV.16b, RIV.16b, v0.16b
+	SM4_CRYPT_BLK(RIV)
+
+	st1		{RIV.16b}, [x1], #16
+
+	cbnz		w4, .Lcbc_enc_loop_1x
+
+.Lcbc_enc_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cbc_enc)
 
 .align 3
@@ -368,79 +226,93 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
-	ld1		{RIV.16b}, [x3];
+	ld1		{RIV.16b}, [x3]
 
-.Lcbc_loop_blk:
-	sub		w4, w4, #8;
-	tbnz		w4, #31, .Lcbc_tail8;
+.Lcbc_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcbc_dec_4x
 
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b-v7.16b}, [x2];
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	rev32		v8.16b, v0.16b
+	rev32		v9.16b, v1.16b
+	rev32		v10.16b, v2.16b
+	rev32		v11.16b, v3.16b
+	rev32		v12.16b, v4.16b
+	rev32		v13.16b, v5.16b
+	rev32		v14.16b, v6.16b
+	rev32		v15.16b, v7.16b
 
-	sub		x2, x2, #64;
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v1.16b, v1.16b, RTMP0.16b;
-	eor		v2.16b, v2.16b, RTMP1.16b;
-	eor		v3.16b, v3.16b, RTMP2.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
 
-	eor		v4.16b, v4.16b, RTMP3.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v5.16b, v5.16b, RTMP0.16b;
-	eor		v6.16b, v6.16b, RTMP1.16b;
-	eor		v7.16b, v7.16b, RTMP2.16b;
+	eor		v8.16b, v8.16b, RIV.16b
+	eor		v9.16b, v9.16b, v0.16b
+	eor		v10.16b, v10.16b, v1.16b
+	eor		v11.16b, v11.16b, v2.16b
+	eor		v12.16b, v12.16b, v3.16b
+	eor		v13.16b, v13.16b, v4.16b
+	eor		v14.16b, v14.16b, v5.16b
+	eor		v15.16b, v15.16b, v6.16b
 
-	mov		RIV.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	st1		{v8.16b-v11.16b}, [x1], #64
+	st1		{v12.16b-v15.16b}, [x1], #64
 
-	cbz		w4, .Lcbc_end;
-	b		.Lcbc_loop_blk;
+	mov		RIV.16b, v7.16b
 
-.Lcbc_tail8:
-	add		w4, w4, #8;
-	cmp		w4, #4;
-	blt		.Lcbc_tail4;
+	cbz		w4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
 
-	sub		w4, w4, #4;
+.Lcbc_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcbc_dec_loop_1x
 
-	ld1		{v0.16b-v3.16b}, [x2];
+	sub		w4, w4, #4
 
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v1.16b, v1.16b, RTMP0.16b;
-	eor		v2.16b, v2.16b, RTMP1.16b;
-	eor		v3.16b, v3.16b, RTMP2.16b;
+	rev32		v8.16b, v0.16b
+	rev32		v9.16b, v1.16b
+	rev32		v10.16b, v2.16b
+	rev32		v11.16b, v3.16b
 
-	mov		RIV.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
 
-	cbz		w4, .Lcbc_end;
+	eor		v8.16b, v8.16b, RIV.16b
+	eor		v9.16b, v9.16b, v0.16b
+	eor		v10.16b, v10.16b, v1.16b
+	eor		v11.16b, v11.16b, v2.16b
 
-.Lcbc_tail4:
-	sub		w4, w4, #1;
+	st1		{v8.16b-v11.16b}, [x1], #64
 
-	ld1		{v0.16b}, [x2];
+	mov		RIV.16b, v3.16b
 
-	SM4_CRYPT_BLK(v0);
+	cbz		w4, .Lcbc_dec_end
 
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RIV.16b}, [x2], #16;
-	st1		{v0.16b}, [x1], #16;
+.Lcbc_dec_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
+
+	rev32		v8.16b, v0.16b
+
+	SM4_CRYPT_BLK_BE(v8)
 
-	cbnz		w4, .Lcbc_tail4;
+	eor		v8.16b, v8.16b, RIV.16b
+	st1		{v8.16b}, [x1], #16
 
-.Lcbc_end:
+	mov		RIV.16b, v0.16b
+
+	cbnz		w4, .Lcbc_dec_loop_1x
+
+.Lcbc_dec_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cbc_dec)
 
 .align 3
@@ -452,25 +324,57 @@ SYM_FUNC_START(sm4_ce_cfb_enc)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
+
+	ld1		{RIV.16b}, [x3]
+
+.Lcfb_enc_loop_4x:
+	cmp		w4, #4
+	blt		.Lcfb_enc_loop_1x
+
+	sub		w4, w4, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	rev32		v8.16b, RIV.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v0.16b, v0.16b, v8.16b
+
+	rev32		v8.16b, v0.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v1.16b, v1.16b, v8.16b
+
+	rev32		v8.16b, v1.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v2.16b, v2.16b, v8.16b
+
+	rev32		v8.16b, v2.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v3.16b, v3.16b, v8.16b
 
-	ld1		{RIV.16b}, [x3];
+	st1		{v0.16b-v3.16b}, [x1], #64
+	mov		RIV.16b, v3.16b
 
-.Lcfb_enc_loop:
-	sub		w4, w4, #1;
+	cbz		w4, .Lcfb_enc_end
+	b		.Lcfb_enc_loop_4x
 
-	SM4_CRYPT_BLK(RIV);
+.Lcfb_enc_loop_1x:
+	sub		w4, w4, #1
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		RIV.16b, RIV.16b, RTMP0.16b;
-	st1		{RIV.16b}, [x1], #16;
+	ld1		{v0.16b}, [x2], #16
 
-	cbnz		w4, .Lcfb_enc_loop;
+	SM4_CRYPT_BLK(RIV)
+	eor		RIV.16b, RIV.16b, v0.16b
 
+	st1		{RIV.16b}, [x1], #16
+
+	cbnz		w4, .Lcfb_enc_loop_1x
+
+.Lcfb_enc_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cfb_enc)
 
 .align 3
@@ -482,79 +386,91 @@ SYM_FUNC_START(sm4_ce_cfb_dec)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
-	ld1		{v0.16b}, [x3];
+	ld1		{RIV.16b}, [x3]
 
-.Lcfb_loop_blk:
-	sub		w4, w4, #8;
-	tbnz		w4, #31, .Lcfb_tail8;
+.Lcfb_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcfb_dec_4x
 
-	ld1		{v1.16b, v2.16b, v3.16b}, [x2], #48;
-	ld1		{v4.16b-v7.16b}, [x2];
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	rev32		v8.16b, RIV.16b
+	rev32		v9.16b, v0.16b
+	rev32		v10.16b, v1.16b
+	rev32		v11.16b, v2.16b
+	rev32		v12.16b, v3.16b
+	rev32		v13.16b, v4.16b
+	rev32		v14.16b, v5.16b
+	rev32		v15.16b, v6.16b
 
-	sub		x2, x2, #48;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	mov		RIV.16b, v7.16b
 
-	mov		v0.16b, RTMP3.16b;
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
 
-	cbz		w4, .Lcfb_end;
-	b		.Lcfb_loop_blk;
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
 
-.Lcfb_tail8:
-	add		w4, w4, #8;
-	cmp		w4, #4;
-	blt		.Lcfb_tail4;
+	cbz		w4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
 
-	sub		w4, w4, #4;
+.Lcfb_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcfb_dec_loop_1x
 
-	ld1		{v1.16b, v2.16b, v3.16b}, [x2];
+	sub		w4, w4, #4
 
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	rev32		v8.16b, RIV.16b
+	rev32		v9.16b, v0.16b
+	rev32		v10.16b, v1.16b
+	rev32		v11.16b, v2.16b
 
-	mov		v0.16b, RTMP3.16b;
+	SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
 
-	cbz		w4, .Lcfb_end;
+	mov		RIV.16b, v3.16b
 
-.Lcfb_tail4:
-	sub		w4, w4, #1;
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
 
-	SM4_CRYPT_BLK(v0);
+	st1		{v0.16b-v3.16b}, [x1], #64
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	st1		{v0.16b}, [x1], #16;
+	cbz		w4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
 
-	mov		v0.16b, RTMP0.16b;
+	SM4_CRYPT_BLK(RIV)
 
-	cbnz		w4, .Lcfb_tail4;
+	eor		RIV.16b, RIV.16b, v0.16b
+	st1		{RIV.16b}, [x1], #16
 
-.Lcfb_end:
+	mov		RIV.16b, v0.16b
+
+	cbnz		w4, .Lcfb_dec_loop_1x
+
+.Lcfb_dec_end:
 	/* store new IV */
-	st1		{v0.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cfb_dec)
 
 .align 3
@@ -566,95 +482,99 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
 	 *   x3: ctr (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
-	ldp		x7, x8, [x3];
-	rev		x7, x7;
-	rev		x8, x8;
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
 
-.Lctr_loop_blk:
-	sub		w4, w4, #8;
-	tbnz		w4, #31, .Lctr_tail8;
+.Lctr_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lctr_4x
 
-#define inc_le128(vctr)                     \
-	mov		vctr.d[1], x8;      \
-	mov		vctr.d[0], x7;      \
-	adds		x8, x8, #1;         \
-	adc		x7, x7, xzr;        \
-	rev64		vctr.16b, vctr.16b;
+#define inc_le128(vctr)					\
+		mov		vctr.d[1], x8;		\
+		mov		vctr.d[0], x7;		\
+		adds		x8, x8, #1;		\
+		rev64		vctr.16b, vctr.16b;	\
+		adc		x7, x7, xzr;
 
 	/* construct CTRs */
-	inc_le128(v0);			/* +0 */
-	inc_le128(v1);			/* +1 */
-	inc_le128(v2);			/* +2 */
-	inc_le128(v3);			/* +3 */
-	inc_le128(v4);			/* +4 */
-	inc_le128(v5);			/* +5 */
-	inc_le128(v6);			/* +6 */
-	inc_le128(v7);			/* +7 */
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
+	inc_le128(v4)			/* +4 */
+	inc_le128(v5)			/* +5 */
+	inc_le128(v6)			/* +6 */
+	inc_le128(v7)			/* +7 */
+
+	ld1		{v8.16b-v11.16b}, [x2], #64
+	ld1		{v12.16b-v15.16b}, [x2], #64
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	cbz		w4, .Lctr_end
+	b		.Lctr_loop_8x
+
+.Lctr_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lctr_loop_1x
+
+	sub		w4, w4, #4
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	/* construct CTRs */
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	ld1		{v8.16b-v11.16b}, [x2], #64
 
-	cbz		w4, .Lctr_end;
-	b		.Lctr_loop_blk;
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
 
-.Lctr_tail8:
-	add		w4, w4, #8;
-	cmp		w4, #4;
-	blt		.Lctr_tail4;
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
 
-	sub		w4, w4, #4;
+	st1		{v0.16b-v3.16b}, [x1], #64
 
-	/* construct CTRs */
-	inc_le128(v0);			/* +0 */
-	inc_le128(v1);			/* +1 */
-	inc_le128(v2);			/* +2 */
-	inc_le128(v3);			/* +3 */
+	cbz		w4, .Lctr_end
 
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
-
-	cbz		w4, .Lctr_end;
-
-.Lctr_tail4:
-	sub		w4, w4, #1;
+.Lctr_loop_1x:
+	sub		w4, w4, #1
 
 	/* construct CTRs */
-	inc_le128(v0);
+	inc_le128(v0)
 
-	SM4_CRYPT_BLK(v0);
+	ld1		{v8.16b}, [x2], #16
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	st1		{v0.16b}, [x1], #16;
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
 
-	cbnz		w4, .Lctr_tail4;
+	cbnz		w4, .Lctr_loop_1x
 
 .Lctr_end:
 	/* store new CTR */
-	rev		x7, x7;
-	rev		x8, x8;
-	stp		x7, x8, [x3];
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_ctr_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 496d55c0d01a..e56e81b1f35f 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -26,9 +26,9 @@ asmlinkage void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
 asmlinkage void sm4_ce_crypt(const u32 *rkey, u8 *dst, const u8 *src,
 			     unsigned int nblks);
 asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
-			       u8 *iv, unsigned int nblks);
+			       u8 *iv, unsigned int nblocks);
 asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
-			       u8 *iv, unsigned int nblks);
+			       u8 *iv, unsigned int nblocks);
 asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -94,66 +94,56 @@ static int sm4_ecb_decrypt(struct skcipher_request *req)
 	return sm4_ecb_do_crypt(req, ctx->rkey_dec);
 }
 
-static int sm4_cbc_encrypt(struct skcipher_request *req)
+static int sm4_cbc_crypt(struct skcipher_request *req,
+			 struct sm4_ctx *ctx, bool encrypt)
 {
-	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
-	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
 	struct skcipher_walk walk;
 	unsigned int nbytes;
 	int err;
 
 	err = skcipher_walk_virt(&walk, req, false);
+	if (err)
+		return err;
 
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			sm4_ce_cbc_enc(ctx->rkey_enc, dst, src, walk.iv, nblks);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			if (encrypt)
+				sm4_ce_cbc_enc(ctx->rkey_enc, dst, src,
+					       walk.iv, nblocks);
+			else
+				sm4_ce_cbc_dec(ctx->rkey_dec, dst, src,
+					       walk.iv, nblocks);
 
-		kernel_neon_end();
+			kernel_neon_end();
+		}
 
-		err = skcipher_walk_done(&walk, nbytes);
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
 	}
 
 	return err;
 }
 
-static int sm4_cbc_decrypt(struct skcipher_request *req)
+static int sm4_cbc_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
-	struct skcipher_walk walk;
-	unsigned int nbytes;
-	int err;
-
-	err = skcipher_walk_virt(&walk, req, false);
 
-	while ((nbytes = walk.nbytes) > 0) {
-		const u8 *src = walk.src.virt.addr;
-		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
-
-		kernel_neon_begin();
-
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			sm4_ce_cbc_dec(ctx->rkey_dec, dst, src, walk.iv, nblks);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
-
-		kernel_neon_end();
+	return sm4_cbc_crypt(req, ctx, true);
+}
 
-		err = skcipher_walk_done(&walk, nbytes);
-	}
+static int sm4_cbc_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
 
-	return err;
+	return sm4_cbc_crypt(req, ctx, false);
 }
 
 static int sm4_cfb_encrypt(struct skcipher_request *req)
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/16] crypto: arm64/sm4 - refactor and simplify CE implementation
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch does not add new features, but only refactors and simplifies the
implementation of the Crypto Extension acceleration of the SM4 algorithm:

Extract the macro optimized by SM4 Crypto Extension for reuse in the
subsequent optimization of CCM/GCM modes.

Encryption in CBC and CFB modes processes four blocks at a time instead of
one, allowing the ld1 instruction to load 64 bytes of data at a time, which
will reduces unnecessary memory accesses.

CBC/CFB/CTR makes full use of free registers to reduce redundant memory
accesses, and rearranges some instructions to improve out-of-order execution
capabilities.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-asm.h  | 209 +++++++++++
 arch/arm64/crypto/sm4-ce-core.S | 646 ++++++++++++++------------------
 arch/arm64/crypto/sm4-ce-glue.c |  64 ++--
 3 files changed, 519 insertions(+), 400 deletions(-)
 create mode 100644 arch/arm64/crypto/sm4-ce-asm.h

diff --git a/arch/arm64/crypto/sm4-ce-asm.h b/arch/arm64/crypto/sm4-ce-asm.h
new file mode 100644
index 000000000000..7ea98e42e779
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-asm.h
@@ -0,0 +1,209 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 helper macros for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#define SM4_PREPARE(ptr)					\
+	ld1		{v24.16b-v27.16b}, [ptr], #64;		\
+	ld1		{v28.16b-v31.16b}, [ptr];
+
+#define SM4_CRYPT_BLK_BE(b0)					\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	rev32		b0.16b, b0.16b;
+
+#define SM4_CRYPT_BLK(b0)					\
+	rev32		b0.16b, b0.16b;				\
+	SM4_CRYPT_BLK_BE(b0);
+
+#define SM4_CRYPT_BLK2_BE(b0, b1)				\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b1.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b1.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b1.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b1.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b1.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b1.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b1.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	sm4e		b1.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	rev64		b1.4s, b1.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	ext		b1.16b, b1.16b, b1.16b, #8;		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+
+#define SM4_CRYPT_BLK2(b0, b1)					\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	SM4_CRYPT_BLK2_BE(b0, b1);
+
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3)			\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b1.4s, v24.4s;				\
+	sm4e		b2.4s, v24.4s;				\
+	sm4e		b3.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b1.4s, v25.4s;				\
+	sm4e		b2.4s, v25.4s;				\
+	sm4e		b3.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b1.4s, v26.4s;				\
+	sm4e		b2.4s, v26.4s;				\
+	sm4e		b3.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b1.4s, v27.4s;				\
+	sm4e		b2.4s, v27.4s;				\
+	sm4e		b3.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b1.4s, v28.4s;				\
+	sm4e		b2.4s, v28.4s;				\
+	sm4e		b3.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b1.4s, v29.4s;				\
+	sm4e		b2.4s, v29.4s;				\
+	sm4e		b3.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b1.4s, v30.4s;				\
+	sm4e		b2.4s, v30.4s;				\
+	sm4e		b3.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	sm4e		b1.4s, v31.4s;				\
+	sm4e		b2.4s, v31.4s;				\
+	sm4e		b3.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	rev64		b1.4s, b1.4s;				\
+	rev64		b2.4s, b2.4s;				\
+	rev64		b3.4s, b3.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	ext		b1.16b, b1.16b, b1.16b, #8;		\
+	ext		b2.16b, b2.16b, b2.16b, #8;		\
+	ext		b3.16b, b3.16b, b3.16b, #8;		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;
+
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3)				\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;				\
+	SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
+#define SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7)	\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b1.4s, v24.4s;				\
+	sm4e		b2.4s, v24.4s;				\
+	sm4e		b3.4s, v24.4s;				\
+	sm4e		b4.4s, v24.4s;				\
+	sm4e		b5.4s, v24.4s;				\
+	sm4e		b6.4s, v24.4s;				\
+	sm4e		b7.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b1.4s, v25.4s;				\
+	sm4e		b2.4s, v25.4s;				\
+	sm4e		b3.4s, v25.4s;				\
+	sm4e		b4.4s, v25.4s;				\
+	sm4e		b5.4s, v25.4s;				\
+	sm4e		b6.4s, v25.4s;				\
+	sm4e		b7.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b1.4s, v26.4s;				\
+	sm4e		b2.4s, v26.4s;				\
+	sm4e		b3.4s, v26.4s;				\
+	sm4e		b4.4s, v26.4s;				\
+	sm4e		b5.4s, v26.4s;				\
+	sm4e		b6.4s, v26.4s;				\
+	sm4e		b7.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b1.4s, v27.4s;				\
+	sm4e		b2.4s, v27.4s;				\
+	sm4e		b3.4s, v27.4s;				\
+	sm4e		b4.4s, v27.4s;				\
+	sm4e		b5.4s, v27.4s;				\
+	sm4e		b6.4s, v27.4s;				\
+	sm4e		b7.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b1.4s, v28.4s;				\
+	sm4e		b2.4s, v28.4s;				\
+	sm4e		b3.4s, v28.4s;				\
+	sm4e		b4.4s, v28.4s;				\
+	sm4e		b5.4s, v28.4s;				\
+	sm4e		b6.4s, v28.4s;				\
+	sm4e		b7.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b1.4s, v29.4s;				\
+	sm4e		b2.4s, v29.4s;				\
+	sm4e		b3.4s, v29.4s;				\
+	sm4e		b4.4s, v29.4s;				\
+	sm4e		b5.4s, v29.4s;				\
+	sm4e		b6.4s, v29.4s;				\
+	sm4e		b7.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b1.4s, v30.4s;				\
+	sm4e		b2.4s, v30.4s;				\
+	sm4e		b3.4s, v30.4s;				\
+	sm4e		b4.4s, v30.4s;				\
+	sm4e		b5.4s, v30.4s;				\
+	sm4e		b6.4s, v30.4s;				\
+	sm4e		b7.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	sm4e		b1.4s, v31.4s;				\
+	sm4e		b2.4s, v31.4s;				\
+	sm4e		b3.4s, v31.4s;				\
+	sm4e		b4.4s, v31.4s;				\
+	sm4e		b5.4s, v31.4s;				\
+	sm4e		b6.4s, v31.4s;				\
+	sm4e		b7.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	rev64		b1.4s, b1.4s;				\
+	rev64		b2.4s, b2.4s;				\
+	rev64		b3.4s, b3.4s;				\
+	rev64		b4.4s, b4.4s;				\
+	rev64		b5.4s, b5.4s;				\
+	rev64		b6.4s, b6.4s;				\
+	rev64		b7.4s, b7.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	ext		b1.16b, b1.16b, b1.16b, #8;		\
+	ext		b2.16b, b2.16b, b2.16b, #8;		\
+	ext		b3.16b, b3.16b, b3.16b, #8;		\
+	ext		b4.16b, b4.16b, b4.16b, #8;		\
+	ext		b5.16b, b5.16b, b5.16b, #8;		\
+	ext		b6.16b, b6.16b, b6.16b, #8;		\
+	ext		b7.16b, b7.16b, b7.16b, #8;		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;				\
+	rev32		b4.16b, b4.16b;				\
+	rev32		b5.16b, b5.16b;				\
+	rev32		b6.16b, b6.16b;				\
+	rev32		b7.16b, b7.16b;
+
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;				\
+	rev32		b4.16b, b4.16b;				\
+	rev32		b5.16b, b5.16b;				\
+	rev32		b6.16b, b6.16b;				\
+	rev32		b7.16b, b7.16b;				\
+	SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7);
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 934e0f093279..41fc745a8528 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -10,10 +10,12 @@
 
 #include <linux/linkage.h>
 #include <asm/assembler.h>
+#include "sm4-ce-asm.h"
 
 .arch	armv8-a+crypto
 
-.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 20, 24, 25, 26, 27, 28, 29, 30, 31
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+		20, 24, 25, 26, 27, 28, 29, 30, 31
 	.set .Lv\b\().4s, \b
 .endr
 
@@ -34,174 +36,6 @@
 
 #define RIV	v20
 
-/* Helper macros. */
-
-#define PREPARE                                       \
-	ld1		{v24.16b-v27.16b}, [x0], #64; \
-	ld1		{v28.16b-v31.16b}, [x0];
-
-#define SM4_CRYPT_BLK(b0)                           \
-	rev32		b0.16b, b0.16b;             \
-	sm4e		b0.4s, v24.4s;              \
-	sm4e		b0.4s, v25.4s;              \
-	sm4e		b0.4s, v26.4s;              \
-	sm4e		b0.4s, v27.4s;              \
-	sm4e		b0.4s, v28.4s;              \
-	sm4e		b0.4s, v29.4s;              \
-	sm4e		b0.4s, v30.4s;              \
-	sm4e		b0.4s, v31.4s;              \
-	rev64		b0.4s, b0.4s;               \
-	ext		b0.16b, b0.16b, b0.16b, #8; \
-	rev32		b0.16b, b0.16b;
-
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3)              \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;             \
-	sm4e		b0.4s, v24.4s;              \
-	sm4e		b1.4s, v24.4s;              \
-	sm4e		b2.4s, v24.4s;              \
-	sm4e		b3.4s, v24.4s;              \
-	sm4e		b0.4s, v25.4s;              \
-	sm4e		b1.4s, v25.4s;              \
-	sm4e		b2.4s, v25.4s;              \
-	sm4e		b3.4s, v25.4s;              \
-	sm4e		b0.4s, v26.4s;              \
-	sm4e		b1.4s, v26.4s;              \
-	sm4e		b2.4s, v26.4s;              \
-	sm4e		b3.4s, v26.4s;              \
-	sm4e		b0.4s, v27.4s;              \
-	sm4e		b1.4s, v27.4s;              \
-	sm4e		b2.4s, v27.4s;              \
-	sm4e		b3.4s, v27.4s;              \
-	sm4e		b0.4s, v28.4s;              \
-	sm4e		b1.4s, v28.4s;              \
-	sm4e		b2.4s, v28.4s;              \
-	sm4e		b3.4s, v28.4s;              \
-	sm4e		b0.4s, v29.4s;              \
-	sm4e		b1.4s, v29.4s;              \
-	sm4e		b2.4s, v29.4s;              \
-	sm4e		b3.4s, v29.4s;              \
-	sm4e		b0.4s, v30.4s;              \
-	sm4e		b1.4s, v30.4s;              \
-	sm4e		b2.4s, v30.4s;              \
-	sm4e		b3.4s, v30.4s;              \
-	sm4e		b0.4s, v31.4s;              \
-	sm4e		b1.4s, v31.4s;              \
-	sm4e		b2.4s, v31.4s;              \
-	sm4e		b3.4s, v31.4s;              \
-	rev64		b0.4s, b0.4s;               \
-	rev64		b1.4s, b1.4s;               \
-	rev64		b2.4s, b2.4s;               \
-	rev64		b3.4s, b3.4s;               \
-	ext		b0.16b, b0.16b, b0.16b, #8; \
-	ext		b1.16b, b1.16b, b1.16b, #8; \
-	ext		b2.16b, b2.16b, b2.16b, #8; \
-	ext		b3.16b, b3.16b, b3.16b, #8; \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;
-
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;             \
-	rev32		b4.16b, b4.16b;             \
-	rev32		b5.16b, b5.16b;             \
-	rev32		b6.16b, b6.16b;             \
-	rev32		b7.16b, b7.16b;             \
-	sm4e		b0.4s, v24.4s;              \
-	sm4e		b1.4s, v24.4s;              \
-	sm4e		b2.4s, v24.4s;              \
-	sm4e		b3.4s, v24.4s;              \
-	sm4e		b4.4s, v24.4s;              \
-	sm4e		b5.4s, v24.4s;              \
-	sm4e		b6.4s, v24.4s;              \
-	sm4e		b7.4s, v24.4s;              \
-	sm4e		b0.4s, v25.4s;              \
-	sm4e		b1.4s, v25.4s;              \
-	sm4e		b2.4s, v25.4s;              \
-	sm4e		b3.4s, v25.4s;              \
-	sm4e		b4.4s, v25.4s;              \
-	sm4e		b5.4s, v25.4s;              \
-	sm4e		b6.4s, v25.4s;              \
-	sm4e		b7.4s, v25.4s;              \
-	sm4e		b0.4s, v26.4s;              \
-	sm4e		b1.4s, v26.4s;              \
-	sm4e		b2.4s, v26.4s;              \
-	sm4e		b3.4s, v26.4s;              \
-	sm4e		b4.4s, v26.4s;              \
-	sm4e		b5.4s, v26.4s;              \
-	sm4e		b6.4s, v26.4s;              \
-	sm4e		b7.4s, v26.4s;              \
-	sm4e		b0.4s, v27.4s;              \
-	sm4e		b1.4s, v27.4s;              \
-	sm4e		b2.4s, v27.4s;              \
-	sm4e		b3.4s, v27.4s;              \
-	sm4e		b4.4s, v27.4s;              \
-	sm4e		b5.4s, v27.4s;              \
-	sm4e		b6.4s, v27.4s;              \
-	sm4e		b7.4s, v27.4s;              \
-	sm4e		b0.4s, v28.4s;              \
-	sm4e		b1.4s, v28.4s;              \
-	sm4e		b2.4s, v28.4s;              \
-	sm4e		b3.4s, v28.4s;              \
-	sm4e		b4.4s, v28.4s;              \
-	sm4e		b5.4s, v28.4s;              \
-	sm4e		b6.4s, v28.4s;              \
-	sm4e		b7.4s, v28.4s;              \
-	sm4e		b0.4s, v29.4s;              \
-	sm4e		b1.4s, v29.4s;              \
-	sm4e		b2.4s, v29.4s;              \
-	sm4e		b3.4s, v29.4s;              \
-	sm4e		b4.4s, v29.4s;              \
-	sm4e		b5.4s, v29.4s;              \
-	sm4e		b6.4s, v29.4s;              \
-	sm4e		b7.4s, v29.4s;              \
-	sm4e		b0.4s, v30.4s;              \
-	sm4e		b1.4s, v30.4s;              \
-	sm4e		b2.4s, v30.4s;              \
-	sm4e		b3.4s, v30.4s;              \
-	sm4e		b4.4s, v30.4s;              \
-	sm4e		b5.4s, v30.4s;              \
-	sm4e		b6.4s, v30.4s;              \
-	sm4e		b7.4s, v30.4s;              \
-	sm4e		b0.4s, v31.4s;              \
-	sm4e		b1.4s, v31.4s;              \
-	sm4e		b2.4s, v31.4s;              \
-	sm4e		b3.4s, v31.4s;              \
-	sm4e		b4.4s, v31.4s;              \
-	sm4e		b5.4s, v31.4s;              \
-	sm4e		b6.4s, v31.4s;              \
-	sm4e		b7.4s, v31.4s;              \
-	rev64		b0.4s, b0.4s;               \
-	rev64		b1.4s, b1.4s;               \
-	rev64		b2.4s, b2.4s;               \
-	rev64		b3.4s, b3.4s;               \
-	rev64		b4.4s, b4.4s;               \
-	rev64		b5.4s, b5.4s;               \
-	rev64		b6.4s, b6.4s;               \
-	rev64		b7.4s, b7.4s;               \
-	ext		b0.16b, b0.16b, b0.16b, #8; \
-	ext		b1.16b, b1.16b, b1.16b, #8; \
-	ext		b2.16b, b2.16b, b2.16b, #8; \
-	ext		b3.16b, b3.16b, b3.16b, #8; \
-	ext		b4.16b, b4.16b, b4.16b, #8; \
-	ext		b5.16b, b5.16b, b5.16b, #8; \
-	ext		b6.16b, b6.16b, b6.16b, #8; \
-	ext		b7.16b, b7.16b, b7.16b, #8; \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;             \
-	rev32		b4.16b, b4.16b;             \
-	rev32		b5.16b, b5.16b;             \
-	rev32		b6.16b, b6.16b;             \
-	rev32		b7.16b, b7.16b;
-
 
 .align 3
 SYM_FUNC_START(sm4_ce_expand_key)
@@ -268,7 +102,7 @@ SYM_FUNC_START(sm4_ce_crypt_block)
 	 *   x1: dst
 	 *   x2: src
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
 	ld1		{v0.16b}, [x2];
 	SM4_CRYPT_BLK(v0);
@@ -285,7 +119,7 @@ SYM_FUNC_START(sm4_ce_crypt)
 	 *   x2: src
 	 *   w3: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
 .Lcrypt_loop_blk:
 	sub		w3, w3, #8;
@@ -337,26 +171,50 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
+
+	ld1		{RIV.16b}, [x3]
+
+.Lcbc_enc_loop_4x:
+	cmp		w4, #4
+	blt		.Lcbc_enc_loop_1x
+
+	sub		w4, w4, #4
 
-	ld1		{RIV.16b}, [x3];
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-.Lcbc_enc_loop:
-	sub		w4, w4, #1;
+	eor		v0.16b, v0.16b, RIV.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v1.16b, v1.16b, v0.16b
+	SM4_CRYPT_BLK(v1)
+	eor		v2.16b, v2.16b, v1.16b
+	SM4_CRYPT_BLK(v2)
+	eor		v3.16b, v3.16b, v2.16b
+	SM4_CRYPT_BLK(v3)
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		RIV.16b, RIV.16b, RTMP0.16b;
+	st1		{v0.16b-v3.16b}, [x1], #64
+	mov		RIV.16b, v3.16b
 
-	SM4_CRYPT_BLK(RIV);
+	cbz		w4, .Lcbc_enc_end
+	b		.Lcbc_enc_loop_4x
 
-	st1		{RIV.16b}, [x1], #16;
+.Lcbc_enc_loop_1x:
+	sub		w4, w4, #1
 
-	cbnz		w4, .Lcbc_enc_loop;
+	ld1		{v0.16b}, [x2], #16
 
+	eor		RIV.16b, RIV.16b, v0.16b
+	SM4_CRYPT_BLK(RIV)
+
+	st1		{RIV.16b}, [x1], #16
+
+	cbnz		w4, .Lcbc_enc_loop_1x
+
+.Lcbc_enc_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cbc_enc)
 
 .align 3
@@ -368,79 +226,93 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
-	ld1		{RIV.16b}, [x3];
+	ld1		{RIV.16b}, [x3]
 
-.Lcbc_loop_blk:
-	sub		w4, w4, #8;
-	tbnz		w4, #31, .Lcbc_tail8;
+.Lcbc_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcbc_dec_4x
 
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b-v7.16b}, [x2];
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	rev32		v8.16b, v0.16b
+	rev32		v9.16b, v1.16b
+	rev32		v10.16b, v2.16b
+	rev32		v11.16b, v3.16b
+	rev32		v12.16b, v4.16b
+	rev32		v13.16b, v5.16b
+	rev32		v14.16b, v6.16b
+	rev32		v15.16b, v7.16b
 
-	sub		x2, x2, #64;
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v1.16b, v1.16b, RTMP0.16b;
-	eor		v2.16b, v2.16b, RTMP1.16b;
-	eor		v3.16b, v3.16b, RTMP2.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
 
-	eor		v4.16b, v4.16b, RTMP3.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v5.16b, v5.16b, RTMP0.16b;
-	eor		v6.16b, v6.16b, RTMP1.16b;
-	eor		v7.16b, v7.16b, RTMP2.16b;
+	eor		v8.16b, v8.16b, RIV.16b
+	eor		v9.16b, v9.16b, v0.16b
+	eor		v10.16b, v10.16b, v1.16b
+	eor		v11.16b, v11.16b, v2.16b
+	eor		v12.16b, v12.16b, v3.16b
+	eor		v13.16b, v13.16b, v4.16b
+	eor		v14.16b, v14.16b, v5.16b
+	eor		v15.16b, v15.16b, v6.16b
 
-	mov		RIV.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	st1		{v8.16b-v11.16b}, [x1], #64
+	st1		{v12.16b-v15.16b}, [x1], #64
 
-	cbz		w4, .Lcbc_end;
-	b		.Lcbc_loop_blk;
+	mov		RIV.16b, v7.16b
 
-.Lcbc_tail8:
-	add		w4, w4, #8;
-	cmp		w4, #4;
-	blt		.Lcbc_tail4;
+	cbz		w4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
 
-	sub		w4, w4, #4;
+.Lcbc_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcbc_dec_loop_1x
 
-	ld1		{v0.16b-v3.16b}, [x2];
+	sub		w4, w4, #4
 
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v1.16b, v1.16b, RTMP0.16b;
-	eor		v2.16b, v2.16b, RTMP1.16b;
-	eor		v3.16b, v3.16b, RTMP2.16b;
+	rev32		v8.16b, v0.16b
+	rev32		v9.16b, v1.16b
+	rev32		v10.16b, v2.16b
+	rev32		v11.16b, v3.16b
 
-	mov		RIV.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
 
-	cbz		w4, .Lcbc_end;
+	eor		v8.16b, v8.16b, RIV.16b
+	eor		v9.16b, v9.16b, v0.16b
+	eor		v10.16b, v10.16b, v1.16b
+	eor		v11.16b, v11.16b, v2.16b
 
-.Lcbc_tail4:
-	sub		w4, w4, #1;
+	st1		{v8.16b-v11.16b}, [x1], #64
 
-	ld1		{v0.16b}, [x2];
+	mov		RIV.16b, v3.16b
 
-	SM4_CRYPT_BLK(v0);
+	cbz		w4, .Lcbc_dec_end
 
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RIV.16b}, [x2], #16;
-	st1		{v0.16b}, [x1], #16;
+.Lcbc_dec_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
+
+	rev32		v8.16b, v0.16b
+
+	SM4_CRYPT_BLK_BE(v8)
 
-	cbnz		w4, .Lcbc_tail4;
+	eor		v8.16b, v8.16b, RIV.16b
+	st1		{v8.16b}, [x1], #16
 
-.Lcbc_end:
+	mov		RIV.16b, v0.16b
+
+	cbnz		w4, .Lcbc_dec_loop_1x
+
+.Lcbc_dec_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cbc_dec)
 
 .align 3
@@ -452,25 +324,57 @@ SYM_FUNC_START(sm4_ce_cfb_enc)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
+
+	ld1		{RIV.16b}, [x3]
+
+.Lcfb_enc_loop_4x:
+	cmp		w4, #4
+	blt		.Lcfb_enc_loop_1x
+
+	sub		w4, w4, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	rev32		v8.16b, RIV.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v0.16b, v0.16b, v8.16b
+
+	rev32		v8.16b, v0.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v1.16b, v1.16b, v8.16b
+
+	rev32		v8.16b, v1.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v2.16b, v2.16b, v8.16b
+
+	rev32		v8.16b, v2.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v3.16b, v3.16b, v8.16b
 
-	ld1		{RIV.16b}, [x3];
+	st1		{v0.16b-v3.16b}, [x1], #64
+	mov		RIV.16b, v3.16b
 
-.Lcfb_enc_loop:
-	sub		w4, w4, #1;
+	cbz		w4, .Lcfb_enc_end
+	b		.Lcfb_enc_loop_4x
 
-	SM4_CRYPT_BLK(RIV);
+.Lcfb_enc_loop_1x:
+	sub		w4, w4, #1
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		RIV.16b, RIV.16b, RTMP0.16b;
-	st1		{RIV.16b}, [x1], #16;
+	ld1		{v0.16b}, [x2], #16
 
-	cbnz		w4, .Lcfb_enc_loop;
+	SM4_CRYPT_BLK(RIV)
+	eor		RIV.16b, RIV.16b, v0.16b
 
+	st1		{RIV.16b}, [x1], #16
+
+	cbnz		w4, .Lcfb_enc_loop_1x
+
+.Lcfb_enc_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cfb_enc)
 
 .align 3
@@ -482,79 +386,91 @@ SYM_FUNC_START(sm4_ce_cfb_dec)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
-	ld1		{v0.16b}, [x3];
+	ld1		{RIV.16b}, [x3]
 
-.Lcfb_loop_blk:
-	sub		w4, w4, #8;
-	tbnz		w4, #31, .Lcfb_tail8;
+.Lcfb_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcfb_dec_4x
 
-	ld1		{v1.16b, v2.16b, v3.16b}, [x2], #48;
-	ld1		{v4.16b-v7.16b}, [x2];
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	rev32		v8.16b, RIV.16b
+	rev32		v9.16b, v0.16b
+	rev32		v10.16b, v1.16b
+	rev32		v11.16b, v2.16b
+	rev32		v12.16b, v3.16b
+	rev32		v13.16b, v4.16b
+	rev32		v14.16b, v5.16b
+	rev32		v15.16b, v6.16b
 
-	sub		x2, x2, #48;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	mov		RIV.16b, v7.16b
 
-	mov		v0.16b, RTMP3.16b;
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
 
-	cbz		w4, .Lcfb_end;
-	b		.Lcfb_loop_blk;
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
 
-.Lcfb_tail8:
-	add		w4, w4, #8;
-	cmp		w4, #4;
-	blt		.Lcfb_tail4;
+	cbz		w4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
 
-	sub		w4, w4, #4;
+.Lcfb_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcfb_dec_loop_1x
 
-	ld1		{v1.16b, v2.16b, v3.16b}, [x2];
+	sub		w4, w4, #4
 
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	rev32		v8.16b, RIV.16b
+	rev32		v9.16b, v0.16b
+	rev32		v10.16b, v1.16b
+	rev32		v11.16b, v2.16b
 
-	mov		v0.16b, RTMP3.16b;
+	SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
 
-	cbz		w4, .Lcfb_end;
+	mov		RIV.16b, v3.16b
 
-.Lcfb_tail4:
-	sub		w4, w4, #1;
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
 
-	SM4_CRYPT_BLK(v0);
+	st1		{v0.16b-v3.16b}, [x1], #64
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	st1		{v0.16b}, [x1], #16;
+	cbz		w4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
 
-	mov		v0.16b, RTMP0.16b;
+	SM4_CRYPT_BLK(RIV)
 
-	cbnz		w4, .Lcfb_tail4;
+	eor		RIV.16b, RIV.16b, v0.16b
+	st1		{RIV.16b}, [x1], #16
 
-.Lcfb_end:
+	mov		RIV.16b, v0.16b
+
+	cbnz		w4, .Lcfb_dec_loop_1x
+
+.Lcfb_dec_end:
 	/* store new IV */
-	st1		{v0.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cfb_dec)
 
 .align 3
@@ -566,95 +482,99 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
 	 *   x3: ctr (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
-	ldp		x7, x8, [x3];
-	rev		x7, x7;
-	rev		x8, x8;
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
 
-.Lctr_loop_blk:
-	sub		w4, w4, #8;
-	tbnz		w4, #31, .Lctr_tail8;
+.Lctr_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lctr_4x
 
-#define inc_le128(vctr)                     \
-	mov		vctr.d[1], x8;      \
-	mov		vctr.d[0], x7;      \
-	adds		x8, x8, #1;         \
-	adc		x7, x7, xzr;        \
-	rev64		vctr.16b, vctr.16b;
+#define inc_le128(vctr)					\
+		mov		vctr.d[1], x8;		\
+		mov		vctr.d[0], x7;		\
+		adds		x8, x8, #1;		\
+		rev64		vctr.16b, vctr.16b;	\
+		adc		x7, x7, xzr;
 
 	/* construct CTRs */
-	inc_le128(v0);			/* +0 */
-	inc_le128(v1);			/* +1 */
-	inc_le128(v2);			/* +2 */
-	inc_le128(v3);			/* +3 */
-	inc_le128(v4);			/* +4 */
-	inc_le128(v5);			/* +5 */
-	inc_le128(v6);			/* +6 */
-	inc_le128(v7);			/* +7 */
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
+	inc_le128(v4)			/* +4 */
+	inc_le128(v5)			/* +5 */
+	inc_le128(v6)			/* +6 */
+	inc_le128(v7)			/* +7 */
+
+	ld1		{v8.16b-v11.16b}, [x2], #64
+	ld1		{v12.16b-v15.16b}, [x2], #64
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	cbz		w4, .Lctr_end
+	b		.Lctr_loop_8x
+
+.Lctr_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lctr_loop_1x
+
+	sub		w4, w4, #4
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	/* construct CTRs */
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	ld1		{v8.16b-v11.16b}, [x2], #64
 
-	cbz		w4, .Lctr_end;
-	b		.Lctr_loop_blk;
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
 
-.Lctr_tail8:
-	add		w4, w4, #8;
-	cmp		w4, #4;
-	blt		.Lctr_tail4;
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
 
-	sub		w4, w4, #4;
+	st1		{v0.16b-v3.16b}, [x1], #64
 
-	/* construct CTRs */
-	inc_le128(v0);			/* +0 */
-	inc_le128(v1);			/* +1 */
-	inc_le128(v2);			/* +2 */
-	inc_le128(v3);			/* +3 */
+	cbz		w4, .Lctr_end
 
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
-
-	cbz		w4, .Lctr_end;
-
-.Lctr_tail4:
-	sub		w4, w4, #1;
+.Lctr_loop_1x:
+	sub		w4, w4, #1
 
 	/* construct CTRs */
-	inc_le128(v0);
+	inc_le128(v0)
 
-	SM4_CRYPT_BLK(v0);
+	ld1		{v8.16b}, [x2], #16
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	st1		{v0.16b}, [x1], #16;
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
 
-	cbnz		w4, .Lctr_tail4;
+	cbnz		w4, .Lctr_loop_1x
 
 .Lctr_end:
 	/* store new CTR */
-	rev		x7, x7;
-	rev		x8, x8;
-	stp		x7, x8, [x3];
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_ctr_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 496d55c0d01a..e56e81b1f35f 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -26,9 +26,9 @@ asmlinkage void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
 asmlinkage void sm4_ce_crypt(const u32 *rkey, u8 *dst, const u8 *src,
 			     unsigned int nblks);
 asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
-			       u8 *iv, unsigned int nblks);
+			       u8 *iv, unsigned int nblocks);
 asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
-			       u8 *iv, unsigned int nblks);
+			       u8 *iv, unsigned int nblocks);
 asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -94,66 +94,56 @@ static int sm4_ecb_decrypt(struct skcipher_request *req)
 	return sm4_ecb_do_crypt(req, ctx->rkey_dec);
 }
 
-static int sm4_cbc_encrypt(struct skcipher_request *req)
+static int sm4_cbc_crypt(struct skcipher_request *req,
+			 struct sm4_ctx *ctx, bool encrypt)
 {
-	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
-	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
 	struct skcipher_walk walk;
 	unsigned int nbytes;
 	int err;
 
 	err = skcipher_walk_virt(&walk, req, false);
+	if (err)
+		return err;
 
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			sm4_ce_cbc_enc(ctx->rkey_enc, dst, src, walk.iv, nblks);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			if (encrypt)
+				sm4_ce_cbc_enc(ctx->rkey_enc, dst, src,
+					       walk.iv, nblocks);
+			else
+				sm4_ce_cbc_dec(ctx->rkey_dec, dst, src,
+					       walk.iv, nblocks);
 
-		kernel_neon_end();
+			kernel_neon_end();
+		}
 
-		err = skcipher_walk_done(&walk, nbytes);
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
 	}
 
 	return err;
 }
 
-static int sm4_cbc_decrypt(struct skcipher_request *req)
+static int sm4_cbc_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
-	struct skcipher_walk walk;
-	unsigned int nbytes;
-	int err;
-
-	err = skcipher_walk_virt(&walk, req, false);
 
-	while ((nbytes = walk.nbytes) > 0) {
-		const u8 *src = walk.src.virt.addr;
-		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
-
-		kernel_neon_begin();
-
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			sm4_ce_cbc_dec(ctx->rkey_dec, dst, src, walk.iv, nblks);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
-
-		kernel_neon_end();
+	return sm4_cbc_crypt(req, ctx, true);
+}
 
-		err = skcipher_walk_done(&walk, nbytes);
-	}
+static int sm4_cbc_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
 
-	return err;
+	return sm4_cbc_crypt(req, ctx, false);
 }
 
 static int sm4_cfb_encrypt(struct skcipher_request *req)
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/16] crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

Use a 128-bit swap mask and tbl instruction to simplify the implementation
for generating SM4 rkey_dec.

Also fixed the issue of not being wrapped by kernel_neon_begin/end() when
using the sm4_ce_expand_key() function.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S | 46 ++++++++++++++++-----------------
 arch/arm64/crypto/sm4-ce-glue.c |  2 ++
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 41fc745a8528..9e4b4f01cdf3 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -65,32 +65,23 @@ SYM_FUNC_START(sm4_ce_expand_key)
 	sm4ekey		v6.4s, v5.4s, v30.4s;
 	sm4ekey		v7.4s, v6.4s, v31.4s;
 
+	adr_l		x5, .Lbswap128_mask
+	ld1		{v24.16b}, [x5]
+
 	st1		{v0.16b-v3.16b}, [x1], #64;
 	st1		{v4.16b-v7.16b}, [x1];
-	rev64		v7.4s, v7.4s;
-	rev64		v6.4s, v6.4s;
-	rev64		v5.4s, v5.4s;
-	rev64		v4.4s, v4.4s;
-	rev64		v3.4s, v3.4s;
-	rev64		v2.4s, v2.4s;
-	rev64		v1.4s, v1.4s;
-	rev64		v0.4s, v0.4s;
-	ext		v7.16b, v7.16b, v7.16b, #8;
-	ext		v6.16b, v6.16b, v6.16b, #8;
-	ext		v5.16b, v5.16b, v5.16b, #8;
-	ext		v4.16b, v4.16b, v4.16b, #8;
-	ext		v3.16b, v3.16b, v3.16b, #8;
-	ext		v2.16b, v2.16b, v2.16b, #8;
-	ext		v1.16b, v1.16b, v1.16b, #8;
-	ext		v0.16b, v0.16b, v0.16b, #8;
-	st1		{v7.16b}, [x2], #16;
-	st1		{v6.16b}, [x2], #16;
-	st1		{v5.16b}, [x2], #16;
-	st1		{v4.16b}, [x2], #16;
-	st1		{v3.16b}, [x2], #16;
-	st1		{v2.16b}, [x2], #16;
-	st1		{v1.16b}, [x2], #16;
-	st1		{v0.16b}, [x2];
+
+	tbl		v16.16b, {v7.16b}, v24.16b
+	tbl		v17.16b, {v6.16b}, v24.16b
+	tbl		v18.16b, {v5.16b}, v24.16b
+	tbl		v19.16b, {v4.16b}, v24.16b
+	tbl		v20.16b, {v3.16b}, v24.16b
+	tbl		v21.16b, {v2.16b}, v24.16b
+	tbl		v22.16b, {v1.16b}, v24.16b
+	tbl		v23.16b, {v0.16b}, v24.16b
+
+	st1		{v16.16b-v19.16b}, [x2], #64
+	st1		{v20.16b-v23.16b}, [x2]
 
 	ret;
 SYM_FUNC_END(sm4_ce_expand_key)
@@ -578,3 +569,10 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
 
 	ret
 SYM_FUNC_END(sm4_ce_ctr_enc)
+
+
+	.section	".rodata", "a"
+	.align 4
+.Lbswap128_mask:
+	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index e56e81b1f35f..ff2d8442d473 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -44,8 +44,10 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	if (key_len != SM4_KEY_SIZE)
 		return -EINVAL;
 
+	kernel_neon_begin();
 	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
 			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
 	return 0;
 }
 
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/16] crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

Use a 128-bit swap mask and tbl instruction to simplify the implementation
for generating SM4 rkey_dec.

Also fixed the issue of not being wrapped by kernel_neon_begin/end() when
using the sm4_ce_expand_key() function.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S | 46 ++++++++++++++++-----------------
 arch/arm64/crypto/sm4-ce-glue.c |  2 ++
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 41fc745a8528..9e4b4f01cdf3 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -65,32 +65,23 @@ SYM_FUNC_START(sm4_ce_expand_key)
 	sm4ekey		v6.4s, v5.4s, v30.4s;
 	sm4ekey		v7.4s, v6.4s, v31.4s;
 
+	adr_l		x5, .Lbswap128_mask
+	ld1		{v24.16b}, [x5]
+
 	st1		{v0.16b-v3.16b}, [x1], #64;
 	st1		{v4.16b-v7.16b}, [x1];
-	rev64		v7.4s, v7.4s;
-	rev64		v6.4s, v6.4s;
-	rev64		v5.4s, v5.4s;
-	rev64		v4.4s, v4.4s;
-	rev64		v3.4s, v3.4s;
-	rev64		v2.4s, v2.4s;
-	rev64		v1.4s, v1.4s;
-	rev64		v0.4s, v0.4s;
-	ext		v7.16b, v7.16b, v7.16b, #8;
-	ext		v6.16b, v6.16b, v6.16b, #8;
-	ext		v5.16b, v5.16b, v5.16b, #8;
-	ext		v4.16b, v4.16b, v4.16b, #8;
-	ext		v3.16b, v3.16b, v3.16b, #8;
-	ext		v2.16b, v2.16b, v2.16b, #8;
-	ext		v1.16b, v1.16b, v1.16b, #8;
-	ext		v0.16b, v0.16b, v0.16b, #8;
-	st1		{v7.16b}, [x2], #16;
-	st1		{v6.16b}, [x2], #16;
-	st1		{v5.16b}, [x2], #16;
-	st1		{v4.16b}, [x2], #16;
-	st1		{v3.16b}, [x2], #16;
-	st1		{v2.16b}, [x2], #16;
-	st1		{v1.16b}, [x2], #16;
-	st1		{v0.16b}, [x2];
+
+	tbl		v16.16b, {v7.16b}, v24.16b
+	tbl		v17.16b, {v6.16b}, v24.16b
+	tbl		v18.16b, {v5.16b}, v24.16b
+	tbl		v19.16b, {v4.16b}, v24.16b
+	tbl		v20.16b, {v3.16b}, v24.16b
+	tbl		v21.16b, {v2.16b}, v24.16b
+	tbl		v22.16b, {v1.16b}, v24.16b
+	tbl		v23.16b, {v0.16b}, v24.16b
+
+	st1		{v16.16b-v19.16b}, [x2], #64
+	st1		{v20.16b-v23.16b}, [x2]
 
 	ret;
 SYM_FUNC_END(sm4_ce_expand_key)
@@ -578,3 +569,10 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
 
 	ret
 SYM_FUNC_END(sm4_ce_ctr_enc)
+
+
+	.section	".rodata", "a"
+	.align 4
+.Lbswap128_mask:
+	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index e56e81b1f35f..ff2d8442d473 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -44,8 +44,10 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	if (key_len != SM4_KEY_SIZE)
 		return -EINVAL;
 
+	kernel_neon_begin();
 	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
 			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
 	return 0;
 }
 
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/16] crypto: arm64/sm4 - export reusable CE acceleration functions
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

In the accelerated implementation of the SM4 algorithm using the Crypto
Extension instructions, there are some functions that can be reused in
the upcoming accelerated implementation of the GCM/CCM mode, and the
CBC/CFB encryption is reused in the optimized implementation of SVESM4.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-glue.c |  5 +++++
 arch/arm64/crypto/sm4-ce.h      | 16 ++++++++++++++++
 2 files changed, 21 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-ce.h

diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index ff2d8442d473..63abcadc684b 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -36,6 +36,11 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
 asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 
+EXPORT_SYMBOL(sm4_ce_expand_key);
+EXPORT_SYMBOL(sm4_ce_crypt_block);
+EXPORT_SYMBOL(sm4_ce_cbc_enc);
+EXPORT_SYMBOL(sm4_ce_cfb_enc);
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
diff --git a/arch/arm64/crypto/sm4-ce.h b/arch/arm64/crypto/sm4-ce.h
new file mode 100644
index 000000000000..109c21b37590
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 common functions for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+void sm4_ce_expand_key(const u8 *key, u32 *rkey_enc, u32 *rkey_dec,
+		       const u32 *fk, const u32 *ck);
+
+void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
+
+void sm4_ce_cbc_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+		    u8 *iv, unsigned int nblocks);
+
+void sm4_ce_cfb_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+		    u8 *iv, unsigned int nblocks);
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/16] crypto: arm64/sm4 - export reusable CE acceleration functions
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

In the accelerated implementation of the SM4 algorithm using the Crypto
Extension instructions, there are some functions that can be reused in
the upcoming accelerated implementation of the GCM/CCM mode, and the
CBC/CFB encryption is reused in the optimized implementation of SVESM4.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-glue.c |  5 +++++
 arch/arm64/crypto/sm4-ce.h      | 16 ++++++++++++++++
 2 files changed, 21 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-ce.h

diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index ff2d8442d473..63abcadc684b 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -36,6 +36,11 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
 asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 
+EXPORT_SYMBOL(sm4_ce_expand_key);
+EXPORT_SYMBOL(sm4_ce_crypt_block);
+EXPORT_SYMBOL(sm4_ce_cbc_enc);
+EXPORT_SYMBOL(sm4_ce_cfb_enc);
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
diff --git a/arch/arm64/crypto/sm4-ce.h b/arch/arm64/crypto/sm4-ce.h
new file mode 100644
index 000000000000..109c21b37590
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 common functions for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+void sm4_ce_expand_key(const u8 *key, u32 *rkey_enc, u32 *rkey_dec,
+		       const u32 *fk, const u32 *ck);
+
+void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
+
+void sm4_ce_cbc_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+		    u8 *iv, unsigned int nblocks);
+
+void sm4_ce_cfb_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+		    u8 *iv, unsigned int nblocks);
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/16] crypto: arm64/sm4 - add CE implementation for CTS-CBC mode
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for CTS-CBC mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is cts(cbc-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

Before:

cts(cbc-sm4-ce) |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
    CTS-CBC enc |  286.09   297.17   457.97   627.75   868.58   900.80   957.69
    CTS-CBC dec |  286.67   285.63   538.35   947.08  2241.03  2577.32  3391.14

After:

cts-cbc-sm4-ce  |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
    CTS-CBC enc |  288.19   428.80   593.57   741.04   911.73   931.80   950.00
    CTS-CBC dec |  292.22   468.99   838.23  1380.76  2741.17  3036.42  3409.62

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S | 102 ++++++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-glue.c |  94 +++++++++++++++++++++++++++++
 2 files changed, 196 insertions(+)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 9e4b4f01cdf3..414d29f8110b 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -306,6 +306,100 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
 	ret
 SYM_FUNC_END(sm4_ce_cbc_dec)
 
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nbytes
+	 */
+	SM4_PREPARE(x0)
+
+	sub		w5, w4, #16
+	uxtw		x5, w5
+
+	ld1		{RIV.16b}, [x3]
+
+	ld1		{v0.16b}, [x2]
+	eor		RIV.16b, RIV.16b, v0.16b
+	SM4_CRYPT_BLK(RIV)
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	add		x2, x2, x5
+	ld1		{v1.16b}, [x2]
+
+	/* create Cn from En-1 */
+	tbl		v0.16b, {RIV.16b}, v3.16b
+	/* padding Pn with zeros */
+	tbl		v1.16b, {v1.16b}, v4.16b
+
+	eor		v1.16b, v1.16b, RIV.16b
+	SM4_CRYPT_BLK(v1)
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v0.16b}, [x5]
+	st1		{v1.16b}, [x1]
+
+	ret
+SYM_FUNC_END(sm4_ce_cbc_cts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nbytes
+	 */
+	SM4_PREPARE(x0)
+
+	sub		w5, w4, #16
+	uxtw		x5, w5
+
+	ld1		{RIV.16b}, [x3]
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	ld1		{v0.16b}, [x2], x5
+	ld1		{v1.16b}, [x2]
+
+	SM4_CRYPT_BLK(v0)
+	/* select the first Ln bytes of Xn to create Pn */
+	tbl		v2.16b, {v0.16b}, v3.16b
+	eor		v2.16b, v2.16b, v1.16b
+
+	/* overwrite the first Ln bytes with Cn to create En-1 */
+	tbx		v0.16b, {v1.16b}, v4.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, RIV.16b
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v2.16b}, [x5]
+	st1		{v0.16b}, [x1]
+
+	ret
+SYM_FUNC_END(sm4_ce_cbc_cts_dec)
+
 .align 3
 SYM_FUNC_START(sm4_ce_cfb_enc)
 	/* input:
@@ -576,3 +670,11 @@ SYM_FUNC_END(sm4_ce_ctr_enc)
 .Lbswap128_mask:
 	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
 	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+
+.Lcts_permute_table:
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe,  0xf
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 63abcadc684b..4d4072c7bfa2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -16,6 +16,7 @@
 #include <asm/simd.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
+#include <crypto/scatterwalk.h>
 #include <crypto/sm4.h>
 
 #define BYTES2BLKS(nbytes)	((nbytes) >> 4)
@@ -29,6 +30,10 @@ asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblocks);
 asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
+				   u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
+				   u8 *iv, unsigned int nbytes);
 asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -153,6 +158,78 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
 	return sm4_cbc_crypt(req, ctx, false);
 }
 
+static int sm4_cbc_cts_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct scatterlist *src = req->src;
+	struct scatterlist *dst = req->dst;
+	struct scatterlist sg_src[2], sg_dst[2];
+	struct skcipher_request subreq;
+	struct skcipher_walk walk;
+	int cbc_blocks;
+	int err;
+
+	if (req->cryptlen < SM4_BLOCK_SIZE)
+		return -EINVAL;
+
+	if (req->cryptlen == SM4_BLOCK_SIZE)
+		return sm4_cbc_crypt(req, ctx, encrypt);
+
+	skcipher_request_set_tfm(&subreq, tfm);
+	skcipher_request_set_callback(&subreq, skcipher_request_flags(req),
+				      NULL, NULL);
+
+	/* handle the CBC cryption part */
+	cbc_blocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+	if (cbc_blocks) {
+		skcipher_request_set_crypt(&subreq, src, dst,
+					   cbc_blocks * SM4_BLOCK_SIZE,
+					   req->iv);
+
+		err = sm4_cbc_crypt(&subreq, ctx, encrypt);
+		if (err)
+			return err;
+
+		dst = src = scatterwalk_ffwd(sg_src, src, subreq.cryptlen);
+		if (req->dst != req->src)
+			dst = scatterwalk_ffwd(sg_dst, req->dst,
+					       subreq.cryptlen);
+	}
+
+	/* handle ciphertext stealing */
+	skcipher_request_set_crypt(&subreq, src, dst,
+				   req->cryptlen - cbc_blocks * SM4_BLOCK_SIZE,
+				   req->iv);
+
+	err = skcipher_walk_virt(&walk, &subreq, false);
+	if (err)
+		return err;
+
+	kernel_neon_begin();
+
+	if (encrypt)
+		sm4_ce_cbc_cts_enc(ctx->rkey_enc, walk.dst.virt.addr,
+				   walk.src.virt.addr, walk.iv, walk.nbytes);
+	else
+		sm4_ce_cbc_cts_dec(ctx->rkey_dec, walk.dst.virt.addr,
+				   walk.src.virt.addr, walk.iv, walk.nbytes);
+
+	kernel_neon_end();
+
+	return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_cbc_cts_encrypt(struct skcipher_request *req)
+{
+	return sm4_cbc_cts_crypt(req, true);
+}
+
+static int sm4_cbc_cts_decrypt(struct skcipher_request *req)
+{
+	return sm4_cbc_cts_crypt(req, false);
+}
+
 static int sm4_cfb_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
@@ -342,6 +419,22 @@ static struct skcipher_alg sm4_algs[] = {
 		.setkey		= sm4_setkey,
 		.encrypt	= sm4_ctr_crypt,
 		.decrypt	= sm4_ctr_crypt,
+	}, {
+		.base = {
+			.cra_name		= "cts(cbc(sm4))",
+			.cra_driver_name	= "cts-cbc-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.walksize	= SM4_BLOCK_SIZE * 2,
+		.setkey		= sm4_setkey,
+		.encrypt	= sm4_cbc_cts_encrypt,
+		.decrypt	= sm4_cbc_cts_decrypt,
 	}
 };
 
@@ -365,5 +458,6 @@ MODULE_ALIAS_CRYPTO("ecb(sm4)");
 MODULE_ALIAS_CRYPTO("cbc(sm4)");
 MODULE_ALIAS_CRYPTO("cfb(sm4)");
 MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/16] crypto: arm64/sm4 - add CE implementation for CTS-CBC mode
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for CTS-CBC mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is cts(cbc-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

Before:

cts(cbc-sm4-ce) |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
    CTS-CBC enc |  286.09   297.17   457.97   627.75   868.58   900.80   957.69
    CTS-CBC dec |  286.67   285.63   538.35   947.08  2241.03  2577.32  3391.14

After:

cts-cbc-sm4-ce  |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
    CTS-CBC enc |  288.19   428.80   593.57   741.04   911.73   931.80   950.00
    CTS-CBC dec |  292.22   468.99   838.23  1380.76  2741.17  3036.42  3409.62

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S | 102 ++++++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-glue.c |  94 +++++++++++++++++++++++++++++
 2 files changed, 196 insertions(+)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 9e4b4f01cdf3..414d29f8110b 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -306,6 +306,100 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
 	ret
 SYM_FUNC_END(sm4_ce_cbc_dec)
 
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nbytes
+	 */
+	SM4_PREPARE(x0)
+
+	sub		w5, w4, #16
+	uxtw		x5, w5
+
+	ld1		{RIV.16b}, [x3]
+
+	ld1		{v0.16b}, [x2]
+	eor		RIV.16b, RIV.16b, v0.16b
+	SM4_CRYPT_BLK(RIV)
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	add		x2, x2, x5
+	ld1		{v1.16b}, [x2]
+
+	/* create Cn from En-1 */
+	tbl		v0.16b, {RIV.16b}, v3.16b
+	/* padding Pn with zeros */
+	tbl		v1.16b, {v1.16b}, v4.16b
+
+	eor		v1.16b, v1.16b, RIV.16b
+	SM4_CRYPT_BLK(v1)
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v0.16b}, [x5]
+	st1		{v1.16b}, [x1]
+
+	ret
+SYM_FUNC_END(sm4_ce_cbc_cts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nbytes
+	 */
+	SM4_PREPARE(x0)
+
+	sub		w5, w4, #16
+	uxtw		x5, w5
+
+	ld1		{RIV.16b}, [x3]
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	ld1		{v0.16b}, [x2], x5
+	ld1		{v1.16b}, [x2]
+
+	SM4_CRYPT_BLK(v0)
+	/* select the first Ln bytes of Xn to create Pn */
+	tbl		v2.16b, {v0.16b}, v3.16b
+	eor		v2.16b, v2.16b, v1.16b
+
+	/* overwrite the first Ln bytes with Cn to create En-1 */
+	tbx		v0.16b, {v1.16b}, v4.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, RIV.16b
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v2.16b}, [x5]
+	st1		{v0.16b}, [x1]
+
+	ret
+SYM_FUNC_END(sm4_ce_cbc_cts_dec)
+
 .align 3
 SYM_FUNC_START(sm4_ce_cfb_enc)
 	/* input:
@@ -576,3 +670,11 @@ SYM_FUNC_END(sm4_ce_ctr_enc)
 .Lbswap128_mask:
 	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
 	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+
+.Lcts_permute_table:
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe,  0xf
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 63abcadc684b..4d4072c7bfa2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -16,6 +16,7 @@
 #include <asm/simd.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
+#include <crypto/scatterwalk.h>
 #include <crypto/sm4.h>
 
 #define BYTES2BLKS(nbytes)	((nbytes) >> 4)
@@ -29,6 +30,10 @@ asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblocks);
 asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
+				   u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
+				   u8 *iv, unsigned int nbytes);
 asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -153,6 +158,78 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
 	return sm4_cbc_crypt(req, ctx, false);
 }
 
+static int sm4_cbc_cts_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct scatterlist *src = req->src;
+	struct scatterlist *dst = req->dst;
+	struct scatterlist sg_src[2], sg_dst[2];
+	struct skcipher_request subreq;
+	struct skcipher_walk walk;
+	int cbc_blocks;
+	int err;
+
+	if (req->cryptlen < SM4_BLOCK_SIZE)
+		return -EINVAL;
+
+	if (req->cryptlen == SM4_BLOCK_SIZE)
+		return sm4_cbc_crypt(req, ctx, encrypt);
+
+	skcipher_request_set_tfm(&subreq, tfm);
+	skcipher_request_set_callback(&subreq, skcipher_request_flags(req),
+				      NULL, NULL);
+
+	/* handle the CBC cryption part */
+	cbc_blocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+	if (cbc_blocks) {
+		skcipher_request_set_crypt(&subreq, src, dst,
+					   cbc_blocks * SM4_BLOCK_SIZE,
+					   req->iv);
+
+		err = sm4_cbc_crypt(&subreq, ctx, encrypt);
+		if (err)
+			return err;
+
+		dst = src = scatterwalk_ffwd(sg_src, src, subreq.cryptlen);
+		if (req->dst != req->src)
+			dst = scatterwalk_ffwd(sg_dst, req->dst,
+					       subreq.cryptlen);
+	}
+
+	/* handle ciphertext stealing */
+	skcipher_request_set_crypt(&subreq, src, dst,
+				   req->cryptlen - cbc_blocks * SM4_BLOCK_SIZE,
+				   req->iv);
+
+	err = skcipher_walk_virt(&walk, &subreq, false);
+	if (err)
+		return err;
+
+	kernel_neon_begin();
+
+	if (encrypt)
+		sm4_ce_cbc_cts_enc(ctx->rkey_enc, walk.dst.virt.addr,
+				   walk.src.virt.addr, walk.iv, walk.nbytes);
+	else
+		sm4_ce_cbc_cts_dec(ctx->rkey_dec, walk.dst.virt.addr,
+				   walk.src.virt.addr, walk.iv, walk.nbytes);
+
+	kernel_neon_end();
+
+	return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_cbc_cts_encrypt(struct skcipher_request *req)
+{
+	return sm4_cbc_cts_crypt(req, true);
+}
+
+static int sm4_cbc_cts_decrypt(struct skcipher_request *req)
+{
+	return sm4_cbc_cts_crypt(req, false);
+}
+
 static int sm4_cfb_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
@@ -342,6 +419,22 @@ static struct skcipher_alg sm4_algs[] = {
 		.setkey		= sm4_setkey,
 		.encrypt	= sm4_ctr_crypt,
 		.decrypt	= sm4_ctr_crypt,
+	}, {
+		.base = {
+			.cra_name		= "cts(cbc(sm4))",
+			.cra_driver_name	= "cts-cbc-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.walksize	= SM4_BLOCK_SIZE * 2,
+		.setkey		= sm4_setkey,
+		.encrypt	= sm4_cbc_cts_encrypt,
+		.decrypt	= sm4_cbc_cts_decrypt,
 	}
 };
 
@@ -365,5 +458,6 @@ MODULE_ALIAS_CRYPTO("ecb(sm4)");
 MODULE_ALIAS_CRYPTO("cbc(sm4)");
 MODULE_ALIAS_CRYPTO("cfb(sm4)");
 MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/16] crypto: arm64/sm4 - add CE implementation for XTS mode
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for XTS mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is xts(ecb-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

Before:

xts(ecb-sm4-ce) |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
        XTS enc |  117.17   430.56   732.92  1134.98  2007.03  2136.23  2347.20
        XTS dec |  116.89   429.02   733.40  1132.96  2006.13  2130.50  2347.92

After:

xts-sm4-ce      |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
        XTS enc |  224.68   798.91  1248.08  1714.60  2413.73  2467.84  2612.62
        XTS dec |  229.85   791.34  1237.79  1720.00  2413.30  2473.84  2611.95

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig       |   4 +-
 arch/arm64/crypto/sm4-ce-core.S | 343 ++++++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-glue.c | 159 ++++++++++++++-
 3 files changed, 504 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 4b121dc0cfba..8939f5ae9214 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -231,7 +231,7 @@ config CRYPTO_SM4_ARM64_CE
 	  - NEON (Advanced SIMD) extensions
 
 config CRYPTO_SM4_ARM64_CE_BLK
-	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv8 Crypto Extensions)"
+	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR/XTS (ARMv8 Crypto Extensions)"
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_SKCIPHER
 	select CRYPTO_SM4
@@ -242,6 +242,8 @@ config CRYPTO_SM4_ARM64_CE_BLK
 	  - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
 	  - CFB (Cipher Feedback) mode (NIST SP800-38A)
 	  - CTR (Counter) mode (NIST SP800-38A)
+	  - XTS (XOR Encrypt XOR with ciphertext stealing) mode (NIST SP800-38E
+	    and IEEE 1619)
 
 	  Architecture: arm64 using:
 	  - ARMv8 Crypto Extensions
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 414d29f8110b..ddd15ec09d38 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
 #define RTMP3	v19
 
 #define RIV	v20
+#define RMASK	v21
 
 
 .align 3
@@ -665,6 +666,348 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
 SYM_FUNC_END(sm4_ce_ctr_enc)
 
 
+#define tweak_next(vt, vin, RTMP)					\
+		sshr		RTMP.2d, vin.2d, #63;			\
+		and		RTMP.16b, RTMP.16b, RMASK.16b;		\
+		add		vt.2d, vin.2d, vin.2d;			\
+		ext		RTMP.16b, RTMP.16b, RTMP.16b, #8;	\
+		eor		vt.16b, vt.16b, RTMP.16b;
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: tweak (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: round key array for IV
+	 */
+	ld1		{v8.16b}, [x3]
+
+	cbz		x5, .Lxts_enc_nofirst
+
+	SM4_PREPARE(x5)
+
+	/* Generate first tweak */
+	SM4_CRYPT_BLK(v8)
+
+.Lxts_enc_nofirst:
+	SM4_PREPARE(x0)
+
+	ands		w5, w4, #15
+	lsr		w4, w4, #4
+	sub		w6, w4, #1
+	csel		w4, w4, w6, eq
+	uxtw		x5, w5
+
+	movi		RMASK.2s, #0x1
+	movi		RTMP0.2s, #0x87
+	uzp1		RMASK.4s, RMASK.4s, RTMP0.4s
+
+	cbz		w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lxts_enc_4x
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+	tweak_next(v12, v11, RTMP3)
+	tweak_next(v13, v12, RTMP0)
+	tweak_next(v14, v13, RTMP1)
+	tweak_next(v15, v14, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	tweak_next(v8, v15, RTMP3)
+
+	cbz		w4, .Lxts_enc_cts
+	b		.Lxts_enc_loop_8x
+
+.Lxts_enc_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lxts_enc_loop_1x
+
+	sub		w4, w4, #4
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	tweak_next(v8, v11, RTMP3)
+
+	cbz		w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
+	eor		v0.16b, v0.16b, v8.16b
+
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	tweak_next(v8, v8, RTMP0)
+
+	cbnz		w4, .Lxts_enc_loop_1x
+
+.Lxts_enc_cts:
+	cbz		x5, .Lxts_enc_end
+
+	/* cipher text stealing */
+
+	tweak_next(v9, v8, RTMP0)
+	ld1		{v0.16b}, [x2]
+	eor		v0.16b, v0.16b, v8.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v8.16b
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	add		x2, x2, x5
+	ld1		{v1.16b}, [x2]
+
+	/* create Cn from En-1 */
+	tbl		v2.16b, {v0.16b}, v3.16b
+	/* padding Pn with En-1 at the end */
+	tbx		v0.16b, {v1.16b}, v4.16b
+
+	eor		v0.16b, v0.16b, v9.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v9.16b
+
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v2.16b}, [x5]
+	st1		{v0.16b}, [x1]
+
+	b		.Lxts_enc_ret
+
+.Lxts_enc_end:
+	/* store new tweak */
+	st1		{v8.16b}, [x3]
+
+.Lxts_enc_ret:
+	ret
+SYM_FUNC_END(sm4_ce_xts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: tweak (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: round key array for IV
+	 */
+	ld1		{v8.16b}, [x3]
+
+	cbz		x5, .Lxts_dec_nofirst
+
+	SM4_PREPARE(x5)
+
+	/* Generate first tweak */
+	SM4_CRYPT_BLK(v8)
+
+.Lxts_dec_nofirst:
+	SM4_PREPARE(x0)
+
+	ands		w5, w4, #15
+	lsr		w4, w4, #4
+	sub		w6, w4, #1
+	csel		w4, w4, w6, eq
+	uxtw		x5, w5
+
+	movi		RMASK.2s, #0x1
+	movi		RTMP0.2s, #0x87
+	uzp1		RMASK.4s, RMASK.4s, RTMP0.4s
+
+	cbz		w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lxts_dec_4x
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+	tweak_next(v12, v11, RTMP3)
+	tweak_next(v13, v12, RTMP0)
+	tweak_next(v14, v13, RTMP1)
+	tweak_next(v15, v14, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	tweak_next(v8, v15, RTMP3)
+
+	cbz		w4, .Lxts_dec_cts
+	b		.Lxts_dec_loop_8x
+
+.Lxts_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lxts_dec_loop_1x
+
+	sub		w4, w4, #4
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	tweak_next(v8, v11, RTMP3)
+
+	cbz		w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
+	eor		v0.16b, v0.16b, v8.16b
+
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	tweak_next(v8, v8, RTMP0)
+
+	cbnz		w4, .Lxts_dec_loop_1x
+
+.Lxts_dec_cts:
+	cbz		x5, .Lxts_dec_end
+
+	/* cipher text stealing */
+
+	tweak_next(v9, v8, RTMP0)
+	ld1		{v0.16b}, [x2]
+	eor		v0.16b, v0.16b, v9.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v9.16b
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	add		x2, x2, x5
+	ld1		{v1.16b}, [x2]
+
+	/* create Cn from En-1 */
+	tbl		v2.16b, {v0.16b}, v3.16b
+	/* padding Pn with En-1 at the end */
+	tbx		v0.16b, {v1.16b}, v4.16b
+
+	eor		v0.16b, v0.16b, v8.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v8.16b
+
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v2.16b}, [x5]
+	st1		{v0.16b}, [x1]
+
+	b		.Lxts_dec_ret
+
+.Lxts_dec_end:
+	/* store new tweak */
+	st1		{v8.16b}, [x3]
+
+.Lxts_dec_ret:
+	ret
+SYM_FUNC_END(sm4_ce_xts_dec)
+
+
 	.section	".rodata", "a"
 	.align 4
 .Lbswap128_mask:
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 4d4072c7bfa2..8222766f712a 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -17,6 +17,7 @@
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <crypto/scatterwalk.h>
+#include <crypto/xts.h>
 #include <crypto/sm4.h>
 
 #define BYTES2BLKS(nbytes)	((nbytes) >> 4)
@@ -40,12 +41,23 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
+asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+			       u8 *tweak, unsigned int nbytes,
+			       const u32 *rkey2_enc);
+asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+			       u8 *tweak, unsigned int nbytes,
+			       const u32 *rkey2_enc);
 
 EXPORT_SYMBOL(sm4_ce_expand_key);
 EXPORT_SYMBOL(sm4_ce_crypt_block);
 EXPORT_SYMBOL(sm4_ce_cbc_enc);
 EXPORT_SYMBOL(sm4_ce_cfb_enc);
 
+struct sm4_xts_ctx {
+	struct sm4_ctx key1;
+	struct sm4_ctx key2;
+};
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
@@ -61,6 +73,29 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	return 0;
 }
 
+static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			  unsigned int key_len)
+{
+	struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int ret;
+
+	if (key_len != SM4_KEY_SIZE * 2)
+		return -EINVAL;
+
+	ret = xts_verify_key(tfm, key, key_len);
+	if (ret)
+		return ret;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+			  ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	sm4_ce_expand_key(&key[SM4_KEY_SIZE], ctx->key2.rkey_enc,
+			  ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
 static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
 {
 	struct skcipher_walk walk;
@@ -357,6 +392,111 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
 	return err;
 }
 
+static int sm4_xts_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int tail = req->cryptlen % SM4_BLOCK_SIZE;
+	const u32 *rkey2_enc = ctx->key2.rkey_enc;
+	struct scatterlist sg_src[2], sg_dst[2];
+	struct skcipher_request subreq;
+	struct scatterlist *src, *dst;
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	if (req->cryptlen < SM4_BLOCK_SIZE)
+		return -EINVAL;
+
+	err = skcipher_walk_virt(&walk, req, false);
+	if (err)
+		return err;
+
+	if (unlikely(tail > 0 && walk.nbytes < walk.total)) {
+		int nblocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+
+		skcipher_walk_abort(&walk);
+
+		skcipher_request_set_tfm(&subreq, tfm);
+		skcipher_request_set_callback(&subreq,
+					      skcipher_request_flags(req),
+					      NULL, NULL);
+		skcipher_request_set_crypt(&subreq, req->src, req->dst,
+					   nblocks * SM4_BLOCK_SIZE, req->iv);
+
+		err = skcipher_walk_virt(&walk, &subreq, false);
+		if (err)
+			return err;
+	} else {
+		tail = 0;
+	}
+
+	while ((nbytes = walk.nbytes) >= SM4_BLOCK_SIZE) {
+		if (nbytes < walk.total)
+			nbytes &= ~(SM4_BLOCK_SIZE - 1);
+
+		kernel_neon_begin();
+
+		if (encrypt)
+			sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nbytes,
+				       rkey2_enc);
+		else
+			sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nbytes,
+				       rkey2_enc);
+
+		kernel_neon_end();
+
+		rkey2_enc = NULL;
+
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+		if (err)
+			return err;
+	}
+
+	if (likely(tail == 0))
+		return 0;
+
+	/* handle ciphertext stealing */
+
+	dst = src = scatterwalk_ffwd(sg_src, req->src, subreq.cryptlen);
+	if (req->dst != req->src)
+		dst = scatterwalk_ffwd(sg_dst, req->dst, subreq.cryptlen);
+
+	skcipher_request_set_crypt(&subreq, src, dst, SM4_BLOCK_SIZE + tail,
+				   req->iv);
+
+	err = skcipher_walk_virt(&walk, &subreq, false);
+	if (err)
+		return err;
+
+	kernel_neon_begin();
+
+	if (encrypt)
+		sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+			       walk.src.virt.addr, walk.iv, walk.nbytes,
+			       rkey2_enc);
+	else
+		sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+			       walk.src.virt.addr, walk.iv, walk.nbytes,
+			       rkey2_enc);
+
+	kernel_neon_end();
+
+	return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_xts_encrypt(struct skcipher_request *req)
+{
+	return sm4_xts_crypt(req, true);
+}
+
+static int sm4_xts_decrypt(struct skcipher_request *req)
+{
+	return sm4_xts_crypt(req, false);
+}
+
 static struct skcipher_alg sm4_algs[] = {
 	{
 		.base = {
@@ -435,6 +575,22 @@ static struct skcipher_alg sm4_algs[] = {
 		.setkey		= sm4_setkey,
 		.encrypt	= sm4_cbc_cts_encrypt,
 		.decrypt	= sm4_cbc_cts_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "xts(sm4)",
+			.cra_driver_name	= "xts-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_xts_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE * 2,
+		.max_keysize	= SM4_KEY_SIZE * 2,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.walksize	= SM4_BLOCK_SIZE * 2,
+		.setkey		= sm4_xts_setkey,
+		.encrypt	= sm4_xts_encrypt,
+		.decrypt	= sm4_xts_decrypt,
 	}
 };
 
@@ -451,7 +607,7 @@ static void __exit sm4_exit(void)
 module_cpu_feature_match(SM4, sm4_init);
 module_exit(sm4_exit);
 
-MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv8 Crypto Extensions");
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR/XTS using ARMv8 Crypto Extensions");
 MODULE_ALIAS_CRYPTO("sm4-ce");
 MODULE_ALIAS_CRYPTO("sm4");
 MODULE_ALIAS_CRYPTO("ecb(sm4)");
@@ -459,5 +615,6 @@ MODULE_ALIAS_CRYPTO("cbc(sm4)");
 MODULE_ALIAS_CRYPTO("cfb(sm4)");
 MODULE_ALIAS_CRYPTO("ctr(sm4)");
 MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
+MODULE_ALIAS_CRYPTO("xts(sm4)");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/16] crypto: arm64/sm4 - add CE implementation for XTS mode
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for XTS mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is xts(ecb-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

Before:

xts(ecb-sm4-ce) |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
        XTS enc |  117.17   430.56   732.92  1134.98  2007.03  2136.23  2347.20
        XTS dec |  116.89   429.02   733.40  1132.96  2006.13  2130.50  2347.92

After:

xts-sm4-ce      |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
        XTS enc |  224.68   798.91  1248.08  1714.60  2413.73  2467.84  2612.62
        XTS dec |  229.85   791.34  1237.79  1720.00  2413.30  2473.84  2611.95

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig       |   4 +-
 arch/arm64/crypto/sm4-ce-core.S | 343 ++++++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-glue.c | 159 ++++++++++++++-
 3 files changed, 504 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 4b121dc0cfba..8939f5ae9214 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -231,7 +231,7 @@ config CRYPTO_SM4_ARM64_CE
 	  - NEON (Advanced SIMD) extensions
 
 config CRYPTO_SM4_ARM64_CE_BLK
-	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv8 Crypto Extensions)"
+	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR/XTS (ARMv8 Crypto Extensions)"
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_SKCIPHER
 	select CRYPTO_SM4
@@ -242,6 +242,8 @@ config CRYPTO_SM4_ARM64_CE_BLK
 	  - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
 	  - CFB (Cipher Feedback) mode (NIST SP800-38A)
 	  - CTR (Counter) mode (NIST SP800-38A)
+	  - XTS (XOR Encrypt XOR with ciphertext stealing) mode (NIST SP800-38E
+	    and IEEE 1619)
 
 	  Architecture: arm64 using:
 	  - ARMv8 Crypto Extensions
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 414d29f8110b..ddd15ec09d38 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
 #define RTMP3	v19
 
 #define RIV	v20
+#define RMASK	v21
 
 
 .align 3
@@ -665,6 +666,348 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
 SYM_FUNC_END(sm4_ce_ctr_enc)
 
 
+#define tweak_next(vt, vin, RTMP)					\
+		sshr		RTMP.2d, vin.2d, #63;			\
+		and		RTMP.16b, RTMP.16b, RMASK.16b;		\
+		add		vt.2d, vin.2d, vin.2d;			\
+		ext		RTMP.16b, RTMP.16b, RTMP.16b, #8;	\
+		eor		vt.16b, vt.16b, RTMP.16b;
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: tweak (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: round key array for IV
+	 */
+	ld1		{v8.16b}, [x3]
+
+	cbz		x5, .Lxts_enc_nofirst
+
+	SM4_PREPARE(x5)
+
+	/* Generate first tweak */
+	SM4_CRYPT_BLK(v8)
+
+.Lxts_enc_nofirst:
+	SM4_PREPARE(x0)
+
+	ands		w5, w4, #15
+	lsr		w4, w4, #4
+	sub		w6, w4, #1
+	csel		w4, w4, w6, eq
+	uxtw		x5, w5
+
+	movi		RMASK.2s, #0x1
+	movi		RTMP0.2s, #0x87
+	uzp1		RMASK.4s, RMASK.4s, RTMP0.4s
+
+	cbz		w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lxts_enc_4x
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+	tweak_next(v12, v11, RTMP3)
+	tweak_next(v13, v12, RTMP0)
+	tweak_next(v14, v13, RTMP1)
+	tweak_next(v15, v14, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	tweak_next(v8, v15, RTMP3)
+
+	cbz		w4, .Lxts_enc_cts
+	b		.Lxts_enc_loop_8x
+
+.Lxts_enc_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lxts_enc_loop_1x
+
+	sub		w4, w4, #4
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	tweak_next(v8, v11, RTMP3)
+
+	cbz		w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
+	eor		v0.16b, v0.16b, v8.16b
+
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	tweak_next(v8, v8, RTMP0)
+
+	cbnz		w4, .Lxts_enc_loop_1x
+
+.Lxts_enc_cts:
+	cbz		x5, .Lxts_enc_end
+
+	/* cipher text stealing */
+
+	tweak_next(v9, v8, RTMP0)
+	ld1		{v0.16b}, [x2]
+	eor		v0.16b, v0.16b, v8.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v8.16b
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	add		x2, x2, x5
+	ld1		{v1.16b}, [x2]
+
+	/* create Cn from En-1 */
+	tbl		v2.16b, {v0.16b}, v3.16b
+	/* padding Pn with En-1 at the end */
+	tbx		v0.16b, {v1.16b}, v4.16b
+
+	eor		v0.16b, v0.16b, v9.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v9.16b
+
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v2.16b}, [x5]
+	st1		{v0.16b}, [x1]
+
+	b		.Lxts_enc_ret
+
+.Lxts_enc_end:
+	/* store new tweak */
+	st1		{v8.16b}, [x3]
+
+.Lxts_enc_ret:
+	ret
+SYM_FUNC_END(sm4_ce_xts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: tweak (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: round key array for IV
+	 */
+	ld1		{v8.16b}, [x3]
+
+	cbz		x5, .Lxts_dec_nofirst
+
+	SM4_PREPARE(x5)
+
+	/* Generate first tweak */
+	SM4_CRYPT_BLK(v8)
+
+.Lxts_dec_nofirst:
+	SM4_PREPARE(x0)
+
+	ands		w5, w4, #15
+	lsr		w4, w4, #4
+	sub		w6, w4, #1
+	csel		w4, w4, w6, eq
+	uxtw		x5, w5
+
+	movi		RMASK.2s, #0x1
+	movi		RTMP0.2s, #0x87
+	uzp1		RMASK.4s, RMASK.4s, RTMP0.4s
+
+	cbz		w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lxts_dec_4x
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+	tweak_next(v12, v11, RTMP3)
+	tweak_next(v13, v12, RTMP0)
+	tweak_next(v14, v13, RTMP1)
+	tweak_next(v15, v14, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	tweak_next(v8, v15, RTMP3)
+
+	cbz		w4, .Lxts_dec_cts
+	b		.Lxts_dec_loop_8x
+
+.Lxts_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lxts_dec_loop_1x
+
+	sub		w4, w4, #4
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	tweak_next(v8, v11, RTMP3)
+
+	cbz		w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
+	eor		v0.16b, v0.16b, v8.16b
+
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	tweak_next(v8, v8, RTMP0)
+
+	cbnz		w4, .Lxts_dec_loop_1x
+
+.Lxts_dec_cts:
+	cbz		x5, .Lxts_dec_end
+
+	/* cipher text stealing */
+
+	tweak_next(v9, v8, RTMP0)
+	ld1		{v0.16b}, [x2]
+	eor		v0.16b, v0.16b, v9.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v9.16b
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	add		x2, x2, x5
+	ld1		{v1.16b}, [x2]
+
+	/* create Cn from En-1 */
+	tbl		v2.16b, {v0.16b}, v3.16b
+	/* padding Pn with En-1 at the end */
+	tbx		v0.16b, {v1.16b}, v4.16b
+
+	eor		v0.16b, v0.16b, v8.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v8.16b
+
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v2.16b}, [x5]
+	st1		{v0.16b}, [x1]
+
+	b		.Lxts_dec_ret
+
+.Lxts_dec_end:
+	/* store new tweak */
+	st1		{v8.16b}, [x3]
+
+.Lxts_dec_ret:
+	ret
+SYM_FUNC_END(sm4_ce_xts_dec)
+
+
 	.section	".rodata", "a"
 	.align 4
 .Lbswap128_mask:
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 4d4072c7bfa2..8222766f712a 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -17,6 +17,7 @@
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <crypto/scatterwalk.h>
+#include <crypto/xts.h>
 #include <crypto/sm4.h>
 
 #define BYTES2BLKS(nbytes)	((nbytes) >> 4)
@@ -40,12 +41,23 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
+asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+			       u8 *tweak, unsigned int nbytes,
+			       const u32 *rkey2_enc);
+asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+			       u8 *tweak, unsigned int nbytes,
+			       const u32 *rkey2_enc);
 
 EXPORT_SYMBOL(sm4_ce_expand_key);
 EXPORT_SYMBOL(sm4_ce_crypt_block);
 EXPORT_SYMBOL(sm4_ce_cbc_enc);
 EXPORT_SYMBOL(sm4_ce_cfb_enc);
 
+struct sm4_xts_ctx {
+	struct sm4_ctx key1;
+	struct sm4_ctx key2;
+};
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
@@ -61,6 +73,29 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	return 0;
 }
 
+static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			  unsigned int key_len)
+{
+	struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int ret;
+
+	if (key_len != SM4_KEY_SIZE * 2)
+		return -EINVAL;
+
+	ret = xts_verify_key(tfm, key, key_len);
+	if (ret)
+		return ret;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+			  ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	sm4_ce_expand_key(&key[SM4_KEY_SIZE], ctx->key2.rkey_enc,
+			  ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
 static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
 {
 	struct skcipher_walk walk;
@@ -357,6 +392,111 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
 	return err;
 }
 
+static int sm4_xts_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int tail = req->cryptlen % SM4_BLOCK_SIZE;
+	const u32 *rkey2_enc = ctx->key2.rkey_enc;
+	struct scatterlist sg_src[2], sg_dst[2];
+	struct skcipher_request subreq;
+	struct scatterlist *src, *dst;
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	if (req->cryptlen < SM4_BLOCK_SIZE)
+		return -EINVAL;
+
+	err = skcipher_walk_virt(&walk, req, false);
+	if (err)
+		return err;
+
+	if (unlikely(tail > 0 && walk.nbytes < walk.total)) {
+		int nblocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+
+		skcipher_walk_abort(&walk);
+
+		skcipher_request_set_tfm(&subreq, tfm);
+		skcipher_request_set_callback(&subreq,
+					      skcipher_request_flags(req),
+					      NULL, NULL);
+		skcipher_request_set_crypt(&subreq, req->src, req->dst,
+					   nblocks * SM4_BLOCK_SIZE, req->iv);
+
+		err = skcipher_walk_virt(&walk, &subreq, false);
+		if (err)
+			return err;
+	} else {
+		tail = 0;
+	}
+
+	while ((nbytes = walk.nbytes) >= SM4_BLOCK_SIZE) {
+		if (nbytes < walk.total)
+			nbytes &= ~(SM4_BLOCK_SIZE - 1);
+
+		kernel_neon_begin();
+
+		if (encrypt)
+			sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nbytes,
+				       rkey2_enc);
+		else
+			sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nbytes,
+				       rkey2_enc);
+
+		kernel_neon_end();
+
+		rkey2_enc = NULL;
+
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+		if (err)
+			return err;
+	}
+
+	if (likely(tail == 0))
+		return 0;
+
+	/* handle ciphertext stealing */
+
+	dst = src = scatterwalk_ffwd(sg_src, req->src, subreq.cryptlen);
+	if (req->dst != req->src)
+		dst = scatterwalk_ffwd(sg_dst, req->dst, subreq.cryptlen);
+
+	skcipher_request_set_crypt(&subreq, src, dst, SM4_BLOCK_SIZE + tail,
+				   req->iv);
+
+	err = skcipher_walk_virt(&walk, &subreq, false);
+	if (err)
+		return err;
+
+	kernel_neon_begin();
+
+	if (encrypt)
+		sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+			       walk.src.virt.addr, walk.iv, walk.nbytes,
+			       rkey2_enc);
+	else
+		sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+			       walk.src.virt.addr, walk.iv, walk.nbytes,
+			       rkey2_enc);
+
+	kernel_neon_end();
+
+	return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_xts_encrypt(struct skcipher_request *req)
+{
+	return sm4_xts_crypt(req, true);
+}
+
+static int sm4_xts_decrypt(struct skcipher_request *req)
+{
+	return sm4_xts_crypt(req, false);
+}
+
 static struct skcipher_alg sm4_algs[] = {
 	{
 		.base = {
@@ -435,6 +575,22 @@ static struct skcipher_alg sm4_algs[] = {
 		.setkey		= sm4_setkey,
 		.encrypt	= sm4_cbc_cts_encrypt,
 		.decrypt	= sm4_cbc_cts_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "xts(sm4)",
+			.cra_driver_name	= "xts-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_xts_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE * 2,
+		.max_keysize	= SM4_KEY_SIZE * 2,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.walksize	= SM4_BLOCK_SIZE * 2,
+		.setkey		= sm4_xts_setkey,
+		.encrypt	= sm4_xts_encrypt,
+		.decrypt	= sm4_xts_decrypt,
 	}
 };
 
@@ -451,7 +607,7 @@ static void __exit sm4_exit(void)
 module_cpu_feature_match(SM4, sm4_init);
 module_exit(sm4_exit);
 
-MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv8 Crypto Extensions");
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR/XTS using ARMv8 Crypto Extensions");
 MODULE_ALIAS_CRYPTO("sm4-ce");
 MODULE_ALIAS_CRYPTO("sm4");
 MODULE_ALIAS_CRYPTO("ecb(sm4)");
@@ -459,5 +615,6 @@ MODULE_ALIAS_CRYPTO("cbc(sm4)");
 MODULE_ALIAS_CRYPTO("cfb(sm4)");
 MODULE_ALIAS_CRYPTO("ctr(sm4)");
 MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
+MODULE_ALIAS_CRYPTO("xts(sm4)");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/16] crypto: essiv - allow digestsize to be greater than keysize
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

In essiv mode, the digest of the hash algorithm is used as the key to
encrypt the IV. The current implementation requires that the digest size
of the hash algorithm is equal to the key size, which will exclude
algorithms that do not meet this situation, such as essiv(cbc(sm4),sm3),
the hash result of sm3 is fixed 256 bits, and the key size of sm4
symmetric algorithm is fixed 128 bits, which makes it impossible to use
essiv mode.

This patch allows algorithms whose digest size is greater than key size
to use esssiv mode by truncating the digest.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/essiv.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/crypto/essiv.c b/crypto/essiv.c
index e33369df9034..6ee5a61bcae4 100644
--- a/crypto/essiv.c
+++ b/crypto/essiv.c
@@ -68,6 +68,7 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
 {
 	struct essiv_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
 	u8 salt[HASH_MAX_DIGESTSIZE];
+	unsigned int saltlen;
 	int err;
 
 	crypto_skcipher_clear_flags(tctx->u.skcipher, CRYPTO_TFM_REQ_MASK);
@@ -86,8 +87,11 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
 	crypto_cipher_set_flags(tctx->essiv_cipher,
 				crypto_skcipher_get_flags(tfm) &
 				CRYPTO_TFM_REQ_MASK);
-	return crypto_cipher_setkey(tctx->essiv_cipher, salt,
-				    crypto_shash_digestsize(tctx->hash));
+
+	saltlen = min(crypto_shash_digestsize(tctx->hash),
+		      crypto_skcipher_max_keysize(tctx->u.skcipher));
+
+	return crypto_cipher_setkey(tctx->essiv_cipher, salt, saltlen);
 }
 
 static int essiv_aead_setkey(struct crypto_aead *tfm, const u8 *key,
@@ -418,8 +422,7 @@ static bool essiv_supported_algorithms(const char *essiv_cipher_name,
 	if (IS_ERR(alg))
 		return false;
 
-	if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize ||
-	    hash_alg->digestsize > alg->cra_cipher.cia_max_keysize)
+	if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize)
 		goto out;
 
 	if (ivsize != alg->cra_blocksize)
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/16] crypto: essiv - allow digestsize to be greater than keysize
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

In essiv mode, the digest of the hash algorithm is used as the key to
encrypt the IV. The current implementation requires that the digest size
of the hash algorithm is equal to the key size, which will exclude
algorithms that do not meet this situation, such as essiv(cbc(sm4),sm3),
the hash result of sm3 is fixed 256 bits, and the key size of sm4
symmetric algorithm is fixed 128 bits, which makes it impossible to use
essiv mode.

This patch allows algorithms whose digest size is greater than key size
to use esssiv mode by truncating the digest.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/essiv.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/crypto/essiv.c b/crypto/essiv.c
index e33369df9034..6ee5a61bcae4 100644
--- a/crypto/essiv.c
+++ b/crypto/essiv.c
@@ -68,6 +68,7 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
 {
 	struct essiv_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
 	u8 salt[HASH_MAX_DIGESTSIZE];
+	unsigned int saltlen;
 	int err;
 
 	crypto_skcipher_clear_flags(tctx->u.skcipher, CRYPTO_TFM_REQ_MASK);
@@ -86,8 +87,11 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
 	crypto_cipher_set_flags(tctx->essiv_cipher,
 				crypto_skcipher_get_flags(tfm) &
 				CRYPTO_TFM_REQ_MASK);
-	return crypto_cipher_setkey(tctx->essiv_cipher, salt,
-				    crypto_shash_digestsize(tctx->hash));
+
+	saltlen = min(crypto_shash_digestsize(tctx->hash),
+		      crypto_skcipher_max_keysize(tctx->u.skcipher));
+
+	return crypto_cipher_setkey(tctx->essiv_cipher, salt, saltlen);
 }
 
 static int essiv_aead_setkey(struct crypto_aead *tfm, const u8 *key,
@@ -418,8 +422,7 @@ static bool essiv_supported_algorithms(const char *essiv_cipher_name,
 	if (IS_ERR(alg))
 		return false;
 
-	if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize ||
-	    hash_alg->digestsize > alg->cra_cipher.cia_max_keysize)
+	if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize)
 		goto out;
 
 	if (ivsize != alg->cra_blocksize)
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/16] crypto: arm64/sm4 - add CE implementation for ESSIV mode
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for ESSIV mode.
The assembly part is realized by reusing the CBC mode.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S |  42 +++++++++++
 arch/arm64/crypto/sm4-ce-glue.c | 128 ++++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index ddd15ec09d38..6b923c3209a0 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -154,6 +154,26 @@ SYM_FUNC_START(sm4_ce_crypt)
 	ret;
 SYM_FUNC_END(sm4_ce_crypt)
 
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 *   x5: round key array for IV
+	 */
+	ld1		{RIV.16b}, [x3]
+
+	SM4_PREPARE(x5)
+
+	SM4_CRYPT_BLK(RIV)
+
+	SM4_PREPARE(x0)
+
+	b		.Lcbc_enc_loop_4x
+
 .align 3
 SYM_FUNC_START(sm4_ce_cbc_enc)
 	/* input:
@@ -208,6 +228,27 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
 
 	ret
 SYM_FUNC_END(sm4_ce_cbc_enc)
+SYM_FUNC_END(sm4_ce_essiv_cbc_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 *   x5: round key array for IV
+	 */
+	ld1		{RIV.16b}, [x3]
+
+	SM4_PREPARE(x5)
+
+	SM4_CRYPT_BLK(RIV)
+
+	SM4_PREPARE(x0)
+
+	b		.Lcbc_dec_loop_8x
 
 .align 3
 SYM_FUNC_START(sm4_ce_cbc_dec)
@@ -306,6 +347,7 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
 
 	ret
 SYM_FUNC_END(sm4_ce_cbc_dec)
+SYM_FUNC_END(sm4_ce_essiv_cbc_dec)
 
 .align 3
 SYM_FUNC_START(sm4_ce_cbc_cts_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 8222766f712a..6267ec1cfac0 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -19,6 +19,8 @@
 #include <crypto/scatterwalk.h>
 #include <crypto/xts.h>
 #include <crypto/sm4.h>
+#include <crypto/sm3.h>
+#include <crypto/hash.h>
 
 #define BYTES2BLKS(nbytes)	((nbytes) >> 4)
 
@@ -35,6 +37,12 @@ asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
 				   u8 *iv, unsigned int nbytes);
 asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
 				   u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_essiv_cbc_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+				     u8 *iv, unsigned int nblocks,
+				     const u32 *rkey2_enc);
+asmlinkage void sm4_ce_essiv_cbc_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+				     u8 *iv, unsigned int nblocks,
+				     const u32 *rkey2_enc);
 asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -58,6 +66,12 @@ struct sm4_xts_ctx {
 	struct sm4_ctx key2;
 };
 
+struct sm4_essiv_cbc_ctx {
+	struct sm4_ctx key1;
+	struct sm4_ctx key2;
+	struct crypto_shash *hash;
+};
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
@@ -96,6 +110,27 @@ static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	return 0;
 }
 
+static int sm4_essiv_cbc_setkey(struct crypto_skcipher *tfm, const u8 *key,
+				unsigned int key_len)
+{
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+	u8 __aligned(8) digest[SM3_DIGEST_SIZE];
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	crypto_shash_tfm_digest(ctx->hash, key, key_len, digest);
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+			  ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	sm4_ce_expand_key(digest, ctx->key2.rkey_enc,
+			  ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
 static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
 {
 	struct skcipher_walk walk;
@@ -497,6 +532,81 @@ static int sm4_xts_decrypt(struct skcipher_request *req)
 	return sm4_xts_crypt(req, false);
 }
 
+static int sm4_essiv_cbc_init_tfm(struct crypto_skcipher *tfm)
+{
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	ctx->hash = crypto_alloc_shash("sm3", 0, 0);
+
+	return PTR_ERR_OR_ZERO(ctx->hash);
+}
+
+static void sm4_essiv_cbc_exit_tfm(struct crypto_skcipher *tfm)
+{
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	crypto_free_shash(ctx->hash);
+}
+
+static int sm4_essiv_cbc_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nblocks;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	if ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+		kernel_neon_begin();
+
+		if (encrypt)
+			sm4_ce_essiv_cbc_enc(ctx->key1.rkey_enc,
+					     walk.dst.virt.addr,
+					     walk.src.virt.addr, walk.iv,
+					     nblocks, ctx->key2.rkey_enc);
+		else
+			sm4_ce_essiv_cbc_dec(ctx->key1.rkey_dec,
+					     walk.dst.virt.addr,
+					     walk.src.virt.addr, walk.iv,
+					     nblocks, ctx->key2.rkey_enc);
+
+		kernel_neon_end();
+
+		err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+		if (err)
+			return err;
+	}
+
+	while ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+		kernel_neon_begin();
+
+		if (encrypt)
+			sm4_ce_cbc_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nblocks);
+		else
+			sm4_ce_cbc_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nblocks);
+
+		kernel_neon_end();
+
+		err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int sm4_essiv_cbc_encrypt(struct skcipher_request *req)
+{
+	return sm4_essiv_cbc_crypt(req, true);
+}
+
+static int sm4_essiv_cbc_decrypt(struct skcipher_request *req)
+{
+	return sm4_essiv_cbc_crypt(req, false);
+}
+
 static struct skcipher_alg sm4_algs[] = {
 	{
 		.base = {
@@ -591,6 +701,23 @@ static struct skcipher_alg sm4_algs[] = {
 		.setkey		= sm4_xts_setkey,
 		.encrypt	= sm4_xts_encrypt,
 		.decrypt	= sm4_xts_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "essiv(cbc(sm4),sm3)",
+			.cra_driver_name	= "essiv-cbc-sm4-sm3-ce",
+			.cra_priority		= 400 + 1,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_essiv_cbc_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.setkey		= sm4_essiv_cbc_setkey,
+		.encrypt	= sm4_essiv_cbc_encrypt,
+		.decrypt	= sm4_essiv_cbc_decrypt,
+		.init		= sm4_essiv_cbc_init_tfm,
+		.exit		= sm4_essiv_cbc_exit_tfm,
 	}
 };
 
@@ -616,5 +743,6 @@ MODULE_ALIAS_CRYPTO("cfb(sm4)");
 MODULE_ALIAS_CRYPTO("ctr(sm4)");
 MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
 MODULE_ALIAS_CRYPTO("xts(sm4)");
+MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/16] crypto: arm64/sm4 - add CE implementation for ESSIV mode
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for ESSIV mode.
The assembly part is realized by reusing the CBC mode.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S |  42 +++++++++++
 arch/arm64/crypto/sm4-ce-glue.c | 128 ++++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index ddd15ec09d38..6b923c3209a0 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -154,6 +154,26 @@ SYM_FUNC_START(sm4_ce_crypt)
 	ret;
 SYM_FUNC_END(sm4_ce_crypt)
 
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 *   x5: round key array for IV
+	 */
+	ld1		{RIV.16b}, [x3]
+
+	SM4_PREPARE(x5)
+
+	SM4_CRYPT_BLK(RIV)
+
+	SM4_PREPARE(x0)
+
+	b		.Lcbc_enc_loop_4x
+
 .align 3
 SYM_FUNC_START(sm4_ce_cbc_enc)
 	/* input:
@@ -208,6 +228,27 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
 
 	ret
 SYM_FUNC_END(sm4_ce_cbc_enc)
+SYM_FUNC_END(sm4_ce_essiv_cbc_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 *   x5: round key array for IV
+	 */
+	ld1		{RIV.16b}, [x3]
+
+	SM4_PREPARE(x5)
+
+	SM4_CRYPT_BLK(RIV)
+
+	SM4_PREPARE(x0)
+
+	b		.Lcbc_dec_loop_8x
 
 .align 3
 SYM_FUNC_START(sm4_ce_cbc_dec)
@@ -306,6 +347,7 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
 
 	ret
 SYM_FUNC_END(sm4_ce_cbc_dec)
+SYM_FUNC_END(sm4_ce_essiv_cbc_dec)
 
 .align 3
 SYM_FUNC_START(sm4_ce_cbc_cts_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 8222766f712a..6267ec1cfac0 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -19,6 +19,8 @@
 #include <crypto/scatterwalk.h>
 #include <crypto/xts.h>
 #include <crypto/sm4.h>
+#include <crypto/sm3.h>
+#include <crypto/hash.h>
 
 #define BYTES2BLKS(nbytes)	((nbytes) >> 4)
 
@@ -35,6 +37,12 @@ asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
 				   u8 *iv, unsigned int nbytes);
 asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
 				   u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_essiv_cbc_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+				     u8 *iv, unsigned int nblocks,
+				     const u32 *rkey2_enc);
+asmlinkage void sm4_ce_essiv_cbc_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+				     u8 *iv, unsigned int nblocks,
+				     const u32 *rkey2_enc);
 asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -58,6 +66,12 @@ struct sm4_xts_ctx {
 	struct sm4_ctx key2;
 };
 
+struct sm4_essiv_cbc_ctx {
+	struct sm4_ctx key1;
+	struct sm4_ctx key2;
+	struct crypto_shash *hash;
+};
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
@@ -96,6 +110,27 @@ static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	return 0;
 }
 
+static int sm4_essiv_cbc_setkey(struct crypto_skcipher *tfm, const u8 *key,
+				unsigned int key_len)
+{
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+	u8 __aligned(8) digest[SM3_DIGEST_SIZE];
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	crypto_shash_tfm_digest(ctx->hash, key, key_len, digest);
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+			  ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	sm4_ce_expand_key(digest, ctx->key2.rkey_enc,
+			  ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
 static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
 {
 	struct skcipher_walk walk;
@@ -497,6 +532,81 @@ static int sm4_xts_decrypt(struct skcipher_request *req)
 	return sm4_xts_crypt(req, false);
 }
 
+static int sm4_essiv_cbc_init_tfm(struct crypto_skcipher *tfm)
+{
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	ctx->hash = crypto_alloc_shash("sm3", 0, 0);
+
+	return PTR_ERR_OR_ZERO(ctx->hash);
+}
+
+static void sm4_essiv_cbc_exit_tfm(struct crypto_skcipher *tfm)
+{
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	crypto_free_shash(ctx->hash);
+}
+
+static int sm4_essiv_cbc_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nblocks;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	if ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+		kernel_neon_begin();
+
+		if (encrypt)
+			sm4_ce_essiv_cbc_enc(ctx->key1.rkey_enc,
+					     walk.dst.virt.addr,
+					     walk.src.virt.addr, walk.iv,
+					     nblocks, ctx->key2.rkey_enc);
+		else
+			sm4_ce_essiv_cbc_dec(ctx->key1.rkey_dec,
+					     walk.dst.virt.addr,
+					     walk.src.virt.addr, walk.iv,
+					     nblocks, ctx->key2.rkey_enc);
+
+		kernel_neon_end();
+
+		err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+		if (err)
+			return err;
+	}
+
+	while ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+		kernel_neon_begin();
+
+		if (encrypt)
+			sm4_ce_cbc_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nblocks);
+		else
+			sm4_ce_cbc_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nblocks);
+
+		kernel_neon_end();
+
+		err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int sm4_essiv_cbc_encrypt(struct skcipher_request *req)
+{
+	return sm4_essiv_cbc_crypt(req, true);
+}
+
+static int sm4_essiv_cbc_decrypt(struct skcipher_request *req)
+{
+	return sm4_essiv_cbc_crypt(req, false);
+}
+
 static struct skcipher_alg sm4_algs[] = {
 	{
 		.base = {
@@ -591,6 +701,23 @@ static struct skcipher_alg sm4_algs[] = {
 		.setkey		= sm4_xts_setkey,
 		.encrypt	= sm4_xts_encrypt,
 		.decrypt	= sm4_xts_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "essiv(cbc(sm4),sm3)",
+			.cra_driver_name	= "essiv-cbc-sm4-sm3-ce",
+			.cra_priority		= 400 + 1,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_essiv_cbc_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.setkey		= sm4_essiv_cbc_setkey,
+		.encrypt	= sm4_essiv_cbc_encrypt,
+		.decrypt	= sm4_essiv_cbc_decrypt,
+		.init		= sm4_essiv_cbc_init_tfm,
+		.exit		= sm4_essiv_cbc_exit_tfm,
 	}
 };
 
@@ -616,5 +743,6 @@ MODULE_ALIAS_CRYPTO("cfb(sm4)");
 MODULE_ALIAS_CRYPTO("ctr(sm4)");
 MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
 MODULE_ALIAS_CRYPTO("xts(sm4)");
+MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 13/16] crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for cmac/xcbc/cbcmac.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 300 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is XXXmac(sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

Before:

update-size    |      16      64     256    1024    2048    4096    8192
---------------+--------------------------------------------------------
cmac(sm4-ce)   |  293.33  403.69  503.76  527.78  531.10  535.46  535.81
xcbc(sm4-ce)   |  292.83  402.50  504.02  529.08  529.87  536.55  538.24
cbcmac(sm4-ce) |  318.42  415.79  497.12  515.05  523.15  521.19  523.01

After:

update-size    |      16      64     256    1024    2048    4096    8192
---------------+--------------------------------------------------------
cmac-sm4-ce    |  371.99  675.28  903.56  971.65  980.57  990.40  991.04
xcbc-sm4-ce    |  372.11  674.55  903.47  971.61  980.96  990.42  991.10
cbcmac-sm4-ce  |  371.63  675.33  903.23  972.07  981.42  990.93  991.45

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S |  70 +++++++++
 arch/arm64/crypto/sm4-ce-glue.c | 267 +++++++++++++++++++++++++++++++-
 2 files changed, 336 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 6b923c3209a0..69fe3b90b7ad 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
 #define RTMP3	v19
 
 #define RIV	v20
+#define RMAC	v20
 #define RMASK	v21
 
 
@@ -1049,6 +1050,75 @@ SYM_FUNC_START(sm4_ce_xts_dec)
 	ret
 SYM_FUNC_END(sm4_ce_xts_dec)
 
+.align 3
+SYM_FUNC_START(sm4_ce_mac_update)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: digest
+	 *   x2: src
+	 *   w3: nblocks
+	 *   w4: enc_before
+	 *   w5: enc_after
+	 */
+	SM4_PREPARE(x0)
+
+	ld1		{RMAC.16b}, [x1]
+
+	cbz		w4, .Lmac_update
+
+	SM4_CRYPT_BLK(RMAC)
+
+.Lmac_update:
+	cbz		w3, .Lmac_ret
+
+	sub		w6, w3, #1
+	cmp		w5, wzr
+	csel		w3, w3, w6, ne
+
+	cbz		w3, .Lmac_end
+
+.Lmac_loop_4x:
+	cmp		w3, #4
+	blt		.Lmac_loop_1x
+
+	sub		w3, w3, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v1.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v2.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v3.16b
+	SM4_CRYPT_BLK(RMAC)
+
+	cbz		w3, .Lmac_end
+	b		.Lmac_loop_4x
+
+.Lmac_loop_1x:
+	sub		w3, w3, #1
+
+	ld1		{v0.16b}, [x2], #16
+
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK(RMAC)
+
+	cbnz		w3, .Lmac_loop_1x
+
+
+.Lmac_end:
+	cbnz		w5, .Lmac_ret
+
+	ld1		{v0.16b}, [x2], #16
+	eor		RMAC.16b, RMAC.16b, v0.16b
+
+.Lmac_ret:
+	st1		{RMAC.16b}, [x1]
+	ret
+SYM_FUNC_END(sm4_ce_mac_update)
+
 
 	.section	".rodata", "a"
 	.align 4
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 6267ec1cfac0..c2d10b8e92b2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -14,8 +14,10 @@
 #include <linux/cpufeature.h>
 #include <asm/neon.h>
 #include <asm/simd.h>
+#include <crypto/b128ops.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
+#include <crypto/internal/hash.h>
 #include <crypto/scatterwalk.h>
 #include <crypto/xts.h>
 #include <crypto/sm4.h>
@@ -55,6 +57,9 @@ asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
 asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
 			       u8 *tweak, unsigned int nbytes,
 			       const u32 *rkey2_enc);
+asmlinkage void sm4_ce_mac_update(const u32 *rkey_enc, u8 *digest,
+				  const u8 *src, unsigned int nblocks,
+				  bool enc_before, bool enc_after);
 
 EXPORT_SYMBOL(sm4_ce_expand_key);
 EXPORT_SYMBOL(sm4_ce_crypt_block);
@@ -72,6 +77,16 @@ struct sm4_essiv_cbc_ctx {
 	struct crypto_shash *hash;
 };
 
+struct sm4_mac_tfm_ctx {
+	struct sm4_ctx key;
+	u8 __aligned(8) consts[];
+};
+
+struct sm4_mac_desc_ctx {
+	unsigned int len;
+	u8 digest[SM4_BLOCK_SIZE];
+};
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
@@ -721,13 +736,260 @@ static struct skcipher_alg sm4_algs[] = {
 	}
 };
 
+static int sm4_cbcmac_setkey(struct crypto_shash *tfm, const u8 *key,
+			     unsigned int key_len)
+{
+	struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sm4_cmac_setkey(struct crypto_shash *tfm, const u8 *key,
+			   unsigned int key_len)
+{
+	struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+	be128 *consts = (be128 *)ctx->consts;
+	u64 a, b;
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	memset(consts, 0, SM4_BLOCK_SIZE);
+
+	kernel_neon_begin();
+
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+
+	/* encrypt the zero block */
+	sm4_ce_crypt_block(ctx->key.rkey_enc, (u8 *)consts, (const u8 *)consts);
+
+	kernel_neon_end();
+
+	/* gf(2^128) multiply zero-ciphertext with u and u^2 */
+	a = be64_to_cpu(consts[0].a);
+	b = be64_to_cpu(consts[0].b);
+	consts[0].a = cpu_to_be64((a << 1) | (b >> 63));
+	consts[0].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+	a = be64_to_cpu(consts[0].a);
+	b = be64_to_cpu(consts[0].b);
+	consts[1].a = cpu_to_be64((a << 1) | (b >> 63));
+	consts[1].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+	return 0;
+}
+
+static int sm4_xcbc_setkey(struct crypto_shash *tfm, const u8 *key,
+			   unsigned int key_len)
+{
+	struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+	u8 __aligned(8) key2[SM4_BLOCK_SIZE];
+	static u8 const ks[3][SM4_BLOCK_SIZE] = {
+		{ [0 ... SM4_BLOCK_SIZE - 1] = 0x1},
+		{ [0 ... SM4_BLOCK_SIZE - 1] = 0x2},
+		{ [0 ... SM4_BLOCK_SIZE - 1] = 0x3},
+	};
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+
+	sm4_ce_crypt_block(ctx->key.rkey_enc, key2, ks[0]);
+	sm4_ce_crypt(ctx->key.rkey_enc, ctx->consts, ks[1], 2);
+
+	sm4_ce_expand_key(key2, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sm4_mac_init(struct shash_desc *desc)
+{
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+	memset(ctx->digest, 0, SM4_BLOCK_SIZE);
+	ctx->len = 0;
+
+	return 0;
+}
+
+static int sm4_mac_update(struct shash_desc *desc, const u8 *p,
+			  unsigned int len)
+{
+	struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+	unsigned int l, nblocks;
+
+	if (len == 0)
+		return 0;
+
+	if (ctx->len || ctx->len + len < SM4_BLOCK_SIZE) {
+		l = min(len, SM4_BLOCK_SIZE - ctx->len);
+
+		crypto_xor(ctx->digest + ctx->len, p, l);
+		ctx->len += l;
+		len -= l;
+		p += l;
+	}
+
+	if (len && (ctx->len % SM4_BLOCK_SIZE) == 0) {
+		kernel_neon_begin();
+
+		if (len < SM4_BLOCK_SIZE && ctx->len == SM4_BLOCK_SIZE) {
+			sm4_ce_crypt_block(tctx->key.rkey_enc,
+					   ctx->digest, ctx->digest);
+			ctx->len = 0;
+		} else {
+			nblocks = len / SM4_BLOCK_SIZE;
+			len %= SM4_BLOCK_SIZE;
+
+			sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, p,
+					  nblocks, (ctx->len == SM4_BLOCK_SIZE),
+					  (len != 0));
+
+			p += nblocks * SM4_BLOCK_SIZE;
+
+			if (len == 0)
+				ctx->len = SM4_BLOCK_SIZE;
+		}
+
+		kernel_neon_end();
+
+		if (len) {
+			crypto_xor(ctx->digest, p, len);
+			ctx->len = len;
+		}
+	}
+
+	return 0;
+}
+
+static int sm4_cmac_final(struct shash_desc *desc, u8 *out)
+{
+	struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+	const u8 *consts = tctx->consts;
+
+	if (ctx->len != SM4_BLOCK_SIZE) {
+		ctx->digest[ctx->len] ^= 0x80;
+		consts += SM4_BLOCK_SIZE;
+	}
+
+	kernel_neon_begin();
+	sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, consts, 1,
+			  false, true);
+	kernel_neon_end();
+
+	memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+	return 0;
+}
+
+static int sm4_cbcmac_final(struct shash_desc *desc, u8 *out)
+{
+	struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+	if (ctx->len) {
+		kernel_neon_begin();
+		sm4_ce_crypt_block(tctx->key.rkey_enc, ctx->digest,
+				   ctx->digest);
+		kernel_neon_end();
+	}
+
+	memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+	return 0;
+}
+
+static struct shash_alg sm4_mac_algs[] = {
+	{
+		.base = {
+			.cra_name		= "cmac(sm4)",
+			.cra_driver_name	= "cmac-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_mac_tfm_ctx)
+							+ SM4_BLOCK_SIZE * 2,
+			.cra_module		= THIS_MODULE,
+		},
+		.digestsize	= SM4_BLOCK_SIZE,
+		.init		= sm4_mac_init,
+		.update		= sm4_mac_update,
+		.final		= sm4_cmac_final,
+		.setkey		= sm4_cmac_setkey,
+		.descsize	= sizeof(struct sm4_mac_desc_ctx),
+	}, {
+		.base = {
+			.cra_name		= "xcbc(sm4)",
+			.cra_driver_name	= "xcbc-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_mac_tfm_ctx)
+							+ SM4_BLOCK_SIZE * 2,
+			.cra_module		= THIS_MODULE,
+		},
+		.digestsize	= SM4_BLOCK_SIZE,
+		.init		= sm4_mac_init,
+		.update		= sm4_mac_update,
+		.final		= sm4_cmac_final,
+		.setkey		= sm4_xcbc_setkey,
+		.descsize	= sizeof(struct sm4_mac_desc_ctx),
+	}, {
+		.base = {
+			.cra_name		= "cbcmac(sm4)",
+			.cra_driver_name	= "cbcmac-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_mac_tfm_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.digestsize	= SM4_BLOCK_SIZE,
+		.init		= sm4_mac_init,
+		.update		= sm4_mac_update,
+		.final		= sm4_cbcmac_final,
+		.setkey		= sm4_cbcmac_setkey,
+		.descsize	= sizeof(struct sm4_mac_desc_ctx),
+	}
+};
+
 static int __init sm4_init(void)
 {
-	return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+	int err;
+
+	err = crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+	if (err)
+		return err;
+
+	err = crypto_register_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
+	if (err)
+		goto out_err;
+
+	return 0;
+
+out_err:
+	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+	return err;
 }
 
 static void __exit sm4_exit(void)
 {
+	crypto_unregister_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
 	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
 }
 
@@ -744,5 +1006,8 @@ MODULE_ALIAS_CRYPTO("ctr(sm4)");
 MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
 MODULE_ALIAS_CRYPTO("xts(sm4)");
 MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
+MODULE_ALIAS_CRYPTO("cmac(sm4)");
+MODULE_ALIAS_CRYPTO("xcbc(sm4)");
+MODULE_ALIAS_CRYPTO("cbcmac(sm4)");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 13/16] crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for cmac/xcbc/cbcmac.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 300 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is XXXmac(sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

Before:

update-size    |      16      64     256    1024    2048    4096    8192
---------------+--------------------------------------------------------
cmac(sm4-ce)   |  293.33  403.69  503.76  527.78  531.10  535.46  535.81
xcbc(sm4-ce)   |  292.83  402.50  504.02  529.08  529.87  536.55  538.24
cbcmac(sm4-ce) |  318.42  415.79  497.12  515.05  523.15  521.19  523.01

After:

update-size    |      16      64     256    1024    2048    4096    8192
---------------+--------------------------------------------------------
cmac-sm4-ce    |  371.99  675.28  903.56  971.65  980.57  990.40  991.04
xcbc-sm4-ce    |  372.11  674.55  903.47  971.61  980.96  990.42  991.10
cbcmac-sm4-ce  |  371.63  675.33  903.23  972.07  981.42  990.93  991.45

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S |  70 +++++++++
 arch/arm64/crypto/sm4-ce-glue.c | 267 +++++++++++++++++++++++++++++++-
 2 files changed, 336 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 6b923c3209a0..69fe3b90b7ad 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
 #define RTMP3	v19
 
 #define RIV	v20
+#define RMAC	v20
 #define RMASK	v21
 
 
@@ -1049,6 +1050,75 @@ SYM_FUNC_START(sm4_ce_xts_dec)
 	ret
 SYM_FUNC_END(sm4_ce_xts_dec)
 
+.align 3
+SYM_FUNC_START(sm4_ce_mac_update)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: digest
+	 *   x2: src
+	 *   w3: nblocks
+	 *   w4: enc_before
+	 *   w5: enc_after
+	 */
+	SM4_PREPARE(x0)
+
+	ld1		{RMAC.16b}, [x1]
+
+	cbz		w4, .Lmac_update
+
+	SM4_CRYPT_BLK(RMAC)
+
+.Lmac_update:
+	cbz		w3, .Lmac_ret
+
+	sub		w6, w3, #1
+	cmp		w5, wzr
+	csel		w3, w3, w6, ne
+
+	cbz		w3, .Lmac_end
+
+.Lmac_loop_4x:
+	cmp		w3, #4
+	blt		.Lmac_loop_1x
+
+	sub		w3, w3, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v1.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v2.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v3.16b
+	SM4_CRYPT_BLK(RMAC)
+
+	cbz		w3, .Lmac_end
+	b		.Lmac_loop_4x
+
+.Lmac_loop_1x:
+	sub		w3, w3, #1
+
+	ld1		{v0.16b}, [x2], #16
+
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK(RMAC)
+
+	cbnz		w3, .Lmac_loop_1x
+
+
+.Lmac_end:
+	cbnz		w5, .Lmac_ret
+
+	ld1		{v0.16b}, [x2], #16
+	eor		RMAC.16b, RMAC.16b, v0.16b
+
+.Lmac_ret:
+	st1		{RMAC.16b}, [x1]
+	ret
+SYM_FUNC_END(sm4_ce_mac_update)
+
 
 	.section	".rodata", "a"
 	.align 4
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 6267ec1cfac0..c2d10b8e92b2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -14,8 +14,10 @@
 #include <linux/cpufeature.h>
 #include <asm/neon.h>
 #include <asm/simd.h>
+#include <crypto/b128ops.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
+#include <crypto/internal/hash.h>
 #include <crypto/scatterwalk.h>
 #include <crypto/xts.h>
 #include <crypto/sm4.h>
@@ -55,6 +57,9 @@ asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
 asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
 			       u8 *tweak, unsigned int nbytes,
 			       const u32 *rkey2_enc);
+asmlinkage void sm4_ce_mac_update(const u32 *rkey_enc, u8 *digest,
+				  const u8 *src, unsigned int nblocks,
+				  bool enc_before, bool enc_after);
 
 EXPORT_SYMBOL(sm4_ce_expand_key);
 EXPORT_SYMBOL(sm4_ce_crypt_block);
@@ -72,6 +77,16 @@ struct sm4_essiv_cbc_ctx {
 	struct crypto_shash *hash;
 };
 
+struct sm4_mac_tfm_ctx {
+	struct sm4_ctx key;
+	u8 __aligned(8) consts[];
+};
+
+struct sm4_mac_desc_ctx {
+	unsigned int len;
+	u8 digest[SM4_BLOCK_SIZE];
+};
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
@@ -721,13 +736,260 @@ static struct skcipher_alg sm4_algs[] = {
 	}
 };
 
+static int sm4_cbcmac_setkey(struct crypto_shash *tfm, const u8 *key,
+			     unsigned int key_len)
+{
+	struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sm4_cmac_setkey(struct crypto_shash *tfm, const u8 *key,
+			   unsigned int key_len)
+{
+	struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+	be128 *consts = (be128 *)ctx->consts;
+	u64 a, b;
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	memset(consts, 0, SM4_BLOCK_SIZE);
+
+	kernel_neon_begin();
+
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+
+	/* encrypt the zero block */
+	sm4_ce_crypt_block(ctx->key.rkey_enc, (u8 *)consts, (const u8 *)consts);
+
+	kernel_neon_end();
+
+	/* gf(2^128) multiply zero-ciphertext with u and u^2 */
+	a = be64_to_cpu(consts[0].a);
+	b = be64_to_cpu(consts[0].b);
+	consts[0].a = cpu_to_be64((a << 1) | (b >> 63));
+	consts[0].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+	a = be64_to_cpu(consts[0].a);
+	b = be64_to_cpu(consts[0].b);
+	consts[1].a = cpu_to_be64((a << 1) | (b >> 63));
+	consts[1].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+	return 0;
+}
+
+static int sm4_xcbc_setkey(struct crypto_shash *tfm, const u8 *key,
+			   unsigned int key_len)
+{
+	struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+	u8 __aligned(8) key2[SM4_BLOCK_SIZE];
+	static u8 const ks[3][SM4_BLOCK_SIZE] = {
+		{ [0 ... SM4_BLOCK_SIZE - 1] = 0x1},
+		{ [0 ... SM4_BLOCK_SIZE - 1] = 0x2},
+		{ [0 ... SM4_BLOCK_SIZE - 1] = 0x3},
+	};
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+
+	sm4_ce_crypt_block(ctx->key.rkey_enc, key2, ks[0]);
+	sm4_ce_crypt(ctx->key.rkey_enc, ctx->consts, ks[1], 2);
+
+	sm4_ce_expand_key(key2, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sm4_mac_init(struct shash_desc *desc)
+{
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+	memset(ctx->digest, 0, SM4_BLOCK_SIZE);
+	ctx->len = 0;
+
+	return 0;
+}
+
+static int sm4_mac_update(struct shash_desc *desc, const u8 *p,
+			  unsigned int len)
+{
+	struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+	unsigned int l, nblocks;
+
+	if (len == 0)
+		return 0;
+
+	if (ctx->len || ctx->len + len < SM4_BLOCK_SIZE) {
+		l = min(len, SM4_BLOCK_SIZE - ctx->len);
+
+		crypto_xor(ctx->digest + ctx->len, p, l);
+		ctx->len += l;
+		len -= l;
+		p += l;
+	}
+
+	if (len && (ctx->len % SM4_BLOCK_SIZE) == 0) {
+		kernel_neon_begin();
+
+		if (len < SM4_BLOCK_SIZE && ctx->len == SM4_BLOCK_SIZE) {
+			sm4_ce_crypt_block(tctx->key.rkey_enc,
+					   ctx->digest, ctx->digest);
+			ctx->len = 0;
+		} else {
+			nblocks = len / SM4_BLOCK_SIZE;
+			len %= SM4_BLOCK_SIZE;
+
+			sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, p,
+					  nblocks, (ctx->len == SM4_BLOCK_SIZE),
+					  (len != 0));
+
+			p += nblocks * SM4_BLOCK_SIZE;
+
+			if (len == 0)
+				ctx->len = SM4_BLOCK_SIZE;
+		}
+
+		kernel_neon_end();
+
+		if (len) {
+			crypto_xor(ctx->digest, p, len);
+			ctx->len = len;
+		}
+	}
+
+	return 0;
+}
+
+static int sm4_cmac_final(struct shash_desc *desc, u8 *out)
+{
+	struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+	const u8 *consts = tctx->consts;
+
+	if (ctx->len != SM4_BLOCK_SIZE) {
+		ctx->digest[ctx->len] ^= 0x80;
+		consts += SM4_BLOCK_SIZE;
+	}
+
+	kernel_neon_begin();
+	sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, consts, 1,
+			  false, true);
+	kernel_neon_end();
+
+	memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+	return 0;
+}
+
+static int sm4_cbcmac_final(struct shash_desc *desc, u8 *out)
+{
+	struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+	if (ctx->len) {
+		kernel_neon_begin();
+		sm4_ce_crypt_block(tctx->key.rkey_enc, ctx->digest,
+				   ctx->digest);
+		kernel_neon_end();
+	}
+
+	memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+	return 0;
+}
+
+static struct shash_alg sm4_mac_algs[] = {
+	{
+		.base = {
+			.cra_name		= "cmac(sm4)",
+			.cra_driver_name	= "cmac-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_mac_tfm_ctx)
+							+ SM4_BLOCK_SIZE * 2,
+			.cra_module		= THIS_MODULE,
+		},
+		.digestsize	= SM4_BLOCK_SIZE,
+		.init		= sm4_mac_init,
+		.update		= sm4_mac_update,
+		.final		= sm4_cmac_final,
+		.setkey		= sm4_cmac_setkey,
+		.descsize	= sizeof(struct sm4_mac_desc_ctx),
+	}, {
+		.base = {
+			.cra_name		= "xcbc(sm4)",
+			.cra_driver_name	= "xcbc-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_mac_tfm_ctx)
+							+ SM4_BLOCK_SIZE * 2,
+			.cra_module		= THIS_MODULE,
+		},
+		.digestsize	= SM4_BLOCK_SIZE,
+		.init		= sm4_mac_init,
+		.update		= sm4_mac_update,
+		.final		= sm4_cmac_final,
+		.setkey		= sm4_xcbc_setkey,
+		.descsize	= sizeof(struct sm4_mac_desc_ctx),
+	}, {
+		.base = {
+			.cra_name		= "cbcmac(sm4)",
+			.cra_driver_name	= "cbcmac-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_mac_tfm_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.digestsize	= SM4_BLOCK_SIZE,
+		.init		= sm4_mac_init,
+		.update		= sm4_mac_update,
+		.final		= sm4_cbcmac_final,
+		.setkey		= sm4_cbcmac_setkey,
+		.descsize	= sizeof(struct sm4_mac_desc_ctx),
+	}
+};
+
 static int __init sm4_init(void)
 {
-	return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+	int err;
+
+	err = crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+	if (err)
+		return err;
+
+	err = crypto_register_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
+	if (err)
+		goto out_err;
+
+	return 0;
+
+out_err:
+	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+	return err;
 }
 
 static void __exit sm4_exit(void)
 {
+	crypto_unregister_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
 	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
 }
 
@@ -744,5 +1006,8 @@ MODULE_ALIAS_CRYPTO("ctr(sm4)");
 MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
 MODULE_ALIAS_CRYPTO("xts(sm4)");
 MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
+MODULE_ALIAS_CRYPTO("cmac(sm4)");
+MODULE_ALIAS_CRYPTO("xcbc(sm4)");
+MODULE_ALIAS_CRYPTO("cbcmac(sm4)");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 14/16] crypto: arm64/sm4 - add CE implementation for CCM mode
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for CCM mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 223 and 225
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is ccm_base(ctr-sm4-ce,cbcmac-sm4-ce)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

Before (rfc4309(ccm_base(ctr-sm4-ce,cbcmac-sm4-ce))):

ccm(sm4)     |     16      64     256     512    1024    1420    4096    8192
-------------+---------------------------------------------------------------
  CCM enc    |  35.07  125.40  336.47  468.17  581.97  619.18  712.56  736.01
  CCM dec    |  34.87  124.40  335.08  466.75  581.04  618.81  712.25  735.89
  CCM mb enc |  34.71  123.96  333.92  465.39  579.91  617.49  711.45  734.92
  CCM mb dec |  34.42  122.80  331.02  462.81  578.28  616.42  709.88  734.19

After (rfc4309(ccm-sm4-ce)):

ccm-sm4-ce   |     16      64     256     512    1024    1420    4096    8192
-------------+---------------------------------------------------------------
  CCM enc    |  77.12  249.82  569.94  725.17  839.27  867.71  952.87  969.89
  CCM dec    |  75.90  247.26  566.29  722.12  836.90  865.95  951.74  968.57
  CCM mb enc |  75.98  245.25  562.91  718.99  834.76  864.70  950.17  967.90
  CCM mb dec |  75.06  243.78  560.58  717.13  833.68  862.70  949.35  967.11

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |  16 ++
 arch/arm64/crypto/Makefile          |   3 +
 arch/arm64/crypto/sm4-ce-ccm-core.S | 328 ++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-ccm-glue.c | 303 +++++++++++++++++++++++++
 4 files changed, 650 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8939f5ae9214..2611036a3e3f 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -281,6 +281,22 @@ config CRYPTO_AES_ARM64_CE_CCM
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_CE_CCM
+	tristate "AEAD cipher: SM4 in CCM mode (ARMv8 Crypto Extensions)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_ALGAPI
+	select CRYPTO_AEAD
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+	  CCM (Counter with Cipher Block Chaining-Message Authentication Code)
+	  authenticated encryption mode (NIST SP800-38C)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_CRCT10DIF_ARM64_CE
 	tristate "CRCT10DIF (PMULL)"
 	depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 087f1625e775..843ea5266965 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -29,6 +29,9 @@ sm4-ce-cipher-y := sm4-ce-cipher-glue.o sm4-ce-cipher-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_BLK) += sm4-ce.o
 sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
+sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
+
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
diff --git a/arch/arm64/crypto/sm4-ce-ccm-core.S b/arch/arm64/crypto/sm4-ce-ccm-core.S
new file mode 100644
index 000000000000..028207c4afd0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-core.S
@@ -0,0 +1,328 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch	armv8-a+crypto
+
+.irp b, 0, 1, 8, 9, 10, 11, 12, 13, 14, 15, 16, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+#define RMAC	v16
+
+/* Helper macros. */
+
+#define inc_le128(vctr)					\
+		mov		vctr.d[1], x8;		\
+		mov		vctr.d[0], x7;		\
+		adds		x8, x8, #1;		\
+		rev64		vctr.16b, vctr.16b;	\
+		adc		x7, x7, xzr;
+
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbcmac_update)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: mac
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	SM4_PREPARE(x0)
+
+	ld1		{RMAC.16b}, [x1]
+
+.Lcbcmac_loop_4x:
+	cmp		w3, #4
+	blt		.Lcbcmac_loop_1x
+
+	sub		w3, w3, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v1.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v2.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v3.16b
+
+	cbz		w3, .Lcbcmac_end
+	b		.Lcbcmac_loop_4x
+
+.Lcbcmac_loop_1x:
+	sub		w3, w3, #1
+
+	ld1		{v0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v0.16b
+
+	cbnz		w3, .Lcbcmac_loop_1x
+
+.Lcbcmac_end:
+	st1		{RMAC.16b}, [x1]
+	ret
+SYM_FUNC_END(sm4_ce_cbcmac_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_final)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: ctr0 (big endian, 128 bit)
+	 *   x2: mac
+	 */
+	SM4_PREPARE(x0)
+
+	ld1		{RMAC.16b}, [x2]
+	ld1		{v0.16b}, [x1]
+
+	SM4_CRYPT_BLK2(RMAC, v0)
+
+	/* en-/decrypt the mac with ctr0 */
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	st1		{RMAC.16b}, [x2]
+
+	ret
+SYM_FUNC_END(sm4_ce_ccm_final)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: mac
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+	ld1		{RMAC.16b}, [x5]
+
+.Lccm_enc_loop_4x:
+	cmp		w4, #(4 * 16)
+	blt		.Lccm_enc_loop_1x
+
+	sub		w4, w4, #(4 * 16)
+
+	/* construct CTRs */
+	inc_le128(v8)			/* +0 */
+	inc_le128(v9)			/* +1 */
+	inc_le128(v10)			/* +2 */
+	inc_le128(v11)			/* +3 */
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK2(v9, RMAC)
+	eor		v9.16b, v9.16b, v1.16b
+	eor		RMAC.16b, RMAC.16b, v1.16b
+	SM4_CRYPT_BLK2(v10, RMAC)
+	eor		v10.16b, v10.16b, v2.16b
+	eor		RMAC.16b, RMAC.16b, v2.16b
+	SM4_CRYPT_BLK2(v11, RMAC)
+	eor		v11.16b, v11.16b, v3.16b
+	eor		RMAC.16b, RMAC.16b, v3.16b
+
+	st1		{v8.16b-v11.16b}, [x1], #64
+
+	cbz		w4, .Lccm_enc_end
+	b		.Lccm_enc_loop_4x
+
+.Lccm_enc_loop_1x:
+	cmp		w4, #16
+	blt		.Lccm_enc_tail
+
+	sub		w4, w4, #16
+
+	/* construct CTRs */
+	inc_le128(v8)
+
+	ld1		{v0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v0.16b
+
+	st1		{v8.16b}, [x1], #16
+
+	cbz		w4, .Lccm_enc_end
+	b		.Lccm_enc_loop_1x
+
+.Lccm_enc_tail:
+	/* construct CTRs */
+	inc_le128(v8)
+
+	SM4_CRYPT_BLK2(RMAC, v8)
+
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+.Lccm_enc_tail_loop:
+	ldrb		w0, [x2], #1		/* get 1 byte from input */
+	umov		w9, v8.b[0]		/* get top crypted CTR byte */
+	umov		w6, RMAC.b[0]		/* get top MAC byte */
+
+	eor		w9, w9, w0		/* w9 = CTR ^ input */
+	eor		w6, w6, w0		/* w6 = MAC ^ input */
+
+	strb		w9, [x1], #1		/* store out byte */
+	strb		w6, [x5], #1		/* store MAC byte */
+
+	subs		w4, w4, #1
+	beq		.Lccm_enc_ret
+
+	/* shift out one byte */
+	ext		RMAC.16b, RMAC.16b, RMAC.16b, #1
+	ext		v8.16b, v8.16b, v8.16b, #1
+
+	b		.Lccm_enc_tail_loop
+
+.Lccm_enc_end:
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+.Lccm_enc_ret:
+	ret
+SYM_FUNC_END(sm4_ce_ccm_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: mac
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+	ld1		{RMAC.16b}, [x5]
+
+.Lccm_dec_loop_4x:
+	cmp		w4, #(4 * 16)
+	blt		.Lccm_dec_loop_1x
+
+	sub		w4, w4, #(4 * 16)
+
+	/* construct CTRs */
+	inc_le128(v8)			/* +0 */
+	inc_le128(v9)			/* +1 */
+	inc_le128(v10)			/* +2 */
+	inc_le128(v11)			/* +3 */
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v8.16b
+	SM4_CRYPT_BLK2(v9, RMAC)
+	eor		v9.16b, v9.16b, v1.16b
+	eor		RMAC.16b, RMAC.16b, v9.16b
+	SM4_CRYPT_BLK2(v10, RMAC)
+	eor		v10.16b, v10.16b, v2.16b
+	eor		RMAC.16b, RMAC.16b, v10.16b
+	SM4_CRYPT_BLK2(v11, RMAC)
+	eor		v11.16b, v11.16b, v3.16b
+	eor		RMAC.16b, RMAC.16b, v11.16b
+
+	st1		{v8.16b-v11.16b}, [x1], #64
+
+	cbz		w4, .Lccm_dec_end
+	b		.Lccm_dec_loop_4x
+
+.Lccm_dec_loop_1x:
+	cmp		w4, #16
+	blt		.Lccm_dec_tail
+
+	sub		w4, w4, #16
+
+	/* construct CTRs */
+	inc_le128(v8)
+
+	ld1		{v0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v8.16b
+
+	st1		{v8.16b}, [x1], #16
+
+	cbz		w4, .Lccm_dec_end
+	b		.Lccm_dec_loop_1x
+
+.Lccm_dec_tail:
+	/* construct CTRs */
+	inc_le128(v8)
+
+	SM4_CRYPT_BLK2(RMAC, v8)
+
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+.Lccm_dec_tail_loop:
+	ldrb		w0, [x2], #1		/* get 1 byte from input */
+	umov		w9, v8.b[0]		/* get top crypted CTR byte */
+	umov		w6, RMAC.b[0]		/* get top MAC byte */
+
+	eor		w9, w9, w0		/* w9 = CTR ^ input */
+	eor		w6, w6, w9		/* w6 = MAC ^ output */
+
+	strb		w9, [x1], #1		/* store out byte */
+	strb		w6, [x5], #1		/* store MAC byte */
+
+	subs		w4, w4, #1
+	beq		.Lccm_dec_ret
+
+	/* shift out one byte */
+	ext		RMAC.16b, RMAC.16b, RMAC.16b, #1
+	ext		v8.16b, v8.16b, v8.16b, #1
+
+	b		.Lccm_dec_tail_loop
+
+.Lccm_dec_end:
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+.Lccm_dec_ret:
+	ret
+SYM_FUNC_END(sm4_ce_ccm_dec)
diff --git a/arch/arm64/crypto/sm4-ce-ccm-glue.c b/arch/arm64/crypto/sm4-ce-ccm-glue.c
new file mode 100644
index 000000000000..f2cec7b52efc
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-glue.c
@@ -0,0 +1,303 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_cbcmac_update(const u32 *rkey_enc, u8 *mac,
+				     const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_ccm_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+			       u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+			       u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_final(const u32 *rkey_enc, u8 *iv, u8 *mac);
+
+
+static int ccm_setkey(struct crypto_aead *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_ctx *ctx = crypto_aead_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int ccm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+	if ((authsize & 1) || authsize < 4)
+		return -EINVAL;
+	return 0;
+}
+
+static int ccm_format_input(u8 info[], struct aead_request *req,
+			    unsigned int msglen)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	unsigned int l = req->iv[0] + 1;
+	unsigned int m;
+	__be32 len;
+
+	/* verify that CCM dimension 'L': 2 <= L <= 8 */
+	if (l < 2 || l > 8)
+		return -EINVAL;
+	if (l < 4 && msglen >> (8 * l))
+		return -EOVERFLOW;
+
+	memset(&req->iv[SM4_BLOCK_SIZE - l], 0, l);
+
+	memcpy(info, req->iv, SM4_BLOCK_SIZE);
+
+	m = crypto_aead_authsize(aead);
+
+	/* format flags field per RFC 3610/NIST 800-38C */
+	*info |= ((m - 2) / 2) << 3;
+	if (req->assoclen)
+		*info |= (1 << 6);
+
+	/*
+	 * format message length field,
+	 * Linux uses a u32 type to represent msglen
+	 */
+	if (l >= 4)
+		l = 4;
+
+	len = cpu_to_be32(msglen);
+	memcpy(&info[SM4_BLOCK_SIZE - l], (u8 *)&len + 4 - l, l);
+
+	return 0;
+}
+
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+	struct __packed { __be16 l; __be32 h; } aadlen;
+	u32 assoclen = req->assoclen;
+	struct scatter_walk walk;
+	unsigned int len;
+
+	if (assoclen < 0xff00) {
+		aadlen.l = cpu_to_be16(assoclen);
+		len = 2;
+	} else {
+		aadlen.l = cpu_to_be16(0xfffe);
+		put_unaligned_be32(assoclen, &aadlen.h);
+		len = 6;
+	}
+
+	sm4_ce_crypt_block(ctx->rkey_enc, mac, mac);
+	crypto_xor(mac, (const u8 *)&aadlen, len);
+
+	scatterwalk_start(&walk, req->src);
+
+	do {
+		u32 n = scatterwalk_clamp(&walk, assoclen);
+		u8 *p, *ptr;
+
+		if (!n) {
+			scatterwalk_start(&walk, sg_next(walk.sg));
+			n = scatterwalk_clamp(&walk, assoclen);
+		}
+
+		p = ptr = scatterwalk_map(&walk);
+		assoclen -= n;
+		scatterwalk_advance(&walk, n);
+
+		while (n > 0) {
+			unsigned int l, nblocks;
+
+			if (len == SM4_BLOCK_SIZE) {
+				if (n < SM4_BLOCK_SIZE) {
+					sm4_ce_crypt_block(ctx->rkey_enc,
+							   mac, mac);
+
+					len = 0;
+				} else {
+					nblocks = n / SM4_BLOCK_SIZE;
+					sm4_ce_cbcmac_update(ctx->rkey_enc,
+							     mac, ptr, nblocks);
+
+					ptr += nblocks * SM4_BLOCK_SIZE;
+					n %= SM4_BLOCK_SIZE;
+
+					continue;
+				}
+			}
+
+			l = min(n, SM4_BLOCK_SIZE - len);
+			if (l) {
+				crypto_xor(mac + len, ptr, l);
+				len += l;
+				ptr += l;
+				n -= l;
+			}
+		}
+
+		scatterwalk_unmap(p);
+		scatterwalk_done(&walk, 0, assoclen);
+	} while (assoclen);
+}
+
+static int ccm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+		     u32 *rkey_enc, u8 mac[],
+		     void (*sm4_ce_ccm_crypt)(const u32 *rkey_enc, u8 *dst,
+					const u8 *src, u8 *iv,
+					unsigned int nbytes, u8 *mac))
+{
+	u8 __aligned(8) ctr0[SM4_BLOCK_SIZE];
+	int err;
+
+	/* preserve the initial ctr0 for the TAG */
+	memcpy(ctr0, walk->iv, SM4_BLOCK_SIZE);
+	crypto_inc(walk->iv, SM4_BLOCK_SIZE);
+
+	kernel_neon_begin();
+
+	if (req->assoclen)
+		ccm_calculate_auth_mac(req, mac);
+
+	do {
+		unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+		const u8 *src = walk->src.virt.addr;
+		u8 *dst = walk->dst.virt.addr;
+
+		if (walk->nbytes == walk->total)
+			tail = 0;
+
+		if (walk->nbytes - tail)
+			sm4_ce_ccm_crypt(rkey_enc, dst, src, walk->iv,
+					 walk->nbytes - tail, mac);
+
+		if (walk->nbytes == walk->total)
+			sm4_ce_ccm_final(rkey_enc, ctr0, mac);
+
+		kernel_neon_end();
+
+		if (walk->nbytes) {
+			err = skcipher_walk_done(walk, tail);
+			if (err)
+				return err;
+			if (walk->nbytes)
+				kernel_neon_begin();
+		}
+	} while (walk->nbytes > 0);
+
+	return 0;
+}
+
+static int ccm_encrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = ccm_format_input(mac, req, req->cryptlen);
+	if (err)
+		return err;
+
+	err = skcipher_walk_aead_encrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_enc);
+	if (err)
+		return err;
+
+	/* copy authtag to end of dst */
+	scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
+				 crypto_aead_authsize(aead), 1);
+
+	return 0;
+}
+
+static int ccm_decrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	unsigned int authsize = crypto_aead_authsize(aead);
+	struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+	u8 authtag[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = ccm_format_input(mac, req, req->cryptlen - authsize);
+	if (err)
+		return err;
+
+	err = skcipher_walk_aead_decrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_dec);
+	if (err)
+		return err;
+
+	/* compare calculated auth tag with the stored one */
+	scatterwalk_map_and_copy(authtag, req->src,
+				 req->assoclen + req->cryptlen - authsize,
+				 authsize, 0);
+
+	if (crypto_memneq(authtag, mac, authsize))
+		return -EBADMSG;
+
+	return 0;
+}
+
+static struct aead_alg sm4_ccm_alg = {
+	.base = {
+		.cra_name		= "ccm(sm4)",
+		.cra_driver_name	= "ccm-sm4-ce",
+		.cra_priority		= 400,
+		.cra_blocksize		= 1,
+		.cra_ctxsize		= sizeof(struct sm4_ctx),
+		.cra_module		= THIS_MODULE,
+	},
+	.ivsize		= SM4_BLOCK_SIZE,
+	.chunksize	= SM4_BLOCK_SIZE,
+	.maxauthsize	= SM4_BLOCK_SIZE,
+	.setkey		= ccm_setkey,
+	.setauthsize	= ccm_setauthsize,
+	.encrypt	= ccm_encrypt,
+	.decrypt	= ccm_decrypt,
+};
+
+static int __init sm4_ce_ccm_init(void)
+{
+	return crypto_register_aead(&sm4_ccm_alg);
+}
+
+static void __exit sm4_ce_ccm_exit(void)
+{
+	crypto_unregister_aead(&sm4_ccm_alg);
+}
+
+module_cpu_feature_match(SM4, sm4_ce_ccm_init);
+module_exit(sm4_ce_ccm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in CCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("ccm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 14/16] crypto: arm64/sm4 - add CE implementation for CCM mode
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for CCM mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 223 and 225
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is ccm_base(ctr-sm4-ce,cbcmac-sm4-ce)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

Before (rfc4309(ccm_base(ctr-sm4-ce,cbcmac-sm4-ce))):

ccm(sm4)     |     16      64     256     512    1024    1420    4096    8192
-------------+---------------------------------------------------------------
  CCM enc    |  35.07  125.40  336.47  468.17  581.97  619.18  712.56  736.01
  CCM dec    |  34.87  124.40  335.08  466.75  581.04  618.81  712.25  735.89
  CCM mb enc |  34.71  123.96  333.92  465.39  579.91  617.49  711.45  734.92
  CCM mb dec |  34.42  122.80  331.02  462.81  578.28  616.42  709.88  734.19

After (rfc4309(ccm-sm4-ce)):

ccm-sm4-ce   |     16      64     256     512    1024    1420    4096    8192
-------------+---------------------------------------------------------------
  CCM enc    |  77.12  249.82  569.94  725.17  839.27  867.71  952.87  969.89
  CCM dec    |  75.90  247.26  566.29  722.12  836.90  865.95  951.74  968.57
  CCM mb enc |  75.98  245.25  562.91  718.99  834.76  864.70  950.17  967.90
  CCM mb dec |  75.06  243.78  560.58  717.13  833.68  862.70  949.35  967.11

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |  16 ++
 arch/arm64/crypto/Makefile          |   3 +
 arch/arm64/crypto/sm4-ce-ccm-core.S | 328 ++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-ccm-glue.c | 303 +++++++++++++++++++++++++
 4 files changed, 650 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8939f5ae9214..2611036a3e3f 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -281,6 +281,22 @@ config CRYPTO_AES_ARM64_CE_CCM
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_CE_CCM
+	tristate "AEAD cipher: SM4 in CCM mode (ARMv8 Crypto Extensions)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_ALGAPI
+	select CRYPTO_AEAD
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+	  CCM (Counter with Cipher Block Chaining-Message Authentication Code)
+	  authenticated encryption mode (NIST SP800-38C)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_CRCT10DIF_ARM64_CE
 	tristate "CRCT10DIF (PMULL)"
 	depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 087f1625e775..843ea5266965 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -29,6 +29,9 @@ sm4-ce-cipher-y := sm4-ce-cipher-glue.o sm4-ce-cipher-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_BLK) += sm4-ce.o
 sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
+sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
+
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
diff --git a/arch/arm64/crypto/sm4-ce-ccm-core.S b/arch/arm64/crypto/sm4-ce-ccm-core.S
new file mode 100644
index 000000000000..028207c4afd0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-core.S
@@ -0,0 +1,328 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch	armv8-a+crypto
+
+.irp b, 0, 1, 8, 9, 10, 11, 12, 13, 14, 15, 16, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+#define RMAC	v16
+
+/* Helper macros. */
+
+#define inc_le128(vctr)					\
+		mov		vctr.d[1], x8;		\
+		mov		vctr.d[0], x7;		\
+		adds		x8, x8, #1;		\
+		rev64		vctr.16b, vctr.16b;	\
+		adc		x7, x7, xzr;
+
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbcmac_update)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: mac
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	SM4_PREPARE(x0)
+
+	ld1		{RMAC.16b}, [x1]
+
+.Lcbcmac_loop_4x:
+	cmp		w3, #4
+	blt		.Lcbcmac_loop_1x
+
+	sub		w3, w3, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v1.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v2.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v3.16b
+
+	cbz		w3, .Lcbcmac_end
+	b		.Lcbcmac_loop_4x
+
+.Lcbcmac_loop_1x:
+	sub		w3, w3, #1
+
+	ld1		{v0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v0.16b
+
+	cbnz		w3, .Lcbcmac_loop_1x
+
+.Lcbcmac_end:
+	st1		{RMAC.16b}, [x1]
+	ret
+SYM_FUNC_END(sm4_ce_cbcmac_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_final)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: ctr0 (big endian, 128 bit)
+	 *   x2: mac
+	 */
+	SM4_PREPARE(x0)
+
+	ld1		{RMAC.16b}, [x2]
+	ld1		{v0.16b}, [x1]
+
+	SM4_CRYPT_BLK2(RMAC, v0)
+
+	/* en-/decrypt the mac with ctr0 */
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	st1		{RMAC.16b}, [x2]
+
+	ret
+SYM_FUNC_END(sm4_ce_ccm_final)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: mac
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+	ld1		{RMAC.16b}, [x5]
+
+.Lccm_enc_loop_4x:
+	cmp		w4, #(4 * 16)
+	blt		.Lccm_enc_loop_1x
+
+	sub		w4, w4, #(4 * 16)
+
+	/* construct CTRs */
+	inc_le128(v8)			/* +0 */
+	inc_le128(v9)			/* +1 */
+	inc_le128(v10)			/* +2 */
+	inc_le128(v11)			/* +3 */
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK2(v9, RMAC)
+	eor		v9.16b, v9.16b, v1.16b
+	eor		RMAC.16b, RMAC.16b, v1.16b
+	SM4_CRYPT_BLK2(v10, RMAC)
+	eor		v10.16b, v10.16b, v2.16b
+	eor		RMAC.16b, RMAC.16b, v2.16b
+	SM4_CRYPT_BLK2(v11, RMAC)
+	eor		v11.16b, v11.16b, v3.16b
+	eor		RMAC.16b, RMAC.16b, v3.16b
+
+	st1		{v8.16b-v11.16b}, [x1], #64
+
+	cbz		w4, .Lccm_enc_end
+	b		.Lccm_enc_loop_4x
+
+.Lccm_enc_loop_1x:
+	cmp		w4, #16
+	blt		.Lccm_enc_tail
+
+	sub		w4, w4, #16
+
+	/* construct CTRs */
+	inc_le128(v8)
+
+	ld1		{v0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v0.16b
+
+	st1		{v8.16b}, [x1], #16
+
+	cbz		w4, .Lccm_enc_end
+	b		.Lccm_enc_loop_1x
+
+.Lccm_enc_tail:
+	/* construct CTRs */
+	inc_le128(v8)
+
+	SM4_CRYPT_BLK2(RMAC, v8)
+
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+.Lccm_enc_tail_loop:
+	ldrb		w0, [x2], #1		/* get 1 byte from input */
+	umov		w9, v8.b[0]		/* get top crypted CTR byte */
+	umov		w6, RMAC.b[0]		/* get top MAC byte */
+
+	eor		w9, w9, w0		/* w9 = CTR ^ input */
+	eor		w6, w6, w0		/* w6 = MAC ^ input */
+
+	strb		w9, [x1], #1		/* store out byte */
+	strb		w6, [x5], #1		/* store MAC byte */
+
+	subs		w4, w4, #1
+	beq		.Lccm_enc_ret
+
+	/* shift out one byte */
+	ext		RMAC.16b, RMAC.16b, RMAC.16b, #1
+	ext		v8.16b, v8.16b, v8.16b, #1
+
+	b		.Lccm_enc_tail_loop
+
+.Lccm_enc_end:
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+.Lccm_enc_ret:
+	ret
+SYM_FUNC_END(sm4_ce_ccm_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: mac
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+	ld1		{RMAC.16b}, [x5]
+
+.Lccm_dec_loop_4x:
+	cmp		w4, #(4 * 16)
+	blt		.Lccm_dec_loop_1x
+
+	sub		w4, w4, #(4 * 16)
+
+	/* construct CTRs */
+	inc_le128(v8)			/* +0 */
+	inc_le128(v9)			/* +1 */
+	inc_le128(v10)			/* +2 */
+	inc_le128(v11)			/* +3 */
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v8.16b
+	SM4_CRYPT_BLK2(v9, RMAC)
+	eor		v9.16b, v9.16b, v1.16b
+	eor		RMAC.16b, RMAC.16b, v9.16b
+	SM4_CRYPT_BLK2(v10, RMAC)
+	eor		v10.16b, v10.16b, v2.16b
+	eor		RMAC.16b, RMAC.16b, v10.16b
+	SM4_CRYPT_BLK2(v11, RMAC)
+	eor		v11.16b, v11.16b, v3.16b
+	eor		RMAC.16b, RMAC.16b, v11.16b
+
+	st1		{v8.16b-v11.16b}, [x1], #64
+
+	cbz		w4, .Lccm_dec_end
+	b		.Lccm_dec_loop_4x
+
+.Lccm_dec_loop_1x:
+	cmp		w4, #16
+	blt		.Lccm_dec_tail
+
+	sub		w4, w4, #16
+
+	/* construct CTRs */
+	inc_le128(v8)
+
+	ld1		{v0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v8.16b
+
+	st1		{v8.16b}, [x1], #16
+
+	cbz		w4, .Lccm_dec_end
+	b		.Lccm_dec_loop_1x
+
+.Lccm_dec_tail:
+	/* construct CTRs */
+	inc_le128(v8)
+
+	SM4_CRYPT_BLK2(RMAC, v8)
+
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+.Lccm_dec_tail_loop:
+	ldrb		w0, [x2], #1		/* get 1 byte from input */
+	umov		w9, v8.b[0]		/* get top crypted CTR byte */
+	umov		w6, RMAC.b[0]		/* get top MAC byte */
+
+	eor		w9, w9, w0		/* w9 = CTR ^ input */
+	eor		w6, w6, w9		/* w6 = MAC ^ output */
+
+	strb		w9, [x1], #1		/* store out byte */
+	strb		w6, [x5], #1		/* store MAC byte */
+
+	subs		w4, w4, #1
+	beq		.Lccm_dec_ret
+
+	/* shift out one byte */
+	ext		RMAC.16b, RMAC.16b, RMAC.16b, #1
+	ext		v8.16b, v8.16b, v8.16b, #1
+
+	b		.Lccm_dec_tail_loop
+
+.Lccm_dec_end:
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+.Lccm_dec_ret:
+	ret
+SYM_FUNC_END(sm4_ce_ccm_dec)
diff --git a/arch/arm64/crypto/sm4-ce-ccm-glue.c b/arch/arm64/crypto/sm4-ce-ccm-glue.c
new file mode 100644
index 000000000000..f2cec7b52efc
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-glue.c
@@ -0,0 +1,303 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_cbcmac_update(const u32 *rkey_enc, u8 *mac,
+				     const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_ccm_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+			       u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+			       u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_final(const u32 *rkey_enc, u8 *iv, u8 *mac);
+
+
+static int ccm_setkey(struct crypto_aead *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_ctx *ctx = crypto_aead_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int ccm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+	if ((authsize & 1) || authsize < 4)
+		return -EINVAL;
+	return 0;
+}
+
+static int ccm_format_input(u8 info[], struct aead_request *req,
+			    unsigned int msglen)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	unsigned int l = req->iv[0] + 1;
+	unsigned int m;
+	__be32 len;
+
+	/* verify that CCM dimension 'L': 2 <= L <= 8 */
+	if (l < 2 || l > 8)
+		return -EINVAL;
+	if (l < 4 && msglen >> (8 * l))
+		return -EOVERFLOW;
+
+	memset(&req->iv[SM4_BLOCK_SIZE - l], 0, l);
+
+	memcpy(info, req->iv, SM4_BLOCK_SIZE);
+
+	m = crypto_aead_authsize(aead);
+
+	/* format flags field per RFC 3610/NIST 800-38C */
+	*info |= ((m - 2) / 2) << 3;
+	if (req->assoclen)
+		*info |= (1 << 6);
+
+	/*
+	 * format message length field,
+	 * Linux uses a u32 type to represent msglen
+	 */
+	if (l >= 4)
+		l = 4;
+
+	len = cpu_to_be32(msglen);
+	memcpy(&info[SM4_BLOCK_SIZE - l], (u8 *)&len + 4 - l, l);
+
+	return 0;
+}
+
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+	struct __packed { __be16 l; __be32 h; } aadlen;
+	u32 assoclen = req->assoclen;
+	struct scatter_walk walk;
+	unsigned int len;
+
+	if (assoclen < 0xff00) {
+		aadlen.l = cpu_to_be16(assoclen);
+		len = 2;
+	} else {
+		aadlen.l = cpu_to_be16(0xfffe);
+		put_unaligned_be32(assoclen, &aadlen.h);
+		len = 6;
+	}
+
+	sm4_ce_crypt_block(ctx->rkey_enc, mac, mac);
+	crypto_xor(mac, (const u8 *)&aadlen, len);
+
+	scatterwalk_start(&walk, req->src);
+
+	do {
+		u32 n = scatterwalk_clamp(&walk, assoclen);
+		u8 *p, *ptr;
+
+		if (!n) {
+			scatterwalk_start(&walk, sg_next(walk.sg));
+			n = scatterwalk_clamp(&walk, assoclen);
+		}
+
+		p = ptr = scatterwalk_map(&walk);
+		assoclen -= n;
+		scatterwalk_advance(&walk, n);
+
+		while (n > 0) {
+			unsigned int l, nblocks;
+
+			if (len == SM4_BLOCK_SIZE) {
+				if (n < SM4_BLOCK_SIZE) {
+					sm4_ce_crypt_block(ctx->rkey_enc,
+							   mac, mac);
+
+					len = 0;
+				} else {
+					nblocks = n / SM4_BLOCK_SIZE;
+					sm4_ce_cbcmac_update(ctx->rkey_enc,
+							     mac, ptr, nblocks);
+
+					ptr += nblocks * SM4_BLOCK_SIZE;
+					n %= SM4_BLOCK_SIZE;
+
+					continue;
+				}
+			}
+
+			l = min(n, SM4_BLOCK_SIZE - len);
+			if (l) {
+				crypto_xor(mac + len, ptr, l);
+				len += l;
+				ptr += l;
+				n -= l;
+			}
+		}
+
+		scatterwalk_unmap(p);
+		scatterwalk_done(&walk, 0, assoclen);
+	} while (assoclen);
+}
+
+static int ccm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+		     u32 *rkey_enc, u8 mac[],
+		     void (*sm4_ce_ccm_crypt)(const u32 *rkey_enc, u8 *dst,
+					const u8 *src, u8 *iv,
+					unsigned int nbytes, u8 *mac))
+{
+	u8 __aligned(8) ctr0[SM4_BLOCK_SIZE];
+	int err;
+
+	/* preserve the initial ctr0 for the TAG */
+	memcpy(ctr0, walk->iv, SM4_BLOCK_SIZE);
+	crypto_inc(walk->iv, SM4_BLOCK_SIZE);
+
+	kernel_neon_begin();
+
+	if (req->assoclen)
+		ccm_calculate_auth_mac(req, mac);
+
+	do {
+		unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+		const u8 *src = walk->src.virt.addr;
+		u8 *dst = walk->dst.virt.addr;
+
+		if (walk->nbytes == walk->total)
+			tail = 0;
+
+		if (walk->nbytes - tail)
+			sm4_ce_ccm_crypt(rkey_enc, dst, src, walk->iv,
+					 walk->nbytes - tail, mac);
+
+		if (walk->nbytes == walk->total)
+			sm4_ce_ccm_final(rkey_enc, ctr0, mac);
+
+		kernel_neon_end();
+
+		if (walk->nbytes) {
+			err = skcipher_walk_done(walk, tail);
+			if (err)
+				return err;
+			if (walk->nbytes)
+				kernel_neon_begin();
+		}
+	} while (walk->nbytes > 0);
+
+	return 0;
+}
+
+static int ccm_encrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = ccm_format_input(mac, req, req->cryptlen);
+	if (err)
+		return err;
+
+	err = skcipher_walk_aead_encrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_enc);
+	if (err)
+		return err;
+
+	/* copy authtag to end of dst */
+	scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
+				 crypto_aead_authsize(aead), 1);
+
+	return 0;
+}
+
+static int ccm_decrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	unsigned int authsize = crypto_aead_authsize(aead);
+	struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+	u8 authtag[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = ccm_format_input(mac, req, req->cryptlen - authsize);
+	if (err)
+		return err;
+
+	err = skcipher_walk_aead_decrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_dec);
+	if (err)
+		return err;
+
+	/* compare calculated auth tag with the stored one */
+	scatterwalk_map_and_copy(authtag, req->src,
+				 req->assoclen + req->cryptlen - authsize,
+				 authsize, 0);
+
+	if (crypto_memneq(authtag, mac, authsize))
+		return -EBADMSG;
+
+	return 0;
+}
+
+static struct aead_alg sm4_ccm_alg = {
+	.base = {
+		.cra_name		= "ccm(sm4)",
+		.cra_driver_name	= "ccm-sm4-ce",
+		.cra_priority		= 400,
+		.cra_blocksize		= 1,
+		.cra_ctxsize		= sizeof(struct sm4_ctx),
+		.cra_module		= THIS_MODULE,
+	},
+	.ivsize		= SM4_BLOCK_SIZE,
+	.chunksize	= SM4_BLOCK_SIZE,
+	.maxauthsize	= SM4_BLOCK_SIZE,
+	.setkey		= ccm_setkey,
+	.setauthsize	= ccm_setauthsize,
+	.encrypt	= ccm_encrypt,
+	.decrypt	= ccm_decrypt,
+};
+
+static int __init sm4_ce_ccm_init(void)
+{
+	return crypto_register_aead(&sm4_ccm_alg);
+}
+
+static void __exit sm4_ce_ccm_exit(void)
+{
+	crypto_unregister_aead(&sm4_ccm_alg);
+}
+
+module_cpu_feature_match(SM4, sm4_ce_ccm_init);
+module_exit(sm4_ce_ccm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in CCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("ccm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 15/16] crypto: arm64/sm4 - add CE implementation for GCM mode
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for GCM mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 224 and 224
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is gcm_base(ctr-sm4-ce,ghash-generic)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

Before (gcm_base(ctr-sm4-ce,ghash-generic)):

gcm(sm4)     |     16      64      256      512     1024     1420     4096     8192
-------------+---------------------------------------------------------------------
  GCM enc    |  25.24   64.65   104.66   116.69   123.81   125.12   129.67   130.62
  GCM dec    |  25.40   64.80   104.74   116.70   123.81   125.21   129.68   130.59
  GCM mb enc |  24.95   64.06   104.20   116.38   123.55   124.97   129.63   130.61
  GCM mb dec |  24.92   64.00   104.13   116.34   123.55   124.98   129.56   130.48

After:

gcm-sm4-ce   |     16      64      256      512     1024     1420     4096     8192
-------------+---------------------------------------------------------------------
  GCM enc    | 108.62  397.18   971.60  1283.92  1522.77  1513.39  1777.00  1806.96
  GCM dec    | 116.36  398.14  1004.27  1319.11  1624.21  1635.43  1932.54  1974.20
  GCM mb enc | 107.13  391.79   962.05  1274.94  1514.76  1508.57  1769.07  1801.58
  GCM mb dec | 113.40  389.36   988.51  1307.68  1619.10  1631.55  1931.70  1970.86

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |  16 +
 arch/arm64/crypto/Makefile          |   3 +
 arch/arm64/crypto/sm4-ce-gcm-core.S | 741 ++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-gcm-glue.c | 286 +++++++++++
 4 files changed, 1046 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 2611036a3e3f..6793d5bc3ee5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -297,6 +297,22 @@ config CRYPTO_SM4_ARM64_CE_CCM
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_CE_GCM
+	tristate "AEAD cipher: SM4 in GCM mode (ARMv8 Crypto Extensions)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_ALGAPI
+	select CRYPTO_AEAD
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+	  GCM (Galois/Counter Mode) authenticated encryption mode (NIST SP800-38D)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - PMULL (Polynomial Multiply Long) instructions
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_CRCT10DIF_ARM64_CE
 	tristate "CRCT10DIF (PMULL)"
 	depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 843ea5266965..4818e204c2ac 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -32,6 +32,9 @@ sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
 sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_GCM) += sm4-ce-gcm.o
+sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
+
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
diff --git a/arch/arm64/crypto/sm4-ce-gcm-core.S b/arch/arm64/crypto/sm4-ce-gcm-core.S
new file mode 100644
index 000000000000..7aa3ec18a289
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-core.S
@@ -0,0 +1,741 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2016 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch	armv8-a+crypto
+
+.irp b, 0, 1, 2, 3, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+/* Used for both encryption and decryption */
+#define	RHASH	v21
+#define	RRCONST	v22
+#define RZERO	v23
+
+/* Helper macros. */
+
+/*
+ * input: m0, m1
+ * output: r0:r1 (low 128-bits in r0, high in r1)
+ */
+#define PMUL_128x128(r0, r1, m0, m1, T0, T1)			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+		eor		T0.16b, T0.16b, T1.16b;		\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+		eor		r0.16b, r0.16b, T1.16b;		\
+		eor		r1.16b, r1.16b, T0.16b;
+
+#define PMUL_128x128_4x(r0, r1, m0, m1, T0, T1,			\
+			r2, r3, m2, m3, T2, T3,			\
+			r4, r5, m4, m5, T4, T5,			\
+			r6, r7, m6, m7, T6, T7)			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+		ext		T2.16b, m3.16b, m3.16b, #8;	\
+		ext		T4.16b, m5.16b, m5.16b, #8;	\
+		ext		T6.16b, m7.16b, m7.16b, #8;	\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+		pmull		r2.1q, m2.1d, m3.1d;		\
+		pmull		r4.1q, m4.1d, m5.1d;		\
+		pmull		r6.1q, m6.1d, m7.1d;		\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+		pmull		T3.1q, m2.1d, T2.1d;		\
+		pmull		T5.1q, m4.1d, T4.1d;		\
+		pmull		T7.1q, m6.1d, T6.1d;		\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+		pmull2		T2.1q, m2.2d, T2.2d;		\
+		pmull2		T4.1q, m4.2d, T4.2d;		\
+		pmull2		T6.1q, m6.2d, T6.2d;		\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+		pmull2		r3.1q, m2.2d, m3.2d;		\
+		pmull2		r5.1q, m4.2d, m5.2d;		\
+		pmull2		r7.1q, m6.2d, m7.2d;		\
+		eor		T0.16b, T0.16b, T1.16b;		\
+		eor		T2.16b, T2.16b, T3.16b;		\
+		eor		T4.16b, T4.16b, T5.16b;		\
+		eor		T6.16b, T6.16b, T7.16b;		\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+		ext		T3.16b, RZERO.16b, T2.16b, #8;	\
+		ext		T5.16b, RZERO.16b, T4.16b, #8;	\
+		ext		T7.16b, RZERO.16b, T6.16b, #8;	\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+		ext		T2.16b, T2.16b, RZERO.16b, #8;	\
+		ext		T4.16b, T4.16b, RZERO.16b, #8;	\
+		ext		T6.16b, T6.16b, RZERO.16b, #8;	\
+		eor		r0.16b, r0.16b, T1.16b;		\
+		eor		r2.16b, r2.16b, T3.16b; 	\
+		eor		r4.16b, r4.16b, T5.16b; 	\
+		eor		r6.16b, r6.16b, T7.16b; 	\
+		eor		r1.16b, r1.16b, T0.16b; 	\
+		eor		r3.16b, r3.16b, T2.16b; 	\
+		eor		r5.16b, r5.16b, T4.16b; 	\
+		eor		r7.16b, r7.16b, T6.16b;
+
+/*
+ * input: r0:r1 (low 128-bits in r0, high in r1)
+ * output: a
+ */
+#define REDUCTION(a, r0, r1, rconst, T0, T1)			\
+		pmull2		T0.1q, r1.2d, rconst.2d;	\
+		ext		T1.16b, T0.16b, RZERO.16b, #8;	\
+		ext		T0.16b, RZERO.16b, T0.16b, #8;	\
+		eor		r1.16b, r1.16b, T1.16b;		\
+		eor		r0.16b, r0.16b, T0.16b;		\
+		pmull		T0.1q, r1.1d, rconst.1d;	\
+		eor		a.16b, r0.16b, T0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK(b0, r0, r1, m0, m1, T0, T1)	\
+	rev32			b0.16b, b0.16b;			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+	sm4e			b0.4s, v24.4s;			\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+	sm4e			b0.4s, v25.4s;			\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+	sm4e			b0.4s, v26.4s;			\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+	sm4e			b0.4s, v27.4s;			\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+	sm4e			b0.4s, v28.4s;			\
+		eor		T0.16b, T0.16b, T1.16b;		\
+	sm4e			b0.4s, v29.4s;			\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+	sm4e			b0.4s, v30.4s;			\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+	sm4e			b0.4s, v31.4s;			\
+		eor		r0.16b, r0.16b, T1.16b;		\
+	rev64			b0.4s, b0.4s;			\
+		eor		r1.16b, r1.16b, T0.16b;		\
+	ext			b0.16b, b0.16b, b0.16b, #8;	\
+	rev32			b0.16b, b0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK3(b0, b1, b2,			\
+				    r0, r1, m0, m1, T0, T1,	\
+				    r2, r3, m2, m3, T2, T3,	\
+				    r4, r5, m4, m5, T4, T5)	\
+	rev32			b0.16b, b0.16b;			\
+	rev32			b1.16b, b1.16b;			\
+	rev32			b2.16b, b2.16b;			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+		ext		T2.16b, m3.16b, m3.16b, #8;	\
+		ext		T4.16b, m5.16b, m5.16b, #8;	\
+	sm4e			b0.4s, v24.4s;			\
+	sm4e			b1.4s, v24.4s;			\
+	sm4e			b2.4s, v24.4s;			\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+		pmull		r2.1q, m2.1d, m3.1d;		\
+		pmull		r4.1q, m4.1d, m5.1d;		\
+	sm4e			b0.4s, v25.4s;			\
+	sm4e			b1.4s, v25.4s;			\
+	sm4e			b2.4s, v25.4s;			\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+		pmull		T3.1q, m2.1d, T2.1d;		\
+		pmull		T5.1q, m4.1d, T4.1d;		\
+	sm4e			b0.4s, v26.4s;			\
+	sm4e			b1.4s, v26.4s;			\
+	sm4e			b2.4s, v26.4s;			\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+		pmull2		T2.1q, m2.2d, T2.2d;		\
+		pmull2		T4.1q, m4.2d, T4.2d;		\
+	sm4e			b0.4s, v27.4s;			\
+	sm4e			b1.4s, v27.4s;			\
+	sm4e			b2.4s, v27.4s;			\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+		pmull2		r3.1q, m2.2d, m3.2d;		\
+		pmull2		r5.1q, m4.2d, m5.2d;		\
+	sm4e			b0.4s, v28.4s;			\
+	sm4e			b1.4s, v28.4s;			\
+	sm4e			b2.4s, v28.4s;			\
+		eor		T0.16b, T0.16b, T1.16b;		\
+		eor		T2.16b, T2.16b, T3.16b;		\
+		eor		T4.16b, T4.16b, T5.16b;		\
+	sm4e			b0.4s, v29.4s;			\
+	sm4e			b1.4s, v29.4s;			\
+	sm4e			b2.4s, v29.4s;			\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+		ext		T3.16b, RZERO.16b, T2.16b, #8;	\
+		ext		T5.16b, RZERO.16b, T4.16b, #8;	\
+	sm4e			b0.4s, v30.4s;			\
+	sm4e			b1.4s, v30.4s;			\
+	sm4e			b2.4s, v30.4s;			\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+		ext		T2.16b, T2.16b, RZERO.16b, #8;	\
+		ext		T4.16b, T4.16b, RZERO.16b, #8;	\
+	sm4e			b0.4s, v31.4s;			\
+	sm4e			b1.4s, v31.4s;			\
+	sm4e			b2.4s, v31.4s;			\
+		eor		r0.16b, r0.16b, T1.16b;		\
+		eor		r2.16b, r2.16b, T3.16b;		\
+		eor		r4.16b, r4.16b, T5.16b;		\
+	rev64			b0.4s, b0.4s;			\
+	rev64			b1.4s, b1.4s;			\
+	rev64			b2.4s, b2.4s;			\
+		eor		r1.16b, r1.16b, T0.16b;		\
+		eor		r3.16b, r3.16b, T2.16b;		\
+		eor		r5.16b, r5.16b, T4.16b;		\
+	ext			b0.16b, b0.16b, b0.16b, #8;	\
+	ext			b1.16b, b1.16b, b1.16b, #8;	\
+	ext			b2.16b, b2.16b, b2.16b, #8;	\
+		eor		r0.16b, r0.16b, r2.16b;		\
+		eor		r1.16b, r1.16b, r3.16b;		\
+	rev32			b0.16b, b0.16b;			\
+	rev32			b1.16b, b1.16b;			\
+	rev32			b2.16b, b2.16b;			\
+		eor		r0.16b, r0.16b, r4.16b;		\
+		eor		r1.16b, r1.16b, r5.16b;
+
+#define inc32_le128(vctr)					\
+		mov		vctr.d[1], x9;			\
+		add		w6, w9, #1;			\
+		mov		vctr.d[0], x8;			\
+		bfi		x9, x6, #0, #32;		\
+		rev64		vctr.16b, vctr.16b;
+
+#define GTAG_HASH_LENGTHS(vctr0, vlen)					\
+		ld1		{vlen.16b}, [x7];			\
+		/* construct CTR0 */					\
+		/* the lower 32-bits of initial IV is always be32(1) */	\
+		mov		x6, #0x1;				\
+		bfi		x9, x6, #0, #32;			\
+		mov		vctr0.d[0], x8;				\
+		mov		vctr0.d[1], x9;				\
+		rbit		vlen.16b, vlen.16b;			\
+		rev64		vctr0.16b, vctr0.16b;			\
+		/* authtag = GCTR(CTR0, GHASH) */			\
+		eor		RHASH.16b, RHASH.16b, vlen.16b;		\
+		SM4_CRYPT_PMUL_128x128_BLK(vctr0, RR0, RR1, RHASH, RH1,	\
+					   RTMP0, RTMP1);		\
+		REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3);	\
+		rbit		RHASH.16b, RHASH.16b;			\
+		eor		RHASH.16b, RHASH.16b, vctr0.16b;
+
+
+/* Register macros for encrypt and ghash */
+
+/* can be the same as input v0-v3 */
+#define	RR1	v0
+#define	RR3	v1
+#define	RR5	v2
+#define	RR7	v3
+
+#define	RR0	v4
+#define	RR2	v5
+#define	RR4	v6
+#define	RR6	v7
+
+#define RTMP0	v8
+#define RTMP1	v9
+#define RTMP2	v10
+#define RTMP3	v11
+#define RTMP4	v12
+#define RTMP5	v13
+#define RTMP6	v14
+#define RTMP7	v15
+
+#define	RH1	v16
+#define	RH2	v17
+#define	RH3	v18
+#define	RH4	v19
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_ghash_setup)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: ghash table
+	 */
+	SM4_PREPARE(x0)
+
+	adr_l		x2, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x2]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+	/* H = E(K, 0^128) */
+	rev32		v0.16b, RZERO.16b
+	SM4_CRYPT_BLK_BE(v0)
+
+	/* H ^ 1 */
+	rbit		RH1.16b, v0.16b
+
+	/* H ^ 2 */
+	PMUL_128x128(RR0, RR1, RH1, RH1, RTMP0, RTMP1)
+	REDUCTION(RH2, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	/* H ^ 3 */
+	PMUL_128x128(RR0, RR1, RH2, RH1, RTMP0, RTMP1)
+	REDUCTION(RH3, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	/* H ^ 4 */
+	PMUL_128x128(RR0, RR1, RH2, RH2, RTMP0, RTMP1)
+	REDUCTION(RH4, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	st1		{RH1.16b-RH4.16b}, [x1]
+
+	ret
+SYM_FUNC_END(sm4_ce_pmull_ghash_setup)
+
+.align 3
+SYM_FUNC_START(pmull_ghash_update)
+	/* input:
+	 *   x0: ghash table
+	 *   x1: ghash result
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	ld1		{RH1.16b-RH4.16b}, [x0]
+
+	ld1		{RHASH.16b}, [x1]
+	rbit		RHASH.16b, RHASH.16b
+
+	adr_l		x4, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x4]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+.Lghash_loop_4x:
+	cmp		w3, #4
+	blt		.Lghash_loop_1x
+
+	sub		w3, w3, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	rbit		v0.16b, v0.16b
+	rbit		v1.16b, v1.16b
+	rbit		v2.16b, v2.16b
+	rbit		v3.16b, v3.16b
+
+	/*
+	 * (in0 ^ HASH) * H^4 => rr0:rr1
+	 * (in1)        * H^3 => rr2:rr3
+	 * (in2)        * H^2 => rr4:rr5
+	 * (in3)        * H^1 => rr6:rr7
+	 */
+	eor		RHASH.16b, RHASH.16b, v0.16b
+
+	PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+			RR2, RR3, v1, RH3, RTMP2, RTMP3,
+			RR4, RR5, v2, RH2, RTMP4, RTMP5,
+			RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+	eor		RR0.16b, RR0.16b, RR2.16b
+	eor		RR1.16b, RR1.16b, RR3.16b
+	eor		RR0.16b, RR0.16b, RR4.16b
+	eor		RR1.16b, RR1.16b, RR5.16b
+	eor		RR0.16b, RR0.16b, RR6.16b
+	eor		RR1.16b, RR1.16b, RR7.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+	cbz		w3, .Lghash_end
+	b		.Lghash_loop_4x
+
+.Lghash_loop_1x:
+	sub		w3, w3, #1
+
+	ld1		{v0.16b}, [x2], #16
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	cbnz		w3, .Lghash_loop_1x
+
+.Lghash_end:
+	rbit		RHASH.16b, RHASH.16b
+	st1		{RHASH.2d}, [x1]
+
+	ret
+SYM_FUNC_END(pmull_ghash_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: ghash result
+	 *   x6: ghash table
+	 *   x7: lengths (only for last block)
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x8, x9, [x3]
+	rev		x8, x8
+	rev		x9, x9
+
+	ld1		{RH1.16b-RH4.16b}, [x6]
+
+	ld1		{RHASH.16b}, [x5]
+	rbit		RHASH.16b, RHASH.16b
+
+	adr_l		x6, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x6]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+	cbz		w4, .Lgcm_enc_hash_len
+
+.Lgcm_enc_loop_4x:
+	cmp		w4, #(4 * 16)
+	blt		.Lgcm_enc_loop_1x
+
+	sub		w4, w4, #(4 * 16)
+
+	/* construct CTRs */
+	inc32_le128(v0)			/* +0 */
+	inc32_le128(v1)			/* +1 */
+	inc32_le128(v2)			/* +2 */
+	inc32_le128(v3)			/* +3 */
+
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	eor		v1.16b, v1.16b, RTMP1.16b
+	eor		v2.16b, v2.16b, RTMP2.16b
+	eor		v3.16b, v3.16b, RTMP3.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	/* ghash update */
+
+	rbit		v0.16b, v0.16b
+	rbit		v1.16b, v1.16b
+	rbit		v2.16b, v2.16b
+	rbit		v3.16b, v3.16b
+
+	/*
+	 * (in0 ^ HASH) * H^4 => rr0:rr1
+	 * (in1)        * H^3 => rr2:rr3
+	 * (in2)        * H^2 => rr4:rr5
+	 * (in3)        * H^1 => rr6:rr7
+	 */
+	eor		RHASH.16b, RHASH.16b, v0.16b
+
+	PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+			RR2, RR3, v1, RH3, RTMP2, RTMP3,
+			RR4, RR5, v2, RH2, RTMP4, RTMP5,
+			RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+	eor		RR0.16b, RR0.16b, RR2.16b
+	eor		RR1.16b, RR1.16b, RR3.16b
+	eor		RR0.16b, RR0.16b, RR4.16b
+	eor		RR1.16b, RR1.16b, RR5.16b
+	eor		RR0.16b, RR0.16b, RR6.16b
+	eor		RR1.16b, RR1.16b, RR7.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+	cbz		w4, .Lgcm_enc_hash_len
+	b		.Lgcm_enc_loop_4x
+
+.Lgcm_enc_loop_1x:
+	cmp		w4, #16
+	blt		.Lgcm_enc_tail
+
+	sub		w4, w4, #16
+
+	/* construct CTRs */
+	inc32_le128(v0)
+
+	ld1		{RTMP0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	st1		{v0.16b}, [x1], #16
+
+	/* ghash update */
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	cbz		w4, .Lgcm_enc_hash_len
+	b		.Lgcm_enc_loop_1x
+
+.Lgcm_enc_tail:
+	/* construct CTRs */
+	inc32_le128(v0)
+	SM4_CRYPT_BLK(v0)
+
+	/* load permute table */
+	adr_l		x0, .Lcts_permute_table
+	add		x0, x0, #32
+	sub		x0, x0, w4, uxtw
+	ld1		{v3.16b}, [x0]
+
+.Lgcm_enc_tail_loop:
+	/* do encrypt */
+	ldrb		w0, [x2], #1	/* get 1 byte from input */
+	umov		w6, v0.b[0]	/* get top crypted byte */
+	eor		w6, w6, w0	/* w6 = CTR ^ input */
+	strb		w6, [x1], #1	/* store out byte */
+
+	/* shift right out one byte */
+	ext		v0.16b, v0.16b, v0.16b, #1
+	/* the last ciphertext is placed in high bytes */
+	ins		v0.b[15], w6
+
+	subs		w4, w4, #1
+	bne		.Lgcm_enc_tail_loop
+
+	/* padding last block with zeros */
+	tbl		v0.16b, {v0.16b}, v3.16b
+
+	/* ghash update */
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_enc_hash_len:
+	cbz		x7, .Lgcm_enc_end
+
+	GTAG_HASH_LENGTHS(v1, v3)
+
+	b		.Lgcm_enc_ret
+
+.Lgcm_enc_end:
+	/* store new CTR */
+	rev		x8, x8
+	rev		x9, x9
+	stp		x8, x9, [x3]
+
+	rbit		RHASH.16b, RHASH.16b
+
+.Lgcm_enc_ret:
+	/* store new MAC */
+	st1		{RHASH.2d}, [x5]
+
+	ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_enc)
+
+#undef	RR1
+#undef	RR3
+#undef	RR5
+#undef	RR7
+#undef	RR0
+#undef	RR2
+#undef	RR4
+#undef	RR6
+#undef RTMP0
+#undef RTMP1
+#undef RTMP2
+#undef RTMP3
+#undef RTMP4
+#undef RTMP5
+#undef RTMP6
+#undef RTMP7
+#undef	RH1
+#undef	RH2
+#undef	RH3
+#undef	RH4
+
+
+/* Register macros for decrypt */
+
+/* v0-v2 for building CTRs, v3-v5 for saving inputs */
+
+#define	RR1	v6
+#define	RR3	v7
+#define	RR5	v8
+
+#define	RR0	v9
+#define	RR2	v10
+#define	RR4	v11
+
+#define RTMP0	v12
+#define RTMP1	v13
+#define RTMP2	v14
+#define RTMP3	v15
+#define RTMP4	v16
+#define RTMP5	v17
+
+#define	RH1	v18
+#define	RH2	v19
+#define	RH3	v20
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: ghash result
+	 *   x6: ghash table
+	 *   x7: lengths (only for last block)
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x8, x9, [x3]
+	rev		x8, x8
+	rev		x9, x9
+
+	ld1		{RH1.16b-RH3.16b}, [x6]
+
+	ld1		{RHASH.16b}, [x5]
+	rbit		RHASH.16b, RHASH.16b
+
+	adr_l		x6, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x6]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+	cbz		w4, .Lgcm_dec_hash_len
+
+.Lgcm_dec_loop_3x:
+	cmp		w4, #(3 * 16)
+	blt		.Lgcm_dec_loop_1x
+
+	sub		w4, w4, #(3 * 16)
+
+	ld1		{v3.16b-v5.16b}, [x2], #(3 * 16)
+
+	/* construct CTRs */
+	inc32_le128(v0)			/* +0 */
+	rbit		v6.16b, v3.16b
+	inc32_le128(v1)			/* +1 */
+	rbit		v7.16b, v4.16b
+	inc32_le128(v2)			/* +2 */
+	rbit		v8.16b, v5.16b
+
+	eor		RHASH.16b, RHASH.16b, v6.16b
+
+	/* decrypt & ghash update */
+	SM4_CRYPT_PMUL_128x128_BLK3(v0, v1, v2,
+				    RR0, RR1, RHASH, RH3, RTMP0, RTMP1,
+				    RR2, RR3, v7, RH2, RTMP2, RTMP3,
+				    RR4, RR5, v8, RH1, RTMP4, RTMP5)
+
+	eor		v0.16b, v0.16b, v3.16b
+	eor		v1.16b, v1.16b, v4.16b
+	eor		v2.16b, v2.16b, v5.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+	st1		{v0.16b-v2.16b}, [x1], #(3 * 16)
+
+	cbz		w4, .Lgcm_dec_hash_len
+	b		.Lgcm_dec_loop_3x
+
+.Lgcm_dec_loop_1x:
+	cmp		w4, #16
+	blt		.Lgcm_dec_tail
+
+	sub		w4, w4, #16
+
+	ld1		{v3.16b}, [x2], #16
+
+	/* construct CTRs */
+	inc32_le128(v0)
+	rbit		v6.16b, v3.16b
+
+	eor		RHASH.16b, RHASH.16b, v6.16b
+
+	SM4_CRYPT_PMUL_128x128_BLK(v0, RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+
+	eor		v0.16b, v0.16b, v3.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	st1		{v0.16b}, [x1], #16
+
+	cbz		w4, .Lgcm_dec_hash_len
+	b		.Lgcm_dec_loop_1x
+
+.Lgcm_dec_tail:
+	/* construct CTRs */
+	inc32_le128(v0)
+	SM4_CRYPT_BLK(v0)
+
+	/* load permute table */
+	adr_l		x0, .Lcts_permute_table
+	add		x0, x0, #32
+	sub		x0, x0, w4, uxtw
+	ld1		{v3.16b}, [x0]
+
+.Lgcm_dec_tail_loop:
+	/* do decrypt */
+	ldrb		w0, [x2], #1	/* get 1 byte from input */
+	umov		w6, v0.b[0]	/* get top crypted byte */
+	eor		w6, w6, w0	/* w6 = CTR ^ input */
+	strb		w6, [x1], #1	/* store out byte */
+
+	/* shift right out one byte */
+	ext		v0.16b, v0.16b, v0.16b, #1
+	/* the last ciphertext is placed in high bytes */
+	ins		v0.b[15], w0
+
+	subs		w4, w4, #1
+	bne		.Lgcm_dec_tail_loop
+
+	/* padding last block with zeros */
+	tbl		v0.16b, {v0.16b}, v3.16b
+
+	/* ghash update */
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_dec_hash_len:
+	cbz		x7, .Lgcm_dec_end
+
+	GTAG_HASH_LENGTHS(v1, v3)
+
+	b		.Lgcm_dec_ret
+
+.Lgcm_dec_end:
+	/* store new CTR */
+	rev		x8, x8
+	rev		x9, x9
+	stp		x8, x9, [x3]
+
+	rbit		RHASH.16b, RHASH.16b
+
+.Lgcm_dec_ret:
+	/* store new MAC */
+	st1		{RHASH.2d}, [x5]
+
+	ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_dec)
+
+	.section	".rodata", "a"
+	.align 4
+.Lcts_permute_table:
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe,  0xf
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+
+.Lghash_rconst:
+	.quad		0x87
diff --git a/arch/arm64/crypto/sm4-ce-gcm-glue.c b/arch/arm64/crypto/sm4-ce-gcm-glue.c
new file mode 100644
index 000000000000..e90ea0f17beb
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-glue.c
@@ -0,0 +1,286 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/b128ops.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_pmull_ghash_setup(const u32 *rkey_enc, u8 *ghash_table);
+asmlinkage void pmull_ghash_update(const u8 *ghash_table, u8 *ghash,
+				   const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_pmull_gcm_enc(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nbytes, u8 *ghash,
+				     const u8 *ghash_table, const u8 *lengths);
+asmlinkage void sm4_ce_pmull_gcm_dec(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nbytes, u8 *ghash,
+				     const u8 *ghash_table, const u8 *lengths);
+
+#define GHASH_BLOCK_SIZE	16
+#define GCM_IV_SIZE		12
+
+struct sm4_gcm_ctx {
+	struct sm4_ctx key;
+	u8 ghash_table[16 * 4];
+};
+
+
+static int gcm_setkey(struct crypto_aead *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	sm4_ce_pmull_ghash_setup(ctx->key.rkey_enc, ctx->ghash_table);
+
+	kernel_neon_end();
+	return 0;
+}
+
+static int gcm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+	switch (authsize) {
+	case 4:
+	case 8:
+	case 12 ... 16:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+static void gcm_calculate_auth_mac(struct aead_request *req, u8 ghash[])
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) buffer[GHASH_BLOCK_SIZE];
+	u32 assoclen = req->assoclen;
+	struct scatter_walk walk;
+	unsigned int buflen = 0;
+
+	scatterwalk_start(&walk, req->src);
+
+	do {
+		u32 n = scatterwalk_clamp(&walk, assoclen);
+		u8 *p, *ptr;
+
+		if (!n) {
+			scatterwalk_start(&walk, sg_next(walk.sg));
+			n = scatterwalk_clamp(&walk, assoclen);
+		}
+
+		p = ptr = scatterwalk_map(&walk);
+		assoclen -= n;
+		scatterwalk_advance(&walk, n);
+
+		if (n + buflen < GHASH_BLOCK_SIZE) {
+			memcpy(&buffer[buflen], ptr, n);
+			buflen += n;
+		} else {
+			unsigned int nblocks;
+
+			if (buflen) {
+				unsigned int l = GHASH_BLOCK_SIZE - buflen;
+
+				memcpy(&buffer[buflen], ptr, l);
+				ptr += l;
+				n -= l;
+
+				pmull_ghash_update(ctx->ghash_table, ghash,
+						   buffer, 1);
+			}
+
+			nblocks = n / GHASH_BLOCK_SIZE;
+			if (nblocks) {
+				pmull_ghash_update(ctx->ghash_table, ghash,
+						   ptr, nblocks);
+				ptr += nblocks * GHASH_BLOCK_SIZE;
+			}
+
+			buflen = n % GHASH_BLOCK_SIZE;
+			if (buflen)
+				memcpy(&buffer[0], ptr, buflen);
+		}
+
+		scatterwalk_unmap(p);
+		scatterwalk_done(&walk, 0, assoclen);
+	} while (assoclen);
+
+	/* padding with '0' */
+	if (buflen) {
+		memset(&buffer[buflen], 0, GHASH_BLOCK_SIZE - buflen);
+		pmull_ghash_update(ctx->ghash_table, ghash, buffer, 1);
+	}
+}
+
+static int gcm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+		     struct sm4_gcm_ctx *ctx, u8 ghash[],
+		     void (*sm4_ce_pmull_gcm_crypt)(const u32 *rkey_enc,
+				u8 *dst, const u8 *src, u8 *iv,
+				unsigned int nbytes, u8 *ghash,
+				const u8 *ghash_table, const u8 *lengths))
+{
+	u8 __aligned(8) iv[SM4_BLOCK_SIZE];
+	be128 __aligned(8) lengths;
+	int err;
+
+	memset(ghash, 0, SM4_BLOCK_SIZE);
+
+	lengths.a = cpu_to_be64(req->assoclen * 8);
+	lengths.b = cpu_to_be64(walk->total * 8);
+
+	memcpy(iv, walk->iv, GCM_IV_SIZE);
+	put_unaligned_be32(2, iv + GCM_IV_SIZE);
+
+	kernel_neon_begin();
+
+	if (req->assoclen)
+		gcm_calculate_auth_mac(req, ghash);
+
+	do {
+		unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+		const u8 *src = walk->src.virt.addr;
+		u8 *dst = walk->dst.virt.addr;
+
+		if (walk->nbytes == walk->total) {
+			tail = 0;
+
+			sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+					       walk->nbytes, ghash,
+					       ctx->ghash_table,
+					       (const u8 *)&lengths);
+		} else if (walk->nbytes - tail) {
+			sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+					       walk->nbytes - tail, ghash,
+					       ctx->ghash_table, NULL);
+		}
+
+		kernel_neon_end();
+
+		err = skcipher_walk_done(walk, tail);
+		if (err)
+			return err;
+		if (walk->nbytes)
+			kernel_neon_begin();
+	} while (walk->nbytes > 0);
+
+	return 0;
+}
+
+static int gcm_encrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_aead_encrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_enc);
+	if (err)
+		return err;
+
+	/* copy authtag to end of dst */
+	scatterwalk_map_and_copy(ghash, req->dst, req->assoclen + req->cryptlen,
+				 crypto_aead_authsize(aead), 1);
+
+	return 0;
+}
+
+static int gcm_decrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	unsigned int authsize = crypto_aead_authsize(aead);
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+	u8 authtag[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_aead_decrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_dec);
+	if (err)
+		return err;
+
+	/* compare calculated auth tag with the stored one */
+	scatterwalk_map_and_copy(authtag, req->src,
+				 req->assoclen + req->cryptlen - authsize,
+				 authsize, 0);
+
+	if (crypto_memneq(authtag, ghash, authsize))
+		return -EBADMSG;
+
+	return 0;
+}
+
+static struct aead_alg sm4_gcm_alg = {
+	.base = {
+		.cra_name		= "gcm(sm4)",
+		.cra_driver_name	= "gcm-sm4-ce",
+		.cra_priority		= 400,
+		.cra_blocksize		= 1,
+		.cra_ctxsize		= sizeof(struct sm4_gcm_ctx),
+		.cra_module		= THIS_MODULE,
+	},
+	.ivsize		= GCM_IV_SIZE,
+	.chunksize	= SM4_BLOCK_SIZE,
+	.maxauthsize	= SM4_BLOCK_SIZE,
+	.setkey		= gcm_setkey,
+	.setauthsize	= gcm_setauthsize,
+	.encrypt	= gcm_encrypt,
+	.decrypt	= gcm_decrypt,
+};
+
+static int __init sm4_ce_gcm_init(void)
+{
+	if (!cpu_have_named_feature(PMULL))
+		return -ENODEV;
+
+	return crypto_register_aead(&sm4_gcm_alg);
+}
+
+static void __exit sm4_ce_gcm_exit(void)
+{
+	crypto_unregister_aead(&sm4_gcm_alg);
+}
+
+static const struct cpu_feature sm4_ce_gcm_cpu_feature[] = {
+	{ cpu_feature(PMULL) },
+	{}
+};
+MODULE_DEVICE_TABLE(cpu, sm4_ce_gcm_cpu_feature);
+
+module_cpu_feature_match(SM4, sm4_ce_gcm_init);
+module_exit(sm4_ce_gcm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in GCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("gcm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 15/16] crypto: arm64/sm4 - add CE implementation for GCM mode
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

This patch is a CE-optimized assembly implementation for GCM mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 224 and 224
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is gcm_base(ctr-sm4-ce,ghash-generic)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

Before (gcm_base(ctr-sm4-ce,ghash-generic)):

gcm(sm4)     |     16      64      256      512     1024     1420     4096     8192
-------------+---------------------------------------------------------------------
  GCM enc    |  25.24   64.65   104.66   116.69   123.81   125.12   129.67   130.62
  GCM dec    |  25.40   64.80   104.74   116.70   123.81   125.21   129.68   130.59
  GCM mb enc |  24.95   64.06   104.20   116.38   123.55   124.97   129.63   130.61
  GCM mb dec |  24.92   64.00   104.13   116.34   123.55   124.98   129.56   130.48

After:

gcm-sm4-ce   |     16      64      256      512     1024     1420     4096     8192
-------------+---------------------------------------------------------------------
  GCM enc    | 108.62  397.18   971.60  1283.92  1522.77  1513.39  1777.00  1806.96
  GCM dec    | 116.36  398.14  1004.27  1319.11  1624.21  1635.43  1932.54  1974.20
  GCM mb enc | 107.13  391.79   962.05  1274.94  1514.76  1508.57  1769.07  1801.58
  GCM mb dec | 113.40  389.36   988.51  1307.68  1619.10  1631.55  1931.70  1970.86

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |  16 +
 arch/arm64/crypto/Makefile          |   3 +
 arch/arm64/crypto/sm4-ce-gcm-core.S | 741 ++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-gcm-glue.c | 286 +++++++++++
 4 files changed, 1046 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 2611036a3e3f..6793d5bc3ee5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -297,6 +297,22 @@ config CRYPTO_SM4_ARM64_CE_CCM
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_CE_GCM
+	tristate "AEAD cipher: SM4 in GCM mode (ARMv8 Crypto Extensions)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_ALGAPI
+	select CRYPTO_AEAD
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+	  GCM (Galois/Counter Mode) authenticated encryption mode (NIST SP800-38D)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - PMULL (Polynomial Multiply Long) instructions
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_CRCT10DIF_ARM64_CE
 	tristate "CRCT10DIF (PMULL)"
 	depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 843ea5266965..4818e204c2ac 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -32,6 +32,9 @@ sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
 sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_GCM) += sm4-ce-gcm.o
+sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
+
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
diff --git a/arch/arm64/crypto/sm4-ce-gcm-core.S b/arch/arm64/crypto/sm4-ce-gcm-core.S
new file mode 100644
index 000000000000..7aa3ec18a289
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-core.S
@@ -0,0 +1,741 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2016 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch	armv8-a+crypto
+
+.irp b, 0, 1, 2, 3, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+/* Used for both encryption and decryption */
+#define	RHASH	v21
+#define	RRCONST	v22
+#define RZERO	v23
+
+/* Helper macros. */
+
+/*
+ * input: m0, m1
+ * output: r0:r1 (low 128-bits in r0, high in r1)
+ */
+#define PMUL_128x128(r0, r1, m0, m1, T0, T1)			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+		eor		T0.16b, T0.16b, T1.16b;		\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+		eor		r0.16b, r0.16b, T1.16b;		\
+		eor		r1.16b, r1.16b, T0.16b;
+
+#define PMUL_128x128_4x(r0, r1, m0, m1, T0, T1,			\
+			r2, r3, m2, m3, T2, T3,			\
+			r4, r5, m4, m5, T4, T5,			\
+			r6, r7, m6, m7, T6, T7)			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+		ext		T2.16b, m3.16b, m3.16b, #8;	\
+		ext		T4.16b, m5.16b, m5.16b, #8;	\
+		ext		T6.16b, m7.16b, m7.16b, #8;	\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+		pmull		r2.1q, m2.1d, m3.1d;		\
+		pmull		r4.1q, m4.1d, m5.1d;		\
+		pmull		r6.1q, m6.1d, m7.1d;		\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+		pmull		T3.1q, m2.1d, T2.1d;		\
+		pmull		T5.1q, m4.1d, T4.1d;		\
+		pmull		T7.1q, m6.1d, T6.1d;		\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+		pmull2		T2.1q, m2.2d, T2.2d;		\
+		pmull2		T4.1q, m4.2d, T4.2d;		\
+		pmull2		T6.1q, m6.2d, T6.2d;		\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+		pmull2		r3.1q, m2.2d, m3.2d;		\
+		pmull2		r5.1q, m4.2d, m5.2d;		\
+		pmull2		r7.1q, m6.2d, m7.2d;		\
+		eor		T0.16b, T0.16b, T1.16b;		\
+		eor		T2.16b, T2.16b, T3.16b;		\
+		eor		T4.16b, T4.16b, T5.16b;		\
+		eor		T6.16b, T6.16b, T7.16b;		\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+		ext		T3.16b, RZERO.16b, T2.16b, #8;	\
+		ext		T5.16b, RZERO.16b, T4.16b, #8;	\
+		ext		T7.16b, RZERO.16b, T6.16b, #8;	\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+		ext		T2.16b, T2.16b, RZERO.16b, #8;	\
+		ext		T4.16b, T4.16b, RZERO.16b, #8;	\
+		ext		T6.16b, T6.16b, RZERO.16b, #8;	\
+		eor		r0.16b, r0.16b, T1.16b;		\
+		eor		r2.16b, r2.16b, T3.16b; 	\
+		eor		r4.16b, r4.16b, T5.16b; 	\
+		eor		r6.16b, r6.16b, T7.16b; 	\
+		eor		r1.16b, r1.16b, T0.16b; 	\
+		eor		r3.16b, r3.16b, T2.16b; 	\
+		eor		r5.16b, r5.16b, T4.16b; 	\
+		eor		r7.16b, r7.16b, T6.16b;
+
+/*
+ * input: r0:r1 (low 128-bits in r0, high in r1)
+ * output: a
+ */
+#define REDUCTION(a, r0, r1, rconst, T0, T1)			\
+		pmull2		T0.1q, r1.2d, rconst.2d;	\
+		ext		T1.16b, T0.16b, RZERO.16b, #8;	\
+		ext		T0.16b, RZERO.16b, T0.16b, #8;	\
+		eor		r1.16b, r1.16b, T1.16b;		\
+		eor		r0.16b, r0.16b, T0.16b;		\
+		pmull		T0.1q, r1.1d, rconst.1d;	\
+		eor		a.16b, r0.16b, T0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK(b0, r0, r1, m0, m1, T0, T1)	\
+	rev32			b0.16b, b0.16b;			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+	sm4e			b0.4s, v24.4s;			\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+	sm4e			b0.4s, v25.4s;			\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+	sm4e			b0.4s, v26.4s;			\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+	sm4e			b0.4s, v27.4s;			\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+	sm4e			b0.4s, v28.4s;			\
+		eor		T0.16b, T0.16b, T1.16b;		\
+	sm4e			b0.4s, v29.4s;			\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+	sm4e			b0.4s, v30.4s;			\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+	sm4e			b0.4s, v31.4s;			\
+		eor		r0.16b, r0.16b, T1.16b;		\
+	rev64			b0.4s, b0.4s;			\
+		eor		r1.16b, r1.16b, T0.16b;		\
+	ext			b0.16b, b0.16b, b0.16b, #8;	\
+	rev32			b0.16b, b0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK3(b0, b1, b2,			\
+				    r0, r1, m0, m1, T0, T1,	\
+				    r2, r3, m2, m3, T2, T3,	\
+				    r4, r5, m4, m5, T4, T5)	\
+	rev32			b0.16b, b0.16b;			\
+	rev32			b1.16b, b1.16b;			\
+	rev32			b2.16b, b2.16b;			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+		ext		T2.16b, m3.16b, m3.16b, #8;	\
+		ext		T4.16b, m5.16b, m5.16b, #8;	\
+	sm4e			b0.4s, v24.4s;			\
+	sm4e			b1.4s, v24.4s;			\
+	sm4e			b2.4s, v24.4s;			\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+		pmull		r2.1q, m2.1d, m3.1d;		\
+		pmull		r4.1q, m4.1d, m5.1d;		\
+	sm4e			b0.4s, v25.4s;			\
+	sm4e			b1.4s, v25.4s;			\
+	sm4e			b2.4s, v25.4s;			\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+		pmull		T3.1q, m2.1d, T2.1d;		\
+		pmull		T5.1q, m4.1d, T4.1d;		\
+	sm4e			b0.4s, v26.4s;			\
+	sm4e			b1.4s, v26.4s;			\
+	sm4e			b2.4s, v26.4s;			\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+		pmull2		T2.1q, m2.2d, T2.2d;		\
+		pmull2		T4.1q, m4.2d, T4.2d;		\
+	sm4e			b0.4s, v27.4s;			\
+	sm4e			b1.4s, v27.4s;			\
+	sm4e			b2.4s, v27.4s;			\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+		pmull2		r3.1q, m2.2d, m3.2d;		\
+		pmull2		r5.1q, m4.2d, m5.2d;		\
+	sm4e			b0.4s, v28.4s;			\
+	sm4e			b1.4s, v28.4s;			\
+	sm4e			b2.4s, v28.4s;			\
+		eor		T0.16b, T0.16b, T1.16b;		\
+		eor		T2.16b, T2.16b, T3.16b;		\
+		eor		T4.16b, T4.16b, T5.16b;		\
+	sm4e			b0.4s, v29.4s;			\
+	sm4e			b1.4s, v29.4s;			\
+	sm4e			b2.4s, v29.4s;			\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+		ext		T3.16b, RZERO.16b, T2.16b, #8;	\
+		ext		T5.16b, RZERO.16b, T4.16b, #8;	\
+	sm4e			b0.4s, v30.4s;			\
+	sm4e			b1.4s, v30.4s;			\
+	sm4e			b2.4s, v30.4s;			\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+		ext		T2.16b, T2.16b, RZERO.16b, #8;	\
+		ext		T4.16b, T4.16b, RZERO.16b, #8;	\
+	sm4e			b0.4s, v31.4s;			\
+	sm4e			b1.4s, v31.4s;			\
+	sm4e			b2.4s, v31.4s;			\
+		eor		r0.16b, r0.16b, T1.16b;		\
+		eor		r2.16b, r2.16b, T3.16b;		\
+		eor		r4.16b, r4.16b, T5.16b;		\
+	rev64			b0.4s, b0.4s;			\
+	rev64			b1.4s, b1.4s;			\
+	rev64			b2.4s, b2.4s;			\
+		eor		r1.16b, r1.16b, T0.16b;		\
+		eor		r3.16b, r3.16b, T2.16b;		\
+		eor		r5.16b, r5.16b, T4.16b;		\
+	ext			b0.16b, b0.16b, b0.16b, #8;	\
+	ext			b1.16b, b1.16b, b1.16b, #8;	\
+	ext			b2.16b, b2.16b, b2.16b, #8;	\
+		eor		r0.16b, r0.16b, r2.16b;		\
+		eor		r1.16b, r1.16b, r3.16b;		\
+	rev32			b0.16b, b0.16b;			\
+	rev32			b1.16b, b1.16b;			\
+	rev32			b2.16b, b2.16b;			\
+		eor		r0.16b, r0.16b, r4.16b;		\
+		eor		r1.16b, r1.16b, r5.16b;
+
+#define inc32_le128(vctr)					\
+		mov		vctr.d[1], x9;			\
+		add		w6, w9, #1;			\
+		mov		vctr.d[0], x8;			\
+		bfi		x9, x6, #0, #32;		\
+		rev64		vctr.16b, vctr.16b;
+
+#define GTAG_HASH_LENGTHS(vctr0, vlen)					\
+		ld1		{vlen.16b}, [x7];			\
+		/* construct CTR0 */					\
+		/* the lower 32-bits of initial IV is always be32(1) */	\
+		mov		x6, #0x1;				\
+		bfi		x9, x6, #0, #32;			\
+		mov		vctr0.d[0], x8;				\
+		mov		vctr0.d[1], x9;				\
+		rbit		vlen.16b, vlen.16b;			\
+		rev64		vctr0.16b, vctr0.16b;			\
+		/* authtag = GCTR(CTR0, GHASH) */			\
+		eor		RHASH.16b, RHASH.16b, vlen.16b;		\
+		SM4_CRYPT_PMUL_128x128_BLK(vctr0, RR0, RR1, RHASH, RH1,	\
+					   RTMP0, RTMP1);		\
+		REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3);	\
+		rbit		RHASH.16b, RHASH.16b;			\
+		eor		RHASH.16b, RHASH.16b, vctr0.16b;
+
+
+/* Register macros for encrypt and ghash */
+
+/* can be the same as input v0-v3 */
+#define	RR1	v0
+#define	RR3	v1
+#define	RR5	v2
+#define	RR7	v3
+
+#define	RR0	v4
+#define	RR2	v5
+#define	RR4	v6
+#define	RR6	v7
+
+#define RTMP0	v8
+#define RTMP1	v9
+#define RTMP2	v10
+#define RTMP3	v11
+#define RTMP4	v12
+#define RTMP5	v13
+#define RTMP6	v14
+#define RTMP7	v15
+
+#define	RH1	v16
+#define	RH2	v17
+#define	RH3	v18
+#define	RH4	v19
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_ghash_setup)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: ghash table
+	 */
+	SM4_PREPARE(x0)
+
+	adr_l		x2, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x2]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+	/* H = E(K, 0^128) */
+	rev32		v0.16b, RZERO.16b
+	SM4_CRYPT_BLK_BE(v0)
+
+	/* H ^ 1 */
+	rbit		RH1.16b, v0.16b
+
+	/* H ^ 2 */
+	PMUL_128x128(RR0, RR1, RH1, RH1, RTMP0, RTMP1)
+	REDUCTION(RH2, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	/* H ^ 3 */
+	PMUL_128x128(RR0, RR1, RH2, RH1, RTMP0, RTMP1)
+	REDUCTION(RH3, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	/* H ^ 4 */
+	PMUL_128x128(RR0, RR1, RH2, RH2, RTMP0, RTMP1)
+	REDUCTION(RH4, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	st1		{RH1.16b-RH4.16b}, [x1]
+
+	ret
+SYM_FUNC_END(sm4_ce_pmull_ghash_setup)
+
+.align 3
+SYM_FUNC_START(pmull_ghash_update)
+	/* input:
+	 *   x0: ghash table
+	 *   x1: ghash result
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	ld1		{RH1.16b-RH4.16b}, [x0]
+
+	ld1		{RHASH.16b}, [x1]
+	rbit		RHASH.16b, RHASH.16b
+
+	adr_l		x4, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x4]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+.Lghash_loop_4x:
+	cmp		w3, #4
+	blt		.Lghash_loop_1x
+
+	sub		w3, w3, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	rbit		v0.16b, v0.16b
+	rbit		v1.16b, v1.16b
+	rbit		v2.16b, v2.16b
+	rbit		v3.16b, v3.16b
+
+	/*
+	 * (in0 ^ HASH) * H^4 => rr0:rr1
+	 * (in1)        * H^3 => rr2:rr3
+	 * (in2)        * H^2 => rr4:rr5
+	 * (in3)        * H^1 => rr6:rr7
+	 */
+	eor		RHASH.16b, RHASH.16b, v0.16b
+
+	PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+			RR2, RR3, v1, RH3, RTMP2, RTMP3,
+			RR4, RR5, v2, RH2, RTMP4, RTMP5,
+			RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+	eor		RR0.16b, RR0.16b, RR2.16b
+	eor		RR1.16b, RR1.16b, RR3.16b
+	eor		RR0.16b, RR0.16b, RR4.16b
+	eor		RR1.16b, RR1.16b, RR5.16b
+	eor		RR0.16b, RR0.16b, RR6.16b
+	eor		RR1.16b, RR1.16b, RR7.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+	cbz		w3, .Lghash_end
+	b		.Lghash_loop_4x
+
+.Lghash_loop_1x:
+	sub		w3, w3, #1
+
+	ld1		{v0.16b}, [x2], #16
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	cbnz		w3, .Lghash_loop_1x
+
+.Lghash_end:
+	rbit		RHASH.16b, RHASH.16b
+	st1		{RHASH.2d}, [x1]
+
+	ret
+SYM_FUNC_END(pmull_ghash_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: ghash result
+	 *   x6: ghash table
+	 *   x7: lengths (only for last block)
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x8, x9, [x3]
+	rev		x8, x8
+	rev		x9, x9
+
+	ld1		{RH1.16b-RH4.16b}, [x6]
+
+	ld1		{RHASH.16b}, [x5]
+	rbit		RHASH.16b, RHASH.16b
+
+	adr_l		x6, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x6]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+	cbz		w4, .Lgcm_enc_hash_len
+
+.Lgcm_enc_loop_4x:
+	cmp		w4, #(4 * 16)
+	blt		.Lgcm_enc_loop_1x
+
+	sub		w4, w4, #(4 * 16)
+
+	/* construct CTRs */
+	inc32_le128(v0)			/* +0 */
+	inc32_le128(v1)			/* +1 */
+	inc32_le128(v2)			/* +2 */
+	inc32_le128(v3)			/* +3 */
+
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	eor		v1.16b, v1.16b, RTMP1.16b
+	eor		v2.16b, v2.16b, RTMP2.16b
+	eor		v3.16b, v3.16b, RTMP3.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	/* ghash update */
+
+	rbit		v0.16b, v0.16b
+	rbit		v1.16b, v1.16b
+	rbit		v2.16b, v2.16b
+	rbit		v3.16b, v3.16b
+
+	/*
+	 * (in0 ^ HASH) * H^4 => rr0:rr1
+	 * (in1)        * H^3 => rr2:rr3
+	 * (in2)        * H^2 => rr4:rr5
+	 * (in3)        * H^1 => rr6:rr7
+	 */
+	eor		RHASH.16b, RHASH.16b, v0.16b
+
+	PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+			RR2, RR3, v1, RH3, RTMP2, RTMP3,
+			RR4, RR5, v2, RH2, RTMP4, RTMP5,
+			RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+	eor		RR0.16b, RR0.16b, RR2.16b
+	eor		RR1.16b, RR1.16b, RR3.16b
+	eor		RR0.16b, RR0.16b, RR4.16b
+	eor		RR1.16b, RR1.16b, RR5.16b
+	eor		RR0.16b, RR0.16b, RR6.16b
+	eor		RR1.16b, RR1.16b, RR7.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+	cbz		w4, .Lgcm_enc_hash_len
+	b		.Lgcm_enc_loop_4x
+
+.Lgcm_enc_loop_1x:
+	cmp		w4, #16
+	blt		.Lgcm_enc_tail
+
+	sub		w4, w4, #16
+
+	/* construct CTRs */
+	inc32_le128(v0)
+
+	ld1		{RTMP0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	st1		{v0.16b}, [x1], #16
+
+	/* ghash update */
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	cbz		w4, .Lgcm_enc_hash_len
+	b		.Lgcm_enc_loop_1x
+
+.Lgcm_enc_tail:
+	/* construct CTRs */
+	inc32_le128(v0)
+	SM4_CRYPT_BLK(v0)
+
+	/* load permute table */
+	adr_l		x0, .Lcts_permute_table
+	add		x0, x0, #32
+	sub		x0, x0, w4, uxtw
+	ld1		{v3.16b}, [x0]
+
+.Lgcm_enc_tail_loop:
+	/* do encrypt */
+	ldrb		w0, [x2], #1	/* get 1 byte from input */
+	umov		w6, v0.b[0]	/* get top crypted byte */
+	eor		w6, w6, w0	/* w6 = CTR ^ input */
+	strb		w6, [x1], #1	/* store out byte */
+
+	/* shift right out one byte */
+	ext		v0.16b, v0.16b, v0.16b, #1
+	/* the last ciphertext is placed in high bytes */
+	ins		v0.b[15], w6
+
+	subs		w4, w4, #1
+	bne		.Lgcm_enc_tail_loop
+
+	/* padding last block with zeros */
+	tbl		v0.16b, {v0.16b}, v3.16b
+
+	/* ghash update */
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_enc_hash_len:
+	cbz		x7, .Lgcm_enc_end
+
+	GTAG_HASH_LENGTHS(v1, v3)
+
+	b		.Lgcm_enc_ret
+
+.Lgcm_enc_end:
+	/* store new CTR */
+	rev		x8, x8
+	rev		x9, x9
+	stp		x8, x9, [x3]
+
+	rbit		RHASH.16b, RHASH.16b
+
+.Lgcm_enc_ret:
+	/* store new MAC */
+	st1		{RHASH.2d}, [x5]
+
+	ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_enc)
+
+#undef	RR1
+#undef	RR3
+#undef	RR5
+#undef	RR7
+#undef	RR0
+#undef	RR2
+#undef	RR4
+#undef	RR6
+#undef RTMP0
+#undef RTMP1
+#undef RTMP2
+#undef RTMP3
+#undef RTMP4
+#undef RTMP5
+#undef RTMP6
+#undef RTMP7
+#undef	RH1
+#undef	RH2
+#undef	RH3
+#undef	RH4
+
+
+/* Register macros for decrypt */
+
+/* v0-v2 for building CTRs, v3-v5 for saving inputs */
+
+#define	RR1	v6
+#define	RR3	v7
+#define	RR5	v8
+
+#define	RR0	v9
+#define	RR2	v10
+#define	RR4	v11
+
+#define RTMP0	v12
+#define RTMP1	v13
+#define RTMP2	v14
+#define RTMP3	v15
+#define RTMP4	v16
+#define RTMP5	v17
+
+#define	RH1	v18
+#define	RH2	v19
+#define	RH3	v20
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: ghash result
+	 *   x6: ghash table
+	 *   x7: lengths (only for last block)
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x8, x9, [x3]
+	rev		x8, x8
+	rev		x9, x9
+
+	ld1		{RH1.16b-RH3.16b}, [x6]
+
+	ld1		{RHASH.16b}, [x5]
+	rbit		RHASH.16b, RHASH.16b
+
+	adr_l		x6, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x6]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+	cbz		w4, .Lgcm_dec_hash_len
+
+.Lgcm_dec_loop_3x:
+	cmp		w4, #(3 * 16)
+	blt		.Lgcm_dec_loop_1x
+
+	sub		w4, w4, #(3 * 16)
+
+	ld1		{v3.16b-v5.16b}, [x2], #(3 * 16)
+
+	/* construct CTRs */
+	inc32_le128(v0)			/* +0 */
+	rbit		v6.16b, v3.16b
+	inc32_le128(v1)			/* +1 */
+	rbit		v7.16b, v4.16b
+	inc32_le128(v2)			/* +2 */
+	rbit		v8.16b, v5.16b
+
+	eor		RHASH.16b, RHASH.16b, v6.16b
+
+	/* decrypt & ghash update */
+	SM4_CRYPT_PMUL_128x128_BLK3(v0, v1, v2,
+				    RR0, RR1, RHASH, RH3, RTMP0, RTMP1,
+				    RR2, RR3, v7, RH2, RTMP2, RTMP3,
+				    RR4, RR5, v8, RH1, RTMP4, RTMP5)
+
+	eor		v0.16b, v0.16b, v3.16b
+	eor		v1.16b, v1.16b, v4.16b
+	eor		v2.16b, v2.16b, v5.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+	st1		{v0.16b-v2.16b}, [x1], #(3 * 16)
+
+	cbz		w4, .Lgcm_dec_hash_len
+	b		.Lgcm_dec_loop_3x
+
+.Lgcm_dec_loop_1x:
+	cmp		w4, #16
+	blt		.Lgcm_dec_tail
+
+	sub		w4, w4, #16
+
+	ld1		{v3.16b}, [x2], #16
+
+	/* construct CTRs */
+	inc32_le128(v0)
+	rbit		v6.16b, v3.16b
+
+	eor		RHASH.16b, RHASH.16b, v6.16b
+
+	SM4_CRYPT_PMUL_128x128_BLK(v0, RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+
+	eor		v0.16b, v0.16b, v3.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	st1		{v0.16b}, [x1], #16
+
+	cbz		w4, .Lgcm_dec_hash_len
+	b		.Lgcm_dec_loop_1x
+
+.Lgcm_dec_tail:
+	/* construct CTRs */
+	inc32_le128(v0)
+	SM4_CRYPT_BLK(v0)
+
+	/* load permute table */
+	adr_l		x0, .Lcts_permute_table
+	add		x0, x0, #32
+	sub		x0, x0, w4, uxtw
+	ld1		{v3.16b}, [x0]
+
+.Lgcm_dec_tail_loop:
+	/* do decrypt */
+	ldrb		w0, [x2], #1	/* get 1 byte from input */
+	umov		w6, v0.b[0]	/* get top crypted byte */
+	eor		w6, w6, w0	/* w6 = CTR ^ input */
+	strb		w6, [x1], #1	/* store out byte */
+
+	/* shift right out one byte */
+	ext		v0.16b, v0.16b, v0.16b, #1
+	/* the last ciphertext is placed in high bytes */
+	ins		v0.b[15], w0
+
+	subs		w4, w4, #1
+	bne		.Lgcm_dec_tail_loop
+
+	/* padding last block with zeros */
+	tbl		v0.16b, {v0.16b}, v3.16b
+
+	/* ghash update */
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_dec_hash_len:
+	cbz		x7, .Lgcm_dec_end
+
+	GTAG_HASH_LENGTHS(v1, v3)
+
+	b		.Lgcm_dec_ret
+
+.Lgcm_dec_end:
+	/* store new CTR */
+	rev		x8, x8
+	rev		x9, x9
+	stp		x8, x9, [x3]
+
+	rbit		RHASH.16b, RHASH.16b
+
+.Lgcm_dec_ret:
+	/* store new MAC */
+	st1		{RHASH.2d}, [x5]
+
+	ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_dec)
+
+	.section	".rodata", "a"
+	.align 4
+.Lcts_permute_table:
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe,  0xf
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+
+.Lghash_rconst:
+	.quad		0x87
diff --git a/arch/arm64/crypto/sm4-ce-gcm-glue.c b/arch/arm64/crypto/sm4-ce-gcm-glue.c
new file mode 100644
index 000000000000..e90ea0f17beb
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-glue.c
@@ -0,0 +1,286 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/b128ops.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_pmull_ghash_setup(const u32 *rkey_enc, u8 *ghash_table);
+asmlinkage void pmull_ghash_update(const u8 *ghash_table, u8 *ghash,
+				   const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_pmull_gcm_enc(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nbytes, u8 *ghash,
+				     const u8 *ghash_table, const u8 *lengths);
+asmlinkage void sm4_ce_pmull_gcm_dec(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nbytes, u8 *ghash,
+				     const u8 *ghash_table, const u8 *lengths);
+
+#define GHASH_BLOCK_SIZE	16
+#define GCM_IV_SIZE		12
+
+struct sm4_gcm_ctx {
+	struct sm4_ctx key;
+	u8 ghash_table[16 * 4];
+};
+
+
+static int gcm_setkey(struct crypto_aead *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	sm4_ce_pmull_ghash_setup(ctx->key.rkey_enc, ctx->ghash_table);
+
+	kernel_neon_end();
+	return 0;
+}
+
+static int gcm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+	switch (authsize) {
+	case 4:
+	case 8:
+	case 12 ... 16:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+static void gcm_calculate_auth_mac(struct aead_request *req, u8 ghash[])
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) buffer[GHASH_BLOCK_SIZE];
+	u32 assoclen = req->assoclen;
+	struct scatter_walk walk;
+	unsigned int buflen = 0;
+
+	scatterwalk_start(&walk, req->src);
+
+	do {
+		u32 n = scatterwalk_clamp(&walk, assoclen);
+		u8 *p, *ptr;
+
+		if (!n) {
+			scatterwalk_start(&walk, sg_next(walk.sg));
+			n = scatterwalk_clamp(&walk, assoclen);
+		}
+
+		p = ptr = scatterwalk_map(&walk);
+		assoclen -= n;
+		scatterwalk_advance(&walk, n);
+
+		if (n + buflen < GHASH_BLOCK_SIZE) {
+			memcpy(&buffer[buflen], ptr, n);
+			buflen += n;
+		} else {
+			unsigned int nblocks;
+
+			if (buflen) {
+				unsigned int l = GHASH_BLOCK_SIZE - buflen;
+
+				memcpy(&buffer[buflen], ptr, l);
+				ptr += l;
+				n -= l;
+
+				pmull_ghash_update(ctx->ghash_table, ghash,
+						   buffer, 1);
+			}
+
+			nblocks = n / GHASH_BLOCK_SIZE;
+			if (nblocks) {
+				pmull_ghash_update(ctx->ghash_table, ghash,
+						   ptr, nblocks);
+				ptr += nblocks * GHASH_BLOCK_SIZE;
+			}
+
+			buflen = n % GHASH_BLOCK_SIZE;
+			if (buflen)
+				memcpy(&buffer[0], ptr, buflen);
+		}
+
+		scatterwalk_unmap(p);
+		scatterwalk_done(&walk, 0, assoclen);
+	} while (assoclen);
+
+	/* padding with '0' */
+	if (buflen) {
+		memset(&buffer[buflen], 0, GHASH_BLOCK_SIZE - buflen);
+		pmull_ghash_update(ctx->ghash_table, ghash, buffer, 1);
+	}
+}
+
+static int gcm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+		     struct sm4_gcm_ctx *ctx, u8 ghash[],
+		     void (*sm4_ce_pmull_gcm_crypt)(const u32 *rkey_enc,
+				u8 *dst, const u8 *src, u8 *iv,
+				unsigned int nbytes, u8 *ghash,
+				const u8 *ghash_table, const u8 *lengths))
+{
+	u8 __aligned(8) iv[SM4_BLOCK_SIZE];
+	be128 __aligned(8) lengths;
+	int err;
+
+	memset(ghash, 0, SM4_BLOCK_SIZE);
+
+	lengths.a = cpu_to_be64(req->assoclen * 8);
+	lengths.b = cpu_to_be64(walk->total * 8);
+
+	memcpy(iv, walk->iv, GCM_IV_SIZE);
+	put_unaligned_be32(2, iv + GCM_IV_SIZE);
+
+	kernel_neon_begin();
+
+	if (req->assoclen)
+		gcm_calculate_auth_mac(req, ghash);
+
+	do {
+		unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+		const u8 *src = walk->src.virt.addr;
+		u8 *dst = walk->dst.virt.addr;
+
+		if (walk->nbytes == walk->total) {
+			tail = 0;
+
+			sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+					       walk->nbytes, ghash,
+					       ctx->ghash_table,
+					       (const u8 *)&lengths);
+		} else if (walk->nbytes - tail) {
+			sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+					       walk->nbytes - tail, ghash,
+					       ctx->ghash_table, NULL);
+		}
+
+		kernel_neon_end();
+
+		err = skcipher_walk_done(walk, tail);
+		if (err)
+			return err;
+		if (walk->nbytes)
+			kernel_neon_begin();
+	} while (walk->nbytes > 0);
+
+	return 0;
+}
+
+static int gcm_encrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_aead_encrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_enc);
+	if (err)
+		return err;
+
+	/* copy authtag to end of dst */
+	scatterwalk_map_and_copy(ghash, req->dst, req->assoclen + req->cryptlen,
+				 crypto_aead_authsize(aead), 1);
+
+	return 0;
+}
+
+static int gcm_decrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	unsigned int authsize = crypto_aead_authsize(aead);
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+	u8 authtag[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_aead_decrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_dec);
+	if (err)
+		return err;
+
+	/* compare calculated auth tag with the stored one */
+	scatterwalk_map_and_copy(authtag, req->src,
+				 req->assoclen + req->cryptlen - authsize,
+				 authsize, 0);
+
+	if (crypto_memneq(authtag, ghash, authsize))
+		return -EBADMSG;
+
+	return 0;
+}
+
+static struct aead_alg sm4_gcm_alg = {
+	.base = {
+		.cra_name		= "gcm(sm4)",
+		.cra_driver_name	= "gcm-sm4-ce",
+		.cra_priority		= 400,
+		.cra_blocksize		= 1,
+		.cra_ctxsize		= sizeof(struct sm4_gcm_ctx),
+		.cra_module		= THIS_MODULE,
+	},
+	.ivsize		= GCM_IV_SIZE,
+	.chunksize	= SM4_BLOCK_SIZE,
+	.maxauthsize	= SM4_BLOCK_SIZE,
+	.setkey		= gcm_setkey,
+	.setauthsize	= gcm_setauthsize,
+	.encrypt	= gcm_encrypt,
+	.decrypt	= gcm_decrypt,
+};
+
+static int __init sm4_ce_gcm_init(void)
+{
+	if (!cpu_have_named_feature(PMULL))
+		return -ENODEV;
+
+	return crypto_register_aead(&sm4_gcm_alg);
+}
+
+static void __exit sm4_ce_gcm_exit(void)
+{
+	crypto_unregister_aead(&sm4_gcm_alg);
+}
+
+static const struct cpu_feature sm4_ce_gcm_cpu_feature[] = {
+	{ cpu_feature(PMULL) },
+	{}
+};
+MODULE_DEVICE_TABLE(cpu, sm4_ce_gcm_cpu_feature);
+
+module_cpu_feature_match(SM4, sm4_ce_gcm_init);
+module_exit(sm4_ce_gcm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in GCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("gcm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
  2022-09-26  9:36 ` Tianjia Zhang
@ 2022-09-26  9:36   ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

Scalable Vector Extension (SVE) is the next-generation SIMD extension for
arm64. SVE allows flexible vector length implementations with a range of
possible values in CPU implementations. The vector length can vary from a
minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
The SVE design guarantees that the same application can run on different
implementations that support SVE, without the need to recompile the code.

SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
expand and improve it. Similar to the Crypto Extension supported by the
NEON instruction set for the algorithm, SVE also supports the similar
instructions, called cryptography acceleration instructions, but this is
also optional instruction set.

This patch uses SM4 cryptography acceleration instructions and SVE2
instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
Since the encryption of CBC/CFB cannot be parallelized, the Crypto
Extension instruction is used.

Since no test environment with a Vector Length (VL) greater than 128 bits
was found, the performance data was obtained on a machine with a VL is
128 bits, because this driver is enabled when the VL is greater than 128
bits, so this performance is only for reference. It can be seen from the
data that there is little difference between the data optimized by Crypto
Extension and SVE (VL=128 bits), and the optimization effect will be more
obvious when VL=256 bits or longer.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
of tcrypt, and compared with that optimized by Crypto Extension.  The
abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

sm4-ce      |      16       64      128      256     1024     1420     4096
------------+--------------------------------------------------------------
    ECB enc |  315.18  1162.65  1815.66  2553.50  3692.91  3727.20  4001.93
    ECB dec |  316.06  1172.97  1817.81  2554.66  3692.18  3786.54  4001.93
    CBC enc |  304.82   629.54   768.65   864.72   953.90   963.32   974.06
    CBC dec |  306.05  1142.53  1805.11  2481.67  3522.06  3587.87  3790.99
    CFB enc |  309.48   635.70   774.44   865.85   950.62   952.68   968.24
    CFB dec |  315.98  1170.38  1828.75  2509.72  3543.63  3539.40  3793.25
    CTR enc |  285.83  1036.59  1583.50  2147.26  2933.54  2954.66  3041.14
    CTR dec |  285.29  1037.47  1584.67  2145.51  2934.10  2950.89  3041.62

sm4-sve-ce (VL = 128 bits)
    ECB enc |  310.00  1154.70  1813.26  2579.74  3766.90  3869.45  4100.26
    ECB dec |  315.60  1176.22  1838.06  2593.69  3774.95  3878.42  4098.83
    CBC enc |  303.44   622.65   764.67   861.40   953.18   963.05   973.77
    CBC dec |  302.13  1091.15  1689.10  2267.79  3182.84  3242.68  3408.92
    CFB enc |  296.62   620.41   762.94   858.96   948.18   956.04   967.67
    CFB dec |  291.23  1065.50  1637.33  2228.12  3158.52  3213.35  3403.83
    CTR enc |  272.27   959.35  1466.34  1934.24  2562.80  2595.87  2695.15
    CTR dec |  273.40   963.65  1471.83  1938.97  2563.12  2597.25  2694.54

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |   19 +
 arch/arm64/crypto/Makefile          |    3 +
 arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-sve-ce-glue.c |  332 +++++++++
 4 files changed, 1382 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 6793d5bc3ee5..bbb5a7a08af5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_SVE_CE_BLK
+	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_SKCIPHER
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
+	  with block cipher modes:
+	  - ECB (Electronic Codebook) mode (NIST SP800-38A)
+	  - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
+	  - CFB (Cipher Feedback) mode (NIST SP800-38A)
+	  - CTR (Counter) mode (NIST SP800-38A)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - ARMv9 cryptography acceleration with SVE2
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_SM4_ARM64_NEON_BLK
 	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 4818e204c2ac..355dd9053434 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
+sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
new file mode 100644
index 000000000000..caecbdf2536c
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-core.S
@@ -0,0 +1,1028 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+.arch	armv8-a+crypto+sve+sve2
+
+.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+		16, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lz\b\().s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+.macro sm4e_sve, zd, zn
+	.inst 0x4523e000 | (.L\zn << 5) | .L\zd
+.endm
+
+
+/* Register macros */
+
+#define RCTR        z16
+#define RCTRv       v16
+#define RIV         z16
+#define RIVv        v16
+#define RSWAP128    z17
+#define RZERO       z18
+#define RLE128_INC  z19
+
+#define RTMP0       z20
+#define RTMP0v      v20
+#define RTMP1       z21
+#define RTMP2       z22
+#define RTMP3       z23
+
+
+/* Helper macros. */
+
+#define SM4_PREPARE(ptr)					\
+		adr_l		x7, .Lbswap128_mask;		\
+		ptrue		p0.b, ALL;			\
+		rdvl		x5, #1;				\
+		ld1b		{RSWAP128.b}, p0/z, [x7];	\
+								\
+		ld1		{v24.16b-v27.16b}, [ptr], #64;	\
+		ld1		{v28.16b-v31.16b}, [ptr];	\
+		dup		z24.q, z24.q[0];		\
+		dup		z25.q, z25.q[0];		\
+		dup		z26.q, z26.q[0];		\
+		dup		z27.q, z27.q[0];		\
+		dup		z28.q, z28.q[0];		\
+		dup		z29.q, z29.q[0];		\
+		dup		z30.q, z30.q[0];		\
+		dup		z31.q, z31.q[0];
+
+#define SM4_SVE_CE_CRYPT_BLK(b0)				\
+		revb		b0.s, p0/m, b0.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;
+
+#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3)			\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b1.s, z24.s;			\
+		sm4e_sve	b2.s, z24.s;			\
+		sm4e_sve	b3.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b1.s, z25.s;			\
+		sm4e_sve	b2.s, z25.s;			\
+		sm4e_sve	b3.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b1.s, z26.s;			\
+		sm4e_sve	b2.s, z26.s;			\
+		sm4e_sve	b3.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b1.s, z27.s;			\
+		sm4e_sve	b2.s, z27.s;			\
+		sm4e_sve	b3.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b1.s, z28.s;			\
+		sm4e_sve	b2.s, z28.s;			\
+		sm4e_sve	b3.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b1.s, z29.s;			\
+		sm4e_sve	b2.s, z29.s;			\
+		sm4e_sve	b3.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b1.s, z30.s;			\
+		sm4e_sve	b2.s, z30.s;			\
+		sm4e_sve	b3.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		sm4e_sve	b1.s, z31.s;			\
+		sm4e_sve	b2.s, z31.s;			\
+		sm4e_sve	b3.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		tbl		b1.b, {b1.b}, RSWAP128.b;	\
+		tbl		b2.b, {b2.b}, RSWAP128.b;	\
+		tbl		b3.b, {b3.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;
+
+#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		revb		b4.s, p0/m, b4.s;		\
+		revb		b5.s, p0/m, b5.s;		\
+		revb		b6.s, p0/m, b6.s;		\
+		revb		b7.s, p0/m, b7.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b1.s, z24.s;			\
+		sm4e_sve	b2.s, z24.s;			\
+		sm4e_sve	b3.s, z24.s;			\
+		sm4e_sve	b4.s, z24.s;			\
+		sm4e_sve	b5.s, z24.s;			\
+		sm4e_sve	b6.s, z24.s;			\
+		sm4e_sve	b7.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b1.s, z25.s;			\
+		sm4e_sve	b2.s, z25.s;			\
+		sm4e_sve	b3.s, z25.s;			\
+		sm4e_sve	b4.s, z25.s;			\
+		sm4e_sve	b5.s, z25.s;			\
+		sm4e_sve	b6.s, z25.s;			\
+		sm4e_sve	b7.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b1.s, z26.s;			\
+		sm4e_sve	b2.s, z26.s;			\
+		sm4e_sve	b3.s, z26.s;			\
+		sm4e_sve	b4.s, z26.s;			\
+		sm4e_sve	b5.s, z26.s;			\
+		sm4e_sve	b6.s, z26.s;			\
+		sm4e_sve	b7.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b1.s, z27.s;			\
+		sm4e_sve	b2.s, z27.s;			\
+		sm4e_sve	b3.s, z27.s;			\
+		sm4e_sve	b4.s, z27.s;			\
+		sm4e_sve	b5.s, z27.s;			\
+		sm4e_sve	b6.s, z27.s;			\
+		sm4e_sve	b7.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b1.s, z28.s;			\
+		sm4e_sve	b2.s, z28.s;			\
+		sm4e_sve	b3.s, z28.s;			\
+		sm4e_sve	b4.s, z28.s;			\
+		sm4e_sve	b5.s, z28.s;			\
+		sm4e_sve	b6.s, z28.s;			\
+		sm4e_sve	b7.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b1.s, z29.s;			\
+		sm4e_sve	b2.s, z29.s;			\
+		sm4e_sve	b3.s, z29.s;			\
+		sm4e_sve	b4.s, z29.s;			\
+		sm4e_sve	b5.s, z29.s;			\
+		sm4e_sve	b6.s, z29.s;			\
+		sm4e_sve	b7.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b1.s, z30.s;			\
+		sm4e_sve	b2.s, z30.s;			\
+		sm4e_sve	b3.s, z30.s;			\
+		sm4e_sve	b4.s, z30.s;			\
+		sm4e_sve	b5.s, z30.s;			\
+		sm4e_sve	b6.s, z30.s;			\
+		sm4e_sve	b7.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		sm4e_sve	b1.s, z31.s;			\
+		sm4e_sve	b2.s, z31.s;			\
+		sm4e_sve	b3.s, z31.s;			\
+		sm4e_sve	b4.s, z31.s;			\
+		sm4e_sve	b5.s, z31.s;			\
+		sm4e_sve	b6.s, z31.s;			\
+		sm4e_sve	b7.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		tbl		b1.b, {b1.b}, RSWAP128.b;	\
+		tbl		b2.b, {b2.b}, RSWAP128.b;	\
+		tbl		b3.b, {b3.b}, RSWAP128.b;	\
+		tbl		b4.b, {b4.b}, RSWAP128.b;	\
+		tbl		b5.b, {b5.b}, RSWAP128.b;	\
+		tbl		b6.b, {b6.b}, RSWAP128.b;	\
+		tbl		b7.b, {b7.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		revb		b4.s, p0/m, b4.s;		\
+		revb		b5.s, p0/m, b5.s;		\
+		revb		b6.s, p0/m, b6.s;		\
+		revb		b7.s, p0/m, b7.s;
+
+#define SM4_CE_CRYPT_BLK(b0)					\
+		rev32		b0.16b, b0.16b;			\
+		sm4e		b0.4s, v24.4s;			\
+		sm4e		b0.4s, v25.4s;			\
+		sm4e		b0.4s, v26.4s;			\
+		sm4e		b0.4s, v27.4s;			\
+		sm4e		b0.4s, v28.4s;			\
+		sm4e		b0.4s, v29.4s;			\
+		sm4e		b0.4s, v30.4s;			\
+		sm4e		b0.4s, v31.4s;			\
+		rev64		b0.4s, b0.4s;			\
+		ext		b0.16b, b0.16b, b0.16b, #8;	\
+		rev32		b0.16b, b0.16b;
+
+#define inc_le128(zctr)						\
+		mov		RCTRv.d[1], x8;			\
+		mov		RCTRv.d[0], x7;			\
+		mov		zctr.d, RLE128_INC.d;		\
+		dup		RCTR.q, RCTR.q[0];		\
+		adds		x8, x8, x5, LSR #4;		\
+		adclt		zctr.d, RCTR.d, RZERO.d;	\
+		adclt		RCTR.d, zctr.d, RZERO.d;	\
+		adc		x7, x7, xzr;			\
+		trn1		zctr.d, RCTR.d, zctr.d;		\
+		revb		zctr.d, p0/m, zctr.d;
+
+#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3)		\
+		mov		v8.d[1], x8;			\
+		mov		v8.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr0.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v9.d[1], x8;			\
+		mov		v9.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr1.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v10.d[1], x8;			\
+		mov		v10.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr2.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v11.d[1], x8;			\
+		mov		v11.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr3.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		dup		z8.q, z8.q[0];			\
+		dup		z9.q, z9.q[0];			\
+		dup		z10.q, z10.q[0];		\
+		dup		z11.q, z11.q[0];		\
+		adclt		zctr0.d, z8.d, RZERO.d;		\
+		adclt		zctr1.d, z9.d, RZERO.d;		\
+		adclt		zctr2.d, z10.d, RZERO.d;	\
+		adclt		zctr3.d, z11.d, RZERO.d;	\
+		adclt		z8.d, zctr0.d, RZERO.d;		\
+		adclt		z9.d, zctr1.d, RZERO.d;		\
+		adclt		z10.d, zctr2.d, RZERO.d;	\
+		adclt		z11.d, zctr3.d, RZERO.d;	\
+		trn1		zctr0.d, z8.d, zctr0.d;		\
+		trn1		zctr1.d, z9.d, zctr1.d;		\
+		trn1		zctr2.d, z10.d, zctr2.d;	\
+		trn1		zctr3.d, z11.d, zctr3.d;	\
+		revb		zctr0.d, p0/m, zctr0.d;		\
+		revb		zctr1.d, p0/m, zctr1.d;		\
+		revb		zctr2.d, p0/m, zctr2.d;		\
+		revb		zctr3.d, p0/m, zctr3.d;
+
+#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3,		\
+		     zctr4, zctr5, zctr6, zctr7)		\
+		mov		v8.d[1], x8;			\
+		mov		v8.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr0.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v9.d[1], x8;			\
+		mov		v9.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr1.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v10.d[1], x8;			\
+		mov		v10.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr2.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v11.d[1], x8;			\
+		mov		v11.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr3.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v12.d[1], x8;			\
+		mov		v12.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr4.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v13.d[1], x8;			\
+		mov		v13.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr5.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v14.d[1], x8;			\
+		mov		v14.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr6.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v15.d[1], x8;			\
+		mov		v15.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr7.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		dup		z8.q, z8.q[0];			\
+		dup		z9.q, z9.q[0];			\
+		dup		z10.q, z10.q[0];		\
+		dup		z11.q, z11.q[0];		\
+		dup		z12.q, z12.q[0];		\
+		dup		z13.q, z13.q[0];		\
+		dup		z14.q, z14.q[0];		\
+		dup		z15.q, z15.q[0];		\
+		adclt		zctr0.d, z8.d, RZERO.d;		\
+		adclt		zctr1.d, z9.d, RZERO.d;		\
+		adclt		zctr2.d, z10.d, RZERO.d;	\
+		adclt		zctr3.d, z11.d, RZERO.d;	\
+		adclt		zctr4.d, z12.d, RZERO.d;	\
+		adclt		zctr5.d, z13.d, RZERO.d;	\
+		adclt		zctr6.d, z14.d, RZERO.d;	\
+		adclt		zctr7.d, z15.d, RZERO.d;	\
+		adclt		z8.d, zctr0.d, RZERO.d;		\
+		adclt		z9.d, zctr1.d, RZERO.d;		\
+		adclt		z10.d, zctr2.d, RZERO.d;	\
+		adclt		z11.d, zctr3.d, RZERO.d;	\
+		adclt		z12.d, zctr4.d, RZERO.d;	\
+		adclt		z13.d, zctr5.d, RZERO.d;	\
+		adclt		z14.d, zctr6.d, RZERO.d;	\
+		adclt		z15.d, zctr7.d, RZERO.d;	\
+		trn1		zctr0.d, z8.d, zctr0.d;		\
+		trn1		zctr1.d, z9.d, zctr1.d;		\
+		trn1		zctr2.d, z10.d, zctr2.d;	\
+		trn1		zctr3.d, z11.d, zctr3.d;	\
+		trn1		zctr4.d, z12.d, zctr4.d;	\
+		trn1		zctr5.d, z13.d, zctr5.d;	\
+		trn1		zctr6.d, z14.d, zctr6.d;	\
+		trn1		zctr7.d, z15.d, zctr7.d;	\
+		revb		zctr0.d, p0/m, zctr0.d;		\
+		revb		zctr1.d, p0/m, zctr1.d;		\
+		revb		zctr2.d, p0/m, zctr2.d;		\
+		revb		zctr3.d, p0/m, zctr3.d;		\
+		revb		zctr4.d, p0/m, zctr4.d;		\
+		revb		zctr5.d, p0/m, zctr5.d;		\
+		revb		zctr6.d, p0/m, zctr6.d;		\
+		revb		zctr7.d, p0/m, zctr7.d;
+
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_crypt)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	uxtw		x3, w3
+	SM4_PREPARE(x0)
+
+.Lcrypt_loop_8x:
+	sub		x3, x3, x5, LSR #1		/* x3 - (8 * VL) */
+	tbnz		x3, #63, .Lcrypt_4x
+
+	ld1b		{z0.b}, p0/z, [x2]
+	ld1b		{z1.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z2.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z3.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z4.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z5.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z6.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z7.b}, p0/z, [x2, #7, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x3, .Lcrypt_end
+	b		.Lcrypt_loop_8x
+
+.Lcrypt_4x:
+	add		x3, x3, x5, LSR #1
+	cmp		x3, x5, LSR #2
+	blt		.Lcrypt_loop_1x
+
+	sub		x3, x3, x5, LSR #2		/* x3 - (4 * VL) */
+
+	ld1b		{z0.b}, p0/z, [x2]
+	ld1b		{z1.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z2.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z3.b}, p0/z, [x2, #3, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x3, .Lcrypt_end
+
+.Lcrypt_loop_1x:
+	cmp		x3, x5, LSR #4
+	blt		.Lcrypt_ce_loop_1x
+
+	sub		x3, x3, x5, LSR #4		/* x3 - VL */
+
+	ld1b		{z0.b}, p0/z, [x2]
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x3, .Lcrypt_end
+	b		.Lcrypt_loop_1x
+
+.Lcrypt_ce_loop_1x:
+	sub		x3, x3, #1
+
+	ld1		{v0.16b}, [x2], #16
+	SM4_CE_CRYPT_BLK(v0)
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x3, .Lcrypt_ce_loop_1x
+
+.Lcrypt_end:
+	ret
+SYM_FUNC_END(sm4_sve_ce_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cbc_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	ld1		{RIVv.16b}, [x3]
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lcbc_dec_4x
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z9.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z8.b}, p0/z, [x2, #7, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		z4.b, z11.b
+	rev		z5.b, z10.b
+	rev		z6.b, z9.b
+	rev		z7.b, z8.b
+	rev		RTMP0.b, RIV.b
+	ext		z7.b, z7.b, z6.b, #16
+	ext		z6.b, z6.b, z5.b, #16
+	ext		z5.b, z5.b, z4.b, #16
+	ext		z4.b, z4.b, z3.b, #16
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z7.b, z7.b
+	rev		z6.b, z6.b
+	rev		z5.b, z5.b
+	rev		z4.b, z4.b
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z8.d
+
+	SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	eor		z4.d, z4.d, z11.d
+	eor		z5.d, z5.d, z10.d
+	eor		z6.d, z6.d, z9.d
+	eor		z7.d, z7.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
+
+.Lcbc_dec_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lcbc_dec_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		RTMP0.b, RIV.b
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z12.d
+
+	SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lcbc_dec_end
+
+.Lcbc_dec_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lcbc_dec_ce
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	rev		RTMP0.b, RIV.b
+	rev		z0.b, z15.b
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z0.b, z0.b
+	mov		RIV.d, z15.d
+
+	SM4_SVE_CE_CRYPT_BLK(z15)
+
+	eor		z0.d, z0.d, z15.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_1x
+
+.Lcbc_dec_ce:
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcbc_dec_ce_loop_1x:
+	sub		x4, x4, #1
+
+	ld1		{v15.16b}, [x2], #16
+	mov		v0.16b, RIVv.16b
+	mov		RIVv.16b, v15.16b
+	SM4_CE_CRYPT_BLK(v15)
+	eor		v0.16b, v0.16b, v15.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lcbc_dec_ce_loop_1x
+
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_end:
+	/* store new IV */
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+	st1		{RIVv.16b}, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_cbc_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cfb_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	ld1		{RIVv.16b}, [x3]
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lcfb_dec_4x
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z9.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z8.b}, p0/z, [x2, #7, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		z4.b, z11.b
+	rev		z5.b, z10.b
+	rev		z6.b, z9.b
+	rev		z7.b, z8.b
+	rev		RTMP0.b, RIV.b
+	ext		z7.b, z7.b, z6.b, #16
+	ext		z6.b, z6.b, z5.b, #16
+	ext		z5.b, z5.b, z4.b, #16
+	ext		z4.b, z4.b, z3.b, #16
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z7.b, z7.b
+	rev		z6.b, z6.b
+	rev		z5.b, z5.b
+	rev		z4.b, z4.b
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z8.d
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	eor		z4.d, z4.d, z11.d
+	eor		z5.d, z5.d, z10.d
+	eor		z6.d, z6.d, z9.d
+	eor		z7.d, z7.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lcfb_dec_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		RTMP0.b, RIV.b
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z12.d
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lcfb_dec_ce
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	rev		RTMP0.b, RIV.b
+	rev		z0.b, z15.b
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z0.b, z0.b
+	mov		RIV.d, z15.d
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	eor		z0.d, z0.d, z15.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_1x
+
+.Lcfb_dec_ce:
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcfb_dec_ce_loop_1x:
+	sub		x4, x4, #1
+
+	ld1		{v15.16b}, [x2], #16
+	mov		v0.16b, RIVv.16b
+	mov		RIVv.16b, v15.16b
+	SM4_CE_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v15.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lcfb_dec_ce_loop_1x
+
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_end:
+	/* store new IV */
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+	st1		{RIVv.16b}, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_cfb_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	dup		RZERO.d, #0
+	adr_l		x6, .Lle128_inc
+	ld1b		{RLE128_INC.b}, p0/z, [x6]
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+.Lctr_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lctr_4x
+
+	inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	ld1b		{z8.b}, p0/z, [x2]
+	ld1b		{z9.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z14.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z15.b}, p0/z, [x2, #7, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	eor		z0.d, z0.d, z8.d
+	eor		z1.d, z1.d, z9.d
+	eor		z2.d, z2.d, z10.d
+	eor		z3.d, z3.d, z11.d
+	eor		z4.d, z4.d, z12.d
+	eor		z5.d, z5.d, z13.d
+	eor		z6.d, z6.d, z14.d
+	eor		z7.d, z7.d, z15.d
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lctr_end
+	b		.Lctr_loop_8x
+
+.Lctr_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lctr_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	inc_le128_4x(z0, z1, z2, z3)
+
+	ld1b		{z8.b}, p0/z, [x2]
+	ld1b		{z9.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #3, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	eor		z0.d, z0.d, z8.d
+	eor		z1.d, z1.d, z9.d
+	eor		z2.d, z2.d, z10.d
+	eor		z3.d, z3.d, z11.d
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lctr_end
+
+.Lctr_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lctr_ce_loop_1x
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	inc_le128(z0)
+	ld1b		{z8.b}, p0/z, [x2]
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	eor		z0.d, z0.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lctr_end
+	b		.Lctr_loop_1x
+
+.Lctr_ce_loop_1x:
+	sub		x4, x4, #1
+
+	/* inc_le128 for CE */
+	mov		v0.d[1], x8
+	mov		v0.d[0], x7
+	adds		x8, x8, #1
+	rev64		v0.16b, v0.16b
+	adc		x7, x7, xzr
+
+	ld1		{v8.16b}, [x2], #16
+
+	SM4_CE_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lctr_ce_loop_1x
+
+.Lctr_end:
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_get_vl)
+	/* VL in bytes */
+	rdvl		x0, #1
+
+	ret
+SYM_FUNC_END(sm4_sve_get_vl)
+
+
+	.section	".rodata", "a"
+	.align 4
+.Lbswap128_mask:
+	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+	.byte		0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
+	.byte		0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
+	.byte		0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
+	.byte		0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
+	.byte		0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
+	.byte		0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
+	.byte		0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
+	.byte		0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
+	.byte		0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
+	.byte		0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
+	.byte		0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
+	.byte		0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
+	.byte		0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
+	.byte		0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
+	.byte		0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
+	.byte		0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
+	.byte		0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
+	.byte		0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
+	.byte		0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
+	.byte		0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
+	.byte		0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
+	.byte		0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
+	.byte		0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
+	.byte		0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
+	.byte		0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
+	.byte		0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
+	.byte		0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
+	.byte		0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
+	.byte		0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
+	.byte		0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
+
+.Lle128_inc:
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
new file mode 100644
index 000000000000..fc797b72b5f0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
@@ -0,0 +1,332 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
+				 const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
+				   const u8 *src, u8 *iv,
+				   unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
+				   const u8 *src, u8 *iv,
+				   unsigned int nblocks);
+asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nblocks);
+asmlinkage unsigned int sm4_sve_get_vl(void);
+
+
+static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
+{
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_sve_ce_crypt(rkey, dst, src, nblocks);
+
+			kernel_neon_end();
+		}
+
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int ecb_encrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return ecb_crypt(req, ctx->rkey_enc);
+}
+
+static int ecb_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return ecb_crypt(req, ctx->rkey_dec);
+}
+
+static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
+		     void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
+				const u8 *src, u8 *iv, unsigned int nblocks))
+{
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
+
+			kernel_neon_end();
+		}
+
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int cbc_encrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
+}
+
+static int cbc_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
+}
+
+static int cfb_crypt(struct skcipher_request *req,
+		     void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
+				const u8 *src, u8 *iv, unsigned int nblocks))
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_cfb_crypt(ctx->rkey_enc, dst, src,
+				      walk.iv, nblocks);
+
+			kernel_neon_end();
+
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
+
+		/* tail */
+		if (walk.nbytes == walk.total && nbytes > 0) {
+			u8 keystream[SM4_BLOCK_SIZE];
+
+			sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+			crypto_xor_cpy(dst, src, keystream, nbytes);
+			nbytes = 0;
+		}
+
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static int cfb_encrypt(struct skcipher_request *req)
+{
+	return cfb_crypt(req, sm4_ce_cfb_enc);
+}
+
+static int cfb_decrypt(struct skcipher_request *req)
+{
+	return cfb_crypt(req, sm4_sve_ce_cfb_dec);
+}
+
+static int ctr_crypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
+					     walk.iv, nblocks);
+
+			kernel_neon_end();
+
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
+
+		/* tail */
+		if (walk.nbytes == walk.total && nbytes > 0) {
+			u8 keystream[SM4_BLOCK_SIZE];
+
+			sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+			crypto_inc(walk.iv, SM4_BLOCK_SIZE);
+			crypto_xor_cpy(dst, src, keystream, nbytes);
+			nbytes = 0;
+		}
+
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static struct skcipher_alg sm4_algs[] = {
+	{
+		.base = {
+			.cra_name		= "ecb(sm4)",
+			.cra_driver_name	= "ecb-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= ecb_encrypt,
+		.decrypt	= ecb_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "cbc(sm4)",
+			.cra_driver_name	= "cbc-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= cbc_encrypt,
+		.decrypt	= cbc_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "cfb(sm4)",
+			.cra_driver_name	= "cfb-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.chunksize	= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= cfb_encrypt,
+		.decrypt	= cfb_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "ctr(sm4)",
+			.cra_driver_name	= "ctr-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.chunksize	= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= ctr_crypt,
+		.decrypt	= ctr_crypt,
+	}
+};
+
+static int __init sm4_sve_ce_init(void)
+{
+	if (sm4_sve_get_vl() <= 16)
+		return -ENODEV;
+
+	return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+static void __exit sm4_sve_ce_exit(void)
+{
+	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
+module_exit(sm4_sve_ce_exit);
+
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
+MODULE_ALIAS_CRYPTO("sm4-sve-ce");
+MODULE_ALIAS_CRYPTO("sm4");
+MODULE_ALIAS_CRYPTO("ecb(sm4)");
+MODULE_ALIAS_CRYPTO("cbc(sm4)");
+MODULE_ALIAS_CRYPTO("cfb(sm4)");
+MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-26  9:36   ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-26  9:36 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Jussi Kivilinna, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

Scalable Vector Extension (SVE) is the next-generation SIMD extension for
arm64. SVE allows flexible vector length implementations with a range of
possible values in CPU implementations. The vector length can vary from a
minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
The SVE design guarantees that the same application can run on different
implementations that support SVE, without the need to recompile the code.

SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
expand and improve it. Similar to the Crypto Extension supported by the
NEON instruction set for the algorithm, SVE also supports the similar
instructions, called cryptography acceleration instructions, but this is
also optional instruction set.

This patch uses SM4 cryptography acceleration instructions and SVE2
instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
Since the encryption of CBC/CFB cannot be parallelized, the Crypto
Extension instruction is used.

Since no test environment with a Vector Length (VL) greater than 128 bits
was found, the performance data was obtained on a machine with a VL is
128 bits, because this driver is enabled when the VL is greater than 128
bits, so this performance is only for reference. It can be seen from the
data that there is little difference between the data optimized by Crypto
Extension and SVE (VL=128 bits), and the optimization effect will be more
obvious when VL=256 bits or longer.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
of tcrypt, and compared with that optimized by Crypto Extension.  The
abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

sm4-ce      |      16       64      128      256     1024     1420     4096
------------+--------------------------------------------------------------
    ECB enc |  315.18  1162.65  1815.66  2553.50  3692.91  3727.20  4001.93
    ECB dec |  316.06  1172.97  1817.81  2554.66  3692.18  3786.54  4001.93
    CBC enc |  304.82   629.54   768.65   864.72   953.90   963.32   974.06
    CBC dec |  306.05  1142.53  1805.11  2481.67  3522.06  3587.87  3790.99
    CFB enc |  309.48   635.70   774.44   865.85   950.62   952.68   968.24
    CFB dec |  315.98  1170.38  1828.75  2509.72  3543.63  3539.40  3793.25
    CTR enc |  285.83  1036.59  1583.50  2147.26  2933.54  2954.66  3041.14
    CTR dec |  285.29  1037.47  1584.67  2145.51  2934.10  2950.89  3041.62

sm4-sve-ce (VL = 128 bits)
    ECB enc |  310.00  1154.70  1813.26  2579.74  3766.90  3869.45  4100.26
    ECB dec |  315.60  1176.22  1838.06  2593.69  3774.95  3878.42  4098.83
    CBC enc |  303.44   622.65   764.67   861.40   953.18   963.05   973.77
    CBC dec |  302.13  1091.15  1689.10  2267.79  3182.84  3242.68  3408.92
    CFB enc |  296.62   620.41   762.94   858.96   948.18   956.04   967.67
    CFB dec |  291.23  1065.50  1637.33  2228.12  3158.52  3213.35  3403.83
    CTR enc |  272.27   959.35  1466.34  1934.24  2562.80  2595.87  2695.15
    CTR dec |  273.40   963.65  1471.83  1938.97  2563.12  2597.25  2694.54

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |   19 +
 arch/arm64/crypto/Makefile          |    3 +
 arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-sve-ce-glue.c |  332 +++++++++
 4 files changed, 1382 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 6793d5bc3ee5..bbb5a7a08af5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_SVE_CE_BLK
+	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_SKCIPHER
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
+	  with block cipher modes:
+	  - ECB (Electronic Codebook) mode (NIST SP800-38A)
+	  - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
+	  - CFB (Cipher Feedback) mode (NIST SP800-38A)
+	  - CTR (Counter) mode (NIST SP800-38A)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - ARMv9 cryptography acceleration with SVE2
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_SM4_ARM64_NEON_BLK
 	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 4818e204c2ac..355dd9053434 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
+sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
new file mode 100644
index 000000000000..caecbdf2536c
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-core.S
@@ -0,0 +1,1028 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+.arch	armv8-a+crypto+sve+sve2
+
+.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+		16, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lz\b\().s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+.macro sm4e_sve, zd, zn
+	.inst 0x4523e000 | (.L\zn << 5) | .L\zd
+.endm
+
+
+/* Register macros */
+
+#define RCTR        z16
+#define RCTRv       v16
+#define RIV         z16
+#define RIVv        v16
+#define RSWAP128    z17
+#define RZERO       z18
+#define RLE128_INC  z19
+
+#define RTMP0       z20
+#define RTMP0v      v20
+#define RTMP1       z21
+#define RTMP2       z22
+#define RTMP3       z23
+
+
+/* Helper macros. */
+
+#define SM4_PREPARE(ptr)					\
+		adr_l		x7, .Lbswap128_mask;		\
+		ptrue		p0.b, ALL;			\
+		rdvl		x5, #1;				\
+		ld1b		{RSWAP128.b}, p0/z, [x7];	\
+								\
+		ld1		{v24.16b-v27.16b}, [ptr], #64;	\
+		ld1		{v28.16b-v31.16b}, [ptr];	\
+		dup		z24.q, z24.q[0];		\
+		dup		z25.q, z25.q[0];		\
+		dup		z26.q, z26.q[0];		\
+		dup		z27.q, z27.q[0];		\
+		dup		z28.q, z28.q[0];		\
+		dup		z29.q, z29.q[0];		\
+		dup		z30.q, z30.q[0];		\
+		dup		z31.q, z31.q[0];
+
+#define SM4_SVE_CE_CRYPT_BLK(b0)				\
+		revb		b0.s, p0/m, b0.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;
+
+#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3)			\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b1.s, z24.s;			\
+		sm4e_sve	b2.s, z24.s;			\
+		sm4e_sve	b3.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b1.s, z25.s;			\
+		sm4e_sve	b2.s, z25.s;			\
+		sm4e_sve	b3.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b1.s, z26.s;			\
+		sm4e_sve	b2.s, z26.s;			\
+		sm4e_sve	b3.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b1.s, z27.s;			\
+		sm4e_sve	b2.s, z27.s;			\
+		sm4e_sve	b3.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b1.s, z28.s;			\
+		sm4e_sve	b2.s, z28.s;			\
+		sm4e_sve	b3.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b1.s, z29.s;			\
+		sm4e_sve	b2.s, z29.s;			\
+		sm4e_sve	b3.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b1.s, z30.s;			\
+		sm4e_sve	b2.s, z30.s;			\
+		sm4e_sve	b3.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		sm4e_sve	b1.s, z31.s;			\
+		sm4e_sve	b2.s, z31.s;			\
+		sm4e_sve	b3.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		tbl		b1.b, {b1.b}, RSWAP128.b;	\
+		tbl		b2.b, {b2.b}, RSWAP128.b;	\
+		tbl		b3.b, {b3.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;
+
+#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		revb		b4.s, p0/m, b4.s;		\
+		revb		b5.s, p0/m, b5.s;		\
+		revb		b6.s, p0/m, b6.s;		\
+		revb		b7.s, p0/m, b7.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b1.s, z24.s;			\
+		sm4e_sve	b2.s, z24.s;			\
+		sm4e_sve	b3.s, z24.s;			\
+		sm4e_sve	b4.s, z24.s;			\
+		sm4e_sve	b5.s, z24.s;			\
+		sm4e_sve	b6.s, z24.s;			\
+		sm4e_sve	b7.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b1.s, z25.s;			\
+		sm4e_sve	b2.s, z25.s;			\
+		sm4e_sve	b3.s, z25.s;			\
+		sm4e_sve	b4.s, z25.s;			\
+		sm4e_sve	b5.s, z25.s;			\
+		sm4e_sve	b6.s, z25.s;			\
+		sm4e_sve	b7.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b1.s, z26.s;			\
+		sm4e_sve	b2.s, z26.s;			\
+		sm4e_sve	b3.s, z26.s;			\
+		sm4e_sve	b4.s, z26.s;			\
+		sm4e_sve	b5.s, z26.s;			\
+		sm4e_sve	b6.s, z26.s;			\
+		sm4e_sve	b7.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b1.s, z27.s;			\
+		sm4e_sve	b2.s, z27.s;			\
+		sm4e_sve	b3.s, z27.s;			\
+		sm4e_sve	b4.s, z27.s;			\
+		sm4e_sve	b5.s, z27.s;			\
+		sm4e_sve	b6.s, z27.s;			\
+		sm4e_sve	b7.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b1.s, z28.s;			\
+		sm4e_sve	b2.s, z28.s;			\
+		sm4e_sve	b3.s, z28.s;			\
+		sm4e_sve	b4.s, z28.s;			\
+		sm4e_sve	b5.s, z28.s;			\
+		sm4e_sve	b6.s, z28.s;			\
+		sm4e_sve	b7.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b1.s, z29.s;			\
+		sm4e_sve	b2.s, z29.s;			\
+		sm4e_sve	b3.s, z29.s;			\
+		sm4e_sve	b4.s, z29.s;			\
+		sm4e_sve	b5.s, z29.s;			\
+		sm4e_sve	b6.s, z29.s;			\
+		sm4e_sve	b7.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b1.s, z30.s;			\
+		sm4e_sve	b2.s, z30.s;			\
+		sm4e_sve	b3.s, z30.s;			\
+		sm4e_sve	b4.s, z30.s;			\
+		sm4e_sve	b5.s, z30.s;			\
+		sm4e_sve	b6.s, z30.s;			\
+		sm4e_sve	b7.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		sm4e_sve	b1.s, z31.s;			\
+		sm4e_sve	b2.s, z31.s;			\
+		sm4e_sve	b3.s, z31.s;			\
+		sm4e_sve	b4.s, z31.s;			\
+		sm4e_sve	b5.s, z31.s;			\
+		sm4e_sve	b6.s, z31.s;			\
+		sm4e_sve	b7.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		tbl		b1.b, {b1.b}, RSWAP128.b;	\
+		tbl		b2.b, {b2.b}, RSWAP128.b;	\
+		tbl		b3.b, {b3.b}, RSWAP128.b;	\
+		tbl		b4.b, {b4.b}, RSWAP128.b;	\
+		tbl		b5.b, {b5.b}, RSWAP128.b;	\
+		tbl		b6.b, {b6.b}, RSWAP128.b;	\
+		tbl		b7.b, {b7.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		revb		b4.s, p0/m, b4.s;		\
+		revb		b5.s, p0/m, b5.s;		\
+		revb		b6.s, p0/m, b6.s;		\
+		revb		b7.s, p0/m, b7.s;
+
+#define SM4_CE_CRYPT_BLK(b0)					\
+		rev32		b0.16b, b0.16b;			\
+		sm4e		b0.4s, v24.4s;			\
+		sm4e		b0.4s, v25.4s;			\
+		sm4e		b0.4s, v26.4s;			\
+		sm4e		b0.4s, v27.4s;			\
+		sm4e		b0.4s, v28.4s;			\
+		sm4e		b0.4s, v29.4s;			\
+		sm4e		b0.4s, v30.4s;			\
+		sm4e		b0.4s, v31.4s;			\
+		rev64		b0.4s, b0.4s;			\
+		ext		b0.16b, b0.16b, b0.16b, #8;	\
+		rev32		b0.16b, b0.16b;
+
+#define inc_le128(zctr)						\
+		mov		RCTRv.d[1], x8;			\
+		mov		RCTRv.d[0], x7;			\
+		mov		zctr.d, RLE128_INC.d;		\
+		dup		RCTR.q, RCTR.q[0];		\
+		adds		x8, x8, x5, LSR #4;		\
+		adclt		zctr.d, RCTR.d, RZERO.d;	\
+		adclt		RCTR.d, zctr.d, RZERO.d;	\
+		adc		x7, x7, xzr;			\
+		trn1		zctr.d, RCTR.d, zctr.d;		\
+		revb		zctr.d, p0/m, zctr.d;
+
+#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3)		\
+		mov		v8.d[1], x8;			\
+		mov		v8.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr0.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v9.d[1], x8;			\
+		mov		v9.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr1.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v10.d[1], x8;			\
+		mov		v10.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr2.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v11.d[1], x8;			\
+		mov		v11.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr3.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		dup		z8.q, z8.q[0];			\
+		dup		z9.q, z9.q[0];			\
+		dup		z10.q, z10.q[0];		\
+		dup		z11.q, z11.q[0];		\
+		adclt		zctr0.d, z8.d, RZERO.d;		\
+		adclt		zctr1.d, z9.d, RZERO.d;		\
+		adclt		zctr2.d, z10.d, RZERO.d;	\
+		adclt		zctr3.d, z11.d, RZERO.d;	\
+		adclt		z8.d, zctr0.d, RZERO.d;		\
+		adclt		z9.d, zctr1.d, RZERO.d;		\
+		adclt		z10.d, zctr2.d, RZERO.d;	\
+		adclt		z11.d, zctr3.d, RZERO.d;	\
+		trn1		zctr0.d, z8.d, zctr0.d;		\
+		trn1		zctr1.d, z9.d, zctr1.d;		\
+		trn1		zctr2.d, z10.d, zctr2.d;	\
+		trn1		zctr3.d, z11.d, zctr3.d;	\
+		revb		zctr0.d, p0/m, zctr0.d;		\
+		revb		zctr1.d, p0/m, zctr1.d;		\
+		revb		zctr2.d, p0/m, zctr2.d;		\
+		revb		zctr3.d, p0/m, zctr3.d;
+
+#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3,		\
+		     zctr4, zctr5, zctr6, zctr7)		\
+		mov		v8.d[1], x8;			\
+		mov		v8.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr0.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v9.d[1], x8;			\
+		mov		v9.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr1.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v10.d[1], x8;			\
+		mov		v10.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr2.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v11.d[1], x8;			\
+		mov		v11.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr3.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v12.d[1], x8;			\
+		mov		v12.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr4.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v13.d[1], x8;			\
+		mov		v13.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr5.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v14.d[1], x8;			\
+		mov		v14.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr6.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v15.d[1], x8;			\
+		mov		v15.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr7.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		dup		z8.q, z8.q[0];			\
+		dup		z9.q, z9.q[0];			\
+		dup		z10.q, z10.q[0];		\
+		dup		z11.q, z11.q[0];		\
+		dup		z12.q, z12.q[0];		\
+		dup		z13.q, z13.q[0];		\
+		dup		z14.q, z14.q[0];		\
+		dup		z15.q, z15.q[0];		\
+		adclt		zctr0.d, z8.d, RZERO.d;		\
+		adclt		zctr1.d, z9.d, RZERO.d;		\
+		adclt		zctr2.d, z10.d, RZERO.d;	\
+		adclt		zctr3.d, z11.d, RZERO.d;	\
+		adclt		zctr4.d, z12.d, RZERO.d;	\
+		adclt		zctr5.d, z13.d, RZERO.d;	\
+		adclt		zctr6.d, z14.d, RZERO.d;	\
+		adclt		zctr7.d, z15.d, RZERO.d;	\
+		adclt		z8.d, zctr0.d, RZERO.d;		\
+		adclt		z9.d, zctr1.d, RZERO.d;		\
+		adclt		z10.d, zctr2.d, RZERO.d;	\
+		adclt		z11.d, zctr3.d, RZERO.d;	\
+		adclt		z12.d, zctr4.d, RZERO.d;	\
+		adclt		z13.d, zctr5.d, RZERO.d;	\
+		adclt		z14.d, zctr6.d, RZERO.d;	\
+		adclt		z15.d, zctr7.d, RZERO.d;	\
+		trn1		zctr0.d, z8.d, zctr0.d;		\
+		trn1		zctr1.d, z9.d, zctr1.d;		\
+		trn1		zctr2.d, z10.d, zctr2.d;	\
+		trn1		zctr3.d, z11.d, zctr3.d;	\
+		trn1		zctr4.d, z12.d, zctr4.d;	\
+		trn1		zctr5.d, z13.d, zctr5.d;	\
+		trn1		zctr6.d, z14.d, zctr6.d;	\
+		trn1		zctr7.d, z15.d, zctr7.d;	\
+		revb		zctr0.d, p0/m, zctr0.d;		\
+		revb		zctr1.d, p0/m, zctr1.d;		\
+		revb		zctr2.d, p0/m, zctr2.d;		\
+		revb		zctr3.d, p0/m, zctr3.d;		\
+		revb		zctr4.d, p0/m, zctr4.d;		\
+		revb		zctr5.d, p0/m, zctr5.d;		\
+		revb		zctr6.d, p0/m, zctr6.d;		\
+		revb		zctr7.d, p0/m, zctr7.d;
+
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_crypt)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	uxtw		x3, w3
+	SM4_PREPARE(x0)
+
+.Lcrypt_loop_8x:
+	sub		x3, x3, x5, LSR #1		/* x3 - (8 * VL) */
+	tbnz		x3, #63, .Lcrypt_4x
+
+	ld1b		{z0.b}, p0/z, [x2]
+	ld1b		{z1.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z2.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z3.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z4.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z5.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z6.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z7.b}, p0/z, [x2, #7, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x3, .Lcrypt_end
+	b		.Lcrypt_loop_8x
+
+.Lcrypt_4x:
+	add		x3, x3, x5, LSR #1
+	cmp		x3, x5, LSR #2
+	blt		.Lcrypt_loop_1x
+
+	sub		x3, x3, x5, LSR #2		/* x3 - (4 * VL) */
+
+	ld1b		{z0.b}, p0/z, [x2]
+	ld1b		{z1.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z2.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z3.b}, p0/z, [x2, #3, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x3, .Lcrypt_end
+
+.Lcrypt_loop_1x:
+	cmp		x3, x5, LSR #4
+	blt		.Lcrypt_ce_loop_1x
+
+	sub		x3, x3, x5, LSR #4		/* x3 - VL */
+
+	ld1b		{z0.b}, p0/z, [x2]
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x3, .Lcrypt_end
+	b		.Lcrypt_loop_1x
+
+.Lcrypt_ce_loop_1x:
+	sub		x3, x3, #1
+
+	ld1		{v0.16b}, [x2], #16
+	SM4_CE_CRYPT_BLK(v0)
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x3, .Lcrypt_ce_loop_1x
+
+.Lcrypt_end:
+	ret
+SYM_FUNC_END(sm4_sve_ce_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cbc_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	ld1		{RIVv.16b}, [x3]
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lcbc_dec_4x
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z9.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z8.b}, p0/z, [x2, #7, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		z4.b, z11.b
+	rev		z5.b, z10.b
+	rev		z6.b, z9.b
+	rev		z7.b, z8.b
+	rev		RTMP0.b, RIV.b
+	ext		z7.b, z7.b, z6.b, #16
+	ext		z6.b, z6.b, z5.b, #16
+	ext		z5.b, z5.b, z4.b, #16
+	ext		z4.b, z4.b, z3.b, #16
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z7.b, z7.b
+	rev		z6.b, z6.b
+	rev		z5.b, z5.b
+	rev		z4.b, z4.b
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z8.d
+
+	SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	eor		z4.d, z4.d, z11.d
+	eor		z5.d, z5.d, z10.d
+	eor		z6.d, z6.d, z9.d
+	eor		z7.d, z7.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
+
+.Lcbc_dec_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lcbc_dec_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		RTMP0.b, RIV.b
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z12.d
+
+	SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lcbc_dec_end
+
+.Lcbc_dec_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lcbc_dec_ce
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	rev		RTMP0.b, RIV.b
+	rev		z0.b, z15.b
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z0.b, z0.b
+	mov		RIV.d, z15.d
+
+	SM4_SVE_CE_CRYPT_BLK(z15)
+
+	eor		z0.d, z0.d, z15.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_1x
+
+.Lcbc_dec_ce:
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcbc_dec_ce_loop_1x:
+	sub		x4, x4, #1
+
+	ld1		{v15.16b}, [x2], #16
+	mov		v0.16b, RIVv.16b
+	mov		RIVv.16b, v15.16b
+	SM4_CE_CRYPT_BLK(v15)
+	eor		v0.16b, v0.16b, v15.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lcbc_dec_ce_loop_1x
+
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_end:
+	/* store new IV */
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+	st1		{RIVv.16b}, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_cbc_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cfb_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	ld1		{RIVv.16b}, [x3]
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lcfb_dec_4x
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z9.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z8.b}, p0/z, [x2, #7, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		z4.b, z11.b
+	rev		z5.b, z10.b
+	rev		z6.b, z9.b
+	rev		z7.b, z8.b
+	rev		RTMP0.b, RIV.b
+	ext		z7.b, z7.b, z6.b, #16
+	ext		z6.b, z6.b, z5.b, #16
+	ext		z5.b, z5.b, z4.b, #16
+	ext		z4.b, z4.b, z3.b, #16
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z7.b, z7.b
+	rev		z6.b, z6.b
+	rev		z5.b, z5.b
+	rev		z4.b, z4.b
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z8.d
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	eor		z4.d, z4.d, z11.d
+	eor		z5.d, z5.d, z10.d
+	eor		z6.d, z6.d, z9.d
+	eor		z7.d, z7.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lcfb_dec_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		RTMP0.b, RIV.b
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z12.d
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lcfb_dec_ce
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	rev		RTMP0.b, RIV.b
+	rev		z0.b, z15.b
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z0.b, z0.b
+	mov		RIV.d, z15.d
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	eor		z0.d, z0.d, z15.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_1x
+
+.Lcfb_dec_ce:
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcfb_dec_ce_loop_1x:
+	sub		x4, x4, #1
+
+	ld1		{v15.16b}, [x2], #16
+	mov		v0.16b, RIVv.16b
+	mov		RIVv.16b, v15.16b
+	SM4_CE_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v15.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lcfb_dec_ce_loop_1x
+
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_end:
+	/* store new IV */
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+	st1		{RIVv.16b}, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_cfb_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	dup		RZERO.d, #0
+	adr_l		x6, .Lle128_inc
+	ld1b		{RLE128_INC.b}, p0/z, [x6]
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+.Lctr_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lctr_4x
+
+	inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	ld1b		{z8.b}, p0/z, [x2]
+	ld1b		{z9.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z14.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z15.b}, p0/z, [x2, #7, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	eor		z0.d, z0.d, z8.d
+	eor		z1.d, z1.d, z9.d
+	eor		z2.d, z2.d, z10.d
+	eor		z3.d, z3.d, z11.d
+	eor		z4.d, z4.d, z12.d
+	eor		z5.d, z5.d, z13.d
+	eor		z6.d, z6.d, z14.d
+	eor		z7.d, z7.d, z15.d
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lctr_end
+	b		.Lctr_loop_8x
+
+.Lctr_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lctr_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	inc_le128_4x(z0, z1, z2, z3)
+
+	ld1b		{z8.b}, p0/z, [x2]
+	ld1b		{z9.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #3, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	eor		z0.d, z0.d, z8.d
+	eor		z1.d, z1.d, z9.d
+	eor		z2.d, z2.d, z10.d
+	eor		z3.d, z3.d, z11.d
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lctr_end
+
+.Lctr_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lctr_ce_loop_1x
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	inc_le128(z0)
+	ld1b		{z8.b}, p0/z, [x2]
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	eor		z0.d, z0.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lctr_end
+	b		.Lctr_loop_1x
+
+.Lctr_ce_loop_1x:
+	sub		x4, x4, #1
+
+	/* inc_le128 for CE */
+	mov		v0.d[1], x8
+	mov		v0.d[0], x7
+	adds		x8, x8, #1
+	rev64		v0.16b, v0.16b
+	adc		x7, x7, xzr
+
+	ld1		{v8.16b}, [x2], #16
+
+	SM4_CE_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lctr_ce_loop_1x
+
+.Lctr_end:
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_get_vl)
+	/* VL in bytes */
+	rdvl		x0, #1
+
+	ret
+SYM_FUNC_END(sm4_sve_get_vl)
+
+
+	.section	".rodata", "a"
+	.align 4
+.Lbswap128_mask:
+	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+	.byte		0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
+	.byte		0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
+	.byte		0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
+	.byte		0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
+	.byte		0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
+	.byte		0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
+	.byte		0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
+	.byte		0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
+	.byte		0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
+	.byte		0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
+	.byte		0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
+	.byte		0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
+	.byte		0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
+	.byte		0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
+	.byte		0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
+	.byte		0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
+	.byte		0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
+	.byte		0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
+	.byte		0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
+	.byte		0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
+	.byte		0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
+	.byte		0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
+	.byte		0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
+	.byte		0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
+	.byte		0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
+	.byte		0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
+	.byte		0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
+	.byte		0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
+	.byte		0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
+	.byte		0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
+
+.Lle128_inc:
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
new file mode 100644
index 000000000000..fc797b72b5f0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
@@ -0,0 +1,332 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
+				 const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
+				   const u8 *src, u8 *iv,
+				   unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
+				   const u8 *src, u8 *iv,
+				   unsigned int nblocks);
+asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nblocks);
+asmlinkage unsigned int sm4_sve_get_vl(void);
+
+
+static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
+{
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_sve_ce_crypt(rkey, dst, src, nblocks);
+
+			kernel_neon_end();
+		}
+
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int ecb_encrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return ecb_crypt(req, ctx->rkey_enc);
+}
+
+static int ecb_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return ecb_crypt(req, ctx->rkey_dec);
+}
+
+static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
+		     void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
+				const u8 *src, u8 *iv, unsigned int nblocks))
+{
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
+
+			kernel_neon_end();
+		}
+
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int cbc_encrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
+}
+
+static int cbc_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
+}
+
+static int cfb_crypt(struct skcipher_request *req,
+		     void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
+				const u8 *src, u8 *iv, unsigned int nblocks))
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_cfb_crypt(ctx->rkey_enc, dst, src,
+				      walk.iv, nblocks);
+
+			kernel_neon_end();
+
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
+
+		/* tail */
+		if (walk.nbytes == walk.total && nbytes > 0) {
+			u8 keystream[SM4_BLOCK_SIZE];
+
+			sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+			crypto_xor_cpy(dst, src, keystream, nbytes);
+			nbytes = 0;
+		}
+
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static int cfb_encrypt(struct skcipher_request *req)
+{
+	return cfb_crypt(req, sm4_ce_cfb_enc);
+}
+
+static int cfb_decrypt(struct skcipher_request *req)
+{
+	return cfb_crypt(req, sm4_sve_ce_cfb_dec);
+}
+
+static int ctr_crypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
+					     walk.iv, nblocks);
+
+			kernel_neon_end();
+
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
+
+		/* tail */
+		if (walk.nbytes == walk.total && nbytes > 0) {
+			u8 keystream[SM4_BLOCK_SIZE];
+
+			sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+			crypto_inc(walk.iv, SM4_BLOCK_SIZE);
+			crypto_xor_cpy(dst, src, keystream, nbytes);
+			nbytes = 0;
+		}
+
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static struct skcipher_alg sm4_algs[] = {
+	{
+		.base = {
+			.cra_name		= "ecb(sm4)",
+			.cra_driver_name	= "ecb-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= ecb_encrypt,
+		.decrypt	= ecb_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "cbc(sm4)",
+			.cra_driver_name	= "cbc-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= cbc_encrypt,
+		.decrypt	= cbc_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "cfb(sm4)",
+			.cra_driver_name	= "cfb-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.chunksize	= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= cfb_encrypt,
+		.decrypt	= cfb_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "ctr(sm4)",
+			.cra_driver_name	= "ctr-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.chunksize	= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= ctr_crypt,
+		.decrypt	= ctr_crypt,
+	}
+};
+
+static int __init sm4_sve_ce_init(void)
+{
+	if (sm4_sve_get_vl() <= 16)
+		return -ENODEV;
+
+	return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+static void __exit sm4_sve_ce_exit(void)
+{
+	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
+module_exit(sm4_sve_ce_exit);
+
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
+MODULE_ALIAS_CRYPTO("sm4-sve-ce");
+MODULE_ALIAS_CRYPTO("sm4");
+MODULE_ALIAS_CRYPTO("ecb(sm4)");
+MODULE_ALIAS_CRYPTO("cbc(sm4)");
+MODULE_ALIAS_CRYPTO("cfb(sm4)");
+MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.24.3 (Apple Git-128)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
  2022-09-26  9:36   ` Tianjia Zhang
@ 2022-09-26 10:02     ` Ard Biesheuvel
  -1 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2022-09-26 10:02 UTC (permalink / raw)
  To: Tianjia Zhang, Mark Brown
  Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
	Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
	linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32

(cc Mark Brown)

Hello Tianjia,

On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
<tianjia.zhang@linux.alibaba.com> wrote:
>
> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
> arm64. SVE allows flexible vector length implementations with a range of
> possible values in CPU implementations. The vector length can vary from a
> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
> The SVE design guarantees that the same application can run on different
> implementations that support SVE, without the need to recompile the code.
>
> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
> expand and improve it. Similar to the Crypto Extension supported by the
> NEON instruction set for the algorithm, SVE also supports the similar
> instructions, called cryptography acceleration instructions, but this is
> also optional instruction set.
>
> This patch uses SM4 cryptography acceleration instructions and SVE2
> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
> Extension instruction is used.
>

Given that we currently do not support the use of SVE in kernel mode,
this patch cannot be accepted at this time (but the rest of the series
looks reasonable to me, although I have only skimmed over the patches)

In view of the disappointing benchmark results below, I don't think
this is worth the hassle at the moment. If we can find a case where
using SVE in kernel mode truly makes a [favorable] difference, we can
revisit this, but not without a thorough analysis of the impact it
will have to support SVE in the kernel. Also, the fact that SVE may
also cover cryptographic extensions does not necessarily imply that a
micro-architecture will perform those crypto transformations in
parallel and so the performance may be the same even if VL > 128.

In summary, please drop this patch for now, and once there are more
encouraging performance numbers, please resubmit it as part of a
series that explicitly enables SVE in kernel mode on arm64, and
documents the requirements and constraints.

I have cc'ed Mark who has been working on the SVE support., who might
have something to add here as well.

Thanks,
Ard.



> Since no test environment with a Vector Length (VL) greater than 128 bits
> was found, the performance data was obtained on a machine with a VL is
> 128 bits, because this driver is enabled when the VL is greater than 128
> bits, so this performance is only for reference. It can be seen from the
> data that there is little difference between the data optimized by Crypto
> Extension and SVE (VL=128 bits), and the optimization effect will be more
> obvious when VL=256 bits or longer.
>
> Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
> of tcrypt, and compared with that optimized by Crypto Extension.  The
> abscissas are blocks of different lengths. The data is tabulated and the
> unit is Mb/s:
>
> sm4-ce      |      16       64      128      256     1024     1420     4096
> ------------+--------------------------------------------------------------
>     ECB enc |  315.18  1162.65  1815.66  2553.50  3692.91  3727.20  4001.93
>     ECB dec |  316.06  1172.97  1817.81  2554.66  3692.18  3786.54  4001.93
>     CBC enc |  304.82   629.54   768.65   864.72   953.90   963.32   974.06
>     CBC dec |  306.05  1142.53  1805.11  2481.67  3522.06  3587.87  3790.99
>     CFB enc |  309.48   635.70   774.44   865.85   950.62   952.68   968.24
>     CFB dec |  315.98  1170.38  1828.75  2509.72  3543.63  3539.40  3793.25
>     CTR enc |  285.83  1036.59  1583.50  2147.26  2933.54  2954.66  3041.14
>     CTR dec |  285.29  1037.47  1584.67  2145.51  2934.10  2950.89  3041.62
>
> sm4-sve-ce (VL = 128 bits)
>     ECB enc |  310.00  1154.70  1813.26  2579.74  3766.90  3869.45  4100.26
>     ECB dec |  315.60  1176.22  1838.06  2593.69  3774.95  3878.42  4098.83
>     CBC enc |  303.44   622.65   764.67   861.40   953.18   963.05   973.77
>     CBC dec |  302.13  1091.15  1689.10  2267.79  3182.84  3242.68  3408.92
>     CFB enc |  296.62   620.41   762.94   858.96   948.18   956.04   967.67
>     CFB dec |  291.23  1065.50  1637.33  2228.12  3158.52  3213.35  3403.83
>     CTR enc |  272.27   959.35  1466.34  1934.24  2562.80  2595.87  2695.15
>     CTR dec |  273.40   963.65  1471.83  1938.97  2563.12  2597.25  2694.54
>
> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> ---
>  arch/arm64/crypto/Kconfig           |   19 +
>  arch/arm64/crypto/Makefile          |    3 +
>  arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
>  arch/arm64/crypto/sm4-sve-ce-glue.c |  332 +++++++++
>  4 files changed, 1382 insertions(+)
>  create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
>  create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index 6793d5bc3ee5..bbb5a7a08af5 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
>           - ARMv8 Crypto Extensions
>           - NEON (Advanced SIMD) extensions
>
> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> +       tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
> +       depends on KERNEL_MODE_NEON
> +       select CRYPTO_SKCIPHER
> +       select CRYPTO_SM4
> +       select CRYPTO_SM4_ARM64_CE_BLK
> +       help
> +         Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
> +         with block cipher modes:
> +         - ECB (Electronic Codebook) mode (NIST SP800-38A)
> +         - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
> +         - CFB (Cipher Feedback) mode (NIST SP800-38A)
> +         - CTR (Counter) mode (NIST SP800-38A)
> +
> +         Architecture: arm64 using:
> +         - ARMv8 Crypto Extensions
> +         - ARMv9 cryptography acceleration with SVE2
> +         - NEON (Advanced SIMD) extensions
> +
>  config CRYPTO_SM4_ARM64_NEON_BLK
>         tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
>         depends on KERNEL_MODE_NEON
> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> index 4818e204c2ac..355dd9053434 100644
> --- a/arch/arm64/crypto/Makefile
> +++ b/arch/arm64/crypto/Makefile
> @@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
>  obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
>  sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
>
> +obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
> +sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
> +
>  obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
>  ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
>
> diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
> new file mode 100644
> index 000000000000..caecbdf2536c
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-core.S
> @@ -0,0 +1,1028 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/assembler.h>
> +
> +.arch  armv8-a+crypto+sve+sve2
> +
> +.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
> +       .set .Lv\b\().4s, \b
> +.endr
> +
> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
> +               16, 24, 25, 26, 27, 28, 29, 30, 31
> +       .set .Lz\b\().s, \b
> +.endr
> +
> +.macro sm4e, vd, vn
> +       .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> +.endm
> +
> +.macro sm4e_sve, zd, zn
> +       .inst 0x4523e000 | (.L\zn << 5) | .L\zd
> +.endm
> +
> +
> +/* Register macros */
> +
> +#define RCTR        z16
> +#define RCTRv       v16
> +#define RIV         z16
> +#define RIVv        v16
> +#define RSWAP128    z17
> +#define RZERO       z18
> +#define RLE128_INC  z19
> +
> +#define RTMP0       z20
> +#define RTMP0v      v20
> +#define RTMP1       z21
> +#define RTMP2       z22
> +#define RTMP3       z23
> +
> +
> +/* Helper macros. */
> +
> +#define SM4_PREPARE(ptr)                                       \
> +               adr_l           x7, .Lbswap128_mask;            \
> +               ptrue           p0.b, ALL;                      \
> +               rdvl            x5, #1;                         \
> +               ld1b            {RSWAP128.b}, p0/z, [x7];       \
> +                                                               \
> +               ld1             {v24.16b-v27.16b}, [ptr], #64;  \
> +               ld1             {v28.16b-v31.16b}, [ptr];       \
> +               dup             z24.q, z24.q[0];                \
> +               dup             z25.q, z25.q[0];                \
> +               dup             z26.q, z26.q[0];                \
> +               dup             z27.q, z27.q[0];                \
> +               dup             z28.q, z28.q[0];                \
> +               dup             z29.q, z29.q[0];                \
> +               dup             z30.q, z30.q[0];                \
> +               dup             z31.q, z31.q[0];
> +
> +#define SM4_SVE_CE_CRYPT_BLK(b0)                               \
> +               revb            b0.s, p0/m, b0.s;               \
> +               sm4e_sve        b0.s, z24.s;                    \
> +               sm4e_sve        b0.s, z25.s;                    \
> +               sm4e_sve        b0.s, z26.s;                    \
> +               sm4e_sve        b0.s, z27.s;                    \
> +               sm4e_sve        b0.s, z28.s;                    \
> +               sm4e_sve        b0.s, z29.s;                    \
> +               sm4e_sve        b0.s, z30.s;                    \
> +               sm4e_sve        b0.s, z31.s;                    \
> +               tbl             b0.b, {b0.b}, RSWAP128.b;       \
> +               revb            b0.s, p0/m, b0.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3)                  \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;               \
> +               sm4e_sve        b0.s, z24.s;                    \
> +               sm4e_sve        b1.s, z24.s;                    \
> +               sm4e_sve        b2.s, z24.s;                    \
> +               sm4e_sve        b3.s, z24.s;                    \
> +               sm4e_sve        b0.s, z25.s;                    \
> +               sm4e_sve        b1.s, z25.s;                    \
> +               sm4e_sve        b2.s, z25.s;                    \
> +               sm4e_sve        b3.s, z25.s;                    \
> +               sm4e_sve        b0.s, z26.s;                    \
> +               sm4e_sve        b1.s, z26.s;                    \
> +               sm4e_sve        b2.s, z26.s;                    \
> +               sm4e_sve        b3.s, z26.s;                    \
> +               sm4e_sve        b0.s, z27.s;                    \
> +               sm4e_sve        b1.s, z27.s;                    \
> +               sm4e_sve        b2.s, z27.s;                    \
> +               sm4e_sve        b3.s, z27.s;                    \
> +               sm4e_sve        b0.s, z28.s;                    \
> +               sm4e_sve        b1.s, z28.s;                    \
> +               sm4e_sve        b2.s, z28.s;                    \
> +               sm4e_sve        b3.s, z28.s;                    \
> +               sm4e_sve        b0.s, z29.s;                    \
> +               sm4e_sve        b1.s, z29.s;                    \
> +               sm4e_sve        b2.s, z29.s;                    \
> +               sm4e_sve        b3.s, z29.s;                    \
> +               sm4e_sve        b0.s, z30.s;                    \
> +               sm4e_sve        b1.s, z30.s;                    \
> +               sm4e_sve        b2.s, z30.s;                    \
> +               sm4e_sve        b3.s, z30.s;                    \
> +               sm4e_sve        b0.s, z31.s;                    \
> +               sm4e_sve        b1.s, z31.s;                    \
> +               sm4e_sve        b2.s, z31.s;                    \
> +               sm4e_sve        b3.s, z31.s;                    \
> +               tbl             b0.b, {b0.b}, RSWAP128.b;       \
> +               tbl             b1.b, {b1.b}, RSWAP128.b;       \
> +               tbl             b2.b, {b2.b}, RSWAP128.b;       \
> +               tbl             b3.b, {b3.b}, RSWAP128.b;       \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)  \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;               \
> +               revb            b4.s, p0/m, b4.s;               \
> +               revb            b5.s, p0/m, b5.s;               \
> +               revb            b6.s, p0/m, b6.s;               \
> +               revb            b7.s, p0/m, b7.s;               \
> +               sm4e_sve        b0.s, z24.s;                    \
> +               sm4e_sve        b1.s, z24.s;                    \
> +               sm4e_sve        b2.s, z24.s;                    \
> +               sm4e_sve        b3.s, z24.s;                    \
> +               sm4e_sve        b4.s, z24.s;                    \
> +               sm4e_sve        b5.s, z24.s;                    \
> +               sm4e_sve        b6.s, z24.s;                    \
> +               sm4e_sve        b7.s, z24.s;                    \
> +               sm4e_sve        b0.s, z25.s;                    \
> +               sm4e_sve        b1.s, z25.s;                    \
> +               sm4e_sve        b2.s, z25.s;                    \
> +               sm4e_sve        b3.s, z25.s;                    \
> +               sm4e_sve        b4.s, z25.s;                    \
> +               sm4e_sve        b5.s, z25.s;                    \
> +               sm4e_sve        b6.s, z25.s;                    \
> +               sm4e_sve        b7.s, z25.s;                    \
> +               sm4e_sve        b0.s, z26.s;                    \
> +               sm4e_sve        b1.s, z26.s;                    \
> +               sm4e_sve        b2.s, z26.s;                    \
> +               sm4e_sve        b3.s, z26.s;                    \
> +               sm4e_sve        b4.s, z26.s;                    \
> +               sm4e_sve        b5.s, z26.s;                    \
> +               sm4e_sve        b6.s, z26.s;                    \
> +               sm4e_sve        b7.s, z26.s;                    \
> +               sm4e_sve        b0.s, z27.s;                    \
> +               sm4e_sve        b1.s, z27.s;                    \
> +               sm4e_sve        b2.s, z27.s;                    \
> +               sm4e_sve        b3.s, z27.s;                    \
> +               sm4e_sve        b4.s, z27.s;                    \
> +               sm4e_sve        b5.s, z27.s;                    \
> +               sm4e_sve        b6.s, z27.s;                    \
> +               sm4e_sve        b7.s, z27.s;                    \
> +               sm4e_sve        b0.s, z28.s;                    \
> +               sm4e_sve        b1.s, z28.s;                    \
> +               sm4e_sve        b2.s, z28.s;                    \
> +               sm4e_sve        b3.s, z28.s;                    \
> +               sm4e_sve        b4.s, z28.s;                    \
> +               sm4e_sve        b5.s, z28.s;                    \
> +               sm4e_sve        b6.s, z28.s;                    \
> +               sm4e_sve        b7.s, z28.s;                    \
> +               sm4e_sve        b0.s, z29.s;                    \
> +               sm4e_sve        b1.s, z29.s;                    \
> +               sm4e_sve        b2.s, z29.s;                    \
> +               sm4e_sve        b3.s, z29.s;                    \
> +               sm4e_sve        b4.s, z29.s;                    \
> +               sm4e_sve        b5.s, z29.s;                    \
> +               sm4e_sve        b6.s, z29.s;                    \
> +               sm4e_sve        b7.s, z29.s;                    \
> +               sm4e_sve        b0.s, z30.s;                    \
> +               sm4e_sve        b1.s, z30.s;                    \
> +               sm4e_sve        b2.s, z30.s;                    \
> +               sm4e_sve        b3.s, z30.s;                    \
> +               sm4e_sve        b4.s, z30.s;                    \
> +               sm4e_sve        b5.s, z30.s;                    \
> +               sm4e_sve        b6.s, z30.s;                    \
> +               sm4e_sve        b7.s, z30.s;                    \
> +               sm4e_sve        b0.s, z31.s;                    \
> +               sm4e_sve        b1.s, z31.s;                    \
> +               sm4e_sve        b2.s, z31.s;                    \
> +               sm4e_sve        b3.s, z31.s;                    \
> +               sm4e_sve        b4.s, z31.s;                    \
> +               sm4e_sve        b5.s, z31.s;                    \
> +               sm4e_sve        b6.s, z31.s;                    \
> +               sm4e_sve        b7.s, z31.s;                    \
> +               tbl             b0.b, {b0.b}, RSWAP128.b;       \
> +               tbl             b1.b, {b1.b}, RSWAP128.b;       \
> +               tbl             b2.b, {b2.b}, RSWAP128.b;       \
> +               tbl             b3.b, {b3.b}, RSWAP128.b;       \
> +               tbl             b4.b, {b4.b}, RSWAP128.b;       \
> +               tbl             b5.b, {b5.b}, RSWAP128.b;       \
> +               tbl             b6.b, {b6.b}, RSWAP128.b;       \
> +               tbl             b7.b, {b7.b}, RSWAP128.b;       \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;               \
> +               revb            b4.s, p0/m, b4.s;               \
> +               revb            b5.s, p0/m, b5.s;               \
> +               revb            b6.s, p0/m, b6.s;               \
> +               revb            b7.s, p0/m, b7.s;
> +
> +#define SM4_CE_CRYPT_BLK(b0)                                   \
> +               rev32           b0.16b, b0.16b;                 \
> +               sm4e            b0.4s, v24.4s;                  \
> +               sm4e            b0.4s, v25.4s;                  \
> +               sm4e            b0.4s, v26.4s;                  \
> +               sm4e            b0.4s, v27.4s;                  \
> +               sm4e            b0.4s, v28.4s;                  \
> +               sm4e            b0.4s, v29.4s;                  \
> +               sm4e            b0.4s, v30.4s;                  \
> +               sm4e            b0.4s, v31.4s;                  \
> +               rev64           b0.4s, b0.4s;                   \
> +               ext             b0.16b, b0.16b, b0.16b, #8;     \
> +               rev32           b0.16b, b0.16b;
> +
> +#define inc_le128(zctr)                                                \
> +               mov             RCTRv.d[1], x8;                 \
> +               mov             RCTRv.d[0], x7;                 \
> +               mov             zctr.d, RLE128_INC.d;           \
> +               dup             RCTR.q, RCTR.q[0];              \
> +               adds            x8, x8, x5, LSR #4;             \
> +               adclt           zctr.d, RCTR.d, RZERO.d;        \
> +               adclt           RCTR.d, zctr.d, RZERO.d;        \
> +               adc             x7, x7, xzr;                    \
> +               trn1            zctr.d, RCTR.d, zctr.d;         \
> +               revb            zctr.d, p0/m, zctr.d;
> +
> +#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3)               \
> +               mov             v8.d[1], x8;                    \
> +               mov             v8.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr0.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v9.d[1], x8;                    \
> +               mov             v9.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr1.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v10.d[1], x8;                   \
> +               mov             v10.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr2.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v11.d[1], x8;                   \
> +               mov             v11.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr3.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               dup             z8.q, z8.q[0];                  \
> +               dup             z9.q, z9.q[0];                  \
> +               dup             z10.q, z10.q[0];                \
> +               dup             z11.q, z11.q[0];                \
> +               adclt           zctr0.d, z8.d, RZERO.d;         \
> +               adclt           zctr1.d, z9.d, RZERO.d;         \
> +               adclt           zctr2.d, z10.d, RZERO.d;        \
> +               adclt           zctr3.d, z11.d, RZERO.d;        \
> +               adclt           z8.d, zctr0.d, RZERO.d;         \
> +               adclt           z9.d, zctr1.d, RZERO.d;         \
> +               adclt           z10.d, zctr2.d, RZERO.d;        \
> +               adclt           z11.d, zctr3.d, RZERO.d;        \
> +               trn1            zctr0.d, z8.d, zctr0.d;         \
> +               trn1            zctr1.d, z9.d, zctr1.d;         \
> +               trn1            zctr2.d, z10.d, zctr2.d;        \
> +               trn1            zctr3.d, z11.d, zctr3.d;        \
> +               revb            zctr0.d, p0/m, zctr0.d;         \
> +               revb            zctr1.d, p0/m, zctr1.d;         \
> +               revb            zctr2.d, p0/m, zctr2.d;         \
> +               revb            zctr3.d, p0/m, zctr3.d;
> +
> +#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3,               \
> +                    zctr4, zctr5, zctr6, zctr7)                \
> +               mov             v8.d[1], x8;                    \
> +               mov             v8.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr0.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v9.d[1], x8;                    \
> +               mov             v9.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr1.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v10.d[1], x8;                   \
> +               mov             v10.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr2.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v11.d[1], x8;                   \
> +               mov             v11.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr3.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v12.d[1], x8;                   \
> +               mov             v12.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr4.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v13.d[1], x8;                   \
> +               mov             v13.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr5.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v14.d[1], x8;                   \
> +               mov             v14.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr6.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v15.d[1], x8;                   \
> +               mov             v15.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr7.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               dup             z8.q, z8.q[0];                  \
> +               dup             z9.q, z9.q[0];                  \
> +               dup             z10.q, z10.q[0];                \
> +               dup             z11.q, z11.q[0];                \
> +               dup             z12.q, z12.q[0];                \
> +               dup             z13.q, z13.q[0];                \
> +               dup             z14.q, z14.q[0];                \
> +               dup             z15.q, z15.q[0];                \
> +               adclt           zctr0.d, z8.d, RZERO.d;         \
> +               adclt           zctr1.d, z9.d, RZERO.d;         \
> +               adclt           zctr2.d, z10.d, RZERO.d;        \
> +               adclt           zctr3.d, z11.d, RZERO.d;        \
> +               adclt           zctr4.d, z12.d, RZERO.d;        \
> +               adclt           zctr5.d, z13.d, RZERO.d;        \
> +               adclt           zctr6.d, z14.d, RZERO.d;        \
> +               adclt           zctr7.d, z15.d, RZERO.d;        \
> +               adclt           z8.d, zctr0.d, RZERO.d;         \
> +               adclt           z9.d, zctr1.d, RZERO.d;         \
> +               adclt           z10.d, zctr2.d, RZERO.d;        \
> +               adclt           z11.d, zctr3.d, RZERO.d;        \
> +               adclt           z12.d, zctr4.d, RZERO.d;        \
> +               adclt           z13.d, zctr5.d, RZERO.d;        \
> +               adclt           z14.d, zctr6.d, RZERO.d;        \
> +               adclt           z15.d, zctr7.d, RZERO.d;        \
> +               trn1            zctr0.d, z8.d, zctr0.d;         \
> +               trn1            zctr1.d, z9.d, zctr1.d;         \
> +               trn1            zctr2.d, z10.d, zctr2.d;        \
> +               trn1            zctr3.d, z11.d, zctr3.d;        \
> +               trn1            zctr4.d, z12.d, zctr4.d;        \
> +               trn1            zctr5.d, z13.d, zctr5.d;        \
> +               trn1            zctr6.d, z14.d, zctr6.d;        \
> +               trn1            zctr7.d, z15.d, zctr7.d;        \
> +               revb            zctr0.d, p0/m, zctr0.d;         \
> +               revb            zctr1.d, p0/m, zctr1.d;         \
> +               revb            zctr2.d, p0/m, zctr2.d;         \
> +               revb            zctr3.d, p0/m, zctr3.d;         \
> +               revb            zctr4.d, p0/m, zctr4.d;         \
> +               revb            zctr5.d, p0/m, zctr5.d;         \
> +               revb            zctr6.d, p0/m, zctr6.d;         \
> +               revb            zctr7.d, p0/m, zctr7.d;
> +
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_crypt)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   w3: nblocks
> +        */
> +       uxtw            x3, w3
> +       SM4_PREPARE(x0)
> +
> +.Lcrypt_loop_8x:
> +       sub             x3, x3, x5, LSR #1              /* x3 - (8 * VL) */
> +       tbnz            x3, #63, .Lcrypt_4x
> +
> +       ld1b            {z0.b}, p0/z, [x2]
> +       ld1b            {z1.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z2.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z3.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z4.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z5.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z6.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z7.b}, p0/z, [x2, #7, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x3, .Lcrypt_end
> +       b               .Lcrypt_loop_8x
> +
> +.Lcrypt_4x:
> +       add             x3, x3, x5, LSR #1
> +       cmp             x3, x5, LSR #2
> +       blt             .Lcrypt_loop_1x
> +
> +       sub             x3, x3, x5, LSR #2              /* x3 - (4 * VL) */
> +
> +       ld1b            {z0.b}, p0/z, [x2]
> +       ld1b            {z1.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z2.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z3.b}, p0/z, [x2, #3, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x3, .Lcrypt_end
> +
> +.Lcrypt_loop_1x:
> +       cmp             x3, x5, LSR #4
> +       blt             .Lcrypt_ce_loop_1x
> +
> +       sub             x3, x3, x5, LSR #4              /* x3 - VL */
> +
> +       ld1b            {z0.b}, p0/z, [x2]
> +
> +       SM4_SVE_CE_CRYPT_BLK(z0)
> +
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x3, .Lcrypt_end
> +       b               .Lcrypt_loop_1x
> +
> +.Lcrypt_ce_loop_1x:
> +       sub             x3, x3, #1
> +
> +       ld1             {v0.16b}, [x2], #16
> +       SM4_CE_CRYPT_BLK(v0)
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x3, .Lcrypt_ce_loop_1x
> +
> +.Lcrypt_end:
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cbc_dec)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   x3: iv (big endian, 128 bit)
> +        *   w4: nblocks
> +        */
> +       uxtw            x4, w4
> +       SM4_PREPARE(x0)
> +
> +       ld1             {RIVv.16b}, [x3]
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_loop_8x:
> +       sub             x4, x4, x5, LSR #1              /* x4 - (8 * VL) */
> +       tbnz            x4, #63, .Lcbc_dec_4x
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z9.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z8.b}, p0/z, [x2, #7, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             z4.b, z11.b
> +       rev             z5.b, z10.b
> +       rev             z6.b, z9.b
> +       rev             z7.b, z8.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z7.b, z7.b, z6.b, #16
> +       ext             z6.b, z6.b, z5.b, #16
> +       ext             z5.b, z5.b, z4.b, #16
> +       ext             z4.b, z4.b, z3.b, #16
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z7.b, z7.b
> +       rev             z6.b, z6.b
> +       rev             z5.b, z5.b
> +       rev             z4.b, z4.b
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z8.d
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       eor             z4.d, z4.d, z11.d
> +       eor             z5.d, z5.d, z10.d
> +       eor             z6.d, z6.d, z9.d
> +       eor             z7.d, z7.d, z8.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x4, .Lcbc_dec_end
> +       b               .Lcbc_dec_loop_8x
> +
> +.Lcbc_dec_4x:
> +       add             x4, x4, x5, LSR #1
> +       cmp             x4, x5, LSR #2
> +       blt             .Lcbc_dec_loop_1x
> +
> +       sub             x4, x4, x5, LSR #2              /* x4 - (4 * VL) */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z12.d
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x4, .Lcbc_dec_end
> +
> +.Lcbc_dec_loop_1x:
> +       cmp             x4, x5, LSR #4
> +       blt             .Lcbc_dec_ce
> +
> +       sub             x4, x4, x5, LSR #4              /* x4 - VL */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       rev             RTMP0.b, RIV.b
> +       rev             z0.b, z15.b
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z15.d
> +
> +       SM4_SVE_CE_CRYPT_BLK(z15)
> +
> +       eor             z0.d, z0.d, z15.d
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x4, .Lcbc_dec_end
> +       b               .Lcbc_dec_loop_1x
> +
> +.Lcbc_dec_ce:
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcbc_dec_ce_loop_1x:
> +       sub             x4, x4, #1
> +
> +       ld1             {v15.16b}, [x2], #16
> +       mov             v0.16b, RIVv.16b
> +       mov             RIVv.16b, v15.16b
> +       SM4_CE_CRYPT_BLK(v15)
> +       eor             v0.16b, v0.16b, v15.16b
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x4, .Lcbc_dec_ce_loop_1x
> +
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_end:
> +       /* store new IV */
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +       st1             {RIVv.16b}, [x3]
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_cbc_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cfb_dec)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   x3: iv (big endian, 128 bit)
> +        *   w4: nblocks
> +        */
> +       uxtw            x4, w4
> +       SM4_PREPARE(x0)
> +
> +       ld1             {RIVv.16b}, [x3]
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_loop_8x:
> +       sub             x4, x4, x5, LSR #1              /* x4 - (8 * VL) */
> +       tbnz            x4, #63, .Lcfb_dec_4x
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z9.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z8.b}, p0/z, [x2, #7, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             z4.b, z11.b
> +       rev             z5.b, z10.b
> +       rev             z6.b, z9.b
> +       rev             z7.b, z8.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z7.b, z7.b, z6.b, #16
> +       ext             z6.b, z6.b, z5.b, #16
> +       ext             z5.b, z5.b, z4.b, #16
> +       ext             z4.b, z4.b, z3.b, #16
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z7.b, z7.b
> +       rev             z6.b, z6.b
> +       rev             z5.b, z5.b
> +       rev             z4.b, z4.b
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z8.d
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       eor             z4.d, z4.d, z11.d
> +       eor             z5.d, z5.d, z10.d
> +       eor             z6.d, z6.d, z9.d
> +       eor             z7.d, z7.d, z8.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x4, .Lcfb_dec_end
> +       b               .Lcfb_dec_loop_8x
> +
> +.Lcfb_dec_4x:
> +       add             x4, x4, x5, LSR #1
> +       cmp             x4, x5, LSR #2
> +       blt             .Lcfb_dec_loop_1x
> +
> +       sub             x4, x4, x5, LSR #2              /* x4 - (4 * VL) */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z12.d
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x4, .Lcfb_dec_end
> +
> +.Lcfb_dec_loop_1x:
> +       cmp             x4, x5, LSR #4
> +       blt             .Lcfb_dec_ce
> +
> +       sub             x4, x4, x5, LSR #4              /* x4 - VL */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       rev             RTMP0.b, RIV.b
> +       rev             z0.b, z15.b
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z15.d
> +
> +       SM4_SVE_CE_CRYPT_BLK(z0)
> +
> +       eor             z0.d, z0.d, z15.d
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x4, .Lcfb_dec_end
> +       b               .Lcfb_dec_loop_1x
> +
> +.Lcfb_dec_ce:
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcfb_dec_ce_loop_1x:
> +       sub             x4, x4, #1
> +
> +       ld1             {v15.16b}, [x2], #16
> +       mov             v0.16b, RIVv.16b
> +       mov             RIVv.16b, v15.16b
> +       SM4_CE_CRYPT_BLK(v0)
> +       eor             v0.16b, v0.16b, v15.16b
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x4, .Lcfb_dec_ce_loop_1x
> +
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_end:
> +       /* store new IV */
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +       st1             {RIVv.16b}, [x3]
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_cfb_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   x3: ctr (big endian, 128 bit)
> +        *   w4: nblocks
> +        */
> +       uxtw            x4, w4
> +       SM4_PREPARE(x0)
> +
> +       dup             RZERO.d, #0
> +       adr_l           x6, .Lle128_inc
> +       ld1b            {RLE128_INC.b}, p0/z, [x6]
> +
> +       ldp             x7, x8, [x3]
> +       rev             x7, x7
> +       rev             x8, x8
> +
> +.Lctr_loop_8x:
> +       sub             x4, x4, x5, LSR #1              /* x4 - (8 * VL) */
> +       tbnz            x4, #63, .Lctr_4x
> +
> +       inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       ld1b            {z8.b}, p0/z, [x2]
> +       ld1b            {z9.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z14.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z15.b}, p0/z, [x2, #7, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       eor             z0.d, z0.d, z8.d
> +       eor             z1.d, z1.d, z9.d
> +       eor             z2.d, z2.d, z10.d
> +       eor             z3.d, z3.d, z11.d
> +       eor             z4.d, z4.d, z12.d
> +       eor             z5.d, z5.d, z13.d
> +       eor             z6.d, z6.d, z14.d
> +       eor             z7.d, z7.d, z15.d
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x4, .Lctr_end
> +       b               .Lctr_loop_8x
> +
> +.Lctr_4x:
> +       add             x4, x4, x5, LSR #1
> +       cmp             x4, x5, LSR #2
> +       blt             .Lctr_loop_1x
> +
> +       sub             x4, x4, x5, LSR #2              /* x4 - (4 * VL) */
> +
> +       inc_le128_4x(z0, z1, z2, z3)
> +
> +       ld1b            {z8.b}, p0/z, [x2]
> +       ld1b            {z9.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #3, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> +       eor             z0.d, z0.d, z8.d
> +       eor             z1.d, z1.d, z9.d
> +       eor             z2.d, z2.d, z10.d
> +       eor             z3.d, z3.d, z11.d
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x4, .Lctr_end
> +
> +.Lctr_loop_1x:
> +       cmp             x4, x5, LSR #4
> +       blt             .Lctr_ce_loop_1x
> +
> +       sub             x4, x4, x5, LSR #4              /* x4 - VL */
> +
> +       inc_le128(z0)
> +       ld1b            {z8.b}, p0/z, [x2]
> +
> +       SM4_SVE_CE_CRYPT_BLK(z0)
> +
> +       eor             z0.d, z0.d, z8.d
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x4, .Lctr_end
> +       b               .Lctr_loop_1x
> +
> +.Lctr_ce_loop_1x:
> +       sub             x4, x4, #1
> +
> +       /* inc_le128 for CE */
> +       mov             v0.d[1], x8
> +       mov             v0.d[0], x7
> +       adds            x8, x8, #1
> +       rev64           v0.16b, v0.16b
> +       adc             x7, x7, xzr
> +
> +       ld1             {v8.16b}, [x2], #16
> +
> +       SM4_CE_CRYPT_BLK(v0)
> +
> +       eor             v0.16b, v0.16b, v8.16b
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x4, .Lctr_ce_loop_1x
> +
> +.Lctr_end:
> +       /* store new CTR */
> +       rev             x7, x7
> +       rev             x8, x8
> +       stp             x7, x8, [x3]
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_get_vl)
> +       /* VL in bytes */
> +       rdvl            x0, #1
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_get_vl)
> +
> +
> +       .section        ".rodata", "a"
> +       .align 4
> +.Lbswap128_mask:
> +       .byte           0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
> +       .byte           0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
> +       .byte           0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
> +       .byte           0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
> +       .byte           0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
> +       .byte           0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
> +       .byte           0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
> +       .byte           0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
> +       .byte           0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
> +       .byte           0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
> +       .byte           0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
> +       .byte           0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
> +       .byte           0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
> +       .byte           0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
> +       .byte           0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
> +       .byte           0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
> +       .byte           0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
> +       .byte           0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
> +       .byte           0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
> +       .byte           0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
> +       .byte           0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
> +       .byte           0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
> +       .byte           0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
> +       .byte           0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
> +       .byte           0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
> +       .byte           0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
> +       .byte           0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
> +       .byte           0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
> +       .byte           0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
> +       .byte           0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
> +       .byte           0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
> +       .byte           0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
> +
> +.Lle128_inc:
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
> new file mode 100644
> index 000000000000..fc797b72b5f0
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
> @@ -0,0 +1,332 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/crypto.h>
> +#include <linux/kernel.h>
> +#include <linux/cpufeature.h>
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <crypto/internal/simd.h>
> +#include <crypto/internal/skcipher.h>
> +#include <crypto/sm4.h>
> +#include "sm4-ce.h"
> +
> +asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
> +                                const u8 *src, unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
> +                                  const u8 *src, u8 *iv,
> +                                  unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
> +                                  const u8 *src, u8 *iv,
> +                                  unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
> +                                    const u8 *src, u8 *iv,
> +                                    unsigned int nblocks);
> +asmlinkage unsigned int sm4_sve_get_vl(void);
> +
> +
> +static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
> +                     unsigned int key_len)
> +{
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       if (key_len != SM4_KEY_SIZE)
> +               return -EINVAL;
> +
> +       kernel_neon_begin();
> +       sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
> +                         crypto_sm4_fk, crypto_sm4_ck);
> +       kernel_neon_end();
> +
> +       return 0;
> +}
> +
> +static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
> +{
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_sve_ce_crypt(rkey, dst, src, nblocks);
> +
> +                       kernel_neon_end();
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> +       }
> +
> +       return err;
> +}
> +
> +static int ecb_encrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return ecb_crypt(req, ctx->rkey_enc);
> +}
> +
> +static int ecb_decrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return ecb_crypt(req, ctx->rkey_dec);
> +}
> +
> +static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
> +                    void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
> +                               const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
> +
> +                       kernel_neon_end();
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> +       }
> +
> +       return err;
> +}
> +
> +static int cbc_encrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
> +}
> +
> +static int cbc_decrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
> +}
> +
> +static int cfb_crypt(struct skcipher_request *req,
> +                    void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
> +                               const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_cfb_crypt(ctx->rkey_enc, dst, src,
> +                                     walk.iv, nblocks);
> +
> +                       kernel_neon_end();
> +
> +                       dst += nblocks * SM4_BLOCK_SIZE;
> +                       src += nblocks * SM4_BLOCK_SIZE;
> +                       nbytes -= nblocks * SM4_BLOCK_SIZE;
> +               }
> +
> +               /* tail */
> +               if (walk.nbytes == walk.total && nbytes > 0) {
> +                       u8 keystream[SM4_BLOCK_SIZE];
> +
> +                       sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> +                       crypto_xor_cpy(dst, src, keystream, nbytes);
> +                       nbytes = 0;
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes);
> +       }
> +
> +       return err;
> +}
> +
> +static int cfb_encrypt(struct skcipher_request *req)
> +{
> +       return cfb_crypt(req, sm4_ce_cfb_enc);
> +}
> +
> +static int cfb_decrypt(struct skcipher_request *req)
> +{
> +       return cfb_crypt(req, sm4_sve_ce_cfb_dec);
> +}
> +
> +static int ctr_crypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
> +                                            walk.iv, nblocks);
> +
> +                       kernel_neon_end();
> +
> +                       dst += nblocks * SM4_BLOCK_SIZE;
> +                       src += nblocks * SM4_BLOCK_SIZE;
> +                       nbytes -= nblocks * SM4_BLOCK_SIZE;
> +               }
> +
> +               /* tail */
> +               if (walk.nbytes == walk.total && nbytes > 0) {
> +                       u8 keystream[SM4_BLOCK_SIZE];
> +
> +                       sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> +                       crypto_inc(walk.iv, SM4_BLOCK_SIZE);
> +                       crypto_xor_cpy(dst, src, keystream, nbytes);
> +                       nbytes = 0;
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes);
> +       }
> +
> +       return err;
> +}
> +
> +static struct skcipher_alg sm4_algs[] = {
> +       {
> +               .base = {
> +                       .cra_name               = "ecb(sm4)",
> +                       .cra_driver_name        = "ecb-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = SM4_BLOCK_SIZE,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = ecb_encrypt,
> +               .decrypt        = ecb_decrypt,
> +       }, {
> +               .base = {
> +                       .cra_name               = "cbc(sm4)",
> +                       .cra_driver_name        = "cbc-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = SM4_BLOCK_SIZE,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .ivsize         = SM4_BLOCK_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = cbc_encrypt,
> +               .decrypt        = cbc_decrypt,
> +       }, {
> +               .base = {
> +                       .cra_name               = "cfb(sm4)",
> +                       .cra_driver_name        = "cfb-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = 1,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .ivsize         = SM4_BLOCK_SIZE,
> +               .chunksize      = SM4_BLOCK_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = cfb_encrypt,
> +               .decrypt        = cfb_decrypt,
> +       }, {
> +               .base = {
> +                       .cra_name               = "ctr(sm4)",
> +                       .cra_driver_name        = "ctr-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = 1,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .ivsize         = SM4_BLOCK_SIZE,
> +               .chunksize      = SM4_BLOCK_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = ctr_crypt,
> +               .decrypt        = ctr_crypt,
> +       }
> +};
> +
> +static int __init sm4_sve_ce_init(void)
> +{
> +       if (sm4_sve_get_vl() <= 16)
> +               return -ENODEV;
> +
> +       return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +static void __exit sm4_sve_ce_exit(void)
> +{
> +       crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
> +module_exit(sm4_sve_ce_exit);
> +
> +MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
> +MODULE_ALIAS_CRYPTO("sm4-sve-ce");
> +MODULE_ALIAS_CRYPTO("sm4");
> +MODULE_ALIAS_CRYPTO("ecb(sm4)");
> +MODULE_ALIAS_CRYPTO("cbc(sm4)");
> +MODULE_ALIAS_CRYPTO("cfb(sm4)");
> +MODULE_ALIAS_CRYPTO("ctr(sm4)");
> +MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
> +MODULE_LICENSE("GPL v2");
> --
> 2.24.3 (Apple Git-128)
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-26 10:02     ` Ard Biesheuvel
  0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2022-09-26 10:02 UTC (permalink / raw)
  To: Tianjia Zhang, Mark Brown
  Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
	Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
	linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32

(cc Mark Brown)

Hello Tianjia,

On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
<tianjia.zhang@linux.alibaba.com> wrote:
>
> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
> arm64. SVE allows flexible vector length implementations with a range of
> possible values in CPU implementations. The vector length can vary from a
> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
> The SVE design guarantees that the same application can run on different
> implementations that support SVE, without the need to recompile the code.
>
> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
> expand and improve it. Similar to the Crypto Extension supported by the
> NEON instruction set for the algorithm, SVE also supports the similar
> instructions, called cryptography acceleration instructions, but this is
> also optional instruction set.
>
> This patch uses SM4 cryptography acceleration instructions and SVE2
> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
> Extension instruction is used.
>

Given that we currently do not support the use of SVE in kernel mode,
this patch cannot be accepted at this time (but the rest of the series
looks reasonable to me, although I have only skimmed over the patches)

In view of the disappointing benchmark results below, I don't think
this is worth the hassle at the moment. If we can find a case where
using SVE in kernel mode truly makes a [favorable] difference, we can
revisit this, but not without a thorough analysis of the impact it
will have to support SVE in the kernel. Also, the fact that SVE may
also cover cryptographic extensions does not necessarily imply that a
micro-architecture will perform those crypto transformations in
parallel and so the performance may be the same even if VL > 128.

In summary, please drop this patch for now, and once there are more
encouraging performance numbers, please resubmit it as part of a
series that explicitly enables SVE in kernel mode on arm64, and
documents the requirements and constraints.

I have cc'ed Mark who has been working on the SVE support., who might
have something to add here as well.

Thanks,
Ard.



> Since no test environment with a Vector Length (VL) greater than 128 bits
> was found, the performance data was obtained on a machine with a VL is
> 128 bits, because this driver is enabled when the VL is greater than 128
> bits, so this performance is only for reference. It can be seen from the
> data that there is little difference between the data optimized by Crypto
> Extension and SVE (VL=128 bits), and the optimization effect will be more
> obvious when VL=256 bits or longer.
>
> Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
> of tcrypt, and compared with that optimized by Crypto Extension.  The
> abscissas are blocks of different lengths. The data is tabulated and the
> unit is Mb/s:
>
> sm4-ce      |      16       64      128      256     1024     1420     4096
> ------------+--------------------------------------------------------------
>     ECB enc |  315.18  1162.65  1815.66  2553.50  3692.91  3727.20  4001.93
>     ECB dec |  316.06  1172.97  1817.81  2554.66  3692.18  3786.54  4001.93
>     CBC enc |  304.82   629.54   768.65   864.72   953.90   963.32   974.06
>     CBC dec |  306.05  1142.53  1805.11  2481.67  3522.06  3587.87  3790.99
>     CFB enc |  309.48   635.70   774.44   865.85   950.62   952.68   968.24
>     CFB dec |  315.98  1170.38  1828.75  2509.72  3543.63  3539.40  3793.25
>     CTR enc |  285.83  1036.59  1583.50  2147.26  2933.54  2954.66  3041.14
>     CTR dec |  285.29  1037.47  1584.67  2145.51  2934.10  2950.89  3041.62
>
> sm4-sve-ce (VL = 128 bits)
>     ECB enc |  310.00  1154.70  1813.26  2579.74  3766.90  3869.45  4100.26
>     ECB dec |  315.60  1176.22  1838.06  2593.69  3774.95  3878.42  4098.83
>     CBC enc |  303.44   622.65   764.67   861.40   953.18   963.05   973.77
>     CBC dec |  302.13  1091.15  1689.10  2267.79  3182.84  3242.68  3408.92
>     CFB enc |  296.62   620.41   762.94   858.96   948.18   956.04   967.67
>     CFB dec |  291.23  1065.50  1637.33  2228.12  3158.52  3213.35  3403.83
>     CTR enc |  272.27   959.35  1466.34  1934.24  2562.80  2595.87  2695.15
>     CTR dec |  273.40   963.65  1471.83  1938.97  2563.12  2597.25  2694.54
>
> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> ---
>  arch/arm64/crypto/Kconfig           |   19 +
>  arch/arm64/crypto/Makefile          |    3 +
>  arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
>  arch/arm64/crypto/sm4-sve-ce-glue.c |  332 +++++++++
>  4 files changed, 1382 insertions(+)
>  create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
>  create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index 6793d5bc3ee5..bbb5a7a08af5 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
>           - ARMv8 Crypto Extensions
>           - NEON (Advanced SIMD) extensions
>
> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> +       tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
> +       depends on KERNEL_MODE_NEON
> +       select CRYPTO_SKCIPHER
> +       select CRYPTO_SM4
> +       select CRYPTO_SM4_ARM64_CE_BLK
> +       help
> +         Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
> +         with block cipher modes:
> +         - ECB (Electronic Codebook) mode (NIST SP800-38A)
> +         - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
> +         - CFB (Cipher Feedback) mode (NIST SP800-38A)
> +         - CTR (Counter) mode (NIST SP800-38A)
> +
> +         Architecture: arm64 using:
> +         - ARMv8 Crypto Extensions
> +         - ARMv9 cryptography acceleration with SVE2
> +         - NEON (Advanced SIMD) extensions
> +
>  config CRYPTO_SM4_ARM64_NEON_BLK
>         tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
>         depends on KERNEL_MODE_NEON
> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> index 4818e204c2ac..355dd9053434 100644
> --- a/arch/arm64/crypto/Makefile
> +++ b/arch/arm64/crypto/Makefile
> @@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
>  obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
>  sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
>
> +obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
> +sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
> +
>  obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
>  ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
>
> diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
> new file mode 100644
> index 000000000000..caecbdf2536c
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-core.S
> @@ -0,0 +1,1028 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/assembler.h>
> +
> +.arch  armv8-a+crypto+sve+sve2
> +
> +.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
> +       .set .Lv\b\().4s, \b
> +.endr
> +
> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
> +               16, 24, 25, 26, 27, 28, 29, 30, 31
> +       .set .Lz\b\().s, \b
> +.endr
> +
> +.macro sm4e, vd, vn
> +       .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> +.endm
> +
> +.macro sm4e_sve, zd, zn
> +       .inst 0x4523e000 | (.L\zn << 5) | .L\zd
> +.endm
> +
> +
> +/* Register macros */
> +
> +#define RCTR        z16
> +#define RCTRv       v16
> +#define RIV         z16
> +#define RIVv        v16
> +#define RSWAP128    z17
> +#define RZERO       z18
> +#define RLE128_INC  z19
> +
> +#define RTMP0       z20
> +#define RTMP0v      v20
> +#define RTMP1       z21
> +#define RTMP2       z22
> +#define RTMP3       z23
> +
> +
> +/* Helper macros. */
> +
> +#define SM4_PREPARE(ptr)                                       \
> +               adr_l           x7, .Lbswap128_mask;            \
> +               ptrue           p0.b, ALL;                      \
> +               rdvl            x5, #1;                         \
> +               ld1b            {RSWAP128.b}, p0/z, [x7];       \
> +                                                               \
> +               ld1             {v24.16b-v27.16b}, [ptr], #64;  \
> +               ld1             {v28.16b-v31.16b}, [ptr];       \
> +               dup             z24.q, z24.q[0];                \
> +               dup             z25.q, z25.q[0];                \
> +               dup             z26.q, z26.q[0];                \
> +               dup             z27.q, z27.q[0];                \
> +               dup             z28.q, z28.q[0];                \
> +               dup             z29.q, z29.q[0];                \
> +               dup             z30.q, z30.q[0];                \
> +               dup             z31.q, z31.q[0];
> +
> +#define SM4_SVE_CE_CRYPT_BLK(b0)                               \
> +               revb            b0.s, p0/m, b0.s;               \
> +               sm4e_sve        b0.s, z24.s;                    \
> +               sm4e_sve        b0.s, z25.s;                    \
> +               sm4e_sve        b0.s, z26.s;                    \
> +               sm4e_sve        b0.s, z27.s;                    \
> +               sm4e_sve        b0.s, z28.s;                    \
> +               sm4e_sve        b0.s, z29.s;                    \
> +               sm4e_sve        b0.s, z30.s;                    \
> +               sm4e_sve        b0.s, z31.s;                    \
> +               tbl             b0.b, {b0.b}, RSWAP128.b;       \
> +               revb            b0.s, p0/m, b0.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3)                  \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;               \
> +               sm4e_sve        b0.s, z24.s;                    \
> +               sm4e_sve        b1.s, z24.s;                    \
> +               sm4e_sve        b2.s, z24.s;                    \
> +               sm4e_sve        b3.s, z24.s;                    \
> +               sm4e_sve        b0.s, z25.s;                    \
> +               sm4e_sve        b1.s, z25.s;                    \
> +               sm4e_sve        b2.s, z25.s;                    \
> +               sm4e_sve        b3.s, z25.s;                    \
> +               sm4e_sve        b0.s, z26.s;                    \
> +               sm4e_sve        b1.s, z26.s;                    \
> +               sm4e_sve        b2.s, z26.s;                    \
> +               sm4e_sve        b3.s, z26.s;                    \
> +               sm4e_sve        b0.s, z27.s;                    \
> +               sm4e_sve        b1.s, z27.s;                    \
> +               sm4e_sve        b2.s, z27.s;                    \
> +               sm4e_sve        b3.s, z27.s;                    \
> +               sm4e_sve        b0.s, z28.s;                    \
> +               sm4e_sve        b1.s, z28.s;                    \
> +               sm4e_sve        b2.s, z28.s;                    \
> +               sm4e_sve        b3.s, z28.s;                    \
> +               sm4e_sve        b0.s, z29.s;                    \
> +               sm4e_sve        b1.s, z29.s;                    \
> +               sm4e_sve        b2.s, z29.s;                    \
> +               sm4e_sve        b3.s, z29.s;                    \
> +               sm4e_sve        b0.s, z30.s;                    \
> +               sm4e_sve        b1.s, z30.s;                    \
> +               sm4e_sve        b2.s, z30.s;                    \
> +               sm4e_sve        b3.s, z30.s;                    \
> +               sm4e_sve        b0.s, z31.s;                    \
> +               sm4e_sve        b1.s, z31.s;                    \
> +               sm4e_sve        b2.s, z31.s;                    \
> +               sm4e_sve        b3.s, z31.s;                    \
> +               tbl             b0.b, {b0.b}, RSWAP128.b;       \
> +               tbl             b1.b, {b1.b}, RSWAP128.b;       \
> +               tbl             b2.b, {b2.b}, RSWAP128.b;       \
> +               tbl             b3.b, {b3.b}, RSWAP128.b;       \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)  \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;               \
> +               revb            b4.s, p0/m, b4.s;               \
> +               revb            b5.s, p0/m, b5.s;               \
> +               revb            b6.s, p0/m, b6.s;               \
> +               revb            b7.s, p0/m, b7.s;               \
> +               sm4e_sve        b0.s, z24.s;                    \
> +               sm4e_sve        b1.s, z24.s;                    \
> +               sm4e_sve        b2.s, z24.s;                    \
> +               sm4e_sve        b3.s, z24.s;                    \
> +               sm4e_sve        b4.s, z24.s;                    \
> +               sm4e_sve        b5.s, z24.s;                    \
> +               sm4e_sve        b6.s, z24.s;                    \
> +               sm4e_sve        b7.s, z24.s;                    \
> +               sm4e_sve        b0.s, z25.s;                    \
> +               sm4e_sve        b1.s, z25.s;                    \
> +               sm4e_sve        b2.s, z25.s;                    \
> +               sm4e_sve        b3.s, z25.s;                    \
> +               sm4e_sve        b4.s, z25.s;                    \
> +               sm4e_sve        b5.s, z25.s;                    \
> +               sm4e_sve        b6.s, z25.s;                    \
> +               sm4e_sve        b7.s, z25.s;                    \
> +               sm4e_sve        b0.s, z26.s;                    \
> +               sm4e_sve        b1.s, z26.s;                    \
> +               sm4e_sve        b2.s, z26.s;                    \
> +               sm4e_sve        b3.s, z26.s;                    \
> +               sm4e_sve        b4.s, z26.s;                    \
> +               sm4e_sve        b5.s, z26.s;                    \
> +               sm4e_sve        b6.s, z26.s;                    \
> +               sm4e_sve        b7.s, z26.s;                    \
> +               sm4e_sve        b0.s, z27.s;                    \
> +               sm4e_sve        b1.s, z27.s;                    \
> +               sm4e_sve        b2.s, z27.s;                    \
> +               sm4e_sve        b3.s, z27.s;                    \
> +               sm4e_sve        b4.s, z27.s;                    \
> +               sm4e_sve        b5.s, z27.s;                    \
> +               sm4e_sve        b6.s, z27.s;                    \
> +               sm4e_sve        b7.s, z27.s;                    \
> +               sm4e_sve        b0.s, z28.s;                    \
> +               sm4e_sve        b1.s, z28.s;                    \
> +               sm4e_sve        b2.s, z28.s;                    \
> +               sm4e_sve        b3.s, z28.s;                    \
> +               sm4e_sve        b4.s, z28.s;                    \
> +               sm4e_sve        b5.s, z28.s;                    \
> +               sm4e_sve        b6.s, z28.s;                    \
> +               sm4e_sve        b7.s, z28.s;                    \
> +               sm4e_sve        b0.s, z29.s;                    \
> +               sm4e_sve        b1.s, z29.s;                    \
> +               sm4e_sve        b2.s, z29.s;                    \
> +               sm4e_sve        b3.s, z29.s;                    \
> +               sm4e_sve        b4.s, z29.s;                    \
> +               sm4e_sve        b5.s, z29.s;                    \
> +               sm4e_sve        b6.s, z29.s;                    \
> +               sm4e_sve        b7.s, z29.s;                    \
> +               sm4e_sve        b0.s, z30.s;                    \
> +               sm4e_sve        b1.s, z30.s;                    \
> +               sm4e_sve        b2.s, z30.s;                    \
> +               sm4e_sve        b3.s, z30.s;                    \
> +               sm4e_sve        b4.s, z30.s;                    \
> +               sm4e_sve        b5.s, z30.s;                    \
> +               sm4e_sve        b6.s, z30.s;                    \
> +               sm4e_sve        b7.s, z30.s;                    \
> +               sm4e_sve        b0.s, z31.s;                    \
> +               sm4e_sve        b1.s, z31.s;                    \
> +               sm4e_sve        b2.s, z31.s;                    \
> +               sm4e_sve        b3.s, z31.s;                    \
> +               sm4e_sve        b4.s, z31.s;                    \
> +               sm4e_sve        b5.s, z31.s;                    \
> +               sm4e_sve        b6.s, z31.s;                    \
> +               sm4e_sve        b7.s, z31.s;                    \
> +               tbl             b0.b, {b0.b}, RSWAP128.b;       \
> +               tbl             b1.b, {b1.b}, RSWAP128.b;       \
> +               tbl             b2.b, {b2.b}, RSWAP128.b;       \
> +               tbl             b3.b, {b3.b}, RSWAP128.b;       \
> +               tbl             b4.b, {b4.b}, RSWAP128.b;       \
> +               tbl             b5.b, {b5.b}, RSWAP128.b;       \
> +               tbl             b6.b, {b6.b}, RSWAP128.b;       \
> +               tbl             b7.b, {b7.b}, RSWAP128.b;       \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;               \
> +               revb            b4.s, p0/m, b4.s;               \
> +               revb            b5.s, p0/m, b5.s;               \
> +               revb            b6.s, p0/m, b6.s;               \
> +               revb            b7.s, p0/m, b7.s;
> +
> +#define SM4_CE_CRYPT_BLK(b0)                                   \
> +               rev32           b0.16b, b0.16b;                 \
> +               sm4e            b0.4s, v24.4s;                  \
> +               sm4e            b0.4s, v25.4s;                  \
> +               sm4e            b0.4s, v26.4s;                  \
> +               sm4e            b0.4s, v27.4s;                  \
> +               sm4e            b0.4s, v28.4s;                  \
> +               sm4e            b0.4s, v29.4s;                  \
> +               sm4e            b0.4s, v30.4s;                  \
> +               sm4e            b0.4s, v31.4s;                  \
> +               rev64           b0.4s, b0.4s;                   \
> +               ext             b0.16b, b0.16b, b0.16b, #8;     \
> +               rev32           b0.16b, b0.16b;
> +
> +#define inc_le128(zctr)                                                \
> +               mov             RCTRv.d[1], x8;                 \
> +               mov             RCTRv.d[0], x7;                 \
> +               mov             zctr.d, RLE128_INC.d;           \
> +               dup             RCTR.q, RCTR.q[0];              \
> +               adds            x8, x8, x5, LSR #4;             \
> +               adclt           zctr.d, RCTR.d, RZERO.d;        \
> +               adclt           RCTR.d, zctr.d, RZERO.d;        \
> +               adc             x7, x7, xzr;                    \
> +               trn1            zctr.d, RCTR.d, zctr.d;         \
> +               revb            zctr.d, p0/m, zctr.d;
> +
> +#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3)               \
> +               mov             v8.d[1], x8;                    \
> +               mov             v8.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr0.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v9.d[1], x8;                    \
> +               mov             v9.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr1.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v10.d[1], x8;                   \
> +               mov             v10.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr2.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v11.d[1], x8;                   \
> +               mov             v11.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr3.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               dup             z8.q, z8.q[0];                  \
> +               dup             z9.q, z9.q[0];                  \
> +               dup             z10.q, z10.q[0];                \
> +               dup             z11.q, z11.q[0];                \
> +               adclt           zctr0.d, z8.d, RZERO.d;         \
> +               adclt           zctr1.d, z9.d, RZERO.d;         \
> +               adclt           zctr2.d, z10.d, RZERO.d;        \
> +               adclt           zctr3.d, z11.d, RZERO.d;        \
> +               adclt           z8.d, zctr0.d, RZERO.d;         \
> +               adclt           z9.d, zctr1.d, RZERO.d;         \
> +               adclt           z10.d, zctr2.d, RZERO.d;        \
> +               adclt           z11.d, zctr3.d, RZERO.d;        \
> +               trn1            zctr0.d, z8.d, zctr0.d;         \
> +               trn1            zctr1.d, z9.d, zctr1.d;         \
> +               trn1            zctr2.d, z10.d, zctr2.d;        \
> +               trn1            zctr3.d, z11.d, zctr3.d;        \
> +               revb            zctr0.d, p0/m, zctr0.d;         \
> +               revb            zctr1.d, p0/m, zctr1.d;         \
> +               revb            zctr2.d, p0/m, zctr2.d;         \
> +               revb            zctr3.d, p0/m, zctr3.d;
> +
> +#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3,               \
> +                    zctr4, zctr5, zctr6, zctr7)                \
> +               mov             v8.d[1], x8;                    \
> +               mov             v8.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr0.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v9.d[1], x8;                    \
> +               mov             v9.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr1.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v10.d[1], x8;                   \
> +               mov             v10.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr2.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v11.d[1], x8;                   \
> +               mov             v11.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr3.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v12.d[1], x8;                   \
> +               mov             v12.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr4.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v13.d[1], x8;                   \
> +               mov             v13.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr5.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v14.d[1], x8;                   \
> +               mov             v14.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr6.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v15.d[1], x8;                   \
> +               mov             v15.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr7.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               dup             z8.q, z8.q[0];                  \
> +               dup             z9.q, z9.q[0];                  \
> +               dup             z10.q, z10.q[0];                \
> +               dup             z11.q, z11.q[0];                \
> +               dup             z12.q, z12.q[0];                \
> +               dup             z13.q, z13.q[0];                \
> +               dup             z14.q, z14.q[0];                \
> +               dup             z15.q, z15.q[0];                \
> +               adclt           zctr0.d, z8.d, RZERO.d;         \
> +               adclt           zctr1.d, z9.d, RZERO.d;         \
> +               adclt           zctr2.d, z10.d, RZERO.d;        \
> +               adclt           zctr3.d, z11.d, RZERO.d;        \
> +               adclt           zctr4.d, z12.d, RZERO.d;        \
> +               adclt           zctr5.d, z13.d, RZERO.d;        \
> +               adclt           zctr6.d, z14.d, RZERO.d;        \
> +               adclt           zctr7.d, z15.d, RZERO.d;        \
> +               adclt           z8.d, zctr0.d, RZERO.d;         \
> +               adclt           z9.d, zctr1.d, RZERO.d;         \
> +               adclt           z10.d, zctr2.d, RZERO.d;        \
> +               adclt           z11.d, zctr3.d, RZERO.d;        \
> +               adclt           z12.d, zctr4.d, RZERO.d;        \
> +               adclt           z13.d, zctr5.d, RZERO.d;        \
> +               adclt           z14.d, zctr6.d, RZERO.d;        \
> +               adclt           z15.d, zctr7.d, RZERO.d;        \
> +               trn1            zctr0.d, z8.d, zctr0.d;         \
> +               trn1            zctr1.d, z9.d, zctr1.d;         \
> +               trn1            zctr2.d, z10.d, zctr2.d;        \
> +               trn1            zctr3.d, z11.d, zctr3.d;        \
> +               trn1            zctr4.d, z12.d, zctr4.d;        \
> +               trn1            zctr5.d, z13.d, zctr5.d;        \
> +               trn1            zctr6.d, z14.d, zctr6.d;        \
> +               trn1            zctr7.d, z15.d, zctr7.d;        \
> +               revb            zctr0.d, p0/m, zctr0.d;         \
> +               revb            zctr1.d, p0/m, zctr1.d;         \
> +               revb            zctr2.d, p0/m, zctr2.d;         \
> +               revb            zctr3.d, p0/m, zctr3.d;         \
> +               revb            zctr4.d, p0/m, zctr4.d;         \
> +               revb            zctr5.d, p0/m, zctr5.d;         \
> +               revb            zctr6.d, p0/m, zctr6.d;         \
> +               revb            zctr7.d, p0/m, zctr7.d;
> +
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_crypt)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   w3: nblocks
> +        */
> +       uxtw            x3, w3
> +       SM4_PREPARE(x0)
> +
> +.Lcrypt_loop_8x:
> +       sub             x3, x3, x5, LSR #1              /* x3 - (8 * VL) */
> +       tbnz            x3, #63, .Lcrypt_4x
> +
> +       ld1b            {z0.b}, p0/z, [x2]
> +       ld1b            {z1.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z2.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z3.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z4.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z5.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z6.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z7.b}, p0/z, [x2, #7, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x3, .Lcrypt_end
> +       b               .Lcrypt_loop_8x
> +
> +.Lcrypt_4x:
> +       add             x3, x3, x5, LSR #1
> +       cmp             x3, x5, LSR #2
> +       blt             .Lcrypt_loop_1x
> +
> +       sub             x3, x3, x5, LSR #2              /* x3 - (4 * VL) */
> +
> +       ld1b            {z0.b}, p0/z, [x2]
> +       ld1b            {z1.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z2.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z3.b}, p0/z, [x2, #3, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x3, .Lcrypt_end
> +
> +.Lcrypt_loop_1x:
> +       cmp             x3, x5, LSR #4
> +       blt             .Lcrypt_ce_loop_1x
> +
> +       sub             x3, x3, x5, LSR #4              /* x3 - VL */
> +
> +       ld1b            {z0.b}, p0/z, [x2]
> +
> +       SM4_SVE_CE_CRYPT_BLK(z0)
> +
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x3, .Lcrypt_end
> +       b               .Lcrypt_loop_1x
> +
> +.Lcrypt_ce_loop_1x:
> +       sub             x3, x3, #1
> +
> +       ld1             {v0.16b}, [x2], #16
> +       SM4_CE_CRYPT_BLK(v0)
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x3, .Lcrypt_ce_loop_1x
> +
> +.Lcrypt_end:
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cbc_dec)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   x3: iv (big endian, 128 bit)
> +        *   w4: nblocks
> +        */
> +       uxtw            x4, w4
> +       SM4_PREPARE(x0)
> +
> +       ld1             {RIVv.16b}, [x3]
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_loop_8x:
> +       sub             x4, x4, x5, LSR #1              /* x4 - (8 * VL) */
> +       tbnz            x4, #63, .Lcbc_dec_4x
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z9.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z8.b}, p0/z, [x2, #7, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             z4.b, z11.b
> +       rev             z5.b, z10.b
> +       rev             z6.b, z9.b
> +       rev             z7.b, z8.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z7.b, z7.b, z6.b, #16
> +       ext             z6.b, z6.b, z5.b, #16
> +       ext             z5.b, z5.b, z4.b, #16
> +       ext             z4.b, z4.b, z3.b, #16
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z7.b, z7.b
> +       rev             z6.b, z6.b
> +       rev             z5.b, z5.b
> +       rev             z4.b, z4.b
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z8.d
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       eor             z4.d, z4.d, z11.d
> +       eor             z5.d, z5.d, z10.d
> +       eor             z6.d, z6.d, z9.d
> +       eor             z7.d, z7.d, z8.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x4, .Lcbc_dec_end
> +       b               .Lcbc_dec_loop_8x
> +
> +.Lcbc_dec_4x:
> +       add             x4, x4, x5, LSR #1
> +       cmp             x4, x5, LSR #2
> +       blt             .Lcbc_dec_loop_1x
> +
> +       sub             x4, x4, x5, LSR #2              /* x4 - (4 * VL) */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z12.d
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x4, .Lcbc_dec_end
> +
> +.Lcbc_dec_loop_1x:
> +       cmp             x4, x5, LSR #4
> +       blt             .Lcbc_dec_ce
> +
> +       sub             x4, x4, x5, LSR #4              /* x4 - VL */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       rev             RTMP0.b, RIV.b
> +       rev             z0.b, z15.b
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z15.d
> +
> +       SM4_SVE_CE_CRYPT_BLK(z15)
> +
> +       eor             z0.d, z0.d, z15.d
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x4, .Lcbc_dec_end
> +       b               .Lcbc_dec_loop_1x
> +
> +.Lcbc_dec_ce:
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcbc_dec_ce_loop_1x:
> +       sub             x4, x4, #1
> +
> +       ld1             {v15.16b}, [x2], #16
> +       mov             v0.16b, RIVv.16b
> +       mov             RIVv.16b, v15.16b
> +       SM4_CE_CRYPT_BLK(v15)
> +       eor             v0.16b, v0.16b, v15.16b
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x4, .Lcbc_dec_ce_loop_1x
> +
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_end:
> +       /* store new IV */
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +       st1             {RIVv.16b}, [x3]
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_cbc_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cfb_dec)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   x3: iv (big endian, 128 bit)
> +        *   w4: nblocks
> +        */
> +       uxtw            x4, w4
> +       SM4_PREPARE(x0)
> +
> +       ld1             {RIVv.16b}, [x3]
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_loop_8x:
> +       sub             x4, x4, x5, LSR #1              /* x4 - (8 * VL) */
> +       tbnz            x4, #63, .Lcfb_dec_4x
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z9.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z8.b}, p0/z, [x2, #7, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             z4.b, z11.b
> +       rev             z5.b, z10.b
> +       rev             z6.b, z9.b
> +       rev             z7.b, z8.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z7.b, z7.b, z6.b, #16
> +       ext             z6.b, z6.b, z5.b, #16
> +       ext             z5.b, z5.b, z4.b, #16
> +       ext             z4.b, z4.b, z3.b, #16
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z7.b, z7.b
> +       rev             z6.b, z6.b
> +       rev             z5.b, z5.b
> +       rev             z4.b, z4.b
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z8.d
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       eor             z4.d, z4.d, z11.d
> +       eor             z5.d, z5.d, z10.d
> +       eor             z6.d, z6.d, z9.d
> +       eor             z7.d, z7.d, z8.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x4, .Lcfb_dec_end
> +       b               .Lcfb_dec_loop_8x
> +
> +.Lcfb_dec_4x:
> +       add             x4, x4, x5, LSR #1
> +       cmp             x4, x5, LSR #2
> +       blt             .Lcfb_dec_loop_1x
> +
> +       sub             x4, x4, x5, LSR #2              /* x4 - (4 * VL) */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z12.d
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x4, .Lcfb_dec_end
> +
> +.Lcfb_dec_loop_1x:
> +       cmp             x4, x5, LSR #4
> +       blt             .Lcfb_dec_ce
> +
> +       sub             x4, x4, x5, LSR #4              /* x4 - VL */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       rev             RTMP0.b, RIV.b
> +       rev             z0.b, z15.b
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z15.d
> +
> +       SM4_SVE_CE_CRYPT_BLK(z0)
> +
> +       eor             z0.d, z0.d, z15.d
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x4, .Lcfb_dec_end
> +       b               .Lcfb_dec_loop_1x
> +
> +.Lcfb_dec_ce:
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcfb_dec_ce_loop_1x:
> +       sub             x4, x4, #1
> +
> +       ld1             {v15.16b}, [x2], #16
> +       mov             v0.16b, RIVv.16b
> +       mov             RIVv.16b, v15.16b
> +       SM4_CE_CRYPT_BLK(v0)
> +       eor             v0.16b, v0.16b, v15.16b
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x4, .Lcfb_dec_ce_loop_1x
> +
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_end:
> +       /* store new IV */
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +       st1             {RIVv.16b}, [x3]
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_cfb_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   x3: ctr (big endian, 128 bit)
> +        *   w4: nblocks
> +        */
> +       uxtw            x4, w4
> +       SM4_PREPARE(x0)
> +
> +       dup             RZERO.d, #0
> +       adr_l           x6, .Lle128_inc
> +       ld1b            {RLE128_INC.b}, p0/z, [x6]
> +
> +       ldp             x7, x8, [x3]
> +       rev             x7, x7
> +       rev             x8, x8
> +
> +.Lctr_loop_8x:
> +       sub             x4, x4, x5, LSR #1              /* x4 - (8 * VL) */
> +       tbnz            x4, #63, .Lctr_4x
> +
> +       inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       ld1b            {z8.b}, p0/z, [x2]
> +       ld1b            {z9.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z14.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z15.b}, p0/z, [x2, #7, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       eor             z0.d, z0.d, z8.d
> +       eor             z1.d, z1.d, z9.d
> +       eor             z2.d, z2.d, z10.d
> +       eor             z3.d, z3.d, z11.d
> +       eor             z4.d, z4.d, z12.d
> +       eor             z5.d, z5.d, z13.d
> +       eor             z6.d, z6.d, z14.d
> +       eor             z7.d, z7.d, z15.d
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x4, .Lctr_end
> +       b               .Lctr_loop_8x
> +
> +.Lctr_4x:
> +       add             x4, x4, x5, LSR #1
> +       cmp             x4, x5, LSR #2
> +       blt             .Lctr_loop_1x
> +
> +       sub             x4, x4, x5, LSR #2              /* x4 - (4 * VL) */
> +
> +       inc_le128_4x(z0, z1, z2, z3)
> +
> +       ld1b            {z8.b}, p0/z, [x2]
> +       ld1b            {z9.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #3, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> +       eor             z0.d, z0.d, z8.d
> +       eor             z1.d, z1.d, z9.d
> +       eor             z2.d, z2.d, z10.d
> +       eor             z3.d, z3.d, z11.d
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x4, .Lctr_end
> +
> +.Lctr_loop_1x:
> +       cmp             x4, x5, LSR #4
> +       blt             .Lctr_ce_loop_1x
> +
> +       sub             x4, x4, x5, LSR #4              /* x4 - VL */
> +
> +       inc_le128(z0)
> +       ld1b            {z8.b}, p0/z, [x2]
> +
> +       SM4_SVE_CE_CRYPT_BLK(z0)
> +
> +       eor             z0.d, z0.d, z8.d
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x4, .Lctr_end
> +       b               .Lctr_loop_1x
> +
> +.Lctr_ce_loop_1x:
> +       sub             x4, x4, #1
> +
> +       /* inc_le128 for CE */
> +       mov             v0.d[1], x8
> +       mov             v0.d[0], x7
> +       adds            x8, x8, #1
> +       rev64           v0.16b, v0.16b
> +       adc             x7, x7, xzr
> +
> +       ld1             {v8.16b}, [x2], #16
> +
> +       SM4_CE_CRYPT_BLK(v0)
> +
> +       eor             v0.16b, v0.16b, v8.16b
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x4, .Lctr_ce_loop_1x
> +
> +.Lctr_end:
> +       /* store new CTR */
> +       rev             x7, x7
> +       rev             x8, x8
> +       stp             x7, x8, [x3]
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_get_vl)
> +       /* VL in bytes */
> +       rdvl            x0, #1
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_get_vl)
> +
> +
> +       .section        ".rodata", "a"
> +       .align 4
> +.Lbswap128_mask:
> +       .byte           0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
> +       .byte           0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
> +       .byte           0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
> +       .byte           0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
> +       .byte           0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
> +       .byte           0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
> +       .byte           0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
> +       .byte           0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
> +       .byte           0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
> +       .byte           0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
> +       .byte           0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
> +       .byte           0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
> +       .byte           0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
> +       .byte           0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
> +       .byte           0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
> +       .byte           0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
> +       .byte           0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
> +       .byte           0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
> +       .byte           0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
> +       .byte           0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
> +       .byte           0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
> +       .byte           0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
> +       .byte           0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
> +       .byte           0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
> +       .byte           0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
> +       .byte           0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
> +       .byte           0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
> +       .byte           0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
> +       .byte           0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
> +       .byte           0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
> +       .byte           0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
> +       .byte           0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
> +
> +.Lle128_inc:
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
> new file mode 100644
> index 000000000000..fc797b72b5f0
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
> @@ -0,0 +1,332 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/crypto.h>
> +#include <linux/kernel.h>
> +#include <linux/cpufeature.h>
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <crypto/internal/simd.h>
> +#include <crypto/internal/skcipher.h>
> +#include <crypto/sm4.h>
> +#include "sm4-ce.h"
> +
> +asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
> +                                const u8 *src, unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
> +                                  const u8 *src, u8 *iv,
> +                                  unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
> +                                  const u8 *src, u8 *iv,
> +                                  unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
> +                                    const u8 *src, u8 *iv,
> +                                    unsigned int nblocks);
> +asmlinkage unsigned int sm4_sve_get_vl(void);
> +
> +
> +static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
> +                     unsigned int key_len)
> +{
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       if (key_len != SM4_KEY_SIZE)
> +               return -EINVAL;
> +
> +       kernel_neon_begin();
> +       sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
> +                         crypto_sm4_fk, crypto_sm4_ck);
> +       kernel_neon_end();
> +
> +       return 0;
> +}
> +
> +static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
> +{
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_sve_ce_crypt(rkey, dst, src, nblocks);
> +
> +                       kernel_neon_end();
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> +       }
> +
> +       return err;
> +}
> +
> +static int ecb_encrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return ecb_crypt(req, ctx->rkey_enc);
> +}
> +
> +static int ecb_decrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return ecb_crypt(req, ctx->rkey_dec);
> +}
> +
> +static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
> +                    void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
> +                               const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
> +
> +                       kernel_neon_end();
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> +       }
> +
> +       return err;
> +}
> +
> +static int cbc_encrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
> +}
> +
> +static int cbc_decrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
> +}
> +
> +static int cfb_crypt(struct skcipher_request *req,
> +                    void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
> +                               const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_cfb_crypt(ctx->rkey_enc, dst, src,
> +                                     walk.iv, nblocks);
> +
> +                       kernel_neon_end();
> +
> +                       dst += nblocks * SM4_BLOCK_SIZE;
> +                       src += nblocks * SM4_BLOCK_SIZE;
> +                       nbytes -= nblocks * SM4_BLOCK_SIZE;
> +               }
> +
> +               /* tail */
> +               if (walk.nbytes == walk.total && nbytes > 0) {
> +                       u8 keystream[SM4_BLOCK_SIZE];
> +
> +                       sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> +                       crypto_xor_cpy(dst, src, keystream, nbytes);
> +                       nbytes = 0;
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes);
> +       }
> +
> +       return err;
> +}
> +
> +static int cfb_encrypt(struct skcipher_request *req)
> +{
> +       return cfb_crypt(req, sm4_ce_cfb_enc);
> +}
> +
> +static int cfb_decrypt(struct skcipher_request *req)
> +{
> +       return cfb_crypt(req, sm4_sve_ce_cfb_dec);
> +}
> +
> +static int ctr_crypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
> +                                            walk.iv, nblocks);
> +
> +                       kernel_neon_end();
> +
> +                       dst += nblocks * SM4_BLOCK_SIZE;
> +                       src += nblocks * SM4_BLOCK_SIZE;
> +                       nbytes -= nblocks * SM4_BLOCK_SIZE;
> +               }
> +
> +               /* tail */
> +               if (walk.nbytes == walk.total && nbytes > 0) {
> +                       u8 keystream[SM4_BLOCK_SIZE];
> +
> +                       sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> +                       crypto_inc(walk.iv, SM4_BLOCK_SIZE);
> +                       crypto_xor_cpy(dst, src, keystream, nbytes);
> +                       nbytes = 0;
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes);
> +       }
> +
> +       return err;
> +}
> +
> +static struct skcipher_alg sm4_algs[] = {
> +       {
> +               .base = {
> +                       .cra_name               = "ecb(sm4)",
> +                       .cra_driver_name        = "ecb-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = SM4_BLOCK_SIZE,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = ecb_encrypt,
> +               .decrypt        = ecb_decrypt,
> +       }, {
> +               .base = {
> +                       .cra_name               = "cbc(sm4)",
> +                       .cra_driver_name        = "cbc-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = SM4_BLOCK_SIZE,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .ivsize         = SM4_BLOCK_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = cbc_encrypt,
> +               .decrypt        = cbc_decrypt,
> +       }, {
> +               .base = {
> +                       .cra_name               = "cfb(sm4)",
> +                       .cra_driver_name        = "cfb-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = 1,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .ivsize         = SM4_BLOCK_SIZE,
> +               .chunksize      = SM4_BLOCK_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = cfb_encrypt,
> +               .decrypt        = cfb_decrypt,
> +       }, {
> +               .base = {
> +                       .cra_name               = "ctr(sm4)",
> +                       .cra_driver_name        = "ctr-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = 1,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .ivsize         = SM4_BLOCK_SIZE,
> +               .chunksize      = SM4_BLOCK_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = ctr_crypt,
> +               .decrypt        = ctr_crypt,
> +       }
> +};
> +
> +static int __init sm4_sve_ce_init(void)
> +{
> +       if (sm4_sve_get_vl() <= 16)
> +               return -ENODEV;
> +
> +       return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +static void __exit sm4_sve_ce_exit(void)
> +{
> +       crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
> +module_exit(sm4_sve_ce_exit);
> +
> +MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
> +MODULE_ALIAS_CRYPTO("sm4-sve-ce");
> +MODULE_ALIAS_CRYPTO("sm4");
> +MODULE_ALIAS_CRYPTO("ecb(sm4)");
> +MODULE_ALIAS_CRYPTO("cbc(sm4)");
> +MODULE_ALIAS_CRYPTO("cfb(sm4)");
> +MODULE_ALIAS_CRYPTO("ctr(sm4)");
> +MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
> +MODULE_LICENSE("GPL v2");
> --
> 2.24.3 (Apple Git-128)
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
  2022-09-26 10:02     ` Ard Biesheuvel
@ 2022-09-26 17:14       ` Mark Brown
  -1 siblings, 0 replies; 42+ messages in thread
From: Mark Brown @ 2022-09-26 17:14 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Tianjia Zhang, Herbert Xu, David S. Miller, Jussi Kivilinna,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

[-- Attachment #1.1: Type: text/plain, Size: 4253 bytes --]

On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:

> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)

> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may

The kernel code doesn't really distinguish between FPSIMD and SVE in
terms of state management, and with the sharing of the V and Z registers
the architecture is very similar too so it shouldn't be too much hassle,
the only thing we should need is some management for the VL when
starting kernel mode SVE (probably just setting the maximum VL as a
first pass).

The current code should *work* and on a system with only a single VL
supported it'd be equivalent since setting the VL is a noop, it'd just
mean that any kernel mode SVE would end up using whatever the last VL
set on the PE happened to be in which could result in inconsistent
performance. 

> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.

Indeed, though so long as the performance is comparable I guess it
doesn't really hurt - if we run into situations where for some
implementations SVE performs worse then we'd need to do something more
complicated than just using SVE if it's available but...

> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.

...in any case as you say until there are cases where SVE does better
for some in kernel use case we probably just shouldn't merge things.

Having said that I have been tempted to put together a branch which has
a kernel_sve_begin() implementation and collects proposed algorithm
implementations so they're there for people to experiment with as new
hardware becomes available.  There's clearly interest in trying to use
SVE in kernel and it makes sense to try to avoid common pitfalls and
reduce duplication of effort.

A couple of very minor comments on the patch:

> > +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> > +       tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
> +acceleration with SVE2)"
> > +       depends on KERNEL_MODE_NEON
> > +       select CRYPTO_SKCIPHER
> > +       select CRYPTO_SM4
> > +       select CRYPTO_SM4_ARM64_CE_BLK
> > +       help

Our current baseline binutils version requirement predates SVE support
so we'd either need to manually encode all SVE instructions used or add
suitable dependency.  The dependency seems a lot more reasonable here,
and we could require a new enough version to avoid the manual encoding
that is done in the patch (though I've not checked how new a version
that'd end up requiring, it might be unreasonable so perhaps just
depending on binutils having basic SVE support and continuing with the
manual encoding might be more helpful).

> > +.macro sm4e, vd, vn
> > +       .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> > +.endm

For any manual encodings that do get left it'd be good to note the
binutils and LLVM versions which support the instruction so we can
hopefully at some point switch to assembling them normally.

> > +static int __init sm4_sve_ce_init(void)
> > +{
> > +       if (sm4_sve_get_vl() <= 16)
> > +               return -ENODEV;

I'm not clear what this check is attempting to guard against - what's
the issue with larger VLs?

If it is needed then we already have a sve_get_vl() in the core kernel
which we should probably be making available to modules rather than
having them open code something (eg, making it a static inline rather
than putting it in asm).

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-26 17:14       ` Mark Brown
  0 siblings, 0 replies; 42+ messages in thread
From: Mark Brown @ 2022-09-26 17:14 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Tianjia Zhang, Herbert Xu, David S. Miller, Jussi Kivilinna,
	Catalin Marinas, Will Deacon, Maxime Coquelin, Alexandre Torgue,
	Eric Biggers, linux-crypto, linux-arm-kernel, linux-kernel,
	linux-stm32

[-- Attachment #1: Type: text/plain, Size: 4253 bytes --]

On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:

> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)

> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may

The kernel code doesn't really distinguish between FPSIMD and SVE in
terms of state management, and with the sharing of the V and Z registers
the architecture is very similar too so it shouldn't be too much hassle,
the only thing we should need is some management for the VL when
starting kernel mode SVE (probably just setting the maximum VL as a
first pass).

The current code should *work* and on a system with only a single VL
supported it'd be equivalent since setting the VL is a noop, it'd just
mean that any kernel mode SVE would end up using whatever the last VL
set on the PE happened to be in which could result in inconsistent
performance. 

> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.

Indeed, though so long as the performance is comparable I guess it
doesn't really hurt - if we run into situations where for some
implementations SVE performs worse then we'd need to do something more
complicated than just using SVE if it's available but...

> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.

...in any case as you say until there are cases where SVE does better
for some in kernel use case we probably just shouldn't merge things.

Having said that I have been tempted to put together a branch which has
a kernel_sve_begin() implementation and collects proposed algorithm
implementations so they're there for people to experiment with as new
hardware becomes available.  There's clearly interest in trying to use
SVE in kernel and it makes sense to try to avoid common pitfalls and
reduce duplication of effort.

A couple of very minor comments on the patch:

> > +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> > +       tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
> +acceleration with SVE2)"
> > +       depends on KERNEL_MODE_NEON
> > +       select CRYPTO_SKCIPHER
> > +       select CRYPTO_SM4
> > +       select CRYPTO_SM4_ARM64_CE_BLK
> > +       help

Our current baseline binutils version requirement predates SVE support
so we'd either need to manually encode all SVE instructions used or add
suitable dependency.  The dependency seems a lot more reasonable here,
and we could require a new enough version to avoid the manual encoding
that is done in the patch (though I've not checked how new a version
that'd end up requiring, it might be unreasonable so perhaps just
depending on binutils having basic SVE support and continuing with the
manual encoding might be more helpful).

> > +.macro sm4e, vd, vn
> > +       .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> > +.endm

For any manual encodings that do get left it'd be good to note the
binutils and LLVM versions which support the instruction so we can
hopefully at some point switch to assembling them normally.

> > +static int __init sm4_sve_ce_init(void)
> > +{
> > +       if (sm4_sve_get_vl() <= 16)
> > +               return -ENODEV;

I'm not clear what this check is attempting to guard against - what's
the issue with larger VLs?

If it is needed then we already have a sve_get_vl() in the core kernel
which we should probably be making available to modules rather than
having them open code something (eg, making it a static inline rather
than putting it in asm).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
  2022-09-26 10:02     ` Ard Biesheuvel
@ 2022-09-27  4:26       ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-27  4:26 UTC (permalink / raw)
  To: Ard Biesheuvel, Mark Brown
  Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
	Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
	linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32

Hi Ard,

On 9/26/22 6:02 PM, Ard Biesheuvel wrote:
> (cc Mark Brown)
> 
> Hello Tianjia,
> 
> On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
> <tianjia.zhang@linux.alibaba.com> wrote:
>>
>> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
>> arm64. SVE allows flexible vector length implementations with a range of
>> possible values in CPU implementations. The vector length can vary from a
>> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
>> The SVE design guarantees that the same application can run on different
>> implementations that support SVE, without the need to recompile the code.
>>
>> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
>> expand and improve it. Similar to the Crypto Extension supported by the
>> NEON instruction set for the algorithm, SVE also supports the similar
>> instructions, called cryptography acceleration instructions, but this is
>> also optional instruction set.
>>
>> This patch uses SM4 cryptography acceleration instructions and SVE2
>> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
>> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
>> Extension instruction is used.
>>
> 
> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)
> 
> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may
> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.
> 
> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.
> 
> I have cc'ed Mark who has been working on the SVE support., who might
> have something to add here as well.
> 
> Thanks,
> Ard.
> 
> 

Thanks for your reply, the current performance of SVE is really
unsatisfactory. One reason is that the optimization of SVE needs to deal
with more and more complex data shifting operations, such as in CBC/CFB
mode, but also in CTR mode. needing more instruction to complete the
128-bit count increment, and the use of CE optimization does not have
these complications.

In addition, I naively thought that when the VL is 256-bit, the
performance will simply double compared to 128-bit. At present, this is
not the case. Maybe it is worth using SVE until there are significantly
improved performance data. I'll follow your advice and drop this
patch.

Best regards,
Tianjia


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-27  4:26       ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-27  4:26 UTC (permalink / raw)
  To: Ard Biesheuvel, Mark Brown
  Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
	Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
	linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32

Hi Ard,

On 9/26/22 6:02 PM, Ard Biesheuvel wrote:
> (cc Mark Brown)
> 
> Hello Tianjia,
> 
> On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
> <tianjia.zhang@linux.alibaba.com> wrote:
>>
>> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
>> arm64. SVE allows flexible vector length implementations with a range of
>> possible values in CPU implementations. The vector length can vary from a
>> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
>> The SVE design guarantees that the same application can run on different
>> implementations that support SVE, without the need to recompile the code.
>>
>> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
>> expand and improve it. Similar to the Crypto Extension supported by the
>> NEON instruction set for the algorithm, SVE also supports the similar
>> instructions, called cryptography acceleration instructions, but this is
>> also optional instruction set.
>>
>> This patch uses SM4 cryptography acceleration instructions and SVE2
>> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
>> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
>> Extension instruction is used.
>>
> 
> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)
> 
> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may
> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.
> 
> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.
> 
> I have cc'ed Mark who has been working on the SVE support., who might
> have something to add here as well.
> 
> Thanks,
> Ard.
> 
> 

Thanks for your reply, the current performance of SVE is really
unsatisfactory. One reason is that the optimization of SVE needs to deal
with more and more complex data shifting operations, such as in CBC/CFB
mode, but also in CTR mode. needing more instruction to complete the
128-bit count increment, and the use of CE optimization does not have
these complications.

In addition, I naively thought that when the VL is 256-bit, the
performance will simply double compared to 128-bit. At present, this is
not the case. Maybe it is worth using SVE until there are significantly
improved performance data. I'll follow your advice and drop this
patch.

Best regards,
Tianjia


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
  2022-09-26 17:14       ` Mark Brown
@ 2022-09-27  4:30         ` Tianjia Zhang
  -1 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-27  4:30 UTC (permalink / raw)
  To: Mark Brown, Ard Biesheuvel
  Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
	Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
	linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32

Hi Mark,

On 9/27/22 1:14 AM, Mark Brown wrote:
> On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:
> 
>> Given that we currently do not support the use of SVE in kernel mode,
>> this patch cannot be accepted at this time (but the rest of the series
>> looks reasonable to me, although I have only skimmed over the patches)
> 
>> In view of the disappointing benchmark results below, I don't think
>> this is worth the hassle at the moment. If we can find a case where
>> using SVE in kernel mode truly makes a [favorable] difference, we can
>> revisit this, but not without a thorough analysis of the impact it
>> will have to support SVE in the kernel. Also, the fact that SVE may
> 
> The kernel code doesn't really distinguish between FPSIMD and SVE in
> terms of state management, and with the sharing of the V and Z registers
> the architecture is very similar too so it shouldn't be too much hassle,
> the only thing we should need is some management for the VL when
> starting kernel mode SVE (probably just setting the maximum VL as a
> first pass).
> 
> The current code should *work* and on a system with only a single VL
> supported it'd be equivalent since setting the VL is a noop, it'd just
> mean that any kernel mode SVE would end up using whatever the last VL
> set on the PE happened to be in which could result in inconsistent
> performance.
> 
>> also cover cryptographic extensions does not necessarily imply that a
>> micro-architecture will perform those crypto transformations in
>> parallel and so the performance may be the same even if VL > 128.
> 
> Indeed, though so long as the performance is comparable I guess it
> doesn't really hurt - if we run into situations where for some
> implementations SVE performs worse then we'd need to do something more
> complicated than just using SVE if it's available but...
> 
>> In summary, please drop this patch for now, and once there are more
>> encouraging performance numbers, please resubmit it as part of a
>> series that explicitly enables SVE in kernel mode on arm64, and
>> documents the requirements and constraints.
> 
> ...in any case as you say until there are cases where SVE does better
> for some in kernel use case we probably just shouldn't merge things.
> 
> Having said that I have been tempted to put together a branch which has
> a kernel_sve_begin() implementation and collects proposed algorithm
> implementations so they're there for people to experiment with as new
> hardware becomes available.  There's clearly interest in trying to use
> SVE in kernel and it makes sense to try to avoid common pitfalls and
> reduce duplication of effort.
> 

Your reply helped me a lot, I did encounter problems when using qemu VL
larger than 128-bit environment, but I also tested it with the pure
user-mode library libgcrypt, it seems to be normal, maybe in 128-bit
It's just a coincidence that it works fine in the physical machine.

I am looking forward to your experimental branch, and I believe that
there will be breakthroughs in hardware in the near future.

> A couple of very minor comments on the patch:
> 
>>> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
>>> +       tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
>> +acceleration with SVE2)"
>>> +       depends on KERNEL_MODE_NEON
>>> +       select CRYPTO_SKCIPHER
>>> +       select CRYPTO_SM4
>>> +       select CRYPTO_SM4_ARM64_CE_BLK
>>> +       help
> 
> Our current baseline binutils version requirement predates SVE support
> so we'd either need to manually encode all SVE instructions used or add
> suitable dependency.  The dependency seems a lot more reasonable here,
> and we could require a new enough version to avoid the manual encoding
> that is done in the patch (though I've not checked how new a version
> that'd end up requiring, it might be unreasonable so perhaps just
> depending on binutils having basic SVE support and continuing with the
> manual encoding might be more helpful).
> 
>>> +.macro sm4e, vd, vn
>>> +       .inst 0xcec08400 | (.L\vn << 5) | .L\vd
>>> +.endm
> 
> For any manual encodings that do get left it'd be good to note the
> binutils and LLVM versions which support the instruction so we can
> hopefully at some point switch to assembling them normally.
> 
>>> +static int __init sm4_sve_ce_init(void)
>>> +{
>>> +       if (sm4_sve_get_vl() <= 16)
>>> +               return -ENODEV;
> 
> I'm not clear what this check is attempting to guard against - what's
> the issue with larger VLs?

Since there is no physical environment, this check is based on my naive
assumption that the performance when VL is 256-bit should theoretically
be twice that of 128-bit, because SVE needs to handle more complex data
shifting operations and CTR incrementing operations, so When VL is
greater than or equal to 256 bits, the use of SVE will bring performance
improvement, otherwise it is a suitable choice to degenerate to CE.

Now it seems that this assumption itself is not valid, I will drop
this patch first.

> 
> If it is needed then we already have a sve_get_vl() in the core kernel
> which we should probably be making available to modules rather than
> having them open code something (eg, making it a static inline rather
> than putting it in asm).

Yes, I agree, exporting sve_get_vl() to the module is the more
appropriate approach.

Best regards,
Tianjia


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation
@ 2022-09-27  4:30         ` Tianjia Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Tianjia Zhang @ 2022-09-27  4:30 UTC (permalink / raw)
  To: Mark Brown, Ard Biesheuvel
  Cc: Herbert Xu, David S. Miller, Jussi Kivilinna, Catalin Marinas,
	Will Deacon, Maxime Coquelin, Alexandre Torgue, Eric Biggers,
	linux-crypto, linux-arm-kernel, linux-kernel, linux-stm32

Hi Mark,

On 9/27/22 1:14 AM, Mark Brown wrote:
> On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:
> 
>> Given that we currently do not support the use of SVE in kernel mode,
>> this patch cannot be accepted at this time (but the rest of the series
>> looks reasonable to me, although I have only skimmed over the patches)
> 
>> In view of the disappointing benchmark results below, I don't think
>> this is worth the hassle at the moment. If we can find a case where
>> using SVE in kernel mode truly makes a [favorable] difference, we can
>> revisit this, but not without a thorough analysis of the impact it
>> will have to support SVE in the kernel. Also, the fact that SVE may
> 
> The kernel code doesn't really distinguish between FPSIMD and SVE in
> terms of state management, and with the sharing of the V and Z registers
> the architecture is very similar too so it shouldn't be too much hassle,
> the only thing we should need is some management for the VL when
> starting kernel mode SVE (probably just setting the maximum VL as a
> first pass).
> 
> The current code should *work* and on a system with only a single VL
> supported it'd be equivalent since setting the VL is a noop, it'd just
> mean that any kernel mode SVE would end up using whatever the last VL
> set on the PE happened to be in which could result in inconsistent
> performance.
> 
>> also cover cryptographic extensions does not necessarily imply that a
>> micro-architecture will perform those crypto transformations in
>> parallel and so the performance may be the same even if VL > 128.
> 
> Indeed, though so long as the performance is comparable I guess it
> doesn't really hurt - if we run into situations where for some
> implementations SVE performs worse then we'd need to do something more
> complicated than just using SVE if it's available but...
> 
>> In summary, please drop this patch for now, and once there are more
>> encouraging performance numbers, please resubmit it as part of a
>> series that explicitly enables SVE in kernel mode on arm64, and
>> documents the requirements and constraints.
> 
> ...in any case as you say until there are cases where SVE does better
> for some in kernel use case we probably just shouldn't merge things.
> 
> Having said that I have been tempted to put together a branch which has
> a kernel_sve_begin() implementation and collects proposed algorithm
> implementations so they're there for people to experiment with as new
> hardware becomes available.  There's clearly interest in trying to use
> SVE in kernel and it makes sense to try to avoid common pitfalls and
> reduce duplication of effort.
> 

Your reply helped me a lot, I did encounter problems when using qemu VL
larger than 128-bit environment, but I also tested it with the pure
user-mode library libgcrypt, it seems to be normal, maybe in 128-bit
It's just a coincidence that it works fine in the physical machine.

I am looking forward to your experimental branch, and I believe that
there will be breakthroughs in hardware in the near future.

> A couple of very minor comments on the patch:
> 
>>> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
>>> +       tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
>> +acceleration with SVE2)"
>>> +       depends on KERNEL_MODE_NEON
>>> +       select CRYPTO_SKCIPHER
>>> +       select CRYPTO_SM4
>>> +       select CRYPTO_SM4_ARM64_CE_BLK
>>> +       help
> 
> Our current baseline binutils version requirement predates SVE support
> so we'd either need to manually encode all SVE instructions used or add
> suitable dependency.  The dependency seems a lot more reasonable here,
> and we could require a new enough version to avoid the manual encoding
> that is done in the patch (though I've not checked how new a version
> that'd end up requiring, it might be unreasonable so perhaps just
> depending on binutils having basic SVE support and continuing with the
> manual encoding might be more helpful).
> 
>>> +.macro sm4e, vd, vn
>>> +       .inst 0xcec08400 | (.L\vn << 5) | .L\vd
>>> +.endm
> 
> For any manual encodings that do get left it'd be good to note the
> binutils and LLVM versions which support the instruction so we can
> hopefully at some point switch to assembling them normally.
> 
>>> +static int __init sm4_sve_ce_init(void)
>>> +{
>>> +       if (sm4_sve_get_vl() <= 16)
>>> +               return -ENODEV;
> 
> I'm not clear what this check is attempting to guard against - what's
> the issue with larger VLs?

Since there is no physical environment, this check is based on my naive
assumption that the performance when VL is 256-bit should theoretically
be twice that of 128-bit, because SVE needs to handle more complex data
shifting operations and CTR incrementing operations, so When VL is
greater than or equal to 256 bits, the use of SVE will bring performance
improvement, otherwise it is a suitable choice to degenerate to CE.

Now it seems that this assumption itself is not valid, I will drop
this patch first.

> 
> If it is needed then we already have a sve_get_vl() in the core kernel
> which we should probably be making available to modules rather than
> having them open code something (eg, making it a static inline rather
> than putting it in asm).

Yes, I agree, exporting sve_get_vl() to the module is the more
appropriate approach.

Best regards,
Tianjia


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2022-09-27  4:32 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-26  9:36 [PATCH 00/16] Optimizing SM3 and SM4 algorithms using NEON/CE/SVE instructions Tianjia Zhang
2022-09-26  9:36 ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 01/16] crypto: arm64/sm3 - raise the priority of the CE implementation Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 02/16] crypto: arm64/sm3 - add NEON assembly implementation Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 03/16] crypto: arm64/sm4 - refactor and simplify NEON implementation Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 04/16] crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test vectors Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 05/16] crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 06/16] crypto: arm64/sm4 - refactor and simplify CE implementation Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 07/16] crypto: arm64/sm4 - simplify sm4_ce_expand_key() of " Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 08/16] crypto: arm64/sm4 - export reusable CE acceleration functions Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 09/16] crypto: arm64/sm4 - add CE implementation for CTS-CBC mode Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 10/16] crypto: arm64/sm4 - add CE implementation for XTS mode Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 11/16] crypto: essiv - allow digestsize to be greater than keysize Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 12/16] crypto: arm64/sm4 - add CE implementation for ESSIV mode Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 13/16] crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 14/16] crypto: arm64/sm4 - add CE implementation for CCM mode Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 15/16] crypto: arm64/sm4 - add CE implementation for GCM mode Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26  9:36 ` [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation Tianjia Zhang
2022-09-26  9:36   ` Tianjia Zhang
2022-09-26 10:02   ` Ard Biesheuvel
2022-09-26 10:02     ` Ard Biesheuvel
2022-09-26 17:14     ` Mark Brown
2022-09-26 17:14       ` Mark Brown
2022-09-27  4:30       ` Tianjia Zhang
2022-09-27  4:30         ` Tianjia Zhang
2022-09-27  4:26     ` Tianjia Zhang
2022-09-27  4:26       ` Tianjia Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.