[PATCH 0/6] Introduce x86 assembly accelerated implementation for SM3 algorithm

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] Introduce x86 assembly accelerated implementation for SM3 algorithm
@ 2021-12-20  8:22 Tianjia Zhang
  2021-12-20  8:22 ` [PATCH 1/6] crypto: sm3 - create SM3 stand-alone library Tianjia Zhang
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Tianjia Zhang @ 2021-12-20  8:22 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Vitaly Chikunov, Eric Biggers,
	Eric Biggers, Gilad Ben-Yossef, Ard Biesheuvel, Jussi Kivilinna,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-crypto, x86,
	linux-arm-kernel, linux-kernel
  Cc: Tianjia Zhang

This series of patches creates an stand-alone library for SM3 hash
algorithm in the lib/crypto directory, and makes the implementations
that originally depended on sm3-generic depend on the stand-alone SM3
library, which also includes sm3-generic itself.

On this basis, the AVX assembly acceleration implementation of SM3
algorithm is introduced, the main algorithm implementation based on
SM3 AES/BMI2 accelerated work by libgcrypt at:
https://gnupg.org/software/libgcrypt/index.html

From the performance benchmark data, the performance improvement of
SM3 algorithm after AVX optimization can reach up to 38%.

Tianjia Zhang (6):
  crypto: sm3 - create SM3 stand-alone library
  crypto: arm64/sm3-ce - make dependent on sm3 library
  crypto: sm2 - make dependent on sm3 library
  crypto: sm3 - make dependent on sm3 library
  crypto: x86/sm3 - add AVX assembly implementation
  crypto: tcrypt - add asynchronous speed test for SM3

 arch/arm64/crypto/Kconfig        |   2 +-
 arch/arm64/crypto/sm3-ce-glue.c  |  20 +-
 arch/x86/crypto/Makefile         |   3 +
 arch/x86/crypto/sm3-avx-asm_64.S | 521 +++++++++++++++++++++++++++++++
 arch/x86/crypto/sm3_avx_glue.c   | 134 ++++++++
 crypto/Kconfig                   |  16 +-
 crypto/sm2.c                     |  36 +--
 crypto/sm3_generic.c             | 142 +--------
 crypto/tcrypt.c                  |  14 +-
 include/crypto/sm3.h             |  35 ++-
 lib/crypto/Kconfig               |   3 +
 lib/crypto/Makefile              |   3 +
 lib/crypto/sm3.c                 | 246 +++++++++++++++
 13 files changed, 1010 insertions(+), 165 deletions(-)
 create mode 100644 arch/x86/crypto/sm3-avx-asm_64.S
 create mode 100644 arch/x86/crypto/sm3_avx_glue.c
 create mode 100644 lib/crypto/sm3.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/6] crypto: sm3 - create SM3 stand-alone library
  2021-12-20  8:22 [PATCH 0/6] Introduce x86 assembly accelerated implementation for SM3 algorithm Tianjia Zhang
@ 2021-12-20  8:22 ` Tianjia Zhang
  2021-12-20  8:22 ` [PATCH 2/6] crypto: arm64/sm3-ce - make dependent on sm3 library Tianjia Zhang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Tianjia Zhang @ 2021-12-20  8:22 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Vitaly Chikunov, Eric Biggers,
	Eric Biggers, Gilad Ben-Yossef, Ard Biesheuvel, Jussi Kivilinna,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-crypto, x86,
	linux-arm-kernel, linux-kernel
  Cc: Tianjia Zhang

Stand-alone implementation of the SM3 algorithm. It is designed
to have as little dependencies as possible. In other cases you
should generally use the hash APIs from include/crypto/hash.h.
Especially when hashing large amounts of data as those APIs may
be hw-accelerated. In the new SM3 stand-alone library,
sm3_compress() has also been optimized, instead of simply using
the code in sm3_generic.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 include/crypto/sm3.h |  31 ++++++
 lib/crypto/Kconfig   |   3 +
 lib/crypto/Makefile  |   3 +
 lib/crypto/sm3.c     | 246 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 283 insertions(+)
 create mode 100644 lib/crypto/sm3.c

diff --git a/include/crypto/sm3.h b/include/crypto/sm3.h
index 42ea21289ba9..315adaf38007 100644
--- a/include/crypto/sm3.h
+++ b/include/crypto/sm3.h
@@ -1,5 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
 /*
  * Common values for SM3 algorithm
+ *
+ * Copyright (C) 2017 ARM Limited or its affiliates.
+ * Copyright (C) 2021 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
  */
 
 #ifndef _CRYPTO_SM3_H
@@ -39,4 +43,31 @@ extern int crypto_sm3_final(struct shash_desc *desc, u8 *out);
 
 extern int crypto_sm3_finup(struct shash_desc *desc, const u8 *data,
 			     unsigned int len, u8 *hash);
+
+/*
+ * Stand-alone implementation of the SM3 algorithm. It is designed to
+ * have as little dependencies as possible so it can be used in the
+ * kexec_file purgatory. In other cases you should generally use the
+ * hash APIs from include/crypto/hash.h. Especially when hashing large
+ * amounts of data as those APIs may be hw-accelerated.
+ *
+ * For details see lib/crypto/sm3.c
+ */
+
+static inline void sm3_init(struct sm3_state *sctx)
+{
+	sctx->state[0] = SM3_IVA;
+	sctx->state[1] = SM3_IVB;
+	sctx->state[2] = SM3_IVC;
+	sctx->state[3] = SM3_IVD;
+	sctx->state[4] = SM3_IVE;
+	sctx->state[5] = SM3_IVF;
+	sctx->state[6] = SM3_IVG;
+	sctx->state[7] = SM3_IVH;
+	sctx->count = 0;
+}
+
+void sm3_update(struct sm3_state *sctx, const u8 *data, unsigned int len);
+void sm3_final(struct sm3_state *sctx, u8 *out);
+
 #endif
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index 545ccbddf6a1..36963e8f4eaa 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -129,5 +129,8 @@ config CRYPTO_LIB_CHACHA20POLY1305
 config CRYPTO_LIB_SHA256
 	tristate
 
+config CRYPTO_LIB_SM3
+	tristate
+
 config CRYPTO_LIB_SM4
 	tristate
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index 73205ed269ba..8149bc00b627 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -38,6 +38,9 @@ libpoly1305-y					+= poly1305.o
 obj-$(CONFIG_CRYPTO_LIB_SHA256)			+= libsha256.o
 libsha256-y					:= sha256.o
 
+obj-$(CONFIG_CRYPTO_LIB_SM3)			+= libsm3.o
+libsm3-y					:= sm3.o
+
 obj-$(CONFIG_CRYPTO_LIB_SM4)			+= libsm4.o
 libsm4-y					:= sm4.o
 
diff --git a/lib/crypto/sm3.c b/lib/crypto/sm3.c
new file mode 100644
index 000000000000..d473e358a873
--- /dev/null
+++ b/lib/crypto/sm3.c
@@ -0,0 +1,246 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * SM3 secure hash, as specified by OSCCA GM/T 0004-2012 SM3 and described
+ * at https://datatracker.ietf.org/doc/html/draft-sca-cfrg-sm3-02
+ *
+ * Copyright (C) 2017 ARM Limited or its affiliates.
+ * Copyright (C) 2017 Gilad Ben-Yossef <gilad@benyossef.com>
+ * Copyright (C) 2021 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <asm/unaligned.h>
+#include <crypto/sm3.h>
+
+static const u32 ____cacheline_aligned K[64] = {
+	0x79cc4519, 0xf3988a32, 0xe7311465, 0xce6228cb,
+	0x9cc45197, 0x3988a32f, 0x7311465e, 0xe6228cbc,
+	0xcc451979, 0x988a32f3, 0x311465e7, 0x6228cbce,
+	0xc451979c, 0x88a32f39, 0x11465e73, 0x228cbce6,
+	0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c,
+	0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce,
+	0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec,
+	0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5,
+	0x7a879d8a, 0xf50f3b14, 0xea1e7629, 0xd43cec53,
+	0xa879d8a7, 0x50f3b14f, 0xa1e7629e, 0x43cec53d,
+	0x879d8a7a, 0x0f3b14f5, 0x1e7629ea, 0x3cec53d4,
+	0x79d8a7a8, 0xf3b14f50, 0xe7629ea1, 0xcec53d43,
+	0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c,
+	0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce,
+	0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec,
+	0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
+};
+
+/*
+ * Transform the message X which consists of 16 32-bit-words. See
+ * GM/T 004-2012 for details.
+ */
+#define R(i, a, b, c, d, e, f, g, h, t, w1, w2)			\
+	do {							\
+		ss1 = rol32((rol32((a), 12) + (e) + (t)), 7);	\
+		ss2 = ss1 ^ rol32((a), 12);			\
+		d += FF ## i(a, b, c) + ss2 + ((w1) ^ (w2));	\
+		h += GG ## i(e, f, g) + ss1 + (w1);		\
+		b = rol32((b), 9);				\
+		f = rol32((f), 19);				\
+		h = P0((h));					\
+	} while (0)
+
+#define R1(a, b, c, d, e, f, g, h, t, w1, w2) \
+	R(1, a, b, c, d, e, f, g, h, t, w1, w2)
+#define R2(a, b, c, d, e, f, g, h, t, w1, w2) \
+	R(2, a, b, c, d, e, f, g, h, t, w1, w2)
+
+#define FF1(x, y, z)  (x ^ y ^ z)
+#define FF2(x, y, z)  ((x & y) | (x & z) | (y & z))
+
+#define GG1(x, y, z)  FF1(x, y, z)
+#define GG2(x, y, z)  ((x & y) | (~x & z))
+
+/* Message expansion */
+#define P0(x) ((x) ^ rol32((x), 9) ^ rol32((x), 17))
+#define P1(x) ((x) ^ rol32((x), 15) ^ rol32((x), 23))
+#define I(i)  (W[i] = get_unaligned_be32(data + i * 4))
+#define W1(i) (W[i & 0x0f])
+#define W2(i) (W[i & 0x0f] =				\
+		P1(W[i & 0x0f]				\
+			^ W[(i-9) & 0x0f]		\
+			^ rol32(W[(i-3) & 0x0f], 15))	\
+		^ rol32(W[(i-13) & 0x0f], 7)		\
+		^ W[(i-6) & 0x0f])
+
+static void sm3_transform(struct sm3_state *sctx, u8 const *data, u32 W[16])
+{
+	u32 a, b, c, d, e, f, g, h, ss1, ss2;
+
+	a = sctx->state[0];
+	b = sctx->state[1];
+	c = sctx->state[2];
+	d = sctx->state[3];
+	e = sctx->state[4];
+	f = sctx->state[5];
+	g = sctx->state[6];
+	h = sctx->state[7];
+
+	R1(a, b, c, d, e, f, g, h, K[0], I(0), I(4));
+	R1(d, a, b, c, h, e, f, g, K[1], I(1), I(5));
+	R1(c, d, a, b, g, h, e, f, K[2], I(2), I(6));
+	R1(b, c, d, a, f, g, h, e, K[3], I(3), I(7));
+	R1(a, b, c, d, e, f, g, h, K[4], W1(4), I(8));
+	R1(d, a, b, c, h, e, f, g, K[5], W1(5), I(9));
+	R1(c, d, a, b, g, h, e, f, K[6], W1(6), I(10));
+	R1(b, c, d, a, f, g, h, e, K[7], W1(7), I(11));
+	R1(a, b, c, d, e, f, g, h, K[8], W1(8), I(12));
+	R1(d, a, b, c, h, e, f, g, K[9], W1(9), I(13));
+	R1(c, d, a, b, g, h, e, f, K[10], W1(10), I(14));
+	R1(b, c, d, a, f, g, h, e, K[11], W1(11), I(15));
+	R1(a, b, c, d, e, f, g, h, K[12], W1(12), W2(16));
+	R1(d, a, b, c, h, e, f, g, K[13], W1(13), W2(17));
+	R1(c, d, a, b, g, h, e, f, K[14], W1(14), W2(18));
+	R1(b, c, d, a, f, g, h, e, K[15], W1(15), W2(19));
+
+	R2(a, b, c, d, e, f, g, h, K[16], W1(16), W2(20));
+	R2(d, a, b, c, h, e, f, g, K[17], W1(17), W2(21));
+	R2(c, d, a, b, g, h, e, f, K[18], W1(18), W2(22));
+	R2(b, c, d, a, f, g, h, e, K[19], W1(19), W2(23));
+	R2(a, b, c, d, e, f, g, h, K[20], W1(20), W2(24));
+	R2(d, a, b, c, h, e, f, g, K[21], W1(21), W2(25));
+	R2(c, d, a, b, g, h, e, f, K[22], W1(22), W2(26));
+	R2(b, c, d, a, f, g, h, e, K[23], W1(23), W2(27));
+	R2(a, b, c, d, e, f, g, h, K[24], W1(24), W2(28));
+	R2(d, a, b, c, h, e, f, g, K[25], W1(25), W2(29));
+	R2(c, d, a, b, g, h, e, f, K[26], W1(26), W2(30));
+	R2(b, c, d, a, f, g, h, e, K[27], W1(27), W2(31));
+	R2(a, b, c, d, e, f, g, h, K[28], W1(28), W2(32));
+	R2(d, a, b, c, h, e, f, g, K[29], W1(29), W2(33));
+	R2(c, d, a, b, g, h, e, f, K[30], W1(30), W2(34));
+	R2(b, c, d, a, f, g, h, e, K[31], W1(31), W2(35));
+
+	R2(a, b, c, d, e, f, g, h, K[32], W1(32), W2(36));
+	R2(d, a, b, c, h, e, f, g, K[33], W1(33), W2(37));
+	R2(c, d, a, b, g, h, e, f, K[34], W1(34), W2(38));
+	R2(b, c, d, a, f, g, h, e, K[35], W1(35), W2(39));
+	R2(a, b, c, d, e, f, g, h, K[36], W1(36), W2(40));
+	R2(d, a, b, c, h, e, f, g, K[37], W1(37), W2(41));
+	R2(c, d, a, b, g, h, e, f, K[38], W1(38), W2(42));
+	R2(b, c, d, a, f, g, h, e, K[39], W1(39), W2(43));
+	R2(a, b, c, d, e, f, g, h, K[40], W1(40), W2(44));
+	R2(d, a, b, c, h, e, f, g, K[41], W1(41), W2(45));
+	R2(c, d, a, b, g, h, e, f, K[42], W1(42), W2(46));
+	R2(b, c, d, a, f, g, h, e, K[43], W1(43), W2(47));
+	R2(a, b, c, d, e, f, g, h, K[44], W1(44), W2(48));
+	R2(d, a, b, c, h, e, f, g, K[45], W1(45), W2(49));
+	R2(c, d, a, b, g, h, e, f, K[46], W1(46), W2(50));
+	R2(b, c, d, a, f, g, h, e, K[47], W1(47), W2(51));
+
+	R2(a, b, c, d, e, f, g, h, K[48], W1(48), W2(52));
+	R2(d, a, b, c, h, e, f, g, K[49], W1(49), W2(53));
+	R2(c, d, a, b, g, h, e, f, K[50], W1(50), W2(54));
+	R2(b, c, d, a, f, g, h, e, K[51], W1(51), W2(55));
+	R2(a, b, c, d, e, f, g, h, K[52], W1(52), W2(56));
+	R2(d, a, b, c, h, e, f, g, K[53], W1(53), W2(57));
+	R2(c, d, a, b, g, h, e, f, K[54], W1(54), W2(58));
+	R2(b, c, d, a, f, g, h, e, K[55], W1(55), W2(59));
+	R2(a, b, c, d, e, f, g, h, K[56], W1(56), W2(60));
+	R2(d, a, b, c, h, e, f, g, K[57], W1(57), W2(61));
+	R2(c, d, a, b, g, h, e, f, K[58], W1(58), W2(62));
+	R2(b, c, d, a, f, g, h, e, K[59], W1(59), W2(63));
+	R2(a, b, c, d, e, f, g, h, K[60], W1(60), W2(64));
+	R2(d, a, b, c, h, e, f, g, K[61], W1(61), W2(65));
+	R2(c, d, a, b, g, h, e, f, K[62], W1(62), W2(66));
+	R2(b, c, d, a, f, g, h, e, K[63], W1(63), W2(67));
+
+	sctx->state[0] ^= a;
+	sctx->state[1] ^= b;
+	sctx->state[2] ^= c;
+	sctx->state[3] ^= d;
+	sctx->state[4] ^= e;
+	sctx->state[5] ^= f;
+	sctx->state[6] ^= g;
+	sctx->state[7] ^= h;
+}
+#undef R
+#undef R1
+#undef R2
+#undef I
+#undef W1
+#undef W2
+
+static inline void sm3_block(struct sm3_state *sctx,
+		u8 const *data, int blocks, u32 W[16])
+{
+	while (blocks--) {
+		sm3_transform(sctx, data, W);
+		data += SM3_BLOCK_SIZE;
+	}
+}
+
+void sm3_update(struct sm3_state *sctx, const u8 *data, unsigned int len)
+{
+	unsigned int partial = sctx->count % SM3_BLOCK_SIZE;
+	u32 W[16];
+
+	sctx->count += len;
+
+	if ((partial + len) >= SM3_BLOCK_SIZE) {
+		int blocks;
+
+		if (partial) {
+			int p = SM3_BLOCK_SIZE - partial;
+
+			memcpy(sctx->buffer + partial, data, p);
+			data += p;
+			len -= p;
+
+			sm3_block(sctx, sctx->buffer, 1, W);
+		}
+
+		blocks = len / SM3_BLOCK_SIZE;
+		len %= SM3_BLOCK_SIZE;
+
+		if (blocks) {
+			sm3_block(sctx, data, blocks, W);
+			data += blocks * SM3_BLOCK_SIZE;
+		}
+
+		memzero_explicit(W, sizeof(W));
+
+		partial = 0;
+	}
+	if (len)
+		memcpy(sctx->buffer + partial, data, len);
+}
+EXPORT_SYMBOL_GPL(sm3_update);
+
+void sm3_final(struct sm3_state *sctx, u8 *out)
+{
+	const int bit_offset = SM3_BLOCK_SIZE - sizeof(u64);
+	__be64 *bits = (__be64 *)(sctx->buffer + bit_offset);
+	__be32 *digest = (__be32 *)out;
+	unsigned int partial = sctx->count % SM3_BLOCK_SIZE;
+	u32 W[16];
+	int i;
+
+	sctx->buffer[partial++] = 0x80;
+	if (partial > bit_offset) {
+		memset(sctx->buffer + partial, 0, SM3_BLOCK_SIZE - partial);
+		partial = 0;
+
+		sm3_block(sctx, sctx->buffer, 1, W);
+	}
+
+	memset(sctx->buffer + partial, 0, bit_offset - partial);
+	*bits = cpu_to_be64(sctx->count << 3);
+	sm3_block(sctx, sctx->buffer, 1, W);
+
+	for (i = 0; i < 8; i++)
+		put_unaligned_be32(sctx->state[i], digest++);
+
+	/* Zeroize sensitive information. */
+	memzero_explicit(W, sizeof(W));
+	memzero_explicit(sctx, sizeof(*sctx));
+}
+EXPORT_SYMBOL_GPL(sm3_final);
+
+MODULE_DESCRIPTION("Generic SM3 library");
+MODULE_LICENSE("GPL v2");
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/6] crypto: arm64/sm3-ce - make dependent on sm3 library
  2021-12-20  8:22 [PATCH 0/6] Introduce x86 assembly accelerated implementation for SM3 algorithm Tianjia Zhang
  2021-12-20  8:22 ` [PATCH 1/6] crypto: sm3 - create SM3 stand-alone library Tianjia Zhang
@ 2021-12-20  8:22 ` Tianjia Zhang
  2021-12-20  8:22 ` [PATCH 3/6] crypto: sm2 " Tianjia Zhang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Tianjia Zhang @ 2021-12-20  8:22 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Vitaly Chikunov, Eric Biggers,
	Eric Biggers, Gilad Ben-Yossef, Ard Biesheuvel, Jussi Kivilinna,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-crypto, x86,
	linux-arm-kernel, linux-kernel
  Cc: Tianjia Zhang

SM3 generic library is stand-alone implementation, sm3-ce can depend
on the SM3 library instead of sm3-generic.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig       |  2 +-
 arch/arm64/crypto/sm3-ce-glue.c | 20 ++++++++++++++------
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index addfa413650b..2a965aa0188d 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -45,7 +45,7 @@ config CRYPTO_SM3_ARM64_CE
 	tristate "SM3 digest algorithm (ARMv8.2 Crypto Extensions)"
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_HASH
-	select CRYPTO_SM3
+	select CRYPTO_LIB_SM3
 
 config CRYPTO_SM4_ARM64_CE
 	tristate "SM4 symmetric cipher (ARMv8.2 Crypto Extensions)"
diff --git a/arch/arm64/crypto/sm3-ce-glue.c b/arch/arm64/crypto/sm3-ce-glue.c
index d71faca322f2..3198f31c9446 100644
--- a/arch/arm64/crypto/sm3-ce-glue.c
+++ b/arch/arm64/crypto/sm3-ce-glue.c
@@ -27,7 +27,7 @@ static int sm3_ce_update(struct shash_desc *desc, const u8 *data,
 			 unsigned int len)
 {
 	if (!crypto_simd_usable())
-		return crypto_sm3_update(desc, data, len);
+		return sm3_update(shash_desc_ctx(desc), data, len);
 
 	kernel_neon_begin();
 	sm3_base_do_update(desc, data, len, sm3_ce_transform);
@@ -39,7 +39,7 @@ static int sm3_ce_update(struct shash_desc *desc, const u8 *data,
 static int sm3_ce_final(struct shash_desc *desc, u8 *out)
 {
 	if (!crypto_simd_usable())
-		return crypto_sm3_finup(desc, NULL, 0, out);
+		return sm3_final(shash_desc_ctx(desc), out);
 
 	kernel_neon_begin();
 	sm3_base_do_finalize(desc, sm3_ce_transform);
@@ -51,14 +51,22 @@ static int sm3_ce_final(struct shash_desc *desc, u8 *out)
 static int sm3_ce_finup(struct shash_desc *desc, const u8 *data,
 			unsigned int len, u8 *out)
 {
-	if (!crypto_simd_usable())
-		return crypto_sm3_finup(desc, data, len, out);
+	if (!crypto_simd_usable()) {
+		struct sm3_state *sctx = shash_desc_ctx(desc);
+
+		if (len)
+			sm3_update(sctx, data, len);
+		sm3_final(sctx, out);
+		return 0;
+	}
 
 	kernel_neon_begin();
-	sm3_base_do_update(desc, data, len, sm3_ce_transform);
+	if (len)
+		sm3_base_do_update(desc, data, len, sm3_ce_transform);
+	sm3_base_do_finalize(desc, sm3_ce_transform);
 	kernel_neon_end();
 
-	return sm3_ce_final(desc, out);
+	return sm3_base_finish(desc, out);
 }
 
 static struct shash_alg sm3_alg = {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/6] crypto: sm2 - make dependent on sm3 library
  2021-12-20  8:22 [PATCH 0/6] Introduce x86 assembly accelerated implementation for SM3 algorithm Tianjia Zhang
  2021-12-20  8:22 ` [PATCH 1/6] crypto: sm3 - create SM3 stand-alone library Tianjia Zhang
  2021-12-20  8:22 ` [PATCH 2/6] crypto: arm64/sm3-ce - make dependent on sm3 library Tianjia Zhang
@ 2021-12-20  8:22 ` Tianjia Zhang
  2021-12-20  8:22 ` [PATCH 4/6] crypto: sm3 " Tianjia Zhang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Tianjia Zhang @ 2021-12-20  8:22 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Vitaly Chikunov, Eric Biggers,
	Eric Biggers, Gilad Ben-Yossef, Ard Biesheuvel, Jussi Kivilinna,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-crypto, x86,
	linux-arm-kernel, linux-kernel
  Cc: Tianjia Zhang

SM3 generic library is stand-alone implementation, it is necessary
for the calculation of sm2 z digest to depends on SM3 library
instead of sm3-generic.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/Kconfig |  2 +-
 crypto/sm2.c   | 36 ++++++++++++++++++------------------
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 01b9ca0836a5..60b252975dc4 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -267,7 +267,7 @@ config CRYPTO_ECRDSA
 
 config CRYPTO_SM2
 	tristate "SM2 algorithm"
-	select CRYPTO_SM3
+	select CRYPTO_LIB_SM3
 	select CRYPTO_AKCIPHER
 	select CRYPTO_MANAGER
 	select MPILIB
diff --git a/crypto/sm2.c b/crypto/sm2.c
index db8a4a265669..97c451e6c920 100644
--- a/crypto/sm2.c
+++ b/crypto/sm2.c
@@ -213,7 +213,7 @@ int sm2_get_signature_s(void *context, size_t hdrlen, unsigned char tag,
 	return 0;
 }
 
-static int sm2_z_digest_update(struct shash_desc *desc,
+static int sm2_z_digest_update(struct sm3_state *sctx,
 			MPI m, unsigned int pbytes)
 {
 	static const unsigned char zero[32];
@@ -226,20 +226,20 @@ static int sm2_z_digest_update(struct shash_desc *desc,
 
 	if (inlen < pbytes) {
 		/* padding with zero */
-		crypto_sm3_update(desc, zero, pbytes - inlen);
-		crypto_sm3_update(desc, in, inlen);
+		sm3_update(sctx, zero, pbytes - inlen);
+		sm3_update(sctx, in, inlen);
 	} else if (inlen > pbytes) {
 		/* skip the starting zero */
-		crypto_sm3_update(desc, in + inlen - pbytes, pbytes);
+		sm3_update(sctx, in + inlen - pbytes, pbytes);
 	} else {
-		crypto_sm3_update(desc, in, inlen);
+		sm3_update(sctx, in, inlen);
 	}
 
 	kfree(in);
 	return 0;
 }
 
-static int sm2_z_digest_update_point(struct shash_desc *desc,
+static int sm2_z_digest_update_point(struct sm3_state *sctx,
 		MPI_POINT point, struct mpi_ec_ctx *ec, unsigned int pbytes)
 {
 	MPI x, y;
@@ -249,8 +249,8 @@ static int sm2_z_digest_update_point(struct shash_desc *desc,
 	y = mpi_new(0);
 
 	if (!mpi_ec_get_affine(x, y, point, ec) &&
-		!sm2_z_digest_update(desc, x, pbytes) &&
-		!sm2_z_digest_update(desc, y, pbytes))
+		!sm2_z_digest_update(sctx, x, pbytes) &&
+		!sm2_z_digest_update(sctx, y, pbytes))
 		ret = 0;
 
 	mpi_free(x);
@@ -265,7 +265,7 @@ int sm2_compute_z_digest(struct crypto_akcipher *tfm,
 	struct mpi_ec_ctx *ec = akcipher_tfm_ctx(tfm);
 	uint16_t bits_len;
 	unsigned char entl[2];
-	SHASH_DESC_ON_STACK(desc, NULL);
+	struct sm3_state sctx;
 	unsigned int pbytes;
 
 	if (id_len > (USHRT_MAX / 8) || !ec->Q)
@@ -278,17 +278,17 @@ int sm2_compute_z_digest(struct crypto_akcipher *tfm,
 	pbytes = MPI_NBYTES(ec->p);
 
 	/* ZA = H256(ENTLA | IDA | a | b | xG | yG | xA | yA) */
-	sm3_base_init(desc);
-	crypto_sm3_update(desc, entl, 2);
-	crypto_sm3_update(desc, id, id_len);
-
-	if (sm2_z_digest_update(desc, ec->a, pbytes) ||
-		sm2_z_digest_update(desc, ec->b, pbytes) ||
-		sm2_z_digest_update_point(desc, ec->G, ec, pbytes) ||
-		sm2_z_digest_update_point(desc, ec->Q, ec, pbytes))
+	sm3_init(&sctx);
+	sm3_update(&sctx, entl, 2);
+	sm3_update(&sctx, id, id_len);
+
+	if (sm2_z_digest_update(&sctx, ec->a, pbytes) ||
+		sm2_z_digest_update(&sctx, ec->b, pbytes) ||
+		sm2_z_digest_update_point(&sctx, ec->G, ec, pbytes) ||
+		sm2_z_digest_update_point(&sctx, ec->Q, ec, pbytes))
 		return -EINVAL;
 
-	crypto_sm3_final(desc, dgst);
+	sm3_final(&sctx, dgst);
 	return 0;
 }
 EXPORT_SYMBOL(sm2_compute_z_digest);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/6] crypto: sm3 - make dependent on sm3 library
  2021-12-20  8:22 [PATCH 0/6] Introduce x86 assembly accelerated implementation for SM3 algorithm Tianjia Zhang
                   ` (2 preceding siblings ...)
  2021-12-20  8:22 ` [PATCH 3/6] crypto: sm2 " Tianjia Zhang
@ 2021-12-20  8:22 ` Tianjia Zhang
  2021-12-20  8:22 ` [PATCH 5/6] crypto: x86/sm3 - add AVX assembly implementation Tianjia Zhang
  2021-12-20  8:22 ` [PATCH 6/6] crypto: tcrypt - add asynchronous speed test for SM3 Tianjia Zhang
  5 siblings, 0 replies; 9+ messages in thread
From: Tianjia Zhang @ 2021-12-20  8:22 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Vitaly Chikunov, Eric Biggers,
	Eric Biggers, Gilad Ben-Yossef, Ard Biesheuvel, Jussi Kivilinna,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-crypto, x86,
	linux-arm-kernel, linux-kernel
  Cc: Tianjia Zhang

SM3 generic library is stand-alone implementation, it is necessary
making the sm3-generic implementation to depends on SM3 library.
The functions crypto_sm3_*() provided by sm3_generic is no longer
exported.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/Kconfig       |   1 +
 crypto/sm3_generic.c | 142 +++++--------------------------------------
 include/crypto/sm3.h |  10 ---
 3 files changed, 16 insertions(+), 137 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 60b252975dc4..8e780bf66c0a 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -999,6 +999,7 @@ config CRYPTO_SHA3
 config CRYPTO_SM3
 	tristate "SM3 digest algorithm"
 	select CRYPTO_HASH
+	select CRYPTO_LIB_SM3
 	help
 	  SM3 secure hash function as defined by OSCCA GM/T 0004-2012 SM3).
 	  It is part of the Chinese Commercial Cryptography suite.
diff --git a/crypto/sm3_generic.c b/crypto/sm3_generic.c
index 193c4584bd00..a215c1c37e73 100644
--- a/crypto/sm3_generic.c
+++ b/crypto/sm3_generic.c
@@ -5,6 +5,7 @@
  *
  * Copyright (C) 2017 ARM Limited or its affiliates.
  * Written by Gilad Ben-Yossef <gilad@benyossef.com>
+ * Copyright (C) 2021 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
  */
 
 #include <crypto/internal/hash.h>
@@ -26,143 +27,29 @@ const u8 sm3_zero_message_hash[SM3_DIGEST_SIZE] = {
 };
 EXPORT_SYMBOL_GPL(sm3_zero_message_hash);
 
-static inline u32 p0(u32 x)
-{
-	return x ^ rol32(x, 9) ^ rol32(x, 17);
-}
-
-static inline u32 p1(u32 x)
-{
-	return x ^ rol32(x, 15) ^ rol32(x, 23);
-}
-
-static inline u32 ff(unsigned int n, u32 a, u32 b, u32 c)
-{
-	return (n < 16) ? (a ^ b ^ c) : ((a & b) | (a & c) | (b & c));
-}
-
-static inline u32 gg(unsigned int n, u32 e, u32 f, u32 g)
-{
-	return (n < 16) ? (e ^ f ^ g) : ((e & f) | ((~e) & g));
-}
-
-static inline u32 t(unsigned int n)
-{
-	return (n < 16) ? SM3_T1 : SM3_T2;
-}
-
-static void sm3_expand(u32 *t, u32 *w, u32 *wt)
-{
-	int i;
-	unsigned int tmp;
-
-	/* load the input */
-	for (i = 0; i <= 15; i++)
-		w[i] = get_unaligned_be32((__u32 *)t + i);
-
-	for (i = 16; i <= 67; i++) {
-		tmp = w[i - 16] ^ w[i - 9] ^ rol32(w[i - 3], 15);
-		w[i] = p1(tmp) ^ (rol32(w[i - 13], 7)) ^ w[i - 6];
-	}
-
-	for (i = 0; i <= 63; i++)
-		wt[i] = w[i] ^ w[i + 4];
-}
-
-static void sm3_compress(u32 *w, u32 *wt, u32 *m)
-{
-	u32 ss1;
-	u32 ss2;
-	u32 tt1;
-	u32 tt2;
-	u32 a, b, c, d, e, f, g, h;
-	int i;
-
-	a = m[0];
-	b = m[1];
-	c = m[2];
-	d = m[3];
-	e = m[4];
-	f = m[5];
-	g = m[6];
-	h = m[7];
-
-	for (i = 0; i <= 63; i++) {
-
-		ss1 = rol32((rol32(a, 12) + e + rol32(t(i), i & 31)), 7);
-
-		ss2 = ss1 ^ rol32(a, 12);
-
-		tt1 = ff(i, a, b, c) + d + ss2 + *wt;
-		wt++;
-
-		tt2 = gg(i, e, f, g) + h + ss1 + *w;
-		w++;
-
-		d = c;
-		c = rol32(b, 9);
-		b = a;
-		a = tt1;
-		h = g;
-		g = rol32(f, 19);
-		f = e;
-		e = p0(tt2);
-	}
-
-	m[0] = a ^ m[0];
-	m[1] = b ^ m[1];
-	m[2] = c ^ m[2];
-	m[3] = d ^ m[3];
-	m[4] = e ^ m[4];
-	m[5] = f ^ m[5];
-	m[6] = g ^ m[6];
-	m[7] = h ^ m[7];
-
-	a = b = c = d = e = f = g = h = ss1 = ss2 = tt1 = tt2 = 0;
-}
-
-static void sm3_transform(struct sm3_state *sst, u8 const *src)
-{
-	unsigned int w[68];
-	unsigned int wt[64];
-
-	sm3_expand((u32 *)src, w, wt);
-	sm3_compress(w, wt, sst->state);
-
-	memzero_explicit(w, sizeof(w));
-	memzero_explicit(wt, sizeof(wt));
-}
-
-static void sm3_generic_block_fn(struct sm3_state *sst, u8 const *src,
-				    int blocks)
-{
-	while (blocks--) {
-		sm3_transform(sst, src);
-		src += SM3_BLOCK_SIZE;
-	}
-}
-
-int crypto_sm3_update(struct shash_desc *desc, const u8 *data,
+static int crypto_sm3_update(struct shash_desc *desc, const u8 *data,
 			  unsigned int len)
 {
-	return sm3_base_do_update(desc, data, len, sm3_generic_block_fn);
+	sm3_update(shash_desc_ctx(desc), data, len);
+	return 0;
 }
-EXPORT_SYMBOL(crypto_sm3_update);
 
-int crypto_sm3_final(struct shash_desc *desc, u8 *out)
+static int crypto_sm3_final(struct shash_desc *desc, u8 *out)
 {
-	sm3_base_do_finalize(desc, sm3_generic_block_fn);
-	return sm3_base_finish(desc, out);
+	sm3_final(shash_desc_ctx(desc), out);
+	return 0;
 }
-EXPORT_SYMBOL(crypto_sm3_final);
 
-int crypto_sm3_finup(struct shash_desc *desc, const u8 *data,
+static int crypto_sm3_finup(struct shash_desc *desc, const u8 *data,
 			unsigned int len, u8 *hash)
 {
-	sm3_base_do_update(desc, data, len, sm3_generic_block_fn);
-	return crypto_sm3_final(desc, hash);
+	struct sm3_state *sctx = shash_desc_ctx(desc);
+
+	if (len)
+		sm3_update(sctx, data, len);
+	sm3_final(sctx, hash);
+	return 0;
 }
-EXPORT_SYMBOL(crypto_sm3_finup);
 
 static struct shash_alg sm3_alg = {
 	.digestsize	=	SM3_DIGEST_SIZE,
@@ -174,6 +61,7 @@ static struct shash_alg sm3_alg = {
 	.base		=	{
 		.cra_name	 =	"sm3",
 		.cra_driver_name =	"sm3-generic",
+		.cra_priority	=	100,
 		.cra_blocksize	 =	SM3_BLOCK_SIZE,
 		.cra_module	 =	THIS_MODULE,
 	}
diff --git a/include/crypto/sm3.h b/include/crypto/sm3.h
index 315adaf38007..56b50330ef0a 100644
--- a/include/crypto/sm3.h
+++ b/include/crypto/sm3.h
@@ -34,16 +34,6 @@ struct sm3_state {
 	u8 buffer[SM3_BLOCK_SIZE];
 };
 
-struct shash_desc;
-
-extern int crypto_sm3_update(struct shash_desc *desc, const u8 *data,
-			      unsigned int len);
-
-extern int crypto_sm3_final(struct shash_desc *desc, u8 *out);
-
-extern int crypto_sm3_finup(struct shash_desc *desc, const u8 *data,
-			     unsigned int len, u8 *hash);
-
 /*
  * Stand-alone implementation of the SM3 algorithm. It is designed to
  * have as little dependencies as possible so it can be used in the
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/6] crypto: x86/sm3 - add AVX assembly implementation
  2021-12-20  8:22 [PATCH 0/6] Introduce x86 assembly accelerated implementation for SM3 algorithm Tianjia Zhang
                   ` (3 preceding siblings ...)
  2021-12-20  8:22 ` [PATCH 4/6] crypto: sm3 " Tianjia Zhang
@ 2021-12-20  8:22 ` Tianjia Zhang
  2021-12-20 18:03   ` Jussi Kivilinna
  2021-12-20  8:22 ` [PATCH 6/6] crypto: tcrypt - add asynchronous speed test for SM3 Tianjia Zhang
  5 siblings, 1 reply; 9+ messages in thread
From: Tianjia Zhang @ 2021-12-20  8:22 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Vitaly Chikunov, Eric Biggers,
	Eric Biggers, Gilad Ben-Yossef, Ard Biesheuvel, Jussi Kivilinna,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-crypto, x86,
	linux-arm-kernel, linux-kernel
  Cc: Tianjia Zhang

This patch adds AVX assembly accelerated implementation of SM3 secure
hash algorithm. From the benchmark data, compared to pure software
implementation sm3-generic, the performance increase is up to 38%.

The main algorithm implementation based on SM3 AES/BMI2 accelerated
work by libgcrypt at:
https://gnupg.org/software/libgcrypt/index.html

Benchmark on Intel i5-6200U 2.30GHz, performance data of two
implementations, pure software sm3-generic and sm3-avx acceleration.
The data comes from the 326 mode and 422 mode of tcrypt. The abscissas
are different lengths of per update. The data is tabulated and the
unit is Mb/s:

update-size |     16      64     256    1024    2048    4096    8192
--------------------------------------------------------------------
sm3-generic | 105.97  129.60  182.12  189.62  188.06  193.66  194.88
sm3-avx     | 119.87  163.05  244.44  260.92  257.60  264.87  265.88

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/x86/crypto/Makefile         |   3 +
 arch/x86/crypto/sm3-avx-asm_64.S | 521 +++++++++++++++++++++++++++++++
 arch/x86/crypto/sm3_avx_glue.c   | 134 ++++++++
 crypto/Kconfig                   |  13 +
 4 files changed, 671 insertions(+)
 create mode 100644 arch/x86/crypto/sm3-avx-asm_64.S
 create mode 100644 arch/x86/crypto/sm3_avx_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index f307c93fc90a..7cbe860f6201 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -88,6 +88,9 @@ nhpoly1305-avx2-y := nh-avx2-x86_64.o nhpoly1305-avx2-glue.o
 
 obj-$(CONFIG_CRYPTO_CURVE25519_X86) += curve25519-x86_64.o
 
+obj-$(CONFIG_CRYPTO_SM3_AVX_X86_64) += sm3-avx-x86_64.o
+sm3-avx-x86_64-y := sm3-avx-asm_64.o sm3_avx_glue.o
+
 obj-$(CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64) += sm4-aesni-avx-x86_64.o
 sm4-aesni-avx-x86_64-y := sm4-aesni-avx-asm_64.o sm4_aesni_avx_glue.o
 
diff --git a/arch/x86/crypto/sm3-avx-asm_64.S b/arch/x86/crypto/sm3-avx-asm_64.S
new file mode 100644
index 000000000000..e7a9a37f3609
--- /dev/null
+++ b/arch/x86/crypto/sm3-avx-asm_64.S
@@ -0,0 +1,521 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM3 AVX accelerated transform.
+ * specified in: https://datatracker.ietf.org/doc/html/draft-sca-cfrg-sm3-02
+ *
+ * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (C) 2021 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+/* Based on SM3 AES/BMI2 accelerated work by libgcrypt at:
+ *  https://gnupg.org/software/libgcrypt/index.html
+ */
+
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+/* Context structure */
+
+#define state_h0 0
+#define state_h1 4
+#define state_h2 8
+#define state_h3 12
+#define state_h4 16
+#define state_h5 20
+#define state_h6 24
+#define state_h7 28
+
+/* Constants */
+
+/* Round constant macros */
+
+#define K0  0x79cc4519
+#define K1  0xf3988a32
+#define K2  0xe7311465
+#define K3  0xce6228cb
+#define K4  0x9cc45197
+#define K5  0x3988a32f
+#define K6  0x7311465e
+#define K7  0xe6228cbc
+#define K8  0xcc451979
+#define K9  0x988a32f3
+#define K10 0x311465e7
+#define K11 0x6228cbce
+#define K12 0xc451979c
+#define K13 0x88a32f39
+#define K14 0x11465e73
+#define K15 0x228cbce6
+#define K16 0x9d8a7a87
+#define K17 0x3b14f50f
+#define K18 0x7629ea1e
+#define K19 0xec53d43c
+#define K20 0xd8a7a879
+#define K21 0xb14f50f3
+#define K22 0x629ea1e7
+#define K23 0xc53d43ce
+#define K24 0x8a7a879d
+#define K25 0x14f50f3b
+#define K26 0x29ea1e76
+#define K27 0x53d43cec
+#define K28 0xa7a879d8
+#define K29 0x4f50f3b1
+#define K30 0x9ea1e762
+#define K31 0x3d43cec5
+#define K32 0x7a879d8a
+#define K33 0xf50f3b14
+#define K34 0xea1e7629
+#define K35 0xd43cec53
+#define K36 0xa879d8a7
+#define K37 0x50f3b14f
+#define K38 0xa1e7629e
+#define K39 0x43cec53d
+#define K40 0x879d8a7a
+#define K41 0x0f3b14f5
+#define K42 0x1e7629ea
+#define K43 0x3cec53d4
+#define K44 0x79d8a7a8
+#define K45 0xf3b14f50
+#define K46 0xe7629ea1
+#define K47 0xcec53d43
+#define K48 0x9d8a7a87
+#define K49 0x3b14f50f
+#define K50 0x7629ea1e
+#define K51 0xec53d43c
+#define K52 0xd8a7a879
+#define K53 0xb14f50f3
+#define K54 0x629ea1e7
+#define K55 0xc53d43ce
+#define K56 0x8a7a879d
+#define K57 0x14f50f3b
+#define K58 0x29ea1e76
+#define K59 0x53d43cec
+#define K60 0xa7a879d8
+#define K61 0x4f50f3b1
+#define K62 0x9ea1e762
+#define K63 0x3d43cec5
+
+/* Register macros */
+
+#define RSTATE %rdi
+#define RDATA  %rsi
+#define RNBLKS %rdx
+
+#define t0 %eax
+#define t1 %ebx
+#define t2 %ecx
+
+#define a %r8d
+#define b %r9d
+#define c %r10d
+#define d %r11d
+#define e %r12d
+#define f %r13d
+#define g %r14d
+#define h %r15d
+
+#define W0 %xmm0
+#define W1 %xmm1
+#define W2 %xmm2
+#define W3 %xmm3
+#define W4 %xmm4
+#define W5 %xmm5
+
+#define XTMP0 %xmm6
+#define XTMP1 %xmm7
+#define XTMP2 %xmm8
+#define XTMP3 %xmm9
+#define XTMP4 %xmm10
+#define XTMP5 %xmm11
+#define XTMP6 %xmm12
+
+#define BSWAP_REG %xmm15
+
+/* Stack structure */
+
+#define STACK_W_SIZE        (32 * 2 * 3)
+#define STACK_REG_SAVE_SIZE (64)
+
+#define STACK_W             (0)
+#define STACK_REG_SAVE      (STACK_W + STACK_W_SIZE)
+#define STACK_SIZE          (STACK_REG_SAVE + STACK_REG_SAVE_SIZE)
+
+/* Instruction helpers. */
+
+#define roll2(v, reg) \
+	roll $(v), reg;
+
+#define roll3mov(v, src, dst) \
+	movl src, dst; \
+	roll $(v), dst;
+
+#define roll3(v, src, dst) \
+	rorxl $(32-(v)), src, dst;
+
+#define addl2(a, out) \
+	leal (a, out), out;
+
+#define addl3(a, b, out) \
+	leal (b, a), out;
+
+/* Round function macros. */
+
+#define GG1(x, y, z, o, t) \
+	movl x, o;         \
+	xorl y, o;         \
+	xorl z, o;
+
+#define FF1(x, y, z, o, t) GG1(x, y, z, o, t)
+
+#define GG2(x, y, z, o, t) \
+	andnl z, x, o;     \
+	movl y, t;         \
+	andl x, t;         \
+	addl2(t, o);
+
+#define FF2(x, y, z, o, t) \
+	movl y, o;         \
+	xorl x, o;         \
+	movl y, t;         \
+	andl x, t;         \
+	andl z, o;         \
+	xorl t, o;
+
+#define R(i, a, b, c, d, e, f, g, h, round, widx, wtype)                      \
+	/* rol(a, 12) => t0 */                                                \
+	roll3mov(12, a, t0); /* rorxl here would reduce perf by 6% on zen3 */ \
+	/* rol (t0 + e + t), 7) => t1 */                                      \
+	addl3(t0, e, t1);                                                     \
+	addl $K##round, t1;                                                   \
+	roll2(7, t1);                                                         \
+	/* h + w1 => h */                                                     \
+	addl wtype##_W1_ADDR(round, widx), h;                                 \
+	/* h + t1 => h */                                                     \
+	addl2(t1, h);                                                         \
+	/* t1 ^ t0 => t0 */                                                   \
+	xorl t1, t0;                                                          \
+	/* w1w2 + d => d */                                                   \
+	addl wtype##_W1W2_ADDR(round, widx), d;                               \
+	/* FF##i(a,b,c) => t1 */                                              \
+	FF##i(a, b, c, t1, t2);                                               \
+	/* d + t1 => d */                                                     \
+	addl2(t1, d);                                                         \
+	/* GG#i(e,f,g) => t2 */                                               \
+	GG##i(e, f, g, t2, t1);                                               \
+	/* h + t2 => h */                                                     \
+	addl2(t2, h);                                                         \
+	/* rol (f, 19) => f */                                                \
+	roll2(19, f);                                                         \
+	/* d + t0 => d */                                                     \
+	addl2(t0, d);                                                         \
+	/* rol (b, 9) => b */                                                 \
+	roll2(9, b);                                                          \
+	/* P0(h) => h */                                                      \
+	roll3(9, h, t2);                                                      \
+	roll3(17, h, t1);                                                     \
+	xorl t2, h;                                                           \
+	xorl t1, h;
+
+#define R1(a, b, c, d, e, f, g, h, round, widx, wtype) \
+	R(1, a, b, c, d, e, f, g, h, round, widx, wtype)
+
+#define R2(a, b, c, d, e, f, g, h, round, widx, wtype) \
+	R(2, a, b, c, d, e, f, g, h, round, widx, wtype)
+
+/* Input expansion macros. */
+
+/* Byte-swapped input address. */
+#define IW_W_ADDR(round, widx, offs) \
+	(STACK_W + ((round) / 4) * 64 + (offs) + ((widx) * 4))(%rsp)
+
+/* Expanded input address. */
+#define XW_W_ADDR(round, widx, offs) \
+	(STACK_W + ((((round) / 3) - 4) % 2) * 64 + (offs) + ((widx) * 4))(%rsp)
+
+/* Rounds 1-12, byte-swapped input block addresses. */
+#define IW_W1_ADDR(round, widx)   IW_W_ADDR(round, widx, 0)
+#define IW_W1W2_ADDR(round, widx) IW_W_ADDR(round, widx, 32)
+
+/* Rounds 1-12, expanded input block addresses. */
+#define XW_W1_ADDR(round, widx)   XW_W_ADDR(round, widx, 0)
+#define XW_W1W2_ADDR(round, widx) XW_W_ADDR(round, widx, 32)
+
+/* Input block loading. */
+#define LOAD_W_XMM_1()                                              \
+	vmovdqu 0*16(RDATA), XTMP0; /* XTMP0: w3, w2, w1, w0 */     \
+	vmovdqu 1*16(RDATA), XTMP1; /* XTMP1: w7, w6, w5, w4 */     \
+	vmovdqu 2*16(RDATA), XTMP2; /* XTMP2: w11, w10, w9, w8 */   \
+	vmovdqu 3*16(RDATA), XTMP3; /* XTMP3: w15, w14, w13, w12 */ \
+	vpshufb BSWAP_REG, XTMP0, XTMP0;                            \
+	vpshufb BSWAP_REG, XTMP1, XTMP1;                            \
+	vpshufb BSWAP_REG, XTMP2, XTMP2;                            \
+	vpshufb BSWAP_REG, XTMP3, XTMP3;                            \
+	vpxor XTMP0, XTMP1, XTMP4;                                  \
+	vpxor XTMP1, XTMP2, XTMP5;                                  \
+	vpxor XTMP2, XTMP3, XTMP6;                                  \
+	leaq 64(RDATA), RDATA;                                      \
+	vmovdqa XTMP0, IW_W1_ADDR(0, 0);                            \
+	vmovdqa XTMP4, IW_W1W2_ADDR(0, 0);                          \
+	vmovdqa XTMP1, IW_W1_ADDR(4, 0);                            \
+	vmovdqa XTMP5, IW_W1W2_ADDR(4, 0);
+
+#define LOAD_W_XMM_2()                     \
+	vmovdqa XTMP2, IW_W1_ADDR(8, 0);   \
+	vmovdqa XTMP6, IW_W1W2_ADDR(8, 0);
+
+#define LOAD_W_XMM_3()                                              \
+	vpshufd $0b00000000, XTMP0, W0; /* W0: xx, w0, xx, xx */    \
+	vpshufd $0b11111001, XTMP0, W1; /* W1: xx, w3, w2, w1 */    \
+	vmovdqa XTMP1, W2;              /* W2: xx, w6, w5, w4 */    \
+	vpalignr $12, XTMP1, XTMP2, W3; /* W3: xx, w9, w8, w7 */    \
+	vpalignr $8, XTMP2, XTMP3, W4;  /* W4: xx, w12, w11, w10 */ \
+	vpshufd $0b11111001, XTMP3, W5; /* W5: xx, w15, w14, w13 */
+
+/* Message scheduling. Note: 3 words per XMM register. */
+#define SCHED_W_0(round, w0, w1, w2, w3, w4, w5)                    \
+	/* Load (w[i - 16]) => XTMP0 */                             \
+	vpshufd $0b10111111, w0, XTMP0;                             \
+	vpalignr $12, XTMP0, w1, XTMP0; /* XTMP0: xx, w2, w1, w0 */ \
+	/* Load (w[i - 13]) => XTMP1 */                             \
+	vpshufd $0b10111111, w1, XTMP1;                             \
+	vpalignr $12, XTMP1, w2, XTMP1;                             \
+	/* w[i - 9] == w3 */                                        \
+	/* XMM3 ^ XTMP0 => XTMP0 */                                 \
+	vpxor w3, XTMP0, XTMP0;
+
+#define SCHED_W_1(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 3] == w5 */                     \
+	/* rol(XMM5, 15) ^ XTMP0 => XTMP0 */     \
+	vpslld $15, w5, XTMP2;                   \
+	vpsrld $(32-15), w5, XTMP3;              \
+	vpxor XTMP2, XTMP3, XTMP3;               \
+	vpxor XTMP3, XTMP0, XTMP0;               \
+	/* rol(XTMP1, 7) => XTMP1 */             \
+	vpslld $7, XTMP1, XTMP5;                 \
+	vpsrld $(32-7), XTMP1, XTMP1;            \
+	vpxor XTMP5, XTMP1, XTMP1;               \
+	/* XMM4 ^ XTMP1 => XTMP1 */              \
+	vpxor w4, XTMP1, XTMP1;                  \
+	/* w[i - 6] == XMM4 */                   \
+	/* P1(XTMP0) ^ XTMP1 => XMM0 */          \
+	vpslld $15, XTMP0, XTMP5;                \
+	vpsrld $(32-15), XTMP0, XTMP6;           \
+	vpslld $23, XTMP0, XTMP2;                \
+	vpsrld $(32-23), XTMP0, XTMP3;           \
+	vpxor XTMP0, XTMP1, XTMP1;               \
+	vpxor XTMP6, XTMP5, XTMP5;               \
+	vpxor XTMP3, XTMP2, XTMP2;               \
+	vpxor XTMP2, XTMP5, XTMP5;               \
+	vpxor XTMP5, XTMP1, w0;
+
+#define SCHED_W_2(round, w0, w1, w2, w3, w4, w5) \
+	/* W1 in XMM12 */                        \
+	vpshufd $0b10111111, w4, XTMP4;          \
+	vpalignr $12, XTMP4, w5, XTMP4;          \
+	vmovdqa XTMP4, XW_W1_ADDR((round), 0);   \
+	/* W1 ^ W2 => XTMP1 */                   \
+	vpxor w0, XTMP4, XTMP1;                  \
+	vmovdqa XTMP1, XW_W1W2_ADDR((round), 0);
+
+
+.section	.rodata.cst16, "aM", @progbits, 16
+.align 16
+
+.Lbe32mask:
+	.long 0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f
+
+.text
+
+/*
+ * Transform nblocks*64 bytes (nblocks*16 32-bit words) at DATA.
+ *
+ * void sm3_transform_avx(struct sm3_state *state,
+ *                        const u8 *data, int nblocks);
+ */
+.align 16
+SYM_FUNC_START(sm3_transform_avx)
+	/* input:
+	 *	%rdi: ctx, CTX
+	 *	%rsi: data (64*nblks bytes)
+	 *	%rdx: nblocks
+	 */
+	vzeroupper;
+
+	pushq %rbp;
+	movq %rsp, %rbp;
+
+	movq %rdx, RNBLKS;
+
+	subq $STACK_SIZE, %rsp;
+	andq $(~63), %rsp;
+
+	movq %rbx, (STACK_REG_SAVE + 0 * 8)(%rsp);
+	movq %r15, (STACK_REG_SAVE + 1 * 8)(%rsp);
+	movq %r14, (STACK_REG_SAVE + 2 * 8)(%rsp);
+	movq %r13, (STACK_REG_SAVE + 3 * 8)(%rsp);
+	movq %r12, (STACK_REG_SAVE + 4 * 8)(%rsp);
+
+	vmovdqa .Lbe32mask (%rip), BSWAP_REG;
+
+	/* Get the values of the chaining variables. */
+	movl state_h0(RSTATE), a;
+	movl state_h1(RSTATE), b;
+	movl state_h2(RSTATE), c;
+	movl state_h3(RSTATE), d;
+	movl state_h4(RSTATE), e;
+	movl state_h5(RSTATE), f;
+	movl state_h6(RSTATE), g;
+	movl state_h7(RSTATE), h;
+
+.align 16
+.Loop:
+	/* Load data part1. */
+	LOAD_W_XMM_1();
+
+	leaq -1(RNBLKS), RNBLKS;
+
+	/* Transform 0-3 + Load data part2. */
+	R1(a, b, c, d, e, f, g, h, 0, 0, IW); LOAD_W_XMM_2();
+	R1(d, a, b, c, h, e, f, g, 1, 1, IW);
+	R1(c, d, a, b, g, h, e, f, 2, 2, IW);
+	R1(b, c, d, a, f, g, h, e, 3, 3, IW); LOAD_W_XMM_3();
+
+	/* Transform 4-7 + Precalc 12-14. */
+	R1(a, b, c, d, e, f, g, h, 4, 0, IW);
+	R1(d, a, b, c, h, e, f, g, 5, 1, IW);
+	R1(c, d, a, b, g, h, e, f, 6, 2, IW); SCHED_W_0(12, W0, W1, W2, W3, W4, W5);
+	R1(b, c, d, a, f, g, h, e, 7, 3, IW); SCHED_W_1(12, W0, W1, W2, W3, W4, W5);
+
+	/* Transform 8-11 + Precalc 12-17. */
+	R1(a, b, c, d, e, f, g, h, 8, 0, IW); SCHED_W_2(12, W0, W1, W2, W3, W4, W5);
+	R1(d, a, b, c, h, e, f, g, 9, 1, IW); SCHED_W_0(15, W1, W2, W3, W4, W5, W0);
+	R1(c, d, a, b, g, h, e, f, 10, 2, IW); SCHED_W_1(15, W1, W2, W3, W4, W5, W0);
+	R1(b, c, d, a, f, g, h, e, 11, 3, IW); SCHED_W_2(15, W1, W2, W3, W4, W5, W0);
+
+	/* Transform 12-14 + Precalc 18-20 */
+	R1(a, b, c, d, e, f, g, h, 12, 0, XW); SCHED_W_0(18, W2, W3, W4, W5, W0, W1);
+	R1(d, a, b, c, h, e, f, g, 13, 1, XW); SCHED_W_1(18, W2, W3, W4, W5, W0, W1);
+	R1(c, d, a, b, g, h, e, f, 14, 2, XW); SCHED_W_2(18, W2, W3, W4, W5, W0, W1);
+
+	/* Transform 15-17 + Precalc 21-23 */
+	R1(b, c, d, a, f, g, h, e, 15, 0, XW); SCHED_W_0(21, W3, W4, W5, W0, W1, W2);
+	R2(a, b, c, d, e, f, g, h, 16, 1, XW); SCHED_W_1(21, W3, W4, W5, W0, W1, W2);
+	R2(d, a, b, c, h, e, f, g, 17, 2, XW); SCHED_W_2(21, W3, W4, W5, W0, W1, W2);
+
+	/* Transform 18-20 + Precalc 24-26 */
+	R2(c, d, a, b, g, h, e, f, 18, 0, XW); SCHED_W_0(24, W4, W5, W0, W1, W2, W3);
+	R2(b, c, d, a, f, g, h, e, 19, 1, XW); SCHED_W_1(24, W4, W5, W0, W1, W2, W3);
+	R2(a, b, c, d, e, f, g, h, 20, 2, XW); SCHED_W_2(24, W4, W5, W0, W1, W2, W3);
+
+	/* Transform 21-23 + Precalc 27-29 */
+	R2(d, a, b, c, h, e, f, g, 21, 0, XW); SCHED_W_0(27, W5, W0, W1, W2, W3, W4);
+	R2(c, d, a, b, g, h, e, f, 22, 1, XW); SCHED_W_1(27, W5, W0, W1, W2, W3, W4);
+	R2(b, c, d, a, f, g, h, e, 23, 2, XW); SCHED_W_2(27, W5, W0, W1, W2, W3, W4);
+
+	/* Transform 24-26 + Precalc 30-32 */
+	R2(a, b, c, d, e, f, g, h, 24, 0, XW); SCHED_W_0(30, W0, W1, W2, W3, W4, W5);
+	R2(d, a, b, c, h, e, f, g, 25, 1, XW); SCHED_W_1(30, W0, W1, W2, W3, W4, W5);
+	R2(c, d, a, b, g, h, e, f, 26, 2, XW); SCHED_W_2(30, W0, W1, W2, W3, W4, W5);
+
+	/* Transform 27-29 + Precalc 33-35 */
+	R2(b, c, d, a, f, g, h, e, 27, 0, XW); SCHED_W_0(33, W1, W2, W3, W4, W5, W0);
+	R2(a, b, c, d, e, f, g, h, 28, 1, XW); SCHED_W_1(33, W1, W2, W3, W4, W5, W0);
+	R2(d, a, b, c, h, e, f, g, 29, 2, XW); SCHED_W_2(33, W1, W2, W3, W4, W5, W0);
+
+	/* Transform 30-32 + Precalc 36-38 */
+	R2(c, d, a, b, g, h, e, f, 30, 0, XW); SCHED_W_0(36, W2, W3, W4, W5, W0, W1);
+	R2(b, c, d, a, f, g, h, e, 31, 1, XW); SCHED_W_1(36, W2, W3, W4, W5, W0, W1);
+	R2(a, b, c, d, e, f, g, h, 32, 2, XW); SCHED_W_2(36, W2, W3, W4, W5, W0, W1);
+
+	/* Transform 33-35 + Precalc 39-41 */
+	R2(d, a, b, c, h, e, f, g, 33, 0, XW); SCHED_W_0(39, W3, W4, W5, W0, W1, W2);
+	R2(c, d, a, b, g, h, e, f, 34, 1, XW); SCHED_W_1(39, W3, W4, W5, W0, W1, W2);
+	R2(b, c, d, a, f, g, h, e, 35, 2, XW); SCHED_W_2(39, W3, W4, W5, W0, W1, W2);
+
+	/* Transform 36-38 + Precalc 42-44 */
+	R2(a, b, c, d, e, f, g, h, 36, 0, XW); SCHED_W_0(42, W4, W5, W0, W1, W2, W3);
+	R2(d, a, b, c, h, e, f, g, 37, 1, XW); SCHED_W_1(42, W4, W5, W0, W1, W2, W3);
+	R2(c, d, a, b, g, h, e, f, 38, 2, XW); SCHED_W_2(42, W4, W5, W0, W1, W2, W3);
+
+	/* Transform 39-41 + Precalc 45-47 */
+	R2(b, c, d, a, f, g, h, e, 39, 0, XW); SCHED_W_0(45, W5, W0, W1, W2, W3, W4);
+	R2(a, b, c, d, e, f, g, h, 40, 1, XW); SCHED_W_1(45, W5, W0, W1, W2, W3, W4);
+	R2(d, a, b, c, h, e, f, g, 41, 2, XW); SCHED_W_2(45, W5, W0, W1, W2, W3, W4);
+
+	/* Transform 42-44 + Precalc 48-50 */
+	R2(c, d, a, b, g, h, e, f, 42, 0, XW); SCHED_W_0(48, W0, W1, W2, W3, W4, W5);
+	R2(b, c, d, a, f, g, h, e, 43, 1, XW); SCHED_W_1(48, W0, W1, W2, W3, W4, W5);
+	R2(a, b, c, d, e, f, g, h, 44, 2, XW); SCHED_W_2(48, W0, W1, W2, W3, W4, W5);
+
+	/* Transform 45-47 + Precalc 51-53 */
+	R2(d, a, b, c, h, e, f, g, 45, 0, XW); SCHED_W_0(51, W1, W2, W3, W4, W5, W0);
+	R2(c, d, a, b, g, h, e, f, 46, 1, XW); SCHED_W_1(51, W1, W2, W3, W4, W5, W0);
+	R2(b, c, d, a, f, g, h, e, 47, 2, XW); SCHED_W_2(51, W1, W2, W3, W4, W5, W0);
+
+	/* Transform 48-50 + Precalc 54-56 */
+	R2(a, b, c, d, e, f, g, h, 48, 0, XW); SCHED_W_0(54, W2, W3, W4, W5, W0, W1);
+	R2(d, a, b, c, h, e, f, g, 49, 1, XW); SCHED_W_1(54, W2, W3, W4, W5, W0, W1);
+	R2(c, d, a, b, g, h, e, f, 50, 2, XW); SCHED_W_2(54, W2, W3, W4, W5, W0, W1);
+
+	/* Transform 51-53 + Precalc 57-59 */
+	R2(b, c, d, a, f, g, h, e, 51, 0, XW); SCHED_W_0(57, W3, W4, W5, W0, W1, W2);
+	R2(a, b, c, d, e, f, g, h, 52, 1, XW); SCHED_W_1(57, W3, W4, W5, W0, W1, W2);
+	R2(d, a, b, c, h, e, f, g, 53, 2, XW); SCHED_W_2(57, W3, W4, W5, W0, W1, W2);
+
+	/* Transform 54-56 + Precalc 60-62 */
+	R2(c, d, a, b, g, h, e, f, 54, 0, XW); SCHED_W_0(60, W4, W5, W0, W1, W2, W3);
+	R2(b, c, d, a, f, g, h, e, 55, 1, XW); SCHED_W_1(60, W4, W5, W0, W1, W2, W3);
+	R2(a, b, c, d, e, f, g, h, 56, 2, XW); SCHED_W_2(60, W4, W5, W0, W1, W2, W3);
+
+	/* Transform 57-59 + Precalc 63 */
+	R2(d, a, b, c, h, e, f, g, 57, 0, XW); SCHED_W_0(63, W5, W0, W1, W2, W3, W4);
+	R2(c, d, a, b, g, h, e, f, 58, 1, XW);
+	R2(b, c, d, a, f, g, h, e, 59, 2, XW); SCHED_W_1(63, W5, W0, W1, W2, W3, W4);
+
+	/* Transform 60-62 + Precalc 63 */
+	R2(a, b, c, d, e, f, g, h, 60, 0, XW);
+	R2(d, a, b, c, h, e, f, g, 61, 1, XW); SCHED_W_2(63, W5, W0, W1, W2, W3, W4);
+	R2(c, d, a, b, g, h, e, f, 62, 2, XW);
+
+	/* Transform 63 */
+	R2(b, c, d, a, f, g, h, e, 63, 0, XW);
+
+	/* Update the chaining variables. */
+	xorl state_h0(RSTATE), a;
+	xorl state_h1(RSTATE), b;
+	xorl state_h2(RSTATE), c;
+	xorl state_h3(RSTATE), d;
+	movl a, state_h0(RSTATE);
+	movl b, state_h1(RSTATE);
+	movl c, state_h2(RSTATE);
+	movl d, state_h3(RSTATE);
+	xorl state_h4(RSTATE), e;
+	xorl state_h5(RSTATE), f;
+	xorl state_h6(RSTATE), g;
+	xorl state_h7(RSTATE), h;
+	movl e, state_h4(RSTATE);
+	movl f, state_h5(RSTATE);
+	movl g, state_h6(RSTATE);
+	movl h, state_h7(RSTATE);
+
+	cmpq $0, RNBLKS;
+	jne .Loop;
+
+	vzeroall;
+
+	movq (STACK_REG_SAVE + 0 * 8)(%rsp), %rbx;
+	movq (STACK_REG_SAVE + 1 * 8)(%rsp), %r15;
+	movq (STACK_REG_SAVE + 2 * 8)(%rsp), %r14;
+	movq (STACK_REG_SAVE + 3 * 8)(%rsp), %r13;
+	movq (STACK_REG_SAVE + 4 * 8)(%rsp), %r12;
+
+	vmovdqa %xmm0, IW_W1_ADDR(0, 0);
+	vmovdqa %xmm0, IW_W1W2_ADDR(0, 0);
+	vmovdqa %xmm0, IW_W1_ADDR(4, 0);
+	vmovdqa %xmm0, IW_W1W2_ADDR(4, 0);
+	vmovdqa %xmm0, IW_W1_ADDR(8, 0);
+	vmovdqa %xmm0, IW_W1W2_ADDR(8, 0);
+
+	movq %rbp, %rsp;
+	popq %rbp;
+	ret;
+SYM_FUNC_END(sm3_transform_avx)
diff --git a/arch/x86/crypto/sm3_avx_glue.c b/arch/x86/crypto/sm3_avx_glue.c
new file mode 100644
index 000000000000..661b6f22ffcd
--- /dev/null
+++ b/arch/x86/crypto/sm3_avx_glue.c
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM3 Secure Hash Algorithm, AVX assembler accelerated.
+ * specified in: https://datatracker.ietf.org/doc/html/draft-sca-cfrg-sm3-02
+ *
+ * Copyright (C) 2021 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <crypto/sm3.h>
+#include <crypto/sm3_base.h>
+#include <asm/simd.h>
+
+asmlinkage void sm3_transform_avx(struct sm3_state *state,
+			const u8 *data, int nblocks);
+
+static int sm3_avx_update(struct shash_desc *desc, const u8 *data,
+			 unsigned int len)
+{
+	struct sm3_state *sctx = shash_desc_ctx(desc);
+
+	if (!crypto_simd_usable() ||
+			(sctx->count % SM3_BLOCK_SIZE) + len < SM3_BLOCK_SIZE) {
+		sm3_update(sctx, data, len);
+		return 0;
+	}
+
+	/*
+	 * Make sure struct sm3_state begins directly with the SM3
+	 * 256-bit internal state, as this is what the asm functions expect.
+	 */
+	BUILD_BUG_ON(offsetof(struct sm3_state, state) != 0);
+
+	kernel_fpu_begin();
+	sm3_base_do_update(desc, data, len, sm3_transform_avx);
+	kernel_fpu_end();
+
+	return 0;
+}
+
+static int sm3_avx_finup(struct shash_desc *desc, const u8 *data,
+		      unsigned int len, u8 *out)
+{
+	if (!crypto_simd_usable()) {
+		struct sm3_state *sctx = shash_desc_ctx(desc);
+
+		if (len)
+			sm3_update(sctx, data, len);
+
+		sm3_final(sctx, out);
+		return 0;
+	}
+
+	kernel_fpu_begin();
+	if (len)
+		sm3_base_do_update(desc, data, len, sm3_transform_avx);
+	sm3_base_do_finalize(desc, sm3_transform_avx);
+	kernel_fpu_end();
+
+	return sm3_base_finish(desc, out);
+}
+
+static int sm3_avx_final(struct shash_desc *desc, u8 *out)
+{
+	if (!crypto_simd_usable()) {
+		sm3_final(shash_desc_ctx(desc), out);
+		return 0;
+	}
+
+	kernel_fpu_begin();
+	sm3_base_do_finalize(desc, sm3_transform_avx);
+	kernel_fpu_end();
+
+	return sm3_base_finish(desc, out);
+}
+
+static struct shash_alg sm3_avx_alg = {
+	.digestsize	=	SM3_DIGEST_SIZE,
+	.init		=	sm3_base_init,
+	.update		=	sm3_avx_update,
+	.final		=	sm3_avx_final,
+	.finup		=	sm3_avx_finup,
+	.descsize	=	sizeof(struct sm3_state),
+	.base		=	{
+		.cra_name	=	"sm3",
+		.cra_driver_name =	"sm3-avx",
+		.cra_priority	=	300,
+		.cra_blocksize	=	SM3_BLOCK_SIZE,
+		.cra_module	=	THIS_MODULE,
+	}
+};
+
+static int __init sm3_avx_mod_init(void)
+{
+	const char *feature_name;
+
+	if (!boot_cpu_has(X86_FEATURE_AVX)) {
+		pr_info("AVX instruction are not detected.\n");
+		return -ENODEV;
+	}
+
+	if (!boot_cpu_has(X86_FEATURE_BMI2)) {
+		pr_info("BMI2 instruction are not detected.\n");
+		return -ENODEV;
+	}
+
+	if (!cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM,
+				&feature_name)) {
+		pr_info("CPU feature '%s' is not supported.\n", feature_name);
+		return -ENODEV;
+	}
+
+	return crypto_register_shash(&sm3_avx_alg);
+}
+
+static void __exit sm3_avx_mod_exit(void)
+{
+	crypto_unregister_shash(&sm3_avx_alg);
+}
+
+module_init(sm3_avx_mod_init);
+module_exit(sm3_avx_mod_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_DESCRIPTION("SM3 Secure Hash Algorithm, AVX assembler accelerated");
+MODULE_ALIAS_CRYPTO("sm3");
+MODULE_ALIAS_CRYPTO("sm3-avx");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 8e780bf66c0a..0af5216c083a 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1008,6 +1008,19 @@ config CRYPTO_SM3
 	  http://www.oscca.gov.cn/UpFile/20101222141857786.pdf
 	  https://datatracker.ietf.org/doc/html/draft-shen-sm3-hash
 
+config CRYPTO_SM3_AVX_X86_64
+	tristate "SM3 digest algorithm (x86_64/AVX)"
+	depends on X86 && 64BIT
+	select CRYPTO_HASH
+	select CRYPTO_LIB_SM3
+	help
+	  SM3 secure hash function as defined by OSCCA GM/T 0004-2012 SM3).
+	  It is part of the Chinese Commercial Cryptography suite. This is
+	  SM3 optimized implementation using Advanced Vector Extensions (AVX)
+	  when available.
+
+	  If unsure, say N.
+
 config CRYPTO_STREEBOG
 	tristate "Streebog Hash Function"
 	select CRYPTO_HASH
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 6/6] crypto: tcrypt - add asynchronous speed test for SM3
  2021-12-20  8:22 [PATCH 0/6] Introduce x86 assembly accelerated implementation for SM3 algorithm Tianjia Zhang
                   ` (4 preceding siblings ...)
  2021-12-20  8:22 ` [PATCH 5/6] crypto: x86/sm3 - add AVX assembly implementation Tianjia Zhang
@ 2021-12-20  8:22 ` Tianjia Zhang
  5 siblings, 0 replies; 9+ messages in thread
From: Tianjia Zhang @ 2021-12-20  8:22 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Vitaly Chikunov, Eric Biggers,
	Eric Biggers, Gilad Ben-Yossef, Ard Biesheuvel, Jussi Kivilinna,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-crypto, x86,
	linux-arm-kernel, linux-kernel
  Cc: Tianjia Zhang

tcrypt supports testing of SM3 hash algorithms that use AVX
instruction acceleration.

In order to add the sm3 asynchronous test to the appropriate
position, shift the testcase sequence number of the multi buffer
backward and start from 450.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/tcrypt.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 00149657a4bc..82b5eef2246a 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -2571,31 +2571,35 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		if (mode > 400 && mode < 500) break;
 		fallthrough;
 	case 422:
+		test_ahash_speed("sm3", sec, generic_hash_speed_template);
+		if (mode > 400 && mode < 500) break;
+		fallthrough;
+	case 450:
 		test_mb_ahash_speed("sha1", sec, generic_hash_speed_template,
 				    num_mb);
 		if (mode > 400 && mode < 500) break;
 		fallthrough;
-	case 423:
+	case 451:
 		test_mb_ahash_speed("sha256", sec, generic_hash_speed_template,
 				    num_mb);
 		if (mode > 400 && mode < 500) break;
 		fallthrough;
-	case 424:
+	case 452:
 		test_mb_ahash_speed("sha512", sec, generic_hash_speed_template,
 				    num_mb);
 		if (mode > 400 && mode < 500) break;
 		fallthrough;
-	case 425:
+	case 453:
 		test_mb_ahash_speed("sm3", sec, generic_hash_speed_template,
 				    num_mb);
 		if (mode > 400 && mode < 500) break;
 		fallthrough;
-	case 426:
+	case 454:
 		test_mb_ahash_speed("streebog256", sec,
 				    generic_hash_speed_template, num_mb);
 		if (mode > 400 && mode < 500) break;
 		fallthrough;
-	case 427:
+	case 455:
 		test_mb_ahash_speed("streebog512", sec,
 				    generic_hash_speed_template, num_mb);
 		if (mode > 400 && mode < 500) break;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 5/6] crypto: x86/sm3 - add AVX assembly implementation
  2021-12-20  8:22 ` [PATCH 5/6] crypto: x86/sm3 - add AVX assembly implementation Tianjia Zhang
@ 2021-12-20 18:03   ` Jussi Kivilinna
  2021-12-21  7:39     ` Tianjia Zhang
  0 siblings, 1 reply; 9+ messages in thread
From: Jussi Kivilinna @ 2021-12-20 18:03 UTC (permalink / raw)
  To: Tianjia Zhang, Herbert Xu, David S. Miller, Vitaly Chikunov,
	Eric Biggers, Eric Biggers, Gilad Ben-Yossef, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-crypto, x86,
	linux-arm-kernel, linux-kernel

On 20.12.2021 10.22, Tianjia Zhang wrote:
> This patch adds AVX assembly accelerated implementation of SM3 secure
> hash algorithm. From the benchmark data, compared to pure software
> implementation sm3-generic, the performance increase is up to 38%.
> 
> The main algorithm implementation based on SM3 AES/BMI2 accelerated
> work by libgcrypt at:
> https://gnupg.org/software/libgcrypt/index.html
> 
> Benchmark on Intel i5-6200U 2.30GHz, performance data of two
> implementations, pure software sm3-generic and sm3-avx acceleration.
> The data comes from the 326 mode and 422 mode of tcrypt. The abscissas
> are different lengths of per update. The data is tabulated and the
> unit is Mb/s:
> 
> update-size |     16      64     256    1024    2048    4096    8192
> --------------------------------------------------------------------
> sm3-generic | 105.97  129.60  182.12  189.62  188.06  193.66  194.88
> sm3-avx     | 119.87  163.05  244.44  260.92  257.60  264.87  265.88
> 
> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> ---
>   arch/x86/crypto/Makefile         |   3 +
>   arch/x86/crypto/sm3-avx-asm_64.S | 521 +++++++++++++++++++++++++++++++
>   arch/x86/crypto/sm3_avx_glue.c   | 134 ++++++++
>   crypto/Kconfig                   |  13 +
>   4 files changed, 671 insertions(+)
>   create mode 100644 arch/x86/crypto/sm3-avx-asm_64.S
>   create mode 100644 arch/x86/crypto/sm3_avx_glue.c
> 
> diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
> index f307c93fc90a..7cbe860f6201 100644
> --- a/arch/x86/crypto/Makefile
> +++ b/arch/x86/crypto/Makefile
> @@ -88,6 +88,9 @@ nhpoly1305-avx2-y := nh-avx2-x86_64.o nhpoly1305-avx2-glue.o
>   
>   obj-$(CONFIG_CRYPTO_CURVE25519_X86) += curve25519-x86_64.o
>   
> +obj-$(CONFIG_CRYPTO_SM3_AVX_X86_64) += sm3-avx-x86_64.o
> +sm3-avx-x86_64-y := sm3-avx-asm_64.o sm3_avx_glue.o
> +
>   obj-$(CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64) += sm4-aesni-avx-x86_64.o
>   sm4-aesni-avx-x86_64-y := sm4-aesni-avx-asm_64.o sm4_aesni_avx_glue.o
>   
> diff --git a/arch/x86/crypto/sm3-avx-asm_64.S b/arch/x86/crypto/sm3-avx-asm_64.S
> new file mode 100644
> index 000000000000..e7a9a37f3609
> --- /dev/null
> +++ b/arch/x86/crypto/sm3-avx-asm_64.S
> @@ -0,0 +1,521 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM3 AVX accelerated transform.
> + * specified in: https://datatracker.ietf.org/doc/html/draft-sca-cfrg-sm3-02
> + *
> + * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna@iki.fi>
> + * Copyright (C) 2021 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
<snip>
> +
> +#define R(i, a, b, c, d, e, f, g, h, round, widx, wtype)                      \
> +	/* rol(a, 12) => t0 */                                                \
> +	roll3mov(12, a, t0); /* rorxl here would reduce perf by 6% on zen3 */ \
> +	/* rol (t0 + e + t), 7) => t1 */                                      \
> +	addl3(t0, e, t1);                                                     \
> +	addl $K##round, t1;                                                   \

It's better to use "leal K##round(t0, e, 1), t1;" here and fix K0-K63 macros
instead as I noted at libgcrypt mailing-list:
  https://lists.gnupg.org/pipermail/gcrypt-devel/2021-December/005209.html

-Jussi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 5/6] crypto: x86/sm3 - add AVX assembly implementation
  2021-12-20 18:03   ` Jussi Kivilinna
@ 2021-12-21  7:39     ` Tianjia Zhang
  0 siblings, 0 replies; 9+ messages in thread
From: Tianjia Zhang @ 2021-12-21  7:39 UTC (permalink / raw)
  To: Jussi Kivilinna, Herbert Xu, David S. Miller, Vitaly Chikunov,
	Eric Biggers, Eric Biggers, Gilad Ben-Yossef, Ard Biesheuvel,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-crypto, x86,
	linux-arm-kernel, linux-kernel



On 12/21/21 2:03 AM, Jussi Kivilinna wrote:
> On 20.12.2021 10.22, Tianjia Zhang wrote:
>> This patch adds AVX assembly accelerated implementation of SM3 secure
>> hash algorithm. From the benchmark data, compared to pure software
>> implementation sm3-generic, the performance increase is up to 38%.
>>
>> The main algorithm implementation based on SM3 AES/BMI2 accelerated
>> work by libgcrypt at:
>> https://gnupg.org/software/libgcrypt/index.html
>>
>> Benchmark on Intel i5-6200U 2.30GHz, performance data of two
>> implementations, pure software sm3-generic and sm3-avx acceleration.
>> The data comes from the 326 mode and 422 mode of tcrypt. The abscissas
>> are different lengths of per update. The data is tabulated and the
>> unit is Mb/s:
>>
>> update-size |     16      64     256    1024    2048    4096    8192
>> --------------------------------------------------------------------
>> sm3-generic | 105.97  129.60  182.12  189.62  188.06  193.66  194.88
>> sm3-avx     | 119.87  163.05  244.44  260.92  257.60  264.87  265.88
>>
>> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
>> ---
>>   arch/x86/crypto/Makefile         |   3 +
>>   arch/x86/crypto/sm3-avx-asm_64.S | 521 +++++++++++++++++++++++++++++++
>>   arch/x86/crypto/sm3_avx_glue.c   | 134 ++++++++
>>   crypto/Kconfig                   |  13 +
>>   4 files changed, 671 insertions(+)
>>   create mode 100644 arch/x86/crypto/sm3-avx-asm_64.S
>>   create mode 100644 arch/x86/crypto/sm3_avx_glue.c
>>
>> diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
>> index f307c93fc90a..7cbe860f6201 100644
>> --- a/arch/x86/crypto/Makefile
>> +++ b/arch/x86/crypto/Makefile
>> @@ -88,6 +88,9 @@ nhpoly1305-avx2-y := nh-avx2-x86_64.o 
>> nhpoly1305-avx2-glue.o
>>   obj-$(CONFIG_CRYPTO_CURVE25519_X86) += curve25519-x86_64.o
>> +obj-$(CONFIG_CRYPTO_SM3_AVX_X86_64) += sm3-avx-x86_64.o
>> +sm3-avx-x86_64-y := sm3-avx-asm_64.o sm3_avx_glue.o
>> +
>>   obj-$(CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64) += sm4-aesni-avx-x86_64.o
>>   sm4-aesni-avx-x86_64-y := sm4-aesni-avx-asm_64.o sm4_aesni_avx_glue.o
>> diff --git a/arch/x86/crypto/sm3-avx-asm_64.S 
>> b/arch/x86/crypto/sm3-avx-asm_64.S
>> new file mode 100644
>> index 000000000000..e7a9a37f3609
>> --- /dev/null
>> +++ b/arch/x86/crypto/sm3-avx-asm_64.S
>> @@ -0,0 +1,521 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +/*
>> + * SM3 AVX accelerated transform.
>> + * specified in: 
>> https://datatracker.ietf.org/doc/html/draft-sca-cfrg-sm3-02
>> + *
>> + * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna@iki.fi>
>> + * Copyright (C) 2021 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
>> + */
> <snip>
>> +
>> +#define R(i, a, b, c, d, e, f, g, h, round, widx, 
>> wtype)                      \
>> +    /* rol(a, 12) => t0 
>> */                                                \
>> +    roll3mov(12, a, t0); /* rorxl here would reduce perf by 6% on 
>> zen3 */ \
>> +    /* rol (t0 + e + t), 7) => t1 
>> */                                      \
>> +    addl3(t0, e, 
>> t1);                                                     \
>> +    addl $K##round, 
>> t1;                                                   \
> 
> It's better to use "leal K##round(t0, e, 1), t1;" here and fix K0-K63 
> macros
> instead as I noted at libgcrypt mailing-list:
>   https://lists.gnupg.org/pipermail/gcrypt-devel/2021-December/005209.html
> 
> -Jussi

Thanks for pointing it out, I will fix it in the next patch.

Best regards,
Tianjia

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-12-21  7:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-20  8:22 [PATCH 0/6] Introduce x86 assembly accelerated implementation for SM3 algorithm Tianjia Zhang
2021-12-20  8:22 ` [PATCH 1/6] crypto: sm3 - create SM3 stand-alone library Tianjia Zhang
2021-12-20  8:22 ` [PATCH 2/6] crypto: arm64/sm3-ce - make dependent on sm3 library Tianjia Zhang
2021-12-20  8:22 ` [PATCH 3/6] crypto: sm2 " Tianjia Zhang
2021-12-20  8:22 ` [PATCH 4/6] crypto: sm3 " Tianjia Zhang
2021-12-20  8:22 ` [PATCH 5/6] crypto: x86/sm3 - add AVX assembly implementation Tianjia Zhang
2021-12-20 18:03   ` Jussi Kivilinna
2021-12-21  7:39     ` Tianjia Zhang
2021-12-20  8:22 ` [PATCH 6/6] crypto: tcrypt - add asynchronous speed test for SM3 Tianjia Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).