All of lore.kernel.org
 help / color / mirror / Atom feed
From: Megha Dey <megha.dey@intel.com>
To: herbert@gondor.apana.org.au, davem@davemloft.net
Cc: linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org,
	ravi.v.shankar@intel.com, tim.c.chen@intel.com,
	andi.kleen@intel.com, dave.hansen@intel.com, megha.dey@intel.com,
	wajdi.k.feghali@intel.com, greg.b.tucker@intel.com,
	robert.a.kasten@intel.com, rajendrakumar.chinnaiyan@intel.com,
	tomasz.kantecki@intel.com, ryan.d.saffores@intel.com,
	ilya.albrekht@intel.com, kyung.min.park@intel.com,
	tony.luck@intel.com, ira.weiny@intel.com
Subject: [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction
Date: Fri, 18 Dec 2020 13:10:59 -0800	[thread overview]
Message-ID: <1608325864-4033-3-git-send-email-megha.dey@intel.com> (raw)
In-Reply-To: <1608325864-4033-1-git-send-email-megha.dey@intel.com>

From: Kyung Min Park <kyung.min.park@intel.com>

Update the crc_pcl function that calculates T10 Data Integrity Field
CRC16 (CRC T10 DIF) using VPCLMULQDQ instruction. VPCLMULQDQ instruction
with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
The advantage comes from packing multiples of 4 * 128 bit data into AVX512
reducing instruction latency.

The glue code in crct10diff module overrides the existing PCLMULQDQ version
with the VPCLMULQDQ version when the following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
   only Icelake)
2. If compiled as built-in module, crct10dif_pclmul.use_avx512 is set at
   boot time or /sys/module/crct10dif_pclmul/parameters/use_avx512 is set
   to 1 after boot.
   If compiled as loadable module, use_avx512 module parameter must be set:
   modprobe crct10dif_pclmul use_avx512=1

A typical run of tcrypt with CRC T10 DIF calculation with PCLMULQDQ
instruction and VPCLMULQDQ instruction shows the following results:
For bytes per update >= 1KB, we see the average improvement of 46%(~1.4x)
For bytes per update < 1KB, we see the average improvement of 13%.
Test was performed on an Icelake based platform with constant frequency
set for CPU.

Detailed results for a variety of block sizes and update sizes are in
the table below.

---------------------------------------------------------------------------
|            |            |         cycles/operation         |            |
|            |            |       (the lower the better)     |            |
|    byte    |   bytes    |----------------------------------| percentage |
|   blocks   | per update |   CRC T10 DIF  |  CRC T10 DIF    | loss/gain  |
|            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
|------------|------------|----------------|-----------------|------------|
|      16    |     16     |        77      |        106      |   -27.0    |
|      64    |     16     |       411      |        390      |     5.4    |
|      64    |     64     |        71      |         85      |   -16.0    |
|     256    |     16     |      1224      |       1308      |    -6.4    |
|     256    |     64     |       393      |        407      |    -3.4    |
|     256    |    256     |        93      |         86      |     8.1    |
|    1024    |     16     |      4564      |       5020      |    -9.0    |
|    1024    |    256     |       486      |        475      |     2.3    |
|    1024    |   1024     |       221      |        148      |    49.3    |
|    2048    |     16     |      8945      |       9851      |    -9.1    |
|    2048    |    256     |       982      |        951      |     3.3    |
|    2048    |   1024     |       500      |        369      |    35.5    |
|    2048    |   2048     |       413      |        265      |    55.8    |
|    4096    |     16     |     17885      |      19351      |    -7.5    |
|    4096    |    256     |      1828      |       1713      |     6.7    |
|    4096    |   1024     |       968      |        805      |    20.0    |
|    4096    |   4096     |       739      |        475      |    55.6    |
|    8192    |     16     |     48339      |      41556      |    16.3    |
|    8192    |    256     |      3494      |       3342      |     4.5    |
|    8192    |   1024     |      1959      |       1462      |    34.0    |
|    8192    |   4096     |      1561      |       1036      |    50.7    |
|    8192    |   8192     |      1540      |       1004      |    53.4    |
---------------------------------------------------------------------------

This work was inspired by the CRC T10 DIF AVX512 optimization published
in Intel Intelligent Storage Acceleration Library.
https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_by16_10.asm

Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/crypto/Makefile                  |   1 +
 arch/x86/crypto/crct10dif-avx512-asm_64.S | 482 ++++++++++++++++++++++++++++++
 arch/x86/crypto/crct10dif-pclmul_glue.c   |  24 +-
 arch/x86/include/asm/disabled-features.h  |   8 +-
 crypto/Kconfig                            |  23 ++
 5 files changed, 535 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index a31de0c..bf0b0fc 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -80,6 +80,7 @@ crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o
 
 obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
 crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o
+crct10dif-pclmul-$(CONFIG_CRYPTO_CRCT10DIF_AVX512) += crct10dif-avx512-asm_64.o
 
 obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o
 poly1305-x86_64-y := poly1305-x86_64-cryptogams.o poly1305_glue.o
diff --git a/arch/x86/crypto/crct10dif-avx512-asm_64.S b/arch/x86/crypto/crct10dif-avx512-asm_64.S
new file mode 100644
index 0000000..07c9371
--- /dev/null
+++ b/arch/x86/crypto/crct10dif-avx512-asm_64.S
@@ -0,0 +1,482 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Copyright(c) 2020 Intel Corporation.
+ *
+ * Implement CRC T10 DIF calculation with AVX512 instructions. (x86_64)
+ *
+ * This is CRC T10 DIF calculation with AVX512 instructions. It requires
+ * the support of Intel(R) AVX512F and VPCLMULQDQ instructions.
+ */
+
+#include <linux/linkage.h>
+
+.text
+#define		init_crc	%edi
+#define		buf		%rsi
+#define		len		%rdx
+#define		VARIABLE_OFFSET 16*2+8
+
+/*
+ * u16 crct10dif-avx512-asm_64(u16 init_crc, const u8 *buf, size_t len);
+ */
+.align 16
+SYM_FUNC_START(crct10dif_pcl_avx512)
+
+	shl		$16, init_crc
+	/*
+	 * The code flow is exactly same as a 32-bit CRC. The only difference
+	 * is before returning eax, we will shift it right 16 bits, to scale
+	 * back to 16 bits.
+	 */
+	sub		$(VARIABLE_OFFSET), %rsp
+
+	vbroadcasti32x4 SHUF_MASK(%rip), %zmm18
+
+	/* For sizes less than 256 bytes, we can't fold 256 bytes at a time. */
+	cmp		$256, len
+	jl		.less_than_256
+
+	/* load the initial crc value */
+	vmovd		init_crc, %xmm10
+
+	/*
+	 * crc value does not need to be byte-reflected, but it needs to be
+	 * moved to the high part of the register because data will be
+	 * byte-reflected and will align with initial crc at correct place.
+	 */
+	vpslldq		$12, %xmm10, %xmm10
+
+	/* receive the initial 64B data, xor the initial crc value. */
+	vmovdqu8	(buf), %zmm0
+	vmovdqu8	16*4(buf), %zmm4
+	vpshufb		%zmm18, %zmm0, %zmm0
+	vpshufb		%zmm18, %zmm4, %zmm4
+	vpxorq		%zmm10, %zmm0, %zmm0
+	vbroadcasti32x4	rk3(%rip), %zmm10
+
+	sub		$256, len
+	cmp		$256, len
+	jl		.fold_128_B_loop
+
+	vmovdqu8	16*8(buf), %zmm7
+	vmovdqu8	16*12(buf), %zmm8
+	vpshufb		%zmm18, %zmm7, %zmm7
+	vpshufb		%zmm18, %zmm8, %zmm8
+	vbroadcasti32x4 rk_1(%rip), %zmm16
+	sub		$256, len
+
+.fold_256_B_loop:
+	add		$256, buf
+	vmovdqu8	(buf), %zmm3
+	vpshufb		%zmm18, %zmm3, %zmm3
+	vpclmulqdq	$0x00, %zmm16, %zmm0, %zmm1
+	vpclmulqdq	$0x11, %zmm16, %zmm0, %zmm2
+	vpxorq		%zmm2, %zmm1, %zmm0
+	vpxorq		%zmm3, %zmm0, %zmm0
+
+	vmovdqu8	16*4(buf), %zmm9
+	vpshufb		%zmm18, %zmm9, %zmm9
+	vpclmulqdq	$0x00, %zmm16, %zmm4, %zmm5
+	vpclmulqdq	$0x11, %zmm16, %zmm4, %zmm6
+	vpxorq		%zmm6, %zmm5, %zmm4
+	vpxorq		%zmm9, %zmm4, %zmm4
+
+	vmovdqu8	16*8(buf), %zmm11
+	vpshufb		%zmm18, %zmm11, %zmm11
+	vpclmulqdq	$0x00, %zmm16, %zmm7, %zmm12
+	vpclmulqdq	$0x11, %zmm16, %zmm7, %zmm13
+	vpxorq		%zmm13, %zmm12, %zmm7
+	vpxorq		%zmm11, %zmm7, %zmm7
+
+	vmovdqu8	16*12(buf), %zmm17
+	vpshufb		%zmm18, %zmm17, %zmm17
+	vpclmulqdq	$0x00, %zmm16, %zmm8, %zmm14
+	vpclmulqdq	$0x11, %zmm16, %zmm8, %zmm15
+	vpxorq		%zmm15, %zmm14, %zmm8
+	vpxorq		%zmm17, %zmm8, %zmm8
+
+	sub		$256, len
+	jge		.fold_256_B_loop
+
+	/* Fold 256 into 128 */
+	add		$256, buf
+	vpclmulqdq	$0x00, %zmm10, %zmm0, %zmm1
+	vpclmulqdq	$0x11, %zmm10, %zmm0, %zmm2
+	vpternlogq	$0x96, %zmm2, %zmm1, %zmm7
+
+	vpclmulqdq	$0x00, %zmm10, %zmm4, %zmm5
+	vpclmulqdq	$0x11, %zmm10, %zmm4, %zmm6
+	vpternlogq	$0x96, %zmm6, %zmm5, %zmm8
+
+	vmovdqa32	%zmm7, %zmm0
+	vmovdqa32	%zmm8, %zmm4
+
+	add		$128, len
+	jmp		.fold_128_B_register
+
+	/*
+	 * At this section of the code, there is 128*x + y (0 <= y < 128) bytes
+	 * of buffer. The fold_128_B_loop will fold 128B at a time until we have
+	 * 128 + y Bytes of buffer.
+	 * Fold 128B at a time. This section of the code folds 8 xmm registers
+	 * in parallel.
+	 */
+.fold_128_B_loop:
+	add		$128, buf
+	vmovdqu8	(buf), %zmm8
+	vpshufb		%zmm18, %zmm8, %zmm8
+	vpclmulqdq	$0x00, %zmm10, %zmm0, %zmm2
+	vpclmulqdq	$0x11, %zmm10, %zmm0, %zmm1
+	vpxorq		%zmm1, %zmm2, %zmm0
+	vpxorq		%zmm8, %zmm0, %zmm0
+
+	vmovdqu8	16*4(buf), %zmm9
+	vpshufb		%zmm18, %zmm9, %zmm9
+	vpclmulqdq	$0x00, %zmm10, %zmm4, %zmm5
+	vpclmulqdq	$0x11, %zmm10, %zmm4, %zmm6
+	vpxorq		%zmm6, %zmm5, %zmm4
+	vpxorq		%zmm9, %zmm4, %zmm4
+
+	sub		$128, len
+	jge		.fold_128_B_loop
+
+	add		$128, buf
+
+	/*
+	 * At this point, the buffer pointer is pointing at the last y Bytes
+	 * of the buffer, where 0 <= y < 128. The 128B of folded data is in
+	 * 8 of the xmm registers: xmm0 - xmm7.
+	 */
+.fold_128_B_register:
+	/* fold the 8 128b parts into 1 xmm register with different constant. */
+	vmovdqu8	rk9(%rip), %zmm16
+	vmovdqu8	rk17(%rip), %zmm11
+	vpclmulqdq	$0x00, %zmm16, %zmm0, %zmm1
+	vpclmulqdq	$0x11, %zmm16, %zmm0, %zmm2
+	vextracti64x2	$3, %zmm4, %xmm7
+
+	vpclmulqdq	$0x00, %zmm11, %zmm4, %zmm5
+	vpclmulqdq	$0x11, %zmm11, %zmm4, %zmm6
+	vmovdqa		rk1(%rip), %xmm10
+	vpternlogq	$0x96, %zmm5, %zmm2, %zmm1
+	vpternlogq	$0x96, %zmm7, %zmm6, %zmm1
+
+	vshufi64x2      $0x4e, %zmm1, %zmm1, %zmm8
+	vpxorq          %ymm1, %ymm8, %ymm8
+	vextracti64x2   $1, %ymm8, %xmm5
+	vpxorq          %xmm8, %xmm5, %xmm7
+
+	/*
+	 * Instead of 128, we add 128 - 16 to the loop counter to save one
+	 * instruction from the loop. Instead of a cmp instruction, we use
+	 * the negative flag with the jl instruction.
+	 */
+	add		$(128 - 16), len
+	jl		.final_reduction_for_128
+
+	/*
+	 * Now we have 16 + y bytes left to reduce. 16 Bytes is in register xmm7
+	 * and the rest is in memory we can fold 16 bytes at a time if y >= 16.
+	 * continue folding 16B at a time.
+	 */
+.16B_reduction_loop:
+	vpclmulqdq	$0x11, %xmm10, %xmm7, %xmm8
+	vpclmulqdq	$0x00, %xmm10, %xmm7, %xmm7
+	vpxor		%xmm8, %xmm7, %xmm7
+	vmovdqu		(buf), %xmm0
+	vpshufb		%xmm18, %xmm0, %xmm0
+	vpxor		%xmm0, %xmm7, %xmm7
+	add		$16, buf
+	sub		$16, len
+
+	/*
+	 * Instead of a cmp instruction, we utilize the flags with the jge
+	 * instruction equivalent of: cmp len, 16-16. Check if there is any
+	 * more 16B in the buffer to be able to fold.
+	 */
+	jge		.16B_reduction_loop
+
+	/*
+	 * now we have 16+z bytes left to reduce, where 0 <= z < 16.
+	 * first, we reduce the data in the xmm7 register.
+	 */
+.final_reduction_for_128:
+	add		$16, len
+	je		.128_done
+
+	/*
+	 * Here we are getting data that is less than 16 bytes. since we know
+	 * that there was data before the pointer, we can offset the input
+	 * pointer before the actual point to receive exactly 16 bytes.
+	 * After that, the registers need to be adjusted.
+	 */
+.get_last_two_xmms:
+	vmovdqa		%xmm7, %xmm2
+	vmovdqu		-16(buf, len), %xmm1
+	vpshufb		%xmm18, %xmm1, %xmm1
+
+	/*
+	 * get rid of the extra data that was loaded before.
+	 * load the shift constant
+	 */
+	lea		16 + pshufb_shf_table(%rip), %rax
+	sub		len, %rax
+	vmovdqu		(%rax), %xmm0
+
+	vpshufb		%xmm0, %xmm2, %xmm2
+	vpxor		mask1(%rip), %xmm0, %xmm0
+	vpshufb		%xmm0, %xmm7, %xmm7
+	vpblendvb	%xmm0, %xmm2, %xmm1, %xmm1
+
+	vpclmulqdq	$0x11, %xmm10, %xmm7, %xmm8
+	vpclmulqdq	$0x00, %xmm10, %xmm7, %xmm7
+	vpxor		%xmm8, %xmm7, %xmm7
+	vpxor		%xmm1, %xmm7, %xmm7
+
+.128_done:
+	/* compute crc of a 128-bit value. */
+	vmovdqa		rk5(%rip), %xmm10
+	vmovdqa		%xmm7, %xmm0
+
+	vpclmulqdq	$0x01, %xmm10, %xmm7, %xmm7
+	vpslldq		$8, %xmm0, %xmm0
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	vmovdqa		%xmm7, %xmm0
+	vpand		mask2(%rip), %xmm0, %xmm0
+	vpsrldq		$12, %xmm7, %xmm7
+	vpclmulqdq	$0x10, %xmm10, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	/* barrett reduction */
+.barrett:
+	vmovdqa		rk7(%rip), %xmm10
+	vmovdqa		%xmm7, %xmm0
+	vpclmulqdq	$0x01, %xmm10, %xmm7, %xmm7
+	vpslldq		$4, %xmm7, %xmm7
+	vpclmulqdq	$0x11, %xmm10, %xmm7, %xmm7
+
+	vpslldq		$4, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+	vpextrd		$1, %xmm7, %eax
+
+.cleanup:
+	/* scale the result back to 16 bits. */
+	shr		$16, %eax
+	add		$(VARIABLE_OFFSET), %rsp
+	ret
+
+.align 16
+.less_than_256:
+	/* check if there is enough buffer to be able to fold 16B at a time. */
+	cmp		$32, len
+	jl		.less_than_32
+
+	/* If there is, load the constants. */
+	vmovdqa		rk1(%rip), %xmm10
+
+	/*
+	 * get the initial crc value and align it to its correct place.
+	 * And load the plaintext and byte-reflect it.
+	 */
+	vmovd		init_crc, %xmm0
+	vpslldq		$12, %xmm0, %xmm0
+	vmovdqu		(buf), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	/* update the buffer pointer */
+	add		$16, buf
+
+	/* subtract 32 instead of 16 to save one instruction from the loop */
+	sub		$32, len
+
+	jmp		.16B_reduction_loop
+
+.align 16
+.less_than_32:
+	/*
+	 * mov initial crc to the return value. This is necessary for
+	 * zero-length buffers.
+	 */
+	mov		init_crc, %eax
+	test		len, len
+	je		.cleanup
+
+	vmovd		init_crc, %xmm0
+	vpslldq		$12, %xmm0, %xmm0
+
+	cmp		$16, len
+	je		.exact_16_left
+	jl		.less_than_16_left
+
+	vmovdqu		(buf), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+	add		$16, buf
+	sub		$16, len
+	vmovdqa		rk1(%rip), %xmm10
+	jmp		.get_last_two_xmms
+
+.align 16
+.less_than_16_left:
+	/*
+	 * use stack space to load data less than 16 bytes, zero-out the 16B
+	 * in the memory first.
+	 */
+	vpxor		%xmm1, %xmm1, %xmm1
+	mov		%rsp, %r11
+	vmovdqa		%xmm1, (%r11)
+
+	cmp		$4, len
+	jl		.only_less_than_4
+
+	mov		len, %r9
+	cmp		$8, len
+	jl		.less_than_8_left
+
+	mov		(buf), %rax
+	mov		%rax, (%r11)
+	add		$8, %r11
+	sub		$8, len
+	add		$8, buf
+.less_than_8_left:
+	cmp		$4, len
+	jl		.less_than_4_left
+
+	mov		(buf), %eax
+	mov		%eax, (%r11)
+	add		$4, %r11
+	sub		$4, len
+	add		$4, buf
+
+.less_than_4_left:
+	cmp		$2, len
+	jl		.less_than_2_left
+
+	mov		(buf), %ax
+	mov		%ax, (%r11)
+	add		$2, %r11
+	sub		$2, len
+	add		$2, buf
+.less_than_2_left:
+	cmp		$1, len
+	jl		.zero_left
+
+	mov		(buf), %al
+	mov		%al, (%r11)
+
+.zero_left:
+	vmovdqa		(%rsp), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	lea		16 + pshufb_shf_table(%rip), %rax
+	sub		%r9, %rax
+	vmovdqu		(%rax), %xmm0
+	vpxor		mask1(%rip), %xmm0, %xmm0
+
+	vpshufb		%xmm0,%xmm7, %xmm7
+	jmp		.128_done
+
+.align 16
+.exact_16_left:
+	vmovdqu		(buf), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+	jmp		.128_done
+
+.only_less_than_4:
+	cmp		$3, len
+	jl		.only_less_than_3
+
+	mov		(buf), %al
+	mov		%al, (%r11)
+
+	mov		1(buf), %al
+	mov		%al, 1(%r11)
+
+	mov		2(buf), %al
+	mov		%al, 2(%r11)
+
+	vmovdqa		(%rsp), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	vpsrldq		$5, %xmm7, %xmm7
+	jmp		.barrett
+
+.only_less_than_3:
+	cmp		$2, len
+	jl		.only_less_than_2
+
+	mov		(buf), %al
+	mov		%al, (%r11)
+
+	mov		1(buf), %al
+	mov		%al, 1(%r11)
+
+	vmovdqa		(%rsp), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	vpsrldq		$6, %xmm7, %xmm7
+	jmp		.barrett
+
+.only_less_than_2:
+	mov		(buf), %al
+	mov		%al, (%r11)
+
+	vmovdqa		(%rsp), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	vpsrldq		$7, %xmm7, %xmm7
+	jmp		.barrett
+SYM_FUNC_END(crct10dif_pcl_avx512)
+
+.section        .data
+.align 32
+rk_1:		.quad 0xdccf000000000000
+rk_2:		.quad 0x4b0b000000000000
+rk1:		.quad 0x2d56000000000000
+rk2:		.quad 0x06df000000000000
+rk3:		.quad 0x9d9d000000000000
+rk4:		.quad 0x7cf5000000000000
+rk5:		.quad 0x2d56000000000000
+rk6:		.quad 0x1368000000000000
+rk7:		.quad 0x00000001f65a57f8
+rk8:		.quad 0x000000018bb70000
+rk9:		.quad 0xceae000000000000
+rk10:		.quad 0xbfd6000000000000
+rk11:		.quad 0x1e16000000000000
+rk12:		.quad 0x713c000000000000
+rk13:		.quad 0xf7f9000000000000
+rk14:		.quad 0x80a6000000000000
+rk15:		.quad 0x044c000000000000
+rk16:		.quad 0xe658000000000000
+rk17:		.quad 0xad18000000000000
+rk18:		.quad 0xa497000000000000
+rk19:		.quad 0x6ee3000000000000
+rk20:		.quad 0xe7b5000000000000
+rk_1b:		.quad 0x2d56000000000000
+rk_2b:		.quad 0x06df000000000000
+		.quad 0x0000000000000000
+		.quad 0x0000000000000000
+
+.align 16
+mask1:
+	.octa	0x80808080808080808080808080808080
+
+.align 16
+mask2:
+	.octa	0x00000000FFFFFFFFFFFFFFFFFFFFFFFF
+
+.align 16
+SHUF_MASK:
+	.octa	0x000102030405060708090A0B0C0D0E0F
+
+.align 16
+pshufb_shf_table:	.octa 0x8f8e8d8c8b8a89888786858483828100
+			.octa 0x000e0d0c0b0a09080706050403020100
+			.octa 0x0f0e0d0c0b0a09088080808080808080
+			.octa 0x80808080808080808080808080808080
diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c
index 71291d5a..26a6350 100644
--- a/arch/x86/crypto/crct10dif-pclmul_glue.c
+++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
@@ -35,6 +35,16 @@
 #include <asm/simd.h>
 
 asmlinkage u16 crc_t10dif_pcl(u16 init_crc, const u8 *buf, size_t len);
+#ifdef CONFIG_CRYPTO_CRCT10DIF_AVX512
+asmlinkage u16 crct10dif_pcl_avx512(u16 init_crc, const u8 *buf, size_t len);
+#else
+static u16 crct10dif_pcl_avx512(u16 init_crc, const u8 *buf, size_t len)
+{ return 0; }
+#endif
+
+static bool use_avx512;
+module_param(use_avx512, bool, 0644);
+MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
 
 struct chksum_desc_ctx {
 	__u16 crc;
@@ -56,7 +66,12 @@ static int chksum_update(struct shash_desc *desc, const u8 *data,
 
 	if (length >= 16 && crypto_simd_usable()) {
 		kernel_fpu_begin();
-		ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
+		if (IS_ENABLED(CONFIG_CRYPTO_CRCT10DIF_AVX512) &&
+		    cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) &&
+		    use_avx512)
+			ctx->crc = crct10dif_pcl_avx512(ctx->crc, data, length);
+		else
+			ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
 		kernel_fpu_end();
 	} else
 		ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
@@ -75,7 +90,12 @@ static int __chksum_finup(__u16 crc, const u8 *data, unsigned int len, u8 *out)
 {
 	if (len >= 16 && crypto_simd_usable()) {
 		kernel_fpu_begin();
-		*(__u16 *)out = crc_t10dif_pcl(crc, data, len);
+		if (IS_ENABLED(CONFIG_CRYPTO_CRCT10DIF_AVX512) &&
+		    cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) &&
+		    use_avx512)
+			*(__u16 *)out = crct10dif_pcl_avx512(crc, data, len);
+		else
+			*(__u16 *)out = crc_t10dif_pcl(crc, data, len);
 		kernel_fpu_end();
 	} else
 		*(__u16 *)out = crc_t10dif_generic(crc, data, len);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5861d34..1192dea 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -56,6 +56,12 @@
 # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
 #endif
 
+#if defined(CONFIG_AS_VPCLMULQDQ)
+# define DISABLE_VPCLMULQDQ	0
+#else
+# define DISABLE_VPCLMULQDQ	(1 << (X86_FEATURE_VPCLMULQDQ & 31))
+#endif
+
 #ifdef CONFIG_IOMMU_SUPPORT
 # define DISABLE_ENQCMD	0
 #else
@@ -82,7 +88,7 @@
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-			 DISABLE_ENQCMD)
+			 DISABLE_ENQCMD|DISABLE_VPCLMULQDQ)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/crypto/Kconfig b/crypto/Kconfig
index a367fcf..b090f14 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -613,6 +613,29 @@ config CRYPTO_CRC32C_VPMSUM
 	  (vpmsum) instructions, introduced in POWER8. Enable on POWER8
 	  and newer processors for improved performance.
 
+config CRYPTO_AVX512
+	bool "AVX512 hardware acceleration for crypto algorithms"
+	depends on X86
+	depends on 64BIT
+	help
+	  This option will compile in AVX512 hardware accelerated crypto
+	  algorithms. These optimized algorithms provide substantial(2-10x)
+	  improvements over existing crypto algorithms for large data size.
+	  However, it may also incur a frequency penalty (aka. "bin drops")
+	  and cause collateral damage to other workloads running on the
+	  same core.
+
+# We default CRYPTO_CRCT10DIF_AVX512 to Y but depend on CRYPTO_AVX512 in
+# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
+# when supported. Specifically, if the platform and/or toolset does not
+# support VPLMULQDQ. Then this algorithm should not be supported as part of
+# the set that CRYPTO_AVX512 selects.
+config CRYPTO_CRCT10DIF_AVX512
+	bool
+	default y
+	depends on CRYPTO_AVX512
+	depends on CRYPTO_CRCT10DIF_PCLMUL
+	depends on AS_VPCLMULQDQ
 
 config CRYPTO_CRC32C_SPARC64
 	tristate "CRC32c CRC algorithm (SPARC64)"
-- 
2.7.4


  parent reply	other threads:[~2020-12-18 21:07 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
2020-12-18 21:10 ` [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support Megha Dey
2021-01-16 16:54   ` Ard Biesheuvel
2021-01-20 22:38     ` Dey, Megha
2020-12-18 21:10 ` Megha Dey [this message]
2021-01-16 17:00   ` [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction Ard Biesheuvel
2021-01-20 22:46     ` Dey, Megha
2020-12-18 21:11 ` [RFC V1 3/7] crypto: ghash - Optimized GHASH computations Megha Dey
2020-12-19 17:03   ` Ard Biesheuvel
2021-01-16  0:14     ` Dey, Megha
2021-01-16  0:20       ` Dave Hansen
2021-01-16  2:04         ` Eric Biggers
2021-01-16  5:13           ` Dave Hansen
2021-01-16 16:48             ` Ard Biesheuvel
2021-01-16  1:43       ` Eric Biggers
2021-01-16  5:07         ` Dey, Megha
2020-12-18 21:11 ` [RFC V1 4/7] crypto: tcrypt - Add speed test for optimized " Megha Dey
2020-12-18 21:11 ` [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization Megha Dey
2021-01-16 17:03   ` Ard Biesheuvel
2021-01-20 22:46     ` Dey, Megha
2020-12-18 21:11 ` [RFC V1 6/7] crypto: aesni - fix coding style for if/else block Megha Dey
2020-12-18 21:11 ` [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ Megha Dey
2021-01-16 17:16   ` Ard Biesheuvel
2021-01-20 22:48     ` Dey, Megha
2020-12-21 23:20 ` [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Eric Biggers
2020-12-28 19:10   ` Dey, Megha
2021-01-16 16:52     ` Ard Biesheuvel
2021-01-16 18:35       ` Dey, Megha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1608325864-4033-3-git-send-email-megha.dey@intel.com \
    --to=megha.dey@intel.com \
    --cc=andi.kleen@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=davem@davemloft.net \
    --cc=greg.b.tucker@intel.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=ilya.albrekht@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=kyung.min.park@intel.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rajendrakumar.chinnaiyan@intel.com \
    --cc=ravi.v.shankar@intel.com \
    --cc=robert.a.kasten@intel.com \
    --cc=ryan.d.saffores@intel.com \
    --cc=tim.c.chen@intel.com \
    --cc=tomasz.kantecki@intel.com \
    --cc=tony.luck@intel.com \
    --cc=wajdi.k.feghali@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.